0% found this document useful (0 votes)
158 views264 pages

Econometrics - Slides

1) Econometrics aims to give empirical content to economic relations by applying mathematics and statistics to economic data. It is used to forecast economic variables, study economic relations, test economic theories, and evaluate policy. 2) The steps in empirical economic analysis are to formulate a question, build an economic model, specify an econometric model, collect data, estimate and test the model, and answer the original question. 3) Economic data can be cross-sectional, time series, or pooled cross sections/panel data. Cross-sectional data consists of observations of many economic units at a single point in time, while time series data tracks a variable over time.

Uploaded by

Samuel Obeng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
158 views264 pages

Econometrics - Slides

1) Econometrics aims to give empirical content to economic relations by applying mathematics and statistics to economic data. It is used to forecast economic variables, study economic relations, test economic theories, and evaluate policy. 2) The steps in empirical economic analysis are to formulate a question, build an economic model, specify an econometric model, collect data, estimate and test the model, and answer the original question. 3) Economic data can be cross-sectional, time series, or pooled cross sections/panel data. Cross-sectional data consists of observations of many economic units at a single point in time, while time series data tracks a variable over time.

Uploaded by

Samuel Obeng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 264

1

Econometrics - Slides
2011/2012

João Nicolau
2

1 Introduction

1.1 What is Econometrics?

Econometrics is a discipline that “aims to give empirical content to economic relations”. It


has been de…ned generally as “the application of mathematics and statistical methods to
economic data”. Application of econometrics:

forecast (e.g. interest rates, in‡ation rates, and gross domestic product).

study economic relations;

testing economic theories;

evaluating and implementing government and business policy. For example, what are the
e¤ects of political campaign expenditures on voting outcomes? What is the e¤ect of school
spending on student performance in the …eld of education?
3

1.2 Steps in Empirical Economic Analysis


Formulate the question of interest. The question might deal with testing a certain
aspect of an economic theory, or it might pertain to testing the e¤ects of a government
policy.

Build the economic model. An economic model consists of mathematical equations that
describe various relationships. Formal economic modeling is sometimes the starting point
for empirical analysis, but it is more common to use economic theory less formally, or
even to rely entirely on intuition.

Specify the econometric model.

Collect the data.

Estimate and test the econometric model.

Answer the question in step 1.


4

1.3 The Structure of Economic Data

1.3.1 Cross-Sectional Data

A cross-sectional data: sample of individuals, households, …rms, cities, states, countries,


etc. taken at a given point in time. An important feature of cross-sectional data: they are
obtained by random sampling from the underlying population. For example, suppose that
yi is the i-th observation of the dependent variable and xi is the i-th observation of the
explanatory variable. Random sampling means that
f(yi; xi)g is an i.i.d. sequence.
This implies that for i 6= j
Cov yi; yj = 0; Cov xi; xj = 0; Cov yi; xj = 0:
Obviously, if xi “explains” yi we will have Cov (yi; xi) 6= 0:

Cross-sectional data is closely aligned with the applied microeconomics …elds, such as labor
economics, state and local public …nance, industrial organization, urban economics, demog-
raphy, and health economics.
5

An example of Cross-Sectional Data:


6

Scatterplots may be adequate for analyzing cross-section data:

Models based on Cross-Sectional Data usually satisfy the assumptions cover by the chapter
“Finite-Sample Properties of OLS”.
7

1.3.2 Time-Series Data

A time series data set consists of observations on a variable or several variables over time.
E.g.: stock prices, money supply, consumer price index, gross domestic product, annual
homicide rates, and automobile sales …gures, etc.

Time series data cannot be assumed to be independent across time. For example, knowing
something about the gross domestic product from last quarter tells us quite a bit about the
likely range of the GDP during this quarter ...

The analysis of time series data is more di¢ cult than that of cross-sectional data. Reasons:

we need to account for the dependent nature of economic time series;

time-series data exhibits unique features such as trends over time and seasonality;

models based on time-series data rarely satisfy the assumptions cover be the chapter
“Finite-Sample Properties of OLS”. The most adequate assumptions are cover by chapter
“Large-Sample Theory”, which is theoretically more advanced.
8

An example of a time series (scatterplots cannot in general be used here, but there are
exceptions):
9

1.3.3 Pooled Cross Sections and Panel or Longitudinal Data

Data sets have both cross-sectional and time series features.

1.3.4 Causality And The Notion Of Ceteris Paribus In Econometric Analysis

Ceteris Paribus: “other (relevant) factors being equal”. Plays an important role in causal
analysis.
Example. Suppose that wages depend on education and labor force experience. Your goal
is to measure the “return to education”. If your analysis involves only wages and education
you may not uncover the ceteris paribus e¤ect of education on wages. Consider the following
data:

monthly wages (Euros) years of experience years of education


1500 6 9
1500 0 15
1600 1 15
2000 8 12
2500 10 12
10

Example. In a totalitarianism regime how can you measure the ceteris paribus e¤ect of
another year of education on wages? You may create 100 clones of a “normal” individual.
Give to each person an amount of education and then measure their wages.

Ceteris Paribus is relatively easy to analyze in Experimental Data.


Example (Experimental Data). Considered the e¤ects of new fertilizers on crop yields. Sup-
pose the crop under consideration is soybeans. Since fertilizer amount is only one factor
a¤ecting yields— some others include rainfall, quality of land, and presence of parasites—
this issue must be posed as a ceteris paribus question. One way to determine the causal e¤ect
of fertilizer amount on soybean yield is to conduct an experiment, which might include the
following steps. Choose several one-acre plots of land. Apply di¤erent amounts of fertilizer
to each plot and subsequently measure the yields.

In economics you have nonexperimental data, so in principle, it is di¢ cult to estimate the
ceteris paribus e¤ects. However, we will see that econometric methods can simulate a ceteris
paribus experiment. We will be able to do in nonexperimental environments what natural
scientists are able to do in a controlled laboratory setting: keep other factors …xed.
11

2 Finite-Sample Properties of OLS

This chapter covers the …nite- or small-sample properties of the OLS estimator, that is, the
statistical properties of the OLS estimator that are valid for any given sample size.

2.1 The Classical Linear Regression Model

The dependent variable is related to several other variables (called the regressors or the
explanatory variables).

Let yi be the i-th observation of the dependent variable.

Let (xi1; xi2; :::; xiK ) be the i-th observation of the K regressors. The sample or data is a
collection of those n observations.

The data in economics cannot be generated by experiments (except in experimental eco-


nomics), so both the dependent and independent variables have to be treated as random
variables, variables whose values are subject to chance.
12

2.1.1 The Linearity Assumption

Assumption (1.1 - Linearity). We have

yi = 1xi1 + 2xi2 + ::: + K xiK + "i; i = 1; 2; :::; n


where 0s are unknown parameters to be estimated, and "i is the unobserved error term.

0s : regression coe¢ cients. They represent the marginal and separate e¤ects of the regres-
sors.
Example (1.1). (Consumption function): Consider

coni = 1 + 2ydi + "i:


coni : consumption; ydi is disposable income. Note: xi1 = 1; xi2 = ydi: The error "i
represents other variables besides disposable income that in‡uence consumption. They in-
clude: those variables— such as …nancial assets— that might be observable but the researcher
decided not to include as regressors, as well as those variables— such as the “mood” of the
consumer— that are hard to measure. The equation is called the simple regression model.
13

The linearity assumption is not as restrictive as it might …rst seem.


Example (1.2). (Wage equation). Consider

wagei = e 1 e 2educi e 3tenurei e 4expri e"i


where WAGE = the wage rate for the individual, educ = education in years, tenure = years
on the current job, and expr = experience in the labor. This equation can be written as

log (wagei) = 1 + 2educi + 3tenurei + 4expri + "i


The equation is said to be in the semi-log form (or log-level form).
Example. Does this model

yi = 1 + 2xi2 + 3 log xi2 + 4x2i3 + "i


violate Assumption 1.1?

There are, of course, cases of genuine nonlinearity. For example

yi = 1 + e 2xi2 + "i
14

Partial E¤ects

To simplify let’s consider, K = 2; and assume that E ( "ij xi1; xi2) = 0.

What is the impact on the conditional expected value y; E ( yij xi1; xi2) when xi2 is increased
by a small amount
x0i = (xi1; xi2) ! xi 0 = (xi1; xi2 + xi2) (holding the other variable …xed)?
Let
E ( yij xi) E ( yij xi1 = xi1; xi2 = xi2 + xi2) E ( yij xi1; xi2) :

Equation Interpretation of 2
(level-level) yi = 1 + 2xi2 + "i E ( yij xi) = 2 xi2
2 xi2
(level-log) yi = 1 + 2 log (xi2) + "i E ( yij xi) ' 100 x i2
100
E(yi jxi )
(log-level) log (yi) = 1 + 2xi2 + "i 100 ' (100 2) xi2
E(yi jxi )
(100 2: semi-elast.)
E(yi jxi ) xi2
(log-log) log (yi) = 1 + 2 log (xi2) + "i 100 ' 2 xi2 100
E(yi jxi )
( 2: elasticity)
15

Exercise 2.1. Suppose, for example, the marginal e¤ect of experience on wages declines with
the level of experience. How can this be captured?
Exercise 2.2. Provide an interpretation of 2 in the following equations:

(a) coni = 1 + 2inci + "i; where inc: income, con: consumption (both measured in
dollars). Assume that 2 = 0:8;

(b) log (wagei) = 1 + 2educi + 3tenurei + 4expri + "i: Assume that 2 = 0:05:

(c) log (pricei) = 1 + 2 log (disti) + "i where prices = housing price and dist =
distance from a recently built garbage incinerator. Assume that 2 = 0:6:
16

2.1.2 Matrix Notation

We have
2 3
1
h i6 7
6 2 7
yi = 1 xi1 + 2xi2 + ::: + K xiK + "i = xi1 xi2 xiK 6 ... 7 + "i
4 5
K
= x0i + "i
where
2 3 2 3
xi1 1
6 xi2 7 6 7
xi = 6
6 ...
7
7;
6
=6 2 7
... 7
4 5 4 5
xiK K
yi = x0i + "i:
17

More compactly
2 3 2 3 2 3
y1 x11 x12 x1K "1
6 y2 7 6 x x22 x2K 7 6 "2 7
6 7 6 7
y =6 ... 7; X = 6 21 7; "i = 6
6
7
7
4 5 4 ... ... ... 5 4 ... 5
yn xn1 xn2 xnK "n

y = X + ":

Example. yi = 1 + 2educi + 3expi + "i (yi = wages in Euros). An example of


Cross-Sectional Data is
2 3 2 3
2000 1 12 5
6 7 6 7
6 2500 7 6 1 15 6 7
6 7 6 7
6 1500 7 6 1 12 3 7
y =6
6 ... 7;
7 X =6
6 ... ... ... 7:
7
6 7 6 7
6 7 6 7
4 5000 5 4 1 17 15 5
1000 1 12 1

Important: y and X (or yi and xik ) may be random variables or observed values. We use
the same notation for both cases.
18

2.1.3 The Strict Exogeneity Assumption

Assumption (1.2 - Strict exogeneity). E ( "ij X) = 0; 8i

This assumption can be written as

E ( "ij x1; :::; xn) = 0; 8i:


With random sampling "i is automatically independent of the explanatory variables for ob-
servations other than i. This implies that

E "ij xj = 0; 8i; j i 6= j
It remains to be analyzed whether or not
?
E ( "ij xi) = 0:
19

Strict Exogeneity assumption can fail in situations such as:

(Cross-Section or Time Series) Omitted variables;

(Cross-Section or Time Series) Measurement error in some of the regressors;

(Time Series, Static models) There is a feedback from yi on future values of xi;

(Time Series, Dynamic models) There is a lag dependent variable as a regressor;

(Cross-Section or Time Series) Simultaneity.


Example (Omitted variables). Suppose that wage is determined by

wagei = 1 + 2xi2 + 3xi3 + vi;


where x2: years of education, x3: ability. Assume that E ( vij X) = 0: Since ability is not
observed, we instead estimate the model. wagei = 1 + 2xi2 + "i; "i = 3xi3 + vi: If
Cov (xi2; xi3) 6= 0 then

Cov ("i; xi2) = Cov ( 3xi3 + vi; xi2) = 3 Cov (xi3; xi2) 6= 0 ) E ( "ij X) 6= 0:
20

Example (Measurement error in some of the regressors). Consider y = household savings


and w = disposable income and

yi = 1 + 2wi + vi; E ( vij w) = 0:


Suppose that w cannot be measured absolutely accurately (for example, because of misre-
porting) and denote the measured value for wi by xi2: We have

xi2 = wi + ui:
Assume: E (ui) = 0; Cov (wi; ui) = 0; Cov (vi; ui) = 0. Now substituting xi2 = wi + ui
into yi = 1 + 2wi + vi we obtain

yi = 1 + 2xi2 + "i; "i = vi 2 ui :


Hence,

Cov ("i; xi2) = ::: = 2 Var (ui ) 6= 0:


Cov ("i; xi2) 6= 0 ) E ( "ij X) 6= 0:
21

Example (Feedback from y on future values of x). Consider a simple static time-series model
to explain a city’s murder rate (yt) in terms of police o¢ cers per capita (xt):

yt = 1 + 2xt + "t;
Suppose that the city adjusts the size of its police force based on past values of the murder
rate. This means that, say, xt+1 might be correlated with "t (since a higher "t leads to a
higher yt).
Example (There is a lag dependent variable as a regressor). See section 2.1.5.
Exercise 2.3. Let kids denote the number of children ever born to a woman, and let educ
denote years of education for the woman. A simple model relating fertility to years of
education is
kidsi = 1 + 2educi + "i:
where "i is the unobserved error. (i) What kinds of factors are contained in "i? Are these
likely to be correlated with level of education? (ii) Will a simple regression analysis uncover
the ceteris paribus e¤ect of education on fertility? Explain.
22

2.1.4 Implications of Strict Exogeneity

The Assumption E ( "ij X) = 0; 8i implies:

E ("i) = 0; 8i:

E "ij xj = 0; 8i; j:

E xjk "i = 0; 8i; j; k or E xj "i = 0; 8i; j The regressors are orthogonal to the
error term for all observations

Cov xjk ; "i = 0:

Note: if E "ij xj 6= 0 or E xjk "i 6= 0 or Cov xjk ; "i 6= 0 ) E ( "ij X) 6= 0:


23

2.1.5 Strict Exogeneity in Time-Series Models

For time-series models where strict exogeneity can be rephrased as: the regressors are or-
thogonal to the past, current, and future error terms. However, for most time-series models,
strict exogeneity is not satis…ed.
Example. Consider
yi = yi 1 + "i; E ( "ij yi 1) = 0 (thus E (yi 1"i) = 0).
Let xi = yi 1: By construction we have
2
E (xi+1"i) = E (yi"i) = ::: = E "i 6= 0:
The regressor is not orthogonal to the past error term, which is a violation of strict exogeneity.
However, the estimator may possess good large-sample properties without strict exogeneity.

2.1.6 Other Assumptions of the Model

Assumption (1.3 - no multicollinearity). The rank of the n K data matrix X is K with


probability 1.
24

None of the K columns of the data matrix X can be expressed as a linear combination of
the other columns of X.
Example (1.4 - continuation of Example 1.2). If no individuals in the sample ever changed
jobs, then tenurei = expri for all i, in violation of the no multicollinearity assumption.
There no way to distinguish the tenure e¤ect on the wage rate from the experience e¤ect.
Remedy: drop tenurei or expri from the wage equation.
Example (Dummy Variable Trap). Consider
wagei = 1 + 2educi + 3f emalei + 4malei + "i
where
(
1 if i corresponds to a female
f emalei = ; malei = 1 f emalei:
0 if i corresponds to a male
In vectorial notation we have
wage = 11 + 2 educ + 3 female + 4 male + ":
It is obvious that 1 = female + male: Therefore the above model violates Assumption
1.3. One may also justify using scalar notation: xi1 = f emalei + malei because this
relationship implies 1 = female + male: Can you overcome the dummy variable trap by
removing xi1 1 from the equation?
25

Exercise 2.4. In a study relating college grade point average to time spent in various activ-
ities, you distribute a survey to several students. The students are asked how many hours
they spend each week in four activities: studying, sleeping, working, and leisure. Any activity
is put into one of the four categories, so that for each student the sum of hours in the four
activities must be 168. (i) In the model

GP Ai = 1 + 2studyi + 3sleepi + 4worki + 5leisurei + "i


does it make sense to hold sleep, work, and leisure …xed, while changing study? (ii) Explain
why this model violates Assumption 1.3; (iii) How could you reformulate the model so that
its parameters have a useful interpretation and it satis…es Assumption 1.3?
Assumption (1.4 - spherical error variance). The error term satis…es:
2 2 > 0;
E "i X = 8i; Homoskedasticity
E "i "j X = 0; 8i; j ; i 6= j: No correlation between observations.

Exercise 2.5. Under the Assumptions 1.2 and 1.4, show that Cov yi; yj X = 0:
26

Assumption 1.4 and strict exogeneity implies:

Var ( "ij X) = E "2i X = 2:

Cov "i; "j X = 0:

E ""0 X = 2I:

Var ( "j X) = 2I:

Note
2 3
E "21 X E ( " 1 "2 j X ) E ( "1 "n j X )
6 7
6 ( " " j X) 7
0 6 E 1 2 E "22 X E ( "2 "n j X ) 7
E "" X =6
6 ... ... ... ... 7:
7
4 5
E ( " 1 "n j X ) E ( " 2 " n j X ) E "2n X
27

Exercise 2.6. Consider the savings function


q
savi = 1 + 2inci + "i; "i = incizi
where zi is a random variable with E (zi) = 0 and Var (zi) = 2z : Assume that zi is
independent of incj (for all i; j ). (i) Show that E ( "j inc) = 0; (ii) Show that Assumption
1.4 is violated.

2.1.7 The Classical Regression Model for Random Samples

The sample (y; X) is a random sample if f(yi; xi)g is i.i.d. (independently and identically
distributed) across observations. Random sample automatically implies:

E ( "ij X) = E ( "ij xi) ;


2 2
E "i X = E "i xi :
Therefore Assumptions 1.2 and 1.4 can be rephrasing as

Assumption 1.2 E ( "ij xi) = E ("i) = 0


Assumption 1.4 E "2i xi = E "2i = 2
28

2.1.8 “Fixed” Regressors

This is a simplifying (and generally an unrealistic) assumption to make the statistical analysis
tractable. It means that X is exactly the same in repeated samples. Sampling schemes that
support this assumption:

a) Experimental situations. For example, suppose that y represents the yields of a crop
grown on n experimental plots, and let the rows of X represent the seed varieties, irrigation
and fertilizer for each plot. The experiment can be repeated as often as desired, with the
same X. Only y varies across plots.

b) Strati…ed Sampling (for more details see Wooldridge, chap. 9).


29

2.2 The Algebra of Least Squares

2.2.1 OLS Minimizes the Sum of Squared Residuals

Residual for observation i (evaluated at ~ ):

yi x0i ~ :
Vector of residuals (evaluated at ~ ):

y X~:
Sum of squared residuals (SSR):
n
X 2 0
SSR ~ = yi x0i ~ = y X~ y X~ :
i=1
The OLS (Ordinary Least Squares):

b = arg min SSR ~


~

b is such that SSR (b) is minimum.


30

K = 1 ; y i = x i + "i
Example. Consider yi = 1 + 2xi2 + "i: The data:

y X
1 1 1
3 1 3
2 1 1
8 1 3
12 1 8

!
0
Verify that SSR ~ = 42 when ~ = :
1
31

2.2.2 Normal Equations

To solve the optimization proble min ~ SSR ~ we use classical optimization:

First Order Condition (FOC):


@SSR ~
= 0.
@ ~
Solve the previous equation with respect to ~ : Let b such solution.

Second Order Condition (SOC):

@ 2SSR ~
0 is a Positive De…nite Matrix , b is global minimum point.
~
@ @ ~
32

To easily obtain the FOC we start writing SSR ~ as


0
SSR ~ = y X~ y X~
= :::
0 0 ~ ~ 0 0 ~
= y y 2y X + X X :
Recalling from matrix algebra that
0
@ a0 ~ @ ~ A~
= a; = 2A ~ (for A symmetric)
@ ~ @~
we have
@SSR ~ 0 0
= 2 yX + 2X0 X ~ = 0
@ ~
i.e. (replacing ~ by the solution b)

X0Xb = X0y or
X0 (y Xb) = 0:
33

This is a system with K equations and K unknowns. These equations are called the normal
equations. If
1
rank (X) = K ) X0 X is nonsingular ) there exists X0 X :
Therefore, if rank (X) = K we have a unique solution:
1
b = X0 X X0 y OLS estimator.
The SOC is
@ 2SSR ~
0 = 2 X0 X:
@ ~@ ~
If rank (X) = K then 2X0X is a positive de…nite matrix thus SSR ~ is strictly convex
in Rk . Hence b is a global minimum point.

The vector of residuals evaluated at ~ = b;

e=y Xb
is called the vector of OLS residuals (or simply residuals).
34

The normal equations can be written as


n
1X
X0 e =0, xiei = 0:
n i=1
This shows that the normal equations can be interpreted as the sample analogue of the
orthogonality conditions E (xi"i) = 0. Notice the reasoning: by assuming in the popula-
tion the orthogonality conditions E (xi"i) = 0 we deduce by the method of moments the
corresponding sample analogue
1X
xi yi x0i ~ = 0:
n i
We obtain the OLS estimator b by solving this equation with respect to ~ :
35

2.2.3 Two Expressions for the OLS Estimator

1
b = X0 X X0 y

X0 X 1 X0 y
b= n n = Sxx1Sxy ; where
n
X0 X 1X
Sxx = = xix0i (sample average of xix0i)
n n i=1
n
X0 y 1X
Sxy = = xiyi (sample average of xiyi).
n n i=1

Example (continuation of previous example). Consider the data.

y X
1 1 1
3 1 3
2 1 1
8 1 3
12 1 8

Obtain b; e and SSR (b) :


36

2.2.4 More Concepts and Algebra

The …tted value for observation i: y^i = x0ib.

The vector of …tted value: y


^ = Xb:

The vector of OLS residuals: e = y Xb = y y


^:

The projection matrix P and the annihilator M are de…ned as


1
P=X X0 X X0 ; M=I P:

Properties:
Exercise 2.7. Show that P and M are symmetric and idempotent and
PX = X
MX = 0
y
^ = Py
e = My = M"
SSR = e0e = y0My = "0M":
37

The OLS estimate of 2 (the variance of the error term), denoted s2, is
SSR e0 e
s2 = =
n K n K
s2 is called the standard error of regression.

The sampling error


1
b = ::: = X0X X 0 ":

Coe¢ cient of Determination

A measure of goodness of …t is the coe¢ cient of determination


Pn Pn
(^
yi y )2 e2
R2 = Pi=1
n (y 2
=1 Pn
i=1 i
2
; 0 R2 1:
i=1 i y) i=1 (yi y)
It measures the proportion of the variation of y that is accounted for by variation in the
regressors, x0j s. Derivation of R2: [board]
38

y y
25 R^2 = 0.96 60 y
50 y^
20 40
30 R^2 = 0.19
15 20
y 10
10
y^ 0
x
-3 -2 -1 -10 0 1 2 3
5
-20
0 x -30
-3 -2 -1 0 1 2 3 -40
-5 -50

y
17
16
15
14
13 y
12 y^
11
10 R^2 = 0.00

9
8 x
-3 -2 -1 0 1 2 3
39

“The most important thing about R2 is that it is not important” (Goldberger). Why?

We are concerned with parameters in a population, not with goodness of …t in the sample;

We can always increase R2 by adding more explanatory variables. At the limit, if K =


n ) R 2 = 1:
Exercise 2.8. Prove that K = n ) R2 = 1 (assume that Assumption 1.3 holds).

It can be proved that


P
^i
i y y^ (yi y ) =n
R2 = ^2 ; ^= :
Sy^Sy

Adjusted coe¢ cient of determination


Pn 2 = (n
n 1 e k)
R2 = 1 1 R2 = 1 Pn
i=1 i
2
:
n k i=1 (yi y ) = (n 1)
Contrary to R2; R2 may decline when a variable is added to the set of independent variables.
40

2.3 Finite-Sample Properties of OLS

First of all we need to recognize that b and bj X are random!

Assumptions:
1.1 - Linearity: yi = 1xi1 + 2xi2 + ::: + K xiK + "i:
1.2 - Strict exogeneity: E ( "ij X) = 0:
1.3 - No multicollinearity.
1.4 - Spherical error variance: E "2i X = 2; E "i"j X = 0:

Proposition (1.1 - …nite-sample properties of b). We have:


(a) (unbiasedness) Under Assumptions 1.1-1.3, E ( bj X) = :
(b) (expression for the variance) Under Assumptions 1.1-1.4, Var ( bj X) = 2 X0X 1 :
(c) (Gauss-Markov Theorem) Under Assumptions 1.1-1.4, the OLS estimator is e¢ cient in
the class of linear unbiased estimators (also called Best Linear Unbiased Estimator). That
is, for any unbiased estimator ^ that is linear in y, Var ( bj X) Var ^ X in the matrix
sense (i.e. Var ^ X Var ( bj X) is a positive semide…nite matrix).
(d) Under Assumptions 1.1-1.4, Cov ( b; ej X) = 0. Proof: [board]
41

Proposition (1.2 - Unbiasedness of s2). Let s2 = e0e= (n K ) : We have


2 2 2
E s X =E s = : Proof: [board]

An unbiased estimator of Var ( bj X) is


1
( bj X) = s2 X0X
Var\ :
Example. Consider

col GP Ai = 1 + 2HSGP Ai + 3ACTi + 4SKIP P EDi + 5P Ci + "i


where: col GP A : college grade point average (GPA); HSGP A : high school GPA; ACT :
achievement examination for college admission; SKIP P ED : average lectures missed per
week; P C is a binary variable (0/1) to identify who owns a personal computer. Using a
survey of 141 students (Michigan State University) in Fall 1994, we obtained the following
results:
42

These results tell us that n = 141, s = 0:325; R2 = 0:259; SSR = 14:37


2 3 2 3
1:356 0:32752 ? ? ? ?
6 7 6 7
6
6 0:4129 7
7
6
6
? 0:09242 ? ? ? 7
7
b =6 0:0133 7; \
Var ( bj X) = 6 ? ? 0:0102 ? ? 7
6 7 6 7
6 7 6 7
4 0:071 5 4 ? ? ? 0:0262 ? 5
0:1244 ? ? ? ? 0:05732
43

2.4 More on Regression Algebra

2.4.1 Regression Matrices

0
Matrix P = X X0X 1 X
Py ! Fitted values from the regression of y on X
Pz ! ?
1 0
Matrix M = I P = I X X0 X X
My ! Residuals from the regression of y on X
Mz ! ?
h i
Consider a partition of X as follows X = X1 X2

1 0
Matrix P1= X1 X01X1 X
1
P1y ! ?
1 0
Matrix M1= I P1 = I X1 X01X1 X
1
M1 y ! ?
44

2.4.2 Short and Long Regression Algebra

Partition X as
h i
X= X 1 X2 ; XK 1 n; XK2 n ; K1 + K2 = K

Long Regression
We have
" #
h i b1
y=y
^ + e = Xb + e = X1 X 2 + e = X1b1 + X2b2 + e:
b2

Short Regression
Suppose that we shorten the list of explanatory variables and regress y on X1: We have
y=y
^ + e = X1b1 + e
where
1
b1 = X01X1 X1 y
1
e = M1 y ; M1 = I X1 X01X1 X01
45

How are b1 and e related to b1 and e?

b1 vs. b1

We have,
1
b1 = X01X1 X1 y
1
= X01X1 X01 (X1b1 + X2b2 + e)
1 1
= b1 + X01X1 X01X2b2 + X01X1 X01e
| {z }
0
1
= b1 + X01X1 X01X2b2
1
= b1 + Fb2; F = X01X1 X01X2:
Thus, in general, b1 6= b1: Exceptional cases: b2 = 0 or X01X2 = O ) b1 = b1:
46

e vs. e

We have,

e = M1 y
= M1 (X1b1 + X2b2 + e)
= M1X1b1 + M1X2b2 + M1e
= M1X2b2 + e;
= v+e
Thus,
e 0e = e0e + v0v e0e
Thus the SSR of the short regression (e 0e ) exceeds the SSR of the long regression (e0e)
and e 0e = e0e i¤ v = 0; that is i¤ b2 = 0:
47

Example. Illustration of b1 6= b1 and e 0e e0e:

Find X; X1; X2; b; b1; b2; b1; e 0e ; e0e:


48

2.4.3 Residual Regression

Consider

y = X +"
= X 1 1 + X2 2 + ":
Premultiplying both sides by M1 and using M1X1 = 0; we obtain

M1 y = M1 X 1 1 + M 1 X 2 2 + M1 "
y ~ 2 2 + M1 "
~ = X
The OLS gives
1 1 1
~0 X
b2 = X ~ ~0 y
X ~ = ~0 X
X ~ ~ 0 M1 y = X
X ~0 X
~ ~0 y
X
2 2 2 2 2 2 2 2 2

Thus
1
~0 X
b2 = X ~ ~0 y
X
2 2 2
49

1
~0 X
Another way to prove b2 = X ~ ~ 0 y (you may skip this proof). We have
X
2 2 2

1 1
~0 X
X ~ ~0 y =
X ~0 X
X ~ ~ 0 (X1b1 + X2b2 + e)
X
2 2 2 2 2 2
1 1 1
= ~0 X
X ~ ~ 0 X1b1 + X
X ~0 X
~ ~ 0 X2b2 + X
X ~0 X
~ ~0 e
X
2 2 2 2 2 2 2 2 2
| {z } | {z } | {z }
0 b2 0
= b2
since:
1 1
~0 X
X ~ ~ 0 X1b1 = X
X ~0 X
~ X02M1X1b1
2 2 2 2 2
= 0
1 0 1
~0 X
X ~ ~ X2b2 = X
X ~0 X
~ X02M1X2b2
2 2 2 2 2
0 0 1 0
= X 2 M 1 M1 X 2 X2M1X2b2
0 1 0
= X 2 M 1 X2 X2M1X2b2
= b2
~0 e =
X X02M1e
2
= X02e
= 0:
50

1 1
~0 X
The conclusion is that we can obtain b2 = X ~ ~0 y = X
X ~0 X
~ ~0 y
X
2 2 2 2 2 2 ~ as follows:

1) Regress X2 on X1 to get the residuals X ~ 2 = M1X2: Interp. of X ~ 2: X


~ 2 is X2 after the
e¤ects of X1 have been removed or, X ~ 2 is the part X2 that is uncorrelated with X1.
~ 2 to get the coe¢ cient b2 of the long regression.
2) Regress y on X

OR:
1’) Same as 1).
2’a) Regress y on X1 to get the residuals y
~ = M1 y :
2’b) Regress y ~ 2 to get the coe¢ cient b2 of the long regression.
~ on X

The conclusion of 1) and 2) is extremely important: b2 relates y to X2 after controlling for


the e¤ects of X1: This is why b2 can be obtained from the regression of y on X ~ 2 where
~ 2 is X2 after the e¤ects of X1 have been removed (…xed or controlled for). This means
X
that b2 has in fact a ceteris paribus interpretation.

To recover b1 we consider the equation b1 = b1 + Fb2: Regress y on X1; obtaining


0 1 0
b1 = X1X1 X1y and now
1
b1 = b1 X01X1 X01X2b2 = b1 Fb2:
51

Example. Consider the example on page 9.


52

h i
Example. Consider X = 1 exper tenure IQ educ and
h i
X1 = 1 exper tenure IQ ; X2 = educ
53
54

2.4.4 Application of Residual Regression

A) Trend Removal (time series)

Suppose that yt and xt have a linear trend. Should the trend term be included in the
regression as in the case
yt = 1 + 2xt2 + 3xt3 + "t; xt3 = t
or should the variables …rst be “detrended” and then used without the trend term included
as in
y~t = 2x
~t2 + ~
"t ?
According to the previous results, the OLS coe¢ cient b2 is the same in both regressions.
In the second regression b2 is obtained from the regression of y
~ = M1y on x
~ 2 = M1 x 2
where
2 3
1 1
h i 6 7
6 1 2 7
X1 = 1 x 3 = 6 .. .. 7 :
4 . . 5
1 n
55

Example. Consider (TXDES: unemployment rate, INF: in‡ation, t: time)

T XDESt = 1 + 2IN Ft + 3t + "t:


We will show two ways to obtain b2 (compare EQ01 to EQ04).

EQ01
Dependent Variable: TXDES EQ02
Method: Least Squares Dependent Variable: TXDES
Sample: 1948 2003 Method: Least Squares
Sample: 1948 2003
Variable Coefficient Std. Error t-Statistic Prob.
Variable Coefficient Std. Error t-Statistic Prob.
C 4.463068 0.425856 10.48023 0.0000
INF 0.104712 0.063329 1.653473 0.1041 C 4.801316 0.379453 12.65325 0.0000
@TREND 0.027788 0.011806 2.353790 0.0223 @TREND 0.030277 0.011896 2.545185 0.0138

EQ03
Dependent Variable: INF EQ04
Method: Least Squares Dependent Variable: TXDES_
Sample: 1948 2003 Method: Least Squares
Sample: 1948 2003
Variable Coefficient Std. Error t-Statistic Prob.
Variable Coefficient Std. Error t-Statistic Prob.
C 3.230263 0.802598 4.024758 0.0002
@TREND 0.023770 0.025161 0.944696 0.3490 INF_ 0.104712 0.062167 1.684382 0.0978
56

B) Seasonal Adjustment and Linear Regression with Seasonal Data

Suppose that we have data on the variable y; quarter by quarter, for m years. A way to deal
with (deterministic) seasonality is the following
yt = 1Qt1 + 2Qt2 + 3Qt3 + 4Qt4 + 5xt5 + "i
where
(
1 in quarter i
Qti =
0 otherwise.
Let
h i h i
X= Q1 Q2 Q3 Q4 x 5 ; X1 = Q1 Q2 Q3 Q4 :
Previous results show that b5 can be obtained from the regression of y
~ = M1y on x
~ 5=
M1x 5: It can be proved
8
>
> yt yQ1 in quarter 1
>
>
< y yQ2 in quarter 2
t
y~t =
>
> yt yQ3 in quarter 3
>
>
: y yQ4 in quarter 4
t
where yQi is the seasonal mean of quarter i:
57

c) Deviations from Means


h i
Let x 1 be the summer vector. Instead of regressing y on x 1 x 2 x K to get
(b1; b2; :::; bK )0 ; we can regress y on
2 3
x12 x2 x1K xK
6 ... ... 7
4 5
xn2 x2 xnK xK
to get the same vector (b2; :::; bK )0 : We sketch the proof. Let
h i
X2 = x 2 x K

so that
y
^ = x 1b1 + X2b2:

~ 2 = M1X2 where
1) Regress X2 on x 1 to get the residuals X
1 0 x 1x0 1
M1 = I x 1 x0 1x 1 x 1 =I :
n
58

As we know
~ 2 = M1 X2
X
h i
= M1 x 2 x K
h i
= M1 x 2 M1 x K
2 3
x12 x2 x1K xK
6 ... ... 7
= 4 5:
xn2 x2 xnK xK

2) Regress y (or y ~ 2 to get the coe¢ cient b2 of the long regression:


~ = M1y) on X
1 1
~0 X
b2 = X ~ ~0 y = X
X ~0 X
~ ~0 y
X
2 2 2 2 2 2 ~:
The intercept can be recovered as
1 0
b1 = b1 x 1 x0 1x 1 x 1 X2 :
59

2.4.5 Short and Residual Regression in the Classical Regression Model

Consider:

y = X1b1 + X2b2 + e (long regression)


y = X1b1 + e (short regression).
The correct speci…cation corresponds to the long regression:

E ( y j X) = X1 1 + X 2 2
= X
Var ( yj X) = 2 I; etc.
60

A) Short-Regression Coe¢ cients

b1 is a biased estimator of 1

Given that
1 1
b1 = X01X1 X01y = b1 + Fb2; F= X01X1 X01X2:
we have

E ( b1j X) = E ( b1 + Fb2j X) = 1 + F 2;
1 1 1
Var ( b1j X) = Var X01X1 X01y X = X01X1 X01 Var ( yj X) X1 X01X1
2 1
= X01X1
thus, in general,
b1 is a biased estimator of 1 (“omitted-variable bias”)
unless:

2= 0: Corresponds to the case of “Irrelevant Omitted Variables”.


F = O: Corresponds to the case of “Orthogonal Explanatory Variables”(in sample space).
61

Var ( b1j X) Var b1 X (you may skip the proof)

Consider b1 = b1 Fb2

Var ( b1j X) = Var ( b1 Fb2j X)


= Var ( b1j X) + Var ( Fb2j X) since Cov ( b1; b2j X) = O [board]
= Var ( b1j X) + F Var ( b2j X) F0:
Because F Var ( b2j X) F0 is positive semide…nite (or nonnegative de…nite), Var ( b1j X)
Var b1 X .

This relation is still valid if 2 = 0: In this case 2 = 0; regressing y on X1 and on irrelevant


variables (X2) involves a cost: Var ( b1j X) Var b1 X ; although E ( b1j X) = 1:

In practise there may be a bias-variance trade-o¤ between short and long regression when
the target is 1:
62

Exercise 2.9. Consider the standard simple regression model yi = 1 + 2xi2 + "i under
Assumptions 1.1 through 1.4. Thus, the usual OLS estimators b1 and b2 are unbiased for
their respective population parameters. Let b2 be the estimator of 2 obtained by assuming
the intercept is zero i.e. 1 = 0 (i) Find E b2 X . Verify that b2 is unbiased for 2 when
the population intercept 1 is zero. Are there other cases where b2 is unbiased? (ii) Find the
variance of b2. (iii) Show that Var b2 X Var ( b2j X); (iv) Comment on the trade-o¤
between bias and variance when choosing between b2 and b2.
Exercise 2.10. Suppose that average worker productivity at manufacturing …rms (avgprod)
depends on two factors, average hours of training (avgtrain) and average worker ability
(avgabil):
avgprodi = 1 + 2avgtraini + 3avgabili + "i
Assume that this equation satis…es Assumptions 1.1 through 1.4. If grants have been given to
…rms whose workers have less than average ability, so that avgtrain and avgabil are negatively
correlated, what is the likely bias in b2 in obtained from the simple regression of avgprod on
avgtrain?
63

B) Short-Regression Residuals (skip this)

Given that e = M1y we have


~ 2 2;
E ( e j X ) = M1 E ( y j X ) = M1 E ( X 1 1 + X 2 2 j X ) = X
Var ( e j X) = Var ( M1yj X) = M1 Var ( yj X) M01 = 2M1:
Thus E ( e j X) 6= 0; unless 2 = 0:

Let’s see now that the omission of explanatory variables leads to an increase in the expected
SSR. We have, by R5,
0
E e e X = E y0M1y X = tr (M1 Var ( yj X)) + E ( yj X)0 M1 E ( yj X)
= 2 tr (M1) + 0 X ~ 2 = 2 (n K1) + 0 X
~0 X
2 2 2
~0 X
~2
2 2 2
and E e0e X = 2 (n K ) thus
0 0 2 0 ~0 X
~
E e e X E e e X = K2 + 2X 2 2 2 > 0:

Notice that: e 0e ~0 X
e0e = b02X ~ ~0 X
0: (check E b02X ~ 2K
2 2 b2 2 2 b2 X = 2 +
~0 X
0X ~ 2 ).
2
2 2
64

C) Residual Regression

The objective is to characterize


Var ( b2j X) :
1
We know that b2 = X ~
~0 X ~ 0 y: Thus
X
2 2 2
1
Var ( b2j X) = Var ~0 X
X ~ ~0 y X
X
2 2 2
1 1
= ~0 X
X ~ ~ 0 Var ( yj X) X
X ~0 X
~2 X ~
2 2 2 2 2
2 1
= ~0 X
X ~
2 2
2 1
= X02M1X2 :

Now suppose that


h i
X= X1 x K (i.e. x K = X2)
65

If follows that
2
Var ( bK j X) = 0
x K M1 x K
and x0 K M1x K is the sum of the squared residuals in the auxiliary regression

x K = 1x 1 + 2x 2 + ::: + K 1x K 1 + error:
One can conclude (assuming that x 1 is the summer vector):

2 x0 K M1x K
RK =1 P 2
:
(xiK xK )
Solving this equation for x0 K M1x K we have
X
x0 K M1 x K = 1
2
RK (xiK xK )2 :

We get
2 2
Var ( bK j X) = P 2
= :
1 2
RK (xiK xK ) 1 2 S2 n
RK xK
66

2 2
Var ( bK j X) = = :
1 2 P (x
RK xK ) 2
1 2 S2 n
RK
iK xK
We can conclude that the precision of bK is high (i.e. Var (bK ) is small) when:

2 is low;

Sx2K is high (imagine the regression


wage = 1 + 2educ + ":
If most people (in the sample) report the same education, Sx2K will be low and 2 will
be estimated very imprecisely).

n is high (large sample is preferable to small sample).

2 is low (multicollinearity increases R2 ).


RK K
67

Exercise 2.11. Consider: sleep: minutes sleep at night per week; totwrk: hours worked
per week; educ: years of schooling; female: binary variable equal to one if the individual
is female. Do women sleep more than men? Explain the di¤erences between the estimates
32.18 and -90.969.

Dependent Variable: SLEEP


Method: Least Squares
Dependent Variable: SLEEP Sample: 1 706
Method: Least Squares
Sample: 1 706 Variable Coefficient Std. Error t-Statistic Prob.

Variable Coefficient Std. Error t-Statistic Prob. C 3838.486 86.67226 44.28737 0.0000
TOTWRK -0.167339 0.017937 -9.329260 0.0000
C 3252.407 22.22211 146.3591 0.0000 EDUC -13.88479 5.657573 -2.454196 0.0144
FEMALE 32.18074 33.75413 0.953387 0.3407 FEMALE -90.96919 34.27441 -2.654143 0.0081

R-squared 0.001289 Mean dependent var 3266.356 R-squared 0.119277 Mean dependent var 3266.356
Adjusted R-squared -0.000129 S.D. dependent var 444.4134 Adjusted R-squared 0.115514 S.D. dependent var 444.4134
S.E. of regression 444.4422 Akaike info criterion 15.03435 S.E. of regression 417.9581 Akaike info criterion 14.91429
Sum squared resid 1.39E+08 Schwarz criterion 15.04726 Sum squared resid 1.23E+08 Schwarz criterion 14.94012
68

Example. The goal is to analyze the impact of another year of education on wages. Consider:
wage: monthly earnings; KWW: knowledge of world work score (KWW is a general test of
work-related abilities); educ: years of education; exper: years of work experience; tenure:
years with current employer
Dependent Variable: LOG(WAGE)
Method: Least Squares
Dependent Variable: LOG(WAGE) Sample: 1 935
Method: Least Squares White Heteroskedasticity-Consistent Standard Errors & Covariance
Sample: 1 935
White Heteroskedasticity-Consistent Standard Errors & Covariance Variable Coefficient Std. Error t-Statistic Prob.

Variable Coefficient Std. Error t-Statistic Prob. C 5.496696 0.112030 49.06458 0.0000
EDUC 0.074864 0.006654 11.25160 0.0000
C 5.973062 0.082272 72.60160 0.0000 EXPER 0.015328 0.003405 4.501375 0.0000
EDUC 0.059839 0.006079 9.843503 0.0000 TENURE 0.013375 0.002657 5.033021 0.0000

R-squared 0.097417 Mean dependent var 6.779004 R-squared 0.155112 Mean dependent var 6.779004
Adjusted R-squared 0.096449 S.D. dependent var 0.421144 Adjusted R-squared 0.152390 S.D. dependent var 0.421144
S.E. of regression 0.400320 Akaike info criterion 1.009029 S.E. of regression 0.387729 Akaike info criterion 0.947250
Sum squared resid 149.5186 Schwarz criterion 1.019383 Sum squared resid 139.9610 Schwarz criterion 0.967958

Dependent Variable: LOG(WAGE)


Method: Least Squares
Sample: 1 935
White Heteroskedasticity-Consistent Standard Errors & Covariance

Variable Coefficient Std. Error t-Statistic Prob.

C 5.210967 0.113778 45.79932 0.0000


EDUC 0.047537 0.008275 5.744381 0.0000
EXPER 0.012897 0.003437 3.752376 0.0002
TENURE 0.011468 0.002686 4.270056 0.0000
IQ 0.004503 0.000989 4.553567 0.0000
KWW 0.006704 0.002070 3.238002 0.0012

R-squared 0.193739 Mean dependent var 6.779004


Adjusted R-squared 0.189400 S.D. dependent var 0.421144
S.E. of regression 0.379170 Akaike info criterion 0.904732
Sum squared resid 133.5622 Schwarz criterion 0.935794
69

Exercise 2.12. Consider

yi = 1 + 2xi2 + "i; i = 1; :::; n


where xi2 is an impulse dummy, i.e. x 2 is a column vector with n 1 zeros and only one
1. To simplify let us suppose that this 1 is the …rst element of x 2; i.e.
h i
x0 2 = 1 0 0 :
Find and interpret the coe¢ cient from the regression of y on x ~ 1 = M2x 1 and M2 =
0 1 0
I x 2 x 2x 2 x 2 (x
~ 1 is the residual vector from the regression x 1 on x 2):
Exercise 2.13. Consider the long regression model (under Assumptions 1.1 through 1.4):

y = X1b1 + X2b2 + e;
and the following coe¢ cients (obtained from the short regressions):
1 1
b1 = X01X1 X01y; b2 = X02X2 X02y:

Decide if you agree or disagree with the following statement: if Cov b1; b2 X1; X2 = O
(zero matrix) then b1 = b1 and b2 = b2:
70

2.5 Multicollinearity

If rank (X) < K then b is not de…ned. This is called strict multicollinearity. When this
happens, the statistical software will be unable to construct X0X 1 : Since the error is
discovered quickly, this is rarely a problem for applied econometric practice.

The more relevant situation is near multicollinearity, which is often called “multicollinearity”
for brevity. This is the situation when the X0X is near singular, when the columns of X are
close to linearly dependent.

Consequence: the individual coe¢ cient estimates will be imprecise. We have shown that
2
Var ( bK j X) = :
1 2 S2 n
RK xK
2 is the coe¢ cient of determination in the auxiliary regression
where RK

x K = 1x 1 + 2x 2 + ::: + K 1x K 1 + error:
71

Exercise 2.14. Do you agree with the following quotations: (a) “But more data is no remedy
for multicollinearity if the additional data are simply "more of the same." So obtaining lots
of small samples from the same population will not help” (Johnston, 1984); (b) “Another
important point is that a high degree of correlation between certain independent variables
can be irrelevant as to how well we can estimate other parameters in the model.”
Exercise 2.15. Suppose you postulate a model explaining …nal exam score in terms of class
attendance. Thus, the dependent variable is …nal exam score, and the key explanatory
variable is number of classes attended. To control for student abilities and e¤orts outside
the classroom, you include among the explanatory variables cumulative GPA, SAT score, and
measures of high school performance. Someone says, “You cannot hope to learn anything
from this exercise because cumulative GPA, SAT score, and high school performance are
likely to be highly collinear.” What should be your answer?
72

2.6 Statistical Inference under Normality

Assumption (1.5 - normality of the error term). "j X N ormal

Assumption 1.5 together with Assumptions 1.2 and 1.4 implies that

"j X N 0; 2I and yj X N X ; 2I :

Suppose that we want to test H0 : 2 = 1. Although Proposition 1.1 guarantees that, on


average, b2 (the OLS estimate of 2) equals 1 if the hypothesis H0 : 2 = 1 is true, b2 may
not be exactly equal to 1 for a particular sample at hand. Obviously, we cannot conclude
that the restriction is false just because the estimate b2 di¤ers from 1. In order for us to
decide whether the sampling error b2 1 is “too large” for the restriction to be true, we
need to construct from the sampling distribution error some test statistic whose probability
distribution is known given the truth of the hypothesis.

The relevant theory is built from the following results:


73

1. z N (0; I) , z0z 2 :
(n)

2 ; w 2 ; w w =m
2. w1 (m) 2 (n) 1 and w2 are independent, w1 =n F (m; n) :
2

3. w 2 ; z
(n) N (0; 1) ; w and z are independent, p z t(n):
w=n

4. Asymptotic Results:
d
v F (m; n) ) mv ! 2(m) as n ! 1
d
u t(n) ) u ! N (0; 1) as n ! 1:

5. Consider the vector n 1 vector yj X N (X ; ) : Then,

w = (y X )0 1 (y X ) 2 :
(n)
74

6. Consider the vector n 1 vector "j X N (0; I) : Let M be a n n idempotent


matrix with rank (M) = r n: Then,

"0M" X 2 :
(r)

7. Consider the vector n 1 vector "j X N (0; I) : Let M be a n n idempotent


matrix with rank (M) = r n: Let L be a matrix such that LM = O: Let t1 = M"
and t2 = L": Then t1 and t2 are independent random vectors.

1
8. bj X N ; 2 X0 X :

9. Let r = R (Rp K ) with rank (R) = p (in Hayashi’s notation p is equal to #r):
Then,
1
Rbj X N r; 2R X0 X R0 :
75

1
10. Let bk be the kth element of b and q kk the (k; k) element of X0X : Then,
b
bk j X N k;
2 q kk or zk = kq k N (0; 1) :
q kk

0 1 1 2 2 :
11. w = (Rb r) R X0 X R0 (Rb r) = (p)

2
(bk k) 2 :
12. wk = 2 q kk (1)

13. w0 = e0e= 2 2
(n K) :

14. The random vectors b and e are independent.

d (b) ; is independent of each of the statistics


15. Each of the statistics e; e0e; w0; s2; Var
b, bk ; Rb; w; wk :
76

b 1
16. tk = k^ k t (n K ) ; where ^ 2b is the (k; k) element of s2 X0X :
bk k

17. q Rb R t (n K ) ; R is of type 1 K
s R(X0 X) 1 R0

0 1 1
18. F = (Rb r) R X0 X R0 (Rb r) = ps2 F (p; n K) :

Exercise 2.16. Prove the results #8, #9, #16 and #18 (take the other results as given).

The two most important results are:


bk k bk k
tk = = t (n K)
^ bk SE (bk )
1 1
F = (Rb r) 0
R X0 X R0 (Rb r) = ps2 F (p; n K) :
77

2.6.1 Con…dence Intervals and Regions

Let t =2 t =2 (n k) be such that

P jtj < t =2 = 1 :
78

Let F F (p; n K ) be such that

P (F > F ) = 1
79

(1 ) 100% CI for an individual slope coe¢ cient k :


8 9
< bj =
k
: t =2 , bk t =2 ^ bk :
: k ^ bk ;

(1 ) 100% CI for a single linear combination of the elements of (p = 1)


8 9
>
< >
= q
Rb R
R : q t =2 , Rb t =2s R (X0X) 1 R0:
>
: 1 >
;
s R (X0X) R0

In this case R is a vector 1 K:


(1 ) 100% Con…dence Region for the parameter vector =R :
( )
1 1
: (Rb )0 R X0X R0 (Rb ) =s2 pF :

(1 ) 100% Con…dence region for the parameter vector (consider R = I in the pre-
vious case)
n o
: (b ) 0
X0 X (b ) =s2 pF :
80

Exercise 2.17. Consider yi = 1xi1 + 2xi2 + "i where yi = wagesi wages; xi1 =
educi educ; xi2 = experi exper: The results are

Dependent Variable: Y
Method: Least Squares
Sample: 1 526

Variable Coefficient Std. Error t-Statistic Prob.

X
X1 0.644272 0.053755 11.98541 0.0000
X2 0.070095 0.010967 6.391393 0.0000

R-squared 0.225162 Mean dependent var 1.34E-15


Adjusted R-squared 0.223683 S.D. dependent var 3.693086
S.E. of regression 3.253935 Akaike info criterion 5.201402
Sum squared resid 5548.160 Schwarz criterion 5.217620
Log likelihood -1365.969 Hannan-Quinn criter. 5.207752
Durbin-Watson stat 1.820274

" # " #
4025:4297 5910:064 1 2:7291 10 4 1:6678 10 5
X0 X = ; X0 X =
5910:064 96706:846 1:6678 10 5 1:1360 10 5

(a) Build the 95% con…dence interval for 2.

(b) Build the 95% con…dence interval for 1 + 2:

(c) Build the 95% con…dence region for the parameter vector :
81

Con…dence regions in the EVIEWS

.10

.09

.08

beta2
.07

.06

.05

.04
.50 .55 .60 .65 .70 .75 .80

beta1

90% and 95% Con…dence region for the parameter vector


82

2.6.2 Testing on a Single Parameter

Suppose that we have a hypothesis about the kth regression coe¢ cient:
H0 : k = 0k
( 0k is a speci…c value, e.g. zero), and that this hypothesis is tested against the alternative
hypothesis
H1 : k 6= 0k :
We do not reject H0 at the 100% level if
0 lies within the (1 ) 100% CI for k ; i.e., bk t =2 ^ bk ;
k
reject H0 otherwise. Equivalently, calculate the test statistic
bk 0
tobs = k
^ bk
and,
if jtobsj > t =2 then reject H0;
if jtobsj t =2 then do not reject H0:
83

The reasoning is as follow. Under the null hypothesis we have


bk 0
t0k = k t(n K):
^ bk
If we observe jtobsj > t =2 and the H0 is true, then a low-probability event has occurred.
We take jtobsj > t =2 as an evidence against the null and the decision should be to reject
H0 :

Other cases:

H0 : k = 0k vs: H1 : k > 0k ;

if tobs > t then reject H0 at the 100% level; otherwise do not reject H0:

H0 : k = 0k vs: H1 : k < 0k ;

if tobs < t then reject H0 at the 100% level; otherwise do not reject H0:
84

2.6.3 Issues in Hypothesis Testing

p-value

p-value (or p) is the probability of obtaining a test statistic at least as extreme as the one that
was actually observed, assuming that the null hypothesis is true. p is an informal measure
of evidence of the null hypothesis.
Example. Consider H0 : k = 0k vs: H1 : k 6= 0k

p-value = 2P t0k > jtobsj H0 is true :


A p-value = 0:02 shows little evidence supporting H0: At the 5% level you should reject the
H0 hypothesis.
Example. Consider H0 : k = 0k vs: H1 : k > 0k

p-value = P t0k > tobs H0 is true :


EVIEWS: divide the reported p-value by two.
85

Reporting the outcome of a test

Correct wording in reporting the outcome of a test involving H0 : k = 0k vs. H1 : k 6= 0k

When the null is rejected we say that bk (not 0


k ) is signi…cantly di¤erent from k at
100%.

When the null isn’t rejected we say that bk (not k) is not signi…cantly di¤erent from
0 at 100%.
k

Correct wording in reporting the outcome of a test involving H0 : k = 0 vs. H1 : k 6= 0

When the null is rejected we say that bk (not k ) is signi…cantly di¤erent from zero at
100% level, or the variable (associated with bk ) is statistically signi…cant at 100%.

When the null isn’t rejected we say that bk (not k ) is not signi…cantly di¤erent from
zero at 100% level, or the variable is not statistically signi…cant at 100%.
86

More Remarks:

Rejection of the null is not proof that the null is false. Why?

Acceptance of the null is not proof that the null is true. Why? We prefer to use the
language “we fail to reject H0 at the x% level” rather than “H0 is accepted at the x%
level.”

In a test of type H0 : k = 0k , if ^ bk is large (bk is an imprecise estimator) is more


di¢ cult to reject the null. The sample contains little information about the true value
of k parameter. Remember that ^ bk depends on
2; S 2 ; n and Rk2 .
xk
87

Statistical Versus Economic Signi…cance

The statistical signi…cance of a variable is determined by the size of tobs = bk =se (bk ) ;
whereas the economic signi…cance of a variable is related to the size and sign of bk :
Example. Suppose that in a business activity we have

log\
(wagei) = :1 + 0:01 f emale + ::: n = 600
(0:001)
H0 : 2 = 0 vs. H1 = 2 6= 0: We have:
b2
t0k= t(600 K) N (0; 1) (under the null)
^ b2
0:01
tobs = = 10;
0:001
p-value = 2P t0k > j10j H0 is true 0:
Discuss statistical versus economic signi…cance.
88

Exercise 2.18. Can we say that students at smaller schools perform better than those at
larger schools? To discuss this hypothesis we consider data on 408 high schools in Michigan
for the year 1993 (see Wooldridge, chapter 4). Performance is measured by the percentage
of students receiving a passing score on a tenth grade math test ( math10). School size
is measured by student enrollment ( enroll). We will control for two other factors, average
annual teacher compensation ( totcomp) and the number of sta¤ per one thousand students
( sta¤ ). Teacher compensation is a measure of teacher quality, and sta¤ size is a rough
measure of how much attention students receive. Figure below reports the results. Answer
to the initial question.

Dependent Variable: MATH10


Method: Least Squares
Sample: 1 408

Variable Coefficient Std. Error t-Statistic Prob.

C 2.274021 6.113794 0.371949 0.7101


TOTCOMP 0.000459 0.000100 4.570030 0.0000
STAFF 0.047920 0.039814 1.203593 0.2295
ENROLL -0.000198 0.000215 -0.917935 0.3592

R-squared 0.054063 Mean dependent var 24.10686


Adjusted R-squared 0.047038 S.D. dependent var 10.49361
S.E. of regression 10.24384 Akaike info criterion 7.500986
Sum squared resid 42394.25 Schwarz criterion 7.540312
Log likelihood -1526.201 Hannan-Quinn criter. 7.516547
F-statistic 7.696528 Durbin-Watson stat 1.668918
Prob(F-statistic) 0.000052
89

Exercise 2.19. We want to relate the median housing price ( price) in the community to
various community characteristics: nox is the amount of nitrous oxide in the air, in parts
per million; dist is a weighted distance of the community from …ve employment centers, in
miles; rooms is the average number of rooms in houses in the community; and stratio is
the average student-teacher ratio of schools in the community. Can we conclude that the
elasticity of price with respect to nox is -1? (Sample: 506 communities in the Boston area -
see Wooldridge, chapter 4).

Dependent Variable: LOG(PRICE)


Method: Least Squares
Sample: 1 506

Variable Coefficient Std. Error t-Statistic Prob.

C 11.08386 0.318111 34.84271 0.0000


LOG(NOX) -0.953539 0.116742 -8.167932 0.0000
LOG(DIST) -0.134339 0.043103 -3.116693 0.0019
ROOMS 0.254527 0.018530 13.73570 0.0000
STRATIO -0.052451 0.005897 -8.894399 0.0000

R-squared 0.584032 Mean dependent var 9.941057


Adjusted R-squared 0.580711 S.D. dependent var 0.409255
S.E. of regression 0.265003 Akaike info criterion 0.191679
Sum squared resid 35.18346 Schwarz criterion 0.233444
Log likelihood -43.49487 Hannan-Quinn criter. 0.208059
F-statistic 175.8552 Durbin-Watson stat 0.681595
Prob(F-statistic) 0.000000
90

2.6.4 Test on a Set of Parameter I

Suppose that we have a joint null hypothesis about :

H0 : R = r vs. H1 : R 6= r:
where R p 1; Rp K ). The test statistics is

1 1
F0 = (Rb r)0
R X0 X R0 (Rb r) = ps2 :

Let Fobs be the observed test statistics. We have

reject H0 if Fobs > F (or if p-value < )


do not reject H0 if Fobs F :

The reasoning is as follow. Under the null hypothesis we have

F0 F(p;n K):
If we observe F 0 > F and the H0 is true, then a low-probability event has occurred.
91

In the case p = 1 (single linear combination of the elements of ) one may use the test
statistics
0 Rb R
t = q t (n K ) :
1
s R (X0X) R0
Example. We consider a simple model to compare the returns to education at junior colleges
and four-year colleges; for simplicity, we refer to the latter as “universities”(See Wooldridge,
chap. 4).The model is

log (wagesi) = 1 + 2jci + 3univi + 4experi + "i:


The population includes working people with a high school degree. jc is number of years
attending a two-year college and univ is number of years at a four-year college. Note that
any combination of junior college and college is allowed, including jc = 0 and univ = 0.
The hypothesis of interest is whether a year at a junior college is worth a year at a university:
this is stated as H0 : 2 = 3: Under H0, another year at a junior college and another year
at a university lead to the same ceteris paribus percentage increase in wage. The alternative
of interest is one-sided: a year at a junior college is worth less than a year at a university.
This is stated as H1 : 2 < 3:
92

Dependent Variable: LWAGE


Method: Least Squares
Sample: 1 6763

Variable Coefficient Std. Error t-Statistic Prob.

C 1.472326 0.021060 69.91020 0.0000


JC 0.066697 0.006829 9.766984 0.0000
UNIV 0.076876 0.002309 33.29808 0.0000
EXPER 0.004944 0.000157 31.39717 0.0000

R-squared 0.222442 Mean dependent var 2.248096


Adjusted R-squared 0.222097 S.D. dependent var 0.487692
S.E. of regression 0.430138 Akaike info criterion 1.151172
Sum squared resid 1250.544 Schwarz criterion 1.155205
Log likelihood -3888.687 Hannan-Quinn criter. 1.152564
F-statistic 644.5330 Durbin-Watson stat 1.968444
Prob(F-statistic) 0.000000

2 3
0:0023972 9:4121 10 5 8:50437 105 1:6780 10 5
6 7
1 6 9:41217 10 5 0:0002520 1:04201 10 5 9:2871 10 8 7
X0 X =6
6
7
7
4 8:50437 10 5 1:0420 10 5 2:88090 10 5 2:12598 10 7 5
1:67807 10 5 9:2871 10 8 2:1259 10 7 1:3402 10 7
Under the null, the test statistics is
Rb R
t0 = q t (n K) :
1
s R (X0X) R0
93

We have
h i
R = 0 1 1 0
q
1
R (X0X) R0 = 0:016124827
q
s R (X0X) 1 R0 = 0:430138 0:016124827 = 0:006936
2 3
1:472326
h i 6 0:066697 7
6 7
Rb = 0 1 1 0 6 7 = 0:01018
4 0:076876 5
0:004944
2 3
1
h i6 7
6 2 7
R = 0 1 1 0 6 7= 2 3 = 0 (under H0)
4 3 5
4
0:01018
tobs = = 1:467
0:006936
t0:05 = 1:645:
We do not reject H0 at the 5% level. There is no evidence against 2 = 3 at 5% level.
94

Remark: in this exercise t0 can be written as


Rb b2 b3
b2 b 3
t0 = q =q = :
1
s R (X0X) R0 Var \
(b2 b3) SE (b2 b3)

Exercise 2.20 (continuation). Propose another way to test H0 : 2 = 3 against H0 :


2 < 3 along the following lines: de…ne = 2 3 ; write 2 = + 3 ; plug this into
the equation log (wagesi) = 1 + 2jci + 3univi + 4experi + "i and test = 0: Use
the database available on the webpage of the course.
95

2.6.5 Test on a Set of Parameter II

We focus on another way to test

H0 : R = r vs. H1 : R 6= r:
(where R p 1; Rp K ). It can be proved that

1 1
F0 = (Rb r)0
R X0 X R0 (Rb r) = ps2
e 0e e0e =p
=
e0e= (n K )
R2 R2 =p
= F (p; n K)
1 R2 = (n K)
where refers to the short regression or the regression subjected to the constraint R = r.
96

Example. Consider once again the equation log (wagesi) = 1 + 2jci + 3univi +
4 experi + "i and H0 : 2 = 3 against H0 : 2 6= 3 : The results of the regression
subjected to the constraint H0 : 2 = 3 are

Dependent Variable: LWAGE


Method: Least Squares
Sample: 1 6763

Variable Coefficient Std. Error t-Statistic Prob.

C 1.471970 0.021061 69.89198 0.0000


JC+UNIV 0.076156 0.002256 33.75412 0.0000
EXPER 0.004932 0.000157 31.36057 0.0000

R-squared 0.222194 Mean dependent var 2.248096


Adjusted R-squared 0.221964 S.D. dependent var 0.487692
S.E. of regression 0.430175 Akaike info criterion 1.151195
Sum squared resid 1250.942 Schwarz criterion 1.154220
Log likelihood -3889.764 Hannan-Quinn criter. 1.152239
F-statistic 965.5576 Durbin-Watson stat 1.968481
Prob(F-statistic) 0.000000

We have p = 1; e0e = 1250:544; e 0e = 1250:942 and


e 0e e0e =p (1250:942 1250:544) =1
Fobs = 0
= = 2:151;
e e= (n K ) 1250:544= (6763 4)
F0:05 = 3:84:
We do not reject the null at 5% level, since Fobs = 2:151 < F0:05 = 3:84:
97

In the case “all slopes zero” (test of signi…cance of the complete regression), it can be
proved that F o equals
R2= (K 1)
F0 = :
1 R2 = (n K)

Under the null H0 : k = 0; k = 2; 3; :::; K; we have F 0 F (K 1; n K) :


Exercise 2.21. Consider the results:
Dependent Variable: Y
Method: Least Squares
Sample: 1 500

Variable Coefficient Std. Error t-Statistic Prob.

C 0.952298 0.237528 4.009200 0.0001


X2 1.322678 1.686759 0.784154 0.4333
X3 2.026896 1.701543 1.191210 0.2341

R-squared 0.300503 Mean dependent var 0.975957


Adjusted R-squared 0.297688 S.D. dependent var 6.337496
S.E. of regression 5.311080 Akaike info criterion 6.183449
Sum squared resid 14019.16 Schwarz criterion 6.208737
Log likelihood -1542.862 Hannan-Quinn criter. 6.193372
F-statistic 106.7551 Durbin-Watson stat 2.052601
Prob(F-statistic) 0.000000

Test: (a) H0 : 2 = 0 vs. H1 : 2 6= 0; (b) H0 : 3 = 0 vs. H1 : 3 6= 0; (c)


H0 : 2 = 0; 3 = 0 vs. H1 : 9 i 6= 0 (i = 1; 2) (d) Are xi2 and xi3 truly relevant
variables? How would you explain the results you obtained in parts (a), (b) and (c)?
98

2.7 Relation to Maximum Likelihood

Having speci…ed the distribution of the error vector, we can use the maximum likelihood
(ML) principle to estimate the model parameters = 0; 2 0.

2.7.1 The Maximum Likelihood Principle

ML principle: choose the parameter estimates to maximize the probability of obtaining the
data. Maximizing the joint density associated with the data, f y; X; ~ ; leads to the same
solution. Therefore:

M L estimator of = arg max f y; X; ~ :


~
99

Example (Without X). We ‡ipped a coin 10 times. If heads then y = 1: Obviously y


Bernoulli( ) : We don’t know if the coin is fair, so we treated E (Y ) = as unknown
P10
parameter. Suppose that i=1 yi = 6: We have
n
Y
f (y; ) = f (y1; :::; yn; ) = f (yi; ) = y1 (1 )1 y1 ::: yn (1 )1 yn
P Pi=1
= i yi (1 )10 i yi = 6 (1 )4 :

0.0012
joint density
0.0011

0.0010

0.0009

0.0008

0.0007

0.0006

0.0005

0.0004

0.0003

0.0002

0.0001

0.0000
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
theta
100

To obtain the ML estimate of we proceed with:

d 6 (1 )4 6
=0,^=
d 10
and since
d2 6 (1 )4
<0
d 2
^ = 0:6 maximizes f y;~ : ^ is the “most likely” value ; that is the value that maximizes
the probability of observing (y1; :::; y10) : Notice that the ML estimator is y:

Since log x; x > 0 is a strictly increasing function we have: ^ maximizes f y;~ i¤ ^


maximizes log f y;~ ; that is

^ = arg max f y; X; ~ , ^ = arg max log f y; X; ~ :

In most cases we prefer to solve max log f y; X; ~ rather max f y; X; ~ ; since the
transformation log greatly simplify the likelihood (products become sums).
101

2.7.2 Conditional versus Unconditional Likelihood

The joint density f (y; X; ) is in general di¢ cult to handle. Consider:


f (y; X; ) = f ( yj X; ) f (X; ) ; = 0; 0 ;

log f (y; X; ) = log f ( yj X; ) + log f (X; )


In general we don’t know f (X; ) :
Example. Consider yi = 1xi1 + 2xi2 + "i where
"i j X N 0; 2 ) yij X N x0i ; 2
X N ; 2I :
x x
Thus,
" # " # " #
= ; = x ; = :
2 2
x

If there is no functional relationship between and (such as a subset of being a


function of ), then maximizing log f (y; X; ) with respect to is achieved by separately
maximizing f ( yj X; ) with respect to and maximizing f (X; ) with respect to . Thus
the ML estimate of also maximizes the conditional likelihood f ( yj X; ) :
102

2.7.3 The Log Likelihood for the Regression Model

Assumption 1.5 (the normality assumption) together with Assumptions 1.2 and 1.4 imply
that the distribution of " conditional on X is N 0; 2I . Thus,

"j X N 0; 2I ) yj X N X ; 2I )
2 n=2 1 0
f ( y j X; ) = 2 exp 2
(y X ) (y X ) )
2
n 2 1
log f ( yj X; ) = log 2 2
(y X )0 (y X ):
2 2
It can be proved
n
X n
n 2 1 X 2
log f ( yj X; ) = log f ( yij xi) = log 2 2
yi x0i :
i=1 2 2 i=1
Proposition (1.5 - ML Estimator of and 2). Suppose Assumptions 1.1-1.5 hold. Then,
1
M L estimator of = X0 X X0 y :
2 e0e 2 e0 e
M L estimator of = 6= s = :
n n K
103

We know that E s2 = 2: Therefore:

e0 e 6= 2:
E n

e 0e
limn!1 E n = 2:
Proposition (1.6 - b is the Best Unbiased Estimator BUE). Under Assumptions 1.1-1.5,
the OLS estimator b of is BUE in that any other unbiased (but not necessarily linear)
estimator has larger conditional variance in the matrix sense.

This result should be distinguished from the Gauss-Markov Theorem that b is minimum
variance among those estimators that are unbiased and linear in y. Proposition 1.6 says
that b is minimum variance in a larger class of estimators that includes nonlinear unbiased
estimators. This stronger statement is obtained under the normality assumption (Assumption
1.5) which is not assumed in the Gauss-Markov Theorem. Put di¤erently, the Gauss-Markov
Theorem does not exclude the possibility of some nonlinear estimator beating OLS, but this
possibility is ruled out by the normality assumption.
104

Exercise 2.22. Suppose yi = x0i + "i where "ij X t(v): Assume that Assumptions
1.1-1.4 hold. Use your intuition to answer “true” or “false” to the following statements:

(a) b is the BLUE;

(b) b is the BUE;

(c) the BUE estimator can only be obtained numerically (i.e. there is not a closed formula
for the BUE estimator).

Just out of curiosity notice that the log-likelihood function is


n
X n n n
log f ( yij xi) = log 2 log log (v 2)
i=1 2 2 2
0 1
v+1 2
2
n
v+1 X B 1 yi x0i C
+n log v
log @1 + 2
A:
2 i=1 v 2
2
105

2.8 Generalized Least Squares (GLS)

We have assumed that


2 = Var ( "ij X) = 2 > 0;
E "i X 8i; Homoskedasticity
E " i "j X = 0; 8i; j ; i 6= j No correlation between observations.
Matrix notation:
2 3
E "21 X E ( "1 " 2 j X ) E ( "1 "n j X)
6 7
6 ( " " j X) 7
0 6 E 1 2 E "22 X E ( "2 "n j X) 7
E "" X = 6
6 ... ... ... ... 7
7
4 5
E ( "1 "n j X ) E ( "2 "n j X ) E "2n X
2 3
2 0 0
6 2 7
6 0 0 7
= 6 7 = 2I:
.
6 .. . .
.. . . . .. 7
4 5
0 0 2
106

The Assumption E ""0 X = I is violated if either

E "2i X depends on X ! Heteroskedasticity, or

E "i"j X 6= 0 ! Serial Correlation (We will analyze this case later).

Let’s assume now that


0 2
E "" X = V (V depends on X):

The model y = X + " based on the assumptions Assumptions 1.1-1.3 and E ""0 X =
2 V is called generalized regression model.

Notice that by de…nition, we always have:


0
E "" X = Var ( "j X) = Var ( yj X) :
107

Example (case where E "2i X depends on X). Consider the following model

yi = 1 + 2xi2 + "i
to explain household expenditure on food (y ) as a function of household income. Typical
behavior: Low-income household do not have the option of extravagant food tastes: they
have few choices and are almost forced to spend a particular portion of their income on food;
High-income household could have simple food tastes or extravagant food tastes: income by
itself is likely to be relatively less important as an explanatory variable.

20
18
16
y : Expenditure

14
12
10
8
6
4
2
0
6 7 8 9 10 11 12 13
x : Income
108

If e accurately re‡ects the behavior of the "; the information in the previous …gure suggests
that the variability of yi increases as income increases, thus it is reasonable to suppose that

Var ( yij xi2) is a function of xi2:


This is the same as saying that
2 is a function of xi2:
E "i xi2

For example if E "2i xi2 = 2x2i2 then


2 3
x212 0 0
6 2 7
0 26
6 0 x22 0 7
7= 2 I:
E "" X = 6 ... ... ... ... 7 V 6=
4 5
0 0 x2n2
| {z }
V
109

2.8.1 Consequence of Relaxing Assumption 1.4

1. The Gauss-Markov Theorem no longer holds for the OLS estimator. The BLUE is some
other estimator.

2. The t-ratio is not distributed as the t distribution. Thus, the t-test is no longer valid. The
same comments apply to the F-test. Note that Var ( bj X) is no longer 2 X0X 1 : In
e¤ect,
1 1 1
Var ( bj X) = Var X0 X X0 y X = X0 X X0 Var ( yj X) X X0 X
2 1 1
= X0 X X0VX X0X :
On the other hand,
2 tr (MVM) 2 tr (MV)
E e0e X tr (Var ( ej X))
E s2 X = = = = :
n K n K n K n K
The conventional standard errors are incorrect when Var ( yj X) 6= 2I: Con…dence
region and hypothesis test procedures based on the classical regression model are not
valid.
110

3. However, the OLS estimator is still unbiased, because the unbiasedness result (Propo-
sition 1.1 (a)) does not require Assumption 1.4. In e¤ect,
0 1 1
E ( bj X) = X X X0 E ( y j X) = X0 X X0 X = ; E (b) =

Options in the presence of E ""0 X 6= 2I:

1 1
Use b to estimate and Var ( bj X) = 2 X0X X0VX X0X for inference
purposes. Note that yj X N X ; 2V implies
1 1
bj X N ; 2 X0 X X0VX X0X :

This is not a good solution as if you know V you may use a more e¢ cient estimator, as
we will see below. Later on, in chapter “Large Sample Theory” we will …nd that 2V
may be replaced by a consistent estimator.

Search for a better estimator of :


111

2.8.2 E¢ cient Estimation with Known V

If the value of the matrix function V is known, a BLUE estimator for , called generalized
least squares (GLS), can be deduced. The basic idea of the derivation is to transform
the generalized regression model into a model that satis…es all the assumptions, including
Assumption 1.4, of the classical regression model. Consider

y = X + "; 0 2
E "" X = V:
We should multiply both sides of the equation by a nonsingular matrix C (depending on X)

Cy = CX + C"
y ~ +~
~ = X "
" verify E ~
such that the transformed error ~ "0 X = 2I; i.e.
"~

"0 X = E C""0C0 X = C E ""0 X C0 = 2CVC0 = 2I


"~
E ~
that is CVC0 = I:
112

Given CVC0 = I, how to …nd C? Since V is by construction symmetric and positive de…-
nite, there exists a nonsingular n n matrix C such
1 1
V=C C0 or V 1 = C0C
Note
1 0
CVC0 = CC 1 C0 C = I:

It easy to see that if y = X + " satis…es Assumptions 1.1-1.3 and Assumption 1.5 (but
not Assumption 1.4), then

y ~ +~
~=X "; where y ~ = CX
~ = Cy; X
satis…es Assumptions 1.1-1.5. Let
1 1X 1 1 y:
~ 0X
^ GLS = X ~ ~ 0y
X ~ = X0 V X0 V
113

Proposition (1.7 - …nite-sample properties of GLS). (a) (unbiasedness) Under Assumption


1.1-1.3,

E ^ GLS X = :
(b) (expression for the variance) Under Assumptions 1.1-1.3 and the assumption E ""0 X =
2 V that the conditional second moment is proportional to V,

2 1
Var ^ GLS X = X0 V 1 X :
(c) (the GLS estimator is BLUE) Under the same set of assumptions as in (b), the GLS
estimator is e¢ cient in that the conditional variance of any unbiased estimator that is linear
in y is greater than or equal to Var ^ GLS X in the matrix sense.

Remark: Var ( bj X) Var ^ GLS X is a positive semide…nite matrix. In particular,

Var bj X Var ^ j;GLS X :


114

2.8.3 A Special Case: Weighted Least Squares (WLS)

Let’s suppose that


2 2
E "i X = vi (vi is a function of X).

Recall: C is such that V 1 = C0C .

We have
2 3 2 3
v1 0 0 1=v1 0 0
6 0 v 0 7 6 0 1=v2 0 7
V = 6
6 .. 2
... . . . ...
7
7) V 1 6
= 6 ... ... ... ...
7
7)
4 . 5 4 5
0 0 vn 0 0 1=vn
2 p 3
1= v1 0 0
6 p 7
0 1= v2 0
C = 6
6 ... ... ... ...
7
7:
4 5
p
0 0 1= vn
115

Now
2 y 3
2
p 32 3 p1
1= v1 0 0 y1 6 yv1 7
6 p 76 y 7 6 p2 7
6 0 1= v2 0 76 2 7 6 7
y
~ = Cy = 6 .
.. .
.. ... .
.. 7 6 .. 7 = 6 .v2 7
4 54 . 5 6 .. 7
p 4 y 5
0 0 1= vn yn pn
vn
2 p 32 3
1= v1 0 0 1 x12 x1K
6 p 76 1 7
~ = CX = 6 6 0 1 = v2 0 76 x22 x2K 7
X ... ... ... ... 7 6 .. ... ... ... 7
4 54 . 5
p
0 0 1= vn 1 xn2 xnK
2 p p p 3
1= v1 x12= v1 x1K = v1
6 1=pv x =
p
v x =
p 7
v2 7
6 2 22 2 2K
= 6 ... ... ... ... 7:
4 5
p p p
1= vn xn2= vn xnK = vn

Another way to express these relations:


y x
y~i = p i ; ~ik = pik ;
x i = 1; 2; :::; n:
vi vi
116

Example. Suppose that yi = + xi2 + "i;


Var ( yij xi2) = Var ( "ij xi2) = 2exi2 ; Cov yi; yj xi2; xj2 = 0
2 3
ex12 0 0
6 ... ... 7
6 7
6 7
V =6
6 0 exi2 0 7:
7
6 ... ... 7
4 5
0 0 exn2
Transformed model (matrix notation):
Cy = 2
CX + C" 3
2 3 2 3
pyx1 p 1x x
p 12 " # p "1
6 e. 12 7 6 e 12
6 ex12 77 6 ex12 7
6 .. 7 = 6 ... ... 7 1 +6 ... 7
4 5 4 5 4 5
pyxn p 1x pxn2 2 p "xn
e n2 e n2 exn 2 e n2
or (scalar notation):
y~i = x
~i1 1 + x "i ,
~i2 2 + ~ i = 1; :::; n
y 1 xi2 "i
p i =p 1 + p 2 + p , i = 1; :::; n:
x
e i2 x
e i2 x
e i2 x
e i2
117

Notice:
!
" 1 1
"ij X) = Var p ix xi2
Var (~ = x Var ( "ij xi2) = x 2exi2 = 2:
e i2 e i2 e i2

E¢ cient estimation under a known form of heteroskedasticity is called the weighted regression
(or the weighted least squares (WLS)).
Example. Consider wagei = 1 + 2educi + 3experi + "i:

30 30

25 25

20 20
WAGE

WAGE

15 15

10 10

5 5

0 0
0 10 20 30 40 50 60 0 4 8 12 16 20

EXPER EDUC
118

300

250
Dependent Variable: WAGE
Method: Least Squares 200
Sample: 1 526

RES2
Variable Coefficient Std. Error t-Statistic Prob.
150
C -3.390540 0.766566 -4.423023 0.0000
EDUC 0.644272 0.053806 11.97397 0.0000
EXPER 0.070095 0.010978 6.385291 0.0000 100

R-squared 0.225162 Mean dependent var 5.896103


Adjusted R-squared 0.222199 S.D. dependent var 3.693086 50
S.E. of regression 3.257044 Akaike info criterion 5.205204
Sum squared resid 5548.160 Schwarz criterion 5.229531
Log likelihood -1365.969 Hannan-Quinn criter. 5.214729 0
F-statistic 75.98998 Durbin-Watson stat 1.820274 0 4 8 12 16 20
Prob(F-statistic) 0.000000
EDUC

Assume Var ( "ij educi; experi) = 2educ2i : Transformed model:


wagei 1 educi experi
= + 2 + 3 "i ,
+~ i = 1; :::; n
educi educi educi educi
119

Dependent Variable: WAGE/EDUC


Method: Least Squares
Sample: 1 526 IF EDUC>0

Variable Coefficient Std. Error t-Statistic Prob.

1/EDUC -0.709212 0.549861 -1.289800 0.1977


EDUC/EDUC 0.443472 0.038098 11.64033 0.0000
EXPER/EDUC 0.055355 0.009356 5.916236 0.0000

R-squared 0.105221 Mean dependent var 0.469856


Adjusted R-squared 0.101786 S.D. dependent var 0.265660
S.E. of regression 0.251777 Akaike info criterion 0.085167
Sum squared resid 33.02718 Schwarz criterion 0.109564
Log likelihood -19.31365 Hannan-Quinn criter. 0.094721
Durbin-Watson stat 1.777416

Exercise 2.23. Let fyi; i = 1; 2; :::g be a sequence of independent random variables with
distribution N ; 2i ; where 2i is known (note: we assume 21 6= 22 6= :::). When
the variances are unequal, the sample mean y is not the best linear unbiased estimator,
Pn
i.e. BLUE). The BLUE has the form ^ = i=1 wiyi where wi are nonrandom weights.
(a) Find a condition on wi such that E ^ = ; (b) Find the optimal weights wi that
make ^ the BLUE. Hint: You may translate this problem into an econometric framework:
if fyig is a sequence of independent random variables with distribution N ; 2i then yi
can be represented by the equation yi = + "i; where "i N 0; 2i : Then …nd the GLS
estimator of :
120

Exercise 2.24. Consider


yi = xi1 + "i; >0
and assume E ( "ij X) = 0; Var ( "ij X) = 1 + jxi1j ; Cov "i; "j X = 0: (a) Suppose we
have a lot of observations and plot a graph of the observation of yi and xi2. How would the
scattered plot look like? (b) Propose an unbiased estimator with minimum variance; (c)
Suppose we have the 3 following observation of (xi2; yi): (0; 0); (3; 1) and (8; 5). Estimate
the value of from these 3 observations.
Exercise 2.25. Consider

yt = 1 + 2t + "t; Var ("i) = 2t2; i = 1; :::; 20


1
Find 2 X0X ; Var ( bj X) and Var ^ GLS X and comment on the results. Solution:
" # " #
2 1 0:215 0:01578 13:293 1:6326
X0 X = 2 ; Var ( bj X) = 2
0:01578 0:0015 1:6326 0:25548
" #
1:0537 0:1895
Var ^ GLS X = 2 :
0:1895 0:0840
121

Exercise 2.26. A research …rst ran a OLS regression. Then she was given the true V matrix.
She transformed the data appropriately and obtained the GLS estimator. For several coe¢ -
cient, standard errors in the second regression were larger than those in the …rst regression.
Does this contradict 1.7 proposition? See the previous exercise.

2.8.4 Limiting Nature of GLS

Finite-sample properties of GLS rest on the assumption that the regressors are strictly
exogenous. In time-series models the regressors are not strictly exogenous and the error
is serially correlated.

In practice, the matrix function V is unknown.

V can be estimated from the sample. This approach is called the Feasible Generalized
Least Squares (FGLS). But if the function V is estimated from the sample, its value V
becomes a random variable, which a¤ects the distribution of the GLS estimator. Very
little is known about the …nite-sample properties of the FGLS estimator. We need to
use the large-sample properties ...
122

3 Large-Sample Theory

The …nite-sample theory breaks down if one of the following three assumptions is violated:

1. the exogeneity of regressors,

2. the normality of the error term, and

3. the linearity of the regression equation.

This chapter develops an alternative approach based on large-sample theory (n is “su¢ ciently
large”).
123

3.1 Review of Limit Theorems for Sequences of Random Variables

3.1.1 Convergence in Probability in Mean Square and in Distribution

Convergence in Probability

A sequence of random scalars fzng converges in probability to a constant (non-random)


if, for any " > 0,
lim P (jzn j > ") = 0:
n!1
We write
p
zn ! or plim zn = :

As we will see, zn is usually a sample mean


Pn Pn
i=1 yi i=1 zi
zn = or zn = :
n n
124

Example. Consider a fair coin. Let zi = 1 if the ith toss results in heads and zi = 0
1 Pn p
otherwise. Let zn = n i=1 zi : The following graph suggests that zn ! 1=2:
125

A sequence of K dimensional vectors fzng converges in probability to a K -dimensional


vector of constants if, for any " > 0,

lim P (jznk k j > ") = 0; 8k


n!1
We write
p
zn ! :

Convergence in Mean Square

A sequence of random scalars fzig converges in mean square (or in quadratic mean) to a
if
h i
2
lim E (zn ) =0
n!1

The extension to random vectors is analogous to that for convergence in probability.


126

Convergence in Distribution

Let fzng be a sequence of random scalars and Fn be the cumulative distribution function
(c.d.f.) of zn, i.e. zn Fn. We say that fzng converges in distribution to a random scalar
z if the c.d.f. Fn, of zn , converges to the c.d.f. F of z at every continuity point of F . We
write
d
zn ! z; where z F;
F is is the asymptotic (or limiting) distribution of z . If F is well-known, for example, if F
is the cumulative normal N (0; 1) distribution we prefer to write
d d
zn ! N (0; 1) (instead of zn ! z and z N (0; 1)):
d
Example. Consider zn t(n): We know that zn ! N (0; 1) :

In most applications zn is of type


p
zn = n (y E (yi)) :
p
Exercise 3.1. For zn = n (y E (yi)) calculate E (zn) and Var (zn) (assume E (yi) = ;
Var (yi) = 2 and fyig is an i.i.d. sequence).
127

3.1.2 Useful Results

Lemma (2.3 - preservation of convergence for continuous transformation). Suppose f is a


vector-valued continuous function that does not depend on n. Then:

p p
(a) if zn ! ) f (zn) ! f ( ) ;

d d
(b) if zn ! z ) f (zn) ! f (z) :

An immediate implication of Lemma 2.3 (a) is that the usual arithmetic operations preserve
convergence in probability:
p p p
xn ! ; yn ! ) xn + yn ! + :
p p p
xn ! ; yn ! ) xnyn ! :
p p p
xn ! ; yn ! ) xn=yn ! = ; 6= 0:
p p
Yn ! ) Yn 1 ! 1 ( is invertible).
128

Lemma (2.4). We have

d p d
(a) xn ! x; yn ! ) xn + yn ! x + :

d p p
(b) xn ! x; yn ! 0 ) y0nxn ! 0:

d p d
(c) xn ! x; An ! A ) Anxn ! Ax: In particular if x N (0; ) ; then
d
Anxn ! N 0; A A0 :

d p d
(d) xn ! x; An ! A ) x0nAn 1xn ! x0A 1x (A is nonsingular).

p
If xn ! 0 we write xn = op (1) :

p
If xn yn ! 0 we write xn = yn + op (1) :

d
In part (c) we may write Anxn = Axn (Anxn and Axn have the same asymptotic
distribution).
129

3.1.3 Viewing Estimators as Sequences of Random Variables

Let ^n be an estimator of a parameter vector based on a sample of size n. We say that


an estimator ^n is consistent for if
^n p! :

The asymptotic bias of ^n, is de…ned as plimn!1 ^n : So if the estimator is consistent,


its asymptotic bias is zero.

Wooldridge’s quotation:

While not all useful estimators are unbiased, virtually all economists agree that
consistency is a minimal requirement for an estimator. The famous econometrician
Clive W.J. Granger once remarked: “If you can’t get it right as n goes to in…nity,
you shouldn’t be in this business.” The implication is that, if your estimator of a
particular population parameter is not consistent, then you are wasting your time.
130

A consistent estimator ^n is asymptotically normal if


p d
n ^n ! N (0; ):
p
Such an estimator is called n-consistent.

The variance matrix is called the asymptotic variance and is denoted Avar ^n ; i.e.
p
lim Var n ^n = Avar ^n = :
n!1

Some authors use the notation Avar ^n to mean =n (which is zero in the limit).
131

3.1.4 Laws of Large Numbers and Central Limit Theorems

Consider
n
1X
zn = zi:
n i=1
p
We say that zn obeys to the LLN if zn ! where = E (zi) or limn E (zn) = :

(A Version of Chebychev’s Weak LLN) If


lim E (zn) = p
) zn ! .
lim Var (zn) = 0

p
(Kolmogorov’s Second Strong LLN) If fzig is i.i.d. with E (zn) = ) zn ! :

These LLNs extend readily to random vectors by requiring element-by-element convergence.


132

Theorem 1 (Lindeberg-Levy CLT). Let fzig be i.i.d. with E (zn) = and Var (zi) = :
Then
n
p 1 X d
n (zn )=p (zi ) ! N (0; ) :
n i=1

Notice that
p
E n (zn ) = 0 ) E (zn) =
p
Var n (zn ) = ) Var (zn) = =n
Given the previous equations, some authors write
!
a
zn N ; :
n
133

Example. Let fzig be i.i.d. with distribution 2(1): By the Lindeberg-Levy CLT (scalar case)
we have
n !
1X 2
a
zn = zi N ;
n i=1 n
where
n
1X
E (zn) = E (zi) = E (zi) = = 1;
n i=1
0 1
Xn 2
1 1 2
Var (zn) = Var @ ziA = Var (zi) = = :
n i=1 n n n
134

-
3210.1
0.4
0.3
0.2

Probability Density Function of


p
Probability Density Function of zn (obtained by n (zn ) (exact expressions for
Monte-Carlo Simulation) n = 5; 10 and 50)
135

Example. In a random sampling, sample size = 30; on the variable z with E (z ) = 10;
Var (z ) = 9 but unknown distribution, obtain an approximation to P (zn < 9:5) : We do
not know the exact distribution of zn: However, from Lindeberg-Levy CLT we have
!
p (zn ) d 2
a
n ! N (0; 1) or zn N ; :
n
Hence,
!
p (zn ) p (9:5 10)
P (zn < 9:5) = P n < 30
3
' ( 0:9128) , [ is the cdf of N (0; 1) ]
= 0:1807:
136

3.2 Fundamental Concepts in Time-Series Analysis

Stochastic process (SP): is a sequence of random variables. For this reason, it is more
adequate to write a SP as fzig (means a sequence of random variables) rather than zi
(means the random variable at time i).
137

3.2.1 Various Classes of Stochastic processes

De…nition (Stationary Processes). A SP fzig is (strictly) stationary if the joint distribution


of (z1; z2; :::; zs) equals to that of zk+1; zk+2; :::; zk+s for any s 2 N and k 2 Z:
Exercise 3.2. Consider a SP fzig where E (jg (zi)j) < 1: Show that if fzig is a strictly
stationary process then E (g (zi)) is constant and do not depend on t:

The de…nition implies that any transformation (function) of a stationary process is itself
stationary,
n othat is, if fzig is stationary, then fg (zi)g is. For example, if fzig is stationary
then ziz0i is also a SP.
De…nition (Covariance Stationary Processes). A stochastic process fzig is weakly (or co-
variance) stationary if: (i) E (zi) does not depend on i , and (ii) Cov zi; zi j exists, is
…nite, and depends only on j but not on i:

If fzig is a covariance SP then Cov (z1; z5) = Cov (z1001; z1005).

A transformation (function) of a covariance stationary process may or may not be a covari-


ance stationary process.
138

q
Example. It can be proved that fzig ; zi = 0 + 1zi2 1"i; where f"ig is i.i.d. with mean
q
zero and unit variance and 0> 0 and 1=3 1< 1 is a covariance stationary process.
However, wi = zi2 is not a covariance stationary process as E wi2 does not exist.
Exercise 3.3. Consider the SP futg where
8
>
> t if t 2000
<
ut = q
>
> k 2 if t > 2000
:
k t
iid iid
where t and s are independent for all t and s and t N (0; 1) and s t(k). Explain
why futg is weakly (or covariance) stationary but not strictly stationary.
De…nition (White Noise Processes). A white noise process fzig is a covariance stationary
process with zero mean and no serial correlation:

E (zi) = 0; Cov zi; zj = 0, i 6= j:


139

Y Y
8 25

4
20

0
15

-4
10
-8

5
-12

0
-16

-20 -5
25 50 75 100 125 150 175 200 25 50 75 100 125 150 175 200

Y Y5
10 4

3
0
2

-10 1

0
-20
-1

-30 -2

-3
-40
-4

-50 -5
25 50 75 100 125 150 175 200 10 20 30 40 50 60 70 80 90
140

In the literature there is not a unique de…nition of ergodicity. We prefer to call “weakly
dependent process” to what Hayashi calls “ergodic process”.
De…nition. A stationary process fzig is said to be a weakly dependent process (= ergodic in
Hayashi’s de…nition) if, for any two bounded functions f : Rk+1 ! R and g : Rs+1 ! R;

lim E f zi; ::; zi+k g (zi+n; ::; zi+n+s)


n!1
= lim E f zi; ::; zi+k jE (g (zi+n; ::; zi+n+s))j :
n!1
Theorem 2 (S&WD). Let fzig be a stationary weakly dependent (S&WD) process with
p
E (zi) = : Then zn ! :

Serial dependence, which is ruled out by the i.i.d. assumption in Kolmogorov’s LLN, is
allowed in this Theorem, provided that it disappears in the long run. Since, for any function
f , ff (zi)g is a S&WD stationary whenever fzig is, this theorem implies that any moment
of a S&WD process (if it exists and is …nite) is consistently estimated by the sample moment.
For example, suppose fzig is a S&WD process and E ziz0i exists and is …nite. Then
n
1X p
zn = zizi ! E ziz0i :
0
n i=1
141

De…nition (Martingale). A vector process fzig is called a martingale with respect to fzig if

E ( zij zi 1; :::; z1) = zi 1 for i 2:


The process
z i = z i 1 + "i
where f"ig is a white noise process with E ( "ij zi 1) = 0, is a martingale since

E ( zij zi 1; :::; z1) = E ( zij zi 1) = zi 1 + E ( "ij zi 1) = zi 1:


De…nition (Martingale Di¤erence Sequence). A vector process fgig with E (gi) = 0 is
called a martingale di¤erence sequence (MDS) or martingale di¤erences if

E ( gij gi 1; :::; g1) = 0:

If fzig is a martingale, the process de…ned as zi = zi zi 1 is a MDS.


Proposition. If fgig is a MDS then Cov gi; gi j = 0, j 6= 0:
142

By de…nition
0 1 0 1
n
X n
X nX1 X
n
1 @ A
1 @
Var (gn) = 2 Var gt = 2 Var (gt) + 2 Cov gi; gi j A :
n t=1 n t=1 j=1 i=j+1
However, if fgig is a stationary MDS with …nite second moment then
n
X
Var (gt) = n Var (gt) ; Cov gi; gi j = 0;
t=1
so
1
Var (gn) = Var (gt) :
n
De…nition (Random Walk). Let fgig be a vector independent white noise process. A random
walk, fzig, is a sequence of cumulative sums:

zi = gi + gi 1 + ::: + g1:
Exercise 3.4. Show that the random walk can be written as

zi = zi 1 + gi ; z1 = g1:
143

3.2.2 Di¤erent Formulation of Lack of Serial Dependence

We have three formulations of a lack of serial dependence for zero-mean covariance stationary
processes:

(1) fgig is independent white noise.

(2) fgig is stationary MDS with …nite variance.

(3) fgig is white noise.

(1) ) (2) ) (3):


Exercise
q 3.5 (Process that satis…es (2) but not (1) - the ARCH process). Consider gi =
2
0 + 1 gi 1 "i; where f"i g is i.i.d. with mean zero and unit variance and 0 > 0 and
j 1j < 1. Show that fgig is a MDS but not a independent white noise.
144

3.2.3 The CLT for S&WD Martingale Di¤erence Sequences

Theorem 3 (Stationary Martingale Di¤erences CLT (Billingsley, 1961) ). Let fgig be a


vector martingale di¤erence sequence that is S&WD process with E gigi0 = and let
P
gi = n1 gi. Then
n
p 1 X d
ngn = p gi ! N (0; ):
n i=1
Theorem 4 (Martingale Di¤erences CLT (White, 1984)). Let fgig be a vector martingale
di¤erence sequence. Suppose that (a) E gigi0 = t is a positive de…nite matrix with
1 Pn (positive de…nite matrix), (b) g has …nite 4th moment, (c) n1
P p
gigi0 !
n i=1 t!
: Then
n
p 1 X d
ngn = p gi ! N (0; ):
n i=1
145

3.3 Large-Sample Distribution of the OLS Estimator

The model presented in this section has probably the widest range of economic applications:

No speci…c distributional assumption (such as the normality of the error term) is required;

The requirement in …nite-sample theory that the regressors be strictly exogenous or …xed
is replaced by a much weaker requirement that they be "predetermined."

Assumption (2.1 - linearity). yi = x0i + "i:


Assumption (2.2 - S&WD). f(yi; xi)g is jointly S&WD.
Assumption (2.3 - predetermined regressors). All the regressors are predetermined in the
sense that they are orthogonal to the contemporaneous error term: E (xik "i) = 0; 8i; k.
This can be written as

E (xi"i) = 0 or E (gi) = 0 where gi = xi"i:


Assumption (2.4 - rank condition). E xix0i = xx is nonsingular.
146

Assumption (2.5 - fgig is a martingale di¤erence sequence with …nite second moments).
fgig ; where gi = xi"i; is a martingale di¤erence sequence (so a fortiori E (gi) = 0.
The K K matrix of cross moments, E gigi0 , is nonsingular. We use S for Avar (g ) (the
p P
variance of ng; where g = n1 gi). By Assumption 2.2 and S&WD Martingale Di¤erences
CLT, S = E gigi0 :

Remarks:

1. (S&WD) A special case of S&WD is that f(yi; xi)g is i.i.d. (random sample in cross-
sectional data).

2. (The model accommodates conditional heteroskedasticity) If f(yi; xi)g is stationary,


then the error term "i = yi x0i is also stationary. The conditional moment
2
E "i xi can depend on xi

without violating any previous assumption, as long as E "2i is constant.


147

3. (E (xi"i) = 0 vs. E ( "ij xi) = 0) The condition E ( "ij xi) = 0 is stronger than
E (xi"i) = 0. In e¤ect,

E (xi"i) = E (E ( xi"ij xi))


= E (xi E ( "ij xi))
= E (xi0) = 0:

4. (Predetermined vs. strictly exogenous regressors) Assumption 2.3, restricts only the
contemporaneous relationship between the error term and the regressors. The exogeneity
assumption (Assumption 1.2) implies that, for any regressor k, E xjk "i = 0 for all i
and j; not just for i = j: Strict exogeneity is a strong assumption that does not hold in
general for time series models.
148

5. (Rank condition as no multicollinearity in the limit) Since


! 1
X0 X X0 y 1X 1 1X
b= = xix0i xiy = Sxx1Sxy
n n n n
where
X0 X 1X
Sxx = = xix0i (sample average of xix0i)
n n
0
Xy 1X
Sxy = = xiyi (sample average of xiyi).
n n
By Assumptions 2.2, 2.4 and theorem S&WD we have
n
X0 X 1X p
= xix0i ! E xix0i :
n n i=1
0
Assumption 2.4 guarantees that the limit in probability of XnX has rank K:
149

6. (A su¢ cient condition for fgig to be a MDS) Since a MDS is zero-mean by de…nition,
Assumption 2.5 is stronger than Assumption 2.3 (this latter is redundant in face of
Assumption 2.5). We will need Assumption 2.5 to prove the asymptotic normality of
the OLS estimator. A su¢ cient condition for fgig to be an MDS is

E ( "ij Fi) = 0 where


Fi = Ii 1 [ xi = f"i 1; "i 2; :::; "1; xi; xi 1; :::; x1g ;
Ii 1 = f"i 1; "i 2; :::; "1; xi 1; :::; x1g :
(This condition implies that the error term is serially uncorrelated and also is uncorrelated
with the current and past regressors). Proof. Notice: fgig is a MDS if

E ( gij gi 1; :::; g1) = 0; gi = xi"i:


Now, using the condition E ( "ij Fi) = 0;

E ( xi"ij gi 1; :::; g1) = E [ E ( xi"ij Fi)j gi 1; :::; g1] = E [0j gi 1; :::; g1] = 0
thus E ( "ij Fi) = 0 ) fgig is a MDS.
150

7. (When the regressors include a constant) Assumption 2.5 is


02 3 1
1
B6 7 C
E ( xi"ij gi 1; :::; g1) = E @ 4 ::: 5 "i gi 1; :::; g1A = 0 ) E ( "ij gi 1; :::; g1) = 0:
xiK

E ( "ij "i 1; :::; "1) = E ( E ( "ij gi 1; :::; g1)j "i 1; :::; "1) = 0:
Assumption 2.5 implies that the error term itself is a MDS and hence is serially uncorrelated.

8. (S is a matrix of fourth moments)

S = E gigi0 = E xi"ix0i"i = E "2i xix0i :


Consistent estimation of S will require an additional assumption.
151

9. (S will take a di¤erent expression without Assumption 2.5) In general


0 1 0 1
p p 1X n Xn
1
Avar (g) = lim Var ng = lim Var @ n giA = lim Var @ p giA
n i=1 n i=1
0 1
n
X
1
= lim Var @ gi A
n i=1
0 1
n
X nX1 X
n
1@ A
= lim Var (gi) + Cov gi; gi j + Cov gi j ; gi
n i=1 j=1 i=j+1
1X n
1 nX1 X n
0 0
= lim Var (gi) + lim E gigi j + E gi j gi :
n i=1 n j=1 i=j+1
Given stationarity, we have
n
1X
Var (gi) = Var (gi) :
n i=1

Thanks to the assumption 2.5 we have E gigi0 j = E gi j gi0 = 0 so

S = Avar (g) = Var (gi) = E gigi0 :


152

Proposition (2.1- asymptotic distribution of the OLS Estimator). (a) (Consistency of b for
) Under Assumptions 2.1-2.4,
p
b ! :
(b) (Asymptotic Normality of b) If Assumption 2.3 is strengthened as Assumption 2.5, then
p d
n (b ) ! N (0; Avar (b))
where
Avar (b) = 1 1
xx S xx :
^
(c) (Consistent Estimate of Avar (b)) Suppose there is available a consistent estimator S
of S: Then under Assumption 2.2, Avar (b) is consistently estimated by

^ xx1
[ (b) = Sxx1SS
Avar
where
n
X0 X 1X
Sxx = = xix0i:
n n i=1
153

Proposition (2.2 - consistent estimation of error variance). Under the Assumptions 2.1- 2.4,
n
X
1 p
s2 = e2i ! E "2i
n K i=1

provide E "2i exists and is …nite.

Under conditional homocedasticity E "2i xi = 2 (we will see this in detail later) we
have,
S = E gigi0 = E "2i xix0i = ::: = 2 0
E xixi =
2
xx
and

Avar (b) = 1S 1= 1 2 1 = 2 xx1;


xx xx xx xx xx
! 1
0
XX 1
[ (b) = s2
Avar = s2 n X0 X :
n
Thus
0 1
a [ (b)
Avar 1
b N@ ; A=N ; s 2 X0 X
n
154

3.4 Statistical Inference

Derivation of the distribution of test statistics is easier than in …nite-sample theory because
we are only concerned about the large-sample approximation to the exact distribution.
Proposition (2.3 - robust t-ratio and Wald statistic). Suppose Assumptions 2.1-2.5 hold,
^ of S. As before, let Avar
and suppose there is available a consistent estimate of S [ (b) =
^ 1: Then
Sxx1SS xx

(a) Under the null hypothesis H0 : k = 0k


1
bk 0 [ ( bk )
Avar
Sxx1SS
^ xx
d
t0k = k ! N (0; 1) ; where ^ 2bk = = kk :
^ bk n n
(b) Under the null hypothesis H0 : R = r; with rank (R) = p
1 d 2 :
W = n (Rb r)0 RAvar
[ (b) R0 (Rb r) ! (p)
155

Remarks

^ bk is called is called the heteroskedasticity-consistent standard error, (heteroskedastic-


ity) robust standard error, or White’s standard error. The reason for this terminology is
that the error term can be conditionally heteroskedastic. The t-ratio is called the robust
t-ratio.

The di¤erences from the …nite-sample t-test are: (1) the way the standard error is
calculated is di¤erent, (2) we use the table of N (0; 1) rather than that of t(n K),
and (3) the actual size or exact size of the test (the probability of Type I error given
the sample size) equals the nominal size (i.e., the desired signi…cance level ) only
approximately, although the approximation becomes arbitrarily good as the sample size
increases. The di¤erence between the exact size and the nominal size of a test is called
the size distortion.

Both tests are consistent in the sense that

power = P (rejecting the null H0j H1 is true) ! 1 as n ! 1:


156

3.5 Estimating S = E "2i xix0i Consistently

How to select an estimator for a population parameter? One of the most important method
is the analog estimation method or the method of moments. The method of moment
principle: To estimate a feature of the population, use the corresponding feature of the
sample.

Examples of analog estimators:

Parameter of the population Estimator

E (yi) Y
Var (yi) Sy2
xy Sxy
2 2
x Pn Sx
i=1 Ifyi cg
P (yi c) n
median (yi) sample median
max(yi) maxi=1;:::;n (yi)
157

The analogy principle suggests that E "2i xix0i can be estimated using the estimator
n
1X
"2i xix0i:
n i=1
Since "i is not observable we need another one:
Xn
1
^=
S e2i xix0i:
n i=1
2
Assumption (2.6 - …nite fourth moments for regressors). E xik xij exists and is …nite
for all k and j (k; j = 1; :::; K ) :
Proposition (2.4 - consistent estimation of S). Suppose S = E "2i xix0i exists and is …nite.
Then, under Assumptions 2.1-2.4 and 2.6, S ^ is consistent for S:
158

The estimator S can be represented as


2 3
e21 0 0
n
X 0 BX 6 7
1 2 0 X 6 0 e2 0 7
^=
S ei xixi = where B =6 ...2 7:
6 ... ... 7
n i=1 n 4 5
0 0 e2n
1^ 1 1 1
[
Thus, Avar (b) = Sxx SSxx = n X0X X0BX X0X . We have

!
a [ b)
Avar( ^ xx1
Sxx1SS 1 1
b N ; n =N ; n =N ; X0 X X0BX X0X

0 1
W = n (Rb [ (b) R0
r) RAvar (Rb r)
1
= n (Rb r) 0 ^ xx1R0
RSxx1SS (Rb r)

0 1 d
0 1 1 2
= (Rb r) R X0 X X0BX X0 X R (Rb r) ! (p)
159

Dependent Variable: WAGE


Method: Least Squares
Sample: 1 526

Variable Coefficient Std. Error t-Statistic Prob.

C -1.567939 0.724551 -2.164014 0.0309


FEMALE -1.810852 0.264825 -6.837915 0.0000
EDUC 0.571505 0.049337 11.58362 0.0000
EXPER 0.025396 0.011569 2.195083 0.0286
TENURE 0.141005 0.021162 6.663225 0.0000

R-squared 0.363541 Mean dependent var 5.896103


Adjusted R-squared 0.358655 S.D. dependent var 3.693086
S.E. of regression 2.957572 Akaike info criterion 5.016075
Sum squared resid 4557.308 Schwarz criterion 5.056619
Log likelihood -1314.228 Hannan-Quinn criter. 5.031950
F-statistic 74.39801 Durbin-Watson stat 1.794400
Prob(F-statistic) 0.000000

Dependent Variable: WAGE


Method: Least Squares
Sample: 1 526
White Heteroskedasticity-Consistent Standard Errors & Covariance

Variable Coefficient Std. Error t-Statistic Prob.

C -1.567939 0.825934 -1.898382 0.0582


FEMALE -1.810852 0.254156 -7.124963 0.0000
EDUC 0.571505 0.061217 9.335686 0.0000
EXPER 0.025396 0.009806 2.589912 0.0099
TENURE 0.141005 0.027955 5.044007 0.0000

R-squared 0.363541 Mean dependent var 5.896103


Adjusted R-squared 0.358655 S.D. dependent var 3.693086
S.E. of regression 2.957572 Akaike info criterion 5.016075
Sum squared resid 4557.308 Schwarz criterion 5.056619
Log likelihood -1314.228 Hannan-Quinn criter. 5.031950
F-statistic 74.39801 Durbin-Watson stat 1.794400
Prob(F-statistic) 0.000000
160

3.6 Implications of Conditional Homoskedasticity

Assumption (2.7 - conditional homoskedasticity). E "2i xi = 2 > 0:

Under Assumption 2.7 we have


S = E "2i xix0i = ::: = 2 0
E xixi =
2
xx and
Avar (b) = 1 1 2 1 1 2 1
xx S xx = xx xx xx = xx :
Proposition (2.5 - large-sample properties of b, t , and F under conditional homoskedas-
ticity). Suppose Assumptions 2.1-2.5 and 2.7 are satis…ed. Then

(a) (Asymptotic distribution of b) The OLS estimator b is consistent and asymptotically


normal with
Avar (b) = 2 xx1:

(b) (Consistent estimation of asymptotic variance) Under the same set of assumptions,
Avar (b) is consistently estimated by
1
[ (b) = s2Sxx1 = ns2 X0X
Avar :
161

(c) (Asymptotic distribution of the t and F statistics of the …nite-sample theory)

Under H0 : k = 0k we have

bk 0 [ ( bk )
Avar 1
d
t0k = k ! N (0; 1) ; where ^ 2bk = 2 0
=s XX :
^ bk n kk

Under H0 : R = r with rank (R) = p, we have


d
pF 0 ! 2(p)
1 1
where F 0 = (Rb r)0 R X0X R0 (Rb r) = ps2 :

Notice
e 0e e0e d
pF 0 = 0 ! 2
(p)
e e= (n K )
where refers to the short regression or the regression subjected to the constraint R =r

Remark (No need for fourth-moment assumption) By S&WD and Assumptions 2.1-2.4,
p
s2Sxx ! 2 xx = S: We do not need the fourth-moment assumption (Assumption 2.6)
for consistency.
162

3.7 Testing Conditional Homoskedasticity

With the advent of robust standard errors allowing us to do inference without specifying the
conditional second moment testing conditional homoskedasticity is not as important as it
used to be. This section presents only the most popular test due to White (1980) for the
case of random samples.

Let i be a vector collecting unique and nonconstant elements of the K K symmetric


matrix xix0i.
Proposition (2.6 - White’s Test for Conditional Heteroskedasticity). In addition to Assump-
tions 2.1 and 2.4, suppose that (a) f(yi; xi)g is i.i.d. with …nite E "2i xix0i (thus strength-
ening Assumptions 2.2 and 2.5), (b) "i is independent of xi (thus strengthening Assumption
2.3 and conditional homoskedasticity), and (c) a certain condition holds on the moments of
"i and xi. Then under H0: E "2i xi = 2 (constant) we have
d
nR2 ! 2(m)

where R2 is the R2 from the auxiliary regression of e2i on a constant and i and m is the
dimension of i:
163

Dependent Variable: WAGE


Method: Least Squares
Sample: 1 526
Included observations: 526

Variable Coefficient Std. Error t-Statistic Prob.

C -1.567939 0.724551 -2.164014 0.0309


FEMALE -1.810852 0.264825 -6.837915 0.0000
EDUC 0.571505 0.049337 11.58362 0.0000
EXPER 0.025396 0.011569 2.195083 0.0286
TENURE 0.141005 0.021162 6.663225 0.0000

R-squared 0.363541 Mean dependent var 5.896103


Adjusted R-squared 0.358655 S.D. dependent var 3.693086
S.E. of regression 2.957572 Akaike info criterion 5.016075
Sum squared resid 4557.308 Schwarz criterion 5.056619
Log likelihood -1314.228 Hannan-Quinn criter. 5.031950
F-statistic 74.39801 Durbin-Watson stat 1.794400
Prob(F-statistic) 0.000000
164

Heteroskedasticity Test: White

F-statistic 5.911627 Prob. F(13,512) 0.0000


Obs*R-squared 68.64843 Prob. Chi-Square(13) 0.0000
Scaled explained SS 227.2648 Prob. Chi-Square(13) 0.0000

Test Equation:
Dependent Variable: RESID^2

Variable Coefficient Std. Error t-Statistic Prob.

C 47.03183 20.19579 2.328794 0.0203


FEMALE -7.205436 10.92406 -0.659593 0.5098
FEMALE*EDUC 0.491073 0.778127 0.631097 0.5283
FEMALE*EXPER -0.154634 0.168490 -0.917768 0.3592
FEMALE*TENURE 0.066832 0.351582 0.190089 0.8493
EDUC -7.693423 2.596664 -2.962811 0.0032
EDUC^2 0.315191 0.086457 3.645652 0.0003
EDUC*EXPER 0.045665 0.036134 1.263789 0.2069
EDUC*TENURE 0.083929 0.054140 1.550226 0.1217
EXPER 0.000257 0.610348 0.000421 0.9997
EXPER^2 -0.009134 0.007010 -1.303002 0.1932
EXPER*TENURE -0.004066 0.017603 -0.230969 0.8174
TENURE -0.298093 0.934417 -0.319015 0.7498
TENURE^2 -0.004633 0.016358 -0.283255 0.7771

R-squared 0.130510 Mean dependent var 8.664083


Adjusted R-squared 0.108433 S.D. dependent var 22.52940
S.E. of regression 21.27289 Akaike info criterion 8.978999
Sum squared resid 231698.4 Schwarz criterion 9.092525
Log likelihood -2347.477 Hannan-Quinn criter. 9.023450
F-statistic 5.911627 Durbin-Watson stat 1.905515
Prob(F-statistic) 0.000000
165

Dependent Variable: WAGE


Method: Least Squares
Included observations: 526
White Heteroskedasticity-Consistent Standard Errors & Covariance

Variable Coefficient Std. Error t-Statistic Prob.

C -1.567939 0.825934 -1.898382 0.0582


FEMALE -1.810852 0.254156 -7.124963 0.0000
EDUC 0.571505 0.061217 9.335686 0.0000
EXPER 0.025396 0.009806 2.589912 0.0099
TENURE 0.141005 0.027955 5.044007 0.0000

R-squared 0.363541 Mean dependent var 5.896103


Adjusted R-squared 0.358655 S.D. dependent var 3.693086
S.E. of regression 2.957572 Akaike info criterion 5.016075
Sum squared resid 4557.308 Schwarz criterion 5.056619
Log likelihood -1314.228 Hannan-Quinn criter. 5.031950
F-statistic 74.39801 Durbin-Watson stat 1.794400
Prob(F-statistic) 0.000000

3.8 Estimation with Parameterized Conditional Heteroskedasticity

Even when the error is found to be conditionally heteroskedastic, the OLS estimator is still
consistent and asymptotically normal, and valid statistical inference can be conducted with
robust standard errors and robust Wald statistics. However, in the (somewhat unlikely) case
of a priori knowledge of the functional form of the conditional second moment, it should be
possible to obtain sharper estimates with smaller asymptotic variance.
166

To simplify the discussion, throughout this section we strengthen Assumptions 2.2 and 2.5
by assuming that f(yi; xi)g is i.i.d.

3.8.1 The Functional Form

The parametric functional form for the conditional second moment we consider is
2 0
E "i xi = zi
where zi is a function of xi:

Por example, E "2i xi = 1 + 2x2i2;

z0i = 1 x2i2 :
167

3.8.2 WLS with Known

The WLS (also GLS) estimator can be obtained by applying the OLS to the regression

~0i + ~"i
y~i = x
where

y x "
y~i = q i ; ~ik = q ik ;
x "i = q i ;
~ i = 1; 2; :::; n
z0i z0i z0i
We have
1 1X 1 1 y:
~ 0X
^ GLS = ^ (V) = X ~ ~ 0y
X ~ = X0 V X0 V
168

Note that
"i j x
E (~ ~ i ) = 0:
Therefore, provided that E x ~0i is nonsingular, Assumptions 2.1-2.5 are satis…ed for equa-
~i x
tion y~i = x ~0i +~"i. Furthermore, by construction, the error ~"i is conditionally homoskedastic:
E (~"i j x
~i) = 1. So Proposition 2.5 applies: the WLS estimator is consistent and asymptoti-
cally normal, and the asymptotic variance is
1
Avar ^ (V) = E x ~0i
~i x
0 1 1
n
X
1
= plim @ x ~0iA
~i x (by S&WD theorem)
n i=1
1 0 1
= plim X V 1X :
n
Thus n1 X0V 1X is a consistent estimator of Avar ^ (V) :
169

3.8.3 Regression of e2i on zi Provides a Consistent Estimate of

If is unknown we need to obtain ^ : Assuming E "2i xi = z0i we have

"2i = E "2i xi + i
where by construction E ( ij xi) = 0: This suggest that the following regression can be
considered
"2i = z0i + i
Provided that E ziz0i is nonsingular, Proposition 2.1 is applicable to this auxiliary regres-
sion: the OLS estimator of is consistent and asymptotically normal. However we cannot
run this regression as "i is not observable. In the previous regression we should replace "i
by the consistent estimate ei (despite the presence of conditional heteroskedasticity). In
conclusion, we may obtain a consistent estimate of by considering the regression of e2i on
zi to get
0 1 1
n
X Xn
^ =@ ziz0iA zie2i :
i=1 i=1
170

3.8.4 WLS with Estimated

Step 1: Estimate the equation yi = x0i + "i by OLS and compute the OLS residuals ei:

Step 2: Regress e2i on zi to obtain the OLS coe¢ cient estimate ^ .

Step 3: Transform the original variables according to the rules


y x
y~i = q i ; ~ik = q ik ;
x i = 1; 2; :::; n
0
zi ^ 0
zi ^
~0i
and run the OLS estimator with respect to the model y~i = x "i to obtain the
+~
Feasible GLS (FGLS):
1X 1 1y
^ = X0 V
^ V ^ X0 V
^
171

It can be proved that:

^ V
^ p
!

p d
n ^ V
^ ! N 0; Avar ^ (V)

1 X0 V
^ 1X is a consistent estimator of Avar ^ (V) :
n

No …nite properties are known concerning the estimator ^ V


^ :
172

3.8.5 A popular speci…cation for E "2i xi

The especi…cation "2i = z0i + i may lead to z0i ^ < 0: To overcome this problem a popular
speci…cation for E "2i xi is
n o
E "2i xi = exp x0i

(it guarantees that Var ( yij xi) > 0 for all 2 Rr ): It implies log E "2i xi = x0i : This
suggests the following procedure:

a) Regress y on X to get the residual vector e:


b) Run the LS regression log e2i on xi to estimate and calculate
n o
^ 2i = exp x0i ^ :

x
c) Transform the data y~i = ^yi ; ~ij = ^ij .
x
i i
d) Regress y ~ and obtain ^ V
~ on X ^
173

Notice also that:


n o
E "2i xi = exp xi0
n o
"2i = exp xi0 + vi; vi = "2i 2
E "i xi
log "2i x0i + vi
log e2i x0i + vi :
Example (Part 1). We want to estimate a demand function for daily cigarette consumption
(cigs). The explanatory variables are: log(income) - log of annual income, log(cigprice) -
log of per pack price of cigarettes in cents, educ - years of education, age and restaurn
- binary indicator equal to unity if the person resides in a state with restaurant smoking
restrictions (source: J. Mullahy (1997), “Instrumental-Variable Estimation of Count Data
Models: Applications to Models of Cigarette Smoking Behavior,” Review of Economics and
Statistics 79, 596-593).

Based on information below, are the standard errors reported in the …rst table reliable?
174

Heteroskedasticity Test: White

F-statistic 2.159258 Prob. F(25,781) 0.0009


Obs*R-squared 52.17245 Prob. Chi-Square(25) 0.0011
Scaled explained SS 110.0813 Prob. Chi-Square(25) 0.0000
Dependent Variable: CIGS
Method: Least Squares
Sample: 1 807 Test Equation:
Dependent Variable: RESID^2
Variable Coefficient Std. Error t-Statistic Prob.
Variable Coefficient Std. Error t-Statistic Prob.
C -3.639823 24.07866 -0.151164 0.8799 C 29374.77 20559.14 1.428794 0.1535
LOG(INCOME) 0.880268 0.727783 1.209519 0.2268 LOG(INCOME) -1049.630 963.4359 -1.089466 0.2763
LOG(CIGPRIC) -0.750862 5.773342 -0.130057 0.8966 (LOG(INCOME))^2 -3.941183 17.07122 -0.230867 0.8175
EDUC -0.501498 0.167077 -3.001596 0.0028 (LOG(INCOME))*(LOG(CIGPRIC)) 329.8896 239.2417 1.378897 0.1683
AGE 0.770694 0.160122 4.813155 0.0000 (LOG(INCOME))*EDUC -9.591849 8.047066 -1.191969 0.2336
(LOG(INCOME))*AGE -3.354565 6.682194 -0.502015 0.6158
AGE^2 -0.009023 0.001743 -5.176494 0.0000 (LOG(INCOME))*(AGE^2) 0.026704 0.073025 0.365689 0.7147
RESTAURN -2.825085 1.111794 -2.541016 0.0112 (LOG(INCOME))*RESTAURN -59.88700 49.69039 -1.205203 0.2285
LOG(CIGPRIC) -10340.68 9754.559 -1.060087 0.2894
R-squared 0.052737 Mean dependent var 8.686493 (LOG(CIGPRIC))^2 668.5294 1204.316 0.555111 0.5790
Adjusted R-squared 0.045632 S.D. dependent var 13.72152 (LOG(CIGPRIC))*EDUC 32.91371 59.06252 0.557269 0.5775
S.E. of regression 13.40479 Akaike info criterion 8.037737 (LOG(CIGPRIC))*AGE 62.88164 55.29011 1.137304 0.2558
(LOG(CIGPRIC))*(AGE^2) -0.622371 0.594730 -1.046477 0.2957
Sum squared resid 143750.7 Schwarz criterion 8.078448 (LOG(CIGPRIC))*RESTAURN 862.1577 720.6219 1.196408 0.2319
Log likelihood -3236.227 Hannan-Quinn criter. 8.053370 EDUC -117.4705 251.2852 -0.467479 0.6403
F-statistic 7.423062 Durbin-Watson stat 2.012825 EDUC^2 -0.290343 1.287605 -0.225491 0.8217
Prob(F-statistic) 0.000000 EDUC*AGE 3.617048 1.724659 2.097254 0.0363
EDUC*(AGE^2) -0.035558 0.017664 -2.012988 0.0445
EDUC*RESTAURN -2.896490 10.65709 -0.271790 0.7859
AGE -264.1461 235.7624 -1.120391 0.2629
AGE^2 3.468601 3.194651 1.085753 0.2779
AGE*(AGE^2) -0.019111 0.028655 -0.666935 0.5050
AGE*RESTAURN -4.933199 10.84029 -0.455080 0.6492
(AGE^2)^2 0.000118 0.000146 0.807552 0.4196
(AGE^2)*RESTAURN 0.038446 0.120459 0.319160 0.7497
RESTAURN -2868.196 2986.776 -0.960299 0.3372

cigs: number of cigarettes smoked per day, log(income): log of annual income, log(cigprice):
log of per pack price of cigarettes in cents, educ: years of education, age and restaurn:
binary indicator equal to unity if the person resides in a state with restaurant smoking re-
strictions.
175

Example (Part 2). Discuss the results of the following …gures.

Dependent Variable: CIGS Dependent Variable: CIGS


Method: Least Squares Method: Least Squares
Sample: 1 807 Sample: 1 807
White Heteroskedasticity-Consistent Standard Errors & Covariance
Variable Coefficient Std. Error t-Statistic Prob.
Variable Coefficient Std. Error t-Statistic Prob.
C -3.639823 24.07866 -0.151164 0.8799
LOG(INCOME) 0.880268 0.727783 1.209519 0.2268 C -3.639823 25.61646 -0.142089 0.8870
LOG(CIGPRIC) -0.750862 5.773342 -0.130057 0.8966 LOG(INCOME) 0.880268 0.596011 1.476931 0.1401
EDUC -0.501498 0.167077 -3.001596 0.0028 LOG(CIGPRIC) -0.750862 6.035401 -0.124410 0.9010
AGE 0.770694 0.160122 4.813155 0.0000 EDUC -0.501498 0.162394 -3.088167 0.0021
AGE^2 -0.009023 0.001743 -5.176494 0.0000 AGE 0.770694 0.138284 5.573262 0.0000
RESTAURN -2.825085 1.111794 -2.541016 0.0112 AGE^2 -0.009023 0.001462 -6.170768 0.0000
RESTAURN -2.825085 1.008033 -2.802573 0.0052
R-squared 0.052737 Mean dependent var 8.686493
Adjusted R-squared 0.045632 S.D. dependent var 13.72152 R-squared 0.052737 Mean dependent var 8.686493
S.E. of regression 13.40479 Akaike info criterion 8.037737 Adjusted R-squared 0.045632 S.D. dependent var 13.72152
Sum squared resid 143750.7 Schwarz criterion 8.078448 S.E. of regression 13.40479 Akaike info criterion 8.037737
Log likelihood -3236.227 Hannan-Quinn criter. 8.053370 Sum squared resid 143750.7 Schwarz criterion 8.078448
F-statistic 7.423062 Durbin-Watson stat 2.012825 Log likelihood -3236.227 Hannan-Quinn criter. 8.053370
Prob(F-statistic) 0.000000 F-statistic 7.423062 Durbin-Watson stat 2.012825
Prob(F-statistic) 0.000000
176

Example (Part 3). a) Regress y on X to get the residual vector e:

Dependent Variable: CIGS


Method: Least Squares
Sample: 1 807

Variable Coefficient Std. Error t-Statistic Prob.

C -3.639823 24.07866 -0.151164 0.8799


LOG(INCOME) 0.880268 0.727783 1.209519 0.2268
LOG(CIGPRIC) -0.750862 5.773342 -0.130057 0.8966
EDUC -0.501498 0.167077 -3.001596 0.0028
AGE 0.770694 0.160122 4.813155 0.0000
AGE^2 -0.009023 0.001743 -5.176494 0.0000
RESTAURN -2.825085 1.111794 -2.541016 0.0112

R-squared 0.052737 Mean dependent var 8.686493


Adjusted R-squared 0.045632 S.D. dependent var 13.72152
S.E. of regression 13.40479 Akaike info criterion 8.037737
Sum squared resid 143750.7 Schwarz criterion 8.078448
Log likelihood -3236.227 Hannan-Quinn criter. 8.053370
F-statistic 7.423062 Durbin-Watson stat 2.012825
Prob(F-statistic) 0.000000
177

b) Run the LS regression log e2i on xi

Dependent Variable: LOG(RES^2)


Method: Least Squares
Sample: 1 807

Variable Coefficient Std. Error t-Statistic Prob.

C -1.920691 2.563033 -0.749382 0.4538


LOG(INCOME) 0.291540 0.077468 3.763351 0.0002
LOG(CIGPRIC) 0.195418 0.614539 0.317992 0.7506
EDUC -0.079704 0.017784 -4.481657 0.0000
AGE 0.204005 0.017044 11.96928 0.0000
AGE^2 -0.002392 0.000186 -12.89313 0.0000
RESTAURN -0.627011 0.118344 -5.298213 0.0000

R-squared 0.247362 Mean dependent var 4.207486


Adjusted R-squared 0.241717 S.D. dependent var 1.638575
S.E. of regression 1.426862 Akaike info criterion 3.557468
Sum squared resid 1628.747 Schwarz criterion 3.598178
Log likelihood -1428.438 Hannan-Quinn criter. 3.573101
F-statistic 43.82129 Durbin-Watson stat 2.024587
Prob(F-statistic) 0.000000

n o
Calculate ^ 2i = exp x0i ^ \
= exp log e2i :

\
Notice: log \
e21; :::; log e2n are the …tted values of the above regression.
178

c) Transform the data


yi xij
y~i = ; ~ij =
x
^i ^i
and d) Regress y ~ and obtain
~ on X ^ :
V

Dependent Variable: CIGS/SIGMA


Method: Least Squares
Sample: 1 807

Variable Coefficient Std. Error t-Statistic Prob.

1/SIGMA 5.635471 17.80314 0.316544 0.7517


LOG(INCOME)/SIGMA 1.295239 0.437012 2.963855 0.0031
LOG(CIGPRIC)/SIGMA -2.940314 4.460145 -0.659242 0.5099
EDUC/SIGMA -0.463446 0.120159 -3.856953 0.0001
AGE/SIGMA 0.481948 0.096808 4.978378 0.0000
AGE^2/SIGMA -0.005627 0.000939 -5.989706 0.0000
RESTAURN/SIGMA -3.461064 0.795505 -4.350776 0.0000

R-squared 0.002751 Mean dependent var 0.966192


Adjusted R-squared -0.004728 S.D. dependent var 1.574979
S.E. of regression 1.578698 Akaike info criterion 3.759715
Sum squared resid 1993.831 Schwarz criterion 3.800425
Log likelihood -1510.045 Hannan-Quinn criter. 3.775347
Durbin-Watson stat 2.049719
179

3.8.6 OLS versus WLS

Under certain conditions we have:

b and ^ V
^ are consistent.

Assuming that the functional form of the conditional second moment is correctly spec-
i…ed, ^ V
^ is asymptotically more e¢ cient than b.

It is not clear which estimator is better (in terms of e¢ ciency) in the following situations:

– the functional form of the conditional second moment is misspeci…ed;

– in …nite samples, even if the functional form is correctly speci…ed, the large-sample
approximation will probably work less well for the WLS estimator than for OLS
because of the estimation of extra parameters (a) involved in the WLS procedure.
180

3.9 Serial Correlation

Because the issue of serial correlation arises almost always in time-series models, we use the
subscript "t" instead of "i" in this section. Throughout this section we assume that the
regressors include a constant. The issue is how to deal with

E "t"t j xt j ; xt 6= 0:
181

3.9.1 Usual Inference is not Valid

When the regressors include a constant (true in virtually all known applications), Assumption
2.5 implies that the error term is a scalar martingale di¤erence sequence, so if the error
is found to be serially correlated (or autocorrelated), that is an indication of a failure of
Assumption 2.5.

We have Cov gt; gt j 6= 0: In fact,

Cov gt; gt j = E xt"tx0t j "t j


= E E xt"tx0t j "t j xt j ; xt
= E xtx0t j E "t"t j xt j ; xt 6= 0:

Assumptions 2.1-2.4 may hold under serial correlation, so the OLS estimator may be consis-
tent even if the error is autocorrelated. However, the large-sample properties of b, t , and
F of proposition 2.5 are not valid. To see why, consider
p p
n (b ) = Sxx1 ng :
182

We have
Avar (b) = 1
xx S
1
xx ;
\
Avar ^ xx1:
(b) = Sxx1SS

If errors are not autocorrelated:


p
S = Var ng = Var (gt) .

If the errors are autocorrelated:


p 1 nX1 X n
0 0
S = Var ng = Var (gt) + E gtgt j + E gt j gt :
n j=1 t=j+1

Since Cov gt; gt j 6= 0 and E gt j gt0 6= 0 we have

S 6= Var (gt) i.e. S 6= E gtgt0 :


P 0 or 1 Pn e2 x x0 (robust to
If the errors are serial correlated we cannot use n1 n x
t=1 t tx n t=1 t t t
conditional heteroskedasticity) as a consistent estimators of S.
183

3.9.2 Testing Serial Correlation

Consider the regression yt = x0t + "t: We want to test whether or not "t is serial correlated.

Consider
Cov "t; "t j Cov "t; "t j j E " t "t j
j = r = = =
2
:
Var ("t) Var "t j Var (" t ) 0 E "t

Since j is not observable, we need to consider


~j
~j =
~0
n n
1 X 1X
~j = "t "t j ; ~0 = "2t :
n t=j+1 n t=1
184

Proposition. If f"tg is a stationary MDS with E "2t "t 1; "t 2; ::: = 2; then
p d p d
n~j ! N 0; 4 and n~j ! N (0; 1) :
Proposition. Under the assumptions of the previous proposition
p
X p
X
p 2 d
Box-Pierce Q statistics = QBP = n~j =n ~2j ! 2(p):
j=1 j=1

However, ~j is still unfeasible as we do not observe the errors. Thus,


^j
^j =
^0
n n
1 X 1X
^j = etet j ; ^0 = e2t (=SSR).
n t=j+1 n t=1
Exercise 3.6. Prove that ^j can be obtained from the regression et on et j (without inter-
cept).
185

Testing with Strictly Exogenous Regressors

To test H0 : j = 0 we consider the following proposition:


Proposition (testing for serial correlation with strictly exogeneous regressors). Suppose that
Assumptions 1.2, 2.1, 2.2, 2.4 are satis…ed. Then
p
^j ! 0;
p d
n^j ! N (0; 1) :
186

To test H0 : 1 = 2 = ::: = p = 0 we consider the following proposition:


Proposition (Box-Pierce Q & Ljung-Box Q). Suppose that Assumptions 1.2, 2.1, 2.2, 2.4
are satis…ed. Then
p
X d
QBP = n ^2j ! 2(p);
j=1
p
X ^2j d
QLB = n (n + 2) ! 2(p):
j=1 n j

It can be shown that the hypothesis H0 : 1 = 2 = ::: = p = 0 can also be tested


through the following auxiliary regression:

regression et on et 1; :::; et p:

We calculate the F statistic for the hypothesis that the p coe¢ cients of et 1; :::; et p are
all zero.
187

Testing with Predetermined, but Not Strictly Exogenous, Regressors


p
If the regressors are not strictly exogenous, the n^j has no longer N (0; 1) distribution and
the residual-based Q statistic may not be asymptotically chi-squared.

The trick consist in removing the e¤ect of xi in the regression of et on et 1; :::; et p by


considering now the
regression et on xt,et 1; :::; et p
and then calculate the F statistic for the hypothesis that the p coe¢ cients of et 1; :::; et p
are all zero. This regression is still valid when the regressors are strictly exogenous (so you
may always use that regression).

Given
et = 1 + 2xt2 + ::: + K xtK + 1et 1 + ::: + pet p + errort
the null hypothesis can be formulated as

H0 : 1 = ::: = p = 0
Use the F test:
188

EVIEWS
189

Example. Consider, chnimp: the volume of imports of barium chloride from China, chempi:
index of chemical production (to control for overall demand for barium chloride), gas: the
volume of gasoline production (another demand variable), rtwex: an exchange rate index
(measures the strength of the dollar against several other currencies).

Equation 1
Dependent Variable: LOG(CHNIMP)
Method: Least Squares
Sample: 1978M02 1988M12
Included observations: 131

Variable Coefficient Std. Error t-Statistic Prob.

C -19.75991 21.08580 -0.937119 0.3505


LOG(CHEMPI) 3.044302 0.478954 6.356142 0.0000
LOG(GAS) 0.349769 0.906247 0.385953 0.7002
LOG(RTWEX) 0.717552 0.349450 2.053378 0.0421

R-squared 0.280905 Mean dependent var 6.174599


Adjusted R-squared 0.263919 S.D. dependent var 0.699738
S.E. of regression 0.600341 Akaike info criterion 1.847421
Sum squared resid 45.77200 Schwarz criterion 1.935213
Log likelihood -117.0061 Hannan-Quinn criter. 1.883095
F-statistic 16.53698 Durbin-Watson stat 1.421242
Prob(F-statistic) 0.000000
190

Equation 2
Breusch-Godfrey Serial Correlation LM Test:

F-statistic 2.337861 Prob. F(12,115) 0.0102


Obs*R-squared 25.69036 Prob. Chi-Square(12) 0.0119

Test Equation:
Dependent Variable: RESID
Method: Least Squares
Sample: 1978M02 1988M12
Included observations: 131
Presample missing value lagged residuals set to zero.

Variable Coefficient Std. Error t-Statistic Prob.

C -3.074901 20.73522 -0.148294 0.8824


LOG(CHEMPI) 0.084948 0.457958 0.185493 0.8532
LOG(GAS) 0.110527 0.892301 0.123867 0.9016
LOG(RTWEX) 0.030365 0.333890 0.090942 0.9277
RESID(-1) 0.234579 0.093215 2.516546 0.0132
RESID(-2) 0.182743 0.095624 1.911051 0.0585
RESID(-3) 0.164748 0.097176 1.695366 0.0927
RESID(-4) -0.180123 0.098565 -1.827464 0.0702
RESID(-5) -0.041327 0.099482 -0.415425 0.6786
RESID(-6) 0.038597 0.098345 0.392468 0.6954
RESID(-7) 0.139782 0.098420 1.420268 0.1582
RESID(-8) 0.063771 0.099213 0.642771 0.5217
RESID(-9) -0.154525 0.098209 -1.573441 0.1184
RESID(-10) 0.027184 0.098283 0.276585 0.7826
RESID(-11) -0.049692 0.097140 -0.511550 0.6099
RESID(-12) -0.058076 0.095469 -0.608329 0.5442

R-squared 0.196110 Mean dependent var -3.97E-15


Adjusted R-squared 0.091254 S.D. dependent var 0.593374
S.E. of regression 0.565652 Akaike info criterion 1.812335
Sum squared resid 36.79567 Schwarz criterion 2.163504
Log likelihood -102.7079 Hannan-Quinn criter. 1.955030
F-statistic 1.870289 Durbin-Watson stat 2.015299
Prob(F-statistic) 0.033268
191

If you conclude that the errors are serial correlated you have a few options:

(a) You know (at least approximately) the form of autocorrelation and so you use a feasible
GLS estimator.

(b) The second approach, parallels the use of the White estimator for heteroskedasticity:
you don’t know the form of autocorrelation so you rely on the OLS, but you use a
consistent estimator for Avar (b) :

(c) You are concerned only with the dynamic speci…cation of the model and with forecast.
You may try to convert your model into a dynamically complete model.

(d) Your model may be misspeci…ed: you respeci…ed the model and the autocorrelation
disappear.
192

3.9.3 Question (a): feasible GLS estimator

There are many forms of autocorrelation and each one leads to a di¤erent structure for the
error covariance matrix V. The most popular form is known as the …rst-order autoregressive
process. In this case the error term in

yt = x0t + "t
is assumed to follow the AR(1) model

"t = "t 1 + vt; j j < 1;


where vt is an error term with mean zero and constant conditional variance that exhibits no
serial correlation. We assume all assumptions 2.1-2.5 was = 0:
193

Initial Model:
yt = x0t + "t; "t = "t 1 + vt; j j<1

The GLS estimator is the OLS estimator applied to the transformed model

~0t + vt
y~t = x
where
( q ( q
1 2y t=1 ; 1 2 x0 t= 1 ;
y~t = 1 ~0t =
x 1
yt yt 1 t > 1 (xt xt 1)0 t > 1
Without the …rst observation, the transformed model is
0
yt yt 1 = (xt xt 1) + vt:

If is unknown we may replace it by a consistent estimator or we may use the nonlinear


least squares estimator (EVIEW).
194

Example (continuation of the previous example). Let’s consider the residuals of Equation 1:

Equation 3
Dependent Variable: LOG(CHNIMP)
Method: Least Squares
Sample (adjusted): 1978M03 1988M12
Included observations: 130 after adjustments
Convergence achieved after 8 iterations

Variable Coefficient Std. Error t-Statistic Prob.

C -39.30703 23.61105 -1.664772 0.0985


LOG(CHEMPI) 2.875036 0.658664 4.364949 0.0000
LOG(GAS) 1.213475 1.005164 1.207241 0.2296
LOG(RTWEX) 0.850385 0.468696 1.814362 0.0720
AR(1) 0.309190 0.086011 3.594777 0.0005

R-squared 0.338533 Mean dependent var 6.180590


Adjusted R-squared 0.317366 S.D. dependent var 0.699063
S.E. of regression 0.577578 Akaike info criterion 1.777754
Sum squared resid 41.69947 Schwarz criterion 1.888044
Log likelihood -110.5540 Hannan-Quinn criter. 1.822569
F-statistic 15.99350 Durbin-Watson stat 2.079096
Prob(F-statistic) 0.000000

Inverted AR Roots .31

Exercise 3.7. Consider yt = 1 + 2xt2 + "t where "t = "t 1 + vt and fvtg is a white
noise process. Using the …rst di¤erences of the variables one gets yt = 1 xt2 + "t:
Show that Corr ( "t; "t 1) = (1 ) =2: Discuss the advantages and disadvantages
of di¤erentiating the variables as a procedure to remove autocorrelation.
195

3.9.4 Question (b): Heteroskedasticity and autocorrelation-consistent (HAC) Co-


variance Matrix Estimator

For sake of generality, assume that you have also a problem of heteroskedasticity.

Given
p 1 nX1 X n
0 0
S = Var ng = Var (gt) + E gtgt j + E gt j gt
n j=1 t=j+1
nX1 X n
1
= E "2t xtx0t + 0 0
E "t"t j xtxt j + E "t j "txt j xt ;
n j=1 t=j+1
a possible estimator of S based on the analogy principle would be
n 0
nX1 X n
1X 1
e2t xtx0t + etet j xtx0t j + et j etxt j x0t ; n0 < n:
n t=1 n j=1 t=j+1
A major problem with this estimator is that it is not positive semi-de…nite and hence cannot
be a well-de…ned variance-covariance matrix.
196

Newey and West show that with a suitable weighting function ! (j ), the estimator below is
consistent and positive semi-de…nite:
Xn XL Xn
1 1
^ HAC =
S e2t xtx0t + ! (j ) etet j xtx0t j + et j etxt j x0t
n t=1 n j=1 t=j+1
where the weighting function ! (j ) is
j
! (j ) = 1 :
L+1
The maximum lag L must be determined in advance. Autocorrelations at lags longer than
L are ignored. For a moving-average process, this value is in general a small number.

This estimator is known as (HAC) covariance matrix estimator and is valid when both
conditional heteroskedasticity and serial correlations are present but of an unknown form.
197

Example. For xt = 1; n = 9; L = 3 we have


L
X n
X
! (j ) etet j xtx0t j + et j etxt j x0t
j=1 t=j+1
XL Xn
= ! (j ) 2etet j
j=1 t=j+1
= ! (1) (2e1e2 + 2e2e3 + 2e3e4 + 2e4e5 + 2e5e6 + 2e6e7 + 2e7e8 + 2e8e9) +
! (2) (2e1e3 + 2e2e4 + 2e3e5 + 2e4e6 + 2e5e7 + 2e6e8 + 2e7e9) +
! (3) (2e1e4 + 2e2e5 + 2e3e6 + 2e4e7 + 2e5e8 + 2e6e9) :

1
! (1) = 1 = 0:75
4
2
! (2) = 1 = 0:50
4
3
! (3) = 1 = 0:25
4
198

Newey-West covariance matrix estimator


[ (b) = Sxx1S
Avar ^ HAC Sxx1:

EVIEWS:

10
L
9

0
0 1000 2000 3000 4000 5000
n

n 2=9
Eviews selects L = f loor(4 100 )
199

Example (continuation ...). Newey-West covariance matrix estimator


[ (b) = Sxx1S
Avar ^ HAC Sxx1

Equation 4
Dependent Variable: LOG(CHNIMP)
Method: Least Squares
Sample: 1978M02 1988M12
Included observations: 131
Newey-West HAC Standard Errors & Covariance (lag truncation=4)

Variable Coefficient Std. Error t-Statistic Prob.

C -19.75991 26.25891 -0.752503 0.4531


LOG(CHEMPI) 3.044302 0.667155 4.563111 0.0000
LOG(GAS) 0.349769 1.189866 0.293956 0.7693
LOG(RTWEX) 0.717552 0.361957 1.982426 0.0496

R-squared 0.280905 Mean dependent var 6.174599


Adjusted R-squared 0.263919 S.D. dependent var 0.699738
S.E. of regression 0.600341 Akaike info criterion 1.847421
Sum squared resid 45.77200 Schwarz criterion 1.935213
Log likelihood -117.0061 Hannan-Quinn criter. 1.883095
F-statistic 16.53698 Durbin-Watson stat 1.421242
Prob(F-statistic) 0.000000
200

3.9.5 Question (c): Dynamically Complete Models

Consider
~0t + ut
yt = x
such that E ( utj x
~t) = 0: This condition although necessary for consistency, does not pre-
clude autocorrelation. You may try to increase the number of regressors to xt and get a new
regression model
yt = x0t + "t such that

E ( "tj xt; yt 1; xt 1; yt 2; :::) = 0:


Written in terms of yt

E ( ytj xt; yt 1; xt 1; yt 2; :::) = E ( ytj xt) :


De…nition. The model yt = x0t + "t is dynamically complete (DC) if

E ( "tj xt; yt 1; xt 1; yt 2; :::) = 0 or


E ( ytj xt; yt 1; xt 1; yt 2; :::) = E ( ytj xt)
holds (see Wooldridge).
201

Proposition. If a model is DC then the errors are not correlated. Moreover fgig is a MDS.

Notice that E ( "tj xt; yt 1; xt 1; yt 2; :::) = 0 can be rewritten as

E ( "ij Fi) = 0 where


Fi = Ii 1 [ xi = f"i 1; "i 2; :::; "1; xi; xi 1; :::; x1g ;
Ii 1 = f"i 1; "i 2; :::; "1; xi 1; :::; x1g :
Example. Consider

yt = 1 + 2xt2 + ut; ut = ut 1 + "t


~0t =
where f"tg is a white noise process and E "tj xt2; yt 1; xt 1;2; yt 2; ::: = 0. Set x
1 xt2 : The above model is not DC since the errors are autocorrelated. Notice that

E ytj xt2; yt 1; xt 1;2; yt 2; ::: = 1 + 2xt2 + ut 1


does not coincide with

E ( ytj x
~t) = E ( ytj xt2) = 1 + 2xt2:
202

However, it is easy to obtain a DC model. Since

ut = yt ( 1 + 2xt2) )
ut 1 = yt 1 ( 1 + 2xt 1;2)
we have

yt = 1 + 2xt2 + ut
= 1 + 2 xt2 + ut 1 + "t
= 1 + 2 xt2 + yt 1 1 + 2 xt 1;2 + "t :
This equation can be written in the form

yt = 1 + 2xt2 + 3yt 1 + 4xt 1;2 + "t:


Let xt = xt2; yt 1; xt 1;2 : The previous models is DC as

E ( ytj xt; yt 1; xt 1; :::) = E ( ytj xt) = 1 + 2xt2 + 3yt 1 + 4xt 1;2:


203

Example (continuation ...). Dynamically Complete Model

Equation 6
Breusch-Godfrey Serial Correlation LM Test:

F-statistic 0.810670 Prob. F(12,110) 0.6389


Obs*R-squared 10.56265 Prob. Chi-Square(12) 0.5667

Test Equation:
Dependent Variable: RESID
Method: Least Squares
Equation 5 Date: 05/12/10 Time: 19:13
Dependent Variable: LOG(CHNIMP) Sample: 1978M03 1988M12
Included observations: 130
Method: Least Squares Presample missing value lagged residuals set to zero.
Sample (adjusted): 1978M03 1988M12
Variable Coefficient Std. Error t-Statistic Prob.
Included observations: 130 after adjustments
C 1.025127 26.26657 0.039028 0.9689
LOG(CHEMPI) 1.373671 3.968650 0.346130 0.7299
Variable Coefficient Std. Error t-Statistic Prob. LOG(GAS) -0.279136 1.055889 -0.264361 0.7920
LOG(RTWEX) -0.074592 2.234853 -0.033377 0.9734
C -11.30596 23.24886 -0.486302 0.6276 LOG(CHEMPI(-1)) -1.878917 4.322963 -0.434636 0.6647
LOG(GAS(-1)) 0.315918 1.076831 0.293378 0.7698
LOG(CHEMPI) -7.193799 3.539951 -2.032175 0.0443 LOG(RTWEX(-1)) -0.007029 2.224878 -0.003159 0.9975
LOG(GAS) 1.319540 1.003825 1.314513 0.1911 LOG(CHNIMP(-1)) 0.151065 0.293284 0.515082 0.6075
RESID(-1) -0.189924 0.307062 -0.618520 0.5375
LOG(RTWEX) -0.501520 2.108623 -0.237842 0.8124 RESID(-2) 0.088557 0.124602 0.710715 0.4788
LOG(CHEMPI(-1)) 9.618587 3.602977 2.669622 0.0086 RESID(-3) 0.154141 0.098337 1.567475 0.1199
RESID(-4) -0.125009 0.098681 -1.266795 0.2079
LOG(GAS(-1)) -1.223681 1.002237 -1.220950 0.2245 RESID(-5) -0.035680 0.099831 -0.357407 0.7215
LOG(RTWEX(-1)) 0.935678 2.088961 0.447915 0.6550 RESID(-6) 0.048053 0.098008 0.490291 0.6249
LOG(CHNIMP(-1)) 0.270704 0.084103 3.218710 0.0016 RESID(-7) 0.129226 0.097417 1.326523 0.1874
RESID(-8) 0.052884 0.099891 0.529420 0.5976
RESID(-9) -0.122323 0.102670 -1.191423 0.2361
R-squared 0.394405 Mean dependent var 6.180590 RESID(-10) 0.022149 0.099419 0.222788 0.8241
RESID(-11) 0.034364 0.099973 0.343738 0.7317
Adjusted R-squared 0.359658 S.D. dependent var 0.699063 RESID(-12) -0.038034 0.102071 -0.372628 0.7101
S.E. of regression 0.559400 Akaike info criterion 1.735660
R-squared 0.081251 Mean dependent var -9.76E-15
Sum squared resid 38.17726 Schwarz criterion 1.912123 Adjusted R-squared -0.077442 S.D. dependent var 0.544011
Log likelihood -104.8179 Hannan-Quinn criter. 1.807363 S.E. of regression 0.564683 Akaike info criterion 1.835533
F-statistic 11.35069 Durbin-Watson stat 2.059684 Sum squared resid 35.07532 Schwarz criterion 2.276692
Log likelihood -99.30962 Hannan-Quinn criter. 2.014790
Prob(F-statistic) 0.000000 F-statistic 0.512002 Durbin-Watson stat 2.011429
Prob(F-statistic) 0.952295
204

3.9.6 Question (d): Misspeci…cation

In many cases the …nding of autocorrelation is an indication that the model is misspeci…ed.
If this is the case, the most natural route is not to change your estimator (from OLS to GLS)
but to change your model. Types of misspeci…cation may lead to a …nding of autocorrelation
in your OLS residuals:

dynamic misspeci…cation (related to question (c));

omitted variables (that are autocorrelated);

yt and/or xtk are integrated processes, e.g. yt I (1) :

functional form misspeci…cation.


205

Functional form misspeci…cation. Suppose that the true linear relationship is

yt = 1 + 2 log t + "t:
In the following …gure we estimate a misspeci…ed functional form: yt = 1 + 2t + "t : The
residuals are clearly autocorrelated
206

3.10 Time Regressions

Consider
yt = + f (t) + "t
where f (t) is a function of time (e.g. f (t) = t or f (t) = t2 etc.). This kind of models
do not satisfy the Assumption 2.2: f(yi; xi)g is jointly S&WD. This type of nonstationary
is not serious and the OLS is applicable. Let’s us focus on the case

yt = + t + "t
= x0t + "t;
" #
x0t = 1 t ; = :

+ t is called time trend of yt:


De…nition. We say that a process is trend stationary if it can be written as the sum of a time
trend and a stationary process. The process fytg here is a special trend-stationary process
where the stationary component is independent white noise.
207

3.10.1 The Asymptotic Distribution of the OLS Estimator

Let b be the OLS estimate of p based on a sample of size n:


" #
^ 1
b= ^ = X0 X X0 y :

Proposition (2.11 - OLS estimation of the time regression). Consider the time regression
yt = + t + "t where "t is independent white noise with E "2t = 2 and E "4 < 1:
Then
p ! 0 " # 11 " #!
n (^ ) d 1 1=2 4 6
! N @0; 2 A = N 0; 2 :
n3=2 ^ 1=2 1=3 6 12

p p
As in the stationary case, ^ is n-consistent because n ^ converges to a (normal)
random variable. The OLS estimate of the time coe¢ cient, ^, is also consistent, but the
speed of convergence is faster: it is n3=2-consistent in that n3=2 ^ converge to a
random variable. In this sense, ^ is superconsistent.
208

We provide a simpler proof of proposition 2.11 in the case yt = t + "t: We have


1
^ = X0 X X0 "
0 2 31 1 2 3
1 "1
Bh i6 2 7C h i6 "2 7
B 6 7C 6 7
= B 1 2 n 6 ... 7C 1 2 n 6 ... 7
@ 4 5A 4 5
n "n
n
X
1
= Pn t"t
t=1 t2
t=1
r
Pn Pn
Var t=1 t"t
= r t=1 t"t
Pn 2 Pn
t=1 t Var t=1 t"t
qP
n t2 Pn
=
t=1
q t=1 t"t
Pn 2 Pn
t 2
t=1 t=1 t
qP
n t2
t=1 d
= Pn 2
Z n ; where Z n !Z N (0; 1)
t=1 t
209
qP
n t2
t=1
n3=2 ^ = n3=2 Pn 2 Zn
t=1 t
Since
qP
n t2 p
t=1
lim n3=2 Pn 2 = 3
n!1
t=1 t
we have
d p d
n3=2 ^ = 3Z ! N 0; 23 :

3.10.2 Hypothesis Testing for Time Regressions

The OLS coe¢ cient estimates of the time regression are asymptotically normal, provided the
sampling error is properly scaled. Inference about ^ can be based on
n3=2 ^ d
p ! N (0; 1) in the case yt = + t + "t
s212
n3=2 ^ d
p ! N (0; 1) in the case yt = t + "t
s2 3
210

4 Endogeneity and the GMM

Consider
yi = 1zi1 + 2zi2 + ::: + K ziK + "i:
If Cov zij ; "i =6 0 (or E zij "i 6= 0) then we say that zij (j -th regressor) is endogenous.
It follows that E (zi"i) 6= 0:
De…nition (endogenous regressor). We say that a regressor is endogenous if it is not predeter-
mined (i.e., not orthogonal to the error term), that is, if it does not satisfy the orthogonality
condition (Assumption 2.3 does not hold).

If the regressors are endogenous we have, under the Assumptions 2.1, 2.2 and 2.4,
0 1 1
n
X Xn
1 1 p 1
b= +@ ziz0iA zi"i ! + zz E (zi"i) 6=
n i=1 n i=1
1
since E (zi"i) 6= 0: The term zz E (zi"i) is the asymptotic bias.
211

Example (Simple regression model). Consider

yi = 1 + 2zi2 + "i
is
2 3
" # d i2;yi)
Cov(z
b1 1 6 y Sz2
z2 7
6 7
b= = Z0 Z Z0y =6 d i2;yi)
2 7
b2 4 Cov(z 5
Sz2
2
where
1X 1X
d
Cov (zi2; yi) = (zi2 z2) (yi y ) ; Sz22 = (zi2 z2)2 :
n n
Under the assumption 2.2 we have
d (z ; y ) p Cov (z ; y )
Cov i2 i i2 i
b2 = !
Sz2 Var (zi2)
Cov (zi2; 1 + 2zi2 + "i) Cov (zi2; "i)
= = 2+ :
Var (zi2) Var (zi2)
212

d (z ; y )
Cov p Cov (zi2; yi)
i2 i
b1 = y 2
z2 ! E (y ) E (zi2)
Sz2 Var (zi2)
!
Cov (zi2; "i)
= 1 + 2 E (zi2 ) 2+ E (zi2)
Var (zi2)
Cov (zi2; "i)
= 1 E (zi2)
Var (zi2)
p
If Cov (zi2; "i) = 0 ) bi ! i: If zi2 is endogenous, b1 and b2 are inconsistent. Show
that
2 3
Cov(zi2 ;"i )
1 (z " ) 6 (z )
Var(zi2 ) E i2 7
zz E i i =4 Cov(zi2 ;"i ) 5:
Var(zi2 )
213

4.1 Examples of Endogeneity

4.1.1 Simultaneous Equations Bias

Example. Consider
yi1 = 0+ 1yi2 + "i1
yi2 = 0 + 1 yi1 + "i2
where "i1 and "i2 are independent. By construction yi1 and yi2 are endogenous regressors. In
fact, it can be proved that
1
Cov (yi2; "i1) = Var ("i1) 6= 0
1 1 1
1
Cov (yi1; "i2) = Var ("i2) 6= 0
1 1 1
Now
pCov (yi2; yi1) Cov (yi2; 0 + 1yi2 + "i1) Cov (yi2; "i1)
^ 1;OLS ! = = 1+ 6= 1
Var (yi2) Var (yi2) Var (yi2)
^ 1;OLS p! Cov (yi2; yi1) = Cov (yi1; 0 + 1yi1 + "i2) = 1 + Cov (yi1; "i2) 6= 1:
Var (yi1) Var (yi1) Var (yi1)
214

The OLS estimator is inconsistent for both 1 and 1 (and for 0 and 0 as well). This
phenomenon is known as the simultaneous equations bias or simultaneity bias, because the
regressor and the error term are often related to each other through a system of simultaneous
equations.
Example. Consider

Ci = + 1Yi + ui
0 (consumption function)
Yi = Ci + Ii (GNP identity).
where Cov (ui; Ii) = 0: It can be proved that
p 1 Var (ui)
^ 1;OLS ! 1+ :
1 1 Var (yi )
Example. See Hayashi:

qid = 0 + 1pi + ui (demand equation)


qis = 0 + 1 pi + vi (supply equation)
qid = qis (market equilibrium)
215

4.1.2 Errors-in-Variables Bias

We will see that predetermined regressor necessarily becomes endogenous when measured
with error. This problem is ubiquitous, particularly in micro data on households.

Consider
yi = zi + ui
where zi is a predetermined regressor. The variables yi and zi are measured with error:

yi = yi + "i and zi = zi + vi:


Assume that E zi ui = E zi "i = E zi vi = E (viui) = E (vi"i) = 0: The regression
equation is
yi = zi + i; i = ui + "i vi
Assuming S&WD we have (after some calculations):
P P
E vi2
z y
^ OLS = Pi i i = Pi ziyi=n p
! :
2 2
i zi i zi =n E zi2
216

4.1.3 Omitted Variable Bias

Consider the “long regression”

y = X1 1 + X2 2 + u
and suppose that this model satis…es the assumptions 2.1-2.4 (hence the OLS based on
the previous equation is consistent). However, for some reason X2 is not included in the
regression model (“short regression)”

y = X1 1 + "; " = X2 2 +u
We are interested only in 1: We have
1
b1 = X01X1 X1 y
1
= X01X1 X1 (X1 1 + X2 2 + u)
0 1 0 1
= 1 + X 1 X1 X1 X2 2 + X1 X1 X1 u
! 1 ! 1
X01X1 X1 X2 X01X1 X1 u
= 1 + 2 +
n n n n
217

This expression converges in probability to


1
1+ x1x1 x1x2 2:

The conclusion is that b1 is inconsistent if there are omitted variables that are correlated
with X1: The variables in X1 are endogenous as long as Cov (X1; X2) 6= 0

Cov (X1; ") = Cov (X1; X2 2 + u) = Cov (X1; X2) 2


Example. Consider the problem of unobserved ability in a wage equation for working adults.
A simple model is

log (W AGEi) = 1 + 2educi + 3abili + ui


where ui is the error term. We put abili into the error term, and we are left with the simple
regression model
log (W AGEi) = 1 + 2educi + "i
where "i = 3abili + ui.
218

The OLS will be inconsistent estimator of 2 if educi and abili are correlated. In e¤ect,
p Cov (educi; "i) Cov (educi; 3abili + ui)
b2 ! 2 + = 2+
Var (educi) Var (educi)
Cov (educi; abili)
= 2+ 3 :
Var (educi)
219

4.2 The General Formulation

4.2.1 Regressors and Instruments

De…nition. xi is an instrumental variable (IV) for zi if (1) xi is uncorrelated with "i, that
is, Cov(xi; "i) = 0 (thus, xi is a predetermined variable), and (2) xi is correlated with zi,
that is, Cov (xi; zi) 6= 0.
Exercise 4.1. Consider log (wagei) = 1 + 2educi + "i: Omitted variable: ability . (a)
Is educ an endogenous variable? (b) Can IQ be considered an IV for educ? and mother’s
education?
Exercise 4.2. Consider childreni = 1 + 2mothereduci + 3motheragei + "i. Omitted
variable: bcmi : dummy equal to one if the mother is informed about birth control methods.
(a) Is mothereduc endogenous? (b) Suggest an IV for mothereducation.
Exercise 4.3. Consider scorei = 1 + 2skippedi + "i: Omitted variable: motivation
(a) Is skippedi endogenous? (b) Can the distance between home (or living quarters) and
university be considered an IV variable?
220

Exercise 4.4. (Wooldridge, Chap. 15) Consider a simple model to estimate the e¤ect of
personal computer (PC) ownership on college grade point average for graduating seniors at
a large public university:
GP Ai = 1 + 2P Ci + "i
where PC is a binary variable indicating PC ownership. (a) Why might PC ownership be
correlated with "i? (b) Explain why PC is likely to be related to parents’ annual income.
Does this mean parental income is a good IV for PC? Why or why not? (c) Suppose that, four
years ago, the university gave grants to buy computers to roughly one-half of the incoming
students, and the students who received grants were randomly chosen. Carefully explain
how you would use this information to construct an instrumental variable for PC. (d) Same
question as (c) but suppose that the university gave grant priority to low-income students.

(see the use of IV in errors-in-variables problems in Woodridge’s text book).


221

Assumption (3.1 - linearity). The equation to be estimated is linear:

yi = z0i + "i; (i = 1; 2; :::; n) ;


where zi is an L-dimensional vector of regressors, is an L-dimensional coe¢ cient vector
and "i is an unobservable error term.
Assumption (3.2 - S&WD). Let xi be a K -dimensional vector to be referred to as the vector
of instruments, and let wi be the unique and nonconstant elements of (yi; zi; xi). fwig is
jointly stationary and weakly dependent.
Assumption (3.3 - orthogonality conditions). All the K variables in xi are predetermined in
the sense that they are all orthogonal to the current error term: E (xik "i) = 0 for all i and
k: This can be written as

E xi yi z0i = 0 or E (gi) = 0
where gi = xi"i:

Notice: xi should include the “1” (constant). Not only xi1 = 1 can be considered as an IV
variable but also guarantee that E 1 yi z0i = 0 , E ( "i ) = 0 :
222

Example (3.1). Consider

qi = 0 + 1pi + ui (demand equation)


where Cov (pi; ui) 6= 0; and xi is such that Cov (xi; pi) 6= 0 but Cov (xi; ui) = 0: Using
previous notation we have:

yi = qi;
" # " #
1 0
zi = ; = ; L=2
pi 1
" #
1
xi = ; K=2
xi
2 3
qi
wi = 6 7
4 pi 5 :
xi

In the above example, xi and zi share the same variable (a constant). The instruments that
are also regressors are called predetermined regressors, and the rest of the regressors, those
that are not included in xi, are called endogenous regressors.
223

Example (3.2 - wage equation). Consider


LWi = 1 + 2Si + 3EXP Ri + 4IQi + "i:
where:

LWi is the log wage of individual i,

Si is completed years of schooling (we assume predetermined),

EXP Ri is experience in years (we assume predetermined),

IQi is IQ (an error-ridden measure of the individual’s ability, is endogenous due to the
errors-in-variables problem)

We still have information on:

AGEi (age of the individual - predetermined),

M EDi (mother’s education in years - predetermined).

Note: AGE; is excluded from the wage equation, re‡ecting the underlying assumption that,
once experience is controlled for, age has no e¤ect on the wage rate.
224

In terms of the general model,

yi = LWi;
2 3 2 3
1 1
6 Si 7 6 7
zi = 6
6
7
7;
6
=6 2 7
7; L=4
4 EXP Ri 5 4 3 5
IQi 4
2 3
1
6 7
6 Si 7
6 7
xi = 6
6 EXP Ri
7;
7 K=5
6 AGE 7
4 i 5
M EDi
h i
wi0 = LWi Si EXP Ri IQi AGEi M EDi :
225

4.2.2 Identi…cation

The GMM estimation of the parameter vector is about how to exploit the information
a¤orded by the orthogonality conditions

E xi yi z0i = 0 , E xiz0i = E (xiyi)

E xiz0i = E (xiyi) can be interpreted as a linear system with K equations where is the
unknown vector. Notice: E xiz0i is a K L matrix and E (xiyi) is a K 1 vector. Can
we solve the system with respect to ? We need to study the identi…cation of the system.
Assumption (3.4 - rank condition for identi…cation). The K L matrix E xiz0i is of full
column rank (i.e., its rank equals L, the number of its columns). We denote this matrix by
xz:
226

Example. Consider the example 3.2 where


2 3
1 2 3
6 7 1
6 Si 7 6 7
6 7 Si
xi = 6
6 EXP Ri 7;
7 zi = 6
6
7
7:
6 7 4 EXP Ri 5
4 AGEi 5
IQi
M EDi
We have
2 3
1
6 7
6 Si 7h i
6 7
xiz0i = 6
6 EXP Ri
7 1 S EXP ER IQ
7 i i i
6 AGE 7
4 i 5
M EDi
2 3
1 Si EXP ERi IQi
6 2 7
6 S i Si SiEXP ERi SiIQi 7
6 7
= 6
6 EXP Ri EXP RiSi EXP ERi2 EXP RiIQi 7:
7
6 7
4 AGEi AGEiSi AGEiEXP ERi AGEiIQi 5
M EDi M EDiSi M EDiEXP ERi M EDiIQi
227

E xiz0i = xz

2 3
1 E (Si) E (EXP ERi) E (IQi)
6 7
6
6 E (Si) E Si2 E (SiEXP ERi) E (SiIQi) 7
7
6 7
= 6 E (EXP Ri) E (EXP RiSi)
6 E EXP ERi2 E (EXP RiIQi) 7:
7
6 7
4 E (AGEi) E (AGEiSi) E (AGEiEXP ERi) E (AGEiIQi) 5
E (M EDi) E (M EDiSi) E (M EDiEXP ERi) E (M EDiIQi)

Assumption 3.4 requires that rank ( xz) = 4:


228

4.2.3 Order Condition for Identi…cation

Since rank ( xz) min fK; Lg we have: if K < L ) rank ( xz) < L. Thus a
necessary condition for identi…cation is that K L:
De…nition (order condition for identi…cation). K L or

#orthogonality
| {z
conditions} #parameters
| {z }
.
K L
De…nition. We say that the equation is overidenti…ed if the rank condition is satis…ed and
K > L, exactly identi…ed or just identi…ed if the rank condition is satis…ed and K = L
and underidenti…ed (or not identi…ed) if the order condition is not satis…ed (i.e., if K < L).
229

Example. Consider the system Ax = b, with A = E xiz0i and b = E (xiyi) : It can be


proved that the system is always “possible” (it has at least one solution). Consider the
following scenarios:

1. If rank (A) = L and K = L the SLE is exactly identi…ed. Example:


" #" # " # (
1 1 x1 3 x1 = 2
= )
0 1 x2 1 x2 = 1
Note: rank (A) = 2 = L = K:

2. If rank (A) = L and K > L. The SLE is overidenti…ed. Example:


2 3 2 3
1 1 " # 3 (
6 7 x1 6 7 x1 = 2
4 0 1 5 =4 1 5)
x2 x2 = 1
0 1 1
Note: rank (A) = 2 = L and K = 3:
230

3. If rank (A) < L the SLE is underidenti…ed. Example:


" #" # " #
1 1 x1 2
= ) x1 = 2 x2 ; x 2 2 R
2 2 x2 4
Note: rank (A) = 1 < L:

4. If K < L then rank (A) < L and the SLE is underidenti…ed. Example:
" #
h i x1
1 1 = 1 ) x1 = 1 x2 ; x 2 2 R
x2
Note: rank (A) = 1 and K = 1 < L = 2:
231

4.2.4 The Assumption for Asymptotic Normality

Assumption (3.5 - fgig is a martingale di¤erence sequence with …nite second moments).
Let gi = xi"i. fgig is a martingale di¤erence sequence (so E (gi) = 0). The K K matrix
of cross moments, E gigi0 , is nonsingular. Let S = Avar (g) :

Remarks:

p
Assumption 3.5 implies Avar (g) = lim Var ( ng) = E gigi0 :

p d
Assumption 3.5 implies ng ! N (0; Avar (g)) :

If the instruments include a constant, then this assumption implies that the error is a
martingale di¤erence sequence (and a fortiori serially uncorrelated).
232

A su¢ cient and perhaps easier to understand condition for Assumption 3.5 is that

E ( "ij Fi) = 0 where


Ii 1 = f"i 1; "i 2; :::; "1; xi 1; :::; x1g ;
Fi = Ii 1 [ xi = f"i 1; "i 2; :::; "1; xi; xi 1; :::; x1g :
It implies the error term is orthogonal not only to the current but also to the past
instruments.

Since gigi0 = "2i xix0i; S is a matrix of fourth moments. Consistent estimation of S will
require a fourth-moment assumption to be speci…ed in Assumption 3.6 below.

If fgig is serially correlated, then S does not equal E gigi0 and will take a more
complicated form.
233

4.3 Generalized Method of Moments (GMM) De…ned

The method of moment principle: To estimate a feature of the population, use the
corresponding feature of the sample.

Examples:

Parameter of the population Estimator


E (yi) Y
Var (yi) Sy2
1P x y
E xi yi z0i n i i i z0i

Method of moments: choose the parameter estimate so that the corresponding sample
moments are also equal to zero. Since we know that E xi yi z0i = 0 we choose the
parameter estimate ~ so that
n
1X
xi yi z0i~ = 0:
n i=1
234

Another way of writing 1 Pn x y z0i~ = 0:


n i=1 i i
n n
1X 1X
gi = 0 , gi w; ~ = 0 , gn ~ = 0:
n i=1 n i=1
| {z }
gn ~

Let’s expand gn ~ = 0 :
n
1X
xi yi z0i~ = 0
n i=1
n n
1X 1X
xiyi xiz0i~ = 0
n i=1 n i=1
n n
1X 1 X
xiz0i~ = xiyi
n i=1 n i=1
Sxz~ = sxy :
235

Thus:

Sxz ~ = sxy is a system with K (linear) equations in L unknowns.


(K L)(L 1) (K 1)

Sxz~ = sxy is the sample analogue of E xi yi z0i = 0; that is

0
E xizi = E (xiyi) :
236

4.3.1 Method of Moments

Consider
Sxz~ = sxy

If K = L and rank ( xz) = L ) xz := E xiz0i is invertible and Sxy is invertible (in


probability, for n large enough).

Solving Sxz~ = sxy with respect to ~ gives


^IV = Sxz1sxy
0 1 1
n
X n
X
1 0 1
= @ xiziA xiyi
n i=1 n i=1
0 1 1
Xn Xn
= @ xiz0iA xiyi
i=1 i=1
0 1 0
= XZ X y:
237

Example. Consider
yi = 1 + 2zi2 + "i
and suppose that Cov (zi; "i) 6= 0; that is, zi is an endogenous variable. We have L = 2
so we need at least K = 2 instrumental variables. Let x0i = 1 xi2 and suppose that
Cov (xi2; "i) = 0 and Cov (xi2; zi2) 6= 0: Thus an IV estimator is
1
^IV = X0Z X0 y :
Exercise 4.5. Consider the previous example. (a) Show that the IV estimator ^2;IV can be
written as
Pn
^2;IV = P i=1 (xi2 x2) (yi y ) :
n (x x2) (zi2 z2)
i=1 i2
(b) Show Cov (xi2; yi) = 2 Cov (xi2; zi2) + Cov (xi2; "i) ; (c) Based on part (b), show
p
that ^2;IV ! 2 (write the assumptions you need to prove these results).
238

4.3.2 GMM

It may happen that K > L (there are more orthogonality conditions than parameters). In
principle, it is better to have as many IV as possible, so the case K > L is desirable, but
then the system Sxz~ = sxy may not have a solution.
Example. Suppose
2 3 2 3
1:00 0:097 0:099 1:954
6 0:097 1:011 0:059 7 6 1:346 7
Sxz = 6
6
7
7; sxy = 6
6
7
7
4 0:099 0:059 0:967 5 4 0:900 5
0:182 0:203 0:031 0:0262
(K = 4; L = 3) and try (if you can) to solve Sxz~ = sxy : This system is of same type of
8
>
> ~1 + ~2 = 1
>
>
< ~3 = 1
>
> ~4 + ~5 = 5
>
>
: ~ +~ = 2
1 2

(the …rst and fourth equations are incompatible - the system is impossible - there is not a
solution).
239

This means we cannot set gn ~ exactly equal to 0: However, we can at least choose ~
so that gn ~ is as close to 0 as possible. In Linear Algebra two vectors are “close” if the
distance between them is relatively small. We will de…ne the distance in RK as follows:

distance between and is equal to ( )0 W


^ ( )
where W^ ; called the weighting matrix, is a symmetric positive de…nite matrix de…ning the
distance.
Example. If
" # " # " #
1 3 ^ = 1 0
= ; = ; W
2 5 0 1
the distance between these two vectors is
" #
h i 1 3
( )0 W
^ ( )= 1 3 2 5 = 22 + 32 = 13:
2 5
240

De…nition (3.1 - GMM estimator). Let W^ be a K K symmetric positive de…nite matrix,


^ p
possibly dependent on the sample, such that W ! W as n ! 1; with W symmetric
and positive de…nite. The GMM estimator of , denoted ^ W
^ is

^ W
^ = arg min J ^; W
^
~
where

J ^; W
^ = ngn ~ 0 Wg
^ n ~
0
= n sxy Sxz ~ ^ sxy
W Sxz~ :
Proposition. Under the Assumptions 3.2 and 3.4
1
^ = S0xzWS
GM M estimator ^ W ^ xz S0xzWs
^ xy :

To prove this proposition you need the following rule:


@ q0Wq @ q0
=2 Wq
@ @
where q is a K 1 vector depending on and W is a K K matrix not depending on :
241

If K = L then Sxz is invertible and ^ W


^ reduces to the IV estimator:

1
^ W
^ = S0xzWS
^ xz S0xzWs
^ xy
1
= Sxz1W^ 1 S0xz S0xzWs
^ xy
= Sxz1sxy = ^IV :

4.3.3 Sampling Error

The GMM estimator can be written as


1
^ W
^ = + S0xzWS
^ xz S0xzWg
^ :
242

Proof: First consider


1X
sxy = xiyi
n i
1X
= xi z0i + "i
n i
1X 0 1X
= xizi + xi"i
n i n i
= Sxz + g
1
Replacing sxy = Sxz + g into ^ W
^ = S0xzWS
^ xz ^ xy produces:
S0xzWs
1
^ W
^ = S0xzWS
^ xz S0xzWs
^ xy
1
= S0xzWS
^ xz S0xzW
^ (Sxz + g)
1 1
= S0xzWS
^ xz S0xzWS
^ xz + S0xzWS
^ xz S0xzWS
^ xzg
0 1 0
= + ^
SxzWSxz ^ :
SxzWg
243

4.4 Large-Sample Properties of GMM

4.4.1 Asymptotic Distribution of the GMM Estimator

Proposition (3.1 - asymptotic distribution of the GMM estimator). (a) (Consistency) Un-
p
der Assumptions 3.1-3.4, ^ W^ ! ; (b) (Asymptotic Normality) If Assumption 3.3 is
strengthened as Assumption 3.5, then
p d
n ^ W
^ ! N 0; Avar ^ W
^

where
0 W 1 0 WSW 0 W 1
Avar ^ W
^ = xz xz xz xz xz xz

Recall: S E gigi0 : (c) (Consistent Estimate of Avar ^ W ^ ) Suppose there is avail-


^ , of S. Then, under Assumption 3.2, Avar ^ W
able a consistent estimator, S ^ is consis-
tently estimated by
1 1
[ ^ W
Avar ^ = S0xzWS
^ xz S0xzW
^S^ WS
^ xz S0xzWS
^ xz :
244

4.4.2 Estimation of Error Variance

Proposition (3.2 - consistent estimation of error variance). For any consistent estimator ^
and under Assumptions 3.1, 3.2, the assumptions that E ziz0i and E "2i exist and are
…nite we have
n
1X p
"i ! E "2i
^
n i=1

where ^
"i yi z0i^:

4.4.3 Hypothesis Testing

Proposition (3.3 - robust t-ratio and Wald statistics). Suppose Assumptions 3.1-3.5 hold,
^ of S ( Avar (g) = E gig0 . Let
and suppose there is available a consistent estimate S i
1 1
[ ^ W
Avar ^ = S0xzWS
^ xz S0xzW
^S ^ xz S0xzWS
^ WS ^ xz :
245

Then (a) under the null H0: j = 0j


p ^ ^ 0 ^
^j W 0
0
n j W j j d
tj = r = ! N (0; 1)
[ ^ W
Avar ^ SEj
jj

[ ^ W
where Avar ^ [ ^ W
is the (j; j ) element of Avar ^ and
jj
s
1 [ ^ ^
SEj = Avar W :
n jj

(b) Under the null hypothesis H0:R = r where p is the number of restrictions and R
(p L) is of full row rank,
0 1 d
W = n R^ W
^ r [ ^ W
RAvar ^ R0 R^ W
^ r ! 2(p):
246

4.4.4 Estimation of S

Let
n
1X
^
S "2i xix0i; where ^
^ "i yi z0i^:
n i=1

Assumption (3.6 - …nite fourth moments). E (xik zi`)2 exists and is …nite for all k =
1; :::; K; and ` = 1; :::; L.
Proposition (3.4 - consistent estimation of S). Suppose ^ is consistent and S = E gigi0
exists and is …nite. Then under Assumptions 3.1, 3.2 and 3.6 the following estimator
n
1X
^
S "2i xix0i; where ^
^ "i yi z0i^:
n i=1
is consistent.
247

4.4.5 E¢ cient GMM Estimator

^ that minimizes the asymptotic variance.


The next proposition provides a choice of W
^ is chosen such that
Proposition (3.5 - optimal choice of the weighting matrix). If W
^ p! S
W 1

then the lower bound for the asymptotic variance of the GMM estimators is reached, which
is equal to
0 S 1 1
xz xz :
De…nition. The estimator
^ S
^ 1 = arg min ngn ~ 0 Wg
^ n ~
~

^ 1 is called the e¢ cient GMM estimator.


^ =S
where W
248

The e¢ cient GMM estimator can be written as


1
^ W
^ = S0xzWS
^ xz S0xzWs
^ xy
1 1S 1 1s
^ S
^ = S0xzS
^ xz S0xzS
^ xy

and
1 0 S 1 1
Avar ^ S
^ = xz xz
\ 1S 1
Avar ^ S
^ 1 = S0xzS
^ xz :
249

^ ; which depends
To calculate the e¢ cient GMM estimator, we need the consistent estimator S
on ^
"i. This leads us to the following two-step e¢ cient GMM procedure:

^
Step1: Compute S 1 Pn ^
" 2 x x0 , where ^
"i = yi z0i~: To obtain ~ :
n i=1 i i i
0
~ W
^ = arg min n sxy Sxz~ W ^ sxy Sxz~

where W^ is a matrix that converges in probability to a symmetric and positive de…nite


matrix, for example
^ = Sxx1:
W
With this choice, use the (so called) 2SLS estimator ^ Sxx1 to obtain the residuals
"i = yi z0i^ and S
^ ^ 1 Pn ^"2xix0 :
n i=1 i i

Step 2: Minimize J ~; S
^ with respect to ~: The minimizer is the e¢ cient GMM estimator,

^ S
^ 1 = arg min n (sxy Sxz )0 S
^ 1 (s Sxz ) :
xy
250

Example. (Wooldridge, chap. 15 - data base:card) Wage and education data for a sample
of men in 1976

Dependent Variable: LOG(WAGE)


Method: Least Squares
Sample: 1 3010
Included observations: 3010

Variable Coefficient Std. Error t-Statistic Prob.

C 4.733664 0.067603 70.02193 0.0000


EDUC 0.074009 0.003505 21.11264 0.0000
EXPER 0.083596 0.006648 12.57499 0.0000
EXPER^2 -0.002241 0.000318 -7.050346 0.0000
BLACK -0.189632 0.017627 -10.75828 0.0000
SMSA 0.161423 0.015573 10.36538 0.0000
SOUTH -0.124862 0.015118 -8.259006 0.0000

R-squared 0.290505 Mean dependent var 6.261832


Adjusted R-squared 0.289088 S.D. dependent var 0.443798
S.E. of regression 0.374191 Akaike info criterion 0.874220
Sum squared resid 420.4760 Schwarz criterion 0.888196
Log likelihood -1308.702 Hannan-Quinn criter. 0.879247
F-statistic 204.9318 Durbin-Watson stat 1.861291
Prob(F-statistic) 0.000000

SMSA =1 if in Standard Metropolitan Statistical Area in 1976.

NEAR4 =1 if he grew up near a 4 year college.


251
252

h i
z0i = 1 EDU Ci EXP ERi EXP ERi2 BLACKi SM SAi SOU T H
h i
x0i = 1 EXP ERi EXP ERi2 BLACKi SM SAi SOU T H N EAR4i N EAR2i

Dependent Variable: LOG(WAGE)


Method: Generalized Method of Moments
Sample: 1 3010
Included observations: 3010
Linear estimation with 1 weight update
Estimation weighting matrix: HAC (Bartlett kernel, Newey-West fixed
bandwidth = 9.0000)
Standard errors & covariance computed using estimation weighting matrix
Instrument specification: C EXPER EXPER^2 BLACK SMSA SOUTH
NEARC4 NEARC2

Variable Coefficient Std. Error t-Statistic Prob.

C 3.330464 0.886167 3.758280 0.0002


EDUC 0.157469 0.052578 2.994963 0.0028
EXPER 0.117223 0.022676 5.169509 0.0000
EXPER^2 -0.002277 0.000380 -5.997813 0.0000
BLACK -0.106718 0.056652 -1.883736 0.0597
SMSA 0.119990 0.030595 3.921874 0.0001
SOUTH -0.095977 0.025905 -3.704972 0.0002

R-squared 0.156572 Mean dependent var 6.261832


Adjusted R-squared 0.154887 S.D. dependent var 0.443798
S.E. of regression 0.407983 Sum squared resid 499.8506
Durbin-Watson stat 1.866667 J-statistic 2.200989
Instrument rank 8 Prob(J-statistic) 0.137922
253

4.5 Testing Overidentifying Restrictions

4.5.1 Testing all Orthogonality Conditions

If the equation is exactly identi…ed then J ~; W


^ = 0: If the equation is overidenti…ed
then J ~; W
^ ^ is chosen optimally so that W
> 0: When W ^ 1
^ =S p
! S 1 then
1
J ~ S
^ 1 ;S
^ is asymptotically chi-squared.

Proposition (3.6 - Hansen’s test of overidentifying restrictions). Under assumptions 3.1-3.5

^ 1 ;S 1 d
J ~ S ^ ! 2(K L)
254

Two comments:

1) This is a speci…cation test, testing whether all the restrictions of the model (which are
the assumptions maintained in Proposition 3.6) are satis…ed. If the J ~ S ^ 1 ;S ^ 1 is
surprisingly large, it means that either the orthogonality conditions (Assumption 3.3) or the
other assumptions (or both) are likely to be false. Only when we are con…dent about those
other assumptions can we interpret the large J statistic as evidence for the endogeneity of
some of the K instruments included in xi:

2) Small-sample properties of the test may be a matter of concern.

Example (continuation). EVIEWS provides the J statistics of proposition 3.6:


255

Dependent Variable: LOG(WAGE)


Method: Generalized Method of Moments
Sample: 1 3010
Included observations: 3010
Linear estimation & iterate weights
Estimation weighting matrix: White
Standard errors & covariance computed using estimation weighting matrix
Convergence achieved after 2 weight iterations
Instrument specification: C EXPER EXPER^2 BLACK SMSA SOUTH
NEARC4 NEARC2

Variable Coefficient Std. Error t-Statistic Prob.

C 3.307001 0.814185 4.061733 0.0000


EDUC 0.158840 0.048355 3.284842 0.0010
EXPER 0.118205 0.021229 5.567988 0.0000
EXPER^2 -0.002296 0.000367 -6.250943 0.0000
BLACK -0.105678 0.051814 -2.039573 0.0415
SMSA 0.117018 0.030158 3.880117 0.0001
SOUTH -0.096095 0.023342 -4.116897 0.0000

R-squared 0.152137 Mean dependent var 6.261832


Adjusted R-squared 0.150443 S.D. dependent var 0.443798
S.E. of regression 0.409055 Sum squared resid 502.4789
Durbin-Watson stat 1.866149 J-statistic 2.673614
Instrument rank 8 Prob(J-statistic) 0.102024
256

4.5.2 Testing Subsets of Orthogonality Conditions

Consider
" #
xi1g K1 rows
xi =
xi2g K K1 rows
We want to test H0 : E (xi2"i) = 0:

The basic idea is to compare two J statistics from two separate GMM estimators, one using
only the instruments included in xi1 and the other using also the suspect instruments xi2
in addition to xi1: If the inclusion of the suspect instruments signi…cantly increases the J
statistic, that is a good reason for doubting the predeterminedness of xi2. This restriction
is testable K1 L (why?).
257

Proposition (3.7 - testing a subset of orthogonality conditions). Suppose that the rank
condition is satis…ed for xi1; so E xi1z0i is of full column rank. Under assumptions 3.1-
3.5. Let
0 1
J = ngn ^ S ^ 1Sxz
^ 1gn ^ ; ^ = S0xzS S0xzS
^ 1s
xy
0 1
J1 = ng1n ^
S 1g
1n ; ^ 1 Sx z
= S0x1zS ^ 1sx y :
S0x1zS
11 1 11 1
Then, under the null H0 : E (xi2"i) = 0,
d
C J J1 ! 2(K K ):
1
258

Example. EVIEWS 7 performs this test. Following previous example, suppose you want to
test E (nearc4i"i) = 0: In our case, xi1 is 7 1 vector and xi2 = nearc4i is a scalar
(L = 7; K1 = 7; K K1 = 1).
259

Instrument Orthogonality C-test Test


Equation: EQ03
Specification: LOG(WAGE) C EDUC EXPER EXPER^2 BLACK SMSA
SOUTH
Instrument specification: C EXPER EXPER^2 BLACK SMSA SOUTH
NEARC4 NEARC2
Test instruments: NEARC4

Value df Probability
Difference in J-stats 2.673614 1 0.1020

J-statistic summary:
Value
Restricted J-statistic 2.673614
Unrestricted J-statistic 5.16E-33

Unrestricted Test Equation:


Dependent Variable: LOG(WAGE)
Method: Generalized Method of Moments
Fixed weighting matrix for test evaluation
Standard errors & covariance computed using estimation weighting matrix
Instrument specification: C EXPER EXPER^2 BLACK SMSA SOUTH
NEARC2

Variable Coefficient Std. Error t-Statistic Prob.

C 0.092557 2.127447 0.043506 0.9653


EDUC 0.349764 0.126360 2.768002 0.0057
EXPER 0.196690 0.052475 3.748287 0.0002
EXPER^2 -0.002445 0.000378 -6.467830 0.0000
BLACK 0.088724 0.129667 0.684247 0.4939
SMSA 0.019006 0.067085 0.283317 0.7770
SOUTH -0.030415 0.046444 -0.654869 0.5126

R-squared -1.171522 Mean dependent var 6.261832


Adjusted R-squared -1.175861 S.D. dependent var 0.443798
S.E. of regression 0.654637 Sum squared resid 1286.934
Durbin-Watson stat 1.818008 J-statistic 5.16E-33
Instrument rank 7
260

4.5.3 Regressor Endogeneity Test

We can use Proposition 3.7 to test for the endogeneity of a subset of regressors.

See example 3.3 of the book.

4.6 Implications of Conditional Homoskedasticity

Assume now:
Assumption (3.7 - conditional homoskedasticity). E "2i xi = 2:

This assumption implies


S 0 2 0 2 0 2
E gigi = E "i xixi = E xixi = xx :
Its estimator is
^ =^ 2Sxx
S
261

4.6.1 E¢ cient GMM Becomes 2SLS

The e¢ cient GMM is


1 1S 1 1s
^ S
^ = S0xzS
^ xz S0xzS
^ xy
1 1 1
= S0xz ^ 2Sxx Sxz S0xz ^ 2Sxx sxy
1
= S0xzSxx1Sxz S0xzSxx1sxy
^2SLS :

The estimator ^2SLS is called two-stage least squares (2SLS or TSLS), for reasons we explain
below. It follows
2 0 S 1 1
Avar ^2SLS = xz xx xz
\ 1
Avar ^2SLS = ^ 2 S0xzSxx1Sxz :
Proposition (3.9 - asymptotic properties of 2SLS). Skip.
262

4.6.2 Alternative Derivations of 2SLS

The 2SLS can be written as


1
^2SLS = S0xzSxx1Sxz S0xzSxx1sxy
1
= Z0X(X0X) 1X0Z Z0X(X0X) 1
X0 y

Let us interpret the 2SLS estimator as a IV estimator. Use as instruments


1
^ = X X0 X
Z X0 Z
^ = X if K = L: De…ne the IV estimator as
or simply Z
0 1 1
n
X Xn
1 1
^IV = @ ziz0iA
^ ^
ziyi
n i=1 n i=1
0 1 0
^
= ZZ ^y
Z
1
= Z0X(X0X) 1X0Z Z0X(X0X) 1
X0 y
= ^2SLS
263

If K = L then
1
^IV = X0Z X0 y :

Finally, let us show the 2SLS as the result of two regression:

1) regress the L regressors on xi and obtain …tted values i.e. ^


zi

1
2) regress yi on ^
z1; :::; ^
zL to obtain the estimator Z ^
^ 0Z ^ 0y which is also the ^2SLS : In
Z
e¤ect,
0 1

1 B 1 0 C
B
^ 0Z
Z ^ ^ 0y
Z = BZ0X(X0X) 1X0X 0
XX X ZC
C Z0X(X0X) 1X0y
@| {z }| {z }A | {z }
^0
Z ^ ^0
Z
Z
1 1 0 1 0
= Z0X(X0X) X0 Z 0
Z X(X X) Xy
= ^2SLS :
264

Exercise 4.6. Consider the equation yi = z0i + "i and the instrumental variables xi,
where K = L: Assume Assumptions 3.1-3.7 and suppose that xi and zi are strictly ex-
ogenous (so the use of the IV estimator is unnecessary). Show that ^IV = X0Z 1 X0y
is unbiased and consistent but less e¢ cient than ^OLS = Z0Z 1 Z0y: Hint: compare
Var ^IV Z; X to Var ^OLS Z; X and and notice that an idempotent matrix is posi-
tive semi-de…nite. Also notice that Var ^IV Z; X Var ^OLS Z; X is positive semi-
1 1
de…nite i¤ Var ^OLS Z; X Var ^IV Z; X is positive semi-de…nite (provided
these inverses exist).

You might also like