0% found this document useful (0 votes)
13 views49 pages

Trix - Post Midsem Merged

The document discusses the large sample properties of the Ordinary Least Squares (OLS) estimator, defining concepts such as convergence in probability and consistent estimators. It highlights that under certain conditions, OLS estimators are consistent and asymptotically normal, even in the presence of non-spherical disturbances like heteroscedasticity. Additionally, it introduces hypothesis testing methods, including the LM test for restrictions and tests for heteroscedasticity such as the Breusch-Pagan and White tests.

Uploaded by

Prasoon bajpai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views49 pages

Trix - Post Midsem Merged

The document discusses the large sample properties of the Ordinary Least Squares (OLS) estimator, defining concepts such as convergence in probability and consistent estimators. It highlights that under certain conditions, OLS estimators are consistent and asymptotically normal, even in the presence of non-spherical disturbances like heteroscedasticity. Additionally, it introduces hypothesis testing methods, including the LM test for restrictions and tests for heteroscedasticity such as the Breusch-Pagan and White tests.

Uploaded by

Prasoon bajpai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

1 Large Sample Properties of OLS Estimator

Definition 1 A sequence of random variables {xn : n = 1, 2, ...} converges in


probability to a constant a if for all " > 0, Pr [|xn a| > "] ! 0 as n ! 1 .
P
This is denoted by xn ! a.
^
Definition 2 Let {✓ n : n = 1, 2, ...} be a sequence of estimators of the P X1
^ P
vector ✓ where n is the sample size. If ✓ n ! ✓ for any value of ✓, we say that
^
✓ n is a consistent estimator of ✓.

Theorem 3 Under Assumptions 1-3 (homoscedasticity and Normality not re-


^ P
quired), n ! .

Proof. Recall that the population model is (for any observation ex ante):

y = x1 1 + x2 2 + ... + xk k +"

Let
x= x1 x2 ... xk
so that
y =x +"
and that E ["|x] = 0.
Now,
^ 1
= (X 0 X) (X 0 Y )
Let
xi ⌘ xi1 xi2 ... xik
i
Now denote X in terms of x the vector of all variables for ith data point.
Therefore: 0 1 1
x
B x2 C
X=B @ : A
C

xn
0 1 1
x
B x2 C
X 0 X = x01 x 2 ... x0n B C
0

@ : A
xn
Or,
n
X 0
X 0X = x i xi
i=1

Caution: This is not a scalar. This is exactly equal to the X 0 X matrix which
is a matrix kXk.

1
Similary,
n
X 0
X 0Y = x i yi
i=1
Therefore,
n
! 1 n
!
^ X 0 X 0
i i i
= x x x yi
i=1 i=1
or,
n
! 1 n
!
^ X 0 X 0
i i i i
= x x x x + "i
i=1 i=1

n
! 1 n
! n
! 1 n
!
X 0 X 0 X 0 X 0
= x i xi x i xi + x i xi x i "i
i=1 i=1 i=1 i=1

n
! 1 n
!
X 0 X 0
= + x i xi x i "i
i=1 i=1

n
! 1 n
!
1 X 0i i 1 X 0i
= + x x x "i
n i=1 n i=1
^
(which is the expression you have seen before: = + X 0 ")
Now, by law of large numbers,
n
1 X 0i i P ⇣ 0 ⌘
x x !E xx
n i=1

Similary,
n
1 X 0i P ⇣ 0 ⌘
x i "i ! E x "
n i=1
⇣ 0 ⌘ ✓ ◆ 1
Pn 0 P P
n 0 P
Under some conditions satisfied here, if n1 x i xi ! E x x , then 1
n x i xi !
i=1 i=1
⇣ 0 ⌘ 1
E xx . Hence:
^ P ⇣ 0 ⌘ 1 ⇣ 0 ⌘
! +E x x E x"
Now we know, that E ["|x] = 0 ) E [x0 "] = 0. Hence
^ P
!
Note that while E ["|x] = 0 ) E [x0 "] = 0, the latter does not imply the
former. So the assumption needed for consistency is even weaker than what
we have made. In other words, E ["|x] = 0 is sufficient for consistency but not
necessary to show consistency of the OLS estimator.

2
1.1 Inference
We have looked at hypothesis testing under the assumption that " follows a nor-
mal distribution. These tests are called Exact tests. However, even if the error
terms are not from a normal distribution, in large samples, we can use the Cen-
tral Limit Theorem to show that OLS are approximately normally distributed
for large N.
While we will not show the proof for the asymptotic normality of the OLS
estimator, it is important to point out that it requires all the assumptions needed
to prove the Gauss Markov Theorem. It therefore requires homoscedasticity
and zero conditional mean assumption (so much more than that required for
consistency).
^
Under these conditions, is asymptotically normal, that is,
✓ ◆
p ^ a
n ⇠ N 0, 2 A 1

where
E [x0 x]
A=
n
2
Similar to before, we have to replace by its sample counterpart.
^0 ^
^2 ""
=
n
2
is a consistent estimator of . Thus
✓ ◆
^
j j

^
⇠ N (0, 1)
se( )

1.2 LM Test
y= 1 x1 + 2 x2 + ... k xk +"
If we want to test
H0 : 1 = 2 = ... = q =0
then, the LM test requires estimation of only the restricted model.

y= q+1 xq+1 + ... k xk +"

If 1 = 2 = ... = q = 0, then " will be uncorrelated with each of the
excluded variables, that is, xi , i = 1, ..., q.
Carry out the auxilliary regression

" = ⇡1 x1 + ... + ⇡k xk + 2

3
Note that all the variables have to be included and not just the q variables.
Intutively, inclusion of xq+1 ,...,xk has no impact on this regression because errors
are orthogonal to included independent variables. Now, if the null hypothesis is
2
indeed true, the R⇠ from this regression should be close to zero. Therefore,
"

2 2
LM = nR⇠
"
⇠ q

2
(the variables xq+1 ,...,xk need to be included because otherwise nR⇠ will
"
2
not follow q . If LM > c↵ , then we reject the null hypothesis. This test is also
called the n R2 test.

2 Non Spherical Disturbance


Consider now, a model
Y =X +"
where E["|X] = 0 and

E [""0 |X] = 2
⌦ ⌘ ⌃.

Here ⌦ is not the identity matrix and is positive definite.


When ⌦ is not the identity matrix, errors are referred to as Non-Spherical.
What are the finite sample properties of the OLS estimator when errors are
non-spherical? Let us look at the finite sample property of unbiasedness.
Note that
1
(X 0 X) X 0 Y = + (X 0 X) 1 X 0 "
✓ ◆  ✓ ◆
^ ^
E OLS = E X E OLS |X =

Therefore the OLS estimator is still unbiased.


What is the variance of the OLS Estimators?
 "✓ ◆✓ ◆0 #
^ ^ ^
V OLS |X = E OLS OLS |X

⇥ ⇤
= E (X 0 X) 1
X 0 ""0 X(X 0 X) 1
|X
= (X 0 X) 1
X 0 E[""0 |X]X(X 0 X) 1

= (X 0 X) 1
X 0 ⌃X(X 0 X) 1

2
✓ ◆
1 1 0 1
= ( X 0 X) 1
X ⌦X ( X 0 X) 1
n n n n
An implication of this is that using 2 (X 0 X) 1 as the variance matrix is
WRONG! Therefore the usual F and t statistics are also wrong. Hence Gauss
Markov theorem does not hold under non spherical errors. It is easy to show
[Exercise] that

4
1. OLS Estimator is consistent
^ a ⇥ 1 1

2. OLS Estimator is asymptotically normal. In particular, OLS ˜N ,A BA
where A ⌘ p lim n1 X 0 X and B ⌘ p lim n1 X 0 ⌦X
3. OLS Estimator is NO longer the asymptotically efficient estimator.

x1

5
1. OLS Estimator is consistent
^ a ⇥ 1 1

2. OLS Estimator is asymptotically normal. In particular, OLS ˜N ,A BA
where A ⌘ p lim n1 X 0 X and B ⌘ p lim n1 X 0 ⌦X
3. OLS Estimator is NO longer the asymptotically efficient estimator.

Let us look at two special cases.

2.0.1 1. Heteroscedasticy:
2 3 2 2 3
!11 0 ... 0 1 0 ... 0
6 0 !22 ... 0 7 6 2
0 7
2
⌦= 26 7=6 0 2 ... 7
4 : : : : 5 4 : : : : 5
2
0 0 0 !nn 0 0 0 n

Often heteroscedasticy can be induced by other forms like


2 2 3
x11 0 ... 0
6 0 2
x21 ... 0 7
E [""0 |X] = 6
4
7
5
: : : :
2
0 0 0 xn1

Here the variance of the error term for each observation is di↵erent because
it is a function of the value of the variable x1 for that particular observation.
In general, under heteroscedasticy,
2 ⇥ 2 ⇤ 3
E "1 |X ⇥ 02 ⇤ ... 0
6 0 E "2 |X ... 0 7
E [""0 |X] = 6
4
7 6= 2 I
5
: : : ⇥ 2: ⇤
0 0 0 E "n |X

We have already seen that OLS estimators under heteroscedasticity (as a


special case of non spherical errors) still retain the properties of consistency and
unbiasedness. White suggested that standard errors can be corrected by using
an appropriate estimator of E [""0 |X] . To motivate the correction,
0 1 let us explore
x1
B x2 C
what X 0 ⌃X looks like. Recall that X can be written asB C
@ : A where each xi
xn
refers to a vector of the ith observation on all the variables . Hence,
2 ⇥ 2 ⇤ 32 3
E "1 |X ⇥ 02 ⇤ ... 0 x1
⇥ ⇤6 0 E "2 |X ... 0 7 6 x2 7
X 0 ⌃X = x01 x02 ... x0n 6 4
76
54 : 5
7
: : : ⇥ 2: ⇤
0 0 ... E "n |X xn

5
White suggested that this can be estimated by:
1 X ^2 0
" i xi xi
n
and substituted for in the correct variance covariance matrix for OLS es-
timator, i.e., (X 0 X) 1 X 0 ⌃X(X 0 X) 1 .Therefore the standard errors calculated
using this are called White (also White-Huber) robust standard errors or Het-
eroscedastic robust standard errors. The t ratios using these standard errors
are called robust t s. This is operationalized in STATA using the robust option.
It is important to note here, that this estimator of the variance covariance
matrix has good asymptotic properties and one does not need to know the
form of heteroscedasticity. However, beware of its rather poor small sample
properties. The implication of this is that for smaller samples, this estimator
should be used with caution.

6
Because of the concern for small sample and efficiency advantages of know-
ing the form of heteroscedasticity (We will come to this later when we discuss
the Generalized Least Square Estimator), it may become important to test for
heteroscedasticity. There are many tests that are popular. We discuss two of
them here:

0.0.1 Breusch Pagan Test:


This test is used to test the null hypothesis H0 : V ("|x1 , ...xk ) = 2
⇥ Note that ⇤under the assumption that E ["|x1 , ...xk ] = 0,this implies that
E "2 |x1 , ...xk = 2
To test this null hypothesis, we can posit that (note that x1 = 1):

"2 = 1 + 2 x2 + ... + k xk +v

and test the null that 2 = ... = k = 0


^2
To do this, we use the residual " instead of "2 and conduct an LM test:
LM ⌘ nR^2 ˜ 2k 1 . If one suspects only a smaller set of variables is correlated to
"
"2 , use the smaller set as the inclusion of more variables a↵ects the degrees of
freedom.

0.0.2 White Test:


The Breusch Pagan test sets up a null of a constant conditional variance. How-
ever, one can test a weaker null: that "2 is uncorrelated with xj s, x2j s and
xj ⇤ xh (j 6= h).
^2 ^2
The test again involves " and involves a regression of " on all these terms.
However, a weakness that has been pointed out is that inclusion of all the
variables (linear terms, square terms and cross products) means that there are
too many parameters to be estimated (and the consequent loss of degrees of
freedom). A trick that has been suggested is to notice that, for any observation
i,
^
yi = 1 + 2 x1i + ... + 2 xki
^
Squaring yi implies that, on the right hand side, we square the squared sum
^ ^2 ^2
of all the variables. Therefore if we include yi and yi in a regression of " ,
then we are in e↵ect regressing on all the linear terms, square terms and the
interaction terms. The test is then an LM test for the joint significance of the
^ ^2
coefficient of yi and yi .
A cautionary note for both tests is that if E [y/x] is mis-specified, then a
test of heteroscedasticity can reject Ho even if it is true.
If we know the form of heteroscedasticity, then we have other ways to ap-
proach the problem. Let us assume that ⌃ = 2 ⌦ is known. Then in the
population model,
Y =X +"

1
consider the following transformation by pre-multiplying on both sides by
1 1 1 1
⌦ 2 where ⌦ 2 .⌦ 2 = ⌦ 1 . (⌦ 2 is symmetric). Therefore:
1 1 1
⌦ 2 Y =⌦ 2 X +⌦ 2 "
1 1 1
Define ⌦ 2 Y ⌘ Y ⇤ , ⌦ 2 X ⌘ X ⇤ and ⌦ 2 " ⌘ "⇤ . In terms of the *- ed
variables, the model is
Y ⇤ = X ⇤ + "⇤
What the properties of "⇤ ?
1 1
E ["⇤ |X] = E[⌦ 2 "|X] = ⌦ 2 E["|X] = 0
Moreover,
1 1
V ar["⇤ |X] = E["⇤ "⇤0 |X] = E[⌦ 2 ""0 ⌦ 2 |X]

1 1 1 1
=⌦ 2 E[""0 |X]⌦ 2 = 2
⌦ 2 ⌦⌦ 2 = 2
I
Therefore in this transformed model, the transformed error terms are ho-
moscedastic. The transformed model, therefore, satisfies all the conditions re-
quired for the Gauss Markov Theorem. Hence the OLS estimator of using this
transformed model has the lower variance among all linear unbiased estimators.
The OLS estimator of on the transformed model is:
^ ⇤
= (X 0 X ⇤ ) 1
X ⇤0 Y ⇤
^
Expressing in terms of the untransformed model with X, Y , we get

^ GLS
= (X 0 ⌦ 1
X) 1
X 0⌦ 1
Y

The superscript GLS refers to Generalized Least Square Estimator. Note


that this is not the same estimator as the OLS estimator in terms of the un-
transformed model.
We can use the same trick in terms of representing the GLS estimator as the
OLS estimator on the transformed model to derive other statistics of interest.
For example:
! ✓ ◆
^ GLS ^ ⇣ ⇤ ⌘ 1

V |X = V |X = 2 X 0X ⇤ = 2 (X 0 ⌦ 1 X) 1

As noted above, since the OLS estimator of on the transformed model is


^ GLS
BLUE, so is when V ("|X) = 2 ⌦. It is also easy to show that the GLS
estimator is consistent (just note that the conditions for consistency are met for
the transformed model).

2
Feasible GLS Estimator: The big assumption when discussing the GLS
estimator is that you know σ 2 Ω. In the example it implies that you know σ 2
and Ω. In most practical cases you don’t. It may be, in the example above, that
you think that h is a function of all variables but you are not sure.
In such cases (which will common), you can model h and use data to esti-

mate the unknown parameters of the model. That is, use h instead of h.More
generally, the Feasible Generalized Least Squares (F GLS) estimator is:

∧ −1 ∧ −1
β F GLS = (X ′ Ω X)−1 X ′ Ω Y

For example, let

V ar(ε/X) = σ2 h(..) = σ2 expδ0 +δ1 x1 +...+δk xk

To estimate this, we can use the model:

log ε2 = θ 0 + δ 1 x1 + ... + δ k x + e
where e has all the nice properties we have assumed for the classical linear
regression model. Since we do not have data for ε, it can be shown that we can

use ε instead. Thus we estimate the model
∧2
log ε = θ0 + δ 1 x1 + ... + δ k x + e′
∧ ∧ ∧
by OLS and calculate σ2 h(..) = expθ 0 +δ 1 x1 +...+δ k xk . Note that even if σ 2 is
not separately identified from δ 0 , it does not matter. Even if we divided both
−1 −1
sides of original equation by σ 2 Ω 2 , nothing substantive would change.
The F GLS estimator is consistent and asymptotically more efficient than
∧ p
the OLS estimator. This has to be necessarily true because Ω → Ω. Hence
the F GLS estimator converges asymptotically to the GLS estimator. We know
that the GLS estimator is more efficient than the OLS. Hence asymptotically
F GLS estimator is more efficient than the OLS estimator.
In small samples however, things are not so clear. The F GLS estimator is
not unbiased. And it may be more inefficient than the OLS estimator. Hence
with very small samples, be wary if you are doing F GLS and always report the
results using the OLS estimator as a comparison.

0.2 2. Autocorrelation/Serial Correlation


This is the case where error terms are correlated across observations. This
can typically arise in temporal data: For example shocks to stock prices can
be correlated over time (autocorrelation). In the cross sectional contect, error
terms (unobservables) that belong to the same cluster (say households in the
same village) may be correlated to each other (often called serial correlation
though both are also used interchangebly). In the case of autocorrelation (we

6
will refer to both cases as autocorrelation), the variance covariance matrix has
this particular form:
 
σ2ε E[ε1 ε2 ] ... E[ε1 εN ]
 E[ε2 ε1 ] σ2ε ... E[ε2 εN ] 
E [εε′ |X] = 



: : : :
2
E[εN ε1 ] E[εN ε2 ] ... σε
Notice that once you know this, everything we have discussed above, with
respect to the GLS estimator and F GLS estimator go through. So we can
use these estimators in this case as well. Here, we will demonstrate what the
variance covariance matrix looks like for time series data. Time series data
follow stochastic processes (you will learn more about this in your time series
course). In time series data, as a convention, observation i is referred to as
observation t and N is referred to as T. Suppose, that the error term in any
period t depends on the error term in the previous period t − 1. We can model
this as:
εt = ρεt−1 + ut (1)
This is referred to as an Autoregressive model of order 1 (AR(1)). In this
model, E[ut ] = 0; E[u2t ] = σ2u and the Cov[ut , us ] = 0 if t ̸= s.
We will now derive what exact form

E [εε′ |X]

takes for AR (1). Note that for different time series models, the implied E [εε′ |X]
will be different and will have to be derived case by case.
Notice that (1) is a difference equation in ε. By repeated backward substi-
tution, we will get an infinite series. That is,

εt = ρ (ρεt−2 + ut−1 ) + ut

= ut + ρut−1 + ρ2 εt−2
= ut + ρut−1 + ρ2 (ρεt−3 + ut−2 )
and so on.... till we get the infinite series:

εt = ut + ρut−1 + ρ2 ut−1 + ...

Now
V (εt ) = V (ut ) + ρ2 V (u2t−1 ) + ...

= σ2u + ρ2 σ 2u + ...
σ2u
If |ρ| < 1 then this sum converges to 1−ρ2 . If we denote V (εt ) by σ 2ε , then
we have shown that
σ 2u
σ 2ε =
1 − ρ2
Now think of any term of E [εε′ |X] that differs by 1 time period (for example:
E[ε1 ε2 ], E[ε4 ε3 ], E[ε10 ε11 ]). Let us represent all these terms that differ by 1 time

7
period by E[εt−1 εt ]. (the terms E[εt , εt−1 ] are same because of symmetricity).
Now,
E[εt εt−1 ] = E[εt−1 (ρεt−1 + ut )]
! "
= ρE ε2t−1 + E[εt−1 ut ]
Now, since ut is uncorrelated to all other ut including all its lagged values,
therefore ut is uncorrelated to εt−1 . Hence,
! " σ2
E[εt εt−1 ] = ρE ε2t−1 = ρσ2ε = ρ u 2
1−ρ
In general, it is easy to show that [EXERCISE]

σ 2u
E[εt εt−s ] = ρs
1 − ρ2
Therefore,
 
1 ρ ... ρT −1
 ρ 1 ... ρT −2 
E [εε′ |X] = σ2ε 


: : : : 
ρT −1 E[εN ε2 ] ... 1
 
1 ρ ... ρT −1
σ2u  ρ 1 ... ρT −2 

=
1 − ρ2  : : : : 
ρT −1 E[εN ε2 ] ... 1
In addition to the GLS estimator, OLS estimators can be used (with the
correct variance covariance matrix). The OLS estimator is still consistent and
unbiased.
If one does not want to assume any particular time series model, one can use
the estimated OLS residuals to calculate a general variance covariance matrix.
The correction to the variance covariance matrix using estimated residuals (just
like robust standard errors) is called N ewey W est Correction. This involves
estimating E [εi εj ].
In the presence of auto-correlation, there is at least one particular instance
where the OLS estimator is biased and inconsistent. This is the case of lagged
dependent variable. To illustrate this, suppose we are estimating the following
models using OLS:
yt = βyt−1 + εt (2)
where,
εt = ρεt−1 + ut (3)
Let all the assumptions we made about ut stand. Assume in addition that
|β| < 1. For OLS estimation of equation (2) to be consistent, the following must
be true:
Cov (yt−1 , εt ) = 0

8
But this is not the case.
Cov (yt−1 , εt ) = Cov(βyt−2 + εt−1, εt )

= Cov(εt−1, εt )
Since Cov(εt−1, εt ) ̸= 0, the OLS estimator is inconsistent. The presence
of lagged dependent variables, by itself, does not pose any inconsistency. It is
the autocorrelated error terms together with the lagged dependent variable that
causes inconsistency. However, in a large number of cases, even when there is
autocorrelation, the inconsistency of OLS estimator in the presence of lagged
dependent variable comes because of specification error. To see this, let us
return to the above example.
Substitute (3) in (2). We get
yt = βyt−1 + ρεt−1 + ut
Now, we know that
yt−1 = βyt−2 + εt−1 (4)
Substituting for εt−1 in (3) by using (4), we get
yt = βyt−1 + ρyt−1 − βρyt−2 + ut
Therefore,
yt = (β + ρ) yt−1 − βρyt−2 + ut
Now ut is uncorrelated with yt−1 and yt−2 . Hence if we regress yt on yt−1
and yt−2 , we will get consistent estimates of (β + ρ) and −βρ. The inconsistency
came about because if we have AR(1) correlated error terms, the main estima-
tion equation necessarily needs to include yt−2 . So in running the regression
of yt on just yt−1 , we had misspecified the equation and not thought enough
of the dynamic structure of the error terms. So to avoid inconsistency due to
autocorrelation, researchers often include a rich dynamic structure if they need
to include lagged dependent variables. If we assume a particular AR process for
the error terms, it is easy to calculate how many lags of the dependent variables
need to be included as regressors if we want to include any lagged dependent
variable as a regressor [Exercise :Try with ε following AR(2)].
As with heteroscedastic error terms, asympotic corrections do not work for
small samples. So it is often desirable to elicit the form of the time series process
of the error term. We will discuss three tests in this note. The first two tests
are specially for testing whether the error terms follow an AR(1) process.
1. t test for AR(1) in the case of strictly exogenous regressors1 . In the model,
εt = ρεt−1 + ut
ut has the usual nice properties. We are testing the null hypothesis:
H0: ρ = 0. To test this hypothesis, we follow these two steps
1 Strictly Exogenous regressors are such that x is uncorrelated with ε ; s = 1, ..., T . Con-
t s
strast this with Weak Exogeneity which requires that xt is uncorrelated with just εt .

9

a. Run an OLS regression of yt on xt1 , ...xtk and obtain residuals ε t for all t.
∧ ∧
b. Run an OLS regression of εt on εt−1 .Conduct a t test to check for the

significance of the coefficient of ε t−1 .
2. Durbin Watson Test. This test requires all the classical assumptions under
null hypothesis including normality of error terms. The DW statistic is:
T '
/ (
∧ ∧
ε t − εt−1
t=2
DW = T % &
/ ∧2
εt
t=1

It can be show that ' (



DW ≈ 2 1 − ρ
where
T '
/ (
∧ ∧
ε t ε t−1
∧ t=2
ρ≡ T % &
/ ∧2
ε t−1
t=2
T %
/ &
∧2
The approximate sign is because we need to assume that ε t−1 ≈
t=2
T % &
/ ∧2
εt . [Exercise].
t=1
The null hypothesis Ho : ρ = 0 against the alternative H1 : ρ > 0. [This
is a more common alternative because time series data are usually positively
correlated. A similar method but with different thresholds can derived if the
alternative is ρ < 0]. The intuitive rationale comes from the fact that under the
null hypothesis, DW ≈ 2. In the case the alternative is true, then DW < 2.
Therefore, intutively, a low DW below 2 indicates that there is autocorrelation
of 1 order. More formally, however, it difficult to calculate the distribution of
the test statistic under the null hypothesis (hence it is not possible to calculate
one threshold value: like the t or z test). Instead, this test specified two bounds
dU and dL such that: If DW > dU : then we fail to reject the null: So there
may be no first order autocorrelation. If DW < dL , then we reject the null and
there is first order autocorrelation. At any other value between dU and dL , this
test is inconclusive.
If we use the two tests above, we are testing againist the alternative of first
order autocorrelation. However, it could well be that the autocorrelation is not
of the first order (so we survive the two test above) but it has a higher order. For
example: shocks are not correlated in subsequent periods but affect outcome k
periods later. To test this we can use the next test:

10
3. Bruesch Godfrey Test: Suppose are testing for AR(q). that is, in the
following model,
εt = ρ1 εt−1 + ...ρq εt−q + et
To test this hypothesis we need to follow two steps:

a. Run an OLS regression of yt on xt s.Calculate residuals ε t .
∧ ∧ ∧ ∧
b. Run a regression of ε t on xt and εt−1 , εt−2 , ..., ε t−k and conduct a joint

test that the coeffecients of all the lagged ε terms is equal to zero.

Under homoscedasticity, we can conduct an LM test where LM = (n − q) R∧2 ∼


ε
χ2q .

0.2.1 Correcting for First Order Serial Correlation


We can use transformations similar, in spirit, to the ones above to estimate
models with first order autocorrelated error terms. Recall that
σ 2u
σ 2ε =
1 − ρ2
The original model is
yt = β 0 + β 1 xt + εt
Lag this model one period (so write the expression for yt−1 ) and for all
observation starting from t = 2, carry out the following transformation

yt − ρyt−1 = (1 − ρ) β 0 + β 1 (xt − ρxt−1 ) + εt − ρεt−1

We know that ut ≡ εt − ρεt−1 .Now it is easy to show that u for observations


t and t − 1 are uncorrelated [Exercise : SHOW ]. That is,

Cov(ut ut−1 ) = 0
˜
Therefore if ρ were know, we could calculate xt −ρxt−1 ≡ xt and yt −ρyt−1 ≡
˜
yt . and run an OLS regression. However, there is still one catch. Notice that
we are dealing with observations t = 2,...,T. What should be done with the first
observation. We could omit the first observation. If you do so, you would be
embarking on a Cochrane Orcutt procedure. Suppose we want to keep the first
observation, the model for that observation is still the original model since it
cannot be transformed. The error term for this first observation is σ 2ε which is
different from σ 2u , the variance of the error term for the tranformed model. We
transform the model for the first observation in a way such that the error term
of the transformed model is σ2u . To do so, 0 we divide both sides of the original
model (for the first observation ONLY) by 1 − ρ2 , so that for t = 1
0 0 0 0
1 − ρ2 y1 = β 0 1 − ρ2 + β 1 1 − ρ2 x1 + 1 − ρ2 εt

11
0
Now the variance of 1 − ρ2 εt is σ2u . This procedure that keeps the first
observation and transforms it differently to the rest of the observations is called
Pr ais W inston prodecure.In general both methods are equivalent to doing a
GLS estimator with a small difference in the variance covariance matrix (first
observation).
As with any GLS procedure, this pressuposes a knowledge of ρ. But ρ is
usually unknown. So we follow a procedure where first we run a simple OLS
∧ ∧ ∧ ∧
to get εt s. Then we regress εt and εt−1 and estimate ρ. This F GLS procedure
does not have good small properties. So its recommended that we follow an

iterative procedure, where we start with any ρ (say the one implied by a simple
OLS). Then we carry out a F GLS procedure to estimate the model. The
estimated parameters will imply an updated set of residuals. These residuals

can be regressed on their first order lags to get another value for ρ. We then use
this again to carry out another F GLS and repeat the process on and on till the

values of ρ in subsequent iterations converge.

12
0.1.3 Measurement Error
There are almost no known finite sample properties of estimators when there is
measurement error. All the known results are asymptotic.
Assume, for simplicity:
y ∗ = βx∗ + ε
Here * denotes the true values of y and x. Instead, let y and x be the observed
values which are potentially measured with error.

1. Only y is measured with error: So

y = y∗ + v

In this case when we run the regression y on x(= x∗ ), we are actually esti-
mating the regression model:

y = βx∗ + ε + v

As long as x∗ is uncorrelated with the error term v, we have no problem


because.

2. Only x is measured with error: So

x = x∗ + u

y ∗ = βx∗ + ε = β(x − u) + ε = βx + w

where w = ε − βu.

Cov(x, w) = Cov(x∗ + u, ε − βu) = −βσ 2u



Therefore β from this regression is inconsistent. For this model its easy to
see that: # #
∧ xi yi n−1 xi yi
β= # = #
xi2 n −1 x2i
#
∧ p lim n−1 (x∗i + u) (βx∗i + εi )
p lim β = #
p lim n−1 (x∗i + ui )2
#
β.p lim n−1 x∗2
i
= # #
p lim n−1 x∗2
i + p lim n
−1 u2i
β
= $ %
σ 2u
1+ #
p lim n−1 x∗2
i

3
∧ ∧
Notice that if σ2u → ∞, β → 0. This is called attentuation. In this case β
values are muted and close to zero.
In a mulivariate regression, if
Y = X∗β + ε
and the observed value
X = X∗ + u
then, defining
∧ −1 −1
β = (X ′ X) X ′ Y = (X ′ X) X ′ (X ∗ β + ε)
−1 −1
= (X ′ X) X ′ X ∗ β + (X ′ X) X ′ε
−1 −1
= (X ′ X) X ′ (X − u) β + (X ′ X) X ′ε
−1 −1 −1
= (X ′ X) X ′ Xβ − (X ′ X) X ′ uβ + (X ′ X) X′ε
−1 −1
= β − (X ′ X) X ′ uβ + (X ′ X) X ′ε
Now & '
X ′X #
p lim = Q∗ +
n u

X ′u #
p lim( )=
n u
& ′ '

p lim =0
n
So (
∧ # )−1 #
p lim β = β − Q∗ + β
u u
Now what if only on variable has error? This implies:
 2 
# σ u1 ... 0
= : : : 
u
0 ... 0
that is everything except the first diagonal element is equal to zero. Denoting
the reciprocal of elements of Q∗ as q ij , we get
∧ β1
p lim β 1 =
1 + σ2u1 q 11
So we have the usual attentuation bias. However for other parameters:
$ %
∧ σ 2u1 q k1
p lim β k = β k − β 1 .
1 + σ2u1 q k1

The second part of the expression is of unknown direction. So it is not


possible to find the direction of bias.

4
ABHIROOP MUKHOPADHYAY, PLANNING UNIT, ISI (DELHI) MAY 2017

Identification of Demand
and Supply Curves

In this lecture, we explore how to “extract” the demand and supply curves from

data. In empirical microeconomics, this is referred to as “identification” of

demand and supply curves. To begin with, let us illustrate what data from

markets would look like. Figure 1 provides the scatter plot of prices and

quantities of a vegetable (potato) using the National Sample Survey data on

Consumption Expenditure Survey (66th round 2009).


25
20
med_price

15
10

0 2000 4000 6000 8000 10000


total_quantity

FIGURE 1

1
ABHIROOP MUKHOPADHYAY, PLANNING UNIT, ISI (DELHI) MAY 2017

Each data point plotted represents a market1, that is, each point refers to the

total quantity purchased/sold in the market (total quantity) and median price for

a unit of quantity in the market (that is per kilo).

Let us begin, by assuming that the demand curve for the good is

! = ! + !" (1)

where q is the quantity bought and p is the price of the good. The usual shape of a

linear demand curve is a downward sloping line. Hence according to theory, we

expect: (i) ! ≥ 0 & (ii) ! ≤ 0. (i) is implied by the fact that quantity cannot be a

negative number. (ii) is implied by the law of demand.

The first temptation may be to use the knowledge of bivariate linear regression

and estimate the following empirical model:

!! = ! + !!! + !! , ! = 1 … ! (1’)

Here, the subscript m refers to a particular market. Recall the data we have is on

markets and there are M of them. !! and !! therefore refers to the quantity and

price, respectively, in the mth market. !! is market specific error term(more on

this later).

This model can be estimated using simple Ordinary Least Square Estimation

Method (OLS). But for the estimators of a and b to be consistent (statistically

speaking2), it must be the case that the covariance between !! and !! is zero.

i.e. !"# !! , !! = 0 (2)

1 Consumer purchases and quantities have been aggregated up to the level of a state region. A state is usually divided into
state region based on climate.
2 An estimator is consistent if it converges, in probability, to the true population parameter.

2
ABHIROOP MUKHOPADHYAY, PLANNING UNIT, ISI (DELHI) MAY 2017

This condition can be seen intuitively, if we invoke some microeconomic theory.

First, the law of demand says that ceteris paribus, when the price of a good rises,

the quantity demanded falls. Ceteris paribus, that is, everything else being same,

imposes some structure on our empirical exercise. For example consider two

markets: A and B. Let us suppose that, in the data, the price is higher in market A

than in market B !! > !! . In addition, it may also be true that the average

income of people living in market A is higher than in market B. Denote income in

each market by I. Therefore !! > !! . Therefore, it is likely that price and income

are positively correlated.

Let us now think about what equation (1’) signifies. In particular, what is the

interpretation of the error term !! . !! captures all the omitted variables that

affect the demand of the good (except price, since that has already been explicitly

taken into account). For example, the error term will include income, preference

for the good (bengalis may love potatoes more than punjabis) and all the other

factors that shift the demand curve.

Now bring the two ideas together. !! contains information on income (that we

have chosen to omit). But we also know that income is positively correlated with

price. Hence the error term !! will necessarily be positively correlated with !! .

Condition (2) is then violated and we get inconsistent estimators of a and b.

To understand this further, let us try to interpret the coefficient b in the demand

function. b is the change in quantity demanded due to a change in prices, when

nothing else changes. Note that the omitted variable income affects both price

and quantity. So it may lead to a higher price and a higher quantity demanded.

Therefore the coefficient b that we will estimate using equation (1’) will not be

3
ABHIROOP MUKHOPADHYAY, PLANNING UNIT, ISI (DELHI) MAY 2017

the pure causal effect of price on quantity. It is also picking up the fact that, in

data, if we consider markets with higher prices, they also have higher income. So

the estimate of b will pick up not only the effect of price on quantity (that we

want) but also the effect of income on quantity (that we don’t want).

Once we understand this problem, it is obvious what the solution to the problem

is. Assuming we have data on income (as well as other factors that shift the

demand curve), we need to explicitly put them in the empirical model that is

estimated. Therefore:

!! = ! + !!! + !!! + ! ! !! + !! , ! = 1 … ! (3)

In specification (3) we have included income explicitly as a variable (as well as a

host of other demand shifters represented by the vector of variables !! ). Hence,

it is no longer in the error term. Therefore even if income and prices are

correlated, !! and !! are no longer correlated and all our coefficients, including

b, are consistent.

It is important here to point out that if the world were a lab (like in science) to

find how prices affect quantity, we could run a series of tests where only the

prices were changed and nothing else. But in social sciences, we deal with real

world observational data, hence not affording us the luxury of holding other

things constant. The best we can do is to use statistics: more specifically, by the

use of multivariate regression methods, we can calculate partial correlation

coefficients. The partial correlation coefficient between any two variables is the

relation between them, netting out the effect of all other variables; in other

4
ABHIROOP MUKHOPADHYAY, PLANNING UNIT, ISI (DELHI) MAY 2017

words, it is as if we are holding all these other variables constant (i.e., ceteris

paribus).

Returning to our specification in (3), the bottom line of our discussion is

1. In addition to price, include all variables that shift the demand curve

(referred to as demand shifters here on), for which you may have data.

2. In case you omit a demand shifting variable, or you do not have data for

any demand shifting variable, ask yourself if what is not included (and

hence enters the error term) is correlated with any of the explanatory

variables (like price, income or any variable in Z). If it isn’t correlated, one

can still estimate all the causal relationships.

A similar argument can also be made for supply function estimation. Hence, in

addition to price, one must include all the variables that shift the supply curve.

Now, let us return to the main issue that is the topic for today’s discussion. We

have the data given in Figure (1). In addition, let us assume, we have data on

incomes and other demand shifters. We also have data on variables that shift the

supply curve (for example cost of inputs: wages rates, rental on capital). What

do we do next? What equation should we estimate? Which curve does the data

“identify”, i.e. what regression should I run: the demand side regression equation

or the supply side one?

To understand this further, let us give superscripts ‘d’: to denote demand and ‘s’

to denote supply. Therefore there are two regression equations:

5
ABHIROOP MUKHOPADHYAY, PLANNING UNIT, ISI (DELHI) MAY 2017

Demand Equation:

! !
!! = ! + !!! + !!! + ! ! !! + !!
!
, ! = 1…! (3’)

and

Supply Equation:

! !
!! = ! + !!! + ! ! !! + !!
!
, ! = 1…! (4)

In (4), X represents the vector of cost shifter variables.

The answer to our question is simple. We should not run either regression. This

is because the data that we have does not come only from a demand function or a

supply function. It comes from both. Figure 2 illustrates this point:

p SS

Data point is
this
equilibrium

DD

FIGURE 2

Recall that given market demand and supply curves, the final transacted quantity

and price are determined by market equilibrium. Hence the data we see is not

just governed by the demand or supply function but by the market equilibrium

condition.

6
ABHIROOP MUKHOPADHYAY, PLANNING UNIT, ISI (DELHI) MAY 2017

Market Equilibrium:

! !
!! = !! = !! (5)

&

! !
!! = !! = !! (6)

It is ! and ! that we observe as data.

Therefore to characterize the data, we need to simultaneously look at equations

(3’), (4) and (5) and (6).

If we substitute (6) and (5) in (3’) and (4), we find that the data are determined

by the following system of equations:

!! = ! + !!! + !!! + ! ! !! + !!
!
, ! = 1…! (7)

!! = ! + !!! + ! ! !! + !!
!
, ! = 1…! (8)

Appreciating that there are two equations that generate the data, how does one

estimate, say the demand equation. To think about how to estimate the demand

equation, let us pretend that we are naïve and decide to estimate equation (7)

ignoring (8).

As pointed out earlier, we need to check if equation (2) holds.

But before we check that, we can solve the simultaneous equation to express the

two variables, !! and !! in terms of variables that are not determined in the
! !
model, i.e., !! , !! , !! , !! and !! . To do so, equate (7) and (8) . Therefore,

! + !!! + !!! + ! ! !! + !!
!
= ! + !!! + ! ! !! + !!
!

7
ABHIROOP MUKHOPADHYAY, PLANNING UNIT, ISI (DELHI) MAY 2017

!!! − !!! = ! − ! − !!! − ! ! !! − !!


!
+ ! ! !! + !!
!

!!! ! ! ! ! !
!! = !!!
− !!! !! − !!! ! ! !! − !!! !!
!
+ !!! ! ! !! + !!! !!
!
(9)

!
Now let us evaluate the !"# !! , !! . Notice that in (9), !! is, among other things,

! !
a function of !! . Therefore !"# !! , !! cannot be equal to zero. So estimating

equation (7) by itself will give us an inconsistent estimator of all the coefficients

of the equation.

To explore how the demand function can be estimated, look at Figure 3:

S2

S1

A
D1

FIGURE 3

Assume two markets that have the same demand curve. However the two

markets differ in their supply curves. Market 1 has the supply curve S1 , where as

market 2 has the supply curve S2 . What differences would yield different supply

8
ABHIROOP MUKHOPADHYAY, PLANNING UNIT, ISI (DELHI) MAY 2017

curves? If the two markets have different input costs (wages, rental), then the

marginal cost of production would differ across both the markets. These

differences would yield supply curves that are to the left or right of each other.

The two supply curves lead to different equilibria: A and B. However, if the

demand curve is fixed, notice that A and B give us two points on the demand

curve. Hence if we can identify markets which differ by cost shifters, we can

identify points on the demand curve. Therefore, factors that lead to a parallel

shift in supply curve “identify” the demand curve. A very similar argument goes

through for the estimation of supply curve. Factors that shift the demand curve

“identify” points on the supply curve.

This intuition from theory is used in the estimation of the demand and supply

curves. Formally, one can identify parameters of the demand functions that use

cost shifters as “instruments”. An instrument, in this context, is a variable that is

correlated with price but is not correlated with the error term. It is obvious to

see that if we take any variable that is included as a part of the group of variables

Xm in equation (4) , it will be correlated with !! (by equation (9)) but will have

!
no correlation with !! .

!
For example, let us suppose that !! contains preference for good quality

potatoes, something that we cannot measure and hence cannot account for

explicitly. Now it is obvious that in markets where this preference is high, we will

observe a higher price because higher quality goods cost more. But quality in our

data set is unknown. But this implies that there will be a positive correlation

!
between !! and !! (remember we cannot observe quality). In this case, the

rental rate for land, wage rates etc. will serve as instruments because they

9
ABHIROOP MUKHOPADHYAY, PLANNING UNIT, ISI (DELHI) MAY 2017

obviously affect the price but are not correlated with the preference for good

quality (It can be argued that higher wages rates lead to higher income. Since

rich people may prefer better quality, our argument does not go through.

However, recall that we have already accounted for income by its inclusion in the

demand equation. Once that route is accounted for, the input costs can only affect

the cost of production: hence prices).

While we will not discuss the exact estimation procedure, the estimator is an

application of Method of Moments estimation technique where we estimate the

sample counterpart of the population moment given by:

!
! !! !! ] = 0

!
(which implies a zero covariance between X and !! ).

10
0.1 Instrumental Variable Estimation
Suppose we are to estimate the following model of wages:

log(wage) = 0 + 1 edu + 2 ability +⌫

Typically, we cannot observe ability. If we run a regression without account-


ing for ability, OLS will be an inconsistent estimator if ability (in the error term)
and edu (years of education) are correlated (which is very likely). One option
is to use a proxy variable for ability. For example, often scores on standardized
IQ tests are used as a proxy for innate ability of a person. In case a proxy is
available we run an OLS regression of this model:
0
log(wage) = 0 + 1 edu + 2 IQ +⇠
^
Whether we still have an inconsistency, especially for 1 , will depend on how
good a proxy IQ tests scores are of ability.1
In this section, however, we focus on another problem. Suppose we don’t
have a proxy variable for ability. What can we do? Can we still estimate,
consistently, the parameters of the population model

log(wage) = 0 + 1 edu +"

where
"⌘ 2 ability +⌫
In this model, as we have pointed out above, Cov(edu, ability) 6= 0; Hence
Cov(edu, ") 6= 0. We can represent this more generally as:

y= 0 + 1x +"

where Cov(x, ") 6= 0.


To estimate 1 , we will need a new estimator: one that requires a new
variable, let us call it z such that:

1. Cov(z, ") = 0 : that is z is orthogonal to the error term "; this can never
be checked and one has to argue this using economic logic.
2. Cov(z, x) 6= 0 : This can and must be checked. To do so, we should check
if the coefficient '1 in the regression

x = '0 + '1 z + !

is significant; that is, H0 : '1 = 0 (against the alternative '1 6= 0). This
OLS regression is called a first stage regression.
Such a variable z is called an Instrumental Variable (IV ). It is important to
point out here than an IV is a not a proxy variable. Indeed whatever is a proxy
1 Typically, 0
since IQ is not equal to ability, therefore the coefficient 2 6= 2

1
variable is a bad IV , since the proxy is necessary correlated with the error term
( recall IQ is correlated with ⌫).
This can be generalized further to the case where there are k regressors
(including a constant): Let x = (1, x2 , . . . , xk ) .Suppose xk is endogenous (cor-
related with the error term) but other x s are exogenous (uncorrelated with the
error term). Consider the instrumental variable z1 . We are going to represent
the instrumental variable vector by z = (1, x2 , . . . , z1 ) . Notice that each ex-
ogenous variable xi : i 6= k is an instrumental variable since, it is uncorrelated
to " (Condition 1) and perfectly correlated to itself (Condition 2). Now the
orthogonality between z and " implies that E[z 0 "] = 0.
Starting with the equation

y =x +"

we can transform this equation by premultiplying on both sides of the main


regression:
z0y = z0x + z0"
Taking un-conditional expectations on both sides we get:

E [z 0 y] = E [z 0 x]

(since E[z 0 "] = 0)


If z is of full column rank, then
1
= (E [z 0 x]) E [z 0 y]

The IV estimator is given by


n
! 1 n
!
^ 1X 0 1X 0
IV = z xi z yi
N i=1 i N i=1 i

If we put all the n obervations of each vector in one matrix and denote them
by their capital letters: we get
^ 0
IV = (Z 0 X) 1
ZY
^ ^
Notice that when Z = X, then IV = OLS .

0.1.1 Inference using IV Estimator:


In a model
y= 0 + 1x +"
2
if the errors are homoscedastic with variance equal to , then
^ 2
V( 1IV ) = 2 ⇢2
n x xz

2
where ⇢2xz is the square of correlation between x and z. Contrast this to
^ 2 ^
the variance of the OLS estimator: V ( 1OLS ) = n 2 . Since ⇢2xz  1, V ( IV )>
x
^
V ( OLS ). The efficiency of the IV estimator depends ultimately on the corre-
lation between x and z. Higher the correlation, the closer the variance of IV to
OLS. ✓ P ◆
^2 ( xi x)
In the data, 2x is estimated by the sampling variance of x x = n ,
⇢2xz can estimated by the Rxz
2
of the regression of x on z (First Stage Regression).
P ^2
To estimate 2 , we use the conventional estimator using residuals ( n 1 2 " ).
^ ^ IV
But it is important to point out here that the residuals used are " = y 0
^ IV
1 x.
Thus
^2
^
Est.V ( 1IV ) = ^2 2
n x Rxz
The strength of correlation between x and z is critical to the application
of the IV estimator. We have seen above that an instrument with very low
correlation with x increases the variance of the estimator. In fact low correla-
tion can be even more harmful, if we are not a hundred perfect sure that z is
uncorrelated with ". It is easy to see that
^ p corr(z, ") "
1IV ! 1 + .
corr(z, x) x

Here in case corr(z, ") is small (but not zero), an even smaller corr(z, x)
makes the inconsistencey still large. This is called the ”weak instruments”
problem. In most practical applications, it is recommended that the F statistic
for a test for exclusion of z in the first stage regression has a value of atleast 10
(though the choice of threshold is more of a convention rather than based on
any deep theory)
So far, we have discussed the case where the number of independent variables
(including the one endogenous variable xk ) is equal to the number of instruments
(all the exogenous xi s and z1 ). Now, what if we have two exogenous variables z1
and z2 over and above the exogenous xi s where z1 and z2 are uncorrelated with
the error term but correlated with xk . Often, in applied work, people will refer
to this situation as: ”We have one endogenous variable and two instruments”
(so the fact that the other exogenous xi are instruments for themselves is not
explicitly pointed out). [ REMARK: In the class I referred to xk as y2 and
considered the case where there was only one exogenous variable z1 and two
instruments z2 and z2 . I have changed the notation here to be consistent with
some notation I followed earlier].
To deal with this. we state, without proving here, that the best instrument
is a linear combination of all the exogenous variables, including z1 and z2 (in

3
case there are more instruments, we can add them to the linear combination).
Therefore we run a regression of xk on all the exogenous variables and use the
^
predicted value from this regression. Hence, xk is the best instrument for xk
where
^ ^ ^ ^ ^ ^
x k = ⇡ 0 + ⇡ 1 x 2 + . . . + ⇡ k 1 x k 1 + 1 z 2 + 2 z3
With multiple instruments, IV is aslo called Two Stage Least Squares Es-
timator (2 SLS). The name comes from the fact that the 2 SLS estimator is
^
identical to running an OLS regression of y on xk and xi s (i 6= k). To see this
^ ⇣ ^ ^ ^
⌘ ^
define the matrix X = x1 x2 . . . xk where each xi is a vector containing n obser-
^ ^
vations. Now x1 is a vector of ones (intercept term) and xl = xl ; l = 2, . . . , k 1;
^
since xl will perfectly predict xl in the first stage regression. The IV estimator
^
we have derived above implies (put Z = X):
0
^ ^0 ^
1
IV = (X X) XY
^ ^
The OLS estimator of a regression of y on xk and xi s (i 6= k) (where xi = xi
for the exogenous variables) is
0
^ ^0^0 ^
1
2SLS = (X X ) XY
^
To show that they are the same, note that X = PZ X. Use the property that
PZ is idempotent and symmetric and the equality follows. The same problem
with bad efficiency property follows, especially if you realize that we are using
^ ^
X instead of X in the regression. X always has lesser variance than X since
^
X includes X and the error term. As we have noted before, when independent
variables have lower variance, their coefficients are less precisely estimated.
The case of more than one endogenous variable and instruments is easy to
0
extend. If there are k regressors k of which are endogenous, there needs to be
atleast k 0 instruments (over and above the exogenous xs). In the first stage for
each endogenous variables, atleast one of these new k 0 instruments needs to be
significantly di↵erent from zero.

0.1.2 Testing for Endogeneity:


Since IV estimators are less efficient than OLS, it may make sense to explore
whether there is an endogeneity problem. For this we use the W u Hausman
test. The intuition behind the test is the following: If we do not have endogene-
ity, then both the OLS and the IV are consistent, but IV is inefficient. Hence
the test checks if the estimates from OLS and IV are di↵erent? Consider the
following regression:
y = 1 + 2 x2 + . . . k xk + "

4
where xk is endogenous. Let z1 be an instrument for xk . Let us estimate the
first stage regression

xk = 1 + 2 x2 + ... + k 1 xk 1 + ✓z1 + v

Since all the independent variables are exogenous in the the first stage, xk
is correlated to " if v is correleted to ". To test this, we estimate the first stage
and calculate the residual

v̂ = xk ˆ1 ˆ2 x2 ... ˆ
k 1 xk 1
ˆ✓z1

Then we run the following OLS regression

y= 1 + 2 x2 + ... k xk
ˆ
+ ⇡ v+"

and test H0 : ⇡ = 0 using a standard t test. If we reject the null then we


conclude that xk is endogenous.

0.2 Testing for Exogeneity of Instruments

The most popular test is the Over-identification test. This can be used to test
the exogeneity of instruments. This test relies on having more instruments than
endogenous variables (hence the term over-identified; As an aside, when the
number of instruments is exactly equal to the number of endogous variables, we
say that the problem is just identified ).
As an example, consider the case described above where xk is one endogenous
variables and there are two instruments z1 and z2 . To carry out this test, let us
^ IV ^1
use only z1 as an instrument and calculate k . Let us call it k to denote the
fact that we used only z1 . Similarly, now let us use only z2 as an instrument and
^2 ^ IV ^2
estimate k . If both z1 and z2 are exogenous, then k and k are both consistent
estimators of k . Hence they must di↵er only by some sampling error. Hence a
test (another Hausman test) can be constructed to check for the exogeneity of
^1 ^2
both z1 and z2 by looking at k k . Note that if the di↵erence is statistically
di↵erent from zero, then we have no choice to conclude that both instruments
are bad (since we have no way to separate out whether only one is bad or both
are bad).
The procedure of comparing di↵erent IV estimates of the same parameter is
an example of testing over-identifying restrictions. The general idea is that we
have more instruments than we need to consistently estimate the parameters.
In the previous example, we had one more instrument than we need and this
results is one overidentifying restriction that can be tested. In general we can
have q more instruments than we need. When q is two or more, comparing
several IV estimates is cumbersome. Instead we can easily compute a test

5
statistic based on 2 SLS residuals. The idea is that if all instruments are
exogenous, the 2 SLS residual should be uncorrelated with the instruments.
By construction of the 2 SLS residual, it is orthogonal to a linear combination
of k instruments. Calculate

^ IV ^ IV ^ IV
"1IV = y 1 2 x2 ... k xk

and regress on all exogenous variables. Use the nR2 test where R2 is obtained
after the regression of the residual from all exogenous variables. Under the
null hypothesis of exogeneity, nR2 ⇠ 2q . q is the number of over-identifying
restrictions. that is, is the number of extra instrumental variables (number of
Instruments (excluding the original exogenous variables) minus the number of
endogenous variables).

6
Limited Dependent Variable
Models

• When y is discrete and takes on a small


number of values, it makes no sense to treat
it as an approximately continuous variable.

LOGIT AND PROBIT MODELS FOR


BINARY RESPONSE

PROBLEMS WITH LPM

• The two most important disadvantages are


that the fitted probabilities can be less than
zero or greater than one and
• the partial effect of any explanatory
variable (appearing in level form) is
constant.
Binary response models.

In a binary response model, interest lies


primarily in the response probability

where we use x to denote the


full set of explanatory variables.

In particular:

where G is a function taking on values


strictly between zero and one: , for
all real numbers z.

• This ensures that the estimated response


probabilities are strictly between zero and
one.
In the logit model, G is the logistic function:

• This is the cumulative distribution function


for a standard logistic random variable

In the probit model, G is the standard normal


cumulative distribution function (cdf )

This choice of G again ensures that (17.2) is


strictly between zero and one for all values
of the parameters and the xj.
• Logit and probit models can be derived from
an underlying latent variable model that
satisfies the classical linear model
assumptions.

• Let y* be an unobserved, or latent variable,


determined by

• y is one if y* > 0, and y is zero if y* ≤ 0.

• Assume e is independent of x and that e


either has the standard logistic distribution
or the standard normal distribution.

• In either case, e is symmetrically distributed


about zero, which means that
1-G(-z) =G(z)
for all real numbers z.
• We want to estimate the effect of xj on the
probability of success P(y =1/x), but this is
complicated by the nonlinear nature of G(.)

Define

In the logit and probit cases, G(.) is a strictly


increasing cdf, and so g(z) > 0 for all z.

Therefore, the partial effect of xj on p(x)


depends on x through the positive quantity
g ( β o + xβ ) , which means that the partial
effect always has the same sign as β j .
If, say, x1 is a binary explanatory variable,
then the partial effect from changing x1
from zero to one, holding all other variables
fixed, is simply

• this depends on all the values of the other xj.


Maximum Likelihood Estimation of Logit and
Probit Models

Assume that we have a random sample of size


n. To obtain the maximum likelihood
estimator, conditional on the explanatory
variables, we need the density of yi given xi.
We can write this as

The log-likelihood function for observation i


is a function of the parameters and the data
(xi,yi) and is obtained by taking the log
^
• The MLE of β , denoted by β , maximizes
this loglikelihood.
^
• If G(.) is the standard logit cdf, then β is the
logit estimator; if G(.) is the standard normal
^
cdf, then β is the probit estimator.

• Nevertheless, the general theory of


(conditional) MLE for random samples
implies that, under very general conditions,
the MLE is consistent, asymptotically
normal, and asymptotically efficient.

• Standard t tests though standard errors of


estimators look different.

• Multiple Hypotheses: The likelihood ratio


statistic is twice the difference in the log-
likelihoods

where
19-05-2022

MOTIVATION

• In most cases, observed controls are an incomplete


proxy for the true omitted variable
• Hence the error term still has components that may
correlate with the x variables

• Suppose you cannot think of any causal


BOUND ANALYSIS identification technique (IVs …)
WHAT TO DO WHEN Y OU DO NOT HAVE A STRONG • Can you still get a sense of whether your results are
CAUSAL I DENTI FI CATI ON TECHNI QUE “meaningful”?

COMMON APPROACH EXAMPLE

• Explore sensitivity of your main regressor (in an


experiment this will be “treatment effect”) to the
inclusion of observed controls
• Coefficient is stable after inclusion of the observed controls
(further adding variables does not change the coefficients
much): Sign that the omitted variable bias is limited

1
19-05-2022

THE PRINCIPLE MORE IN DETAIL

• SUPPOSE THERE IS A MODEL

• Using observables to identify the bias from


unobservables requires making further assumptions • X is the main variable of interest
about the covariance properties (more on this later) • Z is observed
• W contains all unobserved variables
• Coefficients movements alone are, however, not • The objective is to estimate the bias on because
sufficient statistic to calculate bias of W

ASSUMPTION NOT JUST LOOK AT COEFF MOVEMENTS

Illustration: Assume that the model that determines


wages (Y) is given by

Y=b X+ Z* + W
• That is, the relation between X and the un-observables is Where X is education, Z* (gamma X Z) and W are two
proportional to the relationship between X and orthogonal components of “ability”
observables
Assume that the variance of W is much larger than
• The degree of proportionality is given by the variance of Z* but both relate to X in the same
way, that is,
• Altonji et al. (2005) and Oster (2013) incorporate this idea A regression of X on Z* or W yields the same coeff.
and derive bias under alternate deltas

2
19-05-2022

EXAMPLE EXAMPLE

• CONSIDER TWO SITUATIONS • Why


• Researcher observes Z* (low variance control) • The effect on the coefficient of X of adding another
• Researcher observes W (high variance control) variable is given by the covariance between X and this
added variable
• A regression of X on Z* or W yields the same coeff: Same
• If the researcher observes the low variance control correlation
covariate, then the coefficient of X will look more • So if Variance of W> Variance of Z*, then Cov(X, W) > Cov (X, Z*)
stable when the control is included to yield the same correlation
• So adding Z* to the regression will not change the coefficient of
• Not because the bias is small
X by much. It will look stable, but that is because of the low
variance (and consequent low covariance)

EXAMPLE APPLICATION

• The bound analysis calculates the Beta’s for two


parameters that need to provided by the
researcher:
• Gamma (the degree of proportionality)
• Rmax square : the highest R2 that one would get if we
included the omitted variable that is confounding our
• Observation: In this scenario, the Higher variance control will tend analysis
to explain more of the Y: R squares will be higher
Simple case (Under some assumptions)
• Evaluate the stability with due importance to how much the
variance of the added variables are relative to the unobserved (R2)

• So we need to look at the R square when we add variables relative


to what would be the R square if we add all variables : Rmax Lesser assumptions give different expressions

3
19-05-2022

APPLICATION APPLICATIONS

• Oster gives us some guidance after surveying a • Using Randomized studies.. Omit variables. See the
large number of paper across topics which have proportion of studies that survive.
strong causal analysis • Get a empirical justified bounding value of

• The average delta is 0.545 and 86 % of values fall within the • 90 % of RCT results survive
[0,1] range. The ones where delta is greater than 1 are
examples where one has left out the most important
variables outside the regression (household/paternal
education in a wage regression) : Delta=1 is a defendable
upper bound
• The most conservative Rmax is 1. But 9 % to 16% of good
studies wouldn’t survive at this.
• Alternatively , for R square of the control regression =

TYPICAL OUTPUT

4
Rules for Sample Selection

Suppose the population model you want to estimate is

𝑦 = 𝑥𝛽 + 𝜀

Here 𝑥 is assumed to be exogenous.

But, say, the sample you have in your dataset is a “peculiar” one. For example suppose we
want to estimate the relation between education (𝑥) and (𝑦) for the adult population but
we observe this data for only those who work. For those who don’t work the wages they
would command in the labour market are not zero-they are “missing” by the act of not
working.

If we run an OLS on the data we have, is the estimated 𝛽 consistent?

To answer this question, the easiest way to think is this. Suppose we denote
𝑠 = 𝑠(. ) = 1 𝑖𝑓 𝑎 𝑝𝑒𝑟𝑠𝑜𝑛 𝑖𝑠 selected in the data set; =0 otherwise

Therefore the model we are estimating when we use the data on the selected sample is

𝑠(. )𝑦 = 𝑠(. )𝑥𝛽 + 𝑠(. )𝜀

Notice this is effectively the model we are estimating when we use the observation for
whom the y and x are both observed.

The question then for consistency: is 𝐸 [(𝑠(. )𝑥)(𝑠(. )𝜀)] = 0 ?

Now 𝑠(. ) × 𝑠(. ) = 𝑠(. ) , since s is either zero or 1.

Hence, we can equivalently ask if 𝐸 [(𝑠(. )𝑥)(𝜀)] = 0 ? [S.1]

Since 𝑥 and 𝜀 have zero covariance (exogeneity), the question can be answered by asking if
𝑠(. ) and 𝜀 have zero covariance.

Based on this, selection is of two kinds

1. Exogenous selection/sampling: this is when 𝑠 = 𝑠(𝑥) ; that is we are selecting the


sample based on some value of 𝑥. For example if 𝑥 is level of education, then the sample we
are selection may be a sample of people with less than primary education.
In this case notice since 𝑥 and 𝜀 have zero covariance, therefore 𝑠(𝑥) and 𝜀 also have zero
covariance, since s depends only on 𝑥. Estimating with this sample therefore gives a
consistent estimator of 𝛽

2. Endogenous selection/sampling: this is when 𝑠 = 𝑠(𝑧) where z is correlated to 𝜀 .


An obvious application is the above. Here we are selecting based on whether a person
works or not: hence 𝑠(𝑤𝑜𝑟𝑘) : if the decision to work is determined by factors that also
correlate with the unobservable variables that determine one’s wage (𝜀), then the
estimation will be inconsistent. In general, it is always true that any sample selected, based
on some value of 𝑦, leads to inconsistent estimation. This is because 𝑠(𝑦) = 𝑠(𝑥𝛽 + 𝜀) .
Notice by construction, the selection rule has 𝜀 in it and will therefore not satisfy condition
[S.1] above.

So while sample selection based on 𝑥 is generally harmless (so we can do, for example,
heterogeneity by slicing our data set on the basis of x), slicing the data on the basis of y
leads to inconsistent estimation. If, for example, y was education expenditure on schooling,
you cannot only consider people who spend some positive amount of money (and throw
out the sample of people with 0 expenditure).

You might also like