Trix - Post Midsem Merged
Trix - Post Midsem Merged
Proof. Recall that the population model is (for any observation ex ante):
y = x1 1 + x2 2 + ... + xk k +"
Let
x= x1 x2 ... xk
so that
y =x +"
and that E ["|x] = 0.
Now,
^ 1
= (X 0 X) (X 0 Y )
Let
xi ⌘ xi1 xi2 ... xik
i
Now denote X in terms of x the vector of all variables for ith data point.
Therefore: 0 1 1
x
B x2 C
X=B @ : A
C
xn
0 1 1
x
B x2 C
X 0 X = x01 x 2 ... x0n B C
0
@ : A
xn
Or,
n
X 0
X 0X = x i xi
i=1
Caution: This is not a scalar. This is exactly equal to the X 0 X matrix which
is a matrix kXk.
1
Similary,
n
X 0
X 0Y = x i yi
i=1
Therefore,
n
! 1 n
!
^ X 0 X 0
i i i
= x x x yi
i=1 i=1
or,
n
! 1 n
!
^ X 0 X 0
i i i i
= x x x x + "i
i=1 i=1
n
! 1 n
! n
! 1 n
!
X 0 X 0 X 0 X 0
= x i xi x i xi + x i xi x i "i
i=1 i=1 i=1 i=1
n
! 1 n
!
X 0 X 0
= + x i xi x i "i
i=1 i=1
n
! 1 n
!
1 X 0i i 1 X 0i
= + x x x "i
n i=1 n i=1
^
(which is the expression you have seen before: = + X 0 ")
Now, by law of large numbers,
n
1 X 0i i P ⇣ 0 ⌘
x x !E xx
n i=1
Similary,
n
1 X 0i P ⇣ 0 ⌘
x i "i ! E x "
n i=1
⇣ 0 ⌘ ✓ ◆ 1
Pn 0 P P
n 0 P
Under some conditions satisfied here, if n1 x i xi ! E x x , then 1
n x i xi !
i=1 i=1
⇣ 0 ⌘ 1
E xx . Hence:
^ P ⇣ 0 ⌘ 1 ⇣ 0 ⌘
! +E x x E x"
Now we know, that E ["|x] = 0 ) E [x0 "] = 0. Hence
^ P
!
Note that while E ["|x] = 0 ) E [x0 "] = 0, the latter does not imply the
former. So the assumption needed for consistency is even weaker than what
we have made. In other words, E ["|x] = 0 is sufficient for consistency but not
necessary to show consistency of the OLS estimator.
2
1.1 Inference
We have looked at hypothesis testing under the assumption that " follows a nor-
mal distribution. These tests are called Exact tests. However, even if the error
terms are not from a normal distribution, in large samples, we can use the Cen-
tral Limit Theorem to show that OLS are approximately normally distributed
for large N.
While we will not show the proof for the asymptotic normality of the OLS
estimator, it is important to point out that it requires all the assumptions needed
to prove the Gauss Markov Theorem. It therefore requires homoscedasticity
and zero conditional mean assumption (so much more than that required for
consistency).
^
Under these conditions, is asymptotically normal, that is,
✓ ◆
p ^ a
n ⇠ N 0, 2 A 1
where
E [x0 x]
A=
n
2
Similar to before, we have to replace by its sample counterpart.
^0 ^
^2 ""
=
n
2
is a consistent estimator of . Thus
✓ ◆
^
j j
^
⇠ N (0, 1)
se( )
1.2 LM Test
y= 1 x1 + 2 x2 + ... k xk +"
If we want to test
H0 : 1 = 2 = ... = q =0
then, the LM test requires estimation of only the restricted model.
⇠
y= q+1 xq+1 + ... k xk +"
⇠
If 1 = 2 = ... = q = 0, then " will be uncorrelated with each of the
excluded variables, that is, xi , i = 1, ..., q.
Carry out the auxilliary regression
⇠
" = ⇡1 x1 + ... + ⇡k xk + 2
3
Note that all the variables have to be included and not just the q variables.
Intutively, inclusion of xq+1 ,...,xk has no impact on this regression because errors
are orthogonal to included independent variables. Now, if the null hypothesis is
2
indeed true, the R⇠ from this regression should be close to zero. Therefore,
"
2 2
LM = nR⇠
"
⇠ q
2
(the variables xq+1 ,...,xk need to be included because otherwise nR⇠ will
"
2
not follow q . If LM > c↵ , then we reject the null hypothesis. This test is also
called the n R2 test.
E [""0 |X] = 2
⌦ ⌘ ⌃.
⇥ ⇤
= E (X 0 X) 1
X 0 ""0 X(X 0 X) 1
|X
= (X 0 X) 1
X 0 E[""0 |X]X(X 0 X) 1
= (X 0 X) 1
X 0 ⌃X(X 0 X) 1
2
✓ ◆
1 1 0 1
= ( X 0 X) 1
X ⌦X ( X 0 X) 1
n n n n
An implication of this is that using 2 (X 0 X) 1 as the variance matrix is
WRONG! Therefore the usual F and t statistics are also wrong. Hence Gauss
Markov theorem does not hold under non spherical errors. It is easy to show
[Exercise] that
4
1. OLS Estimator is consistent
^ a ⇥ 1 1
⇤
2. OLS Estimator is asymptotically normal. In particular, OLS ˜N ,A BA
where A ⌘ p lim n1 X 0 X and B ⌘ p lim n1 X 0 ⌦X
3. OLS Estimator is NO longer the asymptotically efficient estimator.
x1
5
1. OLS Estimator is consistent
^ a ⇥ 1 1
⇤
2. OLS Estimator is asymptotically normal. In particular, OLS ˜N ,A BA
where A ⌘ p lim n1 X 0 X and B ⌘ p lim n1 X 0 ⌦X
3. OLS Estimator is NO longer the asymptotically efficient estimator.
2.0.1 1. Heteroscedasticy:
2 3 2 2 3
!11 0 ... 0 1 0 ... 0
6 0 !22 ... 0 7 6 2
0 7
2
⌦= 26 7=6 0 2 ... 7
4 : : : : 5 4 : : : : 5
2
0 0 0 !nn 0 0 0 n
Here the variance of the error term for each observation is di↵erent because
it is a function of the value of the variable x1 for that particular observation.
In general, under heteroscedasticy,
2 ⇥ 2 ⇤ 3
E "1 |X ⇥ 02 ⇤ ... 0
6 0 E "2 |X ... 0 7
E [""0 |X] = 6
4
7 6= 2 I
5
: : : ⇥ 2: ⇤
0 0 0 E "n |X
5
White suggested that this can be estimated by:
1 X ^2 0
" i xi xi
n
and substituted for in the correct variance covariance matrix for OLS es-
timator, i.e., (X 0 X) 1 X 0 ⌃X(X 0 X) 1 .Therefore the standard errors calculated
using this are called White (also White-Huber) robust standard errors or Het-
eroscedastic robust standard errors. The t ratios using these standard errors
are called robust t s. This is operationalized in STATA using the robust option.
It is important to note here, that this estimator of the variance covariance
matrix has good asymptotic properties and one does not need to know the
form of heteroscedasticity. However, beware of its rather poor small sample
properties. The implication of this is that for smaller samples, this estimator
should be used with caution.
6
Because of the concern for small sample and efficiency advantages of know-
ing the form of heteroscedasticity (We will come to this later when we discuss
the Generalized Least Square Estimator), it may become important to test for
heteroscedasticity. There are many tests that are popular. We discuss two of
them here:
"2 = 1 + 2 x2 + ... + k xk +v
1
consider the following transformation by pre-multiplying on both sides by
1 1 1 1
⌦ 2 where ⌦ 2 .⌦ 2 = ⌦ 1 . (⌦ 2 is symmetric). Therefore:
1 1 1
⌦ 2 Y =⌦ 2 X +⌦ 2 "
1 1 1
Define ⌦ 2 Y ⌘ Y ⇤ , ⌦ 2 X ⌘ X ⇤ and ⌦ 2 " ⌘ "⇤ . In terms of the *- ed
variables, the model is
Y ⇤ = X ⇤ + "⇤
What the properties of "⇤ ?
1 1
E ["⇤ |X] = E[⌦ 2 "|X] = ⌦ 2 E["|X] = 0
Moreover,
1 1
V ar["⇤ |X] = E["⇤ "⇤0 |X] = E[⌦ 2 ""0 ⌦ 2 |X]
1 1 1 1
=⌦ 2 E[""0 |X]⌦ 2 = 2
⌦ 2 ⌦⌦ 2 = 2
I
Therefore in this transformed model, the transformed error terms are ho-
moscedastic. The transformed model, therefore, satisfies all the conditions re-
quired for the Gauss Markov Theorem. Hence the OLS estimator of using this
transformed model has the lower variance among all linear unbiased estimators.
The OLS estimator of on the transformed model is:
^ ⇤
= (X 0 X ⇤ ) 1
X ⇤0 Y ⇤
^
Expressing in terms of the untransformed model with X, Y , we get
^ GLS
= (X 0 ⌦ 1
X) 1
X 0⌦ 1
Y
2
Feasible GLS Estimator: The big assumption when discussing the GLS
estimator is that you know σ 2 Ω. In the example it implies that you know σ 2
and Ω. In most practical cases you don’t. It may be, in the example above, that
you think that h is a function of all variables but you are not sure.
In such cases (which will common), you can model h and use data to esti-
∧
mate the unknown parameters of the model. That is, use h instead of h.More
generally, the Feasible Generalized Least Squares (F GLS) estimator is:
∧ −1 ∧ −1
β F GLS = (X ′ Ω X)−1 X ′ Ω Y
log ε2 = θ 0 + δ 1 x1 + ... + δ k x + e
where e has all the nice properties we have assumed for the classical linear
regression model. Since we do not have data for ε, it can be shown that we can
∧
use ε instead. Thus we estimate the model
∧2
log ε = θ0 + δ 1 x1 + ... + δ k x + e′
∧ ∧ ∧
by OLS and calculate σ2 h(..) = expθ 0 +δ 1 x1 +...+δ k xk . Note that even if σ 2 is
not separately identified from δ 0 , it does not matter. Even if we divided both
−1 −1
sides of original equation by σ 2 Ω 2 , nothing substantive would change.
The F GLS estimator is consistent and asymptotically more efficient than
∧ p
the OLS estimator. This has to be necessarily true because Ω → Ω. Hence
the F GLS estimator converges asymptotically to the GLS estimator. We know
that the GLS estimator is more efficient than the OLS. Hence asymptotically
F GLS estimator is more efficient than the OLS estimator.
In small samples however, things are not so clear. The F GLS estimator is
not unbiased. And it may be more inefficient than the OLS estimator. Hence
with very small samples, be wary if you are doing F GLS and always report the
results using the OLS estimator as a comparison.
6
will refer to both cases as autocorrelation), the variance covariance matrix has
this particular form:
σ2ε E[ε1 ε2 ] ... E[ε1 εN ]
E[ε2 ε1 ] σ2ε ... E[ε2 εN ]
E [εε′ |X] =
: : : :
2
E[εN ε1 ] E[εN ε2 ] ... σε
Notice that once you know this, everything we have discussed above, with
respect to the GLS estimator and F GLS estimator go through. So we can
use these estimators in this case as well. Here, we will demonstrate what the
variance covariance matrix looks like for time series data. Time series data
follow stochastic processes (you will learn more about this in your time series
course). In time series data, as a convention, observation i is referred to as
observation t and N is referred to as T. Suppose, that the error term in any
period t depends on the error term in the previous period t − 1. We can model
this as:
εt = ρεt−1 + ut (1)
This is referred to as an Autoregressive model of order 1 (AR(1)). In this
model, E[ut ] = 0; E[u2t ] = σ2u and the Cov[ut , us ] = 0 if t ̸= s.
We will now derive what exact form
E [εε′ |X]
takes for AR (1). Note that for different time series models, the implied E [εε′ |X]
will be different and will have to be derived case by case.
Notice that (1) is a difference equation in ε. By repeated backward substi-
tution, we will get an infinite series. That is,
εt = ρ (ρεt−2 + ut−1 ) + ut
= ut + ρut−1 + ρ2 εt−2
= ut + ρut−1 + ρ2 (ρεt−3 + ut−2 )
and so on.... till we get the infinite series:
Now
V (εt ) = V (ut ) + ρ2 V (u2t−1 ) + ...
= σ2u + ρ2 σ 2u + ...
σ2u
If |ρ| < 1 then this sum converges to 1−ρ2 . If we denote V (εt ) by σ 2ε , then
we have shown that
σ 2u
σ 2ε =
1 − ρ2
Now think of any term of E [εε′ |X] that differs by 1 time period (for example:
E[ε1 ε2 ], E[ε4 ε3 ], E[ε10 ε11 ]). Let us represent all these terms that differ by 1 time
7
period by E[εt−1 εt ]. (the terms E[εt , εt−1 ] are same because of symmetricity).
Now,
E[εt εt−1 ] = E[εt−1 (ρεt−1 + ut )]
! "
= ρE ε2t−1 + E[εt−1 ut ]
Now, since ut is uncorrelated to all other ut including all its lagged values,
therefore ut is uncorrelated to εt−1 . Hence,
! " σ2
E[εt εt−1 ] = ρE ε2t−1 = ρσ2ε = ρ u 2
1−ρ
In general, it is easy to show that [EXERCISE]
σ 2u
E[εt εt−s ] = ρs
1 − ρ2
Therefore,
1 ρ ... ρT −1
ρ 1 ... ρT −2
E [εε′ |X] = σ2ε
: : : :
ρT −1 E[εN ε2 ] ... 1
1 ρ ... ρT −1
σ2u ρ 1 ... ρT −2
=
1 − ρ2 : : : :
ρT −1 E[εN ε2 ] ... 1
In addition to the GLS estimator, OLS estimators can be used (with the
correct variance covariance matrix). The OLS estimator is still consistent and
unbiased.
If one does not want to assume any particular time series model, one can use
the estimated OLS residuals to calculate a general variance covariance matrix.
The correction to the variance covariance matrix using estimated residuals (just
like robust standard errors) is called N ewey W est Correction. This involves
estimating E [εi εj ].
In the presence of auto-correlation, there is at least one particular instance
where the OLS estimator is biased and inconsistent. This is the case of lagged
dependent variable. To illustrate this, suppose we are estimating the following
models using OLS:
yt = βyt−1 + εt (2)
where,
εt = ρεt−1 + ut (3)
Let all the assumptions we made about ut stand. Assume in addition that
|β| < 1. For OLS estimation of equation (2) to be consistent, the following must
be true:
Cov (yt−1 , εt ) = 0
8
But this is not the case.
Cov (yt−1 , εt ) = Cov(βyt−2 + εt−1, εt )
= Cov(εt−1, εt )
Since Cov(εt−1, εt ) ̸= 0, the OLS estimator is inconsistent. The presence
of lagged dependent variables, by itself, does not pose any inconsistency. It is
the autocorrelated error terms together with the lagged dependent variable that
causes inconsistency. However, in a large number of cases, even when there is
autocorrelation, the inconsistency of OLS estimator in the presence of lagged
dependent variable comes because of specification error. To see this, let us
return to the above example.
Substitute (3) in (2). We get
yt = βyt−1 + ρεt−1 + ut
Now, we know that
yt−1 = βyt−2 + εt−1 (4)
Substituting for εt−1 in (3) by using (4), we get
yt = βyt−1 + ρyt−1 − βρyt−2 + ut
Therefore,
yt = (β + ρ) yt−1 − βρyt−2 + ut
Now ut is uncorrelated with yt−1 and yt−2 . Hence if we regress yt on yt−1
and yt−2 , we will get consistent estimates of (β + ρ) and −βρ. The inconsistency
came about because if we have AR(1) correlated error terms, the main estima-
tion equation necessarily needs to include yt−2 . So in running the regression
of yt on just yt−1 , we had misspecified the equation and not thought enough
of the dynamic structure of the error terms. So to avoid inconsistency due to
autocorrelation, researchers often include a rich dynamic structure if they need
to include lagged dependent variables. If we assume a particular AR process for
the error terms, it is easy to calculate how many lags of the dependent variables
need to be included as regressors if we want to include any lagged dependent
variable as a regressor [Exercise :Try with ε following AR(2)].
As with heteroscedastic error terms, asympotic corrections do not work for
small samples. So it is often desirable to elicit the form of the time series process
of the error term. We will discuss three tests in this note. The first two tests
are specially for testing whether the error terms follow an AR(1) process.
1. t test for AR(1) in the case of strictly exogenous regressors1 . In the model,
εt = ρεt−1 + ut
ut has the usual nice properties. We are testing the null hypothesis:
H0: ρ = 0. To test this hypothesis, we follow these two steps
1 Strictly Exogenous regressors are such that x is uncorrelated with ε ; s = 1, ..., T . Con-
t s
strast this with Weak Exogeneity which requires that xt is uncorrelated with just εt .
9
∧
a. Run an OLS regression of yt on xt1 , ...xtk and obtain residuals ε t for all t.
∧ ∧
b. Run an OLS regression of εt on εt−1 .Conduct a t test to check for the
∧
significance of the coefficient of ε t−1 .
2. Durbin Watson Test. This test requires all the classical assumptions under
null hypothesis including normality of error terms. The DW statistic is:
T '
/ (
∧ ∧
ε t − εt−1
t=2
DW = T % &
/ ∧2
εt
t=1
10
3. Bruesch Godfrey Test: Suppose are testing for AR(q). that is, in the
following model,
εt = ρ1 εt−1 + ...ρq εt−q + et
To test this hypothesis we need to follow two steps:
∧
a. Run an OLS regression of yt on xt s.Calculate residuals ε t .
∧ ∧ ∧ ∧
b. Run a regression of ε t on xt and εt−1 , εt−2 , ..., ε t−k and conduct a joint
∧
test that the coeffecients of all the lagged ε terms is equal to zero.
Cov(ut ut−1 ) = 0
˜
Therefore if ρ were know, we could calculate xt −ρxt−1 ≡ xt and yt −ρyt−1 ≡
˜
yt . and run an OLS regression. However, there is still one catch. Notice that
we are dealing with observations t = 2,...,T. What should be done with the first
observation. We could omit the first observation. If you do so, you would be
embarking on a Cochrane Orcutt procedure. Suppose we want to keep the first
observation, the model for that observation is still the original model since it
cannot be transformed. The error term for this first observation is σ 2ε which is
different from σ 2u , the variance of the error term for the tranformed model. We
transform the model for the first observation in a way such that the error term
of the transformed model is σ2u . To do so, 0 we divide both sides of the original
model (for the first observation ONLY) by 1 − ρ2 , so that for t = 1
0 0 0 0
1 − ρ2 y1 = β 0 1 − ρ2 + β 1 1 − ρ2 x1 + 1 − ρ2 εt
11
0
Now the variance of 1 − ρ2 εt is σ2u . This procedure that keeps the first
observation and transforms it differently to the rest of the observations is called
Pr ais W inston prodecure.In general both methods are equivalent to doing a
GLS estimator with a small difference in the variance covariance matrix (first
observation).
As with any GLS procedure, this pressuposes a knowledge of ρ. But ρ is
usually unknown. So we follow a procedure where first we run a simple OLS
∧ ∧ ∧ ∧
to get εt s. Then we regress εt and εt−1 and estimate ρ. This F GLS procedure
does not have good small properties. So its recommended that we follow an
∧
iterative procedure, where we start with any ρ (say the one implied by a simple
OLS). Then we carry out a F GLS procedure to estimate the model. The
estimated parameters will imply an updated set of residuals. These residuals
∧
can be regressed on their first order lags to get another value for ρ. We then use
this again to carry out another F GLS and repeat the process on and on till the
∧
values of ρ in subsequent iterations converge.
12
0.1.3 Measurement Error
There are almost no known finite sample properties of estimators when there is
measurement error. All the known results are asymptotic.
Assume, for simplicity:
y ∗ = βx∗ + ε
Here * denotes the true values of y and x. Instead, let y and x be the observed
values which are potentially measured with error.
y = y∗ + v
In this case when we run the regression y on x(= x∗ ), we are actually esti-
mating the regression model:
y = βx∗ + ε + v
x = x∗ + u
y ∗ = βx∗ + ε = β(x − u) + ε = βx + w
where w = ε − βu.
3
∧ ∧
Notice that if σ2u → ∞, β → 0. This is called attentuation. In this case β
values are muted and close to zero.
In a mulivariate regression, if
Y = X∗β + ε
and the observed value
X = X∗ + u
then, defining
∧ −1 −1
β = (X ′ X) X ′ Y = (X ′ X) X ′ (X ∗ β + ε)
−1 −1
= (X ′ X) X ′ X ∗ β + (X ′ X) X ′ε
−1 −1
= (X ′ X) X ′ (X − u) β + (X ′ X) X ′ε
−1 −1 −1
= (X ′ X) X ′ Xβ − (X ′ X) X ′ uβ + (X ′ X) X′ε
−1 −1
= β − (X ′ X) X ′ uβ + (X ′ X) X ′ε
Now & '
X ′X #
p lim = Q∗ +
n u
X ′u #
p lim( )=
n u
& ′ '
Xε
p lim =0
n
So (
∧ # )−1 #
p lim β = β − Q∗ + β
u u
Now what if only on variable has error? This implies:
2
# σ u1 ... 0
= : : :
u
0 ... 0
that is everything except the first diagonal element is equal to zero. Denoting
the reciprocal of elements of Q∗ as q ij , we get
∧ β1
p lim β 1 =
1 + σ2u1 q 11
So we have the usual attentuation bias. However for other parameters:
$ %
∧ σ 2u1 q k1
p lim β k = β k − β 1 .
1 + σ2u1 q k1
4
ABHIROOP MUKHOPADHYAY, PLANNING UNIT, ISI (DELHI) MAY 2017
Identification of Demand
and Supply Curves
In this lecture, we explore how to “extract” the demand and supply curves from
demand and supply curves. To begin with, let us illustrate what data from
markets would look like. Figure 1 provides the scatter plot of prices and
15
10
FIGURE 1
1
ABHIROOP MUKHOPADHYAY, PLANNING UNIT, ISI (DELHI) MAY 2017
Each data point plotted represents a market1, that is, each point refers to the
total quantity purchased/sold in the market (total quantity) and median price for
Let us begin, by assuming that the demand curve for the good is
! = ! + !" (1)
where q is the quantity bought and p is the price of the good. The usual shape of a
expect: (i) ! ≥ 0 & (ii) ! ≤ 0. (i) is implied by the fact that quantity cannot be a
The first temptation may be to use the knowledge of bivariate linear regression
!! = ! + !!! + !! , ! = 1 … ! (1’)
Here, the subscript m refers to a particular market. Recall the data we have is on
markets and there are M of them. !! and !! therefore refers to the quantity and
this later).
This model can be estimated using simple Ordinary Least Square Estimation
speaking2), it must be the case that the covariance between !! and !! is zero.
1 Consumer purchases and quantities have been aggregated up to the level of a state region. A state is usually divided into
state region based on climate.
2 An estimator is consistent if it converges, in probability, to the true population parameter.
2
ABHIROOP MUKHOPADHYAY, PLANNING UNIT, ISI (DELHI) MAY 2017
First, the law of demand says that ceteris paribus, when the price of a good rises,
the quantity demanded falls. Ceteris paribus, that is, everything else being same,
imposes some structure on our empirical exercise. For example consider two
markets: A and B. Let us suppose that, in the data, the price is higher in market A
than in market B !! > !! . In addition, it may also be true that the average
each market by I. Therefore !! > !! . Therefore, it is likely that price and income
Let us now think about what equation (1’) signifies. In particular, what is the
interpretation of the error term !! . !! captures all the omitted variables that
affect the demand of the good (except price, since that has already been explicitly
taken into account). For example, the error term will include income, preference
for the good (bengalis may love potatoes more than punjabis) and all the other
Now bring the two ideas together. !! contains information on income (that we
have chosen to omit). But we also know that income is positively correlated with
price. Hence the error term !! will necessarily be positively correlated with !! .
To understand this further, let us try to interpret the coefficient b in the demand
nothing else changes. Note that the omitted variable income affects both price
and quantity. So it may lead to a higher price and a higher quantity demanded.
Therefore the coefficient b that we will estimate using equation (1’) will not be
3
ABHIROOP MUKHOPADHYAY, PLANNING UNIT, ISI (DELHI) MAY 2017
the pure causal effect of price on quantity. It is also picking up the fact that, in
data, if we consider markets with higher prices, they also have higher income. So
the estimate of b will pick up not only the effect of price on quantity (that we
want) but also the effect of income on quantity (that we don’t want).
Once we understand this problem, it is obvious what the solution to the problem
is. Assuming we have data on income (as well as other factors that shift the
demand curve), we need to explicitly put them in the empirical model that is
estimated. Therefore:
it is no longer in the error term. Therefore even if income and prices are
correlated, !! and !! are no longer correlated and all our coefficients, including
b, are consistent.
It is important here to point out that if the world were a lab (like in science) to
find how prices affect quantity, we could run a series of tests where only the
prices were changed and nothing else. But in social sciences, we deal with real
world observational data, hence not affording us the luxury of holding other
things constant. The best we can do is to use statistics: more specifically, by the
coefficients. The partial correlation coefficient between any two variables is the
relation between them, netting out the effect of all other variables; in other
4
ABHIROOP MUKHOPADHYAY, PLANNING UNIT, ISI (DELHI) MAY 2017
words, it is as if we are holding all these other variables constant (i.e., ceteris
paribus).
1. In addition to price, include all variables that shift the demand curve
(referred to as demand shifters here on), for which you may have data.
2. In case you omit a demand shifting variable, or you do not have data for
any demand shifting variable, ask yourself if what is not included (and
hence enters the error term) is correlated with any of the explanatory
variables (like price, income or any variable in Z). If it isn’t correlated, one
A similar argument can also be made for supply function estimation. Hence, in
addition to price, one must include all the variables that shift the supply curve.
Now, let us return to the main issue that is the topic for today’s discussion. We
have the data given in Figure (1). In addition, let us assume, we have data on
incomes and other demand shifters. We also have data on variables that shift the
supply curve (for example cost of inputs: wages rates, rental on capital). What
do we do next? What equation should we estimate? Which curve does the data
“identify”, i.e. what regression should I run: the demand side regression equation
To understand this further, let us give superscripts ‘d’: to denote demand and ‘s’
5
ABHIROOP MUKHOPADHYAY, PLANNING UNIT, ISI (DELHI) MAY 2017
Demand Equation:
! !
!! = ! + !!! + !!! + ! ! !! + !!
!
, ! = 1…! (3’)
and
Supply Equation:
! !
!! = ! + !!! + ! ! !! + !!
!
, ! = 1…! (4)
The answer to our question is simple. We should not run either regression. This
is because the data that we have does not come only from a demand function or a
p SS
Data point is
this
equilibrium
DD
FIGURE 2
Recall that given market demand and supply curves, the final transacted quantity
and price are determined by market equilibrium. Hence the data we see is not
just governed by the demand or supply function but by the market equilibrium
condition.
6
ABHIROOP MUKHOPADHYAY, PLANNING UNIT, ISI (DELHI) MAY 2017
Market Equilibrium:
! !
!! = !! = !! (5)
&
! !
!! = !! = !! (6)
If we substitute (6) and (5) in (3’) and (4), we find that the data are determined
!! = ! + !!! + !!! + ! ! !! + !!
!
, ! = 1…! (7)
!! = ! + !!! + ! ! !! + !!
!
, ! = 1…! (8)
Appreciating that there are two equations that generate the data, how does one
estimate, say the demand equation. To think about how to estimate the demand
equation, let us pretend that we are naïve and decide to estimate equation (7)
ignoring (8).
But before we check that, we can solve the simultaneous equation to express the
two variables, !! and !! in terms of variables that are not determined in the
! !
model, i.e., !! , !! , !! , !! and !! . To do so, equate (7) and (8) . Therefore,
! + !!! + !!! + ! ! !! + !!
!
= ! + !!! + ! ! !! + !!
!
7
ABHIROOP MUKHOPADHYAY, PLANNING UNIT, ISI (DELHI) MAY 2017
!!! ! ! ! ! !
!! = !!!
− !!! !! − !!! ! ! !! − !!! !!
!
+ !!! ! ! !! + !!! !!
!
(9)
!
Now let us evaluate the !"# !! , !! . Notice that in (9), !! is, among other things,
! !
a function of !! . Therefore !"# !! , !! cannot be equal to zero. So estimating
equation (7) by itself will give us an inconsistent estimator of all the coefficients
of the equation.
S2
S1
A
D1
FIGURE 3
Assume two markets that have the same demand curve. However the two
markets differ in their supply curves. Market 1 has the supply curve S1 , where as
market 2 has the supply curve S2 . What differences would yield different supply
8
ABHIROOP MUKHOPADHYAY, PLANNING UNIT, ISI (DELHI) MAY 2017
curves? If the two markets have different input costs (wages, rental), then the
marginal cost of production would differ across both the markets. These
differences would yield supply curves that are to the left or right of each other.
The two supply curves lead to different equilibria: A and B. However, if the
demand curve is fixed, notice that A and B give us two points on the demand
curve. Hence if we can identify markets which differ by cost shifters, we can
identify points on the demand curve. Therefore, factors that lead to a parallel
shift in supply curve “identify” the demand curve. A very similar argument goes
through for the estimation of supply curve. Factors that shift the demand curve
This intuition from theory is used in the estimation of the demand and supply
curves. Formally, one can identify parameters of the demand functions that use
correlated with price but is not correlated with the error term. It is obvious to
see that if we take any variable that is included as a part of the group of variables
Xm in equation (4) , it will be correlated with !! (by equation (9)) but will have
!
no correlation with !! .
!
For example, let us suppose that !! contains preference for good quality
potatoes, something that we cannot measure and hence cannot account for
explicitly. Now it is obvious that in markets where this preference is high, we will
observe a higher price because higher quality goods cost more. But quality in our
data set is unknown. But this implies that there will be a positive correlation
!
between !! and !! (remember we cannot observe quality). In this case, the
rental rate for land, wage rates etc. will serve as instruments because they
9
ABHIROOP MUKHOPADHYAY, PLANNING UNIT, ISI (DELHI) MAY 2017
obviously affect the price but are not correlated with the preference for good
quality (It can be argued that higher wages rates lead to higher income. Since
rich people may prefer better quality, our argument does not go through.
However, recall that we have already accounted for income by its inclusion in the
demand equation. Once that route is accounted for, the input costs can only affect
While we will not discuss the exact estimation procedure, the estimator is an
!
! !! !! ] = 0
!
(which implies a zero covariance between X and !! ).
10
0.1 Instrumental Variable Estimation
Suppose we are to estimate the following model of wages:
where
"⌘ 2 ability +⌫
In this model, as we have pointed out above, Cov(edu, ability) 6= 0; Hence
Cov(edu, ") 6= 0. We can represent this more generally as:
y= 0 + 1x +"
1. Cov(z, ") = 0 : that is z is orthogonal to the error term "; this can never
be checked and one has to argue this using economic logic.
2. Cov(z, x) 6= 0 : This can and must be checked. To do so, we should check
if the coefficient '1 in the regression
x = '0 + '1 z + !
is significant; that is, H0 : '1 = 0 (against the alternative '1 6= 0). This
OLS regression is called a first stage regression.
Such a variable z is called an Instrumental Variable (IV ). It is important to
point out here than an IV is a not a proxy variable. Indeed whatever is a proxy
1 Typically, 0
since IQ is not equal to ability, therefore the coefficient 2 6= 2
1
variable is a bad IV , since the proxy is necessary correlated with the error term
( recall IQ is correlated with ⌫).
This can be generalized further to the case where there are k regressors
(including a constant): Let x = (1, x2 , . . . , xk ) .Suppose xk is endogenous (cor-
related with the error term) but other x s are exogenous (uncorrelated with the
error term). Consider the instrumental variable z1 . We are going to represent
the instrumental variable vector by z = (1, x2 , . . . , z1 ) . Notice that each ex-
ogenous variable xi : i 6= k is an instrumental variable since, it is uncorrelated
to " (Condition 1) and perfectly correlated to itself (Condition 2). Now the
orthogonality between z and " implies that E[z 0 "] = 0.
Starting with the equation
y =x +"
E [z 0 y] = E [z 0 x]
If we put all the n obervations of each vector in one matrix and denote them
by their capital letters: we get
^ 0
IV = (Z 0 X) 1
ZY
^ ^
Notice that when Z = X, then IV = OLS .
2
where ⇢2xz is the square of correlation between x and z. Contrast this to
^ 2 ^
the variance of the OLS estimator: V ( 1OLS ) = n 2 . Since ⇢2xz 1, V ( IV )>
x
^
V ( OLS ). The efficiency of the IV estimator depends ultimately on the corre-
lation between x and z. Higher the correlation, the closer the variance of IV to
OLS. ✓ P ◆
^2 ( xi x)
In the data, 2x is estimated by the sampling variance of x x = n ,
⇢2xz can estimated by the Rxz
2
of the regression of x on z (First Stage Regression).
P ^2
To estimate 2 , we use the conventional estimator using residuals ( n 1 2 " ).
^ ^ IV
But it is important to point out here that the residuals used are " = y 0
^ IV
1 x.
Thus
^2
^
Est.V ( 1IV ) = ^2 2
n x Rxz
The strength of correlation between x and z is critical to the application
of the IV estimator. We have seen above that an instrument with very low
correlation with x increases the variance of the estimator. In fact low correla-
tion can be even more harmful, if we are not a hundred perfect sure that z is
uncorrelated with ". It is easy to see that
^ p corr(z, ") "
1IV ! 1 + .
corr(z, x) x
Here in case corr(z, ") is small (but not zero), an even smaller corr(z, x)
makes the inconsistencey still large. This is called the ”weak instruments”
problem. In most practical applications, it is recommended that the F statistic
for a test for exclusion of z in the first stage regression has a value of atleast 10
(though the choice of threshold is more of a convention rather than based on
any deep theory)
So far, we have discussed the case where the number of independent variables
(including the one endogenous variable xk ) is equal to the number of instruments
(all the exogenous xi s and z1 ). Now, what if we have two exogenous variables z1
and z2 over and above the exogenous xi s where z1 and z2 are uncorrelated with
the error term but correlated with xk . Often, in applied work, people will refer
to this situation as: ”We have one endogenous variable and two instruments”
(so the fact that the other exogenous xi are instruments for themselves is not
explicitly pointed out). [ REMARK: In the class I referred to xk as y2 and
considered the case where there was only one exogenous variable z1 and two
instruments z2 and z2 . I have changed the notation here to be consistent with
some notation I followed earlier].
To deal with this. we state, without proving here, that the best instrument
is a linear combination of all the exogenous variables, including z1 and z2 (in
3
case there are more instruments, we can add them to the linear combination).
Therefore we run a regression of xk on all the exogenous variables and use the
^
predicted value from this regression. Hence, xk is the best instrument for xk
where
^ ^ ^ ^ ^ ^
x k = ⇡ 0 + ⇡ 1 x 2 + . . . + ⇡ k 1 x k 1 + 1 z 2 + 2 z3
With multiple instruments, IV is aslo called Two Stage Least Squares Es-
timator (2 SLS). The name comes from the fact that the 2 SLS estimator is
^
identical to running an OLS regression of y on xk and xi s (i 6= k). To see this
^ ⇣ ^ ^ ^
⌘ ^
define the matrix X = x1 x2 . . . xk where each xi is a vector containing n obser-
^ ^
vations. Now x1 is a vector of ones (intercept term) and xl = xl ; l = 2, . . . , k 1;
^
since xl will perfectly predict xl in the first stage regression. The IV estimator
^
we have derived above implies (put Z = X):
0
^ ^0 ^
1
IV = (X X) XY
^ ^
The OLS estimator of a regression of y on xk and xi s (i 6= k) (where xi = xi
for the exogenous variables) is
0
^ ^0^0 ^
1
2SLS = (X X ) XY
^
To show that they are the same, note that X = PZ X. Use the property that
PZ is idempotent and symmetric and the equality follows. The same problem
with bad efficiency property follows, especially if you realize that we are using
^ ^
X instead of X in the regression. X always has lesser variance than X since
^
X includes X and the error term. As we have noted before, when independent
variables have lower variance, their coefficients are less precisely estimated.
The case of more than one endogenous variable and instruments is easy to
0
extend. If there are k regressors k of which are endogenous, there needs to be
atleast k 0 instruments (over and above the exogenous xs). In the first stage for
each endogenous variables, atleast one of these new k 0 instruments needs to be
significantly di↵erent from zero.
4
where xk is endogenous. Let z1 be an instrument for xk . Let us estimate the
first stage regression
xk = 1 + 2 x2 + ... + k 1 xk 1 + ✓z1 + v
Since all the independent variables are exogenous in the the first stage, xk
is correlated to " if v is correleted to ". To test this, we estimate the first stage
and calculate the residual
v̂ = xk ˆ1 ˆ2 x2 ... ˆ
k 1 xk 1
ˆ✓z1
y= 1 + 2 x2 + ... k xk
ˆ
+ ⇡ v+"
The most popular test is the Over-identification test. This can be used to test
the exogeneity of instruments. This test relies on having more instruments than
endogenous variables (hence the term over-identified; As an aside, when the
number of instruments is exactly equal to the number of endogous variables, we
say that the problem is just identified ).
As an example, consider the case described above where xk is one endogenous
variables and there are two instruments z1 and z2 . To carry out this test, let us
^ IV ^1
use only z1 as an instrument and calculate k . Let us call it k to denote the
fact that we used only z1 . Similarly, now let us use only z2 as an instrument and
^2 ^ IV ^2
estimate k . If both z1 and z2 are exogenous, then k and k are both consistent
estimators of k . Hence they must di↵er only by some sampling error. Hence a
test (another Hausman test) can be constructed to check for the exogeneity of
^1 ^2
both z1 and z2 by looking at k k . Note that if the di↵erence is statistically
di↵erent from zero, then we have no choice to conclude that both instruments
are bad (since we have no way to separate out whether only one is bad or both
are bad).
The procedure of comparing di↵erent IV estimates of the same parameter is
an example of testing over-identifying restrictions. The general idea is that we
have more instruments than we need to consistently estimate the parameters.
In the previous example, we had one more instrument than we need and this
results is one overidentifying restriction that can be tested. In general we can
have q more instruments than we need. When q is two or more, comparing
several IV estimates is cumbersome. Instead we can easily compute a test
5
statistic based on 2 SLS residuals. The idea is that if all instruments are
exogenous, the 2 SLS residual should be uncorrelated with the instruments.
By construction of the 2 SLS residual, it is orthogonal to a linear combination
of k instruments. Calculate
^ IV ^ IV ^ IV
"1IV = y 1 2 x2 ... k xk
and regress on all exogenous variables. Use the nR2 test where R2 is obtained
after the regression of the residual from all exogenous variables. Under the
null hypothesis of exogeneity, nR2 ⇠ 2q . q is the number of over-identifying
restrictions. that is, is the number of extra instrumental variables (number of
Instruments (excluding the original exogenous variables) minus the number of
endogenous variables).
6
Limited Dependent Variable
Models
In particular:
Define
where
19-05-2022
MOTIVATION
1
19-05-2022
Y=b X+ Z* + W
• That is, the relation between X and the un-observables is Where X is education, Z* (gamma X Z) and W are two
proportional to the relationship between X and orthogonal components of “ability”
observables
Assume that the variance of W is much larger than
• The degree of proportionality is given by the variance of Z* but both relate to X in the same
way, that is,
• Altonji et al. (2005) and Oster (2013) incorporate this idea A regression of X on Z* or W yields the same coeff.
and derive bias under alternate deltas
2
19-05-2022
EXAMPLE EXAMPLE
EXAMPLE APPLICATION
3
19-05-2022
APPLICATION APPLICATIONS
• Oster gives us some guidance after surveying a • Using Randomized studies.. Omit variables. See the
large number of paper across topics which have proportion of studies that survive.
strong causal analysis • Get a empirical justified bounding value of
• The average delta is 0.545 and 86 % of values fall within the • 90 % of RCT results survive
[0,1] range. The ones where delta is greater than 1 are
examples where one has left out the most important
variables outside the regression (household/paternal
education in a wage regression) : Delta=1 is a defendable
upper bound
• The most conservative Rmax is 1. But 9 % to 16% of good
studies wouldn’t survive at this.
• Alternatively , for R square of the control regression =
TYPICAL OUTPUT
4
Rules for Sample Selection
𝑦 = 𝑥𝛽 + 𝜀
But, say, the sample you have in your dataset is a “peculiar” one. For example suppose we
want to estimate the relation between education (𝑥) and (𝑦) for the adult population but
we observe this data for only those who work. For those who don’t work the wages they
would command in the labour market are not zero-they are “missing” by the act of not
working.
To answer this question, the easiest way to think is this. Suppose we denote
𝑠 = 𝑠(. ) = 1 𝑖𝑓 𝑎 𝑝𝑒𝑟𝑠𝑜𝑛 𝑖𝑠 selected in the data set; =0 otherwise
Therefore the model we are estimating when we use the data on the selected sample is
Notice this is effectively the model we are estimating when we use the observation for
whom the y and x are both observed.
Since 𝑥 and 𝜀 have zero covariance (exogeneity), the question can be answered by asking if
𝑠(. ) and 𝜀 have zero covariance.
So while sample selection based on 𝑥 is generally harmless (so we can do, for example,
heterogeneity by slicing our data set on the basis of x), slicing the data on the basis of y
leads to inconsistent estimation. If, for example, y was education expenditure on schooling,
you cannot only consider people who spend some positive amount of money (and throw
out the sample of people with 0 expenditure).