ch9 ch14 PDF
ch9 ch14 PDF
Multicollinearity
9.1
9.1.1
Extreme Collinearity
The standard OLS assumption that ( xi1 , xi2 , . . . , xik ) not be linearly related
means that for any ( c1 , c2 , . . . , ck )
xik 6= c1 xi1 + c2 xi2 + + ck1 xi,k1
(9.1)
X1 =
x12
x22
..
.
x1k
x2k
..
.
xn2
xnk
, xk =
xk1
xk2
..
.
xkn
, and c =
(9.2)
c1
c2
..
.
ck1
(9.3)
9.1.2
87
(9.4)
(9.5)
where the vs are small relative to the xs. If we think of the vs as random
variables they will have small variance (and zero mean if X includes a column
of ones).
A convenient way to algebraically express the degree of collinearity is
the sample correlation between xik and wi = c1 xi1 + c2 xi2 + + ck1 xi,k1 ,
namely
cov( xik , wi )
cov( wi + vi , wi )
rx,w = q
=p
var(wi + vi ) var(wi )
var(xi,k ) var(wi )
(9.6)
Clearly, as the variance of vi grows small, this value will go to unity. For
near extreme collinearity, we are talking about a high correlation between
at least one variable and some linear combination of the others.
We are interested not only in the possibility of high correlation between
xik and the linear combination wi = c1 xi1 + c2 xi2 + + ck1 xi,k1 for a
particular choice of c but for any choice of the coecient. The choice which
P
will maximize the correlation is the choice which minimizes ni=1 wi2 or least
b = X1 b
c = (X01 X1 )1 X01 xk and w
c and
squares. Thus b
2
(rx,wb)2 = Rk
(9.7)
is the R2 of this regression and hence the maximal correlation between xki
and the other xs.
9.1.3
Absence of Collinearity
(9.8)
88
CHAPTER 9. MULTICOLLINEARITY
That is, xik has zero correlation with all linear combinations of the other
variables for any ordering of the variables. In terms of the matrices, this
c = 0 or
requires b
X01 xk = 0.
(9.9)
9.2
9.2.1
Consequences of Multicollinearity
For OLS Estimation
We will first examine the eect of xk1 being highly collinear upon the estimate bk . Now let
xk = X1 c + v
(9.10)
b
= X0 ( X1 : xk )
b
= ( X0 X1 : X0 xk )
|X0 X1 : X0 y|
|X0 X|
(9.11)
(9.12)
v0
which is indeterminant.
0
0
(9.13)
89
1
adj( X0 X )
|X0 X|
1
adj
=
|X0 X|
2
1
adj
=
|X0 X|
2
"
X01
x0k
( X1 : xk )
X01 X1 X01 xk
X01 xk x0k xk
(9.14)
1
1
cof(k, k) = 2 0 |X01 X1 |.
0
|X X|
|X X|
(9.15)
v0
2 |X01 X1 |
= .
0
(9.16)
XX=
"
0 X
X0
X0 X
(9.17)
Using the results for the inverse of a partitioned matrix we find that the
lower right-hand k 1 k 1 submatrix of the inverse is given by
(X0 X X0 (
)1 0 X )1 = (X0 X nx x0 )1
= [(X x0 )0 (X x0 )]1
0
= (X X)1
90
CHAPTER 9. MULTICOLLINEARITY
We now denote X = (X1 : xk ) where xk is last column (k 1)th , then
0
XX=
"
X1 X1 X xk
x0k X x0k xk
(9.18)
Using the results for partitioned inverses again, the (k, k) element of the
0
inverse of (X X)1 is given by,
0
where ek = (In X1 (X1 X1 )1 X1 )xk are the OLS residuals from regressing
2 are the
the demeaned xk s on the other variables and SSEk , SSTk , and Rk
corresponding statistics for this regression. Thus we find
2
var(bk ) = 2 [( X0 X )1 ]kk = 2 /(x0k xk (1 Rk
))
2
Pn
= /(
i=1 (xik
= 2 /(n
xk ) (1
(9.19)
2
Rk
))
1 Pn
2
(xik xk )2 (1 Rk
)).
n i=1
2
and the variance of bk increases with the noise 2 and the correlation Rk
of xk with the other variables, and decreases with the sample size n and the
P
signal n1 ni=1 (xik xk )2 .
Thus, as the collinearity becomes more and more extreme:
91
s2
2nk
2
(9.20)
(9.21)
9.2.2
For Inferences
b
Provided
collinearity does not become extreme, we still have the ratios (j
jj
0
1
b
2
jj
j )/ s d tnk where d = [( X X ) ]jj . Although j becomes highly
variable as collinearity increases, djj grows correspondingly larger,
thereby
0
0
b
compensating. Thus under H0 : j = j , we find (j j )/ s2 djj
tnk , as is the case in the absence of collinearity. This result that the
null distribution of the ratios is not impacted as collinearity becomes more
extreme seems not to be fully appreciated in most texts.
The inferential price extracted by collinearity is loss of power. Under
H0 : j = j1 6= j0 , we can write
(9.22)
The first term will continue to follow a tnk distribution, as argued in the
previous paragraph, as collinearity becomes more extreme. However, the
second term, which represents a shift term, will grow smaller as collinearity
becomes more extreme and djj becomes larger. Thus we are less likely to
shift the statistic into the tail of the ostensible null distribution
and hence
0
b
2
jj
less likely to reject the null hypothesis. Formally, (j j )/ s d will have
a noncentral t distribution, but the noncentrality parameter will become
smaller and smaller as collinearity becomes more extreme.
Alternatively the inferential impact can be seen through the impact on
the confidence intervals. Using
the standard
approach discussed in the
b
b
2
jj
previous chapter, we have [j a s d , j +a s2 djj ) as the 95% confidence
interval, where a is the critical value for a .025 tail. Note that as collinearity
92
CHAPTER 9. MULTICOLLINEARITY
becomes more extreme and djj becomes larger, the width of the interval
becomes larger as well. Thus we see that the estimates are consistent with
a larger and larger set of null hypothesis as the collinearity strengthens. In
the limit it is consistent with any null hypothesis and we have zero power.
We should emphasize that collinearity does not always cause problems.
The shift term in (9.xx) can be written
r
1 Pn
1
0
1
0
2 ))
2
jj
(xik xk )2 (1 Rk
(j j )/ s d = n(j j )/ 2 /(
n i=1
which clearly depends on other factors than the degree of collinearity. The
size of the shift increases with the sample size n, the dierence between
the null and alternative hypotheses (j1 j0 ), and the signal noise ratio
P
( n1 ni=1 (xik xk )2 / 2 . The important question is not whether collinearity
is present or extreme but whether is is extreme enought to eliminate the
power of our test. This is also a phenonmenon that does not seem to be
fully appreciated or well-enough advertised in most texts.
We can easily tell when collinearity is not a problem if the coecients
are significant or we reject the null hypothesis under consideration. Only
if apparently important variables are insignificantly dierent from zero or
have the wrong sign should we consider the possibility that collinearity is
causing problems.
9.2.3
For Prediction
s2
e0 e
= 1 (n k)
,
0
(y y) (y y)
var(y)
(9.23)
9.2.4
93
An Illustrative Example
(9.24)
Yt
80
100
120
140
160
180
200
220
240
260
Wt
810
1009
1273
1425
1633
1876
2052
2201
2435
2686
(0.823)
(0.081)
where estimated standard errors are given in parenthesis. Summary statistics for the regression are: SSR = 324.446, s2 = 46.35, and R2 = 0.9635.
The coecient estimate for the marginal propensity to consume seems to be
a reasonable value however it is not significantly dierent from either zero or
one. And the coecient on wealth is negative, which is not consistent with
economic theory. Wrong signs and insignificant coecient estimates on a
priori important variables are the classic symptoms of collinearity. As an
indicator of the possible collinearity the squared correlation between Yt and
Wt is .9979, which suggests near extreme collinearity among the explanatory
variables.
94
9.3
9.3.1
CHAPTER 9. MULTICOLLINEARITY
Detecting Multicollinearity
When Is Multicollinearity a Problem?
9.3.2
Zero-Order Correlations
(9.25)
9.3.3
Partial Regressions
In the general case (k > 3), even if all the zero-order correlations are small,
we may still have a problem. For while x1 my not be strongly linearly related
to any single xi (i 6= 1), it may be very highly correlated with some linear
combination of xs.
To test for this possibility, we should run regressions of each xi on all the
other xs. If collinearity is present, then one of these regressions will have a
high R2 (relative to R2 for the complete regression).
For example, when k = 4 and
yt = 1 + 2 xt2 + 3 xt3 + 4 xt4 + ut
(9.26)
95
is the regression, then collinearity is indicated when one of the partial regressions
xt2 = 1 + 3 xt3 + 4 xt4
xt3 = 1 + 2 xt2 + 4 xt4
(9.27)
(9.28)
9.3.4
The F Test
(9.29)
9.3.5
Belsley, Kuh, and Welsh (1980), suggest an approach that considers the
invertibility of X directly. First, we transform each column of X so that
they are of similar scale in terms of variability by dividing each column to
96
CHAPTER 9. MULTICOLLINEARITY
unit length:
q
xj = xj / x0j xj
(9.30)
for j = 1, 2, ..., k. Next we find the eigenvalues of the moment matrix of the
so-transformed data matrix by finding the k roots of :
det(X0 X Ik ) = 0.
(9.31)
Note that since X0 X is positive semi-difinite the eigenvalues will be between zero and one with values of zero in the event of singularity and close
to zero in the event of close to singularity. The condition number of the
matrix is taken as the ratio of the largest to smallest of the eigenvalues:
c=
max
.
min
(9.32)
9.4
9.4.1
Professor Goldberger has quite aptly described multicollinearity as micronumerosity or not enough observations. Recall that the shift term depends
on the dierence between the null and alternative, the signal-noise ratio,
and the sample size. For a given signal-noise ratio, unless collinearity is extreme, it can always be overcome by increasing the sample size suciently.
Moreover, we can sometimes gather more data that, hopefully, will not suffer the collinearity problem. With designed experiments, and cross-sections,
this is particularly the case. With time series data this is not feasible and
in any event gathering more data is time-conusming and expensive.
9.4.2
Independent Estimation
Sometimes we can obtain outside estimates. For example, in the AndoModigliani consumption equation
Ct = 0 + 1 Yt + 2 Wt + ut ,
(9.33)
97
(9.34)
9.4.3
Prior Restrictions
(9.35)
(9.36)
9.4.4
Ridge Regression
(9.37)
subject to b0 b m.
(9.38)
98
CHAPTER 9. MULTICOLLINEARITY
Form the Lagrangian (since b0 b is large, we must impose teh restriction with
equality).
L = ( y Xe )0 ( y Xe ) + ( m e0 e )
=
n
X
t=1
yt
k
X
i=1
or
yt
yt xtj =
ei xti
X
i
!2
ei xti
XX
t
ei
+ +( m
k
X
i=1
ei2 ).
xtj + 2ei2 = 0,
xti xtj ei + ej
X
t
xti xtj + ej ,
(9.39)
(9.40)
(9.41)
So, we have
e
X0 y = (X0 X + In ).
(9.42)
e = (X0 X + In )1 X0 y.
(9.43)
e = (X0 X + In )1 X0 y
= (X0 X + In )1 X0 (X + u)
= (X0 X + In )1 X0 (X + (X0 X + In )1 u)
(9.44)
and
0
1 0
E( e ) = (X X + In ) X X = P,
(9.45)
so ridge regression is biased. Rather obviously, as grows large, the expectation shrinks towards zero so the bias is towards zero. Next, we find
that
Cov( e ) = 2 (X0 X + In )1 X0 X(X0 X + In )1 = 2 Q < 2 (X0 X)1 .
(9.46)
99
If u N(0, 2 In ), then
e N(P, 2 Q)
(9.47)
and inferences are possible only for P and hence the complete vector.
The rather obvious question in using ridge regression is what is the best
choice for ? We seek to trade o the increased bias against the reduction
in the variance. This may be done by considering the mean square error
(MSE) which is given by
e = 2 Q + (P I ) 0 (P I )
MSE()
k
k
= (X0 X + In )1 { 2 X0 X + 2 0 }(X0 X + In )1 .
Chapter 10
Stochastic Explanatory
Variables
10.1
Nature of Stochastic X
(10.1)
101
10.1.1
Independent X
(10.2)
10.1.2
(10.3)
for any funtion g(). This assumption is motivated by the assumption that
our model is simply a statement of conditional expectation, E[yi |xi ] = x0i ,
and may or may not be accompanied by a conditional second moment assumption such as E[u2i |xi ] = 2 . Note that independence along with the
unconditional statements E[ui ] = 0 and E[u2i ] = 2 imply conditional zero
mean and constant conditional variance, but not the reverse.
102
10.1.3
Uncorrelated X
A less strong assumption is that xij and ui are uncorrelated, that is,
cov ( xij , ui ) = E( xij , ui ) = 0.
(10.4)
10.1.4
Correlated X
(10.5)
As we shall see below, this can have quite serious implications for the OLS
estimates.
An example is the case of simultaneous equations models that we will
examine later. A second example occurs when our right-hand side variables
are measured with error. Suppose
yi = + xi + ui
(10.6)
xi = xi + vi
(10.7)
= + xi + ( ui vi ).
(10.8)
10.2
Consequences of Stochastic X
10.2.1
103
Recall that
b = ( X0 X )1 X0 y
= (X X)
(10.9)
0
XX ( X + u )
= + ( X X )1 X0 u
1
1
= + ( X0 X )1 X0 u
n
n
!1
n
n
1 P
1 P
0
= +
xj xj
xi ui
n j=1
n i=1
We will now examine the bias and consistency properties of the estimators
under the alternative dependence assumptions.
Uncorrelated X
Suppose that xt is only assumed to be uncorrelated with ui . Rewrite the
second term in (10.9) as
!1
!1
n
n
n
n
1 P
1 P 1 P
1 P
=
xj x0j
xi ui
xj x0j
xi ui
n j=1
n i=1
n i=1
n j=1
=
n
1 P
wi ui .
n i=1
(10.10)
b = + E( X0 X )1 X0 u 6= .
E[]
(10.11)
in general, whereupon
104
Now, each element of xi x0i and xi ui are i.i.d. random variables with expectations Q and 0, respectively. Thus, the law of large numbers guarantees
that
n
1 P
p
xi xi E xi xi = Q,
(10.12)
n i=1
and
n
1 P
p
xi ui E xi ui = 0.
n i=1
It follows that
b = + plim
plim
= + plim
n
1
= +Q
1 0
XX
n
1 0
XX
n
0 = .
1
1
(10.13)
1 0
Xu
n
1 0
Xu
n n
plim
(10.14)
(10.15)
b = + E[(X0 X)1 X0 u]
E[]
= + E[(X0 X)1 X0 E(u|X)]
= + E[(X0 X)1 X0 0] = .
Es
= E e0 e/(n k)
1
0
0
1 0
=
E u ( I X(X X) X )u
nk
= 2,
(10.16)
105
provided that E[u2i |xi ] = 2 . Naturally, since conditional zero mean implies
uncorrelatedness, then we have the same consistency results, namely
b=
plim
and
Independent X
plim s2 = 2 .
(10.17)
and
b=
plim
and
2
2
Es = ,
(10.18)
Correlated X
plim s2 = 2 .
(10.19)
(10.20)
whereupon
n
1 P
p
xi ui E xi ui = d,
n i=1
b = + Q1 d 6=
plim
(10.21)
(10.22)
10.2.2
106
(10.23)
1 0
1 0
b
XX
n( ) =
n X u.
(10.24)
n
n
The typical element of
is
n
1 0
1 P
Xu=
xi ui
n
n i=1
n
1 P
xij ui .
n i=1
(10.25)
(10.26)
(10.27)
2
2
E( xij ui ) = Ex xij Eu ui = qjj ,
(10.28)
and
where qjj is the jj-th element of Q. Thus, according to the central limit
theorem,
In general,
n
1 P
d
n
xij ui N (0, 2 qii ).
n i=1
n
1 P
1
d
n
xi ui = X0 u N (0, 2 Q).
n i=1
n
(10.29)
(10.30)
1 0
nX X
107
b
1
d
d
n( ) Q1 X0 u N (0, 2 Q1 ).
n
(10.31)
For inferences,
and
d
n(bj j ) N (0, 2 qjj )
b
n(j j ) d
p
N (0, 1).
2 qjj
(10.32)
(10.33)
b = 1 X0 X
= n(X0 X)1
(10.34)
Q
n
(10.36)
108
10.3
10.3.1
Instruments
Consider
y = X + u
(10.37)
(10.38)
(10.39)
1 0
1 0
XX
Xy
n
n
b
= plim (X0 X)X0 y = plim .
n
(10.40)
and
b 6= .
plim
(10.41)
Suppose we can find iid variables zi that are independent and hence
uncorrelated with ui , then
E zi ui = 0.
(10.42)
109
E zi zi = M,
(10.43)
10.3.2
(10.44)
to obtain
Z0 y = Z0 X + Z0 u
1 0
1 0
1
Zy =
Z X + Z0 u.
n
n
n
(10.45)
plim
so
1 0
1
Z y = plim Z0 X
n n
n n
plim
(10.46)
(10.47)
or
= plim
n
1
1 0
1 0
ZX
Zy
n
n
(10.48)
Now,
e = (Z0 X)1 Z0 y
(10.49)
110
10.3.3
(10.50)
= (Z0 X)1 Z0 (X + u)
= + (Z0 X)1 Z0 u,
(10.51)
we generally have bias since we are only assured that zi is uncorrelated with
ui , but not that (Z0 X)1 Z0 is uncorrelated with u.
Asymptotically, if zi is independent of ui , then
where, as above,
d
b )
n(
(0, 2 P1 MP1 ),
1 0
ZX
n n
P = plim
and
1 0
Z Z.
n n
M = plim
(10.52)
(10.53)
Let
e
e
e = y X
is consistent.
The ratios
se2 =
n
P
e
e0 e
e
ee2t
=
n k i=1 n k
ej j
d
p
N (0, 1),
se2 [(Z0 X)1 Z0 Z(X0 Z)1 ]jj
(10.54)
(10.55)
(10.56)
10.3.4
111
Optimal Instruments
The instruments zi cannot be just any variables that are independent of and
uncorrelated with ui . They should be as closely related to xi as possible
while remaining uncorrelated with ui .
Looking at the asymptotic covariance matrix P1 MP1 , we can see as
zi and xi become unrelated and hence uncorrelated, that
1 0
ZX=P
n n
plim
(10.57)
goes to zero. The inverse of P consequently grows large and P1 MP1 will
become large. Then the consequence of using zi that are not close to xi is
imprecise estimates. In fact, we can speak of optimal instruments as being
all of xi except the part that is correlated with ui .
10.4
Detecting Correlated X
10.4.1
An Incorrect Procedure
With other problems of OLS we have examined the OLS residuals for signs
of the problem. In the present case, where ui being correlated with xi is the
problem, we might naturally see if our proxy for ui , the OLS residuals et ,
are correlated with xi . Thus,
n
P
xi et = X0 e
(10.58)
i=1
might be taken as an indication of any correlation between xi and ui . Unfortunately, one of the properties of OLS guarantees that
X0 e = 0.
(10.59)
10.4.2
A Priori Information
112
10.4.3
(10.60)
(10.61)
(10.62)
1
+
Gt +
ut .
1 1
1
(10.63)
10.4.4
An IV Approach
b e d
n( ) 0, 2 (P1 MP1 Q1 ) ,
where
1 0
Z X,
n n
1
M = plim Z0 Z,
n n
1
Q = plim X0 X.
n n
P = plim
(10.64)
Chapter 11
Nonscalar Covariance
11.1
11.1.1
(11.1)
11.1.2
Nonscalar Covariance
113
(11.2)
114
u1
11 12
u2
21 22
2
E .. (u1 , u2 , . . . , un ) = ..
..
..
.
.
.
.
un
1n
2n
1n
2n
..
.
nn
(11.3)
11.1.3
Some Examples
Serial Correlation
Consider the model
yt = + xt + ut ,
(11.4)
ut = ut1 + t ,
(11.5)
where
and E[t ] = 0, E[2t ] = 2 , and E[t s ] = 0 for all t 6= s. Here, ut and ut1 are
correlated, so is not diagonal. This is a problem that aicts a large fraction
of time series regressions.
Heteroscedasticity
Consider the model
Ci = + Yi + ui
i = 1, 2, . . . , n,
(11.6)
= x0t1 1 + ut1
= x0t2 2 + ut2 .
If ut1 and ut2 are correlated, then the joint model has a nonscalar covariance.
If the error terms ut1 and ut2 are viewed as omitted variables then it is obvious
to ask whether common factors have been omitted and hence the terms are
correlated.
115
11.2
11.2.1
For Estimation
= (X0 X)1 X0 y
= + (X0 X)1 X0 u.
(11.7)
Thus,
b = + (X0 X)1 X0 E[u] = ,
E[]
(11.8)
b =(X0 X)1 X0 u.
(11.9)
so OLS is still unbiased (but not BLUE since (ii & iii) not satisfied).
Now
so
b )(
b )0 ] = (X0 X)1 X0 E[uu0 ]X(X0 X)1
E[(
= 2 (X0 X)1 X0 X(X0 X)1
6= 2 (X0 X)1 .
The diagonal elements of (X0 X)1 X0 X(X0 X)1 can be either larger or smaller
than the corresponding elements of (X0 X)1 . In certain cases we will be able
to establish the direction of the inequality.
Suppose
1 0 p
X XQ p.d.
n
1 0
p
X XM
n
then (X0 X)1 X0 X(X0 X)1 =
so
(11.10)
p
1 1 0
1 0
1 1 0
1 p 1 1
n Q MQ1 0
n ( n X X)
n X X( n X X)
p
b
(11.11)
11.2.2
For Inference
Suppose
u N (0, 2 )
(11.12)
116
then
b N (, 2 (X0 X)1 X0 X(X0 X)1 ).
Thus
bj j
p
N (0, 1)
2 [(X0 X)1 ]jj
(11.13)
(11.14)
p
2 [(X0 X)1 X0 X(X0 X)1 ]jj .
bj j
p
tnk
s2 [(X0 X)1 ]jj
(11.15)
We might say that OLS yields biased and inconsistent estimates of the variancecovariance matrix. This means that our statistics will have incorrect size so we
over- or under-reject a correct null hypothesis.
11.2.3
For Prediction
We seek to predict
y = x0 + u
(11.16)
where indicates an observation outside the sample. The OLS (point) predictor
is
b
yb = x0
(11.17)
which will be unbiased (but not BLUP). Prediction intervals based on 2 (X0 X)1
will be either too wide or too narrow so the probablility content will not be the
ostenisble value.
11.3
11.3.1
(11.18)
(11.19)
117
or
y = X + u
(11.20)
where y = P1 y, X = P1 X, and u = P1 u.
Perform OLS on the transformed model yields the generalized least squares
or GLS estimator
= (X0 X )1 X0 y
0
0
= ((P1 X) P1 X)1 (P1 X) P1 y
0
(11.21)
This estimator is also known as the Aitken estimator. Note that GLS reduces
to OLS when = In .
11.3.2
0
2
0 1
2
0 1
1
E[( )( ) ] = (X X ) = (X X)
(11.22)
(11.23)
and the GLS estimator is unbiased and BLUE. We assume the transformed
model satisfies the asymptotic properties studied in the previous chapter. First,
suppose
1 0 1
1
p
X X = X0 X Q p.d.
n
n
(a)
118
(b)
d
then n()N (0, 2 Q1 ). Inference and prediction can proceed as before
for the ideal case.
11.3.3
b 1 X)1 X0
b 1 y
= (X0
b 1 X)1 X0
b 1 u.
= +(X0
b =P
bP
b0
The small sample properties of this estimator are problematic since
will generally be a function of u so the regressors of the feasible transformed
b = P
b 1 X become stochastic. The feasible GLS will be biased and
model X
non-normal in small samples even if the original disturbances were normal.
b is consistent that everything will work out in
It might be supposed that if
large samples. Such happiness is not assured since there are possibly n(n+1)/2
possible nonzero elements in which can interact with the x0 s in a pathological
fashion. Suppose that (a) and (b) are satisfied and furthermore
and
1 0 b 1
p
[X () X X0 ()1 X]0
n
then
1
p
b 1 u X0 ()1 u]0
[X0 ()
n
b
d
n( )N (0, 2 Q1 ).
(c)
(d)
(11.24)
Thus in large samples, under (a)-(d), the feasible GLS estimator has the same
asymptotic distribution as the true GLS. As such it shares the optimality
properties of the latter.
11.3.4
119
Suppose
u N (0, 2 )
(11.25)
y N (X, 2 )
(11.26)
then
and
L(, 2 , ; y, X) = f (y|X; , 2 , )
0 1
1
1
e 22 (yX) (yX) .
=
1/2
2
n/2
(2 )
||
Taking as given, we can maximize L() w.r.t. by minimizing
(y X)0 1 (y X) = (y X)0 P01 P1 (y X)
= (y X )0 (y X ).
(11.27)
(11.28)
11.4
11.4.1
We consider a model with G agents and a behavioral equation with n observations for each agent. The equation for agent j can be written
yj = Xj j + uj ,
(11.29)
y1
X1 0 . . .
1
u1
0
y2 0 X2 . . .
2 u2
+
.. = ..
.
.
..
..
..
. .
.
.
. .. .. (11.30)
yG
G
uG
0
0 . . . XG
120
or more compactly
y = X + u
(11.31)
(11.32)
0
2
E[uj uj ] = j In
(11.33)
and
but due to common ommited factors we must allow for the possibility that
0
E[uj u ] = j In j 6= .
(11.34)
E[u] = 0
(11.35)
0
2
E[uu ] = In =
(11.36)
and
where
11.4.2
12
12
..
.
12
22
..
.
...
...
..
.
1G
2G
..
.
G1
G2
...
2
G
(11.37)
SUR Estimation
j
j
j
(11.38)
b N ( , 2 (X0 Xj )1 ).
j
j
j
(11.39)
y = X + u
(11.40)
and as usual the estimators will be unbiased, BLUE for linearity w.r.t. yj , and
under normality
This procedure, however, ignores the covariances between equations. Treating all equations as a combined system yields
121
where
u (0, In )
(11.41)
= (X ( In )1 X)1 X ( In )1 y
= (X (1 In )X)1 X (1 In )y
This estimator will be unbiased and BLUE for linearity in y and will, in general,
be ecient relative to OLS.
If u is multivariate normal then
0
N (,(X ( In )1 X)1 ).
(11.42)
Even if u is not normal then, with reasonable assumptions about the behavior
of X, we have
1 0
d
n( ) N(0, [ lim (X ( In )1 X)]1 ).
n
11.4.3
(11.43)
Diagonal
There are two special cases in which the SUR esimator simplifies to OLS on
each equation. The first case is when is diagonal. In this case
2
1 0 . . .
0
0 22 . . .
0
(11.44)
= .
.
..
.
..
..
..
.
0
...
2
G
and
X ( In )1 X =
Similarly,
X01
0
..
.
0
X02
..
.
...
...
..
.
0
0
..
.
...
X0G
1
X0 X
12 1 1
0
..
.
0
0
1
X0 X
22 2 2
..
.
0
...
...
..
.
...
1
I
12 n
0
..
.
0
1
2
G
0
1
I
22 n
..
.
0
0
0
..
.
X0G XG
...
...
..
.
...
0
0
..
.
1
In
2
X1
0
..
.
0
X2
..
.
...
...
..
.
0
0
..
.
...
XG
122
0
X ( In )1 y=
1
X0 y
12 1 1
0
..
.
0
0
1
X0 y
22 2 2
..
.
0
...
...
..
.
...
whereupon
(X01 X1 )1 X01 y1
(X02 X2 )1 X02 y2
..
.
(X0G XG )1 X0G yG
1
2
G
0
0
..
.
X0G yG
(11.45)
(11.46)
So the estimator for each equation is just the OLS estimator for that equation
alone.
11.4.4
Identical Regressors
The second case is when each equation has the same set of regressor, i.e. Xj = X
so
X = IG X.
(11.47)
And
](1 X0 )y
(X0 X)1 X0 y1
(X0 X)1 X0 y2
=
.
..
.
(X0 X)1 X0 yG
In both these cases the other equations have nothing to add to the estimation
of the equation of interest because either the omitted factors are unrelated or
the equation has no additional regressors to help reduce the sum- of-squared
errors for the equation of interest.
11.4.5
Unknown
Note that for this case comprises in the general form = (). It is finitelength with G(G+1)/2 unique elements. It can be estimated consistently using
123
b
ej = yj Xj
j
denote the OLS residuals for agent j. Then by the usual arguments
and
bj =
n
1 P
eij ei
n i=1
b = (b
j )
which can be shown to satisfy (a)-(d) and will have the same asymptotic distribution as . This estimator will be obtained in two steps: the first step is
b in the
to estimate all equations by OLS and thereby obtain the estimator ,
second step we obtain the feasible GLS estimator.