1 Regression Analysis and Least Squares Estimators
1 Regression Analysis and Least Squares Estimators
Abstract
We consider the regression model: Yi = Xi0 + ui , i = 1, . . . , n. This note summarizes the results
that makes it possible to test: (i) hypotheses
for asymptotic analysis of the least squares estimator ,
and (ii) hypotheses that involve several coecients, such as
about the individual coecients of ;
R = r, where R is a known matrix and r is a known vector.
t = 1, . . . , n,
[Matrix: Y = X + U.]
(1)
n
X
u2i
i=1
n
X
2
Yi 0 Xi ,
= min
i=1
The least squares estimator results by solving the first order condition for a minimum:
Pn
0
=0
2
Y
i
i
i=1
0 S() = 0
Pn
=0
S(
)
=
0
Y
x
X
2
1i
i
i
i=1
1
..
.
.
n
=0
k S() = 0
2 i=1 xki Yi Xi0
Pn
P
Yi = ni=1 Xi0
i=1
Pn x1i Yi = Pn x1i X 0
Pn
Pn
i
i=1
i=1
.
P
Pn
xki Yi = n xki X 0
i=1
i=1
Xi X 0
Xi Yi .
= X0 X 1 X0 Y],
[Matrix:
i=1
(2)
i=1
= X0 X 1 X0 (X + u) = + X0 X 1 X0 U,
(3)
such that
=
n
X
Xi Xi0
i=1
!1
n
X
= X0 X 1 X0 U].
[Matrix:
Xi ui ,
i=1
(4)
A1 E(U |X) = 0.
:
The expectation of
=+
E()
n
X
i=1
Xi Xi0
!1
n
X
= + X0 X 1 E(X0 U) =].
[Matrix: E()
E(Xi ui ) = ,
i=1
(5)
estimator (note that depends on n, although it is not clear from the definition).
, has
When applying asymptotic results one should have the following in mind. The estimator,
n
some (unknown) finite-sample distribution, which is approximated by its asymptotic distribution. The
finite-sample distribution and the asymptotic distribution are (most likely) dierent, so we are making
an error when we use the asymptotic distribution. However, when n is large, the dierence between the
two distributions is likely to be small, and the asymptotic distribution is then a good approximation.
Asymptotic Analysis of
So the univariate LLN is just a special case of the multivariate version. The same is true for the
multivariate CLT.
Theorem 2 (Multivariate CLT) Let {Vi } be a sequence of m-dimensional random vectors that are
iid, with mean V = E(Vi ) and covariance matrix V = var(Vi ) = E[(Vi V )(Vi V )0 ]. Then it holds
P
d
that 1n ni=1 (Vi V ) Nm (0, V ).
Since {Xi } is a sequence of iid random variables (vectors), it follows that {Xi Xi0 } is a sequence of
P
p
iid random variables (matrices). So by the multivariate LLN it holds that n1 ni=1 Xi Xi0 QX
E(Xi Xi0 ). Similarly, {Xi ui } is a sequence of iid random variables (vectors), with expected value
E[Xi ui ] = E[E(Xi ui |Xi )] = E[Xi E(ui |Xi )] = E[Xi 0] = 0. So by the multivariate central limit theP
d
orem we have that 1n ni=1 Xi ui Nk+1 (0, v ), where v var(Xi ui ) = E(Xi Xi0 u2i ). Here we are
implicitly using Assumption A3 that guarantees that the expected values E(Xi Xi0 ) and E(Xi Xi0 u2i ) are
well defined (finite).
Theorem 3 (Linear Transformation of Gaussian Variables) Let Z Nm (, ), for some vector, , (m 1), and some matrix , (m m). Let A be a l m matrix and b be a l 1 vector. Define
the l-dimensional random variable Z = AZ + b. Then it holds that Z Nl (A + b, AA0 ).
d
1
1 X
d
1
0
n( ) =
Xi Xi
Xi ui Nk+1 (0, Q1
X v QX ).
n
n
| i=1 {z
}| i=1
{z
}
p
Q1
X
Nk+1 (0,v )
1
Q1
The covariance matrix n()
X v QX , can be estimated by n(
n
X
1 1
Xi Xi0
Q
X
n
p
n )
1
vQ
1 , where
Q
X
X
and
i=1
How can we be sure that
n(
n )
X
1
Xi Xi0 u
2i .
nk1
i=1
is consistent for n(
n )
X QX and since {Xi , Yi } is iid (Assumption A2), also {Xi X 0 (Yi X 0 )2 } = {Xi X 0 u2 } is iid, such
Q
i
i
i i
P
p
p
v
that a LLN gives us n1 ni=1 Xi Xi0 u2i E(Xi Xi0 u2i ). To establish that
v , we first note that
n
i=1
i=1
i=1
1X
1X
1X
Xi Xi0 u
2i =
Xi Xi0 u2i +
Xi Xi0 (
u2i u2i ),
n
n
n
P
p
v =
and it can be shown that n1 ni=1 Xi Xi0 (
u2i u2i ) 0, (beyond the scope of EC163), such that
P
p
n
1
n
n
0 2 E(X X 0 u2 ) = , using
i i i
v
i
i=1 Xi Xi u
nk1 n
nk1 1 as n . Since the mapping from
1
p
p
{Qx , v } 7 Q1
=
x v Qx is continuous, we know that QX QX and v v implies that
n( n )
1
1
1 p 1
Q
, as we wanted to show.
x v Qx Qx v Qx = n(
n )
is normally distributed
Multiplying n( ) by 1n and adding , shows that asymptotically
1
.
n( n )
2.1
Consider the vector of regression coecients, = ( 0 , . . . , k )0 , and suppose that we are interested in
the jth coecient, j . We can let d = (0, . . . , 0, 1, 0, . . . , 0)0 denote the jth unit-vector (the vector which
has 1 as its jth element and zero otherwise). Then we note that
),
d0 ) = n(
) = n(d0
d0 n(
j
j
and by Theorem 4 it follows that
holds that
d
n( j j ) = d0 n(
) N1 (0, d0 n()
d). So for large n it
A
d),
N1 (0, d0
j
j
j
,
t j =c = q
d
d0
which for large n, is approximately distributed as a standard normal, N (0, 1). (For moderate values of
n, it is typically better to use the t-distribution with n k 1 degrees of freedom.)
2.2
To test hypotheses that involve multiple coecients we need the following result.
Theorem 5 Let Z Nm (, ), for some vector, , (m 1), and some (full rank) matrix , (m m).
Then it holds that
(Z )0 1 (Z ) 2m .
Here we use 2m to denote the chi-squared distribution with m degrees of freedom. In our asymptotic
analysis the result we need is the following.
d
Theorem 6 Let Zn Nm (, ) for some vector, , (m 1), and some (full rank) matrix , (m m).
p
p
d
1 (Zn
Suppose that
and that
. Then it holds that (Zn
)0
) 2m .
In our setting we have established that
Thus the theorem tells us that
p
d
n( ) Nk+1 (0, n()
) and
.
n() n()
1
h
i1
0
0 1
)
n( ) n()
n( ) = ( )
(
n n()
h i1
d
)0
)
= (
(
2k+1 .
This enables us to test the hypothesis that the vector of regression parameters equals a particular vector,
h i1
o )0
o ) and compare this (scalar)
(
e.g., H0 : = o . All we need to do is to compute (
been divided by its degrees of freedom. So, should we prefer to use an F -test to test H0 : = o , we
would simply use that
h i1
o )0
o)
(
(
d
F= o =
Fk+1, ,
k+1
where F= o denotes the test-statistic and Fk+1, represents the F -distribution with (k + 1, ) degrees
of freedom. (An F -distribution has two degrees of freedom).
Typically we are interested in more complicated hypotheses than j = c or = o . A general class
of hypotheses can be formulated as H0 : R = r, for some q k + 1 matrix, R, and some q 1 vector, r.
How can we test hypotheses of this kind? First we note that Theorem 4 gives us that,
d
)
R n(
Nq (R 0, Rn()
R0 ) = Nq (0, Rn()
R0 ).
R) = n[(R
r) (R r)]
) = n(R
The left hand side can be rewritten as R n(
A
r) if the null hypothesis is true. So if we divide by n we get that (R
r)
which equals n(R
R0 ) = Nq (0, R R0 ). Thus by using Theorem 6 we can construct a 2 -test of the
Nq (0, n1 Rn()
h
i1
d
r)0 R
r)
R0
(R
2q , or the equivalent
hypothesis H0 : R = r, using the test statistic: (R
d
FR=r =
Fq, .
q
Tables with critical values for the 2m -distribution can be found in S&W on page 645. The Fm1 ,m2 distribution is tabulated on pages 647-649, and the Fm, -distribution (which you will use most frequently) is tabulated on page 646, and (conveniently) on the very last page of the book.
2.3
A Simple Example
4.0
= 2.5
1.5
and
801
40
27
8
27
8
3
4
98
6
98 ,
15
8
We find that
= 1.5
R1
and
R10 = 15 ,
R1
1
15
A
= (1.5)
(1.5) = 1.2 F1, .
8
5
Note that when we test a single restriction, the F -statistic is the square of the t-statistic
2
2
0
F 3 =0 = t 3 =0 = q 3
.
3)
var(
c
R20 = 39 .
R2
and
1
i1
h
39
A
0
= R2 R2 R2
R2 = 4
4 = 3.2821 F1, .
8
3. H3 : 2 + 3 = 0? Here we set
R3 = (0, 1, 1) and r3 = 0,
We find
= 2.5 1.5 = 1,
R3
R30 = 3 ,
R3
and
1
3
A
=1
1 = 2.6666 F1, .
8
4. H4 : 2 = 3 = 0. Now we set
R4 =
Then
=
R4
2.5
1.5
0 1 0
0 0 1
and
0
0
R40 =
R4
, r4 =
3
4
98
98
15
8
such that
F 2 = 3 =0 =
2.5 1.5
3
4
98
98
15
8
2.5
1.5
= 17.666 F2, .
When Xi is random, the concept of conditional expectation is an important tool in the analysis.
A conditional expected value, is written E(Y |X = x), and denotes the expected value Y, when we
know that X = x. (So E(Y ) is the expected value of Y when we know nothing about other random
variables). When we write E(Y |X), we think of it as a function of X. Since X is random then E(Y |X)
is also random (in the general case).
Some properties of conditional expected values are listed below.
6
(3)