0% found this document useful (0 votes)
51 views

Econometrics (EM2008) The K-Variable Linear Regression Model

This document provides an outline and overview of the key matrix formulation concepts in the k-variable linear regression model. It discusses how the model can be written in matrix notation with vectors and matrices, and derives the normal equations that are solved to obtain least squares estimates of the model coefficients. It also covers the decomposition of the total sum of squares into the explained and unexplained components.
Copyright
© © All Rights Reserved
0% found this document useful (0 votes)
51 views

Econometrics (EM2008) The K-Variable Linear Regression Model

This document provides an outline and overview of the key matrix formulation concepts in the k-variable linear regression model. It discusses how the model can be written in matrix notation with vectors and matrices, and derives the normal equations that are solved to obtain least squares estimates of the model coefficients. It also covers the decomposition of the total sum of squares into the explained and unexplained components.
Copyright
© © All Rights Reserved
You are on page 1/ 46

Econometrics [EM2008]

Lecture 2
The k-variable linear regression model

Irene Mammi

[email protected]

Academic Year 2018/2019

1 / 46
outline

I the k-variable linear regression model


I matrix formulation
I partial correlation coefficients
I inference
I prediction

I References:
I Johnston, J. and J. DiNardo (1997), Econometrics Methods, 4th
Edition, McGraw-Hill, New York, Chapter 3.

2 / 46
the multivariate model

I the bivariate framework is too restrictive for realistic analysis of


economic phenomena
I generally more useful to specify multivariate relations
I restrict the analysis to a single equation which now include k variables
I the specification of such a relationship is

Yt = β 1 + β 2 X2t + β 3 X3t + · · · + β k Xkt + ut t = 1, . . . , n

which identifies k − 1 explanatory variables, namely X2 , X3 , . . . , Xk ,


that are thought to influence the dependent variable
I nb: the X may be transformations of other variables but the
relationship is linear in the β coefficients
I assume that the disturbances are white noise
I k + 1 parameters to estimate, the β’s and the disturbance variance σ2

3 / 46
matrix formulation of the k-variable model
I matrices indicated by uppercase bold letters, vectors by lowercase
bold letters
I vectors generally taken as column vectors
I for example,
   
Y1 X21
 Y2  X22 
y= .  x2 =  . 
   
 ..   .. 
Yn X2n
are n × 1 vectors, also referred to n-vectors, containing the sample
observations on Y and X2
I the n sample observations on the k-variable model can be written as

.. .. .. .. ..
         
. . .  .  .
y  = β 1 x 1  + β 2 x 2  + · · · + β k x k  + u 
         
.. .. .. .. ..
. . . . .
4 / 46
matrix formulation of the k-variable model (cont.)

I the y vector is expressed as a linear combination of the x vectors plus


the disturbance vector u
I the x 1 vector is a column of ones to allow for the intercept term
I collecting all the x vectors intro a matrix X and the β coefficients
into a vector β, can write

y = Xβ + u

where
· · · Xk1
   
1 X21 β1
1 X22 · · · Xk2   β2 
X = . and β= . 
   
.. .. .. 
 .. . . .   .. 
1 X2n · · · Xkn βk

5 / 46
the algebra of least squares

I if the unknown vector β is replaced by some guess or estimate b, this


defines a vector of residuals e,

e = y − Xb

I the least squares principle is to choose b to minimize the residual


sum of squares e 0 e, namely,

RSS = e 0 e
= (y − Xb )0 (y − Xb )
= y 0 y − b 0 X 0 y − y 0 Xb + b 0 X 0 Xb
= y 0 y − 2b 0 X 0 y + b 0 X 0 Xb

6 / 46
the algebra of least squares (cont.)
I the first order-conditions for the minimization are
∂(RSS )
= −2X 0 y + 2X 0 Xb = 0
∂b
giving the normal equations

(X 0 X )b = X 0 y

I if y is replaced by Xb + e the result is

(X 0 X )b = X 0 (Xb + e ) = (X 0 X )b + X 0 e

thus
X 0e = 0
which is another fundamental least-squares result
I the first element in this equation gives ∑ et = 0, that is,

ē = Ȳ − b1 − b2 X̄2 − · · · − bk X̄k = 0
7 / 46
the algebra of least squares (cont.)

I ⇒ the residuals have zero mean, and the regression plane passes
through the point of means in k dimensional space
I the remaining elements are of the form

∑ Xit et = 0 i = 2, . . . , k
t

which implies that each regressor has zero sample correlation with the
residuals
I this, in turn,implies that ŷ (= Xb ), the vector of the regression values
for Y , is uncorrelated with e, for

ŷ 0 e = (Xb )0 e = b 0 X 0 e = 0

8 / 46
the algebra of least squares (cont.)

normal equations for the two-variable case

I here, k = 2 and the model of interest is Y = β 1 + β 2 X + u


I the X matrix is  
1 X1
1 X2 
X = .
 
.. 
 .. . 
1 Xn
thus,
 
1 X1
 1 X2  ∑X
  
1 1 ··· 1  n
X 0X = =

X1 X2 ··· Xn  ... ..
∑ X ∑ X2
 
. 
1 Xn

9 / 46
the algebra of least squares (cont.)

and  
Y1
1 Y2  ∑Y
    
1 1 ···
X 0y =  ..  =
X1 X2 ··· Xn  .  ∑ XY
Yn
giving
∑X ∑Y
    
n b1
=
∑X ∑ X2 b2 ∑ XY
or

nb1 + b2 ∑ X = ∑Y
b1 ∑ X + b2 ∑ X 2
= ∑ XY

10 / 46
the algebra of least squares (cont.)

normal equations for the three-variable case

I in a similar way, it may be shown that the normal equations for fitting
a three-variable equation by least squares are

nb1 + b2 ∑ X2 + b3 ∑ X3 = ∑Y
b1 ∑ X2 + b2 ∑ X22 + b3 ∑ X2 X3 = ∑ X2 Y
b1 ∑ X3 + b2 ∑ X2 X3 + b3 ∑ X32 = ∑ X3 Y

11 / 46
decomposition of the sum of squares

I the zero covariances between regressors and the residuals underlie the
decomposition of the sum of squares
I decomposing the y vector into the part explained by the regression
and the unexplained part, we have

y = ŷ + e = Xb + e

from which it follows that

y 0 y = (ŷ + e )0 (ŷ + e ) = ŷ 0 ŷ + e 0 e = b 0 X 0 Xb + e 0 e

I however, y 0 y = ∑nt=1 Yt2 is the sum of squares of actual Y values;


interest normally centers on analyzing the variation in Y , measured by
the sum of the squared deviations from the sample mean, namely,

∑(Yt − Ȳt )2 = ∑ Yt2 − nȲ 2


t t

12 / 46
decomposition of the sum of squares (cont.)

I subtracting nȲ 2 from each side of the previous decomposition gives

(y 0 y − nȲ 2 ) = (b 0 X 0 Xb − nȲ 2 ) + e0e


TSS = ESS + RSS

where TSS indicates the total sum of squares in Y , and ESS and
RSS the explained and residual (unexplained) sum of squares

13 / 46
equation in deviation form
I alternatively, express all the data in the form of deviations from the
sample mean
I the least-squares equation is

Yt = b1 + b2 X2t + b3 X3t + · · · + bk Xkt + et t = 1, . . . , n

I averaging over the sample observations gives

Ȳ = b1 + bx X̄2 + b3 X̄3 + · · · + bk X̄k

which contains no term in e, since ē is zero


I subtracting the second equation from the first gives

yt = b2 x2t + b3 x3t + · · · + bk xkt + et t = 1, . . . , n

I intercept b1 disappears, but it may be recovered from

b1 = Ȳ − b2 X̄2 − · · · − bk X̄k
14 / 46
equation in deviation form (cont.)
I nb: the least-squares slope coefficients b2 , . . . , bk are identical in both
forms of the regression equation, and so the residuals
I collecting all n observations, the deviation form of the equation may
be written compactly using a transformation matrix
 
1
A = In − ii0
n

where i is a column vector of n ones


I it follows that Ae = e and Ai = 0
I write the least-squares equation as
 
  b1
y = Xb + e = i X2 +e
b2

where X 2 is the n × (k − 1) matrix of observations on the regressors


and b 2 is the k − 1 element vector containing the coefficients
b2 , b3 , . . . , bk
15 / 46
equation in deviation form (cont.)
I premultiplying by A gives
 
  b1
Ay = 0 AX 2 + Ae = (AX 2 )b 2 + e
b2
or
y ∗ = X ∗b2 + e
where y ∗ = Ay and X ∗ = AX 2 give the data in deviation form.
Since X 0 e = 0, it follows that X 0∗ e = 0.
I premultiplying previous equation by X 0∗ gives

X 0∗ y ∗ = (X 0∗ X ∗ )b 2

which are the familiar normal equations, except that now the data
have all been expresses in deviation form and the b 2 vector contains
the k − 1 slope coefficients and excludes the intercept term

16 / 46
equation in deviation form (cont.)
I the decomposition of the sum of squares may be expressed as

y 0∗ y ∗ = b 20 X 0∗ X ∗ b 2 + e0e
TSS = ESS + RSS

I the coefficient of multiple correlation R is defined as the positive


square root of
ESS RSS
R2 = = 1−
TSS TSS
2
I the adjusted R is defined as

RSS/(n − k )
R̄ 2 = 1 −
TSS/(n − 1)
I the numerator and the denominator on the RHS are unbiased
estimators of the disturbance variance and the variance of Y

17 / 46
equation in deviation form (cont.)
I the relation between the adjusted and unadjusted coefficients is

n−1
R̄ 2 = 1 − (1 − R 2 )
n−k
1−k n−1 2
= + R
n−k n−k
I two alternative criteria for comparing the fit of specifications are the
Schwarz criterion
e0e k
SC = ln + ln n
n n

and the Akaike information criterion

e0e 2k
AIC = ln +
n n

18 / 46
generalizing partial correlation
I the normal equations solve for b = (X 0 X )−1 X 0 y
I the residuals from the LS regression may be expressed as

e = y − Xb = y − X (X 0 X )−1 X 0 y = My

where
M = I − X (X 0 X ) −1 X 0
I M is a symmetric, idempotent matrix; it also has the properties that
MX = 0 and Me = e
I now write the general regression in partitioned form as
 
  b2
y = x2 X∗ +e
b (2)

I in this partitioning x 2 is the n × 1 vector of observations on X2 , with


coefficient b2 , and X ∗ is the n × (k − 1) matrix of all the other
variables (including the column of ones) with coefficient vector b (2)

19 / 46
generalizing partial correlation (cont.)
I the normal equations for this setup are
 0
x 2 x 2 x 20 X ∗
   0 
b2 x y
= 20
X 0∗ x 2 X 0∗ X ∗ b (2) X ∗y

I the solution for b2 is

b2 = (x 20 M ∗ x 2 )−1 (x 20 M ∗ y )

where
M ∗ = I − X ∗ (X 0∗ X ∗ )−1 X 0∗
M ∗ is a symmetric, idempotent matrix with the properties
M ∗ X ∗ = 0 and M ∗ e = e
I we have that

M ∗ y is the vector of residuals when y is regressed on X ∗


M ∗ x 2 is the vector of residuals when x 2 is regressed on X ∗

20 / 46
generalizing partial correlation (cont.)
I regressing the first vector on the second gives a slope coefficient,
which, using the simmetry and idempotency of M ∗ , gives the b2
coefficient defined above
I a simpler way to prove the same result is as follows: write the
partitioned regression as

y = x 2 b2 + X ∗ b ( 2 ) + e

I premultiplying by M ∗ , obtain

M ∗ y = ( M ∗ x 2 ) b2 + e

I finally, premultiply by x 20 which gives

x 20 M ∗ y = (x 20 M ∗ x 2 )b2

21 / 46
inference in the k-variables equation
assumptions

1. X is nonstochastic and has full rank k


2. the errors have the properties

E(u ) = 0

and
var(u ) = E(uu 0 ) = σ2 I

I since the expected value operator is applied to every element of a


vector or matrix, we have
     
u1 E(u1 ) 0
u2  E(u2 )  0
E(u ) = E  .  =  .  =  .  = 0
     
 ..   ..   .. 
un E ( un ) 0

22 / 46
inference in the k-variables equation (cont.)
u1
  
  u2   
E(uu 0 ) = E  .  u1 u2 ··· un 
  
 ..  
un
E(u12 ) E ( u1 u2 ) · · · E ( u1 un )
 
 E ( u2 u1 ) 2
E ( u2 ) · · · E ( u2 un ) 
=
 
.. .. .. .. 
 . . . . 
E(un u1 ) E(un u2 ) · · · E(un2 )
var(u1 ) cov(u1 , u2 ) · · · cov(u1 , un )
 
cov(u2 , u1 ) var(u2 ) · · · cov(u2 , un )
=
 
.. .. .. .. 
 . . . . 
cov(un , u1 ) cov(un , u2 ) · · · var(un )
 2
0 ··· 0

σ
 0 σ2 · · · 0 
2
= . ..  = σ I
 
.. ..
 .. . . . 
0 0 · · · σ2

23 / 46
inference in the k-variables equation (cont.)

I the previous matrix is the variance-covariance matrix of the error


term
I this matrix embodies two strong assumptions: homoskedasticity and
no serial correlation

24 / 46
inference in the k-variables equation (cont.)
Mean and Variance of b

I write the normal equations as

b = (X 0 X ) −1 X 0 y

I substitute for y to get

b = (X 0 X ) −1 X 0 (X β + u ) = β + (X 0 X ) −1 X 0 u

from which
b − β = (X 0 X ) −1 X 0 u
I take expectations (moving the expectation operator to the right past
non-stochastic terms such as X)

E (b − β ) = (X 0 X ) −1 X 0 E (u ) = 0

giving
E(b ) = β
25 / 46
inference in the k-variables equation (cont.)
I under the assumptions of the model, the LS estimators are
unbiased estimators of the β parameters
I to obtain the variance-covariance matrix of the LS estimators,
consider
var(b ) = E[(b − β)(b − β)0 ]
and substituting for b − β get

E[(b − β)(b − β)0 ] = E[(X 0 X )−1 X 0 uu 0 X (X 0 X )−1 ]


= (X 0 X )−1 X 0 E[uu 0 ]X (X 0 X )−1
= σ 2 (X 0 X ) −1

thus

var(b ) = σ2 (X 0 X )−1

26 / 46
inference in the k-variables equation (cont.)
Estimation of σ2

I the variance-covariance matrix of LS estimators involves the error


variance σ2 , which is unknown
I it is reasonable to base an estimate on the residual sum of squares
from the fitted regression
I write e = My = M (X β + u ) = Mu since MX = 0, so that

E(e 0 e ) = E(u 0 M 0 Mu ) = E(u 0 Mu )

I exploiting the fact that the trace of a scalar is the scalar, write

E(u 0 Mu ) = E[tr(u 0 Mu )]
= E[tr(uu 0 M )]
= σ2 tr(M )
= σ2 tr(I ) − σ2 tr[X (X 0 X )−1 X 0 ]
= σ2 tr(I ) − σ2 tr[(X 0 X )−1 (X 0 X )]
= σ 2 (n − k )
27 / 46
inference in the k-variables equation (cont.)

I thus
e0e
s2 =
n−k
defines an unbiased estimator of σ2
I the square root s is the standard deviation of the Y values about the
regression plane; it is referred to as standard error of the
estimator or standard error of the regression (SER)

28 / 46
inference in the k-variables equation (cont.)
Gauss-Markov theorem

I this is the fundamental LS theorem


I G-M states that, conditional on the assumptions made, no other
linear, unbiased estimator of the β coefficient can have smaller
sampling variances than those of the least-squares estimator

1. each LS estimator bi is a best linear unbiased estimator of the


population parameter β i
2. the BLUE of any combination of β’s is that same linear combination
of the b’s
3. the BLUE of E(Ys ) is

Ŷs = b1 + b2 X2s + b3 X3s + · · · + bk Xks

which is the value found by inserting a relevant vector of X values


into the regression model

29 / 46
testing linear hypotheses about β

I we have established the properties of the LS estimators of β


I now we show how to test hypotheses about β
I consider, for example
(i) H0 : βi = 0
(ii) H0 : β i = β i0
(iii) H0 : β2 + β3 = 1
(iv) H0 : β 3 = β 4 , or β 3 − β 4 = 0
(v) H0 :
0
   
β2
 β 3  0
 .  = .
   
 ..   .. 
βk 0
(vi) H0 : β2 = 0

30 / 46
testing linear hypotheses about β (cont.)
I all these examples fit into the general linear framework

Rβ = r

where R is a q × k matrix of known constants, with q < k, and r is a


q-vector of known constants. Each null hypothesis determines the
relevant elements in R and r
I for the previous examples we have
 
(i) R = 0 · · · 0 1 0 · · · 0 r =0 q=1
with 1 in th ith position 
(ii) R = 0 · · · 0 1 0 · · · 0 r = β i0 q=1
with 1 in th ith position 
(iii) R = 0 1 1 0 · · · 0  r =1 q=1
(iv) R = 0 0 1 −1 0 · · · 0 r =0 q=1
(v) R = 0 I k −1 r =0 q = k −1
where 0 is a vector of k − 1 zeros
(vi) R = 0k2 ×k1 I k2 r =0 q = k2

31 / 46
testing linear hypotheses about β (cont.)
I we now derive a general testing procedure for the general linear
hypothesis
H0 : R β − r = 0
I given the LS estimator, we can compute the vector (Rb − r ), which
measures the discrepancy between expectation and observation
I if this vector is “large”, it casts doubt on the null hypothesis
I the distinction between “large” and “small” is determined from the
sampling distribution under the null, in this case, the distribution of
Rb when R β = r
I from the unbiasedness result, it follows that

E(Rb ) = R β

I therefore

var(Rb ) = E[R (b − β)(b − β)0 R 0 ]


= R var(b )R 0
= σ 2 R (X 0 X ) −1 R 0
32 / 46
testing linear hypotheses about β (cont.)
I we know the mean and the variance of the vector Rb
I need a further assumption to determine the form of the sampling
distribution: since b is a function of the u vector, the sampling
distribution of Rb will be determined by the distribution of u
I assume that the ui are normally distributed so that,

u ∼ N (0, σ2 I )

I it follows that
b ∼ N [ β, σ2 (X 0 X )−1 ]
then
Rb ∼ N [R β, σ2 R (X 0 X )−1 R 0 ]
and so
R (b − β) ∼ N [0, σ2 R (X 0 X )−1 R 0 ]
I if the null hypothesis R β = r is true, then

(Rb − r ) ∼ N [0, σ2 R (X 0 X )−1 R 0 ]


33 / 46
testing linear hypotheses about β (cont.)
I this equation gives the sampling distribution of Rb, and we may
derive a χ2 variable, namely

(Rb − r )0 [σ2 R (X 0 X )−1 R 0 ]−1 (Rb − r ) ∼ χ2 (q )

I σ2 is unknown but it can be shown that


e0e
∼ χ2 (n − k )
σ2
and that this statistic is distributed independently of b
I a computable test statistic, which has an F distribution under the
null, is

(Rb − r )0 [R (X 0 X )−1 R 0 ]−1 (Rb − r )/q


∼ F (q, n − k )
e 0 e/(n − k )
I the test procedure is to reject R β = r if the computed F value
exceeds the relevant critical value

34 / 46
testing linear hypotheses about β (cont.)

I it could be helpful to write

(Rb − r )0 [s 2 R (X 0 X )−1 R 0 ]−1 (Rb − r )/q ∼ F (q, n − k )

thus, s 2 (X 0 X )−1 is the estimated variance-covariance matrix of b.


I if we let cij denote the i, j th element in (X 0 X )−1 then

s 2 cii = var(bi ) and s 2 cij = cov(bi , bj ) i, j = 1, 2, . . . , k

35 / 46
testing linear hypotheses about β (cont.)
I going back to the previous examples. . .
(i) H0 : β i = 0: Rb picks out bi and R (X 0 X )−1 R 0 picks out cii , the i th
diagonal element in (X 0 X )−1 . Thus we have

bi2 bi2
F = 2
= ∼ F (1, n − k )
s cii var(bi )

or, taking the square root,


bi bi
t= √ = ∼ t (n − k )
s cii s.e.(bi )

(ii) H0 : β i = β i0 : this hypothesis is tested by

bi − β i0
t= ∼ t (n − k )
s.e.(bi )

One may compute also compute a 95 % confidence interval for β i :

bi ± t0.025 s.e.(bi )

36 / 46
testing linear hypotheses about β (cont.)
(iii) H0 : β 2 + β 3 = 1: Rb gives the sum of the two estimated coefficients,
b2 + b3 . Premultiplying (X 0 X )−1 by R gives a row vector whose
elements are the sum of the corresponding elements in the second and
third rows of (X 0 X )−1 . Forming the inner product with R 0 gives the
sum of the second and third elements of the row vector, that is,
c22 + 2c23 + c33 , noting that c23 = c32 . Thus

s 2 R (X 0 X )−1 R 0 = s 2 (c22 + 2c23 + c33 )


= var(b2 ) + 2 cov(b2 , b3 ) + var(b3 )
= var(b2 + b3 )

The test statistic is then


(b2 + b3 − 1)
t= p ∼ t (n − k )
var(b2 + b3 )

Alternatively one may compute, say, a 95% percent confidence


interval for the sum ( β 2 + β 3 ) as
q
(b2 + b3 ) ± t0.025 var(b2 + b3 )
37 / 46
testing linear hypotheses about β (cont.)

(iv) H0 : β 3 = β 4 : the test statistic here is

b3 − b4
t= p ∼ t (n − k )
var(b3 − b4 )

(v) H0 : β 2 = β 3 = · · · = β k = 0: this case involves a composite


hypothesis about all k − 1 coefficients. The F statistic for testing the
joint significance of the complete set of regressors is

ESS/(k − 1)
F = ∼ F (k − 1, n − k )
RSS/(n − k )

This statistic may also be expressed as

R 2 / (k − 1)
F = ∼ F (k − 1, n − k )
(1 − R 2 ) / (n − k )

38 / 46
testing linear hypotheses about β (cont.)

(vi) H0 : β2 = 0: this hypothesis postulates that a subset of coefficients is


a zero vector. Partition the regression equation as follows:
 
  b1
y = X1 X2 + e = X 1b1 + X 2b2 + e
b2

where X 1 has k1 columns, including a column of ones, X 2 has


k2 = k − k1 columns, and b 1 and b 2 are the corresponding subvectors
of regression coefficients. The hypothesis may be tested by running
two separate regressions. First regress y on X 1 and denote the RSS
by e 0∗ e ∗ . Then run the regression on all the X s, obtaining the RSS,
denoted by e 0 e. The test statistic is

(e 0∗ e ∗ − e 0 e )/k2
F = ∼ F ( k2 , n − k )
e 0 e/(n − k )

39 / 46
restricted and unrestricted regressions

I examples (v) and (vi) may be interpreted as the outcome of two


separate regressions
I recall that ESS may be expressed ESS = y 0∗ y ∗ − e 0 e, where y ∗ = Ay
I it may be shown that y 0∗ y ∗ is the RSS when y ∗ is regressed on
x 1 (= i )
I in both cases (v) and (vi) the first regression may be regarded as a
restricted regression and the second as an unrestricted
regression.
I e 0∗ e ∗ is the restricted RSS and e 0 e is the unrestricted RSS

40 / 46
fitting the restricted regressions
I question: how to fit the restricted regression?
I answer: 1) either work out each specific case from first principles; 2)
or derive a general formula into which specific cases can be fitted
I (1) as for the first approach, consider example (iii) with the regression
in deviation form,
y = b 2 x2 + b 3 x3 + e
I want to impose the restriction that b2 + b3 = 1. Substituting the
restriction in the regression gives

y = b2 x2 + (1 − b2 )x3 + e∗ or

(y − x3 ) = b2 (x2 − x3 ) + e∗
so as to form two new variables (y − x3 ) and (x2 − x3 ): the simple
regression of the first on the second (without the constant) gives the
restricted estimate of b2 ; the RSS from this regression is the
restricted RSS, e 0∗ e ∗ .
41 / 46
fitting the restricted regressions (cont.)
I (2) the general approach requires a b ∗ vector that minimizes the RSS
subject to the restrictions Rb ∗ = r . To do so set up the function

φ = (y − Xb ∗ )0 (y − Xb ∗ ) − 2λ0 (Rb ∗ − r )

where λ is a q-vector of Lagrange multipliers


I the first-order conditions are
∂φ
= −2X 0 y + 2(X 0 X )b ∗ − 2R 0 λ = 0
∂b ∗
∂φ
= −2(Rb ∗ − r ) = 0
∂λ
I the solution for b ∗ is

b ∗ = b + (X 0 X )−1 R 0 [R (X 0 X )−1 R 0 ]−1 (r − Rb )

where b is the unrestricted LS estimator (X 0 X )−1 X 0 y.

42 / 46
fitting the restricted regressions (cont.)
I the residuals from the restricted regression are

e ∗ = y − Xb ∗
= y − Xb − X (b ∗ − b )
= e − X (b ∗ − b )
I transposing and multiplying, we obtain

e 0∗ e ∗ = e 0 e + (b ∗ − b )0 X 0 X (b ∗ − b )
I the process of substituting for (b ∗ − b ) and simplifying gives

e 0∗ e ∗ − e 0 e = (r − Rb )0 [R (X 0 X )−1 R 0 ]−1 (r − Rb )

where, apart from q, the expression on the RHS is the same as the
numerator in the F statistic
I thus an alternative expression of the test statistic for H0 : Rb = r is
(e 0∗ e ∗ − e 0 e )/q
F = ∼ F (q, n − k )
e 0 e/(n − k )
43 / 46
prediction

I suppose that we have fitted a regression model, and we know consider


some specific vector of regressor values,

c 0 = 1 X2f · · · Xkf
 

I we wish to predict the value of Y conditional on c


I a point prediction is obtained by inserting the given X values into the
regression equation, giving

Ŷf = b1 + b2 X2f + · · · + bk Xkf = c 0 b

I Gauss-Markov theorem shows that c 0 b is a BLUE of c 0 β; here


c 0 β = E(Yf ) so that Ŷf is an optimal predictor of E(Yf )
I as var(Rb ) = R var(b )R 0 , replacing R by c 0 gives

var(c 0 b ) = c 0 var(b )c

44 / 46
prediction (cont.)
I if we assume normality for the error term, it follows that

c 0b − c 0 β
p ∼ N (0, 1)
var(c 0 b )

I when the unknown σ2 in var(b ) is replaced by s 2 , we have

Ŷ − E(Yf )
pf ∼ t (n − k )
s c 0 (X 0 X ) −1 c

from which a 95% confidence interval for E(Yf ) is


q
Ŷf ± t0.025 s c 0 (X 0 X )−1 c

I to obtain a confidence interval for Yf rather than E(Yf ), consider


they differ only by the error uf that appears in the prediction period
I the point prediction is the same as before, but uncertainty of
prediction increases
45 / 46
prediction (cont.)

I we have Ŷf = c 0 b as before and now Yf = c 0 β + uf so that the


prediction error is

ef = Yf − Ŷf = uf − c 0 (b − β)

I squaring both sides and taking expectations gives the variance of the
prediction error

var(ef ) = σ2 + c 0 var(b )c
σ 2 (1 + c 0 (X 0 X ) −1 c )

from which we derive a t statistic

Ŷf − Yf
∼ t (n − k )
1 + c 0 (X 0 X ) −1 c
p
s

46 / 46

You might also like