0% found this document useful (0 votes)
102 views57 pages

Econometrics Chap - 2

The document provides an overview of linear regression and ordinary least squares (OLS) estimation. It discusses that OLS finds the linear combination of variables that best predicts the variable of interest by minimizing the sum of squared errors. It describes how the OLS estimator is obtained by solving the normal equations, and that the OLS estimator exists if the variables are not perfectly collinear. It also explains that OLS residuals are orthogonal to the predictor variables.

Uploaded by

Cris
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views57 pages

Econometrics Chap - 2

The document provides an overview of linear regression and ordinary least squares (OLS) estimation. It discusses that OLS finds the linear combination of variables that best predicts the variable of interest by minimizing the sum of squared errors. It describes how the OLS estimator is obtained by solving the normal equations, and that the OLS estimator exists if the variables are not perfectly collinear. It also explains that OLS residuals are orthogonal to the predictor variables.

Uploaded by

Cris
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

H:/Lehre/Econometrics Master PO 2021/lecture slides/chap 2.

tex (October 7, 2021)

Linear Regression

2. Linear Regression 1
What is the chapter about?

Principle of ordinary least squares (OLS)

The linear regression model

Model assumptions

Small-sample properties of the OLS estimator

Goodness of fit

Hypothesis testing

Multicollinearity

Illustration: Capital asset pricing model

2 Linear Regression 2
Reading

Obligatory reading

Verbeek: Sections 2.1-2.5, 2.8, 2.10.

Additional reading:

Wooldridge: Chapters 2-4, Sections 6.3, 6.4.

Greene: Chapter 2, Sections 3.1-3.3, 3.5, 4.1-4.3, 5.1-5.3.

Hayashi: Sections 1.1-1.4.

2 Linear Regression 3
OLS problem

Consider having N observations on a variable of interest, y, and a set of


other variables x2 , . . . , xK that are believed to be related to y.

The question at hand is what linear combination of a constant and


x2 , . . . , xK of the form

β̃1 + β̃2 x2 + · · · + β̃K xK

provides a good approximation to y.

Assign to each observation an index i, denoting our sample

{yi , xik }, i = 1, . . . , N, k = 2, . . . , K.

Put all x values and β̃ coefficients into K × 1 vectors

xi = (1, xi2 , xi3 , . . . , xiK )′ , β̃ = (β̃1 , . . . , β̃K )′ .

2.1 Ordinary least squares as an algebraic instrument 4


OLS problem

Then we can write the approximation error for the ith observation as

ei = yi − x′i β̃.

We want this error to be on average as small as possible.

One (but not the only) possibility to achieve that is to pick β̃ so as to


minimize
XN
S(β̃) ≡ (yi − x′i β̃)2 .
i=1

The vector β̃ that minimizes this criterion is called the ordinary least
squares (OLS) vector.

2.1 Ordinary least squares as an algebraic instrument 5


OLS vector

To minimize S(β̃) we have to set its first derivative w.r.t. β̃ equal to zero:

∂S(β̃) XN
= −2 xi (yi − x′i β̃) = 0,
∂ β̃ i=1
!
XN

XN
(or) xi xi β̃ = x i yi .
i=1 i=1

Those are the first-order conditions (focs) for the OLS-minimization


problem and are known as the normal equations.
They define a system of K linear equations in K unknowns.

Solving these equations and denoting the solution as b gives


!−1 N
XN

X
b= xi xi x i yi (OLS vector).
i=1 i=1

2.1 Ordinary least squares as an algebraic instrument 6


Existence of the OLS vector

PN ′
The existence of the OLS vector requires that K × K matrix i=1 x i x i is
invertible.
This in turn requires that the K x-variables are linearly independent
(no-multicollinearity assumption).

The second-order conditions for the OLS-minimization problem requires


that the Hessian matrix with the second-order derivatives:

∂ 2 S(β̃) XN
=2 xi x′i ,
∂ β̃∂ β˜′ i=1

is positive definite. In this case b is indeed a minimizer of S(β̃).


This is the case as long as the no-multicollinearity assumption is satisfied.

2.1 Ordinary least squares as an algebraic instrument 7


Residual sum of squares

With b we can now approximate yi in terms of the xi variables as follows:

ŷi = x′i b (best linear approximation).

The difference between the observed and the approximated value, yi − ŷi ,
is the OLS residual ei and we can write

yi = ŷi + ei = x′i b + ei .

As a result, our minimized criterion function can be written as the sum of


squared residuals which is known as the residual sum of squares
X X
N N
S(β̃) = S(b) = (yi − x′i b)2 = e2i .
β̃=b i=1 i=1

2.1 Ordinary least squares as an algebraic instrument 8


Properties of the OLS residuals

The normal equations (for β̃ = b) imply that


X
N X
N X
N
xi (yi − x′i b) = xi ei = 0, i.e. xik ei = 0, k = 1, . . . , K.
i=1 i=1 i=1

This means that the residuals have the following properties:


The residual vector is orthogonal to each vector of observations on xk ,
i.e.  
e1
 .. 
(x1k , . . . , xN k )  .  = 0, k = 1, . . . , K;
eN

If xi1 = 1 for all i (constant) then


X
N
ei = 0.
i=1

2.1 Ordinary least squares as an algebraic instrument 9


OLS in matrix notation

The use of matrix notation is convenient because it simplifies an otherwise


cumbersome notation with a lot of sums and indices.

Let    
1 x12 ··· x1K x′1
 .. .. .. ..   .. 
X = . . . .  =  . ,
(N ×K)
1 xN 2 ··· xN K x′N
and  
y1
 . 
y =  ..  .
(N ×1)
yN

2.1 Ordinary least squares as an algebraic instrument 10


OLS in matrix notation

The OLS objective function can then be written as

S(β̃) = (y − X β̃)′ (y − X β̃) = y ′ y − 2y ′ X β̃ + β̃ ′ X ′ X β̃.

Taking the first derivative w.r.t. β̃ and setting it equal to zero,

∂S(β̃)
= −2(X ′ y − X ′ X β̃) = 0,
∂ β̃
gives the solution

b = (X ′ X)−1 X ′ y (OLS vector).

This allows us to decompose the observed y-vector as

y = Xb + e = ŷ + e.

2.1 Ordinary least squares as an algebraic instrument 11


Geometric interpretation of OLS

The predicted y-vector and the residual vector can be written as

ŷ = Xb = X(X ′ X)−1 X ′ y = P y,
| {z }
P

e = y − Xb = y − P y = (I − P ) y = M y.
| {z }
M

P and M are projection matrices with the following special properties:

P = P ′ , M = M ′ (symmetric)

P 2 = P , M 2 = M (idempotent)

P M = P ′ M = M ′ P = M P = 0 (orthogonal to each other).

P projects y into the column space of X, and it holds that P X = X;


M (‘residual-maker matrix’) projects y into the space which is orthogonal
to the column space of X so that ŷ ′ e = 0 and X ′ e = 0.

2.1 Ordinary least squares as an algebraic instrument 12


The linear regression model

So far we have been concerned with how to get the best linear
approximation of yi by x′i β̃ for some observed sample values {yi , xi }N
i=1 .

However, economists are typically interested in a relationship that is more


generally valid, i.e. for the whole population of interest and not just for the
incidentally observed sample {yi , xi }N
i=1 .

For this we need a statistical model, which

(1) ideally reflects a fundamental economic relationship between the y-


and x-variables and

(2) which is supposed to be valid for all possible y- and x-observations


(not just for the sample incidentally observed).

2.2 The linear regression model 13


The linear regression model

To begin with, we restrict our attention to linear models and specify our
model as

yi = β1 + β2 xi2 + · · · + βK xiK + εi , i = 1, ..., N,

(or) yi = x′i β + εi ,

(or) y = Xβ + ε.

yi is the dependent variable and the variables in xi are the regressors


or explanatory variables or covariates,

β is the vector of regression coefficients,

εi is the error term or disturbance term.


It captures the influences of all variables that are relevant for yi but
not explicitly included in the model.

2.2 The linear regression model 14


The linear regression model

The linear regression model (LRM)

yi = x′i β + εi ,

is supposed to hold for the whole (y, x)-population, of which we observe a


random sample of size N .

Thus yi , εi and possibly even xi are random variables.

The elements in β are unknown population parameters that need to be


estimated based on the realization of the random sample.

2.2 The linear regression model 15


Deterministic versus random regressors

In the LRM,
yi = x′i β + εi ,

xi may be deterministic (fixed in repeated samples).


Then the only source of randomness is εi leading to different yi -values
across repeated samples.
This is the case for experimental data in laboratory settings where the
xi -values are determined by the researcher.

In empirical economic studies, we typically rely on observational data.


In this case, we commonly consider yi as well as xi to be random, so both
yi and xi have different values in repeated samples.
In this context we need certain assumptions about the joint distribution of
(xi , εi ) in order to give the LRM a meaning.

2.2 The linear regression model 16


Exogeneity assumption

In this context, a common assumption is

E{εi |xi } = 0 (cond. mean independence),

which is known as the assumption that the x variables are exogenous.

Then it holds that E{E{εi |xi }} = E{εi } = 0 and cov{xi , εi } = 0.

It follows that the regression line x′i β gives the conditional expectation of y
given x, i.e.
E{yi |xi } = x′i β + E{εi |xi } = x′i β.

The coefficients βk measure by how much E{yi |xi } changes given an one
unit change in xik and holding all other variables in xi constant (known as
ceteris paribus (c.p.) condition).

2.2 The linear regression model 17


Exogeneity assumption

Often we would like to interpret the model as describing a causal


relationship between yi and one or more regressors in xi .

A prerequisite for βk to measure the causal effect of xik on yi (and not just
the correlation or the causal effect overlaid by additional correlation) is
that we can reasonably assume that E{εi |xi } = 0
to the effect that
∂E{yi |xi }
E{yi |xi } = x′i β, with = βk (if xik is continuous).
∂xik

2.2 The linear regression model 18


OLS estimator

After we have specified a LRM and have a sample of N observations, we


would like to make some statements about the values of the unknown β.

A rule by which we compute a certain value from our sample data is called
an estimator, which gives us an estimate.

We already saw the OLS estimator,


!−1 N
XN

X
b = xi xi xi yi = (X ′ X)−1 X ′ y,
i=1 i=1

which is the most frequently used estimator for β.

Here b is now a random variable (with specific realized values for each
sample) and we are interested in how well b approximates the true value of
the population parameters β.

2.3 The OLS estimator and its small sample properties 19


Gauss-Markov assumptions

How good OLS works for estimating β depends on the assumptions we are
willing to make on the behavior of εi and xi .
Here we start to consider the so-called Gauss-Markov (GM) assumptions,
under which OLS has some desirable properties. (Later we will relax some
of these assumptions.)

The GM assumptions are:

· The regression line holds on average:

E{εi } = 0, i = 1, . . . , N. (A1)

·
{ε1 , . . . , εN } and {x1 , . . . , xN } are independent. (A2)

2.3 The OLS estimator and its small sample properties 20


Gauss-Markov assumptions

· All the errors have the same variance, known as homoscedasticity:

V {εi } = σ 2 , i = 1, . . . , N. (A3)

· The errors are mutually uncorrelated, thereby excluding


autocorrelation:

cov{εi , εj } = 0, i, j = 1, . . . , N, i ̸= j. (A4)

The GM assumptions imply that


(A2) (A1)
E{ε|X} = E{ε} = 0,
(A2) (A3,A4)
V {ε|X} = V {ε} = σ 2 IN .

Hence, the regressors in X do not convey any information about the


expected values of the εi ’s and their (co)variances.

2.3 The OLS estimator and its small sample properties 21


Gauss-Markov assumptions

To derive the properties of the OLS estimator under the GM assumptions,


for the sake of simplicity we assume that X is a deterministic
non-stochastic matrix.
This implies that we can take X as given without affecting the εi ’s so that
(A2) is satisfied.
Alternatively, one could derive all properties conditional upon X.

2.3 The OLS estimator and its small sample properties 22


The OLS estimator under the GM assumptions

The OLS estimator b is unbiased for β, i.e. E{b} = β.


This means that in repeated sampling the OLS estimates are on average
equal to the true value β.

This can be shown as follows:

E{b} = E{(X ′ X)−1 X ′ y} = E{β + (X ′ X)−1 X ′ ε}


= β + E{(X ′ X)−1 X ′ ε}

= β + (X ′ X)−1 X ′ E{ε} = β
(A2) (A1)

Note that we only used (A1) and (A2).


Hence, unbiasedness even holds under heteroscedastic and correlated errors!

2.3 The OLS estimator and its small sample properties 23


The OLS estimator under the GM assumptions

The covariance matrix of b is:

V {b} = E{(b − β)(b − β)′ } = E{(X ′ X)−1 X ′ εε′ X(X ′ X)−1 }


= (X ′ X)−1 X ′ (σ 2 IN )X(X ′ X)−1 = σ 2 (X ′ X)−1 .

The variance of each element bj of b can be found on the main diagonal


and the covariance of bj and bk is the corresponding off-diagonal element.
The smaller the variances, the smaller the probability that the estimate b
is far from β.

This covariance matrix is unknown since it depends on the unknown σ 2 .

2.3 The OLS estimator and its small sample properties 24


The OLS estimator under the GM assumptions

Gauss-Markov Theorem: Under the assumptions (A1)-(A4), the OLS estima-


tor b is the best linear unbiased estimator (BLUE) for β.

Linear means that we consider estimators of the form b̃ = Ay with a


K × N matrix A.

Unbiasedness means that E{Ay} = β.

Best means that there is no other linear unbiased estimator that has a
smaller variance than b – stated mathematically,

V {d′ b̃} ≥ V {d′ b} for any vector d,

or, equivalently, V {b̃} − V {b} is a positive semi-definite matrix.

b is thus the most accurate in the class of linear unbiased estimators.

2.3 The OLS estimator and its small sample properties 25


Estimating the variance of b

In order to estimate V {b} = σ 2 (X ′ X)−1 we need to estimate the error


variance E{ε2i } = σ 2 .

Under the GM assumptions an unbiased estimator for σ 2 is the sample


variance of the OLS residuals ei = yi − x′i b (which have a sample mean
equal to 0):

1 XN
e′ e
s2 = 2
ei = , e = y − Xb.
N − K i=1 N −K

2.3 The OLS estimator and its small sample properties 26


Estimating the variance of b

In order to show that E{S 2 } = σ 2 we use the residual maker matrix,

M = (I − X(X ′ X)−1 X ′ ), with M M = M, M X = 0.

With M we can write the OLS residuals as

e = M y = M (Xβ + ϵ) = M ε.

It follows that

E{e′ e} = E{ε′ M ε} = E{tr(ε′ M ε)}


= E{tr(M εε′ )} = tr(M E{εε′ })
= tr(σ 2 M ) = σ 2 tr(I − X(X ′ X)−1 X ′ ) = σ 2 (N − K),

so that
E{e′ e}
E{s } =
2
= σ2 .
N −K

2.3 The OLS estimator and its small sample properties 27


Estimating the variance of b

With s2 the variance of b, i.e. V {b} = σ 2 (X ′ X)−1 , can be estimated by


!−1
′ −1
XN
V̂ {b} = s (X X) = s
2 2
xi x′i .
i=1

The estimated variances of each bk are on the main diagonal of V̂ {b}:

V̂ {bk } = s2 ckk , ckk : kth diagonal element of (X ′ X)−1 .

The square root of V̂ {bk } is the standard error of bk :



se(bk ) = s ckk .

2.3 The OLS estimator and its small sample properties 28


Distribution of b

If we assume a law of distribution for ε we get a corresponding law of


distribution for b = β + (X ′ X)−1 X ′ ε.

The most common distributional assumption is:

ε ∼ N (0, σ 2 IN ), i.e. εi ∼ N ID(0, σ 2 ), (A5)

where N ID stands for ‘normally and independently distributed’.

From (A5) together with (A2) it follows that

b ∼ N (β, σ 2 (X ′ X)−1 ), and bk ∼ N (βk , σ 2 ckk ), k = 1, . . . , K.

This result provides the basis of statistical tests w.r.t. β based upon b.

2.3 The OLS estimator and its small sample properties 29


Distribution of s2

Under (A2) and (A5) the unbiased estimator for σ 2 ,

2 e′ e
s = , where e = M y = M ε,
N −K
follows a scaled χ2 -distribution, i.e
N −K 2
s ∼ χ 2
(N −K) .
σ2

This follows from the fact that we can rewrite this variable as
N −K 2 N − K ε′ M ε  ε ′  ε 
s = = M ,
σ2 σ2 N − K σ σ
so that it is a quadratic form of a vector with N ID(0, 1) variates (ε/σ).

Finally it can be shown that b and s2 are independent random variables.

2.3 The OLS estimator and its small sample properties 30


How well does the estimated regression line fit the data?

A goodness-of-fit measure is the proportion of variation in the yi ’s that is


explained by the regression line represented by the ŷi ’s.
This proportion defines the R2 statistic,

V̂ {ŷi }
R2 = (V̂ {·} denotes the sample variance).
V̂ {yi }

Since yi = ŷi + ei , it holds in LRMs with intercept that

V̂ {yi } = V̂ {ŷi } + V̂ {ei } (as ŷ and e are orthogonal).

Hence, we can write the R2 as


PN
V̂ {yi } − V̂ {ei } 1/(N − 1) i=1 e2i
R = 2
=1− PN ,
V̂ {yi } 1/(N − 1) i=1 (yi − ȳ) 2

P
where ȳ = (1/N ) i yi .

2.4 Goodness-of-fit 31
Interpretation of the R2

Note that 0 ≤ R2 ≤ 1.
If R2 = 0 this implies that
X
N X
N
e2i = (yi − ȳ)2 ,
i=1 i=1

so the LRM explains none of the variation in the yi ’s.


If R2 = 1 then
X
N
e2i = 0 ⇒ ei = 0 for all i,
i=1

and so there is a prefect fit, i.e. yi = x′i b for all i.

In practice there is no general rule, which values for the R2 are ‘good’.
In particular, a small R2 does not automatically imply that the LRM is
incorrect or useless: It just indicates that there is a lot of heterogeneity in
y not captured by x.

2.4 Goodness-of-fit 32
The adjusted R2

A drawback of the R2 is that when adding a variable to the LRM the R2


will never decrease, even if this variable is unrelated to y.

P
The reason is that i e2i in the R2 can only decrease (increasing R2 ) when
adding a variable.

This issue is addressed by the adjusted R2 :


PN
1/(N − K) i=1 e2i
R̄ = 1 −
2
PN ,
1/(N − 1) i=1 (yi − ȳ) 2

which involves a ‘degrees-of-freedom correction’ when estimating the


variance of ei .
So adding a new variable may increase that variance estimate and
therefore decrease R̄2 .

2.4 Goodness-of-fit 33
Hypothesis testing: t-test

Under the GM condition (A2) and normality (A5) we have


b k − βk
z= √ ∼ N (0, 1), (ckk : diag. element of (X ′ X)−1 ).
σ ckk

Furthermore, it holds for the estimator s2 of σ 2 that

(N − K)s2 /σ 2 ∼ χ2N −K , and b⊥


⊥ s2 .

Consequently, the random variable


b k − βk b k − βk
tk = √ =
s ckk se(bk )
is the ratio of a standard normal variable and the root of an independent
χ2 -variable divided by its degrees of freedom.
So tk follows a t-distribution with N − K degrees of freedom (tk ∼ tN −K ).

2.5 Hypothesis testing: t and F -tests 34


Hypothesis testing: t-test

This result can be used to test hypotheses about the regression coefficients.
Consider the case that we want to test

H0 : βk = βk0 (null hypothesis),


against H1 : βk ̸= βk0 (alternative hypothesis).

For this we would use the t-statistic


bk − βk0
tk = ∼ tN −K under H0 .
se(bk )

The decision rule is: Reject H0 in favor of H1 at significance level α, if

|tk | > tN −K;α/2 (tN −K;α/2 : critical value),

where tN −K;α/2 is the (1 − α/2)-quantile of a tN −K -distribution.

2.5 Hypothesis testing: t and F -tests 35


Hypothesis testing: t-test

0.4

0.35

0.3

pdf of tk under H0
0.25 rejection non-rejection rejection

0.2

0.15

0.1

0.05
/2 1-
/2
0
-3 -2 -1 0 1 2 3
-t tk tN-K,1-
N-K,1- /2 /2

pdf of the t-statistic tk under H0 : βk = βk0 ; Non-rejection and rejection


regions are those for the two-sided test with H1 : βk ̸= βk0 .

The critical value for a given level α is implicitly defined by

P {|tk | > tN −K;α/2 } = α.

A typically selected value for α is 5%.

2.5 Hypothesis testing: t and F -tests 36


Hypothesis testing: t-test

As N − K → ∞, we have tN −K → N (0, 1).


Thus, for a large N − K, H0 is rejected at the 5% level, if |tk | > 1.96.

A common hypothesis to test is

H0 : βk = 0 against H1 : βk ̸= 0,

for which the t-statistic (then also called t-ratio) is simply


bk
tk = .
se(bk )

Regression software typically report this t-statistic, because it tests the


hypothesis whether xik has a statistically significant impact on yi .

2.5 Hypothesis testing: t and F -tests 37


Testing the joint significance of regression coefficients

Often it is important to test whether a subset (or all) of the regression


coefficients are zero.
If this test is for the last J coefficients, H0 becomes

H0 : βK−J+1 = · · · = βK = 0,

and H1 is that at least one of these coefficients is not equal to zero.

2.5 Hypothesis testing: t and F -tests 38


Testing the joint significance of regression coefficients

A simple test approach compares the fitted restricted LRM under H0 ,

yi = b1 + b2 x2i + · · · + bK−J xK−Ji + ẽi ,

with the fitted unrestricted full LRM,

yi = b1 + b2 x2i + · · · + bK−J xK−Ji


+ bK−J+1 xK−J+1i + · · · + bK xKi + ei ,

by using their respective sum of squared OLS residuals,


PN 2 PN 2
S0 = i=1 ẽi , and S1 = i=1 ei .

Under H0 : βK−J+1 = · · · = βK = 0, we expect that S0 is not much larger


than S1 .
A point to note: S0 ≥ S1 .

2.5 Hypothesis testing: t and F -tests 39


Testing the joint significance of regression coefficients

For a formal test we can exploit that under (A2) and (A5) it holds that
S0 − S1
ξ1 = ∼ χ 2
J under H0 .
σ2
But σ 2 is unknown and so ξ1 can not be used for testing.
However, we know that the scaled estimator s2 for σ 2 in the unrestricted
LRM is χ2 -distributed, i.e.
(N − K)s2 S1
ξ2 = = ∼ χ 2
N −K under H0 and H1 .
σ2 σ2

So combining this with the result that the ratio of two independent
χ2 -variables scaled by their degrees of freedom is F -distributed, we find
that
ξ1 /J (S0 − S1 )/J
F = = ∼ FJ,N −K under H0 .
ξ2 /(N − K) S1 /(N − K)

2.5 Hypothesis testing: t and F -tests 40


Testing the joint significance of regression coefficients

The decision rule is: Reject H0 at a significance level α, if


J J
F > FN −K;α (FN −K;α : critical value),

−K;α is the (1 − α)-quantile of an FJ,N −K -distribution.


J
where FN

A common use of this F -test is to test the joint significance of all


regressors except the intercept, i.e.

H0 : β2 = · · · = βK = 0.

Regression software typically report the F -statistic for this test.

2.5 Hypothesis testing: t and F -tests 41


Testing the joint significance of regression coefficients

Points to note:

The significance of the regressors can be tested individually by a


sequence of corresponding t-tests.
However, this says nothing about their joint significance.

The conclusions from an F -test of joint significance and t-tests of


individual significance can differ since the explanatory power of the
regressors overlaps.

2.5 Hypothesis testing: t and F -tests 42


The general F -test

The F -test can be generalized to jointly test J linear restrictions on β of


the form
H0 : Rβ = q against H1 : Rβ ̸= q,
where R is a J × K full-rank matrix and q is a J × 1 vector.

For example suppose we wish to test

β 2 + · · · + βK = 1 and β 2 = β3 .

In this case we have


! !
0 1 1 ··· ··· 1 1
R= , q= .
0 1 −1 0 ··· 0 0

2.5 Hypothesis testing: t and F -tests 43


The general F -test

For testing
H0 : Rβ = q, against H1 : Rβ ̸= q,

we can

fit the LRM, both with and without imposing the restrictions to be


tested,

and then use an F -test which compares the sum of squared residuals
of the restricted model (S0 ) and sum of squared residuals of the
unrestricted model (S1 ) (see above).

2.5 Hypothesis testing: t and F -tests 44


The general F -test

An example for imposing linear restrictions:

Consider the LRM

yi = β1 + β2 x2i + β3 x3i + εi ,

with H0 : β2 + β3 = 1 ⇒ H0 : β 3 = 1 − β2 .

Then the restricted model is

yi = β1 + β2 x2i + (1 − β2 )x3i + εi
⇒ yi − x3i = β1 + β2 (x2i − x3i ) + εi .

There is an alternative formulation of this F -test based on the sum of


squared residuals that does not require the estimation of the restricted
model.

2.5 Hypothesis testing: t and F -tests 45


The general F -test

This alternative exploits that under (A2) and (A5)

b ∼ N (β, σ 2 (X ′ X)−1 )
⇒ Rb − q ∼ N (Rβ − q, Rσ 2 (X ′ X)−1 R′ ).

So under H0 : Rβ − q = 0,

Rb − q ∼ N ( 0 , Rσ 2 (X ′ X)−1 R′ )
 −1
⇒ ξ3 =(Rb − q)′ σ 2 R(X ′ X)−1 R′ (Rb − q) ∼ χ2J .

But σ 2 is unknown and so ξ3 can not be used for testing.


However, we know that under H0 and H1
(N − K)s2
ξ2 = ∼ χ 2
N −K .
σ2

2.5 Hypothesis testing: t and F -tests 46


The general F -test

So we find that under H0



 ′

−1 ′ −1
ξ3 /J (Rb − q) R(X X) R (Rb − q)
F = = ∼ FJ,N −K .
ξ2 /(N − k) Js2

This F -statistic is algebraically identical to the one which is formulated in


terms of the sum of squared residuals of the restricted model (S0 ) and sum
of squared residuals of the unrestricted model (S1 )!

2.5 Hypothesis testing: t and F -tests 47


Size and Power of a test

There are four possible (random) events that can happen when testing a
hypothesis:

1. Not rejecting H0 when it is true : correct decision

2. Rejecting H0 when it is not true : correct decision

3. Rejecting H0 when it is true : Type I error

4. Not rejecting H0 when it is not true : Type II error

Ideally we want a test to make no type I and no type II errors.


But since we are dealing with randomness this is impossible.

2.5 Hypothesis testing: t and F -tests 48


Size and Power of a test

However, by selecting the significance level α we can control the


probability of type-I and II errors.
So for a test at significance level α, we have that P (type I error) = α.
The significance level is also called as the (nominal) size of a test.

The probability that a test correctly rejects H0 when H1 is true is the


power of a test which is given by 1 − P (type II error).
It tells us how powerful a test is at detecting deviations from H0 and
depends upon the true parameter value.

Reducing the size of the test will typically also reduce its power, so that
there is a tradeoff between type I and type II errors.

2.5 Hypothesis testing: t and F -tests 49


p-values

A useful number when doing hypothesis tests is the p-value.


It is the smallest (marginal) significance level at which we can reject H0 .

Conceptually, the p-value is the probability of observing a value of the test


statistic as extreme or more extreme than the one observed, assuming that
H0 holds.

Thus in practice, a p-value smaller than α means that we reject H0 at the


α significance level.

2.5 Hypothesis testing: t and F -tests 50


Multicollinearity

Regressors in LRMs are typically correlated.


E.g. in a LRM for workers’ earnings using their age and experience as
regressors we have regressors which are highly correlated.

If regressors are highly correlated, X ′ X can be near singular so that:

X ′ X may be numerically difficult to invert;

Estimates for β tend to be inaccurate with large se’s so that regressors


may (falsely) appear to be insignificant.

Essentially, the empirical identification of the individual effects of


highly correlated regressors on y is difficult.

This problem is known as multicollinearity.

2.6 Multicollinearity 51
Multicollinearity

To illustrate the problem, consider the variance of OLS estimator bk in a


LRM
" #−1
1 1 X
2 N
σ
V {bk } = (x ik − x̄ k ) 2
,
1 − Rk2 N N i=1

where Rk2 is the R2 obtained from regressing xk on the remaining


regressors and an intercept.
So the larger Rk2 , the larger the variance.

The term
1
V IF (bk ) =
1 − Rk2
is the so-called variance inflation factor.
It tells us by how much the variance is inflated compared to the
hypothetical situation that the regressors are uncorrelated.

2.6 Multicollinearity 52
Multicollinearity

In the extreme case (Rk2 → 1) when there is a perfect linear relationship


between regressors, we have exact multicollinearity.
In this case a regressor is a linear combination of one or more other
regressors (price in cents and price in Euro) and the OLS estimator is not
longer uniquely defined.
This problem can easily solved by omitting redundant regressors.

2.6 Multicollinearity 53
Multicollinearity

In the general case (Rk2 < 1 but large) there is no easy solution and it
depends on the situation.

Here we have highly inaccurate estimates meaning that our sample does
not provide enough information about the parameters.
Hence we would need to use more information either by extending the
sample (how?) or imposing a-priori restrictions on the parameters.
The latter commonly means to exclude regressors from the model (which
ones?).

2.6 Multicollinearity 54
Application: The capital asset pricing model

The CAPM states that

E{rjt } − rf = βj (E{rmt } − rf ),

where rjt is the risky return on asset j in period t, rmt is the return on the
market portfolio and rf is the riskless return (usually also depends on t
but is deterministic).

The so-called beta-factor


cov{rjt , rmt }
βj =
V {rmt }
measures the systematic (or market) risk of asset j, i.e. how strong
fluctuations in the returns are related to market movements.

The model tells us that a larger βj leads to a higher risk premium.

2.7 Illustration: Capital asset pricing model 55


Application: The capital asset pricing model

Assume rational expectations and define the unexpected returns for asset j
as
ujt = rjt − E{rjt } ⇔ E{rjt } = rjt − ujt ,
and likewise for the market portfolio.

Then we can rewrite the CAPM as

rjt − rf = βj (rmt − rf ) + εjt ,

with
εjt = ujt − βj umt .

The error can be shown to satisfy the requirements for a regression error
term and we can estimate the beta-factor by OLS.

2.7 Illustration: Capital asset pricing model 56


Application: The capital asset pricing model

For the empirical analysis of the CAPM with gretl


· use the data set from the 2nd edition of Verbeek’s textbook (available
through gretl’s website in the file capm2.gdt).
It contains monthly returns from 1960/1 to 2002/12 (N = 516) of
three US-industry portfolios (food, consumer durables, construction
industry) and of a value-weighted US stock index (as proxy for the
market portfolio); Returns are measured in deviations from the
riskfree rate.

· Estimate the CAPM without intercept.

· Test the theoretical implication of the CAPM that there is no


intercept.

· Test for an anomaly, the January effect.

2.7 Illustration: Capital asset pricing model 57

You might also like