0% found this document useful (0 votes)
4 views27 pages

EC501 Lecture 02

The document discusses the statistical properties of the Ordinary Least Squares (OLS) estimator in linear regression, outlining the assumptions required for its validity, such as independence of errors and homoskedasticity. It explains the implications of these assumptions on the unbiasedness and variance of the OLS estimator, as well as the Gauss-Markov theorem which states that OLS is the Best Linear Unbiased Estimator (BLUE). Additionally, it touches on the behavior of OLS estimators with small and large sample sizes, including the concept of consistency and asymptotic properties.

Uploaded by

T T
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views27 pages

EC501 Lecture 02

The document discusses the statistical properties of the Ordinary Least Squares (OLS) estimator in linear regression, outlining the assumptions required for its validity, such as independence of errors and homoskedasticity. It explains the implications of these assumptions on the unbiasedness and variance of the OLS estimator, as well as the Gauss-Markov theorem which states that OLS is the Best Linear Unbiased Estimator (BLUE). Additionally, it touches on the behavior of OLS estimators with small and large sample sizes, including the concept of consistency and asymptotic properties.

Uploaded by

T T
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

EC501 Econometric Methods

2. Linear Regression: Statistical Properties

Marcus Chambers

Department of Economics
University of Essex

19 October 2023

1 / 27
Outline

Review

The linear regression model: assumptions

Statistical properties of OLS: small N

Statistical properties of OLS: large N

Reference: Verbeek, chapter 2.

2 / 27
Review
We motivated the ordinary least squares (OLS) estimator by
choosing a linear combination of the regressors that provides a
‘good’ approximation of the dependent variable.
Our measure of ‘good’ was in terms of the sum of squared
residuals, where the residual for observation i is

ei = yi − β̃1 − β̃2 xi2 − . . . − β̃K xiK , i = 1, . . . , N.

The OLS estimator is obtained as b = arg minβ̃ S(β̃) where

N
X
S(β̃) = e2i = e′ e = (y − X β̃)′ (y − X β̃).
i=1

The result is
b = (X ′ X)−1 X ′ y.

3 / 27
Properties of b?

But: what are the (statistical) properties of b?


To answer this question we need to move beyond thinking of
OLS in a purely algebraic sense.
Instead of describing the properties of a given sample we shall
think in terms of a statistical model relating y to x2 , . . . , xK .
We try to learn something about this relationship from our
observed sample.
The statistical properties (assumptions) of the model then
determine the statistical properties of b.

4 / 27
The linear regression model
The linear regression model takes the form

yi = β1 + β2 xi2 + . . . + βK xiK + ϵi or yi = xi′ β + ϵi ,

where ϵi is an error term or disturbance.


This is a population relationship between y and x and is
assumed to hold for any possible observation.
Our goal is to estimate the population parameters, β1 , . . . , βK ,
based on our sample, (yi , xi ; i = 1, . . . , N).
We regard yi and ϵi (and usually xi ) as random variables that
are part of a sample derived from the population.
Recall that we can write the model in matrix form as
y = X β + ϵ.
(1)
(N × 1) (N × K) (K × 1) (N × 1)

5 / 27
Random sampling
The origins of the linear regression model lie in the sciences
where the xi variables are determined in a laboratory setting.
The xi variables are fixed in repeated samples so that the only
source of randomness is ϵi leading to different values for yi
across samples.
This can be hard to justify in Economics where it is more
common to regard both xi and ϵi as changing across samples.
This leads to different observed values of yi and xi each time a
new sample is drawn.
A random sample implies that each observation, (yi , xi ), is an
independent drawing from the population.
We will use this idea as a basis for a set of statistical
assumptions.

6 / 27
Assumptions
Our assumptions concern the linear model

yi = xi′ β + ϵi , i = 1, . . . , N.

The Gauss-Markov conditions are:


E{ϵi } = 0, i = 1, . . . , N; (A1)
{ϵ1 , . . . , ϵN } and {x1 , . . . , xN } are independent; (A2)
V{ϵi } = σ 2 , i = 1, . . . , N; (A3)
cov{ϵi , ϵj } = 0, i, j = 1, . . . , N, i ̸= j. (A4)

Note that we also need N > K and X ′ X to be invertible – here


we need X to have rank K i.e. the columns of X are linearly
independent (M4).
What do these conditions imply?
7 / 27
Assumptions A1, A3 and A4
Assumption (A1) suggests that the regression line holds on
average (more on this shortly).
Assumption (A3) states that all disturbances have the same
variance - this is known as homoskedasticity (which rules out
heteroskedasticity, or non-constant variances, which we shall
deal with later).
Assumption (A4) tells us that all pairs, ϵi and ϵj , are
uncorrelated (this is essentially just random sampling), thereby
ruling out autocorrelation.
In terms of the N × 1 vector ϵ, these assumptions imply (see
S12) that

E{ϵ} = 0 (N × 1) and V{ϵ} = σ 2 IN (N × N),

where IN is the N × N identity matrix.

8 / 27
Assumption A2
Under assumption (A2) the matrix X and vector ϵ are
independent.
This means that knowledge of X tells us nothing about the
distribution of ϵ (and vice versa).
It implies that

E{ϵ|X} = E{ϵ} = 0 and V{ϵ|X} = V{ϵ} = σ 2 IN .

Under (A1) and (A2) the linear regression model is a model for
the conditional mean of yi , because

E{yi |xi } = E{xi′ β + ϵi |xi } = xi′ β + E{ϵi |xi } = xi′ β

in view of E{ϵi |xi } = 0.


Assumptions (A1)–(A4) jointly determine the properties of b.

9 / 27
Small N

We shall begin by taking the sample size, N, to be a finite


number (but recall N > K).
First, note that the OLS vector b is a linear function of y:

b = (X ′ X)−1 X ′ y

= (X ′ X)−1 X ′ (Xβ + ϵ) (using y = Xβ + ϵ)

= (X ′ X)−1 X ′ Xβ + (X ′ X)−1 X ′ ϵ

= β + (X ′ X)−1 X ′ ϵ (because (X ′ X)−1 X ′ X = IK ).

It is, therefore, also a linear function of the unobservable


random vector ϵ.

10 / 27
E{b}

The expected value of b is

E{b} = E{β + (X ′ X)−1 X ′ ϵ} = β + E{(X ′ X)−1 X ′ ϵ}.

But
E{(X ′ X)−1 X ′ ϵ} = E{(X ′ X)−1 X ′ }E{ϵ} = 0
by (A2) and E{ϵ} = 0 by (A1).
Hence E{b} = β and the OLS estimator b is said to be an
unbiased estimator of β.
In repeated sampling the OLS estimator will be equal to β ‘on
average.’
Note that unbiasedness does not require (A3) and (A4).

11 / 27
V{b|X}
The conditional covariance matrix of b is:
V{b|X} = E{(b − β)(b − β)′ |X}

= E{(X ′ X)−1 X ′ ϵϵ′ X(X ′ X)−1 |X}

= (X ′ X)−1 X ′ E{ϵϵ′ |X}X(X ′ X)−1

= σ 2 (X ′ X)−1 X ′ X(X ′ X)−1 as E{ϵϵ′ |X} = σ 2 IN

= σ 2 (X ′ X)−1 .

We will denote this as V{b} = σ 2 (X ′ X)−1 for convenience.


The unconditional variance matrix is actually

V{b} = σ 2 E (X ′ X)−1 ,


which is rather more complicated!


12 / 27
Gauss-Markov Theorem
Clearly OLS is a Linear Unbiased Estimator (LUE).
But how does OLS compare to other LUEs?

Gauss-Markov Theorem
Under Assumptions (A1)–(A4), the OLS estimator b of β is the
Best Linear Unbiased Estimator (BLUE) in the sense that it has
minimum variance within the class of LUEs.
What does this mean?
Take any other LUE, call it b̃; then

V{b̃|X} ≥ V{b|X}

in the sense that the matrix V{b̃|X} − V{b|X} is positive


semi-definite; see (M10).

13 / 27
Normality of ϵ

Sometimes it is appropriate to actually specify the distribution


of the random disturbance vector ϵ.
A common assumption, that incorporates (A1), (A3) and (A4),
is:
ϵ ∼ N(0, σ 2 IN ). (A5)
This is equivalent to

ϵi ∼ NID(0, σ 2 ), (A5′ )

where NID denotes ‘normally and independently distributed.’


This also implies that yi ∼ NID(xi′ β, σ 2 ) (conditional on X) which
is not always appropriate.

14 / 27
Normality of b

Under (A2) and (A5) it follows that

b ∼ N(β, σ 2 (X ′ X)−1 )

because b is linear in ϵ.
Each element of b is also normally distributed:

bk ∼ N(βk , σ 2 ckk ), k = 1, . . . , K,

where ckk denotes the (k, k) (diagonal) element of (X ′ X)−1 .


These results motivate statistical tests based on b but, in
practice, we don’t know σ 2 .
We therefore estimate σ 2 using the data – how do we do this?

15 / 27
Estimation of σ 2 = V{ϵi }
We usually estimate variances by sample averages but ϵi is
unobserved.
Instead we can base an estimator on the residuals:
N
1 X 2
s2 = ei .
N−K
i=1

This estimator is unbiased (i.e. E{s2 } = σ 2 ).


Note the degrees of freedom adjustment – the denominator is
N − K rather than N − 1.
This is because we have estimated K parameters in order to
obtain the residuals (ei = yi − xi′ b).
The estimated variance matrix of b is then

V̂{b} = s2 (X ′ X)−1 .

16 / 27
Returning to the R output for a regression of individuals’ wages
on years of education from last week:
> fit1 <- lm(lwage~educ, data=wage1)
> summary(fit1)

Call:
lm(formula = lwage ~ educ, data = wage1)

Residuals:
Min 1Q Median 3Q Max
-2.21158 -0.36393 -0.07263 0.29712 1.52339

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.583773 0.097336 5.998 3.74e-09 ***
educ 0.082744 0.007567 10.935 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4801 on 524 degrees of freedom


Multiple R-squared: 0.1858,Adjusted R-squared: 0.1843
F-statistic: 119.6 on 1 and 524 DF, p-value: < 2.2e-16

= 0.4801 (implying s2 = 0.2305), while the standard


Here, s p
errors ( s2 ckk ) are 0.0973 and 0.0076 for b1 and b2 , respectively.

17 / 27
Large N

The Gauss-Markov assumptions ensure that exact finite


sample results hold for b (e.g. unbiasedness, normality).
If we wish to relax some of these assumptions then exact finite
sample results are typically not available.
For example, if (A2) doesn’t hold, then b will be biased.
We therefore use results for large N to find out the asymptotic
properties as N → ∞.
For large enough N we treat the asymptotic results as holding
approximately.

18 / 27
Convergence

Consider a sequence of numbers indexed by N e.g.


 
−N 1 1 1 1
{xN = e } = , , ,..., N,... .
e e2 e3 e

We can define the limit of this sequence as N → ∞:

lim xN = lim e−N = 0.


N→∞ N→∞

The sequence {xN } is said to converge to zero.


But what happens if the elements of the sequence are random
variables?

19 / 27
Convergence of random variables

The sequence of random variables {xN } is said to converge in


probability to a constant c if

lim P {|xN − c| > δ} = 0 for all δ > 0;


N→∞

(see, for example, (2.69) on p.34 of Verbeek).


This is written
p
xN → c or plim xN = c.
In words: there exists a positive number δ such that, as N gets
larger and larger, the probability that the distance between xN
and c is larger than δ, converges to zero.
Note that δ can be arbitrarily small.

20 / 27
Slutsky’s Theorem
If plim b = β then b is a consistent estimator of β.
Consistency can be thought of as a large sample version of
unbiasedness and is a minimum requirement for an estimator.
A useful property of the plim operator is:

Slutsky’s Theorem
If g(·) is a continuous function and plim xN = c, then

plim g(xN ) = g(plim xN ) = g(c);

(see, for example, (2.71) on p.34 of Verbeek).

This is not a property shared by the expectations operator; in


general,
̸ g{E(x)}
E{g(x)} =
for a random variable x.
21 / 27
Convergence to a constant
6

N=10
N=100
5
N=1000

4
Density

0
-1 -0.5 0 0.5 1
Estimator

Convergence to a constant c (here, c = 0) is illustrated above


by the variance of the distribution becoming smaller as N
increases.

22 / 27
Large N assumptions

What can we say about b in large samples? Is it consistent?


It is convenient to make the following assumptions:
N
1 ′ 1X ′ p
XX= xi xi → Σxx , (finite, nonsingular); (A6)
N N
i=1
E{xi ϵi } = 0, i = 1, . . . , N. (A7)

In (A6) the matrix Σxx can be regarded as E(xi xi′ ).


Assumption (A7) states that xi and ϵi are uncorrelated.
What do these conditions imply for b?

23 / 27
Properties of b

We begin by writing
 −1
1 ′ 1 ′
b = β+ XX Xϵ
N N
N
!−1 N
1X ′ 1X
= β+ xi xi xi ϵi .
N N
i=1 i=1

Applying the plim operator and using Slutsky we find

N
!−1 N
1X ′ 1X
plim(b − β) = plim xi xi plim xi ϵi .
N N
i=1 i=1

The first term converges to Σ−1


xx using (A6).

24 / 27
Large sample results
It is reasonable to assume that sample averages converge to
their population values and so
N
1X
plim xi ϵi = E{xi ϵi }.
N
i=1

But E{xi ϵi } = 0 under (A7) and so

plim(b − β) = Σ−1
xx E{xi ϵi } = 0.

Hence b is a consistent estimator of β.


It is also possible to show that, as N → ∞,

N(b − β) → N(0, σ 2 Σ−1
xx ),

where → means ‘is asymptotically distributed as’.

25 / 27
Large sample approximation

For a large but finite sample size we can use this result to
approximate the distribution of b as
a
b ∼ N(β, σ 2 Σ−1
xx /N),
a
where ∼ means ‘is approximately distributed as.’
Our best estimate of Σxx is X ′ X/N and we estimate σ 2 using s2 .
Hence we have the familiar result
a
b ∼ N(β, s2 (X ′ X)−1 ).

But note this is only approximate as it is based on weaker


assumptions than Gauss-Markov.

26 / 27
Summary

• Gauss-Markov assumptions
• statistical properties of OLS: small N and large N

• Next week:
• goodness-of-fit
• hypothesis testing (t and F tests)

27 / 27

You might also like