0% found this document useful (0 votes)
34 views46 pages

Lecture 3 - Econometria I

The document discusses the assumptions and properties of simple and multiple linear regression models. It outlines the assumptions of the simple linear regression model, including that the model is linear in parameters, random sampling is used, there is variation in the explanatory variable, the error term has a zero conditional mean, and the error term is homoskedastic. It then discusses interpreting the OLS regression equation and outlines two key assumptions of multiple linear regression models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views46 pages

Lecture 3 - Econometria I

The document discusses the assumptions and properties of simple and multiple linear regression models. It outlines the assumptions of the simple linear regression model, including that the model is linear in parameters, random sampling is used, there is variation in the explanatory variable, the error term has a zero conditional mean, and the error term is homoskedastic. It then discusses interpreting the OLS regression equation and outlines two key assumptions of multiple linear regression models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Econometrics I

Prof. Dr. Daniel Roland


Federal University of ABC - 1/2024
Simple/Multiple linear
regression: assumptions and
estimator properties
Prof. Dr. Daniel Roland
Federal University of ABC - 1/2024
Simple linear regression model

 We explored the simple linear regression model and derived the


OLS estimator previously. Now we will discuss the model’s
assumptions and implications before moving on to multiple linear
regression models. Here are the simple linear regression (SLR)
assumptions.

SLR.1 (Linear in parameters) – In the population model, the dependent


variable, y, is related to the independent variable, x, and the error (or
disturbance), u, as
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝑢
where 𝛽0 and 𝛽1 are the population intercept and slope parameters,
respectively.
Simple linear regression model
Simple linear regression model

SLR.2 (Random Sampling) – We have a random sample of size n,


𝑥𝑖 , 𝑦𝑖 : 𝑖 = 1, 2, … , 𝑛 , following the population model shown
previously. Therefore, the random sample is given by

𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝑢𝑖 , 𝑖 = 1, 2, … , 𝑛

Samples are meant to be a representation of the population with all its


characteristics. If our sample is not random, it might contain bias which
would ultimately invalidate any results. E.g. a survey interested in finding
the population’s opinion on a particular football team would be very
misleading if it interviewed only the supporters of the football team.
Simple linear regression model

SLR.3 (Sample variation in the explanatory variable) – The sample


outcomes on x, namely, 𝑥𝑖 , 𝑖 = 1, 2, … , 𝑛 , are not all the same
value.

If there is no variation of the explanatory variable, there is no way to


estimate the OLS regression line. A simple inspection of summary
statistics can test this assumption. If the standard deviation of x is not
zero, the assumption holds.
Simple linear regression model

SLR.4 (Zero Conditional Mean) – The error u has an expected value of


zero given any value of the explanatory variable. In other words,

𝐸 𝑢|𝑥 = 0

This assumption can be violated if there are important unobserved


factors, measurement errors, reversed causality or non-linearity. For
now, it is sufficient to know that by definition the simple linear
regression model uses this assumption in order to produce unbiased
estimators.
Simple linear regression model

 If assumptions SLR.1-SLR.4 hold, we know that the OLS estimators are


unbiased. That is, on average, they are a good approximation of
the true mean effect found in the PRF. But is that sufficient?

Source: Encyclopedia of Social Measurement, 2005.


Simple linear regression model

 If we add one more assumption, we ensure that the OLS estimators


are not only unbiased, but that they also have the least dispersion in
the class of all linear estimators.

SLR.5 (Homoskedasticity) – The error, u, has the same variance given


any value of the explanatory variable. In other words,

𝑉𝑎𝑟 𝑢|x = 𝜎 2
Simple linear regression model
Simple linear regression model
Simple linear regression model

 Recall that the homoskedasticity assumption, also called “constant


variance” assumption, i.e. 𝑉𝑎𝑟 𝑢|x = 𝜎 2 , is not the same as the
zero conditional mean assumption, i.e. 𝐸 𝑢|𝑥 = 0. The former
refers to variance of u given x while the latter refers to expected
value.
 Essentially, the homoskedasticity assumption says that the variance
of the error term does not vary with x.
Simple linear regression model

 The assumptions presented, SLR.1 – SLR.5, are frequently called the


Gauss-Markov assumptions thanks to the 18th and 19th century
mathematicians Carl Friederich Gauss and Andrey Markov.
 The Gauss-Markov theorem states that if SLR.1-SLR.5 hold, then the
OLS estimator is the best linear unbiased estimator (BLUE). The results
are generalised to the multiple linear regression model with a few
minor modifications.
Multiple linear regression model

 Multiple linear regression models allows us more freedom in our


attempts to explore social and economic problems.
 The concept of ceteris paribus is truly used as we can explicitly
control for other factors that might affect the outcome variable y.
 The assumption that all other factors that affect y are uncorrelated
with x, which is many times not realistic, can be dealt with.
Multiple linear regression model

Suppose we have a simple linear regression model looking at the


effect of education on wage:
𝑤𝑎𝑔𝑒 = 𝛽0 + 𝛽1 𝑒𝑑𝑢𝑐 + 𝑢
We would have to assume that years of experience, which is in the
error term, is uncorrelated with education – a not so reasonable
assumption.
With a multiple linear regression model, we can write:

𝑤𝑎𝑔𝑒 = 𝛽0 + 𝛽1 𝑒𝑑𝑢𝑐 + 𝛽2 𝑒𝑥𝑝𝑒𝑟 + 𝑢


Multiple linear regression model

Now suppose you want to explore the effect of public spending per
student (expend) on their average standardized test score (avgscore).
However, in the USA, public school expenditure per student is based
on property and local income taxes, thus we could include family
average income in our regression.

𝑎𝑣𝑔𝑠𝑐𝑜𝑟𝑒 = 𝛽0 + 𝛽1 𝑒𝑥𝑝𝑒𝑛𝑑 + 𝛽2 𝑎𝑣𝑔𝑓𝑖𝑛𝑐 + 𝑢


Multiple linear regression model

A multiple linear regression model can be written in the population as

𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + ⋯ + 𝛽𝑘 𝑥𝑘 + 𝑢

where 𝛽0 is the intercept, 𝛽1 is the parameter associated with 𝑥1 , 𝛽2 is


the parameter associated with 𝑥2 , and so on.
Multiple linear regression model

A key assumption for the general multiple regression model is very


similar to the one we saw in the simple linear regression model and it
can be written in terms of a conditional expectation:

𝐸 𝑢|𝑥1 , 𝑥2 , … , 𝑥𝑘 = 0

In other words, all factors in the unobserved error term must be


uncorrelated with the explanatory variables.
Multiple linear regression model

The estimated OLS equation can be written in a form similar to the


simple regression case. If we have only two explanatory variables, that
would be:

𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2
where 𝛽0 is the estimate of 𝛽0 , 𝛽1 is the estimate of 𝛽1 and 𝛽2 is the
estimate of 𝛽2 .
Multiple linear regression model

We use the method of OLS to choose values of 𝛽0 , 𝛽1 and 𝛽2 that minimise


the sum of squared residuals. That is, given n observations on y, 𝑥1 and 𝑥2 ,
𝑥𝑖1 , 𝑥𝑖2 , 𝑦𝑖 : 𝑖 = 1,2, … , 𝑛 , the estimates of 𝛽0 , 𝛽1 and 𝛽2 are chosen
simultaneously to make the following equation as small as possible

𝑛
2
𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖1 − 𝛽2 𝑥𝑖2
𝑖=1

The same principles can be generalised to a model with k number of


explanatory variables.
Multiple linear regression model

Lets assume that 𝑥1 , 𝑥2 , … , 𝑥𝑘 are independent variables and 𝑦 is a


dependent variable.
Given a sample of n observations,

𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑘 , 𝑦𝑖 , 𝑖 = 1, 2, … , 𝑛

the population linear regression model is given by:

𝐸 𝑦𝑖 |𝑥𝑖 = 𝛽0 + 𝛽1 𝑥𝑖1 + ⋯ +𝛽𝑘 𝑥𝑖𝑘


Multiple linear regression model

And the sample linear regression function (or OLS regression line) is
given by

𝑦 = 𝛽0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 + ⋯ + 𝛽𝑘 𝑥𝑖𝑘

Thus, the sum of squared residuals to be minimised is:

𝑛
2
𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖1 − 𝛽2 𝑥𝑖2 − ⋯ − 𝛽𝑘 𝑥𝑖𝑘
𝑖=1
Deriving the OLS estimator

We have k+1 linear equations and k+1 unknowns 𝛽0 , 𝛽1 , … , 𝛽𝑘 :


𝑛
𝜕
𝑆 𝛽0 , 𝛽1 , … , 𝛽𝑘 = −2 (𝑦𝑖 −𝛽0 − 𝛽1 𝑥𝑖1 − ⋯ − 𝛽𝑘 𝑥𝑖𝑘 ) = 0
𝜕𝛽0
𝑖=1
𝑛
𝜕
𝑆 𝛽0 , 𝛽1 , … , 𝛽𝑘 = −2 𝑥𝑖1 (𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖1 − ⋯ − 𝛽𝑘 𝑥𝑖𝑘 ) = 0
𝜕𝛽1
𝑖=1
𝑛
𝜕
𝑆 𝛽0 , 𝛽1 , … , 𝛽𝑘 = −2 𝑥𝑖2 (𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖1 − ⋯ − 𝛽𝑘 𝑥𝑖𝑘 ) = 0
𝜕𝛽2
𝑖=1
.
.
.
𝑛
𝜕
𝑆 𝛽0 , 𝛽1 , … , 𝛽𝑘 = −2 𝑥𝑖𝑘 (𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖1 − ⋯ − 𝛽𝑘 𝑥𝑖𝑘 ) = 0
𝜕𝛽𝑘
𝑖=1
Deriving the OLS estimator

 The previous system of equations can be solved:


 Manually (which would demand an incredible amount of time)
 Using a matrix algebra approach (which is beyond the scope of
this course)
 Through the use of an econometric software (which will be our
approach by the end of this course).
Interpreting the OLS regression
equation
Lets start with two independent variables such that

𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2
The intercept 𝛽0 is the predicted value when both 𝑥1 and 𝑥2 equal
zero. The estimates 𝛽1 and 𝛽2 have partial effect or ceteris paribus
interpretations.

∆𝑦 = 𝛽1 ∆𝑥1 +𝛽2 ∆𝑥2


Interpreting the OLS regression
equation
We can generalise this to k explanatory variables in a similar fashion

∆𝑦 = 𝛽1 ∆𝑥1 +𝛽2 ∆𝑥2 + ⋯ + 𝛽𝑘 𝑥𝑘

 With (plenty of!) caveats, the multiple linear regression allows


economists to mimic a laboratory environment where it is easy to fix
each factor, even though the data collected did not explicitly hold
each factor constant.
 Once we obtain the partial effects, we can also explore how y is
affected by simultaneous changes in two or more independent
variables.
Multiple linear regression assumptions

The first two multiple linear regression (MLR) assumptions are very similar
to the SLR assumptions:

MLR.1 (Linear in Parameters) – The model in the population can be


written as
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽𝑘 𝑥𝑘 + 𝑢
where 𝛽0 , 𝛽1 , … , 𝛽𝑘 are unknown parameters (constants) of interest
and u is unobservable random error or disturbance term. Note that this
does not exclude non-linearity in the independent variables.
Multiple linear regression assumptions

MLR.2 (Random sampling) – We have a random sample of n


observations, 𝑥𝑖1 , 𝑥𝑖2 , … 𝑥𝑖𝑘 , 𝑦𝑖 , following the population model in
assumption MLR.1.

We cannot draw our estimates without a population sample. And we


need to ensure that the sample we have is representative, i.e. it
represents the population. Just like in the SLR.2, if our sample is biased
then our estimates will also be biased.
Multiple linear regression assumptions

MLR.3 (No Perfect Collinearity) – In the sample (and therefore in the


population), none of the independent variables is constant, and there
are no exact linear relationships among the independent variables.

If an independent variable is an exact linear combination of the other


independent variables, then we say the model suffers from perfect
collinearity and it cannot be estimated by OLS. That assumption does
allow for independent variables to be correlated, and many times
they are, but it rules out perfect correlation.
Multiple linear regression assumptions

The MLR.3 assumption does not exclude non-linear combinations of


independent variables such as the use of a squared variable.

MRL.4 (Zero conditional mean) – The error u has an expected value of


zero given any values of the independent variables. In other words,

𝐸 𝑢|𝑥1 , 𝑥2 , … , 𝑥𝑘 = 0

There are several ways that this assumption might not be met.
Multiple linear regression assumptions

 Functional form is not specified properly.


 Omitted variable in our functional relationship, which is correlated
with one (or more) independent variable.
 Measurement error.
 Simultaneous effects between regressor and regressand.

For now, we only need to worry about the first two. When MLR.4. holds,
it is often said that we have exogenous explanatory variables.
Unbiasedness of OLS

If assumptions MLR.1 through MLR.4 hold, the OLS estimators are


unbiased.

Theorem 3.1 (Unbiasedness of OLS) – Under assumptions MLR.1 through


MLR.4., for any values of the population parameter 𝛽𝑗 , the expected
value of the parameter estimate is equal to the parameter 𝛽𝑗 . That is

𝐸(𝛽𝑗 ) = 𝛽𝑗 , 𝑗 = 0, 1, … , 𝑘
This theorem, properly described here, is also applied to SLR. Essentially,
it describes the unbiasedness of OLS.
Multiple linear regression assumptions

MLR.5 (homoskedasticity) – The error u has the same variance given


any values of the explanatory variables. In other words

𝑉𝑎𝑟 𝑢|𝑥1 , 𝑥2 , … , 𝑥𝑘 = 𝜎 2

Just as in the simple linear regression model, here the variance of the
unobserved factor is not affected by different values of the
explanatory variable. This is true for each explanatory variable in the
model.
Multiple linear regression assumptions

If we consider a model explaining the wage of school teachers, we


could write.
𝑤𝑎𝑔𝑒 = 𝛽0 + 𝛽1 𝑒𝑑𝑢𝑐 + 𝛽2 𝑒𝑥𝑝𝑒𝑟 + 𝛽3 𝑡𝑒𝑛𝑢𝑟𝑒 + 𝑢

Assumption MLR.5 (homoskedasticity) would state that

𝑉𝑎𝑟 𝑢|𝑒𝑑𝑢𝑐, 𝑒𝑥𝑝𝑒𝑟, 𝑡𝑒𝑛𝑢𝑟𝑒 = 𝜎 2

If the variance changes with any of the three explanatory variables,


then heteroskedasticity is present.
The Gauss-Markov Theorem (BLUE)

Theorem 3.4. (Gauss-Markov Theorem) – Under Assumptions MLR.1


through MLR.5, 𝛽0 , 𝛽1 , … , 𝛽𝑘 are the best linear unbiased estimators
(BLUE) of 𝛽0 , 𝛽1 , … , 𝛽𝑘 , respectively.

Best – It has the smallest variance.


Linear – It is within the class of linear estimators.
Unbiased - E 𝛽𝑗 = 𝛽𝑗
Estimator – a rule that can be applied to any sample to produce an
estimate.
Inclusion/exclusion of regressors

When we add/remove an explanatory variable from our model, the


effect (or lack of) caused by this depends on the relevance of the
variable being included/excluded. So we have two situations of
interest:

1) Including irrelevant variable in a regression model.

2) Exclusion of a relevant variable in a regression model.


Including irrelevant variable in a
regression model.
In the first case, when an irrelevant variable is included in the model
(also called overspecification of the model). It can look like this:

𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + 𝑢

In this example, 𝑥3 has no effect on y and the model is overspecified.


In this case, 𝛽3 = 0 . But we do not know this and estimate the model:

𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽3 𝑥3
Including irrelevant variable in a
regression model.
 Thankfully, the inclusion of an irrelevant regressor does not affect the
unbiasedness of OLS. That is to say, if assumptions MLR.1 through
MLR.4 hold, the OLS estimators for each 𝛽𝑗 are unbiased, i.e.
𝐸 𝛽𝑗 = 𝛽𝑗 for any value of 𝛽𝑗 .
 The actual value of 𝛽𝑗 will most likely not be zero, it should be very
close to it and, on average across all samples, it will be zero.
 While the OLS estimator remains unbiased, the inclusion of an
irrelevant variable does affect the variance of OLS estimators (to be
seen).
Exclusion of a relevant variable in a
regression model
 The second case, where we exclude a relevant variable in a
regression model, is more complex.
 We saw previously that the exclusion of a relevant variable in our
model can be problematic, specially if that variable is correlated
with one of the other regressors as this invalidates MLR.4 and the
unbiasedness of OLS estimators. Now we have a closer look at the
process. Suppose we want to estimate the following:

𝑤𝑎𝑔𝑒 = 𝛽0 + 𝛽1 𝑒𝑑𝑢𝑐 + 𝛽2 𝑎𝑏𝑖𝑙 + 𝑢


Exclusion of a relevant variable in a
regression model
 Instead of the previous model, we estimate the following

𝑤𝑎𝑔𝑒 = 𝛽0 + 𝛽1 𝑒𝑑𝑢𝑐 + 𝑣
where 𝑣 = 𝛽2 𝑎𝑏𝑖𝑙 + 𝑢.

The estimated equation is

𝑦 = 𝛽0 + 𝛽1 𝑒𝑑𝑢𝑐
Exclusion of a relevant variable in a
regression model
 Notice the difference between “~” and “^” in the model. We use
this to explicitly show that we are using an underspecified model.
 In this case, the underspecified coefficient 𝛽1 can be written as

𝛽1 = 𝛽1 + 𝛽2 𝛿1
𝛽1 and 𝛽1 come from estimates from the true population regression
function while 𝛿1 is the slope of the simple linear regression of abil on
educ, i.e. ability as a function of education.
Exclusion of a relevant variable in a
regression model
Since we can assume 𝛿1 is a non-random value and assumptions
MLR.1-MLR.4 are met, we can write:

𝐸 𝛽1 = 𝐸 𝛽1 + 𝛽2 𝛿1 = 𝐸 𝛽1 + 𝐸 𝛽2 𝛿1 = 𝛽1 + 𝛽2 𝛿1

From this, we can calculate the bias in the estimation of 𝛽1

Bias 𝛽1 = 𝐸 𝛽1 − 𝛽1 = 𝛽2 𝛿1
Exclusion of a relevant variable in a
regression model
Bias 𝛽1 = 𝛽2 𝛿1
The term on the right side of the equation is called omitted variable
bias and we can infer the direction of bias based on the value of both
𝛽2 and 𝛿1 . Of course, if either one of them are equal to zero then the
OLS estimate for 𝛽2 will be unbiased.

The sign of 𝛿1 will depend on the correlation between the regressors


educ and abil or, in general terms, between 𝑥1 and 𝑥2 given by
𝐶𝑜𝑟𝑟(𝑥1 , 𝑥2 ).
Exclusion of a relevant variable in a
regression model
Bias 𝛽1 = 𝛽2 𝛿1
Summary of bias in 𝛽1 when 𝑥2 is omitted in an underspecified model.

𝐶𝑜𝑟𝑟 𝑥1 , 𝑥2 > 0 𝐶𝑜𝑟𝑟 𝑥1 , 𝑥2 < 0


𝛽2 > 0 Positive bias Negative bias

𝛽2 < 0 Negative bias Positive bias

Note that we can tell the direction of bias in this way, but not the magnitude .
Exclusion of a relevant variable in a
regression model
Bias 𝛽1 = 𝛽2 𝛿1
To summarise, if the excluded variable is not correlated with any other
regressor, then there is no bias in the OLS estimates. However, if there is
correlation, the direction of the bias will depend on the sign of both
the true parameter and the correlation between the regressors.

Moving on to equations with three or more regressors, in general, if


there is exclusion of a relevant variable then the OLS estimators will be
biased, even if the excluded variable is only correlated with one of the
regressors. The direction of the bias is not as straightforward as in the
case with two explanatory variables.
To do list

 Read sections 3.1-3.3 and 3.5 of Wooldridge if you haven’t already.


 Our next lecture will cover goodness-of-fit from chapters 2 and 3.
 Problem set #1 has been released on Moodle. You can submit the
answers up until 15th March at noon. Any late submissions will not be
accepted.

You might also like