0% found this document useful (0 votes)

24 views34 pages

AE 2023 Lecture3 PDF

This document provides an overview of multiple regression models. It introduces multiple regression and explains why we need multiple regression models to control for many factors that simultaneously affect the dependent variable. The document also discusses the differences between simple and multiple regression, the interpretation of coefficients in multiple regression, and the assumptions of multiple regression models including linearity, random sampling, no perfect collinearity, and zero conditional mean of the error term. It addresses the concepts of exogeneity, unbiasedness of OLS estimators, and the effect of including irrelevant variables in a regression model.

Uploaded by

Tien Truong Bui Ha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views34 pages

AE 2023 Lecture3 PDF

Uploaded by

Tien Truong Bui Ha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Lecture 3: Multiple Regression

Model
Applied Econometrics
Dr. Le Anh Tuan

1
Introducing Multiple Regression
►“Explains variable ! in terms of variables "1, "2, … , "'
Dependent variable, Independent variables,
Explained variable, explanatory variables,
Response variable,… regressors,…

Intercept, Error term,

Constant Disturbance,
Slope parameter,
Unobservables,
Coefficient
Residuals,…

2
Why Do We Need Multiple Regression?
►Control for many other factors which simultaneously
affect the dependent variable
►Reduce omitted variable bias

►Explicitly hold fixed other factors that otherwise would

be in !
►Once we control for a factor, the ceteris paribus
condition with respect to this factor is
automatically fulfilled
►Note that in multiple regression model, our
primary interest is still the value of our main
independent variable, other variables are
control variables.

3
Why Do We Need Multiple Regression?
► Useful for creating functional relationships between
variables
► Quadratic function

Model has two explanatory variables: income and

income squared. Consumption is explained as a
quadratic function of income

► Interaction terms function

! = #$ + #& '& + #( '( + #) '& ∗ '( + +

4
Simple Regression vs. Multiple Regression
► Most of the properties of the simple regression model
directly extend to the multiple regression case
→ same principles
► Population regression model
!" = $% + $' ("' + $) (") + ⋯ + $+ ("+ + ,i

► Sample regression model

!" = $-% + $
-' ("' + $
-) (") + ⋯ + $
-+ ("+ + ,
."

► Fitted values of y:
-% + $
!/" = $ -' ("' + $
-) (") + ⋯ + $
-+ ("+

► Residuals
/ " = !" − !
, -% + $
/" = !" − $ -' ("' + $
-) (") + ⋯ + $
-+ ("+

5
OLS model

►Choosing the values " !#, "

!%, "
! & ,…, "
! ' such that
the sum of squared residuals (SSR) is
minimized.

6
Properties of OLS

► Algebraic properties of OLS regression

► Deviations from regression line sum up to zero

► Covariance between deviations and regressors is

zero

► Sample averages of ! and " lie on regression line

7
Interpreting the coefficients
►Interpretation of the multiple linear regression model
By how much does the dependent variable change
if the j-th independent variable is increased by one
unit, holding all other independent variables and the
error term constant

►The multiple linear regression model manages to hold

the values of other explanatory variables fixed even if,
in reality, they are correlated with the explanatory
variable under consideration.
►“Ceteris paribus”-interpretation
►It has still to be assumed that unobserved factors do
not change if the explanatory variables are changed.

8
Example
►Determinants of college GPA

Grade point
High school grade Achievement test
average at college
point average score

► The coefficient on hsGPA: Holding ACT fixed, another point

on high school GPA is associated with an increase in college
GPA by 0.453 points.
► Or: If we compare two students with the same ACT, but the
hsGPA of student A is one point higher, we predict student A
to have a colGPA that is 0.453 higher than that of student B
► Holding high school GPA fixed, another 1 point on ACT are
associated with an increase in college GPA by 0.0094
points.
9
Goodness of Fit

10
Goodness of Fit
►Decomposition of total variation: SST = SSE + SSR

Explained sum of squares,

represents variation
explained by regression/
independent variables
Total sum of squares, Residual sum of squares,
represents total variation represents variation not
in the dependent variable explained by regression

11
Goodness of Fit

► R-squared measures the fraction of the total variation that is

explained by the regression

► R2 is the fraction of the sample variation in y that is explained

by independent variables ("#, "%, "&, … , "().

► R2 always goes up if we include additional variables.

► Additional regressors can cause trouble especially when we

have few observations.

12
Goodness of Fit
►Econometricians came up with adjusted R-squared,
which enables a comparison of models:

()*
"! # = 1 − 1 − " # .
()+)*

► "! # < " # which implies that as the number of

, variables increases, the adjusted R2 increases less
than the R2

►It is good practice to use "! # rather than " # because

" # tends to give an overly optimistic picture of the fit
of the regression, particularly when the number of
explanatory variables is large compared with the
number of observations.
13
Assumptions of
Multiple Linear Regression (MLR)

14
Assumptions of MLR
► Assumption MLR.1 (Linear population model). In the
population, the relationships between dependent and
independent variables are linear.
! = #$ + #& '1 + #) '2 + ⋯ + #, '- + .
where #$ , #& , #) , … , #, are the population intercept and
slope parameters, respectively.

► Assumption MLR.2 (Random sampling). The data is a

random sample drawn from the population. Each data
point therefore follows the population equation. We
have a random sample of size 1, '21 , '22 , … , '2- , !2 , n =
1,2, … , n

15
Assumptions of MLR
► Assumption MLR.3 (no perfect collinearity). In the
sample (and therefore in the population), none of the
independent variables is constant, and there are no exact
linear relationships among the independent variables.

► Note that SLR.3 was telling us that there is sample

variation in !
► Now, not only do we need variation in all explanatory
variables, but we need them to vary individually.

► The assumption only rules out perfect collinearity/

correlation between explanatory variables; imperfect
correlation is allowed.

16
MLR3. no perfect collinearity
► Example for perfect collinearity: relationships between
regressors

Either shareA or shareB will have to be dropped from the

regression because there is an exact linear relationship between
them: shareA + shareB = 1
ShareA: Percentage of campaign expenditures candidate A
ShareB: Percentage of campaign expenditures candidate B

► Dropping some independent variables may reduce

multicollinearity (but this may lead to omitted variable
bias).

17
Assumptions of MLR
► Assumption MLR.4 (zero conditional mean of u).
The error ! has an expected value of zero given any
value of the explanatory variables. In other words,
"[!|%] = 0.

► The value of the explanatory variables must contain

no information about the mean of the unobserved
factors

► Note that for our random sample, MLR.4 implies

"[!) |%)* , … , %)- ] = 0

18
Endogenous vs Exogenous

► If independent variables and error term :

+ are correlated => endogenous. Endogeneity is a
violation of assumption MLR.4
+are uncorrelated => exogenous; MLR.4 holds if all
indepedent variables are exogenous

► Exogeneity is the key assumption for a causal

interpretation of the regression, and for unbiasedness
of the OLS estimators.

19
Unbiasedness of OLS
► Theorem 3.1 (Unbiasedness of OLS)

MLR.1 – MLR.4 → " $#% = $%

► Under the assumptions MLR.1 through MLR.4, the OLS estimators

are unbiased.

► If we collected multiple random samples, OLS doesn’t

overestimate or underestimate the real values. The estimated
coefficients equal to the true values in the population.

20
Including Irrelevant Variables in a
Regression Model

► No problem because ! #"$ = #$ =0 = 0 in the population

► All OLS estimates are unbiased for any values of #$ ,

including 0.

► Not including an important variable may cause a bias

(omitted variable bias).

► However, including irrelevant variables reduces the

accuracy of the estimated coefficients:
► increase the sampling variance
21
Omitted variable bias

► True model: ! = #$ + #& '& + #( '( + )

► Estimated model: ! +$ + #
*=# + & '& + , (omitted -2 )

► Assume -1 and -2 are correlated and '( = 0$ + 0&'& + 1

► ! = #$ + #& '& + #( (0$ + 0&'& + 1 + ))

► ! = #$ +#( 0$ + (#& +#( 0&)'& + (#( 1 + ))

→ All estimated coefficients will be biased

► If the omitted variable is irrelevant or uncorrelated, no

omitted variable bias

22
Variances of the OLS estimators

► Assumption MLR.5 (Homoskedasticity):

Variance of ! does not vary with ". More precisely,

► The value of the explanatory variables must contain no

information about the variability of the unobserved factors.
► Homoskedasticity = constant variance
► Heteroskedasticity = variance is not constant (changed)

23
Variances of the OLS estimators
► Theorem 3.2 (Variances of the OLS estimators)
► Under assumptions MLR.1 – MLR.5: the variance of !

Total sample variation in

explanatory variable "#: R-squared from a regression of
explanatory variable "# on all other
independent variables (including a
constant)

24
Variances of the OLS estimators

The more the total variation in !" Remember that σ2 is the variance of
, the more accurate the OLS the error term u. If the variance of u
estimates of #" will be. reduces, the accuracy of OLS
What can help: adding more estimators increases.
observations increases $$%" . What can help: adding explanatory
variables (taking some factors out of u)

If !" is uncorrelated with other independent variables, this R2 is zero. With

increasing correlation between the x’s, the accuracy of OLS estimators reduces.
The possible linear relationship between the x’s is called multicollinearity.
25
How to detect multicollinearity?
► Use the correlation coefficients between variable
► Corr(x1, x2) should not be larger than 0.7

► Multicollinearity may be detected through “variance

inflation factors (VIF)”

The variance inflation factor should not be

larger than 10

26
Example : Multicollinearity
Average Other ex-
Expenditures Expenditures for in-
standardized penditures
for teachers structional materials
test score of school

► The different expenditure categories will be strongly

correlated because if a school has a lot of resources it will
spend a lot on everything.
► It will be hard to estimate the differential effects of different
expenditure categories because all expenditures are either
high or low. For precise estimates of the differential
effects, one would need information about situations
where expenditure categories change differentially.
► As a consequence, sampling variance of the estimated
effects will be large.

27
Variances in misspecified models
► The choice of whether to include a particular variable in a
regression can be made by analyzing the tradeoff between
bias and variance
► True population model:

► Estimated model 1

► Estimated model 2

28
Variances in misspecified models
► Case 1:

Conclusion: Do not include irrelevant regressors

► Case 2:

Trade off bias and variance; Caution: bias will not

disappear even in large samples

29
Variances of the OLS estimators
► Estimating the error variance:

► An unbiased estimate of the error variance can be obtained by substracting the

number of estimated regression coefficients from the number of observations.
The number of observations minus the number of estimated parameters is also
called the degrees of freedom. The n estimated squared residuals in the sum
are not completely independent but related through the k+1 equations that define
the first order conditions of the minimization problem

► Theorem 3.3 (Unbiased estimator of the error variance)

30
Theorem 2.3
► Calculation of standard errors for regression coefficients

The true sampling

variation of the
estimated

The estimated
sampling variation of
the estimated
Plug in for the unknown

Note that these formulas are only valid under

assumptions MLR.1-MLR.5 (in particular, there
has to be homoskedasticity)

31
Gauss-Markov Theorem
► The Gauss-Markov theorem shows that, within a certain
class of unbiased estimators, OLS is the one that exhibits
the smallest variance among all competing estimators

Theorem: Gauss Markov theorem.

Under the assumptions MLR.1 through MLR.5,
OLS estimator is the best linear unbiased
estimator (BLUE) of the regression coefficients

32
Gauss-Markov Theorem
On the meaning of BLUE:
►Best:
►“best” actually means “the one with the lowest
variance
►Linear:
►An estimator of multiple regression coefficients is
linear
►OLS can be shown to be a linear estimator
►Unbiased
►the expected value of the estimate of "!# is the real
value "#
►Estimator

33
Exercise

► Reading Chapter 4 – Hypothesis Testing

► Propose 3-5 initial ideas.

► Prepare Research Proposal.

Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (648)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2886)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)

AE 2023 Lecture3 PDF

Uploaded by

AE 2023 Lecture3 PDF

Uploaded by

Lecture 3: Multiple Regression

Intercept, Error term,

►Explicitly hold fixed other factors that otherwise would

Model has two explanatory variables: income and

► Interaction terms function

► Sample regression model

►Choosing the values " !#, "

► Algebraic properties of OLS regression

► Deviations from regression line sum up to zero

► Covariance between deviations and regressors is

► Sample averages of ! and " lie on regression line

►The multiple linear regression model manages to hold

► The coefficient on hsGPA: Holding ACT fixed, another point

Explained sum of squares,

► R-squared measures the fraction of the total variation that is

► R2 is the fraction of the sample variation in y that is explained

► R2 always goes up if we include additional variables.

► Additional regressors can cause trouble especially when we

► "! # < " # which implies that as the number of

►It is good practice to use "! # rather than " # because

► Assumption MLR.2 (Random sampling). The data is a

► Note that SLR.3 was telling us that there is sample

► The assumption only rules out perfect collinearity/

Either shareA or shareB will have to be dropped from the

► Dropping some independent variables may reduce

► The value of the explanatory variables must contain

► Note that for our random sample, MLR.4 implies

► If independent variables and error term :

► Exogeneity is the key assumption for a causal

MLR.1 – MLR.4 → " $#% = $%

► Under the assumptions MLR.1 through MLR.4, the OLS estimators

► If we collected multiple random samples, OLS doesn’t

► No problem because ! #"$ = #$ =0 = 0 in the population

► All OLS estimates are unbiased for any values of #$ ,

► Not including an important variable may cause a bias

► However, including irrelevant variables reduces the

► True model: ! = #$ + #& '& + #( '( + )

► Assume -1 and -2 are correlated and '( = 0$ + 0&'& + 1

► ! = #$ + #& '& + #( (0$ + 0&'& + 1 + ))

→ All estimated coefficients will be biased

► If the omitted variable is irrelevant or uncorrelated, no

► Assumption MLR.5 (Homoskedasticity):

► The value of the explanatory variables must contain no

Total sample variation in

If !" is uncorrelated with other independent variables, this R2 is zero. With

► Multicollinearity may be detected through “variance

The variance inflation factor should not be

► The different expenditure categories will be strongly

Conclusion: Do not include irrelevant regressors

Trade off bias and variance; Caution: bias will not

► An unbiased estimate of the error variance can be obtained by substracting the

► Theorem 3.3 (Unbiased estimator of the error variance)

The true sampling

Note that these formulas are only valid under

Theorem: Gauss Markov theorem.

► Reading Chapter 4 – Hypothesis Testing

► Propose 3-5 initial ideas.

► Prepare Research Proposal.

You might also like