0% found this document useful (0 votes)
71 views7 pages

Review: Multiple Regression: Holding The Other Explanatory Variables Constant or Fixed

The document discusses multiple regression analysis. Multiple regression allows modeling of a dependent variable as a function of more than one independent variable. It defines the general multiple regression model and explains how to interpret the slope coefficients in multiple regression compared to single regression. It also discusses omitted variable bias, which can occur when an important predictor variable is omitted from the regression model.

Uploaded by

Felipe Chen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views7 pages

Review: Multiple Regression: Holding The Other Explanatory Variables Constant or Fixed

The document discusses multiple regression analysis. Multiple regression allows modeling of a dependent variable as a function of more than one independent variable. It defines the general multiple regression model and explains how to interpret the slope coefficients in multiple regression compared to single regression. It also discusses omitted variable bias, which can occur when an important predictor variable is omitted from the regression model.

Uploaded by

Felipe Chen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Review: multiple regression

• Definition: a regression with more than one independent variable


• The general multiple regression model with K independent variables is:

Lecture 7 Yi = β0 + β1X1i + β2X2i + ... + βKXKi + εi

The lecture is based on teaching material from So Yoon Ahn

How to interpret β? Review: omitted variable bias


• A big difference between multiple and single regression model is in • The error e arises because of factors that influence Y but are not
the interpretation of the slope coefficients included in the regression function; so, there are always omitted
• Now a slope coefficient indicates the change in the average of the variables
dependent variable associated with a one-unit increase in the
explanatory variable holding the other explanatory variables constant
or fixed • Sometimes, the omission of those variables can lead to bias in the
OLS estimator
• Example: LifeExpectancy=b0+b1GDP+b2Population +e
• b1: conditional on population, one unit increase in GDP is associated with b1
unit change in life expectancy
• b2: conditional on GDP, one unit increase in population is associated with b2
unit change in life expectancy
Review: omitted variable bias Omitted variable bias
• The bias in the OLS estimator that occurs as a result of an omitted • Suppose our “Correct” model is: Yi = β0 + β1X1i + β2X2i + εi
factor is called omitted variable bias. For omitted variable bias to
occur, the omitted factor “Z” must be: • But researcher mistakenly ran a regression: Yi = 𝛼! +𝛼" X1i + εi

1. A determinant of Y (i.e. Z is part of e); and • We hope E(𝛼"" )=β1


$$,&
2. Correlated with the regressor X (i.e. corr(Z,X) ¹ 0) • But, E(𝛼"" )=𝛽" + 𝛽#
$$&

• Both conditions must hold for the omission of Z to result in omitted


variable bias.

Omitted variable bias direction Omitted variable bias


• Suppose you regress 𝑌 on 𝑋, but omit variable 𝑍, how would the
coefficient estimate on 𝑋 be biased under the following scenarios:

• 𝑍 is negatively correlated with both 𝑋 and 𝑌

• 𝑍 is positively correlated with 𝑋, but negatively correlated with 𝑌

• 𝑍 is not correlated with 𝑋, but positively correlated with 𝑌


Dummy variable A dummy independent variable
• A dummy variable is a variable that takes on the value 1 or 0 • Consider a simple model with one continuous variable (x) and one
• Examples: male (= 1 if are male, 0 otherwise), south (= 1 if in the dummy (d)
south, 0 otherwise), marital status (=1 if married, 0 otherwise) etc. • 𝑦 = 𝛽! + 𝛽" 𝑑 + 𝛽# 𝑥 + 𝑢
• Dummy variables are also called binary variables • This can be interpreted as an intercept shift
• If d=0, then 𝑦 = 𝛽! + 𝛽# 𝑥 + 𝑢
• If d=1, then 𝑦 = 𝛽! + 𝛽" + 𝛽# 𝑥 + 𝑢
• The case of d=0 is the base group

Example of 𝛽0 > 0 Dummies for Multiple Categories


𝑦 = 𝛽3 + 𝛽0 + 𝛽4 𝑥 + 𝑢 • We can use dummy variables to control for something with multiple
𝑦 d=1
Slope: 𝛽4
categories
• Suppose everyone in your data is either a HS dropout, HS grad only, or
𝑦 = 𝛽3 + 𝛽4 𝑥 + 𝑢 college grad
𝛽0 d=0

𝛽3

𝑥
Multiple Categories (cont) Dummy variable trap
• Any categorical variable can be turned into a set of dummy variables • If we run Health=𝛽! + 𝛽" Less_HS+𝛽# HS_grad+𝛽% College_or_above+e
• The regression will not work
• Education: 0 if HS dropout, 1 if HS grad, and 2 if college grad. Recode • This is called perfect multi-collinearity
it into three dummy variables:
• less_HS:1 if education=0, 0 otherwise Constant less_HS HS_grad college health status
• HS_grad:1 if education=1, 0 otherwise 1 1 0 0 3
• college_or_above:1 if education=2, 0 otherwise 1 0 1 0 4
1 0 1 0 4
Less_HS+HS_grad+college_or_above
• If there are a lot of categories, it may make sense to group some =constant=1
1 0 0 1 3
together Cannot estimate the regression!
1 0 0 1 2
• Example: top 10 ranking, 11 – 25, etc. 1 1 0 0 1
1 0 1 0 4

How to get out of dummy variable trap(1)? How to get out of dummy variable trap(2)?
• There are two ways of getting out of the dummy variable trap • Way#2: omit the constant term
• Way #1: omit one category of the dummy variables • Health=𝛽" Less𝐻𝑆 + 𝛽# HS_grad+𝛽% College_or_above+e
• e.g. if we run a regression: • How does this regression differ from the previous?
• Health=𝛽! + 𝛽" HS_grad+𝛽# College_or_above+e • How do we interpret 𝛽" ?
• We omit HS dropouts dummy and treat it as the baseline. The categories • The average health status of HS dropouts.
we include are compared to the category we exclude • Exercise: Interpret 𝛽# and 𝛽%
• How do we interpret 𝛽" ?
• The average health status of HS grads relative to HS dropouts.
• It is the difference in average health between HS grad and HS dropouts
• Exercise: Interpret 𝛽!, 𝛽#
Exercise Review
• Now we are interested in knowing how GDP is related to different quarters. We
have the following data: • (1) When there are only dummy independent variables
Constant Quarter1 Quarter2 Quarter3 Quarter4 GDP • Differences in mean in dependent variables for different groups
1 1 0 0 0 1342
• (2) When there are dummy and continuous independent variables
1 0 1 0 0 1654
• Allowing different intercepts for different groups
1 0 0 1 0 1565
1 0 0 0 1 1807

• Design a regression to accomplish the task and interpret the 𝛽𝑠. You can use
either one of the two methods to get out of the dummy variable trap
• GDP= 𝛽3 + 𝛽4𝑄𝑢𝑎𝑟𝑡𝑒𝑟2+𝛽C𝑄𝑢𝑎𝑟𝑡𝑒𝑟3+𝛽E𝑄𝑢𝑎𝑟𝑡𝑒𝑟4+e
• GDP=𝛽0𝑄𝑢𝑎𝑟𝑡𝑒𝑟1 + 𝛽4𝑄𝑢𝑎𝑟𝑡𝑒𝑟2+𝛽C𝑄𝑢𝑎𝑟𝑡𝑒𝑟3+𝛽E𝑄𝑢𝑎𝑟𝑡𝑒𝑟4+e

F-test F-test
• With a multiple regression • Let STR = student teacher ratio, Expn = expenditures per pupil, and PctEL =
𝑌𝑖 = 𝛽0 +𝛽1𝑋1𝑖 +⋯+𝛽𝑁𝑋k𝑖 +e𝑖, percent of English learners
we can still perform statistical inference using p-value or confidence • Consider the population regression model:
interval to determine if a specific 𝛽 is statically different from 0 (or
other number) • TestScorei = b0 + b1STRi + b2Expni + b3PctELi + ui

• The null hypothesis is that “school resources don’t matter,” and the alternative
• Now if we want to test whether a group of variables jointly have that they do, corresponds to:
any effect on Y, we will use F-test.
• H0: 𝛽1 = 𝛽2 =..=𝛽k =0 • H0: b1 = 0 and b2 = 0
• H1: either b1 ≠ 0 or b2 ≠ 0 or both
• H1: at least one of the 𝛽s is not zero
• TestScorei = b0 + b1STRi + b2Expni + b3PctELi + ui
F-test F-statistic
• H0: b1 = 0 and b2 = 0 • The F-statistic tests all parts of a joint hypothesis at once: test of joint
• H1: either b1 ≠ 0 or b2 ≠ 0 or both hypothesis
• A joint hypothesis specifies a value for two or more coefficients, that
is, it imposes a restriction on two or more coefficients.
• In general, a joint hypothesis will involve q restrictions. In the
example above, q = 2, and the two restrictions are b1 = 0 and b2 = 0.

The “restricted” and “unrestricted” regressions Simple formula for the F-statistic:
Example: are the coefficients on STR and Expn zero? 2
( Runrestricted - Rrestricted
2
)/q
F=
Unrestricted population regression (under H1): (1 - Runrestricted
2
) /( n - kunrestricted - 1)
TestScorei = b0 + b1STRi + b2Expni + b3PctELi + ui where:
2
Restricted population regression (that is, under H0): Rrestricted = the R2 for the restricted regression
2
TestScorei = b0 + b3PctELi + ui (why?) Runrestricted = the R2 for the unrestricted regression
q = the number of restrictions under the null
• The number of restrictions under H0 is q = 2 (why?). kunrestricted = the number of regressors in the
• The fit will be better (R2 will be higher) in the unrestricted unrestricted regression.
regression (why?) • The bigger the difference between the restricted and
By how much must the R2 increase for the coefficients on Expn unrestricted R2’s – the greater the improvement in fit by
and PctEL to be judged statistically significant? adding the variables in question – the larger is the F.
23 24
Example:
F-statistic – summary
Restricted regression:
2
Test score= 644.7 –0.671PctEL, Rrestricted = 0.4149
(1.0) (0.032)
Unrestricted regression:
Test score = 649.6 – 0.29STR + 3.87Expn – 0.656PctEL
2
( Runrestricted - Rrestricted
2
)/q
F=
(15.5) (0.48) (1.59) (0.032) (1 - Runrestricted ) /( n - kunrestricted - 1)
2

2
Runrestricted = 0.4366, kunrestricted = 3, q = 2
2
( Runrestricted - Rrestricted
2
)/q • The F-statistic rejects when adding the two variables
so F= increased the R2 by “enough” – that is, when adding the two
(1 - Runrestricted
2
) /( n - kunrestricted - 1) variables improves the fit of the regression by “enough”
(.4366 - .4149) / 2
= = 8.01
(1 - .4366) /(420 - 3 - 1)
25 26

Example of F-test
• Healthi=β0 + β1Educi + β2Agei + β3malei+ei
• I want to know whether education, age and
gender are jointly affecting health
• H0:β1=β2=β3=0
• What is H1?
• Where is F-test in STATA output?
• Top right panel
• Like t-stat, F-stat follows a specific
distribution, a bigger F-stat means higher
power to reject the H0
• Prob>F is like the p-value in t-test. We can
look at “Prob>F” and compare it with a to
decide whether rejecting H0

You might also like