Review: Multiple Regression: Holding The Other Explanatory Variables Constant or Fixed
Review: Multiple Regression: Holding The Other Explanatory Variables Constant or Fixed
𝛽3
𝑥
Multiple Categories (cont) Dummy variable trap
• Any categorical variable can be turned into a set of dummy variables • If we run Health=𝛽! + 𝛽" Less_HS+𝛽# HS_grad+𝛽% College_or_above+e
• The regression will not work
• Education: 0 if HS dropout, 1 if HS grad, and 2 if college grad. Recode • This is called perfect multi-collinearity
it into three dummy variables:
• less_HS:1 if education=0, 0 otherwise Constant less_HS HS_grad college health status
• HS_grad:1 if education=1, 0 otherwise 1 1 0 0 3
• college_or_above:1 if education=2, 0 otherwise 1 0 1 0 4
1 0 1 0 4
Less_HS+HS_grad+college_or_above
• If there are a lot of categories, it may make sense to group some =constant=1
1 0 0 1 3
together Cannot estimate the regression!
1 0 0 1 2
• Example: top 10 ranking, 11 – 25, etc. 1 1 0 0 1
1 0 1 0 4
How to get out of dummy variable trap(1)? How to get out of dummy variable trap(2)?
• There are two ways of getting out of the dummy variable trap • Way#2: omit the constant term
• Way #1: omit one category of the dummy variables • Health=𝛽" Less𝐻𝑆 + 𝛽# HS_grad+𝛽% College_or_above+e
• e.g. if we run a regression: • How does this regression differ from the previous?
• Health=𝛽! + 𝛽" HS_grad+𝛽# College_or_above+e • How do we interpret 𝛽" ?
• We omit HS dropouts dummy and treat it as the baseline. The categories • The average health status of HS dropouts.
we include are compared to the category we exclude • Exercise: Interpret 𝛽# and 𝛽%
• How do we interpret 𝛽" ?
• The average health status of HS grads relative to HS dropouts.
• It is the difference in average health between HS grad and HS dropouts
• Exercise: Interpret 𝛽!, 𝛽#
Exercise Review
• Now we are interested in knowing how GDP is related to different quarters. We
have the following data: • (1) When there are only dummy independent variables
Constant Quarter1 Quarter2 Quarter3 Quarter4 GDP • Differences in mean in dependent variables for different groups
1 1 0 0 0 1342
• (2) When there are dummy and continuous independent variables
1 0 1 0 0 1654
• Allowing different intercepts for different groups
1 0 0 1 0 1565
1 0 0 0 1 1807
• Design a regression to accomplish the task and interpret the 𝛽𝑠. You can use
either one of the two methods to get out of the dummy variable trap
• GDP= 𝛽3 + 𝛽4𝑄𝑢𝑎𝑟𝑡𝑒𝑟2+𝛽C𝑄𝑢𝑎𝑟𝑡𝑒𝑟3+𝛽E𝑄𝑢𝑎𝑟𝑡𝑒𝑟4+e
• GDP=𝛽0𝑄𝑢𝑎𝑟𝑡𝑒𝑟1 + 𝛽4𝑄𝑢𝑎𝑟𝑡𝑒𝑟2+𝛽C𝑄𝑢𝑎𝑟𝑡𝑒𝑟3+𝛽E𝑄𝑢𝑎𝑟𝑡𝑒𝑟4+e
F-test F-test
• With a multiple regression • Let STR = student teacher ratio, Expn = expenditures per pupil, and PctEL =
𝑌𝑖 = 𝛽0 +𝛽1𝑋1𝑖 +⋯+𝛽𝑁𝑋k𝑖 +e𝑖, percent of English learners
we can still perform statistical inference using p-value or confidence • Consider the population regression model:
interval to determine if a specific 𝛽 is statically different from 0 (or
other number) • TestScorei = b0 + b1STRi + b2Expni + b3PctELi + ui
• The null hypothesis is that “school resources don’t matter,” and the alternative
• Now if we want to test whether a group of variables jointly have that they do, corresponds to:
any effect on Y, we will use F-test.
• H0: 𝛽1 = 𝛽2 =..=𝛽k =0 • H0: b1 = 0 and b2 = 0
• H1: either b1 ≠ 0 or b2 ≠ 0 or both
• H1: at least one of the 𝛽s is not zero
• TestScorei = b0 + b1STRi + b2Expni + b3PctELi + ui
F-test F-statistic
• H0: b1 = 0 and b2 = 0 • The F-statistic tests all parts of a joint hypothesis at once: test of joint
• H1: either b1 ≠ 0 or b2 ≠ 0 or both hypothesis
• A joint hypothesis specifies a value for two or more coefficients, that
is, it imposes a restriction on two or more coefficients.
• In general, a joint hypothesis will involve q restrictions. In the
example above, q = 2, and the two restrictions are b1 = 0 and b2 = 0.
The “restricted” and “unrestricted” regressions Simple formula for the F-statistic:
Example: are the coefficients on STR and Expn zero? 2
( Runrestricted - Rrestricted
2
)/q
F=
Unrestricted population regression (under H1): (1 - Runrestricted
2
) /( n - kunrestricted - 1)
TestScorei = b0 + b1STRi + b2Expni + b3PctELi + ui where:
2
Restricted population regression (that is, under H0): Rrestricted = the R2 for the restricted regression
2
TestScorei = b0 + b3PctELi + ui (why?) Runrestricted = the R2 for the unrestricted regression
q = the number of restrictions under the null
• The number of restrictions under H0 is q = 2 (why?). kunrestricted = the number of regressors in the
• The fit will be better (R2 will be higher) in the unrestricted unrestricted regression.
regression (why?) • The bigger the difference between the restricted and
By how much must the R2 increase for the coefficients on Expn unrestricted R2’s – the greater the improvement in fit by
and PctEL to be judged statistically significant? adding the variables in question – the larger is the F.
23 24
Example:
F-statistic – summary
Restricted regression:
2
Test score= 644.7 –0.671PctEL, Rrestricted = 0.4149
(1.0) (0.032)
Unrestricted regression:
Test score = 649.6 – 0.29STR + 3.87Expn – 0.656PctEL
2
( Runrestricted - Rrestricted
2
)/q
F=
(15.5) (0.48) (1.59) (0.032) (1 - Runrestricted ) /( n - kunrestricted - 1)
2
2
Runrestricted = 0.4366, kunrestricted = 3, q = 2
2
( Runrestricted - Rrestricted
2
)/q • The F-statistic rejects when adding the two variables
so F= increased the R2 by “enough” – that is, when adding the two
(1 - Runrestricted
2
) /( n - kunrestricted - 1) variables improves the fit of the regression by “enough”
(.4366 - .4149) / 2
= = 8.01
(1 - .4366) /(420 - 3 - 1)
25 26
Example of F-test
• Healthi=β0 + β1Educi + β2Agei + β3malei+ei
• I want to know whether education, age and
gender are jointly affecting health
• H0:β1=β2=β3=0
• What is H1?
• Where is F-test in STATA output?
• Top right panel
• Like t-stat, F-stat follows a specific
distribution, a bigger F-stat means higher
power to reject the H0
• Prob>F is like the p-value in t-test. We can
look at “Prob>F” and compare it with a to
decide whether rejecting H0