Lecture Set 4
Lecture Set 4
The elements of the model stand, in the model, for the objects. The model consists in the fact that its
elements are combined with one another in a definite way.
- Choy, Keen Meng (2020) Tractatus Modellus-Philosophicus, mimeo
Outline
1. Hypothesis tests and confidence intervals for one coefficient
2. Joint hypothesis tests on multiple coefficients
3. Other types of hypotheses involving multiple coefficients
4. Confidence sets for multiple coefficients
5. Model specification: how to decide which variables to include
in a regression model
Hypothesis Tests and Confidence Intervals for a
Single Coefficient (SW Section 7.1)
• Hypothesis tests and confidence intervals for a single coefficient
in multiple regression follow the same logic and recipe as for
the slope coefficient in a single-regressor model.
ˆ1 E ( ˆ1 )
• is approximately distributed N(0,1) (CLT).
var( ˆ1 )
The null hypothesis that “school resources don’t matter,” and the
alternative that they do, corresponds to:
H0: β1 = 0 and β2 = 0
Formula for the special case of the joint hypothesis β1 = β1,0 and β2 = β2,0 in a
regression with two regressors:
1 t1 t2 2 ˆ t1 ,t2 t1t2
2 2
F F2, distribution
2 1 ˆ t21 ,t2
where ˆ t1 ,t2 estimates the correlation between t1 and t2 .
In general, the t-statistics are correlated so the formula adjusts for this
correlation.
This adjustment is made so that under the null hypothesis, the F-statistic has an
F2, ∞ distribution whether or not the t-statistics are correlated.
Note: The formula for more than two β’s is nasty unless you use matrix algebra.
If the F-distribution is computed using the general heteroskedasticity-robust SE
formula, its large-n distribution under the null hypothesis is the F2, ∞
distribution, regardless of whether the errors are homoskedastic or
heteroskedastic.
The p-value of the F statistic can be computed using the large sample F2, ∞
aproximation to its distribution.
The p-value can be evaluated using a table of the F2, ∞ distribution or a table of the
χ2𝑞 distribution because a χ2𝑞 -distributed RV is q times an F2, ∞-distributed
random variable.
Implementation in STATA
Use the “test” command after the regression
Example: Test the joint hypothesis that the population coefficients on STR and
expenditures per pupil (expn_stu) are both zero, against the alternative that at least
one of the population coefficients is nonzero.
F-test example, California class size data:
reg testscr str expn_stu pctel, r;
Regression with robust standard errors Number of obs = 420
F( 3, 416) = 147.20
Prob > F = 0.0000
R-squared = 0.4366
Root MSE = 14.353
------------------------------------------------------------------------------
| Robust
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
str | -.2863992 .4820728 -0.59 0.553 -1.234001 .661203
expn_stu | .0038679 .0015807 2.45 0.015 .0007607 .0069751
pctel | -.6560227 .0317844 -20.64 0.000 -.7185008 -.5935446
_cons | 649.5779 15.45834 42.02 0.000 619.1917 679.9641
------------------------------------------------------------------------------
NOTE
test str expn_stu; The test command follows the regression
• A high R 2 (or R 2 ) does not mean that you have eliminated omitted
variable bias.
• A high R 2 (or R 2 ) does not mean that you have an unbiased estimator
of a causal effect (1 ).
• A high R 2 (or R 2 ) does not mean that the included variables are
statistically significant this must be determined using hypotheses tests.
Analysis of the Test Score Data Set
(SW Section 7.6) (1 of 3)
1. Identify the variable of interest:
STR
2. Think of the omitted causal effects that could result in omitted
variable bias
Whether the students know English; outside learning
opportunities; parental involvement; teacher quality (if teacher
salary is correlated with district wealth) – there is a long list!
Analysis of the Test Score Data Set
(SW Section 7.6) (2 of 3)
3. Include those omitted causal effects if you can or, if you can’t,
include variables correlated with them that serve as control
variables.
4. The control variables are effective if the conditional mean
independence (CMI) assumption plausibly holds (if u is
uncorrelated with STR once the control variables are included).
This results in a “base” or “benchmark” model.
Many of the omitted causal variables are hard to measure, so
we need to find control variables. These include PctEL (both a
control variable and an omitted causal factor) and measures of
district wealth.
Analysis of the Test Score Data Set
(SW Section 7.6) (3 of 3)
4. Also specify a range of plausible alternative models, which
include additional candidate variables.
It isn’t clear which of the income-related variables will best
control for the many omitted causal factors such as outside
learning opportunities, so the alternative specifications include
regressions with different income variables. The alternative
specifications considered here are just a starting point, not the
final word!
5. Estimate your base model and plausible alternative
specifications (“sensitivity checks”).
Presentation of regression results (1 of 2)
• We have a number of regressions and we want to report them.
It is awkward and difficult to read regressions written out in
equation form, so instead it is conventional to report them in a
table.
• A table of regression results should include:
– estimated regression coefficients
– standard errors
– measures of fit
– number of observations
– relevant F-statistics, if any
– any other pertinent information, such as confidence intervals for the
causal effect of interest
• Find this information in the following table!
Presentation of regression results (2 of 2)
Summary: testing joint hypotheses
• The “one at a time” approach of rejecting if either of the t-
statistics exceeds 1.96 rejects more than 5% of the time under the
null (the size exceeds the desired significance level)
• The heteroskedasticity-robust F-statistic is built into the STATA
(“test” command); this tests all q restrictions at once.
• For q not too big and n 100, the Fq ,n k 1 distribution and the q2 /q
distribution are essentially identical.
1
2(1 ˆ t1 ,t2 )
2
ˆ 2 ˆ 2 ˆ ˆ
1 1,0
2 2,0
2
ˆ t1 ,t2
1 1,0
2 2,0
SE ( ˆ1 ) SE ( ˆ2 ) SE ( ˆ ) SE ( ˆ )
1 2
This is a quadratic form in β1,0 and β2,0 – thus the boundary of the set F = 3.00 is
an ellipse.
Confidence set based on inverting the
F-statistic
Figure 7.1: 95% Confidence Set for Coefficients on STR and Expn
from Equation (7.6)