Regression With One Regressor-Hypothesis Tests and Confidence Intervals

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 53

Regression with One Regressor:

Hypothesis Tests and Confidence


Intervals
 Hypothesis Testing
 Confidence intervals
 Regression when X is Binary
 Heteroskedasticity and Homoskedasticity
 Weighted Least Squares
 Summary and Assessment
Hypothesis Testing and the Standard Error of
Suppose a skeptic suggests that reducing the number of students
in a class has no effect on learning or, specifically, test scores.
The skeptic thus asserts the hypothesis,

We wish to test this hypothesis using data— reach a tentative


conclusion whether it is correct or not.
Null hypothesis and two-sided alternative:

or, more generally,

Null hypothesis and one-sided alternative:

In economics, it is almost always possible to come up with


stories in which an effect could “go either way,” so it is
standard to focus on two-sided alternatives.
In general,

where the S.E. of the estimator is the square root of an estimator


of the variance of the estimator.
Applied to a hypothesis about :

where is the hypothesized value of .


Formula for SE( )
Recall the expression for the variance of (large n):

where
Estimator of the variance of is
where is the residual.
 There is no reason to memorize this.
 It is computed automatically by regression software.
 is reported by regression software.
 It is less complicated than it seems. The numerator estimates
Var(ν), the denominator estimates Var(X).
The calculation of the t-statistic:

 Reject at 5% significance level if |t| > 1.96.


 p-value is
=probability in tails outside
 Both the previous statements are based on large-n
approximation; typically n = 50 is large enough for the
approximation to be excellent.
Example: Test Scores and ST R, California data
Estimated regression line:

Regression software reports the standard errors:

t-statistic testing
 The 1% two-sided significance level is 2.58, so we reject the
null at 1% significance level.
 Alternatively, we can compute the p-value.
The p-value based on the large-n standard normal approximation to
the t-statistic is 0.00001.
Confidence intervals for
In general, if the sample distribution of an estimator is nomal for
large n, then a 95% confidence interval can be constructed as
estimator ±1.96 standard error, that is
Example: Test Scores and STR, California data
Estimated regression line:

Regression software reports the standard errors.

95% confidence interval for :

Equivalent statements:
 The 95% confidence interval does not include zero.
 The hypothesis = 0 is rejected at the 5% level.
A concise (and conventional) way to report
regressions:
Put standard errors in parentheses below the estimated
coefficients to which they apply.

This expression gives a lot of information.


 The estimated regression line is

 The standard error of is 10.4.


 The standard error of is 0.52
 The R2 is .05; the standard error of the regression is 18.6.
Regression when X is Binary
Sometimes a regressor is binary:
 X = 1 if female, = 0 if male
 X = 1 if treated (experimental drug), = 0 if not
 X = 1 if small class size, = 0 if not

Binary regressors are sometimes called “dummy” variables.


So far, has been called a “slope,” but that doesn’t make much
sense if X is binary.
How do we interpret regression with a binary regressor?
Interpreting regressions with a binary regressor

where X is binary (Xi = 0 or 1).

that is:

so:

which is the population difference in group means.


Example: TestScore and STR, California data
Let

The OLS estimate of the regression line relating Test Score to D


(with standard errors in parentheses) is:

Difference in means between groups = 7.4;


Compare the regression results with the group
means,
computed directly:

 Estimation:
 Test:

This is the same as in the regression.


Other Regression Statistics
A natural question is how well the regression line “fits” or
explains the data. There are two regression statistics that
provide complementary measures of the quality of fit:
 The regression R2 measures the fraction of the variance of Y
that is explained by X; it is unitless and ranges between zero
(no fit) and one (perfect fit)
 The standard error of the regression measures the fit – the
typical size of a regression residual – in the units of Y.

The R2
Write Yi as the sum of the OLS prediction + OLS residual:
 
Yi = Yˆi+ uˆi
 
The R2 is the fraction of the sample variance of Yi “explained” by
ˆ
the regression, that is,Yby :
i

 
ESS
R2 = TSS ,
n n

where ESS =  (Yˆ Y )


i
i 1
2
and TSS =  i
(Y
i 1
 Y ) 2
.
Heteroskedasticity and Homoskedasticity
Heteroskedasticity, Homoskedasticity, and the Formula for
the Standard Errors of and
 What do these two terms mean?
 Consequences of homoskedasticity.
 Implication for computing standard errors.
What do these two terms mean?
If Var(u|X = x) is constant— that is, the variance of the
conditional distribution of u given X does not depend on X,
then u is said to be homoskedasticity ( 變異數齊一 ).
Otherwise, u is said to be heteroskedastic (變異數不齊一 ).
Homoskedasticity in a picture

 E(u|X = x) = 0, u satisfies Least Squares Assumption #1.


 The variance of u does not depend on x.
Heteroskedasticity in a picture

 E(u|X = x) = 0, u satisfies Least Squares Assumption #1.


 The variance of u depends on x.
 Heteroskedastic or homoskedastic?
Is heteroskedasticity present in the class size
data?
Recall the three least squares assumptions:
 The conditional distribution of u given X has mean zero, that
is, E(u|X = x) = 0.
 (Xi , Yi ), i = 1, … , n, are i.i.d.
 Large outliers are rare.
Heteroskedasticity and homoskedasticity concern Var(u|X = x).
Because we have not explicitly assumed homoskedastic errors,
we have implicitly allowed for heteroskedasticity.

So far we have assumed that u is heteroskedastic


What if the errors are in fact homoskedastic?
 You can prove that OLS has the lowest variance among
estimators that are linear in Y . A result called the Gauss-
Markov theorem.
 The formula for the variance of and the OLS standard error
simplifies. If , then

Note: Var( ) is inversely proportional to Var(X). More spread


in X means more information about .
The nominator of Var( ) is
Gauss-Markov conditions:

Gauss-Markov Theorem:
Under the Gauss-Markov conditions, the OLS estimator is
BLUE (Best Linear Unbiased Estimator). That is,
 for all linear
conditionally unbiased estimators .
 where
 is a linear unbiased estimator.
Proof of the Gauss-Markov Theorem

is a linear unbiased estimator of .


For to be conditionally unbiased,

we need
With the above two conditions for

Let , then and


Therefore,

has a greater conditional variance than if di ≠ 0 for any i=1,


… ,n.

If di = 0 for all i, then = , which proves that OLS is BLUE.


General formula for the standard error of is the square root of

Special case under homoskedasticity is


 The homoskedasticity-only formula for the standard error of
and the “heteroskedasticity-robust” formula (the formula that
is valid under heteroskedasticity) differ. In general, you get
different standard errors using the different formulas.
 Homoskedasticity-only standard errors are the default setting
in regression software - sometimes the only setting (e.g.
Excel). To get the general “heteroskedasticity-robust” standard
errors you must override the default.
 If you don’t override the default and there is in fact
heteroskedasticity, will get the wrong standard errors (and
wrong t-statistics and confidence intervals).
 If you use the “, robust” option, the STATA computes
heteroskedasticity-robust standard errors.
 Otherwise, STATA computes homoskedasticity-only standard errors.
Weighted Least Squares
Since OLS under homoskedasticity is efficient, traditional
approach is trying to transform a heteroskedastic model into a
homoskedastic model.
Suppose the conditional variance of ui is a function of Xi
Then we can divide both sides of the single-variable regression
model by to obtain

where


The WLS estimator is the OLS estimator obtained by regressing


on and .
 However, h(Xi ) is usually unknown, then we have to estimate
h(Xi ) first to obtain . Then replace h(Xi ) with .
This is called feasible WLS.
 More importantly, the function form of h(‧) is usually
unknown, then there is no way to systematically estimate
h(Xi ). This is why, in practice, we usually only run OLS with
robust standard error.
The critical points:
 If the errors are homoskedastic and you use the heteroskedastic
formula for standard errors (the one we derived), you are OK.
 If the errors are heteroskedastic and you use the
homoskedasticity-only formula for standard errors, the
standard errors are wrong.
 The two formulas coincide (when n is large) in the special case
of homoskedasticity.
 The bottom line: you should always use the heteroskedasticity-
based formulas- these are conventionally called the
heteroskedasticity-robust standard errors.
The Extended OLS Assumptions (II)
These consist of the three LS assumptions, plus two more:
1. E(u| X = x) = 0.
2. (X i,Y i), i = 1,…,n, are i.i.d.
3. Large outliers are rare (E(Y 4) < , E(X 4) < ).
4. u is homoskedastic
5. u is distributed N (0,2)

 Assumptions 4 and 5 are more restrictive – so they apply to


fewer cases in practice. However, if you make these
assumptions, then certain mathematical calculations simplify
and you can prove strong results – results that hold if these
additional assumptions are true.
42
Efficiency of OLS, part II:

 Under all five extended LS assumptions – including normally


distributed errors – ˆ1 has the smallest variance of all consistent
estimators (linear or nonlinear functions of Y 1,…,Y n), as n .
 This is a pretty amazing result – it says that, if (in addition to LSA
1-3) the errors are homoskedastic and normally distributed, then
OLS is a better choice than any other consistent estimator. And
because an estimator that isn’t consistent is a poor choice, this
says that OLS really is the best you can do – if all five extended
LS assumptions hold. (The proof of this result is beyond the
scope of this course and isn’t in SW – it is typically done in
graduate courses.)
43
Some not-so-good thing about OLS
The foregoing results are impressive, but these results – and the
OLS estimator – have important limitations.

1.The GM theorem really isn’t that compelling:


 The condition of homoskedasticity often doesn’t hold
(homoskedasticity is special)
 The result is only for linear estimators – only a small subset
of estimators (more on this in a moment)

2. The strongest optimality result (“part II” above) requires


homoskedastic normal errors – not plausible in applications
(think about the hourly earnings data!)
44
Limitations of OLS, ctd.
3. OLS is more sensitive to outliers than some other estimators. In the
case of estimating the population mean, if there are big outliers, then
the median is preferred to the mean because the median is less
sensitive to outliers – it has a smaller variance than OLS when there
are outliers. Similarly, in regression, OLS can be sensitive to outliers,
and if there are big outliers other estimators can be more efficient
(have a smaller variance). One such estimator is the least absolute
deviations (LAD) estimator:
n
min b0 ,b1  Yi  (b0  b1 X i )
i 1

In virtually all applied regression analysis, OLS is used – and that is what we will do
in this course too.
45
Inference if u is Homoskedastic and Normal:
the Student t Distribution
Recall the five extended LS assumptions:
1.E(u| X = x) = 0.
2.(X i,Y i), i = 1,…,n, are i.i.d.
3.Large outliers are rare (E(Y 4) < , E(X 4) < ).
4.u is homoskedastic
5.u is distributed N (0,2)

If all five assumptions hold, then:


 ˆ0 and ˆ1 are normally distributed for all n (!)
 the t-statistic has a Student t distribution with n – 2 degrees of
freedom – this holds exactly for all n (!)
46
Normality of the sampling distribution of ˆ1 under 1–5:
n

( X i  X )ui
ˆ1 – 1 = i 1
n

 i
( X
i 1
 X ) 2

1 n ( Xi  X )
= 
n i 1
w u
i i , where wi =
1 n
.
 (
n i 1 i
X  X ) 2

What is the distribution of a weighted average of normals?


Under assumptions 1 – 5:
 1  n
2 2
1 – 1 ~ N  0, 2   wi   u 
ˆ (*)
 n  i 1  
Substituting wi into (*) yields the homoskedasticity-only variance formula.

47
In addition, under assumptions 1 – 5, under the null hypothesis the t
statistic has a Student t distribution with n – 2 degrees of freedom
 Why n – 2? because we estimated 2 parameters, 0 and 1
 For n < 30, the t critical values can be a fair bit larger than the N (0,1)
critical values
 For n > 50 or so, the difference in tn–2 and N (0,1) distributions is
negligible. Recall the Student t table:
degrees of freedom 5% t-distribution critical value
10 2.23
20 2.09
30 2.04
60 2.00
 1.96
48
Practical implication:
 If n < 50 and you really believe that, for your application, u is
homoskedastic and normally distributed, then use the tn–2 instead of the
N (0,1) critical values for hypothesis tests and confidence intervals.
 In most econometric applications, there is no reason to believe that u is
homoskedastic and normal – usually, there is good reason to believe
that neither assumption holds.
 Fortunately, in modern applications, n > 50, so we can rely on the
large-n results presented earlier, based on the CLT, to perform
hypothesis tests and construct confidence intervals using the large-n
normal approximation.

49
Summary and Assessment
 The initial policy question:
Suppose new teachers are hired so the student-teacher ratio
falls by one student per class. What is the effect of this policy
intervention (this “treatment”) on test scores?
 Does our regression analysis give a convincing answer? Not
really - districts with low ST R tend to be ones with lots of
other resources and higher income families, which provide
kids with more learning opportunities outside school. This
suggests that
 so
Digression on Causality
The original question (what is the quantitative effect of an
intervention that reduces class size?) is a question about a
causal effect: the effect on Y of applying a unit of the
treatment is 1.
 
 But what is, precisely, a causal effect?
 The common-sense definition of causality isn’t precise enough
for our purposes.
 In this course, we define a causal effect as the effect that is
measured in an ideal randomized controlled experiment.
Ideal Randomized Controlled Experiment
 Ideal: subjects all follow the treatment protocol – perfect
compliance, no errors in reporting, etc.!
 Randomized: subjects from the population of interest are
randomly assigned to a treatment or control group (so there are
no confounding factors)
 Controlled: having a control group permits measuring the
differential effect of the treatment
 Experiment: the treatment is assigned as part of the
experiment: the subjects have no choice, which means that
there is no “reverse causality” in which subjects choose the
treatment they think will work best.
Back to class size:
 What is an ideal randomized controlled experiment for measuring
the effect on Test Score of reducing STR?
 How does our regression analysis of observational data differ from
this ideal?
 The treatment is not randomly assigned
 In the US – in our observational data – districts with higher
family incomes are likely to have both smaller classes and higher
test scores.
 As a result it is plausible that E(ui|Xi=x) = 0.
 If so, Least Squares Assumption #1 does not hold.
 If so, is biased: does an omitted factor make class size seem
more important than it really is?

You might also like