Class 2
Class 2
1
Order of Topics
I. Why Multiple Regression?
II. Multiple Regression and OLS
Estimator
III. Classical Linear Assumptions
IV. Statistical Properties of OLS
Estimators
V. Variance of Error and Slope 2
Coefficients
I. WHY MULTIPLE REGRESSION? 3
Multiple Regression Reduces Bias
• Example 1: Test score regression with CA district data
• Is this causal?
• Probably not
• Why?
• Likely omitted variable bias – variables that we left out of the regression
• Depends on whether the left out variables are correlated with both the
average test score and student-teacher ratio
5
Why is there bias?
• OLS is an unbiased estimator, so why is there bias?
• The unbiasedness of OLS relies on certain assumptions including
the minimal influence of omitted variables
• We must control for – or hold constant – those variables if they
are correlated with both the dependent and independent
variable(s) in the model
• When we leave these variables out, they end up in the error term
and the estimator is biased
6
Example 2: Earnings and years of education
• What is the likely relationship between earnings and years of
education?
• Positive
Yrs Educ
Example 3: Omitted Variable Bias, actual study
9
Example 4: Omitted Variable bias, actual
study
10
11
Summary
• Examples 1 through 4 illustrate “omitted variable bias”
12
II. MULTIPLE REGRESSION AND OLS 13
ESTIMATOR
Example: California School Districts
Original bivariate regression
Stata: reg testscr str
15
How does this compare to the coefficient It’s roughly half the
from the original bivariate regression? size (-1.12 vs -2.28)
Ceteris Paribus
• In the bivariate regression, we estimated:
• What is ?
• It is the residual you get from regressing Xk on all other X’s
• This is called partialling out – it removes the part of each Xk that
is uncorrelated with the other independent variables
19
Example of partialling
Model:
STR = student-teacher ratio, PFL = % free lunch, PPE = per-pupil expenditures
= --
is the portion of the variation in STR that is not explained by PFL or PPE
= - -
++
is the portion of the variation in PFL that is not explained by STR or PPE
*This is what Stata does in the background to estimate the coefficients for your 20
model. In this cases, it uses to estimate and to estimate , etc.
BREAK AND REVIEW QUESTIONS: Please answer and be ready to contribute answers to class when we return
1. Write the equation for the residual in a bivariate regression equation in terms of:
a. Yi and Yhat
b. Yi , B0hat, B1hat, Xi
2.How can adding more independent (control) variables to a regression model help improve the causal interpretation of a
relationship?
a. You are now informed that the percent of the population age 16 to 30 is positively related to murders per year and negatively
correlated with the % population with handguns.
Does this tell you anything about whether the coefficient on percent population is biased?
If yes, does it tell you anything about the direction of the bias?
b. You are further informed that the percent of population who have hunting licenses is negatively correlated with % population
with handguns and is unrelated to the murders per year. What does this tell you about bias?
c. You are yet further informed that percent of population that is unemployed is positively related to the number of murders but
that the percent of population that is unemployed is uncorrelated with the % population with handguns. What does this tell you 21
about bias.
1. Write the equation for the residual in a bivariate regression equation in terms of:
a. Yi and Yhat
^
ei = Yi – Y
b. Yi , B0hat, B1hat, Xi
^ ^
ei = Yi - Bt - B1 Xi
2.How can adding more independent (control) variables to a regression model help improve the causal interpretation of a
relationship?
a. You are now informed that the percent of the population age 16 to 30 is positively related to murders per year and negatively
correlated with the % population with handguns.
Does this tell you anything about whether the coefficient on percent population is biased? If yes, does it tell you anything about the
direction of the bias?
b. You are further informed that the percent of population who have hunting licenses is negatively correlated with % population
with handguns and is unrelated to the murders per year. What does this tell you about bias?=
Has no effect on other coefficients. It is unrelated to the dependent variable and thus is NOT in the error term. X and e are not
correlated. Does not matter if something unrelated to murders is correlated with the independent variable.
22
c. You are yet further informed that percent of population that is unemployed is positively related to the number of murders but that t
the percent of population that is unemployed is uncorrelated with the % population with handguns. What does this tell you about bias.
No effect on bias since it has no relationship to the independent variable. It’s in the error term but the e and X not correlated.
III. CLASSICAL LINEAR MODEL (CLM) 23
ASSUMPTIONS
Gauss-Markov Theorem; First go around
29
Assumption 3: No Omitted Variables
• All explanatory variables are uncorrelated with the
error term
• Known as the Zero Conditional Mean Assumption
30
31
Assumption 4: No autocorrelation
• Observations of the error term are uncorrelated with
each other
• No autocorrelation/serial correlation
• Observations of the residual must be independent of
each other
• Relevant for time series and panel estimators
• Important for making statistical inferences about
confidence intervals
32
Assumption 5: Homoskedastic, no
heteroskdasticity
• The error term has constant variance, or is
homoskedastic
• Formally:
34
720
700
680
660
640
620
600
14 16 18 20 22 24 26
str
42
Example: Sample 1
720
700
680
660
640
620
600
14 16 18 20 22 24 26
str
43
Sample 2
720
700
680
660
640
620
600
14 16 18 20 22 24 26
str
44
0
1
2
3
4
5
< -10
-10 to -9
-9 to -8
-8 to -7
10 Samples
-7 to -6
-6 to -5
-5 to -4
-4 to -3
-3 to -2
and collect the coefficients
-2 to -1
-1 to 0
0 to 1
1 to 2
2 to 3
3 to 4
Mean
4 to 5
5 to 6
6 to 7
10 Simulations
7 to 8
8+
• We can continue this simulation as many times as we like
45
0
5
10
15
20
25
< -10
-10 to -9
-9 to -8
-8 to -7
-7 to -6
100 Samples
-6 to -5
-5 to -4
-4 to -3
-3 to -2
and collect the coefficients
-2 to -1
-1 to 0
0 to 1
1 to 2
2 to 3
3 to 4
Mean
4 to 5
5 to 6
6 to 7
100 Simulations
7 to 8
8+
• We can continue this simulation as many times as we like
46
500 Samples
• We can continue this simulation as many times as we like
and collect the coefficients
100
90 500 Simulations
80 Mean
70 As we include more random
60 samples in our simulation:
50 1) Mean gets closer to (i.e. -2.28)
40 2) Distribution looks more like a
30 normal distribution
20
10
0
< -10
-10 to -9
-1 to 0
0 to 1
1 to 2
2 to 3
3 to 4
4 to 5
5 to 6
6 to 7
7 to 8
-9 to -8
-8 to -7
-7 to -6
-6 to -5
-5 to -4
-4 to -3
-3 to -2
-2 to -1
8+
47
Central Limit Theorem: Review
• What is it?
• Large samples of same size (n=100) repeatedly drawn from the
population will be distributed normally
• This is true even if the variable being measured is not normally
distributed in the population
• Why is it important?
• It allows us to conduct hypothesis tests no matter the shape of a
variable’s underlying distribution
48
V. VARIANCE OF ERROR AND SLOPE
49
COEFFICIENTS
Variance of Error and Slope Coefficients
• If estimated regression slopes have a sampling distribution,
they have a mean and a variance
Var( )=
4.What is attractive about the B and U qualities of the OLS BLUE estimator?
6.To which three assumptions of the OLS model does iid relate?
52
4.What is attractive about the B and U qualities of the OLS BLUE estimator?
These estimators are unbiased (on average, the sample estimated coefficients will equal the population coefficients) and these
estimators produce minimum variance coefficients, meaning that the distribution of sample coefficients around their population
value is the smallest possible (among linear unbiased estimators).
6.To which three assumptions of the OLS model does iid relate?
The sampling distribution is the distribution of a sample statistic (such as the sample mean or an estimated regression coefficient)
that one would obtain from infinite number of samples of the same size taken from one population.
Example – take infinite number of samples of size 900 from population of US and calculate relationship between weight and number
of fast food meals per week. Sample coefficient on fast food meals will differ in each sample, but there will be a distribution of the
estimated coefficients called the sampling distribution of the estimated coefficient.
With a large enough sample size (usually just 100), sampling distribution (of sample statistic) is normally distributed even if
underlying distribution of the variable is not.
53