Lecture set 2
Lecture set 2
“An equation contains the possibility of the state of affairs in reality. The logical structure of the economy which the model
shows, mathematics shows in equations. (When I speak of mathematics here, I include its sub-branch of statistics)”
- Choy, Keen Meng (2020) Tractatus Modellus-Philosophicus, mimeo
Outline
1. The population linear regression model
2. The ordinary least squares (OLS) estimator and the sample
regression line
3. Measures of fit of the sample regression
4. The least squares assumptions for causal inference
5. The sampling distribution of the OLS estimator
6. The least squares assumptions for prediction
Linear regression lets us estimate the
population regression line and its slope.
• The population regression line is the expected value of Y given
X.
• The slope is the difference in the expected values of Y, for two
values of X that differ by one unit
• The estimated regression can be used either for:
– causal inference (learning about the causal effect on Y of a change in
X)
– prediction (predicting the value of Y given X, for an observation not in
the data set)
The problem of statistical inference for linear regression is, at a general
level, the same as for estimation of the mean or of the differences between
two means. Statistical, or econometric, inference about the slope entails:
• Estimation:
– How should we draw a line through the data to estimate the population
slope?
Answer: ordinary least squares (OLS).
– What are advantages and disadvantages of OLS?
• Hypothesis testing:
– How to test whether the slope is zero?
• Confidence intervals:
– How to construct a confidence interval for the slope?
The Linear Regression Model (SW Section 4.1)
The population regression line:
Test Score = β0 + β1STR
β1 = slope of population regression line
n
min b0 ,b1 [Yi (b0 b1 X i )]2
i 1
n
The OLS estimator solves: min b0 ,b1 [Yi (b0 b1 X i )]2
i 1
Example:
The population regression line: E(Test Score|STR) = y = β0 + β1STR
Mechanics of OLS
Key Concept 4.2: The OLS Estimator,
Predicted Values, and Residuals
The OLS estimators of the slope β1 and the intercept β0 are
n
(X i X )(Yi Y )
s XY
ˆ1 i 1
n
(4.7)
s X2
i
( X
i 1
X ) 2
The estimated intercept (ˆ0 ), slope (ˆ1 ), and residual (uˆi ) are computed
from a sample of n observations of X i and Yi , i 1,..., n. These are estimates
of the unknown true population intercept (0 ), slope (1 ), and error term (ui ).
Application to the California Test Score –
Class Size data
ESS i
(Yˆ Yˆ ) 2
Definition of R 2: R2 i 1
n
i
TSS
(Y Y ) 2
i 1
• R2 = 0 means ESS = 0
• R2 = 1 means ESS = TSS
• 0 ≤ R2 ≤ 1
• For regression with a single X, R2 = the square of the correlation coefficient
between X and Y
The Standard Error of the Regression (SER)
The SER measures the spread of the distribution of u. The SER is
(almost) the sample standard deviation of the OLS residuals:
1 n
SER i
n 2 i 1
(uˆ ˆ
u ) 2
1 n 2
n 2 i 1
uˆi
1 n
The second equality holds because uˆ uˆi 0.
n i 1
n
1
SER
n 2 i 1
ˆ
ui2
The SER:
• has the units of u, which are the units of Y
• measures the average “size” of the OLS residual (the average “mistake” made
by the OLS regression line)
This measures the same thing as the SER – the minor difference
is division by 1/n instead of 1/(n – 2).
Technical note: why divide by n – 2 instead
of n – 1?
1 n 2
SER
n 2 i 1
uˆi
Note: The main place we will encounter non-i.i.d. sampling is when data are
recorded over time for the same entity (panel data and time series data).
* We will deal with that complication when we cover panel data.
Least squares assumption #3: Large outliers are rare
Technical statement: E(X4) < ∞
This is because OLS can be sensitive to an outlier:
• A large outlier is an extreme value of X or Y.
Y 0 1 X u
SO Yi Y 1 ( X i X ) (ui u )
Thus,
n n
(Xi X )
i 1
2
i
( X
i 1
X ) 2
The mean and variance of the sampling
distribution of ˆ (3 of 3)
After some algebraic manipulation (see Appendix at the end
of the slides), we obtain the following
(X i X )ui
ˆ1 1 i 1
n
i
( X
i 1
X ) 2
Now we can calculate E ( ˆ1 ) and var ( ˆ1 ):
n
i ( X X )u i
E ( ˆ1 ) 1 E n i 1
( X X )2
i 1
i
Using the law of iterated
n Expectation which states that
( X i X )u i the expected value of a RV is
equal to the sum of the
E E i n1 X 1 ,..., X n expected values of that RV
( X i X )2 conditioned on a second
i 1 random variable.
0 because E ( ui | X i x ) 0 by LSA #1
• Thus LSA #1 implies that E ( ˆ ) 1 1
write
n
1 n
( X i X )u i
n i 1
vi
ˆ1 1 i n1
n 1 2
i 1
(Xi X ) 2
n
sX
n 1
where vi ( X i X )ui . If n is large, s X2 X2 and 1, so
n
1 n
n i 1
vi
ˆ1 1 ,
2
X
n i 1 ( X2 ) 2
where the final equality uses assumption 2. Thus,
ˆ 1 var[( X i x )ui ]
var( 1 ) .
n ( X )
2 2
Summary so far
1. ˆ1 is unbiased: under LSA#1, E ( ˆ1 ) 1 (just like Y )
2. var( ˆ ) is inversely proportional to n ( just like Y )
1
What is the sampling distribution of ˆ1?
The exact sampling distribution is complicated – it depends on
the population distribution of (Y, X) – but when n is large we get
some simple (and good) approximations:
p
1) Because var( ˆ1 ) 1/n and E ( ˆ1 ) 1 , ˆ1 1
2) When n is large, the sampling distribution of ˆ1 is well
approximated by a normal distribution (CLT)
Large-n approximation to the
distribution of ˆ1:
1 n 1 n
n i 1
vi
n
vi
ˆ1 1 i 21 , where vi ( X i X )ui
n 1 2 X
SX
n
ˆ 2
1 ~ N 1 , v
2 2
, where vi ( X i X )ui
n( X )
The larger the variance of X , the smaller
the variance of ˆ1
The math
ˆ 1 var[( X i x )ui ]
var( 1 1 )
n ( X2 ) 2
The intuition
If there is more variation in X, then there is more information in
the data that you can use to fit the regression line. This is most
easily seen in a figure…
Summary of the sampling distribution of ˆ1 :
If the three Least Squares Assumptions hold, then
• The exact (finite sample) sampling distribution of ˆ1 has:
E ( ˆ1 ) 1 (that is, ˆ1 is unbiased)
ˆ 1 var[( X i x )ui ] 1
var( 1 ) .
n X 4
n
• Other than its mean and variance, the exact distribution of ˆ1 is
complicated and depends on the distribution of (X , u )
p
• ˆ1 1 (that is, ˆ1 is consistent)
ˆ1 E ( ˆ1 )
• When n is large, ~ N (0, 1) (CLT)
var( ˆ1 )
Note : This is similar to the sampling distribution of Y .
Κey Concept 4.4: Large-Sample Distributions
of ˆ0 and ˆ1
If the least squares assumptions in Key Concept 4.3 hold, then in
large samples ˆ and ˆ have a jointly normal sampling distribution.
0 1
The large-sample normal distribution of slope parameter ˆ1 is N ( 1 , 2ˆ ), where the
1
1 var[( X i X )ui ]
2ˆ 2
. (4.21)
1
n [var( X i )]
The large-sample normal distribution of the intercept ˆ0 is N ( 0 , 2ˆ ), where
0
1 var( H i ui ) X
2ˆ 2 2
, where H i 1 2
Xi. (4.22)
n [ E ( H i )] E( X i )
0
The Least Squares Assumptions for Prediction
(SW Appendix 4.4) (1 of 2)
• Prediction entails using an estimation sample to estimate a
prediction model, then using that model to predict the value
of Y for an observation not in the estimation sample.
– Prediction requires good out-of-sample performance.
SO Yi Y 1 ( X i X ) (ui u )
Thus,
n n
(Xi X )
i 1
2
i
( X
i 1
X ) 2
The mean and variance of the sampling
distribution of ˆ (2 of 3)
n n
(X
i 1
i X) 2
(X
i 1
i X )2
n
(X i X )(ui u )
SO ˆ1 1 i 1
n
.
i
( X
i 1
X ) 2
n
n
n
Now
i 1
( X i X )(ui u )
i 1
( X i X )ui i
i 1
( X X ) u
n
n
( X i X )ui X i nX u
i 1 i 1
n
( X i X )ui
i 1
The mean and variance of the sampling
distribution of ˆ (3 of 3)
n n
Substitute ( X i X )(ui u ) ( X i X )ui into the expression
i 1 i 1
for ˆ1 1 :
n
(X i X )(ui u )
ˆ1 1 i 1
n
i
( X
i 1
X ) 2
(X i X )ui
SO ˆ1 1 i 1
n
i
( X
i 1
X ) 2