0% found this document useful (0 votes)
6 views

Lecture set 2

The document covers the fundamentals of linear regression with one regressor, focusing on the population linear regression model, the ordinary least squares (OLS) estimator, and the assumptions necessary for causal inference. It explains how to estimate the regression parameters, interpret the results, and assess the fit of the model using R-squared and standard error of the regression. Additionally, it outlines the least squares assumptions that ensure unbiased estimation of causal effects in regression analysis.

Uploaded by

Jimmy Teng
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lecture set 2

The document covers the fundamentals of linear regression with one regressor, focusing on the population linear regression model, the ordinary least squares (OLS) estimator, and the assumptions necessary for causal inference. It explains how to estimate the regression parameters, interpret the results, and assess the fit of the model using R-squared and standard error of the regression. Additionally, it outlines the least squares assumptions that ensure unbiased estimation of causal effects in regression analysis.

Uploaded by

Jimmy Teng
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

ECONS303: Applied Quantitative Research Methods

Lecture set 2: Linear Regression with One


Regressor

“An equation contains the possibility of the state of affairs in reality. The logical structure of the economy which the model
shows, mathematics shows in equations. (When I speak of mathematics here, I include its sub-branch of statistics)”
- Choy, Keen Meng (2020) Tractatus Modellus-Philosophicus, mimeo
Outline
1. The population linear regression model
2. The ordinary least squares (OLS) estimator and the sample
regression line
3. Measures of fit of the sample regression
4. The least squares assumptions for causal inference
5. The sampling distribution of the OLS estimator
6. The least squares assumptions for prediction
Linear regression lets us estimate the
population regression line and its slope.
• The population regression line is the expected value of Y given
X.
• The slope is the difference in the expected values of Y, for two
values of X that differ by one unit
• The estimated regression can be used either for:
– causal inference (learning about the causal effect on Y of a change in
X)
– prediction (predicting the value of Y given X, for an observation not in
the data set)
The problem of statistical inference for linear regression is, at a general
level, the same as for estimation of the mean or of the differences between
two means. Statistical, or econometric, inference about the slope entails:

• Estimation:
– How should we draw a line through the data to estimate the population
slope?
 Answer: ordinary least squares (OLS).
– What are advantages and disadvantages of OLS?

• Hypothesis testing:
– How to test whether the slope is zero?

• Confidence intervals:
– How to construct a confidence interval for the slope?
The Linear Regression Model (SW Section 4.1)
The population regression line:
Test Score = β0 + β1STR
β1 = slope of population regression line

Why are β0 and β1 “population” parameters?


• We would like to know the population value of β1.
• We don’t know β1, so must estimate it using data.
The Population Linear Regression Model
Yi = β0 + β1Xi + ui, i = 1,…, n
• We have n observations, (Xi, Yi), i = 1,.., n.
• X is the independent variable or regressor
• Y is the dependent variable
• β0 = intercept
• β1 = slope or the coefficient estimate
• ui = the regression error
• The regression error consists of omitted factors. In general, these omitted
factors are other factors that influence Y, other than the variable X. The
regression error also includes error in the measurement of Y.
The population regression model in a picture: Observations on Y and X
(n = 7); the population regression line; and the regression error (the
“error term”):
The Ordinary Least Squares Estimator
(SW Section 4.2)
How can we estimate β0 and β1 from data?

We use the least squares (“ordinary least squares” or “OLS”)


estimator of the unknown parameters β0 and β1. The OLS
estimator require solving the following minimization problem

n
min b0 ,b1  [Yi  (b0  b1 X i )]2
i 1
n
The OLS estimator solves: min b0 ,b1  [Yi (b0  b1 X i )]2
i 1

• The OLS estimator minimizes the average squared


difference between the actual values of Yi and the prediction
(“predicted value”) based on the estimated line.
• This minimization problem can be solved using calculus
(App. 4.2).
• The result is the OLS estimators of β0 and β1.

Example:
The population regression line: E(Test Score|STR) = y = β0 + β1STR
Mechanics of OLS
Key Concept 4.2: The OLS Estimator,
Predicted Values, and Residuals
The OLS estimators of the slope β1 and the intercept β0 are
n

(X i  X )(Yi  Y )
s XY
ˆ1  i 1
n
 (4.7)
s X2
 i
( X
i 1
 X ) 2

ˆ0  Y  ˆ1 X . (4.8)

The OLS predicted values Yˆi and residuals uˆi are


Yˆi  ˆ0  ˆ1 X i , i  1,..., n (4.9)
uˆ  Y  Yˆ , i  1,..., n.
i i i (4.10)

The estimated intercept (ˆ0 ), slope (ˆ1 ), and residual (uˆi ) are computed
from a sample of n observations of X i and Yi , i  1,..., n. These are estimates
of the unknown true population intercept (0 ), slope (1 ), and error term (ui ).
Application to the California Test Score –
Class Size data

• Estimated slope  ˆ1  2.28


• Estimated intercept  ˆ0  698.9
• Estimated regression line: TestScore  698.9  2.28  STR
Interpretation of the estimated slope and
intercept

• TestScore  698.9  2.28  STR

• Districts with one more student per teacher on average have


test scores that are 2.28 points lower.
E (Test score|STR )
• That is,  2.28
STR
• The intercept (taken literally) means that, according to this estimated line,
districts with zero STR would have a (predicted) test score of 698.9. But this
interpretation of the intercept makes no sense – it extrapolates the line outside
the range of the data – here, the intercept is not economically meaningful.
Predicted values & residuals:

One of the districts in the data set is Antelope, CA, for


which STR = 19.33 and Test Score = 657.8
predicted value: Yˆ
Antelope 698.9 – 2.28 19.33  654.8
residual: uˆ Antelope  657.8 – 654.8  3.0
OLS regression: STATA output
regress testscr str, robust
Regression with robust standard errors Number of obs = 420
F( 1, 418) = 19.26
Prob > F = 0.0000
R-squared = 0.0512
Root MSE = 18.581
-------------------------------------------------------------------------
| Robust
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------+----------------------------------------------------------------
str | -2.279808 .5194892 -4.39 0.000 -3.300945 -1.258671
_cons | 698.933 10.36436 67.44 0.000 678.5602 719.3057
-------------------------------------------------------------------------

TestScore  698.9 – 2.28  STR


(We’ll discuss the rest of this output later.)
Measures of Fit (SW Section 4.3)
Two regression statistics provide complementary measures of
how well the regression line “fits” or explains the data:
• The regression R2 measures the fraction of the variance of Y
that is explained by X; it is unitless and ranges between zero
(no fit) and one (perfect fit)
• The standard error of the regression (SER) measures the
magnitude of a typical regression residual in the units of Y.
The regression R2 is the fraction of the sample
variance of Yi “explained” by the regression.
Yi  Yˆi  uˆi  OLS prediction  OLS residual
 sample var (Y )  sample var(Yˆ )  sample var(uˆ )
i i

 total sum of squares  “explained” SS  “residual” SS


n

ESS  i
(Yˆ  Yˆ ) 2

Definition of R 2: R2   i 1
n

 i
TSS
(Y  Y ) 2

i 1

• R2 = 0 means ESS = 0
• R2 = 1 means ESS = TSS
• 0 ≤ R2 ≤ 1
• For regression with a single X, R2 = the square of the correlation coefficient
between X and Y
The Standard Error of the Regression (SER)
The SER measures the spread of the distribution of u. The SER is
(almost) the sample standard deviation of the OLS residuals:

1 n
SER   i
n  2 i 1
(uˆ  ˆ
u ) 2

1 n 2
 
n  2 i 1
uˆi

1 n
The second equality holds because uˆ   uˆi  0.
n i 1
n
1
SER  
n  2 i 1
ˆ
ui2

The SER:
• has the units of u, which are the units of Y
• measures the average “size” of the OLS residual (the average “mistake” made
by the OLS regression line)

The root mean squared error (RMSE) is closely related to the


SER:
1 n 2
RMSE  
n i 1
uˆi

This measures the same thing as the SER – the minor difference
is division by 1/n instead of 1/(n – 2).
Technical note: why divide by n – 2 instead
of n – 1?
1 n 2
SER  
n  2 i 1
uˆi

• Division by n  2 is a “degrees of freedom” correction  just like division by n  1 in


the sample vatriance sY2 , except that for the SER, two parameters
have been estimated ( 0 and 1 , by ˆ0 and ˆ1 ), whereas in sY2
only one has been estimated (Y , by Y ).

• When n is large, it doesn’t matter whether n, n – 1, or n – 2 are used –


although the conventional formula uses n – 2 when there is a single regressor.
• For details, see Section 18.4
The Least Squares Assumptions for
Causal Inference (SW Section 4.4)
• So far we have treated OLS as a way to draw a straight line
through the data on Y and X. Under what conditions does the
slope of this line have a causal interpretation? That is, when will
the OLS estimator be unbiased for the causal effect on Y of X?
• What is the variance of the OLS estimator over repeated
samples?
• To answer these questions, we need to make some assumptions
about how Y and X are related to each other, and about how they
are collected (the sampling scheme)
These assumptions – there are three – are known as the Least
Squares Assumptions for Causal Inference.
Definition of Causal Effect
• The causal effect on Y of a unit change in X is the expected
difference in Y as measured in a randomized controlled
experiment
– For a binary treatment, the causal effect is the expected difference in
means between the treatment and control groups, as discussed in Ch. 3.

• The least squares assumptions for causal inference generalize the


binary treatment case to regression.
The Least Squares Assumptions for Causal
Inference
Let β1 be the causal effect on Y of a change in X:
Yi = β0 + β1Xi + ui, i = 1,…, n
1. The conditional distribution of u given X has mean zero, that
is, E(u|X = x) = 0.
 This implies that ˆ1 is unbiased for the causal effect 1
2. (Xi,Yi), i = 1,…, n, are i.i.d.
– This is true if (X, Y) are collected by simple random sampling
 This delivers the sampling distribution of ˆ0 and ˆ1
3. Large outliers in X and/or Y are rare.
– Technically, X and Y have finite fourth moments
 Outliers can result in meaningless values of ˆ1
Least squares assumption #1: E(u|X = x) = 0. (1 of 2)
When β1 is the
causal effect,
for any given
value of X, the
mean of u is
zero:

Example: Test Scorei = β0 + β1STRi + ui; ui = other factors


• What are some of these “other factors”?
Least squares assumption (LSA) #1:
E(u|X = x) = 0. (2 of 2)
• The benchmark for understanding this assumption is to consider an ideal
randomized controlled experiment:
• X is randomly assigned to people (students randomly assigned to different size
classes; patients randomly assigned to medical treatments). Randomization is
done by computer – using no information about the individual.
• Because X is assigned randomly, all other individual characteristics – the
things that make up u – are distributed independently of X, so u and X are
independent
Least squares assumption (LSA) #1:
E(u|X = x) = 0. (2 of 2)
• Thus, in an ideal randomized controlled experiment, E(u|X = x) = 0 (that is,
LSA #1 holds)
• In actual experiments, or with observational data, we will need to think hard
about whether E(u|X = x) = 0 holds.
Least squares assumption #2: (Xi,Yi),
i = 1,…,n are i.i.d.
This arises automatically if the entity (individual, district) is
sampled by simple random sampling:
• The entities are selected from the same population, so (Xi, Yi) are
identically distributed for all i = 1,…, n.
• The entities are selected at random, so the values of (X, Y) for different
entities are independently distributed.

Note: The main place we will encounter non-i.i.d. sampling is when data are
recorded over time for the same entity (panel data and time series data).
* We will deal with that complication when we cover panel data.
Least squares assumption #3: Large outliers are rare
Technical statement: E(X4) < ∞
This is because OLS can be sensitive to an outlier:
• A large outlier is an extreme value of X or Y.

• Is the lone point an outlier in X or Y?


• In practice, outliers are often data glitches (coding or recording problems).
Sometimes they are observations that really shouldn’t be in your data set.
Plot your data!
Least squares assumption #3: Large outliers are rare
Technical statement: E(X4) < ∞
• On a technical level, if X and Y are bounded, then they have
finite fourth moments. (Standardized test scores automatically
satisfy this; STR, family income, etc. satisfy this too.)
• The substance of this assumption is that a large outlier can
strongly influence the results – so we need to rule out large
outliers.
Before conducting any formal analysis, look at your data!
If you have a large outlier, is it a typo? Does it belong in your
data set? Why is it an outlier?
The Sampling Distribution of the OLS
Estimator (SW Section 4.5)
The OLS estimator is computed from a sample of data. A different sample
yields a different value of ˆ .This is the source of the “sampling uncertainty”
1
of ˆ . We want to:
1
• quantify the sampling uncertainty associated with ˆ1

• use ˆ1 to test hypotheses such as 1  0

• construct a confidence interval for β1


• All these require figuring out the sampling distribution of the
OLS estimator. Two steps to get there…
– Probability framework for linear regression
– Distribution of the OLS estimator
Probability Framework for Linear Regression
The probability framework for linear regression is summarized by the
three least squares assumptions.
Population
• The group of interest (e.g.: all possible school districts)
Random variables: Y, X
• E,g,: (Test Score, STR)
Joint distribution of (Y, X). We assume:
• The population regression function is linear
• E(u| X) = 0 (1st Least Squares Assumption (LSA))
• X, Y have nonzero finite fourth moments (3rd L.S.A.)
Data Collection by simple random sampling implies:
• {(Xi, Yi)}, i = 1,…, n, are i.i.d. (2nd L.S.A.)
The Sampling Distribution of ˆ
• Like Y , ˆ1 has a sampling distribution.
• What is E (ˆ )?
1

 If E ( ˆ1 )  1 , then OLS is unbiased  a good thing!


• What is var(ˆ1 )? (measure of sampling uncertainty)
 We need to derive a formula so we can compute the
standard error of 1.
• What is the distribution of ˆ1 in small samples?
 It is very complicated in general
 However in large samples, ˆ is normally distributed.
1
The mean and variance of the sampling
distribution of ˆ
Yi = β0 + β1Xi + ui

Y   0  1 X  u

SO Yi  Y  1 ( X i  X )  (ui  u )

Thus,
n n

(X i  X )(Yi  Y ) (X i  X )[ 1 ( X i  X )  (ui  u )]


ˆ1  i 1
n
 i 1
n

 (Xi  X )
i 1
2
 i
( X
i 1
 X ) 2
The mean and variance of the sampling
distribution of ˆ (3 of 3)
After some algebraic manipulation (see Appendix at the end
of the slides), we obtain the following

(X i  X )ui
ˆ1  1  i 1
n

 i
( X
i 1
 X ) 2
Now we can calculate E ( ˆ1 ) and var ( ˆ1 ):
 n 
 i ( X  X )u i 
E ( ˆ1 )  1  E  n i  1

 ( X  X )2 
  i 1
i
 Using the law of iterated
  n   Expectation which states that
   ( X i  X )u i   the expected value of a RV is
  equal to the sum of the
 E  E  i n1  X 1 ,..., X n  expected values of that RV
   ( X i  X )2   conditioned on a second
  i 1   random variable.

 0 because E ( ui | X i  x )  0 by LSA #1
• Thus LSA #1 implies that E ( ˆ )   1 1

• That is, ˆ1 is an unbiased estimator of  1 .


• For details see App. 4.3
Next calculate var ( ˆ1 ) (1 of 2)

write
n
1 n
 ( X i  X )u i 
n i 1
vi
ˆ1  1  i n1 
 n 1  2
 i 1
(Xi  X ) 2

 n 
 sX

n 1
where vi  ( X i  X )ui . If n is large, s X2   X2 and  1, so
n
1 n

n i 1
vi
ˆ1  1  ,
 2
X

where vi  ( X i  X )ui (see App. 4.3). Thus,


Next calculate var ( ˆ1 ) (2 of 2)
1 n

n i 1
vi
ˆ1  1 
 X2
 1 n
 var(vi )/n
so var( 1  1 )  var( 1 )  var   vi  ( X ) 
ˆ ˆ 2 2

 n i 1  ( X2 ) 2
where the final equality uses assumption 2. Thus,
ˆ 1 var[( X i   x )ui ]
var( 1 )   .
n ( X )
2 2

Summary so far
1. ˆ1 is unbiased: under LSA#1, E ( ˆ1 )  1 (just like Y )
2. var( ˆ ) is inversely proportional to n ( just like Y )
1
What is the sampling distribution of ˆ1?
The exact sampling distribution is complicated – it depends on
the population distribution of (Y, X) – but when n is large we get
some simple (and good) approximations:
p
1) Because var( ˆ1 )  1/n and E ( ˆ1 )  1 , ˆ1  1
2) When n is large, the sampling distribution of ˆ1 is well
approximated by a normal distribution (CLT)
Large-n approximation to the
distribution of ˆ1:
1 n 1 n

n i 1
vi
n
 vi
ˆ1  1   i 21 , where vi  ( X i  X )ui
 n 1  2 X
  SX
 n 

• When n is large, vi  ( X i  X )ui  ( X i   X )ui , which is i.i.d.


and var(vi )  .
1 n
So, by the CLT,  vi is approximately distributed N (0,  v2 /n).
n i 1

• Thus, for n large, ˆ1s approximately distributed

ˆ   2

1 ~ N  1 , v
2 2 
, where vi  ( X i   X )ui
 n( X ) 
The larger the variance of X , the smaller
the variance of ˆ1
The math
ˆ 1 var[( X i   x )ui ]
var( 1  1 )  
n ( X2 ) 2

Where  X2  var( X i ). The variance of X appears (squared) in the


denominator  so increasing the spread of X decreases the variance
of 1.

The intuition
If there is more variation in X, then there is more information in
the data that you can use to fit the regression line. This is most
easily seen in a figure…
Summary of the sampling distribution of ˆ1 :
If the three Least Squares Assumptions hold, then
• The exact (finite sample) sampling distribution of ˆ1 has:
 E ( ˆ1 )  1 (that is, ˆ1 is unbiased)
ˆ 1 var[( X i   x )ui ] 1
 var( 1 )    .
n X 4
n
• Other than its mean and variance, the exact distribution of ˆ1 is
complicated and depends on the distribution of (X , u )
p
• ˆ1  1 (that is, ˆ1 is consistent)
ˆ1  E ( ˆ1 )
• When n is large, ~ N (0, 1) (CLT)
var( ˆ1 )
Note : This is similar to the sampling distribution of Y .
Κey Concept 4.4: Large-Sample Distributions
of ˆ0 and ˆ1
If the least squares assumptions in Key Concept 4.3 hold, then in
large samples ˆ and ˆ have a jointly normal sampling distribution.
0 1

The large-sample normal distribution of slope parameter ˆ1 is N ( 1 ,  2ˆ ), where the
1

variance of this distribution,  2ˆ , is


1

1 var[( X i   X )ui ]
 2ˆ  2
. (4.21)
1
n [var( X i )]
The large-sample normal distribution of the intercept ˆ0 is N (  0 ,  2ˆ ), where
0

1 var( H i ui )  X 
 2ˆ  2 2
, where H i  1   2 
Xi. (4.22)
n [ E ( H i )]  E( X i ) 
0
The Least Squares Assumptions for Prediction
(SW Appendix 4.4) (1 of 2)
• Prediction entails using an estimation sample to estimate a
prediction model, then using that model to predict the value
of Y for an observation not in the estimation sample.
– Prediction requires good out-of-sample performance.

• For prediction, β1 is simply the slope of the population


regression line (the conditional expectation of Y given X),
which in general is not the causal effect.
• The critical LSA for Prediction is that the out-of-sample
(“OOS”) observation for which you want to predict Y comes
from the same distribution as the data used to estimate the
model.
– This replaces LSA#1 for Causal Inference
The Least Squares Assumptions for Prediction
(SW Appendix 4.4) (2 of 2)
1. The out of sample observation (XOOS,YOOS) is drawn from
the same distribution as the estimation sample (Xi,Yi), i =
1,…,n
– This ensures that the regression line fit using the estimation sample
also applies to the out-of-sample data to be predicted.

2. (Xi,Yi), i = 1,…, n are i.i.d.


– This is the same as LSA#2 for causal inference

3. Large outliers in X and/or Y are rare (X and Y have finite


fourth moments)
– This is the same as LSA#3 for causal inference
* In this book, the assumption that large outliers are unlikely is made mathematically
precise by assuming that X and Y have nonzero finite fourth moments.
APPENDIX
The mean and variance of the sampling
distribution of ˆ (1 of 3)
Some algebra:
Yi = β0 + β1Xi + ui
Y   0  1 X  u

SO Yi  Y  1 ( X i  X )  (ui  u )
Thus,
n n

(X i  X )(Yi  Y ) (X i  X )[ 1 ( X i  X )  (ui  u )]


ˆ1  i 1
n
 i 1
n

 (Xi  X )
i 1
2
 i
( X
i 1
 X ) 2
The mean and variance of the sampling
distribution of ˆ (2 of 3)
n n

(X i  X )( X i  X ) (X i  X )(ui  u )


ˆ1  1 i 1
n
 i 1
n

(X
i 1
i  X) 2
(X
i 1
i  X )2
n

(X i  X )(ui  u )
SO ˆ1  1  i 1
n
.
 i
( X
i 1
 X ) 2

n
 n
n

Now 
i 1
( X i  X )(ui  u )  
i 1
( X i  X )ui   i
 i 1
( X  X ) u

n
 n  
  ( X i  X )ui    X i   nX  u
i 1  i 1  
n
  ( X i  X )ui
i 1
The mean and variance of the sampling
distribution of ˆ (3 of 3)
n n
Substitute  ( X i  X )(ui  u )   ( X i  X )ui into the expression
i 1 i 1

for ˆ1  1 :
n

(X i  X )(ui  u )
ˆ1  1  i 1
n

 i
( X
i 1
 X ) 2

(X i  X )ui
SO ˆ1  1  i 1
n

 i
( X
i 1
 X ) 2

You might also like