0% found this document useful (0 votes)
15 views52 pages

Regression1 Framework

This document provides an overview of regression analysis and the matching estimator technique. It introduces linear regression models and the ordinary least squares estimation method. Key topics covered include the matching estimator framework, identification of treatment effects, and assumptions and properties of linear regression.

Uploaded by

tilfani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views52 pages

Regression1 Framework

This document provides an overview of regression analysis and the matching estimator technique. It introduces linear regression models and the ordinary least squares estimation method. Key topics covered include the matching estimator framework, identification of treatment effects, and assumptions and properties of linear regression.

Uploaded by

tilfani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Regression 1: Framework

Instructor: Yuta Toyama

Last updated: 2021-06-16


Introduction

2 / 52
Observational Study (観察研究)
Researchers in social science cannot always conduct a randomized control trial.

Instead, we need to use observational data in which treatment assignment may not be
random.

An approach in this case is controlling observable characteristics that causes a selection


bias.

This approach is essentially estimation of linear regression model (線形回帰モデル) by


ordinally least squares (OLS, 最小二乗法).

3 / 52
Overview
Introduce an idea of matching (マッチング) estimator.
Identification of treatment effect under selection on observable assumption.
Linear regression is a special case of matching estimator.
Linear regression: framework, practical topics, inference

4 / 52
Selection on Observables, or Matching

5 / 52
Matching to eliminate a selection bias
Idea: Compare individuals with the same observed characteristics X across treatment
and control groups

If treatment choice is driven by observed characteristics (such as age, income, gender, etc),
controlling for such factor would eliminate the selection.

Two key assumptions in matching

6 / 52
Assumption 1: Selection on observables
Let Xi denote the observed characteristics (sometimes called covariates (共変量))
age, income, education, race, etc..

Assumption 1:

Di ⊥ (Y0i , Y1i ) |Xi

Conditional on Xi , treatment assignment is random.

This assumption is often referred in a different name:


Selection on observables
Ignorability
Unconfoundedness

7 / 52
Assumption 2: Overlapping assumption
Assumption 2:

P (Di = 1|Xi = x) ∈ (0, 1) ∀x

Given x, we should be able to observe people from both control and treatment group.

The probability P (Di = 1|Xi = x) is called propensity score (傾向スコア).

8 / 52
Identification of Treatment Effect Parameters
Assumption 1 (unconfoundedness) implies that

E[Y1i |Di = 1, Xi ] = E[Y1i |Di = 0, Xi ] = E[Y1i |Xi ]

E[Y0i |Di = 1, Xi ] = E[Y0i |Di = 0, Xi ] = E[Y0i |Xi ]

Once you conditioning on Xi , the argument is essentially the same as the one in RCT.

9 / 52
The AT T conditional on Xi = x is given by

E[Y1i − Y0i |Di = 1, Xi ] = E[Y1i |Di = 1, Xi ] − E[Y0i |Di = 1, Xi ]

= E[Y1i |Di = 1, Xi ] − E[Y0i |Di = 0, Xi ]

Assumption 2 (overlapping) is needed to use the following

E[Ydi |Di = d, Xi ] = E[Yi |Di = d, Xi ] f or d = 0, 1

Why? Overlapping assumption P (Di = 1|Xi = x) ∈ (0, 1) means that for each x, we
should have people in both treatment and control group.

If not, we cannot observe both E[Yi |Di = d, Xi ] for d = 0, 1 .

With two assumptions,

E[Y1i − Y0i |Di = 1, Xi ] = E[Yi |Di = 1, Xi ] − E[Yi |Di = 0, Xi ]


 
avg with Xi in treatment avg with Xi in control

10 / 52
ATT E[Y 1i − Y0i |Di = 1]

ATT is given by

AT T = E[Y1i − Y0i |Di = 1]

= ∫ E[Y1i − Y0i |Di = 1, Xi = x]fXi (x|Di = 1)dx

= E[Yi |Di = 1] − ∫ (E[Yi |Di = 0, Xi = x]) fXi (x|Di = 1)

11 / 52
ATT E[Y 1i − Y0i ]

ATE is

AT E =E[Y1i − Y0i ]

=∫ E[Y1i − Y0i |Xi = x]fXi (x)dx

=∫ E[Y1i |Di = 1, Xi = x]fXi (x)dx + ∫ E[Y0i |Di = 0, Xi = x]fXi (x)dx

=∫ E[Yi |Di = 1, Xi = x]fXi (x)dx + ∫ E[Yi |Di = 0, Xi = x]fXi (x)dx

12 / 52
From Identification to Estimation
We need to estimate two conditional expectations E[Yi |Di = 1, Xi = x] and
E[Yi |Di = 0, Xi = x]

Several ways to implement this.

1. Regression: Nonparametric and Parametric


2. Nearest neighborhood matching (最近傍マッチング)
3. Propensity Score Matching (傾向スコアマッチング)

Here, I only explain a parametric regression as a way to implement the matching method.

See Appendix and textbooks for the details of matching estimators.

13 / 52
From Matching to Linear Regression Model
Assume that

E[Yi |Di = 0, Xi = x] = β xi

E[Yi |Di = 1, Xi = x] = β xi + τ

Here, treament effect is given by τ .

You will have a linear regression model



yi = β xi + τ Di + ϵi , E[ϵi |Di , xi ] = 0

Running a linear regression to obtain the treatment effect parameter τ .

14 / 52
Linear Regression: Framework

15 / 52
Regression (回帰) framework
Linear regression model (線形回帰モデル) is defined as

Yi = β0 + β1 X1i + ⋯ + βK XKi + ϵi

i: index for observations. i = 1, ⋯ , N .


Yi : dependent variable (被説明変数)

Xki : explanatory variable (説明変数)


ϵi : error term (誤差項)
β : coefficients (係数)

Data (sample): {Yi , Xi1 , … , XiK }N


i=1

We want to estimate coefficients β.

16 / 52
Ordinaly Least Squares (最小二乗法、OLS)
OLS estimators are the minimizers of the sum of squared residuals:

N
1
2
min ∑(Yi − (β0 + β1 Xi1 + ⋯ + βK XiK ))
β0 ,⋯,βK N
i=1

First order conditions characterize the OLS estimator. Denote it by β


^
.

17 / 52
Residual Regression (残差回帰)
Consider the model

Yi = β0 + αDi + β1 X1i + ⋯ + βK XKi + ϵi

Suppose that you are interested in α (say treatment effect parameter).

Residual regression characterizes the OLS estimator of α


^ in the following way.

18 / 52
Frisch–Waugh–Lovell Theorem
1. Run OLS regression of Di on all other explanatory variables 1, X1i , ⋯ , XKi . Obtain the
residual u^D
i
.

2. Run OLS regression of Yi on all other explanatory variables 1, X1i , ⋯ , XKi . Obtain the
residual u^Yi .

3. Run OLS regression of u


^i on u
^i without constant term. The OLS estimator α
^ is
Y D

Y D
∑u
^i u
^i
α
^ =
D 2
∑(u
^i )

19 / 52
How to use FWL theorem
1. Computational advantage if you are interested in a particular coefficient. Use this idea in
estimation of panel data model.

1. Useful to see how the coefficient of interest is estimated. We will see this later in relation to
multicolinearity (多重共線性).

1. Double machine learning (Chernozhukov et al 2018): Estimation of treatment effect


parameters when so many covariates are available.

20 / 52
Assumptions for OLS
1. Random sample (ランダムサンプル): {Yi , Xi1 , … , XiK } is i.i.d. (identically and
independently distributed) drawn sample

2. mean independence: ϵi has zero conditional mean

E[ϵi |Xi1 , … , XiK ] = 0

3. Large outliers are unlikely: The random variable Yi and Xik have finite fourth moments.

4. No perfect multicollinearity (多重共線性): No linear relationship between explanatory


variables.

21 / 52
Theoretical Properties of OLS estimator
1. Unbiasedness: Conditional on the explantory variables X, the expectation of the OLS
estimator β
^
is equal to the true value β.

^
E[β |X] = β

2. Consistency: As the sample size N goes to infinity, the OLS estimator β


^
converges to β in
probability
p
^
β ⟶ β

3. Asymptotic normality (漸近正規性): discuss later.

22 / 52
Linear Regression: Practical Topics

23 / 52
Interpretation of Regression Coefficients
Remember that

Yi = β0 + β1 X1i + ⋯ + βK XKi + ϵi

The coefficient βk : the effect of Xk on Y ceteris paribus (all things being equal)
Equivalently, if Xk is continuous random variable,

∂Y
= βk
∂Xk

If we can estimate βk without bias, can obtain causal effect of Xk on Y .

24 / 52
Common Specifications in Linear Regression Model
Several specifications frequently used in empirical analysis.
1. Nonlinear term
2. log specification
3. dummy (categorical) variables
4. interaction terms (交差項)

25 / 52
Nonlinear term (非線形項)
Non-linear relationship between Y and X in a linearly additive form
2 3
Y i = β0 + β1 X i + β2 X + β3 X + ϵi
i i

As long as the error term ϵi appreas in a additively linear way, we can estimate the
coefficients by OLS.
Multicollinarity could be an issue if we have many polynomials (多項式).
You can use other non-linear variables such as log(x) and √x.

26 / 52
log specification
Using log changes the interpretation of the coefficient β in terms of scales.

Dependent Explanatory interpretation


Y X 1 unit increase in X causes β units change in Y
log Y X 1 unit increase in X causes 100β% change in Y
Y log X 1% increase in X causes β/100 unit change in Y
log Y log X 1% increase in X causes β% change in Y

27 / 52
Dummy variable (ダミー変数)
A dummy variable takes only 1 or 0. This is used to express qualititative information
Example: Dummy variable for race

1 if white
whitei = {
0 otherwise

The coefficient on a dummy variable captures the difference of the outcome Y between
categories

Yi = β0 + β1 whitei + ϵi

The coefficient β1 captures the difference of Y between white and non-white people.

28 / 52
Interaction term (交差項)
You can add the interaction of two explanatory variables in the regression model.
For example:

wagei = β0 + β1 educi + β2 whitei + β3 educi × whitei + ϵi

where wagei is the earnings of person i and educi is the years of schooling for person i.
The effect of educi is

∂wagei
= β1 + β3 whitei ,
∂educi

This allows for heterogeneous effects of education across races.

29 / 52
Measures of Fit
We often use R2 (決定係数) as a measure of the model fit.
Denote the fitted value as y^i

^ ^ ^
y
^ = β 0 + β 1 Xi1 + ⋯ + β K XiK
i

Also called prediction from the OLS regression.

30 / 52
R
2
is defined as

SSE
2
R = ,
T SS

where

2 2
SSE = ∑(y
^ − ȳ ) , T SS = ∑(yi − ȳ )
i

i i

R captures the fraction of the variation of Y explained by the regression model.


2

Adding variables always (weakly) increases R2 .

31 / 52
In a regression model with multiple explanatory variables, we often use adjusted R2 that
adjusts the number of explanatory variables

2 N − 1 SSR
R̄ = 1 −
N − (K + 1) T SS

where
2 2
SSR = ∑(y
^ − yi ) (= ∑ u
^i )
i

i i

32 / 52
Linear Regression: Inference

33 / 52
Statistical Inference of OLS Estimator
The OLS estimator is random variables as it depends on a drawn sample.

We need to conduct statistical inference to evaluate statistical uncertainty of the OLS


estimates.

Plan

Asymptotic distribution (漸近分布) of OLS estimator


Statistical inference:
Homoskedasticity (均一分散) vs Heteroskedasticity (不均一分散)

34 / 52
Asymptotic Normality (漸近正規性) of OLS Estimator
Under the OLS assumption, the OLS estimator has asymptotic normality

d
^
√N (β − β) → N (0, V )

V is called asymptotic variance (matrix) given by


′ −1 ′ 2 ′ −1
V = E[x xi ] E[x xi ϵ ]E[x xi ]
i i i i

(K+1)×(K+1)

is (K vector.

xi = (1, Xi1 , ⋯ , XiK ) + 1) × 1

35 / 52
We can approximate the distribution of β
^
by

^
β ∼ N (β, V /N )

The individual coefficient βk follows

^
β k ∼ N (βk , Vkk /N )

36 / 52
Estimation of Asymptotic Variance (漸近分散)
V is an unknown object. Need to be estimated.
Consider the estimator V^ for V using sample analogues
−1 −1
N N N
1 1 2
1
^ = (
V

∑ x xi ) (

∑ x xi ϵ
^i ) ( ∑ x xi )

i i i
N N N
i=1 i=1 i=1

where ϵ^i ^ ^
= yi − (β 0 + ⋯ + β K XiK ) is the residual.
Technically speaking, V^ converges to V in probability.
We often use the (asymptotic) standard error SE(β^k ) ^
= √V kk /N .

The standard error is an estimator for the standard deviation of the OLS estimator β^k .

37 / 52
Hypothesis testing
You might want to test a particular hypothesis regarding those coefficients.
Does x really affects y?
Is the production technology the constant returns to scale?

38 / 52
3 Steps in Hypothesis Testing
Step 1: Consider the null hypothesis H0 and the alternative hypothesis H1

H0 : β1 = k, H1 : β1 ≠ k

where k is the known number you set by yourself.

Step 2: Define t-statistic by

^
β1 − k
tn =
^
SE(β1 )

Step 3: We reject H0 is at α-percent significance level if

|tn | > Cα/2

where Cα/2 is the α/2 percentile of the standard normal distribution. We say we fail to
reject H0 if the above does not hold. 39 / 52
Caveats on Hypothesis Testing
We often say β
^
is statistically significant (統計的有意) at 5% level if |tn | > 1.96 when we
set k = 0 .

You should also discuss economic significance (経済的有意) of the coefficient in analysis.

Case 1: Small but statistically significant coefficient.


As the sample size N gets large, the SE decreases.

Case 2: Large but statistically insignificant coefficient.


The variable might have an important (economically meaningful) effect.
But you may not be able to estimate the effect precisely with the sample at your hand.

40 / 52
F test
We often test a composite hypothesis that involves multiple parameters such as

H0 : β1 + β2 = 0, H1 : β1 + β2 ≠ 0

We use F test in such a case.

41 / 52
Confidence interval (信頼区間)
95% confidence interval

^
β − k
1
CIn = {k : | | ≤ 1.96}
^
SE(β )
1

^ ^ ^ ^
= [β 1 − 1.96 × SE(β 1 ), β1 + 1.96 × SE(β 1 )]

Interpretation: If you draw many samples (dataset) and construct the 95% CI for each
sample, 95% of those CIs will include the true parameter.

42 / 52
Homoskedasticity vs Heteroskedasticity
The error term ϵi has heteroskedasticity (不均一分散) if V ar(ui |Xi ) depends on Xi . The
asymptotic variance is
′ −1 ′ 2 ′ −1
V = E[x xi ] E[x xi ϵ ]E[x xi ]
i i i i

If not, we call ϵi has homoskedasticity (均一分散). In this case,

′ −1 2
V = E[x xi ] σ
i

where σ 2 = V (ϵi ) .

43 / 52
Standard Errors in Practice
Standard errors under heteroskedasticity assumption is called heteroskedasticity robust
standard errors (不均一分散に頑健な標準誤差)

In many statistical packages (including R and Stata), the standard errors for the OLS
estimators are calculated under homoskedasticity assumption as a default.

However, if the error has heteroskedasticity, the standard error under homoskedasticity
assumption will be underestimated.

In OLS, we should always use heteroskedasticity robust standard error.

44 / 52
Appendix: Matching Estimator

45 / 52
Estimation Methods
We need to estimate E[Yi |Di = 1, Xi = x] and E[Yi |Di = 0, Xi = x]

Several ways to implement the above idea

1. Regression: Nonparametric and Parametric


2. Nearest neighborhood matching
3. Propensity Score Matching

46 / 52
Approach 1: Regression, or Analogue Approach
Let μ
^ (x) be an estimator of μk (x)
k
= E[Yi |Di = k, Xi = x] for k ∈ {0, 1}

The analog estimators are

N
1
^
AT E = ∑ (μ
^ (Xi ) − μ
^ (Xi ))
1 0
N
i=1

−1 N
N ∑ Di (Yi − μ
^ (Xi ))
i=1 0
^
AT T =
N
−1
N ∑ Di
i=1

How to estimate μk (x) = E[Yi |Di = k, Xi = x] ?

47 / 52
Nonparametric Estimation
Suppose that Xi ∈ {x1 , ⋯ , xK } is discrete with small K
Ex: two demographic characteristics (male/female, white/non-white). K = 4 \bigskip
Then, a nonparametric binning estimator is
N
∑ 1{Di = k, Xi = x}Yi
i=1
μ
^ (x) =
k N
∑ 1{Di = k, Xi = x}
i=1

\bigskip
Here, I do not put any parametric assumption on μk (x) = E[Yi |Di = k, Xi = x] .

48 / 52
Curse of dimensionality
Issue: Poor performance if K is large due to many covariates.
So many potential groups, too few observations for each group.
With K variables, each of which takes L values, LK possible groups (bins) in total.
This is known as curse of dimensionality.
Relatedly, if X is a continuous random variable, can use kernel regression.

49 / 52
Parametric Estimation, or going back to linear regression
If you put parametric assumption such as

E[Yi |Di = 0, Xi = x] = β xi

E[Yi |Di = 1, Xi = x] = β xi + τ0

then, you will have a model



y i = β x i + τ Di + ϵ i

You can think the matching estimator as controlling for omitted variable bias by adding
(many) covariates (control variables) xi .

50 / 52
Approach 2: M −Nearest Neighborhood Matching
Idea: Find the counterpart in other group that is close to me.
Define y^i (0) and y^i (1) be the estimator for (hypothetical) outcomes when treated and not
treated.

yi if Di = 0
y
^ (0) = { 1
i
∑ yj if Di = 1
M j∈LM (i)

LM (i) is the set of M individuals in the opposite group who are "close" to individual i
Several ways to define the distance between Xi and Xj , such as

2
dist(Xi , Xj ) = ||Xi − Xj ||

Need to choose (1) M and (2) the measure of distance

51 / 52
Approach 3: Propensity Score Matching
Use propensity score P (Di = 1|Xi = x) as a distance to define who is the closest to me.
Step 1: Estimate propensity score function by logit or probit using a flexible function of Xi .
Step 2: Calculate the propensity score for each observation. Use it to define the pair.

52 / 52

You might also like