Regression1 Framework
Regression1 Framework
2 / 52
Observational Study (観察研究)
Researchers in social science cannot always conduct a randomized control trial.
Instead, we need to use observational data in which treatment assignment may not be
random.
3 / 52
Overview
Introduce an idea of matching (マッチング) estimator.
Identification of treatment effect under selection on observable assumption.
Linear regression is a special case of matching estimator.
Linear regression: framework, practical topics, inference
4 / 52
Selection on Observables, or Matching
5 / 52
Matching to eliminate a selection bias
Idea: Compare individuals with the same observed characteristics X across treatment
and control groups
If treatment choice is driven by observed characteristics (such as age, income, gender, etc),
controlling for such factor would eliminate the selection.
6 / 52
Assumption 1: Selection on observables
Let Xi denote the observed characteristics (sometimes called covariates (共変量))
age, income, education, race, etc..
Assumption 1:
7 / 52
Assumption 2: Overlapping assumption
Assumption 2:
Given x, we should be able to observe people from both control and treatment group.
8 / 52
Identification of Treatment Effect Parameters
Assumption 1 (unconfoundedness) implies that
Once you conditioning on Xi , the argument is essentially the same as the one in RCT.
9 / 52
The AT T conditional on Xi = x is given by
Why? Overlapping assumption P (Di = 1|Xi = x) ∈ (0, 1) means that for each x, we
should have people in both treatment and control group.
10 / 52
ATT E[Y 1i − Y0i |Di = 1]
ATT is given by
11 / 52
ATT E[Y 1i − Y0i ]
ATE is
AT E =E[Y1i − Y0i ]
12 / 52
From Identification to Estimation
We need to estimate two conditional expectations E[Yi |Di = 1, Xi = x] and
E[Yi |Di = 0, Xi = x]
Here, I only explain a parametric regression as a way to implement the matching method.
13 / 52
From Matching to Linear Regression Model
Assume that
′
E[Yi |Di = 0, Xi = x] = β xi
′
E[Yi |Di = 1, Xi = x] = β xi + τ
14 / 52
Linear Regression: Framework
15 / 52
Regression (回帰) framework
Linear regression model (線形回帰モデル) is defined as
Yi = β0 + β1 X1i + ⋯ + βK XKi + ϵi
16 / 52
Ordinaly Least Squares (最小二乗法、OLS)
OLS estimators are the minimizers of the sum of squared residuals:
N
1
2
min ∑(Yi − (β0 + β1 Xi1 + ⋯ + βK XiK ))
β0 ,⋯,βK N
i=1
17 / 52
Residual Regression (残差回帰)
Consider the model
18 / 52
Frisch–Waugh–Lovell Theorem
1. Run OLS regression of Di on all other explanatory variables 1, X1i , ⋯ , XKi . Obtain the
residual u^D
i
.
2. Run OLS regression of Yi on all other explanatory variables 1, X1i , ⋯ , XKi . Obtain the
residual u^Yi .
Y D
∑u
^i u
^i
α
^ =
D 2
∑(u
^i )
19 / 52
How to use FWL theorem
1. Computational advantage if you are interested in a particular coefficient. Use this idea in
estimation of panel data model.
1. Useful to see how the coefficient of interest is estimated. We will see this later in relation to
multicolinearity (多重共線性).
20 / 52
Assumptions for OLS
1. Random sample (ランダムサンプル): {Yi , Xi1 , … , XiK } is i.i.d. (identically and
independently distributed) drawn sample
3. Large outliers are unlikely: The random variable Yi and Xik have finite fourth moments.
21 / 52
Theoretical Properties of OLS estimator
1. Unbiasedness: Conditional on the explantory variables X, the expectation of the OLS
estimator β
^
is equal to the true value β.
^
E[β |X] = β
22 / 52
Linear Regression: Practical Topics
23 / 52
Interpretation of Regression Coefficients
Remember that
Yi = β0 + β1 X1i + ⋯ + βK XKi + ϵi
The coefficient βk : the effect of Xk on Y ceteris paribus (all things being equal)
Equivalently, if Xk is continuous random variable,
∂Y
= βk
∂Xk
24 / 52
Common Specifications in Linear Regression Model
Several specifications frequently used in empirical analysis.
1. Nonlinear term
2. log specification
3. dummy (categorical) variables
4. interaction terms (交差項)
25 / 52
Nonlinear term (非線形項)
Non-linear relationship between Y and X in a linearly additive form
2 3
Y i = β0 + β1 X i + β2 X + β3 X + ϵi
i i
As long as the error term ϵi appreas in a additively linear way, we can estimate the
coefficients by OLS.
Multicollinarity could be an issue if we have many polynomials (多項式).
You can use other non-linear variables such as log(x) and √x.
26 / 52
log specification
Using log changes the interpretation of the coefficient β in terms of scales.
27 / 52
Dummy variable (ダミー変数)
A dummy variable takes only 1 or 0. This is used to express qualititative information
Example: Dummy variable for race
1 if white
whitei = {
0 otherwise
The coefficient on a dummy variable captures the difference of the outcome Y between
categories
Yi = β0 + β1 whitei + ϵi
The coefficient β1 captures the difference of Y between white and non-white people.
28 / 52
Interaction term (交差項)
You can add the interaction of two explanatory variables in the regression model.
For example:
where wagei is the earnings of person i and educi is the years of schooling for person i.
The effect of educi is
∂wagei
= β1 + β3 whitei ,
∂educi
29 / 52
Measures of Fit
We often use R2 (決定係数) as a measure of the model fit.
Denote the fitted value as y^i
^ ^ ^
y
^ = β 0 + β 1 Xi1 + ⋯ + β K XiK
i
30 / 52
R
2
is defined as
SSE
2
R = ,
T SS
where
2 2
SSE = ∑(y
^ − ȳ ) , T SS = ∑(yi − ȳ )
i
i i
31 / 52
In a regression model with multiple explanatory variables, we often use adjusted R2 that
adjusts the number of explanatory variables
2 N − 1 SSR
R̄ = 1 −
N − (K + 1) T SS
where
2 2
SSR = ∑(y
^ − yi ) (= ∑ u
^i )
i
i i
32 / 52
Linear Regression: Inference
33 / 52
Statistical Inference of OLS Estimator
The OLS estimator is random variables as it depends on a drawn sample.
Plan
34 / 52
Asymptotic Normality (漸近正規性) of OLS Estimator
Under the OLS assumption, the OLS estimator has asymptotic normality
d
^
√N (β − β) → N (0, V )
is (K vector.
′
xi = (1, Xi1 , ⋯ , XiK ) + 1) × 1
35 / 52
We can approximate the distribution of β
^
by
^
β ∼ N (β, V /N )
^
β k ∼ N (βk , Vkk /N )
36 / 52
Estimation of Asymptotic Variance (漸近分散)
V is an unknown object. Need to be estimated.
Consider the estimator V^ for V using sample analogues
−1 −1
N N N
1 1 2
1
^ = (
V
′
∑ x xi ) (
′
∑ x xi ϵ
^i ) ( ∑ x xi )
′
i i i
N N N
i=1 i=1 i=1
where ϵ^i ^ ^
= yi − (β 0 + ⋯ + β K XiK ) is the residual.
Technically speaking, V^ converges to V in probability.
We often use the (asymptotic) standard error SE(β^k ) ^
= √V kk /N .
The standard error is an estimator for the standard deviation of the OLS estimator β^k .
37 / 52
Hypothesis testing
You might want to test a particular hypothesis regarding those coefficients.
Does x really affects y?
Is the production technology the constant returns to scale?
38 / 52
3 Steps in Hypothesis Testing
Step 1: Consider the null hypothesis H0 and the alternative hypothesis H1
H0 : β1 = k, H1 : β1 ≠ k
^
β1 − k
tn =
^
SE(β1 )
where Cα/2 is the α/2 percentile of the standard normal distribution. We say we fail to
reject H0 if the above does not hold. 39 / 52
Caveats on Hypothesis Testing
We often say β
^
is statistically significant (統計的有意) at 5% level if |tn | > 1.96 when we
set k = 0 .
You should also discuss economic significance (経済的有意) of the coefficient in analysis.
40 / 52
F test
We often test a composite hypothesis that involves multiple parameters such as
H0 : β1 + β2 = 0, H1 : β1 + β2 ≠ 0
41 / 52
Confidence interval (信頼区間)
95% confidence interval
^
β − k
1
CIn = {k : | | ≤ 1.96}
^
SE(β )
1
^ ^ ^ ^
= [β 1 − 1.96 × SE(β 1 ), β1 + 1.96 × SE(β 1 )]
Interpretation: If you draw many samples (dataset) and construct the 95% CI for each
sample, 95% of those CIs will include the true parameter.
42 / 52
Homoskedasticity vs Heteroskedasticity
The error term ϵi has heteroskedasticity (不均一分散) if V ar(ui |Xi ) depends on Xi . The
asymptotic variance is
′ −1 ′ 2 ′ −1
V = E[x xi ] E[x xi ϵ ]E[x xi ]
i i i i
′ −1 2
V = E[x xi ] σ
i
where σ 2 = V (ϵi ) .
43 / 52
Standard Errors in Practice
Standard errors under heteroskedasticity assumption is called heteroskedasticity robust
standard errors (不均一分散に頑健な標準誤差)
In many statistical packages (including R and Stata), the standard errors for the OLS
estimators are calculated under homoskedasticity assumption as a default.
However, if the error has heteroskedasticity, the standard error under homoskedasticity
assumption will be underestimated.
44 / 52
Appendix: Matching Estimator
45 / 52
Estimation Methods
We need to estimate E[Yi |Di = 1, Xi = x] and E[Yi |Di = 0, Xi = x]
46 / 52
Approach 1: Regression, or Analogue Approach
Let μ
^ (x) be an estimator of μk (x)
k
= E[Yi |Di = k, Xi = x] for k ∈ {0, 1}
N
1
^
AT E = ∑ (μ
^ (Xi ) − μ
^ (Xi ))
1 0
N
i=1
−1 N
N ∑ Di (Yi − μ
^ (Xi ))
i=1 0
^
AT T =
N
−1
N ∑ Di
i=1
47 / 52
Nonparametric Estimation
Suppose that Xi ∈ {x1 , ⋯ , xK } is discrete with small K
Ex: two demographic characteristics (male/female, white/non-white). K = 4 \bigskip
Then, a nonparametric binning estimator is
N
∑ 1{Di = k, Xi = x}Yi
i=1
μ
^ (x) =
k N
∑ 1{Di = k, Xi = x}
i=1
\bigskip
Here, I do not put any parametric assumption on μk (x) = E[Yi |Di = k, Xi = x] .
48 / 52
Curse of dimensionality
Issue: Poor performance if K is large due to many covariates.
So many potential groups, too few observations for each group.
With K variables, each of which takes L values, LK possible groups (bins) in total.
This is known as curse of dimensionality.
Relatedly, if X is a continuous random variable, can use kernel regression.
49 / 52
Parametric Estimation, or going back to linear regression
If you put parametric assumption such as
′
E[Yi |Di = 0, Xi = x] = β xi
′
E[Yi |Di = 1, Xi = x] = β xi + τ0
You can think the matching estimator as controlling for omitted variable bias by adding
(many) covariates (control variables) xi .
50 / 52
Approach 2: M −Nearest Neighborhood Matching
Idea: Find the counterpart in other group that is close to me.
Define y^i (0) and y^i (1) be the estimator for (hypothetical) outcomes when treated and not
treated.
yi if Di = 0
y
^ (0) = { 1
i
∑ yj if Di = 1
M j∈LM (i)
LM (i) is the set of M individuals in the opposite group who are "close" to individual i
Several ways to define the distance between Xi and Xj , such as
2
dist(Xi , Xj ) = ||Xi − Xj ||
51 / 52
Approach 3: Propensity Score Matching
Use propensity score P (Di = 1|Xi = x) as a distance to define who is the closest to me.
Step 1: Estimate propensity score function by logit or probit using a flexible function of Xi .
Step 2: Calculate the propensity score for each observation. Use it to define the pair.
52 / 52