0% found this document useful (0 votes)
32 views114 pages

Chapter3 Asymtotic Stats

Uploaded by

Tin Tran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views114 pages

Chapter3 Asymtotic Stats

Uploaded by

Tin Tran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 114

Chapter 3: Asymptotic Statistics

Jonathan Roth

Mathematical Econometrics I
Brown University
Fall 2023
Outline

1. Overview

2. LLN, CLT, and CMT

3. Putting Asymptotics into Practice

1
Motivation

We’ve seen how we can test hypotheses about population means using
information from the sample mean µ̂ when it is normally distributed
with a known variance

This situation arises when we know that Yi ∼ N(µ, σ 2 ) with known σ

But this situation is rare... how do we “do inference” more generally?

2
Motivation

We’ve seen how we can test hypotheses about population means using
information from the sample mean µ̂ when it is normally distributed
with a known variance

This situation arises when we know that Yi ∼ N(µ, σ 2 ) with known σ

But this situation is rare... how do we “do inference” more generally?

Fortunately, the assumption of normally distributed sample means


turns out to be a good approximation when samples are large

2
Motivation

We’ve seen how we can test hypotheses about population means using
information from the sample mean µ̂ when it is normally distributed
with a known variance

This situation arises when we know that Yi ∼ N(µ, σ 2 ) with known σ

But this situation is rare... how do we “do inference” more generally?

Fortunately, the assumption of normally distributed sample means


turns out to be a good approximation when samples are large

What we mean by a “good approximation” is formalized by asymptotic


statistics, which considers the distribution of µ̂ in the limit as N → ∞

2
Overview of Important Results

The Law of Large Numbers (LLN) says that when N is large, µ̂ is


close to µ with very high probability

3
Overview of Important Results

The Law of Large Numbers (LLN) says that when N is large, µ̂ is


close to µ with very high probability

The Central Limit Theorem (CLT) says that when N is large, the
distribution of µ̂ is approximately normally distributed with mean µ
and variance σ 2 /n

3
Overview of Important Results

The Law of Large Numbers (LLN) says that when N is large, µ̂ is


close to µ with very high probability

The Central Limit Theorem (CLT) says that when N is large, the
distribution of µ̂ is approximately normally distributed with mean µ
and variance σ 2 /n

The Continuous Mapping Theorem says that when N is large,


continuous functions of µ̂, say g(µ̂), are also close to g(µ)

3
Outline

1. Overview X

2. LLN, CLT, and CMT

3. Putting Asymptotics into Practice

4
Convergence in Probability

Intuitively, a random variable XN converges in probability to x if the


probability that XN is “close to” x is almost 1 when N is large

5
Convergence in Probability

Intuitively, a random variable XN converges in probability to x if the


probability that XN is “close to” x is almost 1 when N is large

Formally, we say XN converges in probability to x , Xn →p x or


plim Xn = x , if for all ε > 0,

P(|XN − x | > ε) → 0

5
Convergence in Probability

Intuitively, a random variable XN converges in probability to x if the


probability that XN is “close to” x is almost 1 when N is large

Formally, we say XN converges in probability to x , Xn →p x or


plim Xn = x , if for all ε > 0,

P(|XN − x | > ε) → 0

If Xn →p x for a constant x , we say Xn is consistent for x

5
Convergence in Probability

Intuitively, a random variable XN converges in probability to x if the


probability that XN is “close to” x is almost 1 when N is large

Formally, we say XN converges in probability to x , Xn →p x or


plim Xn = x , if for all ε > 0,

P(|XN − x | > ε) → 0

If Xn →p x for a constant x , we say Xn is consistent for x

Typically x is a constant, although we will sometimes also say


XN → X for X a random variable (using the same definition as above)

5
Convergence in Probability (Cont.)

Useful fact: if E [(XN − x )2 ] → 0, then XN →p x

6
Convergence in Probability (Cont.)

Useful fact: if E [(XN − x )2 ] → 0, then XN →p x

Proof (you won’t be responsible for this):


By the law of iterated expectations,

E [(XN − x )2 ] =P(|Xn − x | > ε)E [(XN − x )2 ||Xn − x | > ε]+


P(|Xn − x | ≤ ε)E [(XN − x )2 ||Xn − x | ≤ ε]

6
Convergence in Probability (Cont.)

Useful fact: if E [(XN − x )2 ] → 0, then XN →p x

Proof (you won’t be responsible for this):


By the law of iterated expectations,

E [(XN − x )2 ] =P(|Xn − x | > ε)E [(XN − x )2 ||Xn − x | > ε]+


P(|Xn − x | ≤ ε)E [(XN − x )2 ||Xn − x | ≤ ε]
≥P(|Xn − x | > ε)ε 2 + 0

6
Convergence in Probability (Cont.)

Useful fact: if E [(XN − x )2 ] → 0, then XN →p x

Proof (you won’t be responsible for this):


By the law of iterated expectations,

E [(XN − x )2 ] =P(|Xn − x | > ε)E [(XN − x )2 ||Xn − x | > ε]+


P(|Xn − x | ≤ ε)E [(XN − x )2 ||Xn − x | ≤ ε]
≥P(|Xn − x | > ε)ε 2 + 0

This implies that

P(|XN − x | > ε) ≤ E [(XN − x )2 ]/ε 2 (Chebychev’s Inequality)

6
Convergence in Probability (Cont.)

Useful fact: if E [(XN − x )2 ] → 0, then XN →p x

Proof (you won’t be responsible for this):


By the law of iterated expectations,

E [(XN − x )2 ] =P(|Xn − x | > ε)E [(XN − x )2 ||Xn − x | > ε]+


P(|Xn − x | ≤ ε)E [(XN − x )2 ||Xn − x | ≤ ε]
≥P(|Xn − x | > ε)ε 2 + 0

This implies that

P(|XN − x | > ε) ≤ E [(XN − x )2 ]/ε 2 (Chebychev’s Inequality)

Hence, E [(XN − x )2 ] → 0 implies P(|XN − x | > ε) → 0

6
Law of Large Numbers

Law of Large Numbers. Suppose that Y1 , ..., YN are drawn iid from
a distribution with Var (Yi ) = σ 2 < ∞. Then

1 N
µ̂N = ∑ Yi →p µ = E [Yi ]
N i=1

In words: as the sample gets large, the sample mean will be close to
the population mean with high probability.

7
Law of Large Numbers

Law of Large Numbers. Suppose that Y1 , ..., YN are drawn iid from
a distribution with Var (Yi ) = σ 2 < ∞. Then

1 N
µ̂N = ∑ Yi →p µ = E [Yi ]
N i=1

In words: as the sample gets large, the sample mean will be close to
the population mean with high probability.

Proof: We saw last chapter that E [µ̂N ] = µ and Var (µ̂N ) = σ 2 /N.
Thus,
Var (µ̂N ) = E [(µ̂N − µ)2 ] = σ 2 /N → 0
Hence, µ̂N →p µ by our “useful fact”.

7
Laws of Large Numbers Illustration

1
Distribution and mean of N ∑i Zi when Zi ∼ U(0, 1), N = 1

8
Laws of Large Numbers Illustration

1
Distribution and mean of N ∑i Zi when Zi ∼ U(0, 1), N = 10

9
Laws of Large Numbers Illustration

1
Distribution and mean of N ∑i Zi when Zi ∼ U(0, 1), N = 100

10
Laws of Large Numbers Illustration

1
Distribution and mean of N ∑i Zi when Zi ∼ U(0, 1), N = 1000

11
Convergence in Distribution

You might have noticed that the distribution of µ̂ in the simulations


looks close to a normal distribution as N gets large

The notion of convergence in distribution formalizes what it means


for one distribution to be close to another distribution

12
Convergence in Distribution

You might have noticed that the distribution of µ̂ in the simulations


looks close to a normal distribution as N gets large

The notion of convergence in distribution formalizes what it means


for one distribution to be close to another distribution

Definition: We say that XN converges in distribution to a continuously


distributed variable X , denoted Xn →d X or Xn ⇒ X , if the CDF of
XN converges (pointwise) to the CDF of X ,

FXN (x ) → FX (x ) for all x

12
Central Limit Theorem

The Central Limit Theorem (CLT) formalizes the sense in which


sample means are approximately normally distributed in large samples

13
Central Limit Theorem

The Central Limit Theorem (CLT) formalizes the sense in which


sample means are approximately normally distributed in large samples

Theorem: Suppose that Y1 , ..., YN are drawn iid from a distribution


with mean µ = E [Yi ] and variance Var (Yi ) = σ 2 < ∞. Then the
sample mean µ̂ = N1 ∑Ni=1 Yi satisfies

N(µ̂ − µ) →d N(0, σ 2 )

13
Central Limit Theorem

The Central Limit Theorem (CLT) formalizes the sense in which


sample means are approximately normally distributed in large samples

Theorem: Suppose that Y1 , ..., YN are drawn iid from a distribution


with mean µ = E [Yi ] and variance Var (Yi ) = σ 2 < ∞. Then the
sample mean µ̂ = N1 ∑Ni=1 Yi satisfies

N(µ̂ − µ) →d N(0, σ 2 )

In words, the theorem says the following:


1 We can start with any distribution Y , possibly non-normal
i
2 If we take the average of the Y , ..., Y
1 N in a sample sufficiently
large, the distribution of µ̂ = N1 ∑i Yi is (approximately) normal!

13
CLT Illustration

1
Distributions of µ̂ = N ∑i Xi vs. N(E [µ̂], Var (µ̂)): Xi ∼ U(0, 1), N = 1

14
CLT Illustration

1
Distributions of µ̂ = N ∑i Xi vs. N(E [µ̂], Var (µ̂)): Xi ∼ U(0, 1), N = 2

15
CLT Illustration

1
Distributions of µ̂ = N ∑i Xi vs. N(E [µ̂], Var (µ̂)): Xi ∼ U(0, 1), N = 5

16
CLT Illustration

1
Distributions of µ̂ = N ∑i Xi vs. N(E [µ̂], Var (µ̂)): Xi ∼ U(0, 1), N = 10

17
CLT Illustration II

https://www.youtube.com/watch?v=EvHiee7gs9Y
18
19
Multivariate Versions

The results we’ve discussed extend naturally to the multivariate case

20
Multivariate Versions

The results we’ve discussed extend naturally to the multivariate case

For a vector XN ∈ Rk , we say XN →p x if each component of XN


converges in probability to each component of x.

20
Multivariate Versions

The results we’ve discussed extend naturally to the multivariate case

For a vector XN ∈ Rk , we say XN →p x if each component of XN


converges in probability to each component of x.

LLN: For µ̂µ N , the sample mean of iid vectors Y1 , ...YN with mean µ
µ N →p µ
and finite variance, µ̂

20
Multivariate Versions

The results we’ve discussed extend naturally to the multivariate case

For a vector XN ∈ Rk , we say XN →p x if each component of XN


converges in probability to each component of x.

LLN: For µ̂µ N , the sample mean of iid vectors Y1 , ...YN with mean µ
µ N →p µ
and finite variance, µ̂

For a vector XN ∈ Rk , we say XN →d X for X continuously distributed


if FXN (x) → FX (x) for all x ∈ Rk .

CLT: For µ̂µ N , the sample


√ mean of iid vectors Y1 , ...YN with mean µ
and finite variance Σ , N(µ̂µ N − µ ) →d N(0, Σ )

20
Continuous Mapping Theorem

Sometimes we are interested in functions of sample means (e.g., the


t-statistic is a function of µ̂ and σ ).

21
Continuous Mapping Theorem

Sometimes we are interested in functions of sample means (e.g., the


t-statistic is a function of µ̂ and σ ).

The continuous mapping theorem (CMT) tells us about continuous


functions of random variables that converge in distribution/probability

21
Continuous Mapping Theorem

Sometimes we are interested in functions of sample means (e.g., the


t-statistic is a function of µ̂ and σ ).

The continuous mapping theorem (CMT) tells us about continuous


functions of random variables that converge in distribution/probability

Theorem: suppose g(·) is a continuous function


If XN →p X , then g(XN ) →p g(X )

21
Continuous Mapping Theorem

Sometimes we are interested in functions of sample means (e.g., the


t-statistic is a function of µ̂ and σ ).

The continuous mapping theorem (CMT) tells us about continuous


functions of random variables that converge in distribution/probability

Theorem: suppose g(·) is a continuous function


If XN →p X , then g(XN ) →p g(X )
If XN →d X , then g(XN ) →d g(X )

21
Continuous Mapping Theorem

Sometimes we are interested in functions of sample means (e.g., the


t-statistic is a function of µ̂ and σ ).

The continuous mapping theorem (CMT) tells us about continuous


functions of random variables that converge in distribution/probability

Theorem: suppose g(·) is a continuous function


If XN →p X , then g(XN ) →p g(X )
If XN →d X , then g(XN ) →d g(X )
Multivariate versions here too: If XN →p X, then g(XN ) →p g(X) and
if XN →d X, then g(XN ) →d g(X)

21
Convergence of Sample Variance

One useful application of the CMT is to show convergence in


probability of the sample variance

22
Convergence of Sample Variance

One useful application of the CMT is to show convergence in


probability of the sample variance

Let σ̂ 2 = 1
N ∑N 2
i=1 (Yi − µ̂) be the sample variance of Yi .

22
Convergence of Sample Variance

One useful application of the CMT is to show convergence in


probability of the sample variance

Let σ̂ 2 = 1
N ∑N 2
i=1 (Yi − µ̂) be the sample variance of Yi .

Claim: if Y1 , ..., YN are iid and Var (Yi2 ) is finite, then


σ̂ 2 →p σ 2 = Var (Yi ).

22
Convergence of Sample Variance

One useful application of the CMT is to show convergence in


probability of the sample variance

Let σ̂ 2 = 1
N ∑N 2
i=1 (Yi − µ̂) be the sample variance of Yi .

Claim: if Y1 , ..., YN are iid and Var (Yi2 ) is finite, then


σ̂ 2 →p σ 2 = Var (Yi ).

Proof:
We can write the sample variance as σ̂ 2 = 1
N ∑N 2 2
i=1 Yi − µ̂ .

22
Convergence of Sample Variance

One useful application of the CMT is to show convergence in


probability of the sample variance

Let σ̂ 2 = 1
N ∑N 2
i=1 (Yi − µ̂) be the sample variance of Yi .

Claim: if Y1 , ..., YN are iid and Var (Yi2 ) is finite, then


σ̂ 2 →p σ 2 = Var (Yi ).

Proof:
1 N
We can write the sample variance as σ̂ 2 = 2 2
N ∑i=1 Yi − µ̂ .
1 N 2 2
First term: by the LLN, N ∑i=1 Yi →p E [Yi ].

22
Convergence of Sample Variance

One useful application of the CMT is to show convergence in


probability of the sample variance

Let σ̂ 2 = 1
N ∑N 2
i=1 (Yi − µ̂) be the sample variance of Yi .

Claim: if Y1 , ..., YN are iid and Var (Yi2 ) is finite, then


σ̂ 2 →p σ 2 = Var (Yi ).

Proof:
1 N
We can write the sample variance as σ̂ 2 = 2 2
N ∑i=1 Yi − µ̂ .
1 N 2 2
First term: by the LLN, N ∑i=1 Yi →p E [Yi ].
Second term: by the LLN, µ̂ →p µ = E [Yi ]. Thus, by the CMT,
µ̂ 2 →p E [Yi ]2 .

22
Convergence of Sample Variance

One useful application of the CMT is to show convergence in


probability of the sample variance

Let σ̂ 2 = 1
N ∑N 2
i=1 (Yi − µ̂) be the sample variance of Yi .

Claim: if Y1 , ..., YN are iid and Var (Yi2 ) is finite, then


σ̂ 2 →p σ 2 = Var (Yi ).

Proof:
1 N
We can write the sample variance as σ̂ 2 = 2 2
N ∑i=1 Yi − µ̂ .
1 N 2 2
First term: by the LLN, N ∑i=1 Yi →p E [Yi ].
Second term: by the LLN, µ̂ →p µ = E [Yi ]. Thus, by the CMT,
µ̂ 2 →p E [Yi ]2 .
Thus, by the CMT again, 1
N ∑N 2 2 2 2 2
i=1 Yi − µ̂ →p E [Yi ] − E [Yi ] = σ .

22
Slutsky’s Lemma

Slutsky’s lemma (sometimes Slutsky’s theorem) summarizes a few


special cases of the CMT that are very useful.

23
Slutsky’s Lemma

Slutsky’s lemma (sometimes Slutsky’s theorem) summarizes a few


special cases of the CMT that are very useful.

Suppose that XN →p c for a constant c, and YN →d Y . Then:

XN + YN →d c + Y .

23
Slutsky’s Lemma

Slutsky’s lemma (sometimes Slutsky’s theorem) summarizes a few


special cases of the CMT that are very useful.

Suppose that XN →p c for a constant c, and YN →d Y . Then:

XN + YN →d c + Y .

Xn Yn →d cY .

23
Slutsky’s Lemma

Slutsky’s lemma (sometimes Slutsky’s theorem) summarizes a few


special cases of the CMT that are very useful.

Suppose that XN →p c for a constant c, and YN →d Y . Then:

XN + YN →d c + Y .

Xn Yn →d cY .

If c 6= 0, then Yn /Xn →d Y /c.

23
Slutsky’s Lemma

Slutsky’s lemma (sometimes Slutsky’s theorem) summarizes a few


special cases of the CMT that are very useful.

Suppose that XN →p c for a constant c, and YN →d Y . Then:

XN + YN →d c + Y .

Xn Yn →d cY .

If c 6= 0, then Yn /Xn →d Y /c.

Analogous versions apply for vector-valued random variables.

23
Asymptotic Hypothesis Testing
Recall that when Yi ∼ N(µ, σ 2 ), we showed that the t-statistic
t̂ = σµ̂−µ
√ 0 ∼ N(0, 1) under H0 : µ = µ0 .
/ n

24
Asymptotic Hypothesis Testing
Recall that when Yi ∼ N(µ, σ 2 ), we showed that the t-statistic
t̂ = σµ̂−µ
√ 0 ∼ N(0, 1) under H0 : µ = µ0 .
/ n

Thus, when Yi ∼ N(µ, σ 2 ), we had that Pr (|t̂| > 1.96) = 0.05 under
the null.

24
Asymptotic Hypothesis Testing
Recall that when Yi ∼ N(µ, σ 2 ), we showed that the t-statistic
t̂ = σµ̂−µ
√ 0 ∼ N(0, 1) under H0 : µ = µ0 .
/ n

Thus, when Yi ∼ N(µ, σ 2 ), we had that Pr (|t̂| > 1.96) = 0.05 under
the null.

Now, suppose that Yi is not normally distributed and we don’t know


its variance.

24
Asymptotic Hypothesis Testing
Recall that when Yi ∼ N(µ, σ 2 ), we showed that the t-statistic
t̂ = σµ̂−µ
√ 0 ∼ N(0, 1) under H0 : µ = µ0 .
/ n

Thus, when Yi ∼ N(µ, σ 2 ), we had that Pr (|t̂| > 1.96) = 0.05 under
the null.

Now, suppose that Yi is not normally distributed and we don’t know


its variance.

By CLT, N(µ̂ − µ0 ) →d N(0, σ 2 ).
By CMT and LLN (as shown above), σ̂ →p σ .

24
Asymptotic Hypothesis Testing
Recall that when Yi ∼ N(µ, σ 2 ), we showed that the t-statistic
t̂ = σµ̂−µ
√ 0 ∼ N(0, 1) under H0 : µ = µ0 .
/ n

Thus, when Yi ∼ N(µ, σ 2 ), we had that Pr (|t̂| > 1.96) = 0.05 under
the null.

Now, suppose that Yi is not normally distributed and we don’t know


its variance.

By CLT, N(µ̂ − µ0 ) →d N(0, σ 2 ).
By CMT and LLN (as shown above), σ̂ →p σ .

µ̂ − µ0
Thus, by Slutsky’s lemma, t̂ = √ →d N(0, 1).
σ̂ / n

24
Asymptotic Hypothesis Testing
Recall that when Yi ∼ N(µ, σ 2 ), we showed that the t-statistic
t̂ = σµ̂−µ
√ 0 ∼ N(0, 1) under H0 : µ = µ0 .
/ n

Thus, when Yi ∼ N(µ, σ 2 ), we had that Pr (|t̂| > 1.96) = 0.05 under
the null.

Now, suppose that Yi is not normally distributed and we don’t know


its variance.

By CLT, N(µ̂ − µ0 ) →d N(0, σ 2 ).
By CMT and LLN (as shown above), σ̂ →p σ .

µ̂ − µ0
Thus, by Slutsky’s lemma, t̂ = √ →d N(0, 1).
σ̂ / n

Hence, asymptotically Pr (|t̂| > 1.96) → 0.05, even though Yi is not


normal and σ̂ is estimated! We can hypothesis test just like before.
24
Asymptotic Confidence Intervals

Similarly, when Yi was


√ normal w/ σ known, we showed the confidence
interval µ̂ ± 1.96σ / N contained the true µ 95% of the time

25
Asymptotic Confidence Intervals

Similarly, when Yi was


√ normal w/ σ known, we showed the confidence
interval µ̂ ± 1.96σ / N contained the true µ 95% of the time

Analogously,√when Yi is non-normal with unknown variance,


µ̂ ± 1.96σ̂ / N contains the true µ with probability approaching 95%
as N grows large.

25
Outline

1. Overview X

2. LLN, CLT, and CMT X

3. Putting Asymptotics into Practice

26
Example – Oregon Health Insurance Experiment

27
Sample Means for Depression Outcome
Control Group Treated Group
Mean 0.329 0.306
SD 0.470 0.461
N 10426 13315

28
Sample Means for Depression Outcome
Control Group Treated Group
Mean 0.329 0.306
SD 0.470 0.461
N 10426 13315
Say we want a CI for the population mean in the control group

28
Sample Means for Depression Outcome
Control Group Treated Group
Mean 0.329 0.306
SD 0.470 0.461
N 10426 13315
Say we want a CI for the population mean in the control group

We have

µ̂ ± 1.96 × σ̂ / N =

28
Sample Means for Depression Outcome
Control Group Treated Group
Mean 0.329 0.306
SD 0.470 0.461
N 10426 13315
Say we want a CI for the population mean in the control group

We have
√ √
µ̂ ± 1.96 × σ̂ / N = 0.329 ± 1.96 × 0.470/ 10426 =

28
Sample Means for Depression Outcome
Control Group Treated Group
Mean 0.329 0.306
SD 0.470 0.461
N 10426 13315
Say we want a CI for the population mean in the control group

We have
√ √
µ̂ ± 1.96 × σ̂ / N = 0.329 ± 1.96 × 0.470/ 10426 = [0.319, 0.338]

28
Sample Means for Depression Outcome
Control Group Treated Group
Mean 0.329 0.306
SD 0.470 0.461
N 10426 13315
Say we want a CI for the population mean in the control group

We have
√ √
µ̂ ± 1.96 × σ̂ / N = 0.329 ± 1.96 × 0.470/ 10426 = [0.319, 0.338]

What about for the treated group?

28
Sample Means for Depression Outcome
Control Group Treated Group
Mean 0.329 0.306
SD 0.470 0.461
N 10426 13315
Say we want a CI for the population mean in the control group

We have
√ √
µ̂ ± 1.96 × σ̂ / N = 0.329 ± 1.96 × 0.470/ 10426 = [0.319, 0.338]

What about for the treated group?



µ̂ ± 1.96 × σ̂ / N =

28
Sample Means for Depression Outcome
Control Group Treated Group
Mean 0.329 0.306
SD 0.470 0.461
N 10426 13315
Say we want a CI for the population mean in the control group

We have
√ √
µ̂ ± 1.96 × σ̂ / N = 0.329 ± 1.96 × 0.470/ 10426 = [0.319, 0.338]

What about for the treated group?


√ √
µ̂ ± 1.96 × σ̂ / N = 0.306 ± 1.96 × 0.461/ 13315 =

28
Sample Means for Depression Outcome
Control Group Treated Group
Mean 0.329 0.306
SD 0.470 0.461
N 10426 13315
Say we want a CI for the population mean in the control group

We have
√ √
µ̂ ± 1.96 × σ̂ / N = 0.329 ± 1.96 × 0.470/ 10426 = [0.319, 0.338]

What about for the treated group?


√ √
µ̂ ± 1.96 × σ̂ / N = 0.306 ± 1.96 × 0.461/ 13315 = [0.298, 0.313]

28
CIs for Treatment Effects in Experiments

We showed previously that in an experiment, the average treatment


effect is given by

τ = E [Yi (1) − Yi (0)] = E [Yi |Di = 1] − E [Yi |Di = 0].


i.e. the difference in population means between the treated and
control groups.

29
CIs for Treatment Effects in Experiments

We showed previously that in an experiment, the average treatment


effect is given by

τ = E [Yi (1) − Yi (0)] = E [Yi |Di = 1] − E [Yi |Di = 0].


i.e. the difference in population means between the treated and
control groups.

How can we form confidence intervals (or test hypotheses) about the
treatment effect?

29
Mean and variance of the difference-in-means
1
Let Ȳ1 = N1 ∑i:Di =1 Yi be the sample mean for the treated group.
1
Let Ȳ0 = N0 ∑i:Di =0 Yi be the sample mean for the control group.

30
Mean and variance of the difference-in-means
1
Let Ȳ1 = N1 ∑i:Di =1 Yi be the sample mean for the treated group.
1
Let Ȳ0 = N0 ∑i:Di =0 Yi be the sample mean for the control group.

Since Ȳ1 , Ȳ0 are each sample means, we have that

30
Mean and variance of the difference-in-means
1
Let Ȳ1 = N1 ∑i:Di =1 Yi be the sample mean for the treated group.
1
Let Ȳ0 = N0 ∑i:Di =0 Yi be the sample mean for the control group.

Since Ȳ1 , Ȳ0 are each sample means, we have that

E [Ȳ1 ] = µ1 , Var (Ȳ1 ) = σ12 /N1


E [Ȳ0 ] = µ0 , Var (Y¯0 ) = σ02 /N0

where µd = E [Yi | Di = d] and σd2 = Var (Yi | Di = d).

Let τ̂ = Ȳ1 − Ȳ0 . It follows that E [τ̂] =

30
Mean and variance of the difference-in-means
1
Let Ȳ1 = N1 ∑i:Di =1 Yi be the sample mean for the treated group.
1
Let Ȳ0 = N0 ∑i:Di =0 Yi be the sample mean for the control group.

Since Ȳ1 , Ȳ0 are each sample means, we have that

E [Ȳ1 ] = µ1 , Var (Ȳ1 ) = σ12 /N1


E [Ȳ0 ] = µ0 , Var (Y¯0 ) = σ02 /N0

where µd = E [Yi | Di = d] and σd2 = Var (Yi | Di = d).

Let τ̂ = Ȳ1 − Ȳ0 . It follows that E [τ̂] = µ1 − µ0 = τ and

Var (τ̂) =

30
Mean and variance of the difference-in-means
1
Let Ȳ1 = N1 ∑i:Di =1 Yi be the sample mean for the treated group.
1
Let Ȳ0 = N0 ∑i:Di =0 Yi be the sample mean for the control group.

Since Ȳ1 , Ȳ0 are each sample means, we have that

E [Ȳ1 ] = µ1 , Var (Ȳ1 ) = σ12 /N1


E [Ȳ0 ] = µ0 , Var (Y¯0 ) = σ02 /N0

where µd = E [Yi | Di = d] and σd2 = Var (Yi | Di = d).

Let τ̂ = Ȳ1 − Ȳ0 . It follows that E [τ̂] = µ1 − µ0 = τ and

Var (τ̂) = σ12 /N1 + σ02 /N0 + 2Cov (Y¯1 , Y¯0 )


=

30
Mean and variance of the difference-in-means
1
Let Ȳ1 = N1 ∑i:Di =1 Yi be the sample mean for the treated group.
1
Let Ȳ0 = N0 ∑i:Di =0 Yi be the sample mean for the control group.

Since Ȳ1 , Ȳ0 are each sample means, we have that

E [Ȳ1 ] = µ1 , Var (Ȳ1 ) = σ12 /N1


E [Ȳ0 ] = µ0 , Var (Y¯0 ) = σ02 /N0

where µd = E [Yi | Di = d] and σd2 = Var (Yi | Di = d).

Let τ̂ = Ȳ1 − Ȳ0 . It follows that E [τ̂] = µ1 − µ0 = τ and

Var (τ̂) = σ12 /N1 + σ02 /N0 + 2Cov (Y¯1 , Y¯0 )


= σ12 /N1 + σ02 /N0

where the fact that the samples are independent implies that
Cov (Y¯1 , Y¯0 ) = 0.
30
We just showed that in an experiment

E [τ̂] = τ and Var (τ̂) = σ12 /N1 + σ02 /N0

where τ̂ is the difference in sample means btwn the treated/control


groups

If we knew that τ̂ was normally distributed (and we knew σ1 , σ0 ), then


we could construct CIs of the form

31
We just showed that in an experiment

E [τ̂] = τ and Var (τ̂) = σ12 /N1 + σ02 /N0

where τ̂ is the difference in sample means btwn the treated/control


groups

If we knew that τ̂ was normally distributed (and we knew σ1 , σ0 ), then


we could construct CIs of the form
q
τ̂ ± 1.96 σ12 /N1 + σ02 /N0

31
We just showed that in an experiment

E [τ̂] = τ and Var (τ̂) = σ12 /N1 + σ02 /N0

where τ̂ is the difference in sample means btwn the treated/control


groups

If we knew that τ̂ was normally distributed (and we knew σ1 , σ0 ), then


we could construct CIs of the form
q
τ̂ ± 1.96 σ12 /N1 + σ02 /N0

As with sample means, we do not know that τ̂ is normally distributed,


but we can show that for N large, it is approximately normally
distributed, which allows us to use CIs of the form
q
τ̂ ± 1.96 σ̂12 /N1 + σ̂02 /N0 ,

for σ̂d2 the estimated conditional variance.


31
Showing Asymptotic Normality

By the CLT, we have that N1 (Y¯1 − µ1 ) →d

32
Showing Asymptotic Normality

By the CLT, we have that N1 (Y¯1 − µ1 ) →d N(0, σ12 ).

32
Showing Asymptotic Normality

By the CLT, we have that N1 (Y¯1 − µ1 ) →d N(0, σ12 ).

N1 1
Note that N = N ∑i Di →p

32
Showing Asymptotic Normality

By the CLT, we have that N1 (Y¯1 − µ1 ) →d N(0, σ12 ).

N1 1
Note that N = N ∑i Di →p E [Di ] by the LLN.

32
Showing Asymptotic Normality

By the CLT, we have that N1 (Y¯1 − µ1 ) →d N(0, σ12 ).

N1 1
Note that N = N ∑i Di →p E [Di ] by the LLN.

Hence, applying the continuous mapping theorem,



N(Ȳ1 − E [Yi (1)]) =

32
Showing Asymptotic Normality

By the CLT, we have that N1 (Y¯1 − µ1 ) →d N(0, σ12 ).

N1 1
Note that N = N ∑i Di →p E [Di ] by the LLN.

Hence, applying the continuous mapping theorem,


√ p p
N(Ȳ1 − E [Yi (1)]) = (1/ N1 /N) · N1 (Ȳ1 − E [Yi (1)])
p
→d (1/ E [Di ]) · N(0, Var (Yi (1)))
 
1
= N 0, Var (Yi (1))
E [Di ]

32
Showing Asymptotic Normality

By the CLT, we have that N1 (Y¯1 − µ1 ) →d N(0, σ12 ).

N1 1
Note that N = N ∑i Di →p E [Di ] by the LLN.

Hence, applying the continuous mapping theorem,


√ p p
N(Ȳ1 − E [Yi (1)]) = (1/ N1 /N) · N1 (Ȳ1 − E [Yi (1)])
p
→d (1/ E [Di ]) · N(0, Var (Yi (1)))
 
1
= N 0, Var (Yi (1))
E [Di ]

Applying similar steps for Ȳ0 , we obtain that

32
Showing Asymptotic Normality

By the CLT, we have that N1 (Y¯1 − µ1 ) →d N(0, σ12 ).

N1 1
Note that N = N ∑i Di →p E [Di ] by the LLN.

Hence, applying the continuous mapping theorem,


√ p p
N(Ȳ1 − E [Yi (1)]) = (1/ N1 /N) · N1 (Ȳ1 − E [Yi (1)])
p
→d (1/ E [Di ]) · N(0, Var (Yi (1)))
 
1
= N 0, Var (Yi (1))
E [Di ]

Applying similar steps for Ȳ0 , we obtain that



 
Ȳ1 − E [Yi (1)]
N →d
Ȳ0 − E [Yi (0)]

32
Showing Asymptotic Normality

By the CLT, we have that N1 (Y¯1 − µ1 ) →d N(0, σ12 ).

N1 1
Note that N = N ∑i Di →p E [Di ] by the LLN.

Hence, applying the continuous mapping theorem,


√ p p
N(Ȳ1 − E [Yi (1)]) = (1/ N1 /N) · N1 (Ȳ1 − E [Yi (1)])
p
→d (1/ E [Di ]) · N(0, Var (Yi (1)))
 
1
= N 0, Var (Yi (1))
E [Di ]

Applying similar steps for Ȳ0 , we obtain that



    1 
Ȳ1 − E [Yi (1)] Var (Yi (1)) 0
N →d N 0, E [Di ] 1
Ȳ0 − E [Yi (0)] 0 1−E [Di ] Var (Yi (0))

32
Hypothesis Testing for Experiments (continued)
We just showed that
√ 1
    
Ȳ1 − E [Yi (1)] E [Di ] Var (Yi (1)) 0
N →d N 0, 1
Ȳ0 − E [Yi (0)] 0 1−E [Di ] Var (Yi (0))

33
Hypothesis Testing for Experiments (continued)
We just showed that
√ 1
    
Ȳ1 − E [Yi (1)] E [Di ] Var (Yi (1)) 0
N →d N 0, 1
Ȳ0 − E [Yi (0)] 0 1−E [Di ] Var (Yi (0))

Applying the CMT,



N(Ȳ1 − Y¯0 − E [Yi (1) − Yi (0)]) →d N(0, σ 2 ),
1 1
where σ 2 = E [Di ] Var (Yi (1)) + E [1−Di ] Var (Yi (0))

33
Hypothesis Testing for Experiments (continued)
We just showed that
√ 1
    
Ȳ1 − E [Yi (1)] E [Di ] Var (Yi (1)) 0
N →d N 0, 1
Ȳ0 − E [Yi (0)] 0 1−E [Di ] Var (Yi (0))

Applying the CMT,



N(Ȳ1 − Y¯0 − E [Yi (1) − Yi (0)]) →d N(0, σ 2 ),
1 1
where σ 2 = E [Di ] Var (Yi (1)) + E [1−Di ] Var (Yi (0))

We can thus form a 95% confidence interval for τ = E [Yi (1) − Yi (0)],

Ȳ1 − Ȳ0 ± 1.96σ̂ / N,
where σ̂ 2 = NN1 σ̂12 + NN0 σ̂02 , where σ̂d2 is the sample variance for
treatment group d ∈ {0, 1}

33
Sample Means for Depression Outcome (Again)
Control Group Treated Group
Mean 0.329 0.306
SD 0.470 0.461
N 10426 13315

34
Sample Means for Depression Outcome (Again)
Control Group Treated Group
Mean 0.329 0.306
SD 0.470 0.461
N 10426 13315
Our point estimate of the treatment effect is
τ̂ =

34
Sample Means for Depression Outcome (Again)
Control Group Treated Group
Mean 0.329 0.306
SD 0.470 0.461
N 10426 13315
Our point estimate of the treatment effect is
τ̂ = 0.306 − 0.329 = −0.023.

34
Sample Means for Depression Outcome (Again)
Control Group Treated Group
Mean 0.329 0.306
SD 0.470 0.461
N 10426 13315
Our point estimate of the treatment effect is
τ̂ = 0.306 − 0.329 = −0.023.

Our CI for the treatment effect is:

34
Sample Means for Depression Outcome (Again)
Control Group Treated Group
Mean 0.329 0.306
SD 0.470 0.461
N 10426 13315
Our point estimate of the treatment effect is
τ̂ = 0.306 − 0.329 = −0.023.

Our CI for the treatment effect is:


r
1 2 1 2
τ̂ ± 1.96 × σ̂1 + σ̂ =
N1 N0 0

34
Sample Means for Depression Outcome (Again)
Control Group Treated Group
Mean 0.329 0.306
SD 0.470 0.461
N 10426 13315
Our point estimate of the treatment effect is
τ̂ = 0.306 − 0.329 = −0.023.

Our CI for the treatment effect is:


r
1 2 1 2
τ̂ ± 1.96 × σ̂1 + σ̂ =
N1 N0 0
r
1 1
− 0.023 ± 1.96 × 0.4612 + 0.4702
13315 10426
=

34
Sample Means for Depression Outcome (Again)
Control Group Treated Group
Mean 0.329 0.306
SD 0.470 0.461
N 10426 13315
Our point estimate of the treatment effect is
τ̂ = 0.306 − 0.329 = −0.023.

Our CI for the treatment effect is:


r
1 2 1 2
τ̂ ± 1.96 × σ̂1 + σ̂ =
N1 N0 0
r
1 1
− 0.023 ± 1.96 × 0.4612 + 0.4702
13315 10426
= [−0.035, −0.001]

34
Hypothesis Testing under Unconfoundedness
Recall that under unconfoundedness, Di ⊥⊥ (Yi (1), Yi (0))|Xi , we have

E [Yi (1) − Yi (0)|Xi = x ] = E [Yi |Di = 1, Xi = x ] − E [Yi |Di = 0, Xi = x ]


| {z }
CATE (x )

That is, within each value of Xi , it’s as if we have an experiment.

35
Hypothesis Testing under Unconfoundedness
Recall that under unconfoundedness, Di ⊥⊥ (Yi (1), Yi (0))|Xi , we have

E [Yi (1) − Yi (0)|Xi = x ] = E [Yi |Di = 1, Xi = x ] − E [Yi |Di = 0, Xi = x ]


| {z }
CATE (x )

That is, within each value of Xi , it’s as if we have an experiment.

By the same logic as for experiments, we have that


p
Nx (Ȳ1,x − Ȳ0,x − E [Yi (1) − Yi (0)|Xi = x ]) →d N(0, σx2 ),
where Nx = |i : Xi = x | and
1
σx2 = E [Di |X i =x ]
Var (Yi (1)|Xi = x ) + E [1−D1i |Xi =x ] Var (Yi (0)|Xi = x ).

35
Hypothesis Testing under Unconfoundedness
Recall that under unconfoundedness, Di ⊥⊥ (Yi (1), Yi (0))|Xi , we have

E [Yi (1) − Yi (0)|Xi = x ] = E [Yi |Di = 1, Xi = x ] − E [Yi |Di = 0, Xi = x ]


| {z }
CATE (x )

That is, within each value of Xi , it’s as if we have an experiment.

By the same logic as for experiments, we have that


p
Nx (Ȳ1,x − Ȳ0,x − E [Yi (1) − Yi (0)|Xi = x ]) →d N(0, σx2 ),
where Nx = |i : Xi = x | and
1
σx2 = E [Di |X i =x ]
Var (Yi (1)|Xi = x ) + E [1−D1i |Xi =x ] Var (Yi (0)|Xi = x ).

So we can also do hyptothesis testing on CATE (x ) when Nx is large.

35
Hypothesis Testing under Unconfoundedness
Recall that under unconfoundedness, Di ⊥⊥ (Yi (1), Yi (0))|Xi , we have

E [Yi (1) − Yi (0)|Xi = x ] = E [Yi |Di = 1, Xi = x ] − E [Yi |Di = 0, Xi = x ]


| {z }
CATE (x )

That is, within each value of Xi , it’s as if we have an experiment.

By the same logic as for experiments, we have that


p
Nx (Ȳ1,x − Ȳ0,x − E [Yi (1) − Yi (0)|Xi = x ]) →d N(0, σx2 ),
where Nx = |i : Xi = x | and
1
σx2 = E [Di |X i =x ]
Var (Yi (1)|Xi = x ) + E [1−D1i |Xi =x ] Var (Yi (0)|Xi = x ).

So we can also do hyptothesis testing on CATE (x ) when Nx is large.

By averaging CATE (x ), we can do hypothesis testing / form CIs for


ATE .
35
The Challenge of Continuous x
We’ve shown thus far how we can estimate CATE (x ) when the
number of observations with Xi = x is large.

This works great when Xi is binary (e.g. an indicator for college) or


takes on a small number of discrete values (e.g. 50 states).

36
The Challenge of Continuous x
We’ve shown thus far how we can estimate CATE (x ) when the
number of observations with Xi = x is large.

This works great when Xi is binary (e.g. an indicator for college) or


takes on a small number of discrete values (e.g. 50 states).

But what about when Xi is continuous?

36
The Challenge of Continuous x
We’ve shown thus far how we can estimate CATE (x ) when the
number of observations with Xi = x is large.

This works great when Xi is binary (e.g. an indicator for college) or


takes on a small number of discrete values (e.g. 50 states).

But what about when Xi is continuous?

For example, if Xi is income, then to estimate CATE (50, 351), the


theory we have says we need a large number of treated and control
units both with income $50,351. In most datasets, we won’t have very
many people with exactly this income.

36
The Challenge of Continuous x
We’ve shown thus far how we can estimate CATE (x ) when the
number of observations with Xi = x is large.

This works great when Xi is binary (e.g. an indicator for college) or


takes on a small number of discrete values (e.g. 50 states).

But what about when Xi is continuous?

For example, if Xi is income, then to estimate CATE (50, 351), the


theory we have says we need a large number of treated and control
units both with income $50,351. In most datasets, we won’t have very
many people with exactly this income.

We thus need a different way of estimating conditional means when Xi


is continuously distributed.

36
The Challenge of Continuous x
We’ve shown thus far how we can estimate CATE (x ) when the
number of observations with Xi = x is large.

This works great when Xi is binary (e.g. an indicator for college) or


takes on a small number of discrete values (e.g. 50 states).

But what about when Xi is continuous?

For example, if Xi is income, then to estimate CATE (50, 351), the


theory we have says we need a large number of treated and control
units both with income $50,351. In most datasets, we won’t have very
many people with exactly this income.

We thus need a different way of estimating conditional means when Xi


is continuously distributed.

The next part of the course will focus on achieving this task using
linear regression as an approximation to the CEF.
36

You might also like