0% found this document useful (0 votes)
41 views54 pages

Con Dence: ECON 226 - J L. G

Uploaded by

Kevin Tan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views54 pages

Con Dence: ECON 226 - J L. G

Uploaded by

Kevin Tan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Confidence

ECON 226 - Jonathan L. Graves


In [1]:
library(tidyverse)

library(haven)

census_data <- read_dta("data/01_census2016.dta")

census_data <- as_factor(census_data)

census_data <- census_data %>%

filter(!is.na(mrkinc)) %>%

filter(!is.na(wages))

Warning message:

"package 'tidyverse' was built under


R version 3.6.3"-- Attaching packages
-------------------------------------
-------------------------------------
-------------------------------------
-------------------------------------
-------------------------------------
-------------------------------------
----- tidyverse 1.3.1 --

v ggplot2 3.3.3 v purrr 0.3.4


v tibble 3.1.1 v dplyr 1.0.5
v tidyr 1.1.3 v stringr 1.4.0
v readr 1.4.0 v forcats 0.5.1
Warning message:

"package 'ggplot2' was built under R


version 3.6.3"Warning message:

"package 'tibble' was built under R v


ersion 3.6.3"Warning message:

"package 'tidyr' was built under R ve


rsion 3.6.3"Warning message:

"package 'readr' was built under R ve


rsion 3.6.3"Warning message:

"package 'purrr' was built under R ve


rsion 3.6.3"Warning message:

"package 'dplyr' was built under R ve


rsion 3.6.3"Warning message:

"package 'forcats' was built under R


version 3.6.3"-- Conflicts ---------
-------------------------------------
-------------------------------------
-------------------------------------
-------------------------------------
-------------------------------------
------------------------------------
tidyverse_conflicts() --

x dplyr::filter() masks stats::filter


()

x dplyr::lag() masks stats::lag()

Warning message:

"package 'haven' was built under R ve


rsion 3.6.3"
Are we confident in our estimates?

The key idea of interence is that we want to estimate


population parameters (e.g. μ ) with their sample
x

counterparts (e.g. x̄)


The Law of Large Numbers tells us that when n is large
that x̄ ≈ μx - they are similar in value
However, how similar are they? In other words, how
precise is our estimate x̄ of μ ? x

The way to evaluate this is by using the fact that x̄ has a


known sampling distribution which is related to the value of
μ → this means we can use it to calculate probabilities and
x

put a number on our certainty.


Confidence Intervals

There are many ways to do this statistically, but one of the


most intuitive and popular to construct a confidence
interval for a random variable.
This is a range of values (an interval) where fairly certain
(to some pre-defined level) that, given some true value,
that the random variable will fall into
This can be a little bit tricky to define very precisely
In particular, we're going to describe how to construct
these objects → an algorithm rather than just defining
what they are.
This also has the advantage of combining together ideas
about variation and central tendency into a complete,
cohesive, package.
Plan of Attack

We're going to do this in steps, building up to more


complicated (but more useful) examples
Step 1: Learn how to construct confidence intervals for
generic random variables
Step 2: Use this knowledge to construct confidence
intervals for the sample mean under certain conditions
Step 3: Relax those conditions to make this more realistic
Step 4: Explain how we can interpret or use these in
general
We're going to see that in Step 4, this will require some
careful interpretation.
Step 1: Confidence Intervals for Random Variables

We will start off by learning how to construct confidence


intervals (CIs) for an arbitrary random variable
The key here is that we will assume that we know what
the CDF/PDF of this variable is
This is unrealistic (in general) because we do not know
CDFs or PDFs for most variables
We will also (for simplicity) restrict attention to quantitative
variables with a continuous set of possible (real) values.
Foreshadowing: but we do know these for the sampling
distribution ofx̄
Defining a Confidence Interval

Suppose is a random variable with a known PDF (CDF), and


Y

0 ≤ β ≤ 1(a percentage). Then, a -\% confidence interval for


β

Yiis an interval such that:


I = [a, b]

The probability that a random draw of lies inside is


Y I β

In other words, there is a -\% chance that


β .
a ≤ Y ≤ b

There are a number of "standard" confidence intervals


like 99\%, 95\%, and 90\%
However, any possible value can be used!
Describing Confidence Intervals

So, what condition must be true for in order for to be a -% I β

confidence interval? This is actually fairly easy, since we know


the distribution of ! Y

We need the interval between and to have a a b

probability of . β

This implies that:


Fy (b) − Fy (a) = β

Why? Well:
FY (b) = P (Y ≤ b) and FY (a) = P (Y ≤ a) .
Therefore FY (b) − FY (a) = P (a ≤ Y ≤ b) .
In [4]:
# Example: an arbitrary distribution (Gamma(2,2))

#In this example, we will use a distribution called "Gamma"

#This takes two parameters, called "shape" and "scale"

#You don't need to know about this distribution; it's just an example

pdf_g <- function(x){dgamma(x, shape = 2, scale = 2)}

cdf_g <- function(x){pgamma(x, shape = 2, scale = 2)}

f <- ggplot(data = data.frame(range = c(0,20)), aes(x = range)) + stat_function(fun = pdf_g)

f <- f + labs(x = "X values", y = "Density")

In [5]:
#Our goal is to find two points (a, b) such that the shaded region is 95%

a = 2

b = 12

g <- f + stat_function(fun = pdf_g, xlim = c(a, b), geom = "area", fill = "blue", alpha = 0.1)

In [6]:
# This is CDF(b) - CDF(a)

cdf_g(b) - cdf_g(a)

#is this a 95% CI?

#adjust it so it is!

0.71840761710622
Many Possible Intervals

Unfortunately, this implies that there are actually many


different confidence intervals possible for a given too β →

many, in some sense! We usually restrict attention to one of


three types:
Centered CI, in which both sides have the same
probability
This means that P (Y ≤ a) = 1 − P (Y ≤ b) =
1−β

Right-Handed CI, which are one-sided, and look like


[a, ∞]

So, FY (a) = 1 − β ⟺ a = F
Y
−1
(1 − β)

Left-Handed CI, which are one-sided, and look like [−∞, b]

So, FY (b) = β ⟺ b = F
−1

Y
(β)

There's nothing special about these in their construction.


In [7]:
## Example: 95%-Right-Handed Gamma(2,2)

a <- qgamma(0.05, shape = 2, scale = 2)

b <- 20 #limit of diagram; really should be Inf

g <- f + stat_function(fun = pdf_g, xlim = c(a, b), geom = "area", fill = "blue", alpha = 0.1)

0.710723021397324

20
Example 2: Standard Normal

Let's construct a 95% centered confidence interval for ,


Z

which is standard normal. This implies that we want:


P (Y ≤ a) = Φ(a) = 0.025

P (Y ≤ b) = Φ(b) = 0.975

So, we can just find this by inverting the normal CDF, which
requires a calculator. Good news! R has one for use.
In [8]:
## Example 1

# the function we want is qnorm(value, mean, sd)

a <- qnorm(0.025, mean = 0, sd = 1) #standard normal

b <- qnorm(0.975, mean = 0, sd = 1)

c(a,b)

# so, our 95% confidence interval is [-1.96,1.96]

1. -1.95996398454005
2. 1.95996398454005
In [9]:
# plotting the picture

pdf_n <- function(x){dnorm(x, mean = 0, sd = 1)}

cdf_n <- function(x){pnorm(x, mean = 0, sd = 1)}

f <- ggplot(data = data.frame(range = c(-5,5)), aes(x = range)) + stat_function(fun = pdf_n)

f <- f + labs(x = "X values", y = "Density")

f <- f + stat_function(fun = pdf_n, xlim = c(a, b), geom = "area", fill = "blue", alpha = 0.1)

Standard Normal CI

You will notice that this also gives just a convenient formula
we can derive. If has a standard normal distribution, then a
Z

β-% CI centered about is:


μZ = 0

α α
−1 −1
[Φ ( ), −Φ ( )]
2 2

where α = 1 − β . You will also notice that this is the same as


writing
α
−1
Za,b = μZ ± Φ ( )σZ
2

We sometimes refer to these values of Φ


−1
as critical values:

Z
α/2
Any Normal Variable

Okay, one last step - this is an example, but it's a very


important one, since it highlights and important connection.
Suppose is normal with
X and . By the same logic as
μX σX

before, we would have a CI of:


α α
−1 −1
[Φ ( |μX , σX ), −Φ ( |μX , σX )]
2 2

However, this is ugly and not convenient! It also doesn't


have a nice formula.
We can get a much more useful expression using the idea
of standardization
Recall: Standardization

If you remember, a property of the normal was that:


If has mean
X and standard deviation
μX then σX Z ≡
X−μX

σX

is a standard normal
This also implies that if Z is standard normal, then
X ≡ (σX ⋅ Z + μX ) is normal with mean μX and standard
deviation σX

But wait: when we construct a CI for the standard normal, we


have specific values for the limits!
α
−1
Za = +Φ ( )
2

α
−1
Zb = −Φ ( )
2

And these are values of Z a standard normal random


variable!
So, this implies that if we (1) multiply by then (2) add σX μx

then the resulting Xa and must be limits for a normal


Xb

variable with mean μX and standard deviation and this is σX →

a -% CI!
β

α
−1
X a = σ X ⋅ Z a + μx = μX + Φ ( )σX
2

α
−1
X b = σ X ⋅ Z b + μx = μX − Φ ( )σX
2


⟹ Xa,b = μx ± Z σX
α/2
Conclusions

In other words, there is a useful (and powerful) relationship


between a CI for a standard normal and any normal random
variable
This is useful, because it means we don't have to keep
figuring out this expression
We can just plug in the appropriate values of μX and σX

We also get a convenient formula to help us remember


this relationship
However, the key is always to remember what we're doing →

not just to memorize the formula. This is a process!


General Approach

1. Figure out what parameter we need to know about


2. Determine the associated statistic (variable) we are
interested in
3. Determine their sampling distributions and necessary
assumptions
4. Calibrate those assumptions using estimates
5. Construct a CI for the statistic under the calibrated
assumptions
Step 2: Confidence Intervals for the Sample Mean

A very useful example is constructing CIs for the sample


mean:
1. We are interested in learning about μX

2. We need to learn about X̄

3. The sampling distribution of is known when


x̄ n → ∞ :
x̄ is normally distributed...
... with mean μX

... and standard deviation σ X /√ n

So, if we know what and


μX are, then we can easily
σX

construct a CI for x̄

We don't know those parameters, but let's start by


imagining what we would do if we did
Sampling Distribution of (known , )
x̄ μx σx

Given what we learned earlier about standardization, this


actually makes this super simple: just plug in the appropriate
values forμX
¯ and :σX
¯


⟹ X̄ a,b = μ ¯ ± Z σ ¯
X α/2 X

σX

⟹ X̄ a,b = μX ± Z
α/2
√n
Interpretation

This gives us a -% confidence interval for


(1 − α) x̄

This tell us that if we drew another sample, there is a


(1 − α) -% chance that the associated would lie betweeen

x̄a and x̄b

Alternatively, there is -% chance it doesn't like between


α

X̄ a and X̄ b

For example, suppose we know a variable had X and


μX = 15

σX = 10 . If we draw a sample of size , what would be the 95%


n

likely range for the sample mean to fall?


In [10]:
# Visually

mean = 15

std = 10

n = 10

pdf_n2 <- function(x){dnorm(x, mean = mean, sd = std/sqrt(n))}

cdf_n2 <- function(x){pnorm(x, mean = mean, sd = std/sqrt(n))}

f <- ggplot(data = data.frame(range = c(5,25)), aes(x = range)) + stat_function(fun = pdf_n2)

f <- f + labs(x = "X-bar values", y = "Density")

In [11]:
a <- mean + qnorm(0.025)*std/sqrt(n)

b <- mean - qnorm(0.025)*std/sqrt(n)

c(a,b)

g <- f + stat_function(fun = pdf_n2, xlim = c(a, b), geom = "area", fill = "blue", alpha = 0.1)

cdf_n2(b) - cdf_n2(a)

1. 8.80204967695439
2. 21.1979503230456

0.95
Step 2a: Relaxing Assumptions about σx

The problem with this is that constructing such an interval


requires us to know what and are.
μx σx

These are population parameters so unless we were


working under an assumption about those, we probably


won't know them
Let's focus on relaxing the assumption on σX : what could
we use to estimate σX that we can actually observe in the
sample?

sX

No surprise! We should use the sample standard deviation


instead!
By the LLN this is going to be a very good approximation
of σx .
However, there is a problem!
Problem: What's the Distribution?

Is it really OK to just plug in the value sx instead of σx ?.


But what about something like:
X̄ n − μx
Z̄ ≡
s x /√ n

When s x = σxthis is a standard normal, but what not? What is


this distribution? And since is a sample paramater, how
sx

does it behave when is large? Isn't that a problem?


n
A Very Big Problem

Yes, it is a problem a very, very serious one. For a long


time, no one really paid too much attention


This was because when was pretty big, just pretending
n

s x = σx didn't seem to matter


However, for small values of , the estimates from the
n

model didn't match with data


This lead a statistician by the name of William Gosset to
figure out. The answer was that this new statistic has a known
distribution -distribution.
→ t
The Distribution
t

This is a complicated distribution which is like the standard


normal, except that it takes a paramater , the degrees
ν = n − 1

of freedom.
You can think of it as a "correction" of the normal
distribution when n is small:
It's still symmetrical about the mean
It has a mean of zero
As n → ∞ it becomes normal
This last property is why we didn't notice for a long time!
In fact for n > 120 there basically no (computable)
difference
In [40]:
## Comparisons

n = 1000 #change me!

pdf_t <- function(x){dt(x, n-1)}

data <- data.frame(range = c(-5,5))

f <- ggplot(data = data, aes(x = range)) + stat_function(fun = pdf_n, color = "blue")

f <- f + stat_function(fun = pdf_t, color = "red", alpha = 1)

f <- f + labs(x = "Values", y = "Density")

Bottom Line

When is small (less than 120), we change our equation for


n

the CI slightly to be:


α sX

⟹ X̄ a,b = μX ± t ( )
n−1
2 √n

Where
t

n−1
(a) is the critical value associated with (i.e. the inverse a

of the CDF) t
In [8]:
# Example

mean = 0

std = 4

n = 1000

v = s/sqrt(n)

pdf_n2 <- function(x){dnorm(x/v, mean = mean/v, sd = 1)}

cdf_n2 <- function(x){pnorm(x/v, mean = mean/v, sd = 1)}

pdf_t <- function(x){dt(x/v, n-1)}

cdf_t <- function(x){pt(x/v, n-1)}

f <- ggplot(data = data.frame(range = c(-5,5)), aes(x = range)) + stat_function(fun = pdf_t)

f <- f + stat_function(fun = pdf_n2, color = "blue", alpha = 0.4)

f <- f + labs(x = "X-bar values", y = "Density")

In [9]:
a <- mean + qnorm(0.025)*s/sqrt(n)

b <- mean - qnorm(0.025)*s/sqrt(n)

c(a,b)

cdf_t(b) - cdf_t(a)

g <- f + stat_function(fun = pdf_t, xlim = c(a, b), geom = "area", fill = "blue", alpha = 0.1)

1. -0.235522112275733
2. 0.235522112275733

0.949722321182881
In [10]:
a2 <- mean + qnorm(0.025)*s/sqrt(n)

b2 <- mean - qnorm(0.025)*s/sqrt(n)

a <- mean + qt(0.025,n-1)*s/sqrt(n)

b <- mean - qt(0.025,n-1)*s/sqrt(n)

c(a,b)

cdf_t(b) - cdf_t(a)

# normal values

c(a2,b2)

cdf_t(b2) - cdf_t(a2)

1. -0.23580780543825
2. 0.23580780543825

0.95

1. -0.235522112275733
2. 0.235522112275733

0.949722321182881
Conclusions

As we can see, when is large, there's really no difference


n

here
You will also notice that the -based intervals are larger
t

than the normal-based ones


This is because they reflect the increased
uncertainty cause by using instead of
s σ

Intuitively, they should be larger because we are


less sure about the sampling distribution, since it
relies on an estimated value
We should keep in mind, in general, that when n is a small
number we need to correct the critical values but when n

is large it doesn't really make a difference


Last Step: Dealing with μx

Fortunately, this doesn't require any new statistical


knowledge - just some careful thinking about what a
confidence interval means.
What we have done so far is to say "if we know we can μx

construct a CI for :

α sX
X̄ a,b = μX ± tn−1 ( )
2 √n

This says there's an -% chance of a random not landing


α x̄

inside this interval


Okay, but what should we use for μx

As we did before, let's use , the sample version


α sx
X̄ a,b = x̄ ± tn−1 ( )
2 √n

Fortunately, because the difference between and x̄ μx is


just an additive shift to this interval (right or left) this
doesn't change the distribution
It should, however, change our interpretation
Intuitive Explanation

The idea is as follows:


Let's call M E = tn−1 (
α

2
)
sx

√n
the "margin of error" (ME)
If the actual is within one ME of
x̄ μx , then it lies inside the
CI about μx

But similarly, μx must lie within one ME of within a CI x̄

about x̄

This is because the ME for the CI is the same size


In other words, if lies inside a CI about then must lie
x̄ μx μx

inside the same-sized CI about x̄

This implies that if we repeatedly created CI about many


samples worth of -s, x̄ μx would lie inside them 1 − α -% of
the time
Important Note

This is a statement about the sampling distribution


We are not saying that has a (e.g.) 95% chance of lying
μx

inside the CI about x̄

It either does, or does not → we just can't tell


which
What are are saying is that 95% of the possible CI created
like this about the possible contain
x̄ μx

In a sense, you can think of a CI as a "big statistic" → what


we are saying the in the sampling distribution of these CIs
of this form, 95% of them will contain μx
Simulation

This is a little bit confusing, so in order to explore this idea


further let's consider the following simulation.
We will be simulating the average wage from a
population.
In particular, let's suppose that (for the purposes
of our simulation) that wages have a population
mean μw = 50, 000 and population standard
deviation σw = 20, 000

Now, we know by the CLT that when n is large


that the sampling distribution of w̄ is normal with
mean μw and standard deviation σ w /√ n

Let's set n = 20, 000

So, we will simulate drawing 10,000 -s from this sampling


distribution, and constructing a 95% CI about the.


Then, we will evaluate that CI to illustrate the conclusion
we drew about
In [23]:
seed <- 143 #random simulation seed

n <- 20000 #sample size of the samples being drawn

k <- 10000 #number of simulations to run

mu <- 50000

se <- 20000/sqrt(n)

contains <- logical(k) #a vector to store whether the CI contains mu or not

me <- qnorm(0.025)*se #margin of error for a 95% CI

wbars <- mu + rnorm(k)*se #draw 10,000 sample means (rnorm generates a standard normal variable)

contains <- (mu > wbars + me)&(mu < wbars - me) #check if mu is within 1 ME of wbar

mean(contains) #proportion of intervals containing mu

0.9527
Worked Example: CI for Average Wages

To demonstrate the process, let's put all of this work together


to see how we construct a CI for average wages in the 2016
Census Data.
We want to learn about the population average of wages,
μw

We know, since the sample size is large that w̄ is


approximately normal, with sampling distribution that is
normal with mean μw and standard deviation σ w /√ n

However, we don't know what μw or σw are, so we


need to estimate them instead.
Let's construct a 99% centered confidence interval for μw (in
the sense described above)
In [11]:
#Step 1: Compute means and standard deviations

n <- nrow(census_data)

w_bar <- mean(census_data$wages)

w_bar

s <- sd(census_data$wages)

se <- s/sqrt(n)

se #notice how much smaller se is in comparison to s!

343063

54482.5211637513

64275.2748093883

109.737990091199
In [9]:
# Step 2a: Find critical values using normal approximation directly and build CI

# (only feasible with large n)

a <- qnorm(0.005, mean = w_bar, sd = se)

b <- qnorm(0.995, mean = w_bar, sd = se)

CI <- c(a,b)

CI

1. 54199.8548331618
2. 54765.1874943407
In [10]:
# Step 2b: Find critical values by normalizing and using formula

# (appropriate for all n)

a <- w_bar + qt(0.005, n-1)*se

b <- w_bar - qt(0.005, n-1)*se

CI <- c(a,b)

CI

1. 54199.8532604581
2. 54765.1890670444

You might also like