Notes ch4 Interval Estimation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Confidence Interval Estimation

Lecture Notes

Erkin Diyarbakirlioglu
IAE Gustave Eiffel
February 6, 2022

1
Table of contents

1 Point estimation ........................................................................................................................................................3

2 Interval estimation ..................................................................................................................................................3

3 Interval estimation for the population proportion ....................................................................................5

4 Interval estimation for the population mean................................................................................................7

2
1 Point estimation

1. Let 𝜃 be the parameter of interest, whose value is unknown and must be estimated using data.
Given a random sample 𝑋1 , … , 𝑋𝑛 drawn from the population, the point estimate of 𝜃 can be
represented generically as,

𝜃̂ = 𝑓(𝑋1 , … , 𝑋𝑛 ) (1)

where the function 𝑓(⋅) is called an estimator of 𝜃. For example, assume 𝜃 is the population mean,
1 𝑛
𝐸(𝑋𝑖 ) = 𝜇. Then, we may choose the estimator as the sample average, i.e. 𝜃̂ = 𝑋̅ = ∑ 𝑋.
𝑛 𝑖=1 𝑖

2. There are infinitely many possible estimators for 𝜃. The question is how we make sure that the
chosen estimator is a good one? There are several concepts useful to assess the desirable
properties of an estimator. When one chooses an estimator and apply the function to the data, the
estimator yields a single number. For example, if one rolls a six-sided die 10 times and calculate
the arithmetic average of the outcomes, the output, that is the sample average, will be a point
estimate of the true mean. Literally, a point estimate is a single value that one looks for estimating
as close as possible to the value, unknown or not, of the parameter of interest.

2 Interval estimation

3.We have noted that a point estimate 𝜃̂ is a single value associated with the population parameter
𝜃. An issue with 𝜃̂ is that it does not provide much information about the parameter for the simple
reason that it is a single value subject to random variation in the data. In other words, without
additional information, we can’t understand how close 𝜃̂ is to 𝜃. Interval estimation thus aims
providing extra information about the parameter. Loosely speaking, an interval estimation aims
producing a range of plausible values for the parameter 𝜃. Thus, instead of estimating 𝜃 as, say,
100, we state lower and upper values like 𝜃̂𝐿 and 𝜃̂𝑈 which we expect to include the real value of
the parameter at a predetermined confidence level.

4. Given a population parameter of interest, let 𝜃̂ a sample estimate of it using a sample of size 𝑛.
The standard error of the estimate is 𝑆𝐸(𝜃̂). Since a confidence interval is a range of plausible
values for 𝜃, it must have a lower and an upper bound around the point estimate 𝜃̂. The width of
these bounds will be proportional to the uncertainty linked to the estimate 𝜃̂. The generic formula
for a confidence interval with 1 − 𝛼 confidence level can be written as,

3
(1 − 𝛼)% 𝐶𝐼 𝑓𝑜𝑟 𝜃 = 𝜃̂ ± 𝑚𝑎𝑟𝑔𝑖𝑛 𝑜𝑓 𝑒𝑟𝑟𝑜𝑟 (2)

At this stage, there are two elements that must be developed. First, the confidence level shows the
probability with which we expect the resulting interval includes 𝜃. Most of the time, one calculates
confidence intervals at 90, 95 or 99%, so that 𝛼 is 10, 5 or 1%. A confidence interval is indeed
labeled based on the confidence level (as a percentage) we associate to it.

The second element is the margin of error. This is simply due to the uncertainty associated with
the estimate 𝜃̂. The margin of error is added to and subtracted from the point estimation so that
we obtain the interval estimation. To calculate the margin of error we need to know the confidence
level and the sampling distribution of the point estimate. Specifically, we will see that the concept
of standard error of 𝜃̂ plays a central role in determining the margin of error.

5. To understand how the interval estimation works, consider the following probability,

𝑃(𝑥𝐿 ≤ 𝑋 ≤ 𝑥𝑈 ) = 1 − 𝛼

where 𝑥𝐿 and 𝑥𝑈 denote two arbitrary lower and upper values on the domain of the random
variable 𝑋 such that |𝑥𝐿 − 𝑋| = 𝑥𝑈 − 𝑋 (equi-distant). To see this, note that

𝛼 𝛼
𝑃(𝑥𝐿 ≤ 𝑋) = and 𝑃(𝑋 ≥ 𝑥𝑈 ) =
2 2

or, equivalently,

𝛼 𝛼
𝐹𝑋 (𝑥𝐿 ) = and 𝐹𝑋 (𝑥𝑈 ) = 1 −
2 2

Then, 𝑥𝐿 and 𝑥𝑈 can be written in terms of the quantile function as

𝛼 𝛼
𝑥𝐿 = 𝐹𝑋−1 ( ) and 𝑥𝑈 = 𝐹𝑋−1 (1 − )
2 2

We call the interval [𝑥𝐿 , 𝑥𝑈 ] a (1 − 𝛼) interval for 𝑋. The figure below shows a generic interval
using the pdf and cdf of an arbitrary continuous random variable 𝑋.

4
Figure 1. [𝒙𝑳 , 𝒙𝑼 ] as a (𝟏 − 𝜶) interval for 𝑿

One should then notice that for a (1 − 𝛼) confidence interval, we must find the 𝑥𝐿 and 𝑥𝑈 quantiles
of the underlying sampling distribution of the sample statistic, i.e. the point estimate. Thus, the
(1 − 𝛼) confidence interval for the parameter 𝜃 can be written in a generic fashion as follows:

𝛼
(1 − 𝛼)% 𝐶𝐼 𝑓𝑜𝑟 𝜃 = 𝜃̂ ± 𝐹 −1 (1 − ) × 𝑆𝐸(𝜃̂) (3)
2

So, the calculation of a confidence interval requires one to determine the (1 − 𝛼 ⁄2) quantile of the
sampling distribution of 𝜃̂.

3 Interval estimation for the population proportion

6. Let 𝑝 the population proportion and 𝑝̂ a sample estimate of 𝑝 calculated using a random sample
of size 𝑛. Using the CLT, it can be shown that the sampling proportion of 𝑝̂ is normal with,

𝑝(1 − 𝑝)
𝑝̂ ∼ 𝑁 (𝑝, √ ) (4)
𝑛

If the population proportion is not given, the standard error of 𝑝̂ can be calculated using the
sample proportion.1 Therefore, a (1 − 𝛼) CI for the population proportion is given by,

𝛼 𝑝̂ (1 − 𝑝̂ )
(1 − 𝛼)% 𝐶𝐼 𝑓𝑜𝑟 𝑝 = 𝑝̂ ± Φ−1 (1 − ) × √ (5)
2 𝑛

where Φ−1 (⋅) is the standard normal quantile function.

1The normal model can be applied if (1) the sample observations are independent, and (2) the success-failure
condition holds, i.e. 𝑛𝑝 ≥ 15 and 𝑛(1 − 𝑝) ≥ 15.
5
Example. In a public opinion survey, 1,500 participants are asked whether they approve or not the
government’s actions. 660 respond “approve”. Let’s calculate a 95% confidence interval for the
overall approval rating. We have 𝑝̂ = 660⁄1500 = 0.44 and 𝑆𝐸(𝑝̂ ) = √0.44(1 − 0.44)⁄1500 =
0.0128. A 95% CI is then given by 0.44 ± 1.96 × 0.0128 = [0.4149 0.4651].

7. Calculating the required sample size Increasing the number of observations will
decrease the range of the confidence interval without reducing the confidence level as the
precision of the estimate is improved as 𝑛 gets larger. In practice, however, adding more
observations to the data is costly. Since the confidence level reflects the success rate of the method,
a desirable property is to have a narrower interval while keeping the confidence level at a
reasonably higher level.

Recall the margin of error for a population proportion using the normal model:

𝛼 𝑝̂ (1 − 𝑝̂ )
𝑀𝑎𝑟𝑔𝑖𝑛 𝑜𝑓 𝑒𝑟𝑟𝑜𝑟 = Φ−1 (1 − ) × √ (6)
2 𝑛

Given a desired margin of error 𝐸 and a predetermined confidence level (1 − 𝛼), the required
sample size is obtained by one of the two methods, known as the educated guess method and the
conservative method. In educated guess method, the optimal sample size 𝑛 is given by,

𝛼 2
(Φ−1 (1 − 2 )) 𝑝̂𝐺 (1 − 𝑝̂𝐺 ) (7)
𝑛=
𝐸2

where 𝑝̂𝐺 is an educated guess for 𝑝. If one uses the conservative method, 𝑛 is given by

𝛼 2
(Φ−1 (1 − )) (0.5)2 (8)
2
𝑛=
𝐸2

𝛼 𝛼
In both formulas, Φ−1 (1 − ) shows the (1 − )th quantile of the standard normal distribution.2
2 2

Example. Suppose the same company in charge of the public opinion poll would like to run another
survey. How many participants must be surveyed to estimate 𝑝 with 2% margin of error at 95%

2A question practically relevant is how we choose between the educated guess or the conservative approach. To
decide, one should look at the cost of sampling extra units versus the set-up cost of the sampling process once more. If
the set-up cost (maybe needed if an educated guess is used) of the sampling procedure once more is high compared to
the cost of sampling extra units, then one will prefer to use a conservative approach. Source:
https://newonlinecourses.science.psu.edu/stat506/node/11/

6
confidence? Remember that the first survey revealed an approval rate by 44%. Using the educated
guess method, we get 𝑛 = (1.962 0.44(1 − 0.44))⁄0.022 = 2366.42. Rounding to the next integer,
𝑛 = 2367. The conservative method yields 𝑛 = (1.962 0.52 )⁄0.022 = 2401.

4 Interval estimation for the population mean

8. Case 1: σ known Assume 𝑋1 , … , 𝑋𝑛 is a sequence of i.i.d. normal random variables with


mean 𝜇 and standard deviation 𝜎. We have seen that the sampling distribution of the sample mean
is also nearly normal with mean 𝜇 and standard deviation given by the standard error of the
estimate, i.e. 𝑆𝐸(𝑥̅ ) = 𝜎⁄√𝑛. Then, the general formula of a (1 − 𝛼) confidence interval for 𝜇 is

𝛼 𝜎
(1 − 𝛼)% 𝐶𝐼 𝑓𝑜𝑟 𝜇 = 𝑥̅ ± Φ−1 (1 − ) × (9)
2 √𝑛

𝛼 𝛼
where Φ−1 (1 − 2 ) is the (1 − 2 )th quantile of the standard normal random variable.

Example. In a quality control, an inspector draws 22 big bags of construction sand and measure
the weights of each one. The sample average is 999.84 kg. If the standards are met, the distribution
of the big bags is normal with 𝜇 =1,000 kg and 𝜎 = 1.2 kg. A 99% confidence interval for the mean
weight 99% 𝐶𝐼 = 999.84 ± Φ−1 (0.995) × (1.2⁄√22) = [998.74 1000.94].

9. Case 2: σ unknown Most of the time, the population sigma is also unknown. In this case, 𝜎 is
also instead estimated using data. Then, it can be shown that the following ratio,

𝑥̅ − 𝜇
𝑡= (10)
𝑠⁄√𝑛

follows a Student’s 𝑡-distribution at 𝑛 − 1 degrees of freedom. This ratio is just like the z-ratio
𝑥̅ −𝜇
developed in the discussion for the CLT, i.e. 𝑧 = 𝜎⁄ 𝑛, with the only difference that the standard

error of the sample mean is no longer calculated using the population sigma but its sample
counterpart. To construct a confidence interval for 𝜇 using the 𝑡-distribution, one of the following
conditions must be met:3

3 In the case when we do not know whether the sample 𝑋1 , … , 𝑋𝑛 comes from a normal population and the sample size
is small, we can first run graphical analysis (e.g. normal plots) on the data to examine visually if there are apparent
deviations from normality. Another option consists in running nonparametric statistical methods like the one-sample
Wilcoxon procedure.
7
1. If the sample 𝑋1 , … , 𝑋𝑛 comes from a normal population, then 𝑥̅ will also follow a normal
𝑥̅ −𝜇
distribution and the ratio 𝑡 = 𝑠⁄ will follow a 𝑡-distribution at 𝑛 − 1 degrees of freedom.
√𝑛

2. If we do not know whether the sample comes from a normal distribution but instead the
sample size is larger than 30, the CLT suggests that 𝑥̅ is nearly normal. Again, the ratio 𝑡 =
𝑥̅ −𝜇
𝑠⁄√𝑛
will follow 𝑡 ∼ 𝑇𝑘=𝑛−1.

The generic formula for a (1 − 𝛼) confidence interval of the population mean is

𝛼 𝑠
(1 − 𝛼)% 𝐶𝐼 𝑓𝑜𝑟 𝜇 = 𝑥̅ ± 𝑇𝑘−1 (1 − ) × (11)
2 √𝑛

𝛼 𝛼
where 𝑇𝑘−1 (1 − 2 ) shows the (1 − 2 )th quantile of a 𝑡-distribution at 𝑘 degrees of freedom.

Example. A hospital manager examines the average emergency room (ER) wait time at a local
hospital. She takes a random sample of 50 patients who visit the ER over the past week. The
sample mean wait time is 30 minutes and the standard deviation is 20 minutes. A 95% confidence
interval for the mean ER wait time in this hospital can be calculated as
20
−1 (0.975)
95% 𝐶𝐼 𝑓𝑜𝑟 𝜇 = 30 ± 𝑇49 × = [24.32 35.68] where we use 2.009575 as the 97.5th
√50

quantile of the 𝑡-distribution at 𝑘 = 49 df.

Example. It is very time consuming to find rattlesnakes and nerve racking to measure them for
obvious reasons. A scientist randomly finds 12 snakes in the central Pennsylvania area and
measures their length. The following twelve measurements in inches are obtained: 40.2, 43.1,
45.5, 44.5, 39.5, 38.5, 40.2, 41.0, 41.6, 43.1, 44.9, 42.8. There is no indication that the data come
from a normal distribution and, in addition, the sample size is small. We can run a graphical check
using the normal probability plot. See the figure below.

8
Since the points fall along the trendline, it is reasonable to suggest that the data come from a
normal distribution. Then, a 90% confidence interval for the mean length of rattlesnakes is
−1 (0.95)
calculated as 42.075 ± 𝑇𝑘=11 × 0.652 = [40.905 43.245]. Note that 𝑥̅ = 42.075, 𝑠 =
2.257 and 𝑆𝐸(𝑥̅ ) = 0.652. The critical value of the 𝑡 random variable is 1.7959.

10. Calculating the required sample size The margin of error 𝐸 for a 𝑡 confidence interval is
𝛼 𝑠
𝑇𝑘−1 (1 − ) × . To determine the required sample size, we have two possibilities. The first one
2 √𝑛

implements the following crude formula to find the sample size,

𝛼 2
Φ−1 (1 − 2 ) × 𝜎
𝑛=( ) (12)
𝐸

where Φ−𝟏 (⋅) is the standard normal quantile. Note that although the margin of error of 𝑡
confidence interval uses a 𝑡 distribution, the crude formula for the sample size uses the standard
normal quantile times the population standard deviation. In other words, this method assumes
that 𝜎 is known. In some applications, the standard deviation may not be given but the researcher
may have prior information about a plausible range of minimum and the maximum values that
can be taken by the variable of interest. Then, a crude estimator of 𝜎 is given by the following ratio:

max{𝑥𝑖 } − min{𝑥𝑖 } 𝑟𝑎𝑛𝑔𝑒


𝜎≈ = (13)
4 4

Example. A marketing research firm wants to estimate the average amount a student spends
during the Spring break. They want to determine it with at most $120 margin of error at 90%
confidence. One can roughly say that the monthly spending ranges from $100 to $1700. How many
students should they sample? In this example, the standard deviation of the data is not given.
Therefore, it can be approximated by the crude estimator that consists in dividing the range by
four. We get 𝜎 ≈ (1700 − 100)⁄4 ≈ 400. The required sample size is 𝑛 = (1.64 × 400⁄120)2 =
30.07, which, rounding to the first upper integer, yields 𝑛 = 31.

The second and more accurate method is to solve for 𝑛 iteratively since the 𝑡 quantile also depends
on 𝑛. Let’s rewrite the number of observations one needs to have for a given margin of error 𝐸 at
a prespecified confidence level using the 𝑡 model:

9
𝛼 2
−1
𝑇𝑘=𝑛−1 (1 − 2 ) × 𝑠
𝑛=( ) (14)
𝐸

Using numerical methods, the above formula can be iteratively solved for 𝑛.

10
End-of-chapter exercises for Confidence interval estimation

Exercise 1. Interval estimation for 𝒑

Suppose you flip a coin 100 times. It lands 58 on tails and 42 times on heads. (1) Calculate the
sample proportion of tails. (2) Calculate a 95% confidence interval for the population proportion
of tails using this sample.4

Exercise 2. Interval estimation for 𝒑

A die turns a 6, seven times after rolling it fifty times. (1) How many times would you expect to
observe a six in this experiment if the die is fair? (2) Calculate a 90% confidence interval of the
population proportion of 6's one would observe in such experiments where we roll the die 50
times.5

Exercise 3. Interval estimation for 𝒑

In 2013, the Pew Research Foundation reported that “45% of U.S. adults report that they live with
one or more chronic conditions”. However, this value was based on a sample, so it may not be a
perfect estimate for the population parameter of interest on its own. The study reported a
standard error of about 1.2%, and a normal model may reasonably be used in this setting. Create
a 95% confidence interval for the proportion of U.S. adults who live with one or more chronic
conditions.6

Exercise 4. Interval estimation for 𝒑

Refer to the previous exercise that reports the results of a survey on U.S. adults living with one or
more chronic conditions. Identify each of the following statements as true or false. Provide a brief
explanation to justify each of your answers. (1) We can say with certainty that the confidence
interval from the previous exercise will contain the true percentage of U.S. adults who suffer from
a chronic illness. (2) If we repeated this study 1,000 times and constructed a 95% confidence
interval for each study, then approximately 950 of those confidence intervals would contain the
true proportion of U.S. adults who suffer from chronic illnesses. (3) The poll provides statistically

4 The sample proportion is 𝑝̂ 𝑇𝐴𝐼𝐿𝑆 = 58%. A 95% CI for the population proportion of tails is given by 𝐶0.95 = 0.58 ±
1.96 × √0.5 × (1 − 0.5)⁄100 = [0.482 0.678].
5 The sample proportion of 6 is 𝑝̂ = 7⁄50 = 14%. The population proportion of six is 𝑝 = 1⁄6 ≈ 16.67%. So, a 90%

confidence interval for 𝑝 is 𝐶0.9 = 14% ± 2.58 × √(16.67%(1 − 16.67%))⁄50 = [5.33% 22.67%].
6 Source: Diez et al. (2015, 206). 𝐶1−5% = 0.45 ± 1.96 × 0.012 = [0.42648 0.47352].
11
significant evidence at the 5% level that the percentage of U.S. adults who suffer from chronic
illness is below 50%. (4) Since the standard error is 1.2%, only 1.2% of people in the study
communicated uncertainty about their answer.7

Exercise 5. Interval estimation for 𝒑

A poll run among Twitter users found that 52% of the users get at least some news on Twitter. 8
The standard error for this estimate was 2.4%, and a normal distribution may be used to model
the sample proportion. Construct a 99% confidence interval for the proportion of Twitter users
who get some news on Twitter and interpret the confidence interval.9

Exercise 6. Interval estimation for 𝒑

The CBS News on January 21st, 2009 reported that President Bush’s final approval rating was 22%
according to a poll conducted by telephone interview among 1,112 adults in the United States. 10
Calculate a 95% confidence interval for the true population proportion for the President’s
approval rate.11

Exercise 7. Interval estimation for 𝒑

Popularity surveys are one of the closely-monitored devices by policy-makers. In a survey run in
France in the aftermath of the first lockdown imposed because of the Covid-19 health crisis, 495
participants out of 883 surveyed, state that they generally approved the restrictions set by the
government to face the crisis. Calculate then a 99% confidence interval of the mean approval rate
of the restrictions associated with the lockdown among the population.12

Exercise 8. Interval estimation for 𝒑

A librarian draws a random sample of size 𝑛 = 100 members from the members’ database to see
how many members pay a penalty for returning books late. It turns out that 39 members in the
sample had to pay a penalty. Using this sample, construct a 99% confidence interval for the
proportion of members who return books late.13

7 Source: Diez et al. (2015, 206). (1) False. This is a 95% confidence interval. (2) True. (3) True. The confidence
interval is below 50%. (4) False. The standard error does not carry such a message about the data.
8 Source: Diez et al. (2015, 206).
9𝐶
0.99 = 0.52 ± 2.58 × 0.024 = [45.81% 58.18%].
10 Source : https://newonlinecourses.science.psu.edu/stat506/node/10/.

11 95% 𝐶𝐼 = 0.22 ± Φ−1 (0.975) × √(0.22 × 0.78⁄1112) = [0.1957 0.2443].

12 𝑝̂ = 495⁄883 = 0.5606. 𝐶
0.99 = 0.5606 ± 2.58 × √(0.5606 × 0.4393⁄883) = [0.5175 0.6037].
13 𝑝̂ = 0.39. 𝐶
0.99 = 0.39 ± Φ −1 (0.995) × √0.39 × 0.61⁄100 = [
0.2673 0.5235]. In R, run the code “binom.test(x =
39, n = 100, conf.level = 0.99)” to get the same results.
12
Exercise 9. Interval estimation for 𝒑

To estimate the survival chances of infants born prematurely, a group of researchers surveyed the
records of all premature babies born in a hospital over a three-year period. They found 39 babies
who were born at 25 weeks gestation, 31 of which survived at least 6 months. What is the point
estimate of the percentage of all babies born at 25 weeks gestation who would survive at least 6
months? Calculate a 95% confidence interval for this proportion?14

Exercise 10. Interval estimation for 𝒑

On January 6th, 2021, supporters of President D. Trump overtook the US Capitol. Following these
events that marked the global agenda, an ABC News / Ipsos poll found that nearly 56% of
Americans want President Trump out of the office before inauguration day and that the majority
fault Trump for rioting in Washington D.C. The related page on Ipsos website note that the poll is
based on a nationally representative sample of 570 general population adults age 18 or older. (1)
Using this information, calculate a 95% confidence interval of the Americans who think D. Trump
should be removed out of the office before the inauguration day. (2) Calculate the same interval
assuming the sample size doubles, i.e. 1140 participants.15

Exercise 11. Interval estimation for 𝒑 and sample size calculation

Assume there are nearly 20,000 students enrolled in a university. As part of their project, a group
of students carry out a survey on smoking habits among their peers and find that nearly 42% of
the sampled students reported that they smoke regularly. (1) The project team calculates a 95%
normal confidence interval for the mean proportion of smokers as [35.1936% 48.8064%].
Calculate then the number of students they sampled. (2) Calculate then a 90% confidence interval
for the mean proportion of smokers among the students using the normal model.16

Exercise 12. Calculating the sample size

In line with the previous exercise on George W. Bush’s final approval rating, assume we want to
estimate the next president’s final approval rating. How many people should be sampled so that

14 The point estimate of the survival rate is 31⁄39 = 79.5%. The standard error is √(0.795 × 0.205)⁄39 = 0.0647. A
95% normal confidence interval is then 0.795 ± 1.96 × 0.0647 = [66.8% 92.2%]. Note that the normal approach
works quite well because the binomial distribution (the data is actually binomial) looks a lot like the normal
distribution when 𝑛 is large and 𝑝 isn’t close to 0 or 1.
15 Source: https://www.ipsos.com/en-us/news-polls/abc-news-rioting-democracy-011021
16 (1) Using one of the bounds of the confidence interval, we solve for the standard error of the sample proportion, i.e.

𝑝̂(1−𝑝̂) 0.42×0.58
0.351936 = 0.42 − 1.96 × 𝑆𝐸(𝑝̂ ) → 𝑆𝐸(𝑝̂ ) = 0.034726. Then, 𝑆𝐸(𝑝̂ ) = √ →𝑛= → 𝑛 = 202.0076. (2) A
𝑛 0.0347262
90% confidence interval is 𝐶0.9 = 0.42 ± 1.65 × 0.034726 = [0.3627 0.4773].
13
the margin of error will be 3% with 95% confidence? Use both the educated guess and
conservative approach.17

Exercise 13. Calculating the sample size

In a study on the inhabitants of a large town, members of a family and social life association want
to find out how many households serve breakfast in the mornings. How large a sample should
they survey to estimate the proportion of families who serve a breakfast in the morning within a
2% margin of error at 5% significance level?18

Exercise 14. Calculating the sample size

To estimate the proportion of voters who plan to vote for Candidate A in an election, a random
sample of size 𝑛 from the voters is chosen. The sampling is done with replacement. Let 𝑝 be the
proportion of voters who plan to vote for Candidate A among all voters. How large does 𝑛 need to
be so that we can obtain a 90% confidence interval for 𝑝 with 3% margin of error?19

Exercise 15. Interval estimation for 𝝁

A hospital administrator hoping to improve wait times decides to estimate the average emergency
room (ER) waiting time at her hospital.20 She collects a simple random sample of 64 patients and
determines the time (in minutes) between when they checked in to the ER until they were first
seen by a doctor. A 95% confidence interval based on this sample is [128 minutes, 147 minutes],
which is based on the normal model for the mean. Determine whether the following statements
are true or false and explain your reasoning. (1) This confidence interval is not valid since we do
not know if the population distribution of the ER wait times is nearly normal. (2) We are 95%
confident that the average wait time of these 64 emergency room patients is between 128 and 147
minutes. (3) We are 95% confident that the average waiting time of all patients at this hospital’s
emergency room is between 128 and 147 minutes. (4) 95% of random samples have a sample
mean between 128 and 147 minutes. (5) A 99% confidence interval would be narrower than the
95% confidence interval since we need to be more sure of our estimate. (6) The margin of error

17 Source: https://newonlinecourses.science.psu.edu/stat506/node/11/. Using the educated guess approach, we take


the previous approval rating as 0.22 as 𝑝̂ . Then the necessary sample size is 𝑛 = (1.962 × 0.22 × 0.78⁄0.032 ) =
732.47. The conservative approach takes 𝑝 = 0.5. The required sample size is 𝑛 = 1067.11.
18 There is no prior about the population proportion, so we should use the conservative method and take 𝑝 = 0.5.
0
Plugging the inputs into the formula, we obtain 𝑛 = 2400.91. So, the association must survey 2,401 people.
2
(Φ−1 (0.95)) (0.5×0.5)
19 With no prior about the population proportion, we use the conservative method. So, 𝑛 = ≈ 752.
0.032
20 Source: Diez et al. (2015, 207).

14
is 9.5 and the sample mean is 137.5. (7) To decrease the margin of error of a 95% confidence
interval to half of what it is now, we would need to double the sample size.21

Exercise 16. Interval estimation for 𝝁

The National Survey of Family Growth conducted by the Centers for Disease Control gathers
information on family life, marriage and divorce, pregnancy, infertility, use of contraception, and
men’s and women’s health. One of the variables collected on this survey is the age at first marriage.
The histogram below shows the distribution of ages at first marriage of 5,534 randomly sampled
women between 2006 and 2010. The average age at first marriage among these women is 23.44
with a standard deviation of 4.72.

age at first marriage


Estimate the average age at first marriage of women using a 95% confidence interval and interpret
this interval in context. Discuss any relevant assumptions.22

Exercise 17. Interval estimation for 𝝁

A researcher measured on a sample of 100 participants to a public running competition the age of
the participants. The sample mean is 35.05 years and the standard deviation of the runners’ ages
is 8.97. More than 5,000 participants are enrolled in this competition each time. (1) Is it possible
to consider this sample as random? (2) What is the standard error of the sample mean? (3) The
researcher has also measured the average time it takes for a runner to complete the race. The

21 (1) False. Provided the data distribution is not very strongly skewed, the sample mean for a sample size of 64
observations will be nearly normal. (2) False. Inference is made on the population parameter, not the point estimate.
The point estimate is always in the confidence interval. (3) True. (4) False. The confidence interval is not about a
sample mean. (5) To be more confident that we capture the parameter, we need a wider interval. Think about needing
a bigger net to be more sure of catching a fish in a murky lake. (6) True. The normal model was used to model the
sample mean. The margin of error is half the width of the interval. The sample is the midpoint of the interval. (7)
False. In the calculation of the standard error, we divide the standard deviation by the square root of the sample size.
To cut the standard error in half, we would need to sample 22 = 4 times the number of people in the initial sample.
22 Source: Diez et al. (2015, 209). 95% 𝐶𝐼 = 23.44 ± 𝐹 −1 (0.975) × (4.72⁄√5534). Given the large sample size and the

weak skewness of the distribution, there will be negligible difference between a normal and t model. Thus, choosing
the critical value as 1.96, we get 95% 𝐶𝐼 𝑓𝑜𝑟 𝜇 = [23.3156 23.5644].
15
average time is measured as 𝑥̅ = 92 minutes with a standard deviation by 𝜎 = 15.8 minutes.
Construct a 95% confidence interval for all runners average time to finish the race.23

Exercise 18. Interval estimation for 𝝁

A company sells car batteries. The management claims the average battery life before a charge is
necessary is at least 100 hours. In a quality control process, inspectors examine 48 batteries
randomly selected from the production. The sample mean is 103.81 hours. The sample standard
deviation is found as 23.96 hours. Construct a 99% confidence interval of the population mean.24

Exercise 19. Interval estimation for 𝝁

The manager of a paint supply store wants to estimate the actual amount of paint contained in 10
liters cans purchased from a nationally known manufacturer. The manufacturer’s specifications
state that the standard deviation of the amount of paint for such cans is equal to 0.02 liters. The
manager takes a random sample of 30 cans and the sample amount of paint per 10 liter can is
found to be 9.995 liters.

Exercise 20. Interval estimation for 𝝁

A random sample of of size 𝑛 = 100 is drawn from a distribution with known variance 𝑉𝑎𝑟(𝑋) =
16. The sample average is calculated as 𝑋̅ = 23.5. Find a 95% confidence interval for the
population mean 𝜇.25

Exercise 21. Confidence intervals

A study focuses on the total cholesterol concentration among the elderly. Using a random sample
with 100 individuals, a 95% confidence interval using a 𝑡-distribution for the mean concentration
level is found as [1.504 2.046]. Using this confidence interval, calculate the sample average and
standard deviation of the cholesterol concentration for these 100 individuals. Hint: The 97.5th
percentile of the 𝑡-distribution at 99 degrees of freedom is 1.9842.26

23 The ratio of the sample size to the population size 𝑛⁄𝑁 = 100⁄5,000 < 0.1. It is possible to consider a sample that
represents less than 10% of the population as random. 𝑆𝐸(𝑥̅ ) = 𝑠⁄√𝑛 = 8.97⁄√100 = 0.897. The conditions for
applying the normal model are thus met. Using the generic confidence interval formula, i.e. 𝑥̅ ± 𝑧1−𝛼⁄2 × 𝑆𝐸(𝑥̅ ), where
𝑧1−𝛼⁄2 is the quantile function of the standard normal distribution, we calculate 95% 𝐶𝐼 = 92 ± 1.96 ×
(15.8⁄√100) = [88.9033 95.0967].
24 𝑥̅ = 103.81 and 𝑠 = 23.96. Therefore, 𝑆𝐸(𝑥̅ ) = 23.96⁄√48 = 3.4583. Assuming a normal model is right, the 99%

confidence interval for the population mean is 103.81 ± 2.5758 × 3.4583 = [94.9 112.72].
25 A 95% confidence interval is given by 23.5 ± 1.96 × (√16⁄√100) = [22.7 24.3].

26 We need to solve for 𝑥̅ and 𝑠 in eq (1): 𝑥̅ − 1.9842 × (𝑠⁄√100) = 1.504 and eq (2): 𝑥̅ + 1.9842 × (𝑠⁄√100) = 2.046.

We get 𝑥̅ = 1.775 and 𝑠 = 1.3658.


16
Exercise 22. Confidence intervals

In an online survey, participants were asked “How big of a problem is police corruption in the
country where you live?” The scores range from 0 that shows that the police corruption is not a
problem at all, to 10 that suggests that corruption is an extremely severe problem. Data collected
among the 1158 participants from Brazil has an average score by 7.63. The 95% (normal)
confidence interval is [7.51 7.75]. What is the sample standard deviation of the scores?27

Exercise 23. Calculating the sample size

Consider again the previous context on corruption perceptions. Assume another survey where we
would like to ask participants about the same problem. How large a sample one should survey to
estimate the true score with 0.2 margin of error at 5% significance level? Use first the crude
method to figure out the standard deviation and then implement the iterative method on a
spreadsheet. For the iterative method, you can use the sample standard deviation you’ve
calculated above.28

Exercise 24. Confidence intervals

Using a random sample with 48 observations from a random variable 𝑋, the 95% confidence
interval for the population mean is calculated as [116 124]. Assume the sampling distribution
of the sample average follows a standard normal distribution. Calculate the mean 𝜇 and the
standard deviation 𝜎 of 𝑋.

Exercise 25. Confidence intervals

Students attending a statistics class ask the professor about last year’s grades. The professor’s
answer is as follows: “23 students attended this class last year and a 95% confidence interval of
the mean grade was [9.1981 12.5019]”. What was the average grade of the previous class? What
was its standard deviation?

Exercise 26. Confidence intervals

A statistics professor thinks that, on average, 10% of all students do not comply with at least one
of the instructions during an exam (like answering the questions in order of their appearance,

27 Source: https://www.indexmundi.com/surveys/results/1/table. The margin of error at 𝛼 = 5% is 𝑚 =


|7.51 − 7.63| = 7.75 − 7.63 = 0.12. Since we use the normal model, we infer that the critical value is the 97.5th
percentile of the standard normal distribution, which is equal to 1.96. Therefore, we need to solve for 𝑠 in the equation
0.12 = 1.96 × (𝑠⁄√1158). We find 𝑠 = 2.0834.
28 Given the research setup, the minimum and maximum scores are 0 and 10 respectively. Thus, the crude estimator of

the standard deviation is 2.5. The necessary sample size to guarantee 0.2 margin of error at 5% significance is 𝑛 ≈
(1.96 × 2.5⁄0.2)2 ≈ 600.25. If one implements the iterative method on Excel, we would get 𝑛 = 417.92.
17
reporting proportions as percentage values and not as fractions, etc.). There are 60 students who
will attend his next class on statistics. Calculate a 95% confidence interval of the number of
students who would not comply with at least one of the instructions of the final exam?

Exercise 27. Confidence intervals

A survey carried out among 809 participants by the University of Monmouth in the aftermath of
Joe Biden's election as 46th president of the USA reported that 54% of the Americans were
supporting the new president, while 30% do not and 16% has no opinion. Using the data, calculate
a 95% confidence interval for those who support the new president and those who do not.

Exercise 28. Confidence intervals

Anton Chigurh would like to discover whether his "lucky 50 cent coin" he'd been keeping in his
pocket for a long time is actually a fair one or not. If it is so, the coin must land on heads with
probability 50% or, equivalently, on tails with probability 50%. (1) How many times should he
flip the coin to estimate the proportion of heads with 95% confidence with less than 1% margin
of error? (2) Anton is an uneasy and busy man, he doesn't like running short of time. So, he decides
to flip his lucky coin only 100 times. He gets 43 heads and 57 tails. Calculate a 95% confidence
interval for the true population proportion of heads. Is it reasonable to suggest that his lucky coin
is a fair one? How should Anton interpret this interval?29

Exercise 29. Confidence intervals

A prospective finance student would like to evaluate her chances to succeed in IAE Gustave Eiffel’s
Portfolio Management program. To this aim, he contacts a few graduates through social media
and collects data on their average at graduation. 19 people answer the student. The average of the
grades is 11.8 with a sample standard deviation by 2.25. Using the appropriate statistical model,
calculate the 90% confidence interval of the mean graduation score.30

Exercise 30. Confidence intervals

An international Ipsos poll run across 15 countries over the period Dec. 2020 – January 2021
revealed that only 40% of the French people was willing to get vaccinated against the Covid-19

1.962 ×0.52 𝛼 𝑝(1−𝑝) 0.52


29 (1) 𝑛 = = 9604 times. (2) 𝐶1−0.05 = 𝑝̂ ± Φ−1 (1 − ) √ = 0.43 ± 1.96√ = [0.332 0.528]. It is
0.012 2 𝑛 100
not reasonable based on this interval that the coin is not fair. The population proportion if the coin is a fair one is
already included within the interval. Anton should then think, "if I keep flipping the coin 100 times repeatedly, the
interval [0.332 0.528] will contain the true population parameter 95 of the time". He can thus assert with 95%
confidence that his coin is fair. In addition, the margin of error is nearly 10 times as much as the margin of error we'd
calculated in the first part of the exercise.
30 𝐶 −1
0.9 = 11.8 ± 𝑇𝑘=18 (0.95) × (2.25⁄√19) = 11.8 ± 1.73 × (2.25⁄√19) = [10.9 12.7].
18
disease. (1) Calculate a 95% confidence interval for the French people who is willing to be
vaccinated if the sample size is 𝑛 = 1126. (2) Suppose we would like to design a new study to see
if during the next months, the public communication strategy succeeded to overcome hesitancy.
How large a sample one should select for this new survey to estimate the proportion of those who
accept to be vaccinated with less than 2% margin of error at 5% significance level? Use first the
prior guess method, and then conservative method to calculate the required sample size 𝑛.

19
Interval Estimation

Lecture outline

Intro: 𝜃 the parameter, value unknown, must be


estimated. Let 𝑋1 , … , 𝑋𝑛 be a random sample, then 𝜃̂
can be estimated using 𝜃̂ = 𝑓(𝑋1 , … , 𝑋𝑛 ) where 𝑓(⋅) is
an estimator. Ex. 𝑋̅ = 𝑛−1 ∑𝑋𝑖 an estimator of 𝜇.

Remarks: (1) 𝜃̂ alone not providing much information


about 𝜃 because of sampling variation; (2) Q: How
close is 𝜃̂ to 𝜃?

Def. An interval estimate 𝜃 is a range of plausible


values for 𝜃. Generically,

𝐶𝐼1−𝛼 𝑓𝑜𝑟 𝜃 = 𝜃̂ ± 𝑀𝑎𝑟𝑔𝑖𝑛 𝑜𝑓 𝑒𝑟𝑟𝑜𝑟

Remarks: (1) In general 𝛼 chosen as 1, 5 or 10%, so


that confidence level 99, 95 or 90%; (2) Margin of
error proportional to the amount of uncertainty
around the estimate.

Interval estimation for 𝑝: The sampling distribution of


𝑝̂ ∼ 𝑁(𝑝, √𝑝(1 − 𝑝)⁄𝑛). If 𝑝 unknown, 𝑆𝐸(𝑝̂ )
calculated by 𝑝̂ . A 𝐶𝐼1−𝛼 for 𝑝 is,

𝛼 𝑝̂ (1 − 𝑝̂ )
𝐶𝐼1−𝛼 𝑓𝑜𝑟 𝑝 = 𝑝̂ ± Φ−1 (1 − ) × √
2 𝑛

1
Dropped

Let 𝑋 be a random variable with pdf 𝑓𝑋 (𝑥; 𝜽) where 𝜽


denotes the parameter vector that describes the
distribution of 𝑋. For simplicity, we assume there is
only one parameter and let 𝜽 = 𝜃. Note that 𝜃 is a fixed
value but unobserved. To get an idea about its
potential value, it must be estimated from data. Let
𝑆𝑖 = {𝑥1 , 𝑥2 , … , 𝑥𝑛 } be a sample with 𝑛 observations
drawn from 𝑋. Using this sample, we can calculate an
estimate of 𝜃. The mathematical formula necessary to
provide this estimate is called an estimator. We will
denote 𝜃̂ a point estimate of 𝜃. Since 𝜃̂ is a quantity
calculated from the sample and there are as many 𝜃̂’s
as the number of random samples of size 𝑛 drawn
from 𝑋, 𝜃̂ must be treated as a random variable.

Concretely, assume 𝑋 is the random variable defined


previously, i.e. sum of two fair dices. The mean of 𝑋 is
7. Assume now we throw two dices 100 times and
write down the average of the outcomes, 𝑥̅1 . It is clear
that 𝑥̅1 needs not to be equal to 𝜇𝑋 = 7. It will be
certainly close to 7 but can fall either on the right or
on the left of it. A first question is then as follows: How
sure we are that the estimated statistic 𝑥̅1 is close to
its true value 𝜇𝑋 ? Now, we can also repeat the same
experiment, that is throw two dices 100 times again
and again. And at the end of each experiment, we
obtain another value of the estimate, 𝑥̅1 , 𝑥̅2 , … and so
on. Here is a second question that follows: How can we
describe the random behavior of these estimates? The
answer will show a surprising fact about the
distribution of 𝑥̅ .

You might also like