0% found this document useful (0 votes)
9 views8 pages

3 Module Notes

The document discusses the Central Limit Theorem (CLT) and confidence intervals in the context of estimating population means from sample data. It emphasizes that the sample mean is an unbiased estimator of the population mean and that the distribution of the sample mean approaches normality as sample size increases. Additionally, it explains how to construct confidence intervals for both population means and proportions, highlighting the use of the t-distribution when the population standard deviation is unknown.

Uploaded by

dianaqrice
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views8 pages

3 Module Notes

The document discusses the Central Limit Theorem (CLT) and confidence intervals in the context of estimating population means from sample data. It emphasizes that the sample mean is an unbiased estimator of the population mean and that the distribution of the sample mean approaches normality as sample size increases. Additionally, it explains how to construct confidence intervals for both population means and proportions, highlighting the use of the t-distribution when the population standard deviation is unknown.

Uploaded by

dianaqrice
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

MAST 6474 Introduction to Data Analysis I

Central Limit Theorem and Confidence Intervals

Sampling: Using Data to Estimate the Population Mean

Until now, we have been given complete information about each random variable. We have been told
its distribution along with the “true” mean and variance of the random variable, even if some
calculations were required.

In almost every problem that we encounter in business analysis, however, we do not know the
distribution of the random variable(s) of interest. Yet we can draw a sample from the random
variable(s), then use that sample data to learn about the variable’s true expected value or mean. We
will illustrate this approach with an example.

Example: Hail — Dents from Above! (source: Jon Danklefs, SMU MBA Class 45D). In March of
2000, a hailstorm ripped through North Texas and did serious damage to a number of homes and
businesses. Kia Motor’s Midlothian distribution center was particularly hard hit. Nearly 5000 vehicles
were hail-damaged.

The distributor immediately authorized a local dent-removal service to repair up to 200 damaged
vehicles. Each car was fixed panel by panel using a paintless dent removal process, and a detailed
invoice was prepared with documented repair costs. After fixing 180 vehicles to the distributor’s
satisfaction, it was mutually agreed that the remaining cars should be repaired using a fixed rate per
car.

Jon Danklefs was responsible for negotiating the repair rate per car. He already had a sample of 180
cars whose repair costs were known. The repair costs for all of these vehicles are contained in the file
Module 3 Notes Dataset 1 found in your Student Resources folder.

Using information from the sample of 180 repaired vehicles, come up with an estimate of the true
expected cost for damaged vehicles.

Solution: We begin by noting that we know nothing about the distribution of repair costs for the
nearly 5000 cars that were damaged. Taken together, these cars represent the population of
interest. We gather information that will help us learn about the true expected cost by selecting a
random sample of damaged cars. Specifically, we choose cars sequentially (one-by-one) for the
sample, with every car not yet included in the sample having an equal probability of being chosen.
Copyright Edward Fox and John Semple 2019 1
MAST 6474 Introduction to Data Analysis I

Assuming that 180 cars were selected this way, we have a random sample “without replacement”
from a finite population. If each one of the damaged vehicles had an equal probability of selection,
even if it was already selected for the sample, we would have a random sample “with replacement.”
When the random sample is small compared to the population (less than 5%), this distinction is not
critical. Assuming that we are drawing a relatively small sample from a virtually infinite (i.e., large)
population simplifies the formulas we use.1 Choosing 180 damaged cars from a population of nearly
5000 justifies this assumption. It will be our standard assumption throughout the course.

The second part of our solution requires that we convert the repair costs of the sample of damaged
cars into an estimate of repair costs for the population. Assuming that every damaged vehicle has the
same probability of appearing in the sample, we can use the average repair cost of vehicles in the
sample:

x 1+ x 2 + x 3+ ⋯+ x178 + x 179 + x 180


x= =$ 215.67
180
The computed value x is called the sample mean; we will use the sample mean to estimate the true
mean cost, μcost , of repair for the entire population of nearly 5000 damaged cars.

How good is an estimate of μcost is $215.67? Based on the sample mean, the Kia distributor could
refuse to pay a fixed price higher than $215.67 per car. But μcost could actually be higher than
$215.67, in which case $215.67 might be a good deal for Kia. On the other hand, μcost is just as likely
to be lower than $215.67, in which case $215.67 might be a bad deal for Kia. We will never know for
certain how close $215.67 is to μcost because the true cost of repairs for the remaining vehicles will
never be documented. But we can determine, probabilistically, how close we are.

To understand how this is done, imagine we were to draw another random sample of size n=180 and
compute another sample estimate of the true mean repair cost. How likely is this second estimate to
be exactly $215.67? What if we took a third random sample of size n=180? We will think of our
original estimate x =$215.67 as a single draw from a random variable that is the average repair cost
for any sample of 180 cars. In the language of statistics, we are interested in the distribution of the
X + X + X +⋯+ X 178 + X 179 + X 180
estimator X = 1 2 3 . Capital X will be used to represent this estimator, a
180
random variable for the mean of a sample of size 180; lower case x will be used represent an
estimate, the computed mean of a particular sample.

1
More complicated formulas are needed for the case of small finite populations from which samples are drawn without
replacement.
Copyright Edward Fox and John Semple 2019 2
MAST 6474 Introduction to Data Analysis I

It turns out that we know more about the distribution of the estimator X , which we will call the
sampling distribution. In fact, we know enough to make precise statements about how close the
computed sample mean x =$215.67 is to the true (but unknown) population mean μcost . Specifically, we
know that

1. The expected value of X is E( X ) = μcost . In simple terms, this means the mean of the
sampling distribution is exactly the same as the true (unknown) mean of the population —
the value that the sample was drawn to estimate. Estimators that have this property are
said to be unbiased. On the other hand, this is like knowing that a manufacturing machine
makes parts that are “correct on average” without knowing what that average actually is.

2
σ
2. The distribution of X is approximately Normal with a mean of μcost and a variance of ,
n
where σ 2 is the true (but unknown) variance of the population we are sampling from. This is
not at all obvious and is a consequence of the Central Limit Theorem (CLT).

We will focus on the second fact. The reason that this fact is such a profound result is that it doesn’t
even depend on the distribution of the underlying population we are sampling from.

The Central Limit Theorem

The Central Limit Theorem (CLT) holds that, if the sample size n is not too small, the distribution of
the estimator X , the sample mean, is approximately normally distributed. Note that the live session
exercise highlights two other important points: (1) that the sample mean is unbiased and (2) that the
dispersion of the sample mean decreases as the sample size n increases.

Let ( X 1 , X 2 ,⋯ , X n) be a random sample from any infinite population with mean μ and variance σ 2.
X + X +⋯+ X n
As n becomes large, the distribution of X = 1 2 is approximately Normal with mean μ and
n

( ) σ
2 2
σ σ
variance ; in our notation X ¿ . N μ , . The square root of the variance, , is known as the
n n √n
standard error of the mean.

Remarkably, the Central Limit Theorem approximation doesn’t require the population of the random
variable x to have any particular distribution. If the underlying random variable is normally distributed,
however, then the distribution of X is exactly Normal (for any size n), and we can dispense with the
Copyright Edward Fox and John Semple 2019 3
MAST 6474 Introduction to Data Analysis I

word “approximately.” This follows directly from the combination rule for independent Normal random
variables. How big does n need to be for this approximation to be very precise? There is general
agreement that n = 100 is big enough, although (as the exercise showed) n = 30 provides a pretty
good approximation. Nevertheless, the bigger the sample size, the better the approximation.

Using the Central Limit Theorem often means converting X to a standard Normal. The standardized
version of the Central Limit Theorem is
¿.
X−μ
N ( 0 ,1 )
2
σ
√n
for large n.

General Confidence Interval

In statistics, it is often useful to construct a confidence interval for a population parameter (we will
focus on the mean, μ), an interval within which we can say with some level of confidence that the true
population parameter is. Unfortunately, even after calculating the confidence interval, we never really
know whether the true population parameter is within the interval or not.

Based on the Central Limit Theorem and the standard Normal distribution (Z), the following probability
statement is approximately true for a large enough sample n:

( )
X−μ
Pr −z α/ 2 < < z α/ 2 =1−α
σ
√n

where a “cutoff” value, z α/ 2, from the standard normal distribution is chosen so that the area under the
curve above z α/ 2 in the upper tail is α /2 and that the area under the curve below −z α/ 2 in the lower tail
is also α /2.

Copyright Edward Fox and John Semple 2019 4


MAST 6474 Introduction to Data Analysis I

Area in this tail Area in this tail


is exactly α /2 is exactly α /2

Q: How do you compute the


value of z needed here?
A: Take the given  and compute
either
−z α/ 2 z α/ 2 =1-NORM.INV(α /2, 0, 1) or
=NORM.INV(1−α /2, 0, 1)

The equation above can be rewritten as Pr X −z α / 2 ∙( σ


√n
< μ< X + z α /2 ∙

√n
=1−α

a probability statement about the true population mean, μ. This probability statement shows that the
σ
interval X ± z α / 2 ∙ contains the true population mean, μ, 100(1-α )% of the time. If we replace X with
√n
an observed sample mean x , we get the interval
σ
x ± zα/ 2 ∙
√n
which is called a 100(1-α )% confidence interval for the mean—where confidence is expressed as a
percentage. If the underlying population that we draw from is normally-distributed, the confidence
interval is said to be an exact 100(1-α )% confidence interval. Otherwise we say it is an approximate
100(1-α )% confidence interval. α is the probability that the true mean falls outside the confidence
interval; α is always small, usually .05 or less.

In practice, we almost never know the true population standard deviation, σ (or equivalently σ 2). We
address this missing parameter differently when calculating confidence intervals for proportions than
when calculating confidence intervals for other parameters. We will discuss proportions first, then the
more general case.

Copyright Edward Fox and John Semple 2019 5


MAST 6474 Introduction to Data Analysis I

Confidence Interval for a Population Proportion

We constantly see TV, online, and newspaper polling that report sample statistics about the
percentage of people that believe something, suffer from some condition, would vote for some
candidate, and so forth. The next time you see a poll on TV, glance at the bottom of the screen. You
will typically see the poll’s “margin of error” reported; this is the sampling error of the estimate. More
specifically, this is the uncertainty about the true population proportion because it is being estimated
from a sample, not determined from the entire population.

Suppose you want to determine the percentage of American households that own firearms. If you
take a random sample of n people, a certain sample percentage will own guns. How does this sample
percentage compare to the true population percentage? Let p denote the true population percentage
and let p̂ denote percentage estimated from the sample. Each individual in the sample represents a
random draw from a population with a Bernoulli distribution

X Probability
1 (Own gun) p
0 (Do not own 1-p
gun)

From our discussion of the Bernoulli distribution, recall that the expected value is p and the variance
is p(1-p). Substituting these values into the confidence interval formula, we find that a 100(1-)%
confidence interval for p is

^p ± z α /2 ∙
√ p ( 1−p )
n

The quantity that is added to / subtracted from ^p is known as the margin of error. Observe the
unknown population proportion p under the square root sign, a reflection of the fact that we don’t
know the true population variance. We replace the population proportion p under the square root sign
with the sample proportion p̂ . This approximation is acceptable if n ^p ≥5 and n ( 1− ^p ) ≥ 5.

Confidence Interval for a Population Proportion


A 100(1-)% confidence interval for the true population proportion p is
given by ^p ± z α /2 ∙
^
p
Where is the sample proportion.
n √
^p ( 1− ^p )

Copyright Edward Fox and John Semple 2019 6


MAST 6474 Introduction to Data Analysis I

Confidence Intervals for a Population Mean (Not a Proportion)

When not dealing with proportions, we use a more general approximation for the unknown population
standard deviation σ (or equivalently, the population variance σ 2). Given a random sample of size n
from the population ( X 1 , X 2 ,⋯ , X n), we will use the sample variance as an estimate for the population
variance. The sample variance, denoted s2, and given by the formula

n
1
2
s= ∑
n−1 i=1
( x i−x )
2

Note that the sample variance calculation divides the sum of squared differences from the sample
mean by n – 1; it is therefore not the average squared distance from the sample mean, x .2 When the
sample size is small (e.g., n = 3, 4 or 5), the result is very different from the average; however, when
the sample size is large (e.g., n > 100), the result is very close to the average. The sample variance
can be computed in Excel with the function VAR.S(number1,number2,…). The sample standard
deviation is denoted by s and can be computed with the function STDEV.S(number1,number2,…). Of
course, s can also be calculated by taking the square root of the sample variance.

t distribution

Remember that our initial expression for the confidence interval assumed that the population
standard deviation σ is known. Usually, however, σ is unknown. It is tempting to simply insert the
sample standard deviation s in its place. We can do this, but it requires us to use a variation of the
standard Normal distribution called Student’s t, or just the t distribution.

If a random sample of size n is drawn from a Normal distribution,


(X −μ)/( s/ √ n) follows a t distribution with n-1 degrees of freedom (df)

Note that the denominator includes an s instead of a σ . “Degrees of freedom,” or df, does not need to
be estimated—it depends on the sample size n. The t distribution is symmetric about 0 and looks a lot
like the standard Normal distribution (especially as the degrees of freedom become large) as shown
in the applet: http://www.stat.uiowa.edu/~mbognar/applets/t.html.

2
Dividing by n -1 rather than n is known as Bessel’s correction.
Copyright Edward Fox and John Semple 2019 7
MAST 6474 Introduction to Data Analysis I

t Distribution

Confidence Interval using the t Distribution

If we continue to assume that the sampling distribution is Normal (or approximately Normal), then we
can construct a 100(1-)% confidence interval for the mean using the following formula:

s
x ± t α / 2 ,df ∙
√n

s

2
Recall that the right-most quantity in this formula, (equivalently written s ), is known as the
√n n
standard error—we will use this term frequently throughout the remainder of the course. The
s
expression on the right side of the ± sign, t α / 2 ,df ∙ , is known as the margin of error.
√n
We can use Excel to calculate confidence intervals. The function CONFIDENCE.T(α ,s,n) computes
the margin of error, where (1 – α )% is the confidence level, s is the sample standard deviation, and n
is the sample size. To calculate the confidence interval, we simply add/subtract the margin of error
to/from the sample mean.

Example: How Rough Is Right? (Brent Pope, SMU MBA Class 46P). Airco has inspected 81
armature hubs and measured their roughness. The data is provided in the file Module 3 Notes
Dataset. Calculate a 95% confidence interval for the mean roughness. Calculate a 98% confidence
interval for the mean roughness.

Copyright Edward Fox and John Semple 2019 8

You might also like