Applications of Central Limit Theorem

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

l 98 Chapter 8 Sampling Distribution of the Mean

sampling distribution is related to the population standard deviation a, there is less vari-
ability among the sample means than there is among individual observations. Even if a
particular sample contains one or two extreme values, it is likely that these values will be
offset by the other measurements in the group. Thus, as long as n is greater than 1, the
standard error of the mean is always smaller than the standard deviation of the popula-
tion. In addition, as n increases, the amount of sampling variation decreases. Finally, if n
is large enough, the distribution of sample means is approximately normal. This remark-
able result is known as the central limit theorem; it applies to any population with a finite
standard deviation, regardless of the shape of the underlying distribution [2] . The farther
the underlying population departs from being normally distributed, however, the larger
the value of n that is necessary to ensure the normality of the sampling distribution. If the
underlying population is itself normal , samples of size 1 are large enough. Even if the
population is bimodal or noticeably skewed, a sample of size 30 is often sufficient.
The central limit theorem is very powerful. It holds true not only for serum cho-
lesterol levels, but for almost any other type of measurement as well. It even applies to
discrete random variables. The central limit theorem allows us to quantify the uncer-
tainty inherent in statistical inference without having to make a great many assumptions
that cannot be verified. Regardless of the distribution of X, because the distribution of
the sample means is approximately normal with mean J.l and standard deviation a/ Jii,
we know that if n is large enough,

X-J.l
Z=--
a/Jii
is normally distributed with mean 0 and standard deviation 1. We have simply stan-
dardized the normal random variable X in the usual way. As a result, we can use tables
of the standard normal distribution-such as Table A.3 in Appendix A-to make infer-
ence about the value of a population mean.

8.3 Applications of the Central Limit Theorem

Consider the distribution of serum cholesterol levels for all 20- to 74-year-old males
living in the United States. The mean of this population is J.l = 211 mg/100 ml , and the
standard deviation is a= 46 mg/100 ml [3]. If we select repeated samples of size 25
from the population, what proportion of the samples will have a mean value of 230 mg/
100 ml or above?
Assuming that a sample of size 25 is large enough, the central limit theorem states
that the distribution of means of samples of size 25 is approximately normal with mean
J.l = 211 mg/100 ml and standard error a/ Jii = 46/fB = 9.2 mg/100 mi. This sampling
distribution and the underlying population distribution are shown in Figure 8.1. Note
that

Z =X- 211
9.2
8.3 Applications of the Central Limit Theorem J99

Sampling
distribution

73 119 165 211 257 303 349


Serum cholesterol level (mg/100 ml)

FIGURE S.J
Distributions of individual values and means of samples of size 25 for the serum
cholesterol levels of 20-to 7 4-year-old males, United States, 1976-1980

is a standard normal random variable. If x = 230, then

230- 211
z= 9.2
= 2.07.

Consulting Table A.3, we find that the area to the right of z = 2.07 is 0.019. Only about
1.9% of the samples will have a mean greater than 230 mg/100 mi. Equivalently, if we
select a single sample of size 25 from the population of 20- to 74-year-old males, the
probability that the mean serum cholesterol level for this sample is 230 mg/100 ml or
higher is 0.019.
What mean value of serum cholesterol level cuts off the lower 10% of the sam-
pling distribution of means? Locating 0.100 in the body of Table A.3, we see that it cor-
responds to the value z = -1.28. Solving for x,

z = -1.28
x-211
9.2

and

x= 211 + ( -1.28)(9.2)
= 199.2.

Therefore, approximately 10% of the samples of size 25 have means that are less than
or equal to 199.2 mg/100 mi.
200 Chapter 8 Sampling Distribution of the Mean

Let us now calculate the upper and lower limits that enclose 95 % of the means of
samples of size 25 drawn from the population. Since 2.5 % of the area under the stan-
dard normal curve lies above z = 1.96 and another 2.5 % lies below z = -1.96,

P( -1.96:::; Z :::; 1.96) = 0.95.

Thus, we are interested in outcomes of Z for which

-1.96 :5 z :5 1.96.

We would like to transform this inequality into a statement about X. Substituting


(X- 211)/9.2 for Z,

X- 211
-1.96 :5 9.2 :5 1.96.

Multiplying all three terms of the inequality by 9.2 and adding 211 results in

211 - 1.96(9.2) :s x :s 211 + 1.96(9.2) ,


or

193.0:::; X:::; 229.0.

This tells us that approximately 95 % of the means of samples of size 25 lie between
193.0 mg/100 ml and 229.0 mg/100 mi. Consequently, if we select a random sample of
size 25 that is reported to be from the population of serum cholesterol levels for all 20-
to 74-year-old males, and the sample has a mean that is either greater than 229.0 or less
than 193.0 mg/100 ml, we should be suspicious of this claim. Either the random sam-
ple was actually drawn from a different population or a rare event has taken place. For
the purposes of this discussion, a "rare event" is defined as an outcome that occurs less
than 5% of the time.
Suppose we had selected samples of size 10 from the population rather than sam-
ples of size 25. In this case, the standard error of X would be 46/JlO = 14.5 mg/100 ml ,
and we would construct the inequality

X- 211
-1.96 :5 5 :5 1.96.
14.

The upper and lower limits that enclose 95 % of the means would be

182.5 :s x :s 239.5.
Note that this interval is wider than the one calculated for samples of size 25 . We ex-
pect the amount of sampling variation to increase as the sample size decreases. Draw-
ing samples of size 50 would result in upper and lower limits

198.2 :s x :s 223.8;
8.3 Applications of the Central Limit Theorem 20J

not surprisingly, this interval is narrower than the one constructed for samples of size
25. Samples of size 100 produce the limits

202.0 ::::; X ::::; 220.0.

In summary, if we incl ude the case for which n = 1, we have the following results:

Interval Enclosing Length


n aj.[n 95% of the Means of Interval
-
1 46.0 120.8 :S X :S 301.2 180.4
10 14.5 x
182.5 :S :S 239.5 57.0
25 9.2 x
193.0 :S :S 229.0
-
36.0
50 6.5 198.2 :S X :S 223.8 25.6
-
100 4.6 202.0 :S X :S 220.0 18.0

As the size of the samples increases, the amount of variability among the sample
means-quantified by the standard error a/ yin-decreases; consequently, the limits en-
compassing 95% of these means move closer together. The length of an interval is sim-
ply the upper limit minus the lower limit.
Note that all the intervals we have constructed have been symmetric about the
population mean 211 mg/100 mi. Clearly, there are other intervals that would also cap-
ture the appropriate proportion of the sample means. Suppose that we again wish to con-
struct an interval that contains 95% of the means of samples of size 25. Since 1% of the
area under the standard normal curve lies above z = 2.32 and 4% lies below z = -1.75,
we know that

P( -1.75 ::::; Z ::::; 2.32) = 0.95 .

As a result, we are interested in the outcomes of Z for which

-1.75 ::::; z ::::; 2.32.


Substituting (X- 211)/9.2 for Z, we find the interval to be

194.9::::; X::::; 232.3.

Therefore, we are able to say that approximately 95% of the means of samples of size 25
lie between 194.9 mg/100 ml and 232.3 mg/100 mi. It is usually preferable to construct
a symmetric interval, however, primarily because it is the shortest interval that captures
the appropriate proportion of the means. (An exception to this rule is the one-sided in-
terval; we return to this special case below.) In this example, the asymmetrical interval
has length 232.3 - 194.9 = 37.4 mg/100 ml ; the length of the symmetric interval is
229.0- 193.0 = 36.0 mg/100 mi.
We now move on to a slightly more complicated question: How large would the
samples need to be for 95 % of their means to lie within ::5 mg/100 ml of the population
mean fi?To answer this, it is not necessary to know the value of the parameter fl We sim-
ply find the sample size n for which

P(fl - 5 ::::; X ::::; fl + 5) = 0.95,


202 Chapter 8 Sampling Distribution of the Mean

or

P(-5::::; X- J1::::; 5) = 0.95.

To begin, we divide all three terms of the inequality by the standard error
u / fo = 46/ fo; this results in

( -5 X- J1 5 )
p 46/ fo ::::; 46/ fo ::::; 46/ fo = 0 "95

Since Z is equal to (X- Jl)/(46/fo),

P( ----=2..__ < z < - 5 - ) - 0 95


46/fo- - 46/fo - . .

Recall that 95% of the area under the standard normal curve lies between z = -1.96
and z = 1.96. Therefore, to find the sample size n, we could use the upper bound of the
interval and solve the equation

z = 1.96
5
= 46/fo;

equivalently, we could use the lower bound and solve

z = -1.96
-5
46/fo.

Taking

and multiplying both sides of the equation by 46/5, we find that

fo = 1.96(46)
5

and

n= [1.9~(46)r
= 325.2.
When we deal with sample sizes, it is conventional to round up. Therefore, samples of
size 326 would be required for 95% of the sample means to lie within ::!::5 mg/100 ml
of the population mean fl. Another way to state this is that if we select a sample of size
326 from the population and calculate its mean, the probability that the sample mean is
within ::!::5 mg/100 ml of the true population mean J1 is 0.95.
8.3 Applications of the Central Limit Theorem 203

Up to this point, we have focused on two-sided intervals: we have found the upper
and lower limits that enclose a specified proportion of the sample means. More specif-
ically, we have focused on symmetric intervals. In some situations, however, we are in-
terested in a one-sided interval instead. For example, we might wish to find the upper
bound for 95% of the mean serum cholesterol levels of samples of size 25 . Since 95 %
of the area under the standard normal curve lies below z = 1.645,

P(Z ::::; 1.645) = 0.95.

Consequently, we are interested in outcomes of Z for which

z : : ; 1.645.
Substituting (X- 211)/9.2 for Z produces

or

Approximately 95 % of the means of samples of size 25 lie below 226.1 mg/100 mi.
If we want to construct a lower bound for 95 % of the mean serum cholesterol lev-
els, we focus on values of Z that lie above -1.645 ; in this case, we solve

to find

X 2: 195.9.

Approximately 95 % of the means of samples of size 25 lie above 195.9 mg/100 mi.
Always keep in mind that we must be cautious when making multiple statements
about the sampling distribution of the means. For samples of serum cholesterol levels
of size 25 , we found that the probability is 0.95 that a sample mean lies within the
interval

(193.0, 229.0).

We also said that the probability is 0.95 that the mean lies below 226.1 mg/100 ml, and
0.95 that it is above 195.9 mg/100 mi. Although these three statements are correct
individually, they are not true simultaneously. The three events are not independent.
For all of them to occur at the same time, the sample mean would have to lie in the
interval

(195.9, 226.1);

the probability that this happens is not equal to 0.95.

You might also like