Chapter 5 - Random Sapling
Chapter 5 - Random Sapling
Chapter 5 - Random Sapling
CHAPTER 5
RANDOM SAMPLING AND
SAMPLING DISTRIBUTIONS
TABLE OF CONTENTS
Page
1 Selection Bias and Sampling 144
Key Definitions, Representative Samples, Simple Random Samples
2 The 1970 Draft Lottery Example and More Sampling Concerns 145
Problems with Mechanical Selection Procedures, Miscellaneous Sampling Issues
3 Disastrous Sampling Stories: The 1936 Literary Digest 147
4 Disastrous Sampling Stories: The New Hite Report 143
5 Disastrous Sampling Stories: A Meaningless Poll 149
6 Sampling Distributions 152
Sampling Distribution of Sample Mean from a Normal Population, Probable Error
7 The Central Limit Theorem 158
Sampling from ANY population, How Large a Sample Do We Need?, Examples
8 Cents and the Central Limit Theorem – Ages of 800 Pennies 159
9 CLT through Simulation 162
10 Four (Straightforward) Questions on Sampling Distributions 165
11 Two Exercises in Sampling Distributions 165
Answer Sketches
• The population (or universe) is the entire collection of units, (individuals or objects or the
list of measurements) about which we would like information.
• The sample is the collection of units we will actually measure or the collection of
measurements we will actually obtain.
• The sampling frame is a list of units from which the sample is chosen. Ideally, it includes
the whole population.
• In a sample survey, measurements are taken on a sample from the population.
• A census is a survey in which the entire population is measured.
How a sample is selected from a population is of vital importance. Since statistics vary from
sample to sample, any inferences base on them will necessarily be subject to some uncertainty.
How, then, do we judge the reliability of a sample statistic as a tool in making an inference about
the corresponding population parameter?
First, some basic notions of sampling. We’d like the sample to be representative of the
population. If we select n elements as a sample from a population, how can we tell whether the
sample is representative? Sometimes it is hard to judge whether a sample is representative.
What other criteria could be used to select valid samples from a population of interest?
One of the simplest and most frequently used sampling procedures produces what is known as
a random sample. Suppose we take a sample of n elements from a population. If some
samples are more likely to be selected than others, then the sampling is not random. If our
method of sampling ensures that every possible combination of n elements in the population has
an equal chance of being selected, the n elements are a random sample. Strictly, speaking, this
is a simple random sample.
From 1948 through about 1969, men were drafted by age -- oldest first, starting with 25-year
olds. On December 1, 1969, the Selective Service System conducted a lottery to determine
the order of selection for 1970. 366 possible days in a year were written on slips of paper and
placed in egg-shaped capsules. The capsules were then drawn one by one to obtain the order
of induction. So the lottery assigned a rank to each of the 366 birthdays. Does this seem like a
fair method of determining the order of selection? Below is the summary of the data for the
resulting order. Does it appear to provide a fair order of selection?
Time Series (Overlay) Plot of 1970 Lottery Data
350
300
250
200
150
100
50
0
0 50 100 150 200 250 300 350
Order
1
Adapted from McClave & Benson, case study 3.2.
250
Average selection number
200
183.5
150
Monthly Averages
100
50
0
0 1 22 34 4 56 6 87 8 10 9 1012 11 14
12
month
If the lottery were truly random than one would expect that the monthly means would be fairly
close to the average value of the selection order =
(1 + 2 + 3 + 4 + … +365+366)/366 = 183.5
and they would show no pattern from month to month.
Why did method that seemed so fair produce such a bias? The problem was with the bowl.
The January capsules had all been put in first, the February capsules were added next, and so
on. The capsules were then mixed but not thoroughly enough. There was still a tendency for
dates early in the year to be near the bottom of the bowl. Compounding the problem was that
the bowl’s compactness made it physically difficult to reach in and select a capsule from beneath
the topmost layers. As a result capsules were more or less chosen from the top down. Since
composition of the layers was not random because of the earlier inadequate mixing, the final
sequence contained the bias evident in the picture above.
Why Randomize?
Results from polls and other statistical studies reported in newspapers or magazines often
emphasize that the samples were randomly selected. Why the emphasis on randomization?
Couldn’t a good investigator do better by carefully choosing respondents to a poll so that
various interest groups were represented? Perhaps, but samples selected without objective
randomization tend to favor one part of the population over another. For example, polls
conducted by sports writers tend to favor the opinions of sports fans. This leaning toward one
side of an issue is called sampling or selection bias. In the long run, random samples seem to
do a good job of producing samples that fairly represent the population. In other words,
randomization reduces sampling bias. Note that random sampling can be tough. A random
sample is not a casual or haphazard sample. The target population must be carefully identified,
and an appropriate sampling frame must be selected.
Sample Results: Landon 57% FDR 43% Actual Election: FDR 62% Landon 38%
What happened?
• Bad sampling frame -- Literary Digest subscribers were not representative of the
American voting population.
• Nonresponse bias -- 76% of subscribers who didn’t return the survey were much more
likely to support FDR.
• Also, Landon’s supporters, desiring change, tended to be more vocal in their support of
their candidate than did FDR’s supporters.
Disasters:
Example 5.3. The New Hite Report -- Controversy over the Numbers
[Reference: Streitfeld, D. (1988). “Shere Hite and the trouble with numbers.” Chance, 1, 26-31.]
In 1968, researcher Shere Hite shocked conservative America with her now-famous “Hite
Report” on the permissive sexual attitudes of American men and women. Twenty years later,
Hite was surrounded by controversy again with her book, Women and Love: A Cultural
Revolution in Progress (Knopf Press, 1988). In this new Hite report, she reveals some
startling statistics describing how women feel about contemporary relationships:
Hite conducted the survey by mailing out 100,000 questionnaires to women across the country
over a 7-year period. Each questionnaire consisted of 127 open-ended questions, many with
numerous subquestions and follow-ups. Hite’s instructions read: “It is not necessary to answer
every question! Feel free to skip around and answer those questions you choose.”
Approximately 4,500 completed questionnaires were returned for a response rate of 4.5%, and
they form the data set from which these percentages were determined. Hite claims that these
4,500 women are a representative sample of all women in the United States, and therefore, the
survey results imply that vast numbers of women are “suffering a lot of pain in their love
relationships with men.” Many people disagree, however, saying that only unhappy women are
likely to take the time to answer Hite’s 127 essay questions, and thus her sample is
representative only of the discontented.
The views of several statisticians and expert survey researchers on the validity of Hite’s
“numbers” were presented in an article in Chance magazine (Summer 1988). A few of the
more critical comments follow.
• Hite used a combination of haphazard sampling and volunteer respondents to collect her
[data]. First, Hite sent questionnaires to a wide variety of organizations and asked them to
circulate the questionnaires to their members. She mentions that they included church groups,
women’s voting and political groups, women’s rights organizations and counseling and walk-in
centers for women. These groups seem not representative of women in general; there is an
over-representation of feminist groups and of women in troubled circumstances.... The use of
groups to distribute the questionnaires meant that gatekeepers had the power of assuring a
zero response rate by not distributing the questionnaire, or conversely of greatly stimulating
returns by endorsing the study in some fashion. Second, Hite also relied on volunteer
respondents who wrote in for copies of the questionnaire. These volunteers seem to have
been recruited from readers of her past books and those who saw interviews on television and
On February 18th, 1993, shortly after Bill Clinton became President of the United States, a
television station in Sacramento, California asked viewers to respond to the question: “Do you
support the President’s economic plan?” The next day the results of a properly conducted
study asking the same question were published in the newspaper.
As you can see, those who responded to the television poll were more likely to be those who
were dissatisfied with the President’s plan. Trying to extend those results to the general
Haphazard/Convenience Sampling
A few years ago, the student newspaper at a California university announced as a front page
headline: “Students ignorant, survey says.” The article explained that a “random survey”
indicated that American students were less aware of current events than international students
were. However, the article quoted the undergraduate researchers, who were international
students themselves, as saying that “the students were randomly sampled on the quad.” The
quad is an open air area where students relax, play frisbee, eat lunch, and so on. There is
simply no proper way to collect a random sample of students by selecting them in an area like
that. In such situations, the researchers are likely to approach people whom they think will
support the results they intended for their survey. Or, they are likely to approach friendly
looking people who look like they will cooperate. This is called a haphazard sample, and it
cannot be expected to be representative.
Nonprobability Sampling
• Judgment Sample: Asking various newspaper food critics about which restaurants they
would recommend
• Quota Sample: If a school has 30% off-campus and 70% on-campus students, attempting
to get 30 on- and 70 off-campus students when doing a survey of 100 students
• Chunk Sample: If I want to know what students think of a textbook quickly, I could use
this class to represent all students using the book.
• The only way to make correct statistical inferences from a population to a sample is to use
probability sampling...
Wainer, Palmer and Bradlow [WPB] (1998) presented a series of situations in which the use of
nonrandom samples led or could lead to seriously incorrect conclusions.
2
This material is drawn from Wainer, H., Palmer, S. and Bradlow, E. (1998). “A selection of selection
anomalies.” Chance, 11(2): 3-7.
Asking Questions
Suppose we want an honest reading on the number of American citizens who cheat on their
taxes? How do we ask people if they cheat?
How you ask the question can seriously affect the answer…
1. Do you agree that unions can cause inconvenience and bad labor-management
relations?
2. Do you agree that unions have been important in securing employee rights and decent
pay?
A sample statistic is a random variable, and thus must be compared and judged on the basis of
its probability distribution. The probability distribution of a sample statistic is called its sampling
distribution.
The sampling distribution of a sample statistic (based on n observations) is the relative frequency
distribution of the values of the statistic theoretically generated by taking repeated random
samples of size n and computing the value of the statistic for each. To illustrate the difference
between distribution of some random variable and sampling distribution of statistic of this
random variable lets consider following example.
Example 5.5
Consider population created by tossing a fair coin infinitely many times. Let random variable X
take value 1 if coin shows tails and 2 if coin shows heads. The probability distribution of X is
x 1 2
p(x) 0.5 0.5
0.6
0.5
probability
0.4
0.3
0.2
0.1
0
1 2
After reading Chapter 3 of this bulkpack you should be able to compute that expected value of
X is 1.5 and variance of X is 0.25.
Now suppose that you did not know the mean value of X . You could draw a sample (say of
size n = 4) from the population (by flipping coin 4 times) and use sample mean to estimate the
population mean. The following table summarizes all possible samples.
Each of these samples has equal chance to come up. Therefore, probability that any one of
them occurs is 1/16. Note, however that some of the sample mean values appear more than
once. For example, sample mean 1.25 appears 4 times in this table. Hence, it is 4 times more
likely to happen compared to sample mean 1 that appears only once. Therefore, probability
that sample mean of 1.25 will occur is 4*1/16 =1/4 =0.25. Repeating this argument for other
values of sample mean we obtain the following probability distribution of sample mean.
Sample mean 1 1.25 1.5 1.75 2
Probability 0.0625 0.25 0.375 0.25 0.0625
0.4
0.35
0.3
0.25
probability
0.2
0.15
0.1
0.05
0
1 1.25 1.5 1.75 2
Series1 0.0625 0.25 0.375 0.25 0.0625
value of mean
Note that, first of all, sample mean is a random variable itself. Second, distribution of sample
mean is very different from the distribution of X. Third, note the bell shape of the sample mean
distribution. In fact, the larger the sample size n is the more sample mean distribution will
resemble normal distribution (this is the essence of the Central Limit Theorem that we will
discuss later in the chapter).
Having the distribution for sample mean we can calculate expected value and variance of the
sample mean.
So, E(sample mean) = E(X), and Var(sample mean) < Var(X). Note that variance of sample
mean of X is less than the variance of X itself. This is not a coincidence, in fact, we will see
later that sample of size n, Var(sample mean)=Var(X)/n.
A telephone company knows that during non-holidays the number of calls that pass through the
main branch office each hour has a Normal distribution with µ = 80,000 and σ = 35,000.
Describe the mean, standard deviation and shape of the sampling distribution of X , the mean
number of incoming calls per hour for a random sample of 60 non-holiday hours.
What else might we want to know about the sampling distribution? What can we say about the
shape of the sampling distribution of the sample mean?
3
Adapted from Sincich, exercise 7.24.
Suppose that a supermarket manager is interested in estimating the mean checkout time for the
non-express checkout lanes. An assistant manager obtains a random sample of 25 checkout
times. If previous data suggest that the population standard deviation is 1.10 minutes, describe
the probable deviation of Y from the unknown population mean µ.
Solution to Example 5.7. The Empirical Rule indicates that approximately 95% of the time Y
( )
is within two standard errors 2σ Y of the population mean µ. For n = 25,
2σ pop 2(110 . )
2σY = = =.44
n 5
The probable error for Y is no more than .44 minute.
The probable accuracy of a sample mean, as measured by its standard error, is affected by the
sample size. Because the standard error of the sample mean is the population standard
deviation divided by the square root of the sample size, the standard error decreases as the
sample size increases. For example, if the sample size had been either 50 or 100 instead of 25
( )
in Example 5.8., the probable errors 2σ Y would have been, respectively, .31 or .22.
The average demand for rental skis on winter Saturdays at a particular area is 148 pairs, which
has been quite stable over time. There is variation due to weather conditions and competing
areas; the standard deviation is 21 pairs. The demand distribution seems to be roughly Normal.
a. The rental shop stocks 170 pairs of skis. What is the probability that demand will exceed
this supply on any one winter Saturday?
b. The shop manager will change the stock of skis for the next year if the average demand over
the 12 winter Saturdays in a season (considered as a random sample) is over 155 or under
135. These limits aren’t equidistant from the long-run process mean of 148 because the
costs of oversupply and undersupply are different. If the population mean stays at 148,
what is the probability that the manager will change the stock?
Basically, the CLT states that if we have a large enough sample taken from any population, not
just a Normally distributed one, then the sampling distribution of the sample mean will also be
Normally distributed.
More formally, if a random sample of n observations is selected from essentially any population,
then, when n is sufficiently large, the sampling distribution of X will be approximately Normal.
The quality of the Normal approximation to the sampling distribution of X depends on two
things: the size of the sample, and the shape of the underlying distribution of the population.
• The larger the sample size n, the better the Normal approximation to the sampling
distribution of X
• Generally speaking, the greater the skewness of the underlying population, the larger the
sample size must be to obtain an adequate Normal approximation.
• If a plot of the sample data shows severe skewness, it is reasonable to assume that the
underlying population is severely skewed, and the Normal approximation to the sample
mean’s sampling distribution is not appropriate unless n is at least 100.
• For mild skewness, n = 30 should generally be sufficient to make the Normal
approximation to the sampling distribution appropriate.
• For symmetric but outlier-prone data, n = 15 sample should be enough for the CLT effect
to make the Normal approximation reasonable.
• For normalish data, n = 1 is sufficient.
In the supermarket checkout time situation of Example 5.8., the following actual times in minutes
were observed (n = 25): .4, .4, .5, .5, .5, .6, .6, .7, .8, .9, 1.1, 1.2, 1.4, 1.5, 1.8, 2.0, 2.3, 2.6,
2.9, 3.4, 4.2, 5.0, 6.6, 9.2, 16.3 ( y = 2.70, median = 140 . , s = 3.56) . Does it appear that a
normal approximation to the sampling distribution of Y (for future samples of size n = 25, for
instance) would be satisfactory?
Checkout histogram
20
18
16
14
Frequency
12
10 Frequency
8
6
4
2
0
0.4 3.58 6.76 9.94 13.12 More
Bin
Solution to Example 5.9. The sample data suggest that the population distribution of checkout
times is likely to be highly skewed. Most times are quite brief, but there are a few people who
really slow things up. (See histogram) A sample of 25 is not enough to deskew the sampling
distribution. Therefore, the Empirical Rule probabilities (which are based on the Normal
approximation) in Example 5.8. are most likely inaccurate for n = 25.
As a consistent example, I collected a sample of 800 pennies in the summer of 1996, and
stored the data in the CENTS file available on the Course Volume. From the original 800
measurements, samples of size 5, then of size 10, 25 and 40 were drawn (in files
CENTS1..CENTS4), and descriptive statistics calculated:
0 5 10 15 20 25 30 35 40 45
Below are histograms and outlier boxplots of the sample means of sizes 5, 10, 25 and 40. Note
that they have been drawn on the same x-axis as the individual ages were above. Note the
CLT effect on the shape, the center and the spread of the distribution.
MEAN5
0 5 10 15 20 25 30 35 40 45
MEAN10
0 5 10 15 20 25 30 35 40 45
MEAN25
0 5 10 15 20 25 30 35 40 45
MEAN40
0 5 10 15 20 25 30 35 40 45
A manufacturer of automobile batteries claims that the distribution of the lifetimes of its best
battery has a mean of 54 months and a standard deviation of 6 months. A consumer group
purchases a sample of 50 batteries and determines their lifetimes.
a. Assuming the manufacturer’s claim is true, describe the sampling distribution of the
mean lifetime of a sample of 50 batteries.
A sample of 50 batteries will produce a sampling distribution of the sample mean which follows
• a Normal distribution, (assuming that the population of batteries is not grossly
skewed), with
• Expected value E( X ) = µpop= 54
σ 6
• and Standard error of the mean SE( X ) = pop = = 0.85
n 50
b. Assuming the manufacturer’s claim is true, what is the probability that the group’s
sample has a mean lifetime of 52 hours or less?
We need to find the probability that the sample mean of 50 batteries, X , is 52 or less. But,
assuming the manufacturer’s claim is true, we know that the sample mean X follows a Normal
distribution with mean E( X ) = 54 and standard error SE( X ) = 0.85. So we have
X−E X
Pr( X < 52) = Pr
[ ]
<
52 − E X [ ]
[ ]
SE X SE X [ ]
52 − 54
= Pr Z <
0.85
= Pr ( Z < −2.35)
Now Pr(Z < -2.35) can be found using Table 2, and is equal to .0094, which is the probability
that the group’s sample has a mean lifetime of 52 hours or less, assuming the manufacturer’s
claim is true.
Versions of a CLT also apply to sample statistics other than sums and means, but the Normal
distribution does not always apply. Mere largeness does not imply Normality, unless a sum or
average is involved. The CLT guarantees that the distribution of the sample mean will be
Normally distributed, even though the individual values may be quite skewed. The best advice
is to draw pictures...
3. The manufacturer of cans of salmon that are supposed to have a net weight of 6 ounces tells
you that the net weight is actually a random variable with mean 6.05 ounces and standard
deviation of 0.18 ounce. Suppose you take a random sample of 36 cans.
a. Find the probability that the sample mean will be less than 5.97 ounces.
b. Suppose your random sample of 36 cans produces a mean of 5.95 ounces. Comment on
the statement made by the manufacturer.
4. The sign on the elevator in a large skyscraper states “Maximum capacity 2,500 pounds, or
16 persons.” A professor of statistics wonders what the probability is that 16 people would
weight more than 2,500 pounds. If the weights of the people who use the elevator are
Normally distributed, with a mean of 150 pounds and a standard deviation of 20 pounds, what
is the probability that the professor seeks?
Suppose the telephone company wishes to determine whether the true mean number of
incoming calls per hour during holidays is the same as for non-holidays. To accomplish this, the
company randomly selects 60 hours during a holiday period, monitors the incoming phone calls
each hour, and computes y , the sample mean number of incoming phone calls. If the sample
mean is computed to be y = 91,970 calls per hour, do you believe that the true mean for
holidays is µ = 80,000 (the same as for non-holidays)? Assume that the standard deviation of
the number of incoming calls per hour for holidays is 35,000.
Telephone Company
To assess whether 80,000 seems reasonable, we need to calculate the probability of obtaining a
sample mean of 91,970 or larger when µ = 80,000. So we assume that µpop = 80,000 and that
σpop = 35,000 and take a sample of n = 60 hours. Again, with a sample of size n = 60, we
need only to assume that the population of holiday hour incoming calls are not grossly skewed
in order to allow the CLT effect to occur.
If this is true, then the sampling distribution of y is Normal, with Expected value E( y ) = µpop=
σ 35,000
80,000 calls and Standard error SE( y ) = pop = = 4,518 . Now, we want to find the
n 60
probability of obtaining a sample mean of 91,970 or larger:
y − E[ y ] 91,970 − E[ y ]
Pr( y ≥ 91,970) = Pr ≥
SE[ y ] SE[ y ]
91,970 − 80,000
= Pr Z ≥
4,518
= Pr ( Z ≥ 2.65)
and, from Table 2, Pr(Z ≥ 2.65) = 1 - .9960 = .0040, which seems awfully small, and thus
suggests that the mean number of calls on holiday hours is higher than 80,000.
4
Adapted from Sincich, exercise 7.18.
and, from Table 2, Pr(-1.18 < Z < 0.59) = .8810 - .2776 = .6034
c.
y − E[ y ] 67 − E[ y ]
Pr( y ≤ 67) = Pr ≤
SE[ y ] SE[ y ]
67 − 71
= Pr Z ≤
.
17
= Pr ( Z ≤ −2.35)
and, from Table 2, Pr(Z ≤ -2.35) = .0094. Thus, we’d expect a sample mean CDQ
score of 67 or lower to occur less often than 1 in 100 times. It would be a real surprise
to observe such a result.