Chapter 5 - Sampling and Sampling Distribution
Chapter 5 - Sampling and Sampling Distribution
Statistics in Practice
Chapter 5 – Sampling and Sampling Distribution 3
Selecting a Sample
– Sampling from a finite population
• A finite population is a population of size N which the N is less than infinity (i.e.,
not too large)
• SIMPLE RANDOM SAMPLE (FINITE POPULATION)
– A simple random sample of size n from a finite population of size N is a
sample selected such that each possible sample of size n has the same
probability of being selected.
• Sampling from the finite population can be with/without replacement
– Sampling from a infinite population
• RANDOM SAMPLE (INFINITE POPULATION)
– A random sample of size n from an infinite population is a sample selected
such that the following conditions are satisfied.
– 1. Each element selected comes from the same population.
– 2. Each element is selected independently.
Chapter 5 – Sampling and Sampling Distribution 4
Parameter:
Sample
Population
Mean
Mean (μ)
X1
X2
Parameter:
Mean Height
(μ)
Xn
Chapter 5 – Sampling and Sampling Distribution 7
Sample Mean
– Imagine we take a sample of n from a population, and define random variables X1,
X2 ... Xn representing the values that could be obtained, then X1, X2 ... Xn are
independent with same mean (μ) and variance (σ2), and their mean (Sample
Mean)
Y = c*X
– E(Y) = c E(X)
– Var(Y) = c2Var(X)
Y = X1+ X2 + ... + Xn
– If X1, X2 ... Xn are independent with same mean (μ) and variance (σ2), Then
– E(Y) = nμ
– Var(Y) =nσ2
Chapter 5 – Sampling and Sampling Distribution 8
– Let us begin by citing two examples in which sampling was used to answer a
research question about a population.
– 1. Members of a political party in Texas were considering supporting a
particular candidate for election to the U.S. Senate, and party leaders wanted
to estimate the proportion of registered voters in the state favoring the
candidate. A sample of 400 registered voters in Texas was selected and 160 of
the 400 voters indicated a preference for the candidate. Thus, an estimate of
the proportion of the population of registered voters favoring the candidate is
160/400 =.40.
– 2. A tire manufacturer is considering producing a new tire designed to provide
an increase in mileage over the firm’s current line of tires. To estimate the
mean useful life of the new tires, the manufacturer produced a sample of 120
tires for testing. The test results provided a sample mean of 36,500 miles.
Hence, an estimate of the mean useful life for the population of new tires was
36,500 miles.
Chapter 5 – Sampling and Sampling Distribution 9
Real Example:
– The director of personnel for Electronics Associates, Inc. (EAI), has been
assigned the task of developing a profile of the company’s 2500 managers. The
characteristics to be identified include the mean annual salary for the
managers and the proportion of managers having completed the company’s
management training program.
– Using the 2500 managers as the population for this study, we can find the
annual salary and the training program status for each individual by referring
to the firm’s personnel records. The data set containing this information for all
2500 managers in the population is in the file named EAI. 1500 of the 2500
managers completed the training program.
– Numerical characteristics of a population are called parameters
– Population mean: μ = $51,800
– Population standard deviation: σ = $4000
– Proportion of the population that completed the training program: p = 0.60
Chapter 5 – Sampling and Sampling Distribution 10
– Now, suppose that the necessary information on all the EAI managers was not
readily available in the company’s database. The question we now consider is
how the firm’s director of personnel can obtain estimates of the population
parameters by using a sample of managers rather than all 2500 managers in
the population.
– Suppose that a sample of 30 managers will be used. Clearly, the time and the
cost of developing a profile would be substantially less for 30 managers than
for the entire population. If the personnel director could be assured that a
sample of 30 managers would provide adequate information about the
population of 2500 managers, working with a sample would be preferable to
working with the entire population.
Chapter 5 – Sampling and Sampling Distribution 11
Point Estimation
– To estimate the value of a population parameter, we compute a
corresponding characteristic of the sample, referred to as a sample statistic.
– A simple random sample of 30 managers and the corresponding data on
annual salary and management training program participation are as shown
below:
Chapter 5 – Sampling and Sampling Distribution 12
Sampling Distribution
– In the preceding section we said that the sample mean is the point estimator
of the population mean μ, and the sample proportion is the point estimator of
the population proportion p.
– For the simple random sample of 30 EAI managers the point estimate of μ is 𝑥ҧ
= $51,814 and the point estimate of 𝑝ҧ is 0.63. Suppose we select another
simple random sample of 30 EAI managers and obtain the following point
estimates:
• Sample mean: 𝑥ҧ = $52,670 and Sample proportion: 𝑝ҧ = 0.70
– Now, suppose we repeat the process of selecting a simple random sample of
30 EAI managers over and over again, each time computing the values of 𝑥ҧ
and 𝑝ҧ .
Chapter 5 – Sampling and Sampling Distribution 13
Sampling distribution
– The sample mean 𝑥ҧ is a random variable.
– 𝑥ҧ has a mean or expected value, a standard deviation, and a probability
distribution.
– The various possible values of 𝑥ҧ are the result of different simple random
samples
– The probability distribution of 𝑥ҧ is called the sampling distribution of 𝑥ҧ
– Knowledge of this sampling distribution and its properties will enable us to
make probability statements about how close the sample mean is to the
population mean μ.
Chapter 5 – Sampling and Sampling Distribution 14
– We note that the largest concentration of the values and the mean of the 500
values is near the population mean μ = $51,800.
– The sampling distribution of 𝑥ҧ is the probability distribution of all possible
values of the sample mean 𝑥ҧ
Chapter 5 – Sampling and Sampling Distribution 15
Sampling Distribution of 𝑥ҧ
1. Population has a normal distribution
• In many situations it is reasonable to assume that the population from which we
are selecting a random sample has a normal, or nearly normal, distribution. When
the population has a normal distribution, the sampling distribution of is normally
distributed for any sample size
2. Population does not have a normal distribution.
• When the population from which we are selecting a random sample does not have
a normal distribution, the central limit theorem is helpful in identifying the shape
of the sampling distribution of 𝑥.ҧ
ഥ
𝒙 is the mean of 5 of students’ height
Chapter 5 – Sampling and Sampling Distribution 17
2. X is non-normal
Chapter 5 – Sampling and Sampling Distribution 18
Expected value of 𝑥ҧ
σ/ n
μ 𝑥ҧ
Standard deviation of 𝑥ҧ
– With a population mean of $51,800, the personnel director wants to know the
probability that is between $51,300 and $52,300. This probability is given by
the darkly shaded area of the sampling distribution shown in Figure below.
– Because the sampling distribution is normally distributed, with mean 51,800
and standard error of the mean 730.3, we can use the standard normal
probability table to find the area or probability.
– We first calculate the z value at the upper and lower endpoints of the interval
(52,300 and 51,300, respectively)
– What is the estimates of the mean and standard deviation of the transit travel
time?
– The sample mean of transit time is 19.7 min and the observed standard
deviation is 4.91 min. If GRT would like to use the sample mean (19.7 min.) as
an estimate of the population mean (true mean but unknown!!!), how good is
this estimate?
– What is the true mean? how to find it?
– If we survey another 20 runs, what would be the estimate? Would it be very
different from 19.7 min.
– What are the factors influencing the quality of the estimate?
Chapter 5 – Sampling and Sampling Distribution 24
Row Labels Count of n=1 Row Labels Count of n=2 Row Labels Count of n=3
1 1 1 1 1.00 1
2 1 1.5 2 1.33 3
3 1 2 3 1.67 6
4 1 2.5 4 2.00 10
5 1 3 5 2.33 15
6 1 3.5 6 2.67 21
Grand Total 6 4 5 3.00 25
4.5 4 3.33 27
Excel
5 3 3.67 27
5.5 2 4.00 25
1.5 6 1 4.33 21
Grand Total 36 4.67 15
1 5.00 10
0.5 5.33 6
5.67 3
0 7 6.00 1
6
5 Grand Total 216
4
3
2 30
1 25
0 20
15
1.5
2.5
3.5
4.5
5.5
2
1
6
(blank)
10
5
0
Chapter 5 – Sampling and Sampling Distribution 25
Solution of Example 3
– (1)
– (2) The CLT does not apply for a sample of small size (n<30), which means the
answer to this question is that “the distribution of the sample mean is
unknown”. However, for a population of uniform distribution, it has been
found by simulation (see Notes about CLT) that the sample mean of a small
size (>5) could follow approximately normal distribution. Therefore, in this
case
Chapter 5 – Sampling and Sampling Distribution 27
– (3)
(a) < 0.4
Example 4
The customer waiting times at a certain post office are assumed to be normally
distributed. A co-op student will be monitoring 30 noontime customers, timing
their arrivals and service with a watch. As closely as you can, find an interval
bracketing the probability that she will observe a deviation from the mean in
waiting time exceeding 2 minutes considering the following:
(a) σ = 6.3 minutes
(b) s = 6.3 minutes (unknown standard deviation)
P = 0.6467
P = 0.6453
Chapter 5 – Sampling and Sampling Distribution 31
Do we know 2?
yes no
yes no yes no
x−
n>30 x− ?
z= t n −1 =
n s n
yes no
Degrees of Freedom
x− x−
z= z=
n s n
X1
Parameter: X2
Proportion of Sample Proportion
students liking p*
cokes (p)
Xn
Chapter 5 – Sampling and Sampling Distribution 33
Recall
Bernoulli Distribution (Expectation and Variance of X)
E(X) = p
Var(X) = p(1-p)
Chapter 5 – Sampling and Sampling Distribution 34
Example 5:
– For the EAI managers the μ is $51,800 and the p is 0.60. Suppose we select
another simple random sample of 30 EAI managers and obtain the following
point estimates: 𝑝ҧ = 0.70. What is the probability of getting this proportion?
P = 0.868
Chapter 5 – Sampling and Sampling Distribution 35
Theorem on χ2 Distribution
– If and only if the data come from (at least approximately) a normal population,
then the pivotal statistic (n − 1) s 2 comes from a chi-squared
2
Example 6:
– A strain gage measurement accuracy is 2 mm (standard deviation). It measures
the deformation of 30 concrete slabs. What is the probability that the
standard deviation of these measurements would be equal to 2.2 mm.
χ2 (n − 1) s 2
2
P = 0.798
Chapter 5 – Sampling and Sampling Distribution 37
Review Exercise
1) A quality control process accepts or rejects each batch of 0.5”steel rods based
on the test results of a random sample of 100. A batch is acceptable if the mean
diameter of a sample from it falls between 0.4995” and 0.5005” otherwise, it is
rejected. Previous evaluations have established that the standard deviation for
individual rod diameters is 0.003”.
• (a)What is the probability that a batch of steel rods that have a mean diameter of
0.5003” will be accepted?
• (b)What is the probability that a near perfect batch having μ= 0.4999” (which
means it should have been accepted) will be rejected?
2) A civil engineer has computed the following results for the strength of certain
materials from 20 specimens: 𝑥= ҧ 31.4 MPa, s = 2.85 MPa. Determine the
approximate probability for getting a result this rare or rarer (this large or larger)
if the true mean strength is 29 MPa.
Chapter 5 – Sampling and Sampling Distribution 38
3) A civil engineer claims that the true mean strength of a given material is 29
MPa. To check this claim he tested 20 specimens and computed the following
results: x = 31.4 MPa, s = 2.85 MPa. If the computed t-value from this sample is
between –t0.05and t0.05, he is satisfied with his claim, what would be his
conclusion?
5) If each strand in a rope has a breaking strength, with mean 100 N and standard
deviation 10 N, and the breaking strength of a rope is the sum of the
(independent) breaking strengths of all the strands, what is the probability that a
rope made up of 64 strands will support a weight of 6300 N?
7) The following is the transportation modes that students use to travel between
their residence and the AUT campus in summer and winter. Answer the following
questions for different modes?
(a) How many sample should you take to be able to use CLT?
(b) Use the number you find in part (a) and calculate Z value?
(c) How can you interpret this Z value?
(d) Compare the results for Summer and Winter.
Summer
Walk Walk Walk Bike Walk Walk Walk Walk Bike Walk Drive Drive Walk Bus
Walk Walk Walk Bike Bus Bike Walk Walk Walk Walk Bike Bus Drive Bike
Walk Walk Walk Walk Walk Walk Walk Walk Walk Walk Bus Walk Walk Walk
Bus Bus Bike Bus Bike Walk Bus Bike Walk Bike Bike Walk Walk Walk
Walk Walk Walk Bike Walk Bike Walk Bus Walk Drive Bus Walk Drive Bike
Walk Bus Walk Walk Walk Walk Walk Walk Walk Bus Bus Walk Walk Bus
Bike Walk Walk Drive Walk Bus Bus Walk
Winter
Walk Bus Drive Bus Walk Bus Walk Bus Bike Bus Drive Walk Bus Bus
Walk Bus Bus Bus Walk Walk Walk Walk Walk Bike Bus Walk Bus Walk
Bus Drive Others Drive Bus Walk Walk Walk Bus Walk Bus Bus Bus Bus
Bus Walk Walk Bus Walk Walk Bus Bus Walk Bus Walk Bus Walk Walk
Bike Bus Walk Bus Walk Drive Bus Walk Drive Bus Bus Bus Walk Walk
Bus Walk Bus Walk Bus Bus Walk Walk Bus Bus Bus Walk Drive Walk
Walk Walk Bus Bus Bus Bus Bus Walk
Chapter 5 – Sampling and Sampling Distribution 41
8) A student took 2 concrete samples and did compressive strength test. The
results are presented below in Mpa. Her supervisor insisted to do the test at high
level of accuracy and precision and restricted her to not having more than 2 Mpa
standard deviation for results. What is the probability that samples cannot meet
this restriction?
26.98 33.52 31.56 34.04 32.82 34.43 28.78 27.55
10) A water distribution subsystem consists of pipes AB, BC, and AC as shown in
the figure below. Because of differences in elevation and in hydraulic head loss in
the pipes and associated uncertainties, the capacity of each pipe (which is
defined as the maximum rate of flow) is given as follows, in cfs (cubic feet per
second):
AB: capacity is normal with mean 5 and coefficient of variation 0.1 (Coefficient of
variation = standard deviation/mean)
BC: capacity is uniform distributed between 2 and 8
AC: capacity equal to 8 or 9 with equal likelihood
(1)Determine the probability that the capacity of the branch ABC will exceed 4 cfs.
(Hint: Define this event as a combination of the events related to the capacity of AB
and BC);
(2)Determine the probability that the total capacity of the subsystem shown above
will exceed 13 cfs. (Hint: Use conditional probability.).
A C
B
Chapter 5 – Sampling and Sampling Distribution 43
11) Consider the class (or all students participated in the survey) as a population and
you are interested in students’ average height. Let X = the height of a randomly
selected student in cm. Use the survey data to answer the following questions:
a) Determine the (population) mean, standard deviation, probability distribution of X;
b) Following a) with the known population parameters (mean and variance of X), imagine
you pick up a sample of 5 students at random (N=5) and let Y = the average height of these
sampled students, determine the mean and standard deviation of Y. What distribution do
you expect Y to follow?
c) Repeat a) for N= 10 and 30. What patterns do you observe (or how do the mean,
standard deviation and distribution of Y change by the sample size N)?
d) With the information about the population, if N=30, what is the probability that the
difference between Y (sample mean) and the population mean (true) is less than 2 cm.
e) With the information about the population, if N=5, what is the probability that the
difference between Y (sample mean) and the population mean (true) is less than 2 cm. You
could assume that the population is normally distributed
f) With the information about the population, if we want to make sure that the probability
that the difference between Y (sample mean) and the population mean (true) is less than 1
cm is over 95%, how many students should we sample?
Chapter 5 – Sampling and Sampling Distribution 44
References
– Liping Fu, Probability and Statistics, University of Waterloo.
– Thomas A. Duever, Statistics in Engineering , University of Waterloo.
– Mahesh D. Pandey, Engineering Risk and Reliability, University of Waterloo.
– David R. Anderson, Dennis J. Sweeney, Thomas A. Williams, Statistics for
Business and Economics (Eleventh Edition), South-Western, Cengage Learning,
2011.
– Douglas C. Montgomery, George C. Runger, Applied statistics and probability
for engineers (Third Edition), John Wiley & Sons, Inc., 2003.
– Paul Newbold, William Carlson, and Betty Thorne, Statistics for Business and
Economics, Eighth Edition, Pearson, 2013.
– W.J. DeCoursey, Statistics and Probability for Engineering Applications With
Microsoft® Excel, Newnes, 2003.
– Carlo Vercellis, Business Intelligence Data Mining and Optimization for
Decision Making, John Wiley & Sons, 2009.