Statistics and Their Distributions
Statistics and Their Distributions
TABLE OF CONTENTS
Section No. and Heading Page No.
Learning Objectives 2
1 Sample Statistic 2
1.1 Sampling With and Without Replacement 3
1.2 Sample Statistic and Sampling Distribution 4
2 Sampling Techniques 5
2.1 Simple Random Sampling 5
2.2 Stratified Random Sampling 6
2.3 Two- Stage or Multi-Stage Random Sampling 6
2.4 Systematic Random Sampling 7
2.5 Purposive Sampling 7
2.6 Cluster Sampling 7
3 Sampling and Non-sampling Errors 8
4 Deriving a Sampling Distribution 9
5 Analytical Methods for Deriving a Sampling Distribution 10
5.1 Using Probability Rules 10
5.2 Simulation Experiments 11
6 Distribution of the Sample Mean 12
7 Distribution of Sample Means when Population is Normally Distributed 14
8 Central Limit Theorem 16
9 Distribution of Sample Means when the Population is Non-Normal 17
10 Distribution of the Sum and Difference of Sample Means 18
Practice Questions 20
References:
1. Jay L. Devore, Probability and Statistics for Engineering and the Sciences,
8th edition, Cengage Learning
2. Irwin Miller and Marylees Miller, Mathematical Statistics, Seventh Edition,
Pearson.
Learning Objectives:
In this chapter you will learn what is a sample statistic and its sampling
distribution. You will learn how to derive the probability distribution of a sample
statistic and the three alternative methods that can be used for this purpose. The
first method is based on selecting samples from the population. You will learn
about different methods for selecting a representative sample and the difference
between sampling and non-sampling errors. You will study in depth about the
probability distribution of sample means. You will learn about the significance of
the Central Limit Theorem in this context. The chapter ends with the study of
distribution of combinations of two sample means. The chapter is followed by
practice questions so that you can test your understanding of the chapter
contents.
Chapter Outline
1. Sample Statistic
2. Sampling Techniques
3. Sampling and Non-sampling Errors
4. Deriving a Sampling Distribution
5. Analytical Methods for Deriving a Sampling Distribution
6. Distribution of the Sample Mean
7. Distribution of Sample Means when Population is Normally Distributed
8. Central Limit Theorem
9. Distribution of Sample Means when the Population is Non-Normal
10. Distribution of the Sum and Difference of Sample Means
1 SAMPLE STATISTIC
Measures which describe some characteristics of the population are known as
parameters. Examples of population parameters are population mean μ,
~ , etc.
population standard deviation σ, population ratio p, population median
These are constants for a population and remain unknown in the absence of
complete population census data.
From a population of size N, the first unit can be drawn in N ways, the second
unit in N-1 ways and so on, when sampling is without replacement. Since the
order in which the sample units are selected is not relevant, the total number of
N N!
possible samples of size n from a population of size N is
n n!N n !
When the population from which the sample is drawn is very large in relation to
sample size, ie, when n < 0.05N, then for all practical purposes we can consider
the population to be infinitely large. If population is infinitely large then number
of samples that can be drawn from the population is also infinitely large,
irrespective of whether sampling is with or without replacement. It is only when
population is finite, sample size n > 0.05N, and sampling is without replacement
N
that the total number of samples will be and probability of selection of any
n
one of the equally likely samples is 1 .
N
n
Several different functions of sample values could be used to obtain the estimate
of the parameter value. Examples of estimators for population mean μ are the
~
sample mean X , the trimmed mean X tr , the median X , or some weighted
average of sample values.
Consider selecting two different samples of the same size n (x1, x2,……xn) from
the same population. It is very unlikely that all the sample values of the first
sample will be repeated in the second sample. Since a sample is only a small
subset of the population and a large number of samples of the same size can be
drawn from the same population, the sample values will be likely to differ from
one sample to the other sample.
Before we obtain sample data, there is uncertainty about the value of each Xi,
since each sample element can be any one of the population units. Because of
this uncertainty, each observation is a random variable Xi before the data
becomes available.
Since sample observations are random variables, the value of any function of the
sample observations (eg, sample mean X , sample variance S2, etc) is also a
random variable which varies from sample to sample. There is uncertainty about
the value of X , the value of S2, and so on prior to obtaining the sample
observations. The value of the sample statistic will depend on which sample was
selected and the parameter estimate would differ accordingly.
2 SAMPLING TECHNIQUES
Samples selected from a population must be representative of the population. If a
sample is unrepresentative of the population and the sample statistic is used to
estimate the corresponding parameter value, then this will result in an inaccurate
estimate. If the selected sample contains a disproportionately large number of
units from one end of the population distribution then the sample statistic will
provide an underestimate or overestimate of the parameter value. For example, if
the sample observations are atypically larger than most of the population values,
then the sample mean will be an overestimate of the population mean.
The technique used in the collection of sample data should be such that it
minimizes the possibility of such errors. There exist a number of alternative
Simple random sampling ensures that each unit of the population gets an equal
chance of being selected in the sample. Since several different samples can be
selected from any population, this method ensures that each sample of the same
size has the same probability of being selected. This method is useful for
homogeneous populations where there are no extreme values. In case of
homogeneous populations atypical observations are unlikely in the selected
sample and the estimate is unlikely to be biased.
The random variables X1, X2,…….Xn are said to form a simple random sample of
size n if the following two conditions are satisfied:
1. The Xi’s are independent random variables
2. Every Xi has the same probability distribution.
In a simple random sample (SRS), each unit of the sample is then said to be
independently and identically distributed (iid).
as that found in the population. The proportions in the sample from each
subgroup conform to the proportions in the population. However, more
information is required about the population in this method than in SRS.
This method is often used in estimating the timber available in a forest. A tree is
selected at random and then a direction is selected at random. Every i th tree in
the selected direction, starting from the first tree, is then examined.
We see from the listed sampling techniques that there is no unique method for
obtaining a representative sample from a given population. Other sampling
methods exist which combine features of more than one of the above methods.
One such example is Stratified Cluster Sampling. The method adopted when
selecting a sample will depend on the nature of the population, purpose of study,
along with time and expenditure constraints.
It is a matter of pure chance which sample is selected. Hence sampling errors are
due to chance factors. Sampling errors are observed only in a sample survey. It is
completely absent in the census method. Factors which contribute to sampling
errors are:
1. Heterogeneity or variability of the population.
2. Bias in the estimation method if incorrect formula used for the statistic
3. Sometimes, in a properly selected sample, some of the sample units
cannot be observed and these are substituted by other units and the
The total number of likely samples is 32 = 9. Following is the list of samples with
x
n n
x
2
i i x
their respective means and variances, where x= i 1
and s2 i 1
n n 1
(x1,x2): (2,2) (2,6) (2,10) (6,2) (6,6) (6,10) (10,2) (10,6) (10,10)
x: 2 4 6 4 6 8 6 8 10
s2 : 0 8 32 8 0 8 32 8 0
Sampling distribution of x :
x : 2 4 6 8 10
P( x ) : 1/9 2/9 3/9 2/9 1/9
1 3 2 1
Var( x ) = 4 16 36 64 100 62
2 372 10.667
36 5.33
9 9 9 9 9 9 2
We can similarly derive the distribution of sample variances and obtain the mean
and variance of the sampling distribution.
2
Sampling distribution of s :
s2 : 0 8 32
2
P( s ) : 3/9 4/9 2/9
3 2 96
2
4
V( s )= 0 64 1024
2304 9216
142.22
2
9 9 9 9 9 81
If the form of the population distribution is known then the probability of selection
of a sample unit will be the probability of its occurrence in the population. Using
this information, the probability of selection of a particular sample is the joint
probability of all the sample units. If the sample units are assumed to be
independent, then the probability of selection of a particular sample is the product
of the probabilities of the sample units. This follows from the assumption that the
sample units are independent. The probability associated with a particular sample
is also the probability of its mean and its variance. The probability of a sample
mean is the same as the probability of selecting the sample for which the mean is
computed. Similarly, probability of a sample variance is equal to the probability of
selecting the sample for which the variance is computed.
The value of mean and variance of each of the 9 samples is equally likely with
probability 1/9. Now we have three samples (2,10), (6,6), and (10,2) which have
same mean value 6. The probability of obtaining a mean value of 6 is the same as
the probability of selecting any one of the three samples: (2,10) or (6,6) or
(10,2). The sum of the probabilities is 3/9, which is the probability of x =6.
Thus the sampling distribution of means and variances can be derived from the
probability of selecting a sample that results in specific values of sample mean
and variance.
Then use a computer to obtain k different random samples, each of size n, from
the population distribution.
For each such sample, calculate the value of the statistic. From k replications we
get k samples and k calculated values of the statistic. Now construct a histogram
for the k values. The histogram gives the approximate sampling distribution of
the statistic.
The larger the number of replications (k), the better will be the approximation of
the sampling distribution. In practice, k=500 or 1000 is usually enough. Actual
sampling distribution is obtained when k→∞.
The sample mean X is useful as it can be used to draw conclusions about the
population mean μ. Some of the most frequently used inferential procedures are
Let X1, X2,….Xn be a random sample from a population with mean μ and standard
deviation σ. Since this is a random sample, the Xi’s are independently and
identically distributed.
Since the sample units are drawn at random from the same parent population
distribution, with mean μ and variance σ2, each observation Xi is independently
and identically distributed with mean μ and variance σ2.
Since any unit of the population could have been selected and any unit could take
any of the population values with respective probabilities, hence E(Xi) = µ and
V(Xi) = σ2. This is because any population unit could be selected in the sample.
So every sample unit could be any one of the population values. Given the
population has a distribution with different populations values associated with
some probability, every sample unit has the same probability distribution as the
population.
n
X = X
i 1
i n is a linear combination of n independent random variables Xi, each
n n
E( X ) = E Xi n = 1 E X i = 1
. n.µ = µ =µ
i 1 n i 1 n X
And,
n n
2
V( X ) = V Xi n = 1
V X i =
1
.n. σ2
= =
i 1 n2 i 1 n2 n X
n
Thus, the mean of sampling distribution is independent of sample size but the
As the sample size increases we obtain more information from the sample and we
can expect the value of sample mean to be closer to the population mean value.
As the sample size increases the sampling distribution becomes narrower and the
sample means become clustered closer to μ. In the limit as sample size increases
indefinitely the sampling distribution collapses to a single point. Each and every
sample mean will be equal to the population mean, ie, X = → 0 as n →∞
n
As long as n>1, X . The reason being that for each sample, the sample
X
mean ( X ) averages out the variability of the observations within the sample.
The sample mean is a central value for each sample. Although the value of the
sample mean is affected by all sample values, by its very nature the mean value
must lie somewhere in the middle of the range of sample values. This is true for
each and every possible sample drawn from the population. Thus the variability in
values of sample means must be less than the variability in the population values
2 N n
distributed with mean µ and variance . However, if the sample
n N 1
N n 2 when
→ 1 and x →
2
size is very small relative to population size,
N 1 n
n<0.05N
10 12.2
P( X <10) = P Z 1.83 = 0.0336
1.2
3
If sampling (a) with replacement then X 0.6 and
25
3 3000 25
(b) without replacement then X 0.6
25 3000 1
[We obtain similar results for (a) and (b) since n = 0.0083N]
69 68
P( X >69)= 1- P( X <69)= 1- P Z 1 1.67
0.6
= 1-0.9525 = 0.0475
Number of samples with X >69 = (80)(0.0475)=3.8 4
The central limit theorem states that when an infinite number of successive
random samples of the same size, n, are taken from a given population with
mean μ and variance σ2, the distribution of sample means X will be
approximately normally distributed with mean μ and standard deviation ,
n
provided n is sufficiently large, irrespective of the shape of the population
distribution.
The larger the value of n, better is the approximation. Even when the population
distribution is highly non-normal, averaging of sample values while computing
X produces a distribution more bell-shaped than the population itself. If n is
large, a suitable normal curve will approximate the actual distribution of X . That is
why sampling distribution of X is said to be asymptotically normal. This is
illustrated in figure 2.
The red curve in figure 2 is the positively skewed population distribution. The
green and blue curves depict two distributions of sample means for different
sample sizes where n1 < n2. The distribution of X with sample size n1 is less
skewed than the distribution of the rv X. As sample size is increased suitably to
n2, the distribution of X is approximately normal.
How large must be sample size will depend on how much is the departure of the
shape of the population being sampled from a normal distribution. In many cases
the sampling distribution quickly approaches a normal distribution, as in case the
population has a uniform distribution where sample size of 12 is sufficient. In
some other cases sample size of 60 or more may be required. There is no hard
and fast rule about the sample size required for the sampling distribution of
means to be normally distributed. In practice, quite satisfactory approximations
can be obtained for n > 30, provided N > 2n where N is population size.
CLT plays an important role in estimation and tests of hypotheses about the mean
as the probability distribution of the population being sampled is often not known.
The central limit theorem enables us to use the normal distribution as an
approximation of the distribution of sample means.
2 6.928 . Based on the CLT, X will be approximately
12
normal with E( X )=36 and =0.49.
X
Let X 2 be the mean of a random sample of size n2 drawn from a population with
n1 n2 n1 n2
sample sizes.
If populations are not known to be normal and sample sizes are sufficiently large
(n1 > 30, n2 > 30) then by CLT the distribution of ( X 1 + X 2 ) will be approximately
If population variances 12 and 22 are not known and are estimated by sample
variances S i2 and S 2 respectively, then by CLT, if n1 > 40, n2 > 40, the
2
standard deviation S 12 S 22
n1 n 2
If the populations are finite of size N1 and N2, and the samples from the two
populations are drawn without replacement, then the finite population correction
factor must be applied to the variance. If the population variances are known and
If the populations are not normal or population variances are unknown and
estimated from sample data, then the sample sizes must be large enough.
If both random samples are independent and from the same population so that
μ1= μ2= μ and 12 = 22 = σ2, then E( X 1 + X 2 ) = 2μ and E( X 1 - X 2 ) = 0. The standard
X2
1.82 0.0324 & Y2
22 0.04 X Y 0.0724 0.269
100 100
Since sample sizes are large, by CLT, the distribution of ( X - Y ) is
approximately normal with mean X Y X Y 0.4 hours and
0 0.4
= 1- P Z
1 1.487 = 1- 0.9319=0.0681
0.269
References:
1. Jay L. Devore, Probability and Statistics for Engineering and the Sciences,
8th edition, Cengage Learning
2. Irwin Miller and Marylees Miller, Mathematical Statistics, Seventh Edition,
Pearson.
3. A.L.Nagar and R.K.Das, Basic Statistics, Second Edition, Oxford University
Press
PRACTICE QUESTIONS
4. What properties of the SRS help in deriving the sampling distribution of means?
OR
Why are random samples commonly used to obtain estimates of unknown
parameters?
10. A random sample of size 81 is taken from an infinite population with the
mean = 128 and the standard deviation σ = 6.3. With what probability
can we assert that the value we obtain for the sample mean X will not fall
between 126.6 and 129.4?
11. The mean production level at a firm is assumed to be 47.3 units per day
with a standard deviation of 12.7. The manager takes a sample of output
for 25 days. If the sample mean exceeds 49 units then the workers are
promised a Diwali bonus. How likely are the employees to get the bonus?
What assumption did you make, if any?
12. Independent random samples of size 400 are taken from each of two
populations having equal means and std deviations σ1 = 20 and σ2 = 30.
What can we assert with a probability of 0.99 about the value of the
difference in sample means?
14. Given that test scores in an entrance examination are normally distributed
with mean of 30 and a std. dev. of 6
(i) What is the probability that a single score drawn at random will be
greater than 34?
(ii) What is the probability that a sample of 9 scores will have a mean
greater than 34?
(iii) Explain the difference in the results obtained in parts (i) and (ii)