Sampling & Sampling Distribution
Sampling & Sampling Distribution
Sampling
Sampling is the process of selection of individual units of population
starting from the formulation of the objective of the study to the collection
of individual units using appropriate technique.
The intermediate steps include the selecting target population and sampling
units, designing a sampling frame and determining appropriate sample
size. A sampling frame is the complete list of sampling units from which
the sample is to be selected. Examples of sampling frames are the telephone
directories, electoral rolls, list of books in a library, list of students enrolled
in a university, list of schools and colleges in a country, list of the employees
working in a firm, list of workers in a garments factory, etc. Sometimes
these lists are in existence and can be readily obtained from the respective
authority. Sometimes these have to be prepared at an extra cost before
selection of units is done. Effectiveness of sampling mainly depends on the
construction of an appropriate Sampling frame.
The term sampling refers to the process of collection of sample from a
population. This term is sometimes used as a synonym of sample survey
which means studying the characteristics of a population through a sample.
Sample: A sample is a representative part of a population.
Sampling frame: A sampling frame is the complete list of all sampling units
of targeted population. It is necessary to prepare a sampling frame before
sampling is made.
Sampling: Sampling is defined as the total process involving in collection of
sample from a target population for a particular study.
Purpose of sampling: A sample is not studied for its own sake. The basic
objective of its study is to draw inference about the population. In other
words, sampling is only the tool which helps us to know the characteristics
of the universe or population by examining only a small portion of it. Such
values or characteristics obtained from the study of a sample are called
statistics (statistic in singular), while their counterparts in respective
population are called parameters.
Principles of Sampling: The following are two important principles which
determine the possibility of arriving at a valid statistical inference about the
features of population or process:
i) Principle of statistical regularity: The principle is based on the
mathematical theory of probability. According to King ‘The law of statistical
regularity lays down that a moderately large number of items chosen at
random from a large group are almost sure on the average to process the
characteristics of large group’. This principle implies that if a sample is
taken from a population of interest is likely to possess all the features of the
parent population. Thus the random sample is the one in which items are
chosen from a population in such a way that each item has an equal
probability of being selected in the sample. When the term random sample
is used without any specification, it usually refers to a simple random
sample, such a sample would be representative of the population, and only
4
Methods of Sampling
When a sample is required to be reflected from a population, it is necessary
to decide which method of sampling should be applied. The various
methods of sampling or sampling designs can be grouped under the heads
6
there is no distinct pattern. These numbers are computer generated and are
truly random.
In random number sampling each element of the population is assigned a
number, for example, for a population of size 400, the numbers like 001, 002,
..…, 400 are usually assigned. Once this has been done, one can use the
tables for random sampling. Although the tables of random numbers are
available in most of the books on statistics on sampling theory, for the sake
of explanation, a sample of random numbers is provided below:
3905 9796
0946 9133
0106 6465
1840 9779
7056 3015
9736 5661
9915 5686
5614 7123
5477 6629
5701 8733
n
of size n is to be drawn, the total number of samples will be N , however,
the number of distinct samples of size n that can be drawn from the N units
is given by the combinational formula:
(N + n - 1)!
n! (N − 1)!
random, it would be much simpler if we could select only the first unit
randomly with the help of random numbers, and the rest of the units are
selected automatically or systematically according to some pre-designed
pattern. Then this type of sampling is known as the systematic sampling. In
this case, sample is selected at regular intervals from an ordered list of
sampling units. In order to select a sample of size n from a population of
size N, let N = nk, where n is the number groups and k is the number of
items in each group, then in this method first unit is selected randomly from
first k units randomly listed in the sampling frame. Let this is the rth unit of
first group of k units, and then every rth unit is selected from each of the
subsequent n-1 groups, and finally sample of size n is selected. i.e. (k+r)th
item will be second member of sample, (2k + r)th member will be third
member of sample, and so on.
The procedure of systematic selection is easier and more convenient than
simple random sampling. It provides more even spread of the sample over
the population list and hence leads to a greater precision. The dependence
or linkage of one member of the sample on the previous one makes the
process different from simple random sampling method, in which selection
of every member is independent of the other. That’s why method is
sometimes termed as a Quasi-random sampling or mixed sampling.
This method of sampling is appropriate when the population is too large for
simple random sample, or if a quick sample is to be selected where chance
of being a member of sample for all units is not a matter. It is especially
useful for the population with more or less definite periodic trend. For
example, weekly sales, 12-mothly rainfall, quarterly remittance, etc.
The main advantage of this sampling is it simplicity of selection, operational
convenience and even spread of the sample over the population. The second
advantage is that because of its simplicity of drawing sample, it is very
useful for large samples.
The serious disadvantage of systematic sampling lies in its use with
populations having unforeseen periodicity which may substantially
contribute bias to the estimate of the parameter, or if the list itself is biased
then serious error may arise in estimation. Again, it does not provide with a
random sampling, it is only random if ordered list of population is truly
random.
Quota Sampling: The chief characteristic of simple, stratified and
systematic sampling is that known probability is associated with the
selection of every individual of the sample that means, the sample is
random or quasi-random. Sometimes non-random sampling methods are
also used when it is not possible to use a random sampling, particularly,
when the whole population is not known.
Quota sampling is an example of non-probability sampling. It involves the
selection of sample units within each group or quota, on the basis of the
judgment of interviewers rather than on calculable chance of being included
in it. Interviewer is given considerable freedom in choosing the individual
cases. Quota sampling is a method in which an interviewer is instructed to
interview a certain number of respondents with specific characteristics. The
quotas are selected before sampling takes place and they are chosen so that
they reflect the known population characteristics. Age, sex and social class
are the three universally used quota controls.
It is useful when the number of sampling units is pre-fixed for groups of
population of the same characteristics.
This method is extensively used in opinion survey, for example product
satisfaction opinion, polling opinion, etc. Suppose, a company wants to
know the customers’ opinion regarding the quality of their product, and
decides to take opinion about quality from 100 female and male consumers
of apparently young and old age as per the following table:
Age Group Sex Number
Male 25
Young (Age group 20-40 years)
Female 10
Male 40
Old (Age group above 40 years)
Female 25
Non-Sampling Error: The possible error which may arise at any stage of
investigation, either in census or in sampling, is termed as non-sampling
error. This type of error arises due to faulty questionnaire, due to non-
response, due to faulty tabulation method, etc.
However, the non-sampling error tends to increase with the sample size,
while sampling error decreases with increase of sample size. In case of
complete enumeration, non-sampling errors and in case of sample survey,
both sampling and non-sampling errors require to be controlled and
reduced to a level at which their presence does not distort the final results.
18
Sampling Distribution
Parameter: The unknown constant or any function of them that appear in
the mathematical specification of a population is known as parameter.
Any numerical quantity calculated from the population data is also called
parameter.
It can be seen that the possible values of sample means tend to be close to
the population mean, and according to the central limit theorem, the
distribution of these sample means tend to be approximately normal for a
sample size larger than 30.
Remarks.
1. Central limit theorem holds only if the mean and variance of the
distributions from which the random sample drawn exist.
2. If the random sample has been drawn from the normal population
X −
then the sampling distribution of the Z = is exactly N(0,1) for
/ n
any sample size n.
20
i =1
However, if X1, X2, ….. Xn are n independently and identically distributed
random variables each of which is normally distributed with mean µ and
n 2
x −μ
variance 2, then n = i
2 2
is distributed as n with n df.
i =1 σ
2
The probability density function of with n degrees of freedom is
2
( )
2
2 1 − 2 n /2 −1 2
f ( ) = n e ; >0
n 2
2 ( )
2
2
Important Properties of 2 distribution
i) The distribution contains only one parameter which is the degree
of freedom of the distribution.
ii) The mean of the distribution is n and the variance is 2n.
iii) The mode of the distribution is n-1.
iv) It is positively skewed distribution for smaller values of n; the
distribution becomes symmetrical as n tends to infinity.
Applications of Chi-square distribution: 2 - distribution has a large
number of applications in statistics, some of which are enumerated below:
i) To test the goodness of fit.
ii) To test the population variance.
iii) To test the independence two attributes in a contingency table.
iv) To test the homogeneity of several variances.
v) To test the equality of several population correlation co-efficient.
vi) To test the equality of several proportions.
Student’s t-Distribution: Let X1, X2, ….., Xn be random sample from a
_
normal distribution with mean with mean µ and variance 2, then x is
normally distributed with mean µ and variance 2/n. Now, if the
_ n
1
n
estimators of µ and variance 2 are given by x= xi and
i =1
n _
1
2 2
s = (xi − x) respectively,
n - 1 i =1
X −
Then the statistic t is defined as t = which follows Student’s t
s/ n
distribution with n-1 degrees of freedom (df).
A continuous random variable t is said to have a t-distribution with n df if
its probability density function is given by
n +1
f(t) =
1
n B(1 / 2 , n / 2)
(1 + t )
2 / n − 2
; -∞<t<∞
Properties of t distribution.
i) The distribution has only one parameter which is the degree of
freedom of the distribution.
ii) The distribution symmetric about mean zero and variance is
n /(n − 2) and all odd order moments of t-distribution are zero.
iii) Since, the distribution is symmetric at mean t = 0, hence, the mean,
median and mode are all zero.
iv) If the degree of freedom increases, t- distribution tends to normal
distribution. Actually, t-distribution tends to normal distribution
when n > 30.
Application of t-distribution: t- distribution is used to test the following
cases :
i. To test the mean of a population when the variance is unknown and
the sample size is less than 30.
22
ii. To test the equality of two population means when the variances are
equal but unknown and the sample sizes are small.
iii. To test the population correlation co-efficient = 0 .
iv. To test the population regression co-efficient β = 0 or β = β0
v. To test the equality of two independent regression co-efficient.
(n 1 - 1)s12
Thus, 12 = is a 2 -variate with (n1 - 1) df
1 2
(n 2 - 1)s2 2
and 2 2 = is a 2 -variate with (n2 - 1) df
2 2
Since the two samples are independent, these 2 –variates are also
independent. Thus, the ratio of two independent chi-squares divided by
their respective degrees of freedom is called F-variate and it is defined as :
2
1 / n 1 − 1 s12
F= 2
= .
2 /n 2 − 1 s2 2
Properties of F-distribution
i) The distribution contains two parameters which are the degrees of
freedom of the distribution.
ii) The mean and variance of F-distribution are
2
Mean = and
( 1 + 2 )
2 2 2 ( 1 + 2 − 2)
Variance = var (F) = 2
; 1 > 2 and 2 > 4.
1 ( 1 − 2 ) ( 2 − 4)
2 ( 1 − 2)
iii) The mode of the distribution is Mode = which less than
( 2 + 2) 1
unity is always. It means mode of the distribution exists if 1 2 .
iv) The distribution is positively skewed.
n n n n n n
And the corresponding standard error of mean is given by se ( X ) =
n
which shows that the variance or standard error of mean decreases as the
sample size n increases. The above results are also true for all possible
samples of size n drawn with replacement from a finite population of size
N.
Theorem 1. If all possible random samples of size n are drawn with
replacement from a finite population of size N with mean and standard
deviation , then the sampling distribution of the mean X follows a
distribution with mean and standard deviation / n .
X −
Remarks. According to central limit theorem Z = follows standard
/ n
normal variate which has mean zero and variance one. Now we shall prove
the theorem with the help of an example.
Example 1. Suppose a population consists with four values 0, 1, 2, 3. Draw
all possible of size 2 with replacement and show that the sample mean
follows the above the theorem.
0+1+2 +3
Solution. The population mean is = = 1.5 .
4
1
(x − )2
2
Population variance = =
N
1
(
= [(0 − 1.5)2 + 1 − 1.5
4
)
2
+ (2 − 1.5)2 + (3 − 1.5)2
Mean of X is given by
E( X) = xi p(xi ) = 0 1 / 16 + 0.5 2 / 16 + 1 3 / 16 + 1.5 4 / 16
+2 3 /16 + 2.5 2 /16 + 3 1 /16
1+3+6+6+5+3 24
= = = 1.5 = = population mean.
16 16
2
(
Variance of X = (X) = X − p( x) )
2
(
= 0 − 1.5 ) 161 + (0.5 − 1.5) 162 + (1 − 1.5) 163 + (1.5 − 1.5) 164
2 2 2 2
+ (2 − 1.5)2
3
16
(
+ 2.5 − 1.5 ) 162 + (3 − 1.5) 161
2 2
5
2
1 10 5
= (2.25 + 2 + 0.75 + 0 + .75 + 2 + 2.25) = = = 4 = .
16 16 8 2 n
Theorem 2. If all possible random samples of size n are drawn without
replacement from a finite population of size N with mean µ and standard
26
(
= 0.5 − 1.5 ) 61 + (1 − 1.5)
2 2
1
6
+ (1.5 − 1.5)2
1 1 1
+ (2 − 1.5)2 + (2.5 − 1.5)2
3 6 6
1 2.5 5
= (1 + 0.25 + 0 + .25 + 1) = =
6 6 12
5
2
4−2 N−n
= 4 = .
2 4−1 n N−1
f
x f p( x ) =
k
0.5 1 1/6
1.0 1 1/6
1.5 2 1/3
2.0 1 1/6
2.5 1 1/6
Example 3. The MBA class has a total of 60 students. Their average score in
statistics after final term was 70 with a standard deviation of 8. A sample of
36 students is taken at random from this class . Calculate the standard error
of the mean for this sample.
Solution. Here N = 60, n = 36 and = 8, since the sample size is not a small
portion of population size, the standard error of mean is given by
N−n 8 60 − 36
(X)= . = . = 0.85.
n N−1 36 60 − 1
Example 4. A large bag contains some counters, 60% of the counters have
the number 0 on them and 40% have the number 1. A random sample of 3
counters is taken from the bag, find the sampling distribution of sample
mean and sample mode.
Solution. Let X be the number of counters which can take value 0 and 1 .
The distribution of population is given by
Values of X: x : 0 1
p(x) : 0.6 0.4
And the population mean is E(X) = 0 × 0.6 + 1 × 0.4 = 0 .4.
The possible samples of size 3 are
(0,0,0), (1,0,0), (0,1,0), (0,0,1), (1,1,0), (1,0,1), (0,1,1), (1,1,1)
X : x 0 1/3 2/3 1
30
Sample mean ( x ) 3 4 5 6 7 8 9
p( x ) 0.1 0.1 0.2 0.2 0.2 0.1 0.1
Variance = =
2 1
5
(2 − 6)2 + (4 − 6)2 + (6 − 6)2 + (8 − 6)2 + (10 − 6)2
16 + 4 + 0 + 4 + 16 40
= = =8
5 5
The mean of the sample mean =
E[ X ]= 3 .1 + 4 .1 + 5 .2 + 6 .2 + 7 .2 + 8 .1 + 9 .1
= .3 + .4 + 1.0 + 1.2 + 1.4 + .8 + .9 = 6.0.
X 2 = (x − ) p(x) = 9 .1 + 4 .1 + 1 .2 + 0 + 1 .2 + 4 .1 + 9 .1
2
2
85−2 N−n
= 3.0 = = .
2 5−1 2 N−1
4.0 2/15
4.5 1/15
5.0 3/15
5.5 1/15
6.0 2/15
6.5 2/15
7.0 2/15
7.5 1/15
In this case too, it can be easily shown that the mean of sampling
distribution is exactly same as the population mean µ = 5.5.
Example 7. The time between two arrivals in a queuing process of cars in a
busy road is normally distributed with a mean 2 minutes and standard
deviation 0.25 minutes. If a random sample size 36 such cars is taken, what
is the probability that the sample mean will be greater than 2.1 minutes?
Solution. Since the population is normally distributed, therefore, the
sampling distribution of the sample mean will follow a normal distribution
σ 0.25
with mean x = 2 and standard error x = = = 0.042.
n 36
(ii) Here we have to find P(120 < X < 125), that means we have to use the
standardized normal distribution for sampling distribution of sample mean
defined as
X −
Z= ~ N(0,1)
/ n
Example 10. A bank calculates that its individual savings deposits are
normally distributed with mean TK. 2000 and a standard deviation of Tk.
600. If the bank takes a random sample of 100 accounts, what is the
probability that mean deposit will lie between TK. 1900 and TK. 2050?
Solution. Let X be the individual savings deposit, given that X ~ N(2000,
6002). We have to find P (1900 < X < 2050). Using the standardized normal
distribution for sampling distribution of sample mean we have,
X −
Z= ~ N(0,1)
/ n
σ 600
Here, standard error of X is given by = = 60,
n 100
n 9
1
X2 =
n2 X2i .
(a) When the population variances are known , the sampling distribution of
( X 1 – X 2 ) follows exactly normal distribution with
(b) When 1 and 2 are not known, and but n1>29 and n2 >29 are sufficiently
large, then standard error of the difference between two sample means can
2 2
s1 s2
be estimated by + where s12 and s22 are the variances obtained
n1 n2
from two samples respectively. In this case, the sampling distribution of
( X 1 – X 2 ) follows approximately normal distribution with
(x 1 − x 2 ) − ( 1 − 2 )
t= .
1 1
s +
n1 n2
In all of the above cases, we have assumed that the parent distribution is
normal. However, if the parent distribution is not normal and if the sample
size is large, then by the virtue of central limit theorem, the distribution of
the difference between two means follow normal distribution with
respective mean and variance.
Example 12. Strength of wire produced by company A has a mean of 4500
kg and a standard deviation of 200 kg. Company B has a mean of 4000 kg
and a standard deviation of 300 kg. If 50 wires of company A and 100 wires
of company B are selected at random and tested for strength, what is the
probability that the sample mean strength of A will be at least 600 kg more
than that of B?
Solution. For the sampling distribution of the difference between two
means, we know the mean value of the difference between two sample
means is given by (µ1 - µ2) = 4500 – 4000 = 500.
2 2 2 2
σ1 σ2 200 300
And standard error + = + = 41.23.
n1 n2 50 100
( X 1 − X 2 ) − ( 1 − 2 ) 600 − ( 1 − 2 )
P( X 1 − X 2 >600) = P
2
1 2
2
1 2
2 2
+ +
n1 n2 n1 n2
0.50 − 0.43
= Z = P (Z > 1.27) = 0.1020.
0.055