Chapter2 Sampling Simple Random Sampling
Chapter2 Sampling Simple Random Sampling
Notations:
The following notations will be used in further notes:
1 n
y= yi : sample mean
n i =1
N
1
Y=
N
y
i =1
i : population mean
1 N 1 N
S2 = i
N − 1 i =1
(Y − Y ) 2
= ( Yi 2 − NY 2 )
N − 1 i =1
N
1 1 N 2
2 ==
N
(Yi − Y )2 =
i =1
( Yi − NY 2 )
N i =1
1 n 1 n
s2 =
n − 1 i =1
( yi − y ) 2
= (
n − 1 i =1
yi2 − ny 2 )
This unit can be selected in the sample either at first draw, second draw, …, or nth draw.
Let Pj (i ) denotes the probability of selection of ui at the jth draw, j = 1,2,...,n. Then
Pj (i ) = P1 (i ) + P2 (i ) + ... + Pn (i )
1 1 1
= + + ... + (n times )
N N N
n
= .
N
Now if u1 , u2 ,..., un are the n units selected in the sample, then the probability of their selection is
Note that when the second unit is to be selected, then there are (n – 1) units left to be selected in the
sample from the population of (N – 1) units. Similarly, when the third unit is to be selected, there are
(n – 2) units left to be selected in the sample from the population of (N – 2) units and so on.
n
If P(u1 ) = , then
N
n −1 1
P(u2 ) = ,..., P(un ) = .
N −1 N − n +1
Thus
n n −1 n − 2 1 1
P(u1 , u2 ,.., un ) = . . ... = .
N N −1 N − 2 N − n +1 N
n
Alternative approach:
The probability of drawing a sample in SRSWOR can alternatively be found as follows:
Let ui ( k ) denotes the ith unit drawn at the kth draw. Note that the ith unit can be any unit out of the N
units. Then so = (ui (1) , ui (2) ,..., ui ( n ) ) is an ordered sample in which the order of the units in which they
are drawn, i.e., ui (1) drawn at the first draw, ui (2) drawn at the second draw and so on, is also
ui (1) , ui (2) ,..., ui ( k −1) have already been drawn in the first (k – 1) draws.
( N − n)!
Probability of drawing a sample in a given order =
N!
So the probability of drawing a sample in which the order of units in which they are drawn is
( N − n)! 1
irrelevant = n ! = .
N! N
n
2. SRSWR
When n units are selected with SRSWR, the total number of possible samples are N n .
1
The Probability of drawing a sample is .
Nn
Alternatively, let ui be the ith unit selected in the sample. This unit can be selected in the sample
either at first draw, second draw, …, or nth draw. At any stage, there are always N units in the
population in case of SRSWR, so the probability of selection of ui at any stage is 1/N for all i =
2. SRSWR
1
P[ selection of u j at kth draw] = .
N
SRSWOR
n
Let ti = yi . Then
i =1
n
1
E( y ) = E ( yi )
n i =1
1
= E ( ti )
n
N
1 1 n
= ti
n N i =1
n
N
1 1 n
n
= yi .
n N i =1 i =1
n
When n units are sampled from N units without replacement, each unit of the population can occur
with other units selected out of the remaining ( N − 1) units in the population, and each unit occurs in
N − 1 N
the possible samples. So
n −1 n
N
n N − 1 N
n
So yi = yi .
i =1 i =1 n − 1 i =1
Now
( N − 1)! n !( N − n)! N
E( y ) =
(n − 1)!( N − n)! nN!
i =1
yi
N
1
=
N
y
i =1
i
=Y.
1 n
N
=
n
Yi Pj (i )
j =1 i =1
1 n
N 1
=
n
Yi . N
j =1 i =1
n
1
=
n
Y
j =1
=Y
SRSWR
n
1
E( y ) = E ( yi )
n i =1
1 n
= E ( yi )
n i =1
1 n
= (Y1P1 + Y2 P2 + ... + YN PN )
n i =1
1 n 1 1 1
=
n i =1
(Y1 + Y2 + ... + YN )
N N N
1 n
= Y
n i =1
=Y.
1
where Pi = for all i = 1, 2,..., N is the probability of selection of a unit. Thus y is an unbiased
N
estimator of the population mean under SRSWR also.
SRSWOR
n n
K = E ( yi − Y )( y j − Y ) .
i j
Consider
N N
1
E ( yi − Y )( y j − Y ) = ( yk − Y )( yl − Y ).
N ( N − 1) k
Since
2
N N N N
k − k
= − + ( yk − Y )( y − Y )
2
( y Y ) ( y Y )
k =1 i =1 k
N N
0 = ( N − 1) S 2 + ( yk − Y )( y − Y )
k
N N
1 1
N ( N − 1) k
( yk − Y )( y − Y ) =
N ( N − 1)
[−( N − 1) S 2 ]
S2
=− .
N
SRSWR
N N
K = E ( yi − Y )( y j − Y )
i j
N N
= E ( yi − Y ) E ( y j − Y )
i j
=0
because the i th and jth draws (i j ) are independent.
Thus the variance of y under SRSWR is
N −1 2
V ( yWR ) = S .
Nn
It is to be noted that if N is infinite (large enough), then
S2
V ( y) =
n
N −n
is both the cases of SRSWOR and SRSWR. So the factor is responsible for changing the
N
variance of y when the sample is drawn from a finite population in comparison to an infinite
N −n
population. This is why is called a finite population correction (fpc) . It may be noted that
N
N −n n N −n n
= 1 − , so is close to 1 if the ratio of a sample size to population , is very small or
N N N N
n
negligible. The term is called the sampling fraction. In practice, fpc can be ignored whenever
N
n
5% and for many purposes, even if it is as high as 10%. Ignoring fpc will result in the
N
overestimation of the variance of y .
estimator of S 2 (or 2 ) and we investigate its biasedness for s 2 in the cases of SRSWOR and
SRSWR,
Consider
1 n
s2 =
n − 1 i =1
( yi − y ) 2
2
1 n
= ( yi − Y ) − ( y − Y )
n − 1 i =1
1 n
=
n − 1 i =1
( yi − Y ) 2 − n( y − Y ) 2
1 n
E (s 2 ) =
n − 1 i =1
E ( yi − Y ) 2 − nE ( y − Y ) 2
1 n 1
=
n − 1 i =1
Var ( yi ) − nVar ( y ) =
n −1
n 2 − nVar ( y )
n 2 N −1 2
E (s 2 ) = − S
n − 1 Nn
n N −1 2 N −1 2
= S − S
n − 1 N Nn
N −1 2
= S
N
=2
Hence
S in SRSWOR
2
E (s 2 ) = 2
in SRSWR
In order to estimate the standard error, one simple option is to consider the square root of the estimate
of the variance of the sample mean.
N −n
• under SRSWOR, a possible estimator is ˆ ( y ) = s.
Nn
N −1
• under SRSWR, a possible estimator is ˆ ( y ) = s.
Nn
It is to be noted that this estimator does not possess the same properties as of Var ( y ) .
Consider s as an estimator of S .
Let
s 2 = S 2 + with E ( ) = 0, E ( 2 ) = S 2 .
Write
s = ( S 2 + )1/2
1/2
= S 1 + 2
S
2
= S 1 + 2 − 4 + ...
2S 8S
assuming will be small as compared to S 2 and as n becomes large, the probability of such an
event approaches one. Neglecting the powers of higher than two and taking expectation, we have
S2
Var ( s) = .
2 ( n − 1)
Both Var ( s ) and Var ( s 2 ) are inflated due to nonnormality to the same extent, by the inflation factor
n −1
1 + 2n ( 2 − 3)
and this does not depend on the coefficient of skewness.
This is an important result to be kept in mind while determining the sample size in which it is
assumed that S 2 is known. If inflation factor is ignored and the population is non-normal, then the
reliability on s 2 may be misleading.
(i) SRSWOR
With the ith unit of the population, we associate a random variable ai defined as follows:
Then,
E (ai ) = 1 Probability that the i th unit is included in the sample
n
= , i =1, 2,..., N .
N
E (ai2 ) = 1 Probability that the i th unit is included in the sample
n
= , i =1, 2,..., N
N
E (ai a j ) = 1 Probability that the i th and j th units are included in the sample
n(n − 1)
= , i j = 1, 2,..., N .
N ( N − 1)
From these results, we can obtain
n( N − n)
Var (ai ) = E (ai2 ) − ( E (ai ) ) = , i =1, 2,..., N
2
N2
n( N − n)
Cov(ai , a j ) = E (ai a j ) − E (ai ) E (a j ) = 2 , i j = 1, 2,..., N .
N ( N − 1)
We can rewrite the sample mean as
1 N
y= ai yi
n i =1
Then
1 N
E( y ) = E (ai ) yi = Y
n i =1
and
1 N 1 N N
Var ( y ) = 2
Var
i =1
ai i
y = 2
n i =1
Var ( ai ) yi
2
+ Cov(ai , a j ) yi y j .
n i j
get
N −n 2
Var ( y ) = S .
Nn
To show that E ( s 2 ) = S 2 , consider
1 n 2 2 1 N
s =
2
(n − 1) i =1
yi − ny =
(n − 1) i =1
ai yi2 − ny 2 .
Hence, taking, expectation, we get
1 N
E (s 2 ) = E (ai ) yi2 − n Var ( y ) + Y 2
(n − 1) i =1
Substituting the values of E (ai ) and Var ( y ) in this expression and simplifying, we get E ( s 2 ) = S 2 .
(ii) SRSWR
Let a random variable ai associated with the ith unit of the population denotes the number of times
the ith unit occurs in the sample i = 1, 2,..., N . So ai assumes values 0, 1, 2,…,n. The joint distribution
n! 1
P (a1 , a2 ,..., aN ) = N
.
Nn
a !
i =1
i
N
where a
i =1
i = n. For this multinomial distribution, we have
n
E (ai ) = ,
N
n( N − 1)
Var (ai ) = , i = 1, 2,..., N .
N2
n
Cov(ai , a j ) = − 2 , i j =1, 2,..., N .
N
We rewrite the sample mean as
1 N
y= ai yi .
n i =1
Hence, taking the expectation of y and substituting the value of E (ai ) = n / N we obtain that
E( y ) = Y .
N −1 2
Var ( y ) = S .
Nn
N −1 2
To prove that E ( s 2 ) = S = 2 in SRSWR, consider
N
n N
(n − 1) s 2 = yi2 − ny 2 = ai yi2 − ny 2 ,
i =1 i =1
i =1
n N 2 ( N − 1) 2
=
N i =1
yi − n.
nN
S − nY 2
(n − 1)( N − 1) 2
= S
N
N −1 2
E (s 2 ) = S =2
N
YˆT = NYˆ
= Ny .
Obviously
( )
E YˆT = NE ( y )
= NY
N ( N − n) 2
s for SRSWOR
n
Var (YˆT ) =
N s2 for SRSWOR.
n
y −Y
N (0,1) when 2 is known. If 2 is unknown and is estimated from the sample, then
Var ( y )
follows a t -distribution with ( n − 1) degrees of freedom. When 2 is known, then the 100( 1 − ) %
confidence interval is given by
y −Y
P −Z Z = 1−
2 Var ( y ) 2
or P y − Z Var ( y ) Y y + Z Var ( y ) = 1 −
2 2
and the confidence limits are
y − Z Var ( y ), y + Z Var ( y )
2 2
where Z denotes the upper % points on N (0,1) distribution. Similarly, when 2 is unknown,
2 2
An important constraint or need to determine the sample size is that the information regarding the
population standard derivation S should be known for these criteria. The reason and need for this
will be clear when we derive the sample size in the next section. A question arises about how to have
information about S beforehand? The possible solutions to this issue are to conduct a pilot survey
and collect a preliminary sample of small size, estimate S and use it as a known value of S it.
Alternatively, such information can also be collected from past data, past experience, the long
association of the experimenter with the experiment, prior information, etc.
Now we find the sample size under different criteria assuming that the samples have been drawn
using SRSWOR. The case for SRSWR can be derived similarly.
N −n 2
or S V
Nn
1 1 V
or − 2
n N S
1 1 1
or −
n N ne
ne
n
n
1+ e
N
S2
where ne = .
v
It may be noted here that ne can be known only when S 2 is known. This reason compels to assume
that S should be known. The same reasoning will also be seen in other cases.
The smallest sample size needed in this case is
ne
nsmallest = .
n
1+ e
N
If N is large, then the required n is
n ne and nsmallest = ne .
P y − Y e = (1 − ).
or Z 2 Var ( y ) = e 2
2
N −n 2
or Z 2 S = e2
2 Nn
Z S 2
2
e
or n =
Z S
2
1 2
1+
N e
which is the required sample size. If N is large then
2
Z S
n = 2e .
2Z Var ( y ) W
2
N −n
2Z S W
2 Nn
1 1
or 4Z 2 − S 2 W 2
2 n N
Sampling Theory| Chapter 2 | Simple Random Sampling | Shalabh, IIT Kanpur
Page20
20
1 1 W2
or +
n N 4 Z 2 S 2
2
4Z 2 S 2
2
or n W2 .
4Z 2 S 2
1+ 2
NW 2
nsmallest = W2
4 Z 2 S 2
1+ 2
NW 2
If N is large then
4Z 2 S 2
n 2
W2
and the minimum sample size needed is
4Z 2 S 2
nsmallest = 2
.
W2
If it is desired that the coefficient of variation of y should not exceed a given or pre-specified value
of the coefficient of variation, say C 0 , then the required sample size n is to be determined such that
CV ( y ) C0
Var ( y )
or C0
Y
Sampling Theory| Chapter 2 | Simple Random Sampling | Shalabh, IIT Kanpur
Page21
21
N −n 2
S
or Nn 2 C02
Y
1 1 C02
or −
n N C2
C2
Co2
or n
C2
1+
NC02
S
is the required sample size where C = is the population coefficient of variation.
Y
The smallest sample size needed in this case is
C2
C02
nsmallest = .
C2
1+
NC02
If N is large, then
C2
n
C02
C2
and nsmalest = 2
C0
y −Y
as . If it is required that such relative estimation error should not exceed a pre-specified value
Y
R with probability (1 − ) , then such requirement can be satisfied by expressing it like such
requirement can be satisfied by expressing it like
y −Y RY
P = 1−.
Var ( y ) Var ( y )
N −n 2
Assuming the population to be normally distributed, y follows N Y , S .
Nn
N −n 2
or Z 2 S = R Y
2 2
2 Nn
1 1 R2
or − = 2 2
n N C Z
2
2
Z C
2
R
or n =
2
Z C
1
1+ 2
N R
S
where C = is the population coefficient of variation and should be known.
Y
If N is large, then
2
z C
n= 2 .
R
6. Pre-specified cost
Let an amount of money C be designated for sample survey to called n observations, C 0 be the
overhead cost and C1 be the cost of collection of one unit in the sample. Then the total cost C can be
expressed as
C = C0 + nC1
C − C0
Or n =
C1
is the required sample size.