SAMPLING AND ESTIMATION Notes and Examples
SAMPLING AND ESTIMATION Notes and Examples
Taking samples and interpreting them is an essential part of statistics. The populations in which
one may be interested are often so large that it will not be possible to use every member in the
population in carrying out a research.
Sample
A sample is a smaller part of the parent population selected at random from the parent
population. The parent population can be finite or infinite.
In carrying out a research about any given population, the selected sample must be
representative.
A value derived from a sample such as mean (𝑥̅ ) and variance (s2) is called a sample statistic (or
statistic). When sample statistics are used to estimate the population parameters they are
called estimates.
SAMPLING
There are mainly two reasons why you might wish to take a sample:
An estimate of a parameter derived from sample data will generally differ from its true value.
The difference is called the sampling error. To reduce the sampling error, the sample must be
representative of the parent population as much as possible.
Here are a number of questions you should ask yourself when about to take a sample:
Page 1 of 20
1.A flour company wants to know what proportion of households in Karachi bake some or all of
their own bread. A sample of 500 residential addresses in Karachi is taken and interviewers are
employed during regular working hours on weekdays and interview only during these hours.
SAMPLING TECHNIQUES
When information about a population is required, one can carry out a census of every member.
The advantage of a census is that accurate and complete information is obtained. The
disadvantages include cost and time requirements.
Usually a sample is taken. The sample must be representative so bias must be avoided in the
selection of members of the sample. In a random sample of size n, each member of the
population has an equal chance of being selected. Also, all the subsets of the population of size
n have an equal chance of constituting the sample.
Assigning a number to each member of a population and then drawing the numbers out of a
hat is one method of obtaining a simple random sample. Another is to use random number
tables, in which each digit has an equal chance of occurring.
Random sampling from a very large population can be very laborious. It may be more
convenient to carry out systematic sampling. This involves selecting every kth member of the
population, for an example every 10th item from a particular machine on a production line.
The method of stratified sampling is often used when the population can be divided into
distinguishable parts or strata such as students in each department in each faculty of a college.
In this method of sampling, each element in the population is given a number starting from 1.
You then select elements for the sample using random number tables or the random number
generator on a calculator or computer.
Suppose that you want to select a sample of 15 houses from a numbered list of 483 houses.
Using random number tables, you choose a random starting position and take the digits in
groups of three. If the first set of three digits is 247, you put house number 247 from the list
into your sample. If the next number is 832, you ignore it because it does not correspond to a
Page 2 of 20
house in the list. You continue in this way until you have a sample of 15 houses. (If any number
appears more than once, it is only included once in the sample.)
In some circumstances, you might choose to assign random numbers in a less wasteful way. For
example, you could subtract 500 from any random number above 500, so instead of discarding
832 you would choose house (832-500) =332.
Task 2
Using the random numbers below, which items would you choose from a numbered list of the
17841 inhabitants of a town if you want a random sample of size 10? Start from the top left
random number and work along each row in order.
54 66 35 88 98 91 45 92 12 47
12 16 71 83 94 22 44 57 43 43
45 32 26 37 19 89 27 02 77 14
85 98 46 56 50 71 07 65 33 63
51 63 71 95 36 36 17 77 53 40
25 95 65 04 59 80 16 59 21 43
91 55 88 14 82 48 48 94 38 34
60 87 82 35 35 45 45 08 44 37
Solution
Reading our numbers starting at digit 1, the following can be chosen for the sample:
14592 12471 16718 02771 07107 16371 17775 02595 04598 01659
Page 3 of 20
PAST EXAM QUESTIONS
Ex 12A Question 1
Unbiased estimates
When a population parameter, such as the mean or the variance, is unknown, then it is sensible
to estimate it from the sample.
An unbiased estimate is one which, on average, gives the true value,i.e.E(estimate)=true value
of the parameter.
̂2 = 𝑛 𝑠2
For variance σ2 𝜎 s2 is the sample variance.
𝑛−1
Example
Page 4 of 20
EXAMINATION TYPE QUESTIONS
Page 5 of 20
SAMPLE STATISTICS
When trying to find out information about a population, it is sensible to take random samples
and then consider the values obtained from them. It is therefore important to know how these
sample values are distributed.
Group task
We found that
𝜎2
E(𝑋̅) =µ and Var(𝑋̅)= 𝑛
𝜎2 𝜎
The standard deviation of the sampling distribution is √ 𝑛 ,usually written as .This is known
√𝑛
as the standard error.
The mean of the sampling distribution of the means is the same as the mean of the population.
The standard deviation of the sampling distribution is much smaller than that of the population.
This implies that the sample means are much more clustered around µ than the population
values are.
Page 6 of 20
̅ WHEN THE POPULATION OF X IS NORMAL
THE DISTRIBUTION OF 𝑿
If samples are taken from a normal population, the sampling distribution of the means is
normal for any sample size.
𝜎 2
i.e If X ̴N(µ,𝜎 2 ) then 𝑋̅ ̴N(µ, 𝑛 )
Example 1
At college the masses of the male students can be modelled by a normal distribution with mean
mass 70kg and standard deviation 5kg.Four male students are chosen at random. Find the
probability that their mean mass is less than 65kg.
Solution
So 𝑋̅ N
̴ (70,6.25).
In the case where the population is not normally distributed, the Central Limit theorem is used.
When samples are taken from a population that is not normally distributed, the sampling
distribution of the means takes on the shape of a normal distribution as the sample size
increases. For large n (n≥30 say) the distribution of the sample mean is approximately normal.
This result is known as the central limit theorem. The theorem holds when the population is
either discrete or continuous.
Page 7 of 20
For samples taken from a non -normal population with mean µ and variance 𝜎 2 ,by the central
limit theorem, 𝑋̅ is approximately normal
𝜎 2
and 𝑋̅ N
̴ (µ, 𝑛 )
Thirty random observations are taken from each of the following distributions and the sample
mean calculated. Find, in each case, the probability that the sample mean exceeds 5.
(b) X is the number of heads obtained when an unbiased coin is tossed nine times.
Expected answers
Suppose that you do not know the value of a particular parameter of a distribution, for example
the mean or variance or the proportion of successes. It seems sensible that you would take a
random sample from the distribution and use it in some way to make an estimate of the value
of your unknown parameter.
This estimate is unbiased if the average (or expectation) of a large number of values taken in
the same way is the true value of the parameter.
Page 8 of 20
POINT ESTIMATES
❖ The best unbiased estimate of p, the proportion of successes in the population is 𝑝̂ where
𝑝̂ =ps ps is the proportion of successes in the sample
❖ The best unbiased estimate for µ ,the population mean ,is µ̂ where
∑𝑥
µ̂ =𝑥̅ = ̅ is the mean of the sample
𝒙
𝑛
❖ The best unbiased estimate of σ2, the population variance, is 𝜎̂ 2 where
𝑛
𝜎̂ 2 =𝑛−1x s2 s2 is the variance of the sample.
INTERVAL ESTIMATES
Another way of using a sample value to give a good idea of an unknown population is to
construct an interval, known as the confidence interval.
In general, this is an interval with a specified probability of including the parameter. The
interval is usually written in the form (a,b) and the end values a and b, are known as confidence
limits. The probabilities often used in confidence intervals are 90%,95% and 99%.
Suppose you do not know the mean µ of a particular population and you want to work out a
95% confidence interval for it. You would need to construct an interval (a,b) so that
P(a<µ<b)=0.95.
In this case, the probability that the interval includes µ is 0.95 or 95%.
The interval that you construct uses the value of the mean of a random sample of size n taken
from the population. This mean is denoted by 𝑥̅ .
Before constructing your interval for µ, it is essential to ask the following questions.
Page 9 of 20
▪ Using any sample size, n large or small
Let us first consider how to calculate the end values of the most commonly used interval, the
95% confidence interval. The method can then be adapted for other levels of confidence.
Note that it is useful to be able to follow the derivation of the end points, but in practice and in
examinations you will only need to be able to apply the formula.
𝜎 2
We know that If X ̴N(µ,𝜎 2 ) then 𝑋̅ ̴N(µ, 𝑛 )
𝑋 −µ ̅
Standardizing 𝑋̅,we have Z=𝜎/√𝑛 where Z N
̴ (0,1)
For a 95% confidence interval you need to find the values of z between which the central 95%
of the distribution lies. This means the upper tail probability is 0.025 and the lower tail is 0.975.
P(Z<z)=0.975
Z=φ-1(0.975)
=1.96
If 𝑥̅ is the mean of a random sample of any size n taken from a normal population with known
variance σ2,
Examples
Page 10 of 20
1.The mass of vitamin E in a capsule manufactured by a certain drug company is normally
distributed with standard deviation 0.042 mg. A random sample of five capsules was analyzed
and the mean mass of vitamin E was found to be 5. 12mg.Calculate a symmetric 95%confidence
interval for the population mean mass of vitamin E per capsule. Give the values of the end
points of the interval correct to 3 s.f.
Solution
𝜎 2
𝑋̅ ̴N(µ, 𝑛 ) with n=5.
𝜎 𝜎
The 95% confidence interval for µ is (𝑥̅ − 1.96 , 𝑥̅ + 1.96 )
√ 𝑛 √𝑛
𝜎 0.042
𝑥̅ ± 1.96 = 5.12±1.96Х
√𝑛 √5
=5.12±0.0368 …
So, the 95% confidence interval for µ, based on the sample mean is(5.08mg,5.16mg)
NOTE: The probability that the above interval includes the mean µ is 0.95 or 95%.
The z-value in the confidence interval is known as the critical value and is obtained for different
levels of confidence as follows:
P(Z<z)=0.95
i.e φ(z)=0.95
z=φ-1(0.95)
Page 11 of 20
=1.645
TASK
In pairs, use a similar approach to find the z value for 99% confidence interval.
Summary
Examples
1.The heights of men in a particular district are distributed with mean µ cm and standard
deviation σ cm.
On the basis of the results obtained from a random sample of 100 men from the district, the
95% confidence interval for µ was calculated and found to be (177.22cm,179.18cm).
Page 12 of 20
Calculate
The width of a confidence interval is the distance between the two confidence limits. This is the
difference between the upper and lower confidence limits.
=2.0608kg
Page 13 of 20
Exercise 9e p460 J.Crawshaw)
6. A random sample of 6 items taken from a normal population with mean µ and variance
4.5cm2 gave the following data:
Sample values:12.9cm,13.2cm,14.6cm,12.6cm,11.3cm,10.1cm.
(c) Would your answers have been different if the population was not normal? Explain your
answer.
10. One hundred and fifty bags of flour are taken from a production line and found to have a
mean mass of 748g and standard deviation of 3.6g.
(a) calculate an unbiased estimate of the standard deviation of a bag of flour produced on this
production line.
(b) calculate a 98% confidence interval for the mean mass of a bag of flour produced on this
production line.
Page 14 of 20
THE DISTRIBUTION OF THE SAMPLE PROPORTION
Page 15 of 20
Suppose a random sample of n observations is taken from a population in which the proportion
of successes is p and that of failures is q=1-p.
If x is the number of successes in the sample, then X follows a binomial distribution i.e X ̴B(n,p)
and E(X)=np, Var(X)=npq.
𝑋
The random variable for the proportion of success in the sample is 𝑛.
𝑋 1
This can be written as Ps, where Ps=𝑛=𝑛 𝑋.
1 1 1 𝑝𝑞
Var(Ps)=𝑉𝑎𝑟 (𝑛 𝑋) = 𝑛2 𝑉𝑎𝑟(𝑋) = 𝑛2 × 𝑛𝑝𝑞 = 𝑛
𝑝𝑞
The distribution Ps has mean p and variance .
𝑛
NOTE: When considering the normal approximation to the binomial distribution, a continuity
1
correction of ± 2 is needed.
1 1 1 1
Since Ps=𝑛X, use a continuity correction 𝑛 × (± 2) i.e ± 2𝑛.
Imagine that you want to find p, the proportion of successes in a particular population. To get
an idea of its value, you could take a random sample of size n and calculate 𝑝𝑠 ,the proportion of
successes in your sample.This would give the best unbiased estimate 𝑝̂ ,where 𝑝̂ =𝑝𝑠 .You could
also use this value of 𝑝𝑠 to obtain an interval estimate of p, known as the confidence interval
for p.
Page 16 of 20
The theory needed to derive the confidence interval for p is based on the sampling distribution
of proportions, Ps.
This states that, provided the sample size n is large, (n≥ 30),
𝑝𝑞
̴ (p, 𝑛 ) where q=1-p
The distribution of Ps is normal, so Ps N
𝑝𝑞
The standard deviation of the sampling distribution of proportions,√ 𝑛 is needed in the
calculation of the limits for the confidence interval. However, the difficulty is that its value is
unknown since p is not known.
To overcome this, we use 𝑝̂ =𝑝𝑠 .writing 1-𝑝𝑠 as qs, the standard deviation of the sampling
𝑝𝑠 𝑞𝑠
distribution is approximately √ .
𝑛
Remember that the sample size n should be large, since the normal approximation to the
binomial distribution is used in obtaining the distribution of the sample proportions. Also, since
a continuous distribution has been used as an approximation to the discrete distribution,
continuity corrections should be used. These are usually omitted, however, when calculating
confidence intervals.
Example1
A manufacturer wants to assess the proportion of defective items in a large batch produced by
a particular machine. He tests a random sample of 300 items and finds that 45 items are
defective.
(a) Calculate a 95% confidence interval for the proportion of defective items in the batch.
(b) If 200 such tests are performed and a 95% confidence interval calculated for each, how
many would you expect to include the proportion of defective items in the batch?
Page 17 of 20
Solution
45
(a) Ps=300=0.15 ,qs=1-ps=0.85, n=300
The 95% confidence limits for p are
𝑝𝑠𝑞𝑠 0.15×0.85
Ps ±1.96 =0.15±1.96× √
𝑛 300
=0.15±0.0404
=(0.1096,0.1904)
(b) The expected number of tests that include the proportion of defective items in the
batch =200× 0.95 = 190.
Example 2 (pair discussion)
In a random sample of 400 carpet shops, it was discovered that 136 of them sold carpets at
below the list prices recommended by the manufacturer.
(a) Estimate the percentage of all carpet shops selling below list price.
(b) Calculate an approximate 90% confidence interval for the proportion of shops that sell
below the list price and explain briefly what this means.
(c) What size sample would have to be taken in order to estimate the percentage to within
±2%,with 90% confidence?
Page 18 of 20
Worksheet 3 Confidence intervals for p
1.In a survey of a random sample of 250 households in a large city,170 households owned at
least one pet.
(a) Find an approximate 95% confidence interval for the proportion of households in the city
that own at least one pet.
2.In order to assess the probability of a successful outcome, an experiment was performed 200
times. The number of successful outcomes was 72.
(a) Find a 95% confidence interval for p, the probability of a successful outcome.
Suggested answers
Page 19 of 20
Page 20 of 20