Unit 5 Estimation: Structure
Unit 5 Estimation: Structure
5.1 Introduction
Objectives
5.2 Point Estimation
5.3 Criteria For a Good Estimator
5.4 Interval Estimation
Confidence Interval for Mean with Known Variance
Confidence Interval for Mean with Unknown Variance
Confidence Interval for Proportion
5.5 Summary
5.1 INTRODUCTION
In Units 2 and 3 you have seen that populations can be described by distributions which
I
are fully determined with the help of their parameters. For example, in the case of
binomialdistribution, you need to know n and p; in a Poisson distribution, you need to
knovn A; and a normal distribution is determined by y and u.These quantities are called
parameters. The problem with these parameters is that in real-life situations they are
usuallj. unknown. We have seen in Unit 4, that in such situations, we take a random
sample from the population and compute a function of the sample values, called
statistic. More precisely we try to estimate the population parameters by functions of
sample values. In this unit we shall discuss certain methods by which we can estimate
the po'plation parameters. These processes are called estimation. As we have already a
stated, in estimation we expect that the sample value is 'reasonably close' to the
population value. How do you judge this? Here we discuss some criteria which tell us
how best the parameters can be estimated by sample values.
In this unit we discuss two methods of estimation - point estimation and interval
estimation. In Sec. 5.2 we discuss point estimation. Point estimation concerns choosing
a statistic, that is a single number calculated from the sample data. In contrast to this,
we sometimes obtain an interval in which we can expect the parameter to lie with some
degree of confidence. The method of constructing such intervals is called 'interval
estiniaticn'. In Sec.5.4, we illustrate construction of such an interval. There we first
il1ust;ate ho~vsuch an interval is constructed for the population mean. We do this in
different cases. First we consider the case when the population standard deviation is
known and the sample size is large (n > 30). Then we take up the case where the
standard deviation is unknown, both when the sample size is small and when it is large.
After that we shall illustrate how interval estimates are constructed for the population
proportion.
Objectives
After reading this unit, you should be able to
I choose an estimator corresponding to a particular situation under study,
I
i
! check whether an estimator is,
Statistical Inference unbiased
or
efficient.
construct confidence intervals for the population mean and proportion, using appro-
priate sampling distribution,
Imagine that you need to find the mean life-time of the bulbs produced by a company.
Assume that the life of a bulb is distributed as normal with mean 8. Now to find the
life-time of a bulb, you have to keep it on till it burns off, and note the time. So, it is a
destructive process. If you do this for every bulb produced by the company, it will soon
@ have to close down! The way out in this situation is to take a sample of the bulbs, and
try to estimate the average life-time of the population on the basis of the life-time
observations obtained from the sample. Of course, we cannot hope to get the exact
value of the mean life-time. What we get from the sample is only an estimate. If
X I , x2, . . . , Xn are the life-times of the bulbs which were chosen in a sample of size n,
X l , X 2 , + . . ;+xn
then we could take the sample mean, as an estimate of the population
n
mean. Of course, this estimate will vary from sample to sample. You already have
come across this concept in the previous unit.
But apart from the sample mean, there could be other ways of estimating the population
mean from the sample. For example, we could take x i as an estimate, or we could take
Xrnin + Xrnax as an estimate where xminis the minimum value and xmaXis the maximum
XI + x 2 + . . . + x , 2
can be value.
n
written as ---
C:=lxi In any case, the estimate is always based on some or all of the sample values. That is to
n say that we calculate some sample statistic and take it as an estimate of the population
parameter. This sample statistic is called an estimator. The value of this estimator for
our sample is the estimate.
Definition 1: An estimator is a function of the sample observations-that is used to
estimate an unknown parameter. A point estimate is a single value of an estimator.
The process by which we choose an estimator and find the point estimate for estimating
an unknown parameter is called point estimation.
For example, if a sample mean is used to estimate a population mean,and if the sample
mean for a particular sample equals 10, then the estimator used is the sample mean,
whereas the point estimate is 10.
To cite another example, suppose we are interested in finding the proportion of
individuals in India preferring a given soft drink over another. Here the population
parameter is proportion. If the sample proportion is used to estimate the population
proportion and if the sample proportion for a particular sample equals 0.6, then the
estimator used is the sample proportion and the point estimate is 0.6.
Why don't you try an exercise now.
El) Write the estimator and estimate used in the following two situations.
i) Suppose an organisation wants to have some information about the mileage
for a whole fleet of used taxis, and for that they calculate the mean odometer
reading [mileage) from a sample of used taxis and find it to be 98,000 miles.
ii) Suppose we want to find the proportion of teenagers who have criminal Estimation
record and for that we take a sample of 50 teenagers and find that 2 % (or
.02) have criminal record.
We can, in fact, have a number of estimators for a given parameter. Apart from the
sample mean, the sample median or the average of the smallest and the largest
observations in the sample could also be considered as estimators for the population
mean. Since we have a variety of estimators for a parameter 8, we should choose the
best of the lot to get a real good estimate. But what do we mean by the best? We'll see
that in the next section.
= 0.5 lseconds.
Then X = 0.5 seconds. Therefore an unbiased estimate is 0.5 1 seconds and 0.5 1
seconds is a point estimate for the mean reaction time of individuals to the stimulus.
E2) A law firm selects a random sample of 60 electronics stores in a particular area,
and asks each of them to repair a compact disc player. In each case the law firm
determines whether the store makes unnecessary repairs in order to inflate its bill.
The law firm finds that 8 of the stores are guilty of this practice. Obtain a point
estimate of the proportion of all such stores in the area that inflate bills in this way.
E3) A washing machine company chooses a random sample of 25 motors from those
it receives from one of its suppliers. It determines the length of life of each of the
motors. The results (expressed in thousands of hours) are as follows:
4.1 4.6 4.6 4.6 5.1
4.3 4.7 4.6 4.8 4.8
4.5 4.2 5.0 4.4 4.7
4.7 4.1 3.8 4.2 4.6
3.9 4.0 4.4 4.0 4.5
The firm's management is interested in estimating the mean length of life of the
motors received from the supplier. Provide a point estimate of this population
parameter.
We have seen that the sample mean and sample proportion are unbiased estimates for
population mean and population proportion respectively. Does this indicate that the
statistic or estimator corresponding to the population parameter is always unbiased? To
find an answer to this, let us consider the following example.
Suppose we consider the parameter, 'standard deviation'. Then the sample statistic S
given by the formula
where (xl, XZ, . . . ,x,) denote the sample observations, can be taken to be an estimator
of the population standard deviation. It has been proved that the statistics has an
expected value equal to \/(?) o and not o, this means that S is not an unbiased
estimator of a. Hence an unbiased estimator of a is obtained by the expression in
(2)
instead of the expression in (1). For example, an unbiased estimate of the population
standard deviation for the situation given in Problem 1 is
Estimation
+ +
(0.49 - 0 . 5 1 ) ~ (0.52 - 0 . 5 1 ) ~ (0.53 - 0.51j2]
:. S
-
=
0.0006
seconds.
As we have seen in E l , in certain situations one can find more than one unbiased
estimator for an unknown parameter 8. If we have to choose between two unbiased
estimators for a fixed sample size, then we find the standard deviation (or variance) of
the sampling distribution of these two estimators and choose that one with smaller
standard deviation (or valiance). An unbiased estimator T I of a parameter 8 is said to
<
be more efficient than another unbiased estimator T2 of 8 if Var(Tj) VarjT2), and in
such a case, [hi: sampling distribution of T I has a smaller dispersion (spread) about 8
than that of T2 (See Fig.1).
Less-efficient estimator
I 1 More-efficient estimator
Fig.1:
As an example, let us take a random sample of size n from a normal population with
mean p and standard deviation u and consider the sample mean and sample median as
two estimators of p. If we compare the sampling distributions of the mean and median
for random samples of size n, we get that these two sampling distributions have the
same rilean but their variances differ. We have seen in Unit 4 that, the variance of the
sampling distribution of the mean is 02/n, and it can be shown that for random samples
of the same size from a normal population, the variance of the sampling distribution of
the median is approximately 1 5708 $.
Statistical Inference Hence we get that both the mean and the median are unbiased estimators, but for a
given sample size, the standard error for mean is less than that of median.
From what we have already observed now, we get that for random samples from normal
populations the mean is more efficient than the median as an estimator of p. This fact
will be more clear to you when you try E4. In fact it can be shown that in most practical
situations where we estimate a population mean p , the variance of the sampling
distribution of no other statistic is less than that of the sampling distribution of the
mean. In other words, in most practical situations the sample mean is the 'most
acceptable' statistic for estimating a population mean p.
There exist several other criteria for assessing the "goodness" of estimators, but we
shall not discuss them in this course.
Why don't you try this exercise now.
- - - --- - -- - - - - - - - - - - -- --
E4) To verify the claim that the mean is generally more efficient than the median a
student conducted an experiment consisting of 12 tosses of three dice. The
following are his results: 2,4, and 6; 5,3, and 5; 4, 5 and 3; 5,2 and 3; 6,l and 5;
2,3 and 1; 3,1, and 4; 5,5 and 2; 3,3 and 4; 1,6 and 2; 3,3 and 3; and 4,5 and 3.
a) Calculate the 12 medians and the 12 means.
b) Group the medians and the means obtained in part (a) into separate
distributions having the classes 1.5-2.5,2.5-3.5, 3.5-4.5 and 4.5-5.5.
c) Draw histograms of the two distributions obtained in part (b) and explain
how they illustrate the claim that the mean is generally more efficient than
the median.
Suppose you have been suspecting that the 1 litre pack of milk that is delivered to your
house every morning is not exactly 1 litre, but less. You feel that the filling machine
which is supposed to fill each polypack with 1 litre of milk is not working properly. Of
course, you are ready to admit that even though the machine is set for 1 litre, it has a
certain variability arid so there could be some packs which are less than 1 litre full
while others which are more.
To end your doubts, you need to find the average volume of milk filled by the machine.
Obviously, it would be impossible to do this except by taking a sample. Suppose you
i
I
I
I
measure the milk pack you get over a period of sixty days. That is, your sample size is
60. Suppose you find that the mean of your observations, which is the sample mean, is
950 ml. This is an estimate of the population mean. But you cannot immediately
Estimation
L
conclude that the machine is set for 950 ml. You must account for the variability of the
I sample means. For this you must also know the standard deviation, a,or calculate it
I
from the sample. Suppose we assume that a = 50.
-Now we shall construct an interval for the parameter p the average amount of milk that
the machine gives. For that we make use of the central limit theorem discussed in Unit
4. According to this Theorem, for sufficiently large sample size n the sample mean X is
a
approximately normally distributed with mean p and standard deviation -. Then we
J;;
make use of the normal distribution table given in Appendix 2 at the end of this block
and note that
P[-1.96 < Z < 1.961 = 0.95 (3)
-
"- "
where Z = -i.e. z a/& =x -
"16 ( 1
Now we rewrite Eqn.(3) using simple algebra as
Now we subtract -X from all the three terms inside the bracket. Then we get
7
I
a
[
P -X - 1.96- +
< - p < -X 1 . 9 6 ~ =
0
1 0.95
J;;
Now we multiply all the terms inside the bracket by - 1 and (therefore the inequalities
C
get reversed) and we get If we multiply the terms in
the inequality y 2 1 by (-1).
<p <X+1 . 9 6 ~-
' 0.95 (4) then the inequality gets
J; -
reversed and we get
Thus corresponding to each sample mean TI, we got an interval given by -y 5 - 1 .
(5)
which satisfies Equation (4). Let us now see what does Equation (4) implies. Let us, for
example consider the sample value X = 950ml. obtained for the problem regarding
average volume of milk filled by the machine. Then the Equation (4) corresponding to
-
x = 950ml is
P[937.35 < p < 962.651 = 0.95
We interpret it in the way that we are 95 % confident that the interval (937.65, 962.65)
contains the true value p. This does not mean that "There is 95 % probability that /L lies
in the interval (937.35, 962.65). This is a very common mis-interpretation of
Equation (4) and it is incorrect. This is because the population mean p is a fixed
quantity and therefore p either lies in the interval (937.35, 962.65) or it does not.
Therefore the probability that p lies in the interval is either 0 or 1. The 95 percent
probability is assigned to our level of confidence that the interval contains p. It is not
assigned to the probability that p lies in the interval.
Another interpretation of Equation (4) is based on the fact that we can construct a
confidence interval for each sample mean X. We will get different intervals for different
values of sample means. So, in this case Equation (4) says that if all possible samples
of size n are calculated, and the intervals are calculated
for each sample, then 95 % of all such intervals are expected to contain the population
parameter p. This does not mean that for a particular sample value TI, we can expect
a
that the interval (X - 1.96-, + a
X 1.96-) will contain p.
The confidence intervals
(X -
" +
1.963&" 9
is also denaoted as
-x rt-r 1.96-
>
fi J;; 4 35
Statistical Inference That means if you select 100 samples and calculate the intervals about their
sample means, then 95 of these will contain the population p. Note that here we
- figure
assume that a is known. In the following - we have illustrated this graphically,
showing five such intervals
1
I 0 I 0
I )x4+1.9%
i74
I
I
I I
fit
x5-1.96
I
x5
I +
%+l.96
Fig.2: A number of intervals constructed around the population mean.
Only the interval constructed around the sample mean Q does not contain the
population mean.
The interval given by (5) is called a confidence interval.(C.I)
The value 0.95 (or 95%) attached with the confidence interval is called confidence
coefRcient. The left end point of the confidence interval is called lower confidence
limit (LCL) and the right end point of the confidence interval is called upper
confidence limit (UCL). The difference between the UCL and LCL is the width of the
confidence interval. The width of the 95% confidence interval in the above example is
Although 0.95 is frequently used as a confidence coefficient, we can have other values
such as 0.90 or 0.99 as confidence coefficients. Using the normal distribution table, we
can obtain the confidence interval for 0.90 (or 90%) as
and for 0.99 (or 99%) as Esff matlon
This shows that the director can be 95% confident that the interval (834.01, 865.99)
contain the mean market value.
E5) For each of the values given below, calculate the 95% confidence interval for the
mean.
i) X=O,cr= 1 0 , n = 8
ii) 51 = 550, o = 40, n = 16.
E6) If the mean length of hospitalisation of 140 patients was 11.4 days and the
standard deviation of patient days is assumed to be 2.5 days, what is the 99%
confidence interval for the average length of stay'? Assume normality.
E7) Estimate the number of days between gemination and the first pickable
cucumbers using the following sample.
Date of germination First Fruit
May 1 , June 17
4 18
8 21
5 16
12 28
18 July 3
11 June 25
9 26
What is the 95% confidence interval assuming u = 2 days?
In all the computations of the confidence interval for p so far we have assumed that the
population variance is known. Each time, the normal distribution was the appropriate
sampling distribution used to determine the confidence intervals. However the norm21
Statistical Inference distribution is not appropriate when the population variance is unknown and the sample
size is less than 30. In such situations we use t-distribution. As indicated in the previous
unit. the sample standard deviation 's' is generally used as an estimator of the
population standard deviation.
If the sample size is 30 or less and the population is normal (and large relative to the
sample), a confidence interval for the population mean can be constructed by using the
t-distribution in place of the standard normal distribution.
You are already familiar with t distribution from Unit 4. We now have to use the t
distribution table given in Appendix to construct the confidence intervals corresponding
to different levels of confidence, say 95% or 99%. Let us suppose that we want to find
the confidence intervals at the 90% confidence level i.e. a = 0.1 with a sample size of
14 similar to the ones we have given in Equation(5). Note that we don't know o in this
case. Therefore, as indicated in Unit 4, the sample standard deviations is used as an
estimator of the population standard deviation. Thus if s is known, then 90% confidence
interval is given as
a 0.1
where b.05is the t-value corresponding to the value - = - = 0.05 and for the
2 2
parameter v = n - 1, where n is the sample size. Now to find the t-value we make use
of the table 1 in the Appendix. For example, suppose that n = 14, then v = 13, then,
from table 1 we get that the t-value is t,/2 = 1.771 (See Fig. 3).
n =14
df =13 t degrees of freedom
O . O 5 d under thearea
curve 0.05 of area
~nderthe curve under the curve
I So, we can be 95% confident that the true mean lies between 42.07 and 44.93
The idea will be more clear to you if when you do the following exercises.
Problem 3: A manufacturer of light bulbs wants to estimate the mean length of life of
a new type of bulb which is designed to be extremely durable. The firm's engineer tests
nine of these bulbs and find that the length of life (in hours) of each is as follows:
Previous experience indicates that the lengths of life of individual bulbs of a particular
type are normally distributed. Construct a 90 percent confidence interval for the mean
length of life of all bulbs of this new type.
Solution: If xi is the length of life of the ith light bulb in the sample, we find that
9
E8) Given the following sample sizes and confidence levels, find the appropriate t,/2
values for constructing confidence intervals.
i) n = 1 0 ; 9 9 %
ii) n = 28; 95%
iii) n = 13; 90%
Statistical Inference iv) n = 25; 99%
E10) Five measurements of the reaction time of an individual to certain stimuli were
recorded as: 0.28, 0.30, 0.27,0.33 and 0.31 second. Find the 95% confidence
interval for the actual reaction time.
E l 1) If you are given a sample of 20 candles from a large shipment of candles, and are
asked to give an interval estimate of their average burning life, how would you
proceed? What information would you need?
The above examples and exercises illustrate how we can use t-distribution to find the
confidence intervals. As we mentioned earlier, t-distribution can be used only if the
population variance is unknown and the sample size is small. Next we shall see how
to construct the confidence intervals for large samples when the population variance is
unknown.
Mathematicians have shown that if the sample size is large, we can simply substitute
the sample standard deviation for the population standard deviation in the results
obtained in the previous part of this section i.e. in Subsection 5.4.1. Thus, if we want to
construct a 95% confidence interval - that is, a confidence interval with a confidence
coefficient of 95 percent - we can substitute s for n in Equation (5), the result being
Equation (8) is applicable only if the population is large relative to the sample.
The following example should make the above discussion more clear.
Problem 4: A random sample of 100 ball bearings made by a machine in 1 week was
taken. The mean diameter was found to be 8.24 mm with a standard deviation of 0.42
mrn. Find the 95% and 99% confidence intervals for the mean diameter of ball bearings
produced by that machine.
Solution: Since the sample is large, from Equation(8), we get that the 95% confidence
interval for p is
Next we shall illustrate how confidence intervals are calculated for population
proportions. We have talked in length about the estimation of population parameter p.
Another important population that we need to estimate is the population proportion, p.
Let's see how to go about it.
standard deviation
J 7r(l - 7r)
n
. We also know from Unit 4, that if the sample size is
sufficiently large ahd if 7r is not very close to 0 or 1, the sampling distribution is
approximately a normally distribution. Then using the standard normal distribution
table, we can find confidence intervals. If we want to construct 95% confidence
intervals then that will be given by
so that
L J
The interval given by in (9) is called 95% confidence interval for 7r. Similarly, we can
have 90% or 99% confidence intervals. The above intervals given in Equatibn (9)
cannot be used as they involve the unknown, T .
However, if n is large, then 7r can be replaced by p without compromising acauracy. So
that for large samples, the 95% confidence interval for 7r will the
If we want to get a 99% confidence interval, we will have to replace 1.96 by 2.58, since
r 1
E14) A random sample of 800 calculators contains 24 defective items. Compute a 99%
confidence interval for the proportion of defective calculators.
E15) Of 1000 randomly selected lung cancer cases, 699 resulted in death. Construct a
95% confidence interval for the death rate from lung cancer.
E16) A student in a university wanted to decide whether or not a contest the election for
the presidency of the students' union. Out of 50 students, 11 showed their
willingness to vote for her. Find a 99% confidence interval for the true proportion
of students voting for her.
We now summarise our discussion about interval estimation in the following table:
Table 1
Parameter Point Estimator Confidence Interval
-
a known x-z5, X+zz
/ x -I
J;; J;;
a unknown, large n
i
a unknown, small n ( t X+t-
J;;
Place Difference
The above analysis is, in fact, exactly how ICI's statisticians proceeded. Despite the
fact that the sample consisted of only 10 observations, the evidence was very strong that
the chlorinating agent had a positive effect on abrasion resistance. After all, the 95
'I: percent confidence interval was that the mean difference between abrasion resistance of
rubber with and without treatment was an increase of between 0.464 and 2.076. (For
that matter, the statisticians found that the 98 percent confidence interval was that the
mean difference was an increase of between0.265 and 2.275). The best estimate was
that the chlorinating agent resulted in an increase of about 1.27 in abrasion resistance.
With the detailed example you have seen how several aspects covered in this unit has
merged. In fact as you reflect on this case study you should check from the summary
below how many points are actually covered in this case study.
With that we come to the end of this unit.
5.5 SUMMARY
E l ) For (i) the estimator is the mean mileage of the sample of used taxis. The value
98,000 miles is an estimate.
For (ii) the estimator is the proportion and the value .02 is an estimate.
E2) An unbiased estimate of the population proportion is obtained by
8 2
p=-=-
60 15
E3) A point estimator of the population mean is obtained by calculating the sample
mean.
The sample mean of 25 motors is 4.448 thousands of hours.
E4) a) The medians are 4,5,4,3,25,2,3,5,3,2,3and 4; the means are 4,4.3,4,3.3,2,
2.7,4,3.3,3 and 4.
b) The frequencies are2,4,3 and 3 for the medians and 1,5,6 and 0 for the
means. Then obtain the frequency distribution.
c) The histograms of two distributions shows that the variance for the median is
more than for the mean which illustrate the claim that the mean is generally
more efficient that the median.
E5) 95% confidence interval for mean is
i) Here X = 0 and a = 10 and n = 8. Therefore the interval is
10
= (-6.9296,6.9296)
(Z)
99% C.I. : 0.738 f 3.11 - = (0.6267,0.8493)
From the sample, TI = 0.298 and Estimation
s= = 0.0213 and d.t = 4.
:. he required value o f t is 2.78
:. 95% C.I. = 0.298 f 2.78
Light up the candles and measure the amount of time (life time) for which each
candle burns. This data will have 20 observations. Find the mean (TI) and the
standard deviation (s) of this data. The value of K is a point estimate. If we want
95% C.I., we find t = 2.09 for 19 d.f., since the sample size is 20. Then C.I. is
24
p = - = 0.03,n = 800
800
:. C.I. = (0.0144,0.0456)
11
E15) p = - = 0.22. Therefore the C.1 is (0.0688,0.3711)
50