Unit 17
Unit 17
Structure
17.0 Objectives
17.1 Introduction
17.2 Concept of Statistical Inference
17.3 Statistical Estimation
17.4 Concept of Hypothesis Testing
17.5 Critical Regions and Types of Errors
17.6 Testing of Hypothesis for a Single Sample
17.7 Test for Difference between Two Samples
17.8 Contingency Table
17.9 Summary
17.10 Answers to Self Check Exercises
17.11 Keywords
17.12 References and Further Reading
17.0 OBJECTIVES
After going through this Unit you should be in a position to:
explain the concept of a hypothesis;
explain the concept of statistical inference;
test a hypothesis based on a single sample; and
test the difference between two samples.
17.1 INTRODUCTION
As mentioned in Unit 6 of this course we undertake a sample survey instead of complete
census of population because of certain constraints. These constraints could be availability
of money, manpower and time. After collection of data through questionnaire, interview
or participatory observation method we follow certain steps such as tabulation,
presentation and analysis of data. We have discussed these issues in the earlier Units of
this course. As you know, we can present data in the form of tables and graphs. Also
data can be put to various statistical analyses. Thus we can find out i) measures of
central tendency such as mean, median and mode, ii) measures of dispersion such as
variance and standard deviation, and iii) correlation and regression coefficients.
Recall that the objective of our study is to analyse the behaviour of the population or
the universe, not the sample. In order to make things feasible we are studying the
sample and hence whatever results we have got are based on sample information.
Naturally a question arises: Are the sample results valid for the population? In other
words, can we draw inferences on the basis of sample results?
Let us take a concrete example from our daily life. You must have noticed that before
election process starts or just before declaration of election results many newspapers
264 and news channels conduct exit polls. The purpose is to predict election results before
the actual results are declared. At that point of time, it is not possible for the surveyors Statistical Inference
to ask all the voters about their preferences for electoral candidates - the time is too
short, resources are scarce, manpower is not available, and a complete census before
election defeats the very purpose of election!
The above is an example of statistical inference. The surveyor actually does not know
the result, which is the outcome of votes cast by all the voters. Here all the voters taken
together comprise the population. The surveyor has collected data from a representative
sample of the population, not all the voters. On the basis of the information contained in
the sample, (s)he is making forecast about the entire population.
i) Sample is a part of the population and there is no reason to expect that sample
mean is equal to population mean (if it does, it is a rare coincidence!). In that case,
what is the population mean?
ii) A number of samples can be drawn from the same population. Suppose we send
two researchers to Sambalpur University on different days and ask them to
administer the same questionnaire (on reading habits) on samples of 100 students
each. Obviously both researchers would come out with different results (say 9.25
hours and 10.5 hours) as the sampling units are different. Which result do we take
to be correct? Can we say that the difference between the studies is negligible?
Remember that population mean is not known to us. We know the sample mean only.
We have posed two types of questions above. First, what would be the value of the
population mean? The answer lies in making an informed guess about the population
mean. This aspect of statistical inference is called ‘estimation’. The second question
pertains to certain assertion made about the population mean. Suppose someone claims
that the average number of hours devoted to study by economics students in Sambalpur
University is 10 hours. On the basis of the sample information can we say that the
population mean is not equal to 10 hours? This aspect of statistical inference is called
hypothesis testing.
Thus statistical inference has two important aspects: statistical estimation and hypothesis
testing (see Fig. 17.1). Estimation could be of two types: point estimation and interval
estimation. In point estimation we estimate the value of population parameter as a single
point. On the other hand, in the case of interval estimation we estimate lower and upper
bounds around sample mean within which population mean is likely to remain.
Hypothesis as you know is an assertion or claim made about the population. It can be
in the form of a null hypothesis and its counterpart, alternative hypothesis. We will
explain these concepts along with examples below.
265
Research Process
When we subtract the population mean from the sample mean divide it by the
population standard deviation we obtain the standard normal variable z, which is
xz 0
n
Estimation of Interval
x
Remember that z is defined in such a manner that z . Therefore, when sample
n
mean ( x ) is equal to population mean ( ), we find that . When x is greater than
, we obtain a positive value for z. Similarly, when is smaller than we obtain a
negative value for z. Thus, as the value of z increases, the difference between sample
mean ( ) and population mean ( ) increases.
In Fig. 17.2 we have shown that when z=1.96, the area covered under the curve is 95
percent. Therefore, if we add and subtract 1.96 from sample mean ( ) we obtain
a 95 percent confidence interval. In symbols, lower limit and upper limit of the interval
267
Research Process
Similarly we obtain the 99 percent confidence interval as ,
since 99 percent area under the standard normal curve is covered when z = 2.58.
Confidence coefficient could take any value. We can ask for a confidence level of say
81 per cent or 97 per cent depending upon how precise our conclusions should be.
However, conventionally two confidence levels are frequently used, namely, 95 per
cent and 99 per cent.
Self Check Exercise
1) Define the following concepts:
a) confidence coefficient
b) confidence interval
c) level of significance
d) sampling distribution
e) standard error
2) A sample of 50 employees was asked to provide the distance commuted by them
to reach office. If sample mean was found to be 4.5 km. Find 95 percent confidence
interval for the population. Assume that population is normally distributed with a
variance of 0.36.
3) For a sample of 25 students in school the mean height was found to be 95 cm.
with a standard deviation of 4 cm. Find the 99 percent confidence interval.
Note: i) Write your answers in the space given below.
ii) Check your answers with the answers given at the end of the Unit.
........................................................................................................
........................................................................................................
........................................................................................................
........................................................................................................
........................................................................................................
........................................................................................................
HA : 0.51 …(17.2)
We have to keep in mind that null hypothesis and alternative hypothesis cannot be true
simultaneously. Secondly, there cannot be a third possibility except for H 0 and H A
about the statement we make. For example, in the case of female literacy in Orissa,
there are two possibilities - literacy rate is 51 per cent or it is not 51 per cent; a third
possibility is not there.
In most cases we find a difference between sample mean ( x ) and population mean
( ). Is the difference because of sampling fluctuation or is there a genuine difference
between the sample and the population? In order to answer this question we need a
test statistic to test the difference between the two. The result that we obtain by using
the test statistic needs to be interpreted and a decision needs to be taken regarding the
acceptance or rejection of the null hypothesis.
Let us go back to the standard normal curve given at Fig. 17.1. We mentioned that as
the value of z increases, the difference between sample mean ( ) and population mean
( ) increases. Moreover, higher the difference between and , higher is the absolute
value of z. Thus z-value measures the discrepancy between and , and therefore
can be used as a test statistic for hypothesis testing. Note that we are concerned with
xz A
H
the difference between and . Therefore, negative or positive sign of z does not
matter much.
Our task is to find out a critical value of z beyond which the difference between and
is significant. Hence, we take the absolute value of z (denoted by ) and if it is less
than the critical value we should not reject the null hypothesis. If the absolute value of z
exceeds the critical value we should reject the null hypothesis and accept the alternative
hypothesis.
Therefore, in the case of large samples z can be considered as a test statistic for hypothesis
testing such that
x
z …(17.3)
n
The above procedure is often called z-test. By applying sample values in the formula
given at (17.3) above we obtain the observed value of z. We compare it with the critical
value of z (to be discussed below). …(17.4)
When the sample size is small, the sampling distribution does not follow normal
distribution. Hence, we cannot apply z- test. In the case of small samples, however, we
apply t-test, which again is bell-shaped, but has a larger variance compared to normal
distribution. The test statistic for t-test is given by
x
t …(17.4)
s n 269
Research Process A problem here is that the critical value of t depends upon the ‘degree of freedom’,
defined as ( n 1 ) where n is the sample size. For example, when sample size is 20,
degree of freedom is 20 -1 =19. Thus the critical value of t varies according to two
factors: i) degrees of freedom, and ii) requisite level of significance.
In Fig. 17. 3 we present the type of test to be applied in different situations. Some of
the factors that guides us in deciding on the test statistic to be used are: i) whether
population is normal or not, ii) whether sample size is small or large, and iii) whether
population variance ( ) is known to us or not. You may wonder that since population
mean is not known to us (our objective is to estimate it from the sample), how do we
know population variance! However, we begin with the simpler case of known variance
and later on consider the more realistic case of unknown population variance.
In the case of small samples we have to use t-test and thus critical values need to be
decided on the basis of t-distribution. Application of t-test is a bit complex as we have
to look for the i) degrees of freedom, and ii) the level of significance.
We will work out some examples based on z-test and t-test in the next Section.
As mentioned earlier the convention is to apply 5% or 1% level of significance. For
these two levels of significance we present the critical values of t-destination for different
1 0.05
degrees of freedom in Table 17.3 at the end of this unit.
Self Check Exercise
4) Distinguish between the following:
a) Null hypothesis and Alternative hypothesis
b) Confidence level and Level of significance
c) Type I and Type II errors
5) Suppose a sample of 100 students has mean age of 12.5 years. Show through a
diagram the critical region at 5 per cent level of significance to test hypothesis that
the sample is equal to the population mean. Assume that population mean and
standard deviation are 10 years and 2 years respectively.
Note: i) Write your answers in the space given below.
ii) Check your answers with the answers given at the end of the Unit.
........................................................................................................
........................................................................................................
........................................................................................................
........................................................................................................
........................................................................................................
........................................................................................................
271
Research Process
17.6 TESTING OF HYPOTHESIS FOR A SINGLE
SAMPLE
We have so far explained the concepts of null and alternative hypotheses. Also we
have learnt that in the case of large samples we apply z-test and in the case of small
samples we apply t-test. In many situations we are asked to judge whether a sample is
significantly different from a given population. For example, let us assume that we surveyed
a sample of 400 households of Raigarh district of Chhatisgarh state and calculated the
per capita income of these households. Subsequently, our task is to test the hypothesis
that per capita income calculated from the sample is not different from the per capita
income of the district.
In the above example we can have two different situations: i) population (in this case all
the households of the district) variance is known, ii) population variance is not known
to us. We explain the steps to be followed in each case below.
17.6.1 Population Variance is Known
The steps you should follow are:
1) Specify the null hypothesis.
2) Find out whether it requires one-tail or two-tail test. Accordingly identify your
critical region. This will help in specification of alternative hypothesis.
3) Apply sample values to z-statistic.
4) Find out from z-table the critical value according to level of significance.
5) If you obtain a value lower than the tabulated value do not reject the null hypothesis.
6) If you obtain a value greater than the tabulated value reject the null hypothesis and
accept the alternative hypothesis
Example 1
Let us consider the case that we know the per capita income of Raigarh district of
Chhatisgarh as well as its variance. Suppose the data available in official records show
that per capita income of Raigarh district is Rs. 10,000 and standard deviation of per
capita income is Rs. 1,500. However, we did a sample survey of 400 households and
found that their per capita income is Rs. 10,500. Do we accept the data provided in
official records?
In this case = Rs. 10,000
= Rs. 1,500
= Rs. 10,500
n = 400
The sample size is large and variance of the population is known. As given in Fig.17.3
we apply z-test.
Our null hypothesis in this case is
H0 : x
The null hypothesis suggests that sample mean is equal to population mean. In other
words, per capita income obtained from the sample is the same as the data provided in
official records.
272
Our alternative hypothesis is Statistical Inference
HA :x
By substituting values in the above we obtain
In the above case since z = 6.67, the sample lies in the critical region and we reject the
hypothesis. Thus the per capita income obtained from the sample is significantly different
from the per capita income provided in official records.
Example 2
Suppose the voltage generated by certain brand of battery is normally distributed. A
random sample of 100 such batteries was tested and found to have a mean voltage of
1.4 volts. At 0.01 level of significance, does this indicate that these batteries have a
general average voltage that is different from 1.5 volts? Assume that population standard
deviation is 0.21 volts.
Here, H 0 : = 1.5
Since average voltage of the sample can be different from average voltage of the
population if it is either less than or more than 1.5 volts, our rejection region is on both
sides of the normal curve. Thus it is a case of two-tail test and alternative hypothesis is
From the Table 17.2 we find that the critical value at 1 per cent level significance is
2.58. Since the actual value of z is greater than 2.58 we reject the null hypothesis at 1%
level and accept the alternative hypothesis that the average life of batteries is different
from 1.5 volts.
17.6.2 Population Variance not Known
The assumption that population standard deviation ( ) is known to us is unrealistic, as
we do not know the population mean itself. When is unknown we have to estimate
it by sample standard deviation (s). In such situations there are two possibilities
depending upon the sample size. If the sample size is large ( ) we apply z-statistic,
that is,
x
z …(17.5)
s n
x
t …(17.6)
s n
273
Research Process The steps you should follow are:
1) Specify the null hypothesis and alternative hypothesis.
2) Check whether sample size is large ( n 30 ) or small ( n 30 ).
3) In case n 30 , apply z-test (17.5).
4) Find out from z-table the critical value according to level of significance ( ).
5) In case , apply t-test (17.6).
6) Find out from t-table the critical value for n 1 degrees of freedom and level of
significance ( ).
7) If you obtain a value lower than the tabulated value do not reject the null hypothesis.
8) If you obtain a value greater than the tabulated value reject the null hypothesis and
accept the alternative hypothesis
Example 3
A tablet is supposed to contain on an average 10 mg. of aspirin. A random sample of
100 tablets show a mean aspirin content of 10.2 mg. with a standard deviation of 1.4
mg. Can you conclude at the 0.05 level of significance that the mean aspirin content is
indeed 10 mg.?
Here, the null hypothesis is
The rejection region is on both sides of 10 mg. Thus it requires a two-tail test
and .
Also, the sample mean is x 10.2 and the sample size n = 100. Since population standard
deviation is not known we estimate it by sample standard deviation s and our test
x 10.2 10
z 1.43
s 1.4
n 100
At 5 per cent level of significance the critical value of z is 1.96. since the z value that we
have obtained is less than 1.96, we do not reject the null hypothesis. Therefore the
mean level of aspirin is 10 mg.
Example 4
The population of Haripura district has a mean life expectancy of 60 years. Certain
health care measures are undertaken in the district. Subsequently, a random sample of
25 persons shows an average life expectancy of 60.5 years with a standard deviation
of 2 years. Can we conclude at the 0.05 level of significance that the average life
expectancy in the district has remained the same?
Here, H 0 : = 60
We have to test for an increase in life expectancy. Thus it is a case of one-tail test and
the rejection region will be on the right-hand tail of the standard normal curve.
Hence our alternative hypothesis is
274
Here population standard deviation s is not known and we estimate it by the sample Statistical Inference
standard deviation s. Here the sample size is small hence we have to apply t-statistic
given at (17.6).
x 60.5 60
t 1.25
s 2
n 25
Since sample size is 25, degrees of freedom are 25-1 = 24. From the t-table we find
that for 24 degrees of freedom, 5 per cent level of significance.
Since t-value obtained above is less than the tabulated value we do not reject the
hypothesis. Therefore, we accept the alternative hypothesis that life expectancy has not
changed after the health care measures.
Self Check Exercise
6) A report claimed that in the ‘School Leaving Examination’, the average marks
scored in Mathematics were 78 with a standard deviation of 16. However, a
random sample of 37 students showed an average of 84 marks in Mathematics. In
the light of this evidence, can we conclude that the average has remained unchanged?
Use 0.05 level of significance.
7) A passenger car company claims that average fuel efficiency of cars is 35 kms per
litre of petrol. A random sample of 50 cars shows an average of 32 kms per litre
with a standard deviation of 1.2 km. Does this evidence falsify the claim of the
passenger car company at 0.01 level of significance?
8) A random sample of 200 tins of coconut oil gave an average weight of 4.95 kg per
tin with a standard deviation of 0.21 kg. Do we accept the hypothesis of net
weight of 5 kg per tin at 0.01 level of significance?
9) According to a report, the national average annual income of the government
employees during a recent year was Rs. 24,632 with a standard deviation of Rs.
1827. A random sample of 49 government employees during the same year showed
an average annual income of Rs. 25,415. On the evidence of this sample, at 0.05
level of significance, Can we conclude that the national average annual income of
government employees was indeed Rs. 24,632?
Note: i) Write your answers in the space given below.
ii) Check your answers with the answers given at the end of the Unit.
........................................................................................................
........................................................................................................
........................................................................................................
........................................................................................................
........................................................................................................
........................................................................................................
The alternative hypothesis is the statement that both the population means are different.
In notations
HA : 1 2 …(17.8)
Population Variance is known
When standard deviations (positive square root of variance) of both the populations
are known we apply z statistic specified as follows:
( x1 x2 ) ( 1 2)
z
2 2
1 2 …(17.9)
n1 n2
In (17.8) above, subscript 1 refers to the first sample and subscript 2 refers to the
second sample. By applying relevant data in (17.9) we obtain the actual value of z and
compare it with the tabulated value for specified level of significance.
Example 5:
A bank wants to find out the average savings of its customers in Delhi and Kolkata. A
sample of 250 accounts in Delhi shows an average savings of Rs. 22500 while a sample
of 200 accounts in Kolkata shows an average savings of Rs. 21500. It is known that
standard deviation of savings in Delhi is Rs. 150 and that in Kolkata is Rs. 200. Can
we conclude at 1 percent level of significance that banking pattern of customers in
Delhi and Kolkata is the same?
In this case the null hypothesis is H 0 : 1 2
= 250 n 2 = 200
We find that at 1 per cent level of significance the critical value obtained from Table
17.1 is 2.58.
Since the actual value is greater than the tabulated value the null hypothesis is rejected
and the alternative hypothesis is accepted. Thus the banking pattern of customers in
Delhi and Kolkata are different.
Population Variance is not known
When population variance ( 2 ) is not known we estimate it by sample variance ( s 2 ).
If both samples are large in size (n>30) then we apply z statistic as follows:
( x1 x2 ) ( 1 2)
z
s12 s 22 …(17.10)
n1 n2
On the other hand, if samples are small in size ( n 30 ) then we apply t-statistic as
follows:
( x1 x2 ) ( 1 2)
t
s12 s 22 …(17.11)
n1 n2
2 ( x1 x2 ) ( 1 2)
t
2
s 22 freedom for t-test = (n1 1) (n 2
sDegrees
1
1) n1 n2 2
n1 n 2 6
Example
A mathematics teacher wants to compare the performance of Class X students in two
sections. She administers the same set of questions to 25 students in Section A and 20
students in Section B. she finds that Section A students have a mean score of 78 marks
with standard deviation of 4 marks while Section B students have a mean score of 75
marks with standard deviation of 5 marks. Is the performance of students in both Sections
different at 1 per cent level of significance?
In this case the null hypothesis is H 0 : 1 2
x 2 = 75 s2 = 5
n1 = 25 n 2 = 20
Since 1 and are not known and sample sizes are small we apply t-test.
78 75
= 42 5 2 = 3 = 2.18
1.37
25 20 277
Research Process The degree of freedom in this case is 25+20-2 = 43.
We can find out from Table 17.3 that at the 1 per cent level of significance the t-value
for 43 degrees of freedom is 2.69.
Since the tabulated value of t is less than actual value of t we reject the hypothesis.
Therefore, students in Section A and Section B are different with respect to their
performance in mathematics.
In Table 17.3 we have presented the observed frequency for each cell in the table.
What should be the expected frequency when there is no relationship between the
variables under consideration? We will answer this question below.
Expected frequency is calculated under the assumption that there is no relationship
between number of children and occupation of father. For each cell in Table 17.2 the
expected frequency is obtained by
Where Eij is expected frequency for row ‘i’ and column j. For example, for row 2 and
column 2 the expected frequency is
2 (Oi Ei ) 2
Ei
…(17.3)
(Oi Ei ) 2
Table 17.5: Calculation of Ei
for each Cell
17.9 SUMMARY
Drawing conclusions about a population on the basis of sample information is called
statistical inference. Here we have basically two things to do: statistical estimation and
hypothesis testing.
An estimate of an unknown parameter could be either a point or an interval. Sample
mean is usually taken as a point estimate of population mean. On the other hand, in
interval estimation we construct two limits (upper and lower) around the sample mean.
We can say with stipulated level of confidence that the population mean, which we do
not know, is likely to remain within the confidence interval. In order to construct confidence
interval we need to know the population variance or its estimate. When we know
population variance, we apply normal distribution to construct the confidence interval.
In cases where population variance is not known, we use t distribution for the above
purpose. Remember that when sample size is large (n>30) t-distribution approximates
normal distribution. Thus for large samples, even if population variance is not known,
we can use normal distribution for estimation of confidence interval on the basis of
sample mean and sample variance.
Subsequently we discussed the methods of testing a hypothesis and drawing conclusions
about the population. Hypothesis is a simple statement (assertion or claim) about the
value assumed by the parameter. We test a hypothesis on the basis of sample information
available to us. In this Unit we considered two situations: i) description of a single
sample, and ii) comparison between two samples.
In the case of qualitative data we cannot have parametric values and hypothesis
testing on the basis of z statistic or t-statistic cannot be performed. Chi-square test is
applied to such situations. Chi-square test is a non-parametric test, where no assumption
about population is required. There are various types of non-parametric tests beside
chi-square test. Moreover, chi-square test can be applied to many situations. We learnt
about a particular application of chi-square test - contingency table. In contingency
table we test the null hypothesis that variables under consideration are independent
against the alternative hypothesis that variables are related. We compare expected
frequency with observed frequency and construct the chi-square statistic. If the
observed value of chi-square exceeds the expected value of chi-square we reject the
null hypothesis.
281
Research Process
17.10 ANSWERS TO SELF CHECK EXERCISES
1) Go through the text and define these terms.
3) Since it is small sample and population variance is not given we apply t-statistics
with degrees of freedom 24. The tabulated value of t at 99 per cent confidence
level is 2.49. The confidence interval is .
6) Since it is large sample with known variance, we apply z-statistic. The alternative
hypothesis is . The observed value of z is 2.28 and critical value of z at
5% level of significance is 1.96. Since the observed value is greater than the
critical value we reject the null hypothesis. Therefore, we conclude that the average
marks were different from 78.
7) It is a large sample with unknown variance. It requires two-tail test. The observed
value of z is 17.68 and critical value of z at 1% level of significance is 2.58. Since
the observed value is greater than the critical value, the null hypothesis is rejected.
8) It is a large sample with unknown variance. We test the null hypothesis with z-
statistic. Observed value of z is 3.37. Null hypothesis is rejected.
The observed value of chi-square statistic is 2.98. Degrees of freedom is 2. The critical
value of chi-square at 5 per cent level of significance at 2 degrees of freedom is 5.99.
Hence null hypothesis is not rejected and soft drink consumption is independent of the
region.
17.11 KEYWORDS
Confidence Level : It gives the percentage (probability) of samples where
the population mean would remain within the confidence
interval around the sample mean. If is the significance
282 level the confidence level is (1- ).
Contingency Table : A two-way table to present bivariate data. It is called Statistical Inference
contingency table because we try to find whether one
variable is contingent upon the other variable.
Nominal Variable : Such a variable takes qualitative values and do not have
any ordering relationships among them. For example,
gender is a nominal variable taking only the qualitative
values, male and female; there is no ordering in ‘male’
and ‘female’ status. A nominal variable is also called an
attribute.
284