Unit 10 - Chi-Square Test
Unit 10 - Chi-Square Test
Unit 10 - Chi-Square Test
10.1 Introduction
In the previous unit, testing of hypothesis, we discussed about how to test
hypothesis concerned with parameters like mean and proportion, using data
from either one or two samples. We used one-sample tests to determine
whether a mean or a proportion was significantly different from a
hypothesised value. In the two-sample tests, we examined the difference
between either two means or two proportions, and we tried to learn whether
this difference was significant.
For example, we have proportions from five populations instead of only two,
then for these cases, the methods for comparing proportions described for
testing hypothesis for two-samples do not apply; we must use the Chi-
Square test (2 test). In this unit, Chi-Square, we will discuss the Chi-Square
tests which enable us to test whether more than two population proportions
can be considered equal. In other words, a Chi-Square test is also a
parametric test which can be applied on categorical data or qualitative data.
This test can be applied when we have few or no assumptions about the
population parameter.
Actually, Chi-Square tests allow us to do a lot more than just test for the
quality of several proportions. If we classify a population into several
categories with respect to two attributes (such as age and job performance),
we can then use a Chi-Square test to determine whether the two attributes
are independent of each other. So, Chi-Square tests can be applied on a
contingency table.
Objectives:
After studying this unit, you should be able to:
describe the non parametric method of testing hypothesis
describe the Chi-Square characteristics
identify the conditions required for applying Chi-Square test for a given
population distribution
recognise the applications of Chi-Square test
describe the steps in solving problems related to Chi-Square test
10.1.1 Relevance
Case-let
Women still earn less than men
On 27 February 2006 the Women and Work Commission (WWC), published
its report on the causes of the “gender pay gap “or the difference between
men’s and women‘s hourly pay. According to the report, British women
working full-time currently earn 17% less per hour than men. In February the
European commission also brought out its own report on the pay gap across
the European Union. Its findings were similar in that, on an hourly basis,
women earn 15% less than men for the same work.
In the United States, the difference in median pay between men and women
is around 20%. According to the WWC report the gender pay gap opens
early. Boys and girls study different subjects in school, and boy’s subjects
lead to more lucrative careers. They then work in different sorts of jobs. As a
result, average hourly pay for a woman at the start of her working life is only
91% of a man’s; even through nowadays she is probably better qualified.
How do we compile this type of statistical information? We can use Chi-
Square testing for more than one type of population.
(Source: Derek L Waller Published by Elsevier Inc Ed 2008).
O i E i 2 O1 E1 2 O 2 E 2 2 O 3 E 3 2 O n E n 2
2 .......
Ei E1 E2 E3 En
Where, O1, O2, O3….On are the observed frequencies and E1, E2, E3…En
are the corresponding expected or theoretical frequencies.
O i E i 2
2
Ei
where, ‘Oi’ is the observed frequency and ‘Ei’ is the expected frequency.
Key Statistic
The observed frequencies are the frequencies obtained from the
observation, which are sample frequencies. The expected frequencies
are the calculated frequencies.
10.2.2 Steps in solving problems related to Chi-Square test
Figure 10.1 depicts the steps required for solving the problems related to
Chi-Square test.
O E
i i
Key Statistic
The results of Chi-Square test cannot be accurate if the cell frequencies
in a contingency table are less than 5.
10.2.5 Practical applications of Chi-Square test
In inferential statistics, the Chi-Square test can also be applied for the
discrete distributions. In using Chi-Square test, we need no assumptions
regarding the shape of sampling distributions. The applications of Chi-
Square test include testing:
Example 1
For example, we are asked to write any four numbers, we will have all the
numbers of our choice. If a restriction is applied or imposed to the choice
that the sum of these numbers should be 50; then the freedom of choice
would be reduced to three only and so the degrees of freedom would
now be 3.
Key Statistic
The Chi-Square curve will be on the positive side of x-axis because the
Chi-Square values are always positive.
Number of rows 1 Number of columns 1
3 1 2 1 2
Hence, a contingency table with three rows and two columns has two
degrees of freedom.
Solved Problem 2
Table 10.1 depicts the production in three shifts and the number of defective
goods that turned out in three weeks. Test at 5% level of significance
whether weeks and shifts are independent.
Table 10.1: Production of Defective Goods in Three Shifts
Shift 1 Week 2 Week 3 Week Total
I 15 5 20 40
II 20 10 20 50
III 25 15 20 60
Total 60 30 60 150
Solution: Table 10.1a depicts the observed and expected values required
to calculate 2.
Table 10.1a: Observed and Expected Values
Observed Expected Value O i E i 2
Value Row Total Column Total (Oi – Ei)2
E Ei
Oi i Grand Total
3. Test statistics
O i E i 2
2
Ei
2cal = 3.6459
4. Conclusion: Since 2cal (3.6459) < 2tab (9.49), ‘Ho’ is accepted. Hence,
the attributes ‘week’ and ‘shifts’ are independent.
Solved Problem 3
Out of 1000 people surveyed, 600 belonged to urban areas and rest to rural
areas. Among 500 who visited other states, 400 belonged to urban areas.
Test at 5% level of significance whether area and visiting other states are
dependent.
Solution: Table 10.2 depicts the information given in solved problem 3 in a
tabulated form.
Table 10.2: People Belonging to Urban and Rural Areas
Other States Urban Rural Total
Visited 400 100 500
Not Visited 200 300 500
Total 600 400 1000
Table 10.2a depicts the observed and expected values for the calculation of 2.
Table 10.2a: Observed and Expected Values
O i E i 2
Observed Expected Value
Value Row Total Column Total (Oi – Ei)2
E Ei
Oi i Grand Total
400 300 10000 33.33
200 300 10000 33.33
100 200 10000 50.00
300 200 10000 50.00
2cal = 166.66
tab 3.84
2
3. Test statistics
O i E i 2
2
Ei
2cal = 166.66
4. Conclusion: Since 2cal (166.66) > 2tab (3.84), ‘Ho’ is rejected. Hence, the
‘area’ and ‘visit’ are dependent.
10.3.2 Test of goodness of fit
The test of goodness of fit of a statistical model measures how accurately
the test fits a set of observations. This test measures and summarises the
differences if any, between the observed and expected values of the
considered statistical model. These test results are helpful to know whether
the samples are drawn from identical distributions or not. The degrees of
freedom are ‘n-1’ and the expected value is equal to the average of the
observed values.
Solved Problem 4
A personal manager is interested in trying to determine whether
absenteeism is greater on one day of the week than on another day of the
week. The record for the past years is available. Table 10.3a depicts the
absenteeism for each working day over a week. Test whether absenteeism
is uniformly distributed over the week.
Table 10.3: Comparison of Data about Absenteeism
Days of
Monday Tuesday Wednesday Thursday Friday
Week
Number of
66 57 54 48 75
absentees
i
66 57 54 48 75 60
5
The table 10.3a depicts the calculated expected values required for
calculation of 2 for the data related to problem 4.
Table 10.3a: Observed and Expected Values for Calculation of 2
2 tab 9.49
4. Test statistics
O i E i 2
2
Ei
2cal = 7.50
5. Conclusion: Since 2cal (7.5) < 2tab (9.49), ‘Ho’ is accepted. In other
words, we conclude at 5% level of significance that absenteeism is
uniformly distributed and is independent of the days of the week.
Solved Problem 5
According to a theory in Genetics, the proportion of beans of A, B, C and D
types in a generation should be 9:3:3:1. In an experiment with 1600 beans,
the frequency of bean of A, B, C and D type was observed to be 882, 313,
287 and 118 respectively. Does the result support the theory?
Solution: The steps for calculation of Chi-Square are described as follows:
1. Null hypothesis ‘Ho’: The result supports theory
Alternate hypothesis ‘H1’: The result does not support theory
2. Level of significance is 5% and degrees of freedom(d.f.)= (4 – 1) = 3
2
tab 7.81
3. Test statistics
O i E i 2
2
Ei
Table 10.4 depicts the observed and expected values for calculation of 2
for solved problem 5.
Table 10.4: Observed and Expected Values for Calculation of 2
2cal = 4.72
4. Conclusion: Since 2cal (4.72) < 2tab (7.81), ‘Ho’ is accepted. Therefore,
the result supports the theory.
Solved problem 6
The following table gives the classification of 100 workers according to
gender and the nature of work. Test whether nature of work is independent
of the gender of the worker.
Table 10.5
Skilled Unskilled Total
Males 40 20 60
Females 10 30 40
Total 50 50 100
tab 3.84
2
3. Test statistics
O i E i 2
2
Ei
Table 10.5a depicts the observed and expected values for calculation of 2
for solved problem 6.
Table 10.5a: Observed and Expected Values for Calculation of 2
2cal = 16.666
4. Conclusion: Since 2cal (16.666) > 2tab (3.84), ‘Ho’ is accepted. Therefore
the null hypothesis that gender and nature of work are independent will
be rejected.
10.3.3 Test for comparing variance
When we have to use 2 as a test of population variance, then,
Ho: s2 = p2 and HA: s2 p2
s
2
2
(n 1)
p
2
Activity
Objective Questions:
1. What is the appropriate test to use if you want to determine whether
there is evidence that the proportion of successes is higher in group 1
than in group 2 and we have obtained independent samples from the
two groups?
i) The Z test
ii) The Chi-Square test
iii) Both of the above
iv) None of the above
2. Which of the following values cannot occur in a Chi-Square
distribution?
i) 100.0
ii) 38.4
iii) 0.61
iv) -2.45
3. What test would you use to determine whether a set of observed
frequencies differ from their corresponding expected frequencies?
i) The t test for dependent samples
ii) The Chi-Square test
iii) The t test for independent samples
iv) The F test
4. When using the chi-square test for differences in two proportions with
a contingency table that has r rows and c columns, how many degrees
of freedom will the test statistic have?
i) n – 1
ii) n1 + n2 - 2
iii) (r - 1) x (c - 1)
iv) (r - 1) + (c – 1)
5. When testing for the independence in a contingency table with 3 rows
and 4 columns, how many the degrees of freedom will the test statistic
have?
i) 5
ii) 6
iii) 7
iv) 12
10.4 Summary
Let us recapitulate the important concepts discussed in this unit:
Chi-Square test is a non-parametric test. The important applications of
Chi-Square test are the tests for independence of attributes, the test of
goodness of fit and the test for specified variance.
2 describe the magnitude of discrepancy between the observed and the
expected frequencies. The value of 2 is calculated as:
O i E i 2 O1 E1 2 O 2 E 2 2 O 3 E 3 2 O n E n 2
2
.......
Ei E1 E2 E3 En
Where, O1, O2, O3….On are the observed frequencies and E1, E2,
E3…En are the corresponding expected or theoretical frequencies..
An important criterion for applying the Chi-Square test is that the sample
size should be very large.
10.5 Glossary
Chi-Square test: It is a non-parametric test where no parameters regarding
the rigidity of population are required.
Level of significance: The smallest probability at which the null hypothesis
would be rejected (type I error). Usually, if the significance level is less than
a number such as 0.05 (5%), the null hypothesis would be rejected in favour
of the alternative; the chance of getting a sample like the one being
analysed if the null hypothesis were true. A small significance level would
imply that getting such a sample was highly unlikely, suggesting that the null
hypothesis is probably not true; also called the P-value of the test.
10.7 Answers
Terminal Questions
Discussion Questions:
i. Indicate the appropriate null and alternative hypothesis to test if the
make of automobile purchased is dependent on an individual’s
nationality?
ii. Using the critical value approach of the Chi-Square test at a 1%
significant level, does it appear that there is a relationship between
automobile purchase and nationality?
iii. Verify the result to Question 2 by using the p-value approach of the
Chi-Square test
iv. What has to be the significance level in order that there appears a
breakeven situation between dependency of nationality and
automobile preference?
v. What is your comment about the results?
References:
Bevington, P. R. & Robinson, D. K. Data Reduction and Error Analysis
for the Physical Sciences (3rd Edition). (Paperback).
Cowan, G. Statistical Data Analysis (Oxford Science Publications).
(Paperback).
Devore, J. L. Probability and Statistics for Engineering and the Sciences
Enhanced Review Edition. (Hardcover - Jan. 29, 2008).
Froedesen, A. G., Skieggestad, D. & Tofte, H. Probability and Statistics
in Particle Physics. (Hardcover, 1979 – out of print).
James. H. Statistical Methods in Experimental Physics (2nd Edition).
(Hardcover - Nov. 29, 2006).
Levin, R. I. & Rubin, D. S. (2008) Statistics for Management, Seventh
Edition, PHI Learning Private Limited.
Lyons, L. Nuclear and Particle Physicists. (Paperback, 1989).