Text
Text
STRAND: STATISTICS
Unit 7 Chi-Squared
TEXT
Contents
Section
7 Chi-Squared
7.1 The Chi-Squared Test
The chi-squared test is a particularly useful technique for testing whether observed data
are representative of a particular distribution. It is widely used in biology, geography
and psychology.
Worked Example 1
(a) Write down 100 numbers 'at random' (taking values from 0 to 9). Do this without
the use of a calculator, computer or printed random number tables. Draw up a
frequency table to see how many times you wrote down each number. (These will
be called your observed frequencies.)
(b) If your numbers really are random, roughly how many of each do you think there
ought to be? (These are referred to as expected frequencies.)
(c) What model are you using for this distribution of expected frequencies?
Solution
(a) Here is a possible frequency table for this experiment.
Number 0 1 2 3 4 5 6 7 8 9
Observed frequency 11 12 8 14 7 9 9 8 14 8
For analysing data of the sort used in Worked Example 1, where you are comparing
observed with expected values, a chart as shown below is a useful way of writing down
the data.
Frequency
Number Observed, Oi Expected, Ei
0
1
2
3
4
-
-
-
1
7.1 CMM Subject Support Strand: STATISTICS Unit 7 Chi-Squared: Text
Observed Expected
frequency frequency
(O i − E i ) 2
Number Oi Ei Oi − E i (O i − E i ) 2 Ei
0 11 10 1 1 0.1
1 12 10 2 4 0.4
2 8 10 −2 4 0.4
3 14 10 4 16 1.6
4 7 10 −3 9 0.9
5 9 10 –1 1 0.1
6 9 10 –1 1 0.1
7 8 10 −2 4 0.4
8 14 10 4 16 1.6
9 8 10 −2 4 0.4
6.0
(O i − E i ) 2
= 6.0 .
For this set of 100 numbers ∑ Ei
But what does this measure tell you and how can you decide whether the observed
frequencies are close to the expected frequencies or really quite different from them?
Firstly, consider what might happen if you tried to test some true random numbers from a
random number table.
Would you actually get 10 for each number? The example worked out here did in fact
use 100 random numbers from a table and not a fictitious set made up by someone taking
part in the experiment.
Each time you take a sample of 100 random numbers you will get a slightly different
distribution and it would certainly be surprising to find one with all the observed
frequencies equal to 10. So, in fact, each different sample of 100 true random numbers
(O i − E i ) 2 . .
will give a different value for ∑ Ei
2
CMM Subject Support Strand: STATISTICS Unit 7 Chi-Squared: Text
7.1
(O i − E i ) 2
The distribution of ∑ Ei
is very close to the theoretical distribution known as
For any χ 2 distribution, the number of degrees of freedom shows the number of
independent free choices which can be made in allocating values to the expected
frequencies. In these examples, there are ten expected frequencies (one for each of the
numbers 0 to 9). However, as the total frequency must equal 100, only nine of the
expected frequencies can vary independently and the tenth one must take whatever value
is required to fulfil that 'constraint'. To calculate the number of degrees of freedom
υ = number of classes or groups − number of constraints
Here there are ten classes and one constraint so
υ = 10 − 1
υ=5
=9
υ=7
2
The shape of the χ distribution is different for each
υ υ=9
value of υ and the function is very complicated. The
mean of χ υ2 is υ and the variance is 2 υ . The 0 18
distribution is positively skewed except for large values
of υ for which it becomes approximately symmetrical.
A high value of χ 2 implies a poor fit between the observed and expected frequencies, so
the right hand end of the distribution is used for most hypothesis testing.
3
7.2 CMM Subject Support Strand: STATISTICS Unit 7 Chi-Squared: Text
Hence with the data from Worked Example 1 in Section 7.1, χ 2 = 6 ; this is considerably
less than the 5% critical value, 16.919. χ2
Degree of 5% 1%
freedom, υ
A summary of the critical values for χ 2 at 5% and 1 3.841 6.635
1% is given opposite for degrees of freedom
υ = 1, 2, ...,10. 2 5.991 9.210
3 7.815 11.345
4 9.488 13.277
5 11.070 15.086
6 12.592 16.812
7 14.067 18.475
Worked Example 1
8 15.507 20.090
Nadir is testing an octahedral die to see if it is
unbiased. The results are given in the table below. 9 16.919 21.666
10 18.307 23.209
Score 1 2 3 4 5 6 7 8
Frequency 7 10 11 9 12 10 14 7
Solution
Using χ 2 , the number of degrees of freedom is 8 − 1 = 7 , so at the 5% significance level
the critical value of χ 2 is 14.07. As before, a table of values is drawn up, the expected
frequencies being based on a uniform distribution which gives
1
frequency for each result = (7 + 10 + 11 + 9 + 12 + 10 + 14 + 7) = 10
8
(O i − E i ) 2
Oi Ei Oi − E i (O i − E i ) 2 Ei
7 10 −3 9 0.9
10 10 0 0 0
11 10 1 1 0.1
9 10 −1 1 0.1
12 10 2 4 0.4
10 10 0 0 0
14 10 4 16 1.6
7 10 −3 9 0.9
4.0
4
CMM Subject Support Strand: STATISTICS Unit 7 Chi-Squared: Text
7.2
The calculated value of χ 2 is 4.0. This is well within the critical value, so Nadir could
conclude that there is evidence to support the hypothesis that the die is fair.
Exercises
1. Nicki made a tetrahedral die using card and then tested it to see whether it was fair.
She got the following scores:
Score 1 2 3 4
Frequency 12 15 19 22
2. Joe has a die which has faces numbered from 1 to 6. He got the following scores:
Score 1 2 3 4 5 6
Frequency 17 20 29 20 18 16
3. The table below shows the number of pupils absent on particular days in the week.
Day M Tu W Th F
Number 125 88 85 94 108
4. Over a long period of time, a research team monitored the number of car accidents
which occurred in a particular county. The following table summarises the data
relating to the day of the week on which the accident occurred,
Day M T W Th F Sa Su
Number of 60 54 48 53 53 75 77
accidents
Investigate the hypothesis that these data are a random sample from a uniform
distribution.
5
CMM Subject Support Strand: STATISTICS Unit 7 Chi-Squared: Text
7.3
2 × 2 contingency tables
The method of approach is illustrated in the example below.
Worked Example 1
Some years ago a University decided to require all entrants to a science course to study a
non-science subject for one year. In the first year all of the scheme entrants were given
the choice of studying French or Russian. The numbers of students of each gender
choosing each language are shown in the following table.
French Russian
Male 39 16
Female 21 14
Use a χ 2 test (including Yates' correction*) at the 5% significance level to test whether
choice of language is independent of gender.
rather than χ =
2 ( Oi − Ei )
2
⎟
⎜
⎝
∑i Ei i
∑ Ei
⎟
⎠
Solution
The observed frequencies are given in the 2 × 2 contingency table.
Male 39 16 55
Female 21 14 35
Total 60 30 90
Assuming the null hypothesis, you need to calculate the expected frequency. For
example,
55
P (student is male) =
90
60
P (student chooses French) =
90
Since these two events are independent under H 0 ,
55 60
P (student is male and chooses French) = ×
90 90
6
CMM Subject Support Strand: STATISTICS Unit 7 Chi-Squared: Text
7.3
and, since there are 90 students,
55 60
expected frequency (for male and French) = × × 90
90 90
55 × 60
=
90
= 36.67
There is no need to go through this procedure each time since it can be calculated directly
from
The row and column totals can be used to find the other expected values. For example,
Expected frequency (for female and French) = 60 − 36.67
= 23.33
Total 60 30 90
Since there is only one expected frequency needed in order to find the rest, the
degree of freedom, υ = 1
But, for υ = 1, you have to use Yates' continuity correction which evaluates
( O −E )
2
4 − 0.5
=
2 i i
χ calc ∑
i =1 Ei
7
7.3 CMM Subject Support Strand: STATISTICS Unit 7 Chi-Squared: Text
Now
(O − E )
2
i i − 0.5
Oi Ei Oi − Ei
Ei
39 36.67 2.33 0.091
16 18.33 2.33 0.183
21 23.33 2.33 0.144
14 11.67 2.33 0.287
the critical χ 2 value. Hence you can conclude that there is no evidence to reject H 0 ; i.e.
choice of subject and gender are independent.
Worked Example 2
Following the example above, the choice of non-science subjects has now been widened
and the current figures are as follows
French Poetry Russian Sculpture
Male 2 8 15 10
Female 10 17 21 37
Use a χ 2 test at the 5% significance level to test whether choice of subject is independent
of sex. In applying the test you should combine French with another subject. Explain
why this is necessary and the reasons for your choice of subject.
Solution
This is a 2 × 4 contingency table of observed values.
French Poetry Russian Sculpture Total
Male 2 8 15 10 35
Female 10 17 21 37 85
Total 12 25 36 47 120
12 × 35
= 3.5
120
8
7.3 CMM Subject Support Strand: STATISTICS Unit 7 Chi-Squared: Text
Since this is less than 5, French should be combined with another subject, and the obvious
choice is Russian since this is also a language.
Combining the French and Russian together gives
Fr / Rus Poetry Sculpture Total
Male 17 8 10 35
Female 31 17 37 85
Total 48 25 47 120
The number of degrees of freedom is 2, since determining just 2 expected values will be
sufficient to find the rest.
Note that, in general, for an h × k contingency table
35 × 48
= 14.00
120
and for 'male' and 'poetry' is
35 × 25
= 7.29
120
The rest of the values can now be calculated from the row and column tables to give the
following expected frequencies
Total 48 25 47 120
9
7.3 CMM Subject Support Strand: STATISTICS Unit 7 Chi-Squared: Text
Oi Ei Oi − Ei
(Oi − Ei )2
Ei
17 14.00 3.00 0.643
8 7.29 0.71 0.069
10 13.71 3.71 1.004
31 34.00 3.00 0.265
17 17.71 0.71 0.028
37 33.29 3.71 0.413
χ 2 calc = 2.422 < 5.99
the critical value. So you conclude again that there is no dependence between sex and
choice of subject.
Exercises
1. In a survey on transport, electors from three different areas of a large city
were asked whether they would prefer money to be spent on general road
improvement or on improving public transport. The replies are shown in the
following contingency table.
Area
A B C
Road improvement preferred 78 46 24
Use a χ 2 test at the 1% significance level to test whether the proportion favouring
expenditure on general road improvement is independent of the area.
10
7.3 CMM Subject Support Strand: STATISTICS Unit 7 Chi-Squared: Text
As part of the same investigation, the following table was constructed showing the
reason for the patients' visits to the surgery, again categorised by social class.
Social Class
Reason I II III IV V
Minor physical 10 21 98 91 27
Major physical 7 17 49 40 15
Mental & other 11 25 41 42 6
Is there significant evidence to conclude that the reason for the patients' visits to the
surgery is independent of their social class?
Use a 5% level of significance.
Give an interpretation of your results.
3. (a) The number of books borrowed from a library during a certain week were
518 on Monday, 431 on Tuesday, 485 on Wednesday, 443 on Thursday and
523 on Friday.
Is there any evidence that the number of books borrowed varies between the
five days of the week? Use a 1% level of significance.
Interpret fully your conclusions.
(b) Analysis of the rate of turnover of employees by a personnel manager
produced the following table showing the length of stay of 200 people who
left the company for other employment.
Length of employment
(years)
Grade 0-2 2-5 >5
Managerial 4 11 6
Skilled 32 28 21
Unskilled 25 23 50
Using a 1% level of significance, analyse this information and state fully the
conclusions from your analysis.
11
CMM Subject Support Strand: STATISTICS Unit 7 Chi-Squared: Text
7.3
Bad mark 4 11 3 12
(a) Test at the 5% significance level whether the mark obtained (by the students
who attempted the question) is associated with the type of question.
(b) Under some circumstances it is necessary to combine classes in order to
carry out a test. If it had been necessary to combine the binomial fit
question with any other question, which question would you have combined
it with and why?
(c) Given that a total of 30 students sat the paper, test, at the 5% significance
level, whether the number of students attempting a particular question is
associated with the type of question.
(d) Compare the difficulty and popularity of the different types of question in
the light of your answers to (a) and (b).
12