0% found this document useful (0 votes)
21 views

Text

Uploaded by

Mokbel Zakaria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Text

Uploaded by

Mokbel Zakaria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

CMM Subject Support Strand: STATISTICS Unit 7 Chi-Squared: Text

STRAND: STATISTICS
Unit 7 Chi-Squared

TEXT

Contents

Section

7.1 The Chi-Squared Test

7.2 Significance Testing

7.3 Contingency Tables


CMM Subject Support Strand: STATISTICS Unit 7 Chi-Squared: Text

7 Chi-Squared
7.1 The Chi-Squared Test
The chi-squared test is a particularly useful technique for testing whether observed data
are representative of a particular distribution. It is widely used in biology, geography
and psychology.

Can you make up your own table of random numbers?

Worked Example 1
(a) Write down 100 numbers 'at random' (taking values from 0 to 9). Do this without
the use of a calculator, computer or printed random number tables. Draw up a
frequency table to see how many times you wrote down each number. (These will
be called your observed frequencies.)
(b) If your numbers really are random, roughly how many of each do you think there
ought to be? (These are referred to as expected frequencies.)
(c) What model are you using for this distribution of expected frequencies?

Solution
(a) Here is a possible frequency table for this experiment.

Number 0 1 2 3 4 5 6 7 8 9
Observed frequency 11 12 8 14 7 9 9 8 14 8

(b) You would expect each digit to appear approximately 10 times.


(c) This is the 'uniform' model, where all expected frequencies occur an equal number
of times.

For analysing data of the sort used in Worked Example 1, where you are comparing
observed with expected values, a chart as shown below is a useful way of writing down
the data.
Frequency
Number Observed, Oi Expected, Ei
0
1
2
3
4
-
-
-

1
7.1 CMM Subject Support Strand: STATISTICS Unit 7 Chi-Squared: Text

For the data in Worked Example 1, try looking at the differences Oi − Ei .


Unfortunately the positive differences and negative differences will cancel each other out
and you always have a zero total.

To overcome this problem the differences O i − E i can be squared. So Σ(O i − E i )2


could form the basis of your 'difference measure'. In this particular example however,
each figure has an equal expected frequency, but this will not always be so (when you
come to test other models in other situations). The importance assigned to a difference
must be related to the size of the expected frequency. A difference of 10 must be more
significant if the expected frequency is 20 than if it is 100.
One way of allowing for this is to divide each squared difference by the expected
frequency for that category.
Here is the example worked out for the data in Worked Example 1:

Observed Expected
frequency frequency

(O i − E i ) 2
Number Oi Ei Oi − E i (O i − E i ) 2 Ei
0 11 10 1 1 0.1
1 12 10 2 4 0.4

2 8 10 −2 4 0.4
3 14 10 4 16 1.6

4 7 10 −3 9 0.9
5 9 10 –1 1 0.1
6 9 10 –1 1 0.1

7 8 10 −2 4 0.4
8 14 10 4 16 1.6

9 8 10 −2 4 0.4
6.0
(O i − E i ) 2

= 6.0 .
For this set of 100 numbers ∑ Ei
But what does this measure tell you and how can you decide whether the observed
frequencies are close to the expected frequencies or really quite different from them?
Firstly, consider what might happen if you tried to test some true random numbers from a
random number table.
Would you actually get 10 for each number? The example worked out here did in fact
use 100 random numbers from a table and not a fictitious set made up by someone taking
part in the experiment.
Each time you take a sample of 100 random numbers you will get a slightly different
distribution and it would certainly be surprising to find one with all the observed
frequencies equal to 10. So, in fact, each different sample of 100 true random numbers
(O i − E i ) 2 . .
will give a different value for ∑ Ei
2
CMM Subject Support Strand: STATISTICS Unit 7 Chi-Squared: Text
7.1

(O i − E i ) 2
The distribution of ∑ Ei
is very close to the theoretical distribution known as

χ 2 (or chi-squared). In fact, there is a family of χ 2 distributions, each with a different


shape depending on the number of degrees of freedom denoted by υ (pronounced
'new').

The distribution in this case is denoted by χ υ2 .

For any χ 2 distribution, the number of degrees of freedom shows the number of
independent free choices which can be made in allocating values to the expected
frequencies. In these examples, there are ten expected frequencies (one for each of the
numbers 0 to 9). However, as the total frequency must equal 100, only nine of the
expected frequencies can vary independently and the tenth one must take whatever value
is required to fulfil that 'constraint'. To calculate the number of degrees of freedom
υ = number of classes or groups − number of constraints
Here there are ten classes and one constraint so

υ = 10 − 1
υ=5
=9
υ=7
2
The shape of the χ distribution is different for each
υ υ=9
value of υ and the function is very complicated. The
mean of χ υ2 is υ and the variance is 2 υ . The 0 18
distribution is positively skewed except for large values
of υ for which it becomes approximately symmetrical.

7.2 Significance Testing


The set of random numbers shown in the table on page 2 generated a value of χ 2 equal
to 6. You can see where this value comes in the χ 2 distribution with 9 degrees of
freedom.

A high value of χ 2 implies a poor fit between the observed and expected frequencies, so
the right hand end of the distribution is used for most hypothesis testing.

From χ 2 tables you find that only 5% of all samples of


true random numbers will give a value of χ 92 greater υ=9
2
than 16.92. This is called the critical value of χ at
5%. If the calculated value of χ 2 from
0 16.92
χ2 = ∑
(O i − E i ) 2 Only 5% of samples of
Ei true random numbers
give results here
is less than 16.92, it would support the view that the
numbers are random. If not, you would expect that the

3
7.2 CMM Subject Support Strand: STATISTICS Unit 7 Chi-Squared: Text

numbers are not truly random.

Hence with the data from Worked Example 1 in Section 7.1, χ 2 = 6 ; this is considerably
less than the 5% critical value, 16.919. χ2
Degree of 5% 1%
freedom, υ
A summary of the critical values for χ 2 at 5% and 1 3.841 6.635
1% is given opposite for degrees of freedom
υ = 1, 2, ...,10. 2 5.991 9.210

3 7.815 11.345

4 9.488 13.277

5 11.070 15.086

6 12.592 16.812

7 14.067 18.475
Worked Example 1
8 15.507 20.090
Nadir is testing an octahedral die to see if it is
unbiased. The results are given in the table below. 9 16.919 21.666

10 18.307 23.209

Score 1 2 3 4 5 6 7 8
Frequency 7 10 11 9 12 10 14 7

Test the hypothesis that the die is fair.

Solution
Using χ 2 , the number of degrees of freedom is 8 − 1 = 7 , so at the 5% significance level
the critical value of χ 2 is 14.07. As before, a table of values is drawn up, the expected
frequencies being based on a uniform distribution which gives
1
frequency for each result = (7 + 10 + 11 + 9 + 12 + 10 + 14 + 7) = 10
8

(O i − E i ) 2
Oi Ei Oi − E i (O i − E i ) 2 Ei
7 10 −3 9 0.9
10 10 0 0 0
11 10 1 1 0.1
9 10 −1 1 0.1
12 10 2 4 0.4
10 10 0 0 0
14 10 4 16 1.6
7 10 −3 9 0.9
4.0

4
CMM Subject Support Strand: STATISTICS Unit 7 Chi-Squared: Text
7.2

The calculated value of χ 2 is 4.0. This is well within the critical value, so Nadir could
conclude that there is evidence to support the hypothesis that the die is fair.

Exercises
1. Nicki made a tetrahedral die using card and then tested it to see whether it was fair.
She got the following scores:
Score 1 2 3 4
Frequency 12 15 19 22

Does the die seem fair?

2. Joe has a die which has faces numbered from 1 to 6. He got the following scores:

Score 1 2 3 4 5 6
Frequency 17 20 29 20 18 16

He thinks that the die may be biased.


What do you think?

3. The table below shows the number of pupils absent on particular days in the week.
Day M Tu W Th F
Number 125 88 85 94 108

Find the expected frequencies if it is assumed that the number of absentees is


independent of the day of the week.
Test, at 5% level, whether the differences in observed and expected frequencies are
significant.

4. Over a long period of time, a research team monitored the number of car accidents
which occurred in a particular county. The following table summarises the data
relating to the day of the week on which the accident occurred,
Day M T W Th F Sa Su
Number of 60 54 48 53 53 75 77
accidents
Investigate the hypothesis that these data are a random sample from a uniform
distribution.

7.3 Contingency Tables


In many situations, individuals are classified according to two sets of attributes, and you
may wish to investigate the dependency between these attributes. This is dealt with by
using a contingency table and the χ 2 distribution.

5
CMM Subject Support Strand: STATISTICS Unit 7 Chi-Squared: Text
7.3

2 × 2 contingency tables
The method of approach is illustrated in the example below.

Worked Example 1
Some years ago a University decided to require all entrants to a science course to study a
non-science subject for one year. In the first year all of the scheme entrants were given
the choice of studying French or Russian. The numbers of students of each gender
choosing each language are shown in the following table.

French Russian
Male 39 16
Female 21 14

Use a χ 2 test (including Yates' correction*) at the 5% significance level to test whether
choice of language is independent of gender.

⎛ * Yates' continuity correction for 2 × 2 contingency tables is that you use ⎞


⎜ ⎟

χ =
2 (| Oi − Ei | − 0.5)
2

rather than χ =
2 ( Oi − Ei )
2



∑i Ei i
∑ Ei

Solution
The observed frequencies are given in the 2 × 2 contingency table.

French Russian Total

Male 39 16 55

Female 21 14 35

Total 60 30 90

The null hypothesis is, as usual,


H 0 : there is no relationship between choice of language and gender
and so the alternative hypothesis is
H1 : the choice of language is dependent on gender

Assuming the null hypothesis, you need to calculate the expected frequency. For
example,
55
P (student is male) =
90
60
P (student chooses French) =
90
Since these two events are independent under H 0 ,

55 60
P (student is male and chooses French) = ×
90 90

6
CMM Subject Support Strand: STATISTICS Unit 7 Chi-Squared: Text
7.3
and, since there are 90 students,
55 60
expected frequency (for male and French) = × × 90
90 90

55 × 60
=
90
= 36.67

There is no need to go through this procedure each time since it can be calculated directly
from

(row total) (column total)


Expected frequency =
(grand total)

In fact, for a 2 × 2 table only one of these calculations is needed.

The row and column totals can be used to find the other expected values. For example,
Expected frequency (for female and French) = 60 − 36.67
= 23.33

In this way, the table of expected frequency is as shown below.

French Russian Total

Male 36.67 18.33 55

Female 23.33 11.67 35

Total 60 30 90

Since there is only one expected frequency needed in order to find the rest, the

degree of freedom, υ = 1

But, for υ = 1, you have to use Yates' continuity correction which evaluates

( O −E )
2
4 − 0.5
=
2 i i
χ calc ∑
i =1 Ei

From tables, the critical χ 2 at 5% level is given by 3.84. Hence H 0 is rejected if


χ 2 calc > 3.84 .

7
7.3 CMM Subject Support Strand: STATISTICS Unit 7 Chi-Squared: Text

Now

(O − E )
2
i i − 0.5
Oi Ei Oi − Ei
Ei
39 36.67 2.33 0.091
16 18.33 2.33 0.183
21 23.33 2.33 0.144
14 11.67 2.33 0.287

χ 2 calc = 0.705 < 3.84

the critical χ 2 value. Hence you can conclude that there is no evidence to reject H 0 ; i.e.
choice of subject and gender are independent.

h × k contingency tables ( h rows, k columns)


This is illustrated with an extension to the previous question, which also illustrates the
convention that any entry with expected frequency of 5 or less should be eliminated by
combining classes together.

Worked Example 2
Following the example above, the choice of non-science subjects has now been widened
and the current figures are as follows
French Poetry Russian Sculpture

Male 2 8 15 10

Female 10 17 21 37

Use a χ 2 test at the 5% significance level to test whether choice of subject is independent
of sex. In applying the test you should combine French with another subject. Explain
why this is necessary and the reasons for your choice of subject.

Solution
This is a 2 × 4 contingency table of observed values.
French Poetry Russian Sculpture Total

Male 2 8 15 10 35

Female 10 17 21 37 85

Total 12 25 36 47 120

The expected frequency for 'male' and 'French' is

12 × 35
= 3.5
120

8
7.3 CMM Subject Support Strand: STATISTICS Unit 7 Chi-Squared: Text

Since this is less than 5, French should be combined with another subject, and the obvious
choice is Russian since this is also a language.
Combining the French and Russian together gives
Fr / Rus Poetry Sculpture Total

Male 17 8 10 35

Female 31 17 37 85

Total 48 25 47 120

As before, H 0 : sex and choice of subject are independent

H1 : sex and choice of subject are dependent

The number of degrees of freedom is 2, since determining just 2 expected values will be
sufficient to find the rest.
Note that, in general, for an h × k contingency table

No. of degrees of freedom = (h − 1) × ( k − 1)

(In the example above, h = 2, k = 3, giving the number of degrees of freedom as


(2 − 1) × (3 − 1) = 1 × 2 = 2 ) Thus, the critical χ 2 value is 5.99.

The expected frequency for 'male' and 'French and Russian' is

35 × 48
= 14.00
120
and for 'male' and 'poetry' is

35 × 25
= 7.29
120
The rest of the values can now be calculated from the row and column tables to give the
following expected frequencies

Fr / Rus Poetry Sculpture Total

Male 14.00 7.29 13.71 35

Female 34.00 17.71 33.29 85

Total 48 25 47 120

and the calculated χ 2 is given by

9
7.3 CMM Subject Support Strand: STATISTICS Unit 7 Chi-Squared: Text

Oi Ei Oi − Ei
(Oi − Ei )2
Ei
17 14.00 3.00 0.643
8 7.29 0.71 0.069
10 13.71 3.71 1.004
31 34.00 3.00 0.265
17 17.71 0.71 0.028
37 33.29 3.71 0.413
χ 2 calc = 2.422 < 5.99

the critical value. So you conclude again that there is no dependence between sex and
choice of subject.

Exercises
1. In a survey on transport, electors from three different areas of a large city
were asked whether they would prefer money to be spent on general road
improvement or on improving public transport. The replies are shown in the
following contingency table.
Area
A B C
Road improvement preferred 78 46 24

Public transport preferred 22 34 36

Use a χ 2 test at the 1% significance level to test whether the proportion favouring
expenditure on general road improvement is independent of the area.

2. During an investigation into visits to a Health Centre, interest is focused on the


social class of those attending the surgery.
The table below shows the number of patients attending the surgery together with
the population of the whole area covered by the Health Centre, each categorised by
social class.

Social Class I II III IV V

Patients 28 63 188 173 48

Population 200 500 1600 1200 500

Use a χ 2 test, at the 5% level of significance, to decide whether or not these


results indicate that those attending the surgery are a representative sample of the
whole area with respect to social class.

10
7.3 CMM Subject Support Strand: STATISTICS Unit 7 Chi-Squared: Text

As part of the same investigation, the following table was constructed showing the
reason for the patients' visits to the surgery, again categorised by social class.

Social Class
Reason I II III IV V

Minor physical 10 21 98 91 27
Major physical 7 17 49 40 15
Mental & other 11 25 41 42 6

Is there significant evidence to conclude that the reason for the patients' visits to the
surgery is independent of their social class?
Use a 5% level of significance.
Give an interpretation of your results.

3. (a) The number of books borrowed from a library during a certain week were
518 on Monday, 431 on Tuesday, 485 on Wednesday, 443 on Thursday and
523 on Friday.
Is there any evidence that the number of books borrowed varies between the
five days of the week? Use a 1% level of significance.
Interpret fully your conclusions.
(b) Analysis of the rate of turnover of employees by a personnel manager
produced the following table showing the length of stay of 200 people who
left the company for other employment.

Length of employment
(years)
Grade 0-2 2-5 >5
Managerial 4 11 6
Skilled 32 28 21
Unskilled 25 23 50

Using a 1% level of significance, analyse this information and state fully the
conclusions from your analysis.

4. A hospital employs a number of visiting surgeons to undertake particular operations.


If complications occur during or after the operation the patient has to be transferred
to a larger hospital nearby where the required back up facilities are available.
A hospital administrator, worried by the effects of this on costs, examines the
records of three surgeons. Surgeon A had 6 out of her last 47 patients transferred,
surgeon B, 4 out of his last 72 patients and surgeon C, 14 out of his last 41. Form
the data into a 2 × 3 contingency table and test, at the 5% significance level,
whether the proportion transferred is independent of the surgeon.
The administrator decides to offer as many operations as possible to surgeon B.
Explain why and suggest what further information you would need before deciding
whether the administrator's decision was based on valid evidence.

11
CMM Subject Support Strand: STATISTICS Unit 7 Chi-Squared: Text
7.3

5. A group of students studying A-level Statistics was set a paper, to be attempted


under examination conditions, containing four questions requiring the use of the
χ 2 distribution. The following table shows the type of question and the number of
students who obtained good (14 or more out of 20) and bad (fewer than 14 out of
20) marks.
Type of question
Contingency Binomial Normal Poisson
table fit fit fit
Good mark 25 12 12 11

Bad mark 4 11 3 12

(a) Test at the 5% significance level whether the mark obtained (by the students
who attempted the question) is associated with the type of question.
(b) Under some circumstances it is necessary to combine classes in order to
carry out a test. If it had been necessary to combine the binomial fit
question with any other question, which question would you have combined
it with and why?
(c) Given that a total of 30 students sat the paper, test, at the 5% significance
level, whether the number of students attempting a particular question is
associated with the type of question.
(d) Compare the difficulty and popularity of the different types of question in
the light of your answers to (a) and (b).

12

You might also like