0% found this document useful (0 votes)
84 views10 pages

Chi-Square Test

This document discusses using the chi-squared test to determine if two classifications or factors from the same sample are independent. It provides an example of classifying adults by gender and regular exercise to see if these factors are independent. The expected and observed values are calculated and chi-squared is used to determine there is very close agreement, indicating the factors are independent.

Uploaded by

Sport Man
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views10 pages

Chi-Square Test

This document discusses using the chi-squared test to determine if two classifications or factors from the same sample are independent. It provides an example of classifying adults by gender and regular exercise to see if these factors are independent. The expected and observed values are calculated and chi-squared is used to determine there is very close agreement, indicating the factors are independent.

Uploaded by

Sport Man
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Color profile: Disabled

Composite Default screen

TWO VARIABLE STATISTICS (Chapter 18) 591

2 Amy and Lee then taste six white wines Wine A B C D E F


and their rankings were:
Amy’s order 1 2 4 3 5 6
a Find Spearman’s rank order correl-
ation coefficient for the wine tasting Lee’s order 2 1 3 4 6 5
data.
b Comment on the degree of agreement between their rankings of the wine.
c What is the significance of the sign of t?
3 Find t for: a perfect agreement b completely opposite order.
4 Construct some examples of your own for the following cases:
a t being close to +1 b t being close to ¡1 c t being close to 0
d t being positive e t being negative
As a consequence of your investigation comment on these five categories.
5 Arrange some competitions of your own choosing and test for rank agreement between
the views of two friends. Record all data and show all calculations. You could examine
preferences in food tasting, sports watched on TV, etc.

D THE Â 2 TEST OF INDEPENDENCE


The Â2 (chi–squared) test is the test we use to find if two classifications (or factors) from
the same sample are independent, i.e., if the occurrence of one of them does not affect the
occurrence of the other.
Examples of two classifications could include:
² income and voting intentions ² gender and money earning capacity
² school year groups and canteen improvements

The Â2 test examines the difference between the observed and expected values and

X (fo ¡ fe )2 where fo is an observed frequency


Â2calc =
fe and fe is an expected frequency.

Small differences between observed and expected frequencies are an indication of the inde-
pendence between the two classifications.

CALCULATING Â2
This table shows the results of a
sample of 400 randomly selected Regular No regular
sum
adults classified according to gen- exercise exercise
der and regular exercise. Male 112 104 216
This is a 2 £ 2 contingency table. Female 96 88 184
sum 208 192 400

The question is: “Are regular exercise and gender independent”?


0

25

50

75

95

100
0

25

50

75

95

100

IB_03
cyan black
Y:\...\IBBK3_18\591IB318.CDR
Thu Jul 22 09:45:30 2004
Color profile: Disabled
Composite Default screen

592 TWO VARIABLE STATISTICS (Chapter 18)

Consider a general 2 £ 2 contingency table for


classifications R and S. S1 S2 sum
Notice that for independence, R1 a b w
w y R2
P(R1 \ S1 ) = P(R1 ) P( S1 ) = £ c d x
n n
sum y z n
) the expected value of P(R1 \ S1 )
= n £ P(R1 \ S1 ) n =w+x= y+z
w y
=n£ £
n n
wy
= wz
n Likewise, the expected value of P(R1 \ S2 ) = , etc
n
So, the expected value table is:
S1 S2 sum
wy wz
R1 w
n n
xy xz
R2 x
n n
sum y z n

Using this result, the expected Regular No regular


table for the regular exercise exercise exercise sum
and gender data is:
216£208 216£192
Male 400 + 112:3 400 + 103:7 216
184£208 184£192
Female 400 + 95:7 400 + 88:3 184
sum 208 192 400

and the Â2 calculation is: (fo ¡ fe )2


fo fe fo ¡ fe (fo ¡ fe )2
fe
112 112:3 ¡0:3 0:09 0:000801
104 103:7 0:3 0:09 0:000868
96 95:7 0:3 0:09 0:000940
88 88:3 ¡0:3 0:09 0:001019
Total + 0:00363

So, Â2calc + 0:00363


Since Â2calc is very small, there is a very close agreement between observed and expected
values. This indicates that regular exercise and gender are independent factors.
Note: If observed and expected values differ considerably, the numerators of each fraction
added are large and so Â2calc would be large.
The question now arises: In a problem like the one considered above, how large
would Â2calc need to be in order for us to conclude that the two factors are not
independent?
100
0

25

50

75

95
0

25

50

75

95

100

IB_03
cyan black
Z:\...\IBBK3_18\592IB318.CDR
Wed Jul 21 09:27:37 2004
Color profile: Disabled
Composite Default screen

TWO VARIABLE STATISTICS (Chapter 18) 593

EXERCISE 18D.1
1 Find Â2calc for the following contingency tables:
a Factor M b Factor S
M1 M2 S1 S2
N1 31 22 53 R1 28 17
N2 20 27 47 R2 52 41
51 49 100

c Factor A d Factor T
A1 A2 T1 T2 T3 T4
B1 24 11 D1 31 22 21 16
B2 16 18 D2 23 19 22 13
B3 25 12

2 Now use a calculator to check your answers to question 1.

Note: For the regular exercise - gender


example a screen dump is:
The Â2calc answer above was + 0:00363
which is not accurate to 3 significant
figures due to the rounding errors in the
expected values.

DEGREES OF FREEDOM
The Â2 distribution is dependent on the number of degrees of freedom (df) where

df = (r ¡ 1)(c ¡ 1) for a contingency table which is r £ c in size.

Our original regular exercise and gender contingency table is 2 £ 2.


So, df = (2 ¡ 1) £ (2 ¡ 1) = 1.
For a 4 £ 3 contingency table, df = (4 ¡ 1)(3 ¡ 1) = 6.

A ‘rule of thumb’ explanation of degrees of freedom


Consider placing the numbers 5, 6 and 8 into the table.
For the first position any one of the three numbers 1st 2nd 3rd
could be used, i.e., we have freedom to choose.
For the second position we have freedom to choose from the remaining two numbers. However
for the remaining position there is no freedom of choice as the remaining number must go
into the third position. So we have 2 degrees of freedom (of choice), which is 3 ¡ 1.
Now see if you can show that in placing 4 6 5 into there are
1 2 8
(3 ¡ 1) £ (3 ¡ 1) degrees of freedom. 3 9 7
50

100
0

25

75

95
0

25

50

75

95

100

IB_03
cyan black
Y:\...\IBBK3_18\593IB318.CDR
Fri Jul 23 15:08:40 2004
Color profile: Disabled
Composite Default screen

594 TWO VARIABLE STATISTICS (Chapter 18)

3 Find the degrees of freedom (df) for the contingency tables of question 1.
4 Find the df of: Factor K
K1 K2 K3 K4 K5 K6 K7
L1 2 5 7 3 1 4 9
L2 6 1 3 8 2 1 7
L3 4 2 2 5 1 6 5
L4 3 4 2 4 3 2 4

TABLE OF CRITICAL VALUES


The Â2 distribution graph is:
Degrees of Area right of table value
freedom (df) 0:10 0:05 0:01
area of rejection
1 2:71 3:84 6:63
2 4:61 5:99 9:21
2 3 6:25 7:81 11:34
Âa
4 7:78 9:49 13:28
The values 0:10, 0:05, 0:01, i.e., 10%, 5 9:24 11:07 15:09
5%, 1% are called significance levels 6 10:64 12:59 16:81
and these are the ones which are com- 7 12:02 14:07 18:48
monly used. Tables of Â2 values exist
8 13:36 15:51 20:09
for up to 100 degrees of freedom and
for other significance levels than those 9 14:68 16:92 21:67
given alongside. 10 15:99 18:31 23:21

Notice that, at a 5% significance level, with df = 1, Â20:05 = 3:84 .


This means that at a 5% significance level, the departure between observed and expected is
too great if Â2calc > 3:84 .
Likewise, at a 1% significance level, with df = 7, the departure between observed and expected
is too great if Â2calc > 18:48 . falso shaded on the tableg

Note: For large values of n, Â2 has an approximate chi-squared distribution with


(r ¡ 1)(c ¡ 1) degrees of freedom.
Generally, n is sufficiently large if all the expected values are 5 or more.

FORMAL TEST FOR INDEPENDENCE


The formal test is structured as follows:
Step 1: We state H0 called the null hypothesis. This is a statement that the two
classifications being considered are independent.
We state H1 called the alternative hypothesis. This is a statement that the two
classifications being considered are not independent
Step 2: Calculate df according to df = (r ¡ 1)(c ¡ 1) .
Step 3: We quote the significance level required, i.e., 10%, 5% or 1%.
100
0

25

50

75

95
0

25

50

75

95

100

IB_03
cyan black
Z:\...\IBBK3_18\594IB318.CDR
Wed Jul 21 09:28:57 2004
Color profile: Disabled
Composite Default screen

TWO VARIABLE STATISTICS (Chapter 18) 595

Step 4: We state the rejection inequality Â2calc > k where k is obtained from the table
of critical values.
X (fo ¡ fe )2
Step 5: From the contingency table, find Â2calc using Â2calc = .
fe
Step 6: We either accept H0 or reject H0 , depending on the rejection inequality result.
Step 7: If operating at a 5% level, we could also use p-values to help us with our
decision making. If p > 0:05, we accept H . 0
If p < 0:05, we reject H0 .

Returning to our original regular exercise/gender example:


Step 1: H0 is regular exercise and gender are independent.
H1 is regular exercise and gender are not independent.
Step 2: df = (2 ¡ 1)(2 ¡ 1) = 1
Step 3: Significance level is 5%:
Step 4: We reject H0 if Â2calc > 3:84 .
Step 5: Â2calc + 0:00413
Step 6: As Â2calc < 3:84, we accept H0 in favour of H1 ,
i.e., that regular exercise and gender are independent classifications.
Step 7: p + 0:949 which is > 0:05, providing further evidence to accept H0 .

Example 6
A survey was given to randomly chosen
high school students from years 9 to 12 on Year group
possible changes to the school’s canteen. 9 10 11 12
The contingency table shows the results. change 7 9 13 14
At a 5% level, test whether there is a sig- no change 14 12 9 7
nificant difference between the proportion
of students wanting a change in the canteen
across the four year groups.

H0 is year group and change are independent (no significant departure).


H1 is year group and change are not independent.
df = (4 ¡ 1)(2 ¡ 1) = 3 The significance level is 5% or 0:05 .
We reject H0 if Â2calc > 7:81 . ffrom critical values tableg
The 2 £ 4 contingency table is: The expected frequency table is:
Year group Year group
9 10 11 12 sum 9 10 11 12
C 7 9 13 14 43 C 10:6 10:6 11:1 10:6
C0 14 12 9 7 42 C0 10:4 10:4 10:9 10:4
sum 21 21 22 21 85
0

25

50

75

95

100
0

25

50

75

95

100

IB_03
cyan black
Z:\...\IBBK3_18\595IB318.CDR
Wed Jul 21 09:29:33 2004
Color profile: Disabled
Composite Default screen

596 TWO VARIABLE STATISTICS (Chapter 18)

(fo ¡ fe )2
fo fe fo ¡ fe (fo ¡ fe )2
fe
7 10:6 ¡3:6 12:96 1:223
9 10:6 ¡1:6 2:56 0:242
13 11:1 1:9 3:61 0:325
) Â2calc + 5:82
14 10:6 3:4 11:56 1:091
14 10:4 3:6 12:96 1:246 which is not > 7:81
12 10:4 1:6 2:56 0:246 Consequently, we accept H0 ,
9 10:9 ¡1:9 3:61 0:331 that there is no significant
difference between the
7 10:4 ¡3:4 11:56 1:112
proportions across the year
Total 5:816 groups.

Note: Using a calculator we obtain:


Notice the small error in the
above table due to rounding.

EXERCISE 18D.2
1 A random sample of people is taken to find if there is a relationship between smoking
marijuana as a teenager and suffering schizophrenia within the next 15 years. The results
are given in the table below:
Schizophrenic Non-Schizophrenic
Smoker 58 73
Non-smoker 269 624
Test at a 5% level whether there is a relationship between smoking marijuana as a
teenager and suffering schizophrenia within the next 15 years.
2 Examine the following contingency tables for the independence of classifications P and
Q. Use a Â2 test i at a 5% level of significance ii at a 10% level of significance.
a Q Q b Q Q Q Q
1 2 1 2 3 4
P1 11 17 P1 6 11 14 18
P2 21 23 P2 9 12 21 17
P3 28 19 P3 13 24 16 10
P4 17 28

3 The table shows the way in which a ran- Age of voter


domly chosen group intend to vote in 18 to 35 36 to 59 60+
the next election.
Party A 85 95 131
Test at a 5% level whether there is any
association between the age of a voter Party B 168 197 173
and the party they wish to vote for.
0

25

50

75

95

100
0

25

50

75

95

100

IB_03
cyan black
Y:\...\IBBK3_18\596IB318.CDR
Thu Jul 22 10:03:52 2004
Color profile: Disabled
Composite Default screen

TWO VARIABLE STATISTICS (Chapter 18) 597

4 The following table shows the results of a random sample where annual income and
cigarette smoking are being compared.
Income level
low average high very high
Smoker 82 167 74 31
Non-smoker 212 668 428 168

Test at a 10% level whether lower income people are more likely to be cigarette smokers.

5 This contingency table shows the responses of a randomly chosen sample of 50+ year
olds to a survey dealing with peoples weight and whether they have diabetes.
Weight
light medium heavy obese
Diabetic 11 19 21 38
Non-diabetic 79 68 74 53

Test at a 1% level whether there is a link between weight and suffering diabetes.

6 The following table is a result of a major investigation considering the two factors of
intelligence level (IQ) and cigarette smoking.
Intelligence level
low average high very high
Non smoker 283 486 226 38
Medium level smoker 123 201 58 18
Heavy smoker 100 147 64 8

Test at a 1% level whether there is a link between intelligence level (IQ) and cigarette
smoking.

REVIEW SET 18A


1 Thomas rode for an hour each day for eleven days and recorded the number of kilometres
ridden against the temperature that day.

Temp t (o C) 32:9 33:9 35:2 37:1 38:9 30:3 32:5 31:7 35:7 36:3 34:7
d km ridden 26:5 26:7 24:4 19:8 18:5 32:6 28:7 29:4 23:8 21:2 29:7

a Using technology, construct a scatterplot of the data.


b Find and interpret Pearson’s correlation coefficient.
c Calculate the equation of the least squares line. How hot must it get before Thomas
does not ride at all?

2 The contingency table below shows the results of motor vehicle accidents in relation to
whether or not the traveller was wearing a seat belt.
Serious injury Permanent disablement Death
Wearing a belt 189 104 58
Not wearing a belt 83 67 46
0

25

50

75

95

100
0

25

50

75

95

100

IB_03
cyan black
Y:\...\IBBK3_18\597IB318.CDR
Thu Jul 22 09:53:30 2004
Color profile: Disabled
Composite Default screen

598 TWO VARIABLE STATISTICS (Chapter 18)

Find Â2 and test at a 1% level that the wearing of a seat belt and injury or death are
independent factors.

3 A drinks vendor varies the price of Supa-fizz on a daily basis, and records the number
of sales of the drink (shown below).

Price (p) $2:50 $1:90 $1:60 $2:10 $2:20 $1:40 $1:70 $1:85
Sales (s) 389 450 448 386 381 458 597 431

a Produce a scatterplot of the data. Do there appear to be any outliers? If so, should
they be included in the analysis?
b Calculate the least squares regression line. Could it give an accurate prediction of
sales if Super-fizz was priced at 50 cents?

4 Eight identical flower beds (petunias) were watered a varying number of times each
week, and the number of flowers each bed produced is recorded in the table below:

Number of waterings (n) 0 1 2 3 4 5 6 7


Flowers produced (f ) 18 52 86 123 158 191 228 250

a Which is the independent variable?


b Calculate the equation of the least squares line.
c On a scatterplot of the data, plot the least squares line.
d Violet has two beds of petunias. One she waters five times a fortnight (2 12 times a
week), the other ten times a week.
i How many flowers can she expect from each bed?
ii Which is the more reliable estimate?

5 Examine the following contingency tables for the Q1 Q2 Q3 Q4


independence of classifications P and Q.
P1 19 23 27 39
Use a Â2 test a at a 5% level of significance
P2 11 20 27 35
b at a 1% level of significance.
P3 26 39 21 30

REVIEW SET 18B


1 The following table gives peptic ulcer rates per 100 of population for differing family
incomes in the year 1998.

Income (I thousand $) 10 15 20 25 30 40 50 60 80
Peptic ulcer rate R 8:3 7:7 6:9 7:3 5:9 4:7 3:6 2:6 1:2

a Define the role of each variable and produce an appropriate graph.


b Use the method of least squares determination to find the equation of the line of
best fit.
c Give an interpretation of the slope and y-intercept of the line.
d Use the equation of the least squares line to predict the peptic ulcer rate for families
with $45 000 incomes.
e What is the x-intercept of this line? Do you think predictions for incomes greater
50
0

95

100
5

25

75
0

25

50

75

95

100

IB_03
cyan black
Y:\...\IBBK3_18\598IB318.CDR
Sun Jul 25 16:47:52 2004
Color profile: Disabled
Composite Default screen

TWO VARIABLE STATISTICS (Chapter 18) 599

than this will be accurate?


f Later it is realised that one of the figures was written incorrectly.
i Which is it likely to be? Why?
ii Repeat b and d having removed the incorrect data.

2 The table shows the responses to a sur- Age of driver


vey as to whether the city speed limit 18 to 30 31 to 54 55+
should be increased.
Increase 234 169 134
Test at a 5% level whether there is any
association between the age of a driver No-increase 156 191 233
and increasing the speed limit.

3 The following table is a result of a major investigation considering the two factors of
intelligence level and business success
Intelligence level
low average high very high
No success 35 30 41 25
Low success 28 41 26 29
Success 35 24 41 56
High success 52 38 63 72

Test at a 1% level whether there is a link between intelligence level (IQ) and business
success.

4 Safety authorities advise drivers to travel 3 seconds behind


the car in front of them as this provides the driver with a
greater chance of avoiding a collision if the car in front has
to brake quickly or is itself involved in an accident. A test
was carried out to find out how long it would take a driver
to bring a car to rest from the time a red light was flashed.
(This is called stopping time, which includes reaction time
and braking time.) The following results are for one driver
in the same car under the same test conditions.

Speed (v km/h) 10 20 30 40 50 60 70 80 90
Stopping time (t secs) 1:23 1:54 1:88 2:20 2:52 2:83 3:15 3:45 3:83

a Produce a scatterplot of the data and indicate its most likely model type.
b Find the linear model which best fits the data. Give evidence as to why you have
chosen this model.
c Use the model to find the stopping time for a speed of:
i 55 km/h ii 110 km/h
d What is the interpretation of the vertical intercept?
e Why does this simple rule apply at all speeds, with a good safety margin?
0

25

50

75

95

100
0

25

50

75

95

100

IB_03
cyan black
Z:\...\IBBK3_18\599IB318.CDR
Wed Jul 21 09:32:03 2004
Color profile: Disabled
Composite Default screen

600 TWO VARIABLE STATISTICS (Chapter 18)

5 Two supervillains, Silent Predator and the Furry Reaper terrorise Metropolis by abducting
fair maidens (most of whom happen to be journalists). Superman believes that they are
collaborating, alternatively abducting fair maidens so as not to compete with each other
for ransom money. He plots their abduction rate below (in dozens of maidens).

Silent Predator (p) 4 6 5 9 3 5 8 11 3 7 7 4


Furry Reaper (r) 13 10 11 8 11 9 6 6 12 7 10 8

a Plot the data on a scatterplot, and find the least squares regression line (put Silent
Predator on the x-axis).
b Is their any evidence for Superman’s suspicions? (Calculate the r and r2 and
describe the strength of Silent Predator and Furry Reaper’s relationship.)
c What is the estimated number of the Furry Reaper’s abductions given that Silent
Predator’s were 6 dozen?
d Why is the model inappropriate when the Furry Reaper abducts more than 20 dozen
maidens?
e Calculate the p- and r-intercepts. What do these values represent?
f If Superman is faced with a choice of capturing one supervillian but not the other,
which should he choose? (Hint: Use e.)
0

25

50

75

95

100
0

25

50

75

95

100

IB_03
cyan black
Z:\...\IBBK3_18\600IB318.CDR
Wed Jul 21 09:32:31 2004

You might also like