TA Bivariate Data
TA Bivariate Data
TA Bivariate Data
Topic assessment
1.
For a random sample of 20 towns, a smoking index (x) and a cancer index (y) are
constructed. These data are illustrated in the diagram below. The associated
summary statistics are also given below.
n 20
x 2 235724
x 2152
y 278565
y 2327
xy 252811
A community health officer claims that these data show that there is a positive
correlation between smoking and cancer.
(i) Show that the product moment correlation coefficient for the data is 0.425,
correct to 3 significant figures. Carry out a suitable hypothesis test at the 5%
significance level to check the officers claim, stating your hypotheses and
conclusion carefully. Comment on the validity of the test in relation to the
scatter diagram.
[10]
(ii) A spokesman for a tobacco firm claims that the data do not show that there is
a connection between smoking and cancer. Discuss briefly whether or not his
claim can be justified statistically.
[2]
(iii) Explain the meaning of the term significance level relating your answer to
the test carried out in part (i).
[2]
2.
Ten gymnasts take part in a competition which has two events, floor exercises
and parallel bars. The scores for each event are given in the table.
Competitor
Floor exercises
Parallel bars
A
9.5
9.2
B
8.8
9.5
C
6.7
5.8
D
6.0
3.7
E
5.9
5.4
F
5.7
6.3
G
5.2
5.9
H
4.2
4.5
I
4.1
5.6
J
3.9
4.7
[3]
13/11/13 MEI
35
45.0
45
39.0
55
30.0
65
18.0
[2]
(ii) Calculate the equation of the regression line of y on x, and plot it on your
scatter diagram.
[5]
(iii) Use your regression equation to predict the remaining number of years of life
of a person aged
(A) 65 years of age,
(B) 90 years of age,
commenting on your second prediction.
[3]
(iv) Calculate the sum of the squares of the residuals. What is the relevance of
this value with respect to the regression line?
[5]
4.
x 307
y 3008
2
2 of 9
y 250
xy 3143
13/11/13 MEI
(i) Calculate the product moment correlation coefficient for the data. Carry out a
suitable hypothesis test at the 5% significance level, using the null hypothesis
H 0 : 0 . Define and state your alternative hypothesis and conclusions
carefully.
[9]
(ii) What must be assumed about the underlying distribution for the test to be
valid? Discuss, with reference to the scatter diagram above, whether this
assumption is reasonable in this case.
[3]
(iii) Another data point, x = 6, y = 29, was omitted from the data set. Explain
briefly the effect its inclusion would have on the product moment correlation
coefficient. Comment on the validity of the test if this point were included.
[3]
Total 60
3 of 9
13/11/13 MEI
2152
107.6
20
0.425
S xxS yy
4168.8 7818.55
H0 : 0
H1 : 0
where is the population correlation coefficient.
At 5% significance level with n = 20, critical value =
0.3783
Since 0.425 > 0.3783, reject H0: there is sufficient
evidence at the 5% significance level to suggest that
there is a positive correlation between the smoking
index and the cancer index.
EITHER: Since the shape of the scatter diagram is
roughly elliptical, the data appear to come from a
bivariate Normal population, so the test is
appropriate.
OR: Since the shape of the scatter diagram does not
appear to be elliptical, the data do not come from a
bivariate Normal population, so the test is not
appropriate.
[10]
(ii) If a different significance level were chosen, then the
test could result in the null hypothesis being
accepted. For example, at the 2.5% significance level
the critical value for n = 20 is 0.4438, so the
spokesmans claim would be justified statistically if
the test were carried out at this level.
[2]
(iii)
The significance level is the probability of
rejecting the null hypothesis when it is in fact true.
If the population correlation coefficient is in fact zero,
then 5% of random bivariate samples of size 20 will
4 of 9
13/11/13 MEI
5 of 9
13/11/13 MEI
[3]
(ii)
Competitor
Floor
exercises rank
Parallel bars
rank
d
rs 1
A B C
1 2 3
D
4
E F G
5 6 7
H I
8 9
7 3 4
1
0
3
6
J
1
0
8
4 9 9
6 di2
n n 2 1
6 78
1
10 99
0.527 (3 s.f.)
6 of 9
13/11/13 MEI
[2]
200
50
4
132
y 132 y 4 33
(ii)
S xx x2 nx2 10500 10000 500
x 200 x
y a bx
For regression line
S xy 450
b
0.9
S xx
500
a y bx 33 0.9 50 78
Regression line is
y 78 0.9x
[5]
(iii)
x 65 y 78 0.9 65 19.5
(A)
x 90 y 78 0.9 90 3
(B)
The second prediction is meaningless as the person is
predicted to have negative years to live.
[3]
7 of 9
13/11/13 MEI
(iv)
x, y
Out of all possible straight lines passing through
, the regression line is the one which minimises the
sum of the squares of the residuals. So 9 is the
smallest possible value of the sum of the squares of
the residuals.
[5]
307
x
12.28
25
4. (i)
S xx x2 nx2 3853 25 12.282 83.04
250
y
10
25
S yy y 2 ny 2 3008 25 102 508
0.355
S xxS yy
83.04 508
H0: = 0
H1: 0
Where is the population correlation coefficient
Critical value for two-tailed test at 5% significance
level for n = 25 is 0.3961.
0.355 < 0.3961 so accept H0: there is not sufficient
evidence at the 5% significant level to suggest that
there is a correlation between temperature and wind
speed.
[9]
(ii) The underlying distribution must be bivariate Normal.
The elliptical shape of the scatter diagram suggests
that this is the case.
[3]
8 of 9
13/11/13 MEI
9 of 9
13/11/13 MEI