C R Tut Solns
C R Tut Solns
1. Sketch a scatter diagram that might be expected when x and y are related approximately as
given in each of the cases (A), (B) and (C) below. In each case your diagram should include 6
points, approximately equally spaced with respect to x, and with all x- and y-values positive.
The letters a, b, c, d, e and f represent constants.
(A) y a bx 2 , where a is positive and b is negative,
(B) y c d ln x , where c is positive and d is negative,
f
(C) y e , where e is positive and f is negative. [GCE2013/II/10 (part)]
x
Solution:
Note: The curve must appear stationary at the y-axis (since the turning point for y a bx 2 occurs
at x = 0.)
f
(C) y e , where e is positive and f is negative.
x
y y y
P P P
O x O x O x
(i) (ii) (iii)
The equations of the regression lines for one of the sets of bivariate data have been incorrectly
obtained. Identify which of (i), (ii) and (iii) is the corresponding scatter diagram, and justify
your answer clearly.
For the other two sets of bivariate data, explain, in each case, what the diagram tells us about
the correlation between the variables x and y? What does the point P represent? Indicate in the
diagrams the y on x and x on y lines.
Solution:
The equations of the regression lines have been incorrectly obtained for (ii). This is because the
regression lines for a single set of bivariate data must be both upward sloping (positively correlated
data) or both downward sloping (negatively correlated data).
For (i), since the regression lines are considerably close to each other, the strength of linear
relationship between the two variables is moderately strong. Also, since both regression lines are
upwards sloping, the variables are positively correlated.
For (iii), since the regression lines are considerably far apart, the strength of linear relationship
between the two variables is weak. Also, since both regression lines are downwards sloping, the
variables are negatively correlated.
The point P represents the point x , y , where x and y are the sample means for x and y
respectively.
Since the y on x line minimises the (vertical) y-errors, and the x on y line minimises the (horizontal)
x-errors, the x on y line is always the steeper line.
y y x on y
x on y
P y on x P y on x
O x O x
(i) (iii)
A student was absent from the theoretical test but obtained a mark of 6 in the practical test.
Use the appropriate regression line to estimate a mark in the theoretical test for this student.
Comment on the reliability of this estimate.
Solution:
Regression line of y on x is
y = 2.4081 + 0.61224x
y = 2.41 + 0.612x (to 3 s.f.)
r = 0.5532833352
= 0.553 (to 3 s.f.)
This estimate is NOT reliable as the value of r and the scatter plot does not indicate a strong linear
correlation between the practical and theoretical test marks.
1. An experiment with certain swimming animals was carried out in order to investigate how the
speed at which they swam depended on the angle through which their hind feet moved. The
angle degrees through which the hind feet moved was measured, together with the
swimming speed v ms 1 . The results are given in the table.
(i) State, giving a reason, which of the least squares regression lines, on v or v on ,
should be used to express a possible linear relation between v and .
(ii) Calculate the equation of the line chosen in part (i), giving the values of the coefficients
to a suitable degree of accuracy.
(iii) Interpret, in context, the value of the gradient of the regression line in (ii). By
considering the value of the v-intercept of the regression line, comment on the
suitability of a linear model for the relationship between v and , for values of
beyond the given data range.
(iv) Find the product moment correlation coefficient for this set of data. If the swimming
speeds were inaccurately measured and each measurement of v is to increase by 0.05,
what is the effect on the product moment correlation coefficient? Justify your answer.
Solution:
(i) The least squares regression line v on θ should be used, as θ is the independent variable and v
is the dependent variable.
(iii) Every increase of one degree through which an animal’s hind feet move results in an
approximate increase of 0.00995 ms 1 in the animal’s swimming speed.
When the angle through which the hind feet moved is zero, the swimming speed of the
animals is approximately –0.565 ms 1 . Hence a linear model is not suitable for the
relationship between v and , because speed cannot be negative.
There is no change to the value of r if each value of v is to increase by 0.05, as the strength of
linear relationship is preserved after translation of the data.
i 1 2 3 4 5 6 7 8
xi 10 11 12 11 17 14 19 x8
yi 9 8 7 6 5 4 1 y8
(i) It is given that the regression lines y on x and x on y for this set of data have equations
7 151 7
y x and x y 20
10 10 6
(ii) Let Yi be the value obtained by substituting xi into the equation of the regression line of
7 151 7 151
y on x, for i = 1, 2, …, 8 i.e. Y1 x1 , Y2 x2 ,.... Find the value of
10 10 10 10
8
( yi Yi ) 2 .
i 1
8
2
(iii) Hence state an inequality that must be satisfied by yi a bxi for any real
i 1
Solution:
10 11 12 11 17 14 19 x8 94 x8
(i) x ;
8 8
9 8 7 6 5 4 1 y8 40 y8
y
8 8
7 151 7
y x and x y 20.
10 10 6
(iii) The least squares regression line of y on x (as its name suggests), minimises the sum of the
squares of the deviations of each value of y in the data set from the value of y attained by
the regression line at the same value of x (i.e. the y-errors). Hence the sum of squares of the
deviations for the same data set, measured with respect to any other straight line y = a + bx,
must be be larger. Therefore for any real constants a and b,
8
2
yi a bxi 8.8
i 1
151 7
and equality is achieved if and only if a and b i.e. when y = a + bx is the
10 10
equation of the regression line of y on x itself.
Location A B C D E F G H I J K L
x 7.7 3.0 24.1 13.2 9.3 9.0 10.4 3.5 17.6 4.5 2.0 2.5
y 8.8 3.3 28.0 16.1 9.4 8.9 12.5 15.8 22.5 5.0 2.2 2.8
s 121 81 181 149 125 121 137 149 173 91 71 71
(i) By considering the values of x and y, explain why Location F should be omitted from
any further analysis. State, with a reason, another location that should be omitted.
(ii) Use a suitable regression line to give an estimate of the straight line distance when the
road distance is 20.0 km.
(iii) Draw a scatter diagram of s against y. State, with a reason, which of the following
models is more appropriate to describe the relationship between y and s:
Model I: s a by 2 ,
Model II: s a b ln y
(iv) Using the more appropriate model found in part (iii), calculate the equation of the
corresponding regression line.
(v) Estimate the road distance travelled if the bus fare is 170 cents. Comment on the
reliability of this estimate. [HCI/2010/Prelims/II/Q12 (modified)]
Solution:
(i) Location F should be omitted as the road distance cannot be smaller than the straight line
distance, indicating that it is an incorrect data entry.
From the scatter diagram, another location that should be omitted is location H.
2015 – 2016 / H2 Maths / Correlation and Regression Page 8 of 14
National Junior College Mathematics Department 2016
(ii) The suitable regression line is the regression x on y: x 0.393655 0.817029 y
When y = 20.0, x 0.393655 0.817029(20.0)
16.734
Hence the estimated straight line distance is 16.7 km (to 3 s.f.).
(iii)
Graphical perspective:
For model I, the turning (or stationary point) of the quadratic model must be observed to be
close to the s axis, which is not satisfied by the trend exhibited in this scatter-plot.
Since the points to appear to follow curve appears to be increasing at a decreasing rate (with
respect to y), a logarithmic model such as model II will be more suitable in this case.
Contextual perspective:
Based on the scatter diagram, a quadratic model would mean that after the turning point, the
bus fare will start to decrease as the road distances increase, which does not make sense. The
logarithmic model is more sensible in this aspect, as it is increasing over all values of y.
Since
170 lies within the range of values of s, [71,173], and
r = 0.992, which has an absolute value close to 1, suggesting a strong linear relationship
between s and ln y,
the estimate obtained is reliable.
(ii) Identify a pair of values of s and t which should be regarded as an outlier. Give a
possible reason for the occurrence for this pair of data.
(iii) Omitting the outlier, find, correct to 4 decimal places, the value of the product moment
correlation coefficient between
(a) t and s, and
1
(b) and s.
t
d
(iv) Use your answers to parts (i) and (iii) to explain which of s = a + bt or s c is the
t
better model, and find the equation of the appropriate regression line for the better
model. [NJC/2011/Promos/Q10(b) (modified)]
Solution:
(i)
1
(iv) For s on , r = – 0.963485031141555 = –0.9635 (4 d.p.)
t
1
(v) As the r-value for s on has an absolute value which is closer to 1, and also the scatter-plot
t
appears to follow a curve that is increasing and eventually plateauing close to a horizontal
d 25701
line, the model s c is the better model. From GC, s 999.59
t t
(i) Sketch the scatter diagram and determine the value of the product moment correlation
coefficient between y and x.
(ii) Determine which of the following is the best model for this set of data, justifying your
choice clearly.
(iii) Find the equation of the least-squares regression line of your selected best model in part
(ii). Use your equation to estimate the value of y when x = 3.8. Comment on the
reliability of the estimation. [NJC/2012/Prelims/II/Q6]
Solution:
(i)
Since
x = 3.8 is within the range of values of x, [0.5, 5.0], and
r (for this model) = –0.999, which has an absolute value close to 1, indicating a strong
negative linear correlation between y and x 2 ,
the estimate is reliable.
(ii) One of the values of h appears to be incorrect. Indicate the corresponding point on your
diagram by labelling it P.
(iii) Calculate the product moment correlation coefficient for this set of data. Use the
equation of an appropriate regression line to predict the value of s when h = 100,
justifying your choice of regression line.
Model (I): h a bs 2 ,
Model (II): h a bes .
(iv) Determine which of the two models is a better choice, giving a reason for your answer.
(v) Suppose a new data pair ( s , h ) is added to the table above, where s and h are the
patient’s sample mean walking speed (in km/h) and his sample mean heart-beat rate (in
bpm) respectively, based on the data above. Without any calculations, explain whether
the equation of the regression line you have obtained in part (iii) would change.
[NJC/2015/Prelims/II/Q12]
Solution:
(i), (ii)
(v) The addition of the data point ( s , h ) does not increase the sum of squares of the h-errors
between the current regression line and the set of data points, hence the h-errors remain
minimised with the same line.
7. A scientist wishes to investigate the rate at which mould grows on a slice of expired bread. He
conducts an experiment to measure the area covered by mould on a slice of expired bread
over a span of 2 weeks and records his findings in the table below.
Day t 0 2 6 10 13
Area covered by mould, x (in cm2) 1.5 18 75 94 99
(i) Calculate, correct to 4 decimal places, the product moment correlation coefficient for
this set of data.
(ii) Explain why the value you have obtained in part (i) does not necessarily imply that a
linear model is suitable for this set of data.
After carrying out some work, the scientist theorises that a model of the form
A
ln 1 a bt ,
x
for some real constants A, a and b, may be a good fit for this set of data. He tests his theory by
calculating the product moment correlation coefficient (denoted by r) between t and
A
ln 1 for a few possible values of A, and records his findings in the table below.
x
(iii) Calculate the value of r for A = 100, giving your answer correct to 6 decimal places.
(iv) Which of 100, 101, and 102 is the most appropriate value of A? Justify your answer.
(v) Using the most appropriate value of A in part (iii), find the values of a and b, and use
these values to estimate the least number of complete days needed for the mould to
cover an area of 50 cm2.
(vi) Suggest what the value of A represents in the context of this question.
(ii) Even though the absolute value of r is close to 1, a linear relationship would suggest that the
area covered by the mould can grow indefinitely large as time passes, which is impossible as
the slice of bread has a finite area.
(iv) A = 100 is the most appropriate since the absolute value of r is closest to 1, out of those for
the 3 given values of A.
100
Hence the regression equation is ln 1 3.36839 0.631816t.
x
100
When x = 50, ln 1 3.36839 0.631816t
50
0 3.36839 0.631816t
3.36839
t
0.631816
5.3313
Therefore at least 6 days are needed for the mould to cover an area of 50 cm2.
A A
(vi) ln 1 a bt 1 ea bt
x x
A
1 ea bt
x
A
x
1 ea bt
A A
As t , a bt (since b 0) x A.
1 ea bt
1 0
In other words, the value of A represents the long-term maximum area covered by the
mould.