0% found this document useful (0 votes)
43 views14 pages

C R Tut Solns

This document provides information about correlation and regression from the National Junior College Mathematics Department. It includes sample questions and solutions related to linear regression, finding equations of regression lines, and interpreting correlation coefficients. Scatter plots and tables of bivariate data are presented.

Uploaded by

lbwnb.68868
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views14 pages

C R Tut Solns

This document provides information about correlation and regression from the National Junior College Mathematics Department. It includes sample questions and solutions related to linear regression, finding equations of regression lines, and interpreting correlation coefficients. Scatter plots and tables of bivariate data are presented.

Uploaded by

lbwnb.68868
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

National Junior College Mathematics Department 2016

National Junior College


2015 – 2016 H2 Mathematics
Correlation and Regression [approx. 3 lessons] Tutorial Solutions

Basic Mastery Questions

1. Sketch a scatter diagram that might be expected when x and y are related approximately as
given in each of the cases (A), (B) and (C) below. In each case your diagram should include 6
points, approximately equally spaced with respect to x, and with all x- and y-values positive.
The letters a, b, c, d, e and f represent constants.
(A) y a bx 2 , where a is positive and b is negative,
(B) y c d ln x , where c is positive and d is negative,
f
(C) y e , where e is positive and f is negative. [GCE2013/II/10 (part)]
x

Solution:

(A) y a bx 2 , where a is positive and b is negative,

Note: The curve must appear stationary at the y-axis (since the turning point for y a bx 2 occurs
at x = 0.)

2015 – 2016 / H2 Maths / Correlation and Regression Page 1 of 14


National Junior College Mathematics Department 2016
(B) y c d ln x , where c is positive and d is negative,

note: vertical asymptote is the value that x cannot take

f
(C) y e , where e is positive and f is negative.
x

2015 – 2016 / H2 Maths / Correlation and Regression Page 2 of 14


National Junior College Mathematics Department 2016
2. The diagrams below show the two regression lines (y on x and x on y) for three different sets
of bivariate data. The scales along the two axes are the same for each diagram.

y y y

P P P

O x O x O x
(i) (ii) (iii)

The equations of the regression lines for one of the sets of bivariate data have been incorrectly
obtained. Identify which of (i), (ii) and (iii) is the corresponding scatter diagram, and justify
your answer clearly.

For the other two sets of bivariate data, explain, in each case, what the diagram tells us about
the correlation between the variables x and y? What does the point P represent? Indicate in the
diagrams the y on x and x on y lines.

Solution:

The equations of the regression lines have been incorrectly obtained for (ii). This is because the
regression lines for a single set of bivariate data must be both upward sloping (positively correlated
data) or both downward sloping (negatively correlated data).

For (i), since the regression lines are considerably close to each other, the strength of linear
relationship between the two variables is moderately strong. Also, since both regression lines are
upwards sloping, the variables are positively correlated.

For (iii), since the regression lines are considerably far apart, the strength of linear relationship
between the two variables is weak. Also, since both regression lines are downwards sloping, the
variables are negatively correlated.

The point P represents the point x , y , where x and y are the sample means for x and y
respectively.

Since the y on x line minimises the (vertical) y-errors, and the x on y line minimises the (horizontal)
x-errors, the x on y line is always the steeper line.

y y x on y
x on y
P y on x P y on x

O x O x
(i) (iii)

2015 – 2016 / H2 Maths / Correlation and Regression Page 3 of 14


National Junior College Mathematics Department 2016
3. Ten students sat for a practical test and a theoretical test for one of their subjects, Physics.
Their marks out of 10 are recorded in the following table.

Practical test (x) 8 6 10 8 5 6 8 10 7 7


Theoretical test (y) 6 7 8 6 7 4 9 10 5 8

Draw a scatter diagram for the pairs of marks.

Find, in any form, the equation of the regression line of


(i) y on x, and
(ii) x on y.

Calculate the product moment correlation coefficient for the data.

A student was absent from the theoretical test but obtained a mark of 6 in the practical test.
Use the appropriate regression line to estimate a mark in the theoretical test for this student.
Comment on the reliability of this estimate.

Solution:

Regression line of y on x is
y = 2.4081 + 0.61224x
y = 2.41 + 0.612x (to 3 s.f.)

Regression line of x on y is x = 4 + 0.5y.

r = 0.5532833352
= 0.553 (to 3 s.f.)

Using the regression line of y on x (to minimise y-errors)


y = 2.4081 + 0.61224(6)
= 6.0815
= 6 (rounded off to nearest integer)

This estimate is NOT reliable as the value of r and the scatter plot does not indicate a strong linear
correlation between the practical and theoretical test marks.

2015 – 2016 / H2 Maths / Correlation and Regression Page 4 of 14


National Junior College Mathematics Department 2016
Practice Questions

1. An experiment with certain swimming animals was carried out in order to investigate how the
speed at which they swam depended on the angle through which their hind feet moved. The
angle degrees through which the hind feet moved was measured, together with the
swimming speed v ms 1 . The results are given in the table.

θ 87 92 96 97 98 101 110 114 115 115 116 123 133


v 0.35 0.30 0.50 0.40 0.25 0.45 0.60 0.55 0.55 0.65 0.50 0.70 0.75

(i) State, giving a reason, which of the least squares regression lines, on v or v on ,
should be used to express a possible linear relation between v and .

(ii) Calculate the equation of the line chosen in part (i), giving the values of the coefficients
to a suitable degree of accuracy.

(iii) Interpret, in context, the value of the gradient of the regression line in (ii). By
considering the value of the v-intercept of the regression line, comment on the
suitability of a linear model for the relationship between v and , for values of
beyond the given data range.

(iv) Find the product moment correlation coefficient for this set of data. If the swimming
speeds were inaccurately measured and each measurement of v is to increase by 0.05,
what is the effect on the product moment correlation coefficient? Justify your answer.

Solution:

(i) The least squares regression line v on θ should be used, as θ is the independent variable and v
is the dependent variable.

(ii) v 0.00994656 0.565027


v 0.00995 0.565 (to 3 s.f.)

(iii) Every increase of one degree through which an animal’s hind feet move results in an
approximate increase of 0.00995 ms 1 in the animal’s swimming speed.

When the angle through which the hind feet moved is zero, the swimming speed of the
animals is approximately –0.565 ms 1 . Hence a linear model is not suitable for the
relationship between v and , because speed cannot be negative.

(iv) r = 0.878 (3 s.f.)

There is no change to the value of r if each value of v is to increase by 0.05, as the strength of
linear relationship is preserved after translation of the data.

2015 – 2016 / H2 Maths / Correlation and Regression Page 5 of 14


National Junior College Mathematics Department 2016
2. A random sample of eight pairs of values of x and y are given in the table below.

i 1 2 3 4 5 6 7 8
xi 10 11 12 11 17 14 19 x8
yi 9 8 7 6 5 4 1 y8

(i) It is given that the regression lines y on x and x on y for this set of data have equations

7 151 7
y x and x y 20
10 10 6

respectively. Find the values of x8 and y8 .

(ii) Let Yi be the value obtained by substituting xi into the equation of the regression line of
7 151 7 151
y on x, for i = 1, 2, …, 8 i.e. Y1 x1 , Y2 x2 ,.... Find the value of
10 10 10 10

8
( yi Yi ) 2 .
i 1

8
2
(iii) Hence state an inequality that must be satisfied by yi a bxi for any real
i 1

constants a and b. Justify your answer clearly.

[Possible Extension Question:


8
2
Find the least value of xi c dyi for any real constants c and d.]
i 1

Solution:

10 11 12 11 17 14 19 x8 94 x8
(i) x ;
8 8

9 8 7 6 5 4 1 y8 40 y8
y
8 8

Since x , y lies on both the regression lines of y on x and x on y,

7 151 7
y x and x y 20.
10 10 6

Solving simultaneously, x 13, y 6.


94 x8 40 y8
13, 6
8 8
x8 10, y8 8

2015 – 2016 / H2 Maths / Correlation and Regression Page 6 of 14


National Junior College Mathematics Department 2016
8
(ii) The graphing calculator can be used to evaluate the value of ( yi Yi ) 2 quickly for this part
i 1
of the question, as shown below:

[For the 3rd screenshot, press to access the “sum” command.]


8
Therefore ( yi Yi ) 2 8.8.
i 1

(iii) The least squares regression line of y on x (as its name suggests), minimises the sum of the
squares of the deviations of each value of y in the data set from the value of y attained by
the regression line at the same value of x (i.e. the y-errors). Hence the sum of squares of the
deviations for the same data set, measured with respect to any other straight line y = a + bx,
must be be larger. Therefore for any real constants a and b,

8
2
yi a bxi 8.8
i 1

151 7
and equality is achieved if and only if a and b i.e. when y = a + bx is the
10 10
equation of the regression line of y on x itself.

2015 – 2016 / H2 Maths / Correlation and Regression Page 7 of 14


National Junior College Mathematics Department 2016
3. With the implementation of a new bus fare system, Jasmine wanted to find out how the bus
fares were decided for different bus journeys. She identified 12 common locations and used a
map to measure the straight line distance, x km, of each location from her home. She also
measured the road distance, y km, of each location from her home and the corresponding bus
fare, s cents. The data are shown below.

Location A B C D E F G H I J K L
x 7.7 3.0 24.1 13.2 9.3 9.0 10.4 3.5 17.6 4.5 2.0 2.5
y 8.8 3.3 28.0 16.1 9.4 8.9 12.5 15.8 22.5 5.0 2.2 2.8
s 121 81 181 149 125 121 137 149 173 91 71 71

(i) By considering the values of x and y, explain why Location F should be omitted from
any further analysis. State, with a reason, another location that should be omitted.

Omit the data for the two locations in part (i).

(ii) Use a suitable regression line to give an estimate of the straight line distance when the
road distance is 20.0 km.

(iii) Draw a scatter diagram of s against y. State, with a reason, which of the following
models is more appropriate to describe the relationship between y and s:

Model I: s a by 2 ,
Model II: s a b ln y

(iv) Using the more appropriate model found in part (iii), calculate the equation of the
corresponding regression line.

(v) Estimate the road distance travelled if the bus fare is 170 cents. Comment on the
reliability of this estimate. [HCI/2010/Prelims/II/Q12 (modified)]

Solution:

(i) Location F should be omitted as the road distance cannot be smaller than the straight line
distance, indicating that it is an incorrect data entry.

From the scatter diagram, another location that should be omitted is location H.
2015 – 2016 / H2 Maths / Correlation and Regression Page 8 of 14
National Junior College Mathematics Department 2016
(ii) The suitable regression line is the regression x on y: x 0.393655 0.817029 y
When y = 20.0, x 0.393655 0.817029(20.0)
16.734
Hence the estimated straight line distance is 16.7 km (to 3 s.f.).

(iii)

Graphical perspective:
For model I, the turning (or stationary point) of the quadratic model must be observed to be
close to the s axis, which is not satisfied by the trend exhibited in this scatter-plot.
Since the points to appear to follow curve appears to be increasing at a decreasing rate (with
respect to y), a logarithmic model such as model II will be more suitable in this case.

Contextual perspective:
Based on the scatter diagram, a quadratic model would mean that after the turning point, the
bus fare will start to decrease as the road distances increase, which does not make sense. The
logarithmic model is more sensible in this aspect, as it is increasing over all values of y.

(iv) The appropriate regression line of s on ln y is


s 25.9500 45.2443ln y
s 26.0 45.2ln y (to 3 s.f.)

(v) When s = 170,


it is obvious that s depends on y
170 25.9500 45.2443ln y
170 25.95
ln y
45.2443
170 25.95
y e 45.2443
24.1389
Hence estimated road distance is 24.1 km (to 3 s.f.)

Since
170 lies within the range of values of s, [71,173], and
r = 0.992, which has an absolute value close to 1, suggesting a strong linear relationship
between s and ln y,
the estimate obtained is reliable.

2015 – 2016 / H2 Maths / Correlation and Regression Page 9 of 14


National Junior College Mathematics Department 2016
4. The table below shows the maximum temperature and the sale of cold soft drinks between
1130 hrs to 1430 hrs by a shop in a Central Business District for nine Tuesdays.
Temperature, t (oC) 29.4 30.5 36.6 31.1 32.5 33.4 33.8 34.8 35.1
Daily sales, s ($) 100 170 64 186 220 236 244 252 254
(i) Draw the scatter diagram for these values, labelling the axes clearly.

(ii) Identify a pair of values of s and t which should be regarded as an outlier. Give a
possible reason for the occurrence for this pair of data.

(iii) Omitting the outlier, find, correct to 4 decimal places, the value of the product moment
correlation coefficient between
(a) t and s, and
1
(b) and s.
t
d
(iv) Use your answers to parts (i) and (iii) to explain which of s = a + bt or s c is the
t
better model, and find the equation of the appropriate regression line for the better
model. [NJC/2011/Promos/Q10(b) (modified)]

Solution:

(i)

(ii) (36.6, 64) is an outlier.


Possible reasons:
- The customers stayed in the office as it was too warm.
- The particular Tuesday is a public holiday.

(iii) For s on t, r = 0.9502698149 = 0.9503 (4 d.p.)

1
(iv) For s on , r = – 0.963485031141555 = –0.9635 (4 d.p.)
t
1
(v) As the r-value for s on has an absolute value which is closer to 1, and also the scatter-plot
t
appears to follow a curve that is increasing and eventually plateauing close to a horizontal
d 25701
line, the model s c is the better model. From GC, s 999.59
t t

2015 – 2016 / H2 Maths / Correlation and Regression Page 10 of 14


National Junior College Mathematics Department 2016
5. The table below gives the values of a set of bivariate data comprising ten observations of x
and y.
x 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
y 13.1 12.9 12.5 11.7 10.9 9.9 8.9 7.7 6.0 4.1

(i) Sketch the scatter diagram and determine the value of the product moment correlation
coefficient between y and x.

(ii) Determine which of the following is the best model for this set of data, justifying your
choice clearly.

(A) y ax b (B) y cx 2 d (C) y e x f

(iii) Find the equation of the least-squares regression line of your selected best model in part
(ii). Use your equation to estimate the value of y when x = 3.8. Comment on the
reliability of the estimation. [NJC/2012/Prelims/II/Q6]

Solution:

(i)

r-value = −0.9728201266 = −0.973 (3 s.f.)

(ii) (B) is the best model because


EITHER: y decreases at an increasing rate as x increases.
OR:
(A) y ax b : r-value = −0.973
(B) y cx 2 d : r-value = −0.999
(C) y e x f : r-value = −0.930
Out of the 3 models, the r-value for model (B) has an absolute value that is closest to 1. Hence
(B) is the best model for this set of data.

(iii) Regression line of y on x 2 is y 0.358974 x 2 13.2251 y 0.359 x 2 13.2 (to 3 s.f.)

When x = 3.8, y 0.358974(3.8) 2 13.2251 8.0415 8.04 (to 3 s.f.)

Since
x = 3.8 is within the range of values of x, [0.5, 5.0], and
r (for this model) = –0.999, which has an absolute value close to 1, indicating a strong
negative linear correlation between y and x 2 ,
the estimate is reliable.

2015 – 2016 / H2 Maths / Correlation and Regression Page 11 of 14


National Junior College Mathematics Department 2016
6. A medical officer wishes to investigate a patient’s walking speed s km/h and his heart-beat
rate h beats per minute (bpm). The data is shown below:

s 1 1.5 2 2.5 3 3.5 4 4.5 5


h 60 63 66 75 86 99 150 110 130

(i) Sketch a scatter plot of the above data.

(ii) One of the values of h appears to be incorrect. Indicate the corresponding point on your
diagram by labelling it P.

Omit P for the remainder of this question.

(iii) Calculate the product moment correlation coefficient for this set of data. Use the
equation of an appropriate regression line to predict the value of s when h = 100,
justifying your choice of regression line.

It is suggested to use one of the following two models instead:

Model (I): h a bs 2 ,
Model (II): h a bes .

where a and b are real constants.

(iv) Determine which of the two models is a better choice, giving a reason for your answer.

(v) Suppose a new data pair ( s , h ) is added to the table above, where s and h are the
patient’s sample mean walking speed (in km/h) and his sample mean heart-beat rate (in
bpm) respectively, based on the data above. Without any calculations, explain whether
the equation of the regression line you have obtained in part (iii) would change.
[NJC/2015/Prelims/II/Q12]

Solution:

(i), (ii)

2015 – 2016 / H2 Maths / Correlation and Regression Page 12 of 14


National Junior College Mathematics Department 2016
(iii) r-value = 0.981 (to 3 s.f.);
We use the h on s line since h is dependent on s, which has equation h = 35.851 + 17.486s
When h = 100, s = 3.669 = 3.67 (to 3 s.f.)

(iv) For model (I), r-value = 0.990 (to 3 s.f.);


For model (II), r-value = 0.938 (to 3 s.f.)
Hence model (I) is more suitable since the absolute value of r is nearer to 1, which suggests a
stronger linear relationship between h and s2 than that between h and e s .

(v) The addition of the data point ( s , h ) does not increase the sum of squares of the h-errors
between the current regression line and the set of data points, hence the h-errors remain
minimised with the same line.

7. A scientist wishes to investigate the rate at which mould grows on a slice of expired bread. He
conducts an experiment to measure the area covered by mould on a slice of expired bread
over a span of 2 weeks and records his findings in the table below.

Day t 0 2 6 10 13
Area covered by mould, x (in cm2) 1.5 18 75 94 99

(i) Calculate, correct to 4 decimal places, the product moment correlation coefficient for
this set of data.

(ii) Explain why the value you have obtained in part (i) does not necessarily imply that a
linear model is suitable for this set of data.

After carrying out some work, the scientist theorises that a model of the form

A
ln 1 a bt ,
x

for some real constants A, a and b, may be a good fit for this set of data. He tests his theory by
calculating the product moment correlation coefficient (denoted by r) between t and
A
ln 1 for a few possible values of A, and records his findings in the table below.
x

A 100 101 102


r –0.983563 –0.975623 –0.969018

(iii) Calculate the value of r for A = 100, giving your answer correct to 6 decimal places.

(iv) Which of 100, 101, and 102 is the most appropriate value of A? Justify your answer.

(v) Using the most appropriate value of A in part (iii), find the values of a and b, and use
these values to estimate the least number of complete days needed for the mould to
cover an area of 50 cm2.

(vi) Suggest what the value of A represents in the context of this question.

2015 – 2016 / H2 Maths / Correlation and Regression Page 13 of 14


National Junior College Mathematics Department 2016
Solution:

(i) r = 0.9592 (to 4 d.p.)

(ii) Even though the absolute value of r is close to 1, a linear relationship would suggest that the
area covered by the mould can grow indefinitely large as time passes, which is impossible as
the slice of bread has a finite area.

(iii) r = –0.983563 (to 6 d.p.)

(iv) A = 100 is the most appropriate since the absolute value of r is closest to 1, out of those for
the 3 given values of A.

(iv) a = 3.36839, b = –0.631816

100
Hence the regression equation is ln 1 3.36839 0.631816t.
x
100
When x = 50, ln 1 3.36839 0.631816t
50
0 3.36839 0.631816t
3.36839
t
0.631816
5.3313

Therefore at least 6 days are needed for the mould to cover an area of 50 cm2.

A A
(vi) ln 1 a bt 1 ea bt
x x
A
1 ea bt
x
A
x
1 ea bt

A A
As t , a bt (since b 0) x A.
1 ea bt
1 0

In other words, the value of A represents the long-term maximum area covered by the
mould.

2015 – 2016 / H2 Maths / Correlation and Regression Page 14 of 14

You might also like