0% found this document useful (0 votes)
15 views

C R Lect Notes

Uploaded by

lbwnb.68868
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

C R Lect Notes

Uploaded by

lbwnb.68868
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

National Junior College Mathematics Department 2016

National Junior College


2015 – 2016 H2 Mathematics
Correlation and Regression (Lecture Notes)

Topic 22: Correlation & Regression

Key Questions to Answer:

What is bivariate data?


o What is meant by an independent and a dependent variable?
How do we plot a scatter diagram?
o How do we determine if there is a linear relationship between the two variables
from the scatter diagram?
What does the product moment correlation coefficient, r, measure?
o How do we calculate the product moment correlation coefficient for a given
set of bivariate data?
o How do we relate the value of the product moment correlation coefficient (in
particular, values close to –1, 0 and 1) to the appearance of the scatter
diagram?
o Does zero correlation necessarily mean that there is no relationship between
the two variables?
o Does a high correlation between two variables imply that one directly causes
the other?
What is meant by linear regression?
o What is a least squares regression line and how does it relate to the scatter
diagram?
o How do we determine the equation of a least squares regression line for a
given set of bivariate data?
o How do we interpret the values of the slope and intercept of a least squares
regression line in a practical situation?
How do we use a regression line to perform prediction or estimation of a value in a
practical situation?
o How is the choice of regression line used to perform estimation affected by
the existence of a dependence relationship between the two variables?
which variable for which we are estimating its value (given the value of
the other variable)?
o How is the reliability of the estimate affected by
the strength of the linear relationship between the two variables (observed
from the value of r and/or the scatter diagram)?
whether the given value (to input) falls inside or outside the data range
for the variable (i.e. whether it is obtained through interpolation or
extrapolation)?
How do we perform linearisation on a set bivariate data to fit a non-linear model?
o Given different models (linear and/or non linear), how do we determine which
model fits the data the best?

2015 – 2016 / H2 Maths / Correlation and Regression Page 1 of 27


National Junior College Mathematics Department 2016

§1 Introduction

1.1 Bivariate Data and Scatter Diagrams

A set of data comprising the values of two variables, say x and y, obtained from the
same sample, is known as bivariate data.

Some examples of bivariate data include the following:


(a) Amount of advertising time for a product and number of sales for that
product
(b) Heights of persons and their ages
(c) Mathematics test scores and English test scores

When a set of bivariate data is plotted in the Cartesian plane, a scatter diagram (or
scatter plot) is produced. Some examples of scatter diagrams are as follows:

Figure 1.1. Scatter diagram for the Senior High 2 Lecture Test percentage scores (y) for
a class against their Senior High 1 Promotional Examination percentage scores (x)

1.2 Independent & Dependent Variables

In a set of bivariate data, one of the two variables may be affected or influenced by
the value of the other variable, which is controlled. In this case, the variable whose
values have been controlled is called the independent variable, while the other
variable is called the dependent variable.

2015 – 2016 / H2 Maths / Correlation and Regression Page 2 of 27


National Junior College Mathematics Department 2016

For example, in the set of data on the amount of advertising time for a product and
number of sales for that product, the amount of advertising time is the independent
variable, while the number of sales for that product is the dependent variable.

On the other hand, in a set of data comprising the heights and intelligence quotients
(IQ) of a group of people, neither variable is likely to depend on the other.

Example 1.1.

In a city, the number of outlets for a particular café, x, and the number of car
accidents, y, are recorded over a period of time. The set of data obtained is given as
follows.

x 25 45 60 75 90
y 88 72 57 44 23

(i) Sketch a scatter diagram for this set of data.


(ii) Referring to the scatter diagram, describe a possible relationship between the
two variables x and y.
(iii) Explain whether the relationship you have observed in part (ii) suggests that
one variable directly causes the other variable.
(iv) If x and y represent the following variables instead, state whether or not one
variable depends on the other, and identify the independent and dependent
variables when that happens.
(a) Time passed (x), concentration of a substance in a solution (y)
(b) Air temperature (x), Wind speed (y)
(c) Mathematics Test scores (x), English Test scores (y)

Solution.

(i) The graphic calculator can be used as a tool to sketch the scatter diagram as
as described in the following procedure:

No. Keys to Press/Steps Screenshot


1
Press .

2015 – 2016 / H2 Maths / Correlation and Regression Page 3 of 27


National Junior College Mathematics Department 2016

2.
Press .

Key in the x and y values


into columns L1 and L2
respectively.

3. Exit to the main screen.

Then press .

4. Adjust the settings


accordingly.

5.
Press .

Then press

(for “ZoomStat”)

2015 – 2016 / H2 Maths / Correlation and Regression Page 4 of 27


National Junior College Mathematics Department 2016

(ii) As x increases, y decreases proportionately OR there is a negative linear


relationship between x and y.

(iii) No. The decrease in the number of car accidents could be due to a recent
campaign on road safety, while the increase in the number of outlets of the
café could be simply due to the café expanding its business at the same time.
It is not likely that increase in the number of outlets of the cafe has caused a
decrease in the number of car accidents.

(iv) (a) x is the independent variable; y is the dependent variable.


(b) y is the independent variable; x is the dependent variable.
(c) It is not evident whether x depends on y or vice versa.

Notes:

Care and caution are needed when interpreting a scatter diagram.

1. While there may appear to be a mathematical relationship between the two


variables, it does not mean that there is a relationship in reality.

2. The appearance of a mathematical relationship does not imply that there is a


causal relationship. An increase in one variable does not necessarily cause an
increase (or decrease) in the other variable.

Further qualitative analysis is needed to ascertain the true effect of one variable on
the other.

§2 Linear Regression

As in Example 1.1, we may be interested to look for a mathematical relationship


between the variables in the form y = f(x), so that we can estimate or predict the
value of y given a value of x which does not appear in the set of bivariate data, or
vice versa.

If it appears from the scatter diagram that a linear relationship is a sensible


interpretation, we may then attempt to find a model for the relationship in the form
of a regression line, i.e. f(x) = a + bx for some real constants a and b.

2015 – 2016 / H2 Maths / Correlation and Regression Page 5 of 27


National Junior College Mathematics Department 2016

This is akin to finding a “best-fit” line, where a line is drawn on the scatter diagram
such that there are as many points above the line as below it (or as many points to
the left of the line as to the right of it). However, for different individuals, the choice
of line is subjected to their personal judgement of “closeness”. Hence, there can be
many possible “best-fit” lines for the same set of bivariate data.

In the following sub-section, we discuss how to work out the equation of a


regression line mathematically.

2.1 The Method of Least Squares

Consider the scatter diagram in Example 1.1. To find a regression line


mathematically, consider the vertical distances e1, e2, e3, e4 and e5 drawn from each
point to a “best-fit” line which we have drawn for the data, as shown below.

Note that the values of e1, e2, e3, e4 and e5 represent the “y-errors”, i.e. the errors
in using the chosen line to estimate the values of y for the values of x given in the
data set.

Logically speaking, the values of the errors should be as small as possible if the line
chosen is indeed the best-fit line. Hence we aim to minimise the values of the errors
when calculating the equation of the regression line.

However, since these errors will be positive or negative according to whether the
points are above or below the line, we work with the squares of these values instead
and consider their sum,

ek 2 e12 e2 2 e32 e4 2 e52

By minimising the sum ek 2 , one will obtain the least squares regression line of

y on x.

Contrastingly, if we were to instead consider the horizontal distances d1, d2, d3, d4
and d5 drawn from each point to a best-fit line, as shown below:

2015 – 2016 / H2 Maths / Correlation and Regression Page 6 of 27


National Junior College Mathematics Department 2016

and minimise the values of the “x-errors” by minimising the sum of squares
dk 2 d12 d22 d32 d42 d5 2 ,

we would obtain the least squares regression line of x on y.

For the purpose of the H2 Maths syllabus, you are not required to find the
equations of the regression lines analytically. However, you will need to know
how to use the graphic calculator to obtain the equations of the regression lines,
which is illustrated in the next example.

Example 2.1.

Consider the following set of bivariate data.

x 25 50 60 80 90
y 80 90 50 44 10

(i) Find the equations of the regression lines of y on x and x on y, and sketch both
lines in a single scatter diagram.

(ii) Find the coordinates of the point of intersection between both lines found in
part (i). How do the values compare to the sample means of x and y in the set
of data?

Suppose the variables x and y are such that neither depends on the other.

(iii) Using a suitable regression line, estimate the value of


(a) y, when x = 70, and
(b) x, when y = 50.
Justify your choice of regression line for each of parts (iii)(a) and (iii)(b).

Suppose instead that x represents air temperature and y represents wind speed.

(iv) Will the choice of regression lines in parts (iii)(a) and (iii)(b) change? Why
or why not?

(v) Interpret the slope of the x on y line in the context of this question.

2015 – 2016 / H2 Maths / Correlation and Regression Page 7 of 27


National Junior College Mathematics Department 2016

Solution.

(i) To find the equation of the regression line of y on x:

After entering the data into the GC (as illustrated in Example 1.1),

No. Keys to Press/Steps Screenshot


1
Press

to enter the ‘CALC’ sub-


menu.

2.
Press to select

‘8: LinReg(a + bx)’ and


enter L1 (x) and L2 (y) as the
‘Xlist’ and ‘Ylist’
respectively.

3.
Press , then

to calculate the

equation of the regression


line of y on x.

Hence the equation of the regression line of y on x is y = 119.85 – 1.0664x.

2015 – 2016 / H2 Maths / Correlation and Regression Page 8 of 27


National Junior College Mathematics Department 2016

To find the equation of the regression line of x on y:

After entering the data into the GC (as illustrated in Example 1.1),

No. Keys to Press/Steps Screenshot


1
Press

to enter the ‘CALC’ sub-


menu.

2.
Press to select

‘8: LinReg(a + bx)’ and


enter L2 (y) and L1 (x) as the
‘Xlist’ and ‘Ylist’
respectively.

3.
Press , then

to calculate the

equation of the regression


line of x on y.

Hence the equation of the regression line of y on x is x = 99.079 – 0.69489y.


x 99.079
y
0.69489
y 142.58 1.4391x
Sketching using the GC (with the scatter plot turned on as illustrated in
Example 1.1),

2015 – 2016 / H2 Maths / Correlation and Regression Page 9 of 27


National Junior College Mathematics Department 2016

(ii) Rearranging the equations of the two regression lines, we have

1.066412214 x y 119.851145
x 0.694885897 y 99.07978512

Using the GC (PolySmlt2 App – see below), the point of intersection between
The aveare
both the Y on X and X on Y lines has coordinates (61.0, 54.8).

Using the “2-Var Stats” command in the GC (in “STAT” “CALC menu –
see below), sample mean of x = 61 and sample mean of y = 54.8, which
coincide with the coordinates of the point of intersection between the two
lines.

2015 – 2016 / H2 Maths / Correlation and Regression Page 10 of 27


National Junior College Mathematics Department 2016

(iii) (a) Since we are estimating the value of y, we wish to minimise the y-errors. Note: Premature
Hence the appropriate regression line to use in this case is the y on x rounding off in the
line. Therefore, when x = 70, equation of line will
Estimated value of y = 119.85 – 1.0664(70) lead to inaccuracy
= 45.202 in estimation e.g.
= 45.2 (to 3 significant figures)
“Estimated value of
(iii) (b) Since we are estimating the value of x, we wish to minimise the x-errors. Y = 120 – 1.07(70)
Hence the appropriate regression line to use in this case is the x on y = 45.1 (to 3 s.f.s)”
line. Therefore, when y = 50,
Estimated value of x = 99.079 – 0.69489(50)
= 64.155
= 64.2 (to 3 significant figures)

(iv) The choice of line for part (iii)(a) should be changed to the x on y line, while
the choice of line for part (iii)(b) should remain the same. As y is the
controlled variable, there is no error to speak of for y, and therefore the x on
y line should be used to carry out estimation in both parts.

(v) Every unit increase in wind speed (y) will lead to an approximate decrease
of 0.695 units in the air temperature (x).

Notes:

1. The choice of regression line to perform estimation in any general scenario is


summarised in the following table:

Estimate y given Estimate x given


Scenario
value of x value of y
y depends on x Use the y on x line.
x depends on y Use the x on y line.
no dependence Use the y on x line. Use the x on y line.

2. If the sample mean of both variables are given to be x and y , then the point
x , y lies on both the regression lines of y on x and x on y, i.e. the two lines
must intersect at x , y . (For the proof, refer to the appendix on the derivation
of the working formulae to find the equations of the y on x and x on y lines.)

§3 The Product Moment Correlation Coefficient

In the previous section, we have discussed how to use regression lines to carry out
estimation/prediction of values for a variable given a set of bivariate data.

However, the validity of performing this procedure to carry out estimation depends
on the assumption that a linear model is a good fit for the data we are given, which
is not always true. For example, the following scatter diagrams illustrate how
different sets of bivariate data may demonstrate different possible relationships (or
equivalently types of correlation) between the two variables.

2015 – 2016 / H2 Maths / Correlation and Regression Page 11 of 27


National Junior College Mathematics Department 2016

Positive correlation Negative correlation


100 29
90
27
80

Temperature (Y °C)
Physics Marks (Y)

70 25
60 23
50
40 21
30 19
20
17
10
0 15
0 20 40 60 80 100 2000 2500 3000 3500 4000 4500
Maths Marks (X) Altitude (X m )

Quadratic relation No observable relation


4500 6000
4000
5000
3500

Monthly Salary ($Y)


Monthly Salary ($Y)

3000 4000
2500
3000
2000
1500 2000
1000
1000
500
0 0
0 20 40 60 80 45 55 65 75 85
Age (X yrs) Weight (X kg)

While the scatter diagram can show clearly if a linear model is a good fit for the
data, this may not always be a practical approach to determine the degree of linear
correlation between the two variables, especially when we are dealing with a large
set of data. In this case, a possible alternative to using the scatter diagram would be
to use a certain measure called the product moment correlation coefficient
(conventionally denoted by r), which is calculated based on the following formula:

x y
( x x)( y y) xy
r= = n (in MF15)
2 2
( x x) ( y y) x
2
y
2

x2 y2
n n

The value of r gives us an indication of the following:

If r is positive (negative), then the two variables


are positively (negatively) correlated i.e. one
Direction of correlation
generally increases (decreases) in value as the
other increases.
The closer the absolute value of r, |r|, is to 1, the
Strength of linear
stronger the strength of linear correlation i.e. the
correlation
better a linear model fits the set of bivariate data

Notes:

(1) For any set of data, –1 ≤ r ≤ 1.

(2) If r = 1, then the set of bivariate data demonstrates a perfect positive linear
correlation between the two variables.

2015 – 2016 / H2 Maths / Correlation and Regression Page 12 of 27


National Junior College Mathematics Department 2016

If r = –1, then the set of bivariate data demonstrates a perfect negative linear
correlation between the two variables.

If a set of bivariate data demonstrates a perfect linear correlation, then ALL


the points in the scatter diagram are collinear. In this case, the y on x and
x on y lines are the same line, which passes through every single point in
scatter plot.

The following diagrams illustrate how the two regression lines appear for
different values of r.

x on y

y on x

y on x and x on y
coincide

Perfect positive linear Strong positive linear correlation, e.g.


correlation, r = 1 r = 0.8

x on y
x on y
y on x
y on x

Weak positive linear correlation, No linear correlation, r = 0


e.g. r = 0.4

x on y
x on y

y on x y on x

Weak negative linear Strong negative linear correlation, e.g.


correlation, e.g. r = 0.4 r = 0.9

y on x and x on y
coincide

Perfect negative linear correlation, r = 1

2015 – 2016 / H2 Maths / Correlation and Regression Page 13 of 27


National Junior College Mathematics Department 2016

In summary, the closer the absolute value of r is to 1 i.e. the stronger the linear
relationship demonstrated, the closer the two regression lines are to each
other.

Note that the x on y line is always STEEPER than the y on x line.

(3) r is a dimensionless quantity i.e. it has NO units, regardless of the units of


each variable.

(4) The value of r depends on the given set of bivariate data and hence it may
change in value if more data pairs are added to the bivariate data (after
conducting more trials of the experiment), or if a data pair that has been
discovered to be an outlier has been removed from the data.

(5) On the other hand, the value of r is independent of any linear


transformation carried out on the set of bivariate data, since the strength of
linear relationship between any two variables is preserved after any
translation, scaling or reflection. For example, the value of r for a set of
bivariate data for two variables, say temperature and amount of rainfall, will
stay the same whether or not the temperatures are measured in Celsius or
Fahrenheit, since the conversion from one unit to the other can be expressed
9
as a linear relationship, F C 32, where C and F are the temperatures
5
measured in Celsius and Fahrenheit respectively.

Example 3.1.

A set of bivariate data between two variables x and y is given as follows. Calculate
the product-moment correlation coefficient, r, for this set of data.

x 2 1 3 3 4 4 4
y 1.5 2 3 3 3.5 4 4

Solution:

To find r given complete sample data

No. Keys to Press/Steps Screenshot


1

Press , and turn


“STAT DIAGNOSTICS”
on.

2015 – 2016 / H2 Maths / Correlation and Regression Page 14 of 27


National Junior College Mathematics Department 2016

1
Press .

2.
Press .

Key in the x and y values


into columns L1 and L2
respectively.

3. Exit to the main screen.

Then press Press

to enter the ‘CALC’ sub-


menu.
4.
Press to select

‘8: LinReg(a + bx)’ and


enter L1 (x) and L2 (y) as the
‘Xlist’ and ‘Ylist’
respectively.

2015 – 2016 / H2 Maths / Correlation and Regression Page 15 of 27


National Junior College Mathematics Department 2016

3.
Press , then

. This time, both

the coefficients of the y on x


line and the value of r will
appear.

Thus, r = 0.905 (3 s.f.)

Note: The value of r will appear only if STAT DIAGNOSTICS has been turned
on.

Example 3.2.

A set bivariate data comprising 7 data pairs for two variables x and y is collected.
The data is summarised as follows.

x 21, x2 71, y 21, y2 68.5, xy 69.

Calculate the product-moment correlation coefficient, r.

Solution:

To find r given summarised sample data

From the formulae booklet (MF15),

x y
xy
r n
2 2
2
x 2
y
x y
n n

21 21
69
7
2 2
21 21
71 68.5
7 7
0.905

2015 – 2016 / H2 Maths / Correlation and Regression Page 16 of 27


National Junior College Mathematics Department 2016

Example 3.3.

The table below shows four sets of bivariate data.

x1 y1 x2 y2 x3 y3 x4 y4
10 8.04 10 9.14 10 7.46 8 6.58
8 6.95 8 8.14 8 6.77 8 5.76
13 7.58 13 8.74 13 12.74 8 7.71
9 8.81 9 8.77 9 7.11 8 8.84
11 8.33 11 9.26 11 7.81 8 8.47
14 9.96 14 8.10 14 8.84 8 7.04
6 7.24 6 6.13 6 6.08 8 5.25
4 4.26 4 3.10 4 5.39 19 12.50
12 10.84 12 9.13 12 8.15 8 5.56
7 4.82 7 7.26 7 6.42 8 7.91
5 5.68 5 4.74 5 5.73 8 6.89

(i) Find the values of the product moment correlation coefficient for each set of
data.

(ii) Sketch the scatter diagrams for each of the sets of the bivariate data.

(iii) Using your sketches in part (ii), comment on the strengths of the linear
relationships between the two variables for each set of data, and compare
between the effectiveness of using the r-values and that of using the scatter
diagrams to determine the strength of the linear relationship of a set of
bivariate data.

Solution:

(i) One can verify that the correlation coefficients for the four sets of data are all
equal to 0.816.

(ii) The following scatter diagrams are drawn based on the above data.

For x1 and y1 For x2 and y2


12 10
9
10
8
8 7
6
y1

y2

6 5
4
4
3
2 2
1
0 0
0 5 10 15 0 5 10 15
x1 x2

2015 – 2016 / H2 Maths / Correlation and Regression Page 17 of 27


National Junior College Mathematics Department 2016

For x3 and y3 For x4 and y4


14 14

12 12

10 10

8 8

y4
y3

6 6

4 4

2 2

0 0
0 5 10 15 0 5 10 15 20

x3 x4

(iii) Based on the sketches in part (ii), the 3rd set of data demonstrates the strongest
linear relationship between the two variables, albeit with the presence of an
outlier.

Since all the r-values are the same, we cannot tell which data set has the
strongest linear relationship based on the r-value alone. Rather, we need to
look at the scatter diagram to help us decide.

In other words, the r-value alone is sometimes not enough to fully illustrate
the strength of the linear relationship between the two variables. The scatter
diagrams give a clearer and more complete picture in this aspect.

Example 3.4.

(i) Sketch an example of a scatter diagram indicating the following:

“A linear (product-moment) coefficient close to zero but there is an


obvious relation between the variables.”

(ii) If the estimated product-moment correlation coefficient has a value close


to +1 or to –1, explain why this need not imply that there is a linear
relationship between the variables.

Solution

(i)
4500
4000
3500
Monthly Salary ($Y)

3000
2500
2000
1500
1000
500
0
0 20 40 60 80
Age (X yrs)

r is close to zero but there exists an obvious relationship (possibly


quadratic) between the variables.

2015 – 2016 / H2 Maths / Correlation and Regression Page 18 of 27


National Junior College Mathematics Department 2016

(ii) r is close to +1 or to –1 indicates a linear relationship for values within the


sample range. However, it does not tell us the relation for data values
outside this range. For example, r may be calculated using the last four
pairs of data for the above scatter diagram, which gives almost a linear
relation.

§4 Reliability of Estimates

From Example 2.1 (iii), we see how a regression line can be used to estimate the
value of one of the two variables for a set of bivariate data, given a value of the
other variable.

However, from Example 3.4 (ii), note that even if the bivariate data demonstrates a
linear relationship, adding more data points may show a different relationship
between the two variables altogether. Hence it is risky to use the regression line to
estimate the value of x (or y) given a value of y (or x) that lies outside the range of
values of y (or x) in the data set.

Therefore, to determine if an estimate (or predicted value) obtained from a


regression line is reliable, we consider the following.

NO The estimate (or


1. Does the given value lie in the
predicted value)
range of values of the data set?
is NOT reliable.

YES

2. Is the absolute value of r close to 1, NO


The estimate (or
i.e. is there a strong linear relationship predicted value)
between the two variables based on YES is reliable.
the given data set?

The process of carrying out estimation from within the given data range is known
as interpolation, while the process of carrying out estimation from beyond the data
range is known as extrapolation.

Example 4.1.

An instrument is used to measure the amount of Vitamin C in a given volume of


liquid. It is standardised by using it on seven specimen solutions containing known
amounts of Vitamin C, x, in micrograms/ml. The reading on the instrument is
denoted by y. Corresponding values of x and y are given in the table:

x 100 200 300 400 500 600 700


y 6.26 5.47 4.67 3.91 3.29 2.28 1.44

(i) Find r, the product moment correlation coefficient between x and y.


(ii) Using a suitable regression line, estimate the value of x when y = 3.5.
(iii) Comment on the reliability of this estimate that you have obtained in part
(ii).

2015 – 2016 / H2 Maths / Correlation and Regression Page 19 of 27


National Junior College Mathematics Department 2016

Solution
(i) By GC, r = 0.999.

(ii) Since the reading on the instrument depends on the amount of Vitamin C
in the solution, we should use the y on x line (even though we are
estimating x from a given value of y).

From the GC, the equation of the y on x regression line is given by:
y = –0.0079357143x + 7.077142857

When y = 3.5,
3.5 0.0079357 x 7.0771
3.5 7.0771
x
0.0079357
450.76 451 (to 3 s.f.)

(iii) Since
the given value of y, i.e. 3.5, is in the data range for y, [1.44, 6.26], &
the value of r, 0.999, has an absolute value that is very close to 1,
this suggests that the estimated value is reliable.

Example 4.2.
An anemometer is used to estimate wind speed by observing the rotational speed of
its vanes. This speed is converted to wind speed by means of an equation obtained
from calibrating the instrument in a wind tunnel. In this calibration process, the
wind speed is fixed precisely and the resulting anemometer speed noted. For a
particular anemometer, this process produced the following set of data:

Actual wind speed, s


1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
(m/s)
Anemometer speed, r
30 38 48 58 68 80 92 106 120 134
(revs/min)

(a) Obtain the equation of the estimated least squares regression line of r on s
and the line of s on r.
(b) If the actual wind speed is 1.65m/s, use an appropriate regression line to
estimate the rotational speed of the anemometer.
(c) Demonstrate, using the above regression line as an example, that it is
unwise to extrapolate beyond the range of the data.

Solution

(a) By GC, r 90.8 116s and s 0.78772 0.0085566r.

(b) Since the wind speed is fixed precisely, it is the independent variable.
Hence we use the line of r on s.
Therefore, r 90.8 116(1.65) 100.6 revs/ min.

(c) Using the r on s line: When s = 0, r = –90.8 revs/min. That is, the
anemometer speed is negative! Thus, it is unwise to extrapolate beyond the
data range.

2015 – 2016 / H2 Maths / Correlation and Regression Page 20 of 27


National Junior College Mathematics Department 2016

§5 Linearisation of Data

Suppose a strong but non-linear relationship can be observed from the data, say one
a
of the form y b . In this case, we can introduce another variable w, where
x
1
w , and carry out linear regression between w and y instead (since y = aw + b).
x

The process of carrying out such a transformation to achieve linearity is known as


linearisation of data. The following are some examples on how linearisation of data
can be carried out for various non-linear models.

Non-Linear Equation Transformed Variables


a 1
(a) y b y = aw + b, where w
x x
(b) y ax 2 b y = au + b, where u x 2
(c) y a b ln x y = a + bv, where v ln x

Example 5.1.

An experiment is conducted to determine the relationship between the variables x


and y. The following table gives the experimental values.

x 1 2 3 4 5 10 30 50
y 5.145 4.139 3.809 3.640 3.542 3.341 3.212 3.181

Find, correct to 4 decimal places, the product moment correlation coefficient


between

(a) x and y,

1
(b) and y,
x

(c) ln x and y.

b
Use your answers to parts (a), (b) and (c) to explain which of y a bx, y a ,
x
y a b ln x is the best model for this set of data.

Solution:

(a) From GC, r-value between x and y = –0.5994 (to 4 d.p.)

2015 – 2016 / H2 Maths / Correlation and Regression Page 21 of 27


National Junior College Mathematics Department 2016

(b) To obtain the r-value (or regression model) for transformed data

No. Keys to Press/Steps Screenshot


1
Press .

2.
Press .

Key in the x and y values


into columns L1 and L2
respectively.

3. Scroll towards the right and


up into the header row to
enter the cell with “L3”.

4.
Press

to define L3 as 1/L1 i.e. L3 is


to comprise all the values of
1
in the data set.
x

2015 – 2016 / H2 Maths / Correlation and Regression Page 22 of 27


National Junior College Mathematics Department 2016

5.
Press to

1
generate all the values of .
x

6 Exit to the main screen,


ensure that STAT
Diagnostics is on, and press

to
enter the “LinReg(ax + b)"
command. Enter L3 and L2
as the X and Y lists
respectively.
7 Select “Calculate” to obtain
1
the r-value between and
x
y. The regression line for y
1
on is also obtained (but
x
not required for this
question).

1
Therefore, r-value between and y = 1.0000 (to 4.d.p.)
x

(c) Following a procedure similar to part (b) (set L4 as ln L1, then set X and Y
lists as L4 and L2 respectively in the “LinReg(ax + b)” command).
r-value between ln x and y = –0.8547 (to 4 d.p.)

b
Since |r| is closest to 1 for part (b), the best model for this set of data is y a .
x

2015 – 2016 / H2 Maths / Correlation and Regression Page 23 of 27


National Junior College Mathematics Department 2016

Example 5.2.

A car is placed in a wind tunnel and the drag force F for different wind speeds, v,
in appropriate units, is recorded. The results are shown in the table.

v 0 4 8 12 16 20 24 28 32 36
F 0 2.5 5.1 8.8 11.2 13.6 17.6 22.0 27.8 33.9

(i) Draw the scatter diagram for these values, labeling the axes clearly.

(ii) It is thought that the drag force F can be modeled by one of the formulae

F a bv or F c dv 2

Use your answer to part (i) to explain which of F a bv or F c dv 2 is


the better model. [2010/II/10 (modified)]

Solution:

(i)

(ii) Since the points appear to follow a curve (or trend) that is increasing at an
increasing rate (with respect to the variable v), the model F c dv 2 is the
better model in this case.

2015 – 2016 / H2 Maths / Correlation and Regression Page 24 of 27


National Junior College Mathematics Department 2016

Appendix A: Equivalence of the 2 Formulae for the Product Moment


Correlation Coefficient

To show that

x y
x x y y xy
r and r n
2 2 2 2
x x y y 2
x 2
y
x y
n n

are equivalent, we need to show the following results:


2
2
2
x
x x x
n
2
2 y
y y y2
n
x y
x x y y xy
n

Proof
2 2
x x x2 2 xx x
2
x2 2x x n x
2
2
x x
x 2 x n
n n
2 2 1 2
x2 x x
n n
1 2
x2 x
n
2
2 y
The proof for y y y2 is similar.
n

x x y y xy xy xy xy

xy x y y x n xy
x y x y
xy y x n
n n n n
x y
xy
n

2015 – 2016 / H2 Maths / Correlation and Regression Page 25 of 27


National Junior College Mathematics Department 2016

Appendix B

B1 Least-Squares Regression Line of y on x: y = a + bx

2
Now, ei2 yi a bxi

First, allow a to vary while keeping all others constant.

ei 2 denotes partial
Differentiating wrt a, we get 2 ( yi a bxi )
a derivative.
2
ei
Let 0 , we get ( yi a bxi ) 0
a
Hence, yi na b xi --- Eqn (1)

Next, allow b to vary while keeping all others constant.

ei 2
Differentiating wrt b, we get 2 ( yi a bxi )( xi )
b
ei2
Let 0 , we get 2 ( xi yi axi bxi 2 ) 0
b
2
Hence, xi yi a xi b xi --- Eqn (2)

Equations (1) and (2) are called the normal equations of y on x.


2
[Eqn(1) xi ] – [Eqn(2) n] gives xi yi n xi yi b xi bn xi 2

xi yi n x y S xy
Thus, it can shown that b 2
,
xi 2 nx S xx
2
x
where S xx x 2
, (or equivalently x 2 nx 2 ) and
n
x y
S xy xy , (or equivalently xy nxy )
n

[Eqn(1) n] gives y a bx a y bx

Thus, the equation of the regression line of y on x is given by:

S xy
y (y bx ) bx y b x x , where b .
S xx

2015 – 2016 / H2 Maths / Correlation and Regression Page 26 of 27


National Junior College Mathematics Department 2016

B2 Least-Squares Regression Line of x on y: 𝒙 = 𝒂′ + 𝒃′𝒚

2
Now, di2 xi a b yi

By doing partial differentiation on 𝑎′ and 𝑏′ as above, we get the following normal


equations:

xi na b yi --- Eqn(3)
2
xi yi a yi b yi --- Eqn(4)

Solving equations (3) and (4), we get

xi yi n x y S xy
b 2
and a x by
2 S yy
yi ny

2
y
where S yy y 2
, (or equivalently y 2 ny 2 ) and
n
x y
S xy xy , (or equivalently xy nxy )
n

Thus, the equation of the regression line of x on y is given by:

S xy
x x by by x b (y y ), where b ,
S yy

Re-arranging this equation, we get

1 S xy
y y x x , where b .
b S yy

2015 – 2016 / H2 Maths / Correlation and Regression Page 27 of 27

You might also like