C R Lect Notes
C R Lect Notes
§1 Introduction
A set of data comprising the values of two variables, say x and y, obtained from the
same sample, is known as bivariate data.
When a set of bivariate data is plotted in the Cartesian plane, a scatter diagram (or
scatter plot) is produced. Some examples of scatter diagrams are as follows:
Figure 1.1. Scatter diagram for the Senior High 2 Lecture Test percentage scores (y) for
a class against their Senior High 1 Promotional Examination percentage scores (x)
In a set of bivariate data, one of the two variables may be affected or influenced by
the value of the other variable, which is controlled. In this case, the variable whose
values have been controlled is called the independent variable, while the other
variable is called the dependent variable.
For example, in the set of data on the amount of advertising time for a product and
number of sales for that product, the amount of advertising time is the independent
variable, while the number of sales for that product is the dependent variable.
On the other hand, in a set of data comprising the heights and intelligence quotients
(IQ) of a group of people, neither variable is likely to depend on the other.
Example 1.1.
In a city, the number of outlets for a particular café, x, and the number of car
accidents, y, are recorded over a period of time. The set of data obtained is given as
follows.
x 25 45 60 75 90
y 88 72 57 44 23
Solution.
(i) The graphic calculator can be used as a tool to sketch the scatter diagram as
as described in the following procedure:
2.
Press .
Then press .
5.
Press .
Then press
(for “ZoomStat”)
(iii) No. The decrease in the number of car accidents could be due to a recent
campaign on road safety, while the increase in the number of outlets of the
café could be simply due to the café expanding its business at the same time.
It is not likely that increase in the number of outlets of the cafe has caused a
decrease in the number of car accidents.
Notes:
Further qualitative analysis is needed to ascertain the true effect of one variable on
the other.
§2 Linear Regression
This is akin to finding a “best-fit” line, where a line is drawn on the scatter diagram
such that there are as many points above the line as below it (or as many points to
the left of the line as to the right of it). However, for different individuals, the choice
of line is subjected to their personal judgement of “closeness”. Hence, there can be
many possible “best-fit” lines for the same set of bivariate data.
Note that the values of e1, e2, e3, e4 and e5 represent the “y-errors”, i.e. the errors
in using the chosen line to estimate the values of y for the values of x given in the
data set.
Logically speaking, the values of the errors should be as small as possible if the line
chosen is indeed the best-fit line. Hence we aim to minimise the values of the errors
when calculating the equation of the regression line.
However, since these errors will be positive or negative according to whether the
points are above or below the line, we work with the squares of these values instead
and consider their sum,
By minimising the sum ek 2 , one will obtain the least squares regression line of
y on x.
Contrastingly, if we were to instead consider the horizontal distances d1, d2, d3, d4
and d5 drawn from each point to a best-fit line, as shown below:
and minimise the values of the “x-errors” by minimising the sum of squares
dk 2 d12 d22 d32 d42 d5 2 ,
For the purpose of the H2 Maths syllabus, you are not required to find the
equations of the regression lines analytically. However, you will need to know
how to use the graphic calculator to obtain the equations of the regression lines,
which is illustrated in the next example.
Example 2.1.
x 25 50 60 80 90
y 80 90 50 44 10
(i) Find the equations of the regression lines of y on x and x on y, and sketch both
lines in a single scatter diagram.
(ii) Find the coordinates of the point of intersection between both lines found in
part (i). How do the values compare to the sample means of x and y in the set
of data?
Suppose the variables x and y are such that neither depends on the other.
Suppose instead that x represents air temperature and y represents wind speed.
(iv) Will the choice of regression lines in parts (iii)(a) and (iii)(b) change? Why
or why not?
(v) Interpret the slope of the x on y line in the context of this question.
Solution.
After entering the data into the GC (as illustrated in Example 1.1),
2.
Press to select
3.
Press , then
to calculate the
After entering the data into the GC (as illustrated in Example 1.1),
2.
Press to select
3.
Press , then
to calculate the
1.066412214 x y 119.851145
x 0.694885897 y 99.07978512
Using the GC (PolySmlt2 App – see below), the point of intersection between
The aveare
both the Y on X and X on Y lines has coordinates (61.0, 54.8).
Using the “2-Var Stats” command in the GC (in “STAT” “CALC menu –
see below), sample mean of x = 61 and sample mean of y = 54.8, which
coincide with the coordinates of the point of intersection between the two
lines.
(iii) (a) Since we are estimating the value of y, we wish to minimise the y-errors. Note: Premature
Hence the appropriate regression line to use in this case is the y on x rounding off in the
line. Therefore, when x = 70, equation of line will
Estimated value of y = 119.85 – 1.0664(70) lead to inaccuracy
= 45.202 in estimation e.g.
= 45.2 (to 3 significant figures)
“Estimated value of
(iii) (b) Since we are estimating the value of x, we wish to minimise the x-errors. Y = 120 – 1.07(70)
Hence the appropriate regression line to use in this case is the x on y = 45.1 (to 3 s.f.s)”
line. Therefore, when y = 50,
Estimated value of x = 99.079 – 0.69489(50)
= 64.155
= 64.2 (to 3 significant figures)
(iv) The choice of line for part (iii)(a) should be changed to the x on y line, while
the choice of line for part (iii)(b) should remain the same. As y is the
controlled variable, there is no error to speak of for y, and therefore the x on
y line should be used to carry out estimation in both parts.
(v) Every unit increase in wind speed (y) will lead to an approximate decrease
of 0.695 units in the air temperature (x).
Notes:
2. If the sample mean of both variables are given to be x and y , then the point
x , y lies on both the regression lines of y on x and x on y, i.e. the two lines
must intersect at x , y . (For the proof, refer to the appendix on the derivation
of the working formulae to find the equations of the y on x and x on y lines.)
In the previous section, we have discussed how to use regression lines to carry out
estimation/prediction of values for a variable given a set of bivariate data.
However, the validity of performing this procedure to carry out estimation depends
on the assumption that a linear model is a good fit for the data we are given, which
is not always true. For example, the following scatter diagrams illustrate how
different sets of bivariate data may demonstrate different possible relationships (or
equivalently types of correlation) between the two variables.
Temperature (Y °C)
Physics Marks (Y)
70 25
60 23
50
40 21
30 19
20
17
10
0 15
0 20 40 60 80 100 2000 2500 3000 3500 4000 4500
Maths Marks (X) Altitude (X m )
3000 4000
2500
3000
2000
1500 2000
1000
1000
500
0 0
0 20 40 60 80 45 55 65 75 85
Age (X yrs) Weight (X kg)
While the scatter diagram can show clearly if a linear model is a good fit for the
data, this may not always be a practical approach to determine the degree of linear
correlation between the two variables, especially when we are dealing with a large
set of data. In this case, a possible alternative to using the scatter diagram would be
to use a certain measure called the product moment correlation coefficient
(conventionally denoted by r), which is calculated based on the following formula:
x y
( x x)( y y) xy
r= = n (in MF15)
2 2
( x x) ( y y) x
2
y
2
x2 y2
n n
Notes:
(2) If r = 1, then the set of bivariate data demonstrates a perfect positive linear
correlation between the two variables.
If r = –1, then the set of bivariate data demonstrates a perfect negative linear
correlation between the two variables.
The following diagrams illustrate how the two regression lines appear for
different values of r.
x on y
y on x
y on x and x on y
coincide
x on y
x on y
y on x
y on x
x on y
x on y
y on x y on x
y on x and x on y
coincide
In summary, the closer the absolute value of r is to 1 i.e. the stronger the linear
relationship demonstrated, the closer the two regression lines are to each
other.
(4) The value of r depends on the given set of bivariate data and hence it may
change in value if more data pairs are added to the bivariate data (after
conducting more trials of the experiment), or if a data pair that has been
discovered to be an outlier has been removed from the data.
Example 3.1.
A set of bivariate data between two variables x and y is given as follows. Calculate
the product-moment correlation coefficient, r, for this set of data.
x 2 1 3 3 4 4 4
y 1.5 2 3 3 3.5 4 4
Solution:
1
Press .
2.
Press .
3.
Press , then
Note: The value of r will appear only if STAT DIAGNOSTICS has been turned
on.
Example 3.2.
A set bivariate data comprising 7 data pairs for two variables x and y is collected.
The data is summarised as follows.
Solution:
x y
xy
r n
2 2
2
x 2
y
x y
n n
21 21
69
7
2 2
21 21
71 68.5
7 7
0.905
Example 3.3.
x1 y1 x2 y2 x3 y3 x4 y4
10 8.04 10 9.14 10 7.46 8 6.58
8 6.95 8 8.14 8 6.77 8 5.76
13 7.58 13 8.74 13 12.74 8 7.71
9 8.81 9 8.77 9 7.11 8 8.84
11 8.33 11 9.26 11 7.81 8 8.47
14 9.96 14 8.10 14 8.84 8 7.04
6 7.24 6 6.13 6 6.08 8 5.25
4 4.26 4 3.10 4 5.39 19 12.50
12 10.84 12 9.13 12 8.15 8 5.56
7 4.82 7 7.26 7 6.42 8 7.91
5 5.68 5 4.74 5 5.73 8 6.89
(i) Find the values of the product moment correlation coefficient for each set of
data.
(ii) Sketch the scatter diagrams for each of the sets of the bivariate data.
(iii) Using your sketches in part (ii), comment on the strengths of the linear
relationships between the two variables for each set of data, and compare
between the effectiveness of using the r-values and that of using the scatter
diagrams to determine the strength of the linear relationship of a set of
bivariate data.
Solution:
(i) One can verify that the correlation coefficients for the four sets of data are all
equal to 0.816.
(ii) The following scatter diagrams are drawn based on the above data.
y2
6 5
4
4
3
2 2
1
0 0
0 5 10 15 0 5 10 15
x1 x2
12 12
10 10
8 8
y4
y3
6 6
4 4
2 2
0 0
0 5 10 15 0 5 10 15 20
x3 x4
(iii) Based on the sketches in part (ii), the 3rd set of data demonstrates the strongest
linear relationship between the two variables, albeit with the presence of an
outlier.
Since all the r-values are the same, we cannot tell which data set has the
strongest linear relationship based on the r-value alone. Rather, we need to
look at the scatter diagram to help us decide.
In other words, the r-value alone is sometimes not enough to fully illustrate
the strength of the linear relationship between the two variables. The scatter
diagrams give a clearer and more complete picture in this aspect.
Example 3.4.
Solution
(i)
4500
4000
3500
Monthly Salary ($Y)
3000
2500
2000
1500
1000
500
0
0 20 40 60 80
Age (X yrs)
§4 Reliability of Estimates
From Example 2.1 (iii), we see how a regression line can be used to estimate the
value of one of the two variables for a set of bivariate data, given a value of the
other variable.
However, from Example 3.4 (ii), note that even if the bivariate data demonstrates a
linear relationship, adding more data points may show a different relationship
between the two variables altogether. Hence it is risky to use the regression line to
estimate the value of x (or y) given a value of y (or x) that lies outside the range of
values of y (or x) in the data set.
YES
The process of carrying out estimation from within the given data range is known
as interpolation, while the process of carrying out estimation from beyond the data
range is known as extrapolation.
Example 4.1.
Solution
(i) By GC, r = 0.999.
(ii) Since the reading on the instrument depends on the amount of Vitamin C
in the solution, we should use the y on x line (even though we are
estimating x from a given value of y).
From the GC, the equation of the y on x regression line is given by:
y = –0.0079357143x + 7.077142857
When y = 3.5,
3.5 0.0079357 x 7.0771
3.5 7.0771
x
0.0079357
450.76 451 (to 3 s.f.)
(iii) Since
the given value of y, i.e. 3.5, is in the data range for y, [1.44, 6.26], &
the value of r, 0.999, has an absolute value that is very close to 1,
this suggests that the estimated value is reliable.
Example 4.2.
An anemometer is used to estimate wind speed by observing the rotational speed of
its vanes. This speed is converted to wind speed by means of an equation obtained
from calibrating the instrument in a wind tunnel. In this calibration process, the
wind speed is fixed precisely and the resulting anemometer speed noted. For a
particular anemometer, this process produced the following set of data:
(a) Obtain the equation of the estimated least squares regression line of r on s
and the line of s on r.
(b) If the actual wind speed is 1.65m/s, use an appropriate regression line to
estimate the rotational speed of the anemometer.
(c) Demonstrate, using the above regression line as an example, that it is
unwise to extrapolate beyond the range of the data.
Solution
(b) Since the wind speed is fixed precisely, it is the independent variable.
Hence we use the line of r on s.
Therefore, r 90.8 116(1.65) 100.6 revs/ min.
(c) Using the r on s line: When s = 0, r = –90.8 revs/min. That is, the
anemometer speed is negative! Thus, it is unwise to extrapolate beyond the
data range.
§5 Linearisation of Data
Suppose a strong but non-linear relationship can be observed from the data, say one
a
of the form y b . In this case, we can introduce another variable w, where
x
1
w , and carry out linear regression between w and y instead (since y = aw + b).
x
Example 5.1.
x 1 2 3 4 5 10 30 50
y 5.145 4.139 3.809 3.640 3.542 3.341 3.212 3.181
(a) x and y,
1
(b) and y,
x
(c) ln x and y.
b
Use your answers to parts (a), (b) and (c) to explain which of y a bx, y a ,
x
y a b ln x is the best model for this set of data.
Solution:
(b) To obtain the r-value (or regression model) for transformed data
2.
Press .
4.
Press
5.
Press to
1
generate all the values of .
x
to
enter the “LinReg(ax + b)"
command. Enter L3 and L2
as the X and Y lists
respectively.
7 Select “Calculate” to obtain
1
the r-value between and
x
y. The regression line for y
1
on is also obtained (but
x
not required for this
question).
1
Therefore, r-value between and y = 1.0000 (to 4.d.p.)
x
(c) Following a procedure similar to part (b) (set L4 as ln L1, then set X and Y
lists as L4 and L2 respectively in the “LinReg(ax + b)” command).
r-value between ln x and y = –0.8547 (to 4 d.p.)
b
Since |r| is closest to 1 for part (b), the best model for this set of data is y a .
x
Example 5.2.
A car is placed in a wind tunnel and the drag force F for different wind speeds, v,
in appropriate units, is recorded. The results are shown in the table.
v 0 4 8 12 16 20 24 28 32 36
F 0 2.5 5.1 8.8 11.2 13.6 17.6 22.0 27.8 33.9
(i) Draw the scatter diagram for these values, labeling the axes clearly.
(ii) It is thought that the drag force F can be modeled by one of the formulae
F a bv or F c dv 2
Solution:
(i)
(ii) Since the points appear to follow a curve (or trend) that is increasing at an
increasing rate (with respect to the variable v), the model F c dv 2 is the
better model in this case.
To show that
x y
x x y y xy
r and r n
2 2 2 2
x x y y 2
x 2
y
x y
n n
Proof
2 2
x x x2 2 xx x
2
x2 2x x n x
2
2
x x
x 2 x n
n n
2 2 1 2
x2 x x
n n
1 2
x2 x
n
2
2 y
The proof for y y y2 is similar.
n
x x y y xy xy xy xy
xy x y y x n xy
x y x y
xy y x n
n n n n
x y
xy
n
Appendix B
2
Now, ei2 yi a bxi
ei 2 denotes partial
Differentiating wrt a, we get 2 ( yi a bxi )
a derivative.
2
ei
Let 0 , we get ( yi a bxi ) 0
a
Hence, yi na b xi --- Eqn (1)
ei 2
Differentiating wrt b, we get 2 ( yi a bxi )( xi )
b
ei2
Let 0 , we get 2 ( xi yi axi bxi 2 ) 0
b
2
Hence, xi yi a xi b xi --- Eqn (2)
xi yi n x y S xy
Thus, it can shown that b 2
,
xi 2 nx S xx
2
x
where S xx x 2
, (or equivalently x 2 nx 2 ) and
n
x y
S xy xy , (or equivalently xy nxy )
n
[Eqn(1) n] gives y a bx a y bx
S xy
y (y bx ) bx y b x x , where b .
S xx
2
Now, di2 xi a b yi
xi na b yi --- Eqn(3)
2
xi yi a yi b yi --- Eqn(4)
xi yi n x y S xy
b 2
and a x by
2 S yy
yi ny
2
y
where S yy y 2
, (or equivalently y 2 ny 2 ) and
n
x y
S xy xy , (or equivalently xy nxy )
n
S xy
x x by by x b (y y ), where b ,
S yy
1 S xy
y y x x , where b .
b S yy