Chapter 23 Correlation and Linear Regression Lecture Notes
Chapter 23 Correlation and Linear Regression Lecture Notes
H2 Mathematics (9758)
Chapter 23 Correlation and Linear Regression
Lecture Notes
Success Criteria
Surface Learning Deep Learning Transfer Learning
Distinguish between an Use scatter diagram and product Explain and illustrate
independent variable and a moment correlation coefficient the method of least
dependent variable from a to explain if there is a plausible squares in finding the
bivariate data. linear relationship between two equation of the
variables. regression line.
Plot a scatter diagram for a Use an appropriate Explain that a high
set of bivariate data using transformation (square, correlation between
GC. reciprocal, logarithmic) to two variables does
linearise a set of bivariate data not necessarily imply
Calculate the value of the to fit the regression model. one directly causes
product moment correlation Use the concepts of the other.
coefficient using GC and interpolation and extrapolation
interpret it by relating the to indicate the reliability of an
value (in particular values estimate made using the
close to 1, 0, –1) to the regression line.
appearance of the scatter Apply concepts of linear
diagram. regression and the method of
least squares to find the
Calculate the equation of the equation of the regression line.
least squares regression line
using GC and interpret its
slope and intercept.
Obtain a prediction or
estimate of a value using a
suitable regression line.
Introduction
In this chapter, we shall look at methods which investigate whether two quantitative variables X and
Y are related. In the event that they are linearly related, we then seek to find a "best-fit line" for the
observed values of X and Y.
Page 1 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024
Example 1
A student was tasked to find a relationship between the length of a pendulum and the square root of
the period of a pendulum. The period of a pendulum is the time for a pendulum to complete a cycle.
To achieve this, he did an experiment with different lengths and recorded their period respectively
using a stopwatch.
Step 1
L and T are a pair of bivariate data.
L (or L ) is the independent (controlled) variable while
T is the dependent (recorded) variable.
T/s
Step 2 Step 3
Bivariate data is represented by A least square regression (best fit) line is drawn.
plotting them on a scatter diagram.
1.63
0.912
0.447 0.837
Step 4
Use the regression line to make prediction.
Question: Estimate the period of the pendulum when the length of pendulum is 0.36.
Question: Can we use GC to draw the sketch and find the best-fit line?
Page 2 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024
GC Keystrokes
1. Press S→1:Edit
T/s
1.63
Note:
(i) Use the function to read the data points on the screen. Use the right and left arrow keys
to move from point to point.
Page 3 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024
T=
Page 4 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024
§1 Bivariate Samples
Bivariate data is data for which there are two variables for each observation.
Examples of data associated with two variables:
We usually assume that we can set the values of the independent variables accurately but that our
observations of the values of the dependent variables will be subject to some level of error or natural
variation. Thus, for analysis, we make the dependent variable a function of the independent variable.
Note: On a graph, the independent variable is always the horizontal axis and the dependent
variable is the vertical axis.
§2 Scatter Diagram
A scatter diagram is often used to see whether there is any relationship between two variables.
Examples:
A scatter diagram gives us an idea of how two variables are related. The closer the points are to a
straight line, the stronger the linear relationship between two variables. Depending whether the points
“slope” upwards or downwards, the relationship is positive or negative respectively.
Page 5 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024
Example 2
The following observations on X and Y have been reported.
Give a sketch of the scatter diagram for the observations. What do you observe?
Solution:
y
190 A general linear trend is observed but
one of the data points ( 291,50 ) falls
outside the overall pattern of the
Be careful of data.
points near the 2 x It is known as an _______ (or
axes as they may 4 382 anomaly or anomalous point).
not be clearly
seen from GC.
§3 Correlation
In Example (a) on page 5, an increase of one variable is associated with an increase in the other
variable. This correlation is positive and we say that performances in Mathematics and Physics are
positively correlated.
In Example (b) on page 5, an increase in one variable is associated with a decrease in the other
variable. This correlation is negative and we say that sales of DVD and cinema tickets are negatively
correlated.
In Example (d) on page 5, no pattern can be seen. There is no correlation between a person’s weight
and the amount they earn every month.
Page 6 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024
The product moment correlation coefficient, r, is used to measure the degree of linear relationship
between two variables X and Y. It can only take values between −1 and 1 (both inclusive). Its sign
(positive or negative) depends on the relationship is positive or negative.
(in MF26)
xy − n
x y
r=
( x − x )( y − y ) =
( x − x )
2
( y − y ) 2
x −
2 ( x)
2
y −
2 ( y)
2
n n
x 49 51 54 58 63 64 68 70 75 78
y 90 88 85 91 82 85 76 77 70 71
Example 3
(a) Given n = 10, x = 630, x 2 = 40580, y = 815, y 2 = 66945, xy = 50718 , calculate the
product moment correlation coefficient, r.
Solution:
xy − n
x y
50718 −
( 630 )(815)
r= = 10
x2 − ( ) y 2 − ( )
x
2
y
2
630
2
8152
40580 − 66945 −
n n 10 10
= −0.919 (3s.f.)
Page 7 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024
(1) Enter the x-data into L1, and the y-data into L2.
(b) From Example 1, calculate the product moment correlation coefficient between L and T.
Solution:
Using GC, r = 0.994 (3 s.f.) (we found this value in page 4 when finding the equation of the best-fit line)
Note: Refer to the scatter diagram below which was obtained previously in Example 1.
Page 8 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024
D / cm 20 30 40 50 60 70
1
D /c m 2 4.47 5.48 6.33 7.07 7.75 8.37
Solution:
From GC,
Note:
Observe from the tables that D = 100 L . Since r is independent of the scale of measurement, the value
of r in Examples 3 and 4 are the same.
(To get a good sense of how to estimate correlation coefficients given a scatter diagram, go to the following website to
access the Guessing Correlation app: http://istics.net/Correlations/)
Learning Point:
Generally, r is independent of the scale of measurement.
i.e. if u = ax + b , v = cy + d , where a, b, c, d are constants ( a 0 and c 0 , or a 0 and
c 0 ), then rxy = ruv .
Page 9 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024
No apparent linear
correlation, r is close
to 0.
Strong negative
linear correlation, r is
close to −1.
Perfect negative
linear correlation,
r = −1 .
r is. r is.
Page 10 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024
§4 Regression Lines
In this chapter, we are interested to create the "best fit" line for all bivariate data points where X and
Y are linearly correlated. The best fit line allows us to predict values of y from known values of x
and vice versa.
X
O
Suppose all the points in a scatter diagram lie approximately along a straight line, we say there is a
linear correlation between X and Y, and we try to fit the best possible straight line to the data. Such a
line is called a regression line. We do this by the least squares method.
(2) The estimated regression line y = a + bx is drawn through these points in such a way that
the sum of squares of the deviations of the points from the line is a minimum
n
where ei is deviation from the observed value, yi, to the predicted value given by the line
y = a + bx at point xi. i.e. ei = yi − (a + bxi ) .
Given: ( x1 , y1 ) , ( x2 , y2 ) , …, ( xn , yn ) y = a + bx
y
•
en
•
• (xn, yn)
•
•
•
(x2, y2) •
Observed value of y y2 • •
e2 •
Predicted value a + bx2 e3
of y
e1 • (x3, y3)
• (x1, y1)
x
Page 11 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024
…
n xn yn a + bxn en = yn − ( a + bxn ) en 2 = ( yn − ( a + bxn ) )
2
Aim: Find the values of a and b such that the errors ei are as small as possible.
Hence, we minimize ei 2 .
(MF26)
y − y = b ( x − x ) where b =
( x − x )( y − y )
( x − x )
2
Gradient
Note: For the “least squares” method to find the regression line of x on y : x = c + dy, refer to
Annex B.
Page 12 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024
For any bivariate set of data connecting variables X and Y, there are two uniquely defined regression
lines:
(a) Regression line of y on x ( y = a + bx )
X is the independent variable and Y is the dependent variable
y is expressed in terms of x
Remark: Before using regression lines to predict values, we need to know whether regression
line of y on x or x on y should be used.
E.g. The speed at which a certain sea animal swims (Y ) depends on the angle through which the
hind feet moves (X ).
X is the independent variable, and Y, which depends on X, is the dependent variable.
Recall : We usually assume that the values of independent variables are set more accurately and
observed values of dependent variables will be subjected to some level of error.
t 4 6 7 9 10 12 13 14
x 5.39 6.96 6.70 7.60 8.33 9.10 11.50 11.30
Solution:
Since the experiment investigates how x varies with t, t is the ________________ variable.
From GC,
Page 13 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024
(2) Press !.
Example 6
Eight pairs of observations on the variables x and y are given below:
x 32 12 21 50 45 60 15 56
y 40 86 52 42 50 8 75 15
Solution:
Page 14 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024
(1) Enter the x-data into L1, and the y-data into L2.
Alternatively
x−c
Note that the equation is actually y = where c = a and d = b in the GC. Hence, you may type
d
x−a
y= where a and b may be recalled by pressing v and 5: Statistics and move to EQ
b
Page 15 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024
2. For the regression line of y on x, the slope/gradient, b, gives the increase (b > 0) or decrease
(b < 0) in y for every unit increase in x.
Observations:
If r 0 , we either have two lines with positive gradient ( r 0 ) or two lines with negative gradient
( r 0 ).
x on y
x
O
Page 16 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024
Activity
x 34 57 43 65 72 46 51 54
y 21.6 58.1 32.8 59.3 60.6 50.2 56.8 49.5
(Adapted from 2007/NJC/Prelim/II/12)
Which of the following models is most appropriate? Explain why is your chosen model the most
appropriate.
(A) y = a + bx (B) y = a + bx 2
b
(C) y = a + b ln x (D) y =a+
x
Submit your answer through this QR code! Solutions will be discussed in the next lecture.
Page 17 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024
§5 Estimation of Values
Once the regression lines are found, we can use them to estimate x and y for a given value of y and x
respectively.
Interpolation
When we work within the given range of values of the data, it is known as interpolation.
Extrapolation
When we work outside the given range of values of the data, it is known as extrapolation.
Extrapolation should be used with caution as the relationship between the two variables may not be
linear outside the range of the data.
An estimate is not reliable if it is outside the data range (even if the product moment correlation
coefficient is high i.e. close to 1) as the linear relation between the 2 variables may no longer
hold.
Example 7
A random sample of ten pairs of values of x and y is used to obtain the following regression line of
y on x, where x is the age in years ( 5 x 16 ) and y is the height, in cm, of 10 boys.
y = 4.4617 x + 87.431 .
(i) Use the regression line of y on x to estimate the height of a 9-year old boy. Given that the
product moment correlation coefficient of the data is 0.903. Comment on the reliability of
your answer.
(ii) Predict the value of y given by the regression line when x = 40 and comment on your
answer.
Solution:
(i) When x = 9 , y = 4.4617 ( 9 ) + 87.431 = 127.5863 128 (3 s.f.)
Therefore, the estimated height of a 9-year old boy is 128 cm.
Since ____________________________________________, it indicates a ____________
linear correlation between the x and y. Hence, the estimated height of a 9 year-old boy is
reliable.
(ii) When x = 40 , y = 4.4617 ( 40 ) + 87.431 = 265.899 266 (3 s.f.)
Since __________________________________, the linear correlation between x and y may
not hold. Hence, the estimated value of y is unreliable.
Moreover, it is unlikely for a person to be 266 cm tall as the heights of people do not vary
linearly with their age. There is a limit for growth.
Note:
If raw data is given and the equation of the y on x line is found using GC, you may use Store RegEQ:
Y1 and input Y1(9) and Y1(40) to obtain the estimates.
Page 18 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024
Examples
(i) y = p + qx 2 Relationship between y and x 2 is linear.
(ii) y = pe qx → ln y = ln p + qx Relationship between ln y and x is linear.
y = px q → ln y = ln p + q ln x Relationship between ln y and ln x is linear.
We can then use the data for the two ‘transformed variables’ to find the equation of the regression
line. In other words, we ‘linearise’ the set of bivariate data. (Refer to Annex B)
Solution:
From GC, r =
Page 19 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024
From GC,
least squares estimate of a = 99.787 99.8 (3 s.f.)
least squares estimate of b = 47.073 47.1 (3 s.f.)
For the transformed data, the product-moment correlation coefficient is 0.992 which is close
to 1.
Page 20 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024
Annex A
Learn the importance of scatter diagram in interpreting correlation and regression (Frank
Anscombe 1973)
Consider the four sets of bivariate data given below. Using GC, sketch the scatterplots, compare their
correlation coefficients and equations of y on x.
x1 y1 x2 y2 x3 y3 x4 y4
4 4.26 4 3.10 4 5.39 8 6.58
5 5.68 5 4.74 5 5.73 8 5.76
6 7.24 6 6.13 6 6.08 8 7.71
7 4.82 7 7.26 7 6.42 8 8.84
8 6.95 8 8.14 8 6.77 8 8.47
9 8.81 9 8.77 9 7.11 8 7.04
10 8.04 10 9.14 10 7.46 8 5.25
11 8.33 11 9.26 11 7.81 8 5.56
12 10.84 12 9.13 12 8.15 8 7.91
13 7.58 13 8.74 13 12.74 8 6.89
14 9.96 14 8.10 14 8.84 19 12.5
Scatter x2 , y 2
Diagram
x1 , y1
Description Moderate linear relationship Curvilinear relationship
Regression line y = 0.5001x + 3.0001 y = 0.5x + 3.0009
Correlation r = 0.6665 r = 0.6662
coefficient
Scatter x3 , y3 x4 , y 4
Diagram
Description Extreme observation much higher No line. All points at x = 8 except one
than the other points. point at x = 19.
Regression line y = 0.4997x + 3.0025 y = 0.4999x + 3.0017
Correlation r = 0.6663 r = 0.6667
coefficient
Page 21 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024
(x3,y3) h3
• •(xn,yn)
h2
h1 (x2,y2)
(x1,y1)
•
•
O x1 c + dy1 x
Observed value of x
Predicted value of x
1 y1 x1 c + dy1 h1 = x1 − ( c + dy1 ) h 2 = ( x − ( c + dy ) )
2
1 1 1
2 y2 x2 c + dy2 h2 = x2 − ( c + dy2 ) h 2 = ( x − ( c + dy ) )
2
2 2 2
…
n ei xn c + dyn hn = xn − ( c + dyn ) h 2 = ( x − ( c + dy ) )2
n n n
Aim: Find the values of c and d such that the errors hi are as small as possible.
Hence, we minimize hi 2 .
x − x = d ( y − y ) where d =
( x − x )( y − y )
( y − y ) Gradient
2
Page 22 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024
Annex C
To Clear Lists
Press `+
Choose 4: ClrAllLists
Press e
Transformations
y = p + qx 2
Relationship between y and x 2 is linear
y on x 2
x-list → L1
y-list → L2
x 2 -list → L3= ( L1)
2
GC y = a + bx
Least squares regression
y = p + qx 2
line of y on x 2
y = pe qx
ln y = ln p + qx
Relationship between ln y and x is linear
ln y on x
x-list → L1
y-list → L2
ln y -list → L3= ln ( L2 ) GC y = a + bx
Least squares regression
ln y = ln p + qx
line of ln y on x
y = pxq
ln y = ln p + q ln x
Relationship between ln y and ln x is linear
ln y on ln x
x-list → L1
y-list → L2
ln x -list → L3= ln ( L1) GC y = a + bx
ln y -list → L4= ln ( L2 )
Least square
regression line of ln y = ln p + q ln x
ln y on ln x
Page 23 of 23