0% found this document useful (0 votes)
103 views23 pages

Chapter 23 Correlation and Linear Regression Lecture Notes

Uploaded by

Rui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views23 pages

Chapter 23 Correlation and Linear Regression Lecture Notes

Uploaded by

Rui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Chapter 23 Correlation and Linear Regression TMJC 2024

H2 Mathematics (9758)
Chapter 23 Correlation and Linear Regression
Lecture Notes

Success Criteria
Surface Learning Deep Learning Transfer Learning
 Distinguish between an  Use scatter diagram and product  Explain and illustrate
independent variable and a moment correlation coefficient the method of least
dependent variable from a to explain if there is a plausible squares in finding the
bivariate data. linear relationship between two equation of the
variables. regression line.
 Plot a scatter diagram for a  Use an appropriate  Explain that a high
set of bivariate data using transformation (square, correlation between
GC. reciprocal, logarithmic) to two variables does
linearise a set of bivariate data not necessarily imply
 Calculate the value of the to fit the regression model. one directly causes
product moment correlation  Use the concepts of the other.
coefficient using GC and interpolation and extrapolation
interpret it by relating the to indicate the reliability of an
value (in particular values estimate made using the
close to 1, 0, –1) to the regression line.
appearance of the scatter  Apply concepts of linear
diagram. regression and the method of
least squares to find the
 Calculate the equation of the equation of the regression line.
least squares regression line
using GC and interpret its
slope and intercept.

 Obtain a prediction or
estimate of a value using a
suitable regression line.

Introduction
In this chapter, we shall look at methods which investigate whether two quantitative variables X and
Y are related. In the event that they are linearly related, we then seek to find a "best-fit line" for the
observed values of X and Y.

Page 1 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024

Example 1
A student was tasked to find a relationship between the length of a pendulum and the square root of
the period of a pendulum. The period of a pendulum is the time for a pendulum to complete a cycle.
To achieve this, he did an experiment with different lengths and recorded their period respectively
using a stopwatch.

Below is the recorded data:

Let L be the length of the pendulum (in m)


Let T be the period of the pendulum (in s)

L/m 0.200 0.300 0.400 0.500 0.600 0.700


1
L / m2 0.447 0.548 0.633 0.707 0.775 0.837

T/s 0.912 1.11 1.23 1.37 1.44 1.63

Step 1
L and T are a pair of bivariate data.
L (or L ) is the independent (controlled) variable while
T is the dependent (recorded) variable.

T/s
Step 2 Step 3
Bivariate data is represented by A least square regression (best fit) line is drawn.
plotting them on a scatter diagram.

1.63

0.912

0.447 0.837

Step 4
Use the regression line to make prediction.

Question: Estimate the period of the pendulum when the length of pendulum is 0.36.

When L = 0.36 m, L = 0.6 m


T = 1.17 s

Question: Can we use GC to draw the sketch and find the best-fit line?

Page 2 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024

GC Keystrokes

Step 2: Plot a Scatter Diagram

1. Press S→1:Edit

2. Enter the L-data into L1,


Move cursor to L2 and key in L1
(L2 consists of values of L )
and enter the T-data into L3.

Note: We will be using L2 (L) and L3 ( L )


to plot the scatter diagram.
3. Press `! to assess the menu.
(a) Select Plot1
(b) Set On
(c) Type: scatter-plot
(d) Xlist: L2 Ylist: L3
(e) Mark: + style
(f) Press e.
(Check: Press `!and ensure that Plot1 is highlighted.)
Press #. Select 9: ZoomStat to obtain the full plot.

T/s

1.63

Need not show the origin if axes 0.912


are not drawn to the scale.
0.447 0.837

Note:
(i) Use the function to read the data points on the screen. Use the right and left arrow keys
to move from point to point.

(ii) When plotting a scatter diagram, you must


(1) Label the axes.
(2) Indicate the minimum and maximum values on each of the axes.
(3) Show the relative positions of the points clearly.
(4) Check that all points are drawn.

Page 3 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024

Step 3: Find the best-fit (least squares regression) line

(1) Enter the respective data


*We will be using L2 ( L -data) and L3 (T-data)
(2) Press M and scroll down to select ‘ON’ for ‘STAT
DIAGNOSTICS’ for the first time.
(It will remain ON even after memory is cleared.)

(3) Press S>. Select 8:LinReg(a+bx).

(4) Xlist: L2 Ylist: L3

(5) Press e four times to display the equation of the least


square regression line.

Using GC, the best-fit (least square regression) line is

(6) Store RegEQ: Y1 (a$). Press e


(7) Go to graph

Step 4: For the estimation

When L = 0.36 m, L = 0.6 m ,

T=

Question: How do we decide if the points lie close to a straight line?

Page 4 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024

§1 Bivariate Samples

Bivariate data is data for which there are two variables for each observation.
Examples of data associated with two variables:

(1) Amount of rain (X) and sales of umbrellas (Y)


(2) Performance in Mathematics (X) and Physics (Y)
(3) Sales of DVD (X) and cinema tickets (Y)
Two questions:
(1) Is there a relationship between the variables? Correlation
(2) If so, can we predict the value of one of the variables if we know the other? Regression
There are two types of variables: Independent (controlled) variable and dependent (to be predicted)
variable.

We usually assume that we can set the values of the independent variables accurately but that our
observations of the values of the dependent variables will be subject to some level of error or natural
variation. Thus, for analysis, we make the dependent variable a function of the independent variable.

Note: On a graph, the independent variable is always the horizontal axis and the dependent
variable is the vertical axis.

§2 Scatter Diagram

A scatter diagram is often used to see whether there is any relationship between two variables.

Examples:

(a) Physics Mark (b) Sales of cinema ticket




 
 

  

  
Sales of DVD
Math Mark O
O
Positive linear relationship Negative linear relationship
(c) Average (d) Monthly Salary
Monthly salary
  
  
 
  
   
 
 

Weight
Age O
O
Curvilinear relationship No relationship

A scatter diagram gives us an idea of how two variables are related. The closer the points are to a
straight line, the stronger the linear relationship between two variables. Depending whether the points
“slope” upwards or downwards, the relationship is positive or negative respectively.

Page 5 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024

Example 2
The following observations on X and Y have been reported.

x 4 116 58 213 382 312 247 283 291 204


y 2 78 26 105 190 156 129 143 50 108

Give a sketch of the scatter diagram for the observations. What do you observe?

Solution:
y
190 A general linear trend is observed but
one of the data points ( 291,50 ) falls
outside the overall pattern of the
Be careful of data.
points near the 2 x It is known as an _______ (or
axes as they may 4 382 anomaly or anomalous point).
not be clearly
seen from GC.

§3 Correlation

Correlation refers to the relationship between two variables.

In Example (a) on page 5, an increase of one variable is associated with an increase in the other
variable. This correlation is positive and we say that performances in Mathematics and Physics are
positively correlated.

In Example (b) on page 5, an increase in one variable is associated with a decrease in the other
variable. This correlation is negative and we say that sales of DVD and cinema tickets are negatively
correlated.

In Example (d) on page 5, no pattern can be seen. There is no correlation between a person’s weight
and the amount they earn every month.

Caution: Correlation does NOT necessarily imply cause and effect.

Watch the video Correlation vs. Causation at


(https://www.youtube.com/watch?v=Tg6e2Y3IEUk&ab_channel=CodyBaldwin)

Page 6 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024

3.1 Product Moment Correlation Coefficient, r

The product moment correlation coefficient, r, is used to measure the degree of linear relationship
between two variables X and Y. It can only take values between −1 and 1 (both inclusive). Its sign
(positive or negative) depends on the relationship is positive or negative.

For a sample, the product moment correlation coefficient, r, is defined as

(in MF26)

 xy −  n
x y
r=
 ( x − x )( y − y ) =
 ( x − x ) 
2
( y − y )  2

x −
2 (  x) 
2

 y −
2 (  y) 
2


 n  n 
  

where n is the number of ordered pairs in the sample.

Note: With the above definition, it is possible for r to be undefined.

3.2 Calculating the product moment correlation coefficient, r

Raw Data vs Summarised Data

x 49 51 54 58 63 64 68 70 75 78
y 90 88 85 91 82 85 76 77 70 71

The data given as above are known as raw data.


n = 10,  x = 630,  x 2 = 40580, y = 815,  y 2 = 66945,  xy = 50718

Data given as above are known as summarized data.


Given a set of raw data, summarized data may be found using the GC (refer to Eg 3b solution)

Example 3

(a) Given n = 10,  x = 630,  x 2 = 40580, y = 815,  y 2 = 66945,  xy = 50718 , calculate the
product moment correlation coefficient, r.

Solution:

 xy −  n
x y
50718 −
( 630 )(815)
r= = 10

  x2 − (  )   y 2 − (  )
 x
2
 y
2
  630 
2
8152 
  40580 −  66945 − 
 n  n   10  10 
  
= −0.919 (3s.f.)

Page 7 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024

Finding Summarised Data

(1) Enter the x-data into L1, and the y-data into L2.

(2) Press S>. Select 2: 2-Var Stats.

(3) Xlist: L1 Ylist: L2

(4) Press e three times. The statistics for X are displayed.


Press ; a few times to obtain the statistics for Y.

(b) From Example 1, calculate the product moment correlation coefficient between L and T.

L/m 0.200 0.300 0.400 0.500 0.600 0.700


1
L /m 2 0.447 0.548 0.633 0.707 0.775 0.837
We are using these
T/s 0.912 1.11 1.23 1.37 1.44 1.63 bivariate data.

Solution:
Using GC, r = 0.994 (3 s.f.) (we found this value in page 4 when finding the equation of the best-fit line)

Note: Refer to the scatter diagram below which was obtained previously in Example 1.

The points on the scatter diagram lie close to a


straight line with _______________. This
agrees with the value of r (= 0.994) which
indicates a ___________________________.

Page 8 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024

Example 4 (Let’s Investigate)


From Example 1, if the length of the pendulum is measured in centimetres (i.e. D = 100 L ),
Calculate the product moment correlation coefficient, r, for the following pairs of values of D
and T.

D / cm 20 30 40 50 60 70
1
D /c m 2 4.47 5.48 6.33 7.07 7.75 8.37

T/s 0.912 1.11 1.23 1.37 1.44 1.63

Solution:
From GC,

Recall from Example 3:

L/m 0.200 0.300 0.400 0.500 0.600 0.700


1
L / m2 0.447 0.548 0.633 0.707 0.775 0.837

T/s 0.912 1.11 1.23 1.37 1.44 1.63

Note:
Observe from the tables that D = 100 L . Since r is independent of the scale of measurement, the value
of r in Examples 3 and 4 are the same.
(To get a good sense of how to estimate correlation coefficients given a scatter diagram, go to the following website to
access the Guessing Correlation app: http://istics.net/Correlations/)

Learning Point:
Generally, r is independent of the scale of measurement.
i.e. if u = ax + b , v = cy + d , where a, b, c, d are constants ( a  0 and c  0 , or a  0 and
c  0 ), then rxy = ruv .

Page 9 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024

3.3 Notes on Correlation Coefficient, r


Match the following scatter diagrams with their respective value of r.

No apparent linear
correlation, r is close
to 0.

Curvilinear, but little


linear correlation, r is
close to 0.

Strong negative
linear correlation, r is
close to −1.

Perfect positive linear


correlation, r = 1 .

Strong positive linear


correlation, r is close
to 1.

Perfect negative
linear correlation,
r = −1 .

For these scatter diagrams, what is their values of r?

r is. r is.

Note: For any sample, − 1  r  1 or r is undefined.


To see how scatter diagrams complement the correlation coefficient in the analysis of data, refer to
Annex A.

Page 10 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024

§4 Regression Lines

In this chapter, we are interested to create the "best fit" line for all bivariate data points where X and
Y are linearly correlated. The best fit line allows us to predict values of y from known values of x
and vice versa.


 
  
 

X
O

Suppose all the points in a scatter diagram lie approximately along a straight line, we say there is a
linear correlation between X and Y, and we try to fit the best possible straight line to the data. Such a
line is called a regression line. We do this by the least squares method.

Least squares regression line of y on x : y = a + bx


The "least squares" method is as follows:

(1) Plot the points (xi , yi), i = 1, 2, 3, ..., n on a scatter diagram.

(2) The estimated regression line y = a + bx is drawn through these points in such a way that
the sum of squares of the deviations of the points from the line is a minimum
n

i.e. a and b are chosen such that S = e


i =1
2
i is minimized,

where ei is deviation from the observed value, yi, to the predicted value given by the line
y = a + bx at point xi. i.e. ei = yi − (a + bxi ) .

Given: ( x1 , y1 ) , ( x2 , y2 ) , …, ( xn , yn ) y = a + bx
y

en

• (xn, yn)



(x2, y2) •
Observed value of y y2 • •
e2 •
Predicted value a + bx2 e3
of y
e1 • (x3, y3)
• (x1, y1)
x

Page 11 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024

Independent Dependent variable Predicted Vertical deviation (Vertical deviation)2


variable (Observed value) value
i xi yi a + bxi ei = yi − ( a + bxi ) ei 2 = ( yi − ( a + bxi ) )
2

1 x1 y1 a + bx1 e1 = y1 − ( a + bx2 ) e12 = ( y1 − ( a + bx1 ) )


2

2 x2 y2 a + bx2 e2 = y2 − ( a + bx2 ) e2 2 = ( y2 − ( a + bx2 ) )


2


n xn yn a + bxn en = yn − ( a + bxn ) en 2 = ( yn − ( a + bxn ) )
2

Aim: Find the values of a and b such that the errors ei are as small as possible.
Hence, we minimize  ei 2 .

It may be proven that:


(i) ( x , y ) lies on the regression line of y on x.
(ii) e 2
is a minimum when b =
 ( x − x )( y − y ) .
( x − x )
i 2

From (i) and (ii), the estimated regression line of y on x is:

(MF26)
y − y = b ( x − x ) where b =
 ( x − x )( y − y )
( x − x )
2

Gradient

Note: For the “least squares” method to find the regression line of x on y : x = c + dy, refer to
Annex B.

Page 12 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024

4.1 Types of regression lines

For any bivariate set of data connecting variables X and Y, there are two uniquely defined regression
lines:
(a) Regression line of y on x ( y = a + bx )
X is the independent variable and Y is the dependent variable
y is expressed in terms of x

(b) Regression line of x on y ( x = c + dy )


Y is the independent variable and X is the dependent variable
x is expressed in terms of y

Remark: Before using regression lines to predict values, we need to know whether regression
line of y on x or x on y should be used.

Case 1: Independent variable is known

Given independent variable Given value To find Use Regression line of


x y y on x ( y = a + bx )
x
y x
x y
y x on y ( x = c + dy )
y x

E.g. The speed at which a certain sea animal swims (Y ) depends on the angle through which the
hind feet moves (X ).
X is the independent variable, and Y, which depends on X, is the dependent variable.

Recall : We usually assume that the values of independent variables are set more accurately and
observed values of dependent variables will be subjected to some level of error.

Example 5 (2010 RVHS/I/10 Modified)


An experiment was conducted to investigate how the mass x grams of a chemical substance varies
with time t hours. The following data were obtained.

t 4 6 7 9 10 12 13 14
x 5.39 6.96 6.70 7.60 8.33 9.10 11.50 11.30

Find the equation of a suitable regression line.

Solution:
Since the experiment investigates how x varies with t, t is the ________________ variable.

From GC,

Page 13 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024

Finding the regression line of y on x

Refer to Example 1 for GC Steps on page 4

(1) Store RegEQ: Y1 (a$). Press e.

(2) Press !.

Case 2: Independent variable is unknown


Given value To find Use Regression line of
x y y on x
y x x on y

Example 6
Eight pairs of observations on the variables x and y are given below:
x 32 12 21 50 45 60 15 56
y 40 86 52 42 50 8 75 15

(i) Find the values of x and y.


(ii) Obtain the regression lines of y on x and x on y.
(iii) On a scatter diagram, sketch the two regression lines.
(iv) Find the coordinates of the intersection of the two regression lines. Comment on the values
of the coordinates, relating to the answers in part (i).

Solution:

Using GC, x = 36.4 and y = 46.

To(i)draw Using GC, the


regression lineequation
x on y onofthe
regression
scatter line of y on
diagram using = 92.2 − 1.27 x (3 s.f.).
x is yGC
Since we The equation
can only inputofy regression
in terms ofline
x inof
thex GC,
on y to x = 66.0
is draw the −line x =y c(3
0.644 + dy
s.f.).in a graph of y
against x, we have to make y the subject.
1 c
x = c + dy  y = x −
d d
1
Hence, the gradient of the line x = c + dy is when drawn in a graph of y against x.
d

Page 14 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024

To draw regression line x on y

(1) Enter the x-data into L1, and the y-data into L2.

(2) Press S>. Select 8:LinReg(a+bx).

(3) Xlist: L2 Ylist: L1

(4) Press eto display the value of a,b and r.

To draw the regression line of x on y on scatter diagram, we have to


MAKE y THE SUBJECT and key the equation manually into the
GC.

Equation of the regression line of x on y is


x = 65.99156566 − 0.6438383838 y (3 sf)
 y = −1.5531848 x + 102.497

Alternatively
x−c
Note that the equation is actually y = where c = a and d = b in the GC. Hence, you may type
d
x−a
y= where a and b may be recalled by pressing v and 5: Statistics and move to EQ
b

(ii) To sketch the two regression lines:

(1) Enter the equations y = 102.497 − 1.5531848 x and


y = 92.1884058 − 1.269784352 x
(2) Press to draw the two regression lines

(iii) The intersection point is ( 36.4, 46 )


Note that the coordinates of the intersection point corresponds
to ________. (Important result)

[Students may use 2-Var Stat to


verify]

Page 15 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024

4.3 Notes on Regression Lines

1. Both the regression lines of y on x and x on y pass through the point ( x , y ) .


Example 6: The coordinates of the point of intersection of the two regression lines is
( x , y ) = ( 36.4, 46.0 ) .
x on y
 y on x

 
 
 

2. For the regression line of y on x, the slope/gradient, b, gives the increase (b > 0) or decrease
(b < 0) in y for every unit increase in x.

3. Relating regression lines with values of r:

Perfect Positive Linear Correlation, r = 1 Perfect Negative Linear Correlation, r = −1


The two regression lines coincide, and The two regression lines coincide, and have
have positive gradient. negative gradient.
y
x on y y x on y
 
 y on x
 y on x 
• •

 ( x , y) 

x x
O O

Observations:

If r  0 , we either have two lines with positive gradient ( r  0 ) or two lines with negative gradient
( r  0 ).

The following diagram is thus not possible.


y
y on x

x on y
x
O

Page 16 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024

Activity

Eight pairs of observations on the variables x and y are given below.

x 34 57 43 65 72 46 51 54
y 21.6 58.1 32.8 59.3 60.6 50.2 56.8 49.5
(Adapted from 2007/NJC/Prelim/II/12)

Which of the following models is most appropriate? Explain why is your chosen model the most
appropriate.

(A) y = a + bx (B) y = a + bx 2
b
(C) y = a + b ln x (D) y =a+
x

Submit your answer through this QR code! Solutions will be discussed in the next lecture.

Page 17 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024

§5 Estimation of Values
Once the regression lines are found, we can use them to estimate x and y for a given value of y and x
respectively.
Interpolation
When we work within the given range of values of the data, it is known as interpolation.

Extrapolation
When we work outside the given range of values of the data, it is known as extrapolation.
Extrapolation should be used with caution as the relationship between the two variables may not be
linear outside the range of the data.

An estimate is not reliable if it is outside the data range (even if the product moment correlation
coefficient is high i.e. close to  1) as the linear relation between the 2 variables may no longer
hold.

An estimate is reliable when


1. it is within the data range; AND
2. the absolute value of the product moment correlation coefficient is close to 1 ( r  1) .

Example 7
A random sample of ten pairs of values of x and y is used to obtain the following regression line of
y on x, where x is the age in years ( 5  x  16 ) and y is the height, in cm, of 10 boys.
y = 4.4617 x + 87.431 .
(i) Use the regression line of y on x to estimate the height of a 9-year old boy. Given that the
product moment correlation coefficient of the data is 0.903. Comment on the reliability of
your answer.
(ii) Predict the value of y given by the regression line when x = 40 and comment on your
answer.

Solution:
(i) When x = 9 , y = 4.4617 ( 9 ) + 87.431 = 127.5863  128 (3 s.f.)
Therefore, the estimated height of a 9-year old boy is 128 cm.
Since ____________________________________________, it indicates a ____________
linear correlation between the x and y. Hence, the estimated height of a 9 year-old boy is
reliable.
(ii) When x = 40 , y = 4.4617 ( 40 ) + 87.431 = 265.899  266 (3 s.f.)
Since __________________________________, the linear correlation between x and y may
not hold. Hence, the estimated value of y is unreliable.
Moreover, it is unlikely for a person to be 266 cm tall as the heights of people do not vary
linearly with their age. There is a limit for growth.
Note:
If raw data is given and the equation of the y on x line is found using GC, you may use Store RegEQ:
Y1 and input Y1(9) and Y1(40) to obtain the estimates.

Page 18 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024

§6 Transformation from a Non-Linear Model to a Linear Model


Given a set of bivariate data, the relationship between the two variables involved may not be linear.
However, there are some non-linear relationships which can be transformed to a linear form.

Examples
(i) y = p + qx 2 Relationship between y and x 2 is linear.
(ii) y = pe qx → ln y = ln p + qx Relationship between ln y and x is linear.
y = px q → ln y = ln p + q ln x Relationship between ln y and ln x is linear.

We can then use the data for the two ‘transformed variables’ to find the equation of the regression
line. In other words, we ‘linearise’ the set of bivariate data. (Refer to Annex B)

Example 8 (H2 Maths/Specimen Paper/P2/7)


The daily rate charged by a car-hire firm varies with the length of the hire period. The firm’s brochure
gives the following data.
Hire period, x days 1 2 3 4 5 10 30 50
Daily rate, $y 149 119 115 112 109 105 103 101
Calculate the value of the product-moment correlation coefficient.
Give a sketch of the scatter diagram for the data, as shown on your calculator, and hence
(i) comment on the value of the product-moment correlation coefficient,
(ii) state, with a reason, which of the following models is most appropriate.
A: y = a + bx B: y = a + bx 2
b
C: y =a+ D: y = a + b ln x
x
For the appropriate model, calculate least squares estimates of a and b, and verify that, for the
transformed data, the product-moment correlation coefficient is close to 1.

Solution:
From GC, r =

(i) ___________________________, suggesting a ___________ linear correlation. This


corresponds to the observation from the scatter diagram, where the points do not lie close to
a straight line.

Page 19 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024

(ii) For Model A: y = a + bx :


For Model B: y = a + bx 2 :
b
For Model C: y = a + :
x
For Model D: y = a + b ln x :

As _______________________________________________________, and Model C has a


b
value of _______________________________.Therefore, Model C, y = a + , seems most
x
appropriate.

From GC,
least squares estimate of a = 99.787  99.8 (3 s.f.)
least squares estimate of b = 47.073  47.1 (3 s.f.)
For the transformed data, the product-moment correlation coefficient is 0.992 which is close
to 1.

Page 20 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024

Annex A
Learn the importance of scatter diagram in interpreting correlation and regression (Frank
Anscombe 1973)

Consider the four sets of bivariate data given below. Using GC, sketch the scatterplots, compare their
correlation coefficients and equations of y on x.
x1 y1 x2 y2 x3 y3 x4 y4
4 4.26 4 3.10 4 5.39 8 6.58
5 5.68 5 4.74 5 5.73 8 5.76
6 7.24 6 6.13 6 6.08 8 7.71
7 4.82 7 7.26 7 6.42 8 8.84
8 6.95 8 8.14 8 6.77 8 8.47
9 8.81 9 8.77 9 7.11 8 7.04
10 8.04 10 9.14 10 7.46 8 5.25
11 8.33 11 9.26 11 7.81 8 5.56
12 10.84 12 9.13 12 8.15 8 7.91
13 7.58 13 8.74 13 12.74 8 6.89
14 9.96 14 8.10 14 8.84 19 12.5

Scatter x2 , y 2
Diagram

x1 , y1
Description Moderate linear relationship Curvilinear relationship
Regression line y = 0.5001x + 3.0001 y = 0.5x + 3.0009
Correlation r = 0.6665 r = 0.6662
coefficient
Scatter x3 , y3 x4 , y 4
Diagram

Description Extreme observation much higher No line. All points at x = 8 except one
than the other points. point at x = 19.
Regression line y = 0.4997x + 3.0025 y = 0.4999x + 3.0017
Correlation r = 0.6663 r = 0.6667
coefficient

Page 21 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024

(a) What can you observe about the regression lines?


The equations of the regression lines are the same (to 3 s..f.).

(b) What conclusion can you draw?


A scatter diagram of bivariate data should be drawn before fitting a regression line or
calculating the linear correlation coefficient. This is because a scatter diagram shows the
relationship between the two variables, and helps in interpreting correlation and regression.

Annex B: [Least squares regression line of x on y : x = c + dy]


Given: ( x1 , y1 ) , ( x2 , y2 ) , …, ( xn , yn )
y

(x3,y3) h3

• •(xn,yn)
h2
h1 (x2,y2)
(x1,y1)


O x1 c + dy1 x

Observed value of x
Predicted value of x

Independent Dependent variable Predicted Horizontal (Horizontal deviation)2


variable (Observed value) value deviation
i yi xi c + dyi hi = xi − ( c + dyi ) h 2 = x − ( c + dy ) 2
( )
i i i

1 y1 x1 c + dy1 h1 = x1 − ( c + dy1 ) h 2 = ( x − ( c + dy ) )
2
1 1 1

2 y2 x2 c + dy2 h2 = x2 − ( c + dy2 ) h 2 = ( x − ( c + dy ) )
2
2 2 2


n ei xn c + dyn hn = xn − ( c + dyn ) h 2 = ( x − ( c + dy ) )2
n n n

Aim: Find the values of c and d such that the errors hi are as small as possible.
Hence, we minimize  hi 2 .

It may be proven that:


(i) ( x , y ) lies on the regression line of x on y.
 ( x − x )( y − y ) .
(ii) h 2
is a minimum when d =
( y − y )
i 2

From (i) and (ii), the estimated regression line of x on y is:

x − x = d ( y − y ) where d =
 ( x − x )( y − y )
( y − y ) Gradient
2

Note: The formula above is NOT GIVEN IN MF 26.

Page 22 of 23
Chapter 23 Correlation and Linear Regression TMJC 2024

Annex C
To Clear Lists
Press `+
Choose 4: ClrAllLists
Press e

Transformations

y = p + qx 2
Relationship between y and x 2 is linear
 y on x 2

x-list → L1
y-list → L2
x 2 -list → L3= ( L1)
2

GC y = a + bx
Least squares regression
y = p + qx 2
line of y on x 2
y = pe qx
ln y = ln p + qx
Relationship between ln y and x is linear
 ln y on x

x-list → L1
y-list → L2
ln y -list → L3= ln ( L2 ) GC y = a + bx
Least squares regression
ln y = ln p + qx
line of ln y on x

y = pxq
ln y = ln p + q ln x
Relationship between ln y and ln x is linear
 ln y on ln x

x-list → L1
y-list → L2
ln x -list → L3= ln ( L1) GC y = a + bx

ln y -list → L4= ln ( L2 )
Least square
regression line of ln y = ln p + q ln x
ln y on ln x

Page 23 of 23

You might also like