0% found this document useful (0 votes)
72 views

Linear Regression and Correlation A Level Notes (Precision Academy)

The document discusses linear regression and correlation. It defines key terms like scatter diagrams, lines of best fit, regression equations, forecasting using regression lines, and Pearson's correlation coefficient. Methods for making scatter plots, finding regression lines, and calculating the correlation coefficient are presented.

Uploaded by

rudomposi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views

Linear Regression and Correlation A Level Notes (Precision Academy)

The document discusses linear regression and correlation. It defines key terms like scatter diagrams, lines of best fit, regression equations, forecasting using regression lines, and Pearson's correlation coefficient. Methods for making scatter plots, finding regression lines, and calculating the correlation coefficient are presented.

Uploaded by

rudomposi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON

SEMINAR ON LINEAR REGRESSION & CORRELATION: GUMBONZVANDA HIGH SCHOOL

OBJECTIVES
After studying this topic students should be able to
 to plot scatter diagrams
 to draw lines of best fit
 To find the equations of regression lines.
 to calculate Pearson`s product moment correlation coefficient (r)
 to compute the coefficient of determination ( )
 To solve problems involving regression and correlation.

THE SCATTER DIAGRAM


 The scatter diagram is also called a scatter plot chart or correlation chart.
 A scatter diagram is a two-dimensional graphical representation of a set of data. The
scatter diagram graphs pairs of numerical data with one variable on each axis to indicate
a relationship between them.
 A scatter diagram is a way of graphing bivariate data.

BIVARIATE DATA

 Bivariate data is data which is collected on two variables.

HOW TO MAKE A SCATTER DIAGRAM?

 The variable that can be controlled in the data collection is known as


the independent or explanatory variable and is plotted on the horizontal axis
 The variable that is measured or discovered (outcome variable) in the data collection is
known as the dependent or response variable and is plotted on the vertical axis.

E.g. the association between the cost of living and the salary: the cost of living (the
dependent variable) depends on one`s salary whilst salary is the controlled variable (the
independent variable).

1
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON

Example 1

Plot a scatter diagram for the data

x 8 2 11 6 5 4 12 9 6 1
y 3 10 3 6 8 12 1 4 9 14

Solution (x is the independent variable & y is the dependent variable)

14

12

10

0
0 2 4 6 8 10 12 14

2
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON

REGRESSION.

 Regression is an arithmetic relation which determines the relationship of one variable


with the other significant variable.
 OR Regression can be defined as the parameter to explain the relationship between two
separate variables.
 OR Regression is a technique that justifies the arithmetic relation between the two
variables, an independent value and a dependent one.

LEAST SQUARES REGRESSION LINE OR LINE OF BEST FIT


 This is a linear graph added to the scatter diagram that best approximates the
relationship between the two variables
 The data can be used to calculate the equation of the straight line that represents the best
fit of the relationship between the two variables
 The least squares regression line is the line of best fit that minimizes the sum of the
squares of the gap between the line and each data value
 The least squares regression line is usually called the regression line and can be
calculated either be considering at the vertical or the horizontal distances between the line
and the data values
 If the regression line is calculated by considering the vertical distances it is called
the regression line of y on x
 If the regression line is calculated by considering the horizontal distances it is called
the regression line of x on y
 After the regression line has been computed; the points that lie far from the line are called
outliers. Such points may indicate a poorly fitting regression line or erroneous data.

REGRESSION LINE OF Y ON X.

Is written in the form where is the y intercept and the is gradient.

This equation is obtained from the equation ̅ ̅ . Where


 ̅

 ̅

∑ ∑ ∑

∑ ∑
NB n is the number of ordered pairs in the data

REGRESSION LINE OF X ON Y.

Is written in the form where is the x intercept and the is gradient.

This equation is obtained from the equation ̅ ̅ . Where

3
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON


 ̅

 ̅

∑ ∑ ∑

∑ ∑

Example2:
Use the least square method to determine

a) The equation of line of y on x.

b) the equation of line of y on x

Then plot the lines on the same axes.

x 3 7 9 11 14 14 15 21 22 23 26

y 5 12 5 12 10 17 23 16 10 20 25

Solution

x 3 7 9 11 14 14 15 21 22 23 26 ∑ 165

y 5 12 5 12 10 17 23 16 10 20 25 ∑ 155

∑ 3007

∑ 2637

∑ 2665
11(12)

(There are 11 pairs in the data) ̅ & ̅

a) y on x

Therefore

4
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON

b)

The equations of regression line

NB The regression lines of and always pass through the point ( ̅ ̅

On the scatter plot both lines pass through ( .

HOW TO DRAW THE LINE OF REGRESSION

 When drawing the straight line at least two points must be plotted. The first point is
( ̅ ̅ and other point is found by plugging the any of the given value(s) of x in the
equation of regression line.
 The other method is of using ( ̅ ̅ and the y or x intercepts as the second points where
the line should pass through.

FOR THE LINE Y ON X:

The extreme values of are 3 and 26.

When then : obtained point is

When then : obtained point is

Line y on x passes through: )

OR; Line y on x passes through: AND y-intercept ( the point where the line cuts the
y axis)

NB: to find the y intercept substitute in the equation of regression line: y on x

FOR THE LINE X ON Y :

The extreme values of are 5 and 25.

When then obtained point is


When then obtained point is
Line x on y passes through: )

OR; Line X ON Y passes through: AND X-intercept ( the point where the line cuts
the x axis)

NB: to find the x intercept substitute in the equation of regression line: x on y

5
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON

FORECASTING OR PREDICTING USING REGRESSION LINE OR THE


EQUATION OF THE REGRESSION LINE

The regression line can also be used to predict the value of a dependent variable from
an independent variable by using the regression equation or the graph

 Predictions should only be made for values of the dependent variable that are
within the range of the given data
 Making a prediction within the range of the given data is called interpolation
 Making a prediction outside of the range of the given data is
called extrapolation and is much less reliable and may lead to incredible results
 The prediction will be more reliable if the number of data values in the original
sample set is bigger

6
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON

CORRELATION.
 Correlation is a statistical measure that expresses the extent to which two variables are
linearly connected.
 Correlation is the degree to which two variables move in coordination with one another.
 Correlation is the degree of change occurring in one of the variables and is reacted by a
corresponding change in the other variable.
CORRELATION AND CAUSALITY
 Causality or causation entails that one event is the result of the occurrence of the other’
 Correlation doesn’t imply causality.

E.g. a survey on learners may show that there is a strong correlation between those with
biggest right feet and those who are best in mental arithmetic.
However in reality big right foot doesn’t cause one to be good in mental arithmetic.
HOW IS CORRELATION MEASURED?
 Statistical correlation is measured by use of Pearson`s product moment correlation
coefficient.
PEARSON`S PRODUCT MOMENT CORRELATION COEFFICIENT.
 Pearson`s product moment correlation coefficient is a measure of strength of a linear
association between two variables and is denoted r.
 Product moment correlation indicates the degree of scatter.
 Pearson`s product moment correlation coefficient (r) takes a range from -1 to 1. The can
be written as –
 If the gradient of regression line is positive then the data set has positive correlation and
if the gradient is negative then the data set has negative correlation.

Value of r comment
No correlation between the variables
perfect positive correlation between the variables
Perfect negative correlation between the variables
Some or weak positive correlation between the variables
High or strong negative correlation between the variables

7
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON

Strong or high Some or weak


positive positive
No correlation
correlation correlation

High or strong Some or weak


negative negative No correlation
correlation correlation
NB: When describing correlation say whether it is positive or negative and also say whether it is
strong or weak.

∑ ∑ ∑
√ ∑ ∑ √ ∑ ∑

Example 3:
Calculate the product moment correlation from the data given in the table.

x 3 7 9 11 14 14 15 21 22 23 26

y 5 12 5 12 10 17 23 16 10 20 25

8
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON

Solution

x 3 7 9 11 14 14 15 21 22 23 26 ∑ 165

y 5 12 5 12 10 17 23 16 10 20 25 ∑ 155

∑ 3007

∑ 2637

∑ 2665
11(12)

∑ ∑ ∑
√ ∑ ∑ √ ∑ ∑

√ √
The association between x and y indicates a strong positive correlation.

COEFFICIENT OF DETERMINATION ( )
 The coefficient of determination ( ), is used to analyze how differences in
one variable can be explained by a difference in a second variable.
E.g., when a person gets pregnant has a direct relation to when they give birth.
FINDING COEFFICIENT OF DETERMINATION ( )

 Step 1: Find the correlation coefficient, r (it may be given to you in the
question). Example, r = 0.32.
 Step 2: Square the correlation coefficient.
 Step 3: Convert the correlation coefficient to a percentage.

MEANING OF THE COEFFICIENT OF DETERMINATION

 The coefficient of determination gives an idea of how many data points fall within the
results of the line formed by the regression equation.
 The higher the coefficient, the higher percentage of points the line passes through when
the data points and line are plotted.
 If the coefficient is 0.80, then 80% of the points should fall within the regression line.
Values of 1 or 0 would indicate the regression line represents all or none of the data,
respectively.
 A higher coefficient is an indicator of a better goodness of fit for the observations.

9
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON

USEFULNESS OF
 The usefulness of is its ability to find the probability of future events falling within the
predicted outcomes.

EXAMPLE ON COEFFICIENT OF DETERMINATION ( )

CORRELATION VS REGRESSION
 The regression will give relation to understand the effects that x has on y to change and
vice-versa. With proper correlation, x and y can be interchanged and obtained to get the
same results.
 Correlation is based on a single statistical format (r), whereas regression is an entirely
different aspect with an equation and is represented with a line
or
 Correlation helps create and define a relationship between two variables, and regression,
on the other hand, helps to find out how one variable affects another.
 The data shown in regression establishes a cause and effect pattern when change occurs
in variables. When changes are in the same direction or opposite for both variables, for
correlation here, the variables have a singular movement in any direction.
 Prediction and optimization will only work with the regression method and would not be
viable in the correlation analysis.

PRACTICE QUESTIONS
1. ZIMSEC P4 2014
age 5 4 6 5 5 5 6 6 2 7 7
price 8.5 10.3 7.0 8.2 8.9 9.8 6.6 9.5 16.9 7.0 4.8

The table displays data on age and price for a sample of cars. Ages are in years whilst the prices
are in thousands of dollars.
a) Draw a scatter diagram of price against age.
b) Calculate the equation of the regression line of price on age of the car.
c) Draw the line of the equation the scatter diagram in a) and used it to estimate the price of a 3
year old car.
d) Find the product-moment correlation and comment on it.

SOLUTION
a) and b)
Let the age be and the price be .

10
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON

20

18

16

14

12
price (y)

10

6
y =19.547 -2.0261x

0
0 1 2 3 4 age (x) 5 6 7 8

11
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON

Age 5 4 6 5 5 5 6 6 2 7 7 ∑

8.5 10.3 7 8.2 8.9 9.8 6.6 9.5 16.9 7 4.8 ∑

42.5 41.2 42 41 44.5 49 39.6 57 33.8 49 33.6 ∑

25 16 36 25 25 25 36 36 4 49 49 ∑
72.25 106.09 49 67.24 79.21 96.04 43.56 90.25 285.61 49 23.04 ∑ 961.29

̅ ̅
∑ ∑ ∑
∑ ∑

The line passes through ( ) and y intercept ( )


c) The price of a 3 year old car is approximately $13500.
d) ; there is a strong negative linear correlation
√ √
between price and age of the car

2. A biologist assumes that there is a linear relationship betweeen the amount of fertilizer
supplied to tomato plants and the subseqauent yeild of tomato obtained.
Eight tomato plants of the same variety; were selected at random and treated, with a
solution in which grams of fertilizer was dissolved in a fixed quantity of water. The
yield kilograms, of tomatoes were recorded.

A B C D E F G H

a) Plot a scatter diagram of yield against amount of fertilizer . [3]


b) Calculate the equation of the least squares regression line of [6]
c) Estimate the yield of a plant treated, weekly with 3.2 grams of fertilizer. [2]
d) Indicate why it may not be appropriate to use your equation to predict the yield of a plant
treated weekly with 20 grams of fertilizer. [1]

12
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON

SOLUTION
a)
9

y = 1.081x + 3.2524
8

3
0 1 2 3 4 5

b) ̅ ̅

13
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON

c)
d) 20 is beyond the range of the data and this result in incredible results (extrapolation)
3. One end A of an elastic string was attached to a horizontal bar and a mass grams was
attached to the other end B. The mass was suspended freely and allowed to settle
vertically below A. the length AB; was recorded for various mass as follows.
m 100 200 300 400 500 600
l 228 236 256 278 285 301

a) Draw a scatter diagram to illustrate the above information. [3]


b) Calculate the least squares line of regression of on and plot this line on your scatter
diagram. [7]
c) Give, in context, interpretation for.
i. The gradient of the line
ii. The intercept of the line on the axis [2]
d) Estimate the length of the string when mass of 300 grams is attached at B [1]
e) State the physical limitation that there might be in using your equation to estimate the
length of the string when a mass of 1200 grams is attached at B. [1]
f) Find the coefficient of determination and comment on it
SOLUTION
a) and b)

100 200 300 400 500 600 ∑

228 236 256 278 285 301 ∑

10000 40000 90000 160000 250000 360000 ∑


51984 55696 65536 77284 81225 90601 ∑ 422326

22800 47200 76800 111200 142500 180600 ∑

̅ ̅

∑ ∑ ∑
∑ ∑

c) i. The increase in length per every gram added.

14
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON

340

320

l = 0.1526m + 210.6
300
𝒍 mm

280

260

240

220

200
100 200 300 400 500 600 700
m grams

ii. The length with no mass.


d)
e) The string might break if too much weight is added.
∑ ∑ ∑
f)
√ ∑ ∑ √ ∑ ∑ √ √

coefficient of determination

15
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON

of the variation in y is explained by the variation in x


OR of the points fall within the regression line
USING THE CALCULATOR TO FIND THE EQUATION OF THE LINE OF LINEAR
REGRESSION.
The following stages are to be followed if using the commonly used SHARP D. A. L in entering
statistical bivariate data.

COMMAND DISPLAY
1. Press the [MODE ] key

2. press the [1] key for STAT

3. Press the [1] key for LINE.

4. example, if the first pair is


(10 , 1003)
 Enter 10 and press
[x;y or STO] key.

Then enter 1003.

 Press the [DATA or


M+] key

Repeat the stage 4 process until all the pairs are entered.

16
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON

Having entered the data, the values of a and b in the equation of the line of regression of y on x,

y = a + b x, can be found using the calculator. The ALPHA key identifies a and b

as extra functions on the bracket keys.

Product moment correlation coefficient can also be found using the calculator, ALPHA key
identifies r as Product moment correlation coefficient.
Other statistical calculations that can be obtained on a calculator are: ̅ ̅; ∑ ; ∑ ;∑ ;
∑ ;∑ .

17

You might also like