Linear Regression and Correlation A Level Notes (Precision Academy)
Linear Regression and Correlation A Level Notes (Precision Academy)
OBJECTIVES
After studying this topic students should be able to
to plot scatter diagrams
to draw lines of best fit
To find the equations of regression lines.
to calculate Pearson`s product moment correlation coefficient (r)
to compute the coefficient of determination ( )
To solve problems involving regression and correlation.
BIVARIATE DATA
E.g. the association between the cost of living and the salary: the cost of living (the
dependent variable) depends on one`s salary whilst salary is the controlled variable (the
independent variable).
1
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON
Example 1
x 8 2 11 6 5 4 12 9 6 1
y 3 10 3 6 8 12 1 4 9 14
14
12
10
0
0 2 4 6 8 10 12 14
2
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON
REGRESSION.
REGRESSION LINE OF Y ON X.
∑
̅
∑
̅
∑ ∑ ∑
∑ ∑
NB n is the number of ordered pairs in the data
REGRESSION LINE OF X ON Y.
3
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON
∑
̅
∑
̅
∑ ∑ ∑
∑ ∑
Example2:
Use the least square method to determine
x 3 7 9 11 14 14 15 21 22 23 26
y 5 12 5 12 10 17 23 16 10 20 25
Solution
x 3 7 9 11 14 14 15 21 22 23 26 ∑ 165
y 5 12 5 12 10 17 23 16 10 20 25 ∑ 155
∑ 3007
∑ 2637
∑ 2665
11(12)
a) y on x
Therefore
4
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON
b)
When drawing the straight line at least two points must be plotted. The first point is
( ̅ ̅ and other point is found by plugging the any of the given value(s) of x in the
equation of regression line.
The other method is of using ( ̅ ̅ and the y or x intercepts as the second points where
the line should pass through.
OR; Line y on x passes through: AND y-intercept ( the point where the line cuts the
y axis)
OR; Line X ON Y passes through: AND X-intercept ( the point where the line cuts
the x axis)
5
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON
The regression line can also be used to predict the value of a dependent variable from
an independent variable by using the regression equation or the graph
Predictions should only be made for values of the dependent variable that are
within the range of the given data
Making a prediction within the range of the given data is called interpolation
Making a prediction outside of the range of the given data is
called extrapolation and is much less reliable and may lead to incredible results
The prediction will be more reliable if the number of data values in the original
sample set is bigger
6
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON
CORRELATION.
Correlation is a statistical measure that expresses the extent to which two variables are
linearly connected.
Correlation is the degree to which two variables move in coordination with one another.
Correlation is the degree of change occurring in one of the variables and is reacted by a
corresponding change in the other variable.
CORRELATION AND CAUSALITY
Causality or causation entails that one event is the result of the occurrence of the other’
Correlation doesn’t imply causality.
E.g. a survey on learners may show that there is a strong correlation between those with
biggest right feet and those who are best in mental arithmetic.
However in reality big right foot doesn’t cause one to be good in mental arithmetic.
HOW IS CORRELATION MEASURED?
Statistical correlation is measured by use of Pearson`s product moment correlation
coefficient.
PEARSON`S PRODUCT MOMENT CORRELATION COEFFICIENT.
Pearson`s product moment correlation coefficient is a measure of strength of a linear
association between two variables and is denoted r.
Product moment correlation indicates the degree of scatter.
Pearson`s product moment correlation coefficient (r) takes a range from -1 to 1. The can
be written as –
If the gradient of regression line is positive then the data set has positive correlation and
if the gradient is negative then the data set has negative correlation.
Value of r comment
No correlation between the variables
perfect positive correlation between the variables
Perfect negative correlation between the variables
Some or weak positive correlation between the variables
High or strong negative correlation between the variables
7
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON
∑ ∑ ∑
√ ∑ ∑ √ ∑ ∑
Example 3:
Calculate the product moment correlation from the data given in the table.
x 3 7 9 11 14 14 15 21 22 23 26
y 5 12 5 12 10 17 23 16 10 20 25
8
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON
Solution
x 3 7 9 11 14 14 15 21 22 23 26 ∑ 165
y 5 12 5 12 10 17 23 16 10 20 25 ∑ 155
∑ 3007
∑ 2637
∑ 2665
11(12)
∑ ∑ ∑
√ ∑ ∑ √ ∑ ∑
√ √
The association between x and y indicates a strong positive correlation.
COEFFICIENT OF DETERMINATION ( )
The coefficient of determination ( ), is used to analyze how differences in
one variable can be explained by a difference in a second variable.
E.g., when a person gets pregnant has a direct relation to when they give birth.
FINDING COEFFICIENT OF DETERMINATION ( )
Step 1: Find the correlation coefficient, r (it may be given to you in the
question). Example, r = 0.32.
Step 2: Square the correlation coefficient.
Step 3: Convert the correlation coefficient to a percentage.
The coefficient of determination gives an idea of how many data points fall within the
results of the line formed by the regression equation.
The higher the coefficient, the higher percentage of points the line passes through when
the data points and line are plotted.
If the coefficient is 0.80, then 80% of the points should fall within the regression line.
Values of 1 or 0 would indicate the regression line represents all or none of the data,
respectively.
A higher coefficient is an indicator of a better goodness of fit for the observations.
9
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON
USEFULNESS OF
The usefulness of is its ability to find the probability of future events falling within the
predicted outcomes.
CORRELATION VS REGRESSION
The regression will give relation to understand the effects that x has on y to change and
vice-versa. With proper correlation, x and y can be interchanged and obtained to get the
same results.
Correlation is based on a single statistical format (r), whereas regression is an entirely
different aspect with an equation and is represented with a line
or
Correlation helps create and define a relationship between two variables, and regression,
on the other hand, helps to find out how one variable affects another.
The data shown in regression establishes a cause and effect pattern when change occurs
in variables. When changes are in the same direction or opposite for both variables, for
correlation here, the variables have a singular movement in any direction.
Prediction and optimization will only work with the regression method and would not be
viable in the correlation analysis.
PRACTICE QUESTIONS
1. ZIMSEC P4 2014
age 5 4 6 5 5 5 6 6 2 7 7
price 8.5 10.3 7.0 8.2 8.9 9.8 6.6 9.5 16.9 7.0 4.8
The table displays data on age and price for a sample of cars. Ages are in years whilst the prices
are in thousands of dollars.
a) Draw a scatter diagram of price against age.
b) Calculate the equation of the regression line of price on age of the car.
c) Draw the line of the equation the scatter diagram in a) and used it to estimate the price of a 3
year old car.
d) Find the product-moment correlation and comment on it.
SOLUTION
a) and b)
Let the age be and the price be .
10
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON
20
18
16
14
12
price (y)
10
6
y =19.547 -2.0261x
0
0 1 2 3 4 age (x) 5 6 7 8
11
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON
Age 5 4 6 5 5 5 6 6 2 7 7 ∑
25 16 36 25 25 25 36 36 4 49 49 ∑
72.25 106.09 49 67.24 79.21 96.04 43.56 90.25 285.61 49 23.04 ∑ 961.29
̅ ̅
∑ ∑ ∑
∑ ∑
2. A biologist assumes that there is a linear relationship betweeen the amount of fertilizer
supplied to tomato plants and the subseqauent yeild of tomato obtained.
Eight tomato plants of the same variety; were selected at random and treated, with a
solution in which grams of fertilizer was dissolved in a fixed quantity of water. The
yield kilograms, of tomatoes were recorded.
A B C D E F G H
12
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON
SOLUTION
a)
9
y = 1.081x + 3.2524
8
3
0 1 2 3 4 5
b) ̅ ̅
13
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON
c)
d) 20 is beyond the range of the data and this result in incredible results (extrapolation)
3. One end A of an elastic string was attached to a horizontal bar and a mass grams was
attached to the other end B. The mass was suspended freely and allowed to settle
vertically below A. the length AB; was recorded for various mass as follows.
m 100 200 300 400 500 600
l 228 236 256 278 285 301
̅ ̅
∑ ∑ ∑
∑ ∑
14
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON
340
320
l = 0.1526m + 210.6
300
𝒍 mm
280
260
240
220
200
100 200 300 400 500 600 700
m grams
coefficient of determination
15
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON
COMMAND DISPLAY
1. Press the [MODE ] key
Repeat the stage 4 process until all the pairs are entered.
16
P R E C I S I O N A C A D E M Y +263775973880 created by MWEDZI SOLOMON
Having entered the data, the values of a and b in the equation of the line of regression of y on x,
y = a + b x, can be found using the calculator. The ALPHA key identifies a and b
Product moment correlation coefficient can also be found using the calculator, ALPHA key
identifies r as Product moment correlation coefficient.
Other statistical calculations that can be obtained on a calculator are: ̅ ̅; ∑ ; ∑ ;∑ ;
∑ ;∑ .
17