Lesson - Correlation Linear Regression
Lesson - Correlation Linear Regression
FC301
Correlation
&
Linear Regression
OBJECTIVES
Correlation
Be able to show diagrammatically,
pairs of observations of variables
Be able to decide if there is a
relationship between the variables
Put a numerical measure on the
strength of this relationship
Calculate Product moment correlation
coefficient
Calculate Spearman’s rank correlation
coefficient
OBJECTIVES LINEAR REGRESSION
Define and apply the concepts related
to linear equations with one
independent variable.
Explain the least squares criterion.
Obtain and graph the regression
equation for a set of data points,
interpret the slope of the regression
line, and use the regression equation
to make predictions.
Define and use the terminology
predictor variable and response
variable.
Understand the concept of
Correlation
Statistical method used to
determine whether a relationship
exists between variables.
Two variables are related if:
Changes in the value of one are
related to changes in the value of the
other.
The association between two variables
may be seen by plotting a scatter
diagram.
Example 1: 10 students sat a Maths test and
a Physics test. Both tests were marked out of
20. Their marks are shown:
The association between two
variables may be seen by plotting a
scatter diagram.
The
points lie
approxim
ately on a
straight
line.
The
higher
the
Mathemat
ics mark
the
If both variables increase together they
are said to be positively correlated.
Strong Strong No
positive negative correlation
=−1 0 1
𝑟
Perfect Strong No Weak Perfect
negativ negativ correlation positive positive
e e
Rule of thumb:  or  is Rule of thumb:  or  is
considered to be ‘strong’ considered to be
correlation. ‘moderate’
correlation.
Example 4: Product Moment Correlation
Coefficient PMCC (r)
Example 4: Solution
a)
−44.5
b) = = − 0.85
𝑟
32.9 × 82.5
Suppose we use a spreadsheet to randomly generate maths marks for students, and
separately generate random English marks.
(This Excel demo accompanies this file – you can press F9 in Excel to generate a new set of random data) ✏  denotes the PMCC
What is the observed PMCC between Maths and English marks in this first set of of a sample.
data?
✏  (Greek letter
0.219 rho) is the PMCC
?
But what is the true underlying PMCC between Maths and English?
0. It was stated above that the maths and English marks were
for the whole
population.
generated independently of each other. Independent variables, by
?
definition, have no correlation. The observed PMCC may vary from
the true PMCC because the data is randomly sampled, just as if we
✏  is the test statistic,
 is the population
parameter.
threw a fair die, we wouldn’t necessarily see equal counts of each
outcome.
COMPLETE WORKSHEET 2
4 4
5 7
2 3
1 1
3 2
7 6
6 5
Example 7: solution
Step 2: Calculate the difference between the ranks
Example 7: solution
Step 3: Square (multiply by itself) your answer
Example 7: solution
Step 4: Add up all your new answers (d²)
The formula…
 = 1 – this bit!
The formula…
6 the sum of
x d²

Divided
by
the sum of
7 pieces of data, so n = d²
7
The formula…
6 8
x

Divided
by
7x7x7 -
7
The formula…

As  is close to 1 we can conclude that the wider
the stem to higher the sunflower grows.
Example 8: See if there is any
correlation between number of public
houses and places of worship:
Solution: Rank each data set; Subtract
the rankings; Square the differences and
find the sum of squares.
There is a small positive correlation
between number of public houses
Example 9: 8 students take tests in
Statistics and Maths:
Calculate spearman’s rank correlation
coefficient.
Example 10: Product Moment Correlation Coefficient
vs Spearman’s rank correlation coefficient
11+ NVR Score Avg AS point score 
 


119 287
103 265
110 137
37 300
? ?
? ?
?
?
?
?
?
Example 10: Product Moment Correlation Coefficient
& Spearman’s rank correlation coefficient
11+ NVR Score Avg AS point score  However, if we’re simply interested
 in how the rankings are correlated,
we might discard the original data
and use the rankings instead.
119 1 287 2
103 265
= − ?0.4
𝑟
110 137
37 300
3 3
4 1
Example 10:
✏ If no tied ranks:

where  is difference between each
rank.
? (If tied ranks, calculate normal
PMCC on ranked data)
? 

?
?
?
Interpreting 
𝑠
=1 =−1 =0
𝑠
𝑠
𝑟
𝑟
𝑟
? agreement.
Rankings in perfect ? order.
Ranks in reverse ? in rankings.
No correlation
Proof of  and PMCC equivalence (Not in textbook/exam)

?
Since we know each of the  are 1  ?

to :
 ? ?
Therefore:


? ? ?
?
?
?
? ?
?
?
?
Example 11: Exam style question
Edexcel S3 June 2011 Q2
?
Differences between  and  (Bro Exam Tip: This can be
tested!)
PMCC:
𝑦
𝑥
The ‘regression’ is the act of
setting the parameters of our
model (here the gradient and
E y-intercept of the line of best
x fit) to best explain the data.
a
m Time spent revising 
m
ar
k I record people’s exam marks as well as the time
they spent revising. I want to predict how well
someone will do based on the time they spent
revising. How would I do this?
LINEAR EQUATIONS WITH
ONE INDEPENDENT VARIABLE
Dependent
variable
Independent
Constants (fixed variable
numbers)
The graph of a linear equation with
one independent variable is straight
line, or simply line.
Any non-vertical line can be
Examples of linear equations with
one independent variable and their
graphs:
Example 12: Air – Conditioning Repairs
A company charges $55 per hour plus
a $30 service charge. Let x denote the
number of hours required for a job,
and let y denote the total cost to the
customer. Find the equation that
expresses y in terms of x.
Solution:
Because the rate for air-conditioning
repairs is $55 per hour, a job that
takes x hours will cost $55x plus the
$30 service charge. Hence the total
cost, y, of a job that takes x hours is:
This equation gives us the exact cost
for a job if we know the number of
hours required. For instance, a job
that takes 2 hours will cost y = 30 +
55×2 = $140. To obtain the graph
of
y = 30 + 55x
we first plot the
points displayed in
Table and then
connect them with a
line.
The graph is useful for quickly
estimating cost.
INTERCEPT AND SLOPE
y- slo
intercep pe
t
a is the y- b1 measures
value of the the
point of steepness of
intersection the line; b
of the line indicates how
and the y- much the y-
axis. value
changes
PREDICTOR VARIABLE AND
RESPONSE VARIABLE
In the context of regression analysis:
Predictor or
Response
explanatory
variable
variable
The
A variable
variable
used to
to be
predict or
measured
explain
or
the values
observed.
of the
OUTLIERS AND INFLUENTIAL
OBSERVATIONS
In the context of regression analysis
outlier is a data point that lies far
from regression line, relative to the
other data points.
An influential observation is a data
point whose removal causes the
regression equation (and line) to
change considerably.
Regression lines with and without the
influential observation removed.
The rule y = a + bx connecting the
variables x and y allows the value of
y to be predicted for any given value
of x.
Equation of a
straight line
Amount by which y
increases for an
increase of 1 in x.
y-intercept is
where the line
cuts they-axis
(the line x = 0)
Example
13
Least-Squares Criterion
The least-squares criterion is that the
line that best fits a set of data points is
the one having the smallest possible sum
of squared errors.
Regression Line and Regression Equation
Regression line: The line that best fits a
set of data points according to the least-
squares criterion.
Regression equation: The equation of
the regression line.
The equation of the regression line
of y on x is:
where
APPLYING AND INTERPRETING THE
REGRESSION EQUATION