0% found this document useful (0 votes)
25 views

Lesson - Correlation Linear Regression

Uploaded by

lewaahaidar6
Copyright
© © All Rights Reserved
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Lesson - Correlation Linear Regression

Uploaded by

lewaahaidar6
Copyright
© © All Rights Reserved
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
You are on page 1/ 84

Statistics

FC301

Correlation
&
Linear Regression
OBJECTIVES
Correlation
Be able to show diagrammatically,
pairs of observations of variables
Be able to decide if there is a
relationship between the variables
Put a numerical measure on the
strength of this relationship
Calculate Product moment correlation
coefficient
Calculate Spearman’s rank correlation
coefficient
OBJECTIVES LINEAR REGRESSION
Define and apply the concepts related
to linear equations with one
independent variable.
Explain the least squares criterion.
Obtain and graph the regression
equation for a set of data points,
interpret the slope of the regression
line, and use the regression equation
to make predictions.
Define and use the terminology
predictor variable and response
variable.
Understand the concept of
Correlation
Statistical method used to
determine whether a relationship
exists between variables.
Two variables are related if:
Changes in the value of one are
related to changes in the value of the
other.
The association between two variables
may be seen by plotting a scatter
diagram.
Example 1: 10 students sat a Maths test and
a Physics test. Both tests were marked out of
20. Their marks are shown:
The association between two
variables may be seen by plotting a
scatter diagram.
The
points lie
approxim
ately on a
straight
line.
The
higher
the
Mathemat
ics mark
the
If both variables increase together they
are said to be positively correlated.

If one variable increases as the other


decreases they are said to be negatively
correlated.

If no straight line (linear) pattern can be


seen there is said to be no correlation.
(For example, there is no correlation between
a person’s height and how much they earn.)
For a positive correlation the
points on the scatter diagrams
increase as you go from left to
right. Most points lie in the first
and third quadrants.
For a negative correlation the
points on the scatter diagram
decrease as you go from left to
right. Most points lie in the second
and fourth quadrants.
For no correlation, the points lie in
all four quadrants.
Example 2:
Line of
best fit
A line that follows the trend of the
data collected.
Two sets of data: Plot data points
on a scatter diagram
Calculate the mean for both sets
of data
Draw a straight line which passes
through point and which has
an approximately equal number of
points above and below the line.
Various degrees
of linear
correlation!
Example
3:
Solution-
Example 3 :

Strong Strong No
positive negative correlation

Moderate Moderate Weak


negative positive positive
COMPLETE WORKSHEET 1

6.1.1- The line of best fit - WORKSHEET


Measuring Correlation
You’re used to use qualitative terms such as “positive correlation” and “negative correlation” and
“no correlation” to describe the type of correlation, and terms such as “perfect”, “strong” and
“weak” to describe the strength.
The Product Moment Correlation Coefficient is one way to quantify this:

✏ The product moment correlation coefficient (PMCC), denoted by ,


describes the linear correlation between two variables. It can take
values between -1 and 1.
Note that PMCC is only applicable for a linear correlation, i.e.
closeness of fit to a linear regression line (i.e. a straight ‘line of best
fit’). It may be the data exhibits strong correlation with respect to a
different model (e.g. exponential) even when the PMCC is low.

=−1 0 1
𝑟
Perfect Strong No Weak Perfect
negativ negativ correlation positive positive
e e
Rule of thumb:  or  is Rule of thumb:  or  is
considered to be ‘strong’ considered to be
correlation. ‘moderate’
correlation.
Example 4: Product Moment Correlation
Coefficient PMCC (r)
Example 4: Solution
a)

−44.5
b) = = − 0.85
𝑟
32.9 × 82.5

c) There is negative correlation. The relatively


older young people took less time to reach the
required level.
Example 5: PMCC (r)
[Textbook] From the large data set, the daily mean windspeed,  knots, and the daily maximum
gust,  knots, were recorded for the first 10 days in September in Hurn in 1987.
a. State the meaning of n/a in the table above.
b. Calculate the product moment correlation coefficient for the remaining 8 days.
c. With reference to your answer to part b, comment on the suitability of a linear regression
model for these data.

Data on daily maximum gust is not available for these days.


a
 ? This is a common exam
question. The
 is close to 1 so there is a strong positive correlation important bit is
b
between daily mean windspeed ? and daily maximum gust. evaluating the
suitability of the
This means that the data points lie close to a straight line, chosen model (in
c ? suitable.
so a linear regression model is this case a linear
regression model, i.e.
line of best fit). The
closer  is to 1 or to -1,
the more suitable this
linear regression
model.
Example 6: PMCC sample (r)

Suppose we use a spreadsheet to randomly generate maths marks for students, and
separately generate random English marks.
(This Excel demo accompanies this file – you can press F9 in Excel to generate a new set of random data) ✏  denotes the PMCC
What is the observed PMCC between Maths and English marks in this first set of of a sample.
data?
✏  (Greek letter
0.219 rho) is the PMCC
?
But what is the true underlying PMCC between Maths and English?
0. It was stated above that the maths and English marks were
for the whole
population.
generated independently of each other. Independent variables, by
?
definition, have no correlation. The observed PMCC may vary from
the true PMCC because the data is randomly sampled, just as if we
✏  is the test statistic,
 is the population
parameter.
threw a fair die, we wouldn’t necessarily see equal counts of each
outcome.
COMPLETE WORKSHEET 2

6.1.2 – PMCC WORKSHEET


SPEARMAN’S RANK CORRELATION
COEFFICIENT (rS)
Usually used for ordinal or non-
numeric data.
The order or rank of the data is
taken to see if there is a relationship
between the variables.
To calculate rS we need:
① Rank each data set if not
ranked already
② Subtract the rankings
The formula for spearman’s rank
correlation coefficient (rS) is:

d is the difference between the


rankings of each item
n is the number of paired
observation
Example 7: Spearman’s rank correlation
coefficient (rS)
Calculate the Spearman’s Rank Correlation Coefficient and
comment on the result.
Example 7: solution
Step 1: Rank the data

4 4
5 7
2 3
1 1
3 2
7 6
6 5
Example 7: solution
Step 2: Calculate the difference between the ranks
Example 7: solution
Step 3: Square (multiply by itself) your answer
Example 7: solution
Step 4: Add up all your new answers (d²)
The formula…

 = 1 – this bit!
The formula…
6 the sum of
x d²


Divided
by

The number of data ranked cubed – the number of data


ranked
Example

the sum of
7 pieces of data, so n = d²
7
The formula…
6 8
x


Divided
by

7x7x7 -
7
The formula…


As  is close to 1 we can conclude that the wider
the stem to higher the sunflower grows.
Example 8: See if there is any
correlation between number of public
houses and places of worship:
Solution: Rank each data set; Subtract
the rankings; Square the differences and
find the sum of squares.
There is a small positive correlation
between number of public houses
Example 9: 8 students take tests in
Statistics and Maths:
Calculate spearman’s rank correlation
coefficient.
Example 10: Product Moment Correlation Coefficient
vs Spearman’s rank correlation coefficient
11+ NVR Score Avg AS point score 
 


119 287
103 265
110 137
37 300

? ?
? ?
?
?
?
?
?
Example 10: Product Moment Correlation Coefficient
& Spearman’s rank correlation coefficient
11+ NVR Score Avg AS point score  However, if we’re simply interested
 in how the rankings are correlated,
we might discard the original data
and use the rankings instead.
119 1 287 2
103 265
= − ?0.4

𝑟
110 137
37 300
3 3

✏ Spearman’s rank correlation


coefficient  is when the data is
2 4 converted to rankings before
calculating the PMCC.

4 1
Example 10:

✏ If no tied ranks:

where  is difference between each
rank.
? (If tied ranks, calculate normal
PMCC on ranked data)

? 

?

?
?
Interpreting 
𝑠
=1 =−1 =0

𝑠
𝑠
𝑟
𝑟
𝑟
? agreement.
Rankings in perfect ? order.
Ranks in reverse ? in rankings.
No correlation
Proof of  and PMCC equivalence (Not in textbook/exam)


?
Since we know each of the  are 1  ?

to :
 ? ?
Therefore:


? ? ?
?
?
?

? ?
?
?
?
Example 11: Exam style question
Edexcel S3 June 2011 Q2

?
Differences between  and  (Bro Exam Tip: This can be
tested!)

Original data Ranked data


 
Spearman’s Rank:

Makes no assumption about


original data: original data need not
be linear.

PMCC:

We can only do a hypothesis test if


the variables are (jointly) normally
distributed.
(We’ll do hypothesis testing in a sec)
DEALING WITH TIED RANKS (1)
Example 10: Suppose that in the dance
competition Judge 1 gave both dances E
and F a rank of 4 and Judge 2 gave
dances A, B and H a rank of 2. Their
marks are shown:

If E and F had been slightly different


Judge 1 would have ranked them 4 and 5,
so we award each of them the mean of
those ranks:
DEALING WITH TIED RANKS (2)

If A, B and H had been slightly different


Judge 2 would have ranked them 2, 3 and
4, so we award each of them the mean of
those ranks:
The lowest maths
mark is 10 (Rank 1)
The next mark is 14
(Rank 2)
The next two marks
are both 15, so they
are ranked:
The next two marks
are both 16 so they
are ranked:
The next two marks
are both 17 so they
are ranked:
The top three marks
are all 18 so they are
ranked:
Apply the same
procedure for the
Statistics marks!
This result suggest some positive
correlation between the
mathematics and statistics results.

Spearman’s rank correlation coefficient is


not used to find linear correlation but a
more general correlation
COMPLETE WORKSHEET 3

6.1.3 - Spearman’s Rank WORKSHEET


Linear Regression
What we’ve done here is come up with a
model to explain the data, in this case,
a line . We’ve then tried to set  and  such
that the resulting  value matches the
actual exam marks as closely as possible.
= 20 + 3

𝑦
𝑥
The ‘regression’ is the act of
setting the parameters of our
model (here the gradient and
E y-intercept of the line of best
x fit) to best explain the data.
a
m Time spent revising 
m
ar
k I record people’s exam marks as well as the time
they spent revising. I want to predict how well
someone will do based on the time they spent
revising. How would I do this?
LINEAR EQUATIONS WITH
ONE INDEPENDENT VARIABLE

Dependent
variable

Independent
Constants (fixed variable
numbers)
The graph of a linear equation with
one independent variable is straight
line, or simply line.
Any non-vertical line can be
Examples of linear equations with
one independent variable and their
graphs:
Example 12: Air – Conditioning Repairs
A company charges $55 per hour plus
a $30 service charge. Let x denote the
number of hours required for a job,
and let y denote the total cost to the
customer. Find the equation that
expresses y in terms of x.
Solution:
Because the rate for air-conditioning
repairs is $55 per hour, a job that
takes x hours will cost $55x plus the
$30 service charge. Hence the total
cost, y, of a job that takes x hours is:
This equation gives us the exact cost
for a job if we know the number of
hours required. For instance, a job
that takes 2 hours will cost y = 30 +
55×2 = $140. To obtain the graph
of
y = 30 + 55x
we first plot the
points displayed in
Table and then
connect them with a
line.
The graph is useful for quickly
estimating cost.
INTERCEPT AND SLOPE

y- slo
intercep pe
t
a is the y- b1 measures
value of the the
point of steepness of
intersection the line; b
of the line indicates how
and the y- much the y-
axis. value
changes
PREDICTOR VARIABLE AND
RESPONSE VARIABLE
In the context of regression analysis:

Predictor or
Response
explanatory
variable
variable
The
A variable
variable
used to
to be
predict or
measured
explain
or
the values
observed.
of the
OUTLIERS AND INFLUENTIAL
OBSERVATIONS
In the context of regression analysis
outlier is a data point that lies far
from regression line, relative to the
other data points.
An influential observation is a data
point whose removal causes the
regression equation (and line) to
change considerably.
Regression lines with and without the
influential observation removed.
The rule y = a + bx connecting the
variables x and y allows the value of
y to be predicted for any given value
of x.
Equation of a
straight line
Amount by which y
increases for an
increase of 1 in x.
y-intercept is
where the line
cuts they-axis
(the line x = 0)
Example
13

If the points on a scatter diagram follow a


linear pattern a straight line can be used as a
COMPLETE WORKSHEET 4

6.1.4 - Gradient and y-intercept WORKSHEET


INDEPENDENT AND DPENDENT
VARIABLES

An independent (or explanatory)


variable is one that is set
independently of the other variable.
It is plotted along the x-axis.
A dependent (or response) variable
is one whose values are determined
by the values of the independent
variable. It is plotted along they-
axis.
Example 14: A company wants to
predict sales of a new album by a pop
group. They know the yearly sales of a
number of existing albums by the
same group and the number of stores
that stock each album.
Which is the independent variable and
which is the dependent variable?
Solution: The yearly sales of each
album depend on the number of stores
that stock it. Therefore the
independent variable is the number of
stores and the dependent variable is
the sales.
The values for a and b that make the
sum of the squares of the residuals a
minimum can be calculated using the
formulae:
For each point on a scatter diagram
you can express y in terms of x as
The values e1, e2, e3 etc are known as
residuals.
The line that minimises the sum of
the squares of the residuals is called
the least squares regression line.
That is to say Σe2 is minimum. The
line is called the regression line of y
on x.
THE LEAST – SQUARES CRITERION
Scatter diagram for the data point in
the Table:

Infinitely many lines can fit those four


data points.
Two possible lines to fit the data
points
Determining how well the data
points are fit
by (a) Line A and (b) Line B
KEY FACT

Least-Squares Criterion
The least-squares criterion is that the
line that best fits a set of data points is
the one having the smallest possible sum
of squared errors.
Regression Line and Regression Equation
Regression line: The line that best fits a
set of data points according to the least-
squares criterion.
Regression equation: The equation of
the regression line.
The equation of the regression line
of y on x is:

where
APPLYING AND INTERPRETING THE
REGRESSION EQUATION

A regression line can be used to


estimate the value of the dependent
variable for any value of the
independent variable.
Interpolation is when you estimate
the value of a dependent variable
within the range of the data.
Extrapolation is when you estimate a
value outside the range of the data.
Values estimated by extrapolation
COMPLETE WORKSHEET

6.1.5 - Linear Regression WORKSHEET


THE COEFFICIENT OF
DETERMINATION (r2)
It is a descriptive measure of the
utility of the regression equation
for making predictions.
This measures the amount of the
variation in y that is explained by x

It is the square of the Pearson’s


product moment correlation
coefficient
Generally expressed as a
percentage (%)
EXAMPLE
EXAMPLE

About 80% of the variation in Blood Pressure is explained by


Age.
The other 20% is unexplained and may be due to a variety of
factors.
COMPLETE WORKSHEET

6.1.6 - Linear Regression WORKSHEET

You might also like