0% found this document useful (0 votes)
18 views65 pages

Lecture 7

Uploaded by

sagynysh.akylbek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views65 pages

Lecture 7

Uploaded by

sagynysh.akylbek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Lecture overview:

1. Positive, negative and no correlation


2. Variance and covariance
3. Product moment correlation coefficient
4. Applications of Regression
5. Linear regression model
6. Interpolation and extrapolation
7. Interpretation of linear regression equation

2
Introduction
When there are two variables in pairs (also
called as bivariate data) there may or may not
be a relationship between them.

If two variables are related to any extent, then


changes in the value of one are related to
changes in the value of the other.

In this lecture, we will learn how to measure a


relationship between two variables.
Correlation
Correlation is a statistical measure used to determine
the degree to which two variables (bivariate data) are
related
Positive correlation
Both variables increase together
e.g. (𝑥, 𝑦) = (study hours, their exam marks)

Negative correlation
One variable increases as the other decreases
e.g. (𝑥, 𝑦) = (Hours on game, Sleeping hours )

No correlation
No straight line (linear) pattern
e.g. (𝑥, 𝑦) = (person’s height, their income)
Example
In the study of a city, the population density,
in people/hectare, and the distance from the city centre,
in km, was investigated by picking a number of sample
areas with the following results.
Solutions
Draw a vertical line through the mean 𝑥 value 𝑥̅ , and
a horizontal line through the mean 𝑦 value of 𝑦%

Most points in the 1st and 3rd


quadrants with the new axes.

Positive
correlation
Most points in the 2nd and 4th Points distributed in all four
quadrants with the new axes. quadrants.

Negative No
correlation correlation
Self-study
The table shows studying hours (in hours) and
exam marks (in percent) of 5 students.

Studying hours (𝒙) 0 4 6 8 10


Exam marks (𝒚) 10 16 30 37 47

a) Plot a scatter diagram.


b) Describe the type of correlation.
c) Interpret the correlation.
Solutions

a)
Exam marks (%)

b) Positive correlation

Studying hours (h) c) As one studies more


in time, the exam mark
becomes higher.
Variance (𝜎 ! ) is the measurement of how far a set of
(random) numbers are spread out from their mean.
𝟐 𝟐
𝟐
∑ 𝒙−𝒙& ∑𝒙
𝝈 = = &𝟐
−𝒙
𝒏 𝒏
!
We define 𝑆"" = ∑ 𝑥 − 𝑥̅ 𝑥 − 𝑥̅ = ∑ 𝑥 − 𝑥̅

Here, 𝑆 stands for SUM.


Similarly, we can define :

!
𝑆## = + 𝑦 − 𝑦%

and

𝑆"# = +(𝑥 − 𝑥)(𝑦


̅ − 𝑦)
%
Computational tips
∑ "%"̅ ! (""
Variance = = can be also computed as
' '

14
Therefore,
!
!
∑𝑥
𝑆"" = ∑𝑥 −
𝑛
Similarly,
!
!
∑𝑦
𝑆## = ∑𝑦 −
𝑛

∑𝑥∑𝑦
𝑆"# = ∑𝑥𝑦 −
𝑛

Remark : 𝑆"" , 𝑆## , and 𝑆"# have units!


15
Example
The head circumference in cm (𝑥) and gestation period in
weeks (𝑦) for new-born babies at a certain clinic over a
period of time were as follows.

Find 𝑆"# for these data.

16
Solution

in cm*weeks
17
Example
Studying hours (𝒙) 0 4 6 8 10
Exam marks (𝒚) 10 16 30 37 47

Find 𝑆"" and 𝑆## .

Solution
(∑ 𝒙)𝟐 𝟕𝟖𝟒
𝑺𝒙𝒙 = ∑ 𝒙𝟐 − = 𝟐𝟏𝟔 − = 𝟓𝟗. 𝟐 hours2
𝒏 𝟓

(∑ 𝒚)𝟐 𝟏𝟗𝟔𝟎𝟎
𝑺𝒚𝒚 = ∑ 𝒚𝟐 − 𝒏 = 𝟒𝟖𝟑𝟒 − 𝟓 = 𝟗𝟏𝟒 percent2

18
Covariance
Covariance provides a measure of the strength of the
correlation between two or more sets of random variables.

The covariance between the two random variables


𝑥 and y are defined as

∑(𝒙%&
𝒙)(𝒚%&
𝒚) 𝑺𝒙𝒚
Covariance 𝝈𝒙𝒚 = =
𝒏 𝒏
Covariance has units!

19
∑(𝒙%&
𝒙)(𝒚%&
𝒚) 𝑺𝒙𝒚
𝝈𝒙𝒚 = 𝒏
= 𝒏
1 ∑𝑥∑𝑦
= ∑𝑥𝑦 −
𝑛 𝑛

∑𝑥𝑦 ∑𝑥 ∑𝑦
= −
𝑛 𝑛 𝑛

∑𝑥𝑦
= − 𝑥̅ 𝑦%
𝑛
20
Studying hours (𝒙) 0 4 6 8 10
Exam marks (𝒚) 10 16 30 37 47

Find the covariance for the studying hours and exam marks.
Solution
!)
𝑥̅ = *
= 5.6 and 𝑦% = 28

∑𝑥𝑦 = 1010
+,+,
𝜎"# = − 5.6×28 = 45.2 hours*percent
*
21
The following table shows the amount of almonds consumed
in grams (g) and exam marks in percent.
Almonds (𝒛) 2 19 25 36 54

1 7 8 9 9
Exam marks (𝒚)

Find the covariance for the amount of almonds and


exam marks.
Solution
𝑧̅ = 27.2 and 𝑦4 = 6.8, ∑𝑥𝑦 = 1145

2234
𝜎01 = 4
− 27.2×6.8 =44.04 g*percent

22
Which one has stronger impact
on exam marks?
Recall : 𝜎"# = 45.2 and 𝜎-# = 44.04
To improve your exam marks, should you

OR

Cannot conclude! Because 𝜎"# and 𝜎-# have different units.


23
Product moment correlation coefficient
(PMCC)
The product moment correlation coefficient 𝑟 is defined as
𝜎"#
𝑟=
𝜎" 𝜎#

,𝜎" and 𝜎# are standard deviations of 𝑥 and 𝑦, respectively.

Unlike covariance, the product moment correlation


coefficient is unitless!

24
We can also use :

𝑆"#
𝜎"# 𝑛 𝑆"#
𝑟= = =
𝜎" 𝜎# 𝑆"" 𝑆## 𝑆"" 𝑆##
𝑛 𝑛

25
Now, let’s compute the PMCC for (studying hours, exam
marks) and (amount of almonds, exam marks) from the
previous example.
5"# 667
Studying hours 𝑟01 = 5"" 5##
= 48.6×823
= 0.972

5$# 66<.6
Almonds 𝑟;1 = 5$$ 5##
= 24<6.=×33.=
= 0.849

Study more, or eat more almonds?


STUDY MORE!!
Higher 𝑟 implies stronger relationship between the two
variables.
26
Example
The number of vehicles, 𝑥 millions, and the number of
accidents 𝑦 thousands in 15 different countries were:

Compute the product moment correlation coefficient for


the number of vehicles and the number of accidents.

27
Solution

Positively correlated
The greater the number
of vehicles, the higher the
number of accidents.

28
Strength of the linear relationship
The value of 𝑟 varies between -1 and 1.

Perfect positive Perfect negative


𝑟 = 0 or
linear correlation linear correlation
close to zero.
All points fit a All points fit a
No linear
straight line with straight line with
correlation
positive gradient. negative gradient.
29
Values of 𝒓 between 1 and 0 – Positive correlation
The closer to 1 the better the correlation, the closer to
0 the worse the correlation

Values of 𝒓 between -1 and 0 – Negative correlation


The closer to -1 the better the correlation, the closer
to 0 the worse the correlation

30
Example
The scatter diagrams show various degrees of correlation.

Match the diagrams with the product moment


correlation coefficients below.
Be careful!
Even if two variables are associated and have a linear
correlation, it does not necessarily mean that a change
in one of the variables causes a change in the other
variable.

That is, you need to understand the context of the


situation rather than solely depend on the numbers, e.g.
product moment correlation coefficient.

32
Example

The number of cars on the road has increased, and the


number of DVD recorders bought has decreased. Is
there a correlation between these two variables?

33
Variables are often linked only through a third variable.
One of such examples is that take place over time.
Example
Over the past 10 years the memory capacity of personal
computers has increased, and so has the average life
expectancy of people in the western world. Is there a
correlation between these two variables?

34
Regression

Line of best fit

If the points on a scatter diagram follow a linear pattern a


straight line can be used as a model for the relationship.
We call it the line of best fit.
Introduction

Why do you think lines of best fit are useful?

Answer: you can use them to predict how


one of the quantities will be affected by a
change in the other one.
Introduction
Excel gives you an easy ‘one-click’ option of plotting a
best fit line for you scattered data.

Have you ever thought how Excel calculates this line?


What is linear regression?
Linear regression is a statistical MODEL that attempts to
show the relationship between two variables with a linear
equation Exam marks (%)

Studying hours (h)


Applications of Regression
Real life application: Analyzing the impact of profit changes
Applications of Regression
Finance Application: Market Model

•One of the most important applications of linear


regression is the market model.
•It is assumed that rate of return on a stock (R) is linearly
related to the rate of return on the overall market.

R = a + bRm +e

Rate of return on a particular stock Rate of return on some major stock index

The beta coefficient measures how sensitive the stock’s rate


of return is to changes in the level of the overall market.
Example 1: Hand-drawn line of best fit
A company would like to predict what the sales of their new
shops are going to be. The yearly sales of nine of their existing
shops are known and the sizes of the population of towns in
which each shop is sited are found. The size of the towns, (x) in
1000s, and the sales, (y) in £1000, are given in the table.

Draw the scatter diagram for these data, and draw by eye the
line of best fit through the points.
Example 1: Hand-drawn line of best fit
1. The two variables are
positively correlated.
2. The line is drawn so
that points lie fairly
evenly either side of it.
3. One of the points is
outside the trend and is
ignored.
Example 1: Hand-drawn line of best fit
4. You could find the
equation of this line by
determining the slope and
the intercept from the
graph.
5. The obtained equation
can be used as a model to
describe the relationship
between x and y.
Will this process produce an accurate model? What if you
have thousands of data points? We need a mathematical
formula to calculate the equation of the line of best fit.
Examples of scatter plots with large data

Positively correlated Is that line a best fit line?


Where to draw best fit line? Does the model give
reliable predictions?
Dependent and independent variables
Linear regression is used to model the relationship between
two variables.

You have to decide which of the variables is the


independent variable, and which is the dependent
variable.
Explanatory and response variables
Independent (or explanatory) variable is one that is set
independently of the other variable. In other words, it’s the
variable you can directly control, or the one that you think
is affecting the other.
This variable is plotted along the x-axis.

Dependent (or response) variable is one whose values are


determined by the values of the independent variable.
It is plotted along the y-axis.
Explanatory and response variables
Can you identify independent and dependent variables in
Example 1?

‘Town size’ is the independent variable, ‘sales’ is the


dependent variable.
Regression lines
Equation of the best fit line
y = bx + a
Small distance ei between each
data value and the best fit line
is referred to as a residual.
The residuals show the errors
in the model, i.e. they show
how the real-life observations
differ from what the model
predicts.
Least squares method
Parameters ‘b’ and ‘a’ of the best fit line are estimated by
minimising the sum of the squares of the residuals,
å e 2.
i

This means: the smaller the sum of squared deviations the


better the fit of the line to the data.

This method of finding the equation of best fit line is


referred to as the method of least squares.

We say ‘regression line of y on x’. (y is a dependent


variable, x is an independent variable)
Linear Regression Model
The Line of Best fit is the end-product of regression

The equation of the regression line of y on x is:

where
Example 3
The data below shows the load on a lorry, x (in tonnes), and
the fuel efficiency, y (in km per litre).

a) Find the equation of the regression line of y on x.


b) Plot your regression line on a scatter diagram.
c) Calculate the residuals for:
(i) x=5.6
(ii) x=6.3.
Solution
a) First, you need to find S xy and S xx .
Start by working out the four summations
å x, å å , å xy.
y , x 2

It's best to draw a table.


å x = 72.3
å y = 66.8
å x = 544.81
2

å xy = 465.05
Solution
! Then S xy = å xy -
å xå y
=
n
72.3 ´ 66.8
= 465.05 - = -17.914
10

( å x)
2

! S xx = å x 2
- =
n
2
72.3
= 544.81 - = 22.081
10
Can you give qualitative interpretation to the data correlation?
Solution
! So the gradient of the regression line is b, where
S xy -17.914
b= = = -0.811(to 3 sig.fig.)
S xx 22.081
! And the intercept of the regression line is a, where:

a = y - bx =
å y
-b
å x
=
n n
66.8 72.3
= - ( -0.811) ´ = 12.5 (to 3 sig.fig)
10 10
• The regression line of y on x is: 𝒚 = 𝟏𝟐. 𝟓 − 𝟎. 𝟖𝟏𝟏𝒙
Solution
b) Plot your regression line on a scatter diagram
! A regression line always goes through the point ( x , y ).
x = 7.23, y = 6.68 Þ the point is (7.23, 6.68)
! By putting x = 0 into the equation, you can see the
line must also go through the point (0, 12.5).

! Draw the regression line through these two points.


• You do not have to use the point 𝑥,̅ 𝑦$ .
• To plot a regression line you can choose any two points to plot.
• It’s a good idea to make the points you are plotting for the
regression line look different from your actual data points.
Solution
c) Calculate the residuals for (i) x = 5.6, (ii) x = 6.3.
The residual = observed y -value - estimated y -value
! (i) the residual = 7.5 - (12.5 - 0.811 ´ 5.6)=-0.458 (3 sig.fig)

! (ii) observed data point is (6.3, 8.8)


the residual = 8.8 - (12.5 - 0.811 ´ 6.3)=1.41 (3 sig.fig)
Self-study
The results from an experiment in which different masses
were placed on a spring and the resulting length of the
spring measured, are shown below.

! ∑" ! ∑"∑#
a) Calculate 𝑆"" = ∑ 𝑥 − and 𝑆"# = ∑ 𝑥𝑦 − .
' '
b) Find the equation of the regression line of y on x
Solution
Interpolation and Extrapolation
You can use a regression line to predict values of your
dependent variable.

(This is because you don’t have any evidence that the relationship described
by your regression line is true outside the range.)
Example 4
The length of a spring (y, in cm) when loaded with different
masses (m, in g) is shown in the table below.

a) Calculate the equation of the regression line of y on m.


b) Use your regression line to estimate the length of the
spring when loaded with a mass of (i) 370 g, (ii) 670 g.
c) Comment on reliability of estimates in part b).
Solution

a)
Solution
c) Comment on reliability of estimates in part b).

• m=370 falls within the range of the data, so this is an


interpolation. This means the result should be fairly reliable.
• m=670 falls outside the range of the data, so this is an
extrapolation. This means the regression line may not be
valid, and we need to treat this result with caution.
Interpreting the regression equation
You should be able to explain what the regression equation
means in context of the problem.
Example
In the previous problem we obtained the equation of
regression line of y on m:

Explain what the two constants 7.8 and 0.01043 mean


in this context.
• 7.8 is the length of the spring when m=0, i.e. when
no load is placed on it.
• 0.01043 is the amount by which the spring’s length
increases for every extra 1 g of load.
References:
1. Palin A., Park A., Whiteley C., (2012), A-level
mathematics for Edexcel Statistics 1, CGP, UK.
2. Attwood, G., Clegg, A., Dyer, G. and Dyer, J
(2008), Edexcel AS and A-Level Modular
Mathematics series S2, Pearson, Harlow, UK.
3. Lecture notes, Statistics and Math for Life
Sciences courses, NUFYP, Nazarbayev
University.

66

You might also like