Lecture 7
Lecture 7
2
Introduction
When there are two variables in pairs (also
called as bivariate data) there may or may not
be a relationship between them.
Negative correlation
One variable increases as the other decreases
e.g. (𝑥, 𝑦) = (Hours on game, Sleeping hours )
No correlation
No straight line (linear) pattern
e.g. (𝑥, 𝑦) = (person’s height, their income)
Example
In the study of a city, the population density,
in people/hectare, and the distance from the city centre,
in km, was investigated by picking a number of sample
areas with the following results.
Solutions
Draw a vertical line through the mean 𝑥 value 𝑥̅ , and
a horizontal line through the mean 𝑦 value of 𝑦%
Positive
correlation
Most points in the 2nd and 4th Points distributed in all four
quadrants with the new axes. quadrants.
Negative No
correlation correlation
Self-study
The table shows studying hours (in hours) and
exam marks (in percent) of 5 students.
a)
Exam marks (%)
b) Positive correlation
!
𝑆## = + 𝑦 − 𝑦%
and
14
Therefore,
!
!
∑𝑥
𝑆"" = ∑𝑥 −
𝑛
Similarly,
!
!
∑𝑦
𝑆## = ∑𝑦 −
𝑛
∑𝑥∑𝑦
𝑆"# = ∑𝑥𝑦 −
𝑛
16
Solution
in cm*weeks
17
Example
Studying hours (𝒙) 0 4 6 8 10
Exam marks (𝒚) 10 16 30 37 47
Solution
(∑ 𝒙)𝟐 𝟕𝟖𝟒
𝑺𝒙𝒙 = ∑ 𝒙𝟐 − = 𝟐𝟏𝟔 − = 𝟓𝟗. 𝟐 hours2
𝒏 𝟓
(∑ 𝒚)𝟐 𝟏𝟗𝟔𝟎𝟎
𝑺𝒚𝒚 = ∑ 𝒚𝟐 − 𝒏 = 𝟒𝟖𝟑𝟒 − 𝟓 = 𝟗𝟏𝟒 percent2
18
Covariance
Covariance provides a measure of the strength of the
correlation between two or more sets of random variables.
∑(𝒙%&
𝒙)(𝒚%&
𝒚) 𝑺𝒙𝒚
Covariance 𝝈𝒙𝒚 = =
𝒏 𝒏
Covariance has units!
19
∑(𝒙%&
𝒙)(𝒚%&
𝒚) 𝑺𝒙𝒚
𝝈𝒙𝒚 = 𝒏
= 𝒏
1 ∑𝑥∑𝑦
= ∑𝑥𝑦 −
𝑛 𝑛
∑𝑥𝑦 ∑𝑥 ∑𝑦
= −
𝑛 𝑛 𝑛
∑𝑥𝑦
= − 𝑥̅ 𝑦%
𝑛
20
Studying hours (𝒙) 0 4 6 8 10
Exam marks (𝒚) 10 16 30 37 47
Find the covariance for the studying hours and exam marks.
Solution
!)
𝑥̅ = *
= 5.6 and 𝑦% = 28
∑𝑥𝑦 = 1010
+,+,
𝜎"# = − 5.6×28 = 45.2 hours*percent
*
21
The following table shows the amount of almonds consumed
in grams (g) and exam marks in percent.
Almonds (𝒛) 2 19 25 36 54
1 7 8 9 9
Exam marks (𝒚)
2234
𝜎01 = 4
− 27.2×6.8 =44.04 g*percent
22
Which one has stronger impact
on exam marks?
Recall : 𝜎"# = 45.2 and 𝜎-# = 44.04
To improve your exam marks, should you
OR
24
We can also use :
𝑆"#
𝜎"# 𝑛 𝑆"#
𝑟= = =
𝜎" 𝜎# 𝑆"" 𝑆## 𝑆"" 𝑆##
𝑛 𝑛
25
Now, let’s compute the PMCC for (studying hours, exam
marks) and (amount of almonds, exam marks) from the
previous example.
5"# 667
Studying hours 𝑟01 = 5"" 5##
= 48.6×823
= 0.972
5$# 66<.6
Almonds 𝑟;1 = 5$$ 5##
= 24<6.=×33.=
= 0.849
27
Solution
Positively correlated
The greater the number
of vehicles, the higher the
number of accidents.
28
Strength of the linear relationship
The value of 𝑟 varies between -1 and 1.
30
Example
The scatter diagrams show various degrees of correlation.
32
Example
33
Variables are often linked only through a third variable.
One of such examples is that take place over time.
Example
Over the past 10 years the memory capacity of personal
computers has increased, and so has the average life
expectancy of people in the western world. Is there a
correlation between these two variables?
34
Regression
R = a + bRm +e
Rate of return on a particular stock Rate of return on some major stock index
Draw the scatter diagram for these data, and draw by eye the
line of best fit through the points.
Example 1: Hand-drawn line of best fit
1. The two variables are
positively correlated.
2. The line is drawn so
that points lie fairly
evenly either side of it.
3. One of the points is
outside the trend and is
ignored.
Example 1: Hand-drawn line of best fit
4. You could find the
equation of this line by
determining the slope and
the intercept from the
graph.
5. The obtained equation
can be used as a model to
describe the relationship
between x and y.
Will this process produce an accurate model? What if you
have thousands of data points? We need a mathematical
formula to calculate the equation of the line of best fit.
Examples of scatter plots with large data
where
Example 3
The data below shows the load on a lorry, x (in tonnes), and
the fuel efficiency, y (in km per litre).
å xy = 465.05
Solution
! Then S xy = å xy -
å xå y
=
n
72.3 ´ 66.8
= 465.05 - = -17.914
10
( å x)
2
! S xx = å x 2
- =
n
2
72.3
= 544.81 - = 22.081
10
Can you give qualitative interpretation to the data correlation?
Solution
! So the gradient of the regression line is b, where
S xy -17.914
b= = = -0.811(to 3 sig.fig.)
S xx 22.081
! And the intercept of the regression line is a, where:
a = y - bx =
å y
-b
å x
=
n n
66.8 72.3
= - ( -0.811) ´ = 12.5 (to 3 sig.fig)
10 10
• The regression line of y on x is: 𝒚 = 𝟏𝟐. 𝟓 − 𝟎. 𝟖𝟏𝟏𝒙
Solution
b) Plot your regression line on a scatter diagram
! A regression line always goes through the point ( x , y ).
x = 7.23, y = 6.68 Þ the point is (7.23, 6.68)
! By putting x = 0 into the equation, you can see the
line must also go through the point (0, 12.5).
! ∑" ! ∑"∑#
a) Calculate 𝑆"" = ∑ 𝑥 − and 𝑆"# = ∑ 𝑥𝑦 − .
' '
b) Find the equation of the regression line of y on x
Solution
Interpolation and Extrapolation
You can use a regression line to predict values of your
dependent variable.
(This is because you don’t have any evidence that the relationship described
by your regression line is true outside the range.)
Example 4
The length of a spring (y, in cm) when loaded with different
masses (m, in g) is shown in the table below.
a)
Solution
c) Comment on reliability of estimates in part b).
66