Lecture 9 simple-linear-regression-correlation updated
Lecture 9 simple-linear-regression-correlation updated
Regression
Topics Covered:
◼ Is there a relationship bivariate data(between x and y)?
◼ What is the strength of this relationship
Pearson’s r
◼ Can we describe this relationship and use this to predict y from
x?
Regression
◼ Is the relationship we have described statistically significant?
t-test
The relationship bivariate data
between x and y
◼ Correlation: is there a relationship between 2
variables?
◼ Regression: how well a certain independent
variable predict dependent variable?
◼ CORRELATION CAUSATION
In
order to infer causality: manipulate independent
variable and observe effect on dependent variable
Scattergrams
Y Y Y
X
X X
Variance:
n
i
• Gives information on variability of a
single variable. ( x − x ) 2
Covariance: S =
2 i =1
n −1
x
• Gives information on the degree to which
two variables vary together.
n
(x
• Note how similar the covariance is to
variance: the equation simply multiplies x’s
error scores by y’s error scores as opposed i − x)( yi − y )
to squaring x’s error scores. cov( x, y ) = i =1
n −1
Covariance
n
(x i − x)( yi − y )
cov( x, y ) = i =1
n −1
◼ When X and Y : cov (x,y) = pos.
◼ When X and Y : cov (x,y) = neg.
◼ When no constant relationship: cov (x,y) = 0
Example Covariance
xi − x yi − y ( xi − x )( yi − y )
7
x y
6
5 0 3 -3 0 0
4 2 2 -1 -1 1
3 3 4 0 1 0
2
4 0 1 -3 -3
1
6 6 3 3 9
0
0 1 2 3 4 5 6 7 x=3 y=3 = 7
( x − x)( y − y))
i i
7
What does this
cov( x, y ) = i =1
= = 1.75 number tell us?
n −1 4
Example Covariance
xi − x yi − y ( xi − x )( yi − y )
7
x y
6
5 0 3 -3 0 0
4 2 2 -1 -1 1
3 3 4 0 1 0
2
4 0 1 -3 -3
1
6 6 3 3 9
0
0 1 2 3 4 5 6 7 x=3 y=3 = 7
( x − x)( y − y))
i i
7
What does this
cov( x, y ) = i =1
= = 1.75 number tell us?
n −1 4
A positive covariance indicates that the two variables tend to move in the same
direction, while a negative covariance indicates that they tend to move in opposite
directions.
Problem with Covariance:
◼ The value obtained by covariance is dependent on the size of
the data’s standard deviations: if large, the value will be
greater than if small… even if the relationship between x and y
is exactly the same in the large versus small standard
deviation datasets.
Example of how covariance value
relies on variance
High variance data Low variance data
𝑐𝑜𝑣(𝑥, 𝑦)
𝑟𝑥𝑦 =
𝑠𝑥 𝑠𝑦
Pearson’s R continued
Sample Correlation Coefficient
Population Correlation Coefficient
Linear Correlation Coefficient
Linear Correlation Coefficient
The sign of r indicates the strength of the linear relationship between the
variables.
•If r is near 1, then the two variables have a strong linear relationship.
•If r is near 0, then the two variables have no linear relation.
•If r is near -1, then the two variables have a weak (negative) linear
relationship.
Limitations of r
◼ When r = 1 or r = -1:
We can predict y from x with certainty
all data points are on a straight line: y = ax + b
◼ r is actually r̂
r = true r of whole population
r̂ = estimate of r based on data
5
0
0 1 2 3 4 5 6
Pearson’s R Example
Calculate the Correlation coefficient of given data.
Pearson’s R Example
Solution: X values
Here n = 5
Regression
◼ Correlation tells you if there is an association
between x and y but it doesn’t describe the
relationship or allow you to predict one
variable from the other.
= ŷ, predicted value
= y i , true value
ε = residual error
Least Squares Regression
◼ To find the best line we must minimise the sum of
the squares of the residuals (the vertical distances
from the data points to our line)
Model line: ŷ = ax + b a = slope, b = intercept
Residual (ε) = y - ŷ
Sum of squares of residuals = Σ (y – ŷ)2
b
ε b ε
b
b b b
y = ax + b b = y – ax
◼ We can put our equation for a into this giving:
r sy r = correlation coefficient of x and y
b=y- s x s = standard deviation of y
y
x s = standard deviation of x
x
◼ We can calculate the regression line for any data, but the important
question is how well does this line fit the data, or how good is it at
predicting y from x
How good is our model?
∑(y – y)2 SSy
◼ Total variance of y: sy2 =
n-1
=
dfy
r2 = sŷ2 / sy2
◼ F-statistic: complicated
rearranging
sŷ2 r2 (n - 2)2
F(df ,df ) = =......=
ŷ er
ser2 1 – r2
◼ And it follows that:
r (n - 2) So all we need to
(because F = t2) t (n-2) = know are r and n
√1 – r2
General Linear Model
◼ Linear regression is actually a form of the
General Linear Model where the parameters
are a, the slope of the line, and b, the intercept.
y = ax + b +ε
◼ A General Linear Model is just any model that
describes the data in terms of a straight line
Multiple regression
◼ Multiple regression is used to determine the effect of a number
of independent variables, x1, x2, x3 etc, on a single dependent
variable, y
◼ The different x variables are combined in a linear way and
each has its own regression coefficient:
Solution:
XY Regression coefficient of X on Y
Regression equation of X on Y
Calculate the regression coefficient and obtain the lines of regression for the following data
Regression equation of Y on X
Y = 0.929X–3.716+11
= 0.929X+7.284
The regression equation of Y on X is Y= 0.929X + 7.284
Calculate the two regression equations of X on Y and Y on X from the data given
below, taking deviations from a actual means of X and Y.
Solution:
Calculate the two regression equations of X on Y and Y on X from the data given
below, taking deviations from a actual means of X and Y.
Solution:
Regression equation of X on Y Regression Equation of Y on X
Coefficient of correlation r= 0.9. Estimate the likely sales for a proposed advertisement expenditure of Rs. 10 crores.
Solution:
When advertisement expenditure is 10 crores i.e., Y=10 then sales X=6(10)+4=64 which
implies sales is 64.
The two regression lines are 3X+2Y=26 and 6X+3Y=31. Find the correlation coefficient.
Solution:
Let the regression equation of Y on X be 3X+2Y = 26
In a laboratory experiment on correlation research study the equation of the two regression
lines were found to be 2X–Y+1=0 and 3X–2Y+7=0 . Find the means of X and Y. Also work out
the values of the regression coefficient and correlation between the two variables X and Y.
Solution:
Solving the two regression equations we get mean values of X and Y
In a laboratory experiment on correlation research study the equation of the two regression
lines were found to be 2X–Y+1=0 and 3X–2Y+7=0 . Find the means of X and Y. Also work out
the values of the regression coefficient and correlation between the two variables X and Y.
Solution: