0% found this document useful (0 votes)
15 views

Lecture 9 simple-linear-regression-correlation updated

cse303

Uploaded by

tasniaifa961
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Lecture 9 simple-linear-regression-correlation updated

cse303

Uploaded by

tasniaifa961
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Correlation and

Regression
Topics Covered:
◼ Is there a relationship bivariate data(between x and y)?
◼ What is the strength of this relationship
 Pearson’s r
◼ Can we describe this relationship and use this to predict y from
x?
 Regression
◼ Is the relationship we have described statistically significant?
 t-test
The relationship bivariate data
between x and y
◼ Correlation: is there a relationship between 2
variables?
◼ Regression: how well a certain independent
variable predict dependent variable?
◼ CORRELATION  CAUSATION
 In
order to infer causality: manipulate independent
variable and observe effect on dependent variable
Scattergrams

Y Y Y

X
X X

Positive correlation Negative correlation No correlation


Variance vs Covariance
◼ First, a note on your sample:
◼ If you’re wishing to assume that your sample is
representative of the general population (RANDOM
EFFECTS MODEL), use the degrees of freedom (n – 1)
in your calculations of variance or covariance.
◼ But if you’re simply wanting to assess your current
sample (FIXED EFFECTS MODEL), substitute n for
the degrees of freedom.
Variance vs Covariance
◼ Do two variables change together?

Variance:
n

 i
• Gives information on variability of a
single variable. ( x − x ) 2

Covariance: S =
2 i =1
n −1
x
• Gives information on the degree to which
two variables vary together.
n

 (x
• Note how similar the covariance is to
variance: the equation simply multiplies x’s
error scores by y’s error scores as opposed i − x)( yi − y )
to squaring x’s error scores. cov( x, y ) = i =1
n −1
Covariance
n

 (x i − x)( yi − y )
cov( x, y ) = i =1
n −1
◼ When X and Y : cov (x,y) = pos.
◼ When X and Y : cov (x,y) = neg.
◼ When no constant relationship: cov (x,y) = 0
Example Covariance
xi − x yi − y ( xi − x )( yi − y )
7
x y
6

5 0 3 -3 0 0
4 2 2 -1 -1 1
3 3 4 0 1 0
2
4 0 1 -3 -3
1
6 6 3 3 9
0
0 1 2 3 4 5 6 7 x=3 y=3 = 7

 ( x − x)( y − y))
i i
7
What does this
cov( x, y ) = i =1
= = 1.75 number tell us?
n −1 4
Example Covariance
xi − x yi − y ( xi − x )( yi − y )
7
x y
6

5 0 3 -3 0 0
4 2 2 -1 -1 1
3 3 4 0 1 0
2
4 0 1 -3 -3
1
6 6 3 3 9
0
0 1 2 3 4 5 6 7 x=3 y=3 = 7

 ( x − x)( y − y))
i i
7
What does this
cov( x, y ) = i =1
= = 1.75 number tell us?
n −1 4

A positive covariance indicates that the two variables tend to move in the same
direction, while a negative covariance indicates that they tend to move in opposite
directions.
Problem with Covariance:
◼ The value obtained by covariance is dependent on the size of
the data’s standard deviations: if large, the value will be
greater than if small… even if the relationship between x and y
is exactly the same in the large versus small standard
deviation datasets.
Example of how covariance value
relies on variance
High variance data Low variance data

Subject x y x error * y x y X error * y


error error
1 101 100 2500 54 53 9
2 81 80 900 53 52 4
3 61 60 100 52 51 1
4 51 50 0 51 50 0
5 41 40 100 50 49 1
6 21 20 900 49 48 4
7 1 0 2500 48 47 9
Mean 51 50 51 50

Sum of x error * y error : 7000 Sum of x error * y error : 28

Covariance: 1166.67 Covariance: 4.67


Solution: Pearson’s r
◼ Covariance does not really tell us anything

▪ Solution: standardise this measure

◼ Pearson’s R: standardises the covariance value.


◼ Divides the covariance by the multiplied standard deviations of X
and Y:

𝑐𝑜𝑣(𝑥, 𝑦)
𝑟𝑥𝑦 =
𝑠𝑥 𝑠𝑦
Pearson’s R continued
Sample Correlation Coefficient
Population Correlation Coefficient
Linear Correlation Coefficient
Linear Correlation Coefficient

The sign of r indicates the strength of the linear relationship between the
variables.
•If r is near 1, then the two variables have a strong linear relationship.
•If r is near 0, then the two variables have no linear relation.
•If r is near -1, then the two variables have a weak (negative) linear
relationship.
Limitations of r
◼ When r = 1 or r = -1:
 We can predict y from x with certainty
 all data points are on a straight line: y = ax + b
◼ r is actually r̂
r = true r of whole population
r̂ = estimate of r based on data
5

◼ r is very sensitive to extreme values:


4

0
0 1 2 3 4 5 6
Pearson’s R Example
Calculate the Correlation coefficient of given data.
Pearson’s R Example
Solution: X values
Here n = 5
Regression
◼ Correlation tells you if there is an association
between x and y but it doesn’t describe the
relationship or allow you to predict one
variable from the other.

◼ To do this we need REGRESSION!


Best-fit Line
◼ Aim of linear regression is to fit a straight line, ŷ = ax + b, to data that
gives best prediction of y for any value of x

◼ This will be the line that ŷ = ax + b


minimises distance between
data and fitted line, i.e. slope intercept
the residuals
ε

= ŷ, predicted value
= y i , true value
ε = residual error
Least Squares Regression
◼ To find the best line we must minimise the sum of
the squares of the residuals (the vertical distances
from the data points to our line)
Model line: ŷ = ax + b a = slope, b = intercept

Residual (ε) = y - ŷ
Sum of squares of residuals = Σ (y – ŷ)2

◼ we must find values of a and b that minimise


Σ (y – ŷ)2
Finding b
◼ First we find the value of b that gives the min
sum of squares

b
ε b ε
b

◼ Trying different values of b is equivalent to


shifting the line up and down the scatter plot
Finding a
◼ Now we find the value of a that gives the min
sum of squares

b b b

◼ Trying out different values of a is equivalent to


changing the slope of the line, while b stays
constant
Minimising sums of squares
◼ Need to minimise Σ(y–ŷ)2
◼ ŷ = ax + b
◼ so need to minimise:

sums of squares (S)


Σ(y - ax - b)2

◼ If we plot the sums of squares


for all different values of a and b
we get a parabola, because it is a
squared term
Gradient = 0
min S
◼ So the min sum of squares is at Values of a and b
the bottom of the curve, where
the gradient is zero.
The maths bit
◼ The min sum of squares is at the bottom of the curve
where the gradient = 0

◼ So we can find a and b that give min sum of squares


by taking partial derivatives of Σ(y - ax - b)2 with
respect to a and b separately

◼ Then we solve these for 0 to give us the values of a


and b that give the min sum of squares
The solution
◼ Doing this gives the following equations for a and b:

r sy r = correlation coefficient of x and y


a= sx
sy = standard deviation of y
sx = standard deviation of x

◼ From you can see that:


▪ A low correlation coefficient gives a flatter slope (small value of
a)
▪ Large spread of y, i.e. high standard deviation, results in a
steeper slope (high value of a)
▪ Large spread of x, i.e. high standard deviation, results in a flatter
slope (high value of a)
The solution cont.
◼ Our model equation is ŷ = ax + b
◼ This line must pass through the mean so:

y = ax + b b = y – ax
◼ We can put our equation for a into this giving:
r sy r = correlation coefficient of x and y
b=y- s x s = standard deviation of y
y
x s = standard deviation of x
x

◼ The smaller the correlation, the closer the


intercept is to the mean of y
Back to the model
a b
r sy r sy
ŷ = ax + b = x+y- x
sx sx
a a
r sy
Rearranges to: ŷ= (x – x) + y
sx
◼ If the correlation is zero, we will simply predict the mean of y for every
value of x, and our regression line is just a flat straight line crossing the
x-axis at y

◼ But this isn’t very useful.

◼ We can calculate the regression line for any data, but the important
question is how well does this line fit the data, or how good is it at
predicting y from x
How good is our model?
∑(y – y)2 SSy
◼ Total variance of y: sy2 =
n-1
=
dfy

◼ Variance of predicted y values (ŷ):


∑(ŷ – y)2 SSpred This is the variance
sŷ2 = = explained by our
n-1 dfŷ regression model

◼ Error variance: This is the variance of the error


between our predicted y values and
∑(y – ŷ)2 SSer the actual y values, and thus is the
serror2 = = variance in y that is NOT explained
n-2 dfer
by the regression model
How good is our model cont.
◼ Total variance = predicted variance + error variance
sy2 = sŷ2 + ser2

◼ Conveniently, via some complicated rearranging


sŷ2 = r2 sy2

r2 = sŷ2 / sy2

◼ so r2 is the proportion of the variance in y that is explained by


our regression model
How good is our model cont.
◼ Insert r2 sy2 into sy2 = sŷ2 + ser2 and rearrange to get:

ser2 = sy2 – r2sy2


= sy2 (1 – r2)

◼ From this we can see that the greater the correlation


the smaller the error variance, so the better our
prediction
Is the model significant?
◼ i.e. do we get a significantly better prediction of y
from our regression equation than by just predicting
the mean?

◼ F-statistic: complicated
rearranging
sŷ2 r2 (n - 2)2
F(df ,df ) = =......=
ŷ er
ser2 1 – r2
◼ And it follows that:
r (n - 2) So all we need to
(because F = t2) t (n-2) = know are r and n
√1 – r2
General Linear Model
◼ Linear regression is actually a form of the
General Linear Model where the parameters
are a, the slope of the line, and b, the intercept.
y = ax + b +ε
◼ A General Linear Model is just any model that
describes the data in terms of a straight line
Multiple regression
◼ Multiple regression is used to determine the effect of a number
of independent variables, x1, x2, x3 etc, on a single dependent
variable, y
◼ The different x variables are combined in a linear way and
each has its own regression coefficient:

y = a1x1+ a2x2 +…..+ anxn + b + ε

◼ The a parameters reflect the independent contribution of each


independent variable, x, to the value of the dependent variable,
y.
◼ i.e. the amount of variance in y that is accounted for by each x
variable after all the other x variables have been accounted for
Calculate the regression coefficient and obtain the lines of regression for the following data

Solution:
XY Regression coefficient of X on Y

Regression equation of X on Y
Calculate the regression coefficient and obtain the lines of regression for the following data

Solution: Regression coefficient of Y on X


XY

Regression equation of Y on X

Y = 0.929X–3.716+11
= 0.929X+7.284
The regression equation of Y on X is Y= 0.929X + 7.284
Calculate the two regression equations of X on Y and Y on X from the data given
below, taking deviations from a actual means of X and Y.

Estimate the likely demand when the price is Rs.20.

Solution:
Calculate the two regression equations of X on Y and Y on X from the data given
below, taking deviations from a actual means of X and Y.

Estimate the likely demand when the price is Rs.20.

Solution:
Regression equation of X on Y Regression Equation of Y on X

When X is 20, Y will be


= –0.25 (20)+44.25
= –5+44.25
= 39.25 (when the price is Rs. 20, the likely demand is 39.25)
The following table shows the sales and advertisement expenditure of a form

Coefficient of correlation r= 0.9. Estimate the likely sales for a proposed advertisement expenditure of Rs. 10 crores.

Solution:

When advertisement expenditure is 10 crores i.e., Y=10 then sales X=6(10)+4=64 which
implies sales is 64.
The two regression lines are 3X+2Y=26 and 6X+3Y=31. Find the correlation coefficient.
Solution:
Let the regression equation of Y on X be 3X+2Y = 26
In a laboratory experiment on correlation research study the equation of the two regression
lines were found to be 2X–Y+1=0 and 3X–2Y+7=0 . Find the means of X and Y. Also work out
the values of the regression coefficient and correlation between the two variables X and Y.
Solution:
Solving the two regression equations we get mean values of X and Y
In a laboratory experiment on correlation research study the equation of the two regression
lines were found to be 2X–Y+1=0 and 3X–2Y+7=0 . Find the means of X and Y. Also work out
the values of the regression coefficient and correlation between the two variables X and Y.
Solution:

You might also like