0% found this document useful (0 votes)
23 views

Lecture 9 simple-linear-regression-correlation updated

cse303

Uploaded by

tasniaifa961
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Lecture 9 simple-linear-regression-correlation updated

cse303

Uploaded by

tasniaifa961
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Correlation and

Regression
Topics Covered:
◼ Is there a relationship bivariate data(between x and y)?
◼ What is the strength of this relationship
 Pearson’s r
◼ Can we describe this relationship and use this to predict y from
x?
 Regression
◼ Is the relationship we have described statistically significant?
 t-test
The relationship bivariate data
between x and y
◼ Correlation: is there a relationship between 2
variables?
◼ Regression: how well a certain independent
variable predict dependent variable?
◼ CORRELATION  CAUSATION
 In
order to infer causality: manipulate independent
variable and observe effect on dependent variable
Scattergrams

Y Y Y

X
X X

Positive correlation Negative correlation No correlation


Variance vs Covariance
◼ First, a note on your sample:
◼ If you’re wishing to assume that your sample is
representative of the general population (RANDOM
EFFECTS MODEL), use the degrees of freedom (n – 1)
in your calculations of variance or covariance.
◼ But if you’re simply wanting to assess your current
sample (FIXED EFFECTS MODEL), substitute n for
the degrees of freedom.
Variance vs Covariance
◼ Do two variables change together?

Variance:
n

 i
• Gives information on variability of a
single variable. ( x − x ) 2

Covariance: S =
2 i =1
n −1
x
• Gives information on the degree to which
two variables vary together.
n

 (x
• Note how similar the covariance is to
variance: the equation simply multiplies x’s
error scores by y’s error scores as opposed i − x)( yi − y )
to squaring x’s error scores. cov( x, y ) = i =1
n −1
Covariance
n

 (x i − x)( yi − y )
cov( x, y ) = i =1
n −1
◼ When X and Y : cov (x,y) = pos.
◼ When X and Y : cov (x,y) = neg.
◼ When no constant relationship: cov (x,y) = 0
Example Covariance
xi − x yi − y ( xi − x )( yi − y )
7
x y
6

5 0 3 -3 0 0
4 2 2 -1 -1 1
3 3 4 0 1 0
2
4 0 1 -3 -3
1
6 6 3 3 9
0
0 1 2 3 4 5 6 7 x=3 y=3 = 7

 ( x − x)( y − y))
i i
7
What does this
cov( x, y ) = i =1
= = 1.75 number tell us?
n −1 4
Example Covariance
xi − x yi − y ( xi − x )( yi − y )
7
x y
6

5 0 3 -3 0 0
4 2 2 -1 -1 1
3 3 4 0 1 0
2
4 0 1 -3 -3
1
6 6 3 3 9
0
0 1 2 3 4 5 6 7 x=3 y=3 = 7

 ( x − x)( y − y))
i i
7
What does this
cov( x, y ) = i =1
= = 1.75 number tell us?
n −1 4

A positive covariance indicates that the two variables tend to move in the same
direction, while a negative covariance indicates that they tend to move in opposite
directions.
Problem with Covariance:
◼ The value obtained by covariance is dependent on the size of
the data’s standard deviations: if large, the value will be
greater than if small… even if the relationship between x and y
is exactly the same in the large versus small standard
deviation datasets.
Example of how covariance value
relies on variance
High variance data Low variance data

Subject x y x error * y x y X error * y


error error
1 101 100 2500 54 53 9
2 81 80 900 53 52 4
3 61 60 100 52 51 1
4 51 50 0 51 50 0
5 41 40 100 50 49 1
6 21 20 900 49 48 4
7 1 0 2500 48 47 9
Mean 51 50 51 50

Sum of x error * y error : 7000 Sum of x error * y error : 28

Covariance: 1166.67 Covariance: 4.67


Solution: Pearson’s r
◼ Covariance does not really tell us anything

▪ Solution: standardise this measure

◼ Pearson’s R: standardises the covariance value.


◼ Divides the covariance by the multiplied standard deviations of X
and Y:

𝑐𝑜𝑣(𝑥, 𝑦)
𝑟𝑥𝑦 =
𝑠𝑥 𝑠𝑦
Pearson’s R continued
Sample Correlation Coefficient
Population Correlation Coefficient
Linear Correlation Coefficient
Linear Correlation Coefficient

The sign of r indicates the strength of the linear relationship between the
variables.
•If r is near 1, then the two variables have a strong linear relationship.
•If r is near 0, then the two variables have no linear relation.
•If r is near -1, then the two variables have a weak (negative) linear
relationship.
Limitations of r
◼ When r = 1 or r = -1:
 We can predict y from x with certainty
 all data points are on a straight line: y = ax + b
◼ r is actually r̂
r = true r of whole population
r̂ = estimate of r based on data
5

◼ r is very sensitive to extreme values:


4

0
0 1 2 3 4 5 6
Pearson’s R Example
Calculate the Correlation coefficient of given data.
Pearson’s R Example
Solution: X values
Here n = 5
Regression
◼ Correlation tells you if there is an association
between x and y but it doesn’t describe the
relationship or allow you to predict one
variable from the other.

◼ To do this we need REGRESSION!


Best-fit Line
◼ Aim of linear regression is to fit a straight line, ŷ = ax + b, to data that
gives best prediction of y for any value of x

◼ This will be the line that ŷ = ax + b


minimises distance between
data and fitted line, i.e. slope intercept
the residuals
ε

= ŷ, predicted value
= y i , true value
ε = residual error
Least Squares Regression
◼ To find the best line we must minimise the sum of
the squares of the residuals (the vertical distances
from the data points to our line)
Model line: ŷ = ax + b a = slope, b = intercept

Residual (ε) = y - ŷ
Sum of squares of residuals = Σ (y – ŷ)2

◼ we must find values of a and b that minimise


Σ (y – ŷ)2
Finding b
◼ First we find the value of b that gives the min
sum of squares

b
ε b ε
b

◼ Trying different values of b is equivalent to


shifting the line up and down the scatter plot
Finding a
◼ Now we find the value of a that gives the min
sum of squares

b b b

◼ Trying out different values of a is equivalent to


changing the slope of the line, while b stays
constant
Minimising sums of squares
◼ Need to minimise Σ(y–ŷ)2
◼ ŷ = ax + b
◼ so need to minimise:

sums of squares (S)


Σ(y - ax - b)2

◼ If we plot the sums of squares


for all different values of a and b
we get a parabola, because it is a
squared term
Gradient = 0
min S
◼ So the min sum of squares is at Values of a and b
the bottom of the curve, where
the gradient is zero.
The maths bit
◼ The min sum of squares is at the bottom of the curve
where the gradient = 0

◼ So we can find a and b that give min sum of squares


by taking partial derivatives of Σ(y - ax - b)2 with
respect to a and b separately

◼ Then we solve these for 0 to give us the values of a


and b that give the min sum of squares
The solution
◼ Doing this gives the following equations for a and b:

r sy r = correlation coefficient of x and y


a= sx
sy = standard deviation of y
sx = standard deviation of x

◼ From you can see that:


▪ A low correlation coefficient gives a flatter slope (small value of
a)
▪ Large spread of y, i.e. high standard deviation, results in a
steeper slope (high value of a)
▪ Large spread of x, i.e. high standard deviation, results in a flatter
slope (high value of a)
The solution cont.
◼ Our model equation is ŷ = ax + b
◼ This line must pass through the mean so:

y = ax + b b = y – ax
◼ We can put our equation for a into this giving:
r sy r = correlation coefficient of x and y
b=y- s x s = standard deviation of y
y
x s = standard deviation of x
x

◼ The smaller the correlation, the closer the


intercept is to the mean of y
Back to the model
a b
r sy r sy
ŷ = ax + b = x+y- x
sx sx
a a
r sy
Rearranges to: ŷ= (x – x) + y
sx
◼ If the correlation is zero, we will simply predict the mean of y for every
value of x, and our regression line is just a flat straight line crossing the
x-axis at y

◼ But this isn’t very useful.

◼ We can calculate the regression line for any data, but the important
question is how well does this line fit the data, or how good is it at
predicting y from x
How good is our model?
∑(y – y)2 SSy
◼ Total variance of y: sy2 =
n-1
=
dfy

◼ Variance of predicted y values (ŷ):


∑(ŷ – y)2 SSpred This is the variance
sŷ2 = = explained by our
n-1 dfŷ regression model

◼ Error variance: This is the variance of the error


between our predicted y values and
∑(y – ŷ)2 SSer the actual y values, and thus is the
serror2 = = variance in y that is NOT explained
n-2 dfer
by the regression model
How good is our model cont.
◼ Total variance = predicted variance + error variance
sy2 = sŷ2 + ser2

◼ Conveniently, via some complicated rearranging


sŷ2 = r2 sy2

r2 = sŷ2 / sy2

◼ so r2 is the proportion of the variance in y that is explained by


our regression model
How good is our model cont.
◼ Insert r2 sy2 into sy2 = sŷ2 + ser2 and rearrange to get:

ser2 = sy2 – r2sy2


= sy2 (1 – r2)

◼ From this we can see that the greater the correlation


the smaller the error variance, so the better our
prediction
Is the model significant?
◼ i.e. do we get a significantly better prediction of y
from our regression equation than by just predicting
the mean?

◼ F-statistic: complicated
rearranging
sŷ2 r2 (n - 2)2
F(df ,df ) = =......=
ŷ er
ser2 1 – r2
◼ And it follows that:
r (n - 2) So all we need to
(because F = t2) t (n-2) = know are r and n
√1 – r2
General Linear Model
◼ Linear regression is actually a form of the
General Linear Model where the parameters
are a, the slope of the line, and b, the intercept.
y = ax + b +ε
◼ A General Linear Model is just any model that
describes the data in terms of a straight line
Multiple regression
◼ Multiple regression is used to determine the effect of a number
of independent variables, x1, x2, x3 etc, on a single dependent
variable, y
◼ The different x variables are combined in a linear way and
each has its own regression coefficient:

y = a1x1+ a2x2 +…..+ anxn + b + ε

◼ The a parameters reflect the independent contribution of each


independent variable, x, to the value of the dependent variable,
y.
◼ i.e. the amount of variance in y that is accounted for by each x
variable after all the other x variables have been accounted for
Calculate the regression coefficient and obtain the lines of regression for the following data

Solution:
XY Regression coefficient of X on Y

Regression equation of X on Y
Calculate the regression coefficient and obtain the lines of regression for the following data

Solution: Regression coefficient of Y on X


XY

Regression equation of Y on X

Y = 0.929X–3.716+11
= 0.929X+7.284
The regression equation of Y on X is Y= 0.929X + 7.284
Calculate the two regression equations of X on Y and Y on X from the data given
below, taking deviations from a actual means of X and Y.

Estimate the likely demand when the price is Rs.20.

Solution:
Calculate the two regression equations of X on Y and Y on X from the data given
below, taking deviations from a actual means of X and Y.

Estimate the likely demand when the price is Rs.20.

Solution:
Regression equation of X on Y Regression Equation of Y on X

When X is 20, Y will be


= –0.25 (20)+44.25
= –5+44.25
= 39.25 (when the price is Rs. 20, the likely demand is 39.25)
The following table shows the sales and advertisement expenditure of a form

Coefficient of correlation r= 0.9. Estimate the likely sales for a proposed advertisement expenditure of Rs. 10 crores.

Solution:

When advertisement expenditure is 10 crores i.e., Y=10 then sales X=6(10)+4=64 which
implies sales is 64.
The two regression lines are 3X+2Y=26 and 6X+3Y=31. Find the correlation coefficient.
Solution:
Let the regression equation of Y on X be 3X+2Y = 26
In a laboratory experiment on correlation research study the equation of the two regression
lines were found to be 2X–Y+1=0 and 3X–2Y+7=0 . Find the means of X and Y. Also work out
the values of the regression coefficient and correlation between the two variables X and Y.
Solution:
Solving the two regression equations we get mean values of X and Y
In a laboratory experiment on correlation research study the equation of the two regression
lines were found to be 2X–Y+1=0 and 3X–2Y+7=0 . Find the means of X and Y. Also work out
the values of the regression coefficient and correlation between the two variables X and Y.
Solution:

You might also like