0% found this document useful (0 votes)
29 views60 pages

Lecture 4

This document provides an overview of linear regression and correlation. It defines key concepts such as the regression line, correlation, R-squared, and hypothesis testing for linear relationships. It discusses estimating the regression line using the least squares method and interpreting the correlation coefficient and R-squared value. It also notes some potential problems with regression including influential observations, non-linear relationships, inappropriate variable combinations, and extrapolation.

Uploaded by

Amine Hadji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views60 pages

Lecture 4

This document provides an overview of linear regression and correlation. It defines key concepts such as the regression line, correlation, R-squared, and hypothesis testing for linear relationships. It discusses estimating the regression line using the least squares method and interpreting the correlation coefficient and R-squared value. It also notes some potential problems with regression including influential observations, non-linear relationships, inappropriate variable combinations, and extrapolation.

Uploaded by

Amine Hadji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Linear regression and correlation

Amine Hadji

Leiden University

March 1, 2022
Outline

• Regression line for the sample

• Correlation

• Hypothesis testing of linear relationship

• Multiple regression (standard deviation and hypothesis testing)

• R2
Scatter plot
Regression line in sample
Relationship between variables

• Positive association: the two variables tend to increase/decrease together

• Negative association: the two variables tend to go to opposite direction


Relationship between variables

• Positive association: the two variables tend to increase/decrease together

• Negative association: the two variables tend to go to opposite direction

• Linear relationship: the pattern of the relationship between the variables


resembles a straight line
Relationship between variables

• Positive association: the two variables tend to increase/decrease together

• Negative association: the two variables tend to go to opposite direction

• Linear relationship: the pattern of the relationship between the variables


resembles a straight line
• Outlier: a point in the scatterplot that has unusual combination of data values
“Alternative” facts
Nonlinearity & Outliers
Regression line in sample
Prediction: Regression line can be used to predict the unknown value of y for any
individual given the individual’s x value.
Regression line in sample
Prediction: Regression line can be used to predict the unknown value of y for any
individual given the individual’s x value.

The formula for the (sample) regression line:

ŷ = b0 + b1 x,
• ŷ : predicted (estimated) y

• b0 : intercept of the straight line in the sample (i.e. the value of ŷ for x = 0)

• b1 : slope of the straight line in the sample (i.e. how much ŷ changes for one unit
increase of x).
Its sign determines if the line is increasing or decreasing
Regression line in sample
Residual error
Usually, the predicted variable ŷ 6= y the observed value:
Residual / Prediction error: y − ŷ .
Least squares estimation
The residuals for the handspan data:

The intercept b0 and slope b1 are chosen to minimize the sum of the squared residuals

n
X
SSE = e12 + e22 + ... + en2 = (yi − b0 − b1 ∗ xi )2 .
i=1

They are called the least squares estimators.


Least squares estimation
The residuals for the handspan data:
• For x1 = 71 we have y1 = 23.5 and ŷ1 = b0 + 71b1 . The residual is
e1 = 23.5 − b0 − 71b1 .

The intercept b0 and slope b1 are chosen to minimize the sum of the squared residuals

n
X
SSE = e12 + e22 + ... + en2 = (yi − b0 − b1 ∗ xi )2 .
i=1

They are called the least squares estimators.


Least squares estimation
The residuals for the handspan data:
• For x1 = 71 we have y1 = 23.5 and ŷ1 = b0 + 71b1 . The residual is
e1 = 23.5 − b0 − 71b1 .
• For x2 = 69 we have y2 = 22 and ŷ2 = b0 + 69b1 . The residual is
e2 = 22 − b0 − 69b1 ...

The intercept b0 and slope b1 are chosen to minimize the sum of the squared residuals

n
X
SSE = e12 + e22 + ... + en2 = (yi − b0 − b1 ∗ xi )2 .
i=1

They are called the least squares estimators.


Least squares estimators

Least squares estimators:


Pn
(x − x̄)(yi − ȳ )
Pn i
b1 = i=1 2
, b0 = ȳ − b1 x̄.
i=1 (xi − x̄)
Least squares estimators

Least squares estimators:


Pn
(x − x̄)(yi − ȳ )
Pn i
b1 = i=1 2
, b0 = ȳ − b1 x̄.
i=1 (xi − x̄)

Example: Handspan data


ŷ = −3 + 0.35x
For instance for height 60 the average handspan is −3 + 0.35 × 60 = 18 cm.
Goodness of fit
Correlation
Correlation: measure of the strength and direction of a linear relationship between two
quantitative variables
• strength - how close the points are to a straight line.

• direction - if one variable is increasing/decreasing as the other variable increases

Formula:
n
1 X  xi − x̄  yi − ȳ 
r=
n−1 sx sy
i=1

• xi , yi : the x (or y ) measurement for the ith observation.

• x̄, ȳ : the mean of the x (or y ) measurements.

• sx , sy : the standard deviation of the x (or y ) measurements.


Correlation - Properties

• The correlation coefficient r is always between −1 and 1.

• Strength is indicated by the magnitude of the correlation.

• Direction is indicated by the sign of the correlation.

• r = 0 means that the best fitting line is the horizontal line.

• r is invariant to scaling of x or y . (e.g. from inch to cm)


Correlations
Strong vs. Weak Corr
Squared correlation

Interpretation:
• r is close to −1 or 1 implies

• r 2:
Squared correlation

Interpretation:
• r is close to −1 or 1 implies ⇒ r 2 close to 1

• r 2 : proportion of variation of y explained by x.


Squared correlation

Interpretation:
• r is close to −1 or 1 implies ⇒ r 2 close to 1

• r 2 : proportion of variation of y explained by x.


SSE
Formula: r 2 = 1 − SST , where
Pn
• SSO (Sum of Squares Total): i=1 (yi − ȳ )2 .
Pn
• SSE (Sum of Squared Errors): i=1 (yi − ŷi )2 .
Squared correlation
Problems in Regression

• Influential observations - observations with extreme values can have a big


impact on correlation
• Inappropriately Combining Groups - two distinct groups may show misleading
results
• Curvilinearity - linear regression for nonlinear data leads to bad predictions

• Extrapolation - no guarantee the linear relationship continues beyond the range


of the observed data
Influential obs.
Non-linearity
Inappropriate combination
Inappropriate combination
Extrapolation
Interpretation of observed correlation

“Correlation does not prove causation.”


• Rule for Concluding Cause and Effect: cause-and-effect relationships can be
inferred from randomized experiments, not from observational studies.
• Confounding variables

• Other explanatory variables


Causation
Estimation of Standard deviation

Standard deviation in Regression:


measure of the general difference between y and ŷ . It can be estimated as
sP
n
r
2
SSE i=1 (yi − ŷi )
s= = .
n−2 n−2
Estimation of Standard deviation

Standard deviation in Regression:


measure of the general difference between y and ŷ . It can be estimated as
sP
n
r
2
SSE i=1 (yi − ŷi )
s= = .
n−2 n−2

If we did not know anything about the xi , the standard deviation would be:
sP
n 2
i=1 (yi − ȳ )
s= .
n−1
Statistical Significance - Linear Relationship
Regression line of the population is: y = b0 + b1 x.
Question: Is the slope zero, i.e. is there any relationship between the variables?
Statistical Significance - Linear Relationship
Regression line of the population is: y = b0 + b1 x.
Question: Is the slope zero, i.e. is there any relationship between the variables?

Hypothesis testing:

H0 : b 1 = 0 H1 : b1 6= 0.
Test statistics:
sample statistics − null value b1 − 0
t= = .
standard error se(b1 )
Statistical Significance - Linear Relationship
Regression line of the population is: y = b0 + b1 x.
Question: Is the slope zero, i.e. is there any relationship between the variables?

Hypothesis testing:

H0 : b 1 = 0 H1 : b1 6= 0.
Test statistics:
sample statistics − null value b1 − 0
t= = .
standard error se(b1 )

Remark: the p-value can be calculated using the t-distribution table


(with degree of freedom n − 2)
Multivariate regression

The formula for the multivariate regression line is

ŷ = b0 + b1 x1 + b2 x2 + ... + bp−1 xp−1

• x1 , ..., xp−1 : explanatory variables

• b0 : intercept of the straight line in the sample (i.e. the value of ŷ for all xj = 0)

• b1 , ..., bp−1 : slopes corresponding to x1 , ..., xp−1 respectively


Examples

Example 1: Wage dependence on education and experience.

log(wage) = b0 + b1 (education) + b2 (work experience).


Example 2: Connection between behavior variable and GPA.

GPA = b0 + b1 (study hours) + b2 (classes missed) + b3 (work hours).


Omitted variables

Problem:
the effect of omitting relevant variables can be picked up by another explanatory
variables.
Omitted variables

Problem:
the effect of omitting relevant variables can be picked up by another explanatory
variables.
Examples:
• work experience & education ⇒ wage

• area density & magnitude of earthquake ⇒ death tolls

• height of father & height of mother ⇒ height of child


Multivariate regression - Assumptions

• No outliers
Multivariate regression - Assumptions

• No outliers

• The errors ei are normally distributed

ei = yi − (b0 + b1 x1,i + b2 x2,i + ... + bp−1 xp−1,i )


Multivariate regression - Assumptions

• No outliers

• The errors ei are normally distributed

ei = yi − (b0 + b1 x1,i + b2 x2,i + ... + bp−1 xp−1,i )

• The errors ei do not depend on the explanatory variables (i.e. homoskedastic)


Multivariate regression - Assumptions

• No outliers

• The errors ei are normally distributed

ei = yi − (b0 + b1 x1,i + b2 x2,i + ... + bp−1 xp−1,i )

• The errors ei do not depend on the explanatory variables (i.e. homoskedastic)

• The sample should be representative of the population


Calvin and Hobbes
Sample Regression

The coefficients b0 , ..., bp−1 are chosen to minimize the sum of the squared residuals

SSE = e12 + e22 + ... + en2


Sample Regression

The coefficients b0 , ..., bp−1 are chosen to minimize the sum of the squared residuals

SSE = e12 + e22 + ... + en2


They are called the least-square estimators (LSE)
Hypothesis Testing

Question: Does the explanatory variable xk significantly influence response?


Hypothesis Testing

Question: Does the explanatory variable xk significantly influence response?


Hypothesis testing:
H0 : bk = 0, Ha : bk 6= 0.
Test statsitics:
bk − 0
t= ,
se(bk )
Hypothesis Testing

Question: Does the explanatory variable xk significantly influence response?


Hypothesis testing:
H0 : bk = 0, Ha : bk 6= 0.
Test statsitics:
bk − 0
t= ,
se(bk )

Remark: the p-value can be calculated using the t-distribution table


(with degree of freedom n − p)
Estimating of Standard deviation and R 2
sP
n
− ŷi )2
i=1 (yi
s= ,
n−p

where p is the number of parameters in the multiple regression model.


Estimating of Standard deviation and R 2
sP
n
− ŷi )2
i=1 (yi
s= ,
n−p

where p is the number of parameters in the multiple regression model.

SSE
R2 = 1 − ,
SSTO

• SSTO: ni=1 (yi − ȳ )2 ,


P

• SSE: ni=1 (yi − ŷi )2 .


P
Problems with R 2

• the more explanatory variables are added, the more R 2 increases


Problems with R 2

• the more explanatory variables are added, the more R 2 increases

• general phenomenon called overfitting (i.e. explaining the noise)


Problems with R 2

• the more explanatory variables are added, the more R 2 increases

• general phenomenon called overfitting (i.e. explaining the noise)

• Math problem - if number of observations and number of explanatory variables are


the same ⇒ R 2 = 1.
Adjusted R 2

If p is the number of explanatory variables

2 n−1 p
Radj = 1 − (1 − R 2 ) = R 2 − (1 − R 2 ) .
n−p−1 n−p−1

• increases only if additional explanatory variable is not uncorrelated

• difficult to interpret

You might also like