0% found this document useful (0 votes)

15 views

Lecture 4

This document provides an overview of linear regression and correlation. It defines key concepts such as the regression line, correlation, R-squared, and hypothesis testing for linear relationships. It discusses estimating the regression line using the least squares method and interpreting the correlation coefficient and R-squared value. It also notes some potential problems with regression including influential observations, non-linear relationships, inappropriate variable combinations, and extrapolation.

Uploaded by

Amine Hadji

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Lecture 4

Uploaded by

Amine Hadji

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Linear regression and correlation

Amine Hadji

Leiden University

March 1, 2022
Outline

• Regression line for the sample

• Correlation

• Hypothesis testing of linear relationship

• Multiple regression (standard deviation and hypothesis testing)

• R2
Scatter plot
Regression line in sample
Relationship between variables

• Positive association: the two variables tend to increase/decrease together

• Negative association: the two variables tend to go to opposite direction

Relationship between variables

• Positive association: the two variables tend to increase/decrease together

• Negative association: the two variables tend to go to opposite direction

• Linear relationship: the pattern of the relationship between the variables

resembles a straight line
Relationship between variables

• Positive association: the two variables tend to increase/decrease together

• Negative association: the two variables tend to go to opposite direction

• Linear relationship: the pattern of the relationship between the variables

resembles a straight line
• Outlier: a point in the scatterplot that has unusual combination of data values
“Alternative” facts
Nonlinearity & Outliers
Regression line in sample
Prediction: Regression line can be used to predict the unknown value of y for any
individual given the individual’s x value.
Regression line in sample
Prediction: Regression line can be used to predict the unknown value of y for any
individual given the individual’s x value.

The formula for the (sample) regression line:

ŷ = b0 + b1 x,
• ŷ : predicted (estimated) y

• b0 : intercept of the straight line in the sample (i.e. the value of ŷ for x = 0)

• b1 : slope of the straight line in the sample (i.e. how much ŷ changes for one unit
increase of x).
Its sign determines if the line is increasing or decreasing
Regression line in sample
Residual error
Usually, the predicted variable ŷ 6= y the observed value:
Residual / Prediction error: y − ŷ .
Least squares estimation
The residuals for the handspan data:

The intercept b0 and slope b1 are chosen to minimize the sum of the squared residuals

n
X
SSE = e12 + e22 + ... + en2 = (yi − b0 − b1 ∗ xi )2 .
i=1

They are called the least squares estimators.

Least squares estimation
The residuals for the handspan data:
• For x1 = 71 we have y1 = 23.5 and ŷ1 = b0 + 71b1 . The residual is
e1 = 23.5 − b0 − 71b1 .

The intercept b0 and slope b1 are chosen to minimize the sum of the squared residuals

n
X
SSE = e12 + e22 + ... + en2 = (yi − b0 − b1 ∗ xi )2 .
i=1

They are called the least squares estimators.

Least squares estimation
The residuals for the handspan data:
• For x1 = 71 we have y1 = 23.5 and ŷ1 = b0 + 71b1 . The residual is
e1 = 23.5 − b0 − 71b1 .
• For x2 = 69 we have y2 = 22 and ŷ2 = b0 + 69b1 . The residual is
e2 = 22 − b0 − 69b1 ...

The intercept b0 and slope b1 are chosen to minimize the sum of the squared residuals

n
X
SSE = e12 + e22 + ... + en2 = (yi − b0 − b1 ∗ xi )2 .
i=1

They are called the least squares estimators.

Least squares estimators

Least squares estimators:

Pn
(x − x̄)(yi − ȳ )
Pn i
b1 = i=1 2
, b0 = ȳ − b1 x̄.
i=1 (xi − x̄)
Least squares estimators

Least squares estimators:

Pn
(x − x̄)(yi − ȳ )
Pn i
b1 = i=1 2
, b0 = ȳ − b1 x̄.
i=1 (xi − x̄)

Example: Handspan data

ŷ = −3 + 0.35x
For instance for height 60 the average handspan is −3 + 0.35 × 60 = 18 cm.
Goodness of fit
Correlation
Correlation: measure of the strength and direction of a linear relationship between two
quantitative variables
• strength - how close the points are to a straight line.

• direction - if one variable is increasing/decreasing as the other variable increases

Formula:
n
1 X xi − x̄ yi − ȳ
r=
n−1 sx sy
i=1

• xi , yi : the x (or y ) measurement for the ith observation.

• x̄, ȳ : the mean of the x (or y ) measurements.

• sx , sy : the standard deviation of the x (or y ) measurements.

Correlation - Properties

• The correlation coefficient r is always between −1 and 1.

• Strength is indicated by the magnitude of the correlation.

• Direction is indicated by the sign of the correlation.

• r = 0 means that the best fitting line is the horizontal line.

• r is invariant to scaling of x or y . (e.g. from inch to cm)

Correlations
Strong vs. Weak Corr
Squared correlation

Interpretation:
• r is close to −1 or 1 implies

• r 2:
Squared correlation

Interpretation:
• r is close to −1 or 1 implies ⇒ r 2 close to 1

• r 2 : proportion of variation of y explained by x.

Squared correlation

Interpretation:
• r is close to −1 or 1 implies ⇒ r 2 close to 1

• r 2 : proportion of variation of y explained by x.

SSE
Formula: r 2 = 1 − SST , where
Pn
• SSO (Sum of Squares Total): i=1 (yi − ȳ )2 .
Pn
• SSE (Sum of Squared Errors): i=1 (yi − ŷi )2 .
Squared correlation
Problems in Regression

• Influential observations - observations with extreme values can have a big

impact on correlation
• Inappropriately Combining Groups - two distinct groups may show misleading
results
• Curvilinearity - linear regression for nonlinear data leads to bad predictions

• Extrapolation - no guarantee the linear relationship continues beyond the range

of the observed data
Influential obs.
Non-linearity
Inappropriate combination
Inappropriate combination
Extrapolation
Interpretation of observed correlation

“Correlation does not prove causation.”

• Rule for Concluding Cause and Effect: cause-and-effect relationships can be
inferred from randomized experiments, not from observational studies.
• Confounding variables

• Other explanatory variables

Causation
Estimation of Standard deviation

Standard deviation in Regression:

measure of the general difference between y and ŷ . It can be estimated as
sP
n
r
2
SSE i=1 (yi − ŷi )
s= = .
n−2 n−2
Estimation of Standard deviation

Standard deviation in Regression:

measure of the general difference between y and ŷ . It can be estimated as
sP
n
r
2
SSE i=1 (yi − ŷi )
s= = .
n−2 n−2

If we did not know anything about the xi , the standard deviation would be:
sP
n 2
i=1 (yi − ȳ )
s= .
n−1
Statistical Significance - Linear Relationship
Regression line of the population is: y = b0 + b1 x.
Question: Is the slope zero, i.e. is there any relationship between the variables?
Statistical Significance - Linear Relationship
Regression line of the population is: y = b0 + b1 x.
Question: Is the slope zero, i.e. is there any relationship between the variables?

Hypothesis testing:

H0 : b 1 = 0 H1 : b1 6= 0.
Test statistics:
sample statistics − null value b1 − 0
t= = .
standard error se(b1 )
Statistical Significance - Linear Relationship
Regression line of the population is: y = b0 + b1 x.
Question: Is the slope zero, i.e. is there any relationship between the variables?

Hypothesis testing:

H0 : b 1 = 0 H1 : b1 6= 0.
Test statistics:
sample statistics − null value b1 − 0
t= = .
standard error se(b1 )

Remark: the p-value can be calculated using the t-distribution table

(with degree of freedom n − 2)
Multivariate regression

The formula for the multivariate regression line is

ŷ = b0 + b1 x1 + b2 x2 + ... + bp−1 xp−1

• x1 , ..., xp−1 : explanatory variables

• b0 : intercept of the straight line in the sample (i.e. the value of ŷ for all xj = 0)

• b1 , ..., bp−1 : slopes corresponding to x1 , ..., xp−1 respectively

Examples

Example 1: Wage dependence on education and experience.

log(wage) = b0 + b1 (education) + b2 (work experience).

Example 2: Connection between behavior variable and GPA.

GPA = b0 + b1 (study hours) + b2 (classes missed) + b3 (work hours).

Omitted variables

Problem:
the effect of omitting relevant variables can be picked up by another explanatory
variables.
Omitted variables

Problem:
the effect of omitting relevant variables can be picked up by another explanatory
variables.
Examples:
• work experience & education ⇒ wage

• area density & magnitude of earthquake ⇒ death tolls

• height of father & height of mother ⇒ height of child

Multivariate regression - Assumptions

• No outliers
Multivariate regression - Assumptions

• No outliers

• The errors ei are normally distributed

ei = yi − (b0 + b1 x1,i + b2 x2,i + ... + bp−1 xp−1,i )

Multivariate regression - Assumptions

• No outliers

• The errors ei are normally distributed

ei = yi − (b0 + b1 x1,i + b2 x2,i + ... + bp−1 xp−1,i )

• The errors ei do not depend on the explanatory variables (i.e. homoskedastic)

Multivariate regression - Assumptions

• No outliers

• The errors ei are normally distributed

ei = yi − (b0 + b1 x1,i + b2 x2,i + ... + bp−1 xp−1,i )

• The errors ei do not depend on the explanatory variables (i.e. homoskedastic)

• The sample should be representative of the population

Calvin and Hobbes
Sample Regression

The coefficients b0 , ..., bp−1 are chosen to minimize the sum of the squared residuals

SSE = e12 + e22 + ... + en2

Sample Regression

The coefficients b0 , ..., bp−1 are chosen to minimize the sum of the squared residuals

SSE = e12 + e22 + ... + en2

They are called the least-square estimators (LSE)
Hypothesis Testing

Question: Does the explanatory variable xk significantly influence response?

Hypothesis Testing

Question: Does the explanatory variable xk significantly influence response?

Hypothesis testing:
H0 : bk = 0, Ha : bk 6= 0.
Test statsitics:
bk − 0
t= ,
se(bk )
Hypothesis Testing

Question: Does the explanatory variable xk significantly influence response?

Hypothesis testing:
H0 : bk = 0, Ha : bk 6= 0.
Test statsitics:
bk − 0
t= ,
se(bk )

Remark: the p-value can be calculated using the t-distribution table

(with degree of freedom n − p)
Estimating of Standard deviation and R 2
sP
n
− ŷi )2
i=1 (yi
s= ,
n−p

where p is the number of parameters in the multiple regression model.

Estimating of Standard deviation and R 2
sP
n
− ŷi )2
i=1 (yi
s= ,
n−p

where p is the number of parameters in the multiple regression model.

SSE
R2 = 1 − ,
SSTO

• SSTO: ni=1 (yi − ȳ )2 ,

• SSE: ni=1 (yi − ŷi )2 .

P
Problems with R 2

• the more explanatory variables are added, the more R 2 increases

Problems with R 2

• the more explanatory variables are added, the more R 2 increases

• general phenomenon called overfitting (i.e. explaining the noise)

Problems with R 2

• the more explanatory variables are added, the more R 2 increases

• general phenomenon called overfitting (i.e. explaining the noise)

• Math problem - if number of observations and number of explanatory variables are

the same ⇒ R 2 = 1.
Adjusted R 2

If p is the number of explanatory variables

2 n−1 p
Radj = 1 − (1 − R 2 ) = R 2 − (1 − R 2 ) .
n−p−1 n−p−1

• increases only if additional explanatory variable is not uncorrelated

• difficult to interpret

Cheat Sheet
No ratings yet
Cheat Sheet
4 pages
Advanced Marketing Research
No ratings yet
Advanced Marketing Research
32 pages
Simple LR Lecture
No ratings yet
Simple LR Lecture
60 pages
Simple LR Lecture
No ratings yet
Simple LR Lecture
60 pages
Friday, May 21
No ratings yet
Friday, May 21
53 pages
Lecture8 4
No ratings yet
Lecture8 4
29 pages
Week 13
No ratings yet
Week 13
25 pages
Data Science 03 - Regression PDF
No ratings yet
Data Science 03 - Regression PDF
32 pages
STAT630Slide Adv Data Analysis
No ratings yet
STAT630Slide Adv Data Analysis
238 pages
Lecture9 Regression1 PDF
No ratings yet
Lecture9 Regression1 PDF
22 pages
Ra Web
No ratings yet
Ra Web
70 pages
STAR Rando Questions Stats
No ratings yet
STAR Rando Questions Stats
14 pages
Linear Regression Analysis and Least Square Methods
No ratings yet
Linear Regression Analysis and Least Square Methods
65 pages
Applied Statistics II Chapter 7 The Relationship Between Two Variables
No ratings yet
Applied Statistics II Chapter 7 The Relationship Between Two Variables
73 pages
Topic - chapter 12 - Regression models
No ratings yet
Topic - chapter 12 - Regression models
1 page
Statistical Analysis: Linear Regression
No ratings yet
Statistical Analysis: Linear Regression
36 pages
Lecture 8 Correlation and Linear Regression
No ratings yet
Lecture 8 Correlation and Linear Regression
66 pages
Psych Stat Reviewer Midterms
No ratings yet
Psych Stat Reviewer Midterms
10 pages
Ch 4- Correlation and Regression YARA&LAMA
No ratings yet
Ch 4- Correlation and Regression YARA&LAMA
27 pages
Relationship- Correlation and Regression (1)
No ratings yet
Relationship- Correlation and Regression (1)
42 pages
Notes 1017 Part1
No ratings yet
Notes 1017 Part1
50 pages
Chapter 3 Notes
No ratings yet
Chapter 3 Notes
5 pages
Regression and Correlation
No ratings yet
Regression and Correlation
14 pages
Stats101A - Chapter 2
No ratings yet
Stats101A - Chapter 2
59 pages
Regression
No ratings yet
Regression
24 pages
Estimation of Causal Relationships I: Illustration 1
No ratings yet
Estimation of Causal Relationships I: Illustration 1
8 pages
Lectures 14 15
No ratings yet
Lectures 14 15
66 pages
Untitled 472
No ratings yet
Untitled 472
13 pages
Lecture 4 Linear Regression
No ratings yet
Lecture 4 Linear Regression
75 pages
Lect5 Math231
No ratings yet
Lect5 Math231
31 pages
Chapter 3 - Classical Simple Linear Regression
No ratings yet
Chapter 3 - Classical Simple Linear Regression
52 pages
Chapter 8
No ratings yet
Chapter 8
45 pages
Linear Regression
100% (2)
Linear Regression
28 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
20 pages
The Bucharest University of Economic Studies Bucharest Business School Romanian - French INDE MBA Program
No ratings yet
The Bucharest University of Economic Studies Bucharest Business School Romanian - French INDE MBA Program
67 pages
Regression Analysis (Simple)
100% (1)
Regression Analysis (Simple)
8 pages
Regression PDF
No ratings yet
Regression PDF
18 pages
Regression Equation For SI
No ratings yet
Regression Equation For SI
12 pages
Regression: Leech N L, Barret K C & Morgan G A (2011)
No ratings yet
Regression: Leech N L, Barret K C & Morgan G A (2011)
35 pages
Business Statistics II
100% (2)
Business Statistics II
100 pages
What Is Multiple Linear Regression
No ratings yet
What Is Multiple Linear Regression
23 pages
Chapter No 11 (Simple Linear Regression)
No ratings yet
Chapter No 11 (Simple Linear Regression)
3 pages
Correlation
No ratings yet
Correlation
29 pages
PARAMETRIC-TEST
No ratings yet
PARAMETRIC-TEST
49 pages
Bivariate Data Analysis
100% (1)
Bivariate Data Analysis
34 pages
LP-III Lab Manual
No ratings yet
LP-III Lab Manual
49 pages
Lecture+8+ +Linear+Regression
No ratings yet
Lecture+8+ +Linear+Regression
45 pages
03 Revisions L Regression
No ratings yet
03 Revisions L Regression
25 pages
Regression Equation
No ratings yet
Regression Equation
56 pages
Aqt 1
No ratings yet
Aqt 1
33 pages
Lecture 1
No ratings yet
Lecture 1
36 pages
Fba 1
No ratings yet
Fba 1
9 pages
6 Continuous Data Analysis
No ratings yet
6 Continuous Data Analysis
49 pages
Lesson 12 - Introduction To Regression and Correlation Analysis Regression Analysis
No ratings yet
Lesson 12 - Introduction To Regression and Correlation Analysis Regression Analysis
39 pages
Bivariate
No ratings yet
Bivariate
28 pages
Topic Simple Linear Regression
No ratings yet
Topic Simple Linear Regression
38 pages
Simple Regression
No ratings yet
Simple Regression
35 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Shortcuts to College Calculus Refreshment Kit
From Everand
Shortcuts to College Calculus Refreshment Kit
Juan Acevedo
No ratings yet
Applications of Derivatives Errors and Approximation (Calculus) Mathematics Question Bank
From Everand
Applications of Derivatives Errors and Approximation (Calculus) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet
Lecture 2
No ratings yet
Lecture 2
65 pages
Lecture Notes
No ratings yet
Lecture Notes
138 pages
Bayesian Statistics (Szábo & V.d.vaart)
No ratings yet
Bayesian Statistics (Szábo & V.d.vaart)
146 pages
(2010) Paisley
No ratings yet
(2010) Paisley
160 pages
Part 1
No ratings yet
Part 1
107 pages