0% found this document useful (0 votes)
4 views

Lecture 8 Correlation and Linear Regression

The document discusses correlation and linear regression, emphasizing the importance of scatter plots in visualizing relationships between variables. It explains the calculation and interpretation of correlation coefficients, the assumptions of linear regression, and the least squares method for estimating regression parameters. Additionally, it covers the coefficient of determination (R²) and its significance in assessing the explanatory power of the regression model.

Uploaded by

amantad09
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture 8 Correlation and Linear Regression

The document discusses correlation and linear regression, emphasizing the importance of scatter plots in visualizing relationships between variables. It explains the calculation and interpretation of correlation coefficients, the assumptions of linear regression, and the least squares method for estimating regression parameters. Additionally, it covers the coefficient of determination (R²) and its significance in assessing the explanatory power of the regression model.

Uploaded by

amantad09
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Lecture 8

Correlation and Linear Regression


Scatter Plots and Correlation
 Before trying to fit any model it is better to see its scatter
plot
 A scatter plot (or scatter diagram) is used to show
the relationship between two variables
 If a scatter plot once show some sort of linear
relationship, we can use correlation analysis to measure
the strength of linear relationship between two variables
o Only concerned with strength of the relationship and its
direction
o We consider the two variables equally; as a result no
causal effect is implied
Scatter Plot Examples
Linear relationships Curvilinear relationships
y y

x x

y y

x x
Scatter Plot Examples
No relationship at all
y

x
Correlation Coefficient

 The population correlation coefficient ρ (rho) measures


the strength of the association between the variables

 The sample correlation coefficient r is an estimate of ρ and


is used to measure the strength of the linear relationship in
the sample observations
Features of ρ and r

 Unit free

 Range between -1 and 1

 The closer to -1, the stronger the negative linear relationship

 The closer to 1, the stronger the positive linear relationship

 The closer to 0, the weaker the linear relationship


Examples of Approximate r Values

y y y

x x x
r = -1 r = -.6 r=0

y y

r = +0.3 x r = +1 x
Calculating the Correlation Coefficient
Sample correlation coefficient:

r 
(x  x)( y  y)
 SS xy / SS SS
xx yy
[  ( x  x) ][( y  y) ]
2 2

or the algebraic equivalent:


n  xy   x y
r 
[ n (  x2 )  (  x)2 ][n( y )  ( 
2
y)2 ]

where:
r = Sample correlation coefficient
n = Sample size
x = Value of the ‘independent’ variable
y = Value of the ‘dependent’ variable
Example
Weight gained Diet (x) Weight gained Diet (x)
(y) (y)

0.4 0.65 0.86 1.1


0.46 0.66 0.89 1.12
0.55 0.63 0.91 1.20
0.56 0.73 0.93 1.32
0.65 0.78 0.96 1.33
0.67 0.76 0.98 1.35
0.78 0.72 1.02 1.42
0.79 0.84 1.04 1.1
0.80 0.87 1.08 1.5
0.83 0.97 1.11 1.3
Calculation Example
Child
Height,
y
n  xy  xy
r
70

60
[n(  x 2 )  (  x) 2 ][n(  y )  (  y)
2 2
]
50

40 8(3142) (73)(321)

30
[8(713)  (73) 2 ][8(14111)  (321) 2 ]
20

10  0.886
0
0 2 4 6 8 10 12 14

Child weight, x r = 0.886 → relatively strong positive


linear association between x and y
SPSS out put
SPSS Correlation Output
Analyze /correlate /bivariate /pearson /OK
Correlations
Child Child
weight height
Child Weight Pearson Correlation 1 0.886
Sig. (2-tailed) 0.003
N 8 8
Child height Pearson Correlation 0.886 1
Sig. (2-tailed) 0.003
N 8 8

Correlation between Child height and weight


Significance Test for Correlation
 Hypotheses
H0: ρ = 0 (no correlation)
HA: ρ ≠ 0 (correlation exists)
r
 Test statistic t 
1  r 2 (with n – 2 degrees of freedom)
n  2

Here, the degree of freedom is taken to be n-2


because, two points can be joined by a straight line
surely
Example:
Is there evidence of a linear relationship between child
height and weight at the 0.05 level of significance?

H0: ρ = 0 (No correlation)


H1: ρ ≠ 0 (correlation exists)
 = 0.05 , df = 8 - 2 = 6

r .886
t   4.68
1 r2 1  .8 8 6 2
n2 82
Correlation Cont,…..
Correlation coefficients below 0.35 show only a slight
relationship between variables.

Correlations between 0.40 and 0.60 may have theoretical


and/or practical value depending on the context.

Only when a correlation of 0.65 or higher is obtained, can


one reasonably assume an accurate prediction.

Correlations over 0.85 indicate a very strong relationship


between the variables correlated.
Linear Regression Analysis
Linear Regression Analysis

 Dependent variable:The variable we wish to


explain. In linear regression it is always
continuous variable
 Independent variable: The variable used
to explain the dependent variable
Regression Models
 In a linear models the parameters enter linearly, but
the predictors do not necessarily have to be linear. For
instance, consider the following two functions

The first one is linear in the parameters, but the second one
is not.
Simple Linear Regression Model
 Only one independent variable, x

 Relationship between x and y is described


by a linear function

 Changes in y are assumed to be caused by


changes in x
Population Linear Regression
The population regression model:
Dependent Population Population Random
Variable Independent Error
y intercept Slope
Variable term, or
Coefficient
residual

y  β 0  β1x  ε
Linear component Random Error
component
Linear Regression Assumptions
 The relationship between the two variables, x and y is Linear

 Independent observations

 Error values are Normally distributed for any given value of x

 The probability distribution of the errors has Equal variance

 Fixed independent variables (not random = non-stochastic = given


values = deterministic); the only randomness in the values of Y comes
from the error term 

 No autocorrelation of the errors (has some similarities with the 2nd)

 No outlier distortion
Assumptions viewed pictorially
LINE (Liner, Independent, Normal and Equal
variance) assumptions
Y

y|x= +  x

Identical normal
distributions of
errors, all centered on
~N(y|x y|x)
the regression line.

X
Population Linear Regression
y y  β0  β1x  ε
Observed
Value of y
for xi
εi Slope = β1
Predicted Random Error
Value of y for this x value
for xi

Intercept
= β0
xi x
Estimated Regression Model
The sample regression line provides an estimate of the
population regression line

Estimated Estimate of Estimate of the


(or predicted) the regression regression slope
y value intercept

Independent

yˆi  b0  b1x variable

The individual random error terms ei have a mean of zero


Least Squares Criterion

 b0 and b1 are obtained by finding the values of b0 and

b1 that minimize the sum of the squared residuals

 e2   ( y y ˆ ) 2

  ( y  ( b 0  b 1 x)) 2
The Least Squares Equation
 After some application of calculus (derivation)
and equating it to zero, we can find the
following:

b1 
 ( x  x)( y  y)
 ( x  x) 2

 xy   x y
b1  n
(  x) 2 and b0  y b1x
 x2 
n
Interpretation of the Slope and the Intercept

 b0 is the estimated average value of y when the

value of x is zero (provided that x is inside the data


range considered).

 b1 is the estimated change in the average value of y

as a result of a one-unit change in x


Example: Simple Linear Regression
 A researcher wishes to examine the relationship between
the amount of the daily average diets taken by a cohort of
24 sample children and the weight gained by them in one
month (both measured in kg). The content of the food is the
same for all of them.

 Dependent variable (y) = weight gained in one month


measured in kilogram

 Independent variable (x) = average weight of diet taken per


day by a child measured in Kilogram
Sample Data for child weight Model
Weight gained (y) Diet (x) Weight gained (y) Diet (x)

0.4 0.65 0.86 1.1


0.46 0.66 0.89 1.12
0.55 0.63 0.91 1.20
0.56 0.73 0.93 1.32
0.65 0.78 0.96 1.33
0.67 0.76 0.98 1.35
0.78 0.72 1.02 1.42
0.79 0.84 1.04 1.1
0.80 0.87 1.08 1.5
0.83 0.97 1.11 1.3
Estimation using the computational formula

From the data we have:


Σx = 20.35, Σy = 16.27, Σxy = 17.58, Σx2 = 22.30

 x y   x y
b  n
1 (  x) 2

x  n
2

17.57  20.35*16.27 / 20
b1   0.643
22.30  414.12 / 20

b0  y b1x  0.81350.643*1.0175  0.160


Regression Using SPSS
Analyze/ regression/linear….
Coefficients

Standardized
Unstandardized Coefficients Coefficients
B Std. Error Beta
Model t Sig.
(Constant) 0.160 .077 2.065 .054

foodweight 0.643 .073 8.772 .000


.900

Weight gained = 0.16 +0.643(food weight)


Interpretation of the Intercept, b0

Weight gained = 0.16 + 0.643(food weight)

 Here, no child had had 0 kilogram of food per day, so for


foods within the range of sizes observed, 0.16Kg is the
portion of the weight gained not explained by food.

 Whereas, b1 = 0.643 tells us that the average weight of a


child increases by 0.643, on average, for each additional
one kilogram of food taken each day
Least Squares Regression Properties

 The sum of the residuals from the least squares regression


line is 0 i.e.  ( y yˆ)  0
 The sum of the squared residuals is a minimum

i.e.  ( y yˆ) 2
is minimized

 The simple regression line always passes through the mean


of the y variable and the mean of the x variable

 The least squares coefficients are unbiased estimates of β0


and β1
Explained and Unexplained Variation

Total variation is made up of two parts:


SST  SSR  SSE
Total sum of Sum of Squares Sum of Squares
Squares Regression Error

SST  (y y)2 SSR  (yˆ y)2 SSE  (y yˆ)2

where:
y = Average value of the dependent variable
y = Observed values of the dependent variable
yˆ = Estimated value of y for the given x value
Explained and Unexplained …

SST = Total Sum of Squares


Measures the variation of the yi values around their
mean y
SSE = Sum Squares Errors
Variation attributable to factors other than the
relationship between x and y
SSR = Sum Square Regression
Explained variation attributable to the relationship
between x and y
Explained and Unexplained …
y
yi  
SSE = (yi - yi )2 y
_
SST = (yi - y)2

Y  _2
_ SSR = (yi - y) _
y y

x
Xi
Coefficient of Determination, R2

 The coefficient of determination is the portion of the


total variation in the dependent variable that is explained
by variation in the independent variable

 The coefficient of determination is also called R-squared


and is denoted as R2

S S R
R 2

S S T where 0  R2  1
Coefficient of Determination, R2

Coefficient of determination
SSR s u m o f s q u a r es ex p l ai ned by r e g r e s s i o n
R2  
SST total s u m o f s q u a r e s

Note: In the single independent variable case, the


coefficient of determination is

R 2  r2
Where:
R2 = Coefficient of determination
r = Simple correlation coefficient
Coefficient of Determination, R2 cont…

 The F-test testes the statistical significance of the


regression of the dependent variable on the
independent variable: H0: β = 0
 However, the reliability of the regression equation is
very commonly measured by the correlation
coefficient R.
 Equivalently one can check the statistical
significance of R or R2 using F-test and can reach
exactly the same F-value as model coefficients’ test
SPSS output
Model summary
Adjusted R
Model R R Square Square
1 0.900 0.810 0.800

81% of the variation in


SSR 0.658 children’s weight
R2    0.81 increment is explained
SST 0.812
by variation in food
weight they took
ANOVA
Sum of Mean
Squares Square
Model df F Sig.
Regression 0.658 1 0.658 76.948 .000
Residual 0.154 18 0.009
Total 0.812 19
Multiple linear regression

Multiple Linear Regression (MLR) is a statistical


method for estimating the relationship between a
dependent variable and two or more independent
(or predictor) variables.
Multiple Linear Regression

Simply, MLR is a method for studying the


relationship between a dependent variable
and two or more independent variables.
Purposes:
Prediction, explanation, theory building
Adjusting effects of confounders
Operation of MLR
 Uses the ordinary least squares solution (as does simple
linear)

 Describes a line for which the (sum of squared)


differences between the predicted and the actual values
of the dependent variable are at a minimum.

 Function: Ypred = a + b1X1 + B2X2 +… + BnXn


Operation?
 MLR produces a model that identifies the best weighted
combination of independent variables to predict the
dependent (or criterion) variable.

 That means MLR assess the contribution of the


combined variables to change the dependent variable.

 It estimates the relative importance of several


hypothesized predictors.
Variations

Total Variation in Y Predictable variation by


the combination of
independent variables

Unpredictable
Variation
Assumptions of the Linear regression Model
1. Linear Functional form
2. Fixed independent variables
3. Independent observations
4. Representative sample and proper specification of the model
(no omitted variables)*
5. Normality of the residuals or errors
6. Equality of variance of the errors (homogeneity of residual
variance)
7. No multicollinearity*
8. No autocorrelation of the errors
9. No outlier distortion

(Most of them, except the 4th and 7th, are mentioned in the simple
linear regression model assumptions) 54
Explanation of some of the Assumptions
Specification of the model (no omitted variables)
 In multivariable models (including multiple
regression), to define the best prediction rule of
the outcome variable, one must select some
variables into the final (best explaining) model
from a set of candidate variables.

There are several strategies to identify such a


subset of variables:

55
Option 1: variable selection based on significance in univariate
models (here simple linear regression):
 all variables that show a significant effect in univariate
models are included. Usually the significance level is set to
0.15-0.25. More specifically a p-value of 0.2 which is the
average is used
Option 2: variable selection based on significance in
multivariable model: starting with a multivariable model
including all candidate variables, one eliminates non-
significant effects one-by-one until all effects in the model are
significant. (backward/stepwise/forward selection)
Explanation of some of the Assumptions
 Option 3: The ‘Purposeful Selection’ algorithm . This
variable selection procedure selects variables not only
based on their significance in a multivariable model, but
also if their omission from the multivariable model would
cause the regression coefficients of other variables in the
model to change by more than, say, 20%.
 Option 4: variable selection based on substance matter
knowledge: this is the best way to select variables, as it is
not data-driven and it is therefore considered as yielding
unbiased results.
 Among all ‘automatic’ selection procedures (from option 1-
3), the third
one is currently state-of-the-art and should be applied.
56
Explanation of some of the Assumptions
 Multicollinearity prevents proper parameter estimation.

 It may also prevent computation of the parameter estimates

completely if it is serious enough.


 To assess multicollinearity, correlation matrix of
independent variables, condition index or condition
number can be used
 Normality: in the population, the data on the

dependent variable are normally distributed for each


of the possible combinations of the level of the X
57
variables
Explanation of some of the Assumptions

 Homoskedasticity: In the population, the variances of the

dependent variable for each of the possible combinations of


the levels of the X variables are equal.

 Linearity: In the population, the relation between the

dependent variable and the independent variable is linear


when all the other independent variables are held constant.

58
Simple vs. Multiple Regression

Simple regression Multiple regression

 One dependent variable Y  One dependent variable Y


predicted from one predicted from a set of
independent variable X independent variables (X1,
X2 ….Xk)
 One regression coefficient
 One regression coefficient
for each independent
variable
 r2: proportion of variation  R2: proportion of variation
in dependent variable Y in dependent variable Y
predictable from X predictable by set of
independent variables (X’s)
Multiple Coefficient of Determination, R2
 In multiple regression ,the corresponding correlation
coefficient is called Multiple Correlation Coefficient
 Since there are more than one independent variables,
multiple correlation coefficient R is the correlation between

the observed y and predicted y values while it is between x
and y in the case of r (simple correlation)

 Unlike the situation for simple correlation, 0 < R < 1,


because it would be impossible to have a negative
correlation between the observed and the least-squares
predicted values
 The square of a multiple correlation coefficient is of course
the corresponding coefficient of determination
Adjusted R square

 R2 will increase when further explanatory variables are


included to the model even if they do not explain
variability in the population.

 The adjusted R2 adjusts down to compensate for the


inflation in R2 caused by the number of variablesadded.
Adjusted R square cont…

 n 1 
Adjusted R  1 - (1- R )
2
 2

 n  p 1
Where n = sample size
P = number of parameters in model
As the sample size increases above 20 cases per
variable, adjustment is less needed (and vice versa).

In linear regression, the sample size must be at least


10 times as many cases as independent variables
B coefficient

 b coefficient measures the amount of increase or decrease


in the dependent variable for a one-unit difference in the
independent variable, controlling for the other independent
variable(s) in the equation.

 Ideally, the independent variables are uncorrelated.

 Consequently, controlling for one of them will not affect


the relationship between the other independent variable and
the dependent variable
Categorical Independent variables
In order to add categorical independent variables in a
linear regression, the following conditions should be
satisfied:
 If a variable is in the nominal scale, it should not have more
than two categories, e.g. sex

 If a categorical variable is in an ordinal scale, it should


divide cases into classes having equal width; e.g. age
categorized and coded as 1 for 0–10 years, 2 for 11–20 years,
3 for 21 -30 years, 4 for 31-40 years, 5 for 41-50 years,
where the numbers representing a quantitative feature
Categorical Independent variables
cont….
 Otherwise, if an independent variable has a nominal scale
with more than two categories, dummy variables must be
prepared and each dummy should be considered as an
independent variable

 A nominal scale variable with two-categories has a linear


relationship with other variables and multiple regression
assumes linear relationships. However, if it has more than
two categories, it could have non-linear relationship
Categorical Independent variables cont...
 Unlike other models like logistic regression, here in linear
regression we cannot have an overall effect of an
independent variable with more than two categories of
nominal scale. Rather we can determine each dummy’s
effect and put them in the regression equation.

Example:
 Y = α + β1X1 + β2X2 + …+ βkDk + βk+1Dk+1 ; here the kth
variable has a nominal scale with three categories, hence it
will have two dummies, Dk and Dk+1 that must be included in
the model independently.
Intercorrelation or collinearlity
 If the two independent variables are uncorrelated,
we can uniquely partition the amount of variance
in Y due to X1 and X2 and bias is avoided.

 Small inter-correlations between the independent


variables will not greatly bias the b coefficients.

 However, large inter-correlations will bias the


b coefficients and for this reason other
mathematical procedures are needed
Intercorrelation cont…
Multicollinearity can cause a number of problems in
multiple regression which can include:
 It severely limits the size of R because the
explanatory variables primarily explain much of the
same variability in the response variable

 Difficult to determine the contribution of each


independent variable as it will be confounded by the
intercorrelation among them.
 The coefficients’ estimate will be unreliable as it
would increase the variance of regression coefficients
Spotting multicollinrarity
 Examining the correlation between each pair of
explanatory variables may not be a sufficient
approach as there could also be multicollinarity
involving more than two variables.

 The most commonly used method is Variance


Inflation Factors (VIFs) or the tolerance of
explanatory variables. One is the inverse of the
other, as a result of this the two measures are
exactly the same:
Spotting multicollinrarity cont…
 The “tolerance of independent variable” is defined as the
proportion of variance of the variable in question left
unexplained by the regression on the remaining
explanatory variables. Smaller values (rule of thumb: less
than 0.1)indicate a greater concern of multicollinarity.

 The VIF of explanatory variable measures the inflation of


the variance of the variable’s regression coefficient relative
to a regression where all the explanatory variables are
independent
 Multicollinarity will be a concern if VIF is greater than 10
(the inverse of 0.1 or tolerance)
Multiple regression
%fat age Sex

9.5 23.0 0.0


Example: 27.9 23.0 1.0
7.8 27.0 0.0
Regress the percentage of fat relative 17.8 27.0 0.0
31.4 39.0 1.0
to body on age and sex 25.9 41.0 1.0
27.4 45.0 0.0
25.2 49.0 1.0
31.1 50.0 1.0
34.7 53.0 1.0
42.0 53.0 1.0
SPSS result on the next slide! 42.0
29.1
54.0
54.0
1.0
1.0
32.5 56.0 1.0
30.3 57.0 1.0
21.0 57.0 1.0
33.0 58.0 1.0
33.8 58.0 1.0
41.1 60.0 1.0
34.5 61.0 1.0
Model Summary
Change Statistics
R Adjusted R Std. Error of R Square Sig. F
Model R Square Square the Estimate Change F Change df1 df2 Change
1 .729a .532 .506 6.5656 .532 20.440 1 18 .000
2 .794b .631 .587 5.9986 .099 4.564 1 17 .047
a. Predictors: (Constant), sex; b. Predictors: (Constant), sex, age
ANOVA
Model Sum of Squares df Mean Square F Sig.
1 Regression 881.128 1 881.128 20.440 .000a
Residual 775.932 18 43.107
Total 1657.060 19
2 Regression 1045.346 2 522.673 14.525 .000b
Residual 611.714 17 35.983
Total 1657.060 19
a. Predictors: (Constant), sex; b. Predictors: (Constant), sex, age; c. Dependent Variable: %age of body fat

Coefficients
Standardized 95% Confidence
Unstandardized Coefficients Coefficients Interval for B
Lower Upper
Model B Std. Error Beta t Sig. Bound Bound
1 (Constant) 15.625 3.283 4.760 .000 8.728 22.522
sex 16.594 3.670 .729 4.521 .000 8.883 24.305
2 (Constant) 6.209 5.331 1.165 .260 -5.039 17.457
sex 10.130 4.517 .445 2.243 .039 .600 19.659
age 0.309 .145 .424 2.136 .047 .004 .614
a. Dependent Variable: %age of body fat relative to body
Interpretations
Keeping all other variables constant, females
have 10.13%of body fat relative body higher than
males

All other things being equal for a unit increase in


age, we would expect a 0.309% increment in
percentage of body fat relative to body weight
Thank You

You might also like