0% found this document useful (0 votes)
76 views

Multiple Linear Regression

This document discusses a statistical community project that encourages academics to share statistics support resources. All resources from the statistical community project (STCP) are released under a Creative Commons license. The document then goes on to provide materials on multiple linear regression and analysis of covariance from universities. It outlines the workshop and discusses key aspects of multiple linear regression like assumptions, fitting models to data, and assessing for multicollinearity.

Uploaded by

TheGimhan123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views

Multiple Linear Regression

This document discusses a statistical community project that encourages academics to share statistics support resources. All resources from the statistical community project (STCP) are released under a Creative Commons license. The document then goes on to provide materials on multiple linear regression and analysis of covariance from universities. It outlines the workshop and discusses key aspects of multiple linear regression like assumptions, fitting models to data, and assessing for multicollinearity.

Uploaded by

TheGimhan123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

community project

encouraging academics to share statistics support resources


All stcp resources are released under a Creative Commons licence

Statistical Methods
12. Multiple Linear
Regression and Analysis
of Covariance
Based  on  materials  provided  by  Coventry  University  and  
Loughborough  University  under  a  Na9onal  HE  STEM  
Programme  Prac9ce  Transfer  Adopters  grant  

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Workshop outline
q  Multiple Linear Regression:
Ø Two independent variables
Ø Multicollinearity: VIF and tolerance
Ø More than two independent variables:
o  Direct variable entry method
o  Backwards regression method
Ø Robustness
q  Analysis of Covariance

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Please note
q  This workshop assumes knowledge of simple
linear regression – see Workshop 11
q  Some disciplines have a different culture in
applying multiple linear regression without
assumption checking – please seek guidance
from your faculty
q  Most people want to look for significance for
deciding which variables to include in the
model, not for the purpose of prediction

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Multiple Linear Regression with
two independent variables
Model:
y  =  b0  +  b1x  +  b2z  
Where:
q  y is the dependent variable
q  x and z are the independent variables
q  b0 is the intercept coefficient
q  b1 and b2 are the slope coefficients
Goal: To minimise the sum of the squares of the
errors
Peter  Samuels   Reviewer:  Ellen  Marshall  
www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
http://cast.massey.ac.nz/core/index.html?book=general, Section 6.3.7
Least squares estimation of y against x and z

yi  =  b0  +  b1xi  +  
b2zi  +  ei  
Choose  b0,  b1  
and  b2  such  
that  ∑𝐢=𝟏↑𝐧▒​​
𝐞↓𝐢 ↓↑𝟐    is  
minimised  

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Example 1: Monthly sales
figures for women’s clothing
q  120 monthly sales figures for a catalogue-based mail
ordering company from January 1989 to December 1998
q  Independent variables:
Ø  Number of phone lines open for ordering
Ø  Amount spent on print advertising
q  Open Sheet1 of the Excel file CatalogueData.xlsx
associated with this workshop
q  Turn it into an SPSS
data file – for the Date
field, use the data type
“Date” and the format
“dd-mmm-yy”
Peter  Samuels   Reviewer:  Ellen  Marshall  
www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
The regression analysis
process
q  Step 1: Get to know your data
q  Step 2: Formulate a model and check the
assumptions
q  Step 3: Fit the model to the data
q  Step 4: Report, interpret, and use the model

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Step 1A: Scatter plot of Sales
against NoPhoneLines
q  Relationship Decimal places
appears to be removed
linear
q  Variance in Font size
Sales appears increased to 10
to be constant
for different Font size
increased
values of
to 12
NoPhoneLines
q  ‘Cigar shaped’
Minimum value changed to 0
data set

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Step 1B: Scatter plot of Sales
against PrintAdvertising
q  Relationship
appears to be
linear
q  Variance in
Sales appears
to be constant
for different
values of
PrintAdvertising
q  ‘Cigar shaped’
data set

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Step 2: Formulate a model

Sales = b0 + b1×NoPhoneLines + b2×PrintAdvertising

Note: A model is always an approximation to the


data

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Step 2: Assumptions of
Multiple Linear Regression
q  The observations of the dependent variable are
independent, e.g. they are not time or sequence dependent
q  The independent variables are normally distributed or
binary
q  The dependent variable is normally distributed for each
value of each predictor (independent) variable
q  The variability of the outcome variable is the same for each
value of the predictor variable
q  The dependent variable varies linearly as the independent
variables vary
q  All seem to be OK for our data set (as indicated by the
‘cigar shaped’ scatter plots)
Peter  Samuels   Reviewer:  Ellen  Marshall  
www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Step 3: Fit the model to the data
q  Analyze > Regression > Linear
q  Add Sales as the dependent variable
q  Add NoPhoneLines and PrintAdvertising as the
independent variables
Select
Statistics…
and choose
Collinearity
diagnostics

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Adjusted R
Square = 0.383 ⇒
model explains
38.3% of the
variation in Sales b0 = -21366 b1 = 653.55 b2 = 1.374

b0, b1 and b2 all significantly Tolerance coefficients slightly


different from 0 less than 1, VIF slightly more
than 1 (see later)
Peter  Samuels   Reviewer:  Ellen  Marshall  
www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Step 4: Fitted model
Sales = -21366 + 653.55×NoPhoneLines
+ 1.374×PrintAdvertising

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Multicollinearity
q  Multicollinearity means there are high correlations
between the predictor (independent) variables
q  Thus two or more predictor variables carry similar
information about the outcome variable
q  With multicollinearity there is very high uncertainty for
the regression coefficients
q  Multicollinearity can be assessed informally by looking
at the bivariate correlations between the independent
variables (any r > 0.8 indicates a possible problem)
q  There are two formal measures of multicollinearity:
Ø  Variance Inflation Factor (VIF)
Ø  Tolerance
Peter  Samuels   Reviewer:  Ellen  Marshall  
www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Variance Inflation Factor
q  Based on fitting predictor variables to other
predictor variables (i.e. NoPhoneLines to
PrintAdvertising for our example) and
calculating R2
q  Values of VIF (Variance Inflation Factor):
Ø  VIF < 5: don’t worry
Ø  5 < VIF < 10: multicollinearity may be a problem, be
cautious
Ø  VIF > 10: multicollinearity is definitely a problem and
will adversely affect results
Source: Myers, R. (1990) Classical and modern
regression with applications. 2nd ed. Boston,
MA: Duxbury
Peter  Samuels   Reviewer:  Ellen  Marshall  
www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Tolerance
q  The percentage of the variance in a predictor
variable that cannot be explained by the
other predictors
q  In our example the tolerance for each
variable was 0.992, meaning 99.2% of the
variance in both predictor variables cannot
be explained by the other variables
q  Tolerance is the reciprocal of VIF – so you
only need to look at VIF
q  For our example, VIF was 1.009 for both
variables so it was not a problem
Peter  Samuels   Reviewer:  Ellen  Marshall  
www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Multiple Linear Regression with
>2 independent variables
Model:  
    y  =  b0  +  b1x1  +  b2x2  +  …  +  bmxm  
Where:  
q  y  is  the  dependent  variable  
q  x1,  x2,  …,  xm  are  the  independent  variables  
q  b0  is  the  intercept  coefficient  
q  b1,  b2,  …,  bm  are  the  slope  coefficients  
Goal:  To  minimise  the  sum  of  the  squares  of  the  errors  
Rule  of  thumb:  For  a  sample  size  of  n,  use  no  more  than  
√⁠n   independent  variables  
Peter  Samuels   Reviewer:  Ellen  Marshall  
www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Issues with >2 variables
q  Much more likely to have multicollinearity, or
variables with non-significant coefficients
q  Impossible to know in advance which variables are
best removed
⇒  Use a systematic variable selection method in
SPSS to determine an adequate model
⇒  Justify any decision you make
q  We recommend backwards removal of predictor
variables:
Ø  All predictor variables initially included
Ø  Least significant variable removed
Ø  Repeat process until least significant variable is below a
threshold (default is 0.1)
Peter  Samuels   Reviewer:  Ellen  Marshall  
www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Example 2: Monthly sales
figures for women’s clothing
q  Independent variables:
Ø  Number of phone lines open for ordering
Ø  Amount spent on print advertising
Ø  Number of catalogues mailed
Ø  Number of pages in catalogue
Ø  Number of customer service representatives
q  Open Sheet2 of the Excel file
Catalogue2Data.xlsx associated with this
workshop
q  Turn it into an SPSS data file
Peter  Samuels   Reviewer:  Ellen  Marshall  
www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Step 1A: Sales v. NoMailed
q Clear linear
relationship
q May be
heteroscedastic
– variance in
errors seems to
depend on
NoMailed
q This looks like
an outlier –
reason?

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Activity
q  Create a new variable called NoMailedGroup by
recoding NoMailed into a new variable
q  Choose a suitable cut-off value
q  NoMailedGroup = 1 below the cut-off value
q  NoMailedGroup = 2 above the cut-off value
q  Run a linear regression of Sales against NoMailed
and choose Unstandardised residuals under
Save…
q  Run an independent samples t-test of the
unstandardised residuals against NoMailedGroup

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
NoMailed cut-off = 12,500

q  Levene’s test for equality of variances returns a non-


significant result
q  Only 8 data values in NoMailedGroup 2
q  Probably OK to assume homoscedasticity
q  There is no specific test for heteroscedasticity in SPSS

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Step 1B: Sales v. NoPages
q  Seems to
be a weak
linear
relationship
q  Variance
seems to be
OK

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Step 1C: Sales v.
NoServiceReps
q  Stronger
linear
relationship
q  Data seems
to be in two
distinct
groups –
month
dependent?

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Activity
q  Create a new scale variable called Month
q  Enter 1 for a date in January, 2 for a date in
February, etc.
q  Create a scatter
plot of Sales
against Month
q  Clearly lower in
January and
February

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
q  Recode Month into a
new variable called
MonthGroup with 1 =
Jan/Feb and 2 =
otherwise
q  Add values to
MonthGroup to
represent these
groups
q  Create a grouped
scatter plot of Sales v.
NoServiceReps with
MonthGroup as the
grouping variable

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
q  Most of the
lower Sales
were from
Jan/Feb
q  We could
add an
additional
variable to
the model
which is 1 for
Jan/Feb and
0 for other
months
q  For the moment the months Jan/Feb are excluded
q  Also see ANCOVA analysis later
Peter  Samuels   Reviewer:  Ellen  Marshall  
www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Step 2: Formulate a model
For the months March – December:

Sales = b0 + b1×NoPhoneLines + b2×PrintAdvertising


+ b3×NoMailed + b4×NoPages + b5×NoServiceReps

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Step 4: Fit the model
q  Select only the cases with MonthGroup = 2
q  Analyze > Regression > Linear
q  Select Sales as the dependent variable and the other 5
variables and the independent variables

Under Statistics…
select collinearity
diagnostics

Select the
method as
Backward

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
q  Only one model
required
q  R2adj = 0.795 ⇒ model
accounts for 79.5% of
the variation in Sales

q All variables included in initial model (backwards method)


q No variable removed because all the probability values < 0.1

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Activity
q  Repeat the analysis with all the months included
q  What affect does this have on the models?
q  Analysis now has 2
models
q  R2adj slightly higher
in second model
q  Both markedly
lower than previous
analysis
Peter  Samuels   Reviewer:  Ellen  Marshall  
www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
q NoPhoneLines was removed from Model 2 because its
probability value in Model 1 > 0.1 (the absolute value of its
standardised coefficient was low)
q Model 2 has very low probability values for all the variables
Peter  Samuels   Reviewer:  Ellen  Marshall  
www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Robustness exceptions
q  Homoscedasticity is mandatory – otherwise use a
nonparametric technique
q  Linearity is mandatory – otherwise transform the
independent variable or use another model, e.g.
quadratic or polynomial
q  Normality is “not necessary for the least-squares
fitting of the regression model, but it is required in
general for inference making.” (e.g. calculating the p-
values and confidence intervals of the coefficients)
“…only extreme departures of the distribution of Y
from normality yield spurious results.”
Source: (Kleinbaum et al., 2008: 120)

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Application of robustness
exceptions to activity
q  Sales v. NoServiceReps was not normally distributed
⇒ The probability values of the coefficients may not be reliable
q  However, the probability value was not borderline so the
model can still be used
q  Including January & February data just increases the ‘noise’

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Example 3: Monthly catalogue
sales figures for jewellery
q  Independent variables:
Ø  Number of phone lines open for ordering
Ø  Amount spent on print advertising
Ø  Number of catalogues mailed
Ø  Number of pages in catalogue
Ø  Number of customer service representatives
q  Open Sheet3 of the Excel file
CatalogueData.xlsx associated with this
workshop
q  Turn it into an SPSS data file
Peter  Samuels   Reviewer:  Ellen  Marshall  
www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Activity

q  Use JewellerySales as the dependent variable


q  Fit the model with all 5 independent or predictor
variables (direct or ‘enter’ method)
q  Is multicollinearity important?
q  Just re-fitting the model with only the variables
with significant coefficients may not be sufficient
q  Try also the backward variable selection
method

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Five independent variable direct model

Three independent variable direct model

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Results of backwards
regression method
R2adj does not
necessarily
reduce when
variables are
removed from
the model

Final model removes NoServiceReps

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Analysis of Covariance
(ANCOVA)
q  Combines ANOVA and Linear Regression:
Ø  ANOVA: Explains outcome for different groups of the
data
Ø  Linear Regression: Explains outcome with
explanatory variables
Ø  ANCOVA: Does both simultaneously
q  Increases the precision of the analysis
q  Compares the means at the average values of
the predictor variables
q  The predictor (independent) variables must be
correlated to the outcome (dependent) variable
Peter  Samuels   Reviewer:  Ellen  Marshall  
www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Difference Average Difference of means adjusted
of means Covariate to average covariate
Outcome variable

Group 1
Group 2
Covariate variable

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Assumptions for ANCOVA
q  Same as ANOVA and Linear Regression:
Ø  Independence of observations
Ø  Equality of variances
Ø  Normality of distribution
q  In addition:
Ø  Equal regression slopes – if this assumption is in
doubt, add an extra binary variable to the model to
represent the two groups
q  ANCOVA models can be very complex
q  Seek advice if you feel ANCOVA is appropriate
for your research
Peter  Samuels   Reviewer:  Ellen  Marshall  
www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Example of ANCOVA
q  Open the SPSS file Catalogue2 you created
earlier
q  We want to compare Sales in January and
February against the rest of the year
q  The independent variable is NoServiceReps
q  Use Analyze > General Linear Model >
Univariate:
Ø  Dependent variable: Sales
Ø  Fixed Factor: MonthGroup
Ø  Covariate: NoServiceReps
Ø  Under Options… choose Parameter estimates
Peter  Samuels   Reviewer:  Ellen  Marshall  
www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Peter  Samuels   Reviewer:  Ellen  Marshall  
www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Model
accounts for
71.4% of the
total variance

The
MonthGroup
=1 coefficient
should be
added to the
intercept
coefficient for
this group

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Add a fitted line to a grouped
scatter plot
q  Open the chart
editor
q  Select the Add a
reference line from
Equation tool
q  Enter the Custom
Equation from the
ANCOVA output
q  Change the line
colour using the
Lines tab
q  Repeat for the other month group by subtracting the
ANCOVA coefficient from the constant
Peter  Samuels   Reviewer:  Ellen  Marshall  
www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Scatter plot with fitted lines
Clearly an
improvement
on Simple
Linear
Regression
but the best
fit slope for
the two
groups is
slightly
different

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Activity
q  Open the file AlligatorData.xlsx associated with
this Workshop
q  The data file contains pelvic canal width,
snout-vent length and the gender of 35
alligators
q  Create a scatterplot for PelvicWidth against
SnoutLength for the different Gender groups
q  Run an ACOVA model for PelvicWidth against
SnoutLength for the different Gender groups
q  Plot the ANCOVA model on the scatterplot
Peter  Samuels   Reviewer:  Ellen  Marshall  
www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Scatterplot of alligator data

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
ANCOVA for alligator data

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Plot of ANCOVA on scatterplot

Peter  Samuels   Reviewer:  Ellen  Marshall  


www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Recap
q  We have introduced Multiple Linear Regression by
adding one predictor variable then several predictor
variables to Simple Linear Regression
q  New diagnostics were required in order to address
the possibility of multicollinearity, namely VIF or
Tolerance
q  Better to use a systematic method for introducing or
removing variables, such as Backwards
q  Robustness arguments are similar to those in
Simple Linear Regression
q  Analysis of Covariance (ANCOVA) is a combination
of ANOVA and Linear Regression
Peter  Samuels   Reviewer:  Ellen  Marshall  
www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  
Bibliography
Bovas, A. & Ledolter, J. (2006) Introduction to Regression Modelling.
Belmont, CA: Thomson Brooks/Cole.
Field, A. (2013) Discovering Statistics using SPSS: (And sex and drugs
and rock 'n' roll), 4th ed., London: SAGE, Sections 8.5 - 8.7 and
Chapter 12.
Kleinbaum, D., Kupper L., Muller, K. and Nizam, A. (2008) Applied
Regression Analysis and Other Multivariable Methods. 4th ed.
Belmont, CA: Thomson Brooks/Cole.
Kutner, M., Nachtsheim, C. and Neter, J. (2004) Applied Linear
Regression Models. 4th ed., Irwin: McGraw-Hill.
Myers, R. (1990) Classical and Modern Regression with Applications. 2nd
ed. Boston, MA: Duxbury.
statstutor (n. d.) Multiple Regression resources. Available at:
http://www.statstutor.ac.uk/topics/regression-and-model-building/
multiple-regression/ [Accessed 8/01/14].
Stirling, W. D. (2013) Welcome to the General CAST e-book. Available at:
http://cast.massey.ac.nz/core/index.html?book=general [Accessed
8/01/14], Section 6.3.7.
Peter  Samuels   Reviewer:  Ellen  Marshall  
www.statstutor.ac.uk   Birmingham  City  University   University  of  Sheffield  

You might also like