Assumptions of Multiple Regression Running Head: Assumptions of Multiple Regression 1
Assumptions of Multiple Regression Running Head: Assumptions of Multiple Regression 1
MULTIPLE REGRESSION
Running Head: OF
ASSUMPTIONS
OF MULTIPLE REGRESSION
Both a histogram and a normal probability plot (a normal P-P plot) can be created in
SPSS by selecting Analyze > Regression > Linear. After entering the outcome and predictor
variables into the appropriate boxes on the main screen, clicking on the Plots tab at the right of
the window will open up a new window. It is here that the histogram and normal probability plot
for the standardized residuals can be selected, as well as selecting the predicted z scores
(ZPRED) for the y axis and (ZRESID) for the x axis (see Figure 1). Clicking Continue
closes this window, and selecting OK on the main window will display the requested
information in the output.
Visual inspection of the histogram will show if variables are highly skewed (the
distribution is not symmetrical with a concentration in the center), high or low in kurtosis
(peakedness), or if there are outliers (points that fall outside of +/-3 standard deviations of the
mean, as 99% of data should fall within this range given a normal distribution). Figure 2
illustrates data that follow a normal distribution. In cases where there is asymmetry of data
distribution, where the data is peaked, or where outliers exist, transformation of the data should
be considered to better approximate normality (see Figure 3).
In the P-P plot, the standardized residuals have been plotted against the expected values
from the standard normal distribution. Data points should fall (approximately) on the diagonal
for residuals that are normally distributed (illustrated as a 45 degree line on the plot; see Figure
4). If the P-P plot shows data points that deviate from the line, or extreme cases on the ends of
the distribution, away from the cluster (as in Figure 5), the assumption of normally distributed
residuals is less viable and a transformation should be considered to improve normality of the
data.
Figure 1. Steps for creating standardized residual plots (a histogram and normal probability plot)
in SPSS.
Figure 3. When data do not follow a normal distribution, as in the displayed data, the assumption
of normality has been violated and a transformation of the data should be considered.
Figure 4. Checking the normality assumption in SPSS through a visual examination of the
normal probability plot. The assumption is assumed to be met when data follow a straight line
(From Hindes, 2012b).
Figure 5. Data that do not follow the diagonal line violate the assumption of normality (From
Hindes, 2012b).
Linearity
As the name suggests, multiple linear regression assumes a linear relationship between
variables to accurately estimate the interrelationships, and between the observed and predicted
values of the dependent variable. Though results are generally not affected by minor violations of
this assumption, more severe violations can result in findings that are an underrepresentation of
the relationship, thus increasing the probability of Type II error (Osborne & Waters, 2002).
The relationship between variables can be visually assessed using bivariate scatterplots
(scatterplots of pairs of variables). In SPSS, this can be achieved by selecting Graphs > Legacy
Dialogs > Scatter/Dot > Simple Scatter. Clicking on the Define button brings up a window
where you can enter a variable for each axis. Selecting OK will bring up the requested output
(see Figures 6 and 7, left). Linearity can also be assessed using the residual plots discussed
previously in relation to the normality assumption, and are useful because curvilinear
relationships are often easier to see in these plots than on scatterplots (Hindes, 2012b). Residual
plots can be used to test this assumption as it is assumed that if the independent variables and the
dependent variable are linearly related, so to will be the residuals and predicted scores
(Princeton, n.d.). When the linearity assumption is met, the residual plots will show a wide, even
distribution of data points around the horizontal line (see Figure 6, right). In cases where this
assumption has not been met (as in Figure 7, right), a decision must be made to select a nonlinear
model instead of MR, or to perform a transformation of the data so that it may be approximated
by a linear model (Stevens, 2009).
Figure 6. The distribution of data follows a straight line in the bivariate scatterplot, indicating a
linear relationship between variables (left). The wide, even distribution when residuals are
plotted against actual values (right) indicate that the linearity assumption has been met (From
Hindes, 2012b).
Figure 7. The curvilinear distribution of data points in the bivariate scatterplot (left) and residual
plot (right) indicate that the assumption of linearity has been violated (From Hindes, 2012b).
Homoscedasticity
The assumption of equal variances (homoscedasticity) assumes that the residuals are
(approximately) equal for all predicted scores of the dependent variable. This assumption can
also be assessed using the residual plots discussed above. The width of the data set will be
approximately the same for all values of the predicted dependent variable when this assumption
is met (see Figure 8). Conversely, a violation of this assumption (referred to as
heteroscedasticity) will appear to have a fan or bow tie distribution (see Figures 9 and 10,
respectively). While slight violations of this assumption may have little effect, more severe
violations can increase the probability of Type I errors (Osborne & Waters, 2002; Yang & Huck,
2010). When requiring further information about heteroscedasticity, inferential tests (e.g.,
Goldfeld-Quandt test, Glejser test) can be used to determine if transformation of the variables
should be considered so that the data can conform more closely to the assumption of
homoscedasticity.
10
Figure 8. The even distribution of data points in the residual plot indicates that the assumption of
homoscedasticity has been met.
Figure 9. The fan-like distribution of data points in the residual plot indicates that the
assumption of homoscedasticity has not been met.
11
Figure 10. The bow tie distribution of data points in the residual plot indicates that the
assumption of homoscedasticity has not been met (From Hindes, 2012b).
Multicollinearity
The more predictor variables included in a model, the more likely it is that
multicollinearity, or high intercorrelations among predictor variables, will exist (Pedhazur, 1997
as cited in Nathans et al., 2012). Therefore, in MR analyses, where multiple independent
variables are included in the model, a correlation analysis should be used to determine whether
any such high correlations exist. Highly correlated predictor variables make it hard to determine
the importance of a given predictor variable, as the researcher cant tell which variable is
influencing the outcome variable (Hindes, 2012b). Moreover, multicollinearity limits the size of
the correlation (R) and increases the variances of the regression coefficients (resulting in more
unstable regression equations; Hindes, 2012b).
From the main regression window (Analyze > Regression > Linear), the statistics tab can
be selected to allow the researcher to choose the statistics they would like to include in the
analysis. The specific information required will depend on the situation, and as this is not the
12
focus of the current paper, these options will not be discussed here. However, the Collinearity
Diagnostics option must be checked in order to explore multicollinearity (see Figure 11). For a
full overview of multiple regression analyses in SPSS, please refer to
http://www.palgrave.com/pdfs/0333734718.pdf.
The second table provided in the regression output (Correlations, see Figure 12)
provides a correlation matrix, which shows the correlations between variables. For MR analyses,
it is desirable to have independent variables that are not strongly correlated with each other, but
that are highly correlated with the outcome variable, so that we can explain as much of the
variance as possible. In the correlation table, the first row provides the Pearson correlation
coefficients (how correlated each set of variables is), while the significance of these values is
found in the second row. Typically, a correlation coefficient of 0.80 or higher is indicative of
multicollinearity. Values of this magnitude can cause problems when drawing inferences about
the relative contribution of each variable to the model as the predictor variables are explaining
shared variance in the dependent variable (thus creating redundancy). With lower correlations
between predictor variables, more of the unshared variance is being accounted for.
Additionally, there are two collinearity statistics that are provided in the SPSS output to
determine whether the predictor variables are independent enough for use in the model. These
are the Tolerance and Variance Inflation Factors (VIFs) and are displayed in the Coefficients
table (see Figure 13). Tolerance values provide a measure of the correlation between predictor
variables and can range between zero and one. As Tolerance is the proportion of a variables
variance that is not accounted for by other independent variables (Hindes, 2012a; Princeton,
n.d.), the closer to zero it is, the stronger the relationship between this variable and the other
13
predictors. Thus, values as close to one as possible are desired, indicating that the predictors are
independent enough for use in the model.
A Variance Inflation Factor (VIF) is the reciprocal of tolerance (1/Tolerance) and
indicates whether there is a strong linear association between a predictor variable and the other
predictor variables. As it is hard to find variables that arent related to each other to some extent,
a general rule is that values below two are considered acceptable (Hindes, 2012b). These values
are considered to be independent enough for use in the regression model without having to worry
about multicollinearity, while those above two indicate multicollinearity and should be reviewed
(Hindes, 2012b).
14
Figure 12. The correlation matrix provided in SPSS output (From Hindes, 2012c). The first row
provides the correlation coefficients, while the second row indicates the significance of the
correlation values.
Figure 13. Checking for multicollinearity in SPSS using Tolerance and Variance Inflation Factors
(from Hindes, 2012c).
Conclusion
In summary, multiple linear regression is a statistical tool that, when used properly,
provides researchers with valuable information about the relative contribution of several
predictor variables on a construct of study. The integrity of interpretations drawn from results is
15
dependent on the assumptions underlying multiple regression having been met. Normality,
linearity and homoscedasticity are three assumptions of multiple regression that must be verified
to ensure confidence can be placed in the interpretations made, and methods for checking these
assumptions using SPSS software were discussed. As high correlations between predictor
variables can create redundancy in the model and make it difficult to draw inferences about the
relative contribution of each predictor to the model, assessing for multicollinearity was also
examined.
16
References
Brace, N., Kemp, R., & Snelgar, R. (2009). Multiple Regression. SPSS for Psychologists. Available
from http://www.palgrave.com/pdfs/0333734718.pdf
Gay, L. R., Mills, G. E., & Airasian, P. (2008). Educational research: Competencies for analysis and
applications (10th ed.). NJ: Pearson.
Hindes, Y. (2012a). Multiple Regression. [PowerPoint slides]. Retrieved from
https://blackboard.ucalgary.ca/webapps/portal/frameset.jsp?tab_id=_2_1&url=%2fwebapps
%2fblackboard%2fexecute%2flauncher%3ftype%3dCourse%26id%3d_119178_1%26url%3d
Hindes, Y. (2012b). Regression Continued. [PowerPoint slides]. Retrieved from https://blackboard
.ucalgary.ca/webapps/portal/frameset.jsp?tab_id=_2_1&url=%2fwebapps%2fblackboard
%2fexecute%2flauncher%3ftype%3dCourse%26id%3d_119178_1%26url%3d
Hindes, Y. (2012c). Pre-Recorded Session 4 Multiple Regression. [PowerPoint slides]. Retrieved
from https://blackboard.ucalgary.ca/webapps/portal/frameset.jsp?tab_id=_2_1&url=
%2fwebapps%2fblackboard%2fexecute%2flauncher%3ftype%3dCourse%26id
%3d_119178_1%26url%3d
SPSS. (2011). Statistical Package for the Social Sciences (Version 20.0) [Computer software].
Chicago, IL: SPSS Inc.
Nathans, L. L., Oswald, F. L., & Nimon, K. (2012). Interpreting multiple linear regression: A
guidebook of variable importance. Practical Assessment, Research & Evaluation, 17(9), 1-19.
Retrieved from http://pareonline.net/getvn.asp?v=17&n=9
Osborne, J. & Waters, E. (2002). Four assumptions of multiple regression that researchers should
always test. Practical Assessment, Research and Evaluation, 8(2). Retrieved from
http://PAREonline.net/getvn.asp?v=8&n=2
17
Princeton University. (n.d.). Introduction to regression. Retrieved June 20, 2012, from
http://dss .princeton.edu/online_help/analysis/regression_intro.htm
Stevens, J. P. (2009). Applied multivariate statistics for the social sciences (5th ed.). New York:
Routledge.
Wells, C.S., & Hintz, J. M. (2007). Dealing with assumptions underlying statistical tests.
Psychology in the Schools, 44, 495-502. doi: 10.1002/pits.20241
Yang, H., & Huck, S. W. (2010). The importance of attending to underlying statistical
assumptions. Newborn & Infant Nursing Reviews, 10, 44-49. doi:10.1053
/j.nainr.2009.12.005