Multiple Linear Regression
Multiple Linear Regression
(2nd Edition)
Mark Tranmer
Jen Murphy
Mark Elliot
Maria Pampaka
January 2020
License and attribution
2
CONTENTS
Contents ..................................................................................................................................... 3
3
3.2 Assumption 2: Linearity ............................................................................................... 32
3.4.2 What to do if the residuals are not homoscedastic and why does it matter .... 37
4
4.2.3 Scenario C: Different intercept, different slopes .............................................. 48
7 Glossary ............................................................................................................................. 57
5
1 THE BASICS – UNDERSTANDING LINEAR REGRESSION
Linear regression is a modelling technique for analysing data to make predictions. In simple
linear regression, a bivariate model is built to predict a response variable (𝑦) from an
explanatory variable (𝑥)1. In multiple linear regression the model is extended to include
more than one explanatory variable (x1,x2,….,xp) producing a multivariate model.
This primer presents the necessary theory and gives a practical outline of the technique for
bivariate and multivariate linear regression models. We discuss model building, assumptions
for regression modelling and interpreting the results to gain meaningful understanding from
data. Complex algebra is avoided as far as is possible and we have provided a reading list
for more in-depth learning and reference.
A simple linear regression estimates the relationship between a response variable 𝑦, and a
single explanatory variable 𝑥, given a set of data that includes observations for both of these
variables for a particular sample.
For example, we might be interested to know if exam performance at age 16 – the response
variable – can be predicted from exam results at age 11 – the explanatory variable.
1
The terms response and explanatory variables are the general terms to describe predictive relationships. You
will also see the terms dependent and independent used. Formally, this latter pair only applies to experimental
designs but are sometimes used more generally. Some statistical software (e.g. SPSS) uses
dependent/independent by default.
6
33 40
65 70
57 62
33 45
43 55
55 65
55 66
67 77
56 66
Table 1 contains exam results at ages 11 and 16 for a sample of 17 students. Before we use
linear regression to predict a student’s result at 16 from the age 11 score, we can plot the
data (Figure 1).
80
70
Exam16
60
50
40
30
30 40 50 60 70 80 90 100
Exam11
We are interested in the relationship between age 11 and age 16 scores – or how they are
correlated. In this case, the correlation coefficient is 0.87 – demonstrating that the two
variables are indeed highly positively correlated.
To fit a straight line to the points on this scatterplot, we use linear regression – the equation
of this line, is what we use to make predictions. The equation for the line in regression
modelling takes the form:
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝑒𝑖
We refer to this as our model. For the mathematical theory underlying the estimation and
calculation of correlation coefficients, see Appendix A.
7
β0 is the intercept also called the constant– this is where the line crosses the 𝑦 axis of the
graph. For this example, this would be the predicted age 16 score, for someone who has
scored nil in their age 11 exam.
β1 is the slope of the line – this is how much the value of 𝑦 increases, for a one-unit increase
in 𝑥, or for each additional mark gained in the age 11 exam, how much the student scores in
the age 16 exam.
𝑒𝑖 is the error term for the 𝑖 𝑡ℎ student. The error is the amount by which the predicted
value is different to the actual value. In linear regression we assume that if we calculate the
error terms for every person in the sample, and take the mean, the mean value will be zero.
The error term is also referred to as the residual (see 1.3 for more detail on residuals).
Our hypothesis is that the age 16 score can be predicted from the age 11 score that is to say
that there is an association between the two. We can write this out as null and alternative
hypotheses:
𝐻0 : 𝛽1 = 0
𝐻1 : 𝛽1 ≠ 0
The null hypothesis is that there is no association – it doesn’t matter what the age 11 score
is for a student when predicting their age 16 score, so the slope of the line, denoted 𝛽1 ,
would be zero.
If there is a relationship, then the slope is not zero – our alternative hypothesis.
The relationship between x and y is then estimated by carrying out a simple linear
regression analysis. SPSS estimates the equation of the line of best fit by minimising the
sum of the squares of the differences between the actual values, and the values predicted
by the equation (the residuals) for each observation. This method is often referred to as the
ordinary least squares approach; there are other methods for estimating parameters but
the technical details of this are beyond this primer.
β0 = -3.984
β1 = 0.939
8
This gives us a regression equation of:
where xi is the value of EXAM11 for the ith student. The ^ symbol over the 𝑦𝑖 is used to
show that this is a predicted value.
So, if a student has an EXAM11 score of 55 we can predict the EXAM16 score as follows:
If we draw this line on the scatter plot, as shown in Figure 2, it is referred to as the line of
best fit of y on x, because we are trying to predict y using the information provided by x.
1.3 RESIDUALS
The predicted EXAM16 score of the student with an EXAM11 score of 55 is 47.7;; however, if
we refer to the original data, we can see that the first student in the table scored 55 at age
11, but their actual score at age 16 was 45. The difference between the actual or observed
value, and the predicted value is called the error or residual.
𝑒𝑖 = 𝑦𝑖 − 𝑦̂𝑖
Remember that 𝑦̂ means predicted, and 𝑦 means actual or observed.
The residual for the first student is therefore 45 – 47.7 = -2.7. The residual is the distance of
each data point away from the regression line. In Figure 2 the prediction equation is plotted
on the scatter plot of exam scores. We can see that very few if any of the actual values fall
on the prediction line.
9
Figure 2 Plotting the regression line for age 11 and age 16 exam scores
90
80
70
Exam16
60
50
40
30
30 40 50 60 70 80 90 100
Exam11
If we calculate the predicted value using the regression equation for every student in the
sample, we can then calculate all the residuals. For a model which meets the assumptions
for linear regression, the mean of these residuals is zero. More about assumptions and
testing data to make sure they are suitable for modelling using linear regression later!
Our model has allowed us to predict the values of EXAM16, however it is important to
distinguish between correlation and causation. The EXAM11 score value, has not caused
the EXAM16 score value, they are simply correlated – there may be other variables through
which the relationship is mediated: base intellect, educational environment, parental
support, student effort and so on and these could be causing the score, rather than the
explanatory variable itself. To illustrate this further, statistically speaking, we would have
just as good a model if we used EXAM16 to predict the values of EXAM11. Clearly one would
not expect a student’s EXAM scores at age 16 to be causing in any sense their exam scores
at age 11! So a good model does not mean a causal relationship.
Our analysis has investigated how an explanatory variable is associated with a response
variable of interest, but the equation itself is not grounds for causal inference.
Multiple linear regression extends simple linear regression to include more than one
explanatory variable. In both cases, we still use the term ‘linear’ because we assume that
the response variable is directly related to a linear combination of the explanatory variables.
10
The equation for multiple linear regression has the same form as that for simple linear
regression but has more terms:
Again, the analysis does not allow us to make causal inferences, but it does allow us to
investigate how a set of explanatory variables is associated with a response variable of
interest.
11
2 BASIC ANALYSIS USING SPSS
Multiple linear regression is a widely used method within social sciences research and
practice. Examples of suitable problems to which this method could be applied include:
Prediction of the overall examination performance of pupils in ‘A’ levels, given the
values of a set of exam scores at age 16.
This section shows how to use the IBM program SPSS to build a multiple linear regression
model to investigate the variation between different areas in the percentage of residents
reporting a life limiting long-term illness.
The data are taken from the 2001 UK Census and are restricted to the council wards in the
North West of England (n = 1006).
12
Which explanatory variables make the most difference to the outcome variable?
Are there any areas that have higher or lower than expected values for the
outcome?
The first task in any data analysis is to explore and understand the data using descriptive
statistics and useful visualisations. This has two purposes:
1. It will help you to get a feel for the data you are working with;
2. It will inform decisions you make when you carry out more complex analyses (such
as regression modelling).
2 Here we are using SPSS version 23. If you are using a different version then the look and
feel may be a little different.
13
This selection opens the following dioalog box.
Clicking on OK at this dialog box will prompt SPSS to open an output window in which the
following output will appear (Table 3).3
3
Note that using the Paste button in a dialog box above allows the syntax to be pasted into a script window
from which it can be directly edited, saved and run again later. There are numerous online sources for SPSS
syntax and it is not intended that this primer covers the writing of syntax.
14
is unlikely to add value to a model. In this case, the variables all look to have sufficient
variability with the possible exception of the %female variable.
15
Here we plot the values for each variable. You can see in
Figure 3 that the distribution for each variable is quite different – for example, there are
much greater differences between the wards in the %social renters, than in %females. This
is in line with our expectations – we would expect most wards to have a similar gender split,
but that poorer areas would have a much higher incidence of social renting.
16
Figure 3 Box plot of univariate distributions
SPSS will calculate the Pearson correlation for all pairs of specified variables. Select Analyze
> Correlate > Bivariate to reach the dialogue box:
Table 4 shows the SPSS output where the five variables above are selected. The output
shows that N = 1006 for all correlations. This tells us that the data are complete and there
17
are no missing values – in a real life data scenario it is likely that N will differ for each
calculated correlation as not all cases will have complete values for every field. Missing
data is an area for research within itself and there are many methods for dealing with
missing data such that a sample remains representative and/or any results are unbiased.
For the purposes of this example, all cases with missing data have been excluded – a
somewhat heavy-handed approach but which works well for a worked example and may
indeed be appropriate in many analyses.
18
Table 4 Pearson Correlations
Correlations
% aged % unemp
60 and of econ % social
% llti % female over act. rented
In this case, the correlations of the explanatory variables with the response variable, apart
from age 60 look good enough (according to the criteria set above). We will leave this in
consideration now, but will watch out for issues with this variable later. Similarly, the
correlation between social rented and unemployment is quite high but not high enough for
rejection at this stage.
The dialog box is shown below. Select Scatter/Dot and then the top left hand option (simple
scatter).
19
To generate the graph you need to drag the variable names from the list on the left onto the
pane on the right and then click OK:
20
The output should look like Figure 4.
Double clicking on the graph from the output page will open the graph editor and allow a
straight line to be fitted and plotted on the scatterplot as shown in Figure 5.
Choose – Elements, Fit line, Linear to fit a simple linear regression line of % LLTI on % social
rented.
21
22
Figure 5 Simple linear regression of %llti by % social rented using graph editor
The simple linear regression line plot in Figure 5 shows an 𝑅 2 value of 0.359 at the top right
hand side of the plot. This means that the variable % social rented explains 35.9% of the
ward level variation in % LLTI. This is a measure of how well our model fits the data – we
can use 𝑅 2 to compare models, the more variance a model explains, the higher the 𝑅 2
value.
The linear regression line plotted in Figure 5 through the graph editor interface can be
specified as a model.
Our response variable is %llti and for a simple linear regression we specify one explanatory
variable, % social rented. These are selected using the Analyze > Regression > Linear menu
path.
23
24
2.3.1 REGRESSION OUTPUTS
The output for a model within SPSS contains four tables. These are shown as separate
Tables here with an explanation of the content for this example.
b
Va ria ble s Entered/Removed
Variables Variables
Model Entered Removed Method
1 % socaial
. Enter
rented
a. All reques ted variables entered.
b. Dependent Variable: % llti
Table 5 confirms that the response variable is % llti and the explanatory variable here is %
social rented. The model selection ‘method’ is stated as ‘Enter’. This is the default and is
most appropriate here. More about “methods” later!
25
Table 6 Model Summary
Model Summary
Table 6 is a summary of the model fit details. The adjusted 𝑅 2 figure 4is 0.359 – the same as
we saw in Figure 5 showing that the model explains 35.9% of the variance in the % of life
limiting illness reported at a ward level.
ANOVAb
Sum of
Model Squares df Mean Square F Sig.
1 Regres sion 6160.641 1 6160.641 563.240 .000a
Residual 10981.604 1004 10.938
Total 17142.244 1005
a. Predic tors : (Const ant), % social rented
b. Dependent Variable: % llti
ANOVA stands for Analysis of Variance; SPSS produces an ANOVA table as part of the
regression output as shown in Table 7. The variance in the data is divided into a set of
components. The technical background to an ANOVA table is beyond the scope of this
primer. We look mainly at the Sig. column, which tells us the p-value for the 𝑅 2 statistic. If
this is greater than 0.05 then the whole model is not statistically significant and we need to
stop our analysis here. The value here is below 0.05 and so we can say that the fit of the
model as a whole is statistically significant.
4In SPSS, both 𝑅 2 and “adjusted” 𝑅 2 are quoted. For large sample sizes, these two figures
are usually very close. For small values of n, the figure is adjusted to take account of the
small sample size and the number of explanatory variables and so there may be a
difference. The technical details of the adjustment are beyond the scope of this primer.
The adjusted figure should be used in all instances.
26
Table 8 Model parameters
Coeffi cientsa
Unstandardized St andardiz ed
Coeffic ient s Coeffic ient s
Model B St d. Error Beta t Sig.
1 (Const ant) 17.261 .157 109.999 .000
% soc ial rented .178 .008 .599 23.733 .000
a. Dependent Variable: % llti
The estimated model parameters are shown in the Coefficients table (Table 8). The B
column gives us the 𝛽 coefficients for the prediction equation.
To best understand this table it helps to write out the model equation. Remember:
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝑒𝑖
Substituting the variables and results of our regression analysis gives:
̂
% 𝑙𝑙𝑡𝑖 = 𝛽0 + 𝛽1 (% 𝑠𝑜𝑐𝑖𝑎𝑙 𝑟𝑒𝑛𝑡𝑒𝑑)
So:
̂
% 𝑙𝑙𝑡𝑖 = 17.261 + 0.178(% 𝑠𝑜𝑐𝑖𝑎𝑙 𝑟𝑒𝑛𝑡𝑒𝑑)
The ^ over the %lltii indicates that this is a predicted value rather than the actual value (and
therefore we don’t need the error term).
In our example, for every 1% increase in the percentage of people living in social rented
housing in a ward, we expect a 0.178% increase in the percentage of people living with a life
limiting illness in that same ward. The relationship is positive – areas with more social
tenants have greater levels of long-term illness.
For a ward with no social tenants, we expect 17.261% illness as this is the intercept – where
the line of best fit crosses the y-axis.
Again, we must be careful to remember that this statistically significant model describes a
relationship but does not tell us that living in socially rented accommodation, causes life
limiting illnesses. In fact, those people reporting illness in each ward may not even be the
same people who report living in social housing as the data are held at a ward, rather than
person level. Instead, an increase in social tenants may indicate that a ward has higher
levels of people with lower incomes and higher levels of poverty. There is a significant body
of literature that links poverty with illness, so this does make substantive sense.
27
2.3.2 STANDARDISED COEFFICIENTS
The unstandardised coefficients shown in Table 8 can be substituted straight into the
theoretical model. The issue with these is that they are dependent on the scale of
measurement of the explanatory variables and therefore cannot be used for comparison –
bigger does not necessarily mean more important. The standardised coefficients get round
this problem and relate to a version of the model where the variables have been
standardised to fit a normal distribution with a mean of zero and a standard deviation of 1.
We interpret the standardised coefficients in terms of standard deviations.
For this model, for one standard deviation change in the % of social renters in a ward, there
is a 0.599 standard deviation change in the % of people reporting a life limiting illness.
The descriptives table we produced in SPSS (Table 3) tells us that the standard deviation of
social tenancy is 13.9% and the standard deviation of the outcome variable is 4.13%. So for
a 13.9% change in social tenancy, there is a (4.13*0.599) change in illness – 2.47%. This is
the same as a change of 0.178% for a 1% increase in social tenancy5.
The parameters are estimates drawn from a distribution of possible values generated by
SPSS when computing the model – the true value for each parameter could in fact fall
anywhere within its distribution. The standard error of the estimate shows us the spread of
this distribution, and the Sig. column tells us whether or not these values are statistically
different from zero.
If these values are not statistically different from zero, then the true value sits within a
distribution which includes zero within the 95% confidence bounds. If the estimate for the
parameter could be zero, then it could be that there is in fact no relationship – a zero
coefficient and a flat line of best fit.
A value which is not statistically significant is indicated by a p-value greater than 0.05 (the
Sig. column). For this model, p <0.05 and so we can say that the estimates of the
parameters are statistically significant and we can infer that there is an association between
the variables.
28
2.4 MULTIPLE LINEAR REGRESSION ANALYSIS
Adding additional explanatory variables to a simple linear regression model builds a multiple
linear regression model. The process is identical within SPSS – including additional variables
in the specification stages. This example includes the percentage of females, the percentage
of over 60s and the percentage of unemployed economically active residents as additional
explanatory variables, over the simple regression using just the percentage of social tenants.
Here we are interested in the levels of life limiting illness in different areas. We have a
theory that poverty is linked with life limiting illnesses, and that differences in age and
gender may play a part. We have a dataset that contains variables which are related to this
theory and so we build a model that reflects our theory.
29
For the default method is ‘Enter’, the order of the explanatory variables is not important.
The method uses all the specified explanatory variables, regardless of whether or not they
turn out to be statistically significant.
Table 9 Variables
b
Va ria ble s Entered/Rem oved
Variables Variables
Model Entered Removed Method
1 % aged 60
and over,
% female,
% unemp
. Enter
of econ
ac t., %
social a
rented
a. All reques ted variables entered.
b. Dependent Variable: % llti
Model Summary
30
Table 11 ANOVA
ANOVAb
Sum of
Model Squares df Mean Square F Sig.
1 Regres sion 11598.023 4 2899.506 523.501 .000a
Residual 5544.221 1001 5.539
Total 17142.244 1005
a. Predictors: (Constant), % aged 60 and over, % female, % unemp of econ act., %
social rented
b. Dependent Variable: % llti
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) -9.832 2.734 -3.596 .000
% unemp of econ act. .774 .035 .664 22.147 .000
% female .344 .056 .121 6.176 .000
% social rented .052 .009 .175 5.728 .000
% aged 60 and over .336 .017 .404 19.762 .000
a. Dependent Variable: % llti
From Table 12 we can see that all of the explanatory variables are statistically significant. So
our theory that these variables are related to long-term limiting illness rates is supported by
the evidence.
All the 𝛽 coefficients are positive – which tells us that an increase in the value of any of the
variables leads to an increase in long term limiting illness rates.
From the information in Table 12, we can now make a prediction of the long term limiting
illness rates for a hypothetical ward, where we know the values of the explanatory variables
but don’t know the long term limiting illness rate.
Say that in our hypothetical ward that the unemployment rate is 18%, females are 45% of
the population, social tenancy is at 20%, and 20% of the population are aged 60 and over.
The general form of the model is:
31
% 𝑙𝑙𝑡𝑖 = 𝛽0 + 𝛽1 (% 𝑢𝑛𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑑)
+ 𝛽2 (% 𝑓𝑒𝑚𝑎𝑙𝑒)
+ 𝛽3 (% 𝑠𝑜𝑐𝑖𝑎𝑙 𝑟𝑒𝑛𝑡𝑒𝑑)
+ 𝛽4 (% 𝑎𝑔𝑒 60 𝑎𝑛𝑑 𝑜𝑣𝑒𝑟)
+ 𝜀𝑖
Substituting the values from Table 12 gives us:
̂ = −9.832 + 0.774 × (% 𝑢𝑛𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑑)
%𝑙𝑙𝑡𝑖
+ 0.344 × (% 𝑓𝑒𝑚𝑎𝑙𝑒)
+ 0.052 × (% 𝑠𝑜𝑐𝑖𝑎𝑙 𝑟𝑒𝑛𝑡𝑒𝑑)
+ 0.336 × (% 𝑎𝑔𝑒 60 𝑎𝑛𝑑 𝑜𝑣𝑒𝑟)
This would give a predicted value for our hypothetical ward of 27.3%:
̂ = −9.832 + 0.774 × 18
%𝑙𝑙𝑡𝑖
+ 0.344 × 45
+ 0.052 × 20
+ 0.336 × 20 = 27.3
We can also use Table 12 to examine the impact of an older population in a ward as a single
variable. If we leave all other variables the same (sometimes called “holding all other
variables constant”), then we can see that an increase of 1% in the proportion of the
population that is over 60 leads to a 0.336% increase in the predicted value of long term
limiting illness rate (i.e. the precise value of the B coefficient). Another way of saying this is
to say this is “controlling for employment, gender and social tenancy rates, a 1 unit increase
in the percentage of people over sixty leads to 0.336 unit increase in long term limiting
illness rates”. This simple interpretability is one of the strengths of linear regression.
OK so we have just shown the basics of linear regression and how it is implemented in SPSS.
Now we are going to go a bit deeper. In this section we will consider some of the
assumptions of linear regression and how they affect the models that you might produce.
32
The residuals are normally distributed
There is no more than limited multicollinearity
There are no external variables – that is variable that are not included in the model
that have strong relationships with the response variable (after controlling for the
variables that are in the model).
Independent errors
Independent observations.
For most of these assumptions, if they are violated then it does not necessarily mean we
cannot use a linear regression method, simply that we may need to acknowledge some
limitations, adapt the interpretation or transform the data to make it more suitable for
modelling.
The most basic assumption of a linear regression is that the response variable is continuous.
The normal definition of continuous is that it can take any value between its minimum and
its maximum. Two useful tests for continuity are:
In many cases these two tests are clear cut but there is a certain class of variables called
count variables which pass test 1 but the result of test 2 is ambiguous and depends in part
on the meaning of the variable. For example, number of cigarettes smoked is usually OK to
treat as continuous whereas number of cars in a household is not.
Binary variables are indicators of whether feature is present or whether something is true or
false not they are usually coded as 1 – the feature is present/true and 0 the feature
absent/false.
Variables which are not binary or continuous can be used in a regression model if there are
first converted into Dummy variables (see section 4.1)
33
Linear regression modelling assumes that the relationship between outcome and each of
the explanatory variables is linear6, however this may not always be the case.
For example, there may be a curve in the data, which is better represented by a quadratic
rather than a linear relationship.
Figure 6 shows the log of hourly wage by age for a sample of respondents. In the left hand
plot a straight line of best fit is plotted. In the right hand plot, we can see that a curved line
looks to the naked eye to be a much more sensible fit. We, therefore, propose that there is
a quadratic relationship between the log of pay per hour, and age. This means that the log
of pay per hour and age squared are linearly related.
6
i.e. in the sense that it conforms to a straight line. It might seem slightly odd as a curve is also a line but when
statisticians refer to “linear”, they mean straight, everything else is “non-linear”. See
https://study.com/academy/lesson/how-to-recognize-linear-functions-vs-non-linear-functions.html
for further discussion.
7
This may seem a little confusing; since we have added in non-linear predictors why is the model still referred
to as a linear regression model? The reason is that the linearity here refers to the model not the data. The
term linear regression denotes an equation in which the effect of each parameter in the model is simply
additive (but the parameters themselves could represent non-linear relationships in the data). See:
https://blog.minitab.com/blog/adventures-in-statistics-2/what-is-the-difference-between-linear-and-
nonlinear-equations-in-regression-analysis for more details.
34
Figure 6 Scatterplots of log of hourly wage, by age
5 5
4 4
3 3
2 2
log of pay £ per hour
0 0
-1 -1
10 20 30 40 50 60 70 10 20 30 40 50 60 70
To account for this non-linear relationship in our linear model, we need to compute a new
variable – the square of age (here called agesq where agesq = age2). If there is a statistically
significant quadratic relationship between hourly wage and age, then the model should
contain a statistically significant linear coefficient for age squared which we can then use to
make better predictions.
The general form of model for the linear relationship would be:
Note that we have retained the linear component in the model. This is generally regarded as
best practice regardless of the significance of the linear component. In this case the left
hand graph in Figure 6 does indicate that there is a linear component.
35
The ordered values of the standardised residuals are plotted against the expected values
from the standard normal distribution. If the residuals are normally distributed, they should
lie, approximately, on the diagonal.
Figure 7 P-P plots for the simple linear regression (left – Table 8) and multiple linear
regression (right Table 12) examples
In Figure 7, the left hand example shows the plot for the simple linear regression and the
right hand plot shows the multiple linear regression. We can see that the line deviates from
the diagonal on the left plot, whereas in the right hand example the line stays more closely
to the diagonal.
This makes substantive sense – our multiple linear regression example explains much more
of the variance and therefore there are no substantively interesting patterns left within the
residuals and they are normally distributed. In our simple linear regression, we are missing
some important explanatory variables – there is unexplained variance and this shows in the
residuals where the distribution deviates from normal.9
If we plot the standardised residuals for our two regression examples with histograms we
can see that both examples follow approximately a normal distribution (Figure 8). The left
9
Note that the reverse is not necessarily true. Normally distributed residuals does not imply that you have no
missing (or extraneous) variables.
36
hand example is our simple linear regression and the right hand example is the multiple
linear regression. The multiple linear regression example here has residuals that follow the
normal distribution more closely.
We could use technical tests for normality such as the Shapiro-Wilk or Kolmogorov-Smirov
statistics; however, these are beyond the scope of this primer.10
Figure 8 Histogram of standardised residuals for simple regression (left, Table 8) and
multiple regression (right, Table 12)
37
Figure 9 shows a plot of the standardised residuals against the standardised predicted
values for the response variable measuring long term illness from our ward level multiple
linear regression (right) and the simple linear regression (left) examples.
Also, we should plot the saved residuals against any of the other variables in the analysis to
assess on a variable-by-variable basis wherever there is any dependency in the residuals on
the variables in the analysis (there should not be).
Figure 9 Plotting residuals to check homoscedasticity for simple regression (left) and
multiple regression (right)
The left hand plot shows a clear cone shape typical of heteroscedasticity. The right hand
plot shows a more random noise type pattern, indicating homoscedastic residuals.
In this case, the left hand plot refers to a simple linear regression with only one explanatory
variable. There are still patterns in the variance which have not been explained and this is
seen in the residuals.11
The right hand plot includes more variables and there are no discernible patterns within the
variance: these residuals look to be meeting the assumption of homoscedasticity.
3.4.2 WHAT TO DO IF THE RE SIDUALS ARE NOT HOMOSCEDASTIC AND WHY DOES IT
MATTER
11 Another way to think about this is that the model is only addressing part of the distribution of the response
variable.
38
Some models are more prone to displaying heteroscedasticity, for example if a data set has
extreme values. A model of data collected over a period of time can often have
heteroscedasticity if there is a significant change in the outcome variable from the
beginning to the end of the collection period.
Heteroscedasticity therefore arises in two forms. The model may be correct, but there is a
feature of the data that causes the error terms to have non-constant variance such as a
large range in values. Alternatively, the model may be incorrectly specified so there is some
unexplained variance due to the omission of an important explanatory variable and this
variance is being included in the error terms.
When the problem is the underlying data, the 𝛽 coefficients will be less precise as a result
and the model fit may be overstated.
For an incorrectly specified model, introducing additional explanatory variables may solve
the problem. For an underlying data issue, removing outliers may help, or it may be
appropriate to transform the outcome variable – possibly using a standardised form of the
variable to reduce the range of possible values.
When two of the explanatory variables in a model are highly correlated (and could therefore
be used to predict one another), we say that they are collinear.
In our model, it may be that these variables are actually representing the same societal
factors which influence rates of illness - we can investigate this by removing one of the
variables and producing an alternative model.
When there are collinear variables, the model can become unstable – this is often indicated
by the standard error around the estimation of the 𝛽 coefficients being large and the
coefficients being subject to large changes when variables are added or deleted form the
model. The model cannot distinguish between the strength of the different effects and one
of the assumptions of linear regression is violated.
39
High pairwise correlation between explanatory variables.
If we refer back to the Pearson correlations that we produced in Error! Reference source
not found. we note that the unemployment and social tenancy variables were correlated
with a Pearson coefficient of 0.797. What is meant by a “high level of correlation” is
somewhat subjective, here we apply a rule of thumb that any correlation over |0.7| is
considered high. Where a pair of variables are highly correlated, it may be worth
considering removing one of them from the analysis.
We can remove one of the variables and investigate the effect. Using the same example, we
remove the unemployment variable and check the model fit.
Model Summary
Removing the unemployment variable produces a model that explains 51.5% of the variance
in illness rates. This is 16% less than for when the variable is included so we can conclude
that this variable is useful for the model – despite being highly correlated with social
tenancy. The parameters of the model are given in Table 13.
40
Table 13: Model parameters
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) -9.127 3.336 -2.736 .006
% female .384 .068 .135 5.648 .000
% social rented .203 .007 .683 27.952 .000
% aged 60 and over .292 .021 .350 14.165 .000
a. Dependent Variable: % llti
The VIF values can be generated as part of the regression output in the coefficients table –
see section 3.6.2
From the regression dialogue box, select Plots. From here, requesting a scatter plot of
predicted values by variance, and the standardised residual plots will provide the three key
visualisations used to assess the assumptions of linear regression as part of the regression
output.
12
See for example Field (2017) or Hair et al (2010) for discussion of this method.
41
3.6.2 CALCULATING VARIANCE INFLATION FACTORS
From the regression dialogue box select Statistics to open the dialogue for requiring VIF.
Tick the Collinearity diagnostics checkbox and exit.
42
When the regression analysis is run, the VIFs form part of the output. Table shows our
example multiple linear regression output with the additional information. We can see in
this example that there are no variables which cause concern.
Unstandardized Standardized
Coefficients Coefficients Collinearity Statistics
% aged 60 and over .336 .017 .404 19.762 .000 .775 1.291
% unemp of econ act. .774 .035 .664 22.147 .000 .360 2.778
Select Save from the regression dialogue box. Here we can request that predicted values
and residuals are saved as new variables to the dataset. We can also save the Cook’s
distance for each observation.
In this example, we have saved the unstandardised and standardised residuals and
predicted values, and the Cook’s distance.
43
New variables are added to the dataset:
pre_1 = unstandardised predicted
res_1 = unstandardised residual
zpr_1 = standardised predicted
zre_1 = standardised residual
coo_1 = Cook’s Distance
Further model specifications save as separate variables with the suffice _2 and so on.
A large residual means that the actual value and that predicted by the regression model are
very different.
44
Extreme values seen on a scatter plot of residuals suggests that there is a sample unit which
needs to be checked, as a rule of thumb, a standardised residual of magnitude 3 or greater
should be investigated.
When this occurs it is worth considering:
Is the data atypical of the general pattern for this sample unit?
Is there a data entry error?
Is there a substantive reason why this outlier occurs?
Has an important explanatory variable been omitted from the model?
Some times in a regression analysis it is sensible to remove such outliers from the data
before refining the model. An outlier will have a disproportionate effect on the estimations
of the 𝛽 parameters because the least squares method minimises the squared error terms –
and this places more weight on minimising the distance of outliers from the line of best fit.
This in turn can move the line of best fit away from the general pattern of the data.
When an outlier has an influence like this, it is described as having leverage on the
regression line. In this example, in the simple model there are many residuals that have a
magnitude greater than 3. This is further evidence that important explanatory variables
have been omitted. In the multiple regression model there are very few points of concern
and all of those are only just over the threshold, so no need to examine any of the wards for
removal from the analysis.
In our multiple regression model example, if we save the Cook’s distances and visualise
them by ward (as in Figure 10Figure 10 Cook's distance by ward code) we can see that there
are several values that breach the threshold (which is typically 3 times the mean of the
Cook’s distances in this case marked as a horizontal line at around y= 0.004).Two cases in
particular have very high Cook’s distances; these may be worth investigating as outliers.13
13 The Breusch-Pagan test is a further analysis where the outcome variable is the squared residual. The
explanatory variables are the same as for the model in question. This regression generates a test statistic for a
45
Figure 10 Cook's distance by ward code
Up to this point, our models have included only continuous variables. A continuous variable
is numeric, and can take any value. In our examples, the value has had a minimum of zero
but actually, mathematically, it wouldn’t have mattered if the values extended into negative
numbers – although this would not have made sense in the real world.
A nominal or unordered categorical variable is one where the possible values are separate
categories but are not in any order.
Consider a survey that asks for a participant’s gender, and codes the answers as follows:
1. Male
2. Female
χ^2 test where the null hypothesis is homoscedasticity. This test is not available through the menu interface in
SPSS but can be run using a readily available macro. The technical details of the test and the method for
executing it through SPSS are beyond the scope of this primer. Note that a function exists within both python
and R for automating the test.
46
3. Trans gender
4. Non binary
Each case within the data would have a numerical value for gender. If we were to use this
number within a linear regression model, it would treat the value for gender of a non-binary
respondent as four times the value for gender of a male. This doesn’t make sense and we
could have listed the answers in any order resulting in them being assigned a different
number within the dataset; the numerical codes are arbitrary.
The variable is not continuous but our theory may still be that the outcome variable is
affected by gender so we want to include it in the model. To do this we construct a series of
dummy variables. Dummy variables are binary variables constructed out of particular values
of a nominal variable.
This means that when the value of all of the dummy variables is zero, the prediction we
make using the regression equation is for a male. Table 13 shows the values for the three
new dummy variables against the original question for gender.
If D_female = 1, and all other dummies are zero, then we are predicting for a female. If
D_trans = 1 and all other dummies are zero, we are predicting for a transgender person and
so on.
Remembering the earlier model for exam results, if we had a theory that gender could also
be used to predict the age 16 results, we might include it as follows:
47
because all of the dummy variables take a value of zero.
For a female:
𝑒𝑥𝑎𝑚16𝑖 = 𝛽0 + 𝛽1 + 𝛽4 (𝑒𝑥𝑎𝑚11) + 𝜀𝑖
because D_nb and D_trans are equal to zero, and D_female is equal to 1.
Going back to our sample of exam results, let’s say that we know the sex of the students.
For this example, we will assume that sex is binary and we have only males and females in
the sample.
We are trying to predict the age 16 scores, using the age 11 scores and the sex of the
student. There are four possible outcomes for our modelling work.
Our model for scenario A is the same as in the earlier section on simple linear regression:
𝑒𝑥𝑎𝑚16𝑖 = 𝛽0 + 𝛽1 𝑒𝑥𝑎𝑚11𝑖 + 𝑒𝑖
There is no difference between boys and girls so there is no term for sex in the equation.
48
4.2.2 SCENARIO B: DIFFERENT INTERCEPT, SAME SLOPE
Here the relationship between exam16 and exam11 has a different intercept for boys than
girls but the nature of the relationship (the slope) is the same for boys and for girls. This
means that boys on average do differently to girls at age 11 and age 16, but the change in
the scores between the two ages is the same regardless of sex.
In scenario (b) the slopes are the same but there is an overall difference in the average
exam scores. We need a dummy variable to represent sex – let’s say that if sex = 0 for a
male and sex = 1 for a female.
𝑒𝑥𝑎𝑚16𝑖 = 𝛽0 + 𝛽1 𝑒𝑥𝑎𝑚11𝑖 + 𝛽2 𝑆𝑒𝑥𝑖 + 𝑒𝑖
There are two separate lines for girls and boys, but they are parallel.
For every case, we multiply the exam11 score by the Sex dummy variable and compute this
into a new variable, here called exam11Sex.
𝑒𝑥𝑎𝑚16𝑖 = 𝛽0 + 𝛽1 𝑒𝑥𝑎𝑚11𝑖 + 𝛽2 𝑆𝑒𝑥𝑖 + 𝛽3 𝑒𝑥𝑎𝑚11𝑆𝑒𝑥𝑖 + 𝑒𝑖
49
Figure 13 Different slope, different intercept
In scenario (d) we have different slopes, but the same intercept for the two sexes14. The
equation for the line is the same as scenario c, but 𝛽2 is zero so the model equation
collapses to:
𝑒𝑥𝑎𝑚16𝑖 = 𝛽0 + 𝛽1 𝑒𝑥𝑎𝑚11𝑖 + 𝛽3 𝑒𝑥𝑎𝑚11𝑆𝑒𝑥𝑖 + 𝑒𝑖
14
Note that this is a theoretical possibility. In practice, this will rarely happen and when building models one
should by default include all the main effects for all of the variables in an interaction term as this improves
model stability.
50
4.3 TRANSFORMING A VARIABLE
The distribution of income is often subject to significant skew and is bounded at zero. This
is because a few people earn a very high salary and it is not possible to have a negative
wage. Figure 15 shows a histogram of hourly pay with significant positive skew on the left
hand side, and the result of taking the log of this variable as a histogram on the right hand
side. We can see that by taking the natural log of the hourly wage, the distribution becomes
closer to normal.
Figure 15 Histograms of hourly pay (left) and log of hourly pay (right)
3000
2000
1000
11
12
10
20
30
40
50
60
70
80
90
0.
0.
0.
0.
.0
.0
.0
.0
.0
.0
.0
.0
.0
0
The SPSS menu and dialogue boxes for transforming variables are shown in section 4.5.
The default method within SPSS linear regression is the enter method.
51
In the enter method, a substantive theory based model is built, including all explanatory
variables considered relevant based on the research question, previous research, real-world
understanding and the availability of data.
When there are a large number of explanatory variables, we might use statistical criteria to
decide which variables to include in the model and produce the “best” equation to predict
the response variable.
Two examples of such selection methods are discussed here; backwards elimination and
stepwise selection. Even with these automatic methods, inclusion of many variables without
a robust theory underlying why we think they may be related risks building spurious
relationships into our model. We may build a good predictive model, but if this is based
upon spurious correlations, we do not learn anything about the problem our research is
trying to address.
4.4.2 STEPWISE
This is more or less the reverse of backward elimination, in that we start with no
explanatory variables in the model, and then build the model up, step-by-step. We begin by
including the variable most highly correlated to the response variable in the model. Then
include the next most correlated variable, allowing for the first explanatory variable in the
model, and keep adding explanatory variables until no further variables are significant. In
this approach, it is possible to delete a variable that has been included at an earlier step but
is no longer significant, given the explanatory variables that were added later. If we ignore
this possibility, and do not allow any variables that have already been added to the model to
be deleted, this model building procedure is called forward selection.
52
Here you can select the variable to recode, and specify the name and label of the new
‘output variable’. Then click on Change to see variable within the Numeric Variable ->
Output Variable box.
Click on Old and new Values to open the next dialogue box for specifying the recode. In this
example, we have selected sex and are recoding into a dummy variable called “Female”.
The previous and new codings are shown in Table 15.
Specify each old and new value and then click Add to generate the list of recodings. In this
dataset, the variable sex was binary and so only a few lines of recoding are needed (see
below) but a variable with more categories would need many values recoding to zero, and
multiple dummies. Also adding a recode of System Missing to System Missing ensures that
values coded as missing within the data retain that coding.
53
4.5.2 COMPUTING A NEW VARIABLE
New variables can be computed via the Transform > Compute Variable… menu path.
54
To compute a quadratic term – here age squared:
To save standardised versions of a variable, go to Descriptives and select the check box.
The resulting dataset will look like this – we now have three original variables and four
computed variables displayed in the Variables viewer.
5 FURTHER READING
55
A number of excellent texts have been written with significantly more technical detail and
worked examples, a selection of which are listed below. Field is available in both SPSS and
also a version in R (a free to use open source data analysis program widely used in academia
and the public and private sectors).
Bryman, A., Cramer, D., 1994. Quantitative Data Analysis for Social Scientists. Routledge.
Dobson, A.J., 2010. An Introduction to Generalized Linear Models, Second Edition. Taylor &
Francis.
Field, A., 2017. Discovering Statistics Using IBM SPSS Statistics. SAGE.
Hair, J.F., Anderson, R.E., Babin, B.J. and Black, W.C., 2010. Multivariate data analysis: A
global perspective (Vol. 7).
Howell, D.C., 2012. Statistical Methods for Psychology. Cengage Learning.
Hutcheson, G.D., 1999. The Multivariate Social Scientist: Introductory Statistics Using
Generalized Linear Models. SAGE.
Linneman, T.J., 2011. Social Statistics: The Basics and Beyond. Taylor & Francis.
McCullagh, P., Nelder, J.A., 1989. Generalized Linear Models, Second Edition. CRC Press.
Plewis, I., Everitt, B., 1997. Statistics in Education. Arnold.
56
6 APPENDIX A: CORRELATION, COVARIANCE AND PARAMETER ESTIMATION
(x i x) 2
Var ( X ) i 1
n 1
(y i y) 2
Var (Y ) i 1
n 1
(x i x)( y i y )
Var ( X , Y ) i 1
n 1
Notice that the correlation coefficient is a function of the variances of the two variables of
interest, and their covariance.
In a simple linear regression analysis, we estimate the intercept, 0, and slope of the line, 1
as:
57
7 GLOSSARY
collinear When one variable can be used to predict another. When two
variables are closely linearly associated.
explanatory variable The variables which we use to predict the outcome variable.
These variables are also referred to as independent or the X
variable(s).
homoscedastic One of the key assumptions for a linear regression model. If
residuals are homoscedastic, they have constant variance
regardless of any explanatory variables.
linear regression A method where a line of best fit is estimated by minimising
the sum of the square of the differences between the actual
and predicted observations.
multicolinearity When two or more variables are closely linearly associated or
can be used to predict each other.
multiple linear regression Linear regression with more than one explanatory variable.
negative correlation
ordinalordinal variable A variable where the responses are categories, which can be
put in an order. For example, the highest level of education
achieved by a respondent. Remember that the possible
58
responses may not be evenly spaced.
59