0% found this document useful (0 votes)
127 views93 pages

Statistical Analysis Using SPSS and R - Chapter 5 PDF

This document discusses statistical modeling techniques, specifically regression models. It covers multiple linear regression, which assumes a linear relationship between a continuous response variable and multiple quantitative predictor variables. Diagnostics for multicollinearity and outliers are also discussed. Multicollinearity occurs when predictor variables are highly correlated. Outliers can influence regression estimates and need to be identified. Common diagnostics include pairwise correlations, condition indices, variance proportions, leverage values, and squared Mahalanobis distances.

Uploaded by

Karl Lewis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views93 pages

Statistical Analysis Using SPSS and R - Chapter 5 PDF

This document discusses statistical modeling techniques, specifically regression models. It covers multiple linear regression, which assumes a linear relationship between a continuous response variable and multiple quantitative predictor variables. Diagnostics for multicollinearity and outliers are also discussed. Multicollinearity occurs when predictor variables are highly correlated. Outliers can influence regression estimates and need to be identified. Common diagnostics include pairwise correlations, condition indices, variance proportions, leverage values, and squared Mahalanobis distances.

Uploaded by

Karl Lewis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

5.

STATISTICAL MODELS

This section will be dedicated to reviewing some of the most popular statistical
modelling techniques.

5.1. Regression Models

Regression is a technique that may be used to investigate the effect of one or more
predictor (independent) variables on an outcome (non-independent) variable
which needs to be a continuous variable.

In this section look into linear regression models. We start by discussing Multiple
Linear regression models in which all variables are quantitative (covariates). We
then consider two special adaptations to the multiple linear regression, ANOVA
and ANCOVA where for the former all predictors are qualitative variables
(factors) whereas the latter admits both qualitative (factors) and quantitative
predictors (covariates) in the model.

5.1.1. Multiple Linear Regression

As its name implies, in a linear regression model the relation of the response
variable to the explanatory variables is assumed to be a linear function of some
parameters. As mentioned earlier, the response variable should be a continuous
variable and the explanatory variables should be quantitative variables: discrete
or continuous.

More specifically, in multiple linear regression analysis, for any sample of size n,
we assume a relation of the type:

Yi = E YiX i1,..., X ip  + εi = β0 + β1 X i1 + ⋯ + β p X ip + εi for i = 1,…, n

where Yi denotes the response or dependent variable for the ith entity (or ith
observation in the sample) and the X ij 's denote the explanatory or independent

239 | P a g e
variables (predictors) for the ith entity. The β j 's j = 0,… p are the unknown
parameters to be estimated and the εi 's are the error terms. A fundamental
assumption in classical multiple linear regression theory is that the error terms
{εi }i=1,…,n are independent and normally distributed with mean zero and unknown
standard deviation σ .

By applying a regression estimator to a data set we obtain the parameter estimates


(or regression estimates) βˆ ,…, βˆ p , fitted values yˆi and residuals
0

ri = yi - yˆ i for i = 1,… , n .

One of the most popular estimators is the ordinary least squares (OLS) estimator
which will be the protagonist of this section.

When conducting regression analysis it is important to investigate the presence


of multicollinearity in the data using multicollinearity diagnostics. The data is
said to be characterized by multicollinearity if dependencies exist between the
explanatory variables. It has been well documented that the presence of
multicollinearity exhibits undesirable effects on the OLS regression estimate such
as:

• the OLS regression parameter estimates are not unique - infinite solutions
all yielding the same fitted values
• it tends to produce models that fit the sampled data perfectly but fail to
predict new data well, a phenomenon known as over-fitting
• the variance of the OLS regression estimates may be unacceptably large
• instability of the OLS estimates
• conflicting conclusions from usual significance tests and the possibility
that the OLS coefficient estimates exhibit different algebraic signs to what
is expected from theoretical considerations.

Multicollinearity is not the only characteristic that has a negative effect on the
OLS estimator. If outliers are present in the data these can also have a negative
effect on the OLS estimator. An essential step in regression analysis involves
identifying outliers and removing them from the data set.

240 | P a g e
In this section we shall see how one can fit a multiple linear regression model
using SPSS and R. We shall also discuss multicollinearity diagnostics and outlier
diagnostics.

In order to fit a multiple linear regression model, the data to be analysed needs to
satisfy the following assumptions:

1. Response variable and explanatory variables are covariates.

2. A linear relationship exists between the dependent variable and each of the
independent variables - this can be checked by visually inspecting
scatterplots of the dependent variable with each one of the independent
variables and by issuing correlation coefficients for each pair - confirmed
by running a correlation analysis as explained earlier in Chapter 4. In fact
regression analysis is always preceded by correlation analysis.

3. There must be no multicollinearity - presence of multicollinearity is


detected by inspecting a number of multicollinearity diagnostics.

4. Residuals should be independent of each other - to check that this is true


one needs to check that the residuals resulting after fitting the model are
independent (using the Durbin-Watson statistic).

5. There must be no influential outliers - identified by applying outlier


diagnostics after fitting the model.
6. Residuals must follow a normal distribution – checked by means of the
Shapiro-Wilk test on residuals

7. Residuals must be homoscedastic – variance is constant across


observations. This is checked using a scatter plot of residuals versus fitted
values.

MULTICOLLINEARITY DIAGNOSTICS

There is no single diagnostic for identifying the presence of multicollinearity. In


the literature it is usually suggested to consider a collection from the many

241 | P a g e
existing diagnostics. Here we shall consider three of the most popular diagnostics,
namely: pairwise correlations, condition indices and variance propoertions. A
brief description of each will be presented next. The main reference for this
section is Myers (1990).

1. Pairwise Correlations: The first step in identifying the presence of


collinearity in the data is that of looking at pairwise correlations between the
explanatory variables. Here we can use either Pearson Correlation, or Spearman
correlation depending on the nature of the joint distribution of the two variables.
(See section 4.5.1)

2. Condition Indices (CIs): The condition indices are the ratios of the
maximum eigenvalue of the sample correlation matrix of the explanatory (or
independent) variables, which we shall denote by R, with all the other
eigenvalues1. So there are as many condition indices as there are eigenvalues. The
eigenvalues are typically ordered in ascending order. The jth condition index is
defined by

CIj (R) =λmax/ λj

where λmax and λj correspond to the largest and jth eigenvalues of R, respectively.
The last condition number, which corresponds to the ratio of the largest to the
smallest eigenvalue of the sample correlation matrix is usually considered as a
diagnostic on its own and is known as the condition number.

Belsley et al. (1980) observe that the number of large (> 5) condition indices
corresponds to the number of near dependencies in the columns of the data matrix
X. Values between 5 and 10 indicate the presence of weak dependencies while
values greater than 30 indicate the presence of moderate or strong relations
between the explanatory variables. A condition number which is much greater
than 30 is an indication of serious multicollinearity while values between 5 and
10 indicate that weak dependencies may induce some affect on the regression
estimates.

1
For interested users not familiar with such mathematical concepts more details can be found in linear algebra
text books.

242 | P a g e
3. Variance Proportions: The Variance proportions of the explanatory
variables, measure the proportion of the variance of each regression coefficient
explained by each of the eigenvalue of the sample correlation matrix R. A large
variance proportion for a variable means that the variable has a linear dependency
with other explanatory variables. As a general rule of thumb, Belsley et. al (2004)
suggest that if the variance proportion is greater than 0.5 then the variables in
question are characterised by dependency.

If multicollinearity is present one has two possibilities:

• From a set of correlated variables retain the one which has the largest
correlation with the dependent variable and remove the rest from the
model.

• Use an alternative estimator to the OLS such as Ridge, LASSO, PCR and
PLS to mention a few. (Such methods are only for more advanced students
and will not be covered here.)

OUTLIER DETECTION METHODS

Regression outlier diagnostics are statistics that are used to identify any
observations that might have a large influence on the estimated regression
coefficients. Observations that exert such influence are called influential points.

The field of regression diagnostics consists of a combination of numerical and


graphical procedures based on an initial fit of the regression model. Outlier
diagnostics should never be considered on their own but collectively. Thus, a
point is considered to be an outlier if it is flagged as being an outlier by more than
one diagnostic. Here we shall consider the five most commonly used diagnostic
methods.

1. Leverage values: Used to detect leverage points or outliers in the x-axis,


that is, cases for which the vector of observations on the explanatory
variables lies far away from the bulk of the observed vectors of explanatory
variables in the dataset. As a general rule, cases for which leverage values

243 | P a g e
are greater than 2 p n , where p is the number of parameters in the model
and n is the number of observations, are considered to be potentially
influential points.

2. Squared Mahalanobis distance: Like the leverage values this measure is


used to detect outliers in the x-axis. As a general rule, the squared
Mahalanobis distance (MD2) are compared with the 95% quantiles of the
chi-squared distribution with p − 1 degrees of freedom, where p is the
number of parameters in the model. Cases for which MDi2 > χ 0.95,
2
p −1 are

flagged as potential outliers.

3. Cook’s distance: Cook's distance is a statistic which provides an


indication of how much influence a single case has over a regression
model. As a general rule, cases with a Cook's distances of a value greater
than one should be flagged as potential outliers and investigated further.

4. Studentized Residuals: The studentized residuals can be considered as


measures of the vertical displacement of each data point from the line of
best fit. Data points with a studentized residual beyond ±2 are considered
to be outliers.

Example: Consider the data set "mtcars" available in the R environment which
was taken from the 1974 Motor Trend US magazine, and comprises fuel
consumption and 10 aspects of automobile design and performance for 32
automobiles (1973–74 models). This data set consists of 32 observations on 11
variables. In the table below one finds more information about the variables in
this data set.

Column Number Variable Name Description


[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (1000 lbs)
[, 7] qsec 1/4 mile time

244 | P a g e
[, 8] vs V/S
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors

Data sourse: Henderson and Velleman (1981), Building multiple regression models
interactively. Biometrics, 37, 391–411.

The goal here shall be to apply regression analysis to establish if there is a


relationship between "mpg" as a response variable with "disp","hp", “drat”, "wt"
and “qsec” as predictor variables.

We start the analysis by checking if the first three of the assumptions underlying
the multiple regression model, listed earlier, are satisfied.

Figure 5.1.1 : Scatter plots

The response variable and all explanatory variables are clearly covariates. By
looking at the scatter plots in Figure 5.1.1 we note that these plots suggest that a

245 | P a g e
linear relationship exists between mpg and each of the predictor variables. To
confirm this one needs to consider correlation coefficients (refer to section 4.5.1).
To decide which correlation coefficient should be used using R, bivariate
normality was tested (by means of the command mvnorm.etest from the
package energy) for each possible pair of covariates. The bivariate normality
assumption was never satisfied hence Spearman correlation coefficients will be
considered here.

Table 5.1.1 : Spearman correlations between all pairs of covariates

From Table 5.1.1 we note that qsec is positively correlated with mpg while all
other predictor variables are strongly negatively correlated with mpg. Thus, we
can conclude that all the independent variables are linearly correlated with mpg
(our response variable).

Next, we shall check for the presence of multicollinearity. Pairwise correlation


coefficients between the independent variables have already been issued in Table
5.1.1. Here we note that most of the independent variables are strongly correlated
between each other with the exception of qsec which is not correlated with wt
246 | P a g e
and drat. These values indicate that multicollinearity is present in the data. To
further confirm the presence of multicollinearity we will next consider condition
indices, the condition number and variance proportions.

In SPSS

In SPSS condition indices and the condition number are given as part of the
output produced when fitting a multiple linear regression (MLR) model. So go
to Analyze-Regression-Linear. Move mpg in the Dependent Variable field and
move disp, hp, drat, wt and qsec under the Block 1 of 1 field; as shown in Figure
5.1.2.

Figure 5.1.2

Then click on the tab Statistics and select Collinearity diagnostics.

247 | P a g e
Figure 5.1.3

Press Continue then OK. The following is part of the output obtained:

Table 5.1.2 : Multicollinearity Diagnostics

From Table 5.1.2 we note that the last four condition indices (9.946, 21.905,
32.009, 71.166) are greater than 5 indicating that there are 5 near dependencies
in the columns of the data matrix. The condition number (the largest condition
index) is equal to 71.166, which is greater than 30.

This indicates the presence of strong dependencies, which is not surprising since,
when inspecting the Spearman correlation coefficients, we saw that, for example,
disp is strongly correlated with drat and wt.

248 | P a g e
From Table 5.1.2 we note that the last three eigenvalues are also showing serious
or near serious dependencies, since their values are very close to 0. We then
consider the variance proportions corresponding to these three eigenvalues. By
applying the 0.5 bench mark suggested in the literature, we note that the 4th
eigenvalue (0.012) represents a dependency which is causing damage to the
parameter estimate for disp (0.77 > 0.5), while the 5th eigenvalue (0.005)
represents a dependency which is causing damage to the parameter estimate for
drat (0.61 > 0.5). The 6th eigenvalue (0.001) represents a dependency which is
causing damage to the parameter estimate for qsec (0.83 > 0.5) and the constant
term (0.97 > 0.5).

In R/RStudio

In R/ RStudio the values presented in Table 5.1.2 are computed by means of the
command colldiag in the package perturb. Full commands are given below:

rm(list=ls(all=TRUE)) # this clears all existing variables


data(mtcars) # load data
names(mtcars) # list the variables in my data

## Condition indices
## Multicollinearity Diagnostic 3 - Condition Number and
## Condition Indices: 5-10 weak dependencies, >30 severe
## multicollinearity

colldiag(as.matrix(mtcars[,3:7]))

The following output is obtained:

Condition
Index Variance Decomposition Proportions
intercept disp hp drat wt qsec
1 1.000 0.000 0.001 0.001 0.000 0.000 0.000
2 4.393 0.000 0.027 0.021 0.007 0.001 0.002
3 9.946 0.000 0.053 0.330 0.012 0.037 0.003
4 21.905 0.001 0.772 0.182 0.102 0.392 0.007
5 32.009 0.032 0.012 0.142 0.606 0.449 0.159
6 71.166 0.966 0.136 0.324 0.272 0.120 0.829

As you can note, the values are exactly the same as those shown in Table 5.1.2
and hence the same interpretation holds.

After considering all the multicollinearity diagnostics we can conclude that we


cannot fit a MLR model containing all the proposed predictors. Since qsec is not

249 | P a g e
correlated with wt and drat, one can fit two different MLR models, one with mpg
as response and qsec and wt as predictors and the other model with mpg as
response and qsec and drat as predictors.

Next we shall fit a MLR with mpg as response and wt and qsec as predictors.

In SPSS

Go to Analyze-Regression-Linear. Move mpg in the Dependent Variable field


and move wt and qsec under the Block 1 of 1 field. From the tab labelled Statistics
tick: Estimates, Model fit and Durbin-Watson. Press Continue.

Figure 5.1.4

From the tab labelled Save tick the boxes which are ticked in Figure 5.1.5 below.
Press Continue.

250 | P a g e
Figure 5.1.4

In the tab labelled Plots move the ZRESID (standardized residuals) in the Y-axis
and ZPRED (standardized predictors) in the X-axis. This will produce the scatter
plot needed to check if assumption 7 is satisfied.

251 | P a g e
Figure 5.1.5

Press Continue then OK.

The following outputs are obtained:

Table 5.1.3

In the column labelled “R” one finds the value of the multiple correlation
coefficient which provides a measure of the predictive ability of the model (that
is, how well it can estimate the dependent variable). The closer this value is to 1
the better the prediction. In this example R=0.909 which indicates a very good
level of prediction. In the column labelled “R Square” one finds the value of the
coefficient of determination, which measures the proportion of the variation in
the dependent variable that can be accounted for by the variables included in the
regression model. The closer this value is to 1 the better. You can see from our
value of 0.826 that our independent variables explain 82.6% of the variability of
our dependent variable.

252 | P a g e
As we have seen earlier on, one of the assumptions of regression is that the
residuals are independent (assumption 4). This assumption is tested by the
Durbin-Watson statistic. A rule of thumb is that test statistic values which lie
outside the range of 1.5 to 2.5 could be cause for concern. Some authors suggest
that values under 1 or more than 3 are a definite cause for concern. In this
example, the value of the Durbin-Watson statistic is 1.496 which is approximately
equal to 1.5 hence we can consider the assumption of independent residuals to be
satisfied.

The Adjusted R Square is typically used when comparing models having


different numbers of predictor variables. In comparing the goodness of fit of
models with different numbers of explanatory variables, the adjusted R square
tries to “adjust” for the unequal number of variables in the different models. The
model having the largest adjusted R square is considered to be the best from the
available models. We shall give an example of the use of this statistic later on in
this section.

The next part of the output contains an ANOVA (analysis of variance) table. This
table tests the following hypothesis

H0: Model with only a constant term (intercept only model) is a good fit for
the data
H1: Model fitted (which includes covariates) fits better than the model with
only the intercept term

Table 5.1.4

253 | P a g e
From Table 5.1.4 we note that the p-value (sig) is less than 0.05 which shows that
the model with qsec and wt as independent variables is an adequate model.

The output that follows is concerned with the estimation of the parameters of the
model.

Table 5.1.5

The general regression model being considered in this example is:

Y = β0 + β1 X1 + β2 X 2 + ε

where Y = mpg, X 1 = wt and X 2 = qsec . In the column labelled “B” one finds the
estimates for the β ' s . In the column labelled “Sig” one finds the p-values for the
following hypothesis tests:
H 0 : βi = 0
H1 : βi ≠ 0

Since all the p-values are less than 0.05 we reject H0 in all three cases and hence
the fitted model is the following:

Y = 19.746 − 5.048 X1 + 0.929 X 2 .

This regression model can be used to estimate unknown values of Y. For


example, the expected mpg of a car with wt =3 and qsec =16 is
Y = 19.746 − 5.048( 3) + 0.929 (16) which is approximately 19.45.

The next output contains information on outlier diagnostics.

254 | P a g e
Table 5.1.6

Table 5.1.6 provides some descriptive statistics for a series of outlier detection
methods. We shall focus only on those discussed earlier on in this section.
Starting with the Leverage values (last row of the table), in this model the number
of parameters p = 3 and the sample size n = 32, hence the cutoff value for
identifying potential influential points is 2 3 ⁄32 = 0.19. From Table 5.1.6 we
note that the maximum leverage value is 0.264 which is greater than the cutoff
value, thus indicating the presence of potential influential points. To identify
these points we need to go to the data file, where the leverages values have been
saved, and sort the column containing the leverages (labelled LEV_1) in
ascending order (this will sort the entire data file) and identify those values which
are greater than 0.19. As you can see from Figure 5.1.9 there is only one data
point for which the leverage value is greater than 0.19. But before we conclude
that this is an influential point we need to look at the other outlier diagnostics.

255 | P a g e
Figure 5.1.6

Next we consider the squared Mahalanobis distance whose descriptive statistics


are provided in Table 5.1.6 in the row labelled Mahal. Distance. The cutoff value
for this method is MDi2 > χ 0.95,2
2
= 5.99 . The max value is 8.177 which is greater
than 5.99 hence this indicates the presence of potential outliers in the sample.
From the column in the data view, labelled MAH_1, which contains the squared
Mahalanobis distances for our observations, it can be noted that only one sample
point has a Mahalanobis distance greater than 5.99 and it happens to be the same
point that was flagged as a potential outlier by the leverage method. The
descriptive statistics for Cook’s distances show that no data point has a Cook’s
distance greater than 1 hence this method is not identifying any outliers.

The descriptive statistics for the studentized residuals are found in Table 5.1.6 in
the row labelled Stud.Residual and we note that the maximum value is slightly
greater than 2 indicating the presence of potential outliers. Upon inspecting the
studentized residuals in the data view we note that there are three points for which
these residuals are slightly greater than 2.

The only point that has been identified as an outlier by more than two diagnostics
is Merc 230 and hence this will be considered an outlier. One way of proceeding
is to remove this data point from the sample and repeat the regression analysis on
the new reduced data set. If the parameter estimates change considerably, then
the point is an influential point and should not be included in the analysis. On the
other hand, if the parameter estimates change only slightly, the point is not
influential and can be retained. This step will be left as an exercise.

256 | P a g e
The last output being shown here is a scatter plot of the Standardized residuals
versus the Standardized fitted (or predicted) values which is inspected to check
that the residuals are indeed homoscedastic, that is, the variance of the residuals
is homogeneous across levels of the predicted values, also known as
homoscedasticity. If the model is well-fitted, there should be no pattern in the
residuals plotted against the fitted values.

Figure 5.1.7
From Figure 5.1.7 we note that the points seem to be scattered randomly without
any pattern, hence we can assume that the residuals are homoscedastic.

One final goodness of fit check is to verify if the assumption that the residuals
follow a normal distribution is satisfied. This is done by applying the Shapiro-
Wilk test of normality on the standardized or studentized residuals. The p-value
obtained when this test is applied to the studentized residuals is 0.119 > 0.05
hence the assumption of normality is satisfied.

257 | P a g e
Earlier we had mentioned the Adjusted R square. Next we shall explain how this
statistic is used to choose between competing models.

Figure 5.1.8
Go back to Analyze-Regression-Linear. Previously in the tab labelled Method we
retained the default selection which is Enter. This time change it to Stepwise. (See
Figure 5.1.7)

Two regression models will be fitted; a simple regression model with only wt as
predictor and a multiple linear regression model with wt and qsec as predictors.
From Table 5.1.7 it may be noted that in the model summary table now we have
values for the two models. Note that the Adjusted R square value for the model
with two predictors is larger than that of the model with one predictor hence the
model with two predictors is considered to be a better model for estimating
unknown values of the response. The p-values shown in Table 5.1.5 in fact
showed that both predictors should be retained in the model.

258 | P a g e
Table 5.1.7

In R/RStudio

Here we will demonstrate how to conduct the regression analysis using R. We


shall limit ourselves to discussing the commands needed to obtain outputs similar
to those produced in SPSS. For the interpretation of the output we refer the reader
to the SPSS section just explained.

Using R to obtain the Model Summary table, the ANOVA Table and Parameter
estimates table (as in Tables 5.1.3 5.1.4, 5.1.5):

Commands:
#Regression Analysis

rm(list=ls(all=TRUE)) # clear all existing variables


data(mtcars) # load data
names(mtcars) # list the variables in my data
y<-mtcars[,1] # response variable will be species
x<-mtcars[,6:7]
# the matrix of explanatory variables which includes only wt #
and qsec
model1<-lm(y ~.,x)
summary(model1)
# gives model summary and parameter estimates and anova table
aov(model1) # gives the full Anova table

Output:

Call:
lm(formula = y ~ ., data = x)

Residuals:
Min 1Q Median 3Q Max
-4.3962 -2.1431 -0.2129 1.4915 5.7486

259 | P a g e
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.7462 5.2521 3.760 0.000765 ***
wt -5.0480 0.4840 -10.430 2.52e-11 ***
qsec 0.9292 0.2650 3.506 0.001500 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.596 on 29 degrees of freedom


Multiple R-squared: 0.8264, Adjusted R-squared: 0.8144
F-statistic: 69.03 on 2 and 29 DF, p-value: 9.395e-12

Call:
aov(formula = model1)

Terms:
wt qsec Residuals
Sum of Squares 847.7252 82.8583 195.4636
Deg. of Freedom 1 1 29

Residual standard error: 2.596175


Estimated effects may be unbalanced

The output highlighted in green contains model summary information (see Table
5.1.3), the output highlighted in yellow corresponds to the ANOVA test (see
Table 5.1.4) and the output highlighted in blue corresponds to the parameter
estimates (see Table 5.1.5).

To obtain the predicted values and the residuals using the fitted model:

predict(model1) # predicted values


residuals(model1) # residuals
library(MASS)
studres(model1) # studentized residuals

Goodness of fit tests:


1. Checking that the residuals are independent:

Here we will use the Durbin-Watson test for autocorrelations amongst the
residuals.

260 | P a g e
H 0 : There is no correlation between the residuals, that is, the residuals are
independent
H1 : There is correlation between the residuals, that is, the residuals are not
independent

The command used is dwtest from the package lmtest:

library(lmtest)
dwtest(model1)

Output :
Durbin-Watson test

data: model1
DW = 1.496, p-value = 0.04929
alternative hypothesis: true autocorrelation is greater than 0

See Table 5.1.3.

2. Checking that the residuals are normally distributed:

Apply Shapiro-Wilk normality test on residuals or studentized residuals.

3. Checking that the residuals have constant variance ( homoscedasticity):

One way of investigating whether the residuals have constant variance is by


looking at the plot of the Residuals vs Fitted values. If residuals are homoscedatic
you should see constant variation in the y-axis (residuals).

plot(predict(model1),residuals(model1),main="Plot of Fitted vs
Residual")
The following plot is obtained:

261 | P a g e
Figure 5.1.9

From Figure 5.1.9 it would seem that the variance is constant in the y-axis.
However this kind of result is very subjective. So we may also consider the
Breusch Pagan test. This is available in R but not in SPSS.

H 0 : Residuals are homoscedastic (constant variance)


H1 : Residuals are heteroscedatic (variance not constant)

In R this test is conducted using the command ‘bptest’from the package ‘lmtest’.

bptest(model1)
Ouput:
studentized Breusch-Pagan test
data: model1
BP = 3.0858, df = 2, p-value = 0.2138

262 | P a g e
Since the resulting p-value > 0.05, we cannot reject the null hypothesis. This
means that our model is valid since residuals are not homoscedatic.

OUTLIER DIAGNOSTICS

Mahalanobis Distances:

# Calculate the Mahalanobis Distances


m_dist <- mahalanobis(x, colMeans(x), cov(x))
m_dist

Identifying the outliers:


p <- 3
# our model includes the intercept term and two predictors

cutof_mah <- qchisq(0.95, p-1, lower.tail = TRUE, log.p =


FALSE) # cutoff value
which(m_dist > cutof_mah)
Merc 230
9

Only one point which is found in the 9th row of the data set and belongs to the
car model Merc 230 is flagged as a potential outlier.

Leverages:

# Outlier Detection
Leverages <- hatvalues(model1,type='rstandard')

#It is important to set the type of residuals as 'rstandard' #


for results to match those obtained in SPSS

263 | P a g e
Identifying the outliers:

# By convention, explanatory variables for which leverage > #


2p/n, where p is the number of parameters in the model
# and n is the number of observations, are considered to be #
potentially influential points. For our model:

p <- 3 # model includes the intercept term and two predictors


n <- length(y)

# Cut off point for leverages is:


Cutofflev <- (2*p)/n

which(Leverages > cutofflev) # gives the outliers

This last command gives the following list of car models and respective row
number.
Merc 230 Lincoln Continental
9 16

These are the points which are flagged as potential outliers.

Cook’s distances:

# Cook's distance
Cook <- cooks.distance(model1,type='rstandard')
# It is important to set the type of residuals as 'rstandard' #
for results to match those obtained in SPSS

Identifying the outliers:

which(cook >= 1) #identifying the outliers cutoff=1


named integer(0)

No outlier was identified by this method.

264 | P a g e
5.2 General Linear Models

In this section we shall look at ANOVA and ANCOVA models done through
both SPSS and R. This is an extension of multiple regression, which involves
having both factors and covariates as predictors. The important question which
arises at this stage is, how are factors represented within a regression equation?
Suppose we have one factor, say gender, with the two main levels being male and
female, in addition to another p covariates. Then we explain gender by what we
call a dummy variable regressor G where G = 1 if the gender is male and G = 0 if
female. Then the regression equation becomes as follows:

Yi = E Yi | X i1,..., X ip  + εi = β0 + β1 X i1 + ... + β p X ip + γG + εi for i = 1,…, n


Suppose now that the factor has more than two levels. To keep it simple, let us
assume a factor has 3 levels. Consider the factor ‘age groups’ which we divide
into 3 levels: 18-34, 35-59, 60+. Then we can explain this factor by two dummy
variables as follows:

A1 A2
18-34 1 0
35-59 0 1
60+ 0 0

and the regression equation now becomes:

Yi = E Yi | X i1,..., X ip  + εi = β0 + β1 X i1 + ... + β p X ip + γ1 A1 + γ2 A2 + εi for i = 1,…, n

In a similar manner, factors with k levels can be represented by k − 1 dummy


variables. Furthermore, there is no limit to how many factors a regression
equation can include, as long as the size of the sample size allows it.

Apart from that, we can also include interactions of factors. The following is an
example of a regression equation which, apart from the covariates, includes both
gender and age groups as main effects, and the interactions between the two:

265 | P a g e
Yi = E Yi | X i1,..., X ip  + εi
= β0 + β1 X i1 + ... + β p X ip + γG G + γ1A A1 + γ1A A2 + γ1GAGA1 + γ2GAGA2 + εi for i = 1,…, n

When analysing data using ANOVA and ANCOVA, the following assumptions
must be made:

Assumption 1: The dependent variable is of a quantitative nature.

Assumption 2: The observations must be independent, both within groups and


between groups – in other words, no observation must come from the same
participant, whether within the same group or in different groups.

Assumption 3: The residuals ( εi ) must be normally distributed and, furthermore,


their variance must not be affected by either the covariate/s or the groupings of
the factor/s.

Assumption 4: The covariates should be linearly related to the dependent


variable at each level of each factor.

Assumption 5: The slope of the dependent variable versus any covariate must be
homogenous along all levels of all factors.

In SPSS

The file nachr.sav contains data for 25 individuals regarding brain density of
nicotinic acetylcholine receptors (nAChR) and age, the presence (or otherwise)
of schizophrenia and the presence (or otherwise) of a smoking habit.

We start by plotting a box plot of nAChR with respect to the factors, and a scatter
plot of age and nAChr (see Figure 5.2.1 and Figure 5.2.2)

266 | P a g e
Figure 5.2.1

In the box plot we see that there is a larger difference in the location of nAChR
for smokers/non-smokers than there is for the schizophrenia category. The
smokers have a higher median that the non-smokers, while people with
schizophrenia have a slightly lower median than people who do not.

Figure 5.2.2

In the scatter plot, on the other hand, we see that nAChR and age are negatively
correlated. However, it must be noted that here we are looking at these

267 | P a g e
relationships individually and not simultaneously, so these conclusions are not
definitive.

We next move on to look at multicollinearity diagnostics but first we need to


create the dummy variables. Go to Transform, Create Dummy Variables and
then put the ‘Schizophrenia’ and ‘Smoker’ variables in the ‘Create Dummy
Variables’ box. Furthermore, we select ‘Create main-effect dummies’ and enter
a variable name for each variable, then press OK. All of this is illustrated in
Figure 5.2.3.

Figure 5.2.3

Correlations
Schizophrenia=
Age No Smoker=No
Spearman's rho Age Correlation Coefficient 1.000 .157 .393
Sig. (2-tailed) . .463 .057
N 24 24 24
Schizophrenia=No Correlation Coefficient .157 1.000 .510*

268 | P a g e
Sig. (2-tailed) .463 . .011
N 24 24 24
Smoker=No Correlation Coefficient .393 .510* 1.000
Sig. (2-tailed) .057 .011 .
N 24 24 24
*. Correlation is significant at the 0.05 level (2-tailed).

Table 5.2.1

In a similar way to Section 5.1, we now need to perform the multicollinearity


diagnostics. Extracting the table of pairwise Spearman correlations shows that, at
0.05 level of significance, the correlation between the two dummy variables is
the highest, with a value of 0.51, and also significantly different from zero at a
0.05 level. The dummy variable for smoker and age, on the other hand, also show
a correlation of 0.393. Table 5.2.2, on the other hand, shows that the highest
condition index is 10.56. In Section 5.1, it was discussed that a condition index
between 5 and 10 indicates weak multicollinearity, while a condition index more
than 30 indicates a very strong one. In our case, the strength of the collinearity
between the two dummy variables for schizophrenia and smoker is inconclusive,
however the 4th eigenvalue may be causing damage to the parameter estimates of
the constant term (intercept) and the age. In the following we shall keep all
variables, but we shall later also check to see what happens if we remove the age
variable.

Collinearity Diagnosticsa
Variance Proportions
Condition Schizophrenia
Model Dimension Eigenvalue Index (Constant) =1.0 Smoker=1.0 Age
1 1 3.305 1.000 .00 .02 .02 .00
2 .420 2.807 .04 .14 .31 .02
3 .246 3.666 .00 .82 .54 .00
4 .030 10.560 .96 .02 .12 .97
a. Dependent Variable: Brain density of nicotinic receptors

Table 5.2.2

To perform General Linear Models in SPSS we go to Analyse, General Linear


Model, Univariate. We put ‘nAChR’ in ‘Dependent variable’, ‘schizophrenia’
and ‘smoker’ in ‘Fixed Factor(s)’ and ‘age in ‘Covariate(s)’ (see Figure 5.2.4),
then press OK. This yields the output in Table 5.2.3. At this point, we see that
269 | P a g e
‘age’, ‘schizophrenia’ and ‘smoker’ are significant, but the interaction term
between schizophrenia and smoker is not. Therefore, we need to remove this
insignificant term, and keep removing any insignificant terms, until all values
are greater than 0.05. In this case, only removing the interaction term will be
necessary. This is done by going to Model, selecting Custom and then only
putting in the main effects for factors and covariates (see Figure 5.2.5). It is
common practice that if the interaction term is significant, the corresponding
main effects are still kept, even if they are not significant.

Figure 5.2.4

Tests of Between-Subjects Effects


Dependent Variable: Brain density of nicotinic receptors
Type III Sum of
Source Squares df Mean Square F Sig.
Corrected Model 414.499a 4 103.625 4.863 .007
Intercept 783.257 1 783.257 36.756 .000
Age 103.850 1 103.850 4.873 .040
Schizophrenia 109.968 1 109.968 5.160 .035
Smoker 103.000 1 103.000 4.833 .041
Schizophrenia * Smoker 7.291 1 7.291 .342 .565
Error 404.888 19 21.310
Total 9672.892 24
Corrected Total 819.387 23
a. R Squared = .506 (Adjusted R Squared = .402)

270 | P a g e
Table 5.2.3

Figure 5.2.5
We then go to Save and select ‘Unstandardised Predicted Values’, ‘Studentised
Residuals’, ‘Cook’s Distance’ and ‘Leverage Values’ (see Figure 5.2.6). Press
Continue.

Figure 5.2.6

271 | P a g e
Figure 5.2.7

Furthermore, from Figure 5.2.7, we see that from Options we can choose
‘Descriptive statistics’, ‘Parameter estimates’ and ‘Homogeneity tests’. Pressing
Continue and OK yields the following outputs.

Descriptive Statistics
Dependent Variable: Brain density of nicotinic receptors
Schizophrenia Smoker Mean Std. Deviation N

272 | P a g e
No No 17.6633 4.54381 9
Yes 25.2275 4.08428 4
Total 19.9908 5.58017 13
Yes No 10.8550 .78489 2
Yes 19.9300 6.05092 9
Total 18.2800 6.54437 11
Total No 16.4255 4.91566 11
Yes 21.5600 5.92078 13
Total 19.2067 5.96871 24

Table 5.2.4

In Table 5.2.4 we see the means for levels of the schizophrenia factor (No and
Yes), levels of the smoking factor (No and Yes), and combinations of both. It
can be seen that the mean is lower for ‘schizophrenia’ group in comparison with
the group when schizophrenia is absent. Similarly, the mean is higher for the
smoking group than it is for the non-smoking group. The following table will
determine whether these differences are significant.

Tests of Between-Subjects Effects


Dependent Variable: Brain density of nicotinic receptors
Type III Sum of
Source Squares Df Mean Square F Sig.
Corrected Model 407.208a 3 135.736 6.586 .003
Intercept 958.086 1 958.086 46.489 .000
Schizophrenia 139.732 1 139.732 6.780 .017
Smoker 149.328 1 149.328 7.246 .014
Age 98.907 1 98.907 4.799 .040
Error 412.179 20 20.609
Total 9672.892 24
Corrected Total 819.387 23
a. R Squared = .497 (Adjusted R Squared = .422)
Table 5.2.5

In Table 5.2.5 we see that both the schizophrenia and the smoking factor, and
the covariate age, are significant at 0.05 level of significance, so these variables
should all be retained in the model. It also gives the R squared for this model,
which is 0.497, and the adjusted R squared which is 0.422. The parameter
estimates for the resulting model are shown in Table 5.2.6.

Parameter Estimates
Dependent Variable: Brain density of nicotinic receptors
Parameter B Std. Error T Sig. 95% Confidence Interval

273 | P a g e
Lower Bound Upper Bound
Intercept 27.387 3.762 7.279 .000 19.538 35.235
[Schizophrenia=1] 5.638 2.165 2.604 .017 1.121 10.154
[Schizophrenia=2] 0a . . . . .
[Smoker=1] -6.258 2.325 -2.692 .014 -11.107 -1.408
[Smoker=2] 0a . . . . .
Age -.127 .058 -2.191 .040 -.247 -.006
a. This parameter is set to zero because it is redundant.

Table 5.2.6

Let S = 1 if schizophrenia is not present and S = 0 if it is. Similarly let S* = 1 if


smoking habit is not present and S* = 0 otherwise. Represent age by A. Then,
from Table 5.2.4 we obtain the following model:

E[Yi | Ai , Si , Si *] = 27.387 − 0.127 Ai + 5.638Si − 6.258Si *.

The difference between the observed Yi and E[Yi | Ai , Si , Si *] is the residual εi .


Having fitted the model to the data, we now turn our attention to checking the
assumptions and the fit of the model. Levene’s test on equality of variances shows
that there is no evidence of variance inhomogeneity between the factor levels (F
= 1.835, df1 = 3, df2 = 20, p = 0.173). Furthermore, normality tests on the
residuals show that the normality hypothesis is not rejected either (for KS-test KS
= 0.118, df = 24, p = 0.2; for SW-test SW = 0.974, df = 24, p = 0.765). There is
no point with a Cook’s distance greater than 1, so there are no potential outliers.
Also, none of the residuals lie outside +/- 2. Also, in this case 2 p n = 1/3 and
there are no points with leverage values higher than this. We therefore conclude
that the model is good and there are no outliers.

In Table 5.2.7 one can find the parameter estimates obtained if the ‘age’
covariate is removed. It can be seen that the parameter for ‘schizophrenia’
remains similar but the parameter for ‘smoking’ increases in magnitude. One
may check the usual assumptions on the model residuals in a similar way. The
R squared in this case drops considerably to 0.376. Furthermore when comparing
the parameter estimates in Table 5.2.6 to those in Table 5.2.7 we note that there
is a difference of 7.63 (27.387-19.757) for the intercept term, a difference of
0.221 (5.638-5.859) in the parameter estimate for [Schizophrenia=1] and a
difference of 1.645 (-6.258+8.125) in the parameter estimate for [Smoker=1].
When observing a large drop in R squared but very little variability in the model

274 | P a g e
parameter estimates it may be reasonable to keep the ‘age’ variable. However
the difference in the intercept term is not very small so in this case ome might
prefer to remove age from the model. A better test for selecting between these
two competing models would be to apply cross-validation which would allow
one to evaluate the predictive ability of the model but this will not be discussed
here.

Parameter Estimates
Dependent Variable: Brain density of nicotinic receptors
95% Confidence Interval
Parameter B Std. Error t Sig. Lower Bound Upper Bound
Intercept 19.757 1.548 12.766 .000 16.539 22.976
[Schizophrenia=1] 5.859 2.350 2.493 .021 .971 10.747
[Schizophrenia=2] 0a . . . . .
[Smoker=1] -8.125 2.350 -3.457 .002 -13.013 -3.237
[Smoker=2] 0a . . . . .
a. This parameter is set to zero because it is redundant.

Table 5.2.7

In R/RStudio

The file nachr.csv contains data for 25 individuals regarding brain density of
nicotinic acetylcholine receptors (nAChR) and age, the presence (or otherwise)
of schizophrenia and the presence (or otherwise) of a smoking habit.

275 | P a g e
Figure 5.2.8

Figure 5.2.9

We first start by turning the values for ‘Schizophrenia’ and ‘Smoker’ into
factors.

276 | P a g e
nachr$Schizophrenia<-factor(nachr$Schizophrenia, levels=c(1,2
), labels=c("No","Yes"))
nachr$Smoker<-factor(nachr$Smoker, levels=c(1,2), labels=c("N
o","Yes"))

We then move on to plotting a box plot of nAChR with respect to the factors,
and a scatter plot of age and nAChr in R (see Figure 5.2.8 and Figure 5.2.9). In
the box plot we see that there is a larger difference in the location of nAChR for
smokers/non-smokers than there is for the schizophrenia category. In the scatter
plot, on the other hand, we see that nAChR and age are negatively correlated.
However, it must be noted that here we are looking at these relationships
individually and not simultaneously, so these conclusions are not definitive.
We now perform the multicollinearity diagnostics. First we create the required
‘Schizophrenia’ and ‘Smoker’ dummy variables, and extract the ‘Age’ covariate
from the data set.
Note that it is up to the user to decide which of the categories, in a categorical
variable, should be taken as reference. By default, SPSS takes the last level as
the reference category. Here, for comparison purposes, we shall take the first
category as reference and hence we will see that the output changes slightly (for
example, the multicollinearity diagnostics) but overall the model fitted will be
the same.

> dummySchizophrenia<-nachr$Schizophrenia=="Yes"
> dummySmoker<-nachr$Smoker=="Yes"
> dummySchizophrenia<-1*dummySchizophrenia
> dummySmoker<-1*dummySmoker
> Age<-nachr$Age

We now bind these together in one matrix and obtain the correlations and
condition indices as follows.
> x<-cbind(dummySchizophrenia,dummySmoker,Age)
> cor(x,method=”spearman”)

dummySchizophrenia dummySmoker Age


dummySchizophrenia 1.0000000 0.5104895 -0.1572874
dummySmoker 0.5104895 1.0000000 -0.3932185
Age -0.1572874 -0.3932185 1.0000000
> colldiag(x)
Condition
Index Variance Decomposition Proportions
intercept dummySchizophrenia dummySmoker Age
1 1.000 0.004 0.028 0.022 0.004
2 2.474 0.013 0.254 0.114 0.032
3 3.601 0.000 0.718 0.642 0.003

277 | P a g e
4 11.470 0.983 0.000 0.222 0.961

We see that the dummy variable for Schizophrenia is moderately correlated with
the dummy variable for smoker. Furthermore, the highest condition index is
11.47, which is above 10, indicating that collinearity is not weak but also not
serious (not above 30). The condition index in this case is slightly different from
the one for SPSS since R takes the dummy variables in the opposite way (1 if
smoking/schizophrenia is present and 0 otherwise). The 4th eigenvalue may be
causing damage to the parameter estimates of the constant term (intercept) and
the age. In the following we shall keep all variables, but we can also check to see
what happens if we remove the age variable.
We fit the linear model to the data and obtain the model summary by using the
lm function as follows. Note that the interaction between Schizophrenia and
Smoker is catered for in the model through the term

Schizophrenia:Smoker.

model1 <- lm(nAChR ~ Schizophrenia + Age + Smoker +


Schizophrenia:Smoker, data = nachr)
summary(model1)

Call:
lm(formula = nAChR ~ Schizophrenia + Age + Smoker + Schizophr
enia:Smoker, data = nachr)

Residuals:
Min 1Q Median 3Q Max
-7.1394 -1.9762 -0.5345 2.8211 8.8811

Coefficients:
Estimate Std.Error tvalue Pr(>|t|)
(Intercept) 27.54190 4.73204 5.820 1.32e-05 ***
SchizophreniaYes -3.78332 3.86011 -0.980 0.3393
Age -0.14180 0.06423 -2.208 0.0398 *
SmokerYes 7.11514 2.78148 2.558 0.0192 *
SchizophreniaYes -2.90853 4.97249 -0.585 0.5655
:SmokerYes

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1

Residual standard error: 4.616 on 19 degrees of freedom

278 | P a g e
Multiple R-squared: 0.5059, Adjusted R-squared: 0.4018
F-statistic: 4.863 on 4 and 19 DF, p-value: 0.007168

At this point, we see that ‘age’, ‘schizophrenia’ and ‘smoker’ are significant, but
the interaction term between schizophrenia and smoker is not. Therefore, we
need to remove this insignificant term, and keep removing any insignificant
terms, until all values are greater than 0.05. In this case, only removing the
interaction term will be necessary. It is common practice that if the interaction
term is significant, the corresponding main effects are still kept, even if they are
not significant. Hence we type:

model1<-lm(nAChR~Schizophrenia+Age+Smoker,data=nachr)
summary(model1)

This yields the following output:

Call:
lm(formula=nAChR ~ Schizophrenia + Age + Smoker, data = nachr
)

Residuals:
Min 1Q Median 3Q Max
-7.6196 -2.0605 -0.1958 2.5154 8.5371

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.76664 4.46730 5.992 7.38e-06 ***
SchizophreniaYes -5.63790 2.16519 -2.604 0.0170 *
Age -0.12667 0.05782 -2.191 0.0405 *
SmokerYes 6.25783 2.32478 2.692 0.0140 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1

Residual standard error: 4.54 on 20 degrees of freedom


Multiple R-squared: 0.497, Adjusted R-squared: 0.4215
F-statistic: 6.586 on 3 and 20 DF, p-value: 0.002826

We see that R sets S = 1 if schizophrenia is present and S = 0 if it is not. Similarly


let S* = 1 if smoking habit is present and S* = 0 if it is not. Represent age by A.
Then from this output we obtain the following model:

Y = 26.767 − 0.127 A − 5.638S + 6.258S *

279 | P a g e
Furthermore we see that the R squared is 0.497 and the adjusted R squared is
0.4215. It can be seen that the parameters for the dummy variables of this model
are the negative of those in SPSS. This is due to the fact that the dummy variables
are taken in an opposite manner to SPSS. To obtain the ANOVA table we type
aov(glm) to obtain the following output:

Call:
aov(formula = glm)

Terms:
Schizophrenia Age Smoker Residuals
Sum of Squares 17.4384 240.4416 149.3278 412.1787
Deg. of Freedom 1 1 1 20

Residual standard error: 4.539707


Estimated effects may be unbalanced

This table tests the following hypothesis

H0: Model with only (intercept) constant term is a good fit for the data.
H1: Model fitted (which includes covariates) fits better than the model with only
the intercept term.

The predicted values E[Yi | Ai , Si , Si *] for each set of observations are obtained
via predict(glm):
1 2 3 4 5 6 7 8 9
19.799708 16.252905 20.179723 17.392949 19.039679 25.930869 16.632919 16.126233 26.817570
10 11 12 13 14 15 16 17 18
22.004051 23.650781 21.193096 14.859517 9.601636 18.519571 20.039630 19.659615 22.066375
19 20 21 22 23 24
18.519571 18.646243 23.586434 18.519571 22.319718 9.601636

and the residuals εi = Yi − E[Yi | Ai , Si , Si *] are obtained via residuals(glm):

1 2 3 4 5 6
-1.2697083272 -4.5229045226 -1.1697230205 8.5370513973 2.6203210595 -0.3908693173
7 8 9 10 11 12
-5.3529192160 0.0937670418 3.8724297315 -0.9740508195 -0.0007811573 -3.9230955361
13 14 15 16 17 18
2.4804826863 1.8083642187 -7.6195711893 1.3403700373 -7.2096152693 5.1336250062
19 20 21 22 23 24
-1.4395711893 8.1237572462 -4.0264337672 -0.7895711893 3.9802818773 0.6983642187

On the other hand, studentised residuals are obtained from the package MASS
via the following commands:

280 | P a g e
library(MASS)
studres(glm)

1 2 3 4 5 6
-0.2967740106 -1.0617616739 -0.2764701838 2.1531356993 0.6064154134 -0.0930772726
7 8 9 10 11 12
-1.2655954013 0.0214272758 0.9516296299 -0.2535918667 -0.0001904824 -0.9887767927
13 14 15 16 17 18
0.5893157960 0.4615806041 -1.8950297860 0.3039500900 -1.7578738617 1.2450055853
19 20 21 22 23 24
-0.3292774832 2.0428174020 -1.0230597320 -0.1802418023 0.9573748954 0.1774114381

To check for normality, one can apply the Shapiro-Wilk test on residuals or
studentised residuals in the usual way to show that normality is satisfied. For
Studentised residuals, we can check whether there are any outliers by checking
whether any one of them lies outside the range [-2, 2]. In this case, we see that
there are none. We also look at the leverage values, where in this case we
compare with 2 p n = 1/3. The output below shows that none of the leverage
values exceeds 1/3.

leverages<-hatvalues(glm,type='rstandard')
leverages

1 2 3 4 5 6 7 8
0.15231839 0.11390249 0.17152170 0.09850924 0.12267214 0.18671666 0.10585128 0.11723514
9 10 11 12 13 14 15 16
0.20031148 0.31761718 0.22476175 0.23700672 0.16840690 0.28453630 0.11390249 0.09921842
17 18 19 20 21 22 23 24
0.09850924 0.15231839 0.11390249 0.11089429 0.24665241 0.11390249 0.16479614 0.28453630

We can also apply Cook’s distance, and take 1 as threshold.

cooks.distance(glm,type='rstandard')
1 2 3 4 5 6 7
4.145521e-03 3.599892e-02 4.147700e-03 1.071654e-01 1.327442e-02 5.231749e-04 4.601955e-02
8 9 10 11 12 13 14
1.604548e-05 5.697911e-02 7.850467e-03 2.768302e-09 7.600837e-02 1.817586e-02 2.205051e-02
15 16 17 18 19 20 21
1.021682e-01 2.664938e-03 7.642985e-02 6.776720e-02 3.646875e-03 1.123054e-01 8.547126e-02
22 23 24
1.097077e-03 4.540191e-02 3.288599e-03

Once again, the output shows that no point has a Cook’s distance greater than 1,
showing that there are no influential points in the data. Finally, to check for
variance homogeneity, we use the Breusch-Pagan test.

bptest(glm)

studentized Breusch-Pagan test

281 | P a g e
data: glm
BP = 1.0137, df = 3, p-value = 0.7979

The Breusch-Pagan test checks whether the variance is dependent on the


predictors. In the null hypothesis, the variance is not dependent on the predictors,
hence no heteroscedasticity is present (i.e., the variance of the error terms is
homogenous throughout). In the alternative hypothesis, we reject the hypothesis
of variance homogeneity. The p-value of 0.797 fails to reject the variance
homogeneity/no heteroscedasticity hypothesis. We therefore conclude that the
model is good and there are no outliers.

To fit the model without the ‘age’, use the command:

model1<-lm(nAChR~Schizophrenia+Smoker,data=nachr)

Results and outcomes will be similar to the SPSS case. One may check the usual
assumptions on the model residuals in a similar way as has been covered in this
section on the previous model.

5.3 Generalized Linear Models

The Multiple linear regression model, the ANOVA model and the ANCOVA
model, covered in the previous sections, are collectively referred to as general
linear models. In this section we shall look at a flexible generalization of such
models, namely the Generalized Linear Model (GLM). GLMs allow for response
variables that are categorical or discrete and/or for error distributions other than
the normal (Gaussian).

For a sample of size n, the generalized linear model is defined as:

( )
g E (Yi X i1 = xi1 ,..., X ip = xip ) = β 0 + β1 X i1 + ... + β p X ip

where i = 1,..., n , Y is the response variable, β 0 ,..., β p are unknown coefficients


(model parameters) need to be estimated and X 1 ,..., X p are the explanatory
variables which can be covariates and/or categorical variables.

282 | P a g e
Also, g (i) is a link function which describes how the mean
µi = E (Yi X i1 = xi1 ,..., X ip = xip ) depends on the linear predictor
ηi = β 0 + β1 X i1 + ... + β p X ip , so that g ( µi ) = ηi . The link function relates the
response variable to the linear model.

Note that for general linear models, the link function is the identity function,
g ( µi ) = µi . So a general linear model may be viewed as a special case of the
generalized linear modelling framework.

Generalized linear models make the following assumptions:

Assumption 1: The observations obtained from different subjects/objects are


assumed to be independent of each other.

Assumption 2: The variance of the response variable is assumed to be a function


of its mean. The form of this function is determined by the distribution that is
assumed.
A generalized linear model caters for response variables with distributions
forming part of the exponential family of distributions2.

In this chapter we shall consider one class of models from this family of models,
namely the Logistic regression models, which has two important members: the
Binary logistic regression model and the Multinomial logistic regression model.

5.3.1 The binary logistic regression model

The binary logistic regression model is used when we wish to model the influence
of a set of explanatory variables X 1 ,..., X p on a response variable Y , where the
possible outcomes of Y can only be one of two categories.

2
Most of the commonly used statistical distributions form part of the exponential family of
distributions. The normal distribution, exponential distribution, chi-square distribution,
binomial distribution, Bernoulli distribution, Poisson distribution, geometric distribution, are
all members of the exponential family of distributions.

283 | P a g e
For example, possible outcomes may be, pass or fail, yes or no, 0 or 1, true or
false. More specifically, for a sample of size n, the binary logistic regression
model is given by

 µ 
g ( µi ) = logit ( µi ) = log  i  = β 0 + β1 X i1 + ... + β p X ip i = 1,..., n
 1 − µi 

where µi is in this case is the probability that the outcome on the response
variable will be in the category of interest. The first category is the default
reference category in R, whilst the last category is the default reference category
in SPSS.

Example

Consider the dataset mtcars which is available in the package datasets in R


software. This data has been obtained through the 1974 Motor Trend US
magazine. There are various variables in the data, all related to fuel consumption
and aspects of automobile design and performance for 32 cars. More detail on
the data has been given in Section 5.1. In Section 5.1, focus was on modelling
the possible influence of the predictors "disp", "hp", "drat", "wt" and "qsec" on
the continuous variable "mpg". Our focus in this section will be directed towards
modelling the influence of a number of explanatory variables on the categorical
response variable "carb". For the purpose of the analysis at hand, the variable
"carb" will be reorganized into two different categories:

0: up to two carburettors,
1: more than two carburettors.

As for a general linear model, the explanatory variables used in a generalized


linear model should not lead to any multicollinearity issues. Prior to fitting a
generalized linear model, multicollinearity diagnostics should thus be carried out
on both continuous and categorical variables. SPSS and R will now be used to
first reorganize the categorical variable "carb" into two categories,
multicollinearity diagnostics will then be conducted on the remaining variables
of the dataset mtcars and will then proceed to fit a binary logistic regression
model (response variable is binary) to the data.

284 | P a g e
We will start off by showing how a logistic regression model may be fitted in
SPSS. Most of the interpretation accompanying SPSS’ output also applies for the
outputs obtained using R.

In SPSS

Open the file ‘mtcars.sav’. Prior to proceeding with fitting the logistic regression
model we need to reorganize the variable carb so that its possible response
categories are only two. The new variable carbnew is created which takes a value
of 0 if the number of carburettors are 1 or 2 and it takes a value of 1 if the number
of carburettors is more than 2. Recall that recoding into a different variable in
SPSS is carried out by going to the menu Transform – Recode into Different
Variables, choosing the variable that you wish to recode, entering the name of the
new variable, entering the new coded values, press continue and OK. Details on
how to recode data may be found in Section 1.1.8.

Before proceeding to fit the logistic regression model to the data with response
variable carbnew, we can obtain descriptive statistics and plots of the variables
being considered in the analysis so as to get an initial feel of the data and an initial
insight into possible relationships that might exist between the variables. For
detail on how to obtain descriptives and plots according to the type of variables
considered refer to Sections 2 and 3. We should also check whether there is any
multicollinearity in the data.

Since in this case we have both covariates and categorical variables as


explanatory variables, we first need to rewrite the categorical variables as dummy
variables and then move on to use collinearity diagnostics on the variables. To
create dummy variables in SPSS, we have to go to Transform – Create Dummy
Variables – take the categorical variables cyl, vs, am and gear underneath Create
Dummy Variables for as shown in Figure 5.3.1 and write down names for the
dummy variables underneath Root Names (One Per Selected Variable). Then
press Ok and the newly created dummy variables will be shown on the Data View
as on Figure 5.3.2.

285 | P a g e
Figure 5.3.1

The procedure that is used to detect multicollinearity in the explanatory variables


of a generalized linear model, is not straightforward. For the purpose of these
lecture notes, we will decide which variables should be removed from the data
based on the resulting correlation coefficients and a trial and error model fitting
approach. The Spearman correlation coefficient is used here due to the presence
of dummy variables in the data. Recall that to obtain the Spearman correlation
coefficient we need to go to Analyze – Correlate – Bivariate. Then move all the
variables underneath Variable(s) except for carb, carbnew and one dummy
variable for each of cyl, vs, am and gear. If you had to include all dummy
variables in the analysis, you will get a serious multicollinearity problem. If you
wish to match the collinearity diagnostics with those obtained in the software R,
leave out the first dummy variable for each of cyl, vs, am and gear. So leave out
cyl=4, vs=0, am=0 and gear=3 as shown in Figure 5.3.3. Tick Spearman, untick
Pearson, press Continue and Ok.

286 | P a g e
Figure 5.3.3

The output of interest is the one shown in Table 5.3.1.

Table 5.3.1

287 | P a g e
Figure 5.3.2

288 | P a g e
We take note of any correlation coefficient that is larger than 0.5 as it is deemed
to be quite high in relation to the identification of variables which may lead to
multicollinearity. There are a number of correlation coefficients that stand out in
this example: -0.909 for disp and mpg, -0.886 for mpg and wt, 0.851 for disp and
hp, to mention a few. All of these correlations are much larger than 0.5 showing
that we cannot consider any of these pairs of explanatory variables together when
fitting a binary logistic regression model. Possible binary logistic regression
models with carbnew as response variable that might be considered at this point
are those with these pairs of explanatory variables:

1. mpg and qsec


2. disp and qsec
3. disp and gear
4. hp and am
5. drat and qsec
6. drat and vs
7. wt and qsec
8. qsec and am
9. vs and am
10.am and cyl.

We proceed to fit binary logistic regression model 1, that is using mpg and qsec
as explanatory variables. A similar procedure to what will be shown for this
particular binary logistic regression model may be used to fit binary logistic
regression models 2 - 10. A comparison between the fits provided by the ten
different models can then be entertained.

A binary logistic regression model may be obtained in three different manners in


SPSS. The three different procedures have different output options.

As will be explained in the section which follows, a multinomial logistic


regression model is a direct extension of the binary logistic regression model as
it caters for a categorical response variable with more than two categories. One
way in which we can fit the binary logistic regression model is through the menu
used to fit a multinomial logistic regression model. To use this procedure, go to
Analyze – Regression – Multinomial Logistic. Then move carbnew as the
Dependent Variable.

289 | P a g e
Note that the reference category in SPSS is by default the last. So, the
probabilities µi estimated from the fitted model will be those corresponding to a
vehicle having up to two carburettors. Should we wish to change the reference
category to the first, we should click on Reference Category and choose First
Category as in Figure 5.3.4. Here we will proceed with the default setting of
SPSS.

Figure 5.3.4

Then move mpg and qsec underneath Covariate(s) as in Figure 5.3.5. Then click
on Save and choose estimated response probabilities. Press Continue and Ok.
Note that if we had a model involving categorical explanatory variables, we
would also press on Model and choose full factorial so that we can include both
main effects and interaction terms in the model.

Figure 5.3.5

290 | P a g e
Table 5.3.2 is part of the output obtained. From this table, we note that the
variable qsec should be removed from the model as its corresponding p-value,
0.106 is greater than 0.05 showing that the model coefficient is not significantly
different from zero.

Table 5.3.2

The output which results, once qsec is removed from the model, is the following:

Table 5.3.3

From Table 5.3.3, since the p-value for mpg (0.006) is less than 0.05 it shows that
mpg exerts an influence on the probability of whether a vehicle has up to 2
carburettors or more than 2 carburettors. The p-value shown in Table 5.3.4 (0 <
0.05) shows that the change in deviance of 17.417 between the two models is
significant. Thus, the model with mpg as explanatory variable provides a better
fit than an intercept only model.

Table 5.3.4
291 | P a g e
Since there is no agreement as to whether pseudo R2 values should be used in
conjunction with logistic regression models, information on pseudo R2 values will
not be given here.

The model being fitted in this case takes the form:

 µ 
log  i  = −8.135 + 0.431 ∗ mpg i .
 1 − µi 

From the coefficient of the variable mpg, 0.431, we may say that mpg has a
positive effect on µi , the probability of a vehicle having an engine with up to two
carburettors. More precisely, a one-unit increase in mpg increases the log odds
 µ 
log  i  of an engine having up to two carburettors by 0.431.
 1 − µi 

We can also focus on the probability µi directly, by making µi subject of the


formula to get the expression for the predicted probabilities:

e −8.135+0.431∗mpgi
µi = .
1 + e −8.135+0.431∗mpgi

The estimated probabilities are presented in Table 5.3.5 with the first column
giving the estimated probabilities of a vehicle having an engine with up to two
carburettors and the second column giving the estimated probabilities of a vehicle
having an engine with more than two carburettors. These probabilities have also
been obtained through the menu Analyze – Regression – Multinomial Logistic:

292 | P a g e
Table 5.3.5

Residuals cannot be obtained through this fitting procedure. To be able to get


residuals such as the Pearson residuals, a binary logistic regression model should
be fitted by means of a different procedure. Go to Analyze – Generalized Linear
Models - Generalized Linear Models. Then choose Binary logistic as type of
model, from the tab Response, take carbnew underneath Dependent Variable (as
in Figure 5.3.6), move mpg underneath Covariates in the tab Predictors (as in
Figure 5.3.7), move mpg underneath Model in the tab Model (as in Figure 5.3.8),
then go to the tab Save and you can choose any of the options available so that
you may conduct residual diagnostics.

Figure 5.3.6
293 | P a g e
Figure 5.3.7

Figure 5.3.8

294 | P a g e
Figure 5.3.9

Once different generalized linear models have been fitted to the data, to be able
to compare the performance of these models, use can be made of goodness of fit
measures such as the deviance (only if models are nested) and the AIC. The lower
(higher) the deviance, the better (worse) the fit of the model. Similarly, the lower
(the closer to zero) the value of AIC the better. The goodness of fit measures are
obtained as part of the default output obtained through the menu Analyze –
Generalized Linear Models - Generalized Linear Models as shown in Table
5.3.6.

295 | P a g e
Goodness of Fita

Value df Value/df
Deviance 21.274 23 .925
Scaled Deviance 21.274 23
Pearson Chi-Square 18.664 23 .811
Scaled Pearson Chi-Square 18.664 23
Log Likelihoodb -12.024
Akaike's Information
28.047
Criterion (AIC)
Finite Sample Corrected AIC
28.461
(AICC)
Bayesian Information
30.979
Criterion (BIC)
Consistent AIC (CAIC) 32.979
Dependent Variable: carbnew
Model: (Intercept), mpg
a. Information criteria are in smaller-is-better form.
b. The full log likelihood function is displayed and used in computing
information criteria.
Table 5.3.6

If interest lies in how many vehicles have been correctly classified, we can also
look at the classification table (see Table 5.3.7) obtained through the menu
Analyze – Regression – Binary Logistic. Group membership is selected through
the tab Save, as shown in Figure 5.3.10:

Figure 5.3.10
296 | P a g e
Table 5.3.7

Using the diagonal elements in the contingency table (see Table 5.3.7), the
number of correctly classified vehicles amounts to 24 (13 + 11) showing that only
75% of the vehicles were correctly classified by means of this model.

In R/RStudio

A GLM may be fitted in R using the glm command from the stats package. The
usage of glm() is similar to the function lm() which we have previously used to
fit linear models in Sections 5.1 and 5.2. However, in the glm command we also
have to specify a family argument which caters for different exponential family
distributions and different link functions.

The exponential family distributions available in the glm function in R, together


with their default link functions are:

binomial (link = ‘logit’)


gaussian (link = ‘ identity’)
Gamma (link = ‘ inverse’)
inverse.gaussian (link = ‘1/mu^2’)
poisson (link = ‘log’)

Since we are going to use the glm function to fit a logistic regression model, we
will need to specify the family as binomial with logit link. The reason for
selecting such a family follows from the fact that a response variable that has only
two possible outcome categories may be considered to follow the Bernoulli
distribution and the Bernoulli distribution is a special case of the Binomial
distribution.

297 | P a g e
We use the following commands:

We start our analysis by opening the data file and specifying which of the
variables are categorical (factors). This is done as follows:
data <- mtcars
attach(data)

data$cyl <- as.factor(cyl) ## defining cyl to be a factor


data$vs <- as.factor(vs)
data$am <- as.factor(am)
data$gear <- as.factor(gear)
data$carb <- as.factor(carb)

Using the command revalue from the R package plyr to reorganize the categories
of the variable carb:

library(plyr)

## reorganizing the categories of the variables carb


carbnew <-revalue(data$carb, c("1"="0", "2"="0" ,"3"="1", "4"=
"1", "6"="1", "8"="1"))

summary(carbnew)
0 1
17 15

Before proceeding to fit the logistic regression model to the data with response
variable carbnew, we can obtain descriptive statistics and plots of the variables
being considered in the analysis so as to get an initial feel of the data and an initial
insight into possible relationships that might exist between the variables. For
detail on how to obtain descriptive statistics and plots according to the type of
variables considered refer to Sections 2 and 3.

We should also check whether there is any multicollinearity in the data. Since in
this case we have both covariates and categorical variables as explanatory
variables, we first need to rewrite the categorical variables as dummy variables
and then move on to use collinearity diagnostics on the variables. There are
various ways in which dummy variables may be created in R. One of these
methods proceeds as follows:

## Creating dummy variables to be able to perform


## multicollinearity diagnostics on categorical variables
dummiesvs = model.matrix(~data$vs)

298 | P a g e
head(dummiesvs)

(Intercept) data$vs1
1 1 0
2 1 0
3 1 1
4 1 1
5 1 0
6 1 1

## extracting solely the column of interest


dummiesvs = model.matrix(~data$vs)[,-1]
head(dummiesvs)

1 2 3 4 5 6
0 0 1 1 0 1

dummiesam = model.matrix(~data$am)[,-1]
head(dummiesam)

1 2 3 4 5 6
1 1 1 0 0 0

dummiesgear = model.matrix(~data$gear)[,-1]
head(dummiesgear)

data$gear4 data$gear5
1 1 0
2 1 0
3 1 0
4 0 0
5 0 0
6 0 0

dummiescyl = model.matrix(~data$cyl)[,-1]
head(dummiescyl)

data$cyl6 data$cyl8
1 1 0
2 1 0
3 0 0
4 1 0
5 0 1
6 1 0

## Creating the matrix to use for multicollinearity


## diagnostics
formultcol<-cbind(mpg, disp, hp, drat, wt, qsec, dummiesvs,
dummiesam, dummiesgear, dummiescyl)

head(formultcol)

299 | P a g e
mpg disp hp drat wt qsec dummiesvs dummiesam data$gear4 data$gear5
21.0 160 110 3.90 2.620 16.46 0 1 1 0
21.0 160 110 3.90 2.875 17.02 0 1 1 0
22.8 108 93 3.85 2.320 18.61 1 1 1 0
21.4 258 110 3.08 3.215 19.44 1 0 0 0
18.7 360 175 3.15 3.440 17.02 0 0 0 0
18.1 225 105 2.76 3.460 20.22 1 0 0 0

data$cyl6 data$cyl8
1 0
1 0
0 0
1 0
0 1
1 0

The procedure that is used to detect multicollinearity in the explanatory variables


of a generalized linear model, is not straightforward. For the purpose of these
lecture notes, we will decide which variables should be removed from the data
based on the resulting correlation coefficients and a trial and error model fitting
approach. The Spearman correlation coefficient is used here due to the presence
of dummy variables in the data. The command used is as follows:
round(cor(formultcol,method='spearman'),digits=2)
## finding the Spearman correlation coefficients and rounding
## the values to 2 decimal places

Using the same strategy as that used when working with SPSS software, any
correlation coefficient that is larger than 0.5 is deemed to be quite high in relation
to the identification of variables which may lead to multicollinearity. There are
a number of correlation coefficients that stand out in this example: -0.91 for disp
and mpg, -0.89 for mpg and wt, 0.85 for disp and hp, to mention a few. All of
these correlations are much larger than 0.5 showing that we cannot consider any
of these pairs of explanatory variables together when fitting a binary logistic
regression model.

300 | P a g e
Possible binary logistic regression models with carbnew as response variable that
might be considered at this point are those with these pairs of explanatory
variables:

1. mpg and qsec


2. disp and qsec
3. disp and gear
4. hp and am
5. drat and qsec
6. drat and vs
7. wt and qsec
8. qsec and am
9. vs and am
10.am and cyl.

We proceed to fit binary logistic regression model 1, that is using mpg and qsec
as explanatory variables. A similar procedure to what will be shown for this
particular binary logistic regression model may be used to fit binary logistic
regression models 2 - 10. A comparison between the fits provided by the ten
different models can then be entertained. We fit our logistic regression model
with carbnew as the response variable and mpg and qsec as the explanatory
variables as follows:

datanew <- cbind(data,carbnew)


## including the variable carbnew in the data so that we can
## use carbnew as the response variable and mpg and qsec as
## explanatory variables

Fitting the binary logistic model:

glm <- glm(carbnew ~ mpg + qsec, data = datanew, family =


'binomial')
summary(glm)

Call:
glm(formula = carbnew ~ mpg + qsec, family = "binomial", data
= datanew)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.94655 -0.48935 -0.06209 0.62801 1.41200

301 | P a g e
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 18.8230 7.8982 2.383 0.0172 *
mpg -0.3433 0.1540 -2.230 0.0258 *
qsec -0.6977 0.4316 -1.616 0.1060
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 44.236 on 31 degrees of freedom


Residual deviance: 23.291 on 29 degrees of freedom
AIC: 29.291

Number of Fisher Scoring iterations: 6

To interpret the output provided by the glm function we start by looking at the p-
values obtained through the Wald test (under the heading Pr(>|z|)). Note that
the p-value for the coefficient corresponding to qsec is greater than 0.05 (0.106 >
0.05) which means that this coefficient is not significantly different from zero.
So we can remove the variable qsec from the model. The output for the model
without qsec is obtained as follows:

glm2 <- glm(carbnew ~ mpg, data = datanew, family='binomial')

summary(glm2)

Call:
glm(formula = carbnew ~ mpg, family = "binomial", data =
datanew)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.88400 -0.62520 -0.06642 0.62837 1.57953

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.1350 2.9845 2.726 0.00642 **
mpg -0.4307 0.1577 -2.731 0.00631 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)


Null deviance: 44.236 on 31 degrees of freedom
Residual deviance: 26.820 on 30 degrees of freedom
AIC: 30.82

Number of Fisher Scoring iterations: 6

302 | P a g e
Since the p-value for mpg (0.00631) is less than 0.05 it shows that mpg exerts an
influence on the probability of whether a vehicle has up to 2 carburettors or more
than 2 carburettors.

The deviance is a measure of goodness of fit of a generalized linear model. The


higher the deviance, the worse the fit of the model. Deviance for two different
models are presented in each of summary measures obtained for a fitted
generalized linear model. The null deviance corresponds to the null model, that
is the model in which all terms are excluded except for the intercept. The second
deviance refers to the fitted model. By including mpg in the model, the deviance
has decreased from 44.236 (intercept only model) to 26.82.

We can check whether a change in deviance is significant and to decide whether


it is worth keeping the variable mpg in the model versus keeping the intercept
only, by means of the residual deviances obtained by using the function anova()
with the option test=’Chisq’ as shown below.

glminterceptonly <- glm(carbnew ~ 1, data = datanew, family ='


binomial')
## fitting an intercept only model

summary(glminterceptonly)

Call:
glm(formula = carbnew ~ 1, family = "binomial", data = datanew
)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.125 -1.125 -1.125 1.231 1.231
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.1252 0.3542 -0.353 0.724

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 44.236 on 31 degrees of freedom


Residual deviance: 44.236 on 31 degrees of freedom
AIC: 46.236

Number of Fisher Scoring iterations: 3

anova(glm2, glminterceptonly, test = 'Chisq')


## alternatively one can perform a similar test using the
## command test ='LR' instead of test = 'Chisq'

303 | P a g e
Analysis of Deviance Table

Model 1: carbnew ~ mpg


Model 2: carbnew ~ 1
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 30 26.820
2 31 44.236 -1 -17.417 3.002e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

With a p-value of 0.00003 < 0.05, the change in deviance is significant showing
that the introduction of mpg in the model led to an improvement in model fit.

The AIC is another measure of goodness of fit of the model that may be used to
compare the performance of competing models. The lower (the closer to zero)
the value of AIC the better. It may be noticed that by retaining the variable mpg
in the model, the AIC resulted to be 30.82. On the other hand, the AIC for an
intercept only model resulted to be 46.236, showing that indeed the inclusion of
mpg in the model led to a better fit.

Note that the deviance (only if models are nested) and AIC may also be used if
we wish to compare the fit of a number of candidate models. The lower the
deviance, the better the fit of the model. Similarly, the lower the value of AIC
the better.

The model being fitted in this case takes the form:

 µ 
log  i  = 8.135 − 0.431 ∗ mpgi .
 1 − µi 

From the coefficient of the variable mpg, -0.431, we may say that mpg has a
negative effect on µi , the probability of an engine having more than two
carburettors. More precisely, a one-unit increase in mpg reduces the log odds
 µ 
log  i  of an engine having more than two carburettors by 0.431. We can
 1 − µi 
also focus on the probability µi directly, by making µi subject of the formula to
get the expression for the predicted probabilities:

304 | P a g e
e8.135−0.431∗mpgi
µi = .
1 + e8.135−0.431∗mpgi

The resulting predicted (estimated) probabilities, of having an engine with more


than two carburettors, may be obtained through the glm function in R using either
one of the two commands which follow:

predicted <- plogis(predict(glm2, datanew))


head(predicted)

fitted<-glm2$fitted.values
head(fitted)

Once a generalized linear model is fitted to the data we also need to check the
residuals. Different types of residuals may be obtained through the glm function
in R. Here we focus on the Pearson residuals and we plot Pearson residuals versus
the linear predictors ηi where ηi = 8.135 − 0.431 ∗ mpg i (the term on the right
hand side of the model). We should expect to see a horizontal band with most of
the residuals falling within ±2 or ±3 .

Pearson<-residuals(glm2,type='pearson')
plot(glm2$linear.predictors, Pearson)
1
Pearson

0
-1
-2

-6 -4 -2 0 2 4

glm2$linear.predictors

Figure 5.3.11: Plots of Pearson Residuals versus Linear Predictors

305 | P a g e
which(abs(Pearson)>3)
named integer(0)
## no absolute Pearson residual value is larger than 3

From Figure 5.3.11 we can see that the resulting Pearson residuals fall within the
±3 horizontal band as desired. However, there are signs of a curvature in the
points, which means that there might be misspecification in the mean model, that
 µ 
is, the relationship between the logit function log  i  and the explanatory
 1 − µi 
variables might not be linear.

Some final points on using the glm function in R

• Suppose that the model which follows is fitted using the glm function in R.
Note that in this case the two explanatory variables are both categorical. If a
glm is fitted to a dataset involving categorical explanatory variables, the glm
function by default removes the first category of each categorical variable, a
process known as aliasing (vs0 and am0 have been aliased in the model which
follows). Without aliasing, estimation problems tend to arise. Aliasing is also
carried out automatically when a general linear model (ANOVA/ANCOVA
model) is fitted to the data as seen in Section 5.2.

glm <- glm(carbnew ~ vs + am, data = datanew, family =


'binomial')
summary(glm)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.0484 0.6172 1.699 0.08939 .
vs1 -2.7125 0.9324 -2.909 0.00362 **
am1 -0.2681 0.8936 -0.300 0.76419

• Interaction terms may also be included in a generalized linear model. Suppose


that in our logistic regression model we considered two categorical explanatory
variables, say vs and am. Then to consider the main effects vs and am and the
interaction term vs:am in the model, we use the command:
glm <- glm(carbnew ~ vs*am, data = datanew, family =
'binomial')
summary(glm)

306 | P a g e
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.6931 0.6124 1.132 0.258
vs1 -1.6094 1.0368 -1.552 0.121
am1 0.9163 1.2550 0.730 0.465
vs1:am1 -18.5661 2465.3261 -0.008 0.994

Final Note on fitting a logistic regression model using R and SPSS (for the
Advanced User)

As a final note, it should be noted that if a logistic regression model with


categorical explanatory variables is fitted in both R and SPSS, the estimates of
the model coefficients obtained will not be the same. The estimated probabilities
obtained with R and the estimated probabilities obtained with SPSS, EST2_1 as
shown in Table 5.3.5, will however be the same. The reason for this is that
logistic regression models with different parametrizations will be fitted to the
data. Both R and SPSS are equipped with tools to change the parametrization
according to the needs of the user. By playing around with the contrasts
commands in R and SPSS you can get the same estimates of the model
coefficients. In R, the default setting for contrasts is contr.treatment:

Suppose that the categorical variable vs is being considered in the logistic


regression model.

glm <- glm(carbnew ~ vs,data=datanew,contrasts = list(vs =


"contr.treatment"),family='binomial')

glm

Coefficients:
(Intercept) vs1
0.9555 -2.7473

Degrees of Freedom: 31 Total (i.e. Null); 30 Residual


Null Deviance: 44.24
Residual Deviance: 32.75 AIC: 36.75

head(model.matrix(glm))
(Intercept) vs1
Mazda RX4 1 0
Mazda RX4 Wag 1 0
Datsun 710 1 1
Hornet 4 Drive 1 1
Hornet Sportabout 1 0
Valiant 1 1

307 | P a g e
By changing the contrasts to contr.sum you get the following:

glm<-glm(carbnew ~ vs,data = datanew,contrasts = list(vs =


"contr.sum"),family='binomial')

glm

Coefficients:
(Intercept) vs1
-0.4181 1.3736

Degrees of Freedom: 31 Total (i.e. Null); 30 Residual


Null Deviance: 44.24
Residual Deviance: 32.75 AIC: 36.75

head(model.matrix(glm))
(Intercept) vs1
Mazda RX4 1 1
Mazda RX4 Wag 1 1
Datsun 710 1 -1
Hornet 4 Drive 1 -1
Hornet Sportabout 1 1
Valiant 1 -1

Other contrasts, available in R, are Helmert and polynomial.

Change of contrasts in SPSS may be carried out by using a different menu to


obtain a binary logistic regression model. Go to Analyze – Regression – Binary
logistic. Then take carbnew underneath Dependent, vs underneath Covariates,
click on Categorical and move vs underneath Categorical Covariates as seen in
Figure 5.3.12. From Figure 5.3.12 it should be noted that by default, contrasts in
SPSS are set as Indicator. For the estimates of the model coefficients obtained
through SPSS to match those obtained in R, contrast should be changed to
Deviation. This change is done from the dropdown list next to Contrast as in
Figure 5.3.13. It is important to press Change after the contrast method is
selected. Then press Continue and Ok. The coefficients obtained using SPSS are
shown in Table 5.3.8.

308 | P a g e
Figure 5.3.12

Figure 5.3.13

Table 5.3.8

309 | P a g e
5.3.2 The multinomial logistic regression model

The multinomial logistic regression model is a generalization of the binary


logistic regression model, in the sense that, like the binary logistic model, it is
used to model the influence of a set of explanatory variables on a response
variable which is nominal (a categorical variable whose categories do not follow
a specific order). In the case of the multinomial logistic regression the nominal
response variable has more than two categories. If the response variable of
interest is an ordinal variable (a categorical variable whose categories do follow
a specific order), modelling should be carried out using the proportional odds
model. For more detail on the distinction between ordinal and nominal variables
refer to the introductory part of these lecture notes.

The correspondence between the multinomial logistic regression model and its
binary counterpart may be viewed directly from the model specification. Suppose
that the response variable Y is made up of K categories and let the first category
of the response variable be the reference category3. Then for a sample of size n,
the multinomial logistic regression model is given by:

µ 
g ( µ 2 i ) = logit ( µ 2 i ) = log  2i  = β 20 + β 21 X i1 + ... + β 2 p X ip
 µ1i 
⋮ ⋮ ⋮
µ 
g ( µ Ki ) = logit ( µ Ki ) = log  Ki  = β K 0 + β K 1 X i1 + ... + β Kp X ip
 µ1i 

where i = 1,..., n, and for k = 1,..., K , µ ki is the probability that the ith outcome
obtained on the response variable will be in the kth category. So, if modelling is
carried out using a K-level response variable, then the multinomial logistic
regression model is actually made up of K – 1 different binary logistic regression
models. Each binary logistic regression model obtained provides information on
the effect of the explanatory variables on the probability that the response variable
is equal to a specific category, always in comparison to a reference category. To
cater for the fact that the explanatory variables may affect each outcome category
differently, the model coefficients of the binary logistic regression models will
also be different.

3
The first category is the default reference category in R, whilst the last category is the default
reference category in SPSS.
310 | P a g e
Example

Consider the iris dataset available in the inbuilt package datasets in R software.
This dataset gives measurements (in cm) of the sepal length, sepal width, petal
length and petal width of 150 flowers. These flowers have been classified into
three different species: iris setosa, versicolor or virginica. Our focus in this
section will be directed towards modelling the influence of a number of
explanatory variables on the nominal response variable (with three categories)
"species".

As for the binary logistic regression model, the explanatory variables used in a
multinomial logistic regression model should not lead to any multicollinearity
issues. Prior to fitting the model, multicollinearity diagnostics should thus be
carried out.

SPSS and R will now be used to conduct multicollinearity diagnostics on the four
explanatory variables, sepal length, sepal width, petal length and petal width, and
will then proceed to fit a multinomial logistic regression model to the data.

We will start off by showing the procedure in SPSS. Most of the interpretation
accompanying SPSS’ output also applies for the outputs obtained using R.

In SPSS

Open the file ‘irisdata.sav’ and get familiar with the dataset by looking at
descriptive statistics and plots of the variables being considered in the analysis.
This will help us to get a feel of the data being modelled and an insight into
possible relationships that might exist between the response and the explanatory
variables. Detail on how to achieve descriptive statistics and plots in SPSS has
been given in Sections 2 and 3.

The boxplot of Sepal Length versus Species, in Figure 5.3.14, shows that there
seems to be a difference in the sepal length due to the iris species. So, it looks
like it might be worth investigating this relationship further.

311 | P a g e
The scatter plot of Petal Width versus Petal Length, in Figure 5.3.15, shows that
there seems to be a linear relationship between the two variables. The Spearman
correlation coefficient for the two variables is in fact 0.938 with p-value < 0.05,
as shown in Table 5.3.9, showing that there is indeed a significant positive linear
relationship between the two. Note that Spearman correlation coefficient has
been used due to non-normality of the variables Petal Length and Petal Width (p-
value 0 < 0.05 achieved for both variables when testing for normality using the
Shapiro-Wilk test).

Figure 5.3.14

Figure 5.3.15
312 | P a g e
Table 5.3.9

Now recall that when fitting a generalized linear model, such as the multinomial
logistic regression model, the explanatory variables should not be collinear. The
fact that there is a linear relationship between petal width and petal length shows
that we might need to remove some of the variables from the iris dataset for our
model fitting. With large correlation values of 0.938, 0.882 and 0.834, the
correlation matrix in Table 5.3.9 also shows that we might have a serious issue
with multicollinearity.

As for the binary logistic regression model, the procedure that is used to detect
multicollinearity in the explanatory variables of a generalized linear model, is not
straightforward. For the purpose of these lecture notes, we will decide which
variables should be removed from the data based on the resulting correlation
coefficients and a trial and error model fitting approach.

Any correlation coefficient that is larger than 0.5 is deemed to be quite high in
relation to the identification of variables which lead to multicollinearity. As
already mentioned, there are three correlation coefficients that stand out. These
correlation coefficients correspond to the relationships Petal.Width ~
Petal.Length, Sepal.Length ~ Petal.Length and Sepal.Length ~ Petal.Width. All
of these correlations are much larger than 0.5 showing that we cannot fit any of
these three multinomial logistic regression models:

313 | P a g e
1. a model with both Petal.Width and Petal.Length as explanatory variables
2. a model with both Sepal.Length and Petal.Length as explanatory variables
3. a model with both Sepal.Length and Petal.Width as explanatory variables.

Multinomial logistic regression models with Species as response variable that


might be considered at this point are the following:

4. having Sepal.Length and Sepal.Width as explanatory variables


5. having Petal.Length and Sepal.Width as explanatory variables
6. having Petal.Width and Sepal.Width as explanatory variables.

We proceed to fit multinomial logistic regression model 4, that is using


Sepal.Width and Sepal.Length as explanatory variables. A similar procedure to
what will be shown for this particular multinomial logistic regression model may
be used to fit multinomial logistic regression models 5 and 6. A comparison
between the fits provided by the three different models can then be entertained.

To fit the multinomial logistic regression model using Sepal Length and Sepal
Width as explanatory variable and Species as the response variable, go to Analyze
– Regression – Multinomial Logistic. The response variable is then moved
underneath Dependent and the explanatory variables Sepal Width and Sepal
Length are moved underneath Covariate(s). Should you wish to match the
resulting output to that obtained in R, the reference category in SPSS needs to be
changed to first as shown in Figure 5.3.16. Then press Ok.

Figure 5.3.16

314 | P a g e
Table 5.3.10

The warnings should be the first thing to get our attention once the SPSS output
is obtained. These warnings are shown in Table 5.3.10. Such warnings show that
our model is not the ideal one to use for the particular dataset. At this stage we
can thus consider two separate multinomial logistic models, one with
Sepal.Length as an explanatory variable and the other model with Sepal.Width as
an explanatory variable. Once these two models have been fitted, identification
of the best overall fitting model can be conducted by comparing the fit of these
two models with the fit of models 5 and 6 defined earlier.

For the purpose of these lecture notes, we will proceed by fitting solely a
multinomial logistic regression model with Species as the response variable and
considering Sepal Length as the explanatory variable (see Figure 5.3.17). From
Figure 5.3.17 it may also be noted that the reference category of the response
variable Species is changed from the default, last, to first so that the results
obtained in SPSS match with those obtained in R.

Figure 5.3.17

315 | P a g e
Also go to Statistics and choose Information Criteria, Cell probabilities,
classification table and goodness-of-fit, Continue, click on Save and choose
Predicted category and Predicted category probability as shown in Figures
5.3.18 and 5.3.19. Then press Continue and Ok.

Figure 5.3.18

Figure 5.3.19

316 | P a g e
Likelihood Ratio Tests
Model Fitting
Criteria Likelihood Ratio Tests

-2 Log Likelihood of
Effect Reduced Model Chi-Square df Sig.
Intercept 224.975 145.680 2 .000
Sepal.Length 226.811 147.516 2 .000
The chi-square statistic is the difference in -2 log-likelihoods between the final
model and a reduced model. The reduced model is formed by omitting an effect
from the final model. The null hypothesis is that all parameters of that effect are 0.

Table 5.3.11

From Table 5.3.11, a p-value of 0 < 0.05 for Sepal Length shows that the
coefficient corresponding to Sepal Length is significantly different from zero.

Model Fitting Information


Model Fitting
Criteria Likelihood Ratio Tests
-2 Log
Model Likelihood Chi-Square df Sig.
Intercept Only 226.811
Final 79.295 147.516 2 .000

Table 5.3.12

The fact that sepal length should be retained in the model is also corroborated by
the p-value 0 < 0.05 shown in the final row of Table 5.3.12. The result in Table
5.3.12 shows that the model with sepal length provides a better fit than the
intercept only model.

Recall that the lower the value of information criteria such as the AIC and BIC,
the better the model fit. So, similarly, the information criteria (AIC and BIC)
shown in Table 5.3.13 also shown that the introduction of sepal length in the
model leads to an improvement in model fit; an AIC of 87.295 with sepal length
in the model versus an AIC of 230.811 without sepal length in the model.

317 | P a g e
Table 5.3.13

The goodness of fit tests shown in Table 5.3.14 are then used to assess how well
does the model fit the data. The resulting p-values (0.789 and 0.999 > 0.05) show
that the model fits the data well.

Goodness-of-Fit
Chi-Square df Sig.
Pearson 56.591 66 .789
Deviance 34.838 66 .999

Table 5.3.14

Since there is no agreement as to whether pseudo R2 values should be used in


conjunction with logistic regression models, information on pseudo R2 values will
not be given here.

Note that goodness of fit measures such as the deviance (for nested models) and
the AIC can also be used if you need to compare the performance of different
generalized linear models. The lower (higher) the deviance, the better (worse)
the fit of the model. Similarly, the lower (the closer to zero) the value of AIC the
better.

Having shown that by retaining sepal length in the model, we have a well-fitting
model, we now turn our attention to the resulting estimates, model specification
and interpretation. The first set of estimates shown in Table 5.3.15 are for the
model obtained for the category versicolor compared to the baseline/reference
category setosa and similarly, the second set of estimates show in Table 5.3.15
are for the model obtained for the category virginica compared to the
baseline/reference category setosa.

318 | P a g e
Table 5.3.15

From the resulting coefficients, we get:

µ   P ( versicolor ) 
log  2 i  = log   = −26.08 + 4.82 × Sepal.Length i
µ
 1i   P ( setosa ) i
µ   P ( virginica ) 
log  3i  = log   = −38.77 + 6.85 × Sepal.Length i
 µ1i   P ( setosa ) i

Using the first equation, a one-unit increase in the variable Sepal Length is
associated with an increase of 4.82 in the log odds of an iris being of the species
versicolor versus setosa. Similarly, from the second equation, a one-unit increase
in the variable Sepal Length is associated with an increase of 6.85 in the log odds
of an iris being of the species virginica versus setosa. A similar interpretation
may also be given in terms of the odds rather than the log odds of an iris being
either a versicolor or virginica versus a setosa. Should such an interpretation be
required, we need to consider the values, 123.43 and 940.49, obtained underneath
the heading Exp(B).

For our fitted model, we can also calculate the predicted (estimated) probabilities
for each of the levels of the response variable Species. In general:

1
µ1i =
1 +  exp ( β j 0 + β j1 X i1 + ... + β jp X ip )
K

j =2

exp ( β k 0 + β k1 X i1 + ... + β kp X ip )
µ ki = for k ≠ 1
1 +  exp ( β j 0 + β j1 X i1 + ... + β jp X ip )
K

j =2

319 | P a g e
Using the above equations, three predicted probabilities may be calculated for
each species category for each iris. For each iris, SPSS takes note of the species
category that has the largest predicted probability associated with it and thus
assigns the particular iris to that category. The data view window shown in Figure
5.3.20 shows the largest predicted probabilities obtained for each iris and the
corresponding species category to which each iris has been assigned.

Figure 5.3.20

If interest lies in how many iris flowers have been correctly classified, we can
also look at the classification table:

Classification
Predicted
Observed setosa versicolor virginica Percent Correct
setosa 45 5 0 90.0%
versicolor 6 30 14 60.0%
virginica 1 12 37 74.0%
Overall Percentage 34.7% 31.3% 34.0% 74.7%
Table 5.3.16

320 | P a g e
Using the diagonal elements in the contingency table in Table 5.3.16, the number
of correctly classified iris amounts to 112 (45 + 30 + 37) showing that only
74.67% of the iris were correctly classified by means of this model.

As a final note, once a generalized linear model is fitted to the data we also need
to check the residuals. To obtain Pearson residuals for a multinomial logistic
regression we will need to resort to fitting what is known as a loglinear model
using the menu Analyze – Generalized Linear Models - Generalized Linear
Models. From the tab Type of Model choose Poisson Loglinear. The Pearson
residuals obtained for such a model are the same as those for the multinomial
logistic regression model.

In R/RStudio

As a first step, load the iris dataset and get familiar with the dataset by looking at
descriptive statistics and plots of the variables being considered in the analysis.
This will help us to get a feel of the data being modelled and an insight into
possible relationships that might exist between the response and the explanatory
variables.
head(iris) ## to view the first six rows of the data

Sepal.Length Sepal.Width Petal.Length Petal.Width Species


1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

summary(iris)
## note that Species is already defined as a factor

Sepal.Length Sepal.Width Petal.Length Petal.Width Species


Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu:2.800 1st Qu:1.600 1st Qu:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median:1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

From the summary measures it should be noted that Species is by default defined
as a factor, so there is no need to include any commands to turn the variable into
a factor.

321 | P a g e
attach(iris)

boxplot(Sepal.Length ~ Species, data = iris, main = "Box Plot"


, xlab = "Species", ylab = "Sepal Length")
## plotting a boxplot with Species as factor and Sepal Length
## as the covariate of interest

Box Plot
7.5
Sepal Length

6.5
5.5
4.5

setosa versicolor virginica

Species

Figure 5.3.21

The boxplot of Sepal Length versus Species, in Figure 5.3.21, shows that there
seems to be a difference in the sepal length due to the iris species. So, it looks
like it might be worth investigating this relationship further.

plot(iris$Petal.Length, iris$Petal.Width, pch=21, bg=c("red",


"green3","blue")[unclass(iris$Species)], main = "Edgar
Anderson's Iris Data")

Edgar Anderson's Iris Data


2.5
2.0
iris$Petal.Width

1.5
1.0
0.5

1 2 3 4 5 6 7

iris$Petal.Length

Figure 5.3.22

322 | P a g e
The scatter plot of Petal Width versus Petal Length, in Figure 5.3.22, shows that
there seems to be a linear relationship between the two variables. The Spearman
correlation coefficient for the two variables is in fact 0.938 with p-value < 0.05,
showing that there is indeed a significant positive linear relationship between the
two. Note that Spearman correlation coefficient has been used due to non-
normality of the variables Petal Length and Petal Width (p-value 0 < 0.05
achieved for both variables when testing for normality using the Shapiro-Wilk
test).

cor.test(iris$Petal.Length,iris$Petal.Width,method='spearman')

Spearman's rank correlation rho

data: iris$Petal.Length and iris$Petal.Width


S = 35061, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.9376668

Now recall that when fitting a generalized linear model, such as the multinomial
logistic regression model, the explanatory variables should not be collinear. The
fact that there is a linear relationship between petal width and petal length shows
that we might need to remove some of the variables from the iris dataset for our
model fitting. With large Spearman correlation values of 0.938, 0.882 and 0.834,
the correlation matrix below also shows that we might have a serious issue with
multicollinearity.

cor(iris[,-ncol(iris)],method='spearman')
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 -0.1667777 0.8818981 0.8342888
Sepal.Width -0.1667777 1.0000000 -0.3096351 -0.2890317
Petal.Length 0.8818981 -0.3096351 1.0000000 0.9376668
Petal.Width 0.8342888 -0.2890317 0.9376668 1.0000000

As for the binary logistic regression model, the procedure that is used to detect
multicollinearity in the explanatory variables of a generalized linear model, is not
straightforward. For the purpose of these lecture notes, we will decide which
variables should be removed from the data based on the resulting correlation
coefficients and a trial and error model fitting approach.

Any correlation coefficient that is larger than 0.5 is deemed to be quite high in
relation to the identification of variables which lead to multicollinearity.

323 | P a g e
As already mentioned, there are three correlation coefficients that stand out.
These correlation coefficients correspond to the relationships Petal.Width ~
Petal.Length, Sepal.Length ~ Petal.Length and Sepal.Length ~ Petal.Width. All
of these correlations are much larger than 0.5 showing that we cannot fit any of
these three multinomial logistic regression models:

1. a model with both Petal.Width and Petal.Length as explanatory variables


2. a model with both Sepal.Length and Petal.Length as explanatory variables
3. a model with both Sepal.Length and Petal.Width as explanatory variables.

Multinomial logistic regression models with Species as response variable that


might be considered at this point are the following:

4. having Sepal.Length and Sepal.Width as explanatory variables


5. having Petal.Length and Sepal.Width as explanatory variables
6. having Petal.Width and Sepal.Width as explanatory variables.

We proceed to fit multinomial logistic regression model 4, that is using Sepal


Length and Sepal Width as explanatory variables. A similar procedure to what
will be shown for this particular multinomial logistic regression model may be
used to fit multinomial logistic regression models 5 and 6. A comparison between
the fits provided by the three different models can then be entertained.

The multinomial logistic regression model may be fitted using different functions
in R. Here the multinom function from the package nnet is used:

library(nnet)
mult <- multinom(Species ~ Sepal.Width + Sepal.Length, data =
iris)
## fitting a multinomial logistic regression model with Sepal
## Width and Sepal Length as explanatory variables
# weights: 12 (6 variable)
initial value 164.791843
iter 10 value 62.715967
iter 20 value 59.808291
⋮ ⋮
iter 90 value 55.230241
iter 100 value 55.212479
final value 55.212479
stopped after 100 iterations
## always check that the final line shows the word converged

324 | P a g e
mult <- multinom(Species ~ Sepal.Width + Sepal.Length, maxit =
1000, data=iris)
## fitting a multinomial logistic regression model with Sepal
## Width and Sepal Length as explanatory variables and
## increasing the number of iterations from the default of 100
## to 1000

# weights: 12 (6 variable)
initial value 164.791843
iter 10 value 62.715967
iter 20 value 59.808291
⋮ ⋮
iter 220 value 55.186513
iter 230 value 55.184474
final value 55.184137
converged

summary(mult)
Call:
multinom(formula = Species ~ Sepal.Width + Sepal.Length, data
= iris,
maxit = 1000)

Coefficients:
(Intercept) Sepal.Width Sepal.Length
versicolor -106.8890 -38.18227 42.24345
virginica -119.9312 -37.77782 44.14532

Std. Errors:
(Intercept) Sepal.Width Sepal.Length
versicolor 20.60839 43.18899 18.20115
virginica 20.65106 43.19554 18.19642

## note that the resulting standard errors corresponding to


## Sepal.Width are both greater than half the resulting
## estimates showing that this model might not be the ideal
## one to use with this dataset
Residual Deviance: 110.3683
AIC: 122.3683

The output obtained using the function multinom includes details on the iteration
history. Always check that you get the word converged in the final line of the
iteration history, otherwise the resulting estimates are not reliable. You might
need to increase the number of iterations to achieve convergence. The function
multinom calls the function nnet which sets the default number of iterations at
100. Including the command maxit=1000 has solved our problem of lack of
convergence. Otherwise, you might need to scrap the model being fitted and use
an alternative.

325 | P a g e
Also check that the resulting standard errors shown in the summary output are
less than half the estimates of the model coefficients. In this example, the
standard errors corresponding to Sepal.Width are both large. The fact that such
large standard errors were obtained shows that our model is not the ideal one to
use for the particular dataset. At this stage we can thus consider two separate
multinomial logistic models, one with Sepal.Length as an explanatory variable
and the other model with Sepal.Width as an explanatory variable. Once these two
models have been fitted, identification of the best overall fitting model can be
conducted by comparing the fit of these two models with the fit of models 5 and
6 defined earlier.

For the purpose of these lecture notes, we will proceed by fitting solely a
multinomial logistic regression model with Species as the response variable and
considering Sepal Length as the explanatory variable.

mult2 <- multinom(Species ~ Sepal.Length, data=iris)


## fitting a multinomial logistic regression model with Sepal
## Length as explanatory variable
# weights: 9 (4 variable)
initial value 164.791843
iter 10 value 91.337114
iter 20 value 91.035008
final value 91.033971
converged
## always check that the final line shows the word converged

summary(mult2)
Call:
multinom(formula = Species ~ Sepal.Length, data = iris)

Coefficients:
(Intercept) Sepal.Length
versicolor -26.08339 4.816072
virginica -38.76786 6.847957
Std. Errors:
(Intercept) Sepal.Length
versicolor 4.889635 0.9069211
virginica 5.691596 1.0223867
## note that in this case the standard errors are less than
## half the estimates of the model coefficients, as desired

Residual Deviance: 182.0679


AIC: 190.0679

Having reached convergence and since we have no issues with large standard
errors, we can proceed to interpret the resulting output. As part of the multinom
function output we get the final value (91.04).

326 | P a g e
If this value is multiplied by 2, it is equivalent to the residual deviance (182.07),
the latter shown in the summary of the model.

In the previous section, we have seen how the deviance is a measure of goodness
of fit of a generalized linear model. The residual deviance is a summary measure
for the fitted model. The higher the deviance, the worse the fit of the model.

The deviance may thus be used to compare nested models, for example the fitted
model versus the null model. The null model is the model in which all terms are
excluded except for the intercept. For this example, the residual deviance
obtained when fitting an intercept only model is 329.58, showing that by
including sepal length in the model, the deviance has decreased from 329.58 to
182.07.

mult3 <- multinom(Species ~ 1 , data=iris)


## fitting an intercept only model
# weights: 6 (2 variable)
initial value 164.791843
final value 164.791843
converged

summary(mult3)
Call:
multinom(formula = Species ~ 1, data = iris)

Coefficients:
(Intercept)
versicolor -2.122746e-16
virginica 4.973799e-16

Std. Errors:
(Intercept)
versicolor 0.2
virginica 0.2

Residual Deviance: 329.5837


AIC: 333.5837

We can check whether a change in deviance is significant and to decide whether


it is worth keeping the variable sepal length in the model versus only keeping the
intercept, by means of the residual deviances obtained by using the function
anova() with the option test=’Chisq’ as shown below:

327 | P a g e
anova(mult3, mult2, test = 'Chisq')

Model Resid.df Resid.Dev Test Df LR stat. Pr(Chi)


1 1 298 329.5837 NA NA NA 0
2 Sepal.Length 296 182.0679 1 vs 2 2 147.5157

With a p-value (Pr(Chi)) of 0 < 0.05, the change in deviance is significant


showing that the introduction of sepal length in the model led to an improvement
in model fit. The improvement in model fit with the introduction of sepal length
in the model may also be noticed by comparing the AIC values of the two fitted
models. The AIC values for the models with and without sepal length are 190.07
and 333.58 respectively. The lower the AIC value the better, favouring the
inclusion of sepal length in the model.

The fact that sepal length should be retained in the model is also corroborated by
the p-values (0 and 0 < 0.05) corresponding to the estimates of the model
coefficients. If the multinomial logistic regression model is fitted in R using the
function multinom, these p-values are not obtained automatically. Alternative
fitting functions do provide such an output but they are generally more tedious to
work with. When using multinom, p-values may be obtained by using a few extra
commands as follows:

z <-summary(mult2)$coefficients/summary(mult2)$standard.errors
## working out values for the z-statistic so that we can
## calculate p-values accordingly
z
(Intercept) Sepal.Length
versicolor -5.334424 5.310353
virginica -6.811422 6.698011

p <- (1 - pnorm(abs(z), 0, 1)) * 2


## obtaining p-values for the model coefficients

p
(Intercept) Sepal.Length
versicolor 9.584830e-08 1.094128e-07
virginica 9.663825e-12 2.112754e-11

Note that goodness of fit measures such as the deviance (for nested models) and
the AIC can also be used if you need to compare the performance of different
generalized linear models.

328 | P a g e
The lower (higher) the deviance, the better (worse) the fit of the model. Similarly,
the lower (the closer to zero) the value of AIC the better.

Having decided to retain sepal length in the model, we now turn our attention to
the resulting estimates, model specification and interpretation. The first line of
the resulting coefficients gives estimates which correspond to the category
versicolor being compared to the baseline/reference category setosa and
similarly, the estimates in the second line correspond to the category virginica
being compared to the baseline/reference category setosa.

From the resulting coefficients we get:

µ   P ( versicolor ) 
log  2 i  = log   = −26.08 + 4.82 × Sepal.Length i
µ
 1i   P ( setosa ) i
µ   P ( virginica ) 
log  3i  = log   = −38.77 + 6.85 × Sepal.Length i
 µ1i   P ( setosa ) i

Using the first equation, a one-unit increase in the variable Sepal Length is
associated with an increase of 4.82 in the log odds of an iris being of the species
versicolor versus setosa. Similarly, from the second equation, a one-unit increase
in the variable Sepal Length is associated with an increase of 6.85 in the log odds
of an iris being of the species virginica versus setosa. A similar interpretation
may also be given in terms of the odds rather than the log odds of an iris being
either a versicolor or virginica versus a setosa. Should such an interpretation be
required, we just need to exponentiate the resulting coefficients as follows:

exp(coef(mult)) ## exponentiating coefficients of the model

(Intercept) Sepal.Length
versicolor 4.700338e-12 123.4791
virginica 1.456567e-17 941.9549

For our fitted model, we can also calculate the predicted (estimated) probabilities
for each of the levels of the response variable Species. In general:

329 | P a g e
1
µ1i =
1 +  exp ( β j 0 + β j1 X i1 + ... + β jp X ip )
K

j =2

exp ( β k 0 + β k1 X i1 + ... + β kp X ip )
µ ki = for k ≠ 1
1 +  exp ( β j 0 + β j1 X i1 + ... + β jp X ip )
K

j =2

In R, these estimates may be obtained using the command:

predicted<-predict(mult,iris$Sepal.Length,type = "probs")
head(predicted)

setosa versicolor virginica


1 0.8065656 0.17615515 0.0172792091
2 0.9184406 0.07655755 0.0050018332
3 0.9678683 0.03079176 0.0013399493
4 0.9800536 0.01926233 0.0006840989
5 0.8728078 0.11776464 0.0094275696
6 0.4776903 0.44246619 0.0798435069

The predict command may also be used to obtain the predicted category:

predicted<-predict(mult,iris$Sepal.Length,type = "class")
head(predicted)

[1] setosa setosa setosa setosa setosa setosa


Levels: setosa versicolor virginica

If interest lies in how many iris flowers have been correctly classified, we can
look at a contingency table showing the predicted categories versus the observed
categories:

conttable<-with(iris,table(predicted,Species))

conttable
Species
predicted setosa versicolor virginica
setosa 45 6 1
versicolor 5 30 12
virginica 0 14 37
To include the percentage of correctly classified flowers in the contingency table,
use the commands:
PercCorrect <- round(diag(conttable)/rowSums(conttable),2)
## rounding the resulting output to two decimal places

330 | P a g e
PercCorrect
setosa versicolor virginica
0.87 0.64 0.73

finaltable<-cbind(conttable,PercCorrect)
## including the final column in the table

finaltable
setosa versicolor virginica PercCorrect
setosa 45 6 1 0.87
versicolor 5 30 12 0.64
virginica 0 14 37 0.73

Using the diagonal elements of the contingency table conttable, the number of
correctly classified iris amounts to 112 (45 + 30 + 37) showing that only 74.67%
of the iris were correctly classified by means of this model.

As a final note, once a generalized linear model is fitted to the data we also need
to check the residuals. To obtain Pearson residuals for a multinomial logistic
regression we will need to resort to fitting what is known as a loglinear model
using the glm function in R (the same function that has been used to fit the binary
logistic regression model; for this purpose specify family poisson instead of
binomial). The Pearson residuals obtained for such a model are the same as those
for the multinomial logistic regression model.

331 | P a g e

You might also like