Statistical Analysis Using SPSS and R - Chapter 5 PDF
Statistical Analysis Using SPSS and R - Chapter 5 PDF
STATISTICAL MODELS
This section will be dedicated to reviewing some of the most popular statistical
modelling techniques.
Regression is a technique that may be used to investigate the effect of one or more
predictor (independent) variables on an outcome (non-independent) variable
which needs to be a continuous variable.
In this section look into linear regression models. We start by discussing Multiple
Linear regression models in which all variables are quantitative (covariates). We
then consider two special adaptations to the multiple linear regression, ANOVA
and ANCOVA where for the former all predictors are qualitative variables
(factors) whereas the latter admits both qualitative (factors) and quantitative
predictors (covariates) in the model.
As its name implies, in a linear regression model the relation of the response
variable to the explanatory variables is assumed to be a linear function of some
parameters. As mentioned earlier, the response variable should be a continuous
variable and the explanatory variables should be quantitative variables: discrete
or continuous.
More specifically, in multiple linear regression analysis, for any sample of size n,
we assume a relation of the type:
where Yi denotes the response or dependent variable for the ith entity (or ith
observation in the sample) and the X ij 's denote the explanatory or independent
239 | P a g e
variables (predictors) for the ith entity. The β j 's j = 0,… p are the unknown
parameters to be estimated and the εi 's are the error terms. A fundamental
assumption in classical multiple linear regression theory is that the error terms
{εi }i=1,…,n are independent and normally distributed with mean zero and unknown
standard deviation σ .
ri = yi - yˆ i for i = 1,… , n .
One of the most popular estimators is the ordinary least squares (OLS) estimator
which will be the protagonist of this section.
• the OLS regression parameter estimates are not unique - infinite solutions
all yielding the same fitted values
• it tends to produce models that fit the sampled data perfectly but fail to
predict new data well, a phenomenon known as over-fitting
• the variance of the OLS regression estimates may be unacceptably large
• instability of the OLS estimates
• conflicting conclusions from usual significance tests and the possibility
that the OLS coefficient estimates exhibit different algebraic signs to what
is expected from theoretical considerations.
Multicollinearity is not the only characteristic that has a negative effect on the
OLS estimator. If outliers are present in the data these can also have a negative
effect on the OLS estimator. An essential step in regression analysis involves
identifying outliers and removing them from the data set.
240 | P a g e
In this section we shall see how one can fit a multiple linear regression model
using SPSS and R. We shall also discuss multicollinearity diagnostics and outlier
diagnostics.
In order to fit a multiple linear regression model, the data to be analysed needs to
satisfy the following assumptions:
2. A linear relationship exists between the dependent variable and each of the
independent variables - this can be checked by visually inspecting
scatterplots of the dependent variable with each one of the independent
variables and by issuing correlation coefficients for each pair - confirmed
by running a correlation analysis as explained earlier in Chapter 4. In fact
regression analysis is always preceded by correlation analysis.
MULTICOLLINEARITY DIAGNOSTICS
241 | P a g e
existing diagnostics. Here we shall consider three of the most popular diagnostics,
namely: pairwise correlations, condition indices and variance propoertions. A
brief description of each will be presented next. The main reference for this
section is Myers (1990).
2. Condition Indices (CIs): The condition indices are the ratios of the
maximum eigenvalue of the sample correlation matrix of the explanatory (or
independent) variables, which we shall denote by R, with all the other
eigenvalues1. So there are as many condition indices as there are eigenvalues. The
eigenvalues are typically ordered in ascending order. The jth condition index is
defined by
where λmax and λj correspond to the largest and jth eigenvalues of R, respectively.
The last condition number, which corresponds to the ratio of the largest to the
smallest eigenvalue of the sample correlation matrix is usually considered as a
diagnostic on its own and is known as the condition number.
Belsley et al. (1980) observe that the number of large (> 5) condition indices
corresponds to the number of near dependencies in the columns of the data matrix
X. Values between 5 and 10 indicate the presence of weak dependencies while
values greater than 30 indicate the presence of moderate or strong relations
between the explanatory variables. A condition number which is much greater
than 30 is an indication of serious multicollinearity while values between 5 and
10 indicate that weak dependencies may induce some affect on the regression
estimates.
1
For interested users not familiar with such mathematical concepts more details can be found in linear algebra
text books.
242 | P a g e
3. Variance Proportions: The Variance proportions of the explanatory
variables, measure the proportion of the variance of each regression coefficient
explained by each of the eigenvalue of the sample correlation matrix R. A large
variance proportion for a variable means that the variable has a linear dependency
with other explanatory variables. As a general rule of thumb, Belsley et. al (2004)
suggest that if the variance proportion is greater than 0.5 then the variables in
question are characterised by dependency.
• From a set of correlated variables retain the one which has the largest
correlation with the dependent variable and remove the rest from the
model.
• Use an alternative estimator to the OLS such as Ridge, LASSO, PCR and
PLS to mention a few. (Such methods are only for more advanced students
and will not be covered here.)
Regression outlier diagnostics are statistics that are used to identify any
observations that might have a large influence on the estimated regression
coefficients. Observations that exert such influence are called influential points.
243 | P a g e
are greater than 2 p n , where p is the number of parameters in the model
and n is the number of observations, are considered to be potentially
influential points.
Example: Consider the data set "mtcars" available in the R environment which
was taken from the 1974 Motor Trend US magazine, and comprises fuel
consumption and 10 aspects of automobile design and performance for 32
automobiles (1973–74 models). This data set consists of 32 observations on 11
variables. In the table below one finds more information about the variables in
this data set.
244 | P a g e
[, 8] vs V/S
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors
Data sourse: Henderson and Velleman (1981), Building multiple regression models
interactively. Biometrics, 37, 391–411.
We start the analysis by checking if the first three of the assumptions underlying
the multiple regression model, listed earlier, are satisfied.
The response variable and all explanatory variables are clearly covariates. By
looking at the scatter plots in Figure 5.1.1 we note that these plots suggest that a
245 | P a g e
linear relationship exists between mpg and each of the predictor variables. To
confirm this one needs to consider correlation coefficients (refer to section 4.5.1).
To decide which correlation coefficient should be used using R, bivariate
normality was tested (by means of the command mvnorm.etest from the
package energy) for each possible pair of covariates. The bivariate normality
assumption was never satisfied hence Spearman correlation coefficients will be
considered here.
From Table 5.1.1 we note that qsec is positively correlated with mpg while all
other predictor variables are strongly negatively correlated with mpg. Thus, we
can conclude that all the independent variables are linearly correlated with mpg
(our response variable).
In SPSS
In SPSS condition indices and the condition number are given as part of the
output produced when fitting a multiple linear regression (MLR) model. So go
to Analyze-Regression-Linear. Move mpg in the Dependent Variable field and
move disp, hp, drat, wt and qsec under the Block 1 of 1 field; as shown in Figure
5.1.2.
Figure 5.1.2
247 | P a g e
Figure 5.1.3
Press Continue then OK. The following is part of the output obtained:
From Table 5.1.2 we note that the last four condition indices (9.946, 21.905,
32.009, 71.166) are greater than 5 indicating that there are 5 near dependencies
in the columns of the data matrix. The condition number (the largest condition
index) is equal to 71.166, which is greater than 30.
This indicates the presence of strong dependencies, which is not surprising since,
when inspecting the Spearman correlation coefficients, we saw that, for example,
disp is strongly correlated with drat and wt.
248 | P a g e
From Table 5.1.2 we note that the last three eigenvalues are also showing serious
or near serious dependencies, since their values are very close to 0. We then
consider the variance proportions corresponding to these three eigenvalues. By
applying the 0.5 bench mark suggested in the literature, we note that the 4th
eigenvalue (0.012) represents a dependency which is causing damage to the
parameter estimate for disp (0.77 > 0.5), while the 5th eigenvalue (0.005)
represents a dependency which is causing damage to the parameter estimate for
drat (0.61 > 0.5). The 6th eigenvalue (0.001) represents a dependency which is
causing damage to the parameter estimate for qsec (0.83 > 0.5) and the constant
term (0.97 > 0.5).
In R/RStudio
In R/ RStudio the values presented in Table 5.1.2 are computed by means of the
command colldiag in the package perturb. Full commands are given below:
## Condition indices
## Multicollinearity Diagnostic 3 - Condition Number and
## Condition Indices: 5-10 weak dependencies, >30 severe
## multicollinearity
colldiag(as.matrix(mtcars[,3:7]))
Condition
Index Variance Decomposition Proportions
intercept disp hp drat wt qsec
1 1.000 0.000 0.001 0.001 0.000 0.000 0.000
2 4.393 0.000 0.027 0.021 0.007 0.001 0.002
3 9.946 0.000 0.053 0.330 0.012 0.037 0.003
4 21.905 0.001 0.772 0.182 0.102 0.392 0.007
5 32.009 0.032 0.012 0.142 0.606 0.449 0.159
6 71.166 0.966 0.136 0.324 0.272 0.120 0.829
As you can note, the values are exactly the same as those shown in Table 5.1.2
and hence the same interpretation holds.
249 | P a g e
correlated with wt and drat, one can fit two different MLR models, one with mpg
as response and qsec and wt as predictors and the other model with mpg as
response and qsec and drat as predictors.
Next we shall fit a MLR with mpg as response and wt and qsec as predictors.
In SPSS
Figure 5.1.4
From the tab labelled Save tick the boxes which are ticked in Figure 5.1.5 below.
Press Continue.
250 | P a g e
Figure 5.1.4
In the tab labelled Plots move the ZRESID (standardized residuals) in the Y-axis
and ZPRED (standardized predictors) in the X-axis. This will produce the scatter
plot needed to check if assumption 7 is satisfied.
251 | P a g e
Figure 5.1.5
Table 5.1.3
In the column labelled “R” one finds the value of the multiple correlation
coefficient which provides a measure of the predictive ability of the model (that
is, how well it can estimate the dependent variable). The closer this value is to 1
the better the prediction. In this example R=0.909 which indicates a very good
level of prediction. In the column labelled “R Square” one finds the value of the
coefficient of determination, which measures the proportion of the variation in
the dependent variable that can be accounted for by the variables included in the
regression model. The closer this value is to 1 the better. You can see from our
value of 0.826 that our independent variables explain 82.6% of the variability of
our dependent variable.
252 | P a g e
As we have seen earlier on, one of the assumptions of regression is that the
residuals are independent (assumption 4). This assumption is tested by the
Durbin-Watson statistic. A rule of thumb is that test statistic values which lie
outside the range of 1.5 to 2.5 could be cause for concern. Some authors suggest
that values under 1 or more than 3 are a definite cause for concern. In this
example, the value of the Durbin-Watson statistic is 1.496 which is approximately
equal to 1.5 hence we can consider the assumption of independent residuals to be
satisfied.
The next part of the output contains an ANOVA (analysis of variance) table. This
table tests the following hypothesis
H0: Model with only a constant term (intercept only model) is a good fit for
the data
H1: Model fitted (which includes covariates) fits better than the model with
only the intercept term
Table 5.1.4
253 | P a g e
From Table 5.1.4 we note that the p-value (sig) is less than 0.05 which shows that
the model with qsec and wt as independent variables is an adequate model.
The output that follows is concerned with the estimation of the parameters of the
model.
Table 5.1.5
Y = β0 + β1 X1 + β2 X 2 + ε
where Y = mpg, X 1 = wt and X 2 = qsec . In the column labelled “B” one finds the
estimates for the β ' s . In the column labelled “Sig” one finds the p-values for the
following hypothesis tests:
H 0 : βi = 0
H1 : βi ≠ 0
Since all the p-values are less than 0.05 we reject H0 in all three cases and hence
the fitted model is the following:
254 | P a g e
Table 5.1.6
Table 5.1.6 provides some descriptive statistics for a series of outlier detection
methods. We shall focus only on those discussed earlier on in this section.
Starting with the Leverage values (last row of the table), in this model the number
of parameters p = 3 and the sample size n = 32, hence the cutoff value for
identifying potential influential points is 2 3 ⁄32 = 0.19. From Table 5.1.6 we
note that the maximum leverage value is 0.264 which is greater than the cutoff
value, thus indicating the presence of potential influential points. To identify
these points we need to go to the data file, where the leverages values have been
saved, and sort the column containing the leverages (labelled LEV_1) in
ascending order (this will sort the entire data file) and identify those values which
are greater than 0.19. As you can see from Figure 5.1.9 there is only one data
point for which the leverage value is greater than 0.19. But before we conclude
that this is an influential point we need to look at the other outlier diagnostics.
255 | P a g e
Figure 5.1.6
The descriptive statistics for the studentized residuals are found in Table 5.1.6 in
the row labelled Stud.Residual and we note that the maximum value is slightly
greater than 2 indicating the presence of potential outliers. Upon inspecting the
studentized residuals in the data view we note that there are three points for which
these residuals are slightly greater than 2.
The only point that has been identified as an outlier by more than two diagnostics
is Merc 230 and hence this will be considered an outlier. One way of proceeding
is to remove this data point from the sample and repeat the regression analysis on
the new reduced data set. If the parameter estimates change considerably, then
the point is an influential point and should not be included in the analysis. On the
other hand, if the parameter estimates change only slightly, the point is not
influential and can be retained. This step will be left as an exercise.
256 | P a g e
The last output being shown here is a scatter plot of the Standardized residuals
versus the Standardized fitted (or predicted) values which is inspected to check
that the residuals are indeed homoscedastic, that is, the variance of the residuals
is homogeneous across levels of the predicted values, also known as
homoscedasticity. If the model is well-fitted, there should be no pattern in the
residuals plotted against the fitted values.
Figure 5.1.7
From Figure 5.1.7 we note that the points seem to be scattered randomly without
any pattern, hence we can assume that the residuals are homoscedastic.
One final goodness of fit check is to verify if the assumption that the residuals
follow a normal distribution is satisfied. This is done by applying the Shapiro-
Wilk test of normality on the standardized or studentized residuals. The p-value
obtained when this test is applied to the studentized residuals is 0.119 > 0.05
hence the assumption of normality is satisfied.
257 | P a g e
Earlier we had mentioned the Adjusted R square. Next we shall explain how this
statistic is used to choose between competing models.
Figure 5.1.8
Go back to Analyze-Regression-Linear. Previously in the tab labelled Method we
retained the default selection which is Enter. This time change it to Stepwise. (See
Figure 5.1.7)
Two regression models will be fitted; a simple regression model with only wt as
predictor and a multiple linear regression model with wt and qsec as predictors.
From Table 5.1.7 it may be noted that in the model summary table now we have
values for the two models. Note that the Adjusted R square value for the model
with two predictors is larger than that of the model with one predictor hence the
model with two predictors is considered to be a better model for estimating
unknown values of the response. The p-values shown in Table 5.1.5 in fact
showed that both predictors should be retained in the model.
258 | P a g e
Table 5.1.7
In R/RStudio
Using R to obtain the Model Summary table, the ANOVA Table and Parameter
estimates table (as in Tables 5.1.3 5.1.4, 5.1.5):
Commands:
#Regression Analysis
Output:
Call:
lm(formula = y ~ ., data = x)
Residuals:
Min 1Q Median 3Q Max
-4.3962 -2.1431 -0.2129 1.4915 5.7486
259 | P a g e
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.7462 5.2521 3.760 0.000765 ***
wt -5.0480 0.4840 -10.430 2.52e-11 ***
qsec 0.9292 0.2650 3.506 0.001500 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
aov(formula = model1)
Terms:
wt qsec Residuals
Sum of Squares 847.7252 82.8583 195.4636
Deg. of Freedom 1 1 29
The output highlighted in green contains model summary information (see Table
5.1.3), the output highlighted in yellow corresponds to the ANOVA test (see
Table 5.1.4) and the output highlighted in blue corresponds to the parameter
estimates (see Table 5.1.5).
To obtain the predicted values and the residuals using the fitted model:
Here we will use the Durbin-Watson test for autocorrelations amongst the
residuals.
260 | P a g e
H 0 : There is no correlation between the residuals, that is, the residuals are
independent
H1 : There is correlation between the residuals, that is, the residuals are not
independent
library(lmtest)
dwtest(model1)
Output :
Durbin-Watson test
data: model1
DW = 1.496, p-value = 0.04929
alternative hypothesis: true autocorrelation is greater than 0
plot(predict(model1),residuals(model1),main="Plot of Fitted vs
Residual")
The following plot is obtained:
261 | P a g e
Figure 5.1.9
From Figure 5.1.9 it would seem that the variance is constant in the y-axis.
However this kind of result is very subjective. So we may also consider the
Breusch Pagan test. This is available in R but not in SPSS.
In R this test is conducted using the command ‘bptest’from the package ‘lmtest’.
bptest(model1)
Ouput:
studentized Breusch-Pagan test
data: model1
BP = 3.0858, df = 2, p-value = 0.2138
262 | P a g e
Since the resulting p-value > 0.05, we cannot reject the null hypothesis. This
means that our model is valid since residuals are not homoscedatic.
OUTLIER DIAGNOSTICS
Mahalanobis Distances:
Only one point which is found in the 9th row of the data set and belongs to the
car model Merc 230 is flagged as a potential outlier.
Leverages:
# Outlier Detection
Leverages <- hatvalues(model1,type='rstandard')
263 | P a g e
Identifying the outliers:
This last command gives the following list of car models and respective row
number.
Merc 230 Lincoln Continental
9 16
Cook’s distances:
# Cook's distance
Cook <- cooks.distance(model1,type='rstandard')
# It is important to set the type of residuals as 'rstandard' #
for results to match those obtained in SPSS
264 | P a g e
5.2 General Linear Models
In this section we shall look at ANOVA and ANCOVA models done through
both SPSS and R. This is an extension of multiple regression, which involves
having both factors and covariates as predictors. The important question which
arises at this stage is, how are factors represented within a regression equation?
Suppose we have one factor, say gender, with the two main levels being male and
female, in addition to another p covariates. Then we explain gender by what we
call a dummy variable regressor G where G = 1 if the gender is male and G = 0 if
female. Then the regression equation becomes as follows:
A1 A2
18-34 1 0
35-59 0 1
60+ 0 0
Apart from that, we can also include interactions of factors. The following is an
example of a regression equation which, apart from the covariates, includes both
gender and age groups as main effects, and the interactions between the two:
265 | P a g e
Yi = E Yi | X i1,..., X ip + εi
= β0 + β1 X i1 + ... + β p X ip + γG G + γ1A A1 + γ1A A2 + γ1GAGA1 + γ2GAGA2 + εi for i = 1,…, n
When analysing data using ANOVA and ANCOVA, the following assumptions
must be made:
Assumption 5: The slope of the dependent variable versus any covariate must be
homogenous along all levels of all factors.
In SPSS
The file nachr.sav contains data for 25 individuals regarding brain density of
nicotinic acetylcholine receptors (nAChR) and age, the presence (or otherwise)
of schizophrenia and the presence (or otherwise) of a smoking habit.
We start by plotting a box plot of nAChR with respect to the factors, and a scatter
plot of age and nAChr (see Figure 5.2.1 and Figure 5.2.2)
266 | P a g e
Figure 5.2.1
In the box plot we see that there is a larger difference in the location of nAChR
for smokers/non-smokers than there is for the schizophrenia category. The
smokers have a higher median that the non-smokers, while people with
schizophrenia have a slightly lower median than people who do not.
Figure 5.2.2
In the scatter plot, on the other hand, we see that nAChR and age are negatively
correlated. However, it must be noted that here we are looking at these
267 | P a g e
relationships individually and not simultaneously, so these conclusions are not
definitive.
Figure 5.2.3
Correlations
Schizophrenia=
Age No Smoker=No
Spearman's rho Age Correlation Coefficient 1.000 .157 .393
Sig. (2-tailed) . .463 .057
N 24 24 24
Schizophrenia=No Correlation Coefficient .157 1.000 .510*
268 | P a g e
Sig. (2-tailed) .463 . .011
N 24 24 24
Smoker=No Correlation Coefficient .393 .510* 1.000
Sig. (2-tailed) .057 .011 .
N 24 24 24
*. Correlation is significant at the 0.05 level (2-tailed).
Table 5.2.1
Collinearity Diagnosticsa
Variance Proportions
Condition Schizophrenia
Model Dimension Eigenvalue Index (Constant) =1.0 Smoker=1.0 Age
1 1 3.305 1.000 .00 .02 .02 .00
2 .420 2.807 .04 .14 .31 .02
3 .246 3.666 .00 .82 .54 .00
4 .030 10.560 .96 .02 .12 .97
a. Dependent Variable: Brain density of nicotinic receptors
Table 5.2.2
Figure 5.2.4
270 | P a g e
Table 5.2.3
Figure 5.2.5
We then go to Save and select ‘Unstandardised Predicted Values’, ‘Studentised
Residuals’, ‘Cook’s Distance’ and ‘Leverage Values’ (see Figure 5.2.6). Press
Continue.
Figure 5.2.6
271 | P a g e
Figure 5.2.7
Furthermore, from Figure 5.2.7, we see that from Options we can choose
‘Descriptive statistics’, ‘Parameter estimates’ and ‘Homogeneity tests’. Pressing
Continue and OK yields the following outputs.
Descriptive Statistics
Dependent Variable: Brain density of nicotinic receptors
Schizophrenia Smoker Mean Std. Deviation N
272 | P a g e
No No 17.6633 4.54381 9
Yes 25.2275 4.08428 4
Total 19.9908 5.58017 13
Yes No 10.8550 .78489 2
Yes 19.9300 6.05092 9
Total 18.2800 6.54437 11
Total No 16.4255 4.91566 11
Yes 21.5600 5.92078 13
Total 19.2067 5.96871 24
Table 5.2.4
In Table 5.2.4 we see the means for levels of the schizophrenia factor (No and
Yes), levels of the smoking factor (No and Yes), and combinations of both. It
can be seen that the mean is lower for ‘schizophrenia’ group in comparison with
the group when schizophrenia is absent. Similarly, the mean is higher for the
smoking group than it is for the non-smoking group. The following table will
determine whether these differences are significant.
In Table 5.2.5 we see that both the schizophrenia and the smoking factor, and
the covariate age, are significant at 0.05 level of significance, so these variables
should all be retained in the model. It also gives the R squared for this model,
which is 0.497, and the adjusted R squared which is 0.422. The parameter
estimates for the resulting model are shown in Table 5.2.6.
Parameter Estimates
Dependent Variable: Brain density of nicotinic receptors
Parameter B Std. Error T Sig. 95% Confidence Interval
273 | P a g e
Lower Bound Upper Bound
Intercept 27.387 3.762 7.279 .000 19.538 35.235
[Schizophrenia=1] 5.638 2.165 2.604 .017 1.121 10.154
[Schizophrenia=2] 0a . . . . .
[Smoker=1] -6.258 2.325 -2.692 .014 -11.107 -1.408
[Smoker=2] 0a . . . . .
Age -.127 .058 -2.191 .040 -.247 -.006
a. This parameter is set to zero because it is redundant.
Table 5.2.6
In Table 5.2.7 one can find the parameter estimates obtained if the ‘age’
covariate is removed. It can be seen that the parameter for ‘schizophrenia’
remains similar but the parameter for ‘smoking’ increases in magnitude. One
may check the usual assumptions on the model residuals in a similar way. The
R squared in this case drops considerably to 0.376. Furthermore when comparing
the parameter estimates in Table 5.2.6 to those in Table 5.2.7 we note that there
is a difference of 7.63 (27.387-19.757) for the intercept term, a difference of
0.221 (5.638-5.859) in the parameter estimate for [Schizophrenia=1] and a
difference of 1.645 (-6.258+8.125) in the parameter estimate for [Smoker=1].
When observing a large drop in R squared but very little variability in the model
274 | P a g e
parameter estimates it may be reasonable to keep the ‘age’ variable. However
the difference in the intercept term is not very small so in this case ome might
prefer to remove age from the model. A better test for selecting between these
two competing models would be to apply cross-validation which would allow
one to evaluate the predictive ability of the model but this will not be discussed
here.
Parameter Estimates
Dependent Variable: Brain density of nicotinic receptors
95% Confidence Interval
Parameter B Std. Error t Sig. Lower Bound Upper Bound
Intercept 19.757 1.548 12.766 .000 16.539 22.976
[Schizophrenia=1] 5.859 2.350 2.493 .021 .971 10.747
[Schizophrenia=2] 0a . . . . .
[Smoker=1] -8.125 2.350 -3.457 .002 -13.013 -3.237
[Smoker=2] 0a . . . . .
a. This parameter is set to zero because it is redundant.
Table 5.2.7
In R/RStudio
The file nachr.csv contains data for 25 individuals regarding brain density of
nicotinic acetylcholine receptors (nAChR) and age, the presence (or otherwise)
of schizophrenia and the presence (or otherwise) of a smoking habit.
275 | P a g e
Figure 5.2.8
Figure 5.2.9
We first start by turning the values for ‘Schizophrenia’ and ‘Smoker’ into
factors.
276 | P a g e
nachr$Schizophrenia<-factor(nachr$Schizophrenia, levels=c(1,2
), labels=c("No","Yes"))
nachr$Smoker<-factor(nachr$Smoker, levels=c(1,2), labels=c("N
o","Yes"))
We then move on to plotting a box plot of nAChR with respect to the factors,
and a scatter plot of age and nAChr in R (see Figure 5.2.8 and Figure 5.2.9). In
the box plot we see that there is a larger difference in the location of nAChR for
smokers/non-smokers than there is for the schizophrenia category. In the scatter
plot, on the other hand, we see that nAChR and age are negatively correlated.
However, it must be noted that here we are looking at these relationships
individually and not simultaneously, so these conclusions are not definitive.
We now perform the multicollinearity diagnostics. First we create the required
‘Schizophrenia’ and ‘Smoker’ dummy variables, and extract the ‘Age’ covariate
from the data set.
Note that it is up to the user to decide which of the categories, in a categorical
variable, should be taken as reference. By default, SPSS takes the last level as
the reference category. Here, for comparison purposes, we shall take the first
category as reference and hence we will see that the output changes slightly (for
example, the multicollinearity diagnostics) but overall the model fitted will be
the same.
> dummySchizophrenia<-nachr$Schizophrenia=="Yes"
> dummySmoker<-nachr$Smoker=="Yes"
> dummySchizophrenia<-1*dummySchizophrenia
> dummySmoker<-1*dummySmoker
> Age<-nachr$Age
We now bind these together in one matrix and obtain the correlations and
condition indices as follows.
> x<-cbind(dummySchizophrenia,dummySmoker,Age)
> cor(x,method=”spearman”)
277 | P a g e
4 11.470 0.983 0.000 0.222 0.961
We see that the dummy variable for Schizophrenia is moderately correlated with
the dummy variable for smoker. Furthermore, the highest condition index is
11.47, which is above 10, indicating that collinearity is not weak but also not
serious (not above 30). The condition index in this case is slightly different from
the one for SPSS since R takes the dummy variables in the opposite way (1 if
smoking/schizophrenia is present and 0 otherwise). The 4th eigenvalue may be
causing damage to the parameter estimates of the constant term (intercept) and
the age. In the following we shall keep all variables, but we can also check to see
what happens if we remove the age variable.
We fit the linear model to the data and obtain the model summary by using the
lm function as follows. Note that the interaction between Schizophrenia and
Smoker is catered for in the model through the term
Schizophrenia:Smoker.
Call:
lm(formula = nAChR ~ Schizophrenia + Age + Smoker + Schizophr
enia:Smoker, data = nachr)
Residuals:
Min 1Q Median 3Q Max
-7.1394 -1.9762 -0.5345 2.8211 8.8811
Coefficients:
Estimate Std.Error tvalue Pr(>|t|)
(Intercept) 27.54190 4.73204 5.820 1.32e-05 ***
SchizophreniaYes -3.78332 3.86011 -0.980 0.3393
Age -0.14180 0.06423 -2.208 0.0398 *
SmokerYes 7.11514 2.78148 2.558 0.0192 *
SchizophreniaYes -2.90853 4.97249 -0.585 0.5655
:SmokerYes
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
278 | P a g e
Multiple R-squared: 0.5059, Adjusted R-squared: 0.4018
F-statistic: 4.863 on 4 and 19 DF, p-value: 0.007168
At this point, we see that ‘age’, ‘schizophrenia’ and ‘smoker’ are significant, but
the interaction term between schizophrenia and smoker is not. Therefore, we
need to remove this insignificant term, and keep removing any insignificant
terms, until all values are greater than 0.05. In this case, only removing the
interaction term will be necessary. It is common practice that if the interaction
term is significant, the corresponding main effects are still kept, even if they are
not significant. Hence we type:
model1<-lm(nAChR~Schizophrenia+Age+Smoker,data=nachr)
summary(model1)
Call:
lm(formula=nAChR ~ Schizophrenia + Age + Smoker, data = nachr
)
Residuals:
Min 1Q Median 3Q Max
-7.6196 -2.0605 -0.1958 2.5154 8.5371
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.76664 4.46730 5.992 7.38e-06 ***
SchizophreniaYes -5.63790 2.16519 -2.604 0.0170 *
Age -0.12667 0.05782 -2.191 0.0405 *
SmokerYes 6.25783 2.32478 2.692 0.0140 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
279 | P a g e
Furthermore we see that the R squared is 0.497 and the adjusted R squared is
0.4215. It can be seen that the parameters for the dummy variables of this model
are the negative of those in SPSS. This is due to the fact that the dummy variables
are taken in an opposite manner to SPSS. To obtain the ANOVA table we type
aov(glm) to obtain the following output:
Call:
aov(formula = glm)
Terms:
Schizophrenia Age Smoker Residuals
Sum of Squares 17.4384 240.4416 149.3278 412.1787
Deg. of Freedom 1 1 1 20
H0: Model with only (intercept) constant term is a good fit for the data.
H1: Model fitted (which includes covariates) fits better than the model with only
the intercept term.
The predicted values E[Yi | Ai , Si , Si *] for each set of observations are obtained
via predict(glm):
1 2 3 4 5 6 7 8 9
19.799708 16.252905 20.179723 17.392949 19.039679 25.930869 16.632919 16.126233 26.817570
10 11 12 13 14 15 16 17 18
22.004051 23.650781 21.193096 14.859517 9.601636 18.519571 20.039630 19.659615 22.066375
19 20 21 22 23 24
18.519571 18.646243 23.586434 18.519571 22.319718 9.601636
1 2 3 4 5 6
-1.2697083272 -4.5229045226 -1.1697230205 8.5370513973 2.6203210595 -0.3908693173
7 8 9 10 11 12
-5.3529192160 0.0937670418 3.8724297315 -0.9740508195 -0.0007811573 -3.9230955361
13 14 15 16 17 18
2.4804826863 1.8083642187 -7.6195711893 1.3403700373 -7.2096152693 5.1336250062
19 20 21 22 23 24
-1.4395711893 8.1237572462 -4.0264337672 -0.7895711893 3.9802818773 0.6983642187
On the other hand, studentised residuals are obtained from the package MASS
via the following commands:
280 | P a g e
library(MASS)
studres(glm)
1 2 3 4 5 6
-0.2967740106 -1.0617616739 -0.2764701838 2.1531356993 0.6064154134 -0.0930772726
7 8 9 10 11 12
-1.2655954013 0.0214272758 0.9516296299 -0.2535918667 -0.0001904824 -0.9887767927
13 14 15 16 17 18
0.5893157960 0.4615806041 -1.8950297860 0.3039500900 -1.7578738617 1.2450055853
19 20 21 22 23 24
-0.3292774832 2.0428174020 -1.0230597320 -0.1802418023 0.9573748954 0.1774114381
To check for normality, one can apply the Shapiro-Wilk test on residuals or
studentised residuals in the usual way to show that normality is satisfied. For
Studentised residuals, we can check whether there are any outliers by checking
whether any one of them lies outside the range [-2, 2]. In this case, we see that
there are none. We also look at the leverage values, where in this case we
compare with 2 p n = 1/3. The output below shows that none of the leverage
values exceeds 1/3.
leverages<-hatvalues(glm,type='rstandard')
leverages
1 2 3 4 5 6 7 8
0.15231839 0.11390249 0.17152170 0.09850924 0.12267214 0.18671666 0.10585128 0.11723514
9 10 11 12 13 14 15 16
0.20031148 0.31761718 0.22476175 0.23700672 0.16840690 0.28453630 0.11390249 0.09921842
17 18 19 20 21 22 23 24
0.09850924 0.15231839 0.11390249 0.11089429 0.24665241 0.11390249 0.16479614 0.28453630
cooks.distance(glm,type='rstandard')
1 2 3 4 5 6 7
4.145521e-03 3.599892e-02 4.147700e-03 1.071654e-01 1.327442e-02 5.231749e-04 4.601955e-02
8 9 10 11 12 13 14
1.604548e-05 5.697911e-02 7.850467e-03 2.768302e-09 7.600837e-02 1.817586e-02 2.205051e-02
15 16 17 18 19 20 21
1.021682e-01 2.664938e-03 7.642985e-02 6.776720e-02 3.646875e-03 1.123054e-01 8.547126e-02
22 23 24
1.097077e-03 4.540191e-02 3.288599e-03
Once again, the output shows that no point has a Cook’s distance greater than 1,
showing that there are no influential points in the data. Finally, to check for
variance homogeneity, we use the Breusch-Pagan test.
bptest(glm)
281 | P a g e
data: glm
BP = 1.0137, df = 3, p-value = 0.7979
model1<-lm(nAChR~Schizophrenia+Smoker,data=nachr)
Results and outcomes will be similar to the SPSS case. One may check the usual
assumptions on the model residuals in a similar way as has been covered in this
section on the previous model.
The Multiple linear regression model, the ANOVA model and the ANCOVA
model, covered in the previous sections, are collectively referred to as general
linear models. In this section we shall look at a flexible generalization of such
models, namely the Generalized Linear Model (GLM). GLMs allow for response
variables that are categorical or discrete and/or for error distributions other than
the normal (Gaussian).
( )
g E (Yi X i1 = xi1 ,..., X ip = xip ) = β 0 + β1 X i1 + ... + β p X ip
282 | P a g e
Also, g (i) is a link function which describes how the mean
µi = E (Yi X i1 = xi1 ,..., X ip = xip ) depends on the linear predictor
ηi = β 0 + β1 X i1 + ... + β p X ip , so that g ( µi ) = ηi . The link function relates the
response variable to the linear model.
Note that for general linear models, the link function is the identity function,
g ( µi ) = µi . So a general linear model may be viewed as a special case of the
generalized linear modelling framework.
In this chapter we shall consider one class of models from this family of models,
namely the Logistic regression models, which has two important members: the
Binary logistic regression model and the Multinomial logistic regression model.
The binary logistic regression model is used when we wish to model the influence
of a set of explanatory variables X 1 ,..., X p on a response variable Y , where the
possible outcomes of Y can only be one of two categories.
2
Most of the commonly used statistical distributions form part of the exponential family of
distributions. The normal distribution, exponential distribution, chi-square distribution,
binomial distribution, Bernoulli distribution, Poisson distribution, geometric distribution, are
all members of the exponential family of distributions.
283 | P a g e
For example, possible outcomes may be, pass or fail, yes or no, 0 or 1, true or
false. More specifically, for a sample of size n, the binary logistic regression
model is given by
µ
g ( µi ) = logit ( µi ) = log i = β 0 + β1 X i1 + ... + β p X ip i = 1,..., n
1 − µi
where µi is in this case is the probability that the outcome on the response
variable will be in the category of interest. The first category is the default
reference category in R, whilst the last category is the default reference category
in SPSS.
Example
0: up to two carburettors,
1: more than two carburettors.
284 | P a g e
We will start off by showing how a logistic regression model may be fitted in
SPSS. Most of the interpretation accompanying SPSS’ output also applies for the
outputs obtained using R.
In SPSS
Open the file ‘mtcars.sav’. Prior to proceeding with fitting the logistic regression
model we need to reorganize the variable carb so that its possible response
categories are only two. The new variable carbnew is created which takes a value
of 0 if the number of carburettors are 1 or 2 and it takes a value of 1 if the number
of carburettors is more than 2. Recall that recoding into a different variable in
SPSS is carried out by going to the menu Transform – Recode into Different
Variables, choosing the variable that you wish to recode, entering the name of the
new variable, entering the new coded values, press continue and OK. Details on
how to recode data may be found in Section 1.1.8.
Before proceeding to fit the logistic regression model to the data with response
variable carbnew, we can obtain descriptive statistics and plots of the variables
being considered in the analysis so as to get an initial feel of the data and an initial
insight into possible relationships that might exist between the variables. For
detail on how to obtain descriptives and plots according to the type of variables
considered refer to Sections 2 and 3. We should also check whether there is any
multicollinearity in the data.
285 | P a g e
Figure 5.3.1
286 | P a g e
Figure 5.3.3
Table 5.3.1
287 | P a g e
Figure 5.3.2
288 | P a g e
We take note of any correlation coefficient that is larger than 0.5 as it is deemed
to be quite high in relation to the identification of variables which may lead to
multicollinearity. There are a number of correlation coefficients that stand out in
this example: -0.909 for disp and mpg, -0.886 for mpg and wt, 0.851 for disp and
hp, to mention a few. All of these correlations are much larger than 0.5 showing
that we cannot consider any of these pairs of explanatory variables together when
fitting a binary logistic regression model. Possible binary logistic regression
models with carbnew as response variable that might be considered at this point
are those with these pairs of explanatory variables:
We proceed to fit binary logistic regression model 1, that is using mpg and qsec
as explanatory variables. A similar procedure to what will be shown for this
particular binary logistic regression model may be used to fit binary logistic
regression models 2 - 10. A comparison between the fits provided by the ten
different models can then be entertained.
289 | P a g e
Note that the reference category in SPSS is by default the last. So, the
probabilities µi estimated from the fitted model will be those corresponding to a
vehicle having up to two carburettors. Should we wish to change the reference
category to the first, we should click on Reference Category and choose First
Category as in Figure 5.3.4. Here we will proceed with the default setting of
SPSS.
Figure 5.3.4
Then move mpg and qsec underneath Covariate(s) as in Figure 5.3.5. Then click
on Save and choose estimated response probabilities. Press Continue and Ok.
Note that if we had a model involving categorical explanatory variables, we
would also press on Model and choose full factorial so that we can include both
main effects and interaction terms in the model.
Figure 5.3.5
290 | P a g e
Table 5.3.2 is part of the output obtained. From this table, we note that the
variable qsec should be removed from the model as its corresponding p-value,
0.106 is greater than 0.05 showing that the model coefficient is not significantly
different from zero.
Table 5.3.2
The output which results, once qsec is removed from the model, is the following:
Table 5.3.3
From Table 5.3.3, since the p-value for mpg (0.006) is less than 0.05 it shows that
mpg exerts an influence on the probability of whether a vehicle has up to 2
carburettors or more than 2 carburettors. The p-value shown in Table 5.3.4 (0 <
0.05) shows that the change in deviance of 17.417 between the two models is
significant. Thus, the model with mpg as explanatory variable provides a better
fit than an intercept only model.
Table 5.3.4
291 | P a g e
Since there is no agreement as to whether pseudo R2 values should be used in
conjunction with logistic regression models, information on pseudo R2 values will
not be given here.
µ
log i = −8.135 + 0.431 ∗ mpg i .
1 − µi
From the coefficient of the variable mpg, 0.431, we may say that mpg has a
positive effect on µi , the probability of a vehicle having an engine with up to two
carburettors. More precisely, a one-unit increase in mpg increases the log odds
µ
log i of an engine having up to two carburettors by 0.431.
1 − µi
e −8.135+0.431∗mpgi
µi = .
1 + e −8.135+0.431∗mpgi
The estimated probabilities are presented in Table 5.3.5 with the first column
giving the estimated probabilities of a vehicle having an engine with up to two
carburettors and the second column giving the estimated probabilities of a vehicle
having an engine with more than two carburettors. These probabilities have also
been obtained through the menu Analyze – Regression – Multinomial Logistic:
292 | P a g e
Table 5.3.5
Figure 5.3.6
293 | P a g e
Figure 5.3.7
Figure 5.3.8
294 | P a g e
Figure 5.3.9
Once different generalized linear models have been fitted to the data, to be able
to compare the performance of these models, use can be made of goodness of fit
measures such as the deviance (only if models are nested) and the AIC. The lower
(higher) the deviance, the better (worse) the fit of the model. Similarly, the lower
(the closer to zero) the value of AIC the better. The goodness of fit measures are
obtained as part of the default output obtained through the menu Analyze –
Generalized Linear Models - Generalized Linear Models as shown in Table
5.3.6.
295 | P a g e
Goodness of Fita
Value df Value/df
Deviance 21.274 23 .925
Scaled Deviance 21.274 23
Pearson Chi-Square 18.664 23 .811
Scaled Pearson Chi-Square 18.664 23
Log Likelihoodb -12.024
Akaike's Information
28.047
Criterion (AIC)
Finite Sample Corrected AIC
28.461
(AICC)
Bayesian Information
30.979
Criterion (BIC)
Consistent AIC (CAIC) 32.979
Dependent Variable: carbnew
Model: (Intercept), mpg
a. Information criteria are in smaller-is-better form.
b. The full log likelihood function is displayed and used in computing
information criteria.
Table 5.3.6
If interest lies in how many vehicles have been correctly classified, we can also
look at the classification table (see Table 5.3.7) obtained through the menu
Analyze – Regression – Binary Logistic. Group membership is selected through
the tab Save, as shown in Figure 5.3.10:
Figure 5.3.10
296 | P a g e
Table 5.3.7
Using the diagonal elements in the contingency table (see Table 5.3.7), the
number of correctly classified vehicles amounts to 24 (13 + 11) showing that only
75% of the vehicles were correctly classified by means of this model.
In R/RStudio
A GLM may be fitted in R using the glm command from the stats package. The
usage of glm() is similar to the function lm() which we have previously used to
fit linear models in Sections 5.1 and 5.2. However, in the glm command we also
have to specify a family argument which caters for different exponential family
distributions and different link functions.
Since we are going to use the glm function to fit a logistic regression model, we
will need to specify the family as binomial with logit link. The reason for
selecting such a family follows from the fact that a response variable that has only
two possible outcome categories may be considered to follow the Bernoulli
distribution and the Bernoulli distribution is a special case of the Binomial
distribution.
297 | P a g e
We use the following commands:
We start our analysis by opening the data file and specifying which of the
variables are categorical (factors). This is done as follows:
data <- mtcars
attach(data)
Using the command revalue from the R package plyr to reorganize the categories
of the variable carb:
library(plyr)
summary(carbnew)
0 1
17 15
Before proceeding to fit the logistic regression model to the data with response
variable carbnew, we can obtain descriptive statistics and plots of the variables
being considered in the analysis so as to get an initial feel of the data and an initial
insight into possible relationships that might exist between the variables. For
detail on how to obtain descriptive statistics and plots according to the type of
variables considered refer to Sections 2 and 3.
We should also check whether there is any multicollinearity in the data. Since in
this case we have both covariates and categorical variables as explanatory
variables, we first need to rewrite the categorical variables as dummy variables
and then move on to use collinearity diagnostics on the variables. There are
various ways in which dummy variables may be created in R. One of these
methods proceeds as follows:
298 | P a g e
head(dummiesvs)
(Intercept) data$vs1
1 1 0
2 1 0
3 1 1
4 1 1
5 1 0
6 1 1
1 2 3 4 5 6
0 0 1 1 0 1
dummiesam = model.matrix(~data$am)[,-1]
head(dummiesam)
1 2 3 4 5 6
1 1 1 0 0 0
dummiesgear = model.matrix(~data$gear)[,-1]
head(dummiesgear)
data$gear4 data$gear5
1 1 0
2 1 0
3 1 0
4 0 0
5 0 0
6 0 0
dummiescyl = model.matrix(~data$cyl)[,-1]
head(dummiescyl)
data$cyl6 data$cyl8
1 1 0
2 1 0
3 0 0
4 1 0
5 0 1
6 1 0
head(formultcol)
299 | P a g e
mpg disp hp drat wt qsec dummiesvs dummiesam data$gear4 data$gear5
21.0 160 110 3.90 2.620 16.46 0 1 1 0
21.0 160 110 3.90 2.875 17.02 0 1 1 0
22.8 108 93 3.85 2.320 18.61 1 1 1 0
21.4 258 110 3.08 3.215 19.44 1 0 0 0
18.7 360 175 3.15 3.440 17.02 0 0 0 0
18.1 225 105 2.76 3.460 20.22 1 0 0 0
data$cyl6 data$cyl8
1 0
1 0
0 0
1 0
0 1
1 0
Using the same strategy as that used when working with SPSS software, any
correlation coefficient that is larger than 0.5 is deemed to be quite high in relation
to the identification of variables which may lead to multicollinearity. There are
a number of correlation coefficients that stand out in this example: -0.91 for disp
and mpg, -0.89 for mpg and wt, 0.85 for disp and hp, to mention a few. All of
these correlations are much larger than 0.5 showing that we cannot consider any
of these pairs of explanatory variables together when fitting a binary logistic
regression model.
300 | P a g e
Possible binary logistic regression models with carbnew as response variable that
might be considered at this point are those with these pairs of explanatory
variables:
We proceed to fit binary logistic regression model 1, that is using mpg and qsec
as explanatory variables. A similar procedure to what will be shown for this
particular binary logistic regression model may be used to fit binary logistic
regression models 2 - 10. A comparison between the fits provided by the ten
different models can then be entertained. We fit our logistic regression model
with carbnew as the response variable and mpg and qsec as the explanatory
variables as follows:
Call:
glm(formula = carbnew ~ mpg + qsec, family = "binomial", data
= datanew)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.94655 -0.48935 -0.06209 0.62801 1.41200
301 | P a g e
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 18.8230 7.8982 2.383 0.0172 *
mpg -0.3433 0.1540 -2.230 0.0258 *
qsec -0.6977 0.4316 -1.616 0.1060
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
To interpret the output provided by the glm function we start by looking at the p-
values obtained through the Wald test (under the heading Pr(>|z|)). Note that
the p-value for the coefficient corresponding to qsec is greater than 0.05 (0.106 >
0.05) which means that this coefficient is not significantly different from zero.
So we can remove the variable qsec from the model. The output for the model
without qsec is obtained as follows:
summary(glm2)
Call:
glm(formula = carbnew ~ mpg, family = "binomial", data =
datanew)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.88400 -0.62520 -0.06642 0.62837 1.57953
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.1350 2.9845 2.726 0.00642 **
mpg -0.4307 0.1577 -2.731 0.00631 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
302 | P a g e
Since the p-value for mpg (0.00631) is less than 0.05 it shows that mpg exerts an
influence on the probability of whether a vehicle has up to 2 carburettors or more
than 2 carburettors.
summary(glminterceptonly)
Call:
glm(formula = carbnew ~ 1, family = "binomial", data = datanew
)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.125 -1.125 -1.125 1.231 1.231
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.1252 0.3542 -0.353 0.724
303 | P a g e
Analysis of Deviance Table
With a p-value of 0.00003 < 0.05, the change in deviance is significant showing
that the introduction of mpg in the model led to an improvement in model fit.
The AIC is another measure of goodness of fit of the model that may be used to
compare the performance of competing models. The lower (the closer to zero)
the value of AIC the better. It may be noticed that by retaining the variable mpg
in the model, the AIC resulted to be 30.82. On the other hand, the AIC for an
intercept only model resulted to be 46.236, showing that indeed the inclusion of
mpg in the model led to a better fit.
Note that the deviance (only if models are nested) and AIC may also be used if
we wish to compare the fit of a number of candidate models. The lower the
deviance, the better the fit of the model. Similarly, the lower the value of AIC
the better.
µ
log i = 8.135 − 0.431 ∗ mpgi .
1 − µi
From the coefficient of the variable mpg, -0.431, we may say that mpg has a
negative effect on µi , the probability of an engine having more than two
carburettors. More precisely, a one-unit increase in mpg reduces the log odds
µ
log i of an engine having more than two carburettors by 0.431. We can
1 − µi
also focus on the probability µi directly, by making µi subject of the formula to
get the expression for the predicted probabilities:
304 | P a g e
e8.135−0.431∗mpgi
µi = .
1 + e8.135−0.431∗mpgi
fitted<-glm2$fitted.values
head(fitted)
Once a generalized linear model is fitted to the data we also need to check the
residuals. Different types of residuals may be obtained through the glm function
in R. Here we focus on the Pearson residuals and we plot Pearson residuals versus
the linear predictors ηi where ηi = 8.135 − 0.431 ∗ mpg i (the term on the right
hand side of the model). We should expect to see a horizontal band with most of
the residuals falling within ±2 or ±3 .
Pearson<-residuals(glm2,type='pearson')
plot(glm2$linear.predictors, Pearson)
1
Pearson
0
-1
-2
-6 -4 -2 0 2 4
glm2$linear.predictors
305 | P a g e
which(abs(Pearson)>3)
named integer(0)
## no absolute Pearson residual value is larger than 3
From Figure 5.3.11 we can see that the resulting Pearson residuals fall within the
±3 horizontal band as desired. However, there are signs of a curvature in the
points, which means that there might be misspecification in the mean model, that
µ
is, the relationship between the logit function log i and the explanatory
1 − µi
variables might not be linear.
• Suppose that the model which follows is fitted using the glm function in R.
Note that in this case the two explanatory variables are both categorical. If a
glm is fitted to a dataset involving categorical explanatory variables, the glm
function by default removes the first category of each categorical variable, a
process known as aliasing (vs0 and am0 have been aliased in the model which
follows). Without aliasing, estimation problems tend to arise. Aliasing is also
carried out automatically when a general linear model (ANOVA/ANCOVA
model) is fitted to the data as seen in Section 5.2.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.0484 0.6172 1.699 0.08939 .
vs1 -2.7125 0.9324 -2.909 0.00362 **
am1 -0.2681 0.8936 -0.300 0.76419
306 | P a g e
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.6931 0.6124 1.132 0.258
vs1 -1.6094 1.0368 -1.552 0.121
am1 0.9163 1.2550 0.730 0.465
vs1:am1 -18.5661 2465.3261 -0.008 0.994
Final Note on fitting a logistic regression model using R and SPSS (for the
Advanced User)
glm
Coefficients:
(Intercept) vs1
0.9555 -2.7473
head(model.matrix(glm))
(Intercept) vs1
Mazda RX4 1 0
Mazda RX4 Wag 1 0
Datsun 710 1 1
Hornet 4 Drive 1 1
Hornet Sportabout 1 0
Valiant 1 1
307 | P a g e
By changing the contrasts to contr.sum you get the following:
glm
Coefficients:
(Intercept) vs1
-0.4181 1.3736
head(model.matrix(glm))
(Intercept) vs1
Mazda RX4 1 1
Mazda RX4 Wag 1 1
Datsun 710 1 -1
Hornet 4 Drive 1 -1
Hornet Sportabout 1 1
Valiant 1 -1
308 | P a g e
Figure 5.3.12
Figure 5.3.13
Table 5.3.8
309 | P a g e
5.3.2 The multinomial logistic regression model
The correspondence between the multinomial logistic regression model and its
binary counterpart may be viewed directly from the model specification. Suppose
that the response variable Y is made up of K categories and let the first category
of the response variable be the reference category3. Then for a sample of size n,
the multinomial logistic regression model is given by:
µ
g ( µ 2 i ) = logit ( µ 2 i ) = log 2i = β 20 + β 21 X i1 + ... + β 2 p X ip
µ1i
⋮ ⋮ ⋮
µ
g ( µ Ki ) = logit ( µ Ki ) = log Ki = β K 0 + β K 1 X i1 + ... + β Kp X ip
µ1i
where i = 1,..., n, and for k = 1,..., K , µ ki is the probability that the ith outcome
obtained on the response variable will be in the kth category. So, if modelling is
carried out using a K-level response variable, then the multinomial logistic
regression model is actually made up of K – 1 different binary logistic regression
models. Each binary logistic regression model obtained provides information on
the effect of the explanatory variables on the probability that the response variable
is equal to a specific category, always in comparison to a reference category. To
cater for the fact that the explanatory variables may affect each outcome category
differently, the model coefficients of the binary logistic regression models will
also be different.
3
The first category is the default reference category in R, whilst the last category is the default
reference category in SPSS.
310 | P a g e
Example
Consider the iris dataset available in the inbuilt package datasets in R software.
This dataset gives measurements (in cm) of the sepal length, sepal width, petal
length and petal width of 150 flowers. These flowers have been classified into
three different species: iris setosa, versicolor or virginica. Our focus in this
section will be directed towards modelling the influence of a number of
explanatory variables on the nominal response variable (with three categories)
"species".
As for the binary logistic regression model, the explanatory variables used in a
multinomial logistic regression model should not lead to any multicollinearity
issues. Prior to fitting the model, multicollinearity diagnostics should thus be
carried out.
SPSS and R will now be used to conduct multicollinearity diagnostics on the four
explanatory variables, sepal length, sepal width, petal length and petal width, and
will then proceed to fit a multinomial logistic regression model to the data.
We will start off by showing the procedure in SPSS. Most of the interpretation
accompanying SPSS’ output also applies for the outputs obtained using R.
In SPSS
Open the file ‘irisdata.sav’ and get familiar with the dataset by looking at
descriptive statistics and plots of the variables being considered in the analysis.
This will help us to get a feel of the data being modelled and an insight into
possible relationships that might exist between the response and the explanatory
variables. Detail on how to achieve descriptive statistics and plots in SPSS has
been given in Sections 2 and 3.
The boxplot of Sepal Length versus Species, in Figure 5.3.14, shows that there
seems to be a difference in the sepal length due to the iris species. So, it looks
like it might be worth investigating this relationship further.
311 | P a g e
The scatter plot of Petal Width versus Petal Length, in Figure 5.3.15, shows that
there seems to be a linear relationship between the two variables. The Spearman
correlation coefficient for the two variables is in fact 0.938 with p-value < 0.05,
as shown in Table 5.3.9, showing that there is indeed a significant positive linear
relationship between the two. Note that Spearman correlation coefficient has
been used due to non-normality of the variables Petal Length and Petal Width (p-
value 0 < 0.05 achieved for both variables when testing for normality using the
Shapiro-Wilk test).
Figure 5.3.14
Figure 5.3.15
312 | P a g e
Table 5.3.9
Now recall that when fitting a generalized linear model, such as the multinomial
logistic regression model, the explanatory variables should not be collinear. The
fact that there is a linear relationship between petal width and petal length shows
that we might need to remove some of the variables from the iris dataset for our
model fitting. With large correlation values of 0.938, 0.882 and 0.834, the
correlation matrix in Table 5.3.9 also shows that we might have a serious issue
with multicollinearity.
As for the binary logistic regression model, the procedure that is used to detect
multicollinearity in the explanatory variables of a generalized linear model, is not
straightforward. For the purpose of these lecture notes, we will decide which
variables should be removed from the data based on the resulting correlation
coefficients and a trial and error model fitting approach.
Any correlation coefficient that is larger than 0.5 is deemed to be quite high in
relation to the identification of variables which lead to multicollinearity. As
already mentioned, there are three correlation coefficients that stand out. These
correlation coefficients correspond to the relationships Petal.Width ~
Petal.Length, Sepal.Length ~ Petal.Length and Sepal.Length ~ Petal.Width. All
of these correlations are much larger than 0.5 showing that we cannot fit any of
these three multinomial logistic regression models:
313 | P a g e
1. a model with both Petal.Width and Petal.Length as explanatory variables
2. a model with both Sepal.Length and Petal.Length as explanatory variables
3. a model with both Sepal.Length and Petal.Width as explanatory variables.
To fit the multinomial logistic regression model using Sepal Length and Sepal
Width as explanatory variable and Species as the response variable, go to Analyze
– Regression – Multinomial Logistic. The response variable is then moved
underneath Dependent and the explanatory variables Sepal Width and Sepal
Length are moved underneath Covariate(s). Should you wish to match the
resulting output to that obtained in R, the reference category in SPSS needs to be
changed to first as shown in Figure 5.3.16. Then press Ok.
Figure 5.3.16
314 | P a g e
Table 5.3.10
The warnings should be the first thing to get our attention once the SPSS output
is obtained. These warnings are shown in Table 5.3.10. Such warnings show that
our model is not the ideal one to use for the particular dataset. At this stage we
can thus consider two separate multinomial logistic models, one with
Sepal.Length as an explanatory variable and the other model with Sepal.Width as
an explanatory variable. Once these two models have been fitted, identification
of the best overall fitting model can be conducted by comparing the fit of these
two models with the fit of models 5 and 6 defined earlier.
For the purpose of these lecture notes, we will proceed by fitting solely a
multinomial logistic regression model with Species as the response variable and
considering Sepal Length as the explanatory variable (see Figure 5.3.17). From
Figure 5.3.17 it may also be noted that the reference category of the response
variable Species is changed from the default, last, to first so that the results
obtained in SPSS match with those obtained in R.
Figure 5.3.17
315 | P a g e
Also go to Statistics and choose Information Criteria, Cell probabilities,
classification table and goodness-of-fit, Continue, click on Save and choose
Predicted category and Predicted category probability as shown in Figures
5.3.18 and 5.3.19. Then press Continue and Ok.
Figure 5.3.18
Figure 5.3.19
316 | P a g e
Likelihood Ratio Tests
Model Fitting
Criteria Likelihood Ratio Tests
-2 Log Likelihood of
Effect Reduced Model Chi-Square df Sig.
Intercept 224.975 145.680 2 .000
Sepal.Length 226.811 147.516 2 .000
The chi-square statistic is the difference in -2 log-likelihoods between the final
model and a reduced model. The reduced model is formed by omitting an effect
from the final model. The null hypothesis is that all parameters of that effect are 0.
Table 5.3.11
From Table 5.3.11, a p-value of 0 < 0.05 for Sepal Length shows that the
coefficient corresponding to Sepal Length is significantly different from zero.
Table 5.3.12
The fact that sepal length should be retained in the model is also corroborated by
the p-value 0 < 0.05 shown in the final row of Table 5.3.12. The result in Table
5.3.12 shows that the model with sepal length provides a better fit than the
intercept only model.
Recall that the lower the value of information criteria such as the AIC and BIC,
the better the model fit. So, similarly, the information criteria (AIC and BIC)
shown in Table 5.3.13 also shown that the introduction of sepal length in the
model leads to an improvement in model fit; an AIC of 87.295 with sepal length
in the model versus an AIC of 230.811 without sepal length in the model.
317 | P a g e
Table 5.3.13
The goodness of fit tests shown in Table 5.3.14 are then used to assess how well
does the model fit the data. The resulting p-values (0.789 and 0.999 > 0.05) show
that the model fits the data well.
Goodness-of-Fit
Chi-Square df Sig.
Pearson 56.591 66 .789
Deviance 34.838 66 .999
Table 5.3.14
Note that goodness of fit measures such as the deviance (for nested models) and
the AIC can also be used if you need to compare the performance of different
generalized linear models. The lower (higher) the deviance, the better (worse)
the fit of the model. Similarly, the lower (the closer to zero) the value of AIC the
better.
Having shown that by retaining sepal length in the model, we have a well-fitting
model, we now turn our attention to the resulting estimates, model specification
and interpretation. The first set of estimates shown in Table 5.3.15 are for the
model obtained for the category versicolor compared to the baseline/reference
category setosa and similarly, the second set of estimates show in Table 5.3.15
are for the model obtained for the category virginica compared to the
baseline/reference category setosa.
318 | P a g e
Table 5.3.15
µ P ( versicolor )
log 2 i = log = −26.08 + 4.82 × Sepal.Length i
µ
1i P ( setosa ) i
µ P ( virginica )
log 3i = log = −38.77 + 6.85 × Sepal.Length i
µ1i P ( setosa ) i
Using the first equation, a one-unit increase in the variable Sepal Length is
associated with an increase of 4.82 in the log odds of an iris being of the species
versicolor versus setosa. Similarly, from the second equation, a one-unit increase
in the variable Sepal Length is associated with an increase of 6.85 in the log odds
of an iris being of the species virginica versus setosa. A similar interpretation
may also be given in terms of the odds rather than the log odds of an iris being
either a versicolor or virginica versus a setosa. Should such an interpretation be
required, we need to consider the values, 123.43 and 940.49, obtained underneath
the heading Exp(B).
For our fitted model, we can also calculate the predicted (estimated) probabilities
for each of the levels of the response variable Species. In general:
1
µ1i =
1 + exp ( β j 0 + β j1 X i1 + ... + β jp X ip )
K
j =2
exp ( β k 0 + β k1 X i1 + ... + β kp X ip )
µ ki = for k ≠ 1
1 + exp ( β j 0 + β j1 X i1 + ... + β jp X ip )
K
j =2
319 | P a g e
Using the above equations, three predicted probabilities may be calculated for
each species category for each iris. For each iris, SPSS takes note of the species
category that has the largest predicted probability associated with it and thus
assigns the particular iris to that category. The data view window shown in Figure
5.3.20 shows the largest predicted probabilities obtained for each iris and the
corresponding species category to which each iris has been assigned.
Figure 5.3.20
If interest lies in how many iris flowers have been correctly classified, we can
also look at the classification table:
Classification
Predicted
Observed setosa versicolor virginica Percent Correct
setosa 45 5 0 90.0%
versicolor 6 30 14 60.0%
virginica 1 12 37 74.0%
Overall Percentage 34.7% 31.3% 34.0% 74.7%
Table 5.3.16
320 | P a g e
Using the diagonal elements in the contingency table in Table 5.3.16, the number
of correctly classified iris amounts to 112 (45 + 30 + 37) showing that only
74.67% of the iris were correctly classified by means of this model.
As a final note, once a generalized linear model is fitted to the data we also need
to check the residuals. To obtain Pearson residuals for a multinomial logistic
regression we will need to resort to fitting what is known as a loglinear model
using the menu Analyze – Generalized Linear Models - Generalized Linear
Models. From the tab Type of Model choose Poisson Loglinear. The Pearson
residuals obtained for such a model are the same as those for the multinomial
logistic regression model.
In R/RStudio
As a first step, load the iris dataset and get familiar with the dataset by looking at
descriptive statistics and plots of the variables being considered in the analysis.
This will help us to get a feel of the data being modelled and an insight into
possible relationships that might exist between the response and the explanatory
variables.
head(iris) ## to view the first six rows of the data
summary(iris)
## note that Species is already defined as a factor
From the summary measures it should be noted that Species is by default defined
as a factor, so there is no need to include any commands to turn the variable into
a factor.
321 | P a g e
attach(iris)
Box Plot
7.5
Sepal Length
6.5
5.5
4.5
Species
Figure 5.3.21
The boxplot of Sepal Length versus Species, in Figure 5.3.21, shows that there
seems to be a difference in the sepal length due to the iris species. So, it looks
like it might be worth investigating this relationship further.
1.5
1.0
0.5
1 2 3 4 5 6 7
iris$Petal.Length
Figure 5.3.22
322 | P a g e
The scatter plot of Petal Width versus Petal Length, in Figure 5.3.22, shows that
there seems to be a linear relationship between the two variables. The Spearman
correlation coefficient for the two variables is in fact 0.938 with p-value < 0.05,
showing that there is indeed a significant positive linear relationship between the
two. Note that Spearman correlation coefficient has been used due to non-
normality of the variables Petal Length and Petal Width (p-value 0 < 0.05
achieved for both variables when testing for normality using the Shapiro-Wilk
test).
cor.test(iris$Petal.Length,iris$Petal.Width,method='spearman')
Now recall that when fitting a generalized linear model, such as the multinomial
logistic regression model, the explanatory variables should not be collinear. The
fact that there is a linear relationship between petal width and petal length shows
that we might need to remove some of the variables from the iris dataset for our
model fitting. With large Spearman correlation values of 0.938, 0.882 and 0.834,
the correlation matrix below also shows that we might have a serious issue with
multicollinearity.
cor(iris[,-ncol(iris)],method='spearman')
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 -0.1667777 0.8818981 0.8342888
Sepal.Width -0.1667777 1.0000000 -0.3096351 -0.2890317
Petal.Length 0.8818981 -0.3096351 1.0000000 0.9376668
Petal.Width 0.8342888 -0.2890317 0.9376668 1.0000000
As for the binary logistic regression model, the procedure that is used to detect
multicollinearity in the explanatory variables of a generalized linear model, is not
straightforward. For the purpose of these lecture notes, we will decide which
variables should be removed from the data based on the resulting correlation
coefficients and a trial and error model fitting approach.
Any correlation coefficient that is larger than 0.5 is deemed to be quite high in
relation to the identification of variables which lead to multicollinearity.
323 | P a g e
As already mentioned, there are three correlation coefficients that stand out.
These correlation coefficients correspond to the relationships Petal.Width ~
Petal.Length, Sepal.Length ~ Petal.Length and Sepal.Length ~ Petal.Width. All
of these correlations are much larger than 0.5 showing that we cannot fit any of
these three multinomial logistic regression models:
The multinomial logistic regression model may be fitted using different functions
in R. Here the multinom function from the package nnet is used:
library(nnet)
mult <- multinom(Species ~ Sepal.Width + Sepal.Length, data =
iris)
## fitting a multinomial logistic regression model with Sepal
## Width and Sepal Length as explanatory variables
# weights: 12 (6 variable)
initial value 164.791843
iter 10 value 62.715967
iter 20 value 59.808291
⋮ ⋮
iter 90 value 55.230241
iter 100 value 55.212479
final value 55.212479
stopped after 100 iterations
## always check that the final line shows the word converged
324 | P a g e
mult <- multinom(Species ~ Sepal.Width + Sepal.Length, maxit =
1000, data=iris)
## fitting a multinomial logistic regression model with Sepal
## Width and Sepal Length as explanatory variables and
## increasing the number of iterations from the default of 100
## to 1000
# weights: 12 (6 variable)
initial value 164.791843
iter 10 value 62.715967
iter 20 value 59.808291
⋮ ⋮
iter 220 value 55.186513
iter 230 value 55.184474
final value 55.184137
converged
summary(mult)
Call:
multinom(formula = Species ~ Sepal.Width + Sepal.Length, data
= iris,
maxit = 1000)
Coefficients:
(Intercept) Sepal.Width Sepal.Length
versicolor -106.8890 -38.18227 42.24345
virginica -119.9312 -37.77782 44.14532
Std. Errors:
(Intercept) Sepal.Width Sepal.Length
versicolor 20.60839 43.18899 18.20115
virginica 20.65106 43.19554 18.19642
The output obtained using the function multinom includes details on the iteration
history. Always check that you get the word converged in the final line of the
iteration history, otherwise the resulting estimates are not reliable. You might
need to increase the number of iterations to achieve convergence. The function
multinom calls the function nnet which sets the default number of iterations at
100. Including the command maxit=1000 has solved our problem of lack of
convergence. Otherwise, you might need to scrap the model being fitted and use
an alternative.
325 | P a g e
Also check that the resulting standard errors shown in the summary output are
less than half the estimates of the model coefficients. In this example, the
standard errors corresponding to Sepal.Width are both large. The fact that such
large standard errors were obtained shows that our model is not the ideal one to
use for the particular dataset. At this stage we can thus consider two separate
multinomial logistic models, one with Sepal.Length as an explanatory variable
and the other model with Sepal.Width as an explanatory variable. Once these two
models have been fitted, identification of the best overall fitting model can be
conducted by comparing the fit of these two models with the fit of models 5 and
6 defined earlier.
For the purpose of these lecture notes, we will proceed by fitting solely a
multinomial logistic regression model with Species as the response variable and
considering Sepal Length as the explanatory variable.
summary(mult2)
Call:
multinom(formula = Species ~ Sepal.Length, data = iris)
Coefficients:
(Intercept) Sepal.Length
versicolor -26.08339 4.816072
virginica -38.76786 6.847957
Std. Errors:
(Intercept) Sepal.Length
versicolor 4.889635 0.9069211
virginica 5.691596 1.0223867
## note that in this case the standard errors are less than
## half the estimates of the model coefficients, as desired
Having reached convergence and since we have no issues with large standard
errors, we can proceed to interpret the resulting output. As part of the multinom
function output we get the final value (91.04).
326 | P a g e
If this value is multiplied by 2, it is equivalent to the residual deviance (182.07),
the latter shown in the summary of the model.
In the previous section, we have seen how the deviance is a measure of goodness
of fit of a generalized linear model. The residual deviance is a summary measure
for the fitted model. The higher the deviance, the worse the fit of the model.
The deviance may thus be used to compare nested models, for example the fitted
model versus the null model. The null model is the model in which all terms are
excluded except for the intercept. For this example, the residual deviance
obtained when fitting an intercept only model is 329.58, showing that by
including sepal length in the model, the deviance has decreased from 329.58 to
182.07.
summary(mult3)
Call:
multinom(formula = Species ~ 1, data = iris)
Coefficients:
(Intercept)
versicolor -2.122746e-16
virginica 4.973799e-16
Std. Errors:
(Intercept)
versicolor 0.2
virginica 0.2
327 | P a g e
anova(mult3, mult2, test = 'Chisq')
The fact that sepal length should be retained in the model is also corroborated by
the p-values (0 and 0 < 0.05) corresponding to the estimates of the model
coefficients. If the multinomial logistic regression model is fitted in R using the
function multinom, these p-values are not obtained automatically. Alternative
fitting functions do provide such an output but they are generally more tedious to
work with. When using multinom, p-values may be obtained by using a few extra
commands as follows:
z <-summary(mult2)$coefficients/summary(mult2)$standard.errors
## working out values for the z-statistic so that we can
## calculate p-values accordingly
z
(Intercept) Sepal.Length
versicolor -5.334424 5.310353
virginica -6.811422 6.698011
p
(Intercept) Sepal.Length
versicolor 9.584830e-08 1.094128e-07
virginica 9.663825e-12 2.112754e-11
Note that goodness of fit measures such as the deviance (for nested models) and
the AIC can also be used if you need to compare the performance of different
generalized linear models.
328 | P a g e
The lower (higher) the deviance, the better (worse) the fit of the model. Similarly,
the lower (the closer to zero) the value of AIC the better.
Having decided to retain sepal length in the model, we now turn our attention to
the resulting estimates, model specification and interpretation. The first line of
the resulting coefficients gives estimates which correspond to the category
versicolor being compared to the baseline/reference category setosa and
similarly, the estimates in the second line correspond to the category virginica
being compared to the baseline/reference category setosa.
µ P ( versicolor )
log 2 i = log = −26.08 + 4.82 × Sepal.Length i
µ
1i P ( setosa ) i
µ P ( virginica )
log 3i = log = −38.77 + 6.85 × Sepal.Length i
µ1i P ( setosa ) i
Using the first equation, a one-unit increase in the variable Sepal Length is
associated with an increase of 4.82 in the log odds of an iris being of the species
versicolor versus setosa. Similarly, from the second equation, a one-unit increase
in the variable Sepal Length is associated with an increase of 6.85 in the log odds
of an iris being of the species virginica versus setosa. A similar interpretation
may also be given in terms of the odds rather than the log odds of an iris being
either a versicolor or virginica versus a setosa. Should such an interpretation be
required, we just need to exponentiate the resulting coefficients as follows:
(Intercept) Sepal.Length
versicolor 4.700338e-12 123.4791
virginica 1.456567e-17 941.9549
For our fitted model, we can also calculate the predicted (estimated) probabilities
for each of the levels of the response variable Species. In general:
329 | P a g e
1
µ1i =
1 + exp ( β j 0 + β j1 X i1 + ... + β jp X ip )
K
j =2
exp ( β k 0 + β k1 X i1 + ... + β kp X ip )
µ ki = for k ≠ 1
1 + exp ( β j 0 + β j1 X i1 + ... + β jp X ip )
K
j =2
predicted<-predict(mult,iris$Sepal.Length,type = "probs")
head(predicted)
The predict command may also be used to obtain the predicted category:
predicted<-predict(mult,iris$Sepal.Length,type = "class")
head(predicted)
If interest lies in how many iris flowers have been correctly classified, we can
look at a contingency table showing the predicted categories versus the observed
categories:
conttable<-with(iris,table(predicted,Species))
conttable
Species
predicted setosa versicolor virginica
setosa 45 6 1
versicolor 5 30 12
virginica 0 14 37
To include the percentage of correctly classified flowers in the contingency table,
use the commands:
PercCorrect <- round(diag(conttable)/rowSums(conttable),2)
## rounding the resulting output to two decimal places
330 | P a g e
PercCorrect
setosa versicolor virginica
0.87 0.64 0.73
finaltable<-cbind(conttable,PercCorrect)
## including the final column in the table
finaltable
setosa versicolor virginica PercCorrect
setosa 45 6 1 0.87
versicolor 5 30 12 0.64
virginica 0 14 37 0.73
Using the diagonal elements of the contingency table conttable, the number of
correctly classified iris amounts to 112 (45 + 30 + 37) showing that only 74.67%
of the iris were correctly classified by means of this model.
As a final note, once a generalized linear model is fitted to the data we also need
to check the residuals. To obtain Pearson residuals for a multinomial logistic
regression we will need to resort to fitting what is known as a loglinear model
using the glm function in R (the same function that has been used to fit the binary
logistic regression model; for this purpose specify family poisson instead of
binomial). The Pearson residuals obtained for such a model are the same as those
for the multinomial logistic regression model.
331 | P a g e