Logistic Regression
Logistic Regression
I said, ‘you arrange it and I’ll bring my guitar.’ ‘No, you whelk,’ he said,
‘we want you to drum. Can you learn some of the songs on the CD I gave you
last year?’ I’d played his band’s CD and I liked it, but their songs were
ridiculously fast and there was no way on earth that I could play them. ‘Sure, no
problem’, I lied. I spent the next two weeks trying to become a much better
drummer than I was, playing along to this CD as if my life depended on it. I’d
love to report that when the rehearsal came I astounded them with my brilliance,
but I didn’t. I did, however, nearly have a heart attack and herniate everything in
my body that it’s possible to herniate. Still, we had another rehearsal, and then
another and, 7 years down the line, we’re still having them. The main difference
now is that I play the songs at a speed that makes their old drummer sound like a
sedated snail (www.myspace.com/fracture-pattern). It’s curious that I started off
playing guitar (which I can still play, incidentally), and then I chose drums.
Within famous bands, there are always assumptions about the personalities of
different musicians: the singers are egocentric, guitarists are perceived to be
cool, bassists introverted and happy to blend into the background, and drummers
are supposed to be crazy hedonists, autistic (enjoying counting helps) or both.
I’m definitely more autistic than hedonistic. If we wanted to test what
personality characteristics predict the instrument you choose to play, then we’d
have a categorical outcome (type of instrument) with several categories (drums,
guitar, bass, singing, keyboard, tuba, etc.) and continuous predictors
(neuroticism, extroversion, etc.). We’ve looked at how we can quantify
associations between purely categorical variables, but if we have continuous
predictors too then surely there’s no model on earth that can handle that kind of
complexity; should we just go to the pub and have a good time instead?
Actually, we can do logistic regression – bugger!
In the last chapter we started to look at how we fit models of the relationships
between categorical variables. We have also seen throughout the book how we
can use categorical variables to predict continuous outcomes. However, we
haven’t looked at the reverse process: predicting categorical outcomes from
continuous or categorical predictors. In a nutshell, logistic regression is multiple
regression but with an outcome variable that is categorical and predictor
variables that are continuous or categorical. In its simplest form, this means that
we can predict which of two categories a person is likely to belong to given
certain other information. A trivial example is to look at which variables predict
whether a person is male or female. We might measure laziness, pig-headedness,
alcohol consumption and daily flatulence. Using logistic regression, we might
find that all of these variables predict the gender of the person. More important,
the model we build will enable us to predict whether a new person is likely to be
male or female based on these variables. So, if we picked a random person and
discovered that they scored highly on laziness, pig-headedness, alcohol
consumption and flatulence, then our model might tell us that, based on this
information, this person is likely to be male. Logistic regression can have life-
saving applications. In medical research it is used to generate models from
which predictions can be made about the likelihood that a tumour is cancerous or
benign (for example). A database of patients is used to establish which variables
are influential in predicting the malignancy of a tumour. These variables can
then be measured for a new patient and their values placed in a logistic
regression model, from which a probability of malignancy could be estimated. If
the probability value of the tumour being malignant is low then the doctor may
decide not to carry out expensive and painful surgery that in all likelihood is
unnecessary. We might not face such life-threatening decisions, but logistic
regression can nevertheless be a very useful tool. When we are trying to predict
membership of only two categorical outcomes the analysis is known as binary
logistic regression, but when we want to predict membership of more than two
categories we use multinomial (or polychotomous) logistic regression.
Deviance = –2 × log-likelihood
The deviance is often referred to as −2LL because of the way it is calculated. It’s
actually rather convenient to (almost) always use the deviance rather than the
log-likelihood because it has a chi-square distribution (see Chapter 18 and the
Appendix), which makes it easy to calculate the significance of the value.
Now, it’s possible to calculate a log-likelihood or deviance for different
models and to compare these models by looking at the difference between their
deviances. For example, it’s useful to compare a logistic regression model
against some kind of baseline state. The baseline state that’s usually used is the
model when only the constant is included. In multiple regression, the baseline
model we use is the mean of all scores (this is our best guess of the outcome
when we have no other information). With a categorical outcome it doesn’t make
sense to use the overall mean (all we know is whether an event happened or not),
so we use the frequency with which the outcome occurred instead. So, if the
outcome occurs 107 times, and doesn’t occur 72 times, then our best guess of the
outcome will be that it occurs (because it occurs more times than it doesn’t). As
such, like multiple regression, our baseline model is the model that gives us the
best prediction when we know nothing other than the values of the outcome: in
logistic regression this will be to predict the outcome that occurs most often.
This is the logistic regression model when only the constant is included. If we
then add one or more predictors to the model, we can compute the improvement
of the model as follows:
So, we merely take the new model deviance and subtract from it the deviance
for the baseline model (the model when only the constant is included). This
difference is known as a likelihood ratio1 and has a chi-square distribution with
degrees of freedom equal to the number of parameters, k, in the new model
minus the number of parameters in the baseline model. The number of
parameters in the baseline model will always be 1 (the constant is the only
parameter to be estimated); any subsequent model will have degrees of freedom
equal to the number of predictors plus 1 (i.e., the number of predictors plus one
parameter representing the constant).
If we build up models hierarchically (i.e., adding one predictor at a time) we
can also use equation (19.6) to compare these models. For example, if you have
a model (we’ll call it the ‘old’ model) and you add a predictor (the ‘new’ model)
to that model, you can see whether the new model has improved the fit using
equation (19.6) in which the baseline model is the ‘old’ model. The degrees of
freedom will again be the difference between the degrees of freedom of the two
models.
19.3.3. Assessing the model: R and R2 ③
When we talked about linear regression, we saw that the multiple correlation
coefficient R and its squared value R2 were useful measures of how well the
model fits the data. We’ve also just seen that the likelihood ratio is similar
inasmuch as it is based on the level of correspondence between predicted and
actual values of the outcome. However, you can calculate a more literal version
of the multiple correlation in logistic regression known as the R-statistic. This R-
statistic is the partial correlation between the outcome variable and each of the
predictor variables and it can vary between −1 and 1. A positive value indicates
that as the predictor variable increases, so does the likelihood of the event
occurring. A negative value implies that as the predictor variable increases, the
likelihood of the outcome occurring decreases. If a variable has a small value of
R then it contributes only a small amount to the model.
The R-statistic is given by:
The −2LL term is the deviance for the original model, the Wald statistic (z) is
calculated as described in the next section, and the degrees of freedom can be
read from the summary table for the variables in the equation. However, because
this value of R is dependent upon the Wald statistic it is by no means an accurate
measure (we’ll see in the next section that the Wald statistic can be inaccurate
under certain circumstances). For this reason the value of R should be treated
with some caution, and it is invalid to square this value and interpret it as you
would in linear regression.
There is some controversy over what would make a good analogue to the R2
in logistic regression, but a measure described by Hosmer and Lemeshow (1989)
can be easily calculated. Hosmer and Lemeshow’s measure (denoted by RL2) is
calculated as:
However, SPSS doesn’t use this measure, it uses Cox and Snell’s (1989),
which is based on the deviance of the model (−2LL(new)) and the deviance of
the original model (−2LL(baseline)), and the sample size, n:
Although all of these measures differ in their computation (and the answers you
get), conceptually they are somewhat the same. So, in terms of interpretation
they can be seen as similar to the R2 in linear regression in that they provide a
gauge of the substantive significance of the model.
19.3.4. Assessing the contribution of predictors: the
Wald statistic ②
As in linear regression, we want to know not only how well the model overall
fits the data, but also the individual contribution of predictors. In linear
regression, we used the estimated regression coefficients (b) and their standard
errors to compute a t-statistic. In logistic regression there is an analogous
statistic, the z-statistic, which follows the normal distribution. Like the t-test in
linear regression, the z-statistic tells us whether the b coefficient for that
predictor is significantly different from zero. If the coefficient is significantly
different from zero then we can assume that the predictor is making a significant
contribution to the prediction of the outcome (Y):
Equation 19.11 shows how the z-statistic is calculated, and you can see it’s
basically identical to the t-statistic in linear regression (see equation (8.11)): it is
the value of the regression coefficient divided by its associated standard error.
The z-statistic was developed by Abraham Wald (Figure 19.2), and is known as
the Wald statistic. SPSS actually reports the Wald statistic as z2, which
transforms it so that it has a chi-square distribution. The z-statistic is used to
ascertain whether a variable is a significant predictor of the outcome; however, it
should be used a little cautiously because, when the regression coefficient (b) is
large, the standard error tends to become inflated, resulting in the z-statistic
being underestimated (see Menard, 1995). The inflation of the standard error
increases the probability of rejecting a predictor as being significant when in
reality it is making a significant contribution to the model (i.e., a Type II error).
In general it is probably more accurate to enter predictors hierarchically and
examine the change in likelihood ratio statistics.
FIGURE 19.2 Abraham Wald writing ‘I must not devise test statistics prone to
having inflated standard errors’ on the blackboard 100 times
More crucial to the interpretation of logistic regression is the value of the odds
ratio, which is the exponential of B (i.e., eB or exp(B)) and is an indicator of the
change in odds resulting from a unit change in the predictor. As such, it is
similar to the b coefficient in logistic regression but easier to understand
(because it doesn’t require a logarithmic transformation). When the predictor
variable is categorical the odds ratio is easier to explain, so imagine we had a
simple example in which we were trying to predict whether or not someone got
pregnant from whether or not they used a condom last time they made love. The
odds of an event occurring are defined as the probability of an event occurring
divided by the probability of that event not occurring (see equation (19.12)) and
should not be confused with the more colloquial usage of the word to refer to
probability. So, for example, the odds of becoming pregnant are the probability
of becoming pregnant divided by the probability of not becoming pregnant:
To calculate the change in odds that results from a unit change in the
predictor, we must first calculate the odds of becoming pregnant given that a
condom wasn’t used. We then calculate the odds of becoming pregnant given
that a condom was used. Finally, we calculate the proportionate change in these
two odds.
To calculate the first set of odds, we use equation (19.3) to calculate the
probability of becoming pregnant given that a condom wasn’t used. If we had
more than one predictor we would use equation (19.4). There are three unknown
quantities in this equation: the coefficient of the constant (b0), the coefficient for
the predictor (b1) and the value of the predictor itself (X). We’ll know the value
of X from how we coded the condom use variable (chances are we would’ve
used 0 = condom wasn’t used and 1 = condom was used). The values of b1 and
b0will be estimated for us. We can calculate the odds as in equation (19.12).
Next, we calculate the same thing after the predictor variable has changed by
one unit. In this case, because the predictor variable is dichotomous, we need to
calculate the odds of getting pregnant, given that a condom was used. So, the
value of X is now 1 (rather than 0).
We now know the odds before and after a unit change in the predictor
variable. It is a simple matter to calculate the proportionate change in odds by
dividing the odds after a unit change in the predictor by the odds before that
change:
This proportionate change in odds is the odds ratio, and we can interpret it in
terms of the change in odds: if the value is greater than 1 then it indicates that as
the predictor increases, the odds of the outcome occurring increase. Conversely,
a value less than 1 indicates that as the predictor increases, the odds of the
outcome occurring decrease. We’ll see how this works with a real example
shortly.
19.3.6. Model building and parsimony ②
When you have more than one predictor, you can choose between the same
methods to build your model as described for ordinary regression (Section
8.5.1). As with ordinary regression, forced entry and hierarchical methods are
preferred. If you are undeterred by the criticisms of stepwise methods in the
previous chapter, then as with ordinary regression you can choose between a
forward or backward stepwise method. These methods work in the same way as
for ordinary regression, except that different statistics are used to determine
whether predictors are entered or removed from the model. For example, the
forward method enters predictors based on their score statistic, then assesses
removal based on the likelihood ratio statistic described in Section 18.3.3 (the
Forward: LR method), an arithmetically less intense version of the likelihood
ratio statistic called the conditional statistic (Forward: Conditional), or the Wald
statistic (Forward: Wald), in which case any predictors in the model that have
significance values of the Wald statistic (above the default removal criterion of
.1) will be removed. Of these methods the likelihood ratio method is the best
removal criterion because the Wald statistic can, at times, be unreliable (see
Section 19.3.4). The opposite of the forward method is the backward method,
which begins with all predictors included in the model and then removes
predictors if their removal is not detrimental to the fit of the model. Whether
removal is detrimental can be assessed using the same three methods as the
forward approach.
As with ordinary regression, stepwise methods are best avoided for theory
testing; however, they are used when no previous research exists on which to
base hypotheses for testing, and in situations where causality is not of interest
and you merely wish to find a model to fit your data (Agresti & Finlay, 1986;
Menard, 1995). As with ordinary regression, if you do use a stepwise method
then the backward method is preferable because forward methods are more
likely to exclude predictors involved in suppressor effects. In terms of the test
statistic used in stepwise methods, the Wald statistic, as we have seen, has a
tendency to be inaccurate in certain circumstances and so the likelihood ratio
method is best. As with ordinary regression, it is best to use hierarchical methods
and to build models in a systematic and theory-driven way. Although we didn’t
discuss this for ordinary regression (because things were getting complicated
enough already), when building a model we should strive for parsimony. In a
scientific context, parsimony refers to the idea that simpler explanations of a
phenomenon are preferable to complex ones. The statistical implication of using
a parsimony heuristic is that models be kept as simple as possible. In other
words, do not include predictors unless they have explanatory benefit. To
impliment this strategy we need to first fit the model that includes all potential
predictors, and then systemmatically remove any that don’t seem to contribute to
the model. This is a bit like a backward stepwise method, except that the
decision-making process is in the researcher’s hands: they make informed
decisions about what predictors should be removed. It’s also worth bearing in
mind that if you have interaction terms in your model then for that interaction
term to be valid you must retain the main effects involved in the interaction term
as well (even if they don’t appear to contribute much).
19.4.1. Assumptions ②
Logistic regression, like any linear model, is open to the sources of bias
discussed in Chapter 5 and Section 8.3, so look back at those parts of the book.
In the context of logistic regression, it’s worth noting a couple of points about
the assumptions of linearity and independence:
Logistic regression also has some unique problems. These are not sources of
bias so much as things that can go wrong. SPSS solves logistic regression
problems by an iterative procedure (SPSS Tip 19.1). Sometimes, instead of
pouncing on the correct solution quickly, you’ll notice nothing happening: SPSS
begins to move infinitely slowly, or appears to have just got fed up with you
asking it to do stuff and has gone on strike. If it can’t find a correct solution, then
sometimes it actually does give up, quietly offering you (without any apology) a
result which is completely incorrect. Usually this is revealed by implausibly
large standard errors. Two situations can provoke this situation, both of which
are related to the ratio of cases to variables: incomplete information and
complete separation.