0% found this document useful (0 votes)
53 views

Logistic Regression

The document discusses learning to play drums for a friend's band. The author struggled at first but kept practicing and is now a much better drummer, playing the songs much faster. It then discusses how people often have assumptions about the personalities associated with different musical instruments. The author considers using logistic regression to predict what instrument someone chooses to play based on personality traits.

Uploaded by

ImtiazUddin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Logistic Regression

The document discusses learning to play drums for a friend's band. The author struggled at first but kept practicing and is now a much better drummer, playing the songs much faster. It then discusses how people often have assumptions about the personalities associated with different musical instruments. The author considers using logistic regression to predict what instrument someone chooses to play based on personality traits.

Uploaded by

ImtiazUddin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

‘OK,’

I said, ‘you arrange it and I’ll bring my guitar.’ ‘No, you whelk,’ he said,
‘we want you to drum. Can you learn some of the songs on the CD I gave you
last year?’ I’d played his band’s CD and I liked it, but their songs were
ridiculously fast and there was no way on earth that I could play them. ‘Sure, no
problem’, I lied. I spent the next two weeks trying to become a much better
drummer than I was, playing along to this CD as if my life depended on it. I’d
love to report that when the rehearsal came I astounded them with my brilliance,
but I didn’t. I did, however, nearly have a heart attack and herniate everything in
my body that it’s possible to herniate. Still, we had another rehearsal, and then
another and, 7 years down the line, we’re still having them. The main difference
now is that I play the songs at a speed that makes their old drummer sound like a
sedated snail (www.myspace.com/fracture-pattern). It’s curious that I started off
playing guitar (which I can still play, incidentally), and then I chose drums.
Within famous bands, there are always assumptions about the personalities of
different musicians: the singers are egocentric, guitarists are perceived to be
cool, bassists introverted and happy to blend into the background, and drummers
are supposed to be crazy hedonists, autistic (enjoying counting helps) or both.
I’m definitely more autistic than hedonistic. If we wanted to test what
personality characteristics predict the instrument you choose to play, then we’d
have a categorical outcome (type of instrument) with several categories (drums,
guitar, bass, singing, keyboard, tuba, etc.) and continuous predictors
(neuroticism, extroversion, etc.). We’ve looked at how we can quantify
associations between purely categorical variables, but if we have continuous
predictors too then surely there’s no model on earth that can handle that kind of
complexity; should we just go to the pub and have a good time instead?
Actually, we can do logistic regression – bugger!

19.2. Background to logistic regression ①

In the last chapter we started to look at how we fit models of the relationships
between categorical variables. We have also seen throughout the book how we
can use categorical variables to predict continuous outcomes. However, we
haven’t looked at the reverse process: predicting categorical outcomes from
continuous or categorical predictors. In a nutshell, logistic regression is multiple
regression but with an outcome variable that is categorical and predictor
variables that are continuous or categorical. In its simplest form, this means that
we can predict which of two categories a person is likely to belong to given
certain other information. A trivial example is to look at which variables predict
whether a person is male or female. We might measure laziness, pig-headedness,
alcohol consumption and daily flatulence. Using logistic regression, we might
find that all of these variables predict the gender of the person. More important,
the model we build will enable us to predict whether a new person is likely to be
male or female based on these variables. So, if we picked a random person and
discovered that they scored highly on laziness, pig-headedness, alcohol
consumption and flatulence, then our model might tell us that, based on this
information, this person is likely to be male. Logistic regression can have life-
saving applications. In medical research it is used to generate models from
which predictions can be made about the likelihood that a tumour is cancerous or
benign (for example). A database of patients is used to establish which variables
are influential in predicting the malignancy of a tumour. These variables can
then be measured for a new patient and their values placed in a logistic
regression model, from which a probability of malignancy could be estimated. If
the probability value of the tumour being malignant is low then the doctor may
decide not to carry out expensive and painful surgery that in all likelihood is
unnecessary. We might not face such life-threatening decisions, but logistic
regression can nevertheless be a very useful tool. When we are trying to predict
membership of only two categorical outcomes the analysis is known as binary
logistic regression, but when we want to predict membership of more than two
categories we use multinomial (or polychotomous) logistic regression.

19.3. What are the principles behind logistic


regression? ③

I don’t wish to dwell on the underlying principles of logistic regression because


they aren’t necessary to understand the test (I am living proof of this fact).
However, I do wish to draw a few parallels to ordinary regression so that you can
get the gist of what’s going on using a framework that will be familiar to you
already. To keep things simple I will explain binary logistic regression, but most
of the principles extend easily to when there are more than two outcome
categories. Now would be a good time for the equation-phobes to look away. In
simple linear regression, we saw that the outcome variable Y is predicted from
the equation:
in which b0 is the Y intercept, b1 quantifies the relationship between the predictor
and outcome, X1 is the value of the predictor variable and ε is an error term.
When there are several predictors, a similar model is used in which the outcome
(Y) is predicted from a combination of each predictor variable (X) multiplied by
its respective regression coefficient (b):

in which bn is the regression coefficient of the corresponding variable Xn.


However, there is a good reason why we cannot apply these linear models when
the outcome variable is categorical. One of the assumptions of the linear model
(i.e., regression) is that the relationship between variables is linear. In Section
8.3.2.1 we saw how important it is that the model accurately reflects the true
relationship that’s being modelled. Therefore, for linear regression to be a valid
model, the observed data should have a linear relationship. When the outcome
variable is categorical, this assumption is violated (Berry, 1993). One way
around this problem is to transform the data using the logarithmic transformation
(see Berry & Feldman, 1985; and Chapter 5 of this book). This transformation is
a way of expressing a non-linear relationship in a linear way. Logistic regression
is based on this principle: it expresses the multiple linear regression equation in
logarithmic terms (called the logit) and thus overcomes the problem of violating
the assumption of linearity. Let’s now look at the logistic regression model.

In logistic regression, instead of predicting the value of a variable Y from a


predictor variable X1 or several predictor variables (Xs), we predict the
probability of Y occurring given known values of X1 (or Xs). The logistic
regression equation bears many similarities to the regression equations just
described. In its simplest form, when there is only one predictor variable X1, the
logistic regression equation from which the probability of Y is predicted is given
by:
in which P(Y) is the probability of Y occurring, e is the base of natural
logarithms, and the other coefficients form a linear combination much the same
as in simple regression. In fact, you might notice that the bracketed portion of
the equation is identical to the linear regression equation: there is a constant (b0),
a predictor variable (X1) and a coefficient (or weight) attached to that predictor
(b1). Just like linear regression, it is possible to extend this equation so as to
include several predictors. When there are several predictors the equation
becomes:

Whereas the one-predictor version of the logistic regression equation contained


the simple linear regression equation, the multiple-predictor version contains the
multiple regression equation.
The equation can be presented in several ways, but the version I have chosen
expresses the equation in terms of the probability of Y occurring (i.e., the
probability that a case belongs in a certain category). The resulting value from
the equation, therefore, varies between 0 and 1. A value close to 0 means that Y
is very unlikely to have occurred, and a value close to 1 means that Y is very
likely to have occurred. Also, just like linear regression, each predictor variable
in the logistic regression equation has its own parameter (b), which is estimated
from the sample data. Whereas in linear regression these parameters are
estimated using the method of least squares (Section 2.4.3), in logistic regression
maximum-likelihood estimation is used, which selects coefficients that make
the observed values most likely to have occurred. Essentially, parameters are
estimated by fitting models, based on the available predictors, to the observed
data. The chosen estimates of the bs will be ones that, when values of the
predictor variables are placed in it, result in values of Y closest to the observed
values.

19.3.1. Assessing the model: the log-likelihood


statistic ③
We’ve seen that the logistic regression model predicts the probability of an event
occurring for a given person (we denote the probability that Y occurs for the ith
person as P(Yi)), based on observations of whether or not the event did occur for
that person (we could denote the actual outcome for the ith person as Yi). So, for
a given person, Y will be either 0 (the outcome didn’t occur) or 1 (the outcome
did occur), and the predicted value, P(Y), will be a value between 0 (there is no
chance that the outcome will occur) and 1 (the outcome will certainly occur). We
saw in multiple regression that if we want to assess whether a model fits the data
we can compare the observed and predicted values of the outcome (if you
remember, we use R2, which is the Pearson correlation between observed values
of the outcome and the values predicted by the regression model). Likewise, in
logistic regression, we use the observed and predicted values to assess the fit of
the model. The measure we use is the log-likelihood:

The log-likelihood is based on summing the probabilities associated with the


predicted and actual outcomes (Tabachnick & Fidell, 2012). The log-likelihood
statistic is analogous to the residual sum of squares in multiple regression in the
sense that it is an indicator of how much unexplained information there is after
the model has been fitted. It follows, therefore, that large values of the log-
likelihood statistic indicate poorly fitting statistical models, because the larger
the value of the log-likelihood, the more unexplained observations there are.

19.3.2. Assessing the model: the deviance statistic ③

The deviance is very closely related to the log-likelihood: it’s given by


Deviance = –2 × log-likelihood

The deviance is often referred to as −2LL because of the way it is calculated. It’s
actually rather convenient to (almost) always use the deviance rather than the
log-likelihood because it has a chi-square distribution (see Chapter 18 and the
Appendix), which makes it easy to calculate the significance of the value.
Now, it’s possible to calculate a log-likelihood or deviance for different
models and to compare these models by looking at the difference between their
deviances. For example, it’s useful to compare a logistic regression model
against some kind of baseline state. The baseline state that’s usually used is the
model when only the constant is included. In multiple regression, the baseline
model we use is the mean of all scores (this is our best guess of the outcome
when we have no other information). With a categorical outcome it doesn’t make
sense to use the overall mean (all we know is whether an event happened or not),
so we use the frequency with which the outcome occurred instead. So, if the
outcome occurs 107 times, and doesn’t occur 72 times, then our best guess of the
outcome will be that it occurs (because it occurs more times than it doesn’t). As
such, like multiple regression, our baseline model is the model that gives us the
best prediction when we know nothing other than the values of the outcome: in
logistic regression this will be to predict the outcome that occurs most often.
This is the logistic regression model when only the constant is included. If we
then add one or more predictors to the model, we can compute the improvement
of the model as follows:

So, we merely take the new model deviance and subtract from it the deviance
for the baseline model (the model when only the constant is included). This
difference is known as a likelihood ratio1 and has a chi-square distribution with
degrees of freedom equal to the number of parameters, k, in the new model
minus the number of parameters in the baseline model. The number of
parameters in the baseline model will always be 1 (the constant is the only
parameter to be estimated); any subsequent model will have degrees of freedom
equal to the number of predictors plus 1 (i.e., the number of predictors plus one
parameter representing the constant).
If we build up models hierarchically (i.e., adding one predictor at a time) we
can also use equation (19.6) to compare these models. For example, if you have
a model (we’ll call it the ‘old’ model) and you add a predictor (the ‘new’ model)
to that model, you can see whether the new model has improved the fit using
equation (19.6) in which the baseline model is the ‘old’ model. The degrees of
freedom will again be the difference between the degrees of freedom of the two
models.
19.3.3. Assessing the model: R and R2 ③

When we talked about linear regression, we saw that the multiple correlation
coefficient R and its squared value R2 were useful measures of how well the
model fits the data. We’ve also just seen that the likelihood ratio is similar
inasmuch as it is based on the level of correspondence between predicted and
actual values of the outcome. However, you can calculate a more literal version
of the multiple correlation in logistic regression known as the R-statistic. This R-
statistic is the partial correlation between the outcome variable and each of the
predictor variables and it can vary between −1 and 1. A positive value indicates
that as the predictor variable increases, so does the likelihood of the event
occurring. A negative value implies that as the predictor variable increases, the
likelihood of the outcome occurring decreases. If a variable has a small value of
R then it contributes only a small amount to the model.
The R-statistic is given by:

The −2LL term is the deviance for the original model, the Wald statistic (z) is
calculated as described in the next section, and the degrees of freedom can be
read from the summary table for the variables in the equation. However, because
this value of R is dependent upon the Wald statistic it is by no means an accurate
measure (we’ll see in the next section that the Wald statistic can be inaccurate
under certain circumstances). For this reason the value of R should be treated
with some caution, and it is invalid to square this value and interpret it as you
would in linear regression.
There is some controversy over what would make a good analogue to the R2
in logistic regression, but a measure described by Hosmer and Lemeshow (1989)
can be easily calculated. Hosmer and Lemeshow’s measure (denoted by RL2) is
calculated as:

As such, is calculated by dividing the model chi-square, which represents the


change from the baseline (based on the log-likelihood) by the baseline chi-
square (the deviance of the model before any predictors were entered). Given
what the model chi-square represents (see Eq. 19.6), another way to express this
is:

is the proportional reduction in the absolute value of the log-likelihood


measure, and as such it is a measure of how much the badness of fit improves as
a result of the inclusion of the predictor variables. It can vary between 0
(indicating that the predictors are useless at predicting the outcome variable) and
1 (indicating that the model predicts the outcome variable perfectly).

However, SPSS doesn’t use this measure, it uses Cox and Snell’s (1989),
which is based on the deviance of the model (−2LL(new)) and the deviance of
the original model (−2LL(baseline)), and the sample size, n:

However, this statistic never reaches its theoretical maximum of 1. Therefore,


Nagelkerke (1991) suggested the following amendment (Nagelkerke’s ):

Although all of these measures differ in their computation (and the answers you
get), conceptually they are somewhat the same. So, in terms of interpretation
they can be seen as similar to the R2 in linear regression in that they provide a
gauge of the substantive significance of the model.
19.3.4. Assessing the contribution of predictors: the
Wald statistic ②

As in linear regression, we want to know not only how well the model overall
fits the data, but also the individual contribution of predictors. In linear
regression, we used the estimated regression coefficients (b) and their standard
errors to compute a t-statistic. In logistic regression there is an analogous
statistic, the z-statistic, which follows the normal distribution. Like the t-test in
linear regression, the z-statistic tells us whether the b coefficient for that
predictor is significantly different from zero. If the coefficient is significantly
different from zero then we can assume that the predictor is making a significant
contribution to the prediction of the outcome (Y):

Equation 19.11 shows how the z-statistic is calculated, and you can see it’s
basically identical to the t-statistic in linear regression (see equation (8.11)): it is
the value of the regression coefficient divided by its associated standard error.
The z-statistic was developed by Abraham Wald (Figure 19.2), and is known as
the Wald statistic. SPSS actually reports the Wald statistic as z2, which
transforms it so that it has a chi-square distribution. The z-statistic is used to
ascertain whether a variable is a significant predictor of the outcome; however, it
should be used a little cautiously because, when the regression coefficient (b) is
large, the standard error tends to become inflated, resulting in the z-statistic
being underestimated (see Menard, 1995). The inflation of the standard error
increases the probability of rejecting a predictor as being significant when in
reality it is making a significant contribution to the model (i.e., a Type II error).
In general it is probably more accurate to enter predictors hierarchically and
examine the change in likelihood ratio statistics.
FIGURE 19.2 Abraham Wald writing ‘I must not devise test statistics prone to
having inflated standard errors’ on the blackboard 100 times

19.3.5. The odds ratio: exp(B) ③

More crucial to the interpretation of logistic regression is the value of the odds
ratio, which is the exponential of B (i.e., eB or exp(B)) and is an indicator of the
change in odds resulting from a unit change in the predictor. As such, it is
similar to the b coefficient in logistic regression but easier to understand
(because it doesn’t require a logarithmic transformation). When the predictor
variable is categorical the odds ratio is easier to explain, so imagine we had a
simple example in which we were trying to predict whether or not someone got
pregnant from whether or not they used a condom last time they made love. The
odds of an event occurring are defined as the probability of an event occurring
divided by the probability of that event not occurring (see equation (19.12)) and
should not be confused with the more colloquial usage of the word to refer to
probability. So, for example, the odds of becoming pregnant are the probability
of becoming pregnant divided by the probability of not becoming pregnant:
To calculate the change in odds that results from a unit change in the
predictor, we must first calculate the odds of becoming pregnant given that a
condom wasn’t used. We then calculate the odds of becoming pregnant given
that a condom was used. Finally, we calculate the proportionate change in these
two odds.
To calculate the first set of odds, we use equation (19.3) to calculate the
probability of becoming pregnant given that a condom wasn’t used. If we had
more than one predictor we would use equation (19.4). There are three unknown
quantities in this equation: the coefficient of the constant (b0), the coefficient for
the predictor (b1) and the value of the predictor itself (X). We’ll know the value
of X from how we coded the condom use variable (chances are we would’ve
used 0 = condom wasn’t used and 1 = condom was used). The values of b1 and
b0will be estimated for us. We can calculate the odds as in equation (19.12).
Next, we calculate the same thing after the predictor variable has changed by
one unit. In this case, because the predictor variable is dichotomous, we need to
calculate the odds of getting pregnant, given that a condom was used. So, the
value of X is now 1 (rather than 0).
We now know the odds before and after a unit change in the predictor
variable. It is a simple matter to calculate the proportionate change in odds by
dividing the odds after a unit change in the predictor by the odds before that
change:

This proportionate change in odds is the odds ratio, and we can interpret it in
terms of the change in odds: if the value is greater than 1 then it indicates that as
the predictor increases, the odds of the outcome occurring increase. Conversely,
a value less than 1 indicates that as the predictor increases, the odds of the
outcome occurring decrease. We’ll see how this works with a real example
shortly.
19.3.6. Model building and parsimony ②

When you have more than one predictor, you can choose between the same
methods to build your model as described for ordinary regression (Section
8.5.1). As with ordinary regression, forced entry and hierarchical methods are
preferred. If you are undeterred by the criticisms of stepwise methods in the
previous chapter, then as with ordinary regression you can choose between a
forward or backward stepwise method. These methods work in the same way as
for ordinary regression, except that different statistics are used to determine
whether predictors are entered or removed from the model. For example, the
forward method enters predictors based on their score statistic, then assesses
removal based on the likelihood ratio statistic described in Section 18.3.3 (the
Forward: LR method), an arithmetically less intense version of the likelihood
ratio statistic called the conditional statistic (Forward: Conditional), or the Wald
statistic (Forward: Wald), in which case any predictors in the model that have
significance values of the Wald statistic (above the default removal criterion of
.1) will be removed. Of these methods the likelihood ratio method is the best
removal criterion because the Wald statistic can, at times, be unreliable (see
Section 19.3.4). The opposite of the forward method is the backward method,
which begins with all predictors included in the model and then removes
predictors if their removal is not detrimental to the fit of the model. Whether
removal is detrimental can be assessed using the same three methods as the
forward approach.

As with ordinary regression, stepwise methods are best avoided for theory
testing; however, they are used when no previous research exists on which to
base hypotheses for testing, and in situations where causality is not of interest
and you merely wish to find a model to fit your data (Agresti & Finlay, 1986;
Menard, 1995). As with ordinary regression, if you do use a stepwise method
then the backward method is preferable because forward methods are more
likely to exclude predictors involved in suppressor effects. In terms of the test
statistic used in stepwise methods, the Wald statistic, as we have seen, has a
tendency to be inaccurate in certain circumstances and so the likelihood ratio
method is best. As with ordinary regression, it is best to use hierarchical methods
and to build models in a systematic and theory-driven way. Although we didn’t
discuss this for ordinary regression (because things were getting complicated
enough already), when building a model we should strive for parsimony. In a
scientific context, parsimony refers to the idea that simpler explanations of a
phenomenon are preferable to complex ones. The statistical implication of using
a parsimony heuristic is that models be kept as simple as possible. In other
words, do not include predictors unless they have explanatory benefit. To
impliment this strategy we need to first fit the model that includes all potential
predictors, and then systemmatically remove any that don’t seem to contribute to
the model. This is a bit like a backward stepwise method, except that the
decision-making process is in the researcher’s hands: they make informed
decisions about what predictors should be removed. It’s also worth bearing in
mind that if you have interaction terms in your model then for that interaction
term to be valid you must retain the main effects involved in the interaction term
as well (even if they don’t appear to contribute much).

19.4. Sources of bias and common problems ④

19.4.1. Assumptions ②

Logistic regression, like any linear model, is open to the sources of bias
discussed in Chapter 5 and Section 8.3, so look back at those parts of the book.
In the context of logistic regression, it’s worth noting a couple of points about
the assumptions of linearity and independence:

Linearity: In ordinary regression we assumed that the outcome had linear


relationships with the predictors. In logistic regression the outcome is
categorical and so this assumption is violated, and we use the log (or logit)
of the data. The assumption of linearity in logistic regression, therefore,
assumes that there is a linear relationship between any continuous
predictors and the logit of the outcome variable. This assumption can be
tested by looking at whether the interaction term between the predictor and
its log transformation is significant (Hosmer & Lemeshow, 1989). We will
go through an example in Section 19.8.1.
Independence of errors: In logistic regression, violating this assumption
produces overdispersion, which we’ll discuss in Section 19.4.4.

Logistic regression also has some unique problems. These are not sources of
bias so much as things that can go wrong. SPSS solves logistic regression
problems by an iterative procedure (SPSS Tip 19.1). Sometimes, instead of
pouncing on the correct solution quickly, you’ll notice nothing happening: SPSS
begins to move infinitely slowly, or appears to have just got fed up with you
asking it to do stuff and has gone on strike. If it can’t find a correct solution, then
sometimes it actually does give up, quietly offering you (without any apology) a
result which is completely incorrect. Usually this is revealed by implausibly
large standard errors. Two situations can provoke this situation, both of which
are related to the ratio of cases to variables: incomplete information and
complete separation.

SPSS TIP 19.1 Error messages about ‘failure to converge’ ③


Many statistical procedures use an iterative process, which means that SPSS attempts to estimate the
parameters of the model by finding successive approximations of those parameters. Essentially, it starts
by estimating the parameters with a ‘best guess’. It then attempts to approximate them more accurately
(known as an iteration). It then tries again, and then again, and so on through many iterations. It stops
either when the approximations of parameters converge (i.e., at each new attempt the ‘approximations’
of parameters are the same or very similar to the previous attempt), or it reaches the maximum number
of attempts (iterations).
Sometimes you will get an error message in the output that says something like Maximum number of
iterations were exceeded, and the log-likelihood value and/or the parameter estimates cannot converge.
What this means is that SPSS has attempted to estimate the parameters the maximum number of times
(as specified in the options) but they are not converging (i.e., at each iteration SPSS is getting quite
different estimates). This certainly means that you should ignore any output that SPSS has produced,
and it might mean that your data are beyond help. You can try increasing the number of iterations that
SPSS attempts, or make the criteria that SPSS uses to assess ‘convergence’ less strict.

You might also like