Introduction To Logistic Regression
Introduction To Logistic Regression
by Simon Moss
Introduction
Logistic regression—also called binary logistic regression—is commonly utilized in many fields,
such as the health sciences. In essence, logistic regression is used
to examine whether one set of variables, such as age, gender, and IQ, predict one of two
outcomes, such as whether or not candidates will complete their PhD
to compare two conditions or groups on a set of variables.
A similar technique, called multinomial logistic regression, is used if you want to predict more
than two outcomes or compare more than two conditions. This document will primarily introduce
logistic regression, but will also broach multinomial logistic regression as well. This document does
not assume extensive knowledge in statistics, but may be easier to grasp if you are familiar with
linear regression—a technique that is discussed in another document.
A simple example
Example
To introduce you to logistic regression, consider this example. Suppose you want to predict
which research candidates are likely to complete their thesis on time. To investigate this topic, a
researcher administers a survey to 500 individuals who had enrolled in a PhD or Masters by Research
over 10 years ago. This survey includes questions that assess
An extract of the data appears in the following screen. Like most data files, each row
corresponds to one person. Each column corresponds to a separate characteristic, called a variable.
In the column called completion, 0 represents did not complete on time, and 1 represents
completed on time. In the column called gender, 0 represents females, and 1 represents males.
Logistic regression can be utilised to examine whether
self-esteem, IQ, age, and sex predicts, or is associated with, whether research candidates
completed on time
self-esteem is related to whether candidates complete on time after controlling IQ, age, and sex
these aims will become clearer as you read.
Many software packages can be utilized to conduct logistic regression. This example utilises
SPSS. If you use another package, such as R or Stata, perhaps follow these examples anyway. Later,
this document clarifies how to conduct linear regression in R and Stata. In SPSS, to generate the
following screen, select the “Analyse” menu, and choose “Regression” and then “Binary Logistic”.
Designate “Completion” as the “Dependent” variable. That is, select “Completion” and then
press the top arrow.
Designate “Self-esteem”, “IQ”, “Age”, and “Gender” as the “Covariate” variables. These variables
are sometimes called predictors instead of covariates.
Press Continue and then OK.
You will receive several tables of output. Here is the most important table, called “Variables in
the equation”.
To utilize the output called “Variables in the equation”, first interpret the p values.
Specifically
proceed to the column called “Sig”—a column that represents the p values
in this example, the p value associated with self-esteem is less than .05 and thus significant
consequently, we conclude that self-esteem is related to whether candidates complete on time
after controlling IQ, age, and gender
in contrast, the p value associated with IQ exceeds .05 and is thus not significant
consequently, we conclude that IQ is not significantly related to whether candidates complete
on time after controlling self-esteem, age, and gender
these principles will be clarified later.
proceed to the column called “B”—a column that represents something called B coefficients
in this example, the B coefficient associated with self-esteem is positive
consequently, we conclude that self-esteem is positively related complete on time after
controlling IQ, age, and sex. That is, self-esteem seems to facilitate completions.
The B coefficients also provide some insight into the extent to which the variables, such as
self-esteem or IQ, differentiate the groups. More specifically, the column labelled Exp(B) is especially
informative. In particular
technically, Exp(B) represents eB. The e is a constant, sometimes called Euler’s number, that
approximates 2.718
therefore, this column equals 2.718B.
for example, for self-esteem, B is .441; the value in the column labelled Exp(B) is thus 2.718 .441
and thus 1.555.
So, what does this number mean? How do you interpret this 1.555? To understand the answer,
you first need to appreciate the concept of odds. To clarify this concept of odds,
suppose that 80% or .80 of research candidates complete their PhD on time
the odds equals the probability they complete their PhD on time over the probability they do not
complete their PhD on time
in this instance, the odds they complete their PhD on time is thus .80/.02 = 4.
in other words, PhD candidates are 4 times as likely to complete on time than not complete on
time
So, how is this concept of odds related to the column Exp(B)? Roughly, Exp(B) indicates the
degree to which the covariate, such as self-esteem, affects the odds. Strictly speaking, an increase in
one unit on the covariate affects the odds by a multiple of Exp(B). To illustrate
Logistic regression can be utilized to generate equations that predict the likelihood of some
outcome, such as the probability of PhD completion, from a set of predictors or covariates, such as
self-esteem and IQ. These equations are not only useful but could also help you understand the
rationale that underpins logistic regression. In particular, logistic regression assumes that
Initially, this formula might seem meaningless. But, to illustrate how you could utilize this
equation
to calculate the right side of this equation, multiply each value in the B column by the
corresponding predictor—and then sum these answers
in this example, the left side is .441 x self-esteem + .007 x IQ - 0.002 x Age + 0.409 x Gender –
2.668
as this example shows, the word “Constant” can be omitted from the equation
therefore, in this example, the equation is
Loge (odds that a person is in Group 1) = .441 x self-esteem + .007 x IQ - 0.002 x Age + 0.409 x
Gender – 2.668
suppose a person arrived with a self-esteem of 7, and IQ of 110, an age of 25, and a gender of 1,
representing males
you would then substitute these values in the formula
in particular, Loge (odds the person will complete) = .441 x 7 + .007 x 110 -.002 x 25 + .409 x 1
- .2668 = 1.548
But, what does this value of 1.548 mean? What does Loge (odds the person will complete)
imply? This expression does not seem intuitive at all. Fortunately, you can then utilize the following
formula
Probability (person is in Group 1) = 1 / [1+ Loge (odds that a person is in Group 1)]
In this instance, the probability a person is in Group 1 = 1/(1 + 1.548) = .0175. Hence, the
probability this person will complete a thesis on time is .0175. This formula can thus be used to
predict the probability of an outcome, such as the probability a person will complete a thesis, from a
set of covariates, such as self-esteem, IQ, age, and gender.
But, how does SPSS, or any software, generate the B values? Which formulas or procedures
does the computer need to complete? In essence, to estimate these B values the software utilizes
the previous formula to predict the likelihood each person is in Group 1—that is, the likelihood that
each person will complete the thesis on time. These values appear in the following spreadsheet, in
the column called Probability. In practice, these probabilities would not appear in the datasheet, but
are merely presented here to facilitate learning.
According to this formula
the probability the first individual pertains to group 1 and thus will complete the thesis on time
is 0.87.
in reality, this individual did not complete the thesis on time
hence, this estimated probability is not appropriate.
the software will gradually adjust the B values to improve the equation
Specifically, the software continues to adjust the B values until all of the individuals in group 0
yield low probabilities and all the individuals in group 1 yield high probabilities, if possible
Controlling variables
Spurious variables
The previous section showed that self-esteem is positively associated with the likelihood a
person will complete the thesis on time after controlling IQ, age, and gender. So, logistic regression,
like linear regression, can be utilised to explore associations after controlling other variables. But,
what does controlling variables actually mean? And, why would you want to control variables. To
illustrate, consider the following table, in which each row represents one person.
This table generates some interesting conclusions. If you scan the last two columns, you will
conclude that self-esteem seems to coincide with completion. That is, people with high scores on
self-esteem—the final six rows—tend to complete thesis thesis. People with low self-esteem did not
tend to complete their thesis. And yet, another explanation is possible:
Perhaps age affects both self-esteem and the inclination of people to complete the thesis
That is, as people age, their self-esteem and motivation to complete a thesis on time might both
tend to improve, as their life becomes more certain
So, to assess whether a boost to self-esteem would really affect whether people complete their
thesis on time, the researcher needs to control age.
For example, the researcher could survey only people who are aged in their twenties.
Indeed, as the following table shows, if you examine only people aged in their twenties, the
association between self-esteem and whether a person completed a thesis not as apparent. That is,
when you scan the second and third column now, the higher scores on self-esteem do not
necessarily correspond to the people who completed the thesis on time. In short, we should control
variables that could affect both the predictor and outcome, such as age—called spurious variables.
Otherwise, the apparent relationship could be ascribed to this spurious variable.
Confounds
Besides spurious variables, researchers might also want to control variables for other
reasons. In particular, the measures are sometimes contaminated or confounded with other
variables. To illustrate, perhaps the measure of IQ is confounded with self-esteem. For example
You can examine only a subset of participants, such as only people who are 18
Or you can utilize statistical tests to predict what the results would be if you had controlled
variables—such as if the participants were average in age. Logistic regression is one of these
tests. That is, logistic regression can estimate what the association between whether a person
completed a thesis and self-esteem would have been had you controlled IQ and age.
So, when should you control variables? You should control variables whenever you have
collected information about a variable, such as age or IQ, that is likely to be strongly associated with
the outcome—in this instance, whether the person completed the thesis. IQ is likely to be
associated completion, so IQ, should be controlled if possible. Height is not as likely to be associated
with completion, so height might not need to be controlled.
Other techniques, such as MANOVA and discriminant function analyses, can also be used to
compare groups on multiple variables. Nevertheless, whenever you want to compare only two
groups—such as people who completed their thesis on time and people who did not complete their
thesis on time—logistic regression is preferable. In particular
logistic regression is preferable when the sample size is reasonably large, such as more than 100
individuals or units
the main reason is that, whenever the sample size is sufficiently large, the underlying
assumptions of logistic regression will be fulfilled
Multinomial regression
Logistic regression, or least binary logistic regression, can compare only two groups, such as
people who completed their thesis on time and people who did not complete their thesis on time.
However, if you want to compare more than two groups—such as candidates who completed on
time, candidates who completed late, and candidates who never completed—you need to utilize a
variant of logistic regression called multinomial regression. In practice, multinomial regression is
very similar except
To illustrate, suppose that SPSS generates the following output. According to this output
self-esteem associated with group 0 is not significant; p = .258
thus, self-esteem does not differ between group 0 and group 2, the reference category.
Parameter Estimates
95% Confidence Inte
a
Completion B Std. Error Wald df Sig. Exp(B) Lower Bound
.00 Intercept 7.167 7.825 .839 1 .360
Self_esteem -.293 .259 1.282 1 .258 .746 .449
IQ -.043 .068 .396 1 .529 .958 .839
Age .010 .053 .033 1 .856 1.010 .910
1.00 Intercept 5.744 7.657 .563 1 .453
Self_esteem .083 .229 .131 1 .717 1.087 .693
IQ -.040 .067 .367 1 .545 .960 .843
Age .000 .053 .000 1 .993 1.000 .902
a. The reference category is: 2.00.
Software
R
If you use R, logistic regression is simple. In essence, the code resembles
Model1 <- glm(completion ~ selfesteem + IQ + age + gender, data = mydata, family = "binomial")
Summary(Model1)
To conduct multinomial regression, researchers tend to use a different package and function:
Stata
In Stata, to conduct logistic regression or multinomial logistic regression, you specify the
categorical variable and then the covariates, such as
Note that base(2) is optional, but can be used to specify which group should be assigned as the
reference category.