Live Session
Module 3
Logistic Regression
DS610
Advanced Applied
Statistics for Data Science
Instructor Name
Module 3 Overview
This module will cover logistic regression. Logistic
Regression is used for classification purposes where the
outcome can be binary (two classes) or more than two
classes. You will use R to perform Logistic Regression
and estimate parameter values and their confidence
values.
Chapter 8 (Discovering Statistics Using R)
2
DS610: Advanced Applied Statistics for Data Science
Module 3 Learning Outcomes
1. Understand when to perform logistic regression
and the underlying assumptions.
2. Perform binomial logistic regression using R.
3. Perform multinomial logistic regression using R.
3
DS610: Advanced Applied Statistics for Data Science
Logistic Regression
Aims
• When and why do we use logistic
regression?
– Binary
– Multinomial
• Theory behind logistic regression
– Assessing the model
– Assessing predictors
– Things that can go wrong
• Interpreting logistic regression
Slide 5
When and Why
• To predict an outcome variable that is
categorical from one or more categorical or
continuous predictor variables.
• Used because having a categorical
outcome variable violates the assumption
of linearity in normal regression.
Slide 6
With One Predictor
• Outcome
– We predict the probability of the outcome
occurring
• b0 and b1
– Can be thought of in much the same way as
multiple regression
– Note the normal regression equation forms part
of the logistic regression equation
Slide 7
With Several Predictors
• Outcome
– We still predict the probability of the outcome
occurring
• Differences
– Note the multiple regression equation forms
part of the logistic regression equation
– This part of the equation expands to
accommodate additional predictors
Slide 8
Assessing the Model
N
∑ [Y ln(P(Y ))+(1−Y )ln(1−P(Y ))]
log −likelihood= i i i i
i =1
• The log-likelihood statistic
– Analogous to the residual sum of squares in
multiple regression
– It is an indicator of how much unexplained
information there is after the model has been
fitted.
– Large values indicate poorly fitting statistical
models.
Assessing Changes in Models
• It’s possible to calculate a log-likelihood for
different models and to compare these
models by looking at the difference
between their log-likelihoods.
χ 2 =2[LL(ne ) −LL(baseline
)]
w
(df =knew−kbaselin )
e
Assessing Predictors: The Wald
Statistic
• Similar to t-statistic in regression.
• Tests the null hypothesis that b = 0.
• Is biased when b is large.
• Better to look at likelihood ratio statistics.
Slide 11
Assessing Predictors:
The Odds Ratio
• Indicates the change in odds resulting from
a unit change in the predictor.
– OR > 1: Predictor ↑, Probability of outcome
occurring ↑.
– OR < 1: Predictor ↑, Probability of outcome
occurring ↓.
Slide 12
Methods of Regression
• Forced entry: all variables entered simultaneously.
• Hierarchical: variables entered in blocks.
– Blocks should be based on past research, or theory
being tested. Good method.
• Stepwise: variables entered on the basis of
statistical criteria (i.e. relative contribution to
predicting outcome).
– Should be used only for exploratory analysis.
Slide 13
Things That Can Go Wrong
• Assumptions from linear regression:
– Linearity
– Independence of errors
– Multicollinearity
• Unique problems
– Incomplete information
– Complete separation
– Overdispersion
Incomplete Information from the
Predictors
• Categorical predictors:
– Predicting cancer from smoking and eating tomatoes.
– We don’t know what happens when non-smokers eat
tomatoes because we have no data in this cell of the design.
• Continuous variables
– Will your sample include an 80-year-old, highly anxious,
Buddhist, left-handed lesbian?
Complete Separation
• When the outcome variable can be perfectly
predicted.
– E.g. predicting whether someone is a burglar, your
teenage son or your cat based on weight.
– Weight is a perfect predictor of cat/burglar unless
you have a very fat cat indeed!
Overdispersion
• Overdispersion is where the variance is
larger than expected from the model.
• This can be caused by violating the
assumption of independence.
• This problem makes the standard errors too
small!
An Example
• Predictors of a treatment intervention.
• Participants
– 113 adults with a medical problem
• Outcome:
– Cured (1) or not cured (0).
• Predictors:
– Intervention: intervention or no treatment.
– Duration: the number of days before treatment that
the patient had the problem.
Slide 18
Basic Logistic Regression Analysis Using R
Commander
Reordering a factor in R Commander
Basic Logistic Regression Analysis Using R
Commander
Dialog box for generalized linear models in R Commander
Basic Logistic Regression Analysis Using R
newModel<-glm(outcome ~ predictor(s), data =
dataFrame, family = name of a distribution,
na.action = an action)
Hierarchical Regression Using R
• Model 1:
eelModel.1 <- glm(Cured ~ Intervention, data =
eelData, family = binomial())
• Model 2:
eelModel.2 <- glm(Cured ~ Intervention +
Duration, data = eelData, family = binomial())
summary(eelModel.1)
summary(eelModel.2)
Output Model 1: Intervention Only
Call:
glm(formula = Cured ~ Intervention, family = binomial(), data = eelData)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5940 -1.0579 0.8118 0.8118 1.3018
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.2877 0.2700 -1.065 0.28671
InterventionIntervention 1.2287 0.3998 3.074 0.00212 **
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 154.08 on 112 degrees of freedom
Residual deviance: 144.16 on 111 degrees of freedom
AIC: 148.16
Improvement: Model 1
• Find the improvement:
modelChi <- eelModel.1$null.deviance - eelModel.1$deviance
modelChi
[1] 9.926201
• degrees of freedom :
chidf <- eelModel.1$df.null - eelModel.1$df.residual
chidf
[1] 1
• To calculate the probability associated with this chi-square statistic we
can use the pchisq() function.
chisq.prob <- 1 - pchisq(modelChi, chidf)
chisq.prob
[1] 0.001629425
Writing a Function to Compute R2
logisticPseudoR2s <- function(LogModel) {
dev <- LogModel$deviance
nullDev <- LogModel$null.deviance
modelN <- length(LogModel$fitted.values)
R.l <- 1 - dev / nullDev
R.cs <- 1- exp ( -(nullDev - dev) / modelN)
R.n <- R.cs / ( 1 - ( exp (-(nullDev / modelN))))
cat("Pseudo R^2 for logistic regression\n")
cat("Hosmer and Lemeshow R^2 ", round(R.l, 3), "\n")
cat("Cox and Snell R^2 ", round(R.cs, 3), "\n")
cat("Nagelkerke R^2 ", round(R.n, 3), "\n")
}
Writing a Function to Compute R2
• To use the function on our model, we simply
place the name of the logistic regression model
(in this case eelModel.1) in the function and
execute:
logisticPseudoR2s(eelModel.1)
• The output will be:
Pseudo R^2 for logistic regression
Hosmer and Lemeshow R^2 0.064
Cox and Snell R^2 0.084
Nagelkerke R^2 0.113
Calculating the Odds Ratio
• We can also calculate the odds ratio as the exponential of the
b coefficient for the predictor variables by executing:
exp(eelModel.1$coefficients)
(Intercept) InterventionIntervention
0.750000 3.416667
• To get the confidence intervals execute:
exp(confint(eelModel.1))
2.5 % 97.5 %
(Intercept) 0.4374531 1.268674
InterventionIntervention 1.5820127 7.625545
Output Model 2: Intervention and
Duration as Predictors
Call:
glm(formula = Cured ~ Intervention + Duration, family = binomial(),
data = eelData)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6025 -1.0572 0.8107 0.8161 1.3095
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.234660 1.220563 -0.192 0.84754
InterventionIntervention 1.233532 0.414565 2.975 0.00293 **
Duration -0.007835 0.175913 -0.045 0.96447
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 154.08 on 112 degrees of freedom
Residual deviance: 144.16 on 110 degrees of freedom
AIC: 150.16
Improvement: Model 2
• We can compare the models by finding the difference in
the deviance statistics as before.
• Or we can use the anova() function:
anova(eelModel.1, eelModel.2)
>Analysis of Deviance Table
Model 1: Cured ~ Intervention
Model 2: Cured ~ Intervention + Duration
Resid. Df Resid. Dev Df Deviance
1 111 144.16
2 110 144.16 1 0.0019835
Summary
• The overall fit of the final model is shown by the deviance statistic and its
associated chi-square statistic.
– If the significance of the chi-square statistic is less than .05, then the model is a
significant fit to the data.
• Check the table labelled coefficients to see which variables significantly
predict the outcome.
• For each variable in the model, look at the z-statistic and its significance
(which again should be below .05).
• Use the odds ratio for interpretation. You can obtain this using
exp(model$coefficients), where model is the name of your model.
– If the value is greater than 1 then as the predictor increases, the odds of the
outcome occurring increase.
– A value less than 1 indicates that as the predictor increases, the odds of the
outcome occurring decrease.
– For the aforementioned interpretation to be reliable the confidence interval of
the odds ratio should not cross 1!
Reporting the Analysis
Multinomial Logistic Regression
• Logistic regression to predict membership of more than two
categories.
• It (basically) works in the same way as binary logistic regression.
• The analysis breaks the outcome variable down into a series of
comparisons between two categories.
– E.g., if you have three outcome categories (A, B and C), then the
analysis will consist of two comparisons that you choose:
• compare everything against your first category (e.g. A vs. B and A vs. C),
• or your last category (e.g. A vs. C and B vs. C),
• or a custom category (e.g. B vs. A and B vs. C).
• The important parts of the analysis and output are much the
same as we have just seen for binary logistic regression.
I May Not Be Fred Flintstone …
• How successful are chat-up lines?
• The chat-up lines used by 348 men and 672 women in a
nightclub were recorded.
• Outcome:
– Whether the chat-up line resulted in one of the following three events:
• the person got no response or the recipient walked away;
• the person obtained the recipient’s phone number;
• the person left the nightclub with the recipient.
• Predictors:
– The content of the chat-up lines were rated for:
• funniness (0 = not funny at all, 10 = the funniest thing that I have ever heard);
• sexuality (0 = no sexual content at all, 10 = very sexually direct);
• moral values (0 = the chat-up line does not reflect good characteristics, 10 =
the chat-up line is very indicative of good characteristics).
– Gender of recipient
Multinomial Logistic Regression in R
• We can use the mlogit.data() function to
convert our data into the correct format:
newDataframe<-mlogit.data(oldDataFrame,
choice = "outcome variable", shape =
"wide"/"long")
Restructuring the Data
• Therefore, to restructure the current data
we could execute:
mlChat <- mlogit.data(chatData, choice =
"Success", shape = "wide”)
Running Multinomial Regression
• Now we are ready to run the multinomial logistic
regression, using the mlogit() function:
newModel<-mlogit(outcome ~ predictor(s), data =
dataFrame, na.action = an action, reflevel = a number
representing the baseline category for the outcome)
• We can, therefore, create the model by executing:
chatModel <- mlogit(Success ~ 1 | Good_Mate + Funny +
Gender + Sex + Gender:Sex + Funny:Gender, data =
mlChat, reflevel = 3)
summary(chatModel)
Interpretation
• To help with the interpretation we can
exponentiate the coefficients:
exp(chatModel$coefficients)
• We can make the output nicer by asking R
to print the variable as a dataframe:
data.frame(exp(chatModel$coefficients))
Exponentiated Coefficients
Confidence Intervals
• We can get confidence intervals for these
coefficients using the confint() function:
exp(confint(chatModel))
Confidence Intervals
Interpretation:
Phone Number vs. No Response
• Good_Mate: Whether the chat-up line showed signs of good moral fibre
significantly predicted whether you got a phone number or no response/walked
away, b = 0.13, Wald χ2(1) = 6.02, p < .05.
• Funny: Whether the chat-up line was funny did not significantly predict whether
you got a phone number or no response, b = 0.14, Wald χ2(1) = 1.60, p > .05.
• Gender: The gender of the person being chatted up significantly predicted
whether they gave out their phone number or gave no response, b = −1.65, Wald
χ2(1) = 4.27, p < .05.
• Sex: The sexual content of the chat-up line significantly predicted whether you got
a phone number or no response/walked away, b = 0.28, Wald χ2(1) = 9.59, p < .01.
• Funny × Gender: The success of funny chat-up lines depended on whether they
were delivered to a man or a woman because in interaction these variables
predicted whether or not you got a phone number, b = 0.49, Wald χ2(1) = 12.37, p
< .001.
• Sex × Gender: The success of chat-up lines with sexual content depended on
whether they were delivered to a man or a woman because in interaction these
variables predicted whether or not you got a phone number, b = −0.35, Wald χ2(1)
= 10.82, p < .01.
Interpretation:
Going Home vs. No Response
• Good_Mate: Whether the chat-up line showed signs of good moral fibre did not
significantly predict whether you went home with the date or got a slap in the face,
b = 0.13, Wald χ2(1) = 2.42, p > .05.
• Funny: Whether the chat-up line was funny significantly predicted whether you went
home with the date or no response, b = 0.32, Wald χ2(1) = 6.46, p < .05.
• Gender: The gender of the person being chatted up significantly predicted whether
they went home with the person or gave no response, b = −5.63, Wald χ2(1) = 17.93,
p < .001.
• Sex: The sexual content of the chat-up line significantly predicted whether you went
home with the date or got a slap in the face, b = 0.42, Wald χ2(1) = 11.68, p < .01.
• Funny × Gender: The success of funny chat-up lines depended on whether they were
delivered to a man or a woman because in interaction these variables predicted
whether or not you went home with the date, b = 1.17, Wald χ2(1) = 34.63, p < .001.
• Sex × Gender: The success of chat-up lines with sexual content depended on
whether they were delivered to a man or a woman because in interaction these
variables predicted whether or not you went home with the date, b = −0.48, Wald
χ2(1) = 8.51, p < .01.
Reporting the Results
References
Chapter 8 (Discovering Statistics Using R)
45