0% found this document useful (0 votes)
18 views45 pages

Seu Ds610 Mod03

Uploaded by

g230001395
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views45 pages

Seu Ds610 Mod03

Uploaded by

g230001395
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

Live Session

Module 3
Logistic Regression

DS610
Advanced Applied
Statistics for Data Science

Instructor Name
Module 3 Overview
This module will cover logistic regression. Logistic
Regression is used for classification purposes where the
outcome can be binary (two classes) or more than two
classes. You will use R to perform Logistic Regression
and estimate parameter values and their confidence
values.

Chapter 8 (Discovering Statistics Using R)

2
DS610: Advanced Applied Statistics for Data Science
Module 3 Learning Outcomes

1. Understand when to perform logistic regression


and the underlying assumptions.
2. Perform binomial logistic regression using R.
3. Perform multinomial logistic regression using R.

3
DS610: Advanced Applied Statistics for Data Science
Logistic Regression
Aims
• When and why do we use logistic
regression?
– Binary
– Multinomial
• Theory behind logistic regression
– Assessing the model
– Assessing predictors
– Things that can go wrong
• Interpreting logistic regression
Slide 5
When and Why

• To predict an outcome variable that is


categorical from one or more categorical or
continuous predictor variables.
• Used because having a categorical
outcome variable violates the assumption
of linearity in normal regression.

Slide 6
With One Predictor

• Outcome
– We predict the probability of the outcome
occurring
• b0 and b1
– Can be thought of in much the same way as
multiple regression
– Note the normal regression equation forms part
of the logistic regression equation

Slide 7
With Several Predictors

• Outcome
– We still predict the probability of the outcome
occurring
• Differences
– Note the multiple regression equation forms
part of the logistic regression equation
– This part of the equation expands to
accommodate additional predictors

Slide 8
Assessing the Model
N

∑ [Y ln(P(Y ))+(1−Y )ln(1−P(Y ))]


log −likelihood= i i i i
i =1

• The log-likelihood statistic


– Analogous to the residual sum of squares in
multiple regression
– It is an indicator of how much unexplained
information there is after the model has been
fitted.
– Large values indicate poorly fitting statistical
models.
Assessing Changes in Models
• It’s possible to calculate a log-likelihood for
different models and to compare these
models by looking at the difference
between their log-likelihoods.

χ 2 =2[LL(ne ) −LL(baseline
)]
w
(df =knew−kbaselin )
e
Assessing Predictors: The Wald
Statistic

• Similar to t-statistic in regression.


• Tests the null hypothesis that b = 0.
• Is biased when b is large.
• Better to look at likelihood ratio statistics.

Slide 11
Assessing Predictors:
The Odds Ratio

• Indicates the change in odds resulting from


a unit change in the predictor.
– OR > 1: Predictor ↑, Probability of outcome
occurring ↑.
– OR < 1: Predictor ↑, Probability of outcome
occurring ↓.

Slide 12
Methods of Regression
• Forced entry: all variables entered simultaneously.
• Hierarchical: variables entered in blocks.
– Blocks should be based on past research, or theory
being tested. Good method.
• Stepwise: variables entered on the basis of
statistical criteria (i.e. relative contribution to
predicting outcome).
– Should be used only for exploratory analysis.

Slide 13
Things That Can Go Wrong
• Assumptions from linear regression:
– Linearity
– Independence of errors
– Multicollinearity
• Unique problems
– Incomplete information
– Complete separation
– Overdispersion
Incomplete Information from the
Predictors
• Categorical predictors:
– Predicting cancer from smoking and eating tomatoes.
– We don’t know what happens when non-smokers eat
tomatoes because we have no data in this cell of the design.
• Continuous variables
– Will your sample include an 80-year-old, highly anxious,
Buddhist, left-handed lesbian?
Complete Separation
• When the outcome variable can be perfectly
predicted.
– E.g. predicting whether someone is a burglar, your
teenage son or your cat based on weight.
– Weight is a perfect predictor of cat/burglar unless
you have a very fat cat indeed!
Overdispersion
• Overdispersion is where the variance is
larger than expected from the model.
• This can be caused by violating the
assumption of independence.
• This problem makes the standard errors too
small!
An Example
• Predictors of a treatment intervention.
• Participants
– 113 adults with a medical problem
• Outcome:
– Cured (1) or not cured (0).
• Predictors:
– Intervention: intervention or no treatment.
– Duration: the number of days before treatment that
the patient had the problem.

Slide 18
Basic Logistic Regression Analysis Using R
Commander

Reordering a factor in R Commander


Basic Logistic Regression Analysis Using R
Commander

Dialog box for generalized linear models in R Commander


Basic Logistic Regression Analysis Using R

newModel<-glm(outcome ~ predictor(s), data =


dataFrame, family = name of a distribution,
na.action = an action)
Hierarchical Regression Using R
• Model 1:
eelModel.1 <- glm(Cured ~ Intervention, data =
eelData, family = binomial())
• Model 2:
eelModel.2 <- glm(Cured ~ Intervention +
Duration, data = eelData, family = binomial())

summary(eelModel.1)
summary(eelModel.2)
Output Model 1: Intervention Only
Call:
glm(formula = Cured ~ Intervention, family = binomial(), data = eelData)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.5940 -1.0579 0.8118 0.8118 1.3018

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.2877 0.2700 -1.065 0.28671
InterventionIntervention 1.2287 0.3998 3.074 0.00212 **

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 154.08 on 112 degrees of freedom


Residual deviance: 144.16 on 111 degrees of freedom
AIC: 148.16
Improvement: Model 1
• Find the improvement:
modelChi <- eelModel.1$null.deviance - eelModel.1$deviance
modelChi

[1] 9.926201
• degrees of freedom :
chidf <- eelModel.1$df.null - eelModel.1$df.residual
chidf

[1] 1
• To calculate the probability associated with this chi-square statistic we
can use the pchisq() function.
chisq.prob <- 1 - pchisq(modelChi, chidf)
chisq.prob

[1] 0.001629425
Writing a Function to Compute R2
logisticPseudoR2s <- function(LogModel) {
dev <- LogModel$deviance
nullDev <- LogModel$null.deviance
modelN <- length(LogModel$fitted.values)
R.l <- 1 - dev / nullDev
R.cs <- 1- exp ( -(nullDev - dev) / modelN)
R.n <- R.cs / ( 1 - ( exp (-(nullDev / modelN))))
cat("Pseudo R^2 for logistic regression\n")
cat("Hosmer and Lemeshow R^2 ", round(R.l, 3), "\n")
cat("Cox and Snell R^2 ", round(R.cs, 3), "\n")
cat("Nagelkerke R^2 ", round(R.n, 3), "\n")
}
Writing a Function to Compute R2
• To use the function on our model, we simply
place the name of the logistic regression model
(in this case eelModel.1) in the function and
execute:
logisticPseudoR2s(eelModel.1)
• The output will be:
Pseudo R^2 for logistic regression
Hosmer and Lemeshow R^2 0.064
Cox and Snell R^2 0.084
Nagelkerke R^2 0.113
Calculating the Odds Ratio
• We can also calculate the odds ratio as the exponential of the
b coefficient for the predictor variables by executing:
exp(eelModel.1$coefficients)

(Intercept) InterventionIntervention
0.750000 3.416667

• To get the confidence intervals execute:


exp(confint(eelModel.1))

2.5 % 97.5 %
(Intercept) 0.4374531 1.268674
InterventionIntervention 1.5820127 7.625545
Output Model 2: Intervention and
Duration as Predictors
Call:
glm(formula = Cured ~ Intervention + Duration, family = binomial(),
data = eelData)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.6025 -1.0572 0.8107 0.8161 1.3095

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.234660 1.220563 -0.192 0.84754
InterventionIntervention 1.233532 0.414565 2.975 0.00293 **
Duration -0.007835 0.175913 -0.045 0.96447

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 154.08 on 112 degrees of freedom


Residual deviance: 144.16 on 110 degrees of freedom
AIC: 150.16
Improvement: Model 2
• We can compare the models by finding the difference in
the deviance statistics as before.
• Or we can use the anova() function:
anova(eelModel.1, eelModel.2)

>Analysis of Deviance Table

Model 1: Cured ~ Intervention


Model 2: Cured ~ Intervention + Duration
Resid. Df Resid. Dev Df Deviance
1 111 144.16
2 110 144.16 1 0.0019835
Summary
• The overall fit of the final model is shown by the deviance statistic and its
associated chi-square statistic.
– If the significance of the chi-square statistic is less than .05, then the model is a
significant fit to the data.
• Check the table labelled coefficients to see which variables significantly
predict the outcome.
• For each variable in the model, look at the z-statistic and its significance
(which again should be below .05).
• Use the odds ratio for interpretation. You can obtain this using
exp(model$coefficients), where model is the name of your model.
– If the value is greater than 1 then as the predictor increases, the odds of the
outcome occurring increase.
– A value less than 1 indicates that as the predictor increases, the odds of the
outcome occurring decrease.
– For the aforementioned interpretation to be reliable the confidence interval of
the odds ratio should not cross 1!
Reporting the Analysis
Multinomial Logistic Regression
• Logistic regression to predict membership of more than two
categories.
• It (basically) works in the same way as binary logistic regression.
• The analysis breaks the outcome variable down into a series of
comparisons between two categories.
– E.g., if you have three outcome categories (A, B and C), then the
analysis will consist of two comparisons that you choose:
• compare everything against your first category (e.g. A vs. B and A vs. C),
• or your last category (e.g. A vs. C and B vs. C),
• or a custom category (e.g. B vs. A and B vs. C).
• The important parts of the analysis and output are much the
same as we have just seen for binary logistic regression.
I May Not Be Fred Flintstone …
• How successful are chat-up lines?
• The chat-up lines used by 348 men and 672 women in a
nightclub were recorded.
• Outcome:
– Whether the chat-up line resulted in one of the following three events:
• the person got no response or the recipient walked away;
• the person obtained the recipient’s phone number;
• the person left the nightclub with the recipient.
• Predictors:
– The content of the chat-up lines were rated for:
• funniness (0 = not funny at all, 10 = the funniest thing that I have ever heard);
• sexuality (0 = no sexual content at all, 10 = very sexually direct);
• moral values (0 = the chat-up line does not reflect good characteristics, 10 =
the chat-up line is very indicative of good characteristics).
– Gender of recipient
Multinomial Logistic Regression in R

• We can use the mlogit.data() function to


convert our data into the correct format:
newDataframe<-mlogit.data(oldDataFrame,
choice = "outcome variable", shape =
"wide"/"long")
Restructuring the Data
• Therefore, to restructure the current data
we could execute:
mlChat <- mlogit.data(chatData, choice =
"Success", shape = "wide”)
Running Multinomial Regression
• Now we are ready to run the multinomial logistic
regression, using the mlogit() function:
newModel<-mlogit(outcome ~ predictor(s), data =
dataFrame, na.action = an action, reflevel = a number
representing the baseline category for the outcome)
• We can, therefore, create the model by executing:
chatModel <- mlogit(Success ~ 1 | Good_Mate + Funny +
Gender + Sex + Gender:Sex + Funny:Gender, data =
mlChat, reflevel = 3)

summary(chatModel)
Interpretation
• To help with the interpretation we can
exponentiate the coefficients:
exp(chatModel$coefficients)
• We can make the output nicer by asking R
to print the variable as a dataframe:
data.frame(exp(chatModel$coefficients))
Exponentiated Coefficients
Confidence Intervals
• We can get confidence intervals for these
coefficients using the confint() function:
exp(confint(chatModel))
Confidence Intervals
Interpretation:
Phone Number vs. No Response
• Good_Mate: Whether the chat-up line showed signs of good moral fibre
significantly predicted whether you got a phone number or no response/walked
away, b = 0.13, Wald χ2(1) = 6.02, p < .05.
• Funny: Whether the chat-up line was funny did not significantly predict whether
you got a phone number or no response, b = 0.14, Wald χ2(1) = 1.60, p > .05.
• Gender: The gender of the person being chatted up significantly predicted
whether they gave out their phone number or gave no response, b = −1.65, Wald
χ2(1) = 4.27, p < .05.
• Sex: The sexual content of the chat-up line significantly predicted whether you got
a phone number or no response/walked away, b = 0.28, Wald χ2(1) = 9.59, p < .01.
• Funny × Gender: The success of funny chat-up lines depended on whether they
were delivered to a man or a woman because in interaction these variables
predicted whether or not you got a phone number, b = 0.49, Wald χ2(1) = 12.37, p
< .001.
• Sex × Gender: The success of chat-up lines with sexual content depended on
whether they were delivered to a man or a woman because in interaction these
variables predicted whether or not you got a phone number, b = −0.35, Wald χ2(1)
= 10.82, p < .01.
Interpretation:
Going Home vs. No Response
• Good_Mate: Whether the chat-up line showed signs of good moral fibre did not
significantly predict whether you went home with the date or got a slap in the face,
b = 0.13, Wald χ2(1) = 2.42, p > .05.
• Funny: Whether the chat-up line was funny significantly predicted whether you went
home with the date or no response, b = 0.32, Wald χ2(1) = 6.46, p < .05.
• Gender: The gender of the person being chatted up significantly predicted whether
they went home with the person or gave no response, b = −5.63, Wald χ2(1) = 17.93,
p < .001.
• Sex: The sexual content of the chat-up line significantly predicted whether you went
home with the date or got a slap in the face, b = 0.42, Wald χ2(1) = 11.68, p < .01.
• Funny × Gender: The success of funny chat-up lines depended on whether they were
delivered to a man or a woman because in interaction these variables predicted
whether or not you went home with the date, b = 1.17, Wald χ2(1) = 34.63, p < .001.
• Sex × Gender: The success of chat-up lines with sexual content depended on
whether they were delivered to a man or a woman because in interaction these
variables predicted whether or not you went home with the date, b = −0.48, Wald
χ2(1) = 8.51, p < .01.
Reporting the Results
References

Chapter 8 (Discovering Statistics Using R)

45

You might also like