Logistic Regression Essentials in R - Articles - STHDA
Logistic Regression Essentials in R - Articles - STHDA
This website uses cookies to ensure you get the best experience on our website, to personalize content and ads and to analyze our traffic. We also share information
about your use of our site with our social media, advertising and analytics partners. By using our site you agree to our use of cookies Learn more
DeclineOK Advertisement
STHDA
Stati s t i c a l t o o l s f or high-through put data analysis
Licence:
Search...
Home Basics Data Visualize Analyze Resources Our Products Support About
Home / Articles / Machine Learning / Classification Methods Essentials / Logistic Regression Essentials in R
Logistic regression belongs to a family, named Generalized Linear Model (GLM), developed for extending the linear regression model (Chapter @ref(linear-regression)) to
other situations. Other synonyms are binary logistic regression, binomial logistic regression and logit model.
Logistic regression does not return directly the class of observations. It allows us to estimate the probability (p) of class membership. The probability will range be-
tween 0 and 1. You need to decide the threshold probability at which the category flips from one to the other. By default, this is set to p = 0.5, but in reality it should
be settled based on the analysis purpose.
Define the logistic regression equation and key terms such as log-odds and logit
Perform logistic regression in R and interpret the results
Make predictions on new test data and evaluate the model accuracy
Contents:
Logistic function
Loading required R packages
Preparing the data
Computing logistic regression
Quick start R code
Simple logistic regression
Multiple logistic regression
Interpretation
Making predictions
Assessing model accuracy
Discussion
References
The Book:
www.sthda.com/english/articles/36-classification-methods-essentials/151-logistic-regression-essentials-in-r/?authuser=0 1/10
21/10/24, 7:30 Logistic Regression Essentials in R - Articles - STHDA
Logistic function
The standard logistic regression function, for predicting the outcome of an observation given a predictor variable (x), is an s-shaped curve defined as p = exp(y) / [1
+ exp(y)] (James et al. 2014). This can be also simply written as p = 1/[1 + exp(-y)], where:
y = b0 + b1*x,
exp() is the exponential and
p is the probability of event to occur (1) given x. Mathematically, this is written as p(event=1|x) and abbreviated asp(x), sopx = 1/[1 + exp(-(b0 + b1*x))]`
By a bit of manipulation, it can be demonstrated that p/(1-p) = exp(b0 + b1*x). By taking the logarithm of both sides, the formula becomes a linear combination of
predictors: log[p/(1-p)] = b0 + b1*x.
When you have multiple predictor variables, the logistic function looks like: log[p/(1-p)] = b0 + b1*x1 + b2*x2 + ... + bn*xn
b0 and b1 are the regression beta coefficients. A positive b1 indicates that increasing x will be associated with increasing p. Conversely, a negative b1 indicates that in-
creasing x will be associated with decreasing p.
The quantity log[p/(1-p)] is called the logarithm of the odd, also known as log-odd or logit.
The odds reflect the likelihood that the event will occur. It can be seen as the ratio of “successes” to “non-successes”. Technically, odds are the probability of an event
divided by the probability that the event will not take place (P. Bruce and Bruce 2017). For example, if the probability of being diabetes-positive is 0.5, the probability of
“won’t be” is 1-0.5 = 0.5, and the odds are 1.0.
Note that, the probability can be calculated from the odds as p = Odds/(1 + Odds).
library(tidyverse)
library(caret)
theme_set(theme_bw())
Performing the following steps might improve the accuracy of your model
Here, we’ll use the PimaIndiansDiabetes2 [in mlbench package], introduced in Chapter @ref(classification-in-r), for predicting the probability of being diabetes positive
based on multiple clinical variables.
We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for
reproducibility.
www.sthda.com/english/articles/36-classification-methods-essentials/151-logistic-regression-essentials-in-r/?authuser=0 2/10
21/10/24, 7:30 Logistic Regression Essentials in R - Articles - STHDA
Quick start R code
The following R code builds a model to predict the probability of being diabetes-positive based on the plasma glucose concentration:
The output above shows the estimate of the regression beta coefficients and their significance levels. The intercept (b0) is -6.32 and the coefficient of glucose variable is
0.043.
The logistic equation can be written as p = exp(-6.32 + 0.043*glucose)/ [1 + exp(-6.32 + 0.043*glucose)]. Using this formula, for each new glucose plasma
concentration value, you can predict the probability of the individuals in being diabetes positive.
Predictions can be easily made using the function predict(). Use the option type = “response” to directly obtain the probabilities
train.data %>%
mutate(prob = ifelse(diabetes == "pos", 1, 0)) %>%
ggplot(aes(glucose, prob)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "glm", method.args = list(family = "binomial")) +
labs(
title = "Logistic Regression Model",
x = "Plasma Glucose Concentration",
y = "Probability of being diabete-pos"
)
www.sthda.com/english/articles/36-classification-methods-essentials/151-logistic-regression-essentials-in-r/?authuser=0 3/10
21/10/24, 7:30 Logistic Regression Essentials in R - Articles - STHDA
Here, we want to include all the predictor variables available in the data set. This is done using ~.:
From the output above, the coefficients table shows the beta coefficient estimates and their significance levels. Columns are:
Estimate: the intercept (b0) and the beta coefficient estimates associated to each predictor variable
Std.Error: the standard error of the coefficient estimates. This represents the accuracy of the coefficients. The larger the standard error, the less confident we are
about the estimate.
z value: the z-statistic, which is the coefficient estimate (column 2) divided by the standard error of the estimate (column 3)
Pr(>|z|): The p-value corresponding to the z-statistic. The smaller the p-value, the more significant the estimate is.
Note that, the functions coef() and summary() can be used to extract only the coefficients, as follow:
coef(model)
summary(model )$coef
Interpretation
It can be seen that only 5 out of the 8 predictors are significantly associated to the outcome. These include: pregnant, glucose, pressure, mass and pedigree.
The coefficient estimate of the variable glucose is b = 0.045, which is positive. This means that an increase in glucose is associated with increase in the probability of
being diabetes-positive. However the coefficient for the variable pressure is b = -0.007, which is negative. This means that an increase in blood pressure will be associ-
ated with a decreased probability of being diabetes-positive.
An important concept to understand, for interpreting the logistic beta coefficients, is the odds ratio. An odds ratio measures the association between a predictor vari-
able (x) and the outcome variable (y). It represents the ratio of the odds that an event will occur (event = 1) given the presence of the predictor x (x = 1), compared to
the odds of the event occurring in the absence of that predictor (x = 0).
For a given predictor (say x1), the associated beta coefficient (b1) in the logistic regression function corresponds to the log of the odds ratio for that predictor.
If the odds ratio is 2, then the odds that the event occurs (event = 1) are two times higher when the predictor x is present (x = 1) versus x is absent (x = 0).
For example, the regression coefficient for glucose is 0.042. This indicate that one unit increase in the glucose concentration will increase the odds of being diabetes-
positive by exp(0.042) 1.04 times.
From the logistic regression results, it can be noticed that some variables - triceps, insulin and age - are not statistically significant. Keeping them in the model may con-
tribute to overfitting. Therefore, they should be eliminated. This can be done automatically using statistical techniques, including stepwise regression and penalized
regression methods. This methods are described in the next section. Briefly, they consist of selecting an optimal model with a reduced set of variables, without com-
promising the model curacy.
Here, as we have a small number of predictors (n = 9), we can select manually the most significant:
Making predictions
We’ll make predictions using the test data in order to evaluate the performance of our logistic regression model.
The R function predict() can be used to predict the probability of being diabetes-positive, given the predictor values.
## 21 25 28 29 32 36
## 0.3914 0.6706 0.0501 0.5735 0.6444 0.1494
www.sthda.com/english/articles/36-classification-methods-essentials/151-logistic-regression-essentials-in-r/?authuser=0 4/10
21/10/24, 7:30 Logistic Regression Essentials in R - Articles - STHDA
Which classes do these probabilities refer to? In our example, the output is the probability that the diabetes test will be positive. We know that these values correspond
to the probability of the test to be positive, rather than negative, because the contrasts() function indicates that R has created a dummy variable with a 1 for “pos”
and “0” for neg. The probabilities always refer to the class dummy-coded as “1”.
contrasts(test.data$diabetes)
## pos
## neg 0
## pos 1
The following R code categorizes individuals into two groups based on their predicted probabilities (p) of being diabetes-positive. Individuals, with p above 0.5 (random
guessing), are considered as diabetes-positive.
## 21 25 28 29 32 36
## "neg" "pos" "neg" "pos" "pos" "neg"
mean(predicted.classes == test.data$diabetes)
## [1] 0.756
The classification prediction accuracy is about 76%, which is good. The misclassification error rate is 24%.
Note that, there are several metrics for evaluating the performance of a classification model (Chapter @ref(classification-model-evaluation)).
Discussion
In this chapter, we have described how logistic regression works and we have provided R codes to compute logistic regression. Additionally, we demonstrated how to
make predictions and to assess the model accuracy. Logistic regression model output is very easy to interpret compared to other classification methods. Additionally,
because of its simplicity it is less prone to overfitting than flexible methods such as decision trees.
Note that, many concepts for linear regression hold true for the logistic regression modeling. For example, you need to perform some diagnostics (Chapter
@ref(logistic-regression-assumptions-and-diagnostics)) to make sure that the assumptions made by the model are met for your data.
Furthermore, you need to measure how good the model is in predicting the outcome of new test data observations. Here, we described how to compute the raw clas-
sification accuracy, but not that other important performance metric exists (Chapter @ref(classification-model-evaluation))
In a situation, where you have many predictors you can select, without compromising the prediction accuracy, a minimal list of predictor variables that contribute the
most to the model using stepwise regression (Chapter @ref(stepwise-logistic-regression)) and lasso regression techniques (Chapter @ref(penalized-logistic-regression)).
Additionally, you can add interaction terms in the model, or include spline terms.
The same problems concerning confounding and correlated variables apply to logistic regression (see Chapter @ref(confounding-variables) and
@ref(multicollinearity)).
You can also fit generalized additive models (Chapter @ref(polynomial-and-spline-regression)), when linearity of the predictor cannot be assumed. This can be done us-
ing the mgcv package:
library("mgcv")
# Fit the model
gam.model <- gam(diabetes ~ s(glucose) + mass + pregnant,
data = train.data, family = "binomial")
# Summarize model
summary(gam.model )
# Make predictions
probabilities <- gam.model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities> 0.5, "pos", "neg")
# Model Accuracy
mean(predicted.classes == test.data$diabetes)
Logistic regression is limited to only two-class classification problems. There is an extension, called multinomial logistic regression, for multiclass classification problem
(Chapter @ref(multinomial-logistic-regression)).
Note that, the most popular method, for multiclass tasks, is the Linear Discriminant Analysis (Chapter @ref(discriminant-analysis)).
www.sthda.com/english/articles/36-classification-methods-essentials/151-logistic-regression-essentials-in-r/?authuser=0 5/10
21/10/24, 7:30 Logistic Regression Essentials in R - Articles - STHDA
References
Bruce, Peter, and Andrew Bruce. 2017. Practical Statistics for Data Scientists. O’Reilly Media.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company,
Incorporated.
Enjoyed this article? Give us 5 stars (just above this text block)! Reader needs to be STHDA member for voting. I’d be very grateful if you’d help it
spread by emailing it to a friend, or sharing it on Twitter, Facebook or Linked In.
Show me some love with the like buttons below... Thank you and please don't forget to share and comment below!!
Machine Learning Essentials: Practical Guide in R Practical Guide to Cluster Analysis in R Practical Guide to Principal Component Methods
in R
More books on R and data science
R Graphics Essentials for Great Data Visualization Network Analysis and Visualization in R
This section contains best data science and self-development resources to help you on your path.
Coursera - Online Courses and Specialization
Data science
Course: Machine Learning: Master the Fundamentals by Standford
Specialization: Data Science by Johns Hopkins University
Specialization: Python for Everybody by University of Michigan
Courses: Build Skills for a Top Job in any Industry by Coursera
Specialization: Master Machine Learning Fundamentals by University of Washington
Specialization: Statistics with R by Duke University
www.sthda.com/english/articles/36-classification-methods-essentials/151-logistic-regression-essentials-in-r/?authuser=0 6/10
21/10/24, 7:30 Logistic Regression Essentials in R - Articles - STHDA
Specialization: Software Development in R by Johns Hopkins University
Specialization: Genomic Data Science by Johns Hopkins University
Trending Courses
The Science of Well-Being by Yale University
Google IT Support Professional by Google
Python for Everybody by University of Michigan
IBM Data Science Professional Certificate by IBM
Business Foundations by University of Pennsylvania
Introduction to Psychology by Yale University
Excel Skills for Business by Macquarie University
Psychological First Aid by Johns Hopkins University
Graphic Design by Cal Arts
Others
R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
Deep Learning with R by François Chollet & J.J. Allaire
Deep Learning with Python by François Chollet
Must be players exited to join it nice fun to play <a href="http://ginrummy.me">gin rummy</a> free online game it is the best way to all of you.
#765
Administrator
#472
Member
A possible typo:
near the end, where you say:
mean(predicted.classes, test.data$diabetes)
## [1] NA
www.sthda.com/english/articles/36-classification-methods-essentials/151-logistic-regression-essentials-in-r/?authuser=0 7/10
21/10/24, 7:30 Logistic Regression Essentials in R - Articles - STHDA
SFd
-------------
Sign in
Login
Login
Password
Password
Auto connect
Sign in
Register
Forgotten password
Welcome!
Want to Learn More on R Programming and Data Science?
Follow us by Email
Subscribe
by FeedBurner
factoextra
survminer
ggpubr
ggcorrplot
fastqcr
Our Books
3D Plots in R
www.sthda.com/english/articles/36-classification-methods-essentials/151-logistic-regression-essentials-in-r/?authuser=0 8/10
21/10/24, 7:30 Logistic Regression Essentials in R - Articles - STHDA
R Graphics Essentials for Great Data Visualization: 200 Practical Examples You Want to Know for Data Science
NEW!!
R-Bloggers
www.sthda.com/english/articles/36-classification-methods-essentials/151-logistic-regression-essentials-in-r/?authuser=0 9/10
21/10/24, 7:30 Logistic Regression Essentials in R - Articles - STHDA
Newsletter Email
Boosted by PHPBoost
www.sthda.com/english/articles/36-classification-methods-essentials/151-logistic-regression-essentials-in-r/?authuser=0 10/10