0% found this document useful (0 votes)
62 views19 pages

Logistic Regression Playbook

This document provides an overview of logistic regression, including how it differs from linear regression, how it can be used to predict dichotomous outcomes, and how to interpret the results. Logistic regression estimates the probability of an outcome being 1 based on independent variables using a logistic function that outputs values between 0 and 1.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views19 pages

Logistic Regression Playbook

This document provides an overview of logistic regression, including how it differs from linear regression, how it can be used to predict dichotomous outcomes, and how to interpret the results. Logistic regression estimates the probability of an outcome being 1 based on independent variables using a logistic function that outputs values between 0 and 1.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Logistic

Regression
Playbook
1. Theory
2. Example
3. Interpretation

Author: Dr. Mathias Jesussek


©DATAtab e.U. | Graz | 2023
What is the difference between a
What is a regression
linear regression and a logistic
A regression analysis is a method for
regression?
modeling relationships between
variables.
In a linear regression, the dependent
It makes it possible to infer variable is a metric variable, e.g. salary
or predict a variable or electricity consumption.

based on one or more


other variables.
In a logistic regression, the
dependent variable is a
The variable we want to infer or
predict is called the dependent
dichotomous variable.
variable or criterion.
What is a dichotomous variable?

Dichotomous variables are variables


with only two values.

For example:
Whether a person buys or does
not buy a particular product
or
whether a disease is
present or not

The variables we use for prediction


are called independent
variables or predictors.
How can logistic Our data set might look like this:
regression be used
Here we have the
independent variables
With the help of logistic
regression, we can determine what
has an influence on whether a certain
disease is present or not. Age Gender Smoker status Disease
22 female Non-smoker 1
25 female Smoker 1
18 male Smoker 0
45 male Non-smoker 0
12 female Smoker 0
43 male Smoker 1
23 male Smoker 0
33 male Smoker 1
… … … …

and here the dependent


variable with 0 and 1.

We could study the influence of We could now investigate what influence the
age, gender and smoking status independent variables have on the disease.
on that particular disease. If there is an influence, then we can predict
how likely a person is to have a certain disease.

In this case 0 stands for not diseased


and 1 for diseased
Now, of course, the
question arises:
Why do we need logistic
regression in this case?
Why can't we just use linear
and the probability for the occurrence
of the characteristic 1 (=characteristic regression?
present) is estimated.
A quick recap: A linear regression would now simply
In linear regression, this is put a straight line through the points.
our regression equation:

We have the the 1


dependent variable independent variables

and the regression coefficients. x

We can now see, that in the case of


linear regression, values between
However, we now have a
dependent variable that is either
plus and
0 or 1.

y
y
1
1

0
0
x
No matter which value we have for the x
independent variables, only 0
or 1 results. minus infinity can occur.
However, the goal of logistic No matter where we are on the x-axis,
regression is to estimate the
probability of occurrence.
1
The value range for the prediction
should therefore be between 0 and 1.
1/2

y
-∞ 0 +∞
1
between minus and plus infinity only
values between 0 and 1 result.

0
And that is exactly
x
what we want!

So we need a function that only


takes values between 0 and 1!

The equation for the logistic

And that is exactly what the function looks like this:


logistic function does.

1
1/2 The logistic function is now
used by the logistic regression.

-∞ 0 +∞
For z, the equation of the linear
regression is now simply inserted.

This gives us this equation:

Thus, the probability that the


dependent variable is 1 is given by:

What does this look


like for our example

In our example,
the probability of having a certain disease

is a function of age, gender and smoking status.


For z, the equation of the linear regression
is now simply inserted.

This gives us this equation:

Thus, the probability that the dependent


variable is 1 is given by:

What does this look like for our example

In our example,
the probability of having a certain disease

is a function of age, gender and smoking status.

Now we need to determine the coefficients


so that our model best represents the given data.

To solve this problem, the so-called


maximum likelihood method is used.
For this purpose, there are good numerical
methods that can solve the problem efficiently.
But how do you interpret the results
of a logistic regression

Age Gender Smoker status Disease


22 female Non-smoker 1
Let's take a look at this
25 female Smoker 1
fictitious example.
18 male Smoker 0
45 male Non-smoker 0
12 female Smoker 0
43 male Smoker 1
23 male Smoker 0
33 male Smoker 1
… … … …

If you like, you can download the example dataset for free and

follow the steps in parallel. Please just use this link.

Or load it from the logistic Regression tutorial


When you use the link, the data is automatically loaded.

We want to calculate a logistic regression,


so we just click on regression.
When we copy our data in here, the
variables show up down here.

Depending on how your dependent variable is


scaled, DATAtab will calculate either a logistic or a
linear regression under the tab Regression.
We choose disease as the dependent variable and age,
gender, and smoking status as the independent variables.
Datatab now calculates a logistic regression for us.
If you don't know how to interpret
the results, you can click on

We will now go through all the


tables slowly and understandably.
Let's start at the top.
Let‘s Start

The first thing that is displayed is the results table. In the results
table you can see that a total of 36 people were examined.

With the help of the calculated regression model, 26 of


36 persons could be correctly assigned. That is 72.22%!

Then comes the classification table.

Here you can see how often the categories


not diseased and diseased were observed
and how often they were predicted.
In total, "not diseased" was observed 16 times.

Of these 16 individuals, the regression model correctly scored


11 as not diseased and incorrectly scored 5 as diseased.
Of the 20 diseased individuals, 15 were correctly scored as
diseased and 5 incorrectly scored as diseased.

To be noted:
For deciding whether a person is diseased or not the
threshold of 50% is used.

50%

-∞ 0 +∞

If the regression model estimates a value greater than 50%,


this person is assigned “diseased”, otherwise “not diseased”.
Now comes the Chi2 test.

Here we can read whether the model as


a whole is significant or not.

Two models are compared for this purpose


In one model all independent variables are used

and in the other model the independent variables are not used.

With the help of the Chi2 test we compare how good the prediction is
when the dependent variables are used and how good it is when the
dependent variables are not used and the Chi2 test “tells us” if there is a
significant difference between these two results.

The null hypothesis is that both models are the same.

If the p-value is less than 0.05,


this null hypothesis is rejected.

In our example, the p-value is less than


0.05 and we assume that there is a
significant difference between the
models. Thus, the model as a whole is
significant.
Next comes the model summary.

In this table we see on the one hand the -2 log likelihood value and on the
other hand we are given different coefficients of determination R2.

R2 is used to find out how well the regression model explains the
dependent variable. In a linear regression, the R2 indicates the
proportion of the variance that can be explained by the independent
variables. The more variance can be explained, the better the regression
model.

However, in the case of logistic regression, the meaning is


different and there are different ways to calculate the R2. Unfortunately,
there is also no agreement yet on which way is the "best" way.

DATAtab gives you the R2 according to


Cox and Snell, according to Nagelkerke and according to McFadden.
And now comes the most important table.
The table with the model coefficients.

The most important parameters are


the coefficient B, the p-value and the odds ratio.

Coefficients B
In the first column we can read the calculated
coefficients from our model.

We can insert these into the


regression equation.
If we insert the coefficients, we get the following regression equation:

With this we can now calculate the probability that a


person is diseased.

Example:
We want to know how likely a person who is 55 years old,
female, and smoker is to be diseased.

We insert:
55 for the age
0, because the person is female
and 1, as the person is a smoker.
This gives us 0.69 or 69%.

Thus, it is 69% likely that a 55-year-old female smoker is diseased.

Based on this prediction, it could now be decided whether to do


another extensive investigation.

The example is purely fictitious.

In reality, there would certainly be many other and different


independent variables.


But now back to the table!

In this column we can read whether the coefficient is


significantly different from zero.

The following null hypothesis is tested:

The coefficient is zero in the population.

So, if the value is smaller than 0.05, the respective


coefficient has a significant influence.
In our example, we see that none of the coefficients have a
significant impact, as all p-values are greater than 0.05.

Odds ratio
In this column we can then read the odds ratio.

For example, the odds ratio of 1.04 means that a one unit increase in the
variable age increases the probability that a person is sick by 1.04 times.
If you liked this Playbook
feel free to share it!
Of course we are also happy if you
visit us on datatab.net.

You might also like