Logistic Regression Cheat Sheet
Logistic Regression Cheat Sheet
Logistic Regression:
Logistic regression is basically a type of linear regression that uses a binary response variable
(usually coded 0 or 1) as the dependent variable. Usually, when we do linear regression (like
we did in your assignment), we are trying to predict the values of a continuous dependent
variable (like age, income, or something else) from several other variables, continuous or not.
However, when the dependent variable is binary, we cannot run our usual linear regression.
There are many mathematical reasons for not doing so. I don’t really want to go into details
(the book I’ve sent does), suffice it to say that it messes up the assumption of normality (since
values consisting of 1s and 0s will never be normally distributed), messes up the homogeneity
of variance assumption (the reasons for this have to do with the distribution of the actual
scores. If you plot your 1s and 0s against a continuous variable, you basically get two rows of
values, some at zero, some at one. This is because we don’t have any other values):
It also poses a problem that if you fit a straight line to such a dataset, the regression line will
predict probabilities for belonging to group 1 or group 0 both below 0 and above 1. This is
clearly nonsensical, as probabilities can never be smaller than 0 or bigger than 0. These details
really are not necessary to know, what’s important is that we simply cannot use linear
regression when the dependent variable is binary, end of story.
What we do instead, is take the log-odds of our data, and model the log-odds as a linear
function of our independent variables. This sounds complicated, so let’s break it down a bit.
„odds” is basically just a concept related to probability. If you flip a coin 10 times and get 4
heads, the probability of a head occurring is 4/10 = 0.4. Odds divides the number of successes
not with the number of trials, but the number of failures. So the ODDS of getting a head int he
same example would be 4/6 = 0.6666. You can read this as „for every failure, we have 0.6666
success). Then we take the natural logarithm of the odds, and w eget something called the
„logit”. The logit has many advantages over using the raw data. It basically linearizes the
relationship, and removes the lower (0) and upper (1) ceilings we would have if we used raw
probabilities.
I have made a small toy dataset to illustrate how to interpret stuff in SPSS. We are basically
looking at what factors influence how obese someone is or isn’t