Basic Concepts of Logistic Regression
Basic Concepts of Logistic Regression
The basic approach is to use the following regression model, employing the notation from
Definition 3 of Method of Least Squares for Multiple Regression:
Where p has a value 0 p 1 (i.e. p is a probability value), we can define the odds function as
Observation: For our purposes, the odds function has the advantage of transforming the
probability function, which has values from 0 to 1, into an equivalent function with values between
0 and . When we take the natural log of the odds function, we get a range of values from to
.
Definition 2: The logit function is the log of the odds function, namely logit(E) = ln Odds(E), or
and so
Here we switch to the model based on the observed sample (and so the parameter is replaced by
its sample estimate p, the j coefficients are replaced by the sample estimates bj and the error term
is dropped). For our purposes we take the event E to be that the dependent variable y has value
1. If y takes only the values 0 or 1, we can think of E as success and the complement E of E as
failure. This is as for the trials in a binomial distribution.
Just as for the regression model studied in Regression and Multiple Regression, a sample consists
of n data elements of the form (yi, xi1, x ,, xik), but for logistic regression each yi only takes the
value 0 or 1. Now let Ei = the event that yi = 1 and pi = P(Ei). Just as the regression line studied
previously provides a way to predict the value of the dependent variable y from the values of the
independent variables x1, , xk in for logistic regression we have
Note too that since the yi have a proportion distribution, by Property 2 of Proportion
Distribution, var(yi) = pi (1 pi).
Observation: In the case where k = 1, we have
Observation: Logistic regression is used instead of ordinary multiple regression because the
assumptions required for ordinary regression are not met. In particular
1. The assumption of the linear regression model that the values of y are normally distributed
cannot be met since y only takes the values 0 and 1.
2. The assumption of the linear regression model that the variance of y is constant across
values of x (homogeneity of variances) also cannot be met with a binary variable. Since the
variance is p(1p) when 50 percent of the sample consists of 1s, the variance is .25, its
maximum value. As we move to more extreme values, the variance decreases. When p =
.10 or .90, the variance is (.1)(.9) = .09, and so as p approaches 1 or 0, the variance
approaches 0.
3. Using the linear regression model, the predicted values will become greater than one and
less than zero if you move far enough on the x-axis. Such values are theoretically
inadmissible for probabilities.
For the logistics model, the least squares approach to calculating the values of the
coefficients bi cannot be used; instead the maximum likelihood techniques, as described below,
are employed to find these values.
Definition 4: The odds ratio between two data elements in the sample is defined as follows:
Using the notation px = P(x), the log odds ratio of the estimates is defined as
Thus,
E.g. when x = 0 for male and x = 1 for female, then represents the odds ratio between males and
females. If for example b1 = 2, and we are measuring the probability of getting cancer under certain
conditions, then = 7.4, which would mean that the odds of females getting cancer would be 7.4
times greater than males under the same conditions.
Observation: The model we will use is based on the binomial distribution, namely the probability
that the sample data occurs as it does is given by
Taking the natural log of both sides and simplifying we get the following definition.
Definition 5: The log-likelihood statistic is defined as follows:
where the yi are the observed values while the pi are the corresponding theoretical values.
Observation: Our objective is to find the maximum value of LL assuming that the pi are as in
Definition 3. This will enable us to find the values of the bi coordinates. It might be helpful to
review Maximum Likelihood Function to better understand the rest of this topic.
Example 1: A sample of 760 people who received doses of radiation between 0 and 1000 rems
was made following a recent nuclear accident. Of these 302 died as shown in the table in Figure
2. Actually each row in the table represents the midpoint of an interval of 100 rems (i.e. 0-100,
100-200, etc.).
Let Ei = the event that a person in the ith interval survived. The table also shows the probability
P(Ei) and odds Odds(Ei) of survival for a person in each interval. Note that P(Ei) = the percentage
of people in interval i who survived and
In Figure 3 we plot the values of P(Ei) vs. i and ln Odds(Ei) vs. i. We see that the second of these
plots is reasonably linear.
Given that there is only one independent variable (namely x = # of rems), we can use the following
model
Here we use coefficients a and b instead of b0 and b1 just to keep the notation simple.
We show two different methods for finding the values of the coefficients a and b. The first uses
Excels Solver tool and the second uses Newtons method. Before proceeding it might be
worthwhile to click on Goal Seeking and Solver to review how to use Excels Solver tool and
Newtons Method to review how to apply Newtons Method. We will use both methods to
maximize the value of the log-likelihood statistic as defined in Definition 5.
Sample Size: The recommended minimum sample size for logistic regression is given by 10k/q
where k = the number of independent variables and q = the smaller of the percentage of cases with
y = 0 or y = 1, with a minimum of 100.
For Example 1, k = 1 and q = 302/760 = .397, and so 10k/q = 25.17. Thus a minimum sample of
size 100 is recommended.