0% found this document useful (0 votes)
229 views

Basic Concepts of Logistic Regression

Logistic regression predicts the probability of an event occurring (like a "yes/no" outcome) based on independent variables. It transforms the probability using the logit function (the log of the odds) so the range of possible values is unrestricted between 0 and infinity. The logistic regression model is fitted to data using maximum likelihood estimation rather than ordinary least squares regression, since the assumptions of ordinary regression like linearity and equal variance are violated by binary outcome data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
229 views

Basic Concepts of Logistic Regression

Logistic regression predicts the probability of an event occurring (like a "yes/no" outcome) based on independent variables. It transforms the probability using the logit function (the log of the odds) so the range of possible values is unrestricted between 0 and infinity. The logistic regression model is fitted to data using maximum likelihood estimation rather than ordinary least squares regression, since the assumptions of ordinary regression like linearity and equal variance are violated by binary outcome data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Basic Concepts of Logistic Regression

The basic approach is to use the following regression model, employing the notation from
Definition 3 of Method of Least Squares for Multiple Regression:

where the odds function is as given in the following definition.


Definition 1: Odds(E) is the odds that event E occurs, namely

Where p has a value 0 p 1 (i.e. p is a probability value), we can define the odds function as

Observation: For our purposes, the odds function has the advantage of transforming the
probability function, which has values from 0 to 1, into an equivalent function with values between
0 and . When we take the natural log of the odds function, we get a range of values from to
.
Definition 2: The logit function is the log of the odds function, namely logit(E) = ln Odds(E), or

Definition 3: Based on the logistic model as described above, we have

where = P(E). It now follows that (see Exponentials and Logs):

and so

Here we switch to the model based on the observed sample (and so the parameter is replaced by
its sample estimate p, the j coefficients are replaced by the sample estimates bj and the error term
is dropped). For our purposes we take the event E to be that the dependent variable y has value
1. If y takes only the values 0 or 1, we can think of E as success and the complement E of E as
failure. This is as for the trials in a binomial distribution.
Just as for the regression model studied in Regression and Multiple Regression, a sample consists
of n data elements of the form (yi, xi1, x ,, xik), but for logistic regression each yi only takes the
value 0 or 1. Now let Ei = the event that yi = 1 and pi = P(Ei). Just as the regression line studied
previously provides a way to predict the value of the dependent variable y from the values of the
independent variables x1, , xk in for logistic regression we have

Note too that since the yi have a proportion distribution, by Property 2 of Proportion
Distribution, var(yi) = pi (1 pi).
Observation: In the case where k = 1, we have

Such a curve has sigmoid shape:

Figure 1 Sigmoid curve for p


The values of b0 and b1 determine the location direction and spread of the curve. The curve is
symmetric about the point where x = -b0 / b1. In fact, the value of p is 0.5 for this value of x.

Observation: Logistic regression is used instead of ordinary multiple regression because the
assumptions required for ordinary regression are not met. In particular
1. The assumption of the linear regression model that the values of y are normally distributed
cannot be met since y only takes the values 0 and 1.
2. The assumption of the linear regression model that the variance of y is constant across
values of x (homogeneity of variances) also cannot be met with a binary variable. Since the
variance is p(1p) when 50 percent of the sample consists of 1s, the variance is .25, its
maximum value. As we move to more extreme values, the variance decreases. When p =
.10 or .90, the variance is (.1)(.9) = .09, and so as p approaches 1 or 0, the variance
approaches 0.
3. Using the linear regression model, the predicted values will become greater than one and
less than zero if you move far enough on the x-axis. Such values are theoretically
inadmissible for probabilities.
For the logistics model, the least squares approach to calculating the values of the
coefficients bi cannot be used; instead the maximum likelihood techniques, as described below,
are employed to find these values.
Definition 4: The odds ratio between two data elements in the sample is defined as follows:

Using the notation px = P(x), the log odds ratio of the estimates is defined as

Observation: In the case where

Thus,

Furthermore, for any value of d

Note too that when x is a dichotomous variable,

E.g. when x = 0 for male and x = 1 for female, then represents the odds ratio between males and
females. If for example b1 = 2, and we are measuring the probability of getting cancer under certain
conditions, then = 7.4, which would mean that the odds of females getting cancer would be 7.4
times greater than males under the same conditions.
Observation: The model we will use is based on the binomial distribution, namely the probability
that the sample data occurs as it does is given by

Taking the natural log of both sides and simplifying we get the following definition.
Definition 5: The log-likelihood statistic is defined as follows:

where the yi are the observed values while the pi are the corresponding theoretical values.
Observation: Our objective is to find the maximum value of LL assuming that the pi are as in
Definition 3. This will enable us to find the values of the bi coordinates. It might be helpful to
review Maximum Likelihood Function to better understand the rest of this topic.
Example 1: A sample of 760 people who received doses of radiation between 0 and 1000 rems
was made following a recent nuclear accident. Of these 302 died as shown in the table in Figure
2. Actually each row in the table represents the midpoint of an interval of 100 rems (i.e. 0-100,
100-200, etc.).

Figure 2 Data for Example 1 plus probability and odds

Let Ei = the event that a person in the ith interval survived. The table also shows the probability
P(Ei) and odds Odds(Ei) of survival for a person in each interval. Note that P(Ei) = the percentage
of people in interval i who survived and

In Figure 3 we plot the values of P(Ei) vs. i and ln Odds(Ei) vs. i. We see that the second of these
plots is reasonably linear.

Figure 3 Plot of probability and ln odds

Given that there is only one independent variable (namely x = # of rems), we can use the following
model

Here we use coefficients a and b instead of b0 and b1 just to keep the notation simple.
We show two different methods for finding the values of the coefficients a and b. The first uses
Excels Solver tool and the second uses Newtons method. Before proceeding it might be
worthwhile to click on Goal Seeking and Solver to review how to use Excels Solver tool and
Newtons Method to review how to apply Newtons Method. We will use both methods to
maximize the value of the log-likelihood statistic as defined in Definition 5.
Sample Size: The recommended minimum sample size for logistic regression is given by 10k/q
where k = the number of independent variables and q = the smaller of the percentage of cases with
y = 0 or y = 1, with a minimum of 100.
For Example 1, k = 1 and q = 302/760 = .397, and so 10k/q = 25.17. Thus a minimum sample of
size 100 is recommended.

You might also like