Logistic Regression Course Note
Logistic Regression Course Note
Lecture 1: LOGISTIC REGRESSION
Aims
To introduce the logistic regression model for modelling binary outcome data.
Objectives
By the end of this session you should be able to:
1. Understand the difference between a linear model and a logistic regression model;
2. Interpret the coefficients of variables in a logistic regression model;
3. Calculate the likelihood ratio test statistic for a variable, or group of variables, in a logistic
regression equation;
4. Write a SAS program to fit a logistic regression model and interpret the output.
Linear Models
The Mantel‐Haenszel method could be used to examine the association between two binary variables
after stratifying by another variable. That is, the method was used to examine the effect of a binary
study factor on a binary outcome variable, while adjusting for a single confounder or effect modifier.
If we want to adjust for more than one variable, it is simpler to use a regression model. We have seen in
this course that, when the outcome is a continuous variable, multiple linear regression is used. The
continuous outcome variable is linearly related to the explanatory variables, and the errors are assumed
to be Normally distributed. This type of model is known as a linear model. In multiple linear regression,
the expected value of the outcome variable (y) is modelled as a function of
multiple explanatory variables ( , , , …), and the regression equation is represented as
…
Logistic Regression
If we apply a linear regression model when the outcome variable has two possible values (e.g. diseased
or not diseased) then there may be problems with the resulting model. When the outcome is binary it
would commonly be coded as 0 (outcome not present) and 1 (outcome present) and the purpose of
fitting a regression model is to identify which factors are associated with the probability (P) of having the
outcome of interest. This probability can only take values between 0 and 1. A linear regression model
fitted to such data would have the following form:
…
Unfortunately such a model may produce estimates of the probability (P) that are outside the bounds of
0 and 1. To analyse data with a binary outcome we instead use the logistic regression model. The rest of
this course will focus on how to fit logistic regression models and interpret the results.
Lecture 1: Logistic Regression 2
In order to apply the logistic regression model we need to make a couple of changes to the linear
regression model described above. Firstly, we require a transformation such that the left hand side of the
equation (the outcome) can take any value. Secondly, if the outcome is binary then the errors will not be
Normally distributed so we will need to specify a distribution for the errors that would be appropriate for
binary data.
Logit Transformation
The transformation that is applied to the outcome in a logistic regression model is known as the logit
function.
ln
1
The logit (or logistic) function transforms P as follows:
When 0, ln ln ln 0 ∞
.
When 0.5, ln ln ln 1 0
.
When 1, ln ln ln ∞ ∞
Thus logit(P) can range from ∞ ∞
If logit(P) is linearly related to x, then the plot of P against x will be S‐shaped (sigmoidal).
0.8
Probability (P)
0.6
0.4
0.2
0.0
-2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13
x
3 Lecture 1: Logistic Regression
Interpreting logits
For a binary outcome (positive vs negative), we can represent the probability of the “positive” outcome
for individuals in group A by PA, and the probability of the positive outcome for individuals in group B by
PB. The log odds of a positive outcome in each group is then given by:
ln ln
1 1
The difference between logits for the two groups A and B is
ln ln
1 1
1
ln
1
ln (3.1)
That is, the difference between the logit of PB and the logit of PA is the natural logarithm of the odds ratio
for event for group B relative to group A.
Multiple logistic regression
The equation of a multiple logistic regression model can be represented as:
ln ⋯ (3.2)
The error distribution chosen for the logistic regression model is the Binomial distribution as the
outcome is binary.
The inverse of the logit transformation can be used to express P (rather than logit P) as a function of the
explanatory variables. Using this inverse transformation, equation (3.2) may also be written in the form:
⋯
(3.3)
⋯
Lecture 1: Logistic Regression 4
This form is useful for calculating the probability of an individual having the outcome, given the values of
their x's (explanatory variables) after having estimated the values of the coefficients (β's).
Interpretation of coefficients
In the multiple logistic regression equation (3.2), represents the additive increase in the loge(odds),
also denoted as ln(odds), for an increase of 1 unit in xi, with all the other x's fixed.
As in multiple linear regression, the explanatory variables in a multiple logistic regression model may be
either continuous or categorical. Using a simple logistic regression model with only one covariate for the
purposes of illustration, we can consider the interpretation of the coefficient when the explanatory
variable is (i) dichotomous or (ii) continuous.
(i) dichotomous explanatory variable x (e.g. 0=male and 1=female)
For males: x=0 and hence
For females: x=1 and hence
Taking the difference:
Hence, represents the log odds ratio for females vs males. The odds ratio for females vs males is
given by exp (i.e. ).
SAS example
– GLOW dataset from Hosmer and Lemeshow
– Outcome: Fracture (coded 1=yes, 0 = no)
– Study factor: Prior fracture (coded 1=yes, 0 = no)
– Explanatory variable: Age (years)
Analysis of Maximum Likelihood Estimates
Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
Interpretation: Those with a prior fracture have 2.31 times the odds of a subsequent fracture
compared to those without a prior fracture, after adjusting for age.
(ii) Continuous explanatory variable x (e.g. age)
For a given age x=a:
For age x=a+1: 1
Taking the difference:
1
Hence, represents the log odds ratio for a 1 year increase in age. The odds ratio for a 1 year
increase in age is given by exp() (i.e. ).
SAS example continued:
Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
Interpretation: For a one year increase in age the odds of a subsequent fracture are increased by a factor
of 1.04, after adjusting for prior fracture.
(iii) Categorical explanatory variables with more than two levels
As demonstrated above, for a dichotomous variable, xj, exp(βj) represents the odds ratio due to the
presence of the characteristic (assuming xj is coded so that 0=absence and 1=presence of the
characteristic). For categorical variables with more than 2 levels, dummy variables must be created as for
multiple linear regression. One level must be chosen as the reference category, with OR=1
(corresponding to β=0). Then exp(β) for the dummy variables gives the OR for the other levels relative to
the reference category.
For example, if we grouped countries of birth into three categories (Australia/NZ, Asia, Elsewhere) then
we would need to create 2 dummy variables. The choice of coding of these dummy variables then
determines which is the reference category.
Dummy variable 1 Dummy variable 2
Australia/NZ 0 0
Asia 1 0
Elsewhere 0 1
In the coding above Australia would be the reference category and in the resulting output we would have
two odds ratios. These correspond to the odds ratio comparing the odds of the outcome for people born
in Asia compared to those in Australia, and the second odds ratio would be comparing the odds of the
outcome for people born Elsewhere to people born in Australia.
7 Lecture 1: Logistic Regression
Fitting a Logistic Regression Model using Maximum Likelihood Estimation
In linear regression we fit models using the method of least squares. Logistic regression models are fitted
using the method of maximum likelihood; a general method that is applied to all other types of
generalised linear models. The likelihood , , ,…, of the observed data is a measure of the
fit of a model. It reflects the probability of observing our data, if the given model was true. Therefore, the
larger the likelihood, the better the fit of the model.
The logistic regression equation can be fitted to a set of data using the method of maximum likelihood.
The computations are even more complex than for a classical regression fitted by least squares, but are
readily carried out using a computer package.
As a very simple illustration, imagine we have a study with 10 participants of whom 4 have the disease of
/
interest. Then in this study the odds of having disease is 0.667.
/
We could fit a logistic regression model to estimate the odds of having disease. The model would be
Logit(disease) = α
(Note: in this model we have no explanatory variables and so the intercept provides the estimate of the
log odds of the outcome).
The method of maximum likelihood postulates a value for α and then calculates how likely it is that this
model would generate the observed data we have (in this case 4 participants with disease and 6
participants without disease) and searches through all possible values of α to find the value that
maximises the likelihood.
In the graph below the likelihood is plotted on the vertical axis (well actually the log‐likelihood is plotted
as when we apply the methods of maximum likelihood we usually use the log of the likelihood as these
are easier to work with mathematically). On the horizontal axis we have postulated values of α (here
labelled as _cons). We can see that the maximum of the graph is approximately ‐0.4. This is the
maximum likelihood estimate of α.
Hence our estimate of Logit(disease) = ‐0.4 and if we take the exponential that gives:
as expected from the observed data.
Lecture 1: Logistic Regression 8
The same principles apply when we extend the model to have explanatory variables. The method of
maximum likelihood searches through all possible estimates for the parameters and finds the values that
maximise the likelihood of observing the data we have. For example, if males are at greater risk of having
the disease than females then a model that postulates that males have only half the risk of females
would be very unlikely to produce the data we have observed.
Both classical regression and logistic regression are particular members of the wider class of models
called generalised linear models. The method of least squares is an example of the more general method
of maximum likelihood applied to a Normally distributed variable. Thus logistic regression is conceptually
similar to classical regression and the rationale of fitting different models to a data set is the same.
Hypothesis testing
There are two methods of constructing a hypothesis test:
1. The likelihood ratio test, based on the log likelihood, is used for testing any set of regression
coefficients. (This corresponds to an F test in linear regression.) This test uses the transformation
‐2Ln(L) (minus twice the log likelihood). Note that as the likelihood increases (i.e. the fit of the
model gets better), the smaller the ‐2Ln(L). Changes in ‐2Ln(L) are test statistics, and can be
referred to as deviance differences, where the deviance is the analogue of residual sum of
squares in classical regression. When variables are added into a model the difference between
‐2Ln(L) of the two models represents the effects of the added variables. This reduction in ‐2Ln(L)
9 Lecture 1: Logistic Regression
is distributed approximately as a chi‐squared statistic with degrees of freedom equal to the
number of variables added.
Consider two models (a) and (b) where model (b) contains q extra terms:
2. The Wald z or test may be used to assess the evidence for a single variable. (This corresponds
to the t test of a single regression coefficient in classical regression.)
Squaring the Wald z (which has an approximate Normal distribution) gives the Wald chi‐squared
statistic, which has an approximate distribution with 1 DF. Although the likelihood ratio test is
usually preferred, the Wald test is very convenient to use because it is included in the output
from many programs such as PROC LOGISTIC in SAS. The Wald test for a set of k dummy variables
or interaction terms can be obtained using PROC LOGISTIC; the test statistic has an approximate
chi‐squared distribution with k degrees of freedom. However, the likelihood ratio test is more
reliable than the Wald test particularly when the overall number of observations is small or the
number of observations in a variable category is small.
Confidence interval estimation
The maximum likelihood method gives the estimates, , with their standard errors, from which
confidence intervals may be constructed. As the sampling error for each , is assumed to be Normally
distributed, the 95% confidence interval is
1.96
and the 95% confidence interval for the odds ratio is
exp 1.96
Logistic regression vs Mantel‐Haenszel method
Logistic regression can be used as an alternative to the computationally simpler method of Mantel‐
Haenszel. The logistic regression has the advantage that it gives more information than the Mantel‐
Haenszel method. This point can be illustrated using the asbestos exposure and lung cancer case control
study. Fitting both asbestos and smoking variables in the model gave:
Lecture 1: Logistic Regression 10
The resulting odds ratio for lung cancer due to asbestos allowing for smoking was 1.78 (95% CI: 1.30 to
2.45), and the odds ratio due to smoking allowing for asbestos was 3.34 (95% CI: 2.10 to 5.31). Hence,
the variables, asbestos and smoking, are represented on an equal basis in the model. By contrast, the
Mantel‐Haenszel analysis does not provide an estimate of the smoking effect allowing for asbestos,
because smoking is the stratification variable. To obtain an estimate of the effect of smoking, adjusted
for asbestos exposure, it would be necessary to repeat the Mantel‐Haenszel analysis after re‐arranging
the data into tables showing the association between smoking and lung cancer after stratifying for
asbestos.
If one factor is regarded as the study factor and the other as a confounder, the above feature confers no
advantage on logistic regression. Often, however, both factors may be regarded as study factors and
both as confounders with respect to the other. Then the logistic regression approach has the advantage
of estimating the effects of both factors simultaneously, each effect being estimated after allowing for
the other.
The real advantage of logistic regression occurs when there are many risk factors to be considered. Even
if all except one were regarded as confounders, the Mantel‐Haenszel method could only be applied after
stratifying for all these factors simultaneously. This would lead to a large number of strata and few
subjects in each, and the stratification would result in loss of information on the effect of the study
factor. Note, in particular, if a stratum contained only one subject it would contribute no information, or
if it contained 2 or 3 subjects, say, but they were all cases, or all controls, or all exposed to the study
factor, or all unexposed, there would be no information. Therefore, in situations where there are several
confounders, stratification will usually lead to a lower power than a logistic regression. Moreover, effect
modification can be explored more readily using logistic regression by including product terms of study
factor and confounder. (This will be discussed further in the next session.)
Another point is that the Mantel‐Haenszel method requires that all variables be categorical, whereas in
logistic regression the risk factors could be either continuous or categorical.
To summarise, logistic regression is a much more flexible method which can deal with more complex
situations.
Further reading: H&L Chapters 1.1 to 1.4, 2 and 3. SMMR §14.2. Applied linear models with SAS Chapter
8, Applied medical statistics using SAS Chapter 8, Chapter 9.
11 Lecture 1: Logistic Regression
Lecture 1: SAS NOTES
Procedures
PROC LOGISTIC
Logistic regression models can be fitted using the LOGISTIC procedure. All that is required is a MODEL
statement with syntax. The MODEL statement will be different depending on whether the records in your
dataset correspond to single observations or grouped observations.
If each record in your dataset corresponds to a single observation or person then the following syntax
would be used:
PROC LOGISTIC [DATA=dataset];
[CLASS variable(s) [/options];]
MODEL outcome_var (EVENT='value') = explanatory variables [/options];
outcome_var is the outcome variable and value is the value this variable takes for the outcome of
interest. For example, MODEL case (EVENT='1'). If you do not include (EVENT='value') in your
program, SAS, by default, takes the lowest value of the outcome_var as the outcome of interest. Note if
you have assigned a format to the outcome variable then the value in the quotes should be the
corresponding format label. For example, if you have assigned a format label of Yes to the value 1 and No
to the value 0 then the statement would read (EVENT='Yes').
If each record corresponds to grouped observations then the following syntax would be used:
PROC LOGISTIC [DATA=dataset];
[CLASS variable(s) [/options];]
MODEL events/trials = explanatory variables [/options];
events is the variable containing the count of the number of people with the outcome of interest, and
trials is the variable containing the total (those with and without the outcome of interest) number of
people.
Note that only one MODEL statement can be used for each LOGISTIC procedure.
For each term in the model, the following are shown in the output:
Under the heading "Analysis of Maximum Likelihood Estimates":
Estimate the coefficient of that term, b
Standard Error the standard error of the parameter estimate, SE(b)
Chi‐Square = z2 where z=b/SE(b), with a chi‐squared distribution with 1 DF
Lecture 1: Logistic Regression 12
Pr > ChiSq the P‐value for the Wald chi‐square
Under the heading "Odds Ratio Estimates":
Point Estimate = exp(b) = eb
The LOGISTIC procedure also prints the value of ‐2Ln(L) for the intercept only, for the intercept and
covariates, and the difference which gives an approximate chi‐squared test for the joint effect of all the
explanatory variables (covariates) in the model, with DF equal to the number of covariates. The effect of
adding a variable to a model may be tested using the likelihood ratio test by calculating the change in ‐
2Ln(L) (or equivalently, the change in the chi‐square for covariates) between models with and without
the additional variable. This likelihood ratio (LR) test is a preferable alternative to the Wald test. The
Wald test statistic for a group of variables (e.g. a set of dummy variables) can be obtained using the TEST
statement.
Options
Some of the options for MODEL are:
ALPHA=value sets the significance level for the confidence limits. The value must be between 0 and 1.
The default value 0.05 results in the calculation of a 95% CI.
PARMLABEL displays the labels (see the LABEL statement) of the parameters in the
"Analysis of Maximum Likelihood Estimates" table.
CLASS Statement
CLASS variable(s) [/options];
Or, to assign different options to variables, use
CLASS variable (options) variable (options) …;
The CLASS statement can be used to define explanatory variables to be categorical. SAS will automatically
create dummy variables for any variables included in the CLASS statement, like in PROC GLM. The default
in SAS is to use effect coding for the dummy variables (this will not be covered in the course). To use
coding consistent with that adopted in Multiple Regression and in this course, we use the following
options for each variable:
PARAM=keyword Specifies the coding for the dummy (design) variables. The keyword REF specifies
that a referent category be used.
REF=level Specifies which level (i.e. value of the variable) is to be the referent category e.g. REF=0.
Alternatively, REF=FIRST requests that the first level of an ordered set of categories be
used as the referent group. REF=LAST, which is the default, uses the last level as the
referent group.
For example:
13 Lecture 1: Logistic Regression
* BASE MODEL;
PROC LOGISTIC DATA = cin_ca;
CLASS smok npart agefi educ / PARAM=ref REF=first;
MODEL status (EVENT = '1') =smok npart agefi educ age;
Notes:
a) All of the explanatory variables, except age, are treated as categorical in the above model.
b) Dummy variables are created using the first level for each variable as the referent category.
c) If value labels have been assigned to a categorical variable, the order of the categories will be
determined by the alphabetical order of the labels.
d) The Wald chi‐square, its degrees of freedom and P‐value will be reported for all variables when
the CLASS statement is used. This is very useful when screening variables for inclusion in the
model. For example,
Lecture 1: EXAMPLES IN SAS
Example 3: Association between Lung Cancer and Asbestos Exposure: Logistic Regression
Modelling
The case‐control data may be analysed for the asbestos effect on lung cancer, taking account of
smoking. The data are:
1 0 17 98
0 0 6 103
1 1 131 274
0 1 69 240
As shown on page 3, the difference of two logits is the log of an odds ratio. Since the odds ratio can be
estimated in a case‐control study, logistic regression can be used to analyse the data from studies such as
Example 3. The outcome (P) represents the proportion of (cases + controls) that are cases for each unit of
data. A SAS program is given below that shows how logistic regression models can be fitted for these
data. The corresponding output and a brief discussion of this output follows.
It should be noted that the parameter α has no useful meaning for a case‐control study, as it is
influenced by the sampling fractions of cases and controls. Thus the fitted model cannot be converted
back to give meaningful estimates of P. This corresponds to the well‐known fact that incidences cannot
be estimated from a case‐control study but the odds ratio can.
SAS Program:
* author : me
date : today
purpose: logistic regression of asbestos and lung cancer data
adjusted for smoking;
RUN;
RUN;
Lecture 1: Logistic Regression 16
Edited Output
MODEL 1
SC 1035.756 1028.194
MODEL 2:
SC 1035.756 1008.832
MODEL 3:
SC 1035.756 1002.688
P = proportion of total that are cases.
(a) MODEL 1: can be used to estimate the unadjusted effect of asbestos exposure, ignoring smoking.
1.5202 0.5986
The “Estimate” for asb (0.5986) gives the estimated coefficient for asb in the model, i.e. the estimated
log odds ratio. SAS uses this estimate in conjunction with its standard error to compute the odds ratio
and its 95% confidence interval (1.82, 95%CI 1.33 to 2.49)
Notes:
(ii) the Wald chi‐square statistic is 13.94, 1DF, P=0.0002. This also indicates very strong evidence of
an association between asb and lung cancer.
19 Lecture 1: Logistic Regression
(b) MODEL 2: can be used to estimate the effect of smoking exposure, not adjusted for asbestos
2.1676 1.2237
The “Estimate” for smoking (1.2237) gives the estimated coefficient for smoking in the model, i.e. the
estimated log odds ratio. The odds ratio and its 95% confidence interval are 3.40, 95%CI 2.14 to 5.39)
Notes:
(i) Smoking is a risk factor for lung cancer
(ii) the Wald chi‐square statistic is 27.03, 1DF, P<0.0001 also indicates very strong evidence of an
association between smoking and lung cancer.
(c) MODEL 3: can be used to estimate the effect of asbestos exposure, adjusted for smoking.
2.4968 0.5776 1.2070
The 2 for intercept only is again 1028.912, but the 2 for the intercept and covariates is
982.157. Since there are two covariates in the model, subtracting in the same way as we did above would
provide a chi‐square statistic with 2 DF to test for the joint effect of asb and smoking. To test for the
effect of asbestos after allowing for smoking we must use the change in 2 between MODEL 3 and
MODEL 2:
Note: for this analysis to be complete, we would also need to test for effect modification by testing for
interaction between asb and smoking in the model (This will be covered in future lectures).
Lecture 1: Logistic Regression 20
Lecture 1: PRACTICAL
(a) To obtain a feel of the logit transformation, consider the following data showing the relationship
between coronary heart disease (CHD) and age in a sample of 1000 men:
Frequency Table of Age Group by CHD
CHD
(i) Calculate the proportion of men with CHD in each age group, P, and plot this proportion against
age (HINT: use the mid‐point of each age group).
(ii) Calculate logit P and plot it against age.
(iii) Describe the effect of the logit transformation on these data.
(b) The first phase of the Blue Mountains Eye Study was a cross‐sectional study of people aged 49 years
or more living in the Blue Mountains. Subjects had photographs taken of their lenses, which were
then graded for the presence of different types of cataract. Information on possible risk factors for
posterior subcapsular cataract (PSC) was collected using an interviewer‐administered questionnaire.
These data included age and smoking status. A total of 3065 study participants had complete data for
these variables, and information on the coding of the variables is given below.
21 Lecture 1: Logistic Regression
These data are recorded in Tutorial_2.xlsx. Using this data:
(i) Produce a frequency table for the variable PSC. How many of the 3065 participants had a
posterior subcapsular cataract?
(ii) Fit a logistic regression model to the data to determine whether age is associated with
posterior subcapsular cataract.
(iii) Interpret the estimate for age
(iv) What is the estimated risk of posterior subcapsular cataract for a person aged 60 years?
(v) Is smoking associated with posterior subcapsular cataract after adjusting for age?
(vi) Interpret the coefficients for smoking in this model.
Appendix
Appendix: Percentage points of the χ2 distribution
From ARMITAGE P, BERRY G, MATTHEWS J.N.S. Statistical methods in medical research. 4th edition. Oxford: Blackwell Scientific Publications, 2001.