Introduction To Logistic Regression: Rachid Salmi, Jean-Claude Desenclos, Alain Moren, Thomas Grein

Introduction to
Logistic Regression
Rachid Salmi, Jean-Claude Desenclos, Alain Moren, Thomas Grein

Content
• Simple and multiple linear regression

• Simple logistic regression
– The logistic function
– Estimation of parameters
– Interpretation of coefficients
• Multiple logistic regression
– Interpretation of coefficients
– Coding of variables
• Examples in Epiinfo 2002
Simple linear regression
Table 1 Age and systolic blood pressure (SBP) among 33 adult women
Age SBP Age SBP Age SBP

22 131 41 139 52 128
23 128 41 171 54 105
24 116 46 137 56 145
27 106 47 111 57 141
28 114 48 115 58 153
29 123 49 133 59 157
30 117 49 128 63 155
32 122 50 183 67 176
33 99 51 130 71 172
35 121 51 133 77 178
40 147 51 144 81 217
SBP (mm Hg)
220
200
180
160
140
120
100
80
20 30 40 50 60 70 80 90
Age (years)
adapted from Colton T. Statistics in Medicine. Boston: Little Brown, 1974

Simple linear regression
• Relation between 2 continuous variables (SBP and age)
y
Slope y  α  β1x 1
• Regression coefficient 1
– Measures association between y and x
– Amount by which y changes on average when x changes by one unit
– Least squares method
Multiple linear regression
• Relation between a continuous variable and a set of

i continuous variables
y  α  β1x 1  β 2 x 2  ...  βi x i
• Partial regression coefficients i

– Amount by which y changes on average when xi changes by one
unit and all the other xis remain constant
– Measures association between xi and y adjusted for all other xi
• Example
– SBP versus age, weight, height, etc
Multiple linear regression
y  α  β1x 1  β 2 x 2  ...  βi x i
Predicted Predictor variables

Response variable Explanatory variables
Outcome variable Covariables
Dependent Independent variables
General linear models
• Family of regression models
• Outcome variable determines choice of model
Outcome Model
Continuous Linear regression
Counts Poisson regression
Survival Cox model
Binomial Logistic regression
• Uses
– Control of confounding
– Model building, risk prediction
Logistic regression
• Models relationship between set of variables xi

– dichotomous (yes/no)
– categorical (social class, ... )
– continuous (age, ...)
and
– dichotomous (binary) variable Y
• Dichotomous outcome most common situation in

biology and epidemiology
Logistic regression (1)
Table 2 Age and signs of coronary heart disease (CD)

How can we analyse these data?
• Compare mean age of diseased and non-diseased
– Non-diseased: 38.6 years

– Diseased: 58.7 years (p<0.0001)
• Linear regression?
Dot-plot: Data from Table 2
Logistic regression (2)
Table 3 Prevalence (%) of signs of CD according to age group

Dot-plot: Data from Table 3
Diseased % 100
80
60
40
20
0
0 1 2 3 4 5 6 7
Age group
Logistic function (1)
Probability of
disease 1.0
0.8
0.6
0.4
0.2
0.0
x
Logistic transformation
{logit of P(y|x)
Advantages of Logit
• Properties of a linear regression model

• Logit between -  and + 
• Probability (P) constrained between 0 and 1
• Directly related to notion of odds of disease
 P  P
ln    α  βx  e αβx
 1- P  1- P
Interpretation of coefficient 
P
 e αβx
1- P
Interpretation of coefficient 
•  = increase in logarithm of odds ratio for a one unit

increase in x
• Test of the hypothesis that =0 (Wald test)
β2
2  (1 df)
Variance ( β)
• Interval testing
Example
• Risk of developing coronary heart disease (CD)

by age (<55 and 55+ years)
• Logistic Regression Model
 P 
ln    α  β1  Age  - 0.841  2.094  Age
 1 - P 
Fitting equation to the data
• Linear regression: Least squares

• Logistic regression: Maximum likelihood
• Likelihood function
– Estimates parameters  and with property that
likelihood (probability) of observed data is higher than
for any other values
– Practically easier to work with log-likelihood
n
L()  ln l ()    yi ln ( xi )  (1  yi ) ln1   ( xi )
i 1
Maximum likelihood
• Iterative computing
– Choice of an arbitrary value for the coefficients (usually 0)
– Computing of log-likelihood
– Variation of coefficients’ values
– Reiteration until maximisation (plateau)
• Results
– Maximum Likelihood Estimates (MLE) for  and 
– Estimates of P(y) for a given value of x
Multiple logistic regression
• More than one independent variable

– Dichotomous, ordinal, nominal, continuous …
 P 
ln    α  β1x 1  β 2 x 2  ... βi xi
 1- P 
• Interpretation of i
– Increase in log-odds for a one unit increase in xi with all
the other xis constant
– Measures association between xi and log-odds adjusted
for all other xi
Effect modification
• Effect modification
– Can be modelled by including interaction terms
 P 
ln    α  β1x 1  β2 x 2  β3 x 1  x 2
 1- P 
Statistical testing
• Question
– Does model including given independent variable
provide more information about dependent variable than
model without this variable?
• Three tests
– Likelihood ratio statistic (LRS)
– Wald test
– Score test
Likelihood ratio statistic
• Compares two nested models

Log(odds) =  + 1x1 + 2x2 + 3x3 + 4x4 (model 1)
Log(odds) =  + 1x1 + 2x2 (model 2)
• LR statistic
-2 log (likelihood model 2 / likelihood model 1) =
-2 log (likelihood model 2) minus -2log (likelihood model 1)
LR statistic is a 2 with DF = number of extra parameters

in model
Example
P Probability for cardiac arrest
Exc 1= lack of exercise, 0 = exercise
Smk 1= smokers, 0= non-smokers
 P 
ln    α  β1 Exc  β2 Smk
 1- P 
 0.7102  1.0047 Exc  0.7005 Smk
(SE 0.2614) (SE 0.2664)
adapted from Kerr, Handbook of Public Health Methods, McGraw-Hill, 1998

• Interaction between smoking and exercise?
 P 
ln    α  β1 Exc  β2 Smk  β3 Smk  Exc
 1- P 
• Product term 3 = -0.4604 (SE 0.5332)
Wald test = 0.75 (1df)
-2log(L) = 342.092 with interaction term

= 342.836 without interaction term
 LR statistic = 0.74 (1df), p = 0.39

 No evidence of any interaction
Coding of variables (1)
• Dichotomous variables: yes = 1, no = 0

• Continuous variables
– Increase in OR for a one unit change in exposure
variable
– Logistic model is multiplicative 
OR increases exponentially with x
» If OR = 2 for a one unit change in exposure and x increases
from 2 to 5: OR = 2 x 2 x 2 = 23 = 8
– Verify that OR increases exponentially with x.

When in doubt, treat as qualitative variable
Continuous variable?
• Relationship between SBP>160 mmHg and body weight
• Introduce BW as continuous variable?

– Code weight as single variable, eg. 3 equal classes:
40-60 kg = 0, 60-80 kg = 1, 80-100 kg = 2
– Compatible with assumption of multiplicative model

– If not compatible, use indicator variables
Coding of variables (2)
• Nominal variables or ordinal with unequal

classes:
– Tobacco smoked: no=0, grey=1, brown=2, blond=3
– Model assumes that OR for blond tobacco
= OR for grey tobacco3
– Use indicator variables (dummy variables)
Indicator variables: Type of tobacco
• Neutralises artificial hierarchy between classes in the

variable "type of tobacco"
• No assumptions made
• 3 variables (3 df) in model using same reference
• OR for each type of tobacco adjusted for the others in
reference to non-smoking
Low Birth Weight Study
• 189 observations
• Low Birth Weigth LBW
yes = birth weight < 2500g
no = birth weight >2499g
• Age of mother in years Age
• Weight of mother in pounds Weight
• Race (1,2,3) Race
• Number of doctor’s visit in last trimester Visits
Risk of death from bacterial meningitis
according to treatment
• 161 observations
• Death (yes, no)
• Treatment
– 1=Chloramphenicol, 2=Ampicillin
• Delay before treatment (onset, in days)
• Convulsions (1,0)
• Level of consciousness (1-3)
• Severity of dehydration (1-3)
• Age in years
• Pathogen
– 1 Others, 2 HiB, 3 Streptococcus pneumoniae
Reference
• Hosmer DW, Lemeshow S. Applied logistic

regression. Wiley & Sons, New York, 1989

Introduction To Logistic Regression: Rachid Salmi, Jean-Claude Desenclos, Alain Moren, Thomas Grein

Uploaded by

Copyright:

Available Formats

Introduction To Logistic Regression: Rachid Salmi, Jean-Claude Desenclos, Alain Moren, Thomas Grein

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Logistic Regression: Rachid Salmi, Jean-Claude Desenclos, Alain Moren, Thomas Grein

Uploaded by

Copyright:

Available Formats

Introduction to

Rachid Salmi, Jean-Claude Desenclos, Alain Moren, Thomas Grein

• Simple and multiple linear regression

Age SBP Age SBP Age SBP

adapted from Colton T. Statistics in Medicine. Boston: Little Brown, 1974

• Relation between a continuous variable and a set of

• Partial regression coefficients i

Predicted Predictor variables

• Models relationship between set of variables xi

– dichotomous (binary) variable Y

• Dichotomous outcome most common situation in

Table 2 Age and signs of coronary heart disease (CD)

• Compare mean age of diseased and non-diseased

– Non-diseased: 38.6 years

Table 3 Prevalence (%) of signs of CD according to age group

• Properties of a linear regression model

• Directly related to notion of odds of disease

•  = increase in logarithm of odds ratio for a one unit

• Risk of developing coronary heart disease (CD)

• Linear regression: Least squares

• More than one independent variable

• Compares two nested models

LR statistic is a 2 with DF = number of extra parameters

adapted from Kerr, Handbook of Public Health Methods, McGraw-Hill, 1998

• Product term 3 = -0.4604 (SE 0.5332)

Wald test = 0.75 (1df)

-2log(L) = 342.092 with interaction term

 LR statistic = 0.74 (1df), p = 0.39

• Dichotomous variables: yes = 1, no = 0

– Verify that OR increases exponentially with x.

• Introduce BW as continuous variable?

– Compatible with assumption of multiplicative model

• Nominal variables or ordinal with unequal

• Neutralises artificial hierarchy between classes in the

• Hosmer DW, Lemeshow S. Applied logistic

You might also like