Introduction To Logistic Regression: Rachid Salmi, Jean-Claude Desenclos, Alain Moren, Thomas Grein

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 36

Introduction to

Logistic Regression

Rachid Salmi, Jean-Claude Desenclos, Alain Moren, Thomas Grein


Content

• Simple and multiple linear regression


• Simple logistic regression
– The logistic function
– Estimation of parameters
– Interpretation of coefficients
• Multiple logistic regression
– Interpretation of coefficients
– Coding of variables
• Examples in Epiinfo 2002
Simple linear regression
Table 1 Age and systolic blood pressure (SBP) among 33 adult women

Age SBP Age SBP Age SBP


22 131 41 139 52 128
23 128 41 171 54 105
24 116 46 137 56 145
27 106 47 111 57 141
28 114 48 115 58 153
29 123 49 133 59 157
30 117 49 128 63 155
32 122 50 183 67 176
33 99 51 130 71 172
35 121 51 133 77 178
40 147 51 144 81 217
SBP (mm Hg)

220

200

180

160

140

120

100

80
20 30 40 50 60 70 80 90

Age (years)

adapted from Colton T. Statistics in Medicine. Boston: Little Brown, 1974


Simple linear regression
• Relation between 2 continuous variables (SBP and age)

y
Slope y  α  β1x 1

• Regression coefficient 1
– Measures association between y and x
– Amount by which y changes on average when x changes by one unit
– Least squares method
Multiple linear regression

• Relation between a continuous variable and a set of


i continuous variables

y  α  β1x 1  β 2 x 2  ...  βi x i

• Partial regression coefficients i


– Amount by which y changes on average when xi changes by one
unit and all the other xis remain constant
– Measures association between xi and y adjusted for all other xi

• Example
– SBP versus age, weight, height, etc
Multiple linear regression

y  α  β1x 1  β 2 x 2  ...  βi x i

Predicted Predictor variables


Response variable Explanatory variables
Outcome variable Covariables
Dependent Independent variables
General linear models
• Family of regression models
• Outcome variable determines choice of model

Outcome Model
Continuous Linear regression
Counts Poisson regression
Survival Cox model
Binomial Logistic regression

• Uses
– Control of confounding
– Model building, risk prediction
Logistic regression

• Models relationship between set of variables xi


– dichotomous (yes/no)
– categorical (social class, ... )
– continuous (age, ...)

and

– dichotomous (binary) variable Y

• Dichotomous outcome most common situation in


biology and epidemiology
Logistic regression (1)

Table 2 Age and signs of coronary heart disease (CD)


How can we analyse these data?

• Compare mean age of diseased and non-diseased

– Non-diseased: 38.6 years


– Diseased: 58.7 years (p<0.0001)

• Linear regression?
Dot-plot: Data from Table 2
Logistic regression (2)

Table 3 Prevalence (%) of signs of CD according to age group


Dot-plot: Data from Table 3

Diseased % 100

80

60

40

20

0
0 1 2 3 4 5 6 7
Age group
Logistic function (1)
Probability of
disease 1.0

0.8

0.6

0.4

0.2

0.0

x
Logistic transformation

{logit of P(y|x)
Advantages of Logit

• Properties of a linear regression model


• Logit between -  and + 
• Probability (P) constrained between 0 and 1

• Directly related to notion of odds of disease

 P  P
ln    α  βx  e αβx
 1- P  1- P
Interpretation of coefficient 

P
 e αβx
1- P
Interpretation of coefficient 

•  = increase in logarithm of odds ratio for a one unit


increase in x
• Test of the hypothesis that =0 (Wald test)

β2
2  (1 df)
Variance ( β)

• Interval testing
Example

• Risk of developing coronary heart disease (CD)


by age (<55 and 55+ years)
• Logistic Regression Model

 P 
ln    α  β1  Age  - 0.841  2.094  Age
 1 - P 
Fitting equation to the data

• Linear regression: Least squares


• Logistic regression: Maximum likelihood
• Likelihood function
– Estimates parameters  and with property that
likelihood (probability) of observed data is higher than
for any other values
– Practically easier to work with log-likelihood
n
L()  ln l ()    yi ln ( xi )  (1  yi ) ln1   ( xi )
i 1
Maximum likelihood

• Iterative computing
– Choice of an arbitrary value for the coefficients (usually 0)
– Computing of log-likelihood
– Variation of coefficients’ values
– Reiteration until maximisation (plateau)

• Results
– Maximum Likelihood Estimates (MLE) for  and 
– Estimates of P(y) for a given value of x
Multiple logistic regression

• More than one independent variable


– Dichotomous, ordinal, nominal, continuous …

 P 
ln    α  β1x 1  β 2 x 2  ... βi xi
 1- P 

• Interpretation of i
– Increase in log-odds for a one unit increase in xi with all
the other xis constant
– Measures association between xi and log-odds adjusted
for all other xi
Effect modification

• Effect modification
– Can be modelled by including interaction terms

 P 
ln    α  β1x 1  β2 x 2  β3 x 1  x 2
 1- P 
Statistical testing

• Question
– Does model including given independent variable
provide more information about dependent variable than
model without this variable?
• Three tests
– Likelihood ratio statistic (LRS)
– Wald test
– Score test
Likelihood ratio statistic

• Compares two nested models


Log(odds) =  + 1x1 + 2x2 + 3x3 + 4x4 (model 1)
Log(odds) =  + 1x1 + 2x2 (model 2)

• LR statistic
-2 log (likelihood model 2 / likelihood model 1) =
-2 log (likelihood model 2) minus -2log (likelihood model 1)

LR statistic is a 2 with DF = number of extra parameters


in model
Example
P Probability for cardiac arrest
Exc 1= lack of exercise, 0 = exercise
Smk 1= smokers, 0= non-smokers

 P 
ln    α  β1 Exc  β2 Smk
 1- P 
 0.7102  1.0047 Exc  0.7005 Smk
(SE 0.2614) (SE 0.2664)

adapted from Kerr, Handbook of Public Health Methods, McGraw-Hill, 1998


• Interaction between smoking and exercise?

 P 
ln    α  β1 Exc  β2 Smk  β3 Smk  Exc
 1- P 

• Product term 3 = -0.4604 (SE 0.5332)

Wald test = 0.75 (1df)

-2log(L) = 342.092 with interaction term


= 342.836 without interaction term

 LR statistic = 0.74 (1df), p = 0.39


 No evidence of any interaction
Coding of variables (1)

• Dichotomous variables: yes = 1, no = 0


• Continuous variables
– Increase in OR for a one unit change in exposure
variable
– Logistic model is multiplicative 
OR increases exponentially with x
» If OR = 2 for a one unit change in exposure and x increases
from 2 to 5: OR = 2 x 2 x 2 = 23 = 8

– Verify that OR increases exponentially with x.


When in doubt, treat as qualitative variable
Continuous variable?
• Relationship between SBP>160 mmHg and body weight

• Introduce BW as continuous variable?


– Code weight as single variable, eg. 3 equal classes:
40-60 kg = 0, 60-80 kg = 1, 80-100 kg = 2

– Compatible with assumption of multiplicative model


– If not compatible, use indicator variables
Coding of variables (2)

• Nominal variables or ordinal with unequal


classes:
– Tobacco smoked: no=0, grey=1, brown=2, blond=3
– Model assumes that OR for blond tobacco
= OR for grey tobacco3
– Use indicator variables (dummy variables)
Indicator variables: Type of tobacco

• Neutralises artificial hierarchy between classes in the


variable "type of tobacco"
• No assumptions made
• 3 variables (3 df) in model using same reference
• OR for each type of tobacco adjusted for the others in
reference to non-smoking
Low Birth Weight Study

• 189 observations
• Low Birth Weigth LBW
yes = birth weight < 2500g
no = birth weight >2499g
• Age of mother in years Age
• Weight of mother in pounds Weight
• Race (1,2,3) Race
• Number of doctor’s visit in last trimester Visits
Risk of death from bacterial meningitis
according to treatment
• 161 observations
• Death (yes, no)
• Treatment
– 1=Chloramphenicol, 2=Ampicillin
• Delay before treatment (onset, in days)
• Convulsions (1,0)
• Level of consciousness (1-3)
• Severity of dehydration (1-3)
• Age in years
• Pathogen
– 1 Others, 2 HiB, 3 Streptococcus pneumoniae
Reference

• Hosmer DW, Lemeshow S. Applied logistic


regression. Wiley & Sons, New York, 1989

You might also like