0% found this document useful (0 votes)
2 views85 pages

Lecture 4&5-Categorical Data Analysis

The document is a lecture on the analysis of categorical variables for MPH students, covering topics such as chi-square statistics, contingency tables, and statistical modeling techniques like binary and multinomial logistic regression. It discusses the categorization of continuous variables, advantages and disadvantages of such methods, and the use of contingency tables for analyzing relationships between categorical variables. Additionally, it provides formulas for calculating risk differences and risk ratios in epidemiological studies.

Uploaded by

selgashae
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views85 pages

Lecture 4&5-Categorical Data Analysis

The document is a lecture on the analysis of categorical variables for MPH students, covering topics such as chi-square statistics, contingency tables, and statistical modeling techniques like binary and multinomial logistic regression. It discusses the categorization of continuous variables, advantages and disadvantages of such methods, and the use of contingency tables for analyzing relationships between categorical variables. Additionally, it provides formulas for calculating risk differences and risk ratios in epidemiological studies.

Uploaded by

selgashae
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 85

Yekatit 12 Hospital Medical College

School of Public Health


Department of Epidemiology and Biostatistics
Chapter 4&5: Analysis of Categorical Variables
For MPH Student
BY
Dube Jara (PhD)
Assistant Professor of Epidemiology
Email: [email protected]

May, 2025
Addis Ababa,
Ethiopia Dube Jara(PhD)
Contents
 Analysis of Categorical Data
 Chi square statistics
 R*C contingency tables
 Measures of association
 Statistical Modeling
 Introduction
 Binary Logistic regression
 Multinomial logistic regression
 Poisson regression
Dube Jara(PhD)
Categorical variables
 Variables that are measured using
nominal and ordinal scales.
The variable may have only two levels
(values) called a dichotomous (e.g. Sex).
The variable may has more than two levels
(values) called polytomous (e.g. blood
pressure).
 Continuous (numeric) variables can be
changed to categorical variables.

Dube Jara(PhD)
Categorizing continuous variables
 Sometime, we might require lesser
degree detail than afforded by numerical
data, hence we occasionally transform
into categorical.
This is achieved by dividing the range of values,
which the numeric value takes into intervals.

Consider the variable “age”. Age might be


transformed to form a new variable , say “age
group”, which consists of categories.

Dube Jara(PhD)
Continued, .....
 Categorizing continuous variables is common
practice in Epidemiology for both descriptive and
multivariate modelling.
 It is used for both outcome and exploratory
variables
 Categorizing a numerical variable generally results
in an ordinal categorical variable
 An associated issue is, therefore, methods for
trend tests based on the ordinal variable

Dube Jara(PhD)
Why categorization?
 Advantages
 Easier to perform the analysis and easy to implement
and interpret results.
 Easy when adjusting confounding
 Avoid to make assumptions about the shape of the
dose-response curve relating to exposure level to risk
 Avoid problems of influential observations(extreme
measurements) which are an issue when dealing with
numeric variables.
 Disadvantages
 Reduction of efficacy/power. The loss of power is
greatest when the dose-repose is non-leaner
 Bias
 Biologically implausible
Dube Jara(PhD)
Contingency table
 When working with categorical variables, we often
arrange the counts in a tabular format called
contingency tables.
 If a contingency table involves two dichotomous
variables when it is a 2x2 (two way table).
 A 2x2 contingency table is a table composed of two
rows cross-classified by two columns
 It can be generalized to accommodate into rxc
contingency table (r-rows and c-columns).

Dube Jara(PhD)
Contingency table...
 2x2 tables can arise from each of the designs
where there are two factors of interest in each four
cells.
 The designs include:
• Cross-sectional,
• Prospective,
• Case-control or
• Matched pairs
 We will call the factor levels “exposed” (E) and
“unexposed” (NE) and “diseased” (D) and “not-
diseased” (ND).

Dube Jara(PhD)
Tests of associations for 2x2
tables
 If we have the frequencies in the four cells of 2x2
table a, b, c and d then the table then, the
appropriate test statistic is the chi-squared test
defined by: r c (O  E ) 2
  
2 ij ij

i 1 j 1 Eij

 where: O and E denote the observed and expected


frequencies and i indicates the row number and j the
column number
 The chi-squared test measures the disparity between
observed frequency (data from the sample ) and expected
frequency Dube Jara(PhD)
Dube Jara(PhD)
Assumptions of the 2 - test

· No observed frequency is zero


· No more than 20% of the expected frequencies
should be <5.
· The guidelines are that 80% of the cells in the table
should have expected frequencies greater than 5, and all
cells should have expected frequencies greater than 1.

· Observation should be independents of each other


· There must be large sample (>40)
· It is based on a ‘large sample’ approximation.
Dube Jara(PhD)
The 2 × 2 frequency table – comparison of two
proportions

We often interested in exposure-disease


relationship as shown in the following table.

Disease
Yes No Total
Exposur Yes n11 n12 n1 n11  n12
e No n21 n22 n2 n21  n22
Total m1 n11  n21 m2 n12  n21 n n1  n2 m1  m2

Dube Jara(PhD)
Cont…
 For a 2 × 2 table the formula for the uncorrected
X2 can be simplified.
2
n ( ad  bc )
 2n(n11 n22  n12n21 ) 2 2 
n1n2 m1m2 (a  c) x(b  d ) x(c  d ) x(a  b)
 For small sample sizes we have to use a

continuity correction known as Yates’


Correction.
2
 c2  
2 O
ij  Eij  1
2 
2


n( n11 n22  n12 n21  n2 ) 2
i 1 j 1 Eij n1n2 m1m2

n( ad  bc  n )2
 c2  2
( a  b)( a  c )(b  d )( c  d )
Dube Jara(PhD)
Expected Value
 Is the product of the row total multiplied
by the column total, divided by the grand
total

 The expected numbers must be computed


for each cell.

Dube Jara(PhD)
2 × 2 contingency table

Example: Consider the following sub-sampled data of Infant


Survival study to look into differences in the proportion of low
birth weight babies between urban and rural resident births.

Residence Low birth weight Total


YES
NO
Rural 1732 222 1954
Urban 1627 61 1688
Total 3359 283 3642

Dube Jara(PhD)
SPSS output – Chi-squared test

Asymp. Exact Exact


Sig. (2- Sig. (2- Sig. (1-
Value Df sided) sided) sided)
Pearson Chi-Square 75.853a 0.000
1
Continuity Correction (a) 74.776 0.000
1
Likelihood Ratio 81.151 0.000
1
Fisher's Exact Test 0.000 0.000

Linear-by-Linear
75.832 0.000
Association 1
N of Valid Cases 3642
(a) Computed only for a 2x2 table
(b) 0 cells (.0%) have expected count less than 5. The minimum expected count
is 53.69.
Dube Jara(PhD)
R x C Contingency Tables
 An RxC contingency table is a table with R rows
and C columns
 It displays the relationship between two variables,
where the variable in the rows has R categories
and the variable in the columns has C categories
 If both R and C are >2, use RxC
contingency table
 Continuity corrections are not used for contingency
tables larger than 2x2.
 Degree of freedom c = (R-1)(C-1) 2
r (Oij  Eij )
 2
 
i 1 j 1 Eij
Dube Jara(PhD)
Chi-Square Test for Trend
 Suppose in a 2xC table that the column variable is
ordinal
 A question of interest will be whether there is a
trend in the proportions falling into the first or
second row across levels of the column variable.
 Examines whether there is a dose-response
relationship or not
 Works with df = 1
 Not used for nominal scale since there is no
specific ordering among the categories.

Dube Jara(PhD)
Consider the following data on SBP and BMI
status

Systolic Blood Pressure


< 140 mmHg 140+ mmHg Total
n % n %
BMI
< 18 85 87.63 12 12.37 97

18-24 172 86.43 27 13.57 199

+25 32 80.00 8 20.00 40

Total 289 86.01 47 13.99 336

Dube Jara(PhD)
SPSS Output

Chi-Square Test
Asymp. Sig.
Value df (2-sided)

Pearson Chi-Square 1.442a 2 0.489

Likelihood Ratio 1.331 2 0.514

Linear-by-Linear Association 1.027 1 0.311

N of Valid Cases 336

Dube Jara(PhD)
Odds and Logits
 All two of these terms describe how often something
happens relative to its opposite happening, such as winning
or losing, or dying or surviving.

 Thus, they all deal with a special case of nominal


measurement scales: dichotomous (binary) outcome
measures.
Logits
 The logits are simply the natural log of the odds
[Ln(odds) or Loge(odds)].

 Note that the two logits are always symmetrical (they


sum to zero) and they range from minus infinity to plus
infinity
Dube Jara(PhD)
Odds and Logits
 All two of these terms describe how often something
happens relative to its opposite happening, such as winning
or losing, or dying or surviving.

 Thus, they all deal with a special case of nominal


measurement scales: dichotomous (binary) outcome
measures.
Logits
 The logits are simply the natural log of the odds
[Ln(odds) or Loge(odds)].

 Note that the two logits are always symmetrical (they


sum to zero) and they range from minus infinity to plus
infinity
Dube Jara(PhD)
Measure of effects for Categorical
Data
 We would like to compare the frequency of
disease between exposed and unexposed
subjects.
We compare:
 Incidence rates in prospective studies, or

 Prevalence rates in cross-sectional studies

between exposed and unexposed individuals


Let
p1 = probability of developing disease for exposed individuals
p2 = probability of developing disease for unexposed individuals
The risk difference is defined as p1 - p2.
The risk ratio or relative risk is defined as p1/p2
Dube Jara(PhD)
1. The risk difference

 Suppose that p1 and p2 are the sample proportions with


disease for exposed and unexposed subjects, based on two
independent sample of size n1 and n2, respectively.

 Then a 100%×(1-) confidence interval for the risk difference


is given by:

pˆ qˆ pˆ qˆ
pˆ  pˆ z 1  / 2
1
 1 2 2

n n
1 2
1 2

Dube Jara(PhD)
Cont…
 Example: A study was conducted to look at the effect of oral
contraceptives (OC) on Myocardial Infarction (MI) in women 40-
44 years of age over 3 years. The following data was obtained
from the study. Compute 95% CI for the risk difference in MI
between users and non-users of OC.

MI status over 3
OC-use group years Total p
Yes No

OC-users 13 4,987 5,000= 0.0026


n1
No-OC-users 7 9,993 10,000= 0.0007
n2
Dube Jara(PhD)
Total 20 14,980 15,000 0.0013
2. The risk ratio (Relative risk)

 A point estimate of the risk ratio (RR = p1/p2) is given by:


RR  pˆ pˆ
1 2

 To obtain an interval estimate, we will assume that the normal


approximation to the binomial distribution is valid.
 Under this assumption, it can be shown that the sampling

RR
distribution of ln( ) more closely follows a normal distribution
n12
than itself and its standard error is given by: n 22
se[(ln( RR )]  
n11 n1 n 21n 2
Dube Jara(PhD)
Cont…
 Therefore, a two-sided 100%×(1-) CI for RR is given by:

n12 n
ln( RR ) Z1-/2  22
n11 n1 n21n2

The anti-log of each end of the interval above then


provides a two-sided 100%×(1-) CI for RR itself, given by:
 
n12 n22 n12 n22
RR
ln( )  Z1-/2  RR )  Z1-/2
ln( 
e n11 n1 n21n2
,e
n11 n1 n21n2

Dube Jara(PhD)
Cont…
Consider the previous example on the risk of MI among OC
users compared with non-OC users.
p1 =13/5000 = 0.0026, n1 = 5000 and
p2 =7/10000 = 0.0007, n2=10000.

Thus the point estimate of RR is = 0.0026/0.0007 = 3.71

To compute a 95% CI, we have n11=13, n12=4,987, n21=7, and


n22=9,993.  4,987 9,993
ln(3.71) 1.312 and s.e[ln( RR)]   0.4685
13(5,000) 7(10,000)

The 95% CI for ln(RR) will be


1.312 ± 1.96×0.4685 = 1.312 ± 0.918 = (0.394, 2.230)

Dube
Therefore, the 95% CI for RR is e 0.394, eJara(PhD)
2.230
= (1.5, 9.3)
3. The Odds Ratio (OR)

 If the probability of success = p, then the odds in favor of


success isodds  p
1 p

 If two populations with proportions p1 and p2 are considered


and the odds in favor of success are computed for each
odds1 p1 q1 p1q2
population, OR  the 
then ratio ofodds or Odds Ratio (OR) is
odds2 p2 q2 p2 q1
defined as:
 ˆ 1qˆ 2
p
OR 
ˆ 2 qˆ1
p

Dube Jara(PhD)
 And is estimated by
Cont…

 From the 2 × 2 table with the four cells labeled by n11, n12, n21

and n22d as above, then

 [n11 /( n11  n21 )] [n22 /( n21  n22 )] n11 n22


OR  
[n21 /( n21  n22 )] [n12 /( n11  n12 )] n12n21

 Example: Using the OC-MI data above, estimate the odds


ratio in favor of MI for OC user compared with a non-user (i.e.
disease odds ratio).

0.0026 0.9993
OR  0.0007 0.9974 3.72
Dube Jara(PhD)
Interval estimates for the Odds Ratio

 Several methods exist for the purpose of estimating CI for the


OR. Of the most popular approaches is the Woolf Method.
Woolf showed that approximately
 1 1 1 1
var[ln( OR)]    
n11 n12 n21 n22

 If we assume approximate normality of ln(OR), then a


100%(1-) CI for ln(OR) is given by:
 1 1 1 1
ln(OR) z1  / 2   
n11 n12 n21 n22

Dube Jara(PhD)
Cont…

 If we take the anti-log of each end of the CI, then it follows


that a 100%(1-) CI for OR is given by


1 1 1 1
OR
ln( ) z1  / 2   
e n11 n1 2 n2 1 n2 2

1 1 1 1
 z1  / 2   
Lower OR e
 n11 n1 2 n2 1 n2 2
,
1 1 1 1
z1  / 2   
Upper OR e
 n11 n1 2 n2 1 n2 2

Dube Jara(PhD)
Cont…

 Example: Suppose all women with at least one birth in a


study of breast cancer are classified as either cases or
controls and with age at first birth as either ≤ 29 or ≥ 30 and
the data is displayed below:
Age at first Status
birth Total
Cases Controls

≥ 30 683 1498 2181


≤ 29 2537 8747 11284
Total 3120 10245 13465
 683 8,747
OR  1.57
1,498 2,537
Dube Jara(PhD)
Cont…

 Compute a 95% CI for the odds ratio relating age at first


birth to breast cancer incidence.
 The SE and a 95% CI for ln(OR) become:

 1 1 1 1
se[ln( OR)]     0.0514
682 2537 1498 8747

Therefore, the 95% CI for ln( OR ) becomes
ln(1.57) 1.96 0.0524 (0.352, 0.553)

 A 95% CI for OR is thus given by (e0.352, e0.553) = (1.42,


1.74)

Dube Jara(PhD)
Logistic Regression
 Outcome variable in many studies is the presence or
absence of some condition, such as:
 Survival status (alive or dead)

 Responding or not to a treatment

 Having a MI or not

 Birth weight status (normal or low)

Dube Jara(PhD)
Logistic Regression…
 Logistic Regression is a type of predictive model
that can be used when the response variable is a
categorical variable.
 Binary logistic regression is a form of regression which
is used when the dependent variable is dichotomous and
the independents are of any type.
 Multinomial logistic regression exists to handle the
case of dependents with more classes than two.
 When multiple classes of the dependent variable can be
ranked, then ordinal logistic regression is preferred to
multinomial logistic regression.
Dube Jara(PhD)
Logistic…
 Logistic regression can be used to predict a
dependent variable on the basis of continuous
and/or categorical independents and to:
 Determine the percent of variance in the dependent
variable explained by the independents;
 Rank the relative importance of independents
 Assess interaction effects; and
 Understand the impact of covariate control
variables

Dube Jara(PhD)
Logistic…
 Logistic regression applies maximum likelihood
estimation
 Transform dependent variable into probability (probability of the
occurrence of certain event)

 Transform probability into odds


 Transform odds into a logit variable (the natural logarithm of the
odds of the dependent occurring or not).
 Form logistic model. The model based log odds of the
dependent unlike OLS
 In stead of assuming that the relationship between p and
x is linear, we assume that the relationship between
ln[p/(1-p)] and x is linear.
Dube Jara(PhD)
Logistic…

Example: Consider the birth weight data to predict factors associated


with low birth weight.

Let the outcome of interest be a dichotomous variable


Define a dichotomous variable


 By considering age of the mother as predictor variable,


construct a scatter plot of the outcome and predictor
variables.
Dube Jara(PhD)
Scatter plot of Birth weight status by age of the
mother 1 .8
low birth weight
.4 .6
.2
0

25 30 35 40 45
maternal age

Dube Jara(PhD)
Logistic…

 We can see that this plot is less informative about the relationship
between the outcome and the explanatory variables than in the case
when the outcome variable is continuous.

 The mean of the dichotomous random variable yi, can be calculated


as,

 But the above quantity is the same as the proportion of low birth
weight babies and can be denoted by p.

i.e. Dube Jara(PhD)


Logistic…

 For the data, 60 of 500 newborns had birth weight


< 2500 gm
This gives an estimated low birth weight probability of

p=60/500 =0 . 12
Overall 12% of the newborns had BWt < 2500 gm

Dube Jara(PhD)
Logistic...

 The statistical model that is generally preferred for the


analysis of binary response is the binary logistic regression
model, stated in terms of the probability that p(yi=1), given
the value of the predictors.
1
p
1 + e -(  0 + 1X1 +  2 X 2 + ...+  k X k )
 i.e. 0 + 1X1 +  2X 2 + ...+  k X k
e

1  e  0 + 1X1 +  2 X 2 + ...+  k X k

 The function above function is called the logistic


function.
 The function has an unlimited range for the x’s
while p is restricted to range from 0 to 1
Dube Jara(PhD)
The Logistic Function
 Thus, if a success occurs with probability,

0 + 1X1 +  2X 2 + ...+  k X k
e
p  0 + 1X1 +  2 X 2 + ...+  k X k
1 e
 Then the failure will occur with probability

1
1 p   0 + 1X1 +  2 X 2 + ...+  k X k
1 e

Dube Jara(PhD)
Logistic…
 Consequently, the odds in favor of success will be

p e  0  1 X 1   2 X 2 ...   k X k 1  e  0  1 X 1   2 X 2 ...   k X k

1 p 1 1  e  0  1 X 1   2 X 2 ...   k X k
 0  1 X 1   2 X 2 ...   k X k
e

 Taking the natural logarithm of each side of this equation:

 p 
ln   ln e 
0 1X1 2 X 2 ... k X k

 0  1 X 1   2 X 2  ...   k X k
1 p 

Dube Jara(PhD)
The fitted equation
 In order to use mother’s age to help us predict the likelihood
that her newborn is low birth weight, we fit the model:

 p̂ 
ln   b 0  b1x
 1  p̂ 
 where b0 and b1 are respective estimates of the population

regression coefficients β0 and β1

 We cannot apply the method of least squares, which


assumes that the response is continuous and normally
distributed, to fit the logistic model, instead, we use
maximum likelihood estimation.
Dube Jara(PhD)
Example with SPSS

Fit a logistic regression model between low birth weight and age of the

mother:

95.0% C.I. for EXP(B)


Exp(B)
B S.E. Wald df Sig. (OR) Lower Upper
-
0.
0 .035 .307 1 .580 .981 .915 1.051
 pˆ 1
matageln  9 -1.332 - 0.019 X
 1  pˆ -
1.
3 1.196 1.240 1 .265 .264
Constan
From the model, 3the coefficient of age implies that for one year increase in
t 2
the age of the mother, the log odds that the newborn will have low birth
weight decreases by 0.019. When the log odds decreases, the probability p
decreases as well. Dube Jara(PhD)
Example with Stata (OR)

Dube Jara(PhD)
Plot of estimated probability of LBW by mothers
age

.15
0.16

0.14

.14
0.12
Estimated probability of LBW

0.1

.13
Pr(lowbw)
0.08

.12
0.06

0.04

.11
0.02

0
.1

10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
25 30 35 40 45
maternal age
Age of mother

Dube Jara(PhD)
Multiple Logistic Regression
We have seen that age of the mother influence the probability that the

infant will have weight < 2500. What will be the fitted equation if
gestation weeks is included in the model?

To model the probability p as a function of the two explanatory


variables, we fit a model of the form:

 p 
ln   β 0  β1x1  β 2 x 2
1 p 

where x1 denotes age and x2 denotes gestation age of the newborn


baby

Dube Jara(PhD)
Both age and gestational weight are in the
model

95.0% C.I.for
EXP(B)

B S.E. Wald df Sig. Exp(B) Lower Upper

age
0.009 0.049 0.035 1 0.85 1.00 0.917 1.110
2 9

gestwks
- 0.108 68.53 1 0.00 0.40 0.330 0.504
0.897 5 0 8

constant 31.55 52.91 0.00 5.05


3 4.338 9 1 0 0
Dube Jara(PhD)
Categorical predictors

The logistic regression model can be generalized to include


discrete or nominal variables in addition to continuous ones.

Suppose that in the above model we include age and gestational


age.

 p̂ 
ln   b1  b1x1  b 2 x 2  b 3 x 3  b 4 x 4
 1  p̂ 

Where x1, x2, x3, x4 represent respectively age, gestation age,


maternal HTN and sex of baby
Dube Jara(PhD)
Age, GA, sex & hyp are all in the model

95.0% C.I.for
EXP(B)
B S.E. Wald df Sig. Exp(B) Lower Upper
Age 0.016 0.048 0.104 1 0.747 1.016 0.924 1.117
gestwks -0.895 0.111 65.275 1 0.000 0.409 0.329 0.508
hyp(yes) 0.755 .431 3.067 1 0.080 2.127 0.914 4.948
sex(female) 0.445 0.379 1.377 1 0.241 1.561 0.742 3.282
Constant 30.858 4.419 48.758 1 0.000 2.521E1

Dube Jara(PhD)
Categorical predictors

 Choose one level of variable to be a baseline (better if it is


the largest group of subjects)

 Create DUMMY (INDICATOR) variables coded 0 or 1 for


each level of variable except baseline

 Include all dummy variables in the model

 Or use the categorical option in SPSS automatically


creates the dummy variables for you

Dube Jara(PhD)
Assumptions
 Logistic regression has less stringent requirements
unlike OLS regression,
 Assumption
 The logit of the dependent variables is linearly related
to the continuous independent
 Check using probability graph
 Observations are independent
 Sample size should be adequate

Dube Jara(PhD)
Assumptions check
 Independency of observations /multi-collinearity/
 Symptoms of Multi-collinearity
 Imprecise estimates
 Implausible estimates
 Very low significance (possibly with very high R2)
 Big changes in estimates when the sample changes even
slightly
 Assessed through:
 Correlation matrices
 Variance inflation factors (>10)
 Tolerance test (<0.1)

Dube Jara(PhD)
Sample size
requirements
 The minimum number of cases per independent
variable is 10, using a guideline provided by Hosmer
and Lemeshow, authors of Applied Logistic
Regression, one of the main resources for Logistic
Regression.

 For preferred case-to-variable ratios, we will


use 20 to 1 for simultaneous and hierarchical
logistic regression and 50 to 1 for stepwise
logistic regression.

Dube Jara(PhD)
Logistic Regression Model building
 The goal of logistic regression is to correctly predict the
category of outcome for individual cases using the most
parsimonious model.
 To accomplish this goal, a model is created that includes all
predictor variables that are useful in predicting the response
variable.

 Several different options are available during model


creation. Variables can be entered into the model in the
order specified by the researcher or logistic regression can
test the fit of the model after each
Dube coefficient is added or
Jara(PhD)
Cont…
 Stepwise regression is used in the exploratory phase of
research but it is not recommended for theory testing
(Menard 1995).

 Theory testing is the testing of a-priori theories or


hypotheses of the relationships between variables.

 Exploratory testing makes no a-priori assumptions


regarding the relationships between the variables, thus the
goal is to discover relationships.
Dube Jara(PhD)
Cont…

 Backward stepwise regression appears to be the preferred


method of exploratory analyses, where the analysis begins
with a full or saturated model and variables are eliminated
from the model in an iterative process.

 The fit of the model is tested after the elimination of each


variable to ensure that the model still adequately fits the
data.
 When no more variables can be eliminated from the model, the
analysis has been completed.

Dube Jara(PhD)
Overall test of relationship/Assessing the fit of the
Model
 The process by which coefficients are tested for significance for
inclusion or elimination from the model involves several different
techniques.
 Each of these will be discussed below.
Classification table
 showing correct and incorrect classifications of the

dichotomous, ordinal, or polychotomous dependent.

Wald Test:
 A Wald test is used to test the statistical significance of each coefficient
() in the model. The
2 Wald test is computed as
 b 
W  
 se(b ) 

 W has a chi-square distribution with d.f.

Dube Jara(PhD)
Cont…
Likelihood-Ratio Test:
 The likelihood-ratio test uses the ratio of the maximized value of the
likelihood function for the full model (L1) over the maximized value

of the likelihood function for the simpler model (L0).

 The likelihood-ratio test statistic equals:

 -2LLo – (-2 LL1 )


2

 The likelihood ratio has a chi-squared statistic distribution with d.f


the number of parameters to be tested.
 This is the recommended test statistic to use when building a model through
backward stepwise elimination.
Dube Jara(PhD)
Extensions: Multinomial logit
 Multinomial logistic regression can be used
when we have more than two categories of
the dependent variable and when cannot be
ordered in a meaningful way.
 One category is chosen as the “comparison
category”, and the beta coefficient (b)
represents the change in odds of being in the
dependent variable category ‘I’ relative to the
comparison category.

Dube Jara(PhD)
Multinomial Logistic Regression:
Assumptions
 Data are case specific; that is, each independent
variable has a single value for each case.
 The dependent variable cannot be perfectly
predicted from the independent variables for any
case,
 Collinearity is assumed to be relatively low, as it
becomes difficult to differentiate between the
impact of several variables if they are highly
correlated,
 No assumption of normality, linearity and
independence

Dube Jara(PhD)
Approach to model
fitting
• In multinomial regression we in principle fit several binary
logistics depending on the category of the dependent variable
– The last category is set as reference group
– For example, if we have 3 groups of dependent variable, two
sets of model outputs, as the third is set as reference.
– Thus, each independent variable appears twice, that is for
each of the two groups

• To compare models use change in likelihood ration


test. Small LR tells you better model

Dube Jara(PhD)
Approach to model
fitting….
• Multinomial logistic regression compute correlation measures
to estimate the strength of the relationship (using pseudo R
square measures, such as Nagelkerke's R² based on type of
software used),

– correlations may not tell much about the accuracy or


errors associated with the model,
– A more useful measure is classification accuracy, which
compares predicted group membership based on the
logistic model to the actual, known group membership,

Dube Jara(PhD)
Multinomial logit…
 In the multinomial logit model, you
estimate a set of coefficients, β(1), β(2),
and β(3), corresponding
to each outcome:

Dube Jara(PhD)
Example with stata
 Data on the type of health insurance with 616
psychologically depressed subjects
 The insurance is categorized as either an
 Indemnity plan (regular fee-for-service insurance,
which may have a deductible or coinsurance rate) or
 A prepaid plan (a fixed up-front payment allowing
subsequent unlimited use as provided)
 No insurance whatsoever
 One of the demographic factors in this data is the
race of the participant, coded as white or
nonwhite:

Dube Jara(PhD)
Practice
 tabulate insure nonwhite, chi2 col
 mlogit insure nonwhite
 mlogit insure nonwhite, base(2)
 mlogit insure nonwhite, base(2) rrr

Dube Jara(PhD)
Dube Jara(PhD)
Dube Jara(PhD)
Dube Jara(PhD)
Assignment
 Logistic Regression Models
Building steps

Dube Jara(PhD)
Count data Analysis

Dube Jara(PhD)
Poisson Regression
Model
 Count data are very common in many
applications. Examples include:
 Number of patients visiting a certain hospital per day,
 CD4 counts,
 Number of live births in a given district per year, etc.
 Count data are commonly analyzed using
Poisson regression model.

Dube Jara(PhD)
Poisson Regression
Model…
 Dependent variable is event count, usually of rather
rare events e.g. number of cases of cancer over a
defined period in a cohort of subjects
 Non negative integer
 Count variables can be modeled with linear
regression (OLS) but:
 Linear models can yield negative predicted values…
whereas counts are never negative
 Count variables are often highly skewed- violating OLS
assumptions

Dube Jara(PhD)
Poisson Regression
Model…
 Analyzing large cohorts with hazard regression
models that have many time-dependent covariates
can require substantial computation time

 In this situation, converting the data to events per


person–year of observation can greatly simplify the
analysis

 If the event rate per unit of time is low, then an


excellent approach to consider is Poisson
regression

Dube Jara(PhD)
Poisson Regression Model…

The Poisson model assumes that yi, given the
vector of covariates xi, is independently Poisson-
distributed with

 The mean is given by µ = λ and the variance,


var(µ) = λ.

Suppose Y1, . . . , YN is a set of independent count
outcomes, and let x1, . . . , xN represent the
corresponding p-dimensional vectors of
covariate values. Dube Jara(PhD)
Poisson Regression Model…
 In Poisson Regression, Y is typically
conceptualized as a rate…
 Positive coefficients indicate higher rate; negative =
lower rate

 Like logit, Poisson models are non-linear


 Coefficients don’t have a simple linear interpretation
 Like logit, model has a log form; exponentiation
aids interpretation as incidence rate ratio:

Dube Jara(PhD)
Poisson Regression:
Exposure Time
 Poisson outcome variables are typically
conceptualized as rates
 Number of event/failure/outcome during a follow

up time
 Cases may vary in exposure to “risk” of a given

outcome
 To properly model rates, we must account for the

fact that some cases have greater exposure


than others.
 Some cases followed for longer time than

others…

Dube Jara(PhD)
Poisson Regression
Assumptions
 Poisson regression makes a big assumption: That
variance of =λ (“equidisperson”)
 The mean and variance are the same
 This assumption is often not met in real data
 Dispersion is often greater than λ. :overdispersion
 Consequence of overdispersion: Standard errors will
be underestimated
 Potential for overconfidence in results; rejecting H 0 when
you shouldn’t!
 Overdispersion is most often caused by highly skewed
dependent variables
 Often due to variables with high numbers of zeros

Dube Jara(PhD)
Infant Data
 Outcome: Total child deaths (‘deaths’).
 Factor1: place of residence

(1 = ‘urbant, 2 = ‘semi-urbant, 3 = ‘ruralt).

 Factor2: monthly family income,‘faminc’ in birr


(continuous).
 log(λ) = ξ0 + ξ1SU + ξ2R + ξ3IN.
 ‘IN’ refers to monthly family income, ‘SU’ and ‘R’
are defined as before.
 Code: poisson deaths i . p l a c e faminc
poisson deaths ib(first).place faminc
Dube Jara(PhD)
log(λ) = −0.745 + 0.036SU + 0.230R − 0.001IN.

Dube Jara(PhD)
The option ‘irr’ displays the incidence rate ratio estimates.
poisson deaths i.place faminc. irr

Dube Jara(PhD)
Results
 exp(ξi) measures the change in the expected log
counts (incidence rate).
 exp(ξ2) = exp(0.230) = 1.25 implies that the
expected log counts of child death in rural is higher
by 1.25 as compared to that of urban, or the
incidence rate of child death in rural is higher by
25% as compared to that of urban.


exp(ξ3) = exp(−0.0013) = 0.9987 implies that as
family income increases by one Birr, the incidence
rate of child death decreases by nearly 0.13%.

Dube Jara(PhD)

You might also like