Lecture 4&5-Categorical Data Analysis
Lecture 4&5-Categorical Data Analysis
May, 2025
Addis Ababa,
Ethiopia Dube Jara(PhD)
Contents
Analysis of Categorical Data
Chi square statistics
R*C contingency tables
Measures of association
Statistical Modeling
Introduction
Binary Logistic regression
Multinomial logistic regression
Poisson regression
Dube Jara(PhD)
Categorical variables
Variables that are measured using
nominal and ordinal scales.
The variable may have only two levels
(values) called a dichotomous (e.g. Sex).
The variable may has more than two levels
(values) called polytomous (e.g. blood
pressure).
Continuous (numeric) variables can be
changed to categorical variables.
Dube Jara(PhD)
Categorizing continuous variables
Sometime, we might require lesser
degree detail than afforded by numerical
data, hence we occasionally transform
into categorical.
This is achieved by dividing the range of values,
which the numeric value takes into intervals.
Dube Jara(PhD)
Continued, .....
Categorizing continuous variables is common
practice in Epidemiology for both descriptive and
multivariate modelling.
It is used for both outcome and exploratory
variables
Categorizing a numerical variable generally results
in an ordinal categorical variable
An associated issue is, therefore, methods for
trend tests based on the ordinal variable
Dube Jara(PhD)
Why categorization?
Advantages
Easier to perform the analysis and easy to implement
and interpret results.
Easy when adjusting confounding
Avoid to make assumptions about the shape of the
dose-response curve relating to exposure level to risk
Avoid problems of influential observations(extreme
measurements) which are an issue when dealing with
numeric variables.
Disadvantages
Reduction of efficacy/power. The loss of power is
greatest when the dose-repose is non-leaner
Bias
Biologically implausible
Dube Jara(PhD)
Contingency table
When working with categorical variables, we often
arrange the counts in a tabular format called
contingency tables.
If a contingency table involves two dichotomous
variables when it is a 2x2 (two way table).
A 2x2 contingency table is a table composed of two
rows cross-classified by two columns
It can be generalized to accommodate into rxc
contingency table (r-rows and c-columns).
Dube Jara(PhD)
Contingency table...
2x2 tables can arise from each of the designs
where there are two factors of interest in each four
cells.
The designs include:
• Cross-sectional,
• Prospective,
• Case-control or
• Matched pairs
We will call the factor levels “exposed” (E) and
“unexposed” (NE) and “diseased” (D) and “not-
diseased” (ND).
Dube Jara(PhD)
Tests of associations for 2x2
tables
If we have the frequencies in the four cells of 2x2
table a, b, c and d then the table then, the
appropriate test statistic is the chi-squared test
defined by: r c (O E ) 2
2 ij ij
i 1 j 1 Eij
Disease
Yes No Total
Exposur Yes n11 n12 n1 n11 n12
e No n21 n22 n2 n21 n22
Total m1 n11 n21 m2 n12 n21 n n1 n2 m1 m2
Dube Jara(PhD)
Cont…
For a 2 × 2 table the formula for the uncorrected
X2 can be simplified.
2
n ( ad bc )
2n(n11 n22 n12n21 ) 2 2
n1n2 m1m2 (a c) x(b d ) x(c d ) x(a b)
For small sample sizes we have to use a
n( n11 n22 n12 n21 n2 ) 2
i 1 j 1 Eij n1n2 m1m2
n( ad bc n )2
c2 2
( a b)( a c )(b d )( c d )
Dube Jara(PhD)
Expected Value
Is the product of the row total multiplied
by the column total, divided by the grand
total
Dube Jara(PhD)
2 × 2 contingency table
Dube Jara(PhD)
SPSS output – Chi-squared test
Linear-by-Linear
75.832 0.000
Association 1
N of Valid Cases 3642
(a) Computed only for a 2x2 table
(b) 0 cells (.0%) have expected count less than 5. The minimum expected count
is 53.69.
Dube Jara(PhD)
R x C Contingency Tables
An RxC contingency table is a table with R rows
and C columns
It displays the relationship between two variables,
where the variable in the rows has R categories
and the variable in the columns has C categories
If both R and C are >2, use RxC
contingency table
Continuity corrections are not used for contingency
tables larger than 2x2.
Degree of freedom c = (R-1)(C-1) 2
r (Oij Eij )
2
i 1 j 1 Eij
Dube Jara(PhD)
Chi-Square Test for Trend
Suppose in a 2xC table that the column variable is
ordinal
A question of interest will be whether there is a
trend in the proportions falling into the first or
second row across levels of the column variable.
Examines whether there is a dose-response
relationship or not
Works with df = 1
Not used for nominal scale since there is no
specific ordering among the categories.
Dube Jara(PhD)
Consider the following data on SBP and BMI
status
Dube Jara(PhD)
SPSS Output
Chi-Square Test
Asymp. Sig.
Value df (2-sided)
Dube Jara(PhD)
Odds and Logits
All two of these terms describe how often something
happens relative to its opposite happening, such as winning
or losing, or dying or surviving.
pˆ qˆ pˆ qˆ
pˆ pˆ z 1 / 2
1
1 2 2
n n
1 2
1 2
Dube Jara(PhD)
Cont…
Example: A study was conducted to look at the effect of oral
contraceptives (OC) on Myocardial Infarction (MI) in women 40-
44 years of age over 3 years. The following data was obtained
from the study. Compute 95% CI for the risk difference in MI
between users and non-users of OC.
MI status over 3
OC-use group years Total p
Yes No
RR pˆ pˆ
1 2
RR
distribution of ln( ) more closely follows a normal distribution
n12
than itself and its standard error is given by: n 22
se[(ln( RR )]
n11 n1 n 21n 2
Dube Jara(PhD)
Cont…
Therefore, a two-sided 100%×(1-) CI for RR is given by:
n12 n
ln( RR ) Z1-/2 22
n11 n1 n21n2
Dube Jara(PhD)
Cont…
Consider the previous example on the risk of MI among OC
users compared with non-OC users.
p1 =13/5000 = 0.0026, n1 = 5000 and
p2 =7/10000 = 0.0007, n2=10000.
Dube
Therefore, the 95% CI for RR is e 0.394, eJara(PhD)
2.230
= (1.5, 9.3)
3. The Odds Ratio (OR)
Dube Jara(PhD)
And is estimated by
Cont…
From the 2 × 2 table with the four cells labeled by n11, n12, n21
Dube Jara(PhD)
Cont…
1 1 1 1
OR
ln( ) z1 / 2
e n11 n1 2 n2 1 n2 2
1 1 1 1
z1 / 2
Lower OR e
n11 n1 2 n2 1 n2 2
,
1 1 1 1
z1 / 2
Upper OR e
n11 n1 2 n2 1 n2 2
Dube Jara(PhD)
Cont…
1 1 1 1
se[ln( OR)] 0.0514
682 2537 1498 8747
Therefore, the 95% CI for ln( OR ) becomes
ln(1.57) 1.96 0.0524 (0.352, 0.553)
Dube Jara(PhD)
Logistic Regression
Outcome variable in many studies is the presence or
absence of some condition, such as:
Survival status (alive or dead)
Having a MI or not
Dube Jara(PhD)
Logistic Regression…
Logistic Regression is a type of predictive model
that can be used when the response variable is a
categorical variable.
Binary logistic regression is a form of regression which
is used when the dependent variable is dichotomous and
the independents are of any type.
Multinomial logistic regression exists to handle the
case of dependents with more classes than two.
When multiple classes of the dependent variable can be
ranked, then ordinal logistic regression is preferred to
multinomial logistic regression.
Dube Jara(PhD)
Logistic…
Logistic regression can be used to predict a
dependent variable on the basis of continuous
and/or categorical independents and to:
Determine the percent of variance in the dependent
variable explained by the independents;
Rank the relative importance of independents
Assess interaction effects; and
Understand the impact of covariate control
variables
Dube Jara(PhD)
Logistic…
Logistic regression applies maximum likelihood
estimation
Transform dependent variable into probability (probability of the
occurrence of certain event)
25 30 35 40 45
maternal age
Dube Jara(PhD)
Logistic…
We can see that this plot is less informative about the relationship
between the outcome and the explanatory variables than in the case
when the outcome variable is continuous.
But the above quantity is the same as the proportion of low birth
weight babies and can be denoted by p.
p=60/500 =0 . 12
Overall 12% of the newborns had BWt < 2500 gm
Dube Jara(PhD)
Logistic...
0 + 1X1 + 2X 2 + ...+ k X k
e
p 0 + 1X1 + 2 X 2 + ...+ k X k
1 e
Then the failure will occur with probability
1
1 p 0 + 1X1 + 2 X 2 + ...+ k X k
1 e
Dube Jara(PhD)
Logistic…
Consequently, the odds in favor of success will be
p e 0 1 X 1 2 X 2 ... k X k 1 e 0 1 X 1 2 X 2 ... k X k
1 p 1 1 e 0 1 X 1 2 X 2 ... k X k
0 1 X 1 2 X 2 ... k X k
e
p
ln ln e
0 1X1 2 X 2 ... k X k
0 1 X 1 2 X 2 ... k X k
1 p
Dube Jara(PhD)
The fitted equation
In order to use mother’s age to help us predict the likelihood
that her newborn is low birth weight, we fit the model:
p̂
ln b 0 b1x
1 p̂
where b0 and b1 are respective estimates of the population
Fit a logistic regression model between low birth weight and age of the
mother:
Dube Jara(PhD)
Plot of estimated probability of LBW by mothers
age
.15
0.16
0.14
.14
0.12
Estimated probability of LBW
0.1
.13
Pr(lowbw)
0.08
.12
0.06
0.04
.11
0.02
0
.1
10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
25 30 35 40 45
maternal age
Age of mother
Dube Jara(PhD)
Multiple Logistic Regression
We have seen that age of the mother influence the probability that the
infant will have weight < 2500. What will be the fitted equation if
gestation weeks is included in the model?
p
ln β 0 β1x1 β 2 x 2
1 p
baby
Dube Jara(PhD)
Both age and gestational weight are in the
model
95.0% C.I.for
EXP(B)
age
0.009 0.049 0.035 1 0.85 1.00 0.917 1.110
2 9
gestwks
- 0.108 68.53 1 0.00 0.40 0.330 0.504
0.897 5 0 8
p̂
ln b1 b1x1 b 2 x 2 b 3 x 3 b 4 x 4
1 p̂
95.0% C.I.for
EXP(B)
B S.E. Wald df Sig. Exp(B) Lower Upper
Age 0.016 0.048 0.104 1 0.747 1.016 0.924 1.117
gestwks -0.895 0.111 65.275 1 0.000 0.409 0.329 0.508
hyp(yes) 0.755 .431 3.067 1 0.080 2.127 0.914 4.948
sex(female) 0.445 0.379 1.377 1 0.241 1.561 0.742 3.282
Constant 30.858 4.419 48.758 1 0.000 2.521E1
Dube Jara(PhD)
Categorical predictors
Dube Jara(PhD)
Assumptions
Logistic regression has less stringent requirements
unlike OLS regression,
Assumption
The logit of the dependent variables is linearly related
to the continuous independent
Check using probability graph
Observations are independent
Sample size should be adequate
Dube Jara(PhD)
Assumptions check
Independency of observations /multi-collinearity/
Symptoms of Multi-collinearity
Imprecise estimates
Implausible estimates
Very low significance (possibly with very high R2)
Big changes in estimates when the sample changes even
slightly
Assessed through:
Correlation matrices
Variance inflation factors (>10)
Tolerance test (<0.1)
Dube Jara(PhD)
Sample size
requirements
The minimum number of cases per independent
variable is 10, using a guideline provided by Hosmer
and Lemeshow, authors of Applied Logistic
Regression, one of the main resources for Logistic
Regression.
Dube Jara(PhD)
Logistic Regression Model building
The goal of logistic regression is to correctly predict the
category of outcome for individual cases using the most
parsimonious model.
To accomplish this goal, a model is created that includes all
predictor variables that are useful in predicting the response
variable.
Dube Jara(PhD)
Overall test of relationship/Assessing the fit of the
Model
The process by which coefficients are tested for significance for
inclusion or elimination from the model involves several different
techniques.
Each of these will be discussed below.
Classification table
showing correct and incorrect classifications of the
Wald Test:
A Wald test is used to test the statistical significance of each coefficient
() in the model. The
2 Wald test is computed as
b
W
se(b )
Dube Jara(PhD)
Cont…
Likelihood-Ratio Test:
The likelihood-ratio test uses the ratio of the maximized value of the
likelihood function for the full model (L1) over the maximized value
Dube Jara(PhD)
Multinomial Logistic Regression:
Assumptions
Data are case specific; that is, each independent
variable has a single value for each case.
The dependent variable cannot be perfectly
predicted from the independent variables for any
case,
Collinearity is assumed to be relatively low, as it
becomes difficult to differentiate between the
impact of several variables if they are highly
correlated,
No assumption of normality, linearity and
independence
Dube Jara(PhD)
Approach to model
fitting
• In multinomial regression we in principle fit several binary
logistics depending on the category of the dependent variable
– The last category is set as reference group
– For example, if we have 3 groups of dependent variable, two
sets of model outputs, as the third is set as reference.
– Thus, each independent variable appears twice, that is for
each of the two groups
Dube Jara(PhD)
Approach to model
fitting….
• Multinomial logistic regression compute correlation measures
to estimate the strength of the relationship (using pseudo R
square measures, such as Nagelkerke's R² based on type of
software used),
Dube Jara(PhD)
Multinomial logit…
In the multinomial logit model, you
estimate a set of coefficients, β(1), β(2),
and β(3), corresponding
to each outcome:
Dube Jara(PhD)
Example with stata
Data on the type of health insurance with 616
psychologically depressed subjects
The insurance is categorized as either an
Indemnity plan (regular fee-for-service insurance,
which may have a deductible or coinsurance rate) or
A prepaid plan (a fixed up-front payment allowing
subsequent unlimited use as provided)
No insurance whatsoever
One of the demographic factors in this data is the
race of the participant, coded as white or
nonwhite:
Dube Jara(PhD)
Practice
tabulate insure nonwhite, chi2 col
mlogit insure nonwhite
mlogit insure nonwhite, base(2)
mlogit insure nonwhite, base(2) rrr
Dube Jara(PhD)
Dube Jara(PhD)
Dube Jara(PhD)
Dube Jara(PhD)
Assignment
Logistic Regression Models
Building steps
Dube Jara(PhD)
Count data Analysis
Dube Jara(PhD)
Poisson Regression
Model
Count data are very common in many
applications. Examples include:
Number of patients visiting a certain hospital per day,
CD4 counts,
Number of live births in a given district per year, etc.
Count data are commonly analyzed using
Poisson regression model.
Dube Jara(PhD)
Poisson Regression
Model…
Dependent variable is event count, usually of rather
rare events e.g. number of cases of cancer over a
defined period in a cohort of subjects
Non negative integer
Count variables can be modeled with linear
regression (OLS) but:
Linear models can yield negative predicted values…
whereas counts are never negative
Count variables are often highly skewed- violating OLS
assumptions
Dube Jara(PhD)
Poisson Regression
Model…
Analyzing large cohorts with hazard regression
models that have many time-dependent covariates
can require substantial computation time
Dube Jara(PhD)
Poisson Regression Model…
The Poisson model assumes that yi, given the
vector of covariates xi, is independently Poisson-
distributed with
Dube Jara(PhD)
Poisson Regression:
Exposure Time
Poisson outcome variables are typically
conceptualized as rates
Number of event/failure/outcome during a follow
up time
Cases may vary in exposure to “risk” of a given
outcome
To properly model rates, we must account for the
others…
Dube Jara(PhD)
Poisson Regression
Assumptions
Poisson regression makes a big assumption: That
variance of =λ (“equidisperson”)
The mean and variance are the same
This assumption is often not met in real data
Dispersion is often greater than λ. :overdispersion
Consequence of overdispersion: Standard errors will
be underestimated
Potential for overconfidence in results; rejecting H 0 when
you shouldn’t!
Overdispersion is most often caused by highly skewed
dependent variables
Often due to variables with high numbers of zeros
Dube Jara(PhD)
Infant Data
Outcome: Total child deaths (‘deaths’).
Factor1: place of residence
(1 = ‘urbant, 2 = ‘semi-urbant, 3 = ‘ruralt).
Dube Jara(PhD)
The option ‘irr’ displays the incidence rate ratio estimates.
poisson deaths i.place faminc. irr
Dube Jara(PhD)
Results
exp(ξi) measures the change in the expected log
counts (incidence rate).
exp(ξ2) = exp(0.230) = 1.25 implies that the
expected log counts of child death in rural is higher
by 1.25 as compared to that of urban, or the
incidence rate of child death in rural is higher by
25% as compared to that of urban.
exp(ξ3) = exp(−0.0013) = 0.9987 implies that as
family income increases by one Birr, the incidence
rate of child death decreases by nearly 0.13%.
Dube Jara(PhD)