0% found this document useful (0 votes)

2 views85 pages

Lecture 4&5-Categorical Data Analysis

The document is a lecture on the analysis of categorical variables for MPH students, covering topics such as chi-square statistics, contingency tables, and statistical modeling techniques like binary and multinomial logistic regression. It discusses the categorization of continuous variables, advantages and disadvantages of such methods, and the use of contingency tables for analyzing relationships between categorical variables. Additionally, it provides formulas for calculating risk differences and risk ratios in epidemiological studies.

Uploaded by

selgashae

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views85 pages

Lecture 4&5-Categorical Data Analysis

Uploaded by

selgashae

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 85

Yekatit 12 Hospital Medical College

School of Public Health

Department of Epidemiology and Biostatistics
Chapter 4&5: Analysis of Categorical Variables
For MPH Student
BY
Dube Jara (PhD)
Assistant Professor of Epidemiology
Email: [email protected]

May, 2025
Addis Ababa,
Ethiopia Dube Jara(PhD)
Contents
 Analysis of Categorical Data
 Chi square statistics
 R*C contingency tables
 Measures of association
 Statistical Modeling
 Introduction
 Binary Logistic regression
 Multinomial logistic regression
 Poisson regression
Dube Jara(PhD)
Categorical variables
 Variables that are measured using
nominal and ordinal scales.
The variable may have only two levels
(values) called a dichotomous (e.g. Sex).
The variable may has more than two levels
(values) called polytomous (e.g. blood
pressure).
 Continuous (numeric) variables can be
changed to categorical variables.

Dube Jara(PhD)
Categorizing continuous variables
 Sometime, we might require lesser
degree detail than afforded by numerical
data, hence we occasionally transform
into categorical.
This is achieved by dividing the range of values,
which the numeric value takes into intervals.

Consider the variable “age”. Age might be

transformed to form a new variable , say “age
group”, which consists of categories.

Dube Jara(PhD)
Continued, .....
 Categorizing continuous variables is common
practice in Epidemiology for both descriptive and
multivariate modelling.
 It is used for both outcome and exploratory
variables
 Categorizing a numerical variable generally results
in an ordinal categorical variable
 An associated issue is, therefore, methods for
trend tests based on the ordinal variable

Dube Jara(PhD)
Why categorization?
 Advantages
 Easier to perform the analysis and easy to implement
and interpret results.
 Easy when adjusting confounding
 Avoid to make assumptions about the shape of the
dose-response curve relating to exposure level to risk
 Avoid problems of influential observations(extreme
measurements) which are an issue when dealing with
numeric variables.
 Disadvantages
 Reduction of efficacy/power. The loss of power is
greatest when the dose-repose is non-leaner
 Bias
 Biologically implausible
Dube Jara(PhD)
Contingency table
 When working with categorical variables, we often
arrange the counts in a tabular format called
contingency tables.
 If a contingency table involves two dichotomous
variables when it is a 2x2 (two way table).
 A 2x2 contingency table is a table composed of two
rows cross-classified by two columns
 It can be generalized to accommodate into rxc
contingency table (r-rows and c-columns).

Dube Jara(PhD)
Contingency table...
 2x2 tables can arise from each of the designs
where there are two factors of interest in each four
cells.
 The designs include:
• Cross-sectional,
• Prospective,
• Case-control or
• Matched pairs
 We will call the factor levels “exposed” (E) and
“unexposed” (NE) and “diseased” (D) and “not-
diseased” (ND).

Dube Jara(PhD)
Tests of associations for 2x2
tables
 If we have the frequencies in the four cells of 2x2
table a, b, c and d then the table then, the
appropriate test statistic is the chi-squared test
defined by: r c (O  E ) 2
  
2 ij ij

i 1 j 1 Eij

 where: O and E denote the observed and expected

frequencies and i indicates the row number and j the
column number
 The chi-squared test measures the disparity between
observed frequency (data from the sample ) and expected
frequency Dube Jara(PhD)
Dube Jara(PhD)
Assumptions of the 2 - test

· No observed frequency is zero

· No more than 20% of the expected frequencies
should be <5.
· The guidelines are that 80% of the cells in the table
should have expected frequencies greater than 5, and all
cells should have expected frequencies greater than 1.

· Observation should be independents of each other

· There must be large sample (>40)
· It is based on a ‘large sample’ approximation.
Dube Jara(PhD)
The 2 × 2 frequency table – comparison of two
proportions

We often interested in exposure-disease

relationship as shown in the following table.

Disease
Yes No Total
Exposur Yes n11 n12 n1 n11  n12
e No n21 n22 n2 n21  n22
Total m1 n11  n21 m2 n12  n21 n n1  n2 m1  m2

Dube Jara(PhD)
Cont…
 For a 2 × 2 table the formula for the uncorrected
X2 can be simplified.
2
n ( ad  bc )
 2n(n11 n22  n12n21 ) 2 2 
n1n2 m1m2 (a  c) x(b  d ) x(c  d ) x(a  b)
 For small sample sizes we have to use a

continuity correction known as Yates’

Correction.
2
 c2  
2 O
ij  Eij  1
2 
2


n( n11 n22  n12 n21  n2 ) 2
i 1 j 1 Eij n1n2 m1m2

n( ad  bc  n )2
 c2  2
( a  b)( a  c )(b  d )( c  d )
Dube Jara(PhD)
Expected Value
 Is the product of the row total multiplied
by the column total, divided by the grand
total

 The expected numbers must be computed

for each cell.

Dube Jara(PhD)
2 × 2 contingency table

Example: Consider the following sub-sampled data of Infant

Survival study to look into differences in the proportion of low
birth weight babies between urban and rural resident births.

Residence Low birth weight Total

YES
NO
Rural 1732 222 1954
Urban 1627 61 1688
Total 3359 283 3642

Dube Jara(PhD)
SPSS output – Chi-squared test

Asymp. Exact Exact

Sig. (2- Sig. (2- Sig. (1-
Value Df sided) sided) sided)
Pearson Chi-Square 75.853a 0.000
1
Continuity Correction (a) 74.776 0.000
1
Likelihood Ratio 81.151 0.000
1
Fisher's Exact Test 0.000 0.000

Linear-by-Linear
75.832 0.000
Association 1
N of Valid Cases 3642
(a) Computed only for a 2x2 table
(b) 0 cells (.0%) have expected count less than 5. The minimum expected count
is 53.69.
Dube Jara(PhD)
R x C Contingency Tables
 An RxC contingency table is a table with R rows
and C columns
 It displays the relationship between two variables,
where the variable in the rows has R categories
and the variable in the columns has C categories
 If both R and C are >2, use RxC
contingency table
 Continuity corrections are not used for contingency
tables larger than 2x2.
 Degree of freedom c = (R-1)(C-1) 2
r (Oij  Eij )
 2
 
i 1 j 1 Eij
Dube Jara(PhD)
Chi-Square Test for Trend
 Suppose in a 2xC table that the column variable is
ordinal
 A question of interest will be whether there is a
trend in the proportions falling into the first or
second row across levels of the column variable.
 Examines whether there is a dose-response
relationship or not
 Works with df = 1
 Not used for nominal scale since there is no
specific ordering among the categories.

Dube Jara(PhD)
Consider the following data on SBP and BMI
status

Systolic Blood Pressure

< 140 mmHg 140+ mmHg Total
n % n %
BMI
< 18 85 87.63 12 12.37 97

18-24 172 86.43 27 13.57 199

+25 32 80.00 8 20.00 40

Total 289 86.01 47 13.99 336

Dube Jara(PhD)
SPSS Output

Chi-Square Test
Asymp. Sig.
Value df (2-sided)

Pearson Chi-Square 1.442a 2 0.489

Likelihood Ratio 1.331 2 0.514

Linear-by-Linear Association 1.027 1 0.311

N of Valid Cases 336

Dube Jara(PhD)
Odds and Logits
 All two of these terms describe how often something
happens relative to its opposite happening, such as winning
or losing, or dying or surviving.

 Thus, they all deal with a special case of nominal

measurement scales: dichotomous (binary) outcome
measures.
Logits
 The logits are simply the natural log of the odds
[Ln(odds) or Loge(odds)].

 Note that the two logits are always symmetrical (they

sum to zero) and they range from minus infinity to plus
infinity
Dube Jara(PhD)
Odds and Logits
 All two of these terms describe how often something
happens relative to its opposite happening, such as winning
or losing, or dying or surviving.

 Thus, they all deal with a special case of nominal

measurement scales: dichotomous (binary) outcome
measures.
Logits
 The logits are simply the natural log of the odds
[Ln(odds) or Loge(odds)].

 Note that the two logits are always symmetrical (they

sum to zero) and they range from minus infinity to plus
infinity
Dube Jara(PhD)
Measure of effects for Categorical
Data
 We would like to compare the frequency of
disease between exposed and unexposed
subjects.
We compare:
 Incidence rates in prospective studies, or

 Prevalence rates in cross-sectional studies

between exposed and unexposed individuals

Let
p1 = probability of developing disease for exposed individuals
p2 = probability of developing disease for unexposed individuals
The risk difference is defined as p1 - p2.
The risk ratio or relative risk is defined as p1/p2
Dube Jara(PhD)
1. The risk difference

 Suppose that p1 and p2 are the sample proportions with

disease for exposed and unexposed subjects, based on two
independent sample of size n1 and n2, respectively.

 Then a 100%×(1-) confidence interval for the risk difference

is given by:

pˆ qˆ pˆ qˆ
pˆ  pˆ z 1  / 2
1
 1 2 2

n n
1 2
1 2

Dube Jara(PhD)
Cont…
 Example: A study was conducted to look at the effect of oral
contraceptives (OC) on Myocardial Infarction (MI) in women 40-
44 years of age over 3 years. The following data was obtained
from the study. Compute 95% CI for the risk difference in MI
between users and non-users of OC.

MI status over 3
OC-use group years Total p
Yes No

OC-users 13 4,987 5,000= 0.0026

n1
No-OC-users 7 9,993 10,000= 0.0007
n2
Dube Jara(PhD)
Total 20 14,980 15,000 0.0013
2. The risk ratio (Relative risk)

 A point estimate of the risk ratio (RR = p1/p2) is given by:



RR  pˆ pˆ
1 2

 To obtain an interval estimate, we will assume that the normal

approximation to the binomial distribution is valid.
 Under this assumption, it can be shown that the sampling


RR
distribution of ln( ) more closely follows a normal distribution
n12
than itself and its standard error is given by: n 22
se[(ln( RR )]  
n11 n1 n 21n 2
Dube Jara(PhD)
Cont…
 Therefore, a two-sided 100%×(1-) CI for RR is given by:

n12 n
ln( RR ) Z1-/2  22
n11 n1 n21n2

The anti-log of each end of the interval above then

provides a two-sided 100%×(1-) CI for RR itself, given by:
 
n12 n22 n12 n22
RR
ln( )  Z1-/2  RR )  Z1-/2
ln( 
e n11 n1 n21n2
,e
n11 n1 n21n2

Dube Jara(PhD)
Cont…
Consider the previous example on the risk of MI among OC
users compared with non-OC users.
p1 =13/5000 = 0.0026, n1 = 5000 and
p2 =7/10000 = 0.0007, n2=10000.

Thus the point estimate of RR is = 0.0026/0.0007 = 3.71

To compute a 95% CI, we have n11=13, n12=4,987, n21=7, and

n22=9,993.  4,987 9,993
ln(3.71) 1.312 and s.e[ln( RR)]   0.4685
13(5,000) 7(10,000)

The 95% CI for ln(RR) will be

1.312 ± 1.96×0.4685 = 1.312 ± 0.918 = (0.394, 2.230)

Dube
Therefore, the 95% CI for RR is e 0.394, eJara(PhD)
2.230
= (1.5, 9.3)
3. The Odds Ratio (OR)

 If the probability of success = p, then the odds in favor of

success isodds  p
1 p

 If two populations with proportions p1 and p2 are considered

and the odds in favor of success are computed for each
odds1 p1 q1 p1q2
population, OR  the 
then ratio ofodds or Odds Ratio (OR) is
odds2 p2 q2 p2 q1
defined as:
 ˆ 1qˆ 2
p
OR 
ˆ 2 qˆ1
p

Dube Jara(PhD)
 And is estimated by
Cont…

 From the 2 × 2 table with the four cells labeled by n11, n12, n21

and n22d as above, then

 [n11 /( n11  n21 )] [n22 /( n21  n22 )] n11 n22

OR  
[n21 /( n21  n22 )] [n12 /( n11  n12 )] n12n21

 Example: Using the OC-MI data above, estimate the odds

ratio in favor of MI for OC user compared with a non-user (i.e.
disease odds ratio).

0.0026 0.9993
OR  0.0007 0.9974 3.72
Dube Jara(PhD)
Interval estimates for the Odds Ratio

 Several methods exist for the purpose of estimating CI for the

OR. Of the most popular approaches is the Woolf Method.
Woolf showed that approximately
 1 1 1 1
var[ln( OR)]    
n11 n12 n21 n22

 If we assume approximate normality of ln(OR), then a

100%(1-) CI for ln(OR) is given by:
 1 1 1 1
ln(OR) z1  / 2   
n11 n12 n21 n22

Dube Jara(PhD)
Cont…

 If we take the anti-log of each end of the CI, then it follows

that a 100%(1-) CI for OR is given by


1 1 1 1
OR
ln( ) z1  / 2   
e n11 n1 2 n2 1 n2 2

1 1 1 1
 z1  / 2   
Lower OR e
 n11 n1 2 n2 1 n2 2
,
1 1 1 1
z1  / 2   
Upper OR e
 n11 n1 2 n2 1 n2 2

Dube Jara(PhD)
Cont…

 Example: Suppose all women with at least one birth in a

study of breast cancer are classified as either cases or
controls and with age at first birth as either ≤ 29 or ≥ 30 and
the data is displayed below:
Age at first Status
birth Total
Cases Controls

≥ 30 683 1498 2181

≤ 29 2537 8747 11284
Total 3120 10245 13465
 683 8,747
OR  1.57
1,498 2,537
Dube Jara(PhD)
Cont…

 Compute a 95% CI for the odds ratio relating age at first

birth to breast cancer incidence.
 The SE and a 95% CI for ln(OR) become:

 1 1 1 1
se[ln( OR)]     0.0514
682 2537 1498 8747

Therefore, the 95% CI for ln( OR ) becomes
ln(1.57) 1.96 0.0524 (0.352, 0.553)

 A 95% CI for OR is thus given by (e0.352, e0.553) = (1.42,

1.74)

Dube Jara(PhD)
Logistic Regression
 Outcome variable in many studies is the presence or
absence of some condition, such as:
 Survival status (alive or dead)

 Responding or not to a treatment

 Having a MI or not

 Birth weight status (normal or low)

Dube Jara(PhD)
Logistic Regression…
 Logistic Regression is a type of predictive model
that can be used when the response variable is a
categorical variable.
 Binary logistic regression is a form of regression which
is used when the dependent variable is dichotomous and
the independents are of any type.
 Multinomial logistic regression exists to handle the
case of dependents with more classes than two.
 When multiple classes of the dependent variable can be
ranked, then ordinal logistic regression is preferred to
multinomial logistic regression.
Dube Jara(PhD)
Logistic…
 Logistic regression can be used to predict a
dependent variable on the basis of continuous
and/or categorical independents and to:
 Determine the percent of variance in the dependent
variable explained by the independents;
 Rank the relative importance of independents
 Assess interaction effects; and
 Understand the impact of covariate control
variables

Dube Jara(PhD)
Logistic…
 Logistic regression applies maximum likelihood
estimation
 Transform dependent variable into probability (probability of the
occurrence of certain event)

 Transform probability into odds

 Transform odds into a logit variable (the natural logarithm of the
odds of the dependent occurring or not).
 Form logistic model. The model based log odds of the
dependent unlike OLS
 In stead of assuming that the relationship between p and
x is linear, we assume that the relationship between
ln[p/(1-p)] and x is linear.
Dube Jara(PhD)
Logistic…

Example: Consider the birth weight data to predict factors associated

with low birth weight.

Let the outcome of interest be a dichotomous variable



Define a dichotomous variable



 By considering age of the mother as predictor variable,

construct a scatter plot of the outcome and predictor
variables.
Dube Jara(PhD)
Scatter plot of Birth weight status by age of the
mother 1 .8
low birth weight
.4 .6
.2
0

25 30 35 40 45
maternal age

Dube Jara(PhD)
Logistic…

 We can see that this plot is less informative about the relationship
between the outcome and the explanatory variables than in the case
when the outcome variable is continuous.

 The mean of the dichotomous random variable yi, can be calculated

as,

 But the above quantity is the same as the proportion of low birth
weight babies and can be denoted by p.

i.e. Dube Jara(PhD)

Logistic…

 For the data, 60 of 500 newborns had birth weight

< 2500 gm
This gives an estimated low birth weight probability of

p=60/500 =0 . 12
Overall 12% of the newborns had BWt < 2500 gm

Dube Jara(PhD)
Logistic...

 The statistical model that is generally preferred for the

analysis of binary response is the binary logistic regression
model, stated in terms of the probability that p(yi=1), given
the value of the predictors.
1
p
1 + e -(  0 + 1X1 +  2 X 2 + ...+  k X k )
 i.e. 0 + 1X1 +  2X 2 + ...+  k X k
e

1  e  0 + 1X1 +  2 X 2 + ...+  k X k

 The function above function is called the logistic

function.
 The function has an unlimited range for the x’s
while p is restricted to range from 0 to 1
Dube Jara(PhD)
The Logistic Function
 Thus, if a success occurs with probability,

0 + 1X1 +  2X 2 + ...+  k X k
e
p  0 + 1X1 +  2 X 2 + ...+  k X k
1 e
 Then the failure will occur with probability

1
1 p   0 + 1X1 +  2 X 2 + ...+  k X k
1 e

Dube Jara(PhD)
Logistic…
 Consequently, the odds in favor of success will be

p e  0  1 X 1   2 X 2 ...   k X k 1  e  0  1 X 1   2 X 2 ...   k X k

1 p 1 1  e  0  1 X 1   2 X 2 ...   k X k
 0  1 X 1   2 X 2 ...   k X k
e

 Taking the natural logarithm of each side of this equation:

 p 
ln   ln e 
0 1X1 2 X 2 ... k X k

 0  1 X 1   2 X 2  ...   k X k
1 p 

Dube Jara(PhD)
The fitted equation
 In order to use mother’s age to help us predict the likelihood
that her newborn is low birth weight, we fit the model:

 p̂ 
ln   b 0  b1x
 1  p̂ 
 where b0 and b1 are respective estimates of the population

regression coefficients β0 and β1

 We cannot apply the method of least squares, which

assumes that the response is continuous and normally
distributed, to fit the logistic model, instead, we use
maximum likelihood estimation.
Dube Jara(PhD)
Example with SPSS

Fit a logistic regression model between low birth weight and age of the


mother:

95.0% C.I. for EXP(B)

Exp(B)
B S.E. Wald df Sig. (OR) Lower Upper
-
0.
0 .035 .307 1 .580 .981 .915 1.051
 pˆ 1
matageln  9 -1.332 - 0.019 X
 1  pˆ -
1.
3 1.196 1.240 1 .265 .264
Constan
From the model, 3the coefficient of age implies that for one year increase in
t 2
the age of the mother, the log odds that the newborn will have low birth
weight decreases by 0.019. When the log odds decreases, the probability p
decreases as well. Dube Jara(PhD)
Example with Stata (OR)

Dube Jara(PhD)
Plot of estimated probability of LBW by mothers
age

.15
0.16

0.14

.14
0.12
Estimated probability of LBW

0.1

.13
Pr(lowbw)
0.08

.12
0.06

0.04

.11
0.02

0
.1

10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
25 30 35 40 45
maternal age
Age of mother

Dube Jara(PhD)
Multiple Logistic Regression
We have seen that age of the mother influence the probability that the


infant will have weight < 2500. What will be the fitted equation if
gestation weeks is included in the model?

To model the probability p as a function of the two explanatory



variables, we fit a model of the form:

 p 
ln   β 0  β1x1  β 2 x 2
1 p 

where x1 denotes age and x2 denotes gestation age of the newborn



baby

Dube Jara(PhD)
Both age and gestational weight are in the
model

95.0% C.I.for
EXP(B)

B S.E. Wald df Sig. Exp(B) Lower Upper

age
0.009 0.049 0.035 1 0.85 1.00 0.917 1.110
2 9

gestwks
- 0.108 68.53 1 0.00 0.40 0.330 0.504
0.897 5 0 8

constant 31.55 52.91 0.00 5.05

3 4.338 9 1 0 0
Dube Jara(PhD)
Categorical predictors

The logistic regression model can be generalized to include



discrete or nominal variables in addition to continuous ones.

Suppose that in the above model we include age and gestational

age.

 p̂ 
ln   b1  b1x1  b 2 x 2  b 3 x 3  b 4 x 4
 1  p̂ 

Where x1, x2, x3, x4 represent respectively age, gestation age,

maternal HTN and sex of baby
Dube Jara(PhD)
Age, GA, sex & hyp are all in the model

95.0% C.I.for
EXP(B)
B S.E. Wald df Sig. Exp(B) Lower Upper
Age 0.016 0.048 0.104 1 0.747 1.016 0.924 1.117
gestwks -0.895 0.111 65.275 1 0.000 0.409 0.329 0.508
hyp(yes) 0.755 .431 3.067 1 0.080 2.127 0.914 4.948
sex(female) 0.445 0.379 1.377 1 0.241 1.561 0.742 3.282
Constant 30.858 4.419 48.758 1 0.000 2.521E1

Dube Jara(PhD)
Categorical predictors

 Choose one level of variable to be a baseline (better if it is

the largest group of subjects)

 Create DUMMY (INDICATOR) variables coded 0 or 1 for

each level of variable except baseline

 Include all dummy variables in the model

 Or use the categorical option in SPSS automatically

creates the dummy variables for you

Dube Jara(PhD)
Assumptions
 Logistic regression has less stringent requirements
unlike OLS regression,
 Assumption
 The logit of the dependent variables is linearly related
to the continuous independent
 Check using probability graph
 Observations are independent
 Sample size should be adequate

Dube Jara(PhD)
Assumptions check
 Independency of observations /multi-collinearity/
 Symptoms of Multi-collinearity
 Imprecise estimates
 Implausible estimates
 Very low significance (possibly with very high R2)
 Big changes in estimates when the sample changes even
slightly
 Assessed through:
 Correlation matrices
 Variance inflation factors (>10)
 Tolerance test (<0.1)

Dube Jara(PhD)
Sample size
requirements
 The minimum number of cases per independent
variable is 10, using a guideline provided by Hosmer
and Lemeshow, authors of Applied Logistic
Regression, one of the main resources for Logistic
Regression.

 For preferred case-to-variable ratios, we will

use 20 to 1 for simultaneous and hierarchical
logistic regression and 50 to 1 for stepwise
logistic regression.

Dube Jara(PhD)
Logistic Regression Model building
 The goal of logistic regression is to correctly predict the
category of outcome for individual cases using the most
parsimonious model.
 To accomplish this goal, a model is created that includes all
predictor variables that are useful in predicting the response
variable.

 Several different options are available during model

creation. Variables can be entered into the model in the
order specified by the researcher or logistic regression can
test the fit of the model after each
Dube coefficient is added or
Jara(PhD)
Cont…
 Stepwise regression is used in the exploratory phase of
research but it is not recommended for theory testing
(Menard 1995).

 Theory testing is the testing of a-priori theories or

hypotheses of the relationships between variables.

 Exploratory testing makes no a-priori assumptions

regarding the relationships between the variables, thus the
goal is to discover relationships.
Dube Jara(PhD)
Cont…

 Backward stepwise regression appears to be the preferred

method of exploratory analyses, where the analysis begins
with a full or saturated model and variables are eliminated
from the model in an iterative process.

 The fit of the model is tested after the elimination of each

variable to ensure that the model still adequately fits the
data.
 When no more variables can be eliminated from the model, the
analysis has been completed.

Dube Jara(PhD)
Overall test of relationship/Assessing the fit of the
Model
 The process by which coefficients are tested for significance for
inclusion or elimination from the model involves several different
techniques.
 Each of these will be discussed below.
Classification table
 showing correct and incorrect classifications of the

dichotomous, ordinal, or polychotomous dependent.

Wald Test:
 A Wald test is used to test the statistical significance of each coefficient
() in the model. The
2 Wald test is computed as
 b 
W  
 se(b ) 

 W has a chi-square distribution with d.f.

Dube Jara(PhD)
Cont…
Likelihood-Ratio Test:
 The likelihood-ratio test uses the ratio of the maximized value of the
likelihood function for the full model (L1) over the maximized value

of the likelihood function for the simpler model (L0).

 The likelihood-ratio test statistic equals:

 -2LLo – (-2 LL1 )

 The likelihood ratio has a chi-squared statistic distribution with d.f

the number of parameters to be tested.
 This is the recommended test statistic to use when building a model through
backward stepwise elimination.
Dube Jara(PhD)
Extensions: Multinomial logit
 Multinomial logistic regression can be used
when we have more than two categories of
the dependent variable and when cannot be
ordered in a meaningful way.
 One category is chosen as the “comparison
category”, and the beta coefficient (b)
represents the change in odds of being in the
dependent variable category ‘I’ relative to the
comparison category.

Dube Jara(PhD)
Multinomial Logistic Regression:
Assumptions
 Data are case specific; that is, each independent
variable has a single value for each case.
 The dependent variable cannot be perfectly
predicted from the independent variables for any
case,
 Collinearity is assumed to be relatively low, as it
becomes difficult to differentiate between the
impact of several variables if they are highly
correlated,
 No assumption of normality, linearity and
independence

Dube Jara(PhD)
Approach to model
fitting
• In multinomial regression we in principle fit several binary
logistics depending on the category of the dependent variable
– The last category is set as reference group
– For example, if we have 3 groups of dependent variable, two
sets of model outputs, as the third is set as reference.
– Thus, each independent variable appears twice, that is for
each of the two groups

• To compare models use change in likelihood ration

test. Small LR tells you better model

Dube Jara(PhD)
Approach to model
fitting….
• Multinomial logistic regression compute correlation measures
to estimate the strength of the relationship (using pseudo R
square measures, such as Nagelkerke's R² based on type of
software used),

– correlations may not tell much about the accuracy or

errors associated with the model,
– A more useful measure is classification accuracy, which
compares predicted group membership based on the
logistic model to the actual, known group membership,

Dube Jara(PhD)
Multinomial logit…
 In the multinomial logit model, you
estimate a set of coefficients, β(1), β(2),
and β(3), corresponding
to each outcome:

Dube Jara(PhD)
Example with stata
 Data on the type of health insurance with 616
psychologically depressed subjects
 The insurance is categorized as either an
 Indemnity plan (regular fee-for-service insurance,
which may have a deductible or coinsurance rate) or
 A prepaid plan (a fixed up-front payment allowing
subsequent unlimited use as provided)
 No insurance whatsoever
 One of the demographic factors in this data is the
race of the participant, coded as white or
nonwhite:

Dube Jara(PhD)
Practice
 tabulate insure nonwhite, chi2 col
 mlogit insure nonwhite
 mlogit insure nonwhite, base(2)
 mlogit insure nonwhite, base(2) rrr

Dube Jara(PhD)
Dube Jara(PhD)
Dube Jara(PhD)
Dube Jara(PhD)
Assignment
 Logistic Regression Models
Building steps

Dube Jara(PhD)
Count data Analysis

Dube Jara(PhD)
Poisson Regression
Model
 Count data are very common in many
applications. Examples include:
 Number of patients visiting a certain hospital per day,
 CD4 counts,
 Number of live births in a given district per year, etc.
 Count data are commonly analyzed using
Poisson regression model.

Dube Jara(PhD)
Poisson Regression
Model…
 Dependent variable is event count, usually of rather
rare events e.g. number of cases of cancer over a
defined period in a cohort of subjects
 Non negative integer
 Count variables can be modeled with linear
regression (OLS) but:
 Linear models can yield negative predicted values…
whereas counts are never negative
 Count variables are often highly skewed- violating OLS
assumptions

Dube Jara(PhD)
Poisson Regression
Model…
 Analyzing large cohorts with hazard regression
models that have many time-dependent covariates
can require substantial computation time

 In this situation, converting the data to events per

person–year of observation can greatly simplify the
analysis

 If the event rate per unit of time is low, then an

excellent approach to consider is Poisson
regression

Dube Jara(PhD)
Poisson Regression Model…

The Poisson model assumes that yi, given the
vector of covariates xi, is independently Poisson-
distributed with

 The mean is given by µ = λ and the variance,

var(µ) = λ.

Suppose Y1, . . . , YN is a set of independent count
outcomes, and let x1, . . . , xN represent the
corresponding p-dimensional vectors of
covariate values. Dube Jara(PhD)
Poisson Regression Model…
 In Poisson Regression, Y is typically
conceptualized as a rate…
 Positive coefficients indicate higher rate; negative =
lower rate

 Like logit, Poisson models are non-linear

 Coefficients don’t have a simple linear interpretation
 Like logit, model has a log form; exponentiation
aids interpretation as incidence rate ratio:

Dube Jara(PhD)
Poisson Regression:
Exposure Time
 Poisson outcome variables are typically
conceptualized as rates
 Number of event/failure/outcome during a follow

up time
 Cases may vary in exposure to “risk” of a given

outcome
 To properly model rates, we must account for the

fact that some cases have greater exposure

than others.
 Some cases followed for longer time than

others…

Dube Jara(PhD)
Poisson Regression
Assumptions
 Poisson regression makes a big assumption: That
variance of =λ (“equidisperson”)
 The mean and variance are the same
 This assumption is often not met in real data
 Dispersion is often greater than λ. :overdispersion
 Consequence of overdispersion: Standard errors will
be underestimated
 Potential for overconfidence in results; rejecting H 0 when
you shouldn’t!
 Overdispersion is most often caused by highly skewed
dependent variables
 Often due to variables with high numbers of zeros

Dube Jara(PhD)
Infant Data
 Outcome: Total child deaths (‘deaths’).
 Factor1: place of residence

(1 = ‘urbant, 2 = ‘semi-urbant, 3 = ‘ruralt).

 Factor2: monthly family income,‘faminc’ in birr

(continuous).
 log(λ) = ξ0 + ξ1SU + ξ2R + ξ3IN.
 ‘IN’ refers to monthly family income, ‘SU’ and ‘R’
are defined as before.
 Code: poisson deaths i . p l a c e faminc
poisson deaths ib(first).place faminc
Dube Jara(PhD)
log(λ) = −0.745 + 0.036SU + 0.230R − 0.001IN.

Dube Jara(PhD)
The option ‘irr’ displays the incidence rate ratio estimates.
poisson deaths i.place faminc. irr

Dube Jara(PhD)
Results
 exp(ξi) measures the change in the expected log
counts (incidence rate).
 exp(ξ2) = exp(0.230) = 1.25 implies that the
expected log counts of child death in rural is higher
by 1.25 as compared to that of urban, or the
incidence rate of child death in rural is higher by
25% as compared to that of urban.


exp(ξ3) = exp(−0.0013) = 0.9987 implies that as
family income increases by one Birr, the incidence
rate of child death decreases by nearly 0.13%.

Dube Jara(PhD)

8609 Quiz
100% (3)
8609 Quiz
41 pages
Chapter 8-10 Contigency Table, Correlation and Regression
No ratings yet
Chapter 8-10 Contigency Table, Correlation and Regression
91 pages
Chi-Square Test
No ratings yet
Chi-Square Test
10 pages
Teaching Science in Elementary Grades (Physics, Earth and Space Science
73% (11)
Teaching Science in Elementary Grades (Physics, Earth and Space Science
8 pages
1categorical Data Analysis (Chi Square) June 2022
No ratings yet
1categorical Data Analysis (Chi Square) June 2022
194 pages
Alphabetical List of All Countries and Capitals of
No ratings yet
Alphabetical List of All Countries and Capitals of
9 pages
1measures of Association
No ratings yet
1measures of Association
105 pages
Outline Note Allan Agresti
No ratings yet
Outline Note Allan Agresti
187 pages
Measures of Association
No ratings yet
Measures of Association
56 pages
Analysis of Categorical Data and Epidemiologic Studies - Topic 8
No ratings yet
Analysis of Categorical Data and Epidemiologic Studies - Topic 8
52 pages
Agresti Cda
No ratings yet
Agresti Cda
191 pages
Biostatistic Lecture PPT by Andamlak MLT
No ratings yet
Biostatistic Lecture PPT by Andamlak MLT
229 pages
Logistic Regression
No ratings yet
Logistic Regression
79 pages
Analysis of Categorical Data and Epidemiologic Studies - Topic 8 22042024
No ratings yet
Analysis of Categorical Data and Epidemiologic Studies - Topic 8 22042024
53 pages
ADS - Documentation - Channel Simulation
No ratings yet
ADS - Documentation - Channel Simulation
294 pages
Biostatistics L11 2023
No ratings yet
Biostatistics L11 2023
36 pages
Statistics For Categorical Data Analysis
No ratings yet
Statistics For Categorical Data Analysis
48 pages
Self Study Pack - English - Grade 8
No ratings yet
Self Study Pack - English - Grade 8
137 pages
Basic Biostatistics - Wakgari Module 17-21
No ratings yet
Basic Biostatistics - Wakgari Module 17-21
82 pages
Lecture Notes 3
No ratings yet
Lecture Notes 3
56 pages
Lecture 3.measures of Effectiveness
No ratings yet
Lecture 3.measures of Effectiveness
38 pages
Inferences On Two-Way Contingency Tables
No ratings yet
Inferences On Two-Way Contingency Tables
45 pages
Categorical Data - spss2019
No ratings yet
Categorical Data - spss2019
62 pages
EE3211 Modelling Techniques
No ratings yet
EE3211 Modelling Techniques
47 pages
Probability and Statistics - Y2Phys
No ratings yet
Probability and Statistics - Y2Phys
108 pages
10measures of Association
No ratings yet
10measures of Association
249 pages
Lecture Notes 2
No ratings yet
Lecture Notes 2
40 pages
Further Maths ss2
No ratings yet
Further Maths ss2
2 pages
Goodness of Fit Tests Contingency Tables
No ratings yet
Goodness of Fit Tests Contingency Tables
49 pages
Lin Pinxiang - Master Tea Brewer
100% (1)
Lin Pinxiang - Master Tea Brewer
40 pages
Inferential Statistics II
No ratings yet
Inferential Statistics II
62 pages
Test of Significance
No ratings yet
Test of Significance
32 pages
Chapter 3 Bivariate Data Biostat
No ratings yet
Chapter 3 Bivariate Data Biostat
31 pages
PSM 201 The Chi Square Test
No ratings yet
PSM 201 The Chi Square Test
10 pages
Are Children Sexual
No ratings yet
Are Children Sexual
18 pages
Data Analysis:: Simple Statistical Tests
100% (1)
Data Analysis:: Simple Statistical Tests
45 pages
Class Lecture-3
No ratings yet
Class Lecture-3
10 pages
Multinomial & Ordinal LR Possion1
No ratings yet
Multinomial & Ordinal LR Possion1
63 pages
Chi Square
No ratings yet
Chi Square
18 pages
Applied Biostatistics Project
No ratings yet
Applied Biostatistics Project
27 pages
Lesson 11 CHI SQUARE TEST OF SIGNIFICANCE2 (Autosaved)
No ratings yet
Lesson 11 CHI SQUARE TEST OF SIGNIFICANCE2 (Autosaved)
17 pages
Better To Be in Agreement Than in Bad Company: A Critical Analysis of Many Kappa-Like Tests
No ratings yet
Better To Be in Agreement Than in Bad Company: A Critical Analysis of Many Kappa-Like Tests
22 pages
Daily Lesson LOG School: Grade Level:: Teacher: English Teaching Dates/Time: Quarter: Cot 1
100% (3)
Daily Lesson LOG School: Grade Level:: Teacher: English Teaching Dates/Time: Quarter: Cot 1
4 pages
Logistic Regression - 2021 ch-8
No ratings yet
Logistic Regression - 2021 ch-8
52 pages
Test of Association
No ratings yet
Test of Association
27 pages
Categorical Data Analysis
No ratings yet
Categorical Data Analysis
44 pages
Probability and Statistics - Lecture 4
No ratings yet
Probability and Statistics - Lecture 4
35 pages
Hypothesis Testing - Chi Squared Test
No ratings yet
Hypothesis Testing - Chi Squared Test
16 pages
Phân Tích Dữ Liệu Và Xác Định Phép Kiểm Thống Kê
No ratings yet
Phân Tích Dữ Liệu Và Xác Định Phép Kiểm Thống Kê
50 pages
Chi-Square (X2) Distribution
No ratings yet
Chi-Square (X2) Distribution
35 pages
Methods For Proportions
No ratings yet
Methods For Proportions
19 pages
Kami Export - Vihan Aggarwal - Chi-Square WS #3 - Homogeneity Test 2024.3
No ratings yet
Kami Export - Vihan Aggarwal - Chi-Square WS #3 - Homogeneity Test 2024.3
9 pages
Basic Concepts in Biostatistics 1-1
No ratings yet
Basic Concepts in Biostatistics 1-1
36 pages
Lecture 4: Contingency Table: This Example Is From Wikipedia
No ratings yet
Lecture 4: Contingency Table: This Example Is From Wikipedia
5 pages
Statistical Method of Categorical Variable
No ratings yet
Statistical Method of Categorical Variable
68 pages
Chi Square Test
100% (1)
Chi Square Test
14 pages
Chapter#8 Association
No ratings yet
Chapter#8 Association
59 pages
Lecture 08 Test For Independence
No ratings yet
Lecture 08 Test For Independence
18 pages
6.3 Chi-Square
No ratings yet
6.3 Chi-Square
35 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
6 Contingency Tables
No ratings yet
6 Contingency Tables
72 pages
Analysis of Categorical Data
No ratings yet
Analysis of Categorical Data
75 pages
(Q1) MODULE 1 - The Nature of Matter PDF
No ratings yet
(Q1) MODULE 1 - The Nature of Matter PDF
26 pages
Chi - Square Test: PG Students: DR Amit Gujarathi DR Naresh Gill
No ratings yet
Chi - Square Test: PG Students: DR Amit Gujarathi DR Naresh Gill
32 pages
Ippd 2020
No ratings yet
Ippd 2020
3 pages
5 How To Analyze Your Data: Nonparametric Parametric
No ratings yet
5 How To Analyze Your Data: Nonparametric Parametric
18 pages
Biostatistics 541/699, Exam 2: Solutions
No ratings yet
Biostatistics 541/699, Exam 2: Solutions
4 pages
Practical Magnetic Design - Inductors and Coupled Inductors (Article)
No ratings yet
Practical Magnetic Design - Inductors and Coupled Inductors (Article)
23 pages
Mac Brochure-OM MacGuide - Compressed
No ratings yet
Mac Brochure-OM MacGuide - Compressed
14 pages
2025 Humphrey Reference Letter Forms
No ratings yet
2025 Humphrey Reference Letter Forms
4 pages
2023 05 Struktur Variaans-Kovarians
No ratings yet
2023 05 Struktur Variaans-Kovarians
42 pages
NNDL Notes
No ratings yet
NNDL Notes
73 pages
Form 4 Term 2 Geography Schemes @0743505350
No ratings yet
Form 4 Term 2 Geography Schemes @0743505350
9 pages
Art of Defining A Concept Paper
No ratings yet
Art of Defining A Concept Paper
22 pages
NFRC 302-2017 E0a1
No ratings yet
NFRC 302-2017 E0a1
20 pages
MODULE 9 Personal Relationships
No ratings yet
MODULE 9 Personal Relationships
91 pages
Fourier Series
No ratings yet
Fourier Series
12 pages
66-Article Text-89-2-10-20230228
No ratings yet
66-Article Text-89-2-10-20230228
7 pages
L1 11991 en Thermanit 13 04 Si de en 21
No ratings yet
L1 11991 en Thermanit 13 04 Si de en 21
1 page
ENG 2019 UAS Kalkulus 3 PDF
No ratings yet
ENG 2019 UAS Kalkulus 3 PDF
5 pages
Scan Plan Paut-Mc-03 PDF
No ratings yet
Scan Plan Paut-Mc-03 PDF
5 pages
Metalsa Supplier Manual Rev 4 1
No ratings yet
Metalsa Supplier Manual Rev 4 1
58 pages
The Body Language of Culture
No ratings yet
The Body Language of Culture
9 pages
Discourse Analysis Unit 4
No ratings yet
Discourse Analysis Unit 4
14 pages
Lecture - Water Reqmts Spreadsheet
No ratings yet
Lecture - Water Reqmts Spreadsheet
8 pages
AgXeed AgBot 5.115T2 Specifications-1
No ratings yet
AgXeed AgBot 5.115T2 Specifications-1
1 page
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
From Everand
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
Gérard Blanchet
3/5 (1)
Shortcuts to College Calculus Refreshment Kit
From Everand
Shortcuts to College Calculus Refreshment Kit
Juan Acevedo
No ratings yet