About Log Linear Validation
About Log Linear Validation
In clinical investigations we often have response and explanatory variables that are both
categorical. For example ill / not ill as response variable and immunised / not immunised as
explanatory variable. The categories here are nominal. There is no ordering between them.
Sometimes the categories could be ordered, and we say that the variable is ordinal. For
example survived; survived with deficits; died.
In the case of categorical data one is commonly looking for association between two
variables. The 2 test is one example. Usually the 2 test is performed for a 2 2 contingency
table. Even though the test is still valid for larger tables, one can run into difficulties with
interpretation. All that a significant 2 test tells us is that the pattern of data as depicted in the
table could not arise by chance. In a 2 x 2 contingency table the presence or absence of
association between the two variables is often clear from inspection alone. The formal
statistical test merely confirms (or refutes) it. In the case of complicated contingency tables
involving several variables a more robust form of analysis is the log-linear analysis.
Recall that the 2 test involves entering the frequency counts for the two categorical variables
in rows and columns together with the marginal totals (i.e. totals for each row and each
column), as well as the full overall total. From these totals the expected frequency for each
cell is calculated. Then 2 =(Observed frequency Expected)2 Expected. (Recall also that
the probability of the joint occurrence of two independent events is the product of their
separate probabilities). A log-linear model is best thought of as a model for the expected
frequencies in a contingency table. But it is more than just an alternative form of the 2 test.
Its strength lies in that it can be extended to quite complicated contingency tables involving
several variables.
In a 2 2 contingency table the probability of an individual occupying a given cell is the
product of the marginal totals, since they represent the respective main effects probabilities.
Log-linear analysis is based on the fact that the logarithm of a product is the sum of the
individual logarithms of the individual terms in the product. In other words log (p q) = logp
+ logq. To put it in the statistical jargon, the logarithm of the cell frequencies is a linear
function of the logarithms of the components.
In log-linear analysis tables are formed that contain one-way, two-way, and higher order
associations. The logarithm of the cell frequency is estimated by means of a linear equation
(function in mathematical terminology). The log-linear model so developed starts with all the
one-way, two-way, and higher order associations. The aim is to construct a model such that
the cell frequencies in a contingency table are accounted for by the minimum number of
terms. This is done by a process of backward elimination. What this means is that one begins
with the maximum number of terms, and then drops a term in each round. Statisticians refer to
it as the backward hierarchical method.
In practice, one commences the analysis by including all the variables. This is referred to as
the saturated model. It can usually be expected to predict the cell frequencies perfectly. Then
the highest order interaction is removed, and its effect on how closely the model can now
predict the cell frequencies is noted. This process of progressive elimination is continued.
Each time a variable is removed a statistical test is performed to determine whether the
accuracy of prediction falls to an extent such that the component most recently eliminated
should be one of the components of the final model. At each stage the assessment of
145
goodness-of-fit is made by means of a statistic known as the likelihood ratio. The final model
includes only the associations necessary to reproduce the observed frequencies.
A comparison of the observed and expected frequencies for each cell using the likelihood
ratio makes the evaluation of the final model. In the same way as in the case of 2 test, small
expected frequencies can lead to loss of power. It is recommended that all expected
frequencies should be greater than1, and not more than 20% should be less than 5.
We take the following example to illustrate how log-linear analysis works:
In a hospital accident and emergency service 176 subjects who attended
for acute chest pain were enrolled in a study. Of these 71 had abnormal
electrocardiograms and in the case of 105 it was normal. Of those with
abnormal electrocardiograms, 57 were overweight as judged by their body
mass index, and 14 were normal. By comparison out of the 105 subjects
with normal electrocardiograms 40 were overweight and 65 normal.
In the first group of 71 subjects with abnormal electrocardiograms, out of
the 57 overweight subjects 47 were smokers and 10 non-smokers.
Amongst the 14 with normal weights 8 were smokers and 6 non-smokers.
In the second group of 105 with normal electrocardiograms out of the 40
overweight subjects 25 were smokers and 15 non-smokers. Amongst the
65 with normal weights 35 were smokers and 30 non-smokers.
The investigators wish to assess the contribution that overweight and
smoking make to coronary artery disease.
BMI
1
1
2
2
1
1
2
2
ECG
1=
2=
1=
2=
1=
2=
BMI
Smoke
SMOKE COUNT
1
47
2
10
1
8
2
6
1
25
2
15
1
35
2
30
Abnormal
Normal
Overweight
Normal weight
Smoker
Non-smoker
We first perform a simple cross-tabulation to check whether the frequencies per each cell are
adequate to allow log-linear analysis.
146
Cross-tabulation
Control: SMOKING = 1
Rows: ECG
Columns: BMI
1
All
47
34.43
8
20.57
55
55.00
25
37.57
35
22.43
60
60.00
All
72
72.00
43
43.00
115
115.00
Control: SMOKING = 2
Rows: ECG
Columns: BMI
1
All
10
6.56
6
9.44
16
16.00
15
18.44
30
26.56
45
45.00
All
25
25.00
36
36.00
61
61.00
From the above results we infer that among both smokers and non-smokers there is an
association between being overweight and an abnormal electrocardiogram. How much is the
extent of the interaction between an abnormal electrocardiogram, smoking and being
overweight?
This question is better answered by log-linear analysis as shown below:
[ In SPSS Statistics Loglinear. Then click on Model selection to open Model Selection Loglinear Analysis
dialogue box.]
176 cases will be used in the analysis.
FACTOR Information
Factor Level
BMI
2
ECG
2
SMOKING
2
Label
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
* * * * * * * *
H I E R A R C H I C A L
L O G
L I N E A R
* * * * * * * *
147
.000
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
(1).
Code
BMI
ECG
SMOKING
SMOKING
ECG
SMOKING
SMOKING
BMI
ECG
SMOKING
SMOKING
ECG
SMOKING
SMOKING
OBS count
EXP count
Residual
Std Resid
1
2
47.5
10.5
47.5
10.5
.00
.00
.00
.00
1
2
25.5
15.5
25.5
15.5
.00
.00
.00
.00
1
2
8.5
6.5
8.5
6.5
.00
.00
.00
.00
1
2
35.5
30.5
35.5
30.5
.00
.00
.00
.00
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
.00000
.00000
DF = 0
DF = 0
P = 1.000
P = 1.000
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
(2)
* * * * * * * *
H I E R A R C H I C A L
L O G
L I N E A R
* * * * * * * *
DF
L.R. Chisq
Prob
Pearson Chisq
Prob
Iteration
3
2
1
1
4
7
1.389
44.530
69.822
.2386
.0000
.0000
1.420
46.724
68.727
.2334
.0000
.0000
4
2
0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
DF
L.R. Chisq
Prob
Pearson Chisq
Prob
Iteration
1
2
3
3
3
1
25.292
43.142
1.389
.0000
.0000
.2386
22.004
45.304
1.420
.0001
.0000
.2334
0
0
0
* * * * * * * *
H I E R A R C H I C A L
L O G
L I N E A R
* * * * * * * *
(3)
Backward Elimination (p = .050) for DESIGN 1 with generating class
BMI*ECG*SMOKING
Likelihood ratio chi square =
.00000
DF = 0
P = 1.000
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
148
DF
Prob
Iter
1.389
.2386
Step 1
The best model has generating class
BMI*ECG
BMI*SMOKING
ECG*SMOKING
Likelihood ratio chi square =
1.38856
DF = 1
P =
.239
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
DF
Prob
Iter
1
1
1
27.631
3.080
3.505
.0000
.0792
.0612
2
2
2
Step 2
The best model has generating class
BMI*ECG
ECG*SMOKING
Likelihood ratio chi square =
4.46886
DF = 2
P =
.107
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
* * * * * * * *
H I E R A R C H I C A L
DF
Prob
Iter
1
1
32.094
7.968
.0000
.0048
2
2
L O G
L I N E A R
* * * * * * * *
Step 3
The best model has generating class
BMI*ECG
ECG*SMOKING
Likelihood ratio chi square =
* * * * * * * *
4.46886
H I E R A R C H I C A L
DF = 2
L O G
P =
.107
L I N E A R
* * * * * * * *
.000
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
BMI
ECG
Code
1
1
OBS count
EXP count
Residual
Std Resid
SMOKING
SMOKING
ECG
SMOKING
SMOKING
BMI
ECG
SMOKING
SMOKING
ECG
SMOKING
SMOKING
149
1
2
47.0
10.0
44.2
12.8
2.85
-2.85
.43
-.79
1
2
25.0
15.0
22.9
17.1
2.14
-2.14
.45
-.52
1
2
8.0
6.0
10.8
3.2
-2.85
2.85
-.86
1.60
1
2
35.0
30.0
37.1
27.9
-2.14
2.14
-.35
.41
2
1
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
4.46886
4.88270
DF = 2
DF = 2
P =
P =
.107
.087
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
150
BMI * ECG * SMOKING. Its elimination produces a chi-square change of 1.389, which has
an associated significance level of 0.2386. Since it is greater than the criterion level of 0.05, it
is removed.
The procedure moves on to the next hierarchical level described under step 1. All 2 way
interactions between the three variables are being tested. Removal of BMI * ECG will
produce a large change of 27.631 in the likelihood ratio chi-square. The P value for that is
highly significant (prob = 0.0000). The smallest change (of 3.080) is related to the
BMI * SMOKING interaction. This is removed next. And the procedure continues until the
final model which gives the second order interactions of BMI * ECG and ECG * SMOKING.
Each time an estimate is obtained it is called iteration. The largest difference between
successive estimates is called convergence criterion.
At the end the programme provides a table of observed and expected cell count, the residuals
(Observed Expected), and the standardised residuals (Residual Expected cell count).This
table helps us to assess how well the model fits the data. If the model fits the data well, the
residuals should be small and without any identifiable pattern. Also if the model fits well the
standardised residuals should have a normal pattern. Values of standardised residuals >1.96 or
< 1.96 suggest discrepancies.
Goodness-of-fit test statistic checks how well the model fits the data. It is based on Pearsons
Chi-square statistic 2 = (Observed Expected)2 Expected.
An alternative statistic is the likelihood ratio chi-square
Overweight (BMI 1)
Normal weight (BMI2)
Cardiogram abnormal
(ECG 1)
47
8
Cardiogram normal
(ECG 2)
25
35
Odds Ratio = 8.225
Smoker (Smoking 1)
Non-Smoker (Smoking 2)
Cardiogram abnormal
(ECG 1)
10
6
Cardiogram Normal
(ECG 2)
15
30
Odds ratio = 2
Comment
To perform a multi-way frequency analysis tables are formed that contain the one-way, twoway, three-way, and higher order associations. The log-linear model starts with all of the one-,
two-, three-, and higher-way associations, and then eliminates as many of them as possible
while still maintaining an adequate fit between expected and observed cell frequencies. In
log-linear modelling the full model that includes all possible main effects and interactions fits
151
the data exactly, with zero residual deviance. One then assesses whether a less full model fits
the data adequately by comparing its residual deviance with the full model.
In our example, the three-way association tested was between category of electrocardiogram,
body mass index, and smoking. It got eliminated because it was found not significant. After
that a two-way association (type of electrocardiogram and body mass index; type of
electrocardiogram and smoking) was tested. The two-way association was found significant.
As we have seen the purpose of multi-way frequency analysis is to test for association among
discrete variables. Once a preliminary search for association is completed by simple 2 x 2
contingency tables a model is fitted that includes only the associations necessary to reproduce
the observed frequencies.
In the above example, we have a data set with a binary response variable
(Electrocardiogram abnormal/normal) and explanatory variables that are all
categorical. In such a situation one has a choice between using logistic regression and
log-linear modelling. For performing logistic regression rearrangement of the data is
needed so that for each variable we have a column of 1s and 0s.
Other differences from logistic regression are:
1. There is no clear demarcation between outcome and explanatory variables in
log-linear models.
2. Logistic regression allows continuous as well as categorical explanatory
variables to be included in the regression analysis.
152
We now look at the result of logistic regression for the same data:
Binary Logistic Regression
Link Function:
Logit
Response Information
Variable
ECG
Value
1
0
Total
Count
71
105
176
(Event)
Coef
-1.9523
1.7960
0.6954
StDev
0.3854
0.3641
0.3754
Z
P
-5.07 0.000
4.93 0.000
1.85 0.064
Odds
Ratio
6.03
2.00
95% CI
Lower
Upper
2.95
0.96
12.30
4.18
Log-Likelihood = -100.890
Test that all slopes are zero: G = 35.599, DF = 2, P-Value = 0.000
Goodness-of-Fit Tests
Method
Pearson
Deviance
Hosmer-Lemeshow
Chi-Square
1.420
1.389
1.420
DF
1
1
2
P
0.233
0.239
0.492
6
4.5
8
9.5
Obs
30
10
11.5
35
47
45.5
15
Exp
31.5
33.5
13.5
26.5
Total
36
43
25
72
Total
71
25
105
176
Measures of Association:
(Between the Response Variable and Predicted Probabilities)
Pairs
Concordant
Discordant
Ties
Total
Number
4650
1020
1785
7455
Percent
62.4%
13.7%
23.9%
100.0%
Summary Measures
Somers' D
Goodman-Kruskal Gamma
Kendall's Tau-a
0.49
0.64
0.24
Recall that in logistic regression we are interested in the odds of events. The logistic model is
about logarithms of odds, so that loge {P/1-P} = + 1X1 + 2X2 and so on. Hence the
coefficients are measures of change in log odds associated with a unit change in the
explanatory variable.
The programme calculates the odds ratios (actually these are the adjusted odds ratios) and lists
them under the heading of the same name. For BMI the adjusted odds ratio is 6.03 and for
smoking it is 2.00. These values are close to the crude odds ratio calculated from cross
tabulation.
The details of the rest of the output are not explained here. Suffice it to say that it is a well
fitting model. We can conclude that in this group of subjects those with an abnormal
cardiogram were 6 times more likely to be overweight and twice more likely to be smokers
compared to those with normal cardiograms.
153
Is there any interaction between being overweight and smoking? We can check for this by
introducing an interaction term in the regression.
The results of the analysis including an interaction term are shown below:
Logistic Regression Table
Predictor
Constant
BMI
SMOKING
BMI*Smok
Coef
-1.6094
1.2040
0.1335
0.9032
StDev
0.4472
0.6055
0.5946
0.7626
Z
-3.60
1.99
0.22
1.18
P
0.000
0.047
0.822
0.236
Odds
Ratio
3.33
1.14
2.47
95% CI
Lower
Upper
1.02
0.36
0.55
10.92
3.67
11.00
Log-Likelihood = -100.196
Test that all slopes are zero: G = 36.987, DF = 3, P-Value = 0.000
The result of the second regression does not require further explanation and are not discussed.
Comment