Marginal Models For The Association Structure of Hierarchical Binary Responses

Journal of Applied Statistics

Marginal models for the association structure of

hierarchical binary responses

André G. F. C. Costa, Enrico A. Colosimo, Aline B. M. Vaz, José Luiz P. Silva &
Leila D. Amorim

To cite this article: André G. F. C. Costa, Enrico A. Colosimo, Aline B. M. Vaz, José Luiz P. Silva
& Leila D. Amorim (2017) Marginal models for the association structure of hierarchical binary
responses, Journal of Applied Statistics, 44:10, 1827-1838, DOI: 10.1080/02664763.2016.1238042

Published online: 04 Oct 2016.

VOL. 44, NO. 10, 1827–1838

Marginal models for the association structure of hierarchical

binary responses
André G. F. C. Costaa , Enrico A. Colosimob , Aline B. M. Vazc , José Luiz P. Silvab and
Leila D. Amorimd
a ABG Consultoria, Belo Horizonte, Brazil; b Departamento de Estatística, Universidade Federal de Minas
Gerais, Belo Horizonte, MG, Brazil; c Universidade Estadual de Feira de Santana and Centro de Pesquisas René
Rachou, FIOCRUZ, Brazil; d Departamento de Estatística, Universidade Federal da Bahia, Salvador, BA, Brazil


Clustered binary responses are often found in ecological studies. Received 26 May 2015
Data analysis may include modeling the marginal probability Accepted 14 September 2016
response. However, when the association is the main scientific focus, KEYWORDS
modeling the correlation structure between pairs of responses is ALR; correlated binary
the key part of the analysis. Second-order generalized estimating responses; GEE; odds ratio;
equations (GEE) are established in the literature. Some of them are ORTH
more efficient in computational terms, especially facing large clus-
ters. Alternating logistic regression (ALR) and orthogonalized resid-
ual (ORTH) GEE methods are presented and compared in this paper.
Simulation results show a slightly superiority of ALR over ORTH.
Marginal probabilities and odds ratios are also estimated and com-
pared in a real ecological study involving a three-level hierarchical
clustering. ALR and ORTH models are useful for modeling complex
association structure with large cluster sizes.

1. Introduction
Clustered binary responses are often found in epidemiological and biological studies. It
is usually assumed that there is a correlation between observations in the same cluster,
whereas no correlation exists between observations from different ones. Correlation can
be induced by the design of the research, such as in longitudinal or clustered cross-section
studies including family or spatial components. Hierarchical clustering data might arise
from different levels within the same cluster. Therefore, it is expected that the observations
within the same level are correlated, and additionally, it is expected that there is also corre-
lation between observations at different levels, since these observations belong to the same
The objectives of statistical analysis of such data include (i) describing the dependence
of each binary response on explanatory variables, and (ii) characterizing the degree of
association between pairs of outcomes as well as the dependence of this association on
covariates [5]. When the association structure is not of interest, standard GEE, proposed

CONTACT Enrico A. Colosimo [email protected] Departamento de Estatística, Universidade Federal de

Minas Gerais, Belo Horizonte, MG31270-901, Brazil

© 2016 Informa UK Limited, trading as Taylor & Francis Group

1828 A. G. F. C. COSTA ET AL.

by Liang and Zeger [10], provides a computationally fast approach for fitting marginal
models. Liang and Zeger’s original proposal estimate the regression parameters associ-
ated with the expected value of an individual’s vector of binary responses and phrase
the working assumptions about the association between pairs of outcomes in terms of
marginal correlations. It is well known that if the assumed correlation structure is incorrect,
some efficiency is lost although main effect estimators still remain consistent. How-
ever, standard GEE is not adequate when scientific interest is placed on the association
When the association structure is of scientific focus, the usual GEE was extended
through the introduction of a second estimating equation, which allows the association
between pairs of responses be dependent on covariates. Depending on the construction of
the second estimation equation, the estimates of mean and associations may be orthogonal
or non-orthogonal [11,16]. The advantage of orthogonal estimation is that misspecification
of the association structure does not hamper consistency and asymptotic normality of the
marginal regression parameters. Among the first proposed extension of GEE was Prentice
[15] who extended Liang and Zeger’s method to allow joint estimation of probabilities and
pairwise correlations. However, the correlation coefficient is not an appropriate measure
of association for binary responses, and some investigators may find the odds ratio easier
to interpret. Hence, Lipsitz et al. [12] and Liang et al. [11] modified the Prentice method to
allow modeling of the association through marginal odds ratios rather than marginal cor-
relations. They noted that their extension GEE version is nearly fully efficient as compared
to a full likelihood approach.
Nevertheless, these methods are rarely used in practice because they can be computa-
tionally infeasible if the number of measures within cluster is large, which is very common
in problems involving hierarchical clustering structures. In order to solve these problems
Alternating Logistic Regression (ALR) and Orthogonalized Residuals (ORTH) were pro-
posed by Carey et al. [5] and Zink [24], respectively. Both approaches specify within-cluster
association in terms of pairwise odds ratios. ALR is almost as efficient as the second-order
GEE of Liang et al. [11], and shares the computational ease of conventional GEE. In this
approach, the second estimation equation is defined in terms of conditional residuals. The
ORTH model is a new approach for the second estimation equation, replacing the strategy
of conditional residuals to orthogonalized ones.
Outline of the paper is as follows. Section 2 describes the real data motivation. GEE
methodology is careful described in Section 3. A small size Monte Carlo simulation study
is presented in Section 4 aiming to compare ALR and ORTH GEE approaches for the asso-
ciation structure estimates. Real data results are presented and discussed in Section 5. Paper
ends with some final remarks in Section 6.

2. Real data motivation

This paper was motivated by a study related to fungal endophytes distribution where the
association structure is of primary research interest. Fungal endophytes inhabit healthy
plant tissues during at least one stage of their life cycle without causing any apparent symp-
toms of disease or negative effects on the hosts [14]. In this work, the fungal endophytes
were isolated following a hierarchical nesting: fungal endophytes occur in leaf fragments

Figure 1. Hierarchical structure of the ecological study.

(level 3) of leaves (level 2), which are clustered within individual host trees (level 1) that,
by your turn, occur in a site collection. See Figure 1 for a full picture of this hierarchical
Three different sites were studied in Patagonia, Argentina, in the Andean Patagonian
region and two of them in Atlantic rainforest area, in Brazil. Five apparently healthy leaves
were collected from each of the 20 trees in each site. The trees were spaced approximately
5 m apart. All the leaves were stored in sterile plastic bags, and fungal isolation was per-
formed on the same day of the collection. The leaves were surface-sterilized. After the leaf
surface sterilization, six fragments (approximately 4 mm2 ) were cut from each leaf: one
from the base (C, near petiole), two from the middle vein (E and F), one from the left mar-
gin (D), one from the right margin (B) and one from the tip (A) (6 leaf fragments/leaf;
30 leaf fragments/tree; 600 leaf fragments/site; 3000 leaf fragments overall). The binary
responses of the presence/absence of a fungal endophyte were considered for statistical
analysis. More information of this study can be found in [22].
The objectives of this study were: (i) to estimate, using the dependence structure,
whether the fungal endophytes exhibit association within a leaf, individual host tree and
collection site; and (ii) to test the hypothesis that increasing the distance among different
individual host tree in the same collection site, the association of the fungal endophytes

3. Marginal models
suppose N independent clusters, each one with ni observations. Consider the index i
identifies the ith cluster, i = 1, . . . , N, and j and k identifying two observations within
the same cluster, with 1 ≤ j < k ≤ ni . For the ith cluster, the response vector is given
by Yi = (Yi1 , Yi2 , . . . , Yini ) , such that Yij follows a Bernoulli distribution with mean
1830 A. G. F. C. COSTA ET AL.

μij = E(Yij ) = P(Yij = 1). GEE estimator proposed by Liang and Zeger [10] for the
marginal model is obtained by solving the following equation:

S(β) = i
Vi−1 (Yi − μi (β)) = 0, (1)

where μi = (μi1 , . . . , μini ) , Vi = Ai Ri (α)Ai , Ri (α) is a working correlation matrix

1/2 1/2

ni × ni and Ai is a diagonal matrix with elements given by σijj = Var(Yij ) = μij (1 − μij ).
Mean function μi (β) depends on a p-vector of covariates Xij through a link function (g(·))
as μij (β) = g −1 (Xij β). Quantities α’s are taken as nuisance parameters and estimated by
the method of moments [6]. Under mild conditions and correct specification of the mean
function μi (β), Liang and Zeger [10] proved that the resulting estimates β̂ are consistent
for β even when the covariance structure is misspecified. Sandwich estimators based on
Equation (1) for the variance of β̂ are available [10].
In some applications, the within-cluster association may be of the scientific focus. In
these cases, some proposals have been formulated that estimate the association vector α
through a second estimating equation, assumed independent of Equation (1). This assump-
tion has the advantage that misspecification of the association structure again does not
affect consistency and asymptotic normality of the marginal regression parameters.
Prentice [15] proposed the correlation coefficient, ρijk = Cor(Yij , Yik ), to represent the
dependence or association between the pair Yij and Yik . However, for binary responses the
correlation coefficient as a measure of association is not widely used, mainly due to the
difficulty in interpretation. Another issue is that the correlation coefficient is restricted by
the marginal means. Lipsitz et al. [12] and Liang et al. [11] proposed modifications in the
second estimation equation using the odds ratio to account for the association between
binary outcomes. They assumed a link function for ψijk = OR(Yij , Yik ) such that
μijk (1 − μij − μik + μijk ) 
log(ψijk ) = log = Xijk α, 1 ≤ j < k ≤ ni ,
(μij − μijk )(μik − μijk )

where μijk = E(Yij Yik ). While estimating β through Equation (1), their method solves a
second order equation for α using Zijk = Yij Yik .
In real applications involving complex hierarchical clustering structures, as the one
described in Section 2, it is expected large sample clusters. Due to the high computational
effort under these situations, Prentice [15] and Lipsitz et al. [12] GEE extensions are not
An alternative, that shares the computational ease of conventional GEE, is ALR pro-
posed by Carey et al. [5]. Let γijk = log(ψijk ), then
μij − μijk
logit P(Yij = 1|Yik = yik ) = γijk yik + log . (2)
1 − μij − μik + μijk

The pairwise log odds ratio is obtained as the regression coefficient in a logistic regression
of Yij on Yik as long as the second term on the right-hand side of Equation (2) is used as
an offset. Denoting the mi -vector of conditional residuals by Ci with elements Yij − ξijk ,
where ξijk = E(Yij = 1|Yik = yik ), and Si is a diagonal matrix with elements ξijk (1 − ξijk ),

ALR estimator for θ = (β, α) is the simultaneous solution of the first estimating equation
given in (1) and

∂ξ  i −1
Sα,ALR = Si Ci = 0. (3)
On the other hand, the stochastic nature of Si and ∂ξi /α does not allow a theoretical
investigation of Equation (3) through the standard theory of estimation equation. Another
drawback is that Sα,ALR is invariant to permutations of the vector Yi [9] whereas the robust
variance estimator is not.
Zink [24] presented the ORTH model as an alternative one to the ALR, resolving the
dependence of variance estimates on observation order. ORTHs approach again keeps
the same Equation (1) to estimate the parameters of the mean. But unlike ALR, where
estimation of association parameters α is based on conditional expectations E(Yij |Yik ), α
estimation is instead based on expectations of cross-products Yij Yik conditional on Yij and
Yik , for 1 ≤ j < k ≤ ni . An approximate covariance matrix is then built in a way that is
very computationally feasible for larger clusters [20].
Let Zijk still be equal to Yij Yik . ORTHs are defined as linear regressions of Zijk on Yij and
Yik specifying:

Qijk = Zijk − [μijk + bijk:j (Yij − μij ) + bijk:k (Yik − μik )], (4)

such that bijk:j = μijk (1 − μik )(μik − μijk )/dijk , bijk:k = μijk (1 − μij )(μij − μijk )/
dijk , dijk = σijj σikk − σijk
2 , σ = Cov(Y , Y ) = μ − μ μ .
ijk ij ik ijk ij ik
After the definition of ORTHs Qijk , the second estimation equation is given as

Sα,ORTH = E Pi−1 Qi , (5)

where Pi is an approximation for the covariance matrix of Qi = {Qijk }, that is,

Pi = diag(νijk )R∗iQQ (λ) diag(νijk ),

1/2 1/2

μijk (μij − μijk )(μik − μijk )(1 − μij − μik + μijk )
νijk = Var(Qijk ) = ,
μij μik (1 − μij − μik + μijk ) − μ2ijk

and R∗iQQ (λ) is an exchangeable correlation matrix, depending on correlation parameter λ,

that approximates RiQQ = Corr(Qi ). Zink [24] showed that Sα,ORTH = Sα,ALR when λ = 0.
On the other hand, taken λ > 0 may improve efficiency of ORTH estimates. The formu-
lation in Equation (5) offers the advantage that it follows a standard estimating equation
approach. Also, the associated robust variance estimator is invariant to permutations of Yi .

4. Monte Carlo simulation

A Monte Carlo simulation was performed aiming to compare the performance of the ALR
and ORTH approaches. Simulations were designed to explore the estimators properties, in
1832 A. G. F. C. COSTA ET AL.

especial those related to the association structure as well as their robust variance estimator.
Simulation design has the similar design as the real data motivation presented in Section 2.
Qaqish [19] introduced a family of multivariate binary distributions that allows, in a
simple way, generating correlated binary variables for a specified mean vector and correla-
tion structure. Multivariate binary responses were obtained for the simulation scenarios by
using the methodology proposed by Qaqish [19]. It is implemented in software R, package
Mean model and correlation structure are generated as the following, respectively.

Logit Pr(Y = 1) = β0 + β1 x + β2 x2 ,

⎪α0 I(within collection site) + α3 Distancejk ,

⎪ If j and k are different individual host tree in the same collection site,

⎨α I(within collection site) + α I(within host tree),
0 1
LogOR(Yj , Yk ) =

⎪ If j and k are different leaf in the same individual host tree,

⎪α0 I(within collection site) + α1 I(within host tree) + α2 I(within leaf),

⎩If j and k are different fragments in the same leaf,

whereas β0 = −1.00, β1 = 0.10, β2 = 0.08, α0 = 1.073, α1 = 0.741, α2 = 0.792 and α3 =

−0.177. Regressor x is defined as the first-level variable and assumes the values x =
−5, −4, −3, −2, −1, 1, 2, 3, 4, 5. Design matrix was completely balanced. For instance, for
a sample of size 400, 10 different sites were considered (one for each value of x) in the mean
model. In each site a 24 balanced design was taken for the complete four levels cluster struc-
ture. Samples of sizes 400, 800, 1600, 3200 and 6400 were considered in the simulation just
by multiplying the standard size of 400. N = 1000 Monte Carlo simulations were performed
for each sample size.
Properties of the correlation structure estimators were established under two situa-
tions: (1) using the true mean structure and (2) misspecification of the mean structure
by assuming a straight line (β2 = 0).
Relative bias estimates of the association structure were obtained as

|αi − α̂¯ i |

where α̂¯ i = Nj=1 (α̂i /N), for i = 0, . . . , 3 and j = 1, . . . , N. While robust variance relative
bias estimate was obtained as
ˆ α̂i )|
|se(α̂i ) − se(
se(α̂i )
¯ 2
where se(α̂i ) = N ˆ α̂i ) = N
i=1 ((α̂i − α̂i ) /(N − 1)) and se( ˆ α̂i )/(N − 1)).
i=1 (se(
Simulation results are presented in Figures 2 and 3. The following conclusions can be
observed from these figures: (1) estimates of α0 showed the largest bias in the misspecified
model, it stays around 20% for all sample sizes. Relative bias for the others association
estimates, under the misspecified model, are small but it seems to be increasing with the
increase of the sample size. As expected, relative bias for the correct model gets smaller as
sample size increase for both estimates; (2) in Figure 3, relative standard errors showed a

Figure 2. Relative bias for the correct and misspecified models.

little superiority of the ALR method over the ORTH; and, (3) in general, ALR and ORTH
have very similar behavior.
Table 1 presents a summary of the computational time for simulation scenarios. ALR
is faster than ORTH for smaller sample sizes but ORTH gets faster for bigger ones (more
than 200 observations per group).
1834 A. G. F. C. COSTA ET AL.

Figure 3. Relative standard errors for the correct and misspecified models.

5. Numerical results
Let’s return to the real data set described in Section 2. It was used the following linear
predictors for the mean and association structure for the three levels fungal endophytes

Table 1. Computational length time (in seconds).

Sample size in group

40 200

Sample size Model Mean STD Min Max Mean STD Min Max
400 ALR 2.4 0.2 2.1 2.9 151.3 233.2 73.9 839.6
ORTH 13.4 5.3 7.1 32.6 29.5 10.4 20.1 70.8
800 ALR 4.7 0.4 3.9 5.9 271.6 442.4 116.3 1680.4
ORTH 21.8 5.5 14.3 35.8 83.8 55.7 40.6 233.1
1600 ALR 10.3 0.6 9.1 11.3 261.4 30.3 237.6 310
ORTH 41.6 17.2 28.6 126.8 213.9 150.5 82.6 618.4
3200 ALR 17.6 1.3 15.7 21.1 498.1 52.7 461.5 609.3
ORTH 100.2 33.2 72.4 204.4 435.5 262.7 169.1 1262.7


Logit Pr(Y = 1) = β0 + β1 I(Country = Brazil) + β2 I(Fragment = A)

+ β3 I(Fragment = B) + β4 I(Fragment = D) + β5 I(Fragment = E)
+ β6 I(Fragment = F),

⎪α1 I(within collection site) + α4 Distancejk ,

⎪If j and k are different individual host tree in the same collection site,

⎨α I(within collection site) + α I(within host tree),
1 2
LogOR(Yj , Yk ) =

⎪ If j and k are different leaf in the same individual host tree,

⎪ α1 I(within collection site) + α2 I(within host tree) + α3 I(within leaf),

⎩If j and k are different fragments in the same leaf,

where distance is measured in decameter (Minimum = 0, Maximum = 11.4).

Table 2 presents the estimates of ALR and ORTH models. Results from ALR and ORTH
GEE methods are very similar. They reached at the same conclusions. Estimate of λ for
the ORTH is very close to zero that might explain similar results of the methods. There
is a significant association of fungal endophytes at the individual host tree and leaf lev-
els. A tree with fungus on any leaf is 2.3 (95% CI 1.1,5.1) times as likely to have fungus
on any other different leaf. Fungus were also more likely to aggregate within leaves. The
endophytes of wood plants are horizontally transmitted by hyphal fragmentation and/or
spores from plant to plant [2,7] and may be released passively by herbivores or physical
agents such as wind or rain [17]. Thus, the fungal endophyte colonization depends on the
availability and viability of fungal propagules in the surrounding environment [18]. This
mode of transmission may explain the association observed at the leaf and individual host
tree levels for fungal endophytes.
However, it was observed that the distance between hosts trees within each site collec-
tion were not statistically significant. This result corroborates that the dispersal limitation
is an important factor in explaining the biogeographic pattern of fungal endophytes [13].
1836 A. G. F. C. COSTA ET AL.

Table 2. Results of ALR and ORTH models for fungal endophytes study.
ALR ORTH(λ = 0.006)
Mean β se(β) p-value O.R β se(β) p-value O.R
Intercept −0.953 0.120 0.000 – −0.977 0.188 0.000 –
Country = Brazil −0.090 0.518 0.862 0.914 −0.034 0.585 0.954 0.967
Fragment = A −0.498 0.294 0.091 0.608 −0.502 0.294 0.087 0.605
Fragment = B −0.926 0.213 <0.01 0.396 −0.935 0.212 <0.01 0.393
Fragment = D −1.173 0.181 <0.01 0.309 −1.185 0.181 <0.01 0.306
Fragment = E −0.718 0.156 <0.01 0.488 −0.725 0.156 <0.01 0.484
Fragment = F −0.672 0.247 0.006 0.511 −0.678 0.246 0.006 0.507
Association α se(α) P-value P.O.R α se(α) P-value P.O.R
Within site 0.477 0.672 0.478 1.612 0.195 0.388 0.615 1.215
Within host tree 0.817 0.404 0.043 2.265 0.857 0.404 0.034 2.355
Within leaf 0.896 0.184 0.000 2.449 0.928 0.210 0.000 2.528
Distancejk −0.003 0.005 0.609 0.997 −0.003 0.004 0.443 0.997

In the mean structure, there is no significant difference between Brazil and Argentina
concerning the odds of presence of fungal endophytes. This result corroborate previous
studies which have been shown that most fungal endophytes belong to Sordariomycetes
[1,8]. Although the odds of presence of fungal endophytes in fragment C is significantly
higher than others fragments, except fragment A. There are a higher probability of infec-
tion near the petiole than in more distal leaf fragments [3,4,23]. The expansion of petiole
end are smaller than the more distal two-thirds of the leaf. Consequently, infections will
become more ‘diluted’ if there were established in the more distal leaf parts before it fully
expansion when compared with the petiole segments. The result is a leaf with more dense
infections at the petiole end [23]. Based on our results, we suggested a preference for the
leaf tissue colonization by Sordariomycetes. Probably, the leaf development determine the
colonization pattern.

6. Final remarks
ALR and ORTH GEE methods were presented and compared in this paper. Simulation
results showed a slightly superiority of the ALR method over the ORTH one. Marginal
probabilities and odds ratios were also estimated and compared in a real ecological study
involving a three levels hierarchical clustering. Results from ALR and ORTH GEE methods
were very similar. A tree with fungus on any leaf was 2.3 (95% CI 1.1,5.1) times as likely to
have fungus on any other different leaf. Fungus were also more likely to aggregate within
leaves. However, it was observed that the distance between hosts trees within each site
collection were not statistically significant.
Simulation and real data results were obtained by using some R packages [21]. In order
to fit ALR and ORTH models, geepack and orth packages were used, respectively. R Scripts
can be obtained upon request from the first author.
An important issue related to GEE is its validity under missing data pattern. This issue
was not explored in this paper because the ecological study is a complete data set. However,
it can be observed that the GEE methodology under study in this paper is only valid under
missing completely at random (MCAR) pattern [6].

In terms of computational time, ALR is faster than ORTH for smaller sample size but
ORTH gets faster for bigger clusters (more than 150 observations in each group). ALR and
ORTH models have proved to be useful for modeling a complex association structure in
the presence of large cluster sizes.

Disclosure statement
No potential conflict of interest was reported by the authors.

Research partially supported by FAPEMIG, CAPES and CNPq grants (E.A.C). Research partially
supported by CAPES and CNPq grants (A.B.M.V). Research partially supported by CAPES and
CNPq grants (L.D.A).

