Marginal Models For The Association Structure of Hierarchical Binary Responses
Marginal Models For The Association Structure of Hierarchical Binary Responses
Marginal Models For The Association Structure of Hierarchical Binary Responses
André G. F. C. Costa, Enrico A. Colosimo, Aline B. M. Vaz, José Luiz P. Silva &
Leila D. Amorim
To cite this article: André G. F. C. Costa, Enrico A. Colosimo, Aline B. M. Vaz, José Luiz P. Silva
& Leila D. Amorim (2017) Marginal models for the association structure of hierarchical binary
responses, Journal of Applied Statistics, 44:10, 1827-1838, DOI: 10.1080/02664763.2016.1238042
Article views: 34
1. Introduction
Clustered binary responses are often found in epidemiological and biological studies. It
is usually assumed that there is a correlation between observations in the same cluster,
whereas no correlation exists between observations from different ones. Correlation can
be induced by the design of the research, such as in longitudinal or clustered cross-section
studies including family or spatial components. Hierarchical clustering data might arise
from different levels within the same cluster. Therefore, it is expected that the observations
within the same level are correlated, and additionally, it is expected that there is also corre-
lation between observations at different levels, since these observations belong to the same
cluster.
The objectives of statistical analysis of such data include (i) describing the dependence
of each binary response on explanatory variables, and (ii) characterizing the degree of
association between pairs of outcomes as well as the dependence of this association on
covariates [5]. When the association structure is not of interest, standard GEE, proposed
by Liang and Zeger [10], provides a computationally fast approach for fitting marginal
models. Liang and Zeger’s original proposal estimate the regression parameters associ-
ated with the expected value of an individual’s vector of binary responses and phrase
the working assumptions about the association between pairs of outcomes in terms of
marginal correlations. It is well known that if the assumed correlation structure is incorrect,
some efficiency is lost although main effect estimators still remain consistent. How-
ever, standard GEE is not adequate when scientific interest is placed on the association
parameters.
When the association structure is of scientific focus, the usual GEE was extended
through the introduction of a second estimating equation, which allows the association
between pairs of responses be dependent on covariates. Depending on the construction of
the second estimation equation, the estimates of mean and associations may be orthogonal
or non-orthogonal [11,16]. The advantage of orthogonal estimation is that misspecification
of the association structure does not hamper consistency and asymptotic normality of the
marginal regression parameters. Among the first proposed extension of GEE was Prentice
[15] who extended Liang and Zeger’s method to allow joint estimation of probabilities and
pairwise correlations. However, the correlation coefficient is not an appropriate measure
of association for binary responses, and some investigators may find the odds ratio easier
to interpret. Hence, Lipsitz et al. [12] and Liang et al. [11] modified the Prentice method to
allow modeling of the association through marginal odds ratios rather than marginal cor-
relations. They noted that their extension GEE version is nearly fully efficient as compared
to a full likelihood approach.
Nevertheless, these methods are rarely used in practice because they can be computa-
tionally infeasible if the number of measures within cluster is large, which is very common
in problems involving hierarchical clustering structures. In order to solve these problems
Alternating Logistic Regression (ALR) and Orthogonalized Residuals (ORTH) were pro-
posed by Carey et al. [5] and Zink [24], respectively. Both approaches specify within-cluster
association in terms of pairwise odds ratios. ALR is almost as efficient as the second-order
GEE of Liang et al. [11], and shares the computational ease of conventional GEE. In this
approach, the second estimation equation is defined in terms of conditional residuals. The
ORTH model is a new approach for the second estimation equation, replacing the strategy
of conditional residuals to orthogonalized ones.
Outline of the paper is as follows. Section 2 describes the real data motivation. GEE
methodology is careful described in Section 3. A small size Monte Carlo simulation study
is presented in Section 4 aiming to compare ALR and ORTH GEE approaches for the asso-
ciation structure estimates. Real data results are presented and discussed in Section 5. Paper
ends with some final remarks in Section 6.
(level 3) of leaves (level 2), which are clustered within individual host trees (level 1) that,
by your turn, occur in a site collection. See Figure 1 for a full picture of this hierarchical
structure.
Three different sites were studied in Patagonia, Argentina, in the Andean Patagonian
region and two of them in Atlantic rainforest area, in Brazil. Five apparently healthy leaves
were collected from each of the 20 trees in each site. The trees were spaced approximately
5 m apart. All the leaves were stored in sterile plastic bags, and fungal isolation was per-
formed on the same day of the collection. The leaves were surface-sterilized. After the leaf
surface sterilization, six fragments (approximately 4 mm2 ) were cut from each leaf: one
from the base (C, near petiole), two from the middle vein (E and F), one from the left mar-
gin (D), one from the right margin (B) and one from the tip (A) (6 leaf fragments/leaf;
30 leaf fragments/tree; 600 leaf fragments/site; 3000 leaf fragments overall). The binary
responses of the presence/absence of a fungal endophyte were considered for statistical
analysis. More information of this study can be found in [22].
The objectives of this study were: (i) to estimate, using the dependence structure,
whether the fungal endophytes exhibit association within a leaf, individual host tree and
collection site; and (ii) to test the hypothesis that increasing the distance among different
individual host tree in the same collection site, the association of the fungal endophytes
decreases.
3. Marginal models
suppose N independent clusters, each one with ni observations. Consider the index i
identifies the ith cluster, i = 1, . . . , N, and j and k identifying two observations within
the same cluster, with 1 ≤ j < k ≤ ni . For the ith cluster, the response vector is given
by Yi = (Yi1 , Yi2 , . . . , Yini ) , such that Yij follows a Bernoulli distribution with mean
1830 A. G. F. C. COSTA ET AL.
μij = E(Yij ) = P(Yij = 1). GEE estimator proposed by Liang and Zeger [10] for the
marginal model is obtained by solving the following equation:
N
∂μ
S(β) = i
Vi−1 (Yi − μi (β)) = 0, (1)
i=1
∂β
ni × ni and Ai is a diagonal matrix with elements given by σijj = Var(Yij ) = μij (1 − μij ).
Mean function μi (β) depends on a p-vector of covariates Xij through a link function (g(·))
as μij (β) = g −1 (Xij β). Quantities α’s are taken as nuisance parameters and estimated by
the method of moments [6]. Under mild conditions and correct specification of the mean
function μi (β), Liang and Zeger [10] proved that the resulting estimates β̂ are consistent
for β even when the covariance structure is misspecified. Sandwich estimators based on
Equation (1) for the variance of β̂ are available [10].
In some applications, the within-cluster association may be of the scientific focus. In
these cases, some proposals have been formulated that estimate the association vector α
through a second estimating equation, assumed independent of Equation (1). This assump-
tion has the advantage that misspecification of the association structure again does not
affect consistency and asymptotic normality of the marginal regression parameters.
Prentice [15] proposed the correlation coefficient, ρijk = Cor(Yij , Yik ), to represent the
dependence or association between the pair Yij and Yik . However, for binary responses the
correlation coefficient as a measure of association is not widely used, mainly due to the
difficulty in interpretation. Another issue is that the correlation coefficient is restricted by
the marginal means. Lipsitz et al. [12] and Liang et al. [11] proposed modifications in the
second estimation equation using the odds ratio to account for the association between
binary outcomes. They assumed a link function for ψijk = OR(Yij , Yik ) such that
μijk (1 − μij − μik + μijk )
log(ψijk ) = log = Xijk α, 1 ≤ j < k ≤ ni ,
(μij − μijk )(μik − μijk )
where μijk = E(Yij Yik ). While estimating β through Equation (1), their method solves a
second order equation for α using Zijk = Yij Yik .
In real applications involving complex hierarchical clustering structures, as the one
described in Section 2, it is expected large sample clusters. Due to the high computational
effort under these situations, Prentice [15] and Lipsitz et al. [12] GEE extensions are not
feasible.
An alternative, that shares the computational ease of conventional GEE, is ALR pro-
posed by Carey et al. [5]. Let γijk = log(ψijk ), then
μij − μijk
logit P(Yij = 1|Yik = yik ) = γijk yik + log . (2)
1 − μij − μik + μijk
The pairwise log odds ratio is obtained as the regression coefficient in a logistic regression
of Yij on Yik as long as the second term on the right-hand side of Equation (2) is used as
an offset. Denoting the mi -vector of conditional residuals by Ci with elements Yij − ξijk ,
where ξijk = E(Yij = 1|Yik = yik ), and Si is a diagonal matrix with elements ξijk (1 − ξijk ),
JOURNAL OF APPLIED STATISTICS 1831
ALR estimator for θ = (β, α) is the simultaneous solution of the first estimating equation
given in (1) and
N
∂ξ i −1
Sα,ALR = Si Ci = 0. (3)
i=1
∂α
On the other hand, the stochastic nature of Si and ∂ξi /α does not allow a theoretical
investigation of Equation (3) through the standard theory of estimation equation. Another
drawback is that Sα,ALR is invariant to permutations of the vector Yi [9] whereas the robust
variance estimator is not.
Zink [24] presented the ORTH model as an alternative one to the ALR, resolving the
dependence of variance estimates on observation order. ORTHs approach again keeps
the same Equation (1) to estimate the parameters of the mean. But unlike ALR, where
estimation of association parameters α is based on conditional expectations E(Yij |Yik ), α
estimation is instead based on expectations of cross-products Yij Yik conditional on Yij and
Yik , for 1 ≤ j < k ≤ ni . An approximate covariance matrix is then built in a way that is
very computationally feasible for larger clusters [20].
Let Zijk still be equal to Yij Yik . ORTHs are defined as linear regressions of Zijk on Yij and
Yik specifying:
Qijk = Zijk − [μijk + bijk:j (Yij − μij ) + bijk:k (Yik − μik )], (4)
such that bijk:j = μijk (1 − μik )(μik − μijk )/dijk , bijk:k = μijk (1 − μij )(μij − μijk )/
dijk , dijk = σijj σikk − σijk
2 , σ = Cov(Y , Y ) = μ − μ μ .
ijk ij ik ijk ij ik
After the definition of ORTHs Qijk , the second estimation equation is given as
N
−∂Qi
Sα,ORTH = E Pi−1 Qi , (5)
i=1
∂α
where
μijk (μij − μijk )(μik − μijk )(1 − μij − μik + μijk )
νijk = Var(Qijk ) = ,
μij μik (1 − μij − μik + μijk ) − μ2ijk
especial those related to the association structure as well as their robust variance estimator.
Simulation design has the similar design as the real data motivation presented in Section 2.
Qaqish [19] introduced a family of multivariate binary distributions that allows, in a
simple way, generating correlated binary variables for a specified mean vector and correla-
tion structure. Multivariate binary responses were obtained for the simulation scenarios by
using the methodology proposed by Qaqish [19]. It is implemented in software R, package
binarySimCLF.
Mean model and correlation structure are generated as the following, respectively.
Logit Pr(Y = 1) = β0 + β1 x + β2 x2 ,
⎧
⎪
⎪α0 I(within collection site) + α3 Distancejk ,
⎪
⎪
⎪
⎪
⎪
⎪ If j and k are different individual host tree in the same collection site,
⎪
⎨α I(within collection site) + α I(within host tree),
0 1
LogOR(Yj , Yk ) =
⎪
⎪ If j and k are different leaf in the same individual host tree,
⎪
⎪
⎪
⎪α0 I(within collection site) + α1 I(within host tree) + α2 I(within leaf),
⎪
⎪
⎪
⎩If j and k are different fragments in the same leaf,
|αi − α̂¯ i |
,
αi
where α̂¯ i = Nj=1 (α̂i /N), for i = 0, . . . , 3 and j = 1, . . . , N. While robust variance relative
bias estimate was obtained as
ˆ α̂i )|
|se(α̂i ) − se(
,
se(α̂i )
¯ 2
where se(α̂i ) = N ˆ α̂i ) = N
i=1 ((α̂i − α̂i ) /(N − 1)) and se( ˆ α̂i )/(N − 1)).
i=1 (se(
Simulation results are presented in Figures 2 and 3. The following conclusions can be
observed from these figures: (1) estimates of α0 showed the largest bias in the misspecified
model, it stays around 20% for all sample sizes. Relative bias for the others association
estimates, under the misspecified model, are small but it seems to be increasing with the
increase of the sample size. As expected, relative bias for the correct model gets smaller as
sample size increase for both estimates; (2) in Figure 3, relative standard errors showed a
JOURNAL OF APPLIED STATISTICS 1833
little superiority of the ALR method over the ORTH; and, (3) in general, ALR and ORTH
have very similar behavior.
Table 1 presents a summary of the computational time for simulation scenarios. ALR
is faster than ORTH for smaller sample sizes but ORTH gets faster for bigger ones (more
than 200 observations per group).
1834 A. G. F. C. COSTA ET AL.
Figure 3. Relative standard errors for the correct and misspecified models.
5. Numerical results
Let’s return to the real data set described in Section 2. It was used the following linear
predictors for the mean and association structure for the three levels fungal endophytes
JOURNAL OF APPLIED STATISTICS 1835
40 200
Sample size Model Mean STD Min Max Mean STD Min Max
400 ALR 2.4 0.2 2.1 2.9 151.3 233.2 73.9 839.6
ORTH 13.4 5.3 7.1 32.6 29.5 10.4 20.1 70.8
800 ALR 4.7 0.4 3.9 5.9 271.6 442.4 116.3 1680.4
ORTH 21.8 5.5 14.3 35.8 83.8 55.7 40.6 233.1
1600 ALR 10.3 0.6 9.1 11.3 261.4 30.3 237.6 310
ORTH 41.6 17.2 28.6 126.8 213.9 150.5 82.6 618.4
3200 ALR 17.6 1.3 15.7 21.1 498.1 52.7 461.5 609.3
ORTH 100.2 33.2 72.4 204.4 435.5 262.7 169.1 1262.7
study:
⎧
⎪
⎪α1 I(within collection site) + α4 Distancejk ,
⎪
⎪
⎪
⎪
⎪
⎪If j and k are different individual host tree in the same collection site,
⎪
⎨α I(within collection site) + α I(within host tree),
1 2
LogOR(Yj , Yk ) =
⎪
⎪ If j and k are different leaf in the same individual host tree,
⎪
⎪
⎪
⎪ α1 I(within collection site) + α2 I(within host tree) + α3 I(within leaf),
⎪
⎪
⎪
⎩If j and k are different fragments in the same leaf,
Table 2. Results of ALR and ORTH models for fungal endophytes study.
ALR ORTH(λ = 0.006)
Models
Mean β se(β) p-value O.R β se(β) p-value O.R
Intercept −0.953 0.120 0.000 – −0.977 0.188 0.000 –
Country = Brazil −0.090 0.518 0.862 0.914 −0.034 0.585 0.954 0.967
Fragment = A −0.498 0.294 0.091 0.608 −0.502 0.294 0.087 0.605
Fragment = B −0.926 0.213 <0.01 0.396 −0.935 0.212 <0.01 0.393
Fragment = D −1.173 0.181 <0.01 0.309 −1.185 0.181 <0.01 0.306
Fragment = E −0.718 0.156 <0.01 0.488 −0.725 0.156 <0.01 0.484
Fragment = F −0.672 0.247 0.006 0.511 −0.678 0.246 0.006 0.507
Association α se(α) P-value P.O.R α se(α) P-value P.O.R
Within site 0.477 0.672 0.478 1.612 0.195 0.388 0.615 1.215
Within host tree 0.817 0.404 0.043 2.265 0.857 0.404 0.034 2.355
Within leaf 0.896 0.184 0.000 2.449 0.928 0.210 0.000 2.528
Distancejk −0.003 0.005 0.609 0.997 −0.003 0.004 0.443 0.997
In the mean structure, there is no significant difference between Brazil and Argentina
concerning the odds of presence of fungal endophytes. This result corroborate previous
studies which have been shown that most fungal endophytes belong to Sordariomycetes
[1,8]. Although the odds of presence of fungal endophytes in fragment C is significantly
higher than others fragments, except fragment A. There are a higher probability of infec-
tion near the petiole than in more distal leaf fragments [3,4,23]. The expansion of petiole
end are smaller than the more distal two-thirds of the leaf. Consequently, infections will
become more ‘diluted’ if there were established in the more distal leaf parts before it fully
expansion when compared with the petiole segments. The result is a leaf with more dense
infections at the petiole end [23]. Based on our results, we suggested a preference for the
leaf tissue colonization by Sordariomycetes. Probably, the leaf development determine the
colonization pattern.
6. Final remarks
ALR and ORTH GEE methods were presented and compared in this paper. Simulation
results showed a slightly superiority of the ALR method over the ORTH one. Marginal
probabilities and odds ratios were also estimated and compared in a real ecological study
involving a three levels hierarchical clustering. Results from ALR and ORTH GEE methods
were very similar. A tree with fungus on any leaf was 2.3 (95% CI 1.1,5.1) times as likely to
have fungus on any other different leaf. Fungus were also more likely to aggregate within
leaves. However, it was observed that the distance between hosts trees within each site
collection were not statistically significant.
Simulation and real data results were obtained by using some R packages [21]. In order
to fit ALR and ORTH models, geepack and orth packages were used, respectively. R Scripts
can be obtained upon request from the first author.
An important issue related to GEE is its validity under missing data pattern. This issue
was not explored in this paper because the ecological study is a complete data set. However,
it can be observed that the GEE methodology under study in this paper is only valid under
missing completely at random (MCAR) pattern [6].
JOURNAL OF APPLIED STATISTICS 1837
In terms of computational time, ALR is faster than ORTH for smaller sample size but
ORTH gets faster for bigger clusters (more than 150 observations in each group). ALR and
ORTH models have proved to be useful for modeling a complex association structure in
the presence of large cluster sizes.
Disclosure statement
No potential conflict of interest was reported by the authors.
Funding
Research partially supported by FAPEMIG, CAPES and CNPq grants (E.A.C). Research partially
supported by CAPES and CNPq grants (A.B.M.V). Research partially supported by CAPES and
CNPq grants (L.D.A).
References
[1] A.E. Arnold and F. Lutzoni, Diversity and host range of foliar fungal endophytes: are tropical
leaves biodiversity hotspots? Ecology 88 (2007), pp. 541–549.
[2] A.E. Arnold, Z. Maynard, G.S. Gilbert, P.D. Coley and T.A. Kursar, Are tropical endophytes
hyperdiverse? Ecol. Lett. 3 (2000), pp. 267–274.
[3] M.E. Bernstein and G.C. Carroll, Internal fungi in old- growth Douglas fir foliage, Canad. J.
Botany 55 (1977), pp. 644–653.
[4] P.F. Cannon and C.M. Simmons, Diversity and host preference of leaf endophytic fungi in the
Iwokrama Forest Reserve, Guyana, Mycologia 94 (2002), pp. 210–220.
[5] V. Carey, S.L. Zeger and P. Diggle, Modelling multivariate binary data with alternating logistic
regressions, Biometrika 80 (1993), pp. 517–526.
[6] P.J. Diggle, P. Heagerty, K.Y. Liang and S.L Zeger, Analysis of Longitudinal Data, 2nd ed., Oxford
University Press, New York, 2002.
[7] S.H. Faeth and K.E. Hammon, Fungal endophytes in oak trees: long-term patterns of abundance
and associations with leaf- miners, Ecology 78 (1997), pp. 810–819.
[8] K.L. Higgins, A.E. Arnold, J. Miadlikowska, S.D. Sarvate and F. Lutzoni, Phylogenetic relation-
ships, host affinity, and geographic structure of boreal and arctic endophytes from three major
plant lineages, Mol. Phylogenet. Evol. 42 (2007), pp. 543–555.
[9] A.Y.C. Kuk, Permutation invariance of alternating logistic regression for multivariate binary data,
Biometrika 91 (2004), pp. 758–761.
[10] K.Y. Liang and S.L. Zeger, Longitudinal data analysis using generalized linear models,
Biometrika 73 (1986), pp. 13–22.
[11] K.Y. Liang, S.L. Zeger and B. Qaqish, Multivariate regression analyses for categorical data, J. R.
Statist. Soc. B 54 (1992), pp. 3–40.
[12] S.R. Lipsitz, N.M. Laird and D.P. Harrington, Generalized estimating equations for correlated
binary data: Using the odds ratio as a measure os association, Biometrika 78 (1991), pp. 153–160.
[13] J.B.H. Martiny, J.A. Eisen, K. Penn, S.D. Allison and C. Horner-Devine, Drivers of bacterial
b-diversity depend on spatial scales, Proc. Natl. Acad. Sci. U.S.A. 108 (2011), pp. 7850–7854.
[14] O. Petrini, T.N. Sieber, L. Toti and O. Viret, Ecology, metabolite production, and substrate
utilization in endophytic fungi, Nat. Toxins 1 (1992), pp. 185–196.
[15] R.L. Prentice, Correlated binary regression with covariates specific to each binary observation,
Biometrics 44 (1988), pp. 1033–48.
[16] R.L. Prentice and L.P. Zhao, Estimating equations for parameters in mean and covariates of
multivariate discrete and continuous responses, Biometrics 47 (1991), pp. 825–839.
[17] R.J. Rodriguez, J.F. Jr. White, E.A. Arnold and R.S. Redman, Fungal endophytes: diversity and
functional roles, New Phytol. 182 (2009), pp. 314–330.
[18] B. Schulz and C. Boyle, The endophytic continuum, Mycol. Res. 109 (2005), pp. 661–686.
1838 A. G. F. C. COSTA ET AL.
[19] B.F. Qaqish, A family of multivariate binary distributions for simulating correlated binary
variables with specified marginal means and correlations, Biometrika 90 (2003), pp. 455–463.
[20] B.F. Qaqish, R.C. Zink and J.S. Preisser, Orthogonalized residuals for estimation of marginally
specified association parameters in multivariate binary data, Scand. J. Stat. 39 (2012),
pp. 515–527.
[21] R Core Team. R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria, 2015. Available at https://www.R-project.org/.
[22] A.B.M. Vaz, Costa Da and A. Góes-Neto, Fungal endophytes associated with three South Amer-
ican Myrtae (Myrtaceae) exhibit preferences in the colonization at leaf level, Fungal Biol. 118
(2014), pp. 277–286.
[23] D. Wilson and G.C. Carroll, Infection studies of Discula quercina, an endophyte of Quercus
garryana, Mycologia 86 (1994), pp. 635–647.
[24] R.C. Zink, Correlated binary regression using orthogonalized residuals, PhD thesis, University
of North Carolina, Chapel Hill, 2003.