0% found this document useful (0 votes)
12 views

GraphicalmodelsCSDA-3

Uploaded by

lforzani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

GraphicalmodelsCSDA-3

Uploaded by

lforzani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Sufficient dimension reduction for a novel class of zero-inflated graphical models

Eric Koplin1,a , Liliana Forzani1,a , Diego Tomassib , Ruth M. Pfeifferc,∗


a CONICET & Facultad de Ingenierı́a Quı́mica, Universidad Nacional del Litoral, Santiago del Estero 2829, Santa Fe, 3000, Santa Fe, Argentina
b Biofortis, 3 Route de la Chatterie, Saint-Herblain, 44800, Loire-Atlantique, France
c Division of Cancer Epidemiology and Genetics, National Cancer Institute, 9609 Medical Center Drive, Rockville, 20892, Maryland, USA

Abstract
Graphical models allow modeling of complex dependencies among components of a random vector. In many applica-
tions of graphical models, however, for example microbiome data, the data have an excess number of zero values. We
present new pairwise graphical models with distributions in an exponential family, that accommodate excess numbers
of zeros in the random vector components. First we characterise these multivariate distributions in terms of univariate
conditional distributions. We then model predictors that arise from such a pairwise graphical model with excess zeros
as a function of an outcome, and derive the corresponding first order sufficient dimension reduction (SDR). That is,
we find linear combinations of the predictors that contain all the information for the regression of the outcome as a
function of the predictors. We estimate the SDR using a pseudo-likelihood with a hierarchical penalty that accounts
for the graphical model structure, for variable selection, by allowing interactions only for variables that are associated
with outcome also through main effects. This method yields consistent estimators of the reduction and can be applied
to continuous or categorical outcomes. We then illustrate our methods by studying normal, Poisson and truncated
Poisson graphical models with excess zeros in simulations and by analyzing microbiome data from the American Gut
Project. Our models provided robust variable selection and the predictive performance of the Poisson zero-inflation
pairwise graphical model was equal or better than that obtained from applying other available methods for the analysis
of microbiome data.
Keywords: Count data, Hierarchical penalization, Hurdle model, Pairwise graphical models, Pseudo-likelihood,
Variable selection

1. Introduction

Statistical graphical models allow modeling of complex biological phenomena. These graphs consist of nodes
representing the variables in the model linked by edges representing the dependence relationships between them. A
special type of data generated in many fields of science are compositional data, i.e. vectors X = (X1 , . . . , Xk )T , with
Xl ≥ 0, for all l = 1, . . . , k, and a constant-sum constraint, i.e. X1 + . . . + Xk = m X . Early methods for the analysis
of compositional data were developed e.g., by Aitchison (1986). A recent example spurred by new high-throughput
technologies, is microbiome data that characterize microbial ecology measured in biological samples, e.g. stool or
saliva. Bacterial microbiota are classified into taxonomic levels (phylum, class, order, family, genus), and within
each level are categorized into taxa. Microbiome data thus can be represented by counts or proportions (“relative
abundances”) of DNA sequences that fall into various taxa within a taxonomic level.
Tomassi et al. (2019) developed likelihood based sufficient dimension reduction (SDR) methods to assess associ-
ations of count or compositional vectors X with an outcome Y. For each model, including Poisson graphical models
(Inouye et al., 2017), which are multivariate extensions of Poisson models, Tomassi et al. (2019) estimated linear com-
binations of X that contain sufficient information to model or predict Y, and presented methods for variable selection
to identify components of X that are truly associated with outcome.

∗ Correspondingauthor
Email addresses: [email protected] (Eric Koplin), [email protected] (Ruth M. Pfeiffer)

Preprint submitted to Computational Statistics and Data Analysis December 15, 2023
However, in many applications there is a larger number of zero components than standard exponential family
graphical models, e.g. Poisson models, can accommodate. For microbiome data the large number of zeros, i.e. spar-
sity, can arise because microbes are present in the sampled environment but not detected due to low sequencing depth
and sampling variation, or because some microbes are unable of living in that environment and are truly never present.
This zero inflation causes a poor fit of some models, e.g. Poisson models, to the data and leads to incorrect inference.
A model that can handle large numbers of zeros is the multinomial model, but its restrictive constraints on the corre-
lations between the X components are typically not met in real data. We therefore present a new family of pairwise
graphical models with distributions in the exponential family that accommodates excess numbers of zeros in X.
We derive the corresponding first order sufficient dimension reduction (SDR), i.e. linear combinations of X that
contain sufficient information to model Y. We characterize the zero-inflated pairwise graphical models in terms
of conditional distributions, which facilitates estimation via pseudo-likelihood and data simulations. We propose
a computationally efficient and robust estimation procedure and add hierarchical penalization that accounts for the
graphical model structure to identify components of X associated with Y.
The rest of the paper is organized as follows. Section 2 gives an overview of SDR and first moment SDR for
pairwise graphical models with some special cases relevant for compositional data. We introduce the zero-inflated
pairwise graphical models that extend the zero inflated Hurdle-normal model by McDavid et al. (2019) to general
exponential family distributions and derive the corresponding SDR (Section 3). Section 4 presents the estimation and
hierarchical penalization for variable selection. We assess the performance of our proposed approaches in simulations
(Section 5) and using data from the American Gut Project (Section 6). Section 7 closes with a discussion.

2. Sufficient dimension reduction for pairwise graphical models

We briefly summarize sufficient dimension reduction, denoting matrices and vectors by bold face letters.

2.1. Background on sufficient dimension reduction (SDR)


Modeling and visualizing the relationship between a response variable Y ∈ R as a function of predictors X ∈ Rk is
challenging when k is large. Dimension reduction methods help to reduce the complexity of the problem. In particular,
SDR aims to find a function R : Rk → Rd with d ≪ k, such that F(Y|X) = F(Y|R(X)), where F(Y|.) is the conditional
distribution of Y given the second argument. Then R(X) contains the same information about Y as X and is a sufficient
dimension reduction for the regression of Y on X.
Estimation of R(X) was originally based on moments or functions of moments of the conditional distribution
of the inverse regression (IR) X|Y, e.g. SIR, proposed by Li (1991). It required continuous X, conditions on their
moments, typically captured only part of the reduction and it was always linear in the predictors, i.e., R(X) = αT X
with α ∈ Rk×d .
Cook (2007) introduced model-based IR which is due to the fact that F(Y|R(X)) = F(Y|X) if and only if R is also
sufficient for the IR of X as a function of Y, i.e. F(X|R(X), Y) = F(X|R(X)). Model-based IR avoids a limitation of
moment-based SDR methods, since the reduction is exhaustive for the regression of Y on X; that is, R contains all
the information in X about Y, and likelihood-based estimators of R can be obtained (Cook and Forzani, 2008). Bura
et al. (2016) showed that when the distribution of X|Y is in a general exponential family, then the minimal SDR is
linear not in X, but in the minimal sufficient statistic for the X|Y exponential family. Tomassi et al. (2019) extended
this approach and derived R(X) for Poisson graphical models for vectors X of dependent counts. Next we summarize
these results for pairwise graphical models with distributions in a general exponential family.

2.2. First moment SDR for pairwise graphical models (pGMs)


Probabilistic graphical models have been used extensively to model dependencies between random variables (see
e.g. Wainwright, Jordan et al., 2008). Our focus is on undirected graphical models. Briefly, an undirected graphical
model consists of a random vector X = (X1 , . . . , Xk )T and an undirected graph G = (V, E), where V = {1, . . . , k}
are nodes and E ⊂ V × V edges. Each component Xi of X represents a node in V and the graph structure captures
conditional independence of the components of X. In particular, Theorem 3.9 of Lauritzen (1996, p. 36) states that
Xi and X j are conditionally independent given all the other components X\{i, j} of X if and only if (i, j) < E, i.e. the
nodes i and j are not adjacent in G.

2
We focus on undirected pairwise graphical models (pGMs) with distributions of the form
n 1 o
P(x | η, Θ) = exp ηT T(x) + T(x)T ΘT(x) + 1T B(x) − A(η, Θ) , (1)
2
where the functions T(x) = (T (x1 ), . . . , T (xk ))T ∈ Rk and T(x)T(x)T are the sufficient statistics. The natural pa-
rameter ηT = (η1 , . . . , ηk ) ∈ Rk represents the main effects (also called “node potentials”), Θ ∈ Rk×k is a sym-
metric matrix of pairwise interaction terms between components of X, which capture the edges in the graph, and
B(x) = (B(x1 ), . . . , B(xk ))T is the component-wise base measure. There is no edge between nodes i and j in the
corresponding graph if and only if Θi j = 0, and this implies that Xi and X j are conditionally independent given X\{i, j}
called the pairwise Markov property (Lauritzen, 1996, p. 32). The log-partition function A(η, Θ) < ∞ ensures that P
is a proper probability function, i.e. integrates to one.
To obtain the first order moment SDR R(X) of Y as a function of X, we model the main effect parameters η in the
inverse regression of X with distribution (1) as a function of Y,

ηy = η0 + Γ f y , (2)

where f y = ( f1 (y), . . . , fr (y))T ∈ Rr is a vector of known centered functions of Y, η0 ∈ Rk and Γ ∈ Rk×r is an


unconstrained matrix with rank d ≤ min{r, k}. For discrete Y with values in {0, . . . , r}, we set f j (y) = I(y = j) − P(Y =
j), j = 1, . . . , r, where I(·) denotes the indicator function. For continuous Y, flexible functions of Y could be used,
e.g. polynomials. As we focus on first moment SDRs, we assume only η depends on Y. However, in an extension one
could additionally model the interactions Θy as functions of Y.
For distributions of X in the exponential family (1) with the natural parameter given by (2), Bura et al. (2016)
proved that
R(X) = αT T(X), (3)
where α ∈ Rk×d is a basis of span{ηy − η0 , y ∈ Y} = span{Γ} and Y is the sample space of Y. An example of (1) for
binary predictors X ∈ {0, 1}k is the Ising model, with B(x) constant and T (x) = x (Bura et al., 2016).
When for count data the additional constraint m X = kj=1 X j is imposed, e.g. for the multinomial model, then, as
P

Xk = m X − k−1
P
j=1 X j , a lower dimensional version of the SDR needs to be derived. This can be done using a one-to-one
transformation Z \k = (Z1 , . . . , Zk−1 )T of X, that has a distribution of the form (1) but with parameters η\k ∈ Rk−1 and
Θ\k,\k ∈ R(k−1)×(k−1) . Tomassi et al. (2019) showed that, letting R(Z \k ) = αT\k Z \k denote the SDR of the regression
Y|Z \k , the SDR of Y|X is R(X) = (αT\k Z \k ; m X ). Next we present some special cases of pGM distributions (1) relevant
for compositional data.

2.2.1. Normal pGM


Applying the log-ratio transformation (Aitchison, 1986) to counts X when all components of X are strictly positive
(i.e. X j > 0, j = 1, . . . , k) yields Z = log(X1 /Xk ), . . . , log(Xk−1 /Xk ) T ∈ Rk−1 . Assuming that Z | Y = µy + ϵ, ϵ ∼


Nk−1 (0, Σ), gives a model with a density of the form (1) with T (x) = x, B =constant and Θ = −Σ−1 ∈ R(k−1)×(k−1) .
Applying (3) with ηy = Σ−1 µy yields R(Z) = αT Z, where α ∈ R(k−1)×d is a full rank matrix with span(α) = span{ηy −
η0 , y ∈ Y} = Σ−1 span{µy − µ0 , y ∈ Y}, which corresponds to the Principal Fitted Components solution (Cook and
Forzani, 2008).

2.2.2. Poisson pGMs


These models extend the Poisson distribution to multivariate count data. For the most general formulation of P
in (1) B(x) = − log(x!), T (x) = x and Θ ∈ Rk×k with Θi j ≤ 0 for all i , j = 1, . . . k and Θ j j = 0 for j = 1, · · · , k,
to ensure that A(η, Θ) < ∞ and thus P is a proper distribution (Yang et al., 2013). An important modification that
does not require that all Θi j ≤ 0 is the truncated Poisson pPGM (TPoisson-pGM) model (Yang et al., 2013), that
restricts all components of X to X j ≤ T ∗ , where T ∗ is a known constant. Under this model, R(X) is the same as for the
unconstrained, standard Poisson pGM (Tomassi et al., 2019) and given by (3).

3
3. SDR for zero-inflated pairwise graphical models

Here we present a novel family of pairwise graphical models that extends the pGM distribution in (1), to accom-
modate excess zeros in the X components and characterize its distribution in terms of component-wise conditional
distributions to facilitate computations and estimation. First we state these results without considering dependencies
on Y. We then derive the first moment SDR for this novel family and propose a computationally efficient pseudo-
likelihood approach for estimation that allows incorporating variable selection via a novel hierarchical penalty.

3.1. Zero-inflated pairwise graphical model (zipGM)


We now augment X with a zero pattern vector ν = ν(X) = (ν(X1 ), . . . , ν(Xk ))T , where ν(Xi ) = I(Xi , 0), where
I(·) is the indicator function, and define the zero-inflated pairwise graphical model (zipGM) distribution
n 1 1 o
P(X = x|ω) = exp ηT T(x) + ξT ν(x) + T(x)T ΘT(x) + ν(x)T Λν(x) + ν(x)T ΦT(x) + 1T B(x) − A(ω) , (4)
2 2
where (ν(x), T(x), ν(x)ν(x)T , T(x)ν(x)T , T(x)T(x)T ) are the sufficient statistics, with T(x) = (T (x1 ), . . . , T (xk ))T . The
log-partition function A(ω) is needed to ensure that P(X = x|ω) is a proper distribution. We denote all parameters in
the distribution in (4) by the vector
2
ω = (η; ξ; vec(Λ); vec(Φ); vec(Θ)) ∈ R2k+3k , (5)

where ξ, η ∈ Rk and Λ, Θ, Φ ∈ Rk×k . Λ and Θ are symmetric matrices, and Λ and Φ have zero diagonals. The
vectorized matrices Λ, Φ, Θ in (5) are obtained by stacking their columns, and Ω is the space of all ω that ensure that
(4) is a proper distribution.
The model in (4) extends the zero inflated Hurdle-normal model by McDavid et al. (2019) to general exponential
family distributions. As calculating log-partition function A is computationally difficult when estimating the param-
eters, we characterize model (4) through a product of conditional distributions, each corresponding to a univariate
zero-inflated model.
Theorem 1. Assume that the joint distribution of X = (X1 , · · · , Xk )T satisfies P(X = 0|ω) > 0 and P(X|ω) =
i> j G i j (Xi , X j ), for some functions G i j and for all ω ∈ Ω. I.e. P can be written as the product of terms that each
Q
depend on at most two components of X. Then P(X|ω) has the form (4) if and only if for all j = 1, . . . , k,
n 1 o
P(X j = x j | X\ j = x\ j , ω) = exp ξ j|\ j ν(x j ) + η j|\ j T (x j ) + Θ j j T (x j )2 + B(x j ) − A(ξ j|\ j , η j|\ j , Θ j j ) , (6)
2
with sufficient statistic (ν(x j ), T (x j ), T (x j )2 /2), natural parameters (ξ j|\ j , η j|\ j , Θ j j ), base measure B(x j ) and log parti-
tion function
1
A(ξ j|\ j , η j|\ j , Θ j j ) = log 1 + exp ξ j|\ j − η j|\ j T (0) − Θ j j T (0)2 + A+ (η j|\ j , Θ j j ) .
n h io
(7)
2
A+ (η j|\ j , Θ j j ) = log x,0 η j\ j T (x) + 21 Θ j j T (x)2 + B(x) in (7) is the log partition function of the model given that X , 0.
P
The canonical parameters ξ j|\ j and η j|\ j depend on X\ j through

ξ j|\ j = ξ j + Λ j,\ j ν\ j + Φ j,\ j T(x\ j ) (8)


η j|\ j = η j + νT\ j Φ\ j, j + Θ j,\ j T(x\ j ), (9)

where Λ j,\ j corresponds to the jth row of the matrix Λ after removing its jth column and equivalent definitions apply
to all other matrices.
The proof of Theorem 1 is given in Appendix B.
The distribution in (6) is a special case of a standard univariate zero-inflated distribution in the exponential family
(Haslett et al., 2018, Section 3.5) with base measure B(x), sufficient statistics (ν(x), t(x)), with t : R → Rq , and
associated natural parameters τ + τ0 − A+ (θ) ∈ R and θ ∈ Rq ,

P(X = x|θ, τ) = exp θT t(x) + ν(x)[τ + τ0 − A+ (θ)] + B(x) − A(τ, τ0 ) .


n o
(10)
4
In (10), τ and θ are free parameters. A+ (θ) = log x,0 exp{θT t(x) + B(x)} is the log-partition function of the model
P
conditional on X , 0, τ0 = θT t(0) + B(0), A(τ, τ0 ) = τ0 + A(τ), and A(τ) = log{1 + exp(τ)}. If B(0) = 0 and t(0) = 0,
then τ0 = 0 and A(τ, τ0 ) = A(τ).
Letting t(x) = (T (x), T 2 (x)/2), θ = (η j|\ j , Θ j j ) and ξ j|\ j = τ + τ0 − A+ (θ) in (10), we recover (6).
The next Proposition (proved in Appendix B) states that the distribution in (10) can be factorized into the product
of a distribution in the exponential family with support excluding zero, and a Bernoulli distribution for the zero
component ν(X j ) = ν, where ν = 0, 1.
Proposition 1. If X has a distribution of the form (10), then ν(X) has a Bernoulli distribution with p = 1/{1+exp(−τ)},
and the conditional distribution of X given ν(X) is

P(X = x | ν(x), θ) = δν,0 δ x,0 + δν,1 exp{θT t(x) + B(x) − A+ (θ)}, (11)

where δ x,y = I(x = y) is the Kronecker delta function.


Using the result in Proposition 1, we state the zipGM model (4) in terms of the corresponding conditional distri-
butions (11), and the conditional distribution of ν(X),
i−1
P(ν(X j ) = 1 | X\ j , ω) = 1 + exp{−ξ j|\ j − A+ (η j|\ j , Θ j j )} .
h
(12)

Some important special cases of the zipGM in (4) that we study further are discussed next.

Normal Zero-inflated pGM (Normal-zipGM). When T (x) = x, Θ j j < 0 and B(x) = 0 in (6), we obtain the Hurdle-
normal model (McDavid et al., 2019), with conditional distribution

(X j | ν(X j ) = 1, X\ j ) ∼ N(−Θ−1
j j η j|\ j , −Θ j j ).
−1
(13)

The function A+ in (12) is A+ (η j|\ j , Θ j j ) = −η2j|\ j /2Θ j j − 1/2 log(−Θ j j ) + 1/2 log(2π).
Remark 1. If the original X variables are count data, the log-ratio transformation used in Section 2.2.1 needs to be
extended to conserve the zero pattern to Z j = ν j log(X j /Xk ), j = 1, . . . , k − 1, where 0 × −∞ = 0.

Poisson Zero-inflated pGM (Poisson-zipGM). When T (x) = x, B(x) = − log(x!), Θi, j ≤ 0 for all i , j and Θ j j = 0 in
(6), the conditional distribution of X j is a zero-truncated Poisson distribution with natural parameter η j|\ j and support
on the positive integers,

(X j | ν(X j ) = 1, X\ j ) ∼ Poisson+ {exp(η j|\ j )}. (14)

The log-normalizing constant is A+ (η j|\ j ) = log ∞


P 
t=1 exp(η j|\ j t)/t! = log[exp{exp(η j|\ j )} − 1]. To avoid numerical
problems when η j|\ j is large, we use the equivalent formulation A+ (η j|\ j ) = exp(η j|\ j ) + log{P(X̃ ≥ 1)} = exp(η j|\ j ) +
log[1 − exp{− exp(η j|\ j )}], where the auxiliary variable X̃ has a Poisson distribution with rate exp(η j|\ j ).

Truncated Poisson Zero-inflated pGM (TPoisson-zipGM). Here the support of the Poisson counts is restricted to
{x : 1 ≤ x ≤ T ∗ }, which yields the log-normalizing constant
 T∗  ( ∞ )
+ X exp(η j|\ j t) 
 X exp(η j|\ j t) exp(η j|\ j 0)
A (η j|\ j ) = log  = log exp{exp(η j|\ j )} −
 


t! t! 0!
 

t=T ∗ +1
 
t=1
= exp(η j|\ j ) + log{P(1 ≤ X̃ ≤ T ∗ )}, (15)

where the auxiliary variable X̃ has a Poisson distribution with rate exp(η j|\ j ).

5
3.2. First moment SDR for zero-inflated pairwise graphical models
To capture the relationship between X and Y and obtain the sufficient reduction R(X) for the regression of Y|X,
we assume that the distribution of X|Y follows (4), where the parameters η and ξ depend linearly on known, centered
functions f y ∈ Rr of Y,

ηy = η0 + Γ f y , (16)
ξy = ξ0 + Ψ f y , (17)

where Γ, Ψ are two k × r matrices and η0 , ξ0 ∈ Rk . We propose two different approaches to obtain R(X). The first, the
joint SDR, yields the minimal first moment reduction.

Proposition 2 (Joint SDR). Assume X|Y has a distribution of the form (4), with main effects parameters (16) and (17).
If the 2k × r concatenated matrix (Γ; Ψ), obtained by joining the columns of Γ in (17) and Ψ in (16) (concatenation
is formally defined in Appendix Appendix A), has rank d J ≤ min{r, 2k}, then a d J -dimensional sufficient reduction for
Y|X is " #
T(X)
R(X) = κT ∈ Rd J , (18)
ν(X)
where κ is a basis of span{(ηy − η0 ; ξy − ξ0 ), y ∈ Y} = span{(Γ; Ψ)}.
In some settings, however, it may be desirable to obtain separate reductions for the zero and X components for
better interpretation of the results.
Proposition 3 (Stacked SDR). We assume the same conditions stated in Proposition 2 hold, and let dX = rank(Γ) ≤
min{r, k} and dν = rank(Ψ) ≤ min{r, k}. A sufficient reduction for Y|X of dimension dX + dν is given by
#T "
α 0
" #
T(X)
R s (X) = ∈ RdX +dν , (19)
0 ζ ν(X)

where α is a basis of span{ηy − η0 , y ∈ Y} = span{Γ} and ζ is a basis of span{ξy − ξ0 , y ∈ Y} = span{Ψ}.


The stacked SDR in (19) is numerically more stable when the variances of X and the variances of ν(X) have
very different magnitudes (data not shown), while the joint SDR in (18) provides a greater reduction of the data, as
rank(Γ; Ψ) ≤ rank(Γ) + rank(Ψ). Propositions 2 and 3 are proved in Appendix B.

4. Estimation and variable selection

To avoid computing the partition function A in the joint distribution (4), we propose a pseudo-likelihood esti-
mation method based on the conditional probabilities (6). It is computationally efficient for most zipGM and pGM
distributions since each conditional probability can be computed in a closed form from (6) and (7). We incorporate
variable selection into the estimation via a novel hierarchical weighted penalty, building on approaches by Zhao et al.
(2009) and McDavid et al. (2019). We present an algorithm to solve the penalized estimation problem using a second
order optimization approach as in Lee et al. (2012), that iteratively builds and optimizes a quadratic function.

4.1. Pseudo-likelihood estimation


Let Q X,Y denote the true unknown distribution of the data (X, Y) with density Q(X, Y) = Q(X|Y)Q(Y), and let
P X,Y (ω) denote a parametric distribution with density P(X, Y|ω) = P(X|Y, ω)P(Y). In our case P(X|Y, ω) is given by
model (4) where the main effects parameters are modeled in equations (16) and (17). Following Varin et al. (2011),
we define the composite Kullback-Leibler (CKL) divergence between Q X,Y and P X,Y as
k
X n o
CKL{Q X,Y ∥P X,Y (ω)} = E QX,Y log(Q(X j , Y | X\ j ) − log P(X j , Y | X\ j , ω) . (20)
j=1

6
Using properties from Martens (2020),
k
X n o
CKL{Q X,Y ∥P X,Y (ω)} = E QY E QX|Y log(Q(X j | X\ j , Y) − log P(X j | X\ j , Y, ω)
j=1

= E QY CKL{Q X|Y ∥P X|Y (ω)}. (21)

Replacing the unknown distribution Q X,Y by the empirical distribution function Q̂ X,Y based on an i.i.d. sample
(X s , Y s ), s = 1, . . . , n, we show in Appendix B that
n k k
1 XX X
CKL{Q̂ X,Y ∥P X,Y (ω)} ∝ − log P(X sj | X\s j , Y s , ω) C Ê ℓ j|\ j (ω). (22)
n s=1 j=1 j=1

Thus, minimizing the CKL divergence between the empirical distribution Q̂ X,Y and P X,Y (ω) is equivalent to minimiz-
ing the corresponding pseudo-likelihood (Besag, 1975). An estimator of ω is thus
k
X
ω̂ = arg min Ê ℓ j|\ j (ω). (23)
ω∈Ω j=1

When P X,Y (ω) is correctly specified ω̂ is consistent for the parameter ω of the distribution that generates the data (Be-
sag, 1975). The pseudo-likelihood for the zipGM distribution (4) is obtained by plugging the conditional probability
(6) into (22) as
k n k
X 1 XXn s s 1 o
Ê ℓ j|\ j (ω) = − ξ j|\ j ν j|\ j + η sj|\ j T (x sj ) + Θ j j T 2 (x sj ) + B(x sj ) − A(ξ sj|\ j , η sj|\ j , Θ j j ) , (24)
j=1
n s=1 j=1 2

where, for ease of exposition, we let ν s = ν(X s ), ξ s = ξY s and η s = ηY s .


To compute the corresponding information matrix I (ω), we first expand (22) around ω, where ω̄ is a value from
a small neighbourhood around the origin,
1 T
CKL{P(ω + ω̄)∥P(ω)} = ω̄ I (ω)ω̄ + O(∥ω̄∥32 ), (25)
2
from which we obtain
k
∂2 ∂2 X
I (ω) B CKL{P(ω + ω̄)∥P(ω)} = E E
QY P X|Y (ω) ℓ j|\ j (ω). (26)
∂ω∂ωT ∂ω∂ωT j=1

The inner expectation is taken with respect to P X|Y and not Q̂ X|Y , which would yield the Hessian matrix. Semi-definite
positiveness of I (ω) is guaranteed as the pseudo-likelihood is a convex function of ω (Lee and Hastie, 2015). We use
I (ω) in the optimization procedure and in defining the penalty terms.
When the rank d J , dX or dν of the sufficient reduction R̂(X) is lower than the dimension r of the regression matrices
Γ or Ψ, we incorporate the rank restriction on Γ and Ψ in the parametric space Ω in (23). Alternatively, when r is
small, selecting d J , dX or dν can be incorporated into the cross validation procedure described in Section 4.2.2.
Note that Γ̂ and Ψ̂ obtained from (23) include all components of X and all possible interactions between them (see
also Supplementary Material Figure S1a). To improve interpretability of the resulting R(X), and facilitate replication
we next incorporate variable selection via penalization into the pseudo-likelihood estimation to capture only the X
components truly related to Y.

4.2. Variable selection using hierarchical penalization


Here, in addition to penalizing the regression parameters Γ and Ψ of the main effects ηy and ξy in (16) and (17),
respectively, that capture the relationship of X with Y, we also penalize the interaction parameters (Θ, Λ, Φ) in model
7
(4). When there are no interactions between components Xi and X j of X, i.e. Θi j = Λi j = Φi j = Φ ji = 0, then
P(Xi | X\i ) = P(Xi | X\{i, j} ) and Xi X j | X\{i, j} . This case is shown in Supplementary Material Figure S1b; there
is no edge between Xi and X j , but both Xi and X j are connected to X\{i, j} . Similarly to Zhao et al. (2009), we allow
interactions between Xi and X j only if for both the corresponding main effect terms are included in the model, i.e. we
induce sparseness hierarchically. This leads to a better conditioned optimization problem as it avoids instabilities in
variable selection caused by high correlations. When two variables are highly correlated our penalization selects both,
at the cost of less sparseness. However, further post-processing could be applied to select the minimal set of variables
needed to predict the outcome. Note that a particular component X j can be associated with Y by contributing to R(X)
either through T (X j ) or ν(X j ) (Supplementary Material Figure S1c).

4.2.1. Penalized pseudo-likelihood


Building on earlier approaches (e.g. Lee and Hastie, 2015), we estimate ω by optimizing the pseudo-loglikelihood
(23) with two weighted penalty terms, to capture only outcome related variables and their interactions,
k
X k
X k X
X
ω̂ = arg min Ê ℓ j\ j (ω) + λR ∥(ΓTj ; ΨTj ; ΘTj,\ j ; Φ\ j, j ; ΦTj,\ j ; ΛTj,\ j )∥R j + λC ∥(Θ jl ; Φl j ; Φ jl ; Λ jl )∥C jl . (27)
ω∈Ω j=1 j=1 j=1 l, j

where we use the same notation as in (8) and (9), e.g. Λ j,\ j corresponds to the jth row of the matrix Λ after removing
its jth column, and Φ\ j, j = [ΦT ]Tj,\ j . The vectors Γ j and Ψ j in (27) are the jth rows of the matrices Γ and Ψ in (16)
q
and (17), corresponding to the regression coefficients of Y for the component X j ; ∥b∥R j = bT R j b is a weighted
norm for a positive definite [2r + 4(k − 1)] × [2r + 4(k − 1)] matrix R j and the same definition is used for ∥b∥C jl , with
the matrix C jl ∈ R4×4 .
The matrices R j and C jl used in (27) are square blocks from the Fisher information matrix I (ω0 ) in (26) computed
assuming independence between X and Y and all components of X, i.e. letting Γi j = Ψi j = Θi j = Λi j = Φi j = Φ ji = 0.
The density of X under the independence zipGM with corresponding parameter vector ω0 is
k k
Y Y n 1 o
P X (ω0 ) = P(X j = x j ) = exp η0j X j + ξ0j ν(X j ) + Θ0j j X 2j − A(η0j , ξ0j , diag(Θ0j j )) . (28)
j=1 j=1
2

Thus for ω in a neighbourhood of ω0 , the CKL can be approximated by ∥ω∥2I (ω̂0 ) , and the norms in (27) shrink the
parameters towards the independence model P X (ω0 ) in (28). Details about computing I (ω̂0 ) and assumptions on its
structure to ease the computational burden are given in Appendix C. The novel penalty term with tuning parameter
λR in (27) induces zeros in the rows of the regression matrices Γ and Ψ and in the rows and columns of the interaction
matrices, eliminating components X unrelated to Y. The penalty term with tuning parameter λC induces sparsity in
the model by setting select entries of the interaction matrices to zero, similar to McDavid et al. (2019).
Model selection consistency, i.e. the oracle property of the proposed procedure, follows from Lee et al. (2015).

4.2.2. Outline of the algorithm for the penalized estimation


Assuming that we have independent training and validation data sets, the estimation consists of the following
steps:
1. Obtain ω̂0 for the independent model (28) fitted to the training data using pseudo-likelihood without penalties.
2. Compute I (ω̂0 ) from (26), where the expectation E PX|Y (ω̂0 ) = E PX (ω̂0 ) is computed by sampling from the distri-
bution in (28). Then use the respective diagonal blocks of I (ω̂0 ) to construct the norms ∥.∥R j and ∥.∥C j in the
penalty terms in (27).
3. Compute upper bounds λR,max ∝ max j ∥J j ∥R−1j and λC,max ∝ max jl ∥J jl ∥C −1jl , where J j , J jl are the corresponding
blocks of the Jacobian of the pseudo-likelihood (24), according to the groups defined in the penalization respec-
tively evaluated at ω̂0 . Then, for each pair (λR , λC ) with values on a grid defined on [0, λR,max ] × [0, λC,max ]:

8
(a) Obtain ω̂ from (27) using the algorithm in Appendix D. In brief, I (ω̂(k) ) in (26) evaluated at the es-
timate from the kth iteration, is used as the curvature matrix in a second order approximation of the
pseudo-likelihood. When calculating I , the expectation E PX|Y (ω) is computed numerically based on Gibbs
sampling from the conditional distributions in (6), and E QY is estimated by the sample mean. Martens
(2020) shows that by using the damped matrix I (ω) + ϵ I, linear and global convergence is ensured.
(b) Refit model P X|Y (ω) in (4) using only the components and interactions of X selected in step 3a to obtain
Γ and Ψ
b b and compute the reduction R(X) in (18).
(c) Compute a measure of the predictive performance of R(X) obtained in step 3b in the validation data.
For binary Y we use the area under the receiver operator characteristics curve (AUC) with ŷ = R(X) for
dimension r = 1, and ŷ = P̂(Y = 1|R(X)) from a logistic model when r > 1. For continuous Y we use
the mean squared error (MSE) between y and ŷ predicted based on a kernel for R̂(X) (Adragni and Cook,
2009), defined as MS E = 1/n (yi − ŷi )2 .
P
4. Using the tuning parameters (λ∗R , λ∗C ) that yielded the model with the best performance measure, e.g. the largest
average AUC, from step 3c, repeat steps 3a and 3b for the combined training and validation data to obtain the
final model.
Remark 2. To sample from the univariate zero-inflated exponential family (10) in steps 2 and 3a above, we use
Proposition 1, and first sample ν j from a Bernoulli distribution, and conditional on being non-zero, we obtain X j from
the conditional distributions (13) or (14). The starting point for the Gibbs sampler is the independence distribution
in (28).
Remark 3. To simplify computations for the Poisson-zipGM, we use that the counts of a Poisson process with rate
one in the interval (0, exp(η j )) has a Poisson distribution with rate exp(η j ). The time t j until the first event that falls
into (0, exp(η j )) has an exponential distribution with rate one. For X̃ j ∼ Poisson(exp(η j|\ j ) − t j ) we get (X j |ν(X j ) =
1) = X̃ j + 1. To sample from the TPoisson-zipGM, we evaluate the inverse probability distribution function using the
algorithm in Giles (2016).

5. Simulations

We assessed the performance of our procedures in simulations, all using the joint SDR, R(X) in (18), which gives
the minimal reduction of X for the regression on Y.
For Simulation 1 (Section 5.1.1), we generated X|Y for binary Y from selected zipGMs. We compared the pre-
dictive performance of the estimators and the ability to select variables for different proportions of zeros in X when
fitting the data generating model, the corresponding standard pGM and a special case of the standard pGM, the Ising
model for ν(X), that only uses the components of X categorized into zeros and non-zeros. All models were estimated
based on (27). In Simulation 2 (Section 5.1.2) we studied the accuracy of variable selection of the zipGMs for various
interaction settings and under misspecification of the zipGM distributional form. For Simulation 3 (Section 5.4), we
generated data from a forward model Y|X for binary and continuous Y and compared our methods to procedures from
the mixOmics package (Rohart et al., 2017).
The training data for model fitting had n = 200, 500 and 1000 observations, and the independent validation data set
used to select the final model had 1000 observations. The final model’s performance was assessed in an independent
test set with 1000 observations. For each parameter setting results are based on 100 repetitions.

5.1. Data generation for simulations 1 and 2


Given Y ∼ Bernoulli(0.5), we generated XT = (X1 , . . . , X100 ) from the Normal, Poisson and TPoisson-zipGMs.

5.1.1. Simulation 1: comparing pGMs and zipGMs for varying amounts of zero inflation
Here X1 , · · · , X15 were associated with Y through both ηy in (16) and ξy in (17). The interaction matrices Θ, Λ
and Ψ had a 15 × 15 block for X1 , · · · , X15 with all non-zero entries. All other elements of Λ, Ψ and the off-diagonal
elements of Θ were set to zero. For the Normal-zipGM model η0 j and Θ j j were generated from N(0, 0.01), and
N(−1, 0.01), respectively, for all j. For the Poisson-zipGM and the TPoisson-zipGM, η0 j ∼ N(log(3), 1). To vary
the proportion of zeros we used ξ0 j = −A+ (η0 j , Θ j j ) + ∆ξ for ∆ξ ∈ {−2, 0, 2}, with A+ (·) defined in Theorem 1. The
9
proportions of zeros ranged from 87-90% for ∆ξ = −2, 45-65% for ∆ξ = 0, and 10-28% for ∆ξ = 2. To generate
the interaction parameters (Θ jl , Φl j , Φ jl , Λ jl ), j, l = 1, . . . , 15, we first drew independent values (Θ̃ jl , Φ̃l j , Φ̃ jl , Λ̃ jl ) from
a mixture distribution, 0.5N(−3, 1) + 0.5N(3, 1). We scaled these numbers to have interactions with similar effect
sizes: for the Normal-zipGM, Θ jl = Θ̃ jl /(Θ j j Θll )1/2 and Φl j = Φ̃l j /(−Θ j j )1/2 . For the Poisson-zipGM, Θ jl =
−Θ̃ jl exp{(η0 j + η0l )/2}, and Φl j = Φ̃l j exp(η0l /2). We further scaled the interaction parameters by c = 0.01 for the
Normal-zipGM, and c = 0.005 for the Poisson and TPoisson-zipGMs to control their strength.
To create a challenging setting that at the same time increased the proportion of zeros and increased the rate of
the counts, we let Ψ̄ j = 1 and Γ̄ j = − exp(−η0 j /2) for the Poisson-zipGM and TPoisson-zipGM and Γ̄ j = −(−Θ j j )1/2
for the Normal-zipGM, for j = 1, . . . , 15, and Ψ̄ j = Γ̄ j = 0 for j = 16, . . . , 100, for all models. We set Γ = a1 Γ̄ in
(16) and Ψ = a2 Ψ̄ in (17) to control the strength of association between X and Y. The constant a1 was chosen so that
AUC≈ 0.65 when αT T(X) where α = span{Γ} was used as the predictor and there were few zeros (∆ξ = 5, a case not
studied further), and a2 was chosen to yield AUC≈ 0.65 when ∆ξ = 0 and ζ T ν(X) where ζ = span{Ψ} was used as the
predictor. Note that the AUC values were not fixed, and varied for different values of ∆ξ .

5.1.2. Simulation 2: assessing impact of interaction strengths and distributional mis-specification


Here X1 , . . . , X5 were associated with Y only through ηy , X6 , . . . , X10 through both ηy and ξy , and X11 , . . . , X15
only through ξy . For the Normal-zipGM, we generated η0 j and Θ j j as in Section 5.1.1. For the Poisson-zipGM and
TPoisson-zipGM η0 j ∼ N(log(100), 1) for all j. For all models ξ0 j ∼ N(−A+ (η0 j , Θ j j ), 1), resulting in ∼50% of zeros
among all X components under the independent model (28).
The interaction matrices Θ, Λ, and Ψ each had three blocks of interactions: between X1 , . . . , X15 (B1), between
X16 , . . . , X100 (B2) and between X1 , . . . , X15 and X16 , . . . , X100 (B3). The combinations we studied were: independence
in all blocks, block diagonal with 30 interactions (10 for each block of 5 variables) in B1, with independence or 15
randomly positioned interactions in each, B2 and B3. The interaction parameters were generated and scaled as for
Simulation 1. A larger scaling constant c yields stronger interactions.
For all models, we drew Γ̄ j and Ψ̄ j from the mixture 0.5N(−3, 1) + 0.5N(3, 1) and scaled each Γ j by stdev(X j )
under the independence model. We then let Γ j = a1 Γ̄ j and Ψ j = a2 Ψ̄ j and chose a1 and a2 such that AUC ≈ 0.65
when αT T(X) or ζ T ν(X) were used as predictors for Y, to get similar association strengths for the counts and binary
model components.

5.2. Performance measures


Since for binary Y the reduction R(X) is a scalar, we used the AUC to select the model (step 3c of the algorithm)
and to measure the predictive performance of R(X). For comparison, we also studied the performance of an “oracle”
model defined as in McDavid et al. (2019), which selects the variables that minimize the false positive rate (FPR) for
a bounded false negative rate (FNR) < 0.1 (results in the Supplementary Material).
We quantify the performance of our algorithm for variable selection (VS) and to test conditional independence (CI)
by two FPR and FNR measures. FPR(VS) reports the proportion of variables incorrectly selected as outcome associ-
ated by the first penalization term. FPR(CI) reports the proportion of times any two components Xi and X j are inde-
pendent given the other components X\i j but at least one of the corresponding interaction parameters (Λi j , Φ ji , Φi j , Λi j )
is estimated to be nonzero. Corresponding definitions are used for FNR.

5.3. Simulation results


5.3.1. Simulation 1
Figure E.1 shows the results when a Normal-pGM, Normal-zipGM and an Ising model, a pGM that only uses
the binary component of the data, were fit to data generated from a Normal-zipGM for various sample sizes n of the
training data. For all values of ∆ξ the Normal-zipGM had the highest AUC, and the difference was bigger for larger
n’s for ∆ξ = 0 and ∆ξ = 2. For ∆ξ = −2, i.e. proportions of zeros >45%, the AUCs of the Normal-zipGM and Ising
model were similar, and much higher than the AUC of the Normal-pGM. For ∆ξ = 2 (10-30% zeros) the Normal-pGM
model had a higher AUC than the Ising model. All models had very low FPR(VS), FNR(VS) and FPR(CI) values
for all choices of ∆ξ . However, the Ising model always had a high FNR(CI), confirming its difficulty in estimating
strong interactions and its sensitivity to model misspecification. The columns of Figure E.2 show the results for
different proportions of zeros (values of ∆ξ ) when the Poisson-pGM, Poisson-zipGM and the Ising model were fit to

10
data generated from a Poisson-zipGM. The differences between the Poisson-pGM and the Poisson-zipGM were more
pronounced than for the corresponding normal models. The AUC was largest for the Poisson-zipGM, followed by the
AUC of the Ising model, and it was much lower for the Poisson-pGM. For ∆ξ = 2 (10-28% zeros), the mean AUC
of the Poisson-pGM was less than 0.55 for all sample sizes, while it was around 0.7 for the Poisson-zipGM. For this
setting the Poisson-pGM also had a much higher FPR(VS) than all other models, while for a very large proportion
of zeros (∆ξ = −2) it had a very large FNR(VS). The FPR(CI) was low for all models for ∆ξ = −2. The FNR(CI)
decreased for the Poisson-zipGM and Poisson-pGM as the proportion of zeros decreased. Results for the truncated
Poisson models were similar to those from the Poisson models and are shown in Supplementary Material Figure S2.
We further discuss differences between the Poisson and Normal model performances under misspecification with
zero-inflated data in Supplementary Material S3. Supplementary Material Figures S3 and S4 show results for the
“oracle” selection criterion, which highlight that the standard pGMs performed well for variable selection when data
were generated under the zipGMs. This indicates that our hierarchical penalty robustly captures important variables
and is not very sensitive to mild misspecifications of the underlying probability model.
In Supplementary Material S6, we show results for simulations when we generated data from a Poisson-pGM
model with different proportions of zeros, and then fit the Poisson-zipGM and the Ising-pGM. When the data were
analyzed with the Poisson-zipGM, variables were selected with little error for all the settings, and we only observed
some loss in prediction performance when the proportion of zeros was very small. Thus that only in settings with a
small proportion of zeros should the analysts consider using the standard pGM due to its better predictive efficiency.

5.3.2. Simulation 2
Figure E.3 shows results when X was generated from the Normal-zipGM under independence (first column) and
under increasingly stronger interactions (columns 2 and 3) for various sample sizes of the training data. When R(X)
was estimated using the Normal-zipGM model that generated the data, the AUC ≈ 0.7, and noticeably lower (AUC ≈
0.6) for R(X) estimated from the Poisson-zipGM and TPoisson-zipGM for all choices of interaction parameters. Of
note, the Normal-zipGM gave results very close to a predictive model that used the true R(X), the “gold standard”
(corresponding to the dotted lines) for all settings.
The FPR(VS) was highest for the Normal-zipGM, with the exception of the case of strong interaction (column 3),
where for larger sample sizes the TPoisson-zipGM had a FPR(VS) of 60% or higher. The FPR(CI) was < 0.2 for all
models, except for the TPoisson-zipGM for n = 500 and n = 1000, when data were generated with strong interactions.
Under the independent setting the FNR(CI) was always zero, but for the moderate interaction setting all models had
FNR(CI)> 0.8. The FNR(CI) was lower for the strong interaction setting for the Normal-zipGM and Poisson-zipGM,
and lowest for TPoisson-zipGM, confirming that the hierarchical penalization under stronger interactions reduces
both, FNR(VS) and FNR(CI).
When X was generated from the Poisson-zipGM (Figure E.4) under independence (first column), or under in-
creasingly stronger interactions (columns two and three), the AUC was similar for the Poisson-zipGM and TPoisson-
zipGM, and lower for the Normal-zipGM. In the strong interaction setting all models selected some false positive
variables that had interactions with (X1 , . . . , X15 ), i.e. variables that capture some second order information about
Y, which resulted in better predictions than the ones obtained from the gold standard model (dotted lines) for the
Poisson-zipGM and TPoisson-zipGM.
Under independence and for modest interactions, the FPR(VS) for the Normal-zipGM was much larger than for
the Poisson-zipGMs. For all zipGMs and interaction strengths the FPR(CI) was very low. The FNR(CI) was low
under independence and for strong interactions, but > 60% for the modest interaction setting for all zipGMs and
sample sizes.
Supplementary Material S4 shows results for the “oracle” selection criterion, where FPR(VS)< 0.1 (McDavid
et al., 2019). All models had low FNR(VS), except the Normal-zipGM, for which FNR(VS)∼ 0.8 for modest inter-
actions. It also presents results for additional settings, showing that the Normal-zipGM generally performed better
when there were strong correlations among the X components.

5.4. Simulation 3: Data generated from the forward model


For Simulations 1 and 2 we generated data from the inverse models of X|Y. Here we simulated data based on
a forward model Y|X, using the package MB-GAN (Rong et al., 2021), designed to generate counts X resembling

11
realistic microbiome data. We combined the predictors based on the phylogenetic tree up to the genus level, which
results in 88 variables. We considered continuous as well as binary Y. The results for binary Y are shown in Supple-
mentary Material S5. For a sample s we simulated binary or continuous Y (s) as a function of m(s)
10 , the sum of the 10
most abundant components of X divided by kj=1 x sj , or of the sum of the 10 indicator components of ν(X) with the
P

highest entropy. Continuous Ys were obtained by subtracting the mean across all samples from m(s) 10 and dividing the
difference by the empirical standard deviation.
We estimated the joint SDR, R(X), with fy = y and d J = r = 1 in Proposition 2 and obtained ŷ from using R̂ in
a Gaussian kernel estimator with the median of the pairwise distances as the bandwidth (Adragni and Cook, 2009).
The performance of the final model was assessed by the MSE. For comparison, we used Sparse Partial Least Squares
(SPLS) from the mixOmics package (Rohart et al., 2017) which also accommodates variable selection and prediction.
The first column of Figure E.5 shows that when Y was generated using only ν(X), SPLS had much larger MSEs
than the zipGM models, which all had similar performance, for all sample sizes. When Y depended on X (second
column of Figure E.5), TPoisson-zipGM and SPLS had the lowest MSEs, and the Normal-zipGM performed poorly.
However, SPLS had the largest FNR(VS) for all sample sizes and settings. The oracle selection criterion led to similar
conclusions (Supplementary Material Figure S11).
In Supplementary Material S7, we compare the joint and stacked SDR and their ability to estimate the dimension
of the reduction. The results show that the joint SDR does not only give the smallest reduction but the rank estimation
is robust. On the other side, the stacked SDR achieves better prediction error but it usually overestimate the reduction
rank.

6. Example: American Gut Project (AGP) data

We studied associations of the AGP gut microbiome taxonomic level 6 (Genus level) data with sex (http://
americangut.org). 16S V4 region sequences were identified through the Greengenes 13.8 reference library. AGP
samples of individuals ages < 50 years with at least 1,000 reads were rarefied to 1,000 reads for analysis.
After excluding taxa present in < 10% of the entire dataset, we used k = 71 taxa measured in 937 samples (488
female, 449 male). We fit our zipGM and pGM models using the AUC as a predictive measure. We compared our
results to those obtained from sPLSDA in the mixOmics package (Le Cao et al., 2011; Rohart et al., 2017) and results
from two forward methods, selbal (Rivera-Pinto et al., 2018) and CoDA-lasso as implemented in coda4microbiome
(Calle et al., 2023). selbal and CoDA-lasso find contrasts of log-transformed relative abundances and accommodate
variable selection for the association of the compositional data with a binary outcome. We used 5-fold cross validation
within strata defined by sex to obtain unbiased model assessment and only considered models that selected ≤15
variables to facilitate interpretation and comparison of the results across methods.
Figure E.6a shows the AUC values for each model and the number of times each selected variable was identified by
a model. The Poisson-zipGM and TPoisson-zipGM selected the same variables in every fold of the cross-validation
procedure while the standard pGM models, mixOmics, coda4microbiome and selbal chose largely different vari-
ables for each fold. For the AUC computations using variables that were selected in at least one fold, mixOmics and
coda4microbiome thus used 23 and 24 variables, respectively, resulting in AUC=0.65 for both, while Poisson-zipGM
and TPoisson-zipGM used only 13 variables, resulting in slightly lower AUC=0.63 for both, similar to that of selbal.
The AUC for the Normal-zipGM was slightly lower, but the AUCs of the standard pGMs were much lower, including
the Ising model, which yielded AUC=0.57. These results suggest that modeling the zero patterns improves the pre-
dictive performance of standard pGMs, but they also caution that modeling only the zero patterns ignores too much
information in the data to yield optimal performance.
We then assessed the predictive ability of the n variables that each method selected in all five folds. Table E.1 shows
P
the error rate defined as 1/n i I(yi , ŷi ) estimated from a separate 10-fold cross validation using these n variables to
predict sex based on various statistical models: logistic regression(LR, package glm), quadratic discriminant analysis
(qda, package MASS), random forest (RF, package randomForest), support vector classifiers with polynomial kernel
(svm-poly, package kernlab). For LR and qda, using the variables selected by mixOmics resulted in the lowest error
rates, 0.37 and 0.38 respectively, closely followed by using the variables selected by Poisson-zipGM and TPoisson-
zipGM, 0.39 for both. When svm was used as a predictive model, the error rate from using the variables selected by
Poisson-zipGM and TPoisson-zipGM as predictors yielded noticeably lower error rates, 0.34 for both, than all other

12
approaches. Error rates using the variables selected by the Normal models, the standard Poisson pGMs and selbal
were substantially higher for all predictive models.
As neither mixOmics, selbal nor coda4microbiome yield easily interpretable estimates of interactions, we also
estimated sparse graphical models using the R packages Spring (Yoon et al., 2019) and SpiecEasi (Kurtz et al.,
2015) for the data. We did not assume the interations in our models to depend on outcome thus allowing comparisons
with results from Spring nor SpiecEasi, which do not accommodate any dependency of the data on outcomes. Spring
builds on a truncated Gaussian copula model, and estimates the correlations between latent variables by a rank statistic
using a modified log-contrast transformation which conserves the zeros in the data and is normalized by the geometric
mean of non-zero counts. SpiecEasi considers the log-transform normalized by the geometric mean of the data and
fit a Normal-pGM. Each method induces conditional independence by penalizing the entries of the interaction matrix.
The corresponding graphs are shown in Figure E.6b, that also show the interaction network for each model estimated
without variable selection, i.e. using λR = 0 in (27) in gray. The zero inflated models selected interactions more
robustly, with better agreement between them and with SpiecEasi and Spring. This became evident by the cluster
formed by the nodes labeled {6, · · · , 10}, which was not selected by the Ising model the Poisson-based pGMs, but by
all the zipGMs. Supplementary Material Figure S12 shows the network and variable selection for the Normal-pGM
from the hierarchical penalization when applied over a regular grid of penalization parameters to further highlight the
impact of the penality terms on selection of variables and interactions.
Of note, while the literature on associations of sex with gut microbiota is inconsistent, an association with
Bacteroides-Prevotella, also selected by our models, has been reported (Kim et al., 2019).

7. Discussion

In this paper we define novel pairwise graphical model distributions in the exponential family that accommodate
excess zeros in a random vector X. A special case of these models is the Hurdle-normal model proposed by McDavid
et al. (2019). We characterize this joint distribution in terms of conditional distributions of the components of X,
which facilitates estimation and data generation, particularly in high-dimensional settings. To assess associations of
X with an outcome Y, we model the main effects parameters as functions of Y and derive two reductions R(X) that
contain sufficient information in X about Y. While we focus on univaraite Y here, our methods are easily extended to
multivariate outcomes Y ∈ Rm .
For estimation we propose a pseudo-likelihood approach, derived from a composite Kullbach-Leibler divergence.
For variable selection, we incorporate a novel hierarchical penalty into the pseudo-likelihood that has two terms:
one that penalizes the association parameters between components of X and Y and removes interactions from non-Y
associated components of X. The second penality term penalizes interactions between the X components to further
induce sparsity in the conditional independence graph of the selected variables.
McDavid et al. (2019) used separate regressions with a penalization corresponding to λC in (27) for interactions.
Our penalty additionally includes the relation between X and Y. In simulations, it tended to select variables strongly
correlated with the ones directly associated with Y, leading to even better predictions than those based on an “oracle”
models. In addition to improving interpretability of the fitted model, the hierarchical penalization is particularly useful
for analyses with limited sample sizes (Zhao et al., 2009). Our weighted penalization extends that of Lee and Hastie
(2015), who considered only a single weight for each group (under the independent model).
It approximates a measure that translates the distance (divergence) between distributions into a (weighted norm)
distance between parameters with respect to the independence model. Of note, our approach can accommodate con-
tinuous and categorical outcomes. For categorical outcomes, variable selection extracts the most relevant predictors
for all categories jointly, leading to lower false positive rates than independent binary comparisons. This penalization
approach is general and can be used with other divergence measures, which we plan to study in future research.
An interesting comparison of our work is with kernel-based methods for zero-inflated graphical models. Kernel
methods do have potential in providing alternative perspectives and enhancements. Our proposed methodology does
not extend easily to kernel methods, so further development is needed that will be part of future research.

13
Acknowledgments
We thank William Wheeler, IMS, for help with computations. This work utilized the computational resources of
the NIH HPC Biowulf cluster (https://hpc.nih.gov).

Appendix A. Definitions
We define the concatenation of two matrices A ∈ Rk1 ×r and B ∈ Rk2 ×r as the mapping (·; ·) : Rk1 ×r × Rk2 ×r →
k1 +k2 ×r
R such that the j-th row satisfies

A j
 if j ≤ k1
( A; B) j =  .

 B j−k1 if k1 + k2 ≥ j > k1 .

Appendix B. Proofs
Proof of theorem 1. (⇒) Using (4), the conditional probability of X j given all other components X\ j is

P(X = (x\ j ; x j ))
P(X j = x j | X\ j = x\ j ) = P , (B.1)
m∈X∪{0} P(X = (x\ j ; m))

where
1
P(X = (x\ j ; x j )) = exp{ξT (ν\ j ; ν j ) + ηT (T(x\ j ); T (x j )) + (ν\ j ; ν j )T Λ(ν\ j ; ν j )+
2
1
(ν\ j ; ν j )T Φ(x\ j ; x j ) + (x\ j ; x j )T Θ(x\ j ; x j ) + 1T B((x\ j ; x j ))} (B.2)
2
and
  n 1
P X = (x\ j ; m) = exp ξT (ν\ j ; ν(m)) + ηT (T(x\ j ); T (m)) + (ν\ j ; ν(m))T Λ(ν\ j ; ν(m))
2
1 o
+ (ν\ j ; ν(m)) Φ(x\ j ; m) + (x\ j ; m)T Θ(x\ j ; m) + 1T B((x\ j ; m)) , (B.3)
T
2
and X accounts for the support of the random variable X j | ν(X j ) = 1. For ease of exposition we abbreviate ν(x) by ν.
Consider the block-wise representation
Λ\ j,\ j Λ\ j, j Φ\ j,\ j Φ\ j, j Θ\ j,\ j Θ\ j, j
! ! !
Λ= Φ= Θ= ,
Λ j,\ j 0 Φ j,\ j 0 Θ j,\ j Θ j j

and by symmetry, Λ\ j, j = ΛTj,\ j and Θ\ j, j = ΘTj,\ j . After some simplification


Θjj
exp{ξ j|\ j ν j + η j|\ j T (x j ) + 2 T (x j ) + B(x j )}
2
P(X j = x j | X\ j = x\ j ) = P Θ
, (B.4)
m∈X∪{0} exp{ξ j|\ j ν(m) + η j|\ j T (m) + 2j j T (m)2 + B(m)}
where ξ j|\ j = ξ j + Λ j,\ j ν\ j + Φ j,\ j T(x\ j ) and η j|\ j = η j + νT\ j Φ\ j, j + Θ j,\ j T(x\ j ) that is (8) and (9) respectively.
(⇐) Let
P(X | ω)
Q(X | ω) := log , (B.5)
P(0 | ω)
for any X in the sample space (X ∪ {0})k , where P(0 | ω) denotes the probability that all random variables take the
value 0, with P(0 | ω) > 0. As by assumption P(X | ω) factors into products of at most two components, and Q is
homogeneous, we can write Q as:
k
X 1 XX
Q(X | ω) = X j g j (X j ) + X j Xk g jk (X j , Xk ) (B.6)
j=1
2 j k, j

14
where g j (·), g jk (·) = gk j (·) are, up to a constant, defined so that log G jk (X j , Xk ) = X j g j (X j )+Xk gk (Xk )+X j Xk [g jk (X j , Xk )+
gk j (Xk , X j )]/2 for j > k. From Besag (1974),

P(X j | X\ j , ω)
Q(X | ω) − Q(X\ j | ω) = log , (B.7)
P(0 | X\ j , ω)

where X\ j = (X1 , · · · , X j−1 , 0, X j+1 , · · · , Xk ). Using (B.6) in (B.7) yields

1X
Q(X | ω) − Q(X\ j | ω) = X j g j (X j ) + X j Xk g jk (X j , Xk ). (B.8)
2 k, j

From (6), the right hand of (B.7) simplifies to

P(X j | X\ j , ω) 1
log = Θ j j [T 2 (X j ) − T 2 (0)] + η j|\ j (X\ j )[T (X j ) − T (0)] + ξ j|\ j (X\ j )ν j + [B(X j ) − B(0)], (B.9)
P(0 | X\ j , ω) 2

where we let ν j = ν(X j ), and we stress the dependence of η j|\ j and ξ j|\ j on X\ j . Fixing Xk = 0 for all k , j, (B.8) and
(B.9) together with (B.7) imply
1
X j g j (X j ) = Θ j j [T 2 (X j ) − T 2 (0)] + η j|\ j (0)[T (X j ) − T (0)] + ξ j|\ j (0)ν j + [B(X j ) − B(0)], (B.10)
2
and fixing Xk = 0 for all k < { j, l},
1
X j g j (X j ) + X j Xl g jl (X j , Xl ) = Θ j j [T 2 (X j ) − T 2 (0)] + η j|\ j (· · · , 0, Xl , 0, · · · )[T (X j ) − T (0)]+
2
+ ξ j|\ j (· · · , 0, Xl , 0, · · · )ν j + [B(X j ) − B(0)]. (B.11)

Combining (B.10) and (B.11), we get


1
X j g j (X j ) = Θ j j [T 2 (X j ) − T 2 (0)] + η j|\ j (0)[T (X j ) − T (0)] + ξ j|\ j (0)ν j + [B(X j ) − B(0)], (B.12)
2
X j Xl g jl (X j , Xl ) = [η j|\ j (· · · , 0, Xl , 0, · · · ) − η j|\ j (0)][T (X j ) − T (0)] + [ξ j|\ j (· · · , 0, Xl , 0, · · · ) − ξ j|\ j (0)]ν j .

Using (8) and (9),


X
η j|\ j (0) = η j + Θ jl T (0), (B.13)
l, j
X
ξ j|\ j (0) = ξ j + Φ jl T (0),
l, j

η j|\ j (· · · , 0, Xl , 0, · · · ) − η j|\ j (0) = Θ jl [T (Xl ) − T (0)] + Φl j νl ,


ξ j|\ j (· · · , 0, Xl , 0, · · · ) − ξ j|\ j (0) = Φ jl [T (Xl ) − T (0)] + Λ jl νl ,

where η j , ξ j , Θ jl , Φl j , Φ jl and Λ jl are constants. Using (B.13) in (B.12), and rearranging terms, the joint distribution

15
in (B.6) reduces to
 
k 
1
X  X X 

Q(X | ω) = Θ 2 2
+ + Θ + + Φ +
 
[T (X ) − T (0)] [η T (0)][T (X ) − T (0)] [ξ T (0)]ν [B(X ) − B(0)]

j j j j jl j j jl j j
2


 


j=1  l, j l, j

k
1 XXn o
+ Θ jl [T (X j ) − T (0)][T (Xl ) − T (0)] + Φl j [T (X j ) − T (0)]νl + Φ jl ν j [T (Xl ) − T (0)] + Λ jl ν j νl
2 j=1 l, j
k ( )
X 1
= η j [T (X j ) − T (0)] + ξ j ν j + Θ j j [T (X j ) − T (0)] + [B(X j ) − B(0)]
2 2

j=1
2
k X
X k X
X k X
X
− Θ jl T (0)2 + Θ jl T (X j )T (0) + Φ jl ν j T (0)
j=1 l, j j=1 l, j j=1 l, j
k k X k
1 XXn o 1X 1 XXn o
+ Θ jl T (X j )T (Xl ) + Λ jl ν j νl + Θ jl T (0)2 − Θ jl T (X j )T (0) + Θ jl T (0)T (Xl )
2 j=1 l, j 2 j=1 l, j 2 j=1 l, j
k k X
1 XXn  1X o
+ Φl j νl T (X j ) + Φ jl ν j T (Xl ) − Φl j νl T (0) + Φ jl ν j T (0) ,
2 j=1 l, j 2 j=1 l, j
1 1 1
= ηT [T(x) − T(0)] + ξT ν(x) + T(x)T ΘT(x) − T(0)T ΘT(0) + ν(x)T Λν(x) + ν(x)T ΦT(x)
2 2 2
+ 1T [B(x) − B(0)].
To obtain the last equality we used that Θ jl = Θl j and kj=1 l, j Φl j νl T (0) = kj=1 l, j Φ jl ν j T (0) which cancels
P P P P
several terms. Elements of a parameter matrix or vector are given by
  
Θ j j if j = l
 0
 if j = l 0
 if j = l
[η] j = η j , [ξ] j = ξ j , [Θ] jl =  , = , = .
  
[Φ] jl [Λ] jl
Θ jl otherwise
 Φ jl otherwise

 Λ jl otherwise

Thus from (B.5), since ν(0) = 0,


1 1
log P(X | ω) = ηT T(x) + ξT ν(x) + T(x)T ΘT(x) + ν(x)T Λν(x) + ν(x)T ΦT(x) + 1T B(x) − A(η, ξ, Θ, Λ, Φ),
2 2
which agrees with (4), with log partition function
X 1 1
A(η, ξ, Θ, Λ, Φ) = log exp{ηT T(x) + ξT ν(x) + T(x)T ΘT(x) + ν(x)T Λν(x) + ν(x)T ΦT(x) + 1T B(x)}.
k
2 2
x∈(X∪{0})

Proof of Proposition 1. Using that A(τ) = log{1 + exp(τ)} and A(τ, τ0 ) = A(τ) + τ0 ,
P(ν(X) = 0) = P(X = 0) = exp{τ0 − A(τ, τ0 )} = exp{−A(τ)} = (1 + exp{τ})−1 ,
P(ν(X) = 1) = 1 − P(ν(X) = 0) = (1 + exp{−τ})−1 ,
where we used the definition of A+ (θ). Moreover,
P(X = x) exp{ν(x)[τ + τ0 − A+ (θ)] + θT t(x) + B(x) − A(τ, τ0 )}
P(X = x | ν(X) = ν) = =
P(ν(X) = ν(x)) exp{τν(x) − A(τ)}
+
= exp{ν(x)[τ0 − A (θ)] + θ t(x) + B(x) − τ0 },
T

P(X = 0 | ν(X) = 0) = 1,
P(X = x | ν(X) = 1) = exp{θT t(x) + B(x) − A+ (θ)}.

16
Proof of Proposition 2. The joint distribution (4), (16) and (17) can be factorized into

P(X = x | Y = y) = g(R(x), Y)c(x)l(y), (B.14)

where

l(y) = exp{−A(ξy , ηy , Θ, Φ, Λ)},


1
c(x) = exp{ ν(x)T (Λ + 2 diag(ξ0 ))ν(x) + ν(x)T [Φ + diag(η0 )]T(x)
2
1
+ T(x)T ΘT(x) + 1T B(x)}, (B.15)
2
g(R(x), y) = exp{(ξ(y) − ξ0 )T ν(x) + (η(y) − η0 )T T(x)}
 !T !
 T Γ
( !)
 T(x)  T T T(x)
n o
= exp   = exp (β f y ) κ ν(x) = exp (β f y ) R(X) ,
T
 
 fy Ψ (B.16)

 ν(x) 

Γ
!
where we used = κβ, with κ ∈ R2p×d and β ∈ Rd×r , the reduced rank representation. A(y) is the log partition
Ψ
function that depends on all the model parameters. Applying the Factorization Theorem in Tomassi et al. (2019) yields
the result in Proposition 2.
Proof of Proposition 3. The proof follows from the above proof of Proposition 2, by noting that
 !T !  !T 
n o  T β αT T(x) 
  T β
 
g(R(x), y) = exp f y Γ T(x) + f y Ψ ν(x) = exp 
T T T T
= exp  ,
   
fy fy R s (X)
 

 τ ζ ν(x) 
T 
 
 τ 

where Γ = αβ and Ψ = ζτ are low rank representations, with rank d, d0 ≤ min{k, r} respectively. Applying the
proposition S1.1 of Tomassi et al. (2019) yields the result stated in Proposition 3.

Appendix C. Computation of the block-diagonal Hessian and Jacobian matrices

Appendix C.1. Block-diagonal Hessian


For the computation of all Fisher information matrices, we assume a block diagonal structure defined by the
groups:

b j = (η j ; ξ j ; Θ j j ), r j = (Γ j ; Ψ j ), w jl = (Θ jl ; Φl j ; Φ jl ; Λ j ), j , l ∈ {1, . . . , k}. (C.1)

Using (C.1), the weights in (27) reduce to the block structure R jl = diag(I r j , (I w jl )l, j and C jl = I w jl , respectively.
Under the block-diagonal structure in (C.1), I has a closed form representation that we derive next. The matrix
components corresponding to the independent parameters b j = (η j ; ξ j ; Θ j j ) are

∂2 X
k
∂2 ℓ j|\ j ∂2 A(η j|\ j , ξ j|\ j , Θ j j )
I bj = E ℓ j′ |\ j′ = E = E . (C.2)
∂b j ∂bTj j′ =1 ∂(η j\ j ; ξ j\ j ; Θ j j )∂(η j\ j ; ξ j\ j ; Θ j j )T ∂(η j|\ j ; ξ j|\ j , Θ j j )∂(η j|\ j ; ξ j|\ j , Θ j j )T

For the regression parameters r j = (Γ j , Ψ j ) we obtain

∂2 X
k
∂(η j|\ j ; ξ j|\ j ) ∂2 ℓ j|\ j ∂(η j|\ j ; ξ j|\ j )
I rj = E ℓ j ′ |\ j′ = E , (C.3)
∂r j ∂r j j′ =1
T
∂r j T ∂(η j|\ j ; ξ j|\ j )∂(η j|\ j ; ξ j|\ j ) T ∂r j

where
∂(η j|\ j ; ξ j|\ j ) ∂2 ℓ j|\ j ∂2 A(η j|\ j , ξ j|\ j , Θ j j )
= I2 ⊗ fyT , = , (C.4)
∂rTj ∂(η j|\ j ; ξ j|\ j )∂(η j|\ j ; ξ j|\ j )T ∂(η j|\ j ; ξ j|\ j )∂(η j|\ j ; ξ j|\ j )T

17
and I2 is the identity matrix of dimension 2. Lastly, for the interaction parameters w jl = (Θ jl , Φl j , Φ jl , Λ j ), we compute
k
∂2 X
I w jl = E ℓ j′ |\ j′ (C.5)
∂w jl ∂wTjl j′ =1
∂(η j|\ j ; ξ j|\ j ) ∂2 ℓ j|\ j ∂(η j|\ j ; ξ j|\ j ) ∂(ηl\l ; ξl\l ) ∂2 ℓl|\l ∂(ηl\l ; ξl\l )
=E + E (C.6)
∂w jl T ∂(η ;
j jξ )∂(η ;
j jξ ) T ∂w jl ∂w jlT ∂(η ;
l lξ )∂(η ;
l lξ ) T ∂w jl
∂(η j|\ j ; ξ j|\ j ) ∂2 A(η j|\ j , ξ j|\ j , Θ j j ) ∂(η j|\ j ; ξ j|\ j ) ∂(ηl\l ; ξl\l ) ∂2 A(ηl\l , ξl\l , Θll ) ∂(ηl\l ; ξl\l )
=E + E , (C.7)
∂wTjl ∂(η j|\ j ; ξ j|\ j )∂(η j|\ j ; ξ j|\ j )T ∂w jl ∂wTjl ∂(ηl|l ; ξl|l )∂(ηl|l ; ξl|l )T ∂w jl

where
∂(η j ; ξ j )
#T #T
∂(ηl ; ξl )
" "
Xl Xj
= I2 ⊗ , = ⊗ I2 . (C.8)
∂w jl ν(Xl ) ∂w jl ν(X j )

Computing the Hessian of A(η j|\ j , ξ j|\ j , Θ j j ). From (7), and using T (0) = 0, we have

A(ξ j|\ j , η j|\ j , Θ j j ) = log 1 + exp ξ j|\ j + A+ (η j|\ j , Θ j j ) .


 n o
(C.9)

Then the first derivatives are


 ∂A+ (η j|\ j ,Θ j j ) 
∂A(η j|\ j , ξ j|\ j , Θ j j )   ∂η j|\ j 
 
+

= h ξ j|\ j + A (η j|\ j , Θ j j )   1  , (C.10)
∂(η j|\ j ; ξ j|\ j ; Θ j j )

 ∂A+ (η j|\ j ,Θ j j ) 
∂Θ j j

where h(x) = (1 + e−x )−1 is the sigmoid function. The second order derivatives are
∂2 A(η j|\ j , ξ j|\ j , Θ j j )
= (C.11)
∂(η j|\ j ; ξ j|\ j ; Θ j j )∂(η j|\ j ; ξ j|\ j ; Θ j j )T
 ∂A+ (η j|\ j ,Θ j j )   ∂A+ (η j|\ j ,Θ j j ) T
  ∂η j|\ j   ∂η j|\ j 
  
+ +
  
= h ξ j|\ j + A (η j|\ j , Θ j j ) (1 − h ξ j|\ j + A (η j|\ j , Θ j j ) ) 
 1 
  1  + (C.12)
 ∂A+ (η j|\ j ,Θ j j )   ∂A+ (η j|\ j ,Θ j j ) 
∂Θ j j ∂Θ j j
 ∂2 A+ (η j|\ j ,Θ j j ) ∂2 A+ (η j|\ j ,Θ j j ) 
0 ∂η j|\ j ∂Θ j j 
  ∂η j|\ j
 2


+

+ h ξ j|\ j + A (η j|\ j , Θ j j )   ,


 0 0 0 (C.13)
 ∂2 A+ (η j|\ j ,Θ j j ) ∂2 A+ (η j|\ j ,Θ j j ) 

∂Θ j j ∂η j|\ j 0 ∂Θ j j
2

since h′ (x) = h(x){1 − h(x)}.

Appendix C.2. Calculation of the Jacobian


We compute the Jacobian J b j , J r j , Jw jl of each block ω b j , ω r j and ωwl j by
∂ℓ j|\ j ∂A(η j|\ j , ξ j|\ j , Θ j j )
Jb j = E =E − E(T (X j ); ν(X j ); T (X j )2 /2), (C.14)
∂(η j|\ j ; ξ j|\ j ; Θ j j ) ∂(η j|\ j ; ξ j|\ j ; Θ j j )
∂(η j|\ j ; ξ j|\ j ) ∂ℓ j|\ j
Jr j = E , (C.15)
∂r j ∂(η j|\ j ; ξ j|\ j )
∂(η j|\ j ; ξ j|\ j ) ∂ℓ j|\ j ∂(ηl\l ; ξl\l ) ∂ℓl\l
Jw jl = E + . (C.16)
∂w jl ∂(η j|\ j ; ξ j|\ j ) ∂w jl ∂(ηl\l ; ξl\l )

Appendix C.3. Special Cases


In the last step we compute the derivatives of A+ (η j|\ j , Θ j j ) for the distributions we study in simulations.
18
Normal-zipGM. From (13),

η2j|\ j 1 1
A+ (η j|\ j , Θ j j ) = − − log(−Θ j j ) + log(2π).
2Θ j j 2 2

Then
η j|\ j
∂A+ (η j|\ j , Θ j j )  − Θ j j , 
 
=  η2j|\ j (C.17)
∂(η j|\ j ; Θ j j )

2Θ2j j
− 2Θ1 j j 
 1 η j|\ j 
∂2 A+ (η j|\ j , Θ j j ) − Θ j j Θ2j j

=  . (C.18)
 
η η2
∂(η j|\ j ; Θ j j )∂(η j|\ j ; Θ j j )T  j|\2 j − j|\3 j + 1

Θ jj Θ jj 2Θ2j j

Poisson-zipGM. From (14), A+ (η j|\ j ) = log(exp(exp(η j|\ j ))−1). Defining an auxiliary variable X
ej|\ j ∼Poisson(exp(η j|\ j ))
ej|\ j >1)
P(X
and calling β j|\ j = ej|\ j =1)
P(X
we have:

∂A+ (η j|\ j ) P(X ej|\ j = 0) + P(X ej|\ j > 0)


= exp(η j|\ j )
∂η j|\ j ej|\ j > 0)
P(X
ej|\ j = 0) 
 
P(X 1
= exp(η j|\ j ) 1 +  = exp(η ) + ,

j|\ j
+ β j|\ j

P(X j|\ j > 0) 1
e 

∂2 A+ (η j|\ j ) β j|\ j
( " #)
1 1
= exp(η j|\ j ) 1 − 1 − .
∂η2j|\ j 1 + β j|\ j exp(η j|\ j ) 1 + β j|\ j

∂A+ (η j|\ j ) ∂A+ (η j|\ j )


When η j|\ j → ∞, then β j|\ j → ∞ and ∂η j|\ j → exp(η j|\ j ), or when η j|\ j → −∞, then β j|\ j → 0 and ∂η j|\ j →
exp(η j|\ j ) + 1. Moreover,

∂A+ (η j|\ j , Θ j j )  1−exp(− exp(η j|\ j )


  " #
exp(η j|\ j )
=  ,
exp(η ))

j|\ j  ≈

(C.19)
∂(η j|\ j ; Θ j j ) 0 0
 exp(η j|\ j +exp(η j|\ j ))(exp exp(η j|\ j )−exp(η j|\ j )−1)
∂2 A+ (η j|\ j , Θ j j )
 " #
0 exp(η j|\ j ) 0
=  ,
 (exp exp(η j|\ j )−1)2  ≈ (C.20)
∂(η j|\ j ; Θ j j )∂(η j|\ j ; Θ j j )T 0 0 0 0

and the approximation is accurate for exp(η j|\ j ) ≫ 1.

TPoisson-zipGM. From (15), A+ (η j|\ j ) = log exp(exp(η j|\ j )) − 1 − i=T ∗ +1 exp(iη j|\ j )/i! . Then,
 P 

∂A+ (η j|\ j ) exp(η j|\ j ) exp(exp(η j|\ j )) − i=T ∗ +1 exp(iη j|\ j )/(i − 1)! h
#−1
(T ∗ )−1
P
i−1 "
= = 1 + β j|\ j + λ +−1
,
∂η j|\ j 1 + α j|\ j
P
exp(exp(η j|\ j )) − 1 − i=T ∗ +1 exp(iη j|\ j )/i!

with
ej|\ j ≤ T ∗ )
P(2 ≤ X ej|\ j ≤ T ∗ − 2)
P(1 ≤ X
β j|\ j = , α j|\ j = ,
P(Xej|\ j = 1) ej|\ j = T ∗ − 1)
P(X

19
ej|\ j ∼Poisson(λ j|\ j ) with rate λ j|\ j = exp(η j|\ j ). Moreover,
and where we defined the auxiliary variable X

1 ∂2 A+ (η j|\ j )
#−2
−2 dβ j|\ j " (T ∗ )−1
!
−2 dα j|\ j

= − 1 + β j|\ j − λ −1
+ −λ−2
− (T ∗ −1
) (1 + α j|\ j )
λ j|\ j ∂η2j|\ j dλ j|\ j j|\ j
1 + α j|\ j j|\ j
dλ j|\ j

dβ j|\ j P(1 ≤ X ej|\ j ≤ T ∗ − 1) P(2 ≤ X ej|\ j ≤ T ∗ ) λT −1 !
= − ej|\ j = 0) = 1 − j|\ j + 1 − 1 β j|\ j ,
P(X
dλ j|\ j P(Xej|\ j = 1) P2 (Xej|\ j = 1) T ∗! λ j|\ j
∗ ∗
dα j|\ j P(0 ≤ X ej|\ j ≤ T − 3) P(1 ≤ X ej|\ j ≤ T − 2)
= − P(X ej|\ j = T ∗ − 2)
dλ j|\ j P(X ej|\ j = T ∗ − 1) ej|\ j = T ∗ − 1)
P2 (X
(T ∗ − 1)! T ∗ − 1 T∗ − 1
!
= − + 1− α j|\ j .
λTj|\−1

j
λ j|\ j λ j|\ j

When η j|\ j → −∞, then β j|\ j → 0, β′j|\ j → 1/2, and α j|\ j → ∞, α′j|\ j → −∞. When η j|\ j → ∞, then β j|\ j → ∞,
β′j|\ j → ∞ and α j|\ j → 0, α′j|\ j → 0.

Proposition 4. Let Q̂ X,Y denote the empirical distribution of Q X,Y , defined by a training data set (X s , Y s ), s = 1, . . . , n.
The divergence CKL(Q̂ X,Y ∥P X,Y ) in (20) computed as the expectation on QY of Q̂ X|Y from the parametric distribution
P(X|Y, ω), as in (21), is equivalent to the pseudo-likelihood cost function − 1n ns=1 kj=1 log P(X sj | X\s j , Y s , ω) up to an
P P
additive constant.

Proof. First consider the empirical joint density of Q X,Y and its conditionals:
n
1X
Q̂(X, Y) = δ(X = X s )δ(Y = Y s ) (C.21)
n s=1
n
1X
Q̂(Y) = δ(Y = Y s ) (C.22)
n s=1
 Pn δ(X=X s )δ(Y=Y s )
 s=1Pns=1 δ(Y=Y s )
 if Y ∈ {Y s }ns=1
Q̂(X | Y) = 

(C.23)
C1 otherwise

 Pn δ(X=X s )δ(Y=Y s )
 Pns=1 δ(X\ j =X\s j )δ(Y=Y s ) if (X\ j , Y) ∈ {(X\ j , Y )} s=1
 s=1 s s n
Q̂(X j | X\ j , Y) = 

(C.24)
C2 otherwise

with X ∈ Xk and Y ∈ Y and where C1 , C2 are constants. Then, replacing Q X,Y by Q̂ X,Y in (21) yields

X k X
X h i
CKL(Q̂ X,Y ∥P X,Y (ω)) = Q̂(Y) Q̂(X | Y) log Q̂(X j | X\ j , Y) − log P(X j | X\ j , Y, ω) .
Y∈Y j=1 X∈Xk

Using (C.22), (C.23) and (C.24), CKL(Q̂ X,Y ∥P X,Y (ω)) equals

s=1 δ(X = X )δ(Y = Y ) s=1 δ(X = X )δ(Y = Y )


n k X Pn s s  Pn s s
XX 1 X 
δ(Y = Y ) s
log − log P(X j | X \ j , Y, ω) ,
s=1 δ(Y = Y ) s=1 δ(X \ j X \ j )δ(Y = Y )
Pn s
Pn s s
Y∈Y s=1
n j=1 k X∈X

and after rearranging the terms

k n
δ(X = X s )δ(Y = Y s )
Pn
1 XX X X  
CKL(Q̂ X,Y ∥P X,Y (ω)) = δ(X = X s )δ(Y = Y s ) log Pn s=1 −log P(X j | X \j , Y, ω) .
s=1 δ(X \ j = X \ j )δ(Y = Y )
s s
n j=1 Y∈Y k s=1
X∈X

20
Defining the cardinality of the events (X, Y) and (X\ j , Y) in the sample by |(X, Y)|S = δ(X = X s )δ(Y = Y s ) and
Pn
s=1
|(X j\ , Y)|S = ns=1 δ(X\ j = X\s j )δ(Y = Y s ), respectively, we get
P

k " #
1 XX X |(X, Y)|S
CKL(Q̂ X,Y ∥P X,Y (ω)) = |(X, Y)|S log − log P(X j | X\ j , Y, ω) .
n j=1 Y∈Y k
|(X\ j , Y)|S
X∈X

Finally,
k k n
1 XX X |(X, Y)|S 1 XX
CKL(Q̂ X,Y ∥P X,Y (ω)) = |(X, Y)|S log − log P(X sj | X\s j , Y s , ω)
n j=1 Y∈Y k
|(X\ j , Y)|S n j=1 s=1
X∈X
n k k
1 XX X
=C− log P(X sj | X\s j , Y s , ω) = C − Ê ℓ j,\ j (ω),
n s=1 j=1 j=1

where C is a constant independent of ω.

Appendix D. Details on the optimization

To optimize (27) we implemented a second-order proximal algorithm from Lee et al. (2012), which at each itera-
tion moves to the direction that minimizes a quadratic approximation of the loss function:

I prox(ωk − tI −1 ∇ℓ(ωk )) = arg min gℓ (I 1/2 ω) + gR (I (ω̂0 )1/2 ω) + gC (I (ω̂0 )1/2 ω) + ıΩ (ω), (D.1)
t ω

where I = I (ωk ) is the block-diagonal information matrix, and I (ω̂0 ) the Fisher information matrix under indepen-
dence, which is used in the norms of the weighted penalization. Moreover,
1
gℓ (ω̄) = ∥ω̄ − I 1/2 (ωk − tI −1 ∇ℓ(ωk ))∥22 , (D.2)
2t
Xk
gR (ω̄) = λR ∥ω̄ r j ,(w jl )l, j ∥2 , (D.3)
j=1
k X
X
gC (ω̄) = λC ∥ω̄w jl ∥2 , (D.4)
j=1 l, j

0
 if ω ∈ Ω
ıΩ (ω) =  ,

(D.5)
∞
 otherwise

where ℓ = Ê j ℓ j|\ j is the pseudo-likelihood function.


P
We solve problem (D.1) with the simultaneous-direction method of multiplier (SDMM) algorithm (Combettes and
Pesquet, 2011). To apply this algorithm, we use the proximal of each function gℓ , gR , gC and ıΩ given below in closed

21
form (Parikh et al., 2014).
1
prox(ω̄0 ) B arg min gℓ (ω̄) + ∥ω̄ − ω̄0 ∥22 (D.6)
γgℓ ω̄ 2γ
1 1
= arg min ∥ω̄ − (I 1/2 ωk − tI −1/2 ∇ℓ(ωk ))∥22 + ∥ω̄ − ω̄0 ∥22
ω̄ 2t 2γ
!−1 !
1 1 1 1/2 k 1
= + I ω + ω̄0 − I −1/2 ∇ℓ(ωk ) ,
t γ t γ
 
λR γ
!
1
prox(ω̄0 ) B arg min gR (ω̄) + ∥ω̄ r j ,(w jl )l, j − ω̄0,r j ,(w jl )l, j ∥2 = 1 − Pk  ω̄0,r j ,(w jl )l, j ,
2  
(D.7)
j=1 ∥ω̄0,r j ,(w jl )l, j ∥2 +
γgR j ω̄ r j ,(w jl )l, j 2γ
λC γ
! !
1
prox(ω̄0 ) B arg min gC (ω̄) + ∥ω̄w jl − ω̄0,w jl ∥2 = 1 − 2
ω̄0,w jl , (D.8)
γgC jl ω̄w jl 2γ ∥ω̄0,w jl ∥2 +
1
prox(ω̄0 ) B arg min ıΩ (ω̄) + ∥ω̄ − ω̄0 ∥22 = arg min ∥ω̄ − ω̄0 ∥2 . (D.9)
ıΩ ω̄ 2γ ω̄∈Ω

Note that (D.6)-(D.9) are formulated in terms of ω̄ = I (ω̂0 )1/2 ω, which corresponds to a preconditioned first order
method over ω, with preconditioning matrix I (ω̂0 )1/2 (Yang et al., 2016). This formulation reduces the condition
number of the sub-problem, resulting in better convergence and precision. Moreover, (D.7) and (D.8) are the block
soft threshold operator evaluated group-wise, where (x)+ is x if x ≥ 0 and zero otherwise. Finally (D.9) is the
orthogonal projection onto the parametric space Ω.

Appendix E. Supplementary Material

The Supplementary Material is available online at https://www.sciencedirect.com/journal/


computational-statistics-and-data-analysis.
The data and code for this paper are available at https://github.com/ekoplin/SDR-zipGM.

References
Adragni, K.P., Cook, R.D., 2009. Sufficient dimension reduction and prediction in regression. Philosophical Transactions of the Royal Society A:
Mathematical, Physical and Engineering Sciences 367, 4385–4405. doi:10.1098/rsta.2009.0110.
Aitchison, J., 1986. The Statistical Analysis of Compositional Data. Chapman & Hall, Ltd., London, UK. doi:10.1111/j.2517-6161.1982.
tb01195.x.
Besag, J., 1974. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society Series B, Statistical
Methodology 36, 192–236. doi:10.1111/j.2517-6161.1974.tb00999.x.
Besag, J., 1975. Statistical analysis of non-lattice data. Journal of the Royal Statistical Society: Series D (The Statistician) 24, 179–195. doi:10.
2307/2987782.
Bura, E., Duarte, S., Forzani, L., 2016. Sufficient reductions in regressions with exponential family inverse predictors. Journal of the American
Statistical Association 111, 1313–1329. doi:10.1080/01621459.2015.1093944.
Calle, M.L., Pujolassos, M., Susin, A., 2023. coda4microbiome: compositional data analysis for microbiome cross-sectional and longitudinal
studies. BMC bioinformatics 24, 82. doi:10.1186/s12859-023-05205-3.
Combettes, P.L., Pesquet, J.C., 2011. Proximal splitting methods in signal processing, in: Fixed-point algorithms for inverse problems in science
and engineering. Springer, pp. 185–212. doi:10.1007/978-1-4419-9569-8_10.
Cook, R.D., 2007. Fisher lecture: Dimension reduction in regression. Statist. Sci. 22, 1–26. doi:10.1214/088342306000000682.
Cook, R.D., Forzani, L., 2008. Principal fitted components for dimension reduction in regression. Statistical Science 23, 485–501. doi:10.1214/
08-STS275.
Giles, M.B., 2016. Algorithm 955: approximation of the inverse Poisson cumulative distribution function. ACM transactions on mathematical
software (TOMS) 42, 1–22. doi:10.1145/2699466.
Haslett, J., Parnell, A., Sweeney, J., 2018. A general framework for modelling zero inflation. arXiv preprint arXiv:1805.00555 .
Inouye, D.I., Yang, E., Allen, G.I., Ravikumar, P., 2017. A review of multivariate distributions for count data derived from the Poisson distribution.
Wiley Interdisciplinary Reviews: Computational Statistics 9, e1398. doi:10.1002/wics.1398.
Kim, Y., Unno, T., Kim, B., Park, M., 2019. Sex differences in gut microbiota. The World Journal of Mens Health 38, 48—-60. doi:10.1007/
978-981-19-0120-1_22.
Kurtz, Z.D., Müller, C.L., Miraldi, E.R., Littman, D.R., Blaser, M.J., Bonneau, R.A., 2015. Sparse and compositionally robust inference of
microbial ecological networks. PLoS Computational Biology 11, e1004226. doi:10.1371/journal.pcbi.1004226.

22
Lauritzen, S.L., 1996. Graphical models. volume 17. Clarendon Press. doi:10.1002/(SICI)1097-0258(19991115).
Le Cao, K.A., Boitard, S., Besse, P., 2011. Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for
multiclass problems. BMC Bioinformatics 12, 253. doi:10.1186/1471-2105-12-253.
Lee, J.D., Hastie, T.J., 2015. Learning the structure of mixed graphical models. Journal of Computational and Graphical Statistics 24, 230–253.
doi:10.1080/10618600.2014.900500.
Lee, J.D., Sun, Y., Saunders, M., 2012. Proximal Newton-type methods for convex optimization, in: Advances in Neural Information Processing
Systems, pp. 827–835.
Lee, J.D., Sun, Y., Taylor, J.E., 2015. On model selection consistency of regularized M-estimators. Electronic Journal of Statistics 9, 608 – 642.
doi:10.1214/15-EJS1013.
Li, K.C., 1991. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association 86, 316–327. doi:10.1080/
01621459.1991.10475035.
Martens, J., 2020. New insights and perspectives on the natural gradient method. Journal of Machine Learning Research 21, 5776–5851.
McDavid, A., Gottardo, R., Simon, N., Drton, M., 2019. Graphical models for zero-inflated single cell gene expression. Annals of Applied
Statistics 13, 848––873. doi:10.1214/18-AOAS1213.
Parikh, N., Boyd, S., et al., 2014. Proximal algorithms. Foundations and Trends® in Optimization 1, 127–239. doi:10.1561/2400000003.
Rivera-Pinto, J., Egozcue, J.J., Pawlowsky-Glahn, V., Paredes, R., Noguera-Julian, M., Calle, M.L., 2018. Balances: a new perspective for
microbiome analysis. MSystems 3, e00053–18. doi:10.1128/mSystems.00053-18.
Rohart, F., Gautier, B., Singh, A., Lê Cao, K.A., 2017. mixomics: An R package for ‘omics feature selection and multiple data integration. PLoS
Computational Biology 13, e1005752. doi:10.1371/journal.pcbi.1005752.
Rong, R., Jiang, S., Xu, L., Xiao, G., Xie, Y., Liu, D.J., Li, Q., Zhan, X., 2021. Mb-gan: Microbiome simulation via generative adversarial network.
GigaScience 10, giab005. doi:10.1093/gigascience/giab005.
Tomassi, D., Forzani, L., Duarte, S., Pfeiffer, R.M., 2019. Sufficient dimension reduction for compositional data. Biostatistics 22, 687–705.
doi:10.1093/biostatistics/kxz060.
Varin, C., Reid, N., Firth, D., 2011. An overview of composite likelihood methods. Statistica Sinica 21, 5–42.
Wainwright, M.J., Jordan, M.I., et al., 2008. Graphical models, exponential families, and variational inference. Foundations and Trends® in
Machine Learning 1, 1–305. doi:10.1561/2200000001.
Yang, E., Ravikumar, P.K., Allen, G.I., Liu, Z., 2013. On Poisson graphical models, in: Advances in Neural Information Processing Systems, pp.
1718–1726.
Yang, T., Jin, R., Zhu, S., Lin, Q., 2016. On data preconditioning for regularized loss minimization. Machine Learning 103, 57–79. doi:10.1007/
s10994-015-5536-6.
Yoon, G., Gaynanova, I., Müller, C.L., 2019. Microbial networks in spring-semi-parametric rank-based correlation and partial correlation estimation
for quantitative microbiome data. Frontiers in genetics 10, 516. doi:10.3389/fgene.2019.00516.
Zhao, P., Rocha, G., Yu, B., 2009. The composite absolute penalties family for grouped and hierarchical variable selection. Annals of Statistics 37,
3468–3497. doi:10.1214/07-AOS584.

23
Δξ = Δ 2 Δξ = 0 Δξ = 2

0.70

AUC 0.65

0.60

0.55

0.50

0.4

0.3
FPR(VS)

0.2

0.1

0.0

0.14
0.12
0.10
0.08
FNR(VS)

0.06
0.04
0.02
0.00

0.0016
0.0014
0.0012
0.0010
FPR(CI)

0.0008
0.0006
0.0004
0.0002
0.0000

1.0

0.8

0.6
FNR(CI)

0.4

0.2

0.0
200 500 1000 200 500 1000 200 500 1000
n n n

model
Ising-pGM Normal-zipGM Normal-pGM

Figure E.1: Performance measures for prediction and variable selection for the Ising model, Normal-zipGM and Normal-pGM when data were
generated from a Normal-zipGM for sample sizes n = 200, 500, 1000. The proportion of zeros are 87-90% for ∆ξ = −2, 45-65% for ∆ξ = 0,
10-28% for ∆ξ = 2.

24
Δξ = Δ 2 Δξ = 0 Δξ = 2
0.75

0.70

0.65
AUC

0.60

0.55

0.50

1.0

0.8

0.6
FPR(VS)

0.4

0.2

0.0

0.8

0.6
FNR(VS)

0.4

0.2

0.0

0.010

0.008

0.006
FPR(CI)

0.004

0.002

0.000

1.0

0.8

0.6
FNR(CI)

0.4

0.2

0.0
200 500 1000 200 500 1000 200 500 1000
n n n

model
Ising-pGM Poisson-zipGM Poisson-pGM

Figure E.2: Performance measures for prediction and variable selection for the Ising model, Poisson-zipGM and Poisson-pGM when data were
generated from a Poisson-zipGM for sample sizes n = 200, 500, 1000. The proportion of zeros are 87-90% for ∆ξ = −2, 45-65% for ∆ξ = 0,
10-28% for ∆ξ = 2.

25
indep-indep-indep(0.0000) blockdiag-random-random(0.0010) blockdiag-random-random(0.0100)
0.75

0.70

0.65
AUC

0.60

0.55

0.50

0.8
0.7
0.6
0.5
FPR(VS)

0.4
0.3
0.2
0.1
0.0

0.8

0.6
FNR(VS)

0.4

0.2

0.0

0.06

0.05

0.04
FPR(CI)

0.03

0.02

0.01

0.00

1.0

0.8

0.6
FNR(CI)

0.4

0.2

0.0
200 500 1000 200 500 1000 200 500 1000
n n n

model
Normal-zipGM Poisson-zipGM TPoisson-zipGM

Figure E.3: Performance measures for prediction and variable selection when data were generated from Normal-zipGMs with increasing complexity
of the interactions matrices for sample sizes n = 200, 500, 1000.

26
indep-indep-indep(0.0000) blockdiag-random-random(0.0010) blockdiag-random-random(0.0100)

0.75

0.70
AUC

0.65

0.60

0.55

1.0

0.8

0.6
FPR(VS)

0.4

0.2

0.0

0.8

0.6
FNR(VS)

0.4

0.2

0.0

0.40
0.35
0.30
0.25
FPR(CI)

0.20
0.15
0.10
0.05
0.00

1.0

0.8

0.6
FNR(CI)

0.4

0.2

0.0
200 500 1000 200 500 1000 200 500 1000
n n n

model
Normal-zipGM Poisson-zipGM TPoisson-zipGM

Figure E.4: Performance measures for prediction and variable selection when data were generated from a Poisson-zipGMs with increasing com-
plexity of the interactions matrices for sample sizes n = 200, 500, 1000.

27
Y|ν(X) Y|X
2.0

1.5
MSE

1.0

0.5

0.0

0.08

0.06
FPR(VS)

0.04

0.02

0.00

1.0

0.8

0.6
FNR(VS)

0.4

0.2

0.0
200 500 1000 200 500 1000
n n

model
Normal-zipGM Poisson-zipGM TPoisson-zipGM mixOmics

Figure E.5: Performance measures for prediction and variable selection for select zipGMs and mixOmics when data were generated from the
forward models Y|ν(X) (column 1) or Y|X (column 2) for continuous Y for sample sizes n = 200, 500, 1000.

28
(a) AUC predictive performance and selected variables.

29
(b) Interaction network for the different models among the subset of variables specified in (a). In gray, the interaction network without variable selection; in blue the
variable selection and the interaction network selected by the different model when hierarchical penalization is applied.

Figure E.6: AGP dataset association analysis for the outcome “sex” using 5-fold cross-validation. (a) shows the AUC values and selected variables
estimation and (b) shows the connectivity network.

30
Model used for variable selection n selected LR qda svm-poly RF
coda4microbiome 7 0.39 0.42 0.39 0.00
mixOmics 10 0.37 0.38 0.37 0.00
selbal 2 0.43 0.45 0.44 0.13
Ising-pGM 2 0.45 0.47 0.44 0.14
Normal-pGM 10 0.40 0.45 0.38 0.00
Normal-zipGM 7 0.40 0.41 0.40 0.00
Poisson-pGM 5 0.45 0.46 0.46 0.33
Poisson-zipGM 13 0.39 0.39 0.34 0.00
TPoisson-pGM 4 0.45 0.45 0.46 0.35
TPoisson-zipGM 13 0.39 0.39 0.34 0.00

Table E.1: Error rates obtained from using n variables selected in all five folds of the cross validation in standard classifiers; LR: logistic regres-
sion, package glm; qda: quadratic discriminant analysis, R package MASS; RF: random forest, package randomForest; svm-poly: support vector
classifiers with polynomial kernel, package kernlab.

31

You might also like