0% found this document useful (0 votes)

29 views27 pages

Midas Biorxiv 2023

Uploaded by

Renmark Martinez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views27 pages

Midas Biorxiv 2023

Uploaded by

Renmark Martinez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/369534637

MIDAS: a fast and simple simulator for realistic microbiome data

Preprint · March 2023

DOI: 10.1101/2023.03.23.533996

CITATIONS READS
0 49

3 authors, including:

Glen A Satten
Emory University
241 PUBLICATIONS 21,838 CITATIONS

SEE PROFILE

All content following this page was uploaded by Glen A Satten on 28 March 2023.

The user has requested enhancement of the downloaded file.

bioRxiv preprint doi: https://doi.org/10.1101/2023.03.23.533996; this version posted March 25, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

MIDAS: a fast and simple simulator for realistic

microbiome data

Mengyu He

[email protected]

Department of Biostatistics and Bioinformatics,

Emory University, Atlanta, GA 30329, USA

Glen Satten

[email protected]

Department of Gynecology and Obstetrics

Department of Biostatistics and Bioinformatics,

Emory University, Atlanta, GA 30329, USA

Ni Zhao∗

[email protected]

Department of Biostatistics,

Johns Hopkins University, Baltimore, MD 21205, USA

1
bioRxiv preprint doi: https://doi.org/10.1101/2023.03.23.533996; this version posted March 25, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 Abstract

2 Motivation: Advances in sequencing technology has led to the discovery of associations

3 between the human microbiota and many diseases, conditions and traits. With the increasing

4 availability of microbiome data, many statistical methods have been developed for studying

5 these associations. The growing number of newly developed methods highlights the need for

6 simple, rapid and reliable methods to simulate realistic microbiome data, which is essential

7 for validating and evaluating the performance of these methods. However, generating realistic

8 microbiome data is challenging due to the complex nature of microbiome data, which feature

9 correlation between taxa, sparsity, overdispersion, and compositionality. Current methods for

10 simulating microbiome data are deficient in their ability to capture these important features of

11 microbiome data, or can require exorbitant computational time.

12 Results: We develop MIDAS (MIcrobiome DAta Simulator), a fast and simple approach

13 for simulating realistic microbiome data that reproduces the distributional and correlation

14 structure of a template microbiome dataset. We demonstrate improved performance of MI-

15 DAS relative to other existing methods using gut and vaginal data. MIDAS has three major

16 advantages. First, MIDAS performs better in reproducing the distributional features of real

17 data compared to other methods at both presence-absence level and relative-abundance level.

18 MIDAS-simulated data are more similar to the template data than competing methods, as quan-

19 tified using a variety of measures. Second, MIDAS makes no distributional assumption for the

20 relative abundances, and thus can easily accommodate complex distributional features in real

21 data. Third, MIDAS is computationally efficient and can be used to simulate large microbiome

22 datasets.

23 Availability and implementation: The R package MIDAS is available on GitHub at

24 https://github.com/mengyu-he/MIDAS

25 Contact: Ni Zhao, Department of Biostatistics, Johns Hopkins University ([email protected])

26 Supplementary information: Supplementary data are available at Bioinformatics online.

2
bioRxiv preprint doi: https://doi.org/10.1101/2023.03.23.533996; this version posted March 25, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

27 1 Introduction

28 The human microbiota and its associated microbiome plays a fundamental role in many dis-
29 eases and conditions, including obesity (Sze and Schloss, 2016), inflammatory bowel disease (IBD)
30 (Simren et al., 2013), preterm birth (Fettweis et al., 2019), autism (Gilbert et al., 2018) and can-
31 cers (Dejea et al., 2014; Kostic et al., 2012). Advances in sequencing technologies, especially 16S
32 rRNA sequencing, now allow rapid and simultaneous measurement of the relative abundance of all
33 taxa in a community. This has lead to a growing number of epidemiological and clinical studies
34 to measure the association between the microbiome and traits of interest, sometimes with complex
35 study designs and research questions.
36 Although microbiome data is increasingly available, statistical analysis remains challenging.
37 Microbiome data have special characteristics that are difficult to model analytically, including
38 sparsity (the majority of taxa are not present in a sample), overdispersion (the variance of read
39 counts is larger than what is assumed from the usual parametric models), and compositionality
40 (the read counts in a sample sum to a constant). There is little consensus among researchers on
41 how microbiome data should be analyzed, and new methods are being regularly developed, both
42 for identifying individual taxa that associate with diseases (Paulson et al., 2013; Mandal et al.,
43 2015; Lin and Peddada, 2020; Martin et al., 2020; Hu and Satten, 2020; Hu et al., 2021, 2022),
44 and for understanding the community-level characteristics that relate to clinical conditions (Zhao
45 et al., 2015; Wu et al., 2016; Jiang et al., 2022).
46 Simulating realistic microbiome data is essential for the development of novel methods. To
47 establish the validity of a new method and prove it outperforms existing ones, researchers rely on
48 simulated data in which the true microbiome/trait associations are known. Ideally, the simulated
49 data should be similar to real microbiome data for the simulation studies to be trustworthy. How-
50 ever, simulating realistic microbiome data is made difficult by the same challenges as analyzing
51 microbiome data: sparsity, overdispersion and compositionality. Further, the distribution of counts
52 for each taxon are highly skewed and correlated in a complex way. For these reasons, most simu-
53 lation methods are based on using a template microbiome dataset, and generate simulated data that

3
bioRxiv preprint doi: https://doi.org/10.1101/2023.03.23.533996; this version posted March 25, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

54 is ‘similar’ to the template data in some way.

55 Several approaches have been proposed for simulating microbiome data. Among them, some
56 methods impose strong parametric assumptions so that the simulated microbiome data share sim-
57 ilar dispersion of real data. For example, the Dirichlet-Multinomial (D-M) distribution, in which
58 the taxa counts are generated from a multinomial distribution with proportion parameters provided
59 by a Dirichlet prior (Chen and Li, 2013), is frequently used in simulating microbiome data. The
60 hyper-parameters of this DM model are often estimated from real data so that the simulated data
61 share similar dispersion. Another method, MetaSPARSim (Patuzzi et al., 2019), uses a gamma-
62 multivariate hypergeometric (gamma-MHG) model, in which the gamma distribution models the
63 biological variability of taxa counts, accounting for overdispersion, and the MHG distribution mod-
64 els technical variability originating from the sequencing process. Although the D-M model and the
65 MetaSPARSim model address the compositional feature by either the multinomial or the hyperge-
66 ometric distribution, they do not attempt to match the correlation structures in the simulated data
67 with those found in the real data.
68 One recently-developed approach that does attempt to model between-taxa correlations is
69 SparseDOSSA (Sparse Data Observations for the Simulation of Synthetic Abundances) (Ma et al.,
70 2021). This hierarchical model makes assumptions about both the marginal and joint distributions
71 of the relative abundances of a set of taxa. For the marginal distribution, SparseDOSSA assumes a
72 zero-inflated log-normal model for the relative abundance of each taxon and then imposes the com-
73 positional constraint. Parameters in the zero-inflated log-normal marginal are estimated through
74 a penalized Expectation-Maximization (EM) algorithm from a template dataset. Unfortunately,
75 the penalized EM algorithm for estimating hyper-parameters is computationally expensive, espe-
76 cially when a large number of taxa exist in the data. For example, fitting SparseDOSSA model to
77 a modest-sized dataset with sample size of 79 and number of taxa = 109 takes more than a day
78 (≈ 27.8 hours) on a single Intel “Cascade Lake” core (Ma et al., 2021). To partially compensate
79 for this drawback, SparseDOSSA provides fitted models that were previously trained by the devel-
80 opers and that users can use directly, which is only useful if the developer-provided fits resemble

4
bioRxiv preprint doi: https://doi.org/10.1101/2023.03.23.533996; this version posted March 25, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

81 the data users wish to generate. Moreover, SparseDOSSA removes rare taxa that appear in fewer
82 than 4 samples by default, thus failing to accommodate the possibility that rare taxa are of interest
83 in the simulation studies.
84 Considering the drawbacks of existing approaches, a method that can flexibly capture the dis-
85 tributional and correlation structure of microbiome data would greatly benefit the research com-
86 munity. In this paper, we develop a fast and simple MIcrobiome DAta Simulator (MIDAS) for
87 generating realistic microbiome data that capture the correlation structure of taxa of a template mi-
88 crobiome dataset in both the presence-absence data and the relative abundances. MIDAS generates
89 relative abundance data using a two-step approach. The first step generates the presence-absence
90 of each taxon by simulating correlated binary data from a probit model with a shrinkage correlation
91 structure, while the second step generates relative abundance and count data from a Gaussian cop-
92 ula model. MIDAS also allows the user to change the library sizes, taxon relative abundances or
93 the proportion of non-zero cells to generate data in which these features may depend on covariates
94 such as case/control status.
95 MIDAS uses the correlation structure of the template data both in generating correlated binary
96 data for the first step, and in the Gaussian copula model for relative abundances. A rank-based
97 approach is used to fit the copula model to the template data to handle zero count data. MIDAS
98 makes no distributional assumption for the relative abundances, and thus can accommodate com-
99 plex distributional features in real data. Using simulations, we show that MIDAS reproduces the
100 distributional features at both the presence-absence level and the relative-abundance level, and in so
101 doing generates data that are more similar in multiple metrics to the template data than competing
102 methods. MIDAS is also computationally efficient and can be used to simulate large microbiome
103 datasets in a fast and simple fashion.

5
bioRxiv preprint doi: https://doi.org/10.1101/2023.03.23.533996; this version posted March 25, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

104 2 Materials and methods

105 Suppose that we are interested in simulating microbiome data that are similar to a real dataset
106 with n samples and J taxa; we refer to this dataset as the ‘template’ data. Assume in this dataset,
107 all taxa are present in at least one sample, and J1 taxa (J1 ≤ J) are present in all samples. For
108 sample i and taxon j, let Ci j denote the observed count, Ni = ∑Jj=1 Ci j denote the observed library
109 size, πi j denote the observed relative abundance (πi j = Ci j /Ni ), and let presence-absence indicator
110 Zi j = I(Ci j > 0) where I(S) = 1 if S is true and 0 otherwise. We let C, Z and π represent the
111 n × J matrices of the read counts, presence-absence and the relative abundances of all taxa in the
112 template data, respectively. Corresponding quantities for the simulated data are denoted by a tilde,
113 e.g. Z
e is the presence-absence indicator in the simulated data.

114 We develop a two-step procedure for generating count and relative abundance data that share
115 similar characteristics of a template dataset. The first step is to generate the binary presence-
116 absence indicators so that they share similar correlation as in the real presence-absence data Z.
117 This step allocates zeros and non-zeros for the simulated data. The second step is to fill the non-
118 zero cells from step 1 using a Gaussian copula model fitted to the observed values π. We next
119 describe each step in detail; summaries of each step are found in algorithm 1.

120 2.1 Step 1: generate presence-absence data

121 The goal of step 1 is to generate presence-absence data Zei j having correlation and marginal
122 means that match the target data. To facilitate the generation of these data, we propose to generate
123 multivariate normal data Di j having mean µ j + ηi and variance-covariance matrix Σ in such a way
124 that Zi j = 1 corresponds to Di j ≥ 0. To accomplish this, we choose µ j and ηi to jointly solve

J
125 ∑ Φ(µ j + ηi) = Zi· (1)
j=1

6
bioRxiv preprint doi: https://doi.org/10.1101/2023.03.23.533996; this version posted March 25, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

126 and
N
127 ∑ Φ(µ j + ηi) = Z· j , (2)
i=1

128 where Zi· is the number of non-zero cells in the data from the ith observation and Φ(·) and Φ−1 (·)
129 are the CDF and quantile function of the standard normal distribution respectively. These equations
130 are iterated alternately, starting from the initial values ηi = 0 and µ j = Φ−1 (Z· j ).
131 To estimate Σ we first calculate the empirical correlation matrix for the observed values of Z,
132 denoted ζ . We next convert ζ into a matrix of tetrachoric correlations, denoted by ρ, using the
133 approach of (Bonnet and Price, 2005). The correlation matrix is smoothed to be positive definite
134 using the function cor.smooth() in R package psych (Revelle, 2015). We then sample values
135 e i j ∼ MVN(µ j + ηi , ρ) and take Zei j = I(D̃i j > 0).
D
136 Because the number of zero cells in a sample is related to its library size, this procedure is
137 designed so that the library sizes in the simulated data are the same as the template data. See
138 Section 2.3 for discussion of the case where simulated data with different library sizes than those
139 in the template data is required.

140 2.2 Step 2: generate relative abundance and count data

141 The goal of Step 2 is to generate relative abundance data that mimic the relative abundance data
142 of the non-zero cells in the target data. To accomplish this, we use a Gaussian copula model, which
143 allows us to specify a marginal distribution for each taxon that matches the observed distribution
144 of non-zero relative abundances for that taxon.
145 In order to allow for the possible generation of non-zero relative abundances for taxa that are
146 observed to have zero counts, we must include the zero cells when we specify the correlation
147 structure of the Gaussian copula. To accomplish this, we use a rank-based approach based on the
148 relationship between the Pearson and Spearman correlations for normally-distributed data (Rup-
149 pert and Mattesson, 2015). This approach does not require us to know the values we would have
150 obtained for an empty cell, had that cell not been empty; our only assumption is that the relative

7
bioRxiv preprint doi: https://doi.org/10.1101/2023.03.23.533996; this version posted March 25, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

151 abundances of the zero cells are smaller than those of the cells having non-zero counts. In partic-
152 ular, to specify the correlation of the underlying Gaussian model, we calculate Spearman’s rank
153 correlation φ for the observed relative abundance values. When calculating the rank correlation,
154 we consider the zero cells to be tied, and then break these (and any other) ties by a random order-
155 ing. For the kth of K such random orderings, after computing Spearman’s rank correlation φ (k) , we
(k) (k)
156 obtain the corresponding Pearson correlation r(k) using ri j = 2sin(πφ j j′ /6). The correlation ma-
157 trix r∗ = ∑K (k)
k=1 r /K is corrected to be positive definite by setting negative eigenvalues to a small

158 positive value and then renormalizing to preserve the trace of the smoothed correlation matrix. The
159 default choice for MIDAS is K = 100. We then take the corrected correlation matrix as the final
160 correlation matrix for the underlying Gaussian model.
161 To simulate a new dataset with n observations, we first generate n independent multivariate
162 normal variables W ∼ MVN(0, r∗ ) from the multivariate normal distribution with mean µ = 0 and
163 variance-covariance matrix r∗ . We then choose simulated relative abundances for the jth taxon,
164 πe· j , from the non-zero relative abundances in the template data in the following way. If Zei j = 0
165 e j = ∑ni=1 Zei j values, if m
we choose πei j = 0. For the remaining m e j ≤ m j where m j = ∑ni=1 Zi j , we

166 sample m e j > m j then, in addition

e j values from the non-zero values of π· j without replacement; if m

167 e j − m j values from the non-zero

to the m j non-zero values of π· j , we sample an additional set of m
168 values of π· j with replacement. We then assign the non-zero values of πei j the values of πi j that we
169 have sampled, in such a way that their rank agrees with the rank of the w· j values corresponding to
170 Zei j = 1. The values of πe are then normalized to sum to one for each observation.
171 A count table Ce is then calculated by multiplying the sampled relative abundances π
e i j by library

172 size Ni for each observation. Any values so obtained that are between 0 and 1 are rounded up to 1
173 to keep the presence-absence structure; other values are rounded to the nearest integer. The library
174 ei = ∑J Cei j and the final relative abundance is
sizes for the simulated data are then calculated as N j=1

175 updated through πei j = Cei j /N

ei .

8
bioRxiv preprint doi: https://doi.org/10.1101/2023.03.23.533996; this version posted March 25, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Algorithm 1 Steps to simulate one data set similar to a template data

Input: Template data: presence-absence matrix Z (n × J), relative abundance matrix π (n × J), K
(number of Monte-Carlo replicates for r∗ )
Output: Z, e C,
e N ei = ∑J Cei j , π e j = ∑ni=1 Zei j
e i j = Ci j /Ni , and m
j=1
Initialize necessary quantities for presence-absence data
a. Filter Z to remove taxa that are present in all samples.
b. Calculate the correlation matrix ζ using the filtered Z.
c. Convert ζ to the tetrachoric correlation matrix (with shrinkage to remove
negative eigenvalues) to obtain ρ.
d. Estimate µ and η
Initialize necessary quantities for relative abundance data
a. for k = 1, · · · , K,
Calculate ϕ(k) = Spearman’s rank correlation of π by randomly breaking ties.
Convert ϕ(k) to Pearson correlation r(k)
Calculate r∗ = ∑K (k)
k=1 r /K.
Step 1: Generate presence-absence data Z. e
e = 1 for taxa present in every observation.
a. Set Z
b. For remaining taxa, generate multivariate normal data Di j ∼ MVN(µi + η j , ρ).
e i j = I(Di j > 0) for taxa not present in all observations.
c. Set Z
Step 2: Generate relative-abundance and count data.
a. Generate W ∼ MVN(0, r∗ ).
b. for j = 1, · · · , J,
Set πei j = 0 for the n − m e j observations in taxon j having Zei j = 0
Sample m e j elements from non-zero elements from π· j
Assign sampled values to non-zero πei j in order given by Wi j .
c. Normalize the matrix with assigned relative abundances.
d. Estimate counts by multiplying matrix of assigned relative abundances by Ni for
each observation; Round non-zero counts less than 1 up to 1; round all other counts
to nearest integer to yield count matrix C. e
J
e. Calculate library size Ni = ∑ j=1 Ci j .
e e
g. Calculate final relative abundance matrix π e i· = C
e i· /N
ei .

9
bioRxiv preprint doi: https://doi.org/10.1101/2023.03.23.533996; this version posted March 25, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

176 2.3 Changing the parameters of the simulation

177 In the previous sections, we have shown how MIDAS will generate a dataset that replicates a
178 template microbiome dataset. In some situations, however, we may want to change the parameters
179 used in MIDAS to generate simulated data that are different from the template data in a controlled
180 way. For example, we may want to generate data with different library sizes. Or, we may want to
181 change the relative abundances or proportion of zero cells for a set of taxa, perhaps in a way that
182 is related to external covariates. In this section, we discuss how this can be accomplished in the
183 MIDAS framework.
184 The parameters that are available for adjustment in MIDAS are: the library sizes Ni , the taxon
185 mean relative abundances p j , the taxon proportion of zero cells δ j = Z· j /n and the taxon mean
(1)
186 relative abundance among non-zero cells p j . Note that these last three are related through

(1)
187 pj = δj × pj . (3)

188 bi , pbj , δbj and pb(1) . Aside from the two correlation
We denote the new (desired) parameters as N j

189 matrices (which do not change), the quantities that govern the MIDAS simulation are Zi· , Z· j and
190 the values of πi j for non-zero cells. Thus, we must describe how these quantities are obtained when
191 a change in parameters is desired. MIDAS supports two approaches to changing parameters. In
192 the first approach, the user specifies changes to at most one taxon-level quantity and optionally the
193 library sizes, and MIDAS changes the other parameters to be consistent with the patterns seen in
194 the target data. In the second strategy, the user specifies all the quantities. The first should generate
195 more realistic data, but the second will be useful for generating data where only a single quantity
196 has been changed, which is useful when evaluating the performance of microbiome analysis meth-
197 ods. We first describe implementation of the first strategy which is used for the rest of this work;
198 the second strategy is described briefly at the end of this section.
199 For realistic data when changing library sizes, it is important to account for the positive rela-
200 tionship between the library size and the number of taxa with non-zero counts for that sample (see

10
bioRxiv preprint doi: https://doi.org/10.1101/2023.03.23.533996; this version posted March 25, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

201 e.g. Supplementary Figure S1). To accomplish this, MIDAS fits the Shape Constrained Additive
202 model (SCAM) (Pya and Wood, 2015)

203 log10 [E(Zi· )] = f (log10 (Ni )), (4)

204 to the template data using the R package scam, where f is a monotone smoothing spline. Then,
205 if we wish to generate data having library sizes N
bi we can generate Zbi· using (4). We would

206 then replace the right-hand side of (1) by min(J, Zbi· ) and replace the right-hand side of (2) by
207 Ze· j = (Z· j /Z·· )Zb·· where Zb·· = ∑i min(J, Zbi· ) to find the appropriate values of µ and η to simulate
208 presence-absence data Z.
e

209 Many simulations require groups of samples with different relative abundances. When chang-
210 ing the mean relative abundance of the taxa, we can expect the proportions of zero cells to change
211 as well. To account for this, MIDAS fits the SCAM model

212 logit E δ j = g(log10 (p j )) (5)

213 to the template data. Then realistic values of δbj and hence Z· j can be generated for the specified pbj
214 values. Alternatively, a simulation may specify a new value of δbj . In this case MIDAS determines
215 the resulting mean relative abundances using

216 logit E p j = h(log10 (δ j )) (6)

(1)
217 In either case, the constraint (3) then determines pbj . MIDAS then replaces the right-hand side
218 of (2) with Zb· j and changes the right-hand size of (1) to (Zi· /Z·· )Zb·· where Zb·· = ∑ j Zb· j to find the
219 appropriate values of µ and η.
220 Finally, if we wish to change both library sizes and taxon frequencies, we fit the SCAM model

221 logit[E(Zi j )] = f (log10 (Ni )) + g(log10 (p j )) (7)

11
bioRxiv preprint doi: https://doi.org/10.1101/2023.03.23.533996; this version posted March 25, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

222 then use the marginal totals Zbi· and Zb· j predicted from (7) in (1) and (2). Simultaneous changes in
223 both library sizes and the proportion of non-zero cells are not currently supported by MIDAS.
224 To allow the user more control over the simulation when parameter changes are made, MIDAS
225 supports a second simulation strategy in which the user specifies Ni and (at least) two of pbj , δbj and
(1)
226 pbj . Then, the SCAM models described above are not fit; instead, Zb· j = nδbj after which MIDAS
227 assigns Zbi· = (Zi· /Z·· )Zb·· where Zb·· = ∑ j Zb· j .
228 Changes to Zi· and Z· j affect only generation of the binary data Z.
e To ensure generation of
(1) α
229 relative abundance data that agrees with pej for either strategy, MIDAS replaces by πi j j where α j
(1)
230 is chosen so that their mean among non-zero cells is pbj . Presence-absence data Z
e and relative

231 abundance data π

e are then generated as described in Sections 2.1 and 2.2, and are scaled by N
bi to

232 generate counts C

e as described in Section 2.2.

233 Finally, it may be worth noting that we may choose to use only a subset of the template data. For
234 example, if we wish to compare ‘control’ and ‘case’ populations, we could use only disease-free
235 observations from the template data, so that the proposed changes in taxon prevalences represent
236 modifications to the disease-free group to create ‘case’ data.

237 2.4 Simulation studies and comparisons with existing methods

238 We compared MIDAS to three competing methods (the D-M method, MetaSPARSim and
239 SparseDOSSA) and evaluate how well the simulated data reporduce the characteristics of the tem-
240 plate data. We use two datasets from the Integrative Human Microbiome Project (HMP2) (Proctor
241 et al., 2019) as the template data: a vaginal microbiome dataset from Multi-Omic Microbiome
242 Study: Pregnancy Initiative (MOMS-PI) project, and a gut microbiome dataset from the Inflam-
243 matory Bowel Disease Multi-omics Database (IBDMDB) project (Lloyd-Price et al., 2019). These
244 two datasets represent microbiome from two body sites that are frequently studied in the litera-
245 ture. They are also distinct in their characteristics, and thus provide a comprehensive assessment
246 of the proposed method. For example, the vaginal data, containing 95.25% zeros, is more sparse
247 than the gut data, which comprises 85.09% zeros. Moreover, the coefficient of variation (CV) of

12
bioRxiv preprint doi: https://doi.org/10.1101/2023.03.23.533996; this version posted March 25, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

248 vaginal data is 40.77, while that of the gut data is 10.76, indicating that the vaginal data is more
249 over-dispersed.
250 We lightly filtered the two template datasets. For quality control, we removed samples with
251 library size < 3000. To allow comparison with SparseDOSSA, we removed taxa that were present
252 in fewer than 4 samples, a requirement of SparseDOSSA. The MOMSPI was a longitudinal study
253 with repeated vaginal samples; we kept only first-visit samples to avoid repeated measures. The
254 only filtering used for the IBD data was that required by SparseDOSSA. After filtering, 517 sam-
255 ples and 1146 taxa were preserved in the vaginal MOMS-PI dataset; the gut IBD dataset comprised
256 146 samples and 614 taxa. We ignored covariates such as gender or location of biopsy collection to
257 focus only on reproducing the microbiome datasets as closely as possible, the goal of all methods
258 considered here. In our simulations, the library sizes for datasets generated using the D-M method
259 and MetaSPARSim were the same as that in the original data. For SparseDOSSA, the library
260 sizes were generated from a log-normal distribution parameterized by mean and standard devia-
261 tion of log counts in the original data, as recommended in their original publication. To facilitate
262 comparison of the methods, all simulated counts were transformed to relative abundances.
263 We compared the simulated data from each method to the template data using several measures.
264 First, we concatenated the template data with a simulated dataset from each method, and defined a
265 binary variable to differentiate the template and simulated data. We tested the significance of this
266 variable using PERMANOVA (Anderson, 2001), which tests for shifts in the between-observation
267 distances. Our PERMANOVA tests used the Jaccard distance as well as the Bray-Curtis distance,
268 which are both commonly used in microbiome data analyses. The Jaccard distance uses only
269 presence-absence information in the data, and thus can assess how similar Z
e and Z are, while

270 the Bray-Curtis distance accounts for both the presence-absence and relative abundance informa-
271 tion and can be used to assess the simulation of π
e . We also compared the alpha diversity of the

272 simulated data and template data. The simulated communities were compared to the template
273 in terms of observed richness and Shannon Index, and the differences in diversity were tested by
274 Kruskal-Wallis tests. The observed richness is simply the number of observed taxa, while Shannon

13
bioRxiv preprint doi: https://doi.org/10.1101/2023.03.23.533996; this version posted March 25, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

275 Index additionally considers evenness-the relative abundances of taxa-when quantifying diversity.
276 To suppress random variability, we repeated the comparison of alpha-diversity and beta-diversity
277 using 20 simulated datasets from each of the four methods. Finally, we compared the methods
278 visually, using ordination and PCoA, as well as boxplots of alpha diversity values, using a single
279 simulated data set for each method.
280 We next compared the simulation approaches in terms of their β -dispersion, by comparing
281 whether the distribution of distances from each observation to the sample centroid was the same
282 in the simulated and template data. We calculated distances to the centroids using the betadisper
283 function in R package vegan (Dixon, 2003). We used the Kolmogorov-Smirnov (K-S) test to
284 compare these empirical distributions. We again averaged results over 20 simulation replicates to
285 suppress random variability. We also compared the alpha diversity of the template and simulated
286 data, as measured by the species richness (number of observed taxa) and the Shannon entropy.
287 Finally, we evaluated the performance of our approach to generating data with different library
288 sizes by rarefying our template datasets, then using the approach described in section 2.3 to in-
289 crease the library size to that of the original template data. Thus, we can compare the resulting
290 simulated data to the original template data. Specifically, for each template, the observed counts
291 for each subject were rarefied (subsampled without replacement) to remove 10% of the observed
292 counts. The rarefied data are then treated as the template data in MIDAS, and the target library size
293 is the original library size.

294 3 Results

295 3.1 Comparison Results

296 The PCoA plots in Figure 1 provide a simple visualization of the similarities between the orig-
297 inal data and the simulated data by MIDAS, the D-M method, MetaSPARSim, and SparseDOSSA
298 for the IBD data and MOMS-PI data. For both datasets, after ordination the data simulated from
299 MIDAS looked similar to the template data, using either the (presence-absence-based) Jaccard

14
bioRxiv preprint doi: https://doi.org/10.1101/2023.03.23.533996; this version posted March 25, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 1: Principal Coordinates plots (PCoA) of the simulated and original community. Each
row corresponds to one method. The left two columns are the plots for the IBD data, and the
right two columns are the plots for the MOMS-PI data. Black points: samples from original data.
Colored points: samples from the simulated data with red being MIDAS, blue being D-M, pink
being MetaSPARSim, and green being SparseDOSSA

300 (Figure 1 A,C) or (relative abundance-based) Bray-Curtis distance (Figure 1 B,D). To allow visual
301 comparison between the template data and multiple datasets simulated by MIDAS, in Figure 2
302 we also give a heatmap of constructed using 20 simulated datasets. Conversely, for both data

15
bioRxiv preprint doi: https://doi.org/10.1101/2023.03.23.533996; this version posted March 25, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

303 templates, data simulated by the D-M method, MetaSPARSim, SparseDOSSA all appear to be
304 underdispersable in the first two principle coordinates (Figure 1 E,G,I,K,M,O) using the Jaccard
305 distance. For the IBD data, data simulated using D-M and MetaSPARSim appeared easily dis-
306 tinguished from the original data when the Bray-Curtis distance was used (Figure 1 F, J). For the
307 MOMS-PI data, we also see clear underdispersion in data simulated using D-M (Figure 1 H).
308 The visual impressions of beta diversity in figures Figure 1 and Figure 2 are confirmed in
309 Table 1, where we test whether the template and simulated data are significantly different using
310 PERMANOVA. For tests using the Jaccard distance, the p-values for MIDAS were consistently
311 high (indicating no detected difference between simulated and template data); for the other meth-
312 ods, only SparseDOSSA showed a similar pattern to the template data, but only when applied to
313 the IBD data. When using the Bray-Curtis distance, only MIDAS could produce data that was not
314 easily differentiated from the template data by PERMANOVA.
315 Figure 3 shows empirical cumulative distribution function (CDF) representing the distances
316 between each sample and the group centroid in the simulated data and in the template data using
317 Jaccard and Bray-Curtis distances, calculated using the betadisper function in the R package
318 vegan. If the simulated data are similar to the template data, the CDF should resemble that of
319 the template data. The CDFs of the template data and the datasets simulated by D-M method,
320 MetaSPARSim, and SparseDOSSA are noticeably dissimilar, which is confirmed by extremely
321 small Kolmogorov-Smirnov test p-values. The range of distances to centroids in the simulated
322 data by D-M method and SparseDOSSA is smaller compared to the real data in every scenario,
323 indicating a smaller dispersion overall. For the IBD data, the MIDAS-simulated data follows the
324 template data closely in dispersion in both Jaccard and Bray-Curtis distances. For the MOMS-PI
325 data with Bray-Curtis distance, there is a significant difference in dispersion between the MIDAS-
326 simulated data and the template data Figure 3. However, panel D of Figure 3 shows the MIDAS
327 results are clearly closer to those of the template data than the other methods are.
328 Table 1 and Figure 4 also shows comparisons of two measures of alpha diversity: species
329 richness and Shannon index. To compare the richness and Shannon index between simulated and

16
bioRxiv preprint doi: https://doi.org/10.1101/2023.03.23.533996; this version posted March 25, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 2: Principal Coordinates plots (PCoA) of simulated and original community. The heatmap
is plotted based on 10 replicates of simulated communities by MIDAS, with darker colorings as-
sociated with higher density of simulated values. Black points represent the original community.

17
bioRxiv preprint doi: https://doi.org/10.1101/2023.03.23.533996; this version posted March 25, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Table 1: Average p-value (from 20 replicates) for tests comparing alpha and beta diversities of
simulated data and template data
Beta-Diversity Alpha-Diversity
Data Method Jaccard Bray-Curtis Richness t Richness KS Shannon t Shannon KS
MIDAS 0.9993 1.000 0.6644 0.1829 0.2610 0.2637
D-M 0.0090 <0.0001 0.3303 <0.0001 <0.0001 <0.0001
IBD
MetaSPARSim 0.0340 <0.0001 0.3102 <0.0001 0.0078 <0.0001
SparseDOSSA 0.7972 <0.0001 0.0569 <0.0001 <0.0001 <0.0001
MIDAS 0.5793 0.8617 0.6252 0.0019 <0.0001 <0.0001
D-M <0.0001 <0.0001 0.0028 <0.0001 <0.0001 <0.0001
MOMS-PI
MetaSPARSim <0.0001 <0.0001 0.6341 <0.0001 <0.0001 <0.0001
SparseDOSSA <0.0001 <0.0001 <0.0001 <0.0001 0.0002 0.0015
∗ P-values obtained from PERMANOVA.

330 template data, we used the Welch t-test to compare the means and the Kolmogorov-Smirnov two-
331 sample test to compare the full distribution. In Table 1 we report the average p-value obtained from
332 20 simulated datasets for each method. In Figure 4, we also plot the alpha diversities for a single
333 data set from each simulation method. For the IBD data, all methods successfully reproduced the
334 mean richness, while for the MOMS-PI data only MIDAS and MetaSPARSim had mean richness
335 that did not significantly differ from the template data. The situation is very different if we ask
336 for the entire distribution of sample richness values to be the same as the template data; here, only
337 MIDAS generated data that met this criterion when using the IBD data as the template. For the
338 Shannon index, only MIDAS produced a non-significant t-test for the Shannon index for the IBD
339 data, while none of the methods met this criterion for the MOMS-PI data. The same pattern was
340 found for the full distribution of Shannon values.
341 We also compare the computational time that each method takes to fit its proposed model to the
342 template IBD and MOMS-PI datasets and to simulate one dataset of the same size, which is sum-
343 marized in Table 2. The computational time is evaluated on an Intel Quad core 2.7GHz processor,
344 with 8GB memory. Comparing the total time used, MIDAS is one of the fastest, especially when
345 fitting and simulating the large dataset of MOMS-PI. For the model fitting, MetaSPARSim is the
346 fastest, but it is very slow in generating new data. For generating new data with specified param-
347 eters, D-M is the fastest. The computation time of SparseDOSSA for fitting the model depends
348 on the number of iterations in its EM algorithm. In practice, it takes more than 3 hours to fit its

18
bioRxiv preprint doi: https://doi.org/10.1101/2023.03.23.533996; this version posted March 25, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 3: Empirical cumulative distribution function of distances to centroids

349 model to our IBD and MOMSPI datasets, making it hard to use in practice; the pre-trained models
350 can be used if faster results are needed, but then a user-selected template dataset cannot be used.
351 Discounting the time required for model fitting, MIDAS, D-M and SparseDOSSA all can generate
352 replicate datasets quickly; MetaSPARSim is the only outlier in this regard.
353 Finally, in the supplementary material, we show the success of our approach to changing the
354 library size for simulated data. Using rarefied data as the template, the data we simulated with the

19
bioRxiv preprint doi: https://doi.org/10.1101/2023.03.23.533996; this version posted March 25, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 4: Alpha diversities (Richness and Shannon Index) of original and a single simulated dataset
for each of four simulation methods.

20
bioRxiv preprint doi: https://doi.org/10.1101/2023.03.23.533996; this version posted March 25, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

355 original (unrarefied) library size agrees well with the original template data, both by ordination
356 (Supplementary Figure S3) and in terms of beta dispersion (as measured by the distance from each
357 observation to the data centroid) (Supplementary Figure S4).

Table 2: Computation time (seconds) required to fit the template data, and to simulate a new dataset
with the same library size. Simulating time is the average time over 20 replicates of generating
datasets of the same size as the real data. Total time is the sum of fitting and simulating times.

IBD MOMS-PI
Method
Fitting Simulating Total Fitting Simulating Total
MIDAS 25.5 2.5 28.0 162.0 15.3 177.6
D-M 25.0 0.3 25.3 308.4 2.2 310.6
MetaSPARSim 7.4 144.9 152.3 41.3 469.4 510.7
SparseDOSSA 10812.6 0.8 10813.4 11792.5 5.2 11797.7

358 4 Discussion

359 Simulating realistic microbiome datasets is essential for methodology development in micro-
360 biome studies. However, this task is surprisingly difficult due to the complexity of microbiomial
361 abundance data. Here we propose a two-step MIcrobiome DAta Simulator (MIDAS) to generate
362 microbiome datasets that resemble a real (template) microbiome dataset of the user’s choice. The
363 first step of MIDAS involves generating correlated binary indicators that represent the presence-
364 absence status of all taxa. The correlation matrix for these binary indicators is estimated from the
365 real data with a shrinkage procedure. The second step of MIDAS generates relative abundance
366 and counts for the taxa that are considered to be present in step 1, utilizing a Gaussian copula to
367 account for the taxon-taxon correlations.
368 In simulation studies, it is frequently necessary to generate microbiome data for which relative
369 abundances or presence-absence status are related to a trait or covariate that are associated with one
370 or a few clinical characteristics. MIDAS supports this situation; taxon relative abundances, library
371 sizes, and taxon proportion of zero cells are output by the MIDAS set-up function Midas.setup.
372 These quantities can be modified using the function Midas.modify before passing them as inputs

21
bioRxiv preprint doi: https://doi.org/10.1101/2023.03.23.533996; this version posted March 25, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

373 into the MIDAS simulation function Midas.sim. In the simple situation with case and control, it
374 may be worthwhile to restrict the template data to unaffected controls.
375 We expect that modifying library sizes will perform best if the new library sizes are within the
376 range of library sizes found in the template data, so that the number of zero cells for a sample
377 can be reasonably predicted from its new library size. In our simulations, we found MIDAS could
378 successfully execute an 11% increase in library size while remaining faithful to the template data.
379 MIDAS is designed to generate a new dataset that is as close as possible to the target data.
380 In some cases, it may be desirable to introduce extra variability. For example, in section 2.1 we
381 used the number of non-zeroes d j found in the original data; we could instead sample these from a
382 distribution, possibly while maintaining their rank order, or add random noise. In section 2.2, we
383 generated relative abundance values πe· j using a sampling-without-replacement scheme. We could
384 instead sample from a smooth version of Fj (say, using a linear interpolation of the CDF) or even
385 fit a Beta distribution to the non-zero relative abundances for each taxon. Finally, a taxon j that
386 appears in all samples will always appear in all simulated samples; this could easily be modified
387 by allowing a zero cell in the ith sample with probability ε1 j . Similarly, we could allow a taxon j′
388 that have no counts in any sample to appear in a simulated sample with probability ε0 j′ ; we could
389 provide a distribution of relative abundances for such a taxon by using the distribution of relative
390 abundances from a randomly-selected taxon having the same number of observed non-zero cells
391 as taxon j′ has. We may consider these changes as we receive feedback from users.
392 To summarize, MIDAS is easy to implement, flexible and suitable for most microbiome data
393 simulation situations. For the two template datasets we considered, MIDAS showed superior per-
394 formance when compared to existing competitors both by PERMANOVA and in terms of beta
395 dispersion.

22
bioRxiv preprint doi: https://doi.org/10.1101/2023.03.23.533996; this version posted March 25, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

396 Data Availability

397 All template datasets are publicly available and can be accessed through R package HMP2Data.
398 Details can be found in the vignette of R package MIDAS.

399 Competing interests

400 The authors declare no competing risk.

401 Acknowledgements

402 Funding

403 Drs Zhao and Satten’s work is supported, in part, by National Institute of Health (R01GM147162),

404 References

405 Anderson, M. J. (2001). A new method for non-parametric multivariate analysis of variance.
406 Austral Ecology, 26.

407 Bonnet, D. G. and Price, R. M. (2005). Inferential methods for the tetrachoric correlation coeffi-
408 cient. Journal of Educational and Behavioral Statistics, 30(2), 213–225.

409 Chen, J. and Li, H. (2013). Variable selection for sparse dirichlet-multinomial regression with an
410 application to microbiome data analysis. Annals of Applied Statistics, 7.

411 Dejea, C. M., Wick, E. C., Hechenbleikner, E. M., White, J. R., Mark Welch, J. L., Rossetti, B. J.,
412 Peterson, S. N., Snesrud, E. C., Borisy, G. G., Lazarev, M., Stein, E., Vadivelu, J., Roslani,
413 A. C., Malik, A. A., Wanyiri, J. W., Goh, K. L., Thevambiga, I., Fu, K., Wan, F., Llosa, N.,
414 Housseau, F., Romans, K., Wu, X., McAllister, F. M., Wu, S., Vogelstein, B., Kinzler, K. W.,

23
bioRxiv preprint doi: https://doi.org/10.1101/2023.03.23.533996; this version posted March 25, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

415 Pardoll, D. M., and Sears, C. L. (2014). Microbiota organization is a distinct feature of proximal
416 colorectal cancers. Proc. Natl. Acad. Sci. U.S.A., 111(51), 18321–18326.

417 Dixon, P. (2003). Vegan, a package of r functions for community ecology. Journal of Vegetation
418 Science, 14.

419 Fettweis, J. M., Serrano, M. G., Brooks, J. P., Edwards, D. J., Girerd, P. H., Parikh, H. I., Huang,
420 B., Arodz, T. J., Edupuganti, L., Glascock, A. L., Xu, J., Jimenez, N. R., Vivadelli, S. C., Fong,
421 S. S., Sheth, N. U., Jean, S., Lee, V., Bokhari, Y. A., Lara, A. M., Mistry, S. D., Duckworth,
422 R. A., Bradley, S. P., Koparde, V. N., Orenda, X. V., Milton, S. H., Rozycki, S. K., Matveyev,
423 A. V., Wright, M. L., Huzurbazar, S. V., Jackson, E. M., Smirnova, E., Korlach, J., Tsai, Y. C.,
424 Dickinson, M. R., Brooks, J. L., Drake, J. I., Chaffin, D. O., Sexton, A. L., Gravett, M. G.,
425 Rubens, C. E., Wijesooriya, N. R., Hendricks-Muñoz, K. D., Jefferson, K. K., Strauss, J. F., and
426 Buck, G. A. (2019). The vaginal microbiome and preterm birth. Nature Medicine, 25.

427 Gilbert, J. A., Blaser, M. J., Caporaso, J. G., Jansson, J. K., Lynch, S. V., and Knight, R. (2018).
428 Current understanding of the human microbiome. Nature Medicine, 24.

429 Hu, Y., Satten, G. A., and Hu, Y.-J. (2022). Locom: A logistic regression model for testing differen-
430 tial abundance in compositional microbiome data with false discovery rate control. Proceedings
431 of the National Academy of Sciences.

432 Hu, Y. J. and Satten, G. A. (2020). Testing hypotheses about the microbiome using the linear
433 decomposition model (ldm). Bioinformatics, 36.

434 Hu, Y. J., Lane, A., and Satten, G. A. (2021). A rarefaction-based extension of the ldm for testing
435 presence-absence associations in the microbiome. Bioinformatics, 37.

436 Jiang, Z., He, M., Chen, J., Zhao, N., and Zhan, X. (2022). MiRKAT-MC: A Distance-Based Mi-
437 crobiome Kernel Association Test With Multi-Categorical Outcomes. Front Genet, 13, 841764.

24
bioRxiv preprint doi: https://doi.org/10.1101/2023.03.23.533996; this version posted March 25, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

438 Kostic, A. D., Gevers, D., Pedamallu, C. S., Michaud, M., Duke, F., Earl, A. M., Ojesina, A. I.,
439 Jung, J., Bass, A. J., Tabernero, J., Baselga, J., Liu, C., Shivdasani, R. A., Ogino, S., Birren,
440 B. W., Huttenhower, C., Garrett, W. S., and Meyerson, M. (2012). Genomic analysis identifies
441 association of Fusobacterium with colorectal carcinoma. Genome Res., 22(2), 292–298.

442 Lin, H. and Peddada, S. D. (2020). Analysis of compositions of microbiomes with bias correction.
443 Nature communications, 11(1), 1–11.

444 Lloyd-Price, J., Arze, C., Ananthakrishnan, A. N., Schirmer, M., Avila-Pacheco, J., Poon, T. W.,
445 Andrews, E., Ajami, N. J., Bonham, K. S., Brislawn, C. J., Casero, D., Courtney, H., Gonzalez,
446 A., Graeber, T. G., Hall, A. B., Lake, K., Landers, C. J., Mallick, H., Plichta, D. R., Prasad, M.,
447 Rahnavard, G., Sauk, J., Shungin, D., Vázquez-Baeza, Y., White, R. A., Bishai, J., Bullock, K.,
448 Deik, A., Dennis, C., Kaplan, J. L., Khalili, H., McIver, L. J., Moran, C. J., Nguyen, L., Pierce,
449 K. A., Schwager, R., Sirota-Madi, A., Stevens, B. W., Tan, W., ten Hoeve, J. J., Weingart, G.,
450 Wilson, R. G., Yajnik, V., Braun, J., Denson, L. A., Jansson, J. K., Knight, R., Kugathasan,
451 S., McGovern, D. P., Petrosino, J. F., Stappenbeck, T. S., Winter, H. S., Clish, C. B., Franzosa,
452 E. A., Vlamakis, H., Xavier, R. J., and Huttenhower, C. (2019). Multi-omics of the gut microbial
453 ecosystem in inflammatory bowel diseases. Nature, 569.

454 Ma, S., Ren, B., Mallick, H., Moon, Y. S., Schwager, E., Maharjan, S., Tickle, T. L., Lu, Y.,
455 Carmody, R. N., Franzosa, E. A., Janson, L., and Huttenhower, C. (2021). A statistical model
456 for describing and simulating microbial community profiles. PLoS Computational Biology, 17.

457 Mandal, S., Treuren, W. V., White, R. A., Eggesbø, M., Knight, R., and Peddada, S. D. (2015).
458 Analysis of composition of microbiomes: a novel method for studying microbial composition.
459 Microbial Ecology in Health & Disease, 26.

460 Martin, B. D., Witten, D., and Willis, A. D. (2020). Modeling microbial abundances and dysbiosis
461 with beta-binomial regression. Ann Appl Stat, 14(1), 94–115.

25
bioRxiv preprint doi: https://doi.org/10.1101/2023.03.23.533996; this version posted March 25, 2023. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

462 Patuzzi, I., Baruzzo, G., Losasso, C., Ricci, A., and Camillo, B. D. (2019). Metasparsim: A 16s
463 rrna gene sequencing count data simulator. BMC Bioinformatics, 20.

464 Paulson, J. N., Stine, O. C., Bravo, H. C., and Pop, M. (2013). Differential abundance analysis for
465 microbial marker-gene surveys. Nature Methods, 10.

466 Proctor, L. M., Creasy, H. H., Fettweis, J. M., Lloyd-Price, J., Mahurkar, A., Zhou, W., Buck,
467 G. A., Snyder, M. P., Strauss, J. F., Weinstock, G. M., White, O., and Huttenhower, C. (2019).
468 The integrative human microbiome project. Nature, 569.

469 Pya, N. and Wood, S. N. (2015). Shape constrained additive models. Statistics and Computing,
470 25, 543–559.

471 Revelle, W. (2015). Package ’psych’ - procedures for psychological, psychometric and personality
472 research. R Package.

473 Ruppert, D. and Mattesson, D. S. (2015). Statistics and Data Analysis for Financial Engineering,
474 with R examples. Springer, New York, NY.

475 Simren, M., Barbara, G., Flint, H. J., Spiegel, B. M., Spiller, R. C., Vanner, S., Verdu, E. F.,
476 Whorwell, P. J., and Zoetendal, E. G. (2013). Intestinal microbiota in functional bowel disorders:
477 a Rome foundation report. Gut, 62(1), 159–176.

478 Sze, M. A. and Schloss, P. D. (2016). Looking for a Signal in the Noise: Revisiting Obesity and
479 the Microbiome. MBio, 7(4).

480 Wu, C., Chen, J., Kim, J., and Pan, W. (2016). An adaptive association test for microbiome data.
481 Genome Medicine, 8.

482 Zhao, N., Chen, J., Carroll, I. M., Ringel-Kulka, T., Epstein, M. P., Zhou, H., Zhou, J. J., Ringel,
483 Y., Li, H., and Wu, M. C. (2015). Testing in microbiome-profiling studies with mirkat, the
484 microbiome regression-based kernel association test. American Journal of Human Genetics, 96.

View publication stats

Applications of Linear Algebra in Game Development2
No ratings yet
Applications of Linear Algebra in Game Development2
8 pages
1 s2.0 S2001037021004943 Main
No ratings yet
1 s2.0 S2001037021004943 Main
10 pages
Fgene 12 803627
No ratings yet
Fgene 12 803627
10 pages
Statistical Analysis of Microbiome Data With R Optimized PDF Download
100% (12)
Statistical Analysis of Microbiome Data With R Optimized PDF Download
15 pages
Statistical Analysis of Microbiome Data With R Optimized DOCX Download
No ratings yet
Statistical Analysis of Microbiome Data With R Optimized DOCX Download
16 pages
Galloway Digestive Diseases Sciences 2020
No ratings yet
Galloway Digestive Diseases Sciences 2020
12 pages
Statistical Analysis of Micro Biomed at A With R
No ratings yet
Statistical Analysis of Micro Biomed at A With R
43 pages
Microbiome Analyst
No ratings yet
Microbiome Analyst
9 pages
Bioinformatic and Statistical Analysis of Microbiome Data: Yinglin Xia Jun Sun
No ratings yet
Bioinformatic and Statistical Analysis of Microbiome Data: Yinglin Xia Jun Sun
716 pages
Nihms 1554587
No ratings yet
Nihms 1554587
20 pages
Disentangling Interactions in The Microbiome: A Network Perspective
No ratings yet
Disentangling Interactions in The Microbiome: A Network Perspective
12 pages
Pwad 024
No ratings yet
Pwad 024
13 pages
The Best Practice For Microbiome Analysis Using R
No ratings yet
The Best Practice For Microbiome Analysis Using R
13 pages
Nihms 1062795
No ratings yet
Nihms 1062795
21 pages
Applications of Machine
No ratings yet
Applications of Machine
25 pages
Early Infection Detection Through AI Analysis of Host-Microbiome Interactions
No ratings yet
Early Infection Detection Through AI Analysis of Host-Microbiome Interactions
4 pages
Bioinformatics Unveiled
From Everand
Bioinformatics Unveiled
Joan Melody
No ratings yet
The Best Practice For Microbiome Analysis Using R
No ratings yet
The Best Practice For Microbiome Analysis Using R
13 pages
SFlorida July2020
No ratings yet
SFlorida July2020
114 pages
Constucting and Analyzing Microbiome Networks in R - Layeghifard2018
No ratings yet
Constucting and Analyzing Microbiome Networks in R - Layeghifard2018
24 pages
High Definition For Systems Biology of Microbial Commun - 2016 - Current Opinion
No ratings yet
High Definition For Systems Biology of Microbial Commun - 2016 - Current Opinion
8 pages
Page 4 Microbiome
No ratings yet
Page 4 Microbiome
5 pages
Microbial Community Profiling For Human Microbiome Projects: Tools, Techniques, and Challenges
No ratings yet
Microbial Community Profiling For Human Microbiome Projects: Tools, Techniques, and Challenges
13 pages
Msystems 01105-20
No ratings yet
Msystems 01105-20
17 pages
BMC Bioinformatics Volume 20 Issue 1 2019 (Doi 10.1186 - s12859-019-2744-2) Ho, Nhan Thi Li, Fan Wang, Shuang Kuhn, Louise - Metamicrobiomer - An R Package For Analysis of Microbiome Relative Abund
No ratings yet
BMC Bioinformatics Volume 20 Issue 1 2019 (Doi 10.1186 - s12859-019-2744-2) Ho, Nhan Thi Li, Fan Wang, Shuang Kuhn, Louise - Metamicrobiomer - An R Package For Analysis of Microbiome Relative Abund
15 pages
Microviz An R Package For Microbiome Data Visualiz
No ratings yet
Microviz An R Package For Microbiome Data Visualiz
4 pages
Nature Reviews Microbiology-微生物组分析最佳实践
No ratings yet
Nature Reviews Microbiology-微生物组分析最佳实践
13 pages
Bioinformatics in Microbiology
No ratings yet
Bioinformatics in Microbiology
41 pages
Dohlman 2019
No ratings yet
Dohlman 2019
14 pages
Assignment 2
No ratings yet
Assignment 2
18 pages
Microbiome Differential Abundance Methods Produce Different Results Across 38 Datasets
No ratings yet
Microbiome Differential Abundance Methods Produce Different Results Across 38 Datasets
16 pages
Applbiosci 02 00028
No ratings yet
Applbiosci 02 00028
22 pages
Microbiome and Metagenomics - Statistical Methods Computation and
No ratings yet
Microbiome and Metagenomics - Statistical Methods Computation and
117 pages
MPEMDA - A Multi-Similarity Integration Approach With Pre-Completion and Error Correction For Predicting Microbe-Drug Associations
No ratings yet
MPEMDA - A Multi-Similarity Integration Approach With Pre-Completion and Error Correction For Predicting Microbe-Drug Associations
9 pages
Another Dimension For Drug Discovery: Perspective
No ratings yet
Another Dimension For Drug Discovery: Perspective
1 page
Genome-Resolved Metagenomics: A Game Changer For Microbiome Medicine
No ratings yet
Genome-Resolved Metagenomics: A Game Changer For Microbiome Medicine
12 pages
KombOver Efficient K-Core and K-Truss Based Charac
No ratings yet
KombOver Efficient K-Core and K-Truss Based Charac
19 pages
Whole Genome Transformer For Gene Interaction Effects
No ratings yet
Whole Genome Transformer For Gene Interaction Effects
17 pages
SCIE1106 L36 Microbes, Our Other Genome
No ratings yet
SCIE1106 L36 Microbes, Our Other Genome
23 pages
Fmicb 13 1018594
No ratings yet
Fmicb 13 1018594
15 pages
2018-The Human Gut Microbiome
No ratings yet
2018-The Human Gut Microbiome
18 pages
Predicting Drug-Microbiome Interactions With Machine Learning
No ratings yet
Predicting Drug-Microbiome Interactions With Machine Learning
12 pages
A Practical Guide To Amplicon and Metagenomic Analysis of Microbiome Data
No ratings yet
A Practical Guide To Amplicon and Metagenomic Analysis of Microbiome Data
16 pages
New Approaches For The Generation and Analysis of Microbial Typing Data Instant Download
No ratings yet
New Approaches For The Generation and Analysis of Microbial Typing Data Instant Download
15 pages
Networks As Tools For Defining Emergent Properties of Microbiomes and Their Stability
No ratings yet
Networks As Tools For Defining Emergent Properties of Microbiomes and Their Stability
13 pages
Metagenomic Systems Biology of The Human Gut Microbiome
No ratings yet
Metagenomic Systems Biology of The Human Gut Microbiome
6 pages
Young
No ratings yet
Young
14 pages
NRG 3182
No ratings yet
NRG 3182
11 pages
The Human Microbiome at The Interface of Health and Disease PDF
No ratings yet
The Human Microbiome at The Interface of Health and Disease PDF
11 pages
Statement On Analysis and Interpretation of Clinical Human Gastrointestinal Microbiome Testing Using Nextgeneration Sequencing in South Africa
No ratings yet
Statement On Analysis and Interpretation of Clinical Human Gastrointestinal Microbiome Testing Using Nextgeneration Sequencing in South Africa
3 pages
Leveraging Pre-Trained Language Models For Mining Microbiome-Disease Relationships
No ratings yet
Leveraging Pre-Trained Language Models For Mining Microbiome-Disease Relationships
19 pages
Review On Predicting Pairwise Relationships Between
No ratings yet
Review On Predicting Pairwise Relationships Between
25 pages
Artigo 2
No ratings yet
Artigo 2
21 pages
ASMNGS 2018 Abstracts
No ratings yet
ASMNGS 2018 Abstracts
187 pages
【Important】Discovery of Antimicrobial Peptides in the Global Microbiome With Machine Learning - 2024 - cell
No ratings yet
【Important】Discovery of Antimicrobial Peptides in the Global Microbiome With Machine Learning - 2024 - cell
35 pages
Chapter 4 - The Human Microbiome - Genomic and Precision Medicine (Third Edition)
No ratings yet
Chapter 4 - The Human Microbiome - Genomic and Precision Medicine (Third Edition)
15 pages
2025 04 14 648862v1 Full
No ratings yet
2025 04 14 648862v1 Full
38 pages
BMJ j831 Full
No ratings yet
BMJ j831 Full
14 pages
A Consensus Statement On Establishing Causality, Therapeutic Applications and The Use of Preclinical Models in Microbiome Research
No ratings yet
A Consensus Statement On Establishing Causality, Therapeutic Applications and The Use of Preclinical Models in Microbiome Research
14 pages
Genes 13 02280 v3
No ratings yet
Genes 13 02280 v3
34 pages
Bioinformatics and Biosensors
No ratings yet
Bioinformatics and Biosensors
32 pages
Tentsystem
No ratings yet
Tentsystem
4 pages
18CM0111 Plan
No ratings yet
18CM0111 Plan
9 pages
Childrens Experiences and Feelings in A
No ratings yet
Childrens Experiences and Feelings in A
135 pages
Transpo Terms
No ratings yet
Transpo Terms
8 pages
Pe Reviewer Badminton
No ratings yet
Pe Reviewer Badminton
4 pages
Lecture Notes in Engg Data Analysis
No ratings yet
Lecture Notes in Engg Data Analysis
4 pages
PEC 2017 Appendix A Electrical Symbols
100% (1)
PEC 2017 Appendix A Electrical Symbols
5 pages
CA1 Martinez Renmark D.
No ratings yet
CA1 Martinez Renmark D.
4 pages
Row Reduction PDF
100% (1)
Row Reduction PDF
6 pages
Theory and Design of Audio Rooms-Reformulation of Sabine Foroula
No ratings yet
Theory and Design of Audio Rooms-Reformulation of Sabine Foroula
6 pages
JEE Main 2020 April-September Attempt Shift-2 (02nd September, 2020) Detailed Analysis
No ratings yet
JEE Main 2020 April-September Attempt Shift-2 (02nd September, 2020) Detailed Analysis
7 pages
Mat 111
No ratings yet
Mat 111
19 pages
Turbofan Engine Control Design Using Robust Multivariable Control Technologies
No ratings yet
Turbofan Engine Control Design Using Robust Multivariable Control Technologies
10 pages
311 E Book2 PDF
No ratings yet
311 E Book2 PDF
578 pages
Rensselaer Polytechnic Institute Troy, Ny Engr-1100 Introduction To Engineering Analysis Fall 2009 Exam No. 3
No ratings yet
Rensselaer Polytechnic Institute Troy, Ny Engr-1100 Introduction To Engineering Analysis Fall 2009 Exam No. 3
6 pages
Ch2 Wiener Filters
No ratings yet
Ch2 Wiener Filters
80 pages
2007-2008 Jawaharlal Nehru Technological University Kukatpally, Hyderabad B.Tech Electrical and Electronics Engineering I Year Course Structure
No ratings yet
2007-2008 Jawaharlal Nehru Technological University Kukatpally, Hyderabad B.Tech Electrical and Electronics Engineering I Year Course Structure
91 pages
Question Paper - 2024 (Class - Xii Maths)
No ratings yet
Question Paper - 2024 (Class - Xii Maths)
5 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Demo 50 YCT 2025 JEE Main Mathematics Solved Papers English Medium
No ratings yet
Demo 50 YCT 2025 JEE Main Mathematics Solved Papers English Medium
50 pages
Quadratic Programming Solution of Dynamic Matrix Control (QDMC)
No ratings yet
Quadratic Programming Solution of Dynamic Matrix Control (QDMC)
16 pages
Cayley Table of D4
100% (1)
Cayley Table of D4
26 pages
List of Excel Formulas
No ratings yet
List of Excel Formulas
208 pages
Mathematical and Statistical Foundations
No ratings yet
Mathematical and Statistical Foundations
60 pages
Estimation and Hypothesis Testing of Cointegration Vectors in Gaussian Vector Auto Regressive Models
No ratings yet
Estimation and Hypothesis Testing of Cointegration Vectors in Gaussian Vector Auto Regressive Models
31 pages
Noncommutative Geometry From Strings and Branes: A, B A, B A1
No ratings yet
Noncommutative Geometry From Strings and Branes: A, B A, B A1
20 pages
Single-Channel and Multi-Channel Image Reconstruction in X-Ray CT
No ratings yet
Single-Channel and Multi-Channel Image Reconstruction in X-Ray CT
6 pages
Anna University-B.E EEE-Electrical and Electronics Engineering Syllabus
No ratings yet
Anna University-B.E EEE-Electrical and Electronics Engineering Syllabus
148 pages
Quaternion Conrad
No ratings yet
Quaternion Conrad
19 pages
Periyar University: Periyar Palkalai Nagar SALEM - 636011
No ratings yet
Periyar University: Periyar Palkalai Nagar SALEM - 636011
48 pages
cs239 Ejer1
No ratings yet
cs239 Ejer1
2 pages
Shear Locking: Shear Locking, Aspect Ratio Stiffening, and Qualitative Errors
No ratings yet
Shear Locking: Shear Locking, Aspect Ratio Stiffening, and Qualitative Errors
18 pages
Crash Course The Math of Quantum Mechanics PDF
No ratings yet
Crash Course The Math of Quantum Mechanics PDF
6 pages
Inverse of A Matrix.01
No ratings yet
Inverse of A Matrix.01
6 pages
Ifasd 074
No ratings yet
Ifasd 074
16 pages
Math g3 m1 Full Module
No ratings yet
Math g3 m1 Full Module
325 pages

Midas Biorxiv 2023

Uploaded by

Midas Biorxiv 2023

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

MIDAS: a fast and simple simulator for realistic microbiome data

Preprint · March 2023

The user has requested enhancement of the downloaded file.

MIDAS: a fast and simple simulator for realistic

Department of Biostatistics and Bioinformatics,

Emory University, Atlanta, GA 30329, USA

Department of Gynecology and Obstetrics

Department of Biostatistics and Bioinformatics,

Emory University, Atlanta, GA 30329, USA

Johns Hopkins University, Baltimore, MD 21205, USA

2 Motivation: Advances in sequencing technology has led to the discovery of associations

11 microbiome data, or can require exorbitant computational time.

14 structure of a template microbiome dataset. We demonstrate improved performance of MI-

23 Availability and implementation: The R package MIDAS is available on GitHub at

25 Contact: Ni Zhao, Department of Biostatistics, Johns Hopkins University ([email protected])

26 Supplementary information: Supplementary data are available at Bioinformatics online.

54 is ‘similar’ to the template data in some way.

104 2 Materials and methods

120 2.1 Step 1: generate presence-absence data

140 2.2 Step 2: generate relative abundance and count data

166 sample m e j > m j then, in addition

167 e j − m j values from the non-zero

175 updated through πei j = Cei j /N

Algorithm 1 Steps to simulate one data set similar to a template data

176 2.3 Changing the parameters of the simulation

203 log10 [E(Zi· )] = f (log10 (Ni )), (4)

221 logit[E(Zi j )] = f (log10 (Ni )) + g(log10 (p j )) (7)

231 abundance data π

232 generate counts C

237 2.4 Simulation studies and comparisons with existing methods

295 3.1 Comparison Results

Figure 3: Empirical cumulative distribution function of distances to centroids

396 Data Availability

399 Competing interests

400 The authors declare no competing risk.

View publication stats

You might also like