14 Simulation Studies
14 Simulation Studies
Author Note
∗
Björn S. Siepe, František Bartoš, and Samuel Pawel contributed equally.
Björn S. Siepe https://orcid.org/0000-0002-9558-4648
František Bartoš https://orcid.org/0000-0002-0018-5573
Tim P. Morris https://orcid.org/0000-0001-5850-3610
Anne-Laure Boulesteix https://orcid.org/0000-0002-2729-0947
Daniel W. Heck https://orcid.org/0000-0002-6302-9252
Samuel Pawel https://orcid.org/0000-0003-2779-320X
Preregistrations, Data, and R code for all analyses are available at the Open Science
Framework: https://osf.io/dfgvu/. This is the second version of this preprint (June
14th, 2024).
The authors made the following contributions: BSS: Conceptualization, Methodology,
Formal Analysis, Software, Investigation, Data Curation, Visualization, Writing – original
draft, Writing – review & editing; FB: Conceptualization, Methodology, Formal Analysis,
Software, Investigation, Data Curation, Visualization, Writing – original draft, Writing –
review & editing; TPM: Writing – review & editing; ALB: Writing – review & editing; DWH:
Writing – review & editing; SP: Conceptualization, Methodology, Formal Analysis, Software,
Investigation, Data Curation, Visualization, Writing – original draft, Writing – review &
editing.
We thank Phil Chalmers and Carolin Strobl for helpful comments on drafts of the
manuscript. We thank Adrianna Zielińska for help with code checking. We thank Felix
Schönbrodt and an anonymous referee for constructive and helpful comments that improved
the manuscript. Our acknowledgment of these individuals does not imply their endorsement
of this article.
Correspondence concerning this article should be addressed to Björn S. Siepe,
Department of Psychology, University of Marburg, Gutenbergstraße 18, Marburg, Germany.
E-mail: [email protected]
SIMULATION STUDIES IN PSYCHOLOGY 3
Abstract
Simulation studies are widely used for evaluating the performance of statistical methods in
psychology. However, the quality of simulation studies can vary widely in terms of their
design, execution, and reporting. In order to assess the quality of typical simulation studies in
psychology, we reviewed 321 articles published in Psychological Methods, Behavioral
Research Methods, and Multivariate Behavioral Research in 2021 and 2022, among which
100/321 = 31.2% report a simulation study. We find that many articles do not provide
complete and transparent information about key aspects of the study, such as justifications for
the number of simulation repetitions, Monte Carlo uncertainty estimates, or code and data to
reproduce the simulation studies. To address this problem, we provide a summary of the
ADEMP (Aims, Data-generating mechanism, Estimands and other targets, Methods,
Performance measures) design and reporting framework from Morris, White, and Crowther
(2019) adapted to simulation studies in psychology. Based on this framework, we provide
ADEMP-PreReg, a step-by-step template for researchers to use when designing, potentially
preregistering, and reporting their simulation studies. We give formulae for estimating
common performance measures, their Monte Carlo standard errors, and for calculating the
number of simulation repetitions to achieve a desired Monte Carlo standard error. Finally, we
give a detailed tutorial on how to apply the ADEMP framework in practice using an example
simulation study on the evaluation of methods for the analysis of pre–post measurement
experiments.
Keywords: experimental design, Monte Carlo experiments, meta-research,
preregistration, reporting
SIMULATION STUDIES IN PSYCHOLOGY 4
Simulation studies are experiments and should be treated as such by authors and editors.
Hauck and Anderson (1984, p. 215)
Introduction
certain methods and data-generating mechanisms, and in deciding which results are reported.
Issues with the conduct and reporting of simulation studies were described almost half a
century ago (Hoaglin & Andrews, 1975). However, the attention afforded to researcher
degrees of freedom in psychology and other empirical sciences has recently led to more
critical reflection on the state of methodological research (Boulesteix, Binder, Abrahamowicz,
& Sauerbrei, 2018; Boulesteix, Hoffmann, Charlton, & Seibold, 2020; Boulesteix, Hoffmann,
et al., 2020; Friedrich & Friede, 2023; Heinze et al., 2024; Luijken et al., 2023; Pawel, Kook,
& Reeve, 2024; Strobl & Leisch, 2022).
Some may argue that simulation studies are often conducted at a more exploratory
stage of research and therefore do not require as much rigor and transparency (including
measures such as sample size planning, preparation and preregistration of a study protocol, or
code and data sharing) as other types of studies. However, many simulation studies are not
conducted and reported as exploratory, but rather with the explicit goal of deriving
recommendations for the use of methods. It is important to realize that such simulation studies
often have a large impact. For example, the simulation study by Hu and Bentler (1999) on
cut-off criteria for structural equation models has been cited over 100,000 times, presumably
justifying thousands of choices in structural equation modeling. It would be detrimental if the
results of such a study were flawed or reported suboptimally. Another example is the
simulation study that recommended the “1 variable per 10 events” heuristic as a sample size
criterion for logistic regression (Peduzzi, Concato, Kemper, Holford, & Feinstein, 1996). This
heuristic has been cited over 8,000 times and was widely adopted as a minimum sample size
criterion, but the influential simulation study advocating it was later found to be
non-replicable (van Smeden et al., 2016).
In non-methodological research, it has been repeatedly emphasized that research
results should be accompanied by measures of statistical uncertainty, such as p-values,
standard errors, or confidence intervals (Cumming, Fidler, Kalinowski, & Lai, 2012; van der
Bles et al., 2019). Clear guidelines are now available in most fields, for example, the APA
guidelines require that “when point estimates [...] are provided, always include an associated
measure of variability” (American Psychological Association, 2020, p. 88). It is perhaps
SIMULATION STUDIES IN PSYCHOLOGY 6
performance measures, their Monte Carlo standard errors, and for calculating the number of
repetitions to achieve a desired Monte Carlo standard error. Finally, we illustrate ADEMP and
the template with an example simulation study on a typical application from psychological
research—a comparison of methods for the analysis of pre–post measurements.
Aims
The aim of a simulation study refers to the goal of the methodological research project
and shapes subsequent choices. Aims are typically related to evaluating the properties of a
method (or multiple methods) with respect to a particular statistical task. In psychological
simulation studies, common statistical tasks and exemplary aims (taken from the literature
review) can include:
Table 1
Summary of the ADEMP Planning and Reporting Structure for Simulation Studies.
Aims What is the aim of the study? To evaluate the hypothesis testing and es-
Data-generating How are data sets generated? Pre–post measurements are simulated
Estimands and What are the estimands and/or other The null hypothesis of no effect between
other targets targets of the study? groups is the primary target, the treatment
est
post-score analysis
Performance Which performance measures are Type I error rate, power, and bias
measures used?
• Hypothesis testing, e.g., comparing different tests of publication bias (Rodgers &
Pustejovsky, 2021).
• Model selection, e.g., comparing different fit indices for selecting the best structural
equation model (Shi, DiStefano, Maydeu-Olivares, & Lee, 2022).
• Design, e.g., comparing different methods for determining sample size in mixed-effect
modeling (Murayama, Usami, & Sakaki, 2022).
• Other aims, e.g., assessing tools for quantifying complexity (Moulder, Daniel,
Teachman, & Boker, 2022), clustering data sets into equivalent parts (Papenberg &
SIMULATION STUDIES IN PSYCHOLOGY 9
These statistical tasks are often closely related, for example, hypothesis testing and model
selection may be seen as the same task; the duality of p-values and confidence intervals
enables both to be used for estimation and hypothesis testing from a frequentist perspective;
model selection may be used for the purpose of description, prediction or estimation.
Data-generating mechanism
If multiple factors are varied, there are different possible ways to combine them: fully
factorial (all possible combinations), partially factorial (considering some combinations but
not all possible ones), one-at-a-time (varying one factor while holding the other/s constant), or
scattershot (creating a set of distinct conditions). The fully factorial approach is typically
preferred because it allows us to disentangle the individual effects of the factors and their
interactions, but it may not always be feasible computationally or because some combinations
of factors make no sense. For example, in a simulation study involving missing data, we may
wish to vary the proportion of missing data and the missing data mechanism. When the
proportion is zero, the mechanism is not applicable. Complex simulation designs can also
make the reporting and interpretation of results more difficult. To reduce the complexity of the
design, a partially factorial design may then be chosen (Morris et al., 2019, see, for example,
Skrondal (2000) for recommendations on “fractional factorial designs”).
Figure 1
Example of Different Ways to Combine Factors in the Design of a Simulation Study.
Figure 1 gives an example of how two factors, sample size and the number of
variables, could be combined for a simulation study comparing different regression methods:
The fully factorial approach would include all possible combinations (left panel). However,
this may not be possible because, for example, the regression methods under study may not be
able to handle situations where the number of variables is greater than the sample size. In this
case, these conditions may be excluded and a partially factorial design adopted (middle left
SIMULATION STUDIES IN PSYCHOLOGY 11
panel). With the one-at-a-time approach, one may fix the sample size to a value of 40 and then
vary the number of variables across all levels, and vice versa, fix the number of variables to 15
and vary the sample size across all levels (middle right panel). Finally, with the scattershot
approach, one may create distinct conditions of sample size and number of variables, for
example, inspired by actual data sets that feature these combinations (right panel). Depending
on the setup of this approach, higher-order interaction effects between simulation factors may
not be identifiable.
When resampling an existing data set, researchers rely on a (usually large) existing
data set to sample smaller data sets for the simulation. Alternatively, one may sample equally
large data sets with replacement from the existing data set. The data-generating mechanism is
thus implicitly determined by the data set while researchers only need to specify the
resampling mechanism.
Estimands and other targets jointly refer to the practical aims of the compared
methods, Table 2 provides an overview of common targets of simulation studies. For example,
if a simulation study aims to compare different methods for estimating the effect of an
intervention versus the absence of that intervention, the estimand of interest is a contrast of
these groups rather than, say, a group mean. An estimand is a target quantity of a statistical
analysis (see ICH, 2019; Keene, Lynggaard, Englert, Lanius, & Wright, 2023; Lundberg,
Johnson, & Stewart, 2021, for accessible introductions to estimands). In simulation studies, an
estimand is typically, but not always, a parameter of the underlying data-generating model.
When it is not, care is needed to define and compute the true or ideal value of an estimand. If
the simulation study aims to compare different methods for hypothesis testing rather than
estimation, the true hypothesis is the target of interest.1 Again, care is needed to distinguish
between different ways of translating substantive hypotheses into statistical hypotheses (e.g.,
whether a null hypothesis of no effect is specified as a point null hypothesis or an average null
hypothesis in random-effects meta-analysis). Similarly, the targets appropriate for the
1
This may initially sound confusing to the reader, as some might expect the target of the simulation would be the
outcome—Type I error rate of the hypothesis test. However, this would be the performance measure that
indicates the methods’ performance for the given target—the null hypothesis.
SIMULATION STUDIES IN PSYCHOLOGY 12
statistical tasks of other simulation studies include the true model (when the statistical task is
model selection), the design characteristics (when the statistical task is design), or new data
(when the statistical task is prediction).
Methods
than searching for specific conditions under which a method appears to be the “best” (Strobl
& Leisch, 2022).
Comparative simulation studies can benefit from approaches that decrease
over-optimism and allegiance bias used in other scientific fields such as experimental
psychology or clinical trials. These include blinding the data analysts to the method (Pawel et
al., 2024) or using separate research teams for data simulation and analysis (Kreutz et al.,
2020). Further, “adversarial collaboration”, the collaboration of researchers with different
theoretical or methodological views (Cowan et al., 2020; for an example see Binder,
Sauerbrei, & Royston, 2012), could be introduced to simulation studies to achieve useful
comparisons between different methods. Researchers can also build on previous research by
combining the conditions and methods of previous simulation studies into a single, large
simulation study, extending previous simulation designs when necessary, to assess the
robustness of their results to different experimental settings that have already been
investigated in isolation by others (see Bartoš, Maier, Wagenmakers, Doucouliagos, &
Stanley, 2023; Hong & Reed, 2021, for an example).
Performance measures
Performance measures are the summary statistics used for quantifying how well
methods can achieve their task for a given data-generating mechanism. For instance, a
performance measure may quantify how well a method can estimate an estimand. As such, the
estimated performance corresponds to the “inferences” of a simulation study that allow
researchers to draw conclusions about the methods. The selection of appropriate performance
measures depends on the aims of the simulation study, but also the estimands and other
targets. For example, bias, (root) mean square error, and confidence interval coverage can be
used to evaluate methods for estimating intervention effects, while power and Type I error rate
might be used to evaluate methods for testing hypotheses about publication bias. Table 2
shows typical performance measures for different simulation study aims.
The same statistical method may be applied for different statistical tasks and in
different contexts, such as estimation and prediction, for which different performance
measures can be used. Typically, multiple performance measures for a method should be
SIMULATION STUDIES IN PSYCHOLOGY 14
Table 2
Different Types of Statistical Tasks, their Target(s), and Typical Performance Measures.
effects/parameters) CI width
interpreted together (Morris et al., 2019). For example, one may only consider comparing the
statistical power of different hypothesis testing methods if these methods have appropriate
Type I error rates (e.g., are below 5%). When evaluating estimation performance, it is often
desirable to interpret the bias and variance of an estimator together, as there is typically a
trade-off between the two. In general, providing a rationale for the choice of performance
measure as well as defining it clearly (ideally, with a formula-based representation) avoids
ambiguity. This is especially important when less familiar performance measures are used,
and when performance is estimated conditional on some sample statistic (e.g., bias of a study
given that it converged in a given simulated data set).
Performance measures used in simulation studies are typically aggregated across all
simulation repetitions. For example, the bias is estimated as the mean deviation between
parameter estimates and the true parameter across all repetitions. It can often be informative,
especially when building and reviewing a simulation study, to also look at other quantities
than the mean, for example, the median or other quantiles, or to visualize the distribution, for
example, with violin or box plots of parameter estimates, p-values, or Bayes factors. This
strategy may be useful for two reasons. First, it can help uncover errors in the simulation
design if the distribution of performance estimates violates expectations from theoretical work
SIMULATION STUDIES IN PSYCHOLOGY 15
θ̂ is an estimator of the estimand θ, and θ̂i is the estimate obtained from simulation i
1(CIi includes θ) and 1(Testi rejects H0 ) are 1 if the respective event occurred in simulation i and 0 otherwise
[ C
MSE, ov, and P
d dow denote the estimated MSE, coverage, and power, respectively. MCSE∗ denotes the desired MCSE when calculating the number of repetitions nsim .
Pnsim Pnsim
The sample variance of the estimates is S 2 = i=1
{θ̂i − ( i=1
θ̂i /nsim )}2 /(nsim − 1)
θ̂
Pnsim nsim
P
The sample variance of the square errors is S 2 = [(θ̂i − θ)2 − { i=1 (θ̂i − θ)2 /nsim }]2 /(nsim − 1)
(θ̂−θ)2 i=1
16
2 =
P nsim Pnsim
The sample variance of the CI widths is SW i=1
[(CIi,upper − CIi,lower ) − { i=1 (CIi,upper − CIi,lower )/nsim }]2 /(nsim − 1)
2 =
P nsim nsim
P
The sample variance of a generic statistic G is SG i=1
{Gi − ( i=1 Gi /nsim )}2 /(nsim − 1) with Gi the statistic obtained from simulation i. For example, G may be a measure of predictive performance.
SIMULATION STUDIES IN PSYCHOLOGY 17
approximate MCSEs associated with the estimated performance measures. All MCSEs are
based on the assumptions of independent simulations and approximate normality of the
estimated performance measures. More accurate jackknife-based MCSEs are available
through various R packages such as rsimsum (Gasparini, 2018) and simhelpers (Joshi &
Pustejovsky, 2022). The SimDesign R package (Chalmers & Adkins, 2020) can compute
confidence intervals for performance measures via bootstrapping.
MCSEs (or other measures of uncertainty, including visual representations) should be
provided alongside the estimates of performance to indicate the associated uncertainty. Failing
to calculate and report Monte Carlo uncertainty can lead to erroneous interpretations of results
and unsupported claims about the performance of different methods (see, e.g., the illustration
by Koehler et al., 2009). In situations where MCSEs are tiny relative to the estimated
performance and may distract, one could, for example, provide the maximum MCSE across
all conditions to give the reader reassurance about the worst case.
When planning a simulation study, researchers should choose a number of simulation
repetitions that ensures a desired precision for estimating the chosen performance measures.
The last column of Table 3 gives simple formulae for this purpose. Many of these depend on
quantities that are not known but have to be estimated from the simulated data. For example,
the MCSE of the estimated coverage depends on the coverage itself. In this case, one can
either assume a certain value for which the desired MCSE should be achieved (e.g., 95%),
take a “worst-case value” in the sense that the MCSE is maximal for a given number of
repetitions (this would be 50% for coverage), or estimate it from a small pilot study (e.g.,
taking the estimated coverage closest to 50% across all conditions and methods obtained in
the pilot study). The latter approach may be especially advisable for performance measures
where there is no conventional benchmark, such as 95% is for coverage.
In practice, it can be challenging to define what it means for an MCSE to be
“sufficiently small”. A. S. Cohen, Kane, and Kim (2001) provide some guidelines on how to
decide on the desired precision based on the size of the effects under study. Essentially, the
number of repetitions must be chosen large enough such that the MCSE is sufficiently small
compared to the relevant effect of interest (e.g., if a change in coverage of 1% is a relevant
SIMULATION STUDIES IN PSYCHOLOGY 18
effect, the MCSE for the estimated coverage should be less than that). However, what exactly
constitutes a relevant effect must be decided by researchers on a case-by-case basis, as to our
knowledge there are no standards. This parallels the challenges in traditional sample size
calculations, where researchers must also decide on a minimum effect size of interest.
Finally, during the design phase of a simulation study with clear expectations about the
performance of different methods, researchers may also wish to specify in advance what
constitutes a “relevant difference” in performance, or what constitutes “acceptable” and
“unacceptable” levels of performance, to avoid post-hoc interpretation of performance. Such
studies may be seen as “confirmatory” methodological research (Herrmann et al., 2024). For
example, it could be stated that a Type I error rate greater than 5% defines unacceptable
performance, or that a method X is considered to perform better than a method Y in a given
simulation condition if the estimated performance of method X minus its MCSE is greater
than the estimated performance of method Y plus its MCSE. Again, this is similar to
traditional sample size calculations where researchers need to decide on a minimum effect
size of interest they want to detect (Anvari & Lakens, 2021). While this can be difficult in
practice, it forces researchers to think thoroughly about the problem at hand, so investing this
time comes with the benefit of higher clarity of expectations and interpretation.
Reporting
As with any experiment, transparent reporting of study design, execution, and results is
essential to put the outcomes from a simulation study into context. The ADEMP structure is a
useful template for researchers to follow when reporting the design and results of their
simulation study. Furthermore, the results should be reported in a way that clearly answers the
main research questions and acknowledges the uncertainty associated with the estimated
performance. It is often difficult to find a balance between streamlining the results of
simulation studies for the reader and exhaustively reporting all conditions in detail. However,
it is important that researchers avoid selectively reporting only certain conditions that favor
their preferred method or are in line with their expectations, as this can lead to overoptimism
(Pawel et al., 2024).
Figures are often helpful for interpreting large quantities of results and identifying
SIMULATION STUDIES IN PSYCHOLOGY 19
general trends. However, for most plot types, there is a limit to how many factors can be
communicated visually (see section 7.2 in Morris et al., 2019, for some recommendations, see
also Rücker & Schwarzer, 2014). On the other hand, presenting results only with figures can
hinder the accurate interpretation of results and also make it more difficult for researchers
replicating the simulation study to verify whether they have been successful (Luijken et al.,
2023). Figures should therefore ideally be combined with quantitative summaries of results,
such as tables or graphical tables containing both numerical and graphic elements.2 For
complex simulation designs with a large number of conditions, the communication of results
can be improved using interactive tools such as R Shiny applications (Chang et al., 2023, see
e.g., Carter, Schönbrodt, Gervais, & Hilgard, 2019; Gasparini, Morris, & Crowther, 2021).
Computational aspects
2
See the documentation of the gtExtras R package (Mock, 2024) for examples.
SIMULATION STUDIES IN PSYCHOLOGY 20
studies, such as the simulated data or parameter estimates of computed models. This enables
independent reproduction and evaluation of the results by other researchers without the full
computational effort that large simulation studies require.
Information on the computational environment and operating system is relevant to
reproduce simulation studies. Different software packages or package versions can lead to
different results, even when the apparently same method is used (Hodges et al., 2023).
Operating systems can differ in a variety of aspects that may subtly influence the results of
analyses (Glatard et al., 2015). There are several helpful tools that facilitate sharing
information on the computational environment and operating system. For example, when
using R, the output of the sessionInfo() command includes information about the
operating system, R package versions, and auxiliary dependencies (e.g., the installed linear
algebra programs such as BLAS/LAPACK). Furthermore, Peikert and Brandmaier (2021) and
Epskamp (2019) provide accessible tips for reproducible workflows in R, which can serve as a
starting guide for other statistical software as well. For instance, in advanced workflows, a
snapshot of the current version of all software required to reproduce the analysis is stored
(e.g., via Docker or the R package renv, Ushey & Wickham, 2023).
An important computational aspect of simulation studies is the use of pseudo-random
numbers. It is important to initialize the random number generator with a seed and to store
this seed so that the same sequence of pseudo-random numbers can be reproduced in the
future (assuming other dependencies, such as operating system and software versions, remain
the same). The primary purpose of the seed is to ensure computational reproducibility and to
facilitate debugging. At the same time, the seed should not matter for simulation studies with
a sufficient number of repetitions, because the seed should have a negligible effect on the
results (estimated performance measures, patterns, and conclusions). Things become more
complicated when multiple cores, clusters, or computers are used for running the simulation
study since the seed has to be set for each parallelized instance to ensure reproducibility. One
solution is to use “streams” rather than seeds, which fixes the random number generator to the
actual starting position in the deterministic sequence of generated numbers (Morris et al.,
SIMULATION STUDIES IN PSYCHOLOGY 21
2019). Streams are available in Stata, SAS, and R3 . However, when using streams, one needs
to know how many pseudo-random numbers are required per instance, so that the streams can
be set to avoid overlap. This can be challenging, especially when the methods evaluated in the
simulation study also use pseudo-random numbers (e.g., Markov Chain Monte Carlo sampling
or bootstrap methods).
Literature review
In this section, we use the ADEMP structure to assess the current state of simulation
studies in psychology. For each ADEMP component, we summarize the findings, highlight
their relevance, and suggest improvements for future simulation studies. We compare some of
our results with the results of Morris et al. (2019) who reviewed 100 simulation studies
published in Statistics in Medicine. Visual summaries of the review are provided in Figure 2
and Figure 3. Table 4 summarizes the most common pitfalls we encountered during the
review. The preregistration, data, and code to reproduce the results are available at the Open
Science Framework (https://osf.io/dfgvu/).
We extracted 321 articles until we reached 100 articles containing at least one
simulation study. We extracted articles by going through the 2022 issues of the journals in
chronological order. After assessing the number of articles containing a simulation study from
each journal, we then continued chronologically in the 2021 issues, aiming for a roughly equal
split of simulation studies from the three journals.4 The proportion of articles containing a
simulation study (31.2%) was considerably lower than the 75.4% proportion reported by
Morris et al. (2019) for the 100 simulation studies published in Statistics in Medicine. The
lower proportion of simulation studies in our review is mainly due to articles in BRM, which
generally published the most articles, but only 15.6% of them contained a simulation study.
We extracted roughly equal numbers of articles containing a simulation study from the three
journals, with 32 from BRM and 35 each from PM and MBR. Of these articles, 63 contained
only a single simulation study, while the rest contained up to 6 simulation studies (see Panel A
3
Random number streams are available in R through different packages, e.g., parallel (R Core Team, 2023),
rstream (Leydold, 2022), or doRNG (Gaujoux, 2023).
4
Due to an oversight, we did not review the last issue of BRM in 2022 but rather continued with the first 2021
issues of the journals.
SIMULATION STUDIES IN PSYCHOLOGY 22
Figure 2
Common Issues of Simulation Studies in Psychology as Identified in the Literature Review.
Note. 100 articles were reviewed that included simulation studies and were published in Psy-
chological Methods, Behavioral Research Methods, and Multivariate Behavioral Research in
2021 and 2022.
of Figure 3).
Three authors (BS, FB, SP) each reviewed around one-third of all simulation studies
and assessed the overall confidence in their rating of each study as “low”, “medium”, or
“high”. To assess inter-rater agreement, each rater also reviewed six studies that were assigned
to the other raters and which had a “low” or “medium” confidence rating, thereby representing
the most challenging simulation studies that were reviewed. Nevertheless, an agreement larger
than 75% was achieved for the majority of questions (Median = 83.3%). The lowest
agreement was with respect to whether the estimands were stated and the number of estimands
(above 30%). All studies where we disagreed about the number of estimands were studies
about latent variable models, where it was often unclear which parameters were of interest and
how their number varied across conditions, with many studies even showing varying numbers
SIMULATION STUDIES IN PSYCHOLOGY 23
Figure 3
Descriptive Results from Literature Review of Simulation Studies in Psychology.
Note. 100 articles were reviewed that included simulation studies and were published in Psy-
chological Methods, Behavioral Research Methods, and Multivariate Behavioral Research in
2021 and 2022. In Panel J, absolute and relative bias are combined in the bias category. In
Panel E, partially factorial and one-at-a-time are combined. Within-panel totals are greater
than 100 in panels H, J, K, L due to the possibility of more than one category.
SIMULATION STUDIES IN PSYCHOLOGY 24
of parameters per condition. The results of the agreement analysis are shown in Figure 5 in the
Appendix.
Aims
In 94% of the reviewed articles, the aims of the study were defined in some form. We
did not quantify how specific or vague the aims were defined, although they were often
defined rather vaguely (“We conducted a simulation study to evaluate the performance of
method X”). By far most studies had estimation as one of their statistical tasks (68%),
followed by hypothesis testing (21%) and model selection (9%; Panel H in Figure 3). This
resembles the results of Morris et al., who also found these three tasks to be the most
prominent ones with similar frequencies.
Data-generating mechanism
In our review, the clear majority of simulation studies (83%) generated data based on
parametric models with parameters specified by researchers (‘parametric customized’), while
15% were directly based on parameter estimates from real data (Panel B of Figure 3). The
remaining 2% used resampling techniques. In almost all of the studies (95%), the
data-generating parameters were provided, which mirrors the results from Morris et al. (91%
studies). Nevertheless our view is that many of the reviewed papers could have benefited from
describing the data-generating mechanism in a more structured way to facilitate easy
comprehension and replication.
Researchers used between 1 and 6,000 simulation conditions (Median = 16; Panel C in
Figure 3). In these, they varied between 1 and 7 factors, with 1 and 3 being the most common
choices (Panel D in Figure 3). Of all designs, 58% were fully factorial, meaning that all
possible combinations of factor levels were investigated. Moreover, 37% of the studies were
either partially factorial or varied factors one-at-a-time (including studies with a single design
factor) and 5% used distinct scenarios in a scattershot design (Panel E in Figure 3). As in
experimental psychology, a fully factorial design enables the study of the main and interaction
effects of the varied factors. In our review, some studies made use of this fact by using analysis
of variance to assess the effects of simulation factors (see also Chipman & Bingham, 2022).
The number of repetitions per simulation condition ranged between 1 and 1,000,000
Table 4
Step Pitfalls
Not summarizing simulation conditions and data-generating mechanism in a structured way (e.g., bullet points, tables)
Data-generating mechanism
Not providing justification and Monte Carlo uncertainty coupled with a small number of simulation repetitions
Estimands and other targets Not defining estimands / targets clearly, especially in models with many parameters
Methods Not clearly listing all of the compared methods and their specifications
Not reporting computational environment (operating system, software, and package versions)
Computational aspects Not using persistent repositories for sharing code and data (e.g., publisher or university repositories)
Note. Pitfalls were not all coded explicitly, but summarized from the quantitative results of the literature review and discussions between the reviewing authors.
25
SIMULATION STUDIES IN PSYCHOLOGY 26
(Panel F in Figure 3). The median number was 900, whereas the most frequently selected
options were 1,000 repetitions followed by 500 repetitions, similar to the results from Morris
et al.. However, in 17% of studies, at least some of the performance results were aggregated
across multiple parameters (such as the average bias across factor loadings), leading to higher
precision. Only 8% of the studies provided a justification for the specific number of
repetitions used, while only 3% of these actually performed a calculation of the required
number of repetitions (Panel A in Figure 2). This is very similar to the results from Morris et
al., who also found only 4% of studies presenting a justification for their choice of the number
of repetitions. This lack of justification is, unfortunately, consistent with the findings from
similar surveys of the methodological literature (Harwell et al., 2018; Hauck & Anderson,
1984; Hoaglin & Andrews, 1975; Koehler et al., 2009). Of course, this does not rule out the
possibility that the study authors chose their number of repetitions in some informed way
(e.g., by visually assessing whether Monte Carlo uncertainty was sufficiently small) without
explicitly reporting their rationale.
In 20% of the studies, the estimands or targets of the simulation were either not
reported or unclear to us. Of those that were clear, most studies focused only on a single
estimand, while the median number of estimands was 4. In at least 17% of the studies,
estimated performance measures related to different estimands were later aggregated to
calculate average performance, while this was unclear in 4% of studies. We noticed that
especially when evaluating models with many parameters, such as latent variable models or
certain time series models, it can easily become unclear which parameters are of interest.
Clear definition and reporting of estimands and (potentially aggregated) performance
measures is particularly important in these situations.
Methods
While the number of methods evaluated in the simulation studies ranged from 1 to
192, more than half (65%) evaluated 3 or fewer methods, and 24% evaluated only a single
method (Panel G in Figure 3).
SIMULATION STUDIES IN PSYCHOLOGY 27
Performance measures
Presentation of results
Simulation results were most commonly reported in the text of an article and
accompanied by tables and figures (Panel K in Figure 3). The vast majority of studies (77%)
did not report the uncertainty of performance measures (Panel B in Figure 2), despite our
liberal approach of including visualizations such as box plots as indicative of Monte Carlo
uncertainty. The proportion is comparable to the stricter approach of Morris et al. (2019) who
counted 93% of their studies not reporting Monte Carlo standard errors. To cite two positive
examples from our review, J. Liu and Perera (2022) ran a pilot simulation study to obtain the
empirical standard errors for parameter estimates, which they then used to calculate the
needed number of repetitions to keep the MCSE below a desired level. Rodgers and
Pustejovsky (2021) provided the upper bound of the MCSE of their performance measures to
indicate their precision.
SIMULATION STUDIES IN PSYCHOLOGY 28
Computational aspects
R was the most commonly used statistical software to conduct simulation studies and
was used in 77% of the studies (Panel L in Figure 3). Notably, the software used was unclear
or not mentioned in 9% of the studies. In Morris et al. (2019), 38% of the studies did not
mention the software used for their simulation study. In around half of the studies we
reviewed, authors also indicated that they used some form of user-written commands, such as
custom model code, or packages for their simulations. To fully understand these simulations,
it would be crucial to share code alongside the manuscript. However, code was not available
for almost two-thirds of the simulation studies (64%; Panel C in 2). This also includes cases in
which code was supposed to be provided, but the repository was not available, and cases in
which code was supposedly available “upon (reasonable) request”. In multiple cases, authors
supposedly provided code on the journal website or on a university homepage, but the code
was not available at the designated location. Our results are similar to the findings of
Kucharskỳ, Houtkoop, and Visser (2020), who analyzed articles in three methodological
journals (including Psychological Methods and Behavior Research Methods) and found that
56% of studies that contained coded analyses did not share their code. Of the 36% of studies
in our review in which code was provided, 21% also provided a seed in their code. We did not
check if this seed and the supplied code would be sufficient to reproduce the reported results.
Beyond the code and software used, we reviewed whether articles contained
information on the computational environment and operating system used. We coded
information on the computational environment as “fully” when packages with versions and
auxiliary dependencies were provided, for example in a “sessionInfo” output from R or via a
Docker container. We rated the information as “partially” or “minimal” when the main
packages used were reported with or without versions, respectively. Full information on the
computational environment was only reported in 2% of the studies, while 24% did not report
on their computational environment at all (Panel D in Figure 2). Even more studies (93%) did
not provide any information whatsoever on their operating system. Full information (naming
the operating system and its version) was provided in 4% of the studies, while 3% at least
provided the operating system without stating its version. Papenberg and Klau (2021) are a
SIMULATION STUDIES IN PSYCHOLOGY 29
positive example that included full information on their computational environment and
operating system which they used, as well as code and data to reproduce the simulations.
5
We used version 0.1.0 of the template in this example as archived at
https://doi.org/10.5281/zenodo.10057884.
SIMULATION STUDIES IN PSYCHOLOGY 30
Aims
The aim of the simulation study is to evaluate the hypothesis testing and estimation
characteristics of different methods for estimating the treatment effect in pre–post
measurement experiments. We compare three different methods (ANCOVA, change score
analysis, and post score analysis) in terms of power and Type I error rate related to the
SIMULATION STUDIES IN PSYCHOLOGY 31
hypothesis test of no effect, and bias related to the treatment effect estimate. We vary the true
treatment effect and the correlation of pre- and post-measurements.
Data-generating mechanism
where the first argument of the normal distribution in (1) is the mean vector and the second
argument the covariance matrix. The numerical subscript 1 indicates measurement time “pre”
and 2 indicates “post”. The parameter µg,2 denotes the post-treatment mean. It is fixed to zero
in the control group (µcontrol,2 = 0), whereas it is varied across simulation conditions in the
experimental group. The parameter ρ denotes the pre–post correlation and is also varied
across simulation conditions.
We use the following values for the manipulated parameters of the data-generating
mechanism:
We vary the conditions in a fully factorial manner which results in 3 (post-treatment mean in
experimental group) × 3 (pre–post measurement correlation) = 9 simulation conditions. We
select the specific values as they correspond to the conventions for no, small, and medium
standardized mean difference effect sizes in psychology (J. Cohen, 2013). The pre–post
measurement correlations correspond to no, one quarter, and approximately one half of shared
variance that, based on our experience, are both realistic and also allow us to observe
differences between the examined methods.
SIMULATION STUDIES IN PSYCHOLOGY 32
Our primary target is the null hypothesis of no difference between the outcomes of the
control and treatment groups. Our secondary estimand is the treatment effect size defined as
the expected difference between the control and the experimental group measurements at
time-point two
for which the true value is given by the parameter µexp,2 for the considered data-generating
mechanisms.
Methods
Both change score and post score analysis can be seen as special cases of ANCOVA. Change
score analysis fixes the pre coefficient to 1 (using the offset() function), and post score
analysis omits the pre variable from the model (effectively fixing its coefficient to 0).
Performance measures
Our primary performance measures are the Type I error rate (in conditions where the
true effect is zero) and the power (in conditions where the true effect is non-zero) to reject the
6
An alternative way of writing this model is lm(I(post - pre) ~ treatment)
SIMULATION STUDIES IN PSYCHOLOGY 33
null hypothesis of no difference between the control and treatment condition. The null
hypothesis is rejected if the two-sided t-test p-value for the null hypothesis of no effect is less
than or equal to the conventional threshold of 0.05. The rejection rate (the Type I error rate or
the power, depending on the data generating mechanism) is estimated by
Pnsim
i=1 1(pi ≤ 0.05)
R
\ Rate =
nsim
where 1(pi ≤ 0.05) is the indicator of whether the p-value in simulation i is equal to or less
than 0.05. We use the following formula to compute the MCSE of the estimated rejection rate
v
u
uR
t\ Rate(1 − R
\ Rate)
Rate =
MCSER\ .
nsim
Our secondary performance measure is the bias of the treatment effect estimate. It is estimated
by
Pnsim
i=1 θ̂i
Bias
d = −θ
nsim
where θ is the true treatment effect and θ̂i is the effect estimate from simulation i. We compute
the MCSE of the estimated bias with
Sθ̂
c = √n
MCSEBias
sim
q Pnsim Pnsim
1 1
where Sθ̂ = nsim −1 i=1 {θ̂i − ( nsim i=1 θ̂i )}2 is the sample standard deviation of the
effect estimates.
Based on these performance measures, we perform 10,000 repetitions per condition.
This number is determined by using the formulae from Table 3 in Siepe et al. (2023) aiming
for 0.005 MCSE of Type I error rate and power under the worst case performance (50%
rejection rate: 0.50 × (1 − 0.50) / 0.0052 = 10,000), which we deem to be sufficiently
accurate for estimating power and Type I error rate for all practical purposes. Our simulation
protocol also illustrates how to determine the number of repetitions for bias based on a small
SIMULATION STUDIES IN PSYCHOLOGY 34
Computational aspects
The simulation study is performed using R version 4.3.1 (R Core Team, 2023) and the
following R packages: the mvtnorm package (Version 1.2-3, Genz & Bretz, 2009) to
generate data, the lm() function included in the stats package (Version 4.3.1, R Core
Team, 2023) to fit the different models, the SimDesign package (Version 2.13, Chalmers &
Adkins, 2020) to set up and run the simulation study, and the ggplot2 package (Version
3.4.4, Wickham, 2016) to create visualizations. We executed the simulation study on a system
running Ubuntu 22.04.4 LTS. A sessionInfo output with more information on the
computational environment, a Dockerfile to reproduce it, and code and data to reproduce the
study and its analysis are available at the Open Science Framework
(https://osf.io/dfgvu/).
Results
Figure 4 shows the results of the simulation study visually, Table 6 in the Appendix
shows the same results numerically. No missing/non-convergent values were observed. We
see from the Effect = 0 panel/rows that all methods maintain a Type I error rate close to 5%
irrespective of the correlation between the pre–post measurements. For non-zero effects, when
the pre–post measurement correlation is zero, ANCOVA and post-score analysis exhibit
similar levels of power and both surpass change-score analysis. However, when the pre–post
measure correlation increases to 0.7, change-score analysis shows higher power than
post-score analysis, yet ANCOVA shows higher power than both other methods.
The lower panels in Figure 4 show the estimated bias of the methods. We see that all
methods had essentially equivalent bias across all simulation conditions. Furthermore, the bias
of all methods in all conditions was close to zero and, given the very small MCSEs, can be
considered as negligible.
In sum, under the investigated scenarios, all methods produced unbiased effect
estimates while ANCOVA consistently showed the highest power among the three methods.
In almost all conditions, the Type I error rate was within one MCSE of the nominal rate of 5%,
and all differences to the nominal rate were smaller than 0.5%. For this simple setting and the
SIMULATION STUDIES IN PSYCHOLOGY 35
Figure 4
Estimated rejection rate (Power / Type I Error depending on DGM) and Bias of ANCOVA,
Change Score Analysis and Post Score Analysis.
Note. Error bars correspond to ±1 Monte Carlo standard error. The y-axis in the bias plot is
scaled only from −0.01 to 0.01, meaning that the bias can be considered negligible.
methods under study, there is a substantial amount of statistical theory that explains and
predicts our results (see, e.g., Senn, 2008), which is not often the case for simulation studies.
Our findings are also in line with previous simulation studies (Vickers, 2001).
SIMULATION STUDIES IN PSYCHOLOGY 36
Table 5
Recommendations for Methodological Research Using Simulation Studies.
Recommendation
1. Provide a rationale for all relevant choices in design and analysis (e.g., justifications for data-
2. Use a standardized structure for planning and reporting of simulation studies (e.g., ADEMP)
3. Report Monte Carlo uncertainty (e.g., Monte Carlo standard errors, uncertainty visualizations)
5. Write (and possibly preregister) study protocol to guide simulation design and to disclose the
state of knowledge, prior expectations, and evaluation criteria before seeing the results (e.g.,
9. Upload code, data, results, and other supplements to a FAIR research data repository (e.g., OSF
or Zenodo)
10. Journals/Editors/Reviewers: Promote higher reporting standards and open code/data (e.g., re-
standard error is achieved. The formulae in Table 3 can be used for this purpose. The choice of
the desired MCSE, and hence the number of repetitions required, is embedded in the trade-off
between the generalizability and the precision of a simulation study. Researchers aiming for
high precision in their performance measure estimates will usually be able to study fewer
conditions, restricting the scope of their investigation and limiting the external validity and
generalizability of their results. Therefore, setting the number of simulation repetitions too
high can also waste computational resources that could be better spent investigating additional
settings. Choosing the number of simulation repetitions to achieve desired precision, as
explained in our article, can help researchers to make informed choices in this trade-off.
SIMULATION STUDIES IN PSYCHOLOGY 38
However, even when simulation studies are carefully designed in advance, their scope is often
narrow compared to all possible realistic settings. Researchers should avoid discrepancies
between the scope of their simulation and the generality with which their results are reported.
Preregistration of a study protocol helps to make a transparent distinction between
knowledge, decisions, and evaluation criteria that were present before or after the results were
observed. At the same time, preregistration does not mean that the researcher’s hands are tied
and that modifications to the study cannot be made, but rather that they should be
transparently disclosed through amendments to the protocol. Fortunately, the issue of
“double-dipping” on the same data to formulate and test hypotheses is less of a problem in
simulation studies as new data can typically be generated cheaply (with certain exceptions,
such as bootstrap or Monte Carlo methods). Rather, the purpose of preregistered protocols is
to guide the planning of rigorous simulation studies, to provide other researchers with a
transparent picture of the research process, and potentially receive peer feedback independent
of the results. This concerns especially the selection of methods, data-generating mechanisms,
conditions, and performance measures, which are often highly flexible in simulation studies.
Researchers can also obtain feedback on their protocols from other researchers, especially if
the protocol is publicly available (see, e.g., Kipruto & Sauerbrei, 2022). Moreover, fixing the
criteria for the evaluation of the results a priori protects researchers from cognitive biases in
the interpretation of results, such as hindsight bias, confirmation bias, or allegiance bias, that
can blur their interpretation of simulation study results. Our ADEMP-PreReg template
(https://zenodo.org/doi/10.5281/zenodo.10057883, development version:
https://github.com/bsiepe/ADEMP-PreReg) can be used for preparing a
(possibly preregistered) simulation study protocol, as a blueprint for the structured reporting
of a simulation study, or as guidance document when reviewing a simulation study. In future
work, this may be extended to a standardized reporting checklist created by a panel of experts
on simulation studies, similar to risk-of-bias assessment tools for randomized controlled trials
(Sterne et al., 2019) or reporting guidelines for prediction models in health care (Collins,
Reitsma, Altman, & Moons, 2015).
To foster computational reproducibility and enable other researchers to build on a
SIMULATION STUDIES IN PSYCHOLOGY 39
simulation study, we strongly recommend to share code, data, and other supplementary
material. We recommend to upload files to a research data repository that accords with the
FAIR principles (Wilkinson et al., 2016), such as OSF or Zenodo, as we encountered various
dead links in our review, even from journals. Zenodo, in particular, offers great integration
with GitHub which facilitates developing simulation code on GitHub (using the git version
control system) and then archiving time-stamped versions or snapshots of the repository in a
FAIR way on Zenodo with one click. Moreover, we recommend to report software versions
and the computational environment used to run the study in detail. For example, for R users
(the vast majority of researchers based on our review), we recommend to at least report the
output of sessionInfo() in the supplementary material or code repository as a low-effort
step for reporting necessary software versions and the computational environment. Ideally,
data files containing the full output of a simulation study should be shared if possible.
Besides researchers conducting simulation studies themselves, other academic
stakeholders can help raise the standards of methodological research. For example, during the
peer-review process, reviewers and editors can encourage proper design and reporting of
simulation studies, for instance, by guiding authors to justify the number of repetitions or to
report Monte Carlo standard errors. Similarly, journals can promote higher standards for
simulation studies by requiring authors to share code and/or data for articles that include
simulation studies. This seems appropriate since conclusions from simulation studies heavily
depend on the validity of their underlying code, and since there are usually no ethical concerns
with publishing code and simulated data (with the exception of studies with data generating
mechanisms based on resampling, where sharing the resampled data could be problematic).
Mandatory code and data sharing, along with reproducibility checks and reproducibility
badges, have already been adopted by several journals, for example, Meta-Psychology or
Biometrical Journal which both have dedicated reproducibility teams that (partially) rerun
simulation studies of submitted articles (Hofner, Schmid, & Edler, 2015; Lindsay, 2023). In a
similar vein, journals could have specific calls for the replication and/or generalization of
influential simulation studies (Giordano & Waller, 2020; Lohmann, Astivia, Morris, &
Groenwold, 2022; Luijken et al., 2023).
SIMULATION STUDIES IN PSYCHOLOGY 40
Conclusions
References
Abrahamowicz, M., Beauchamp, M.-E., Boulesteix, A.-L., Morris, T. P., Sauerbrei, W.,
Kaufman, J. S., & STRATOS Simulation Panel, o. b. o. t. (2024, 05). Data-Driven
Simulations to Assess the Impact of Study Imperfections in Time-to-Event Analyses.
American Journal of Epidemiology, kwae058. doi: 10.1093/aje/kwae058
American Psychological Association. (2020). Publication manual of the American
Psychological Association (7th ed.). American Psychological Association.
Anvari, F., & Lakens, D. (2021). Using anchor-based methods to determine the smallest effect
size of interest. Journal of Experimental Social Psychology, 96, 104159. doi:
10.1016/j.jesp.2021.104159
Bartoš, F., Maier, M., Wagenmakers, E.-J., Doucouliagos, H., & Stanley, T. (2023). Robust
Bayesian meta-analysis: Model-averaging across complementary publication bias
adjustment methods. Research Synthesis Methods, 14(1), 99–116. doi:
10.1002/jrsm.1594
Binder, H., Sauerbrei, W., & Royston, P. (2012). Comparison between splines and fractional
polynomials for multivariable model building with continuous covariates: a simulation
study with continuous response. Statistics in Medicine, 32(13), 2262–2277. doi:
10.1002/sim.5639
Boomsma, A. (2013). Reporting Monte Carlo studies in structural equation modeling.
Structural Equation Modeling: A Multidisciplinary Journal, 20(3), 518–540. doi:
10.1080/10705511.2013.797839
Boulesteix, A.-L. (2015). Ten simple rules for reducing overoptimistic reporting in
methodological computational research. PLoS Computational Biology, 11(4),
e1004191. doi: 10.1371/journal.pcbi.1004191
Boulesteix, A.-L., Binder, H., Abrahamowicz, M., & Sauerbrei, W. (2018). On the necessity
and design of studies comparing statistical methods. Biometrical Journal, 60(1),
216–218. doi: 10.1002/bimj.201700129
Boulesteix, A.-L., Groenwold, R. H., Abrahamowicz, M., Binder, H., Briel, M., Hornung, R.,
. . . Sauerbrei, W. (2020). Introduction to statistical simulations in health research. BMJ
SIMULATION STUDIES IN PSYCHOLOGY 43
learning.
Hoaglin, D. C., & Andrews, D. F. (1975). The reporting of computation-based results in
statistics. The American Statistician, 29(3), 122–126. doi:
10.1080/00031305.1975.10477393
Hodges, C. B., Stone, B. M., Johnson, P. K., Carter III, J. H., Sawyers, C. K., Roby, P. R., &
Lindsey, H. M. (2023). Researcher degrees of freedom in statistical software contribute
to unreliable results: A comparison of nonparametric analyses conducted in SPSS, SAS,
Stata, and R. Behavior Research Methods, 55(6), 2813–2837. doi:
10.3758/s13428-022-01932-2
Hofner, B., Schmid, M., & Edler, L. (2015). Reproducible research in statistics: A review and
guidelines for the biometrical journal. Biometrical Journal, 58(2), 416–427. doi:
10.1002/bimj.201500156
Hong, S., & Reed, W. R. (2021). Using Monte Carlo experiments to select meta-analytic
estimators. Research Synthesis Methods, 12(2), 192–215. doi: 10.1002/jrsm.1467
Hu, L.-t., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure
analysis: Conventional criteria versus new alternatives. Structural equation modeling: a
multidisciplinary journal, 6(1), 1–55. doi: 10.1080/10705519909540118
ICH. (2019). Addendum on estimands and sensitivity analyses in clinical trials to the
guideline on statistical principles for clinical trials, ICH E9(R1).
https://database.ich.org/sites/default/files/
E9-R1_Step4_Guideline_2019_1203.pdf.
Joshi, M., & Pustejovsky, J. (2022). simhelpers: Helper functions for simulation studies
[Computer software manual]. Retrieved from
https://CRAN.R-project.org/package=simhelpers (R package version
0.1.2)
Keene, O. N., Lynggaard, H., Englert, S., Lanius, V., & Wright, D. (2023). Why estimands
are needed to define treatment effects in clinical trials. BMC Medicine, 21(1). doi:
10.1186/s12916-023-02969-6
Kelter, R. (2023). The Bayesian simulation study (BASIS) framework for simulation studies
SIMULATION STUDIES IN PSYCHOLOGY 47
single-level data and new measures of effect sizes. Behavior Research Methods, 54,
574—596. doi: 10.3758/s13428-021-01578-6
Liu, J., & Perera, R. A. (2022). Estimating knots and their association in parallel bilinear
spline growth curve models in the framework of individual measurement occasions.
Psychological Methods, 27(5), 703. doi: 10.1037/met0000309
Lohmann, A., Astivia, O. L. O., Morris, T. P., & Groenwold, R. H. H. (2022). It’s time! Ten
reasons to start replicating simulation studies. Frontiers in Epidemiology, 2. doi:
10.3389/fepid.2022.973470
Lüdtke, O., & Robitzsch, A. (2023). ANCOVA versus change score for the analysis of
two-wave data. The Journal of Experimental Education, Advance Online Publication.
doi: 10.1080/00220973.2023.2246187
Luijken, K., Lohmann, A., Alter, U., Gonzalez, J. C., Clouth, F. J., Fossum, J. L., . . .
Groenwold, R. H. H. (2023). Replicability of simulation studies for the investigation of
statistical methods: The RepliSims project. doi: 10.48550/arXiv.2307.02052
Lundberg, I., Johnson, R., & Stewart, B. M. (2021). What is your estimand? Defining the
target quantity connects statistical evidence to theory. American Sociological Review,
86(3), 532–565. doi: 10.1177/000312242110041
McNeish, D., & Bauer, D. J. (2022). Reducing incidence of nonpositive definite covariance
matrices in mixed effect models. Multivariate Behavioral Research, 57(2-3), 318–340.
doi: 10.1080/00273171.2020.1830019
McNeish, D., Lane, S., & Curran, P. (2018). Monte Carlo simulation methods. The
Reviewer’s Guide to Quantitative Methods in the Social Sciences, 269–276.
Mock, T. (2024). gtextras: Extending ’gt’ for beautiful html tables [Computer software
manual]. Retrieved from https://github.com/jthomasmock/gtExtras (R
package version 0.5.0.9005, https://jthomasmock.github.io/gtExtras/)
Morris, T. P., White, I. R., & Crowther, M. J. (2019). Using simulation studies to evaluate
statistical methods. Statistics in Medicine, 38(11), 2074–2102. doi: 10.1002/sim.8086
Moss, J., Wong, A. Y., Durriseau, J. A., & Bradshaw, G. L. (2022). Tracking strategy changes
using machine learning classifiers. Behavior Research Methods, 54, 1–23. doi:
SIMULATION STUDIES IN PSYCHOLOGY 49
10.3758/s13428-021-01720-4
Moulder, R. G., Daniel, K. E., Teachman, B. A., & Boker, S. M. (2022). Tangle: A metric for
quantifying complexity and erratic behavior in short time series. Psychological
Methods, 27(1), 82.
Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., du Sert,
N. P., . . . Ioannidis, J. P. A. (2017). A manifesto for reproducible science. Nature
Human Behaviour, 1(0021). doi: 10.1038/s41562-016-0021
Murayama, K., Usami, S., & Sakaki, M. (2022). Summary-statistics-based power analysis: A
new and practical method to determine sample size for mixed-effects modeling.
Psychological Methods, 27(6), 1014–1038. doi: 10.1037/met0000330
Nießl, C., Hoffmann, S., Ullmann, T., & Boulesteix, A.-L. (2023). Explaining the optimistic
performance evaluation of newly proposed methods: A cross-design validation
experiment. Biometrical Journal, 2200238. doi: 10.1002/bimj.202200238
Papenberg, M., & Klau, G. W. (2021). Using anticlustering to partition data sets into
equivalent parts. Psychological Methods, 26(2), 161–174. doi: 10.1037/met0000301
Pawel, S., Kook, L., & Reeve, K. (2024). Pitfalls and potentials in simulation studies:
Questionable research practices in comparative simulation studies allow for spurious
claims of superiority of any method. Biometrical Journal, 66(1), 2200091. doi:
https://doi.org/10.1002/bimj.202200091
Paxton, P., Curran, P. J., Bollen, K. A., Kirby, J., & Chen, F. (2001). Monte Carlo
experiments: Design and implementation. Structural Equation Modeling: A
Multidisciplinary Journal, 8(2), 287–312. doi: 10.1207/S15328007SEM0802_7
Peduzzi, P., Concato, J., Kemper, E., Holford, T. R., & Feinstein, A. R. (1996). A simulation
study of the number of events per variable in logistic regression analysis. Journal of
Clinical Epidemiology, 49(12), 1373–1379. doi: 10.1016/s0895-4356(96)00236-3
Peikert, A., & Brandmaier, A. M. (2021). A reproducible data analysis workflow.
Quantitative and Computational Methods in Behavioral Sciences, 1(Article e3763). doi:
10.5964/qcmb.3763
Psychometric Society. (1979). Publication policy regarding Monte Carlo studies.
SIMULATION STUDIES IN PSYCHOLOGY 50
Skrondal, A. (2000). Design and analysis of Monte Carlo experiments: Attacking the
conventional wisdom. Multivariate Behavioral Research, 35(2), 137–167. doi:
10.1207/s15327906mbr3502_1
Sterne, J. A., Savović, J., Page, M. J., Elbers, R. G., Blencowe, N. S., Boutron, I., . . . Higgins,
J. P. (2019). RoB 2: A revised tool for assessing risk of bias in randomised trials. BMJ,
366, l4898. doi: 10.1136/bmj.l4898
Stolte, M., Schreck, N., Slynko, A., Saadati, M., Benner, A., Rahnenführer, J., & Bommert, A.
(2024). Simulation study to evaluate when plasmode simulation is superior to
parametric simulation in estimating the mean squared error of the least squares estimator
in linear regression. PLOS ONE, 19(5), e0299989. doi: 10.1371/journal.pone.0299989
Strobl, C., & Leisch, F. (2022). Against the “one method fits all data sets” philosophy for
comparison studies in methodological research. Biometrical Journal. doi:
10.1002/bimj.202200104
Ushey, K., & Wickham, H. (2023). renv: Project environments [Computer software manual].
(https://rstudio.github.io/renv/, https://github.com/rstudio/renv)
Van Breukelen, G. J. (2013). ANCOVA versus CHANGE from baseline in nonrandomized
studies: The difference. Multivariate Behavioral Research, 48(6), 895–922. doi:
10.1080/00273171.2013.831743
van der Bles, A. M., Van Der Linden, S., Freeman, A. L., Mitchell, J., Galvao, A. B., Zaval,
L., & Spiegelhalter, D. J. (2019). Communicating uncertainty about facts, numbers and
science. Royal Society Open Science, 6(5), 181870. doi: 10.1098/rsos.181870
van Smeden, M., de Groot, J. A. H., Moons, K. G. M., Collins, G. S., Altman, D. G.,
Eijkemans, M. J. C., & Reitsma, J. B. (2016). No rationale for 1 variable per 10 events
criterion for binary logistic regression analysis. BMC Medical Research Methodology,
16, 163. doi: 10.1186/s12874-016-0267-3
Vickers, A. J. (2001). The use of percentage change from baseline as an outcome in a
controlled trial is statistically inefficient: A simulation study. BMC Medical Research
Methodology, 1(6). doi: 10.1186/1471-2288-1-6
White, I. R., Pham, T. M., Quartagno, M., & Morris, T. P. (2023). How to check a simulation
SIMULATION STUDIES IN PSYCHOLOGY 52
Appendix
A summary of inter-rater agreement for the studies that were most difficult to code is
provided in Figure 5. The wording of the questions and their answer options are available in
the preregistration of our review (https://osf.io/8cbfd). Overall, the agreement
between raters appears to be acceptable. Low agreement is especially pronounced in questions
about the estimands, where the ambiguity of reporting combined with the complexity of some
models often made the assessment of a specific number of estimands very difficult. As we
used studies that we identified as most difficult for our assessment of agreement, we consider
the proportions found here as a lower bound for the overall agreement across all studies. Also,
higher rates of disagreement in some questions here again indicate the need for more clarity in
the reporting of studies. In the case of disagreement, we kept the rating of the initial reviewer
for the analyses in the manuscript.
Numerical results of the example simulation study on methods for the analysis of
pre–post measurements are given in Table 6.
SIMULATION STUDIES IN PSYCHOLOGY 53
Figure 5
Agreement between Raters
Note. Proportion of agreement between the three raters for 15 papers with a low or medium
confidence rating. Two studies that were also rated for agreement were not assessed here, as
the raters chose different simulation studies therein.
Table 6
Correlation Effect ANCOVA Change score Post score ANCOVA Change score Post score
0.00 0.00 0.0447 (0.0021) 0.0508 (0.0022) 0.0464 (0.0021) −0.0003 (0.0020) 0.0006 (0.0028) −0.0005 (0.0020)
0.00 0.20 0.1655 (0.0037) 0.1111 (0.0031) 0.1671 (0.0037) −0.0023 (0.0020) −0.0010 (0.0029) −0.0024 (0.0020)
0.00 0.50 0.6907 (0.0046) 0.4137 (0.0049) 0.6940 (0.0046) −0.0009 (0.0020) −0.0004 (0.0028) −0.0009 (0.0020)
0.50 0.00 0.0496 (0.0022) 0.0474 (0.0021) 0.0500 (0.0022) −0.0008 (0.0017) −0.0013 (0.0020) −0.0008 (0.0020)
0.50 0.20 0.2108 (0.0041) 0.1715 (0.0038) 0.1646 (0.0037) 0.0015 (0.0017) 0.0004 (0.0020) 0.0022 (0.0020)
0.50 0.50 0.8130 (0.0039) 0.6978 (0.0046) 0.6973 (0.0046) 0.0021 (0.0018) 0.0020 (0.0020) 0.0018 (0.0020)
0.70 0.00 0.0487 (0.0022) 0.0503 (0.0022) 0.0500 (0.0022) −0.0021 (0.0014) −0.0032 (0.0015) −0.0001 (0.0020)
0.70 0.20 0.2782 (0.0045) 0.2459 (0.0043) 0.1641 (0.0037) −0.0023 (0.0014) −0.0027 (0.0016) −0.0008 (0.0020)
0.70 0.50 0.9296 (0.0026) 0.8913 (0.0031) 0.6944 (0.0046) 0.0013 (0.0014) 0.0012 (0.0016) 0.0011 (0.0020)
54