nrn3475 p024

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

ANALYSIS

Power failure: why small sample


size undermines the reliability of
neuroscience
Katherine S. Button1,2, John P. A. Ioannidis3, Claire Mokrysz1, Brian A. Nosek4,
Jonathan Flint5, Emma S. J. Robinson6 and Marcus R. Munafò1
Abstract | A study with low statistical power has a reduced chance of detecting a true effect,
but it is less well appreciated that low power also reduces the likelihood that a statistically
significant result reflects a true effect. Here, we show that the average statistical power of
studies in the neurosciences is very low. The consequences of this include overestimates of
effect size and low reproducibility of results. There are also ethical dimensions to this
problem, as unreliable research is inefficient and wasteful. Improving reproducibility in
neuroscience is a key priority and requires attention to well-established but often ignored
methodological principles.

It has been claimed and demonstrated that many (and low sample size of studies, small effects or both) nega-
possibly most) of the conclusions drawn from biomedi- tively affects the likelihood that a nominally statistically
cal research are probably false1. A central cause for this significant finding actually reflects a true effect. We dis-
important problem is that researchers must publish in cuss the problems that arise when low-powered research
order to succeed, and publishing is a highly competitive designs are pervasive. In general, these problems can be
enterprise, with certain kinds of findings more likely to divided into two categories. The first concerns prob-
be published than others. Research that produces novel lems that are mathematically expected to arise even if
results, statistically significant results (that is, typically the research conducted is otherwise perfect: in other
p < 0.05) and seemingly ‘clean’ results is more likely to be words, when there are no biases that tend to create sta-
1
School of Experimental
Psychology, University of
published2,3. As a consequence, researchers have strong tistically significant (that is, ‘positive’) results that are
Bristol, Bristol, BS8 1TU, UK. incentives to engage in research practices that make spurious. The second category concerns problems that
2
School of Social and their findings publishable quickly, even if those prac- reflect biases that tend to co‑occur with studies of low
Community Medicine, tices reduce the likelihood that the findings reflect a true power or that become worse in small, underpowered
University of Bristol,
(that is, non-null) effect 4. Such practices include using studies. We next empirically show that statistical power
Bristol, BS8 2BN, UK.
3
Stanford University School of flexible study designs and flexible statistical analyses is typically low in the field of neuroscience by using evi-
Medicine, Stanford, and running small studies with low statistical power 1,5. dence from a range of subfields within the neuroscience
California 94305, USA. A simulation of genetic association studies showed literature. We illustrate that low statistical power is an
4
Department of Psychology, that a typical dataset would generate at least one false endemic problem in neuroscience and discuss the impli-
University of Virginia,
Charlottesville,
positive result almost 97% of the time6, and two efforts cations of this for interpreting the results of individual
Virginia 22904, USA. to replicate promising findings in biomedicine reveal studies.
5
Wellcome Trust Centre for replication rates of 25% or less7,8. Given that these pub-
Human Genetics, University of lishing biases are pervasive across scientific practice, it Low power in the absence of other biases
Oxford, Oxford, OX3 7BN, UK.
is possible that false positives heavily contaminate the Three main problems contribute to producing unreliable
6
School of Physiology and
Pharmacology, University of neuroscience literature as well, and this problem may findings in studies with low power, even when all other
Bristol, Bristol, BS8 1TD, UK. affect at least as much, if not even more so, the most research practices are ideal. They are: the low probability of
Correspondence to M.R.M. prominent journals9,10. finding true effects; the low positive predictive value (PPV;
e-mail: marcus.munafo@ Here, we focus on one major aspect of the problem: see BOX 1 for definitions of key statistical terms) when an
bristol.ac.uk
doi:10.1038/nrn3475
low statistical power. The relationship between study effect is claimed; and an exaggerated estimate of the mag-
Published online 10 April 2013 power and the veracity of the resulting finding is nitude of the effect when a true effect is discovered. Here,
Corrected online 15 April 2013 under-appreciated. Low statistical power (because of we discuss these problems in more detail.

NATURE REVIEWS | NEUROSCIENCE VOLUME 14 | MAY 2013 | 365

© 2013 Macmillan Publishers Limited. All rights reserved


A N A LY S I S

Data were extracted from forest plots, tables and text. — four out of the seven meta-analyses did not include
Some articles reported several meta-analyses. In those any study with over 80 participants. If we exclude these
cases, we included multiple meta-analyses only if they ‘outlying’ meta-analyses, the median statistical power
contained distinct study samples. If several meta-analyses falls to 18%.
had overlapping study samples, we selected the most com- Small sample sizes are appropriate if the true effects
prehensive (that is, the one containing the most studies) being estimated are genuinely large enough to be reliably
or, if the number of studies was equal, the first analysis observed in such samples. However, as small studies are
presented in the article. Data extraction was indepen- particularly susceptible to inflated effect size estimates and
dently performed by K.S.B. and either M.R.M. or C.M. publication bias, it is difficult to be confident in the evi-
and verified collaboratively. dence for a large effect if small studies are the sole source
The following data were extracted for each meta- of that evidence. Moreover, many meta-analyses show
analysis: first author and summary effect size estimate small-study effects on asymmetry tests (that is, smaller
of the meta-analysis; and first author, publication year, studies have larger effect sizes than larger ones) but never-
sample size (by groups), number of events in the control theless use random-effect calculations, and this is known
group (for odds/risk ratios) and nominal significance to inflate the estimate of summary effects (and thus also
(p < 0.05, ‘yes/no’) of the contributing studies. For five the power estimates). Therefore, our power calculations
articles, nominal study significance was unavailable and are likely to be extremely optimistic76.
was therefore obtained from the original studies if they
were electronically available. Studies with missing data Empirical evidence from specific fields
(for example, due to unclear reporting) were excluded One limitation of our analysis is the under-representation
from the analysis. of meta-analyses in particular subfields of neuroscience,
The main outcome measure of our analysis was the such as research using neuroimaging and animal mod-
achieved power of each individual study to detect the els. We therefore sought additional representative meta-
estimated summary effect reported in the corresponding analyses from these fields outside our 2011 sampling frame
meta-analysis to which it contributed, assuming an α level to determine whether a similar pattern of low statistical
of 5%. Power was calculated using G*Power software23. power would be observed.
We then calculated the mean and median statistical
power across all studies. Neuroimaging studies. Most structural and volumetric
MRI studies are very small and have minimal power
Results. Our search strategy identified 246 articles pub- to detect differences between compared groups (for
lished in 2011, out of which 155 were excluded after example, healthy people versus those with mental health
an initial screening of either the abstract or the full diseases). A cl ear excess significance bias has been dem-
text. Of the remaining 91 articles, 48 were eligible for onstrated in studies of brain volume abnormalities 73,
inclusion in our analysis24–71, comprising data from 49 and similar problems appear to exist in fMRI studies
meta-analyses and 730 individual primary studies. A of the blood-oxygen-level-dependent response77. In
flow chart of the article selection process is shown in order to establish the average statistical power of stud-
FIG. 2, and the characteristics of included meta-analyses ies of brain volume abnormalities, we applied the same
are described in TABLE 1. analysis as described above to data that had been pre-
Our results indicate that the median statistical power viously extracted to assess the presence of an excess of
in neuroscience is 21%. We also applied a test for an significance bias73. Our results indicated that the median
excess of statistical significance72. This test has recently statistical power of these studies was 8% across 461 indi-
been used to show that there is an excess significance bias vidual studies contributing to 41 separate meta-analyses,
in the literature of various fields, including in studies of which were drawn from eight articles that were published
brain volume abnormalities73, Alzheimer’s disease genet- between 2006 and 2009. Full methodological details
ics70,74 and cancer biomarkers75. The test revealed that the describing how studies were identified and selected are
actual number (349) of nominally significant studies in available elsewhere73.
our analysis was significantly higher than the number
expected (254; p < 0.0001). Importantly, these calcula- Animal model studies. Previous analyses of studies using
tions assume that the summary effect size reported in each animal models have shown that small studies consist-
study is close to the true effect size, but it is likely that ently give more favourable (that is, ‘positive’) results than
they are inflated owing to publication and other biases larger studies78 and that study quality is inversely related
described above. to effect size79–82. In order to examine the average power
Interestingly, across the 49 meta-analyses included in neuroscience studies using animal models, we chose
in our analysis, the average power demonstrated a clear a representative meta-analysis that combined data from
bimodal distribution (FIG. 3). Most meta-analyses com- studies investigating sex differences in water maze per-
prised studies with very low average power — almost formance (number of studies (k) = 19, summary effect
50% of studies had an average power lower than 20%. size Cohen’s d = 0.49) and radial maze performance
However, seven meta-analyses comprised studies with (k = 21, summary effect size d = 0.69)80. The summary
high (>90%) average power 24,26,31,57,63,68,71. These seven effect sizes in the two meta-analyses provide evidence for
meta-analyses were all broadly neurological in focus medium to large effects, with the male and female per-
and were based on relatively small contributing studies formance differing by 0.49 to 0.69 standard deviations

NATURE REVIEWS | NEUROSCIENCE VOLUME 14 | MAY 2013 | 369

© 2013 Macmillan Publishers Limited. All rights reserved


A N A LY S I S

Table 2 | Sample size required to detect sex differences in water maze and radial maze performance
Total animals Required N per study Typical N per study Detectable effect for typical N
used
80% power 95% power Mean Median 80% power 95% power
Water maze 420 134 220 22 20 d = 1.26 d = 1.62
Radial maze 514 68 112 24 20 d = 1.20 d = 1.54
Meta-analysis indicated an effect size of Cohen’s d = 0.49 for water maze studies and d = 0.69 for radial maze studies.

80% power, and the average sample size of 24 animals experiments, the total numbers of animals actually used
for the radial maze experiments was only sufficient to in the studies contributing to the meta-analyses were
detect an effect size of d = 1.20. In order to achieve 80% even larger: 420 for the water maze experiments and
power to detect, in a single study, the most probable true 514 for the radial maze experiments.
effects as indicated by the meta-analysis, a sample size There is ongoing debate regarding the appropriate
of 134 animals would be required for the water maze balance to strike between using as few animals as possi-
experiment (assuming an effect size of d = 0.49) and ble in experiments and the need to obtain robust, reliable
68 animals for the radial maze experiment (assuming findings. We argue that it is important to appreciate the
an effect size of d = 0.69); to achieve 95% power, these waste associated with an underpowered study — even a
sample sizes would need to increase to 220 and 112, study that achieves only 80% power still presents a 20%
respectively. What is particularly striking, however, is possibility that the animals have been sacrificed with-
the inefficiency of a continued reliance on small sample out the study detecting the underlying true effect. If the
sizes. Despite the apparently large numbers of animals average power in neuroscience animal model studies is
required to achieve acceptable statistical power in these between 20–30%, as we observed in our analysis above,
the ethical implications are clear.
Low power therefore has an ethical dimension —
100 unreliable research is inefficient and wasteful. This applies
to both human and animal research. The principles of the
80 ‘three Rs’ in animal research (reduce, refine and replace)83
Post-study probability (%)

require appropriate experimental design and statistics


— both too many and too few animals present an issue
60
as they reduce the value of research outputs. A require-
ment for sample size and power calculation is included
40 in the Animal Research: Reporting In Vivo Experiments
80% power (ARRIVE) guidelines84, but such calculations require a
20 30% power clear appreciation of the expected magnitude of effects
10% power being sought.
0 Of course, it is also wasteful to continue data col-
0 0.2 0.4 0.6 0.8 1.0
lection once it is clear that the effect being sought does
not exist or is too small to be of interest. That is, studies
Pre-study odds R
are not just wasteful when they stop too early, they are
Figure 4 | Positive predictive value as a function of the also wasteful when they stop too late. Planned, sequen-
pre-study odds of associationNature Reviews | Neuroscience
for different levels of tial analyses are sometimes used in large clinical trials
statistical power. The probability that a research finding when there is considerable expense or potential harm
reflects a true effect — also known as the positive associated with testing participants. Clinical trials may
predictive value (PPV) — depends on both the pre-study be stopped prematurely in the case of serious adverse
odds of the effect being true (the ratio R of ‘true effects’
effects, clear beneficial effects (in which case it would be
over ‘null effects’ in the scientific field) and the study’s
statistical power. The PPV can be calculated for given unethical to continue to allocate participants to a placebo
values of statistical power (1 – β), pre-study odds ratio (R) condition) or if the interim effects are so unimpressive
and type I error rate (α), using the formula PPV = ([1 – β] × R) that any prospect of a positive result with the planned
⁄ ([1− β] × R + α). The median statistical power of studies in sample size is extremely unlikely 85. Within a significance
the neuroscience field is optimistically estimated to be testing framework, such interim analyses — and the pro-
between ~8% and ~31%. The figure illustrates how low tocol for stopping — must be planned for the assump-
statistical power consistent with this estimated range tions of significance testing to hold. Concerns have been
(that is, between 10% and 30%) detrimentally affects the raised as to whether stopping trials early is ever justified
association between the probability that a finding reflects given the tendency for such a practice to produce inflated
a true effect (PPV) and pre-study odds, assuming α = 0.05.
effect size estimates86. Furthermore, the decision process
Compared with conditions of appropriate statistical
power (that is, 80%), the probability that a research finding around stopping is not often fully disclosed, increasing
reflects a true effect is greatly reduced for 10% and 30% the scope for researcher degrees of freedom86. Alternative
power, especially if pre-study odds are low. Notably, in an approaches exist. For example, within a Bayesian frame-
exploratory research field such as much of neuroscience, work, one can monitor the Bayes factor and simply stop
the pre-study odds are often low. testing when the evidence is conclusive or when resources

372 | MAY 2013 | VOLUME 14 www.nature.com/reviews/neuro

© 2013 Macmillan Publishers Limited. All rights reserved

You might also like