nrn3475 p033

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

ANALYSIS

Power failure: why small sample


size undermines the reliability of
neuroscience
Katherine S. Button1,2, John P. A. Ioannidis3, Claire Mokrysz1, Brian A. Nosek4,
Jonathan Flint5, Emma S. J. Robinson6 and Marcus R. Munafò1
Abstract | A study with low statistical power has a reduced chance of detecting a true effect,
but it is less well appreciated that low power also reduces the likelihood that a statistically
significant result reflects a true effect. Here, we show that the average statistical power of
studies in the neurosciences is very low. The consequences of this include overestimates of
effect size and low reproducibility of results. There are also ethical dimensions to this
problem, as unreliable research is inefficient and wasteful. Improving reproducibility in
neuroscience is a key priority and requires attention to well-established but often ignored
methodological principles.

It has been claimed and demonstrated that many (and low sample size of studies, small effects or both) nega-
possibly most) of the conclusions drawn from biomedi- tively affects the likelihood that a nominally statistically
cal research are probably false1. A central cause for this significant finding actually reflects a true effect. We dis-
important problem is that researchers must publish in cuss the problems that arise when low-powered research
order to succeed, and publishing is a highly competitive designs are pervasive. In general, these problems can be
enterprise, with certain kinds of findings more likely to divided into two categories. The first concerns prob-
be published than others. Research that produces novel lems that are mathematically expected to arise even if
results, statistically significant results (that is, typically the research conducted is otherwise perfect: in other
p < 0.05) and seemingly ‘clean’ results is more likely to be words, when there are no biases that tend to create sta-
1
School of Experimental
Psychology, University of
published2,3. As a consequence, researchers have strong tistically significant (that is, ‘positive’) results that are
Bristol, Bristol, BS8 1TU, UK. incentives to engage in research practices that make spurious. The second category concerns problems that
2
School of Social and their findings publishable quickly, even if those prac- reflect biases that tend to co‑occur with studies of low
Community Medicine, tices reduce the likelihood that the findings reflect a true power or that become worse in small, underpowered
University of Bristol,
(that is, non-null) effect 4. Such practices include using studies. We next empirically show that statistical power
Bristol, BS8 2BN, UK.
3
Stanford University School of flexible study designs and flexible statistical analyses is typically low in the field of neuroscience by using evi-
Medicine, Stanford, and running small studies with low statistical power 1,5. dence from a range of subfields within the neuroscience
California 94305, USA. A simulation of genetic association studies showed literature. We illustrate that low statistical power is an
4
Department of Psychology, that a typical dataset would generate at least one false endemic problem in neuroscience and discuss the impli-
University of Virginia,
Charlottesville,
positive result almost 97% of the time6, and two efforts cations of this for interpreting the results of individual
Virginia 22904, USA. to replicate promising findings in biomedicine reveal studies.
5
Wellcome Trust Centre for replication rates of 25% or less7,8. Given that these pub-
Human Genetics, University of lishing biases are pervasive across scientific practice, it Low power in the absence of other biases
Oxford, Oxford, OX3 7BN, UK.
is possible that false positives heavily contaminate the Three main problems contribute to producing unreliable
6
School of Physiology and
Pharmacology, University of neuroscience literature as well, and this problem may findings in studies with low power, even when all other
Bristol, Bristol, BS8 1TD, UK. affect at least as much, if not even more so, the most research practices are ideal. They are: the low probability of
Correspondence to M.R.M. prominent journals9,10. finding true effects; the low positive predictive value (PPV;
e-mail: marcus.munafo@ Here, we focus on one major aspect of the problem: see BOX 1 for definitions of key statistical terms) when an
bristol.ac.uk
doi:10.1038/nrn3475
low statistical power. The relationship between study effect is claimed; and an exaggerated estimate of the mag-
Published online 10 April 2013 power and the veracity of the resulting finding is nitude of the effect when a true effect is discovered. Here,
Corrected online 15 April 2013 under-appreciated. Low statistical power (because of we discuss these problems in more detail.

NATURE REVIEWS | NEUROSCIENCE VOLUME 14 | MAY 2013 | 365

© 2013 Macmillan Publishers Limited. All rights reserved


A N A LY S I S

Table 1 (cont.) | Characteristics of included meta-analyses


Study k N Summary effect size Power Refs
Median (range) Cohen’s d OR Random or Median
fixed effects (range)
Yang 3 51 (18–205) 0.67 NA 0.65 (0.27–1.00) 67
Yuan 14 116.5 (19–1178) 4.98 Fixed 0.92 (0.33–1.00) 68
Zafar 8 78.5 (46–483) 1.07* Random 0.05 (0.00–0.06) 69
Zhang 12 337.5 (39–901) 1.27 Random 0.14 (0.01–0.30) 70
Zhu 8 110 (48–371) 0.84 Random 0.97 (0.81–1.00) 71
The choice of fixed or random effects model was made by the original authors of the meta-analysis. k, number of studies; NA, not
available; OR, odds ratio. * indicates the relative risk.

for water maze and radial maze, respectively. Our results The estimates shown in FIGS 4,5 are likely to be opti-
indicate that the median statistical power for the water mistic, however, because they assume that statistical
maze studies and the radial maze studies to detect these power and R are the only considerations in determin-
medium to large effects was 18% and 31%, respectively ing the probability that a research finding reflects a true
(TABLE 2). The average sample size in these studies was 22 effect. As we have already discussed, several other biases
animals for the water maze and 24 for the radial maze are also likely to reduce the probability that a research
experiments. Studies of this size can only detect very finding reflects a true effect. Moreover, the summary
large effects (d = 1.20 for n = 22, and d = 1.26 for n = 24) effect size estimates that we used to determine the statis-
with 80% power — far larger than those indicated by tical power of individual studies are themselves likely to
the meta-analyses. These animal model studies were be inflated owing to bias — our excess of significance test
therefore severely underpowered to detect the summary provided clear evidence for this. Therefore, the average
effects indicated by the meta-analyses. Furthermore, the statistical power of studies in our analysis may in fact be
summary effects are likely to be inflated estimates of the even lower than the 8–31% range we observed.
true effects, given the problems associated with small
studies described above. Ethical implications. Low average power in neuro-
The results described in this section are based on science studies also has ethical implications. In our
only two meta-analyses, and we should be appropriately analysis of animal model studies, the average sample
cautious in extrapolating from this limited evidence. size of 22 animals for the water maze experiments was
Nevertheless, it is notable that the results are so con- only sufficient to detect an effect size of d = 1.26 with
sistent with those observed in other fields, such as the
neuroimaging and neuroscience studies that we have
described above. 16
14 30
Implications 12 25
Implications for the likelihood that a research finding 10
20

%
reflects a true effect. Our results indicate that the aver- 8
N

15
age statistical power of studies in the field of neurosci- 6
10
ence is probably no more than between ~8% and ~31%, 4
2 5
on the basis of evidence from diverse subfields within
0 0
neuro-science. If the low average power we observed
10

0
00
–2

–3

–4

–5

–6

–7

–8

–9

across these studies is typical of the neuroscience lit-


0–

–1
11

21

31

41

51

61

71

81
91

erature as a whole, this has profound implications for Power (%)


the field. A major implication is that the likelihood that Figure 3 | Median power of studies included in
any nominally significant finding actually reflects a true neuroscience meta-analyses.Nature Reviews | Neuroscience
The figure shows a
effect is small. As explained above, the probability that histogram of median study power calculated for each of
a research finding reflects a true effect (PPV) decreases the n = 49 meta-analyses included in our analysis, with the
as statistical power decreases for any given pre-study number of meta-analyses (N) on the left axis and percent
odds (R) and a fixed type I error level. It is easy to show of meta-analyses (%) on the right axis. There is a clear
the impact that this is likely to have on the reliability of bimodal distribution; n = 15 (31%) of the meta-analyses
findings. FIGURE 4 shows how the PPV changes for a range comprised studies with median power of less than 11%,
whereas n = 7 (14%) comprised studies with high average
of values for R and for a range of v alues for the average
power in excess of 90%. Despite this bimodality, most
power in a field. For effects that are genuinely non-null, meta-analyses comprised studies with low statistical
FIG. 5 shows the degree to which an effect size estimate power: n = 28 (57%) had median study power of less than
is likely to be inflated in initial studies — owing to the 31%. The meta-analyses (n = 7) that comprised studies
winner’s curse phenomenon — for a range of values for with high average power in excess of 90% had their
statistical power. broadly neurological subject matter in common.

NATURE REVIEWS | NEUROSCIENCE VOLUME 14 | MAY 2013 | 371

© 2013 Macmillan Publishers Limited. All rights reserved


A N A LY S I S

Box 2 | Recommendations for researchers


data within collaborative teams and also making some
or all of those research materials publicly available.
Perform an a priori power calculation Leading journals are increasingly adopting policies for
Use the existing literature to estimate the size of effect you are looking for and design making data, protocols and analytical codes available,
your study accordingly. If time or financial constraints mean your study is at least for some types of studies. However, these poli-
underpowered, make this clear and acknowledge this limitation (or limitations) in the
cies are uncommonly adhered to95, and thus the ability
interpretation of your results.
for independent experts to repeat published analysis
Disclose methods and findings transparently remains low 96.
If the intended analyses produce null findings and you move on to explore your data in
other ways, say so. Null findings locked in file drawers bias the literature, whereas
exploratory analyses are only useful and valid if you acknowledge the caveats and Incentivizing replication. Weak incentives for conduct-
limitations. ing and publishing replications are a threat to identifying
false positives and accumulating precise estimates of
Pre-register your study protocol and analysis plan
Pre-registration clarifies whether analyses are confirmatory or exploratory, encourages research findings. There are many ways to alter repli-
well-powered studies and reduces opportunities for non-transparent data mining and cation incentives97. For example, journals could offer a
selective reporting. Various mechanisms for this exist (for example, the Open Science submission option for registered replications of impor-
Framework). tant research results (see, for example, a possible new
Make study materials and data available submission format for Cortex98). Groups of researchers
Making research materials available will improve the quality of studies aimed at can also collaborate on performing one or many replica-
replicating and extending research findings. Making raw data available will enhance tions to increase the total sample size (and therefore the
opportunities for data aggregation and meta-analysis, and allow external checking of statistical power) achieved while minimizing the labour
analyses and results. and resource impact on any one contributor. Adoption
Work collaboratively to increase power and replicate findings of the gold standard of large-scale collaborative con-
Combining data increases the total sample size (and therefore power) while minimizing sortia and extensive replication in fields such as human
the labour and resource impact on any one contributor. Large-scale collaborative genome epidemiology has transformed the reliability
consortia in fields such as human genetic epidemiology have transformed the reliability of the produced findings. Although previously almost
of findings in these fields. all of the proposed candidate gene associations from
small studies were false99 (with some exceptions100), col-
laborative consortia have substantially improved power,
Registration of confirmatory analysis plan. Both explor- and the replicated results can be considered highly reli-
atory and confirmatory research strategies are legiti- able. In another example, in the field of psychology, the
mate and useful. However, presenting the result of an Reproducibility Project is a collaboration of more than
exploratory analysis as if it arose from a confirmatory 100 researchers aiming to estimate the reproducibility
test inflates the chance that the result is a false positive. of psychological science by replicating a large sample of
In particular, p‑values lose their diagnostic value if they studies published in 2008 in three psychology journals92.
are not the result of a pre-specified analysis plan for Each individual research study contributes just a small
which all results are reported. Pre-registration — and, portion of time and effort, but the combined effect is
ultimately, full reporting of analysis plans — clarifies substantial both for accumulating replications and for
the distinction between confirmatory and explora- generating an empirical estimate of reproducibility.
tory analysis, encourages well-powered studies (at least
in the case of confirmatory analyses) and reduces the Concluding remarks. Small, low-powered studies are
file-drawer effect. These subsequently reduce the likeli- endemic in neuroscience. Nevertheless, there are reasons
hood of false positive accumulation. The Open Science to be optimistic. Some fields are confronting the prob-
Framework (OSF) offers a registration mechanism for lem of the poor reliability of research findings that arises
scientific research. For observational studies, it would from low-powered studies. For example, in genetic epi-
be useful to register datasets in detail, so that one can be demiology sample sizes increased dramatically with the
aware of how extensive the multiplicity and complexity widespread understanding that the effects being sought
of analyses can be94. are likely to be extremely small. This, together with an
increasing requirement for strong statistical evidence
Improving availability of materials and data. Making and independent replication, has resulted in far more
research materials available will improve the quality reliable results. Moreover, the pressure for emphasiz-
of studies aimed at replicating and extending research ing significant results is not absolute. For example, the
findings. Making raw data available will improve data Proteus phenomenon101 suggests that refuting early
aggregation methods and confidence in reported results can be attractive in fields in which data can be
results. There are multiple repositories for making data produced rapidly. Nevertheless, we should not assume
more widely available, such as The Dataverse Network that science is effectively or efficiently self-correcting 102.
Project and Dryad) for data in general and others There is now substantial evidence that a large propor-
such as OpenfMRI, INDI and OASIS for neuroimag- tion of the evidence reported in the scientific literature
ing data in particular. Also, commercial repositories may be unreliable. Acknowledging this challenge is the
(for example, figshare) offer means for sharing data first step towards addressing the problematic aspects
and other research materials. Finally, the OSF offers of current scientific practices and identifying effective
infrastructure for documenting, archiving and sharing solutions.

374 | MAY 2013 | VOLUME 14 www.nature.com/reviews/neuro

© 2013 Macmillan Publishers Limited. All rights reserved

You might also like