nrn3475 p021
nrn3475 p021
nrn3475 p021
It has been claimed and demonstrated that many (and low sample size of studies, small effects or both) nega-
possibly most) of the conclusions drawn from biomedi- tively affects the likelihood that a nominally statistically
cal research are probably false1. A central cause for this significant finding actually reflects a true effect. We dis-
important problem is that researchers must publish in cuss the problems that arise when low-powered research
order to succeed, and publishing is a highly competitive designs are pervasive. In general, these problems can be
enterprise, with certain kinds of findings more likely to divided into two categories. The first concerns prob-
be published than others. Research that produces novel lems that are mathematically expected to arise even if
results, statistically significant results (that is, typically the research conducted is otherwise perfect: in other
p < 0.05) and seemingly ‘clean’ results is more likely to be words, when there are no biases that tend to create sta-
1
School of Experimental
Psychology, University of
published2,3. As a consequence, researchers have strong tistically significant (that is, ‘positive’) results that are
Bristol, Bristol, BS8 1TU, UK. incentives to engage in research practices that make spurious. The second category concerns problems that
2
School of Social and their findings publishable quickly, even if those prac- reflect biases that tend to co‑occur with studies of low
Community Medicine, tices reduce the likelihood that the findings reflect a true power or that become worse in small, underpowered
University of Bristol,
(that is, non-null) effect 4. Such practices include using studies. We next empirically show that statistical power
Bristol, BS8 2BN, UK.
3
Stanford University School of flexible study designs and flexible statistical analyses is typically low in the field of neuroscience by using evi-
Medicine, Stanford, and running small studies with low statistical power 1,5. dence from a range of subfields within the neuroscience
California 94305, USA. A simulation of genetic association studies showed literature. We illustrate that low statistical power is an
4
Department of Psychology, that a typical dataset would generate at least one false endemic problem in neuroscience and discuss the impli-
University of Virginia,
Charlottesville,
positive result almost 97% of the time6, and two efforts cations of this for interpreting the results of individual
Virginia 22904, USA. to replicate promising findings in biomedicine reveal studies.
5
Wellcome Trust Centre for replication rates of 25% or less7,8. Given that these pub-
Human Genetics, University of lishing biases are pervasive across scientific practice, it Low power in the absence of other biases
Oxford, Oxford, OX3 7BN, UK.
is possible that false positives heavily contaminate the Three main problems contribute to producing unreliable
6
School of Physiology and
Pharmacology, University of neuroscience literature as well, and this problem may findings in studies with low power, even when all other
Bristol, Bristol, BS8 1TD, UK. affect at least as much, if not even more so, the most research practices are ideal. They are: the low probability of
Correspondence to M.R.M. prominent journals9,10. finding true effects; the low positive predictive value (PPV;
e-mail: marcus.munafo@ Here, we focus on one major aspect of the problem: see BOX 1 for definitions of key statistical terms) when an
bristol.ac.uk
doi:10.1038/nrn3475
low statistical power. The relationship between study effect is claimed; and an exaggerated estimate of the mag-
Published online 10 April 2013 power and the veracity of the resulting finding is nitude of the effect when a true effect is discovered. Here,
Corrected online 15 April 2013 under-appreciated. Low statistical power (because of we discuss these problems in more detail.
may simply report that only nine patients were studied. study as being inconclusive or uninformative21. The pro-
A manipulation affecting only three observations could tocols of large studies are also more likely to have been
change the odds ratio from 1.00 to 1.50 in a small study registered or otherwise made publicly available, so that
but might only change it from 1.00 to 1.01 in a very large deviations in the analysis plans and choice of outcomes
study. When investigators select the most favourable, may become obvious more easily. Small studies, con-
interesting, significant or promising results among a wide versely, are often subject to a higher level of exploration
spectrum of estimates of effect magnitudes, this is inevi- of their results and selective reporting thereof.
tably a biased choice. Third, smaller studies may have a worse design quality
Publication bias and selective reporting of outcomes than larger studies. Several small studies may be oppor-
and analyses are also more likely to affect smaller, under- tunistic experiments, or the data collection and analysis
powered studies17. Indeed, investigations into publication may have been conducted with little planning. Conversely,
bias often examine whether small studies yield different large studies often require more funding and personnel
results than larger ones18. Smaller studies more readily resources. As a consequence, designs are examined more
disappear into a file drawer than very large studies that carefully before data collection, and analysis and reporting
are widely known and visible, and the results of which are may be more structured. This relationship is not absolute
eagerly anticipated (although this correlation is far from — small studies are not always of low quality. Indeed, a
perfect). A ‘negative’ result in a high-powered study can- bias in favour of small studies may occur if the small stud-
not be explained away as being due to low power 19,20, and ies are meticulously designed and collect high-quality data
thus reviewers and editors may be more willing to pub- (and therefore are forced to be small) and if large studies
lish it, whereas they more easily reject a small ‘negative’ ignore or drop quality checks in an effort to include as
large a sample as possible.
Records identified through Additional records identified Empirical evidence from neuroscience
database search through other sources Any attempt to establish the average statistical power in
(n = 246) (n = 0)
neuroscience is hampered by the problem that the true
effect sizes are not known. One solution to this problem
is to use data from meta-analyses. Meta-analysis pro-
Records after vides the best estimate of the true effect size, albeit with
duplicates removed
(n = 246) limitations, including the limitation that the individual
studies that contribute to a meta-analysis are themselves
subject to the problems described above. If anything,
Abstracts screened Excluded summary effects from meta-analyses, including power
(n = 246) (n = 73) estimates calculated from meta-analysis results, may also
be modestly inflated22.
Acknowledging this caveat, in order to estimate sta-
Full-text articles screened Excluded tistical power in neuroscience, we examined neurosci-
(n = 173) (n = 82) ence meta-analyses published in 2011 that were retrieved
using ‘neuroscience’ and ‘meta-analysis’ as search terms.
Using the reported summary effects of the meta-analy-
Full-text articles assessed ses as the estimate of the true effects, we calculated the
for eligibility Excluded
(n = 91) (n = 43) power of each individual study to detect the effect indi-
cated by the corresponding meta-analysis.
Articles included in analysis Methods. Included in our analysis were articles published
(n = 48) in 2011 that described at least one meta-analysis of previ-
ously published studies in neuroscience with a summary
Figure 2 | Flow diagram of articles selected for inclusion. Computerized
effect estimate (mean difference or odds/risk ratio) as well
databases were searched on 2 February 2012 via WebNature Reviews
of Science | Neuroscience
for papers published in
2011, using the key words ‘neuroscience’ and ‘meta-analysis’. Two authors (K.S.B. and as study level data on group sample size and, for odds/risk
M.R.M.) independently screened all of the papers that were identified for suitability ratios, the number of events in the control group.
(n = 246). Articles were excluded if no abstract was electronically available (for example, We searched computerized databases on 2 February
conference proceedings and commentaries) or if both authors agreed, on the basis of 2012 via Web of Science for articles published in 2011,
the abstract, that a meta-analysis had not been conducted. Full texts were obtained for using the key words ‘neuroscience’ and ‘meta-analysis’.
the remaining articles (n = 173) and again independently assessed for eligibility by K.S.B. All of the articles that were identified via this electronic
and M.R.M. Articles were excluded (n = 82) if both authors agreed, on the basis of the full search were screened independently for suitability by two
text, that a meta-analysis had not been conducted. The remaining articles (n = 91) were authors (K.S.B. and M.R.M.). Articles were excluded if no
assessed in detail by K.S.B. and M.R.M. or C.M. Articles were excluded at this stage if
abstract was electronically available (for example, confer-
they could not provide the following data for extraction for at least one meta-analysis:
ence proceedings and commentaries) or if both authors
first author and summary effect size estimate of the meta-analysis; and first author,
publication year, sample size (by groups) and number of events in the control group (for agreed, on the basis of the abstract, that a meta-analysis
odds/risk ratios) of the contributing studies. Data extraction was performed had not been conducted. Full texts were obtained for the
independently by K.S.B. and M.R.M. or C.M. and verified collaboratively. In total, n = 48 remaining articles and again independently assessed for
articles were included in the analysis. eligibility by two authors (K.S.B. and M.R.M.) (FIG. 2).