The Problem With Science
The Problem With Science
1
The Problem with Science
The Reproducibility Crisis and What to Do About
It
R. BARKER BAUSELL
2
Oxford University Press is a department of the University of Oxford. It furthers the University’s
objective of excellence in research, scholarship, and education by publishing worldwide. Oxford
is a registered trade mark of Oxford University Press in the UK and certain other countries.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or
transmitted, in any form or by any means, without the prior permission in writing of Oxford
University Press, or as expressly permitted by law, by license, or under terms agreed with the
appropriate reproduction rights organization. Inquiries concerning reproduction outside the scope
of the above should be sent to the Rights Department, Oxford University Press, at the address
above.
You must not circulate this work in any other form and you must impose this same condition on
any acquirer.
3
Contents
A Brief Note
Acknowledgments
Introduction
1. Publication Bias
2. False-Positive Results and a Nontechnical Overview of Their
Modeling
3. Questionable Research Practices (QRPs) and Their Devastating
Scientific Effects
4. A Few Case Studies of QRP-Driven Irreproducible Results
5. The Return of Pathological Science Accompanied by a Pinch of
Replication
4
11. A (Very) Few Concluding Thoughts
Index
5
A Brief Note
This book was written and peer reviewed by Oxford University Press
before the news concerning the “problem” in Wuhan broke, hence no
mention of COVID-19 appears in the text. Relatedly, since one of my
earlier books, Snake Oil Science: The Truth About Complementary and
Alternative Medicine, had been published by Oxford more than a decade
ago, I had seen no need to pursue this line of inquiry further since the
bulk of the evidence indicated that alternative medical therapies were
little more than cleverly disguised placebos, with their positive scientific
results having been facilitated by substandard experimental design,
insufficient scientific training, questionable research practices, or worse.
So, for this book, I chose to concentrate almost exclusively on a set of
problems bedeviling mainstream science and the initiative based
thereupon, one that has come to be called “the reproducibility crisis.”
However, as everyone is painfully aware, in 2020, all hell broke loose.
The internet lit up advocating bogus therapies; the leaders of the two
most powerful countries in the world, Xi Jinping and Donald Trump,
advocated traditional Chinese herbals and a household cleaner,
respectively; and both disparaged or ignored actual scientific results that
did not support their agendas. Both world leaders also personally
employed (hence served as role models for many of their citizens)
unproved, preventive remedies for COVID-19: traditional Chinese herbal
compounds by Xi Jinping; hydroxychloroquine (which is accompanied
by dangerous side effects) by Donald Trump.
This may actually be more understandable in Xi’s case, since, of the
two countries, China is undoubtedly the more problematic from the
perspective of conducting and publishing its science. As only one
example, 20 years ago, Andrew Vickers’s systematic review team found
that 100% of that country’s alternative medical trials (in this case
acupuncture) and 99% of its conventional medical counterparts
published in China were positive. And unfortunately there is credible
evidence that the abysmal methodological quality of Chinese herbal
medical research itself (and not coincidentally the almost universally
positive results touting their efficacy) has continued to this day.
6
To be fair, however, science as an institution is far from blameless in
democracies such as the United States. Few scientists, including research
methodologists such as myself, view attempting to educate our elected
officials on scientific issues as part of their civic responsibility.
So while this book was written prior to the COVID-19 pandemic,
there is little in it that is not relevant to research addressing future health
crises such as this (e.g., the little-appreciated and somewhat
counterintuitive [but well documented] fact that early findings in a new
area of inquiry often tend to be either incorrect or to report significantly
greater effect sizes than follow-up studies). It is therefore my hope that
one of the ultimate effects of the reproducibility crisis (which again
constitutes the subject matter of this book) will be to increase the societal
utility of science as well as the public’s trust therein. An aspiration that
will not be realized without a substantial reduction in the prevalence of
the many questionable research behaviors that permit and facilitate the
inane tendency for scientists to manufacture (and publish) false-positive
results.
7
Acknowledgments
8
Introduction
This is a story about science. Not one describing great discoveries or the
geniuses who make them, but one that describes the labors of scientists
who are in the process of reforming the scientific enterprise itself. The
impetus for this initiative involves a long-festering problem that
potentially affects the usefulness and credibility of science itself.
The problem, which has come to be known as the reproducibility
crisis, affects almost all of science, not one or two individual disciplines.
Like its name, the problem revolves around the emerging realization that
much—perhaps most—of the science being produced cannot be
reproduced. And scientific findings that do not replicate are highly
suspect if not worthless.
So, three of the most easily accomplished purposes of this book are
1. Practicing scientists who have not had the time or the opportunity to
understand the extent of this crisis or how they can personally avoid
producing (and potentially embarrassing) irreproducible results;
2. Aspiring scientists, such as graduate students and postdocs, for the same
reasons;
3. Academic and funding administrators who play (whether they realize it or
not) a key role in perpetuating the crisis; and
4. Members of the general public interested in scientific issues who are
barraged almost daily with media reports of outrageously counterintuitive
findings or ones that contradict previously ones.
9
Some readers may find descriptors such as “crisis” for an institution as
sacrosanct as science a bit hyperbolic, but in truth this story has two
themes. One involves a plethora of wrongness and one involves a
chronicling of the labors of a growing cadre of scientists who have
recognized the seriousness of the problem and have accordingly
introduced evidence-based strategies for its amelioration.
However, regardless of semantic preferences, this book will present
overwhelming evidence that a scientific crisis does indeed exist. In so
doing it will not constitute a breathless exposé of disingenuous scientific
blunders or bad behavior resulting in worthless research at the public
expense. Certainly some such episodes compose an important part of the
story, but, in its totality, this book is intended to educate as many readers
as possible to a serious but addressable societal problem.
So, in a sense, this is an optimistic story representing the belief (and
hope) that the culture of science itself is in the process of being altered to
usher in an era in which (a) the social and behavioral sciences (hereafter
referred to simply as the social sciences) will make more substantive,
reproducible contributions to society; and (b) the health sciences will
become even more productive than they have been in past decades. Of
course the natural and physical sciences have their own set of problems,
but only a handful of reproducibility issues from these disciplines have
found their way into the present story since their methodologies tend to
be quite different from the experimental and correlational approaches
employed in the social and health sciences.
For the record, although hardly given to giddy optimism in many
things scientific, I consider this astonishing 21st-century reproducibility
awakening (or, in some cases, reawakening) to be deservedly labeled as a
paradigmatic shift in the Kuhnian sense (1962). Not from the perspective
of an earth-shattering change in scientific theories or worldviews such as
ushered in by Copernicus, Newton, or Einstein, but rather in a dramatic
shift (or change) in the manner in which scientific research is conducted
and reported. These are behavioral and procedural changes that may also
redirect scientific priorities and goals from a cultural emphasis on
publishing as many professional articles as humanly possible to one of
ensuring that what is published is correct, reproducible, and hence has a
chance of being at least potentially useful.
However, change (whether paradigmatic or simply behavioral) cannot
be fully understood or appreciated without at least a brief mention of
what it replaces. So permit me the conceit of a very brief review of an
important methodological initiative that occurred in the previous century.
10
The Age of Internal and External Validity
For the social sciences, our story is perhaps best begun in 1962, when a
research methodologist (Donald T. Campbell) and a statistician (Julian
C. Stanley) wrote a chapter in a handbook dealing with research on
teaching of all things. The chapter garnered considerable attention at the
time, and it soon became apparent that its precepts extended far beyond
educational research. Accordingly, it was issued as an 84-page paperback
monograph entitled Experimental and Quasi-Experimental Designs for
Research (1966) and was promptly adopted as a supplemental textbook
throughout the social sciences.
But while this little book’s influence arguably marked the
methodological coming of age for the social sciences, it was preceded
(and undoubtedly was greatly influenced) by previous methodology
textbooks such as Sir Ronald Fisher’s The Design of Experiments (1935),
written for agriculture researchers but influencing myriad other
disciplines as well, and Sir Austin Bradford Hill’s Principles of Medical
Statistics (1937), which had an equally profound effect upon medical
research.
The hallmark of Campbell and Stanley’s remarkable little book
involved the naming and explication of two constructs, internal and
external validity, accompanied by a list of the research designs (or
architecture) that addressed (or failed to address) the perceived
shortcomings of research conducted in that era. Internal validity was
defined in terms of whether or not an experimental outcome (generally
presumed to be positive) was indeed a function of the intervention rather
than extraneous events or procedural confounds. External validity
addressed the question of:
11
external validity (replication) also served as the same bottom line arbiter
for the reproducible–irreproducible dichotomy that constitutes the basic
subject matter of this book.
Campbell and Stanley’s basic precepts, along with Jacob Cohen’s
(1977, 1988) seminal (but far too often ignored work) on statistical
power, were subsequently included and cited in hundreds of subsequent
research methods textbooks in just about every social science discipline.
And, not coincidentally, these precepts influenced much of the veritable
flood of methodological work occurring during the next several decades,
not only in the social sciences but in the health sciences as well.
Unfortunately, this emphasis on the avoidance of structural (i.e.,
experimental design) at the expense of procedural (i.e., behavioral)
confounds proved to be insufficient given the tacit assumption that if the
architectural design of an experiment was reasonably sound and the data
were properly analyzed, then any positive results accruing therefrom
could be considered correct 95% of the time (i.e., the complement of the
statistical significance criterion of p ≤ 0.05). And while a vast literature
did eventually accumulate around the avoidance of these procedural
confounds, less attention was paid to the possibility that a veritable host
of investigator-initiated questionable research practices might,
purposefully or naïvely, artifactually produce false-positive, hence
irreproducible, results.
From a scientific cultural perspective, this mindset was perhaps best
characterized by the writings of Robert Merton (1973), a sociologist of
science whose description of this culture would be taken as Pollyannaish
satire if written today. In his most famous essay (“Science and the Social
Order”) he laid out “four sets of institutional imperatives—universalism,
communism [sharing of information not the political designation],
disinterestedness, and organized skepticism—[that] are taken to
comprise the ethos of modern science” (p. 270).
Scientific ethos was further described as
[t]he ethos of science is that affectively toned complex of values and norms
which is held to be binding on the man of science. The norms are expressed
in the form of prescriptions, proscriptions, preferences, and permissions.
They are legitimatized in terms of institutional values. These imperatives,
transmitted by precept and example and reinforced by sanctions, are in
varying degrees internalized by the scientist, thus fashioning his scientific
conscience or, if one prefers the latter-day phrase, his superego. (1973, p.
269, although the essay itself was first published in 1938)
12
While I am not fluent in Sociologese, I interpret this particular passage
as describing the once popular notion that scientists’ primary motivation
was to discover truth rather than to produce a publishable p-value ≤ 0.05.
Or that most scientists were so firmly enculturated into the “ethos” of
their calling that any irreproducible results that might accrue were of
little concern given the scientific process’s “self-correcting” nature.
To be fair, Merton’s essay was actually written in the 1930s and might
have been somewhat more characteristic of science then than in the latter
part of the 20th and early 21st centuries. But his vision of the cultural
aspect of science was prevalent (and actually taught) during the same
general period as were internal and external validity. Comforting
thoughts certainly, but misconceptions that may explain why early
warnings regarding irreproducibility were ignored.
Also in fairness, Merton’s view of science was not patently incorrect:
it was simply not sufficient. And the same can be said for Campbell and
Stanley’s focus on internal validity and the sound research designs that
they fostered. Theirs might even qualify as an actual methodological
paradigm for some disciplines, and it was certainly not incorrect. It was
in fact quite useful. It simply was not sufficient to address an as yet
unrecognized (or at least unappreciated) problem with the avalanche of
scientific results that were in the process of being produced.
So while we owe a professional debt of gratitude to the previous
generation of methodologists and their emphasis on the necessity of
randomization and the use of appropriate designs capable of negating
most experimental confounds, it is now past time to move on. For this
approach has proved impotent in assuring the reproducibility of research
findings. And although most researchers were aware that philosophers of
science from Francis Bacon to Karl Popper had argued that a
quintessential prerequisite for a scientific finding to be valid resides in its
reproducibility (i.e., the ability of other scientists to replicate it), this
crucial tenet was largely ignored in the social sciences (but taken much
more seriously by physical scientists—possibly because they weren’t
required to recruit research participants). Or perhaps it was simply due to
their several millennia experiential head start.
In any event, ignoring the reproducibility of a scientific finding is a
crucial failing because research that is not reproducible is worthless and,
even worse, is detrimental to its parent science by (a) impeding the
accumulation of knowledge, (b) squandering increasingly scarce societal
resources, and (c) wasting the most precious of other scientists’
resources—their time and ability to make their contributions to science.
13
All failings, incidentally, that the reproducibility initiative is designed to
ameliorate.
14
Almost everyone is aware of what Robert Burns and John Steinbach had
to say about the plans of mice and men, but planning is still necessary
even if its objective is unattainable or nonexistent. So the book will
begin with the past and present conditions that facilitate the troubling
prevalence of irreproducible findings in the scientific literature
(primarily the odd fact that many disciplines almost exclusively publish
positive results in preference to negative ones). Next a very brief (and
decidedly nontechnical) overview of the role that p-values and statistical
power play in reproducibility/irreproducibility along with one of the
most iconic modeling exercises in the history of science. The next
several chapters delineate the behavioral causes (i.e., questionable
research practices [QRPs]) of irreproducibility (accompanied by
suggested solutions thereto) followed by a few examples of actual
scientific pathology which also contribute to the problem (although
hopefully not substantially). Only then will the replication process itself
(the ultimate arbiter of reproducibility) be discussed in detail along with
a growing number of very impressive initiatives dedicated to its
widespread implementation. This will be followed by equally almost
impressive initiatives for improving the publishing process (which
include the enforcement of preregistration and data-sharing requirements
that directly impact the reproducibility of what is published). The final
chapter is basically a brief addendum positing alternate futures for the
reproducibility movement, along with a few thoughts on the role of
education in facilitating the production of reproducible results and the
avoidance of irreproducible ones.
15
But while some of the book’s content may seem irrelevant to
practitioners and students of the purely biological and physical sciences,
the majority of the key concepts discussed are relevant to almost all
empirically based disciplines. Most sciences possess their own problems
associated with publishing, the overproduction of positive results,
statistical analysis, unrecognized (or hidden) questionable research
practices, instrumental insensitivity, inadequate mentoring, and the sad
possibility that there are just too many practicing scientists who are
inadequately trained to ensure that their work is indeed reproducible.
Naturally, some of the content will be presented in more detail than some
will prefer or require, but all adult readers have had ample practice in
skipping over content they’re either conversant with or uninterested in.
To facilitate that process, most chapters are relatively self-contained,
with cursory warnings of their content posted at the conclusion of their
immediately preceding chapter.
Also, while the story being presented almost exclusively involves the
published work of others, I cannot in good consciousness avoid inserting
my own opinions regarding this work and the issues involved. I have,
however, attempted to clearly separate my opinions from those of others.
Otherwise, every topic discussed is supported by credible empirical
evidence, and every recommendation tendered is similarly supported by
either evidence or reasoned opinions by well-recognized reproducibility
thinkers. This strategy has unavoidably necessitated a plethora of
citations which only constitute a mere fraction of the literature reviewed.
For readability purposes, this winnowing process has admittedly resulted
in an unsystematic review of cited sources, although the intent was to
represent an overall consensus of those thinkers and researchers who
have contributed to this crucial scientific movement.
16
results. I even once published an annotated guide to 2,600 published
methodological sources encompassing 78 topics, 224 journals, and 125
publishers (Bausell, 1991). The tome was dedicated “to the three
generations of research methodologists whose work this book partially
represents” (p. viii).
In the three decades that followed, the methodological literature
virtually exploded with the publication of more articles, topic areas, and
journals (and, of course, blogs) than in the entire history of science prior
to that time. And while this work has been extremely beneficial to the
scientific enterprise, its main contribution may have been the facilitation
of the emergence of a new generation of methodologists studying (and
advocating for) the reproducibility of scientific results.
Naturally, as a chronicler, I could hardly avoid recognizing the
revolutionary importance of this latter work, not just to research
methodology but also to the entire scientific enterprise. My primary
motivation for telling this story is to hopefully help promulgate and
explain the importance of its message to the potential audiences
previously described. And, of course, I dedicate the book “to the present
generation of reproducibility methodologists it partially represents.”
I must also acknowledge my debt to three virtual mentors who have
guided me in interpreting and evaluating scientific evidence over the past
two decades, philosophers of science from the recent and distant past
whose best-known precepts I have struggled (not always successfully) to
apply to the subject matter of this book as well. In chronological order
these individuals are
17
And Finally an Affective Note
So Where to Begin?
Let’s begin with the odd phenomenon called publication bias with which
everyone is familiar although many may not realize either the extent of
its astonishing prevalence or its virulence as a facilitator of
irreproducibility.
References
18
Cohen, J. (1977, 1988). Statistical power analysis for the behavioral sciences.
Hillsdale, NJ: Lawrence Erlbaum.
Fisher, R. A. (1935). The design of experiments. London: Oliver & Boyd.
Greenwald, A. G. (1975). Consequences of prejudice against the null hypothesis.
Psychological Bulletin, 82, 1–20.
Hill, A. B. (1935). Principles of medical statistics. London: Lancet.
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS
Medicine, 2, e124.
Kuhn, T. S. (1962). The structure of scientific revolutions. Chicago: University of
Chicago Press.
Merton, R. K. (1973). The sociology of science: Theoretical and empirical
investigations. (N. W. Storer, Ed.). Chicago: University of Chicago Press.
Park, R. (2000). Voodoo science: the road from foolishness to fraud. New York:
Oxford University Press.
19
PART I
20
1
Publication Bias
But the worst inconvenience of all is yet to be mentioned, and that is, that
whilst this vanity of thinking men obliged to write either systems or nothing
is in request, many excellent notions or experiments are, by sober and
modest men, suppressed.” (p. 1386)
21
But, as we all know, over time some things improve while others get
even worse. And the latter appears to be the case with publication bias,
which has been germinating at least since the mid-20th century and
remains in full bloom at the end of the second decades of the 21st
century.
22
considered (or taught) in those days was the possibility that false-positive
results were clogging scientific literatures by unreported (and often
unconsidered) strategies perfectly designed to produce publishable
positive results—even when participants were randomly assigned to
conditions. After all, empirical results could always be replicated by
other researchers if they were sufficiently interested—which of course
very few then were (or even are today).
So, thus trained and acculturated, our long-forgotten (or hypothetical)
graduate student conducted many, many educational experiments and
obtained statistical significance with some. And those he failed to do so
he didn’t bother to submit for publication once he had discovered for
himself that his instructor had been right all along (i.e., that their
rejection rate was several times greater than those blessed with p-values
≤ 0.05 as opposed to those cursed with the dreaded p > 0.05). After all,
why even bother to write up such “failures” when the time could be more
profitably spent conducting more “successful” studies?
Given his ambition and easy access to undergraduate and high school
research participants he might have even conducted a series of 22
experiments in a failed effort to produce a study skill capable of
producing more learning from a prose passage than simply reading and
rereading said passage for the same amount of time—a hypothetical
debacle that might have resulting in 22 statistically nonsignificant
differences (none of which was ever submitted for publication). And
once he might have even had the audacity to mention the quest itself at a
brown bag departmental luncheon, which resulted in vigorous criticisms
from all sides for wasting participants’ time conducting nonsignificant
studies.
23
Pintilie, & Tannock, 2003) more succinctly (and bluntly) state,
“Nonpublication breaks the contract that investigators make with trial
participants, funding agencies, and ethics boards” (p. 496).
Second, regardless of whether or not human participants are
employed, publishing negative results
1. Permits other scientists from avoiding dead end paths that may be
unproductive.
2. Potentially provides other scientists with an idea for a more effective
intervention or a more relevant outcome variable (in the hypothetical
scenario, this might have involved using engagement in the study activity
as the outcome variable, thereby allowing the participants to take as much
time as they needed to master the content), and
3. Encourages the creation of useful rather than spurious theories (the latter
of which are more likely to be corroborated by positive results in the
complete absence of negative ones).
24
interventions, but these studies require specialized analytic approaches
(see Bausell, 2015) and are relatively rare outside of clinical (most
commonly pharmaceutical) trials.
Some unacceptable (but perhaps understandable) reasons for
investigators’ not attempting to publish a negative finding could include
25
Undoubtedly the most ambitious effort to estimate the extent to which
positive results dominate the scientific literature was employed by
Daniele Fanelli, who contrasted entire sciences on the acceptance or
rejection of their stated hypothesis. In his first paper (2010), 2,434
studies published from 2000 to 2007 were selected from 10,837 journals
in order to compare 20 different scientific fields with respect to their rate
of positive findings.
The clear “winner” turned out to be psychology-psychiatry, with a
91.5% statistical significance rate although perhaps the most shocking
findings emanating from this study were that (a) all 20 of the sciences
(which basically constitute the backbone of our species’ empirical,
inferential scientific effort) reported positive published success rates of
greater than 70%, (b) the average positive rate for the 2,434 studies was
84%, and (c) when the 20 sciences were collapsed into three commonly
employed categories all obtained positive rates in excess of 80% (i.e.,
biological sciences = 81%, physical sciences = 84%, and social sciences
= 88%).
In his follow-up analysis, Fanelli (2011) added studies from 1990 to
1999 to the 2000 to 2007 sample just discussed in order to determine if
these positive rates were constant or if they changed over time. He found
that, as a collective, the 20 sciences had witnessed a 22% increase in
positive findings over this relatively brief time period. Eight disciplines
(clinical medicine, economic and business, geoscience, immunology,
molecular biology-genetics, neuroscience-behavior, psychology-
psychiatry, and pharmacology-toxicology) actually reported positive
results at least 90% of the time by 2007, followed by seven (agriculture,
microbiology, materials science, neuroscience-behavior, plants-animals,
physics, and the social sciences) enjoying positive rates of from 80% to
90%. (Note that since the author did not report the exact percentages for
these disciplines, these values were estimated based on figure 2 of the
2011 report.)
Now, of course, neither of these analyses is completely free of
potential flaws (as are none of the other 35 or so studies cited later in this
chapter), but they constitute the best evidence we have regarding the
prevalence of publication bias. (For example, both studies employed the
presence of a key sentence, “test*the hypothes*,” in abstracts only, and
some disciplines did not rely on p-values for their hypothesis tests.)
However, another investigator (Pautasso, 2010) provides a degree of
confirmatory evidence for the Fanelli results by finding similar (but
somewhat less dramatic) increases in the overall proportion of positive
26
results over time using (a) four different databases, (b) different key
search phrases (“no significant difference/s” or “no statistically
significant difference/s”), (c) different disciplinary breakdowns, and, for
some years, (d) only titles rather than abstracts.
27
two decades or so later than that. (This time around, Sterling and his co-
investigators searched eight psychology journals and found the
prevalence of positive results within one or two percentage points [96%]
of the previous two efforts.) Their rather anticlimactic conclusion:
“These results also indicate that practices leading to publication bias
have not changed over a period of 30 years” (p. 108). However, as the
previous section illustrated, it has changed for the worst more recently in
some disciplines.
While surveys of journal articles published in specific journals
constitute the earliest approach to studying publication bias, more
recently, examinations of meta-analyses (both with respect to the
individual studies comprising them and the meta-analyses themselves)
have become more popular vehicles to explore the phenomenon.
However, perhaps the most methodologically sound approach involves
comparisons of the publication status of positive and negative
longitudinal trials based on (a) institutional review board (IRB) and
institutional animal care and use committee (IACUC) applications and
(b) conference abstracts. Two excellent interdisciplinary reviews of such
studies (Song, Parekh-Bhurke, Hooper, et al., 2009; Dwan, Altman,
Arnaiz, et al., 2013) found, perhaps unsurprisingly by now, that positive
studies were significantly more likely to be published than their negative
counterparts.
More entertaining, there are even experimental documentations of the
phenomenon in which methodologically oriented investigators, with the
blessings of the journals involved, send out two almost identical versions
of the same bogus article to journal reviewers. “Almost identical”
because one version reports a statistically significant result while the
other reports no statistical significance. The positive version tended to be
significantly more likely to be (a) accepted for publication (Atkinson,
Furlong, & Wampold, 1982) and (b) rated more highly on various factors
such as methodological soundness (Mahoney, 1977), or (c) both
(Emerson, Warme, Wolf, et al., 2010).
A Quick Recap
28
with this particular bias. In support of this rather pejorative
generalization, the remainder of this chapter is given over to the presence
of publication bias in a sampling of (a) subdisciplines or research topics
within disciplines and (b) the methodological factors known to be subject
to (or associated with) publication bias.
29
Psychotherapy for depression (Cuijpers, Smit, Bohlmeijer, et al.,
‣ 2010; Flint, Cuijpers, & Horder, 2015)
‣ Pediatric research (Klassen, Wiebe, Russell, et al., 2002)
‣ Gastroenterology research (Timmer et al., 2002) and
gastroenterological research cancer risk (Shaheen, Crosby,
Bozymski, & Sandler, 2000)
‣ Antidepressant medications (Turner, Matthews, Linardatos, et al.,
2008)
‣ Alternative medicine (Vickers, Goyal, Harland, & Rees, 1998;
Pittler, Abbot, Harkness, & Ernst, 2000)
‣ Obesity research (Allison, Faith, & Gorman, 1996)
‣ Functional magnetic resonance imaging (fMRI) studies of
emotion, personality, and social cognition (Vul, Harris,
Winkielman, & Pashler, 2009) plus functional fMRI studies in
general (Carp, 2012)
‣ Empirical sociology (Gerber & Malhotra, 2008)
‣ Anesthesiology (De Oliveira, Chang, Kendall, et al. 2012)
‣ Political behavior (Gerber, Malhotra, Dowling, & Doherty, 2010)
‣ Neuroimaging (Ioannidis, 2011; Jennings & Van Horn, 2012).
‣ Cancer prognostic markers (Kyzas, Denaxa-Kyza, & Ioannidis,
2007; Macleod, Michie, Roberts, et al., 2014)
‣ Education (Lipsey & Wilson, 1993; Hattie, 2009)
‣ Empirical economics (Doucouliagos, 2005)
‣ Brain volume abnormalities (Ioannidis, 2011)
‣ Reproductive medicine (Polyzos, Valachis, Patavoukas, et al.,
2011)
‣ Cognitive sciences (Ioannidis, Munafò, Fusar-Poli, et al., 2014)
‣ Orthodontics (Koletsi, Karagianni, Pandis, et al., 2009)
‣ Chinese genetic epidemiology (Pan, Trikalinos, Kavvoura, et al.,
2005)
‣ Drug addiction (Vecchi, Belleudi, Amato, et al., 2009)
‣ Biology (Csada, James, & Espie, 1996)
‣ Genetic epidemiology (Agema, Jukema, Zwinderman, & van der
Wall, 2002)
‣ Phase III cancer trials published in high-impact journals (Tang,
Pond, Welsh, & Chen, 2014)
30
‣ Multiple publications more so than single publication of the same
data (Tramèr, Reynolds, Moore, & McQuay, 1997; Schein &
Paladugu, 2001); although Melander, Ahlqvist-Rastad, Meijer,
and Beermann (2003) found the opposite relationship for a set of
Swedish studies
‣ The first hypothesis tested in multiple hypothesis studies less than
single-hypothesis studies (Fanelli, 2010)
‣ Higher impact journals more so than low-impact journals (Tang et
al., 2014, for cancer studies); but exceptions exist, such as the
Journal of the American Medical Association (JAMA) and the
New England Journal of Medicine (NEJM), for clinical trials
(Olson, Rennie, Cook, et al., 2002)
‣ Non-English more so than English-language publications (Vickers
et al., 1998; Jüni, Holenstein, Sterne, et al., 2003)
‣ RCTs with larger sample sizes less so than RCTs with smaller
ones (Easterbrook, Berlin, Gopalan, & Matthews, 1991)
‣ RCTs less often than observational studies (e.g., epidemiological
research), laboratory-based experimental studies, and
nonrandomized trials (Easterbrook et al., 1991; Tricco, Tetzaff,
Pham, et al., 2009)
‣ Research reported in complementary and alternative medicine
journals more so than most other types of journals (Ernst &
Pittler, 1997)
‣ Methodologically sound alternative medicine trials in high impact
jounrals less so than their methodologically unsound counterparts
in the same journals (Bausell, 2009)
‣ Meta-analyses more so than subsequent large RCTs on same topic
(LeLorier, Gregoire, Benhaddad, et al, 1997).
‣ Earlier studies more so than later studies on same topic (Jennings
& Van Horn, 2012; Ioannidis, 2008)
‣ Preregistration less often than no registration of trials (Kaplan &
Irvin, 2015)
‣ Physical sciences (81%) less often than biological sciences (84%),
less than social sciences (88%) (Fanelli, 2010), although Fanelli
and Ioannidis (2013) found that the United States may be a
greater culprit in the increased rate of positive findings in the
“soft” (e.g., social) sciences than other countries
‣ Investigators reporting no financial conflict of interest less often
than those who do have such a conflict (Bekelman, Li, & Gross,
2003; Friedman & Richter, 2004; Perlis, Perlis, Wu, et al., 2005;
31
Okike, Kocher, Mehlman, & Bhandari, 2007); all high-impact
medical (as do most other medical) journals require a statement
by all authors regarding conflict of interest.
‣ Pulmonary and allergy trials funded by pharmaceutical companies
more so than similar trials funded by other sources (Liss, 2006)
‣ Fewer reports of harm in stroke research in published than non-
published studies, as well as publication bias in general
(Liebeskind, Kidwell, Sayre, & Saver, 2006)
‣ Superior results for prevention and criminology intervention trials
when evaluated by the program developers versus independent
evaluators (Eisner, 2009)
‣ And, of course, studies conducted by investigators known to have
committed fraud or misconduct more so than those not so
identified; it is a rare armed robbery that involves donating rather
than stealing money.
A Dissenting Voice
32
impossible for some areas whose percentage of positive results approach
100%).
Whether the large and growing preponderance of these positive results
in the published scientific literatures is a cause, a symptom, or simply a
facilitator of irreproducibility doesn’t particularly matter. What is
important is that the lopsided availability of positive results (at the
expense of negative ones) distorts our understanding of the world we live
in as well as retards the accumulation of the type of knowledge that
science is designed to provide.
This phenomenon, considering (a) the sheer number of scientific
publications now being produced (estimated to be in excess of 2 million
per year; National Science Board, 2018) and (b) the inconvenient fact
that most of these publications are positive, leads to the following rather
disturbing implications:
1. Even if the majority of these positive effects are not the product of
questionable research practices specifically designed to produce positive
findings, and
2. If an unknown number of correct negative effects are not published, then
3. Other investigative teams (unaware of these unpublished studies) will test
these hypotheses (which in reality are false) until someone produces a
positive result by chance alone—or some other artifact, which
4. Will naturally be published (given that it is positive) even if far more
definitive contradictory evidence exists—therefore contributing to an
already error-prone scientific literature.
First, given the ubiquitous nature (and long history) of the problem,
perhaps some of the strategies that probably won’t be particularly helpful
to reduce publication bias should be listed. Leaning heavily (but not
entirely) on a paper entitled “Utopia: II. Restructuring Incentives and
33
Practices to Promote Truth over Publishability” (Nosek, Spies, & Motyl,
2012), some of these ineffective (or at least mostly ineffective) strategies
are
34
1. Investigators, such as our hypothetical graduate student who slipped his
22 negative studies into Robert Rosenthal’s allegorical “file drawer”
(1979), never to be translated into an actual research report or
accompanied by a sufficiently comprehensive workflow to do so in the
future. (See Chapter 9 for more details of the concept of keeping detailed
workflows, along with Phillip Bourne’s [2010] discussion thereof.) For
even though many of us (perhaps even our mythical graduate student)
may have good intentions to publish all of our negative studies in the
future, in time the intricate details of conducting even a simple experiment
fade in the face of constant competition for space in long-term memory.
And although we think we will, we never seem to have more time
available in the future than we do now in the present.
2. Journal editors who feel pressure (personal and/or corporate) to ensure a
competitive citation rate for their journals and firmly believe that
publishing negative studies will interfere with this goal as well as reduce
their readership. (To my knowledge there is little or no empirical
foundation for this belief.) We might attempt to convince editors of major
journals to impose a specified percentage annual limit on the publication
of positive results, perhaps beginning as high as 85% and gradually
decreasing it over time.
3. Peer reviewers with a bias against studies with p-values > 0.05. As
mentioned previously, this artifact has been documented experimentally
several times by randomly assigning journal reviewers to review one of
two identical versions of a fake manuscript, with the exception that one
reports statistical significance while the other reports nonsignificance.
Perhaps mandatory peer review seminars and/or checklists could be
developed for graduate students and postdocs to reduce this peer reviewer
bias in the future.
4. Research funders who would much rather report that they spent their
money demonstrating that something works or exists versus that it does
not. And since the vast majority of investigators’ funding proposals
hypothesize (hence promise) positive results, they tend to be in no hurry
to rush their negative results into publication—at least until their next
grant proposal is approved.
5. The public and the press that it serves are also human, with the same
proclivities, although with an added bias toward the “man bites dog”
phenomenon.
However, there are steps that all researchers can take to reduce the
untoward effects of publication bias by such as
35
1. Immediately writing and quickly submitting negative studies for
publication. And, if rejected, quickly resubmitting them to another journal
until they are accepted (and they eventually will be given sufficient
persistent coupled with the absolute glut of journals in most fields). And,
as for any study, (a) transparently reporting any glitches in their conduct
(which peer reviewers will appreciate and many will actually reward) and,
(b) perhaps especially importantly for negative studies, explaining why
the study results are important contributions to science;
2. Presenting negative results at a conference (which does not preclude
subsequent journal publication). In the presence of insufficient funds to
attend one, perhaps a co-author or colleague could be persuaded to do so
at one that he or she plans to attend;
3. Serving as a peer reviewer (a time-consuming, underappreciated duty all
scientists must perform) or journal editor (even worse): evaluating studies
based on their design, conceptualization, and conduct rather than their
results;
4. And, perhaps most promising of all, utilizing preprint archives such as the
arXiv, bioRxiv, engrXiv, MedRxiv, MetaArXiv, PeerJ, PsyArXiv,
SocArXiv, and SSRN, which do not discriminate against negative results.
This process could become a major mechanism for increasing the
visibility and availability of nonsignificant results since no peer review is
required and there is no page limit, so manuscripts can be as long or as
brief as their authors’ desire.
An Effective Vaccine
36
due to publication bias. The process, in the authors’ words, occurs as
follows:
And as yet another reason for publishing negative results (in the long
run they are more likely to be correct than their positive counterparts):
Poynard, Munteanu, Ratziu, and colleagues (2002) conducted a
completely unique analysis (at least as far I can ascertain) with the
37
possible exception of a book tangentially related thereto (Arbesman,
2012) entitled The Half-Life of Facts: Why Everything We Know Has an
Expiration Date). The authors accomplished this by collecting cirrhosis
and hepatitis articles and meta-analyses conducted between 1945 and
1999 in order to determine which of the original conclusions were still
considered “true” by 2000. Their primary results were
So What’s Next?
38
things statistical. And even for those who do, the first few pages of
Chapter 2 may be instructive since statistical significance and power play
a much more important (perhaps the most important) role in
reproducibility than is commonly realized. The p-value, because it is so
easily gamed (as illustrated in Chapter 3), is, some would say, set too
high to begin with, while statistical power is so often ignored (and
consequently set too low).
References
Agema, W. R., Jukema, J. W., Zwinderman, A. H., & van der Wall, E. E. (2002).
A meta-analysis of the angiotensin-converting enzyme gene polymorphism
and restenosis after percutaneous transluminal coronary revascularization:
Evidence for publication bias. American Heart Journal, 144, 760–768.
Allison, D. B., Faith, M. S., & Gorman, B. S. (1996). Publication bias in obesity
treatment trials? International Journal of Obesity and Related Metabolic
Disorders, 20, 931–937.
Arbesman, S. (2012). The half-life of facts: Why everything we know has an
expiration date. New York: Penguin.
Atkinson, D. R., Furlong, M. J., & Wampold, B. E. (1982). Statistical
significance reviewer evaluations, and the scientific process: Is there a
statistically significant relationship? Journal of Counseling Psychology, 29,
189–194.
Bausell, R. B. (2009). Are positive alternative medical therapy trials credible?
Evidence from four high-impact medical journals. Evaluation & the Health
Professions, 32, 349–369.
Bausell, R. B. (2015). The design and conduct of meaningful experiments
involving human participants: 25 scientific principles. New York: Oxford
University Press.
Bekelman, J. E., Li, Y., & Gross, C. P. (2003). Scope and impact of financial
conflicts of interest in biomedical research: A systematic review. Journal of
the American Medical Association, 289, 454–465.
Berlin, J. A., Begg, C. B., & Louis, T. A. (1989). An assessment of publication
bias using a sample of published clinical trials. Journal of the American
Statistical Association, 84, 381–392.
Bourne, P. E. (2010). What do I want from the publisher of the future? PLoS
Computational Biology, 6, e1000787.
Bozarth, J. D., & Roberts, R. R. (1972). Signifying significant significance.
American Psychologist, 27, 774–775.
Campbell, D. T., & Stanley, J. C. (1966). Experimental and quasi-experimental
designs for research. Chicago: Rand McNally.
39
Carp, J. (2012). The secret lives of experiments: Methods reporting in the fMRI
literature. Neuroimage, 63, 289–300.
Chalmers, I. (1990). Underreporting research is scientific misconduct. Journal of
the American Medical Association, 263, 1405–1408.
Cooper, H. M., DeNeve, K. M., & Charlton, K. (1997). Finding the missing
science: The fate of studies submitted for review by a human subjects
committee. Psychological Methods, 2, 447–452.
Csada, R. D., James, P. C., & Espie, R. H. M. (1996). The “file drawer problem”
of non-significant results: Does it apply to biological research? Oikos, 76,
591–593.
Cuijpers, P., Smit, F., Bohlmeijer, E., et al. (2010). Efficacy of cognitive-
behavioural therapy and other psychological treatments for adult depression:
Meta-analytic study of publication bias. British Journal of Psychiatry, 196,
173–178.
De Bellefeuille, C., Morrison, C. A., & Tannock, I. F. (1992). The fate of
abstracts submitted to a cancer meeting: Factors which influence presentation
and subsequent publication. Annals of Oncology, 3, 187–191.
De Oliveira, G. S., Jr., Chang, R., Kendall, M. C., et al. (2012). Publication bias
in the anesthesiology literature. Anesthesia & Analgesia, 114, 1042–1048.
Dickersin, K. (1991). The existence of publication bias and risk factors for its
occurrence. Journal of the American Medical Association, 263, 1385–1389.
Dickersin, K., Chan, S., Chalmers, T. C., et al. (1987). Publication bias and
clinical trials. Controlled Clinical Trials, 8, 343–353.
Doucouliagos, C. (2005). Publication bias in the economic freedom and
economic growth literature. Journal of Economic Surveys, 19, 367–387.
Dubben, H-H., & Beck-Bornholdt, H-P. (2005). Systematic review of publication
bias in studies on publication bias. British Medical Journal, 331, 433–434.
Dwan, K., Altman, D. G., Arnaiz, J. A., et al. (2013). Systematic review of the
empirical evidence of study publication bias and outcome reporting bias: An
updated review. PLoS ONE, 8, e66844.
Easterbrook, P. J., Berlin, J. A., Gopalan, R., & Matthews, D. R. (1991).
Publication bias in clinical research. Lancet, 337, 867–872.
Eisner, M. (2009). No effects in independent prevention trials: Can we reject the
cynical view? Journal of Experimental Criminology, 5, 163–183.
Emerson. G. B., Warme, W. J., Wolf, F. M., et al. (2010). Testing for the presence
of positive-outcome bias in peer review. Archives of Internal Medicine, 170,
1934–1939.
Ernst, E., & Pittler, M. H. (1997). Alternative therapy bias. Nature, 385, 480.
Fanelli, D. (2010). “Positive” results increase down the hierarchy of the sciences.
PLoS ONE, 5, e10068.
Fanelli, D. (2011). Negative results are disappearing from most disciplines and
countries. Scientometrics, 90, 891–904.
Fanelli, D., & Ioannidis, J. P. (2013). US studies may overestimate effect sizes in
softer research. Proceedings of the National Academy of the Sciences, 110,
40
15031–15036.
Flint, J., Cuijpers, P., & Horder, J. (2015). Is there an excess of significant
findings in published studies of psychotherapy for depression? Psychological
Medicine, 45, 439–446.
Friedman, L. S., & Richter, E. D. (2004). Relationship between conflicts of
interest and research results. Journal of General Internal Medicine, 19, 51–56.
Gerber, A. S., & Malhotra, N. (2008). Publication bias in empirical sociological
research: Do arbitrary significance levels distort published results?
Sociological Methods and Research, 37, 3–30.
Gerber, A. S., Malhotra, N., Dowling, C. M., & Doherty, D. (2010). Publication
bias in two political behavior literatures. American Politics Research, 38, 591–
613.
Greenwald, A. G. (1975). Consequences of prejudice against the null hypothesis.
Psychological Bulletin, 82, 1–20.
Hardwicke, T., & Ioannidis, J. (2018). Mapping the universe of registered
reports. Nature Human Behaviour, 2, 10.1038/s41562–018–0444–y.
Hartling, L., Craig, W. R., & Russell, K. (2004). Factors influencing the
publication of randomized controlled trials in child health research. Archives of
Adolescent Medicine, 158, 984–987.
Hattie, J. (2009). Visible learning: A synthesis of over 800 meta-analyses relating
to achievement. London: Routledge.
Ioannidis, J. P. (2011). Excess significance bias in the literature on brain volume
abnormalities. Archives of General Psychiatry, 68, 773–780.
Ioannidis, J. P. A. (2008). Why most discovered true associations are inflated.
Epidemiology 19, 640–648.
Ioannidis, J. P. A., Munafò, M. R., Fusar-Poli, P., et al. (2014). Publication and
other reporting biases in cognitive sciences: Detection, prevalence, and
prevention. Trends in Cognitive Science, 19, 235–241.
Jennings, R. G., & Van Horn, J. D. (2012). Publication bias in neuroimaging
research: Implications for meta-analyses. Neuroinformatics, 10, 67–80.
Jüni, P., Holenstein, F., Sterne, J., et al. (2003). Direction and impact of language
bias in meta-analysis of controlled trials: Empirical study. International
Journal of Epidemiology, 31, 115–123.
Kaplan, R. M., & Irvin, V. L. (2015). Likelihood of null effects of large NHLBI
clinical trials has increased over time. PLoS ONE, 10, e0132382.
Klassen, T. P., Wiebe, N., Russell, K., et al. (2002). Abstracts of randomized
controlled trials presented at the Society for Pediatric Research Meeting.
Archives of Pediatric and Adolescent Medicine, 156, 474–479.
Koletsi, D., Karagianni, A., Pandis, N., et al. (2009). Are studies reporting
significant results more likely to be published? American Journal of
Orthodontics and Dentofacial Orthopedics, 136, 632e1–632e5.
Korevaar, D. A., Hooft, L., & ter Riet (2011). Systematic reviews and meta-
analyses of preclinical studies: Publication bias in laboratory animal
experiments. Laboratory Animals, 45, 225–230.
41
Krzyzanowska, M. K., Pintilie, M., & Tannock, I. F. (2003). Factors associated
with failure to publish large randomized trials presented at an oncology
meeting. Journal of the American Medical Association, 290, 495–501.
Kyzas, P. A., Denaxa-Kyza, D., & Ioannidis, J. P. (2007). Almost all articles on
cancer prognostic markers report statistically significant results. European
Journal of Cancer, 43, 2559–2579.
LeLorier, J., Gregoire, G., Benhaddad, A., et al. (1997). Discrepancies between
meta-analyses and subsequent large randomized, controlled trials. New
England Journal of Medicine, 337, 536–542.
Liebeskind, D. S., Kidwell, C. S., Sayre, J. W., & Saver, J. L. (2006). Evidence
of publication bias in reporting acute stroke clinical trials. Neurology, 67, 973–
979.
Lipsey, M. W., & Wilson, D. B. (1993). Educational and behavioral treatment:
Confirmation from meta-analysis. American Psychologist, 48, 1181–1209.
Liss, H. (2006). Publication bias in the pulmonary/allergy literature: Effect of
pharmaceutical company sponsorship. Israeli Medical Association Journal, 8,
451–544.
Macleod, M. R., Michie, S., Roberts, I., et al. (2014). Increasing value and
reducing waste in biomedical research regulation and management. Lancet,
383, 176–185.
Mahoney, M. J. (1977). Publication prejudices: An experimental study of
confirmatory bias in the peer review system. Cognitive Therapy and Research,
1, 161–175.
Melander, H., Ahlqvist-Rastad, J., Meijer, G., & Beermann, B. (2003). Evidence
b(i)ased medicine-selective reporting from studies sponsored by
pharmaceutical industry: Review of studies in new drug applications. British
Medical Journal, 326, 1171–1173.
Menke, J., Roelandse, M., Ozyurt, B., et al. (2020). Rigor and Transparency
Index, a new metric of quality for assessing biological and medical science
methods. bioRxiv http://doi.org/dkg6;2020
National Science Board. (2018). Science and engineering indicators 2018. NSB-
2018-1. Alexandria, VA: National Science Foundation.
https://www.nsf.gov/statistics/indicators/.
Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2018). The
preregistration revolution. Proceedings of the National Academy of Sciences,
115, 2600–2606.
Nosek, B. A., & Lakens, D. (2014). Registered reports: A method to increase the
credibility of published results. Social Psychology, 45, 137–141.
Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific utopia: II.
Restructuring incentives and practices to promote truth over publishability.
Perspectives in Psychological Science, 7, 615–631.
Okike, K., Kocher, M. S., Mehlman, C. T., & Bhandari, M. (2007). Conflict of
interest in orthopaedic research: An association between findings and funding
in scientific presentations. Journal of Bone and Joint Surgery, 89, 608–613.
42
Olson, C. M., Rennie, D., Cook, D., et al. (2002). Publication bias in editorial
decision making. Journal of the American Medical Association, 287, 2825–
2828.
Pan, Z., Trikalinos, T. A., Kavvoura, F. K., et al. (2005). Local literature bias in
genetic epidemiology: An empirical evaluation of the Chinese literature [see
comment]. PLoS Medicine, 2, e334.
Pautasso, M. (2010). Worsening file-drawer problem in the abstracts of natural,
medical and social science databases. Scientometrics, 85, 193–202.
Perlis, R. H., Perlis, C. S., Wu, Y., et al. (2005). Industry sponsorship and
financial conflict of interest in the reporting of clinical trials in psychiatry.
American Journal of Psychiatry, 162, 1957–1960.
Pittler, M. H., Abbot, N. C., Harkness, E. F., & Ernst, E. (2000). Location bias in
controlled clinical trials of complementary/alternative therapies. Journal of
Clinical Epidemiology, 53, 485–489.
Polyzos, N. P., Valachis, A., Patavoukas, E., et al. (2011). Publication bias in
reproductive medicine: From the European Society of Human Reproduction
and Embryology annual meeting to publication. Human Reproduction, 26,
1371–1376.
Poynard, T., Munteanu, M., Ratziu, V., et al. (2002). Truth survival in clinical
research: An evidence-based requiem? Annuals of Internal Medicine, 136,
888–895.
Rosenthal, R. (1979). The file drawer problem and tolerance for null results.
Psychological Bulletin, 86, 638–641.
Schein, M., & Paladugu, R. (2001). Redundant surgical publications: Tip of the
iceberg? Surgery, 129, 655–661.
Sena, E. S., van der Worp, H. B., Bath, P. M., et al. (2010). Publication bias in
reports of animal stroke studies leads to major overstatement of efficacy. PLoS
Biology, 8, e1000344.
Shaheen, N. J., Crosby, M. A., Bozymski, E. M., & Sandler, R. S. (2000). Is there
publication bias in the reporting of cancer risk in Barrett’s esophagus?
Gastroenterology, 119, 333–338.
Song, F., Parekh-Bhurke, S., Hooper, L., et al. (2009). Extent of publication bias
in different categories of research cohorts: A meta-analysis of empirical
studies. BMC Medical Research Methodology, 9, 79.
Sterling, T. D. (1959). Publication decision and the possible effects on inferences
drawn from tests of significance-or vice versa. Journal of the American
Statistical Association, 54, 30–34.
Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication
decisions revisited: The effect of the outcome of statistical tests on the
decision to publish and vice versa. American Statistician, 49, 108–112.
Tang, P. A., Pond, G. R., Welsh, S., & Chen, E. X. (2014). Factors associated
with publication of randomized phase III cancer trials in journals with a high
impact factor. Current Oncology, 21, e564–572.
43
ter Riet, G., Korevaar, D. A., Leenaars, M., et al. (2012). Publication bias in
laboratory animal research: A survey on magnitude, drivers, consequences and
potential solutions. PLoS ONE, 1(9), e43404,
Timmer, A., Hilsden, R. J., Cole, J., et al. (2002). Publication bias in
gastroenterological research: A retrospective cohort study based on abstracts
submitted to a scientific meeting. BMC Medical Research Methodology, 2, 7.
Tramèr, M. R., Reynolds, D. J., Moore, R. A., & McQuay, H. J. (1997). Impact
of covert duplicate publication on meta-analysis: A case study. British Medical
Journal, 315, 635–640.
Tricco, A. C., Tetzaff, J., Pham, B., et al. (2009). Non-Cochrane vs. Cochrane
reviews were twice as likely to have positive conclusion statements: Cross-
sectional study. Journal of Clinical Epidemiology, 62, 380–386.
Tsilidis, K. K., Panagiotou, O. A., Sena, E. S., et al. (2013). Evaluation of excess
significance bias in animal studies of neurological diseases. PLoS Biology, 11,
e1001609.
Turner, E. H., Matthews, A. M., Linardatos, E., et al. (2008). Selective
publication of antidepressant trials and its influence on apparent efficacy. New
England Journal of Medicine, 358, 252–260.
Vecchi, S., Belleudi, V., Amato, L., et al. (2009). Does direction of results of
abstracts submitted to scientific conferences on drug addiction predict full
publication? BMC Medical Research Methodology, 9, 23.
Vickers, A., Goyal, N., Harland, R., & Rees, R. (1998). Do certain countries
produce only positive results? A systematic review of controlled trials.
Controlled Clinical Trials, 19, 159–166.
Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2009). Puzzlingly high
correlations in fMRI studies of emotion, personality, and social cognition.
Perspectives on Psychological Science, 4, 274–290.
Weber, E. J., Callaham, M. L., & Wears, R. L. (1998). Unpublished research
from a medical specialty meeting: Why investigators fail to publish. Journal of
the American Medical Association, 280, 257–259.
44
2
False-Positive Results and a Nontechnical
Overview of Their Modeling
45
1. Statistical significance, defined by the comparison between the
probability level generated by a computer following the statistical analysis
performed on study results (referred to here as the p-value) and the
maximum probability level hypothesized by the investigator or based on a
disciplinary consensus or tradition (referred to as the alpha level). If the
obtained p-value is less than or exactly equal to (≤) the hypothesized or
disciplinary conventional alpha level (typically 0.05), then statistical
significance is declared.
2. Statistical power is most succinctly (and rather cavalierly) defined as the
probability that a given study will result in statistical significance (the
minimum value of which is most often recommended to be set at 0.80).
Statistical power is a function of (a) the targeted alpha level; (b) the study
design; (c) the number of participants, animals, or other observations
employed; and (d) our third statistical construct, the effect size.
3. The effect size is possibly the simplest of the three constructs to
conceptualize but without question it is the most difficult of the three
constructs to predict prior to conducting a study. It is most often predicted
based on (a) a small-scale pilot study, (b) a review of the results of similar
studies (e.g., meta-analyses), or (c) a disciplinary convention, which, in
the social sciences, is often set at 0.50 based on Jacob Cohen’s decades-
old (1988) recommendation. Its prediction is also the most tenuous of the
triad regardless of how it is generated. If the effect size is overestimated,
even when a hypothesized effect actually exists, its attendant study will be
more difficult to replicate without adjustments such as an increased
sample size or the use of questionable research practices (QRPs). If the
effect size is underestimated, replication is more likely (even in the
absence of QRPs), and, if the true effect size under investigation is
sufficiently large, the attendant study will most likely be either trivial or
constitute a major scientific finding. Since this latter scenario occurs with
extreme rarity and the overestimation of effect sizes is far more common
—whether predicted a priori or based on study results—most of what
follows will be based on this scenario.
All three constructs are based on a statistical model called the normal or
bell-shaped curve, which is often depicted as shown in Figure 2.1.
46
Figure 2.1 The bell-shaped, normal curve.
https://en.wikipedia.org/wiki/File:Standard_deviation_diagram.svg
All three are therefore integrally related to one another. For example,
all else being equal:
1. The lower (or more stringently) the alpha level is set (i.e., the manner in
which statistical significance is defined), the lower the power will be
unless the study design is properly adjusted (e.g., increasing the number
of participants or other observations [aka the sample size] to be employed)
and/or the larger the hypothesized effect size must be;
2. The higher the desired power, the larger the required sample size and/or
the larger the hypothesized effect size must be (if the alpha level is not
adjusted); and, obviously,
3. The smaller the hypothesized effect size the less statistical power will be
available (unless the sample size is increased and/or the alpha level is not
adjusted).
Naturally, since all three of these statistical constructs are based on the
normal curve they, too, are all subject to the same rather restrictive
governing assumptions plus a few of their own. But the normal curve has
proved to be a rather useful model which is surprising robust to minor
violations of many of its assumptions.
So let’s now begin the modeling of false-positive results based on the
statistical models just discussed. As depicted in Table 2.1, this diagram
has confounded students in untold numbers of introductory statistics and
research methods books for decades, sometimes with no warning that it
47
represents a most tenuous statistical model that has little or no practical
applicability to actual scientific practice.
48
(hence a false-positive result) than in the presence of an appropriate
adjustment.
the probability that an experiment [or any type of empirical study for that
matter] will result in statistical significanceif that significance level is
appropriate for the design employed [i.e., is properly adjusted], if the study
is properly conducted [i.e., in the absence of unavoidable glitches and
QRPs], and if its hypothesized effect size is correct. (Bausell & Li, 2002, p.
14)
49
will be remembered, suggested a 5% occurrence of false-positive results
in the absence of bias).
50
literature designed to locate statistically significant correlations
between various single-nucleotide polymorphisms (SNPs; of which
there are an estimated 10 million) and various psychological constructs
and diagnoses, such as general intelligence or schizophrenia (an actual,
if unfortunate, empirical example of which will be presented shortly).
SNPs, pronounced “snips,” constitute the primary source of genetic
variations in humans and basically involve postconception changes in
the sequence of a single DNA “building block.” The overwhelming
majority are benign mutations that have no known untoward or
beneficial effects on the individual, but, in the proper location, they
have the capacity to affect a gene’s functions—one of which is
increased susceptibility to a disease.
Ioannidis therefore chose this data mining arena as his example,
assuming that 100,000 gene polymorphisms might be a reasonable
estimate for the number of possible candidates for such an inquiry (i.e.,
the denominator of the required ratio), accompanied by a limited but
defensible guess at the likely number of SNPs that might actually play
a role (the numerator) in the specific psychological attribute of interest.
So, given these assumptions, let’s suspend judgment and see where this
exercise takes us.
For his imputed values, Ioannidis chose 0.05 for the significance
criterion (customarily employed in the genomic field at that time but
fortunately no longer), 0.60 for the amount of statistical power
available for the analysis, and 10 for the number of polymorphisms
likely to be associated with the attribute of interest, which Ioannidis
hypothetically chose to be schizophrenia. (Dividing the best guess
regarding the number of true relationships [10] by the number of
analyses [100,000] yields the proportion of “true effects” in this
hypothetical domain.)
Plugging these three values into the above-mentioned modeling
formula produced an estimated false-positive rate above the 50% level
and hence far above the 5% rate of false-positive results posited in
Table 2.1. (And this, in turn, indicated that any obtained statistically
significant relationship close to a p-value of 0.05 between a gene and
the development of schizophrenia would probably be false.)
Ioannidis then went on to add two different scenarios to his model,
both of which are known to increase the proportion of published false-
positives results.
51
1. The rate of QRPs (i.e., systematic data analyses and/or investigator
procedural practices that have been demonstrated to artifactually
enhance the chances of producing statistical significant findings) and
2. A facet of publication bias in which 10 research teams are investigating
the same topic but only 1 of the 10 finds statistically significant results
(which of course means that this single positive finding would be
considerably more likely to be published—or even submitted for
publication—than would the other nine non-statistically significant
results).
52
1. The smaller the studies conducted in a scientific field, the less likely the
research findings are to be true. Studies with small sample sizes are
generally associated with less statistical power, but when such studies
happen to generate positive results they are considerably more likely to
be incorrect than their high-powered counterparts. However, since low-
powered studies are more common than high-powered ones in most
disciplines, this adds to these disciplines’ false-positive rates. (As
mentioned, low statistical power can also be a leading cause of false-
negative results, but this is less problematic because so few negative
studies are published.)
2. The smaller the effect sizes in a scientific field, the less likely the
research findings are to be true. As fields mature and the low hanging
fruit has already been harvested, their effects become smaller and thus
require increasingly larger samples to maintain acceptable statistical
power. If these sample size compensations are not made accordingly,
then power decreases and the rate of false positives increases. (And
when effects move toward the limits of our instruments’ capacity to
reliably detect them, erroneous findings increase accordingly.)
3. The greater the number and the lesser the selection of tested
relationships in a scientific field, the less likely the research findings are
to be true. This corollary might have been stated more clearly, but what
Ioannidis apparently means is that fields with lower pre-study
probabilities of being true (e.g., genetic association studies in which
thousands upon thousands of relationships are tested and only a few
true-positive effects exist) have a greater prevalence of false-positive
results in comparison to fields in which fewer hypotheses are tested
since said hypotheses must be informed by more and better preliminary,
supportive data (e.g., large medical randomized controlled trials [RCTs],
which are quite expensive to mount and must have preliminary data
supporting their hypotheses before they are funded). In addition, clinical
RCTs (perhaps with the exception of psychological and psychiatric
trials) tend to be more methodologically sophisticated (e.g., via the use
of double-blinded placebo designs) and regulated (e.g., the requirement
that detailed protocols be preregistered, thereby decreasing the
prevalence of a posteriori hypothesis changes). These conditions also
reduce the prevalence of false-positive results.
4. The greater the flexibility in designs, definitions, outcomes, and
analytical modes in a scientific field, the less likely the research findings
are to be true. Here, the difference between publishing practices in high-
impact, hypothesis-testing medical journals versus social science outlets
is even greater than for the previous corollary. Efficacy studies such as
those published in the Journal of the American Medical Association or
the New Journal of Medicine customarily involve randomization of
patients; a detailed diagram of patient recruitment, including dropouts;
53
double blinding; veridical control groups (e.g., a placebo or an effective
alternative treatment); a recognized health outcome (as opposed to
idiosyncratic self-reported ones constructed by the investigators), pre-
registration of the study protocol, including data analytic procedures;
intent-to-treat analyses; and the other strategies listed in the
Consolidated Standards of Reporting Trials (CONSORT) Statement of
clinical medical trials (Schulz, Altman, & Moher for the Consort Group,
2010). Publication standards in the social sciences are far more
“flexible” in these regards, most notably perhaps in the sheer number of
self-reported idiosyncratic outcome variables that are quite distal from
any recognized veridical social or behavioral outcome. More
importantly the “greater flexibility” mentioned in this corollary also
entails a greater prevalence of QRPs in the design and conduct of studies
(see Chapter 3).
5. The greater the financial and other interests and prejudices in a
scientific field, the less likely the research findings are to be true. Rather
self-explanatory and encompasses self-interest, bias, fraud, and
misconduct—all of which will be discussed in later chapters. An
example of the biasing effects due to financial interests will be discussed
with respect to pharmaceutical research in Chapter 10.
6. The hotter a scientific field (with more scientific teams involved), the
less likely the research findings are to be true. If nothing else, “hotter”
fields encourage publication bias such as via the author’s genetic
scenario in which the first team to achieve statistical significance is
more likely to publish its results than the first team that finds no
statistically significant effect.
54
Let’s now consider a second modeling exercise that may have more
applicability for social and behavioral experimentation with actual
human participants. This model also requires an estimate regarding the
prior probability of true effects and basically employs the same formula
used by Ioannidis and proposed by Wacholder et al. However, since this
one targets an entire discipline’s experimentation rather than multiple
analyses on the same dataset, its prior probability estimate may be
somewhat of a greater stretch (but perhaps more applicable to
experimental research).
55
However, as previously mentioned, the primary advantage of
modeling resides in the ability to input as many different determinants of
the endpoint of interest as the modeler desires. So, in this case, I have
taken the liberty of expanding Pashler and Harris’s illustrative results by
adding a few additional values to the three input constructs in Table 2.2.
Namely:
56
A: Proportion of discipline-wide B: C: D: Proportion of
studies assumed to have true effect Power Alpha false-positive
(actual) results
.050 .800 .050 .54
.050 .500 .050 .66
.050 .350 .050 .73
.100 .800 .050 .36
.100 .500 .050 .47
.100 .350 .050 .56
.250 .800 .050 .16
.250 .500 .050 .23
.250 .350 .050 .30
.050 .800 .025 .37
.050 .500 .025 .49
.050 .350 .025 .58
.100 .800 .025 .22
.100 .500 .025 .31
.100 .350 .025 .39
.250 .800 .025 .09
.250 .500 .025 .13
.250 .350 .025 .18
.050 .800 .010 .19
.050 .500 .010 .28
.050 .350 .010 .35
.100 .800 .010 .10
.100 .500 .010 .15
.100 .350 .010 .20
.250 .800 .010 .04
.250 .500 .010 .06
.250 .350 .010 .08
.050 .800 .005 .11
.050 .500 .005 .16
.050 .350 .005 .21
.100 .800 .005 .05
.100 .500 .005 .08
.100 .350 .005 .11
.250 .800 .005 .02
.250 .500 .005 .03
.250 .350 .005 .04
57
literature—or whatever that literature happens to be to which the
inputted constructs might apply). That’s obviously a huge discrepancy so
let’s examine the different assumptive inputs that have gone into this
estimate (and it definitely is an estimate).
When the hypothesized proportion of true effects that scientists
happen to be looking for ranges between 5% and 25% (the first column),
the proportion of false-positive results (Column D) are powerfully
affected by these hypothesized values. Thus if the discovery potential
(Column A) is as low as .05, which might occur (among other
possibilities) when scientists in the discipline are operating under
completely fallacious paradigms, the average resulting proportion of
false-positive results in the literature is .39 and ranges from .11 to .73.
When true effects of .10 and .25 are assumed, the estimated published
positive effects that are false drop to averages of .25 and .11,
respectively.
Similarly as the average statistical power in a discipline increases from
.35 to .80 the rate of published false-positive effects (irrespective of the
alpha level and the estimated rate of false-positive effects in the
literature) drops from .31 to .19. However it is the alpha level which is
the most powerful independent determinant of false-positive results in
this model. When the alpha is set at .05, the average rate of false-positive
results in the literature averaged across the three levels of power and the
four modeled assumed level of true effects is .46 or almost half of the
published positive results in many scientific literatures. (And positive
results, it will be recalled, comprise from .92 to .96 of psychology’s
published literature.)
An alpha level of .01 or .005 yields much more acceptable false-
positive results (.16 and .09, respectively, averaged across the other two
inputs). An alpha of .005, in fact, produces possibly acceptable false-
positive results (i.e., < .20 or an average of .075) for all three projected
rates of true effects and power levels of .80 and .50. Not coincidentally, a
recent paper (Benjamin et al., 2017) in Nature Human Behavior (co-
authored by a veritable Who’s Who host of reproducibility experts)
recommended that studies reporting new discoveries employ a p-value of
.005 rather than .05. Ironically, a similar modeling conclusion was
reached more than two decades ago in the classic article entitled “Effect
Sizes and p Values: What Should Be Reported and What Should Be
Replicated” (Greenwald, Gonzalez, Harris, & Guthrie, 1996).
For those interested in history, it could be argued that the first author,
Anthony Greenwald, one of the 73 authors of the Nature Human
58
Behavior paper just cited, foresaw the existence of the reproducibility
crisis almost half a century ago in a classic paper “Consequences of
Prejudice Against the Null Hypothesis” (1975). This paper also detailed
what surely must have been the first (and, if not the first, surely the most
creative) modeling demonstration of false-positive results so, just for the
fun of it, let’s briefly review that truly classic article.
1. The mean probability level deemed most appropriate for rejecting a null
hypothesis was .046 (quite close to the conventional .05 level).
2. The available statistical power deemed satisfactory for accepting the null
hypothesis was .726 (also quite close to the standard .80 power
recommendation for design purposes). Interestingly, however, only half
of the sample responded to this latter query, and, based on other
questions, the author concluded that only about 17% of the sample
typically considered statistical power or the possibility of producing
false-negative results prior to conducting their research. (In those days
the primary effects of low power on reproducibility were not widely
appreciated, and low power was considered problematic primarily in the
absence of statistical significance.)
3. After conducting an initial full-scale test of the primary hypothesis and
not achieving statistical significance, only 6% of the researchers said
that they would submit the study without further data collection. (Hence
publication bias is at least half a century old, as is possibly this QRP as
well [i.e., presumably collecting additional data to achieve statistical
significance without adjusting the alpha level].) A total of 56% said they
would conduct a “modified” replication before deciding whether to
submit, and 28% said they would give up on the problem. Only 10%
said that they would conduct an exact replication.
59
system in which “there may be relatively few publications on problems
for which the null hypothesis is (at least to a reasonable approximation)
true, and of these, a high proportion will erroneously reject the null
hypothesis” (p. 1). It is worth repeating that this prescient statement
regarding reproducibility was issued almost a half-century ago and was,
of course, largely ignored.
Greenwald therefore went on to conclude that by the time the entire
process is completed the actual alpha level may have been raised from
.05 to as high as .30 and the researcher “because of his investment in
confirming his theory with a rejection of the null hypothesis, has
overlooked the possibility that the observed x-y relationship may be
dependent on a specific manipulation, measure, experimenter, setting,
or some combination of them” (p. 13). Then, under the heading “Some
Epidemics of Type I Error” (aka false-positive results), he buttresses his
case involving the inflation of alpha levels via several well-received
past studies that found their way into textbooks but were later discarded
because they couldn’t be replicated.
So not only did Greenwald recognize the existence of a
reproducibility crisis long before the announcement of the present one,
he also (a) warned his profession about the problems associated with
publication bias, (b) advocated the practice of replication at a time
when it was even less common that it is today, and (c) provided
guidelines for both the avoidance of publication bias and false-positive
results. And while he is recognized to some extent for these
accomplishments, he deserves an honored place in the pantheon of
science itself (if there is such a thing).
60
we’ve only started this story. We haven’t even considered the calamitous
effects of the primary villains of this story: a veritable host of QRPs and
institutional impediments that foster the production of false-positive
results. So let’s consider some of these along with what appears to be an
oxymoron: a downright amusing model illustrating the effects of QRPs
on irreproducibility.
References
61
3
Questionable Research Practices (QRPs) and
Their Devastating Scientific Effects
The short answer is that we don’t know, and there is really no completely
accurate way of finding out. However there have been a number of
surveys estimating the prevalence of both QRPs and outright fraudulent
62
behaviors, and, like just about everything else scientific, these efforts
have been the subject of at least one meta-analysis.
We owe this latter gift to Daniele Fanelli (2009), already introduced as
an important and frequent contributor to the reproducibility literature,
who has graced us with a meta-analysis involving 18 such surveys.
Unfortunately the response rates of most of them were unimpressive
(some would say inadequate), and, since they unavoidably involved self-
or observed reports of antisocial behaviors, the results produced were
undoubtedly underestimates of the prevalence of QRPs and/or fraud.
While the survey questions and sampling procedures in the studies
reviewed by Dr. Fanelli were quite varied, the meta-analysis’ general
conclusions were as follows:
However, with respect to this third point, a more recent and unusually
large survey (John, Loewenstein, & Prelec, 2012) casts some doubt
thereupon and may have moved psychology to the head of the QRP
class.
In this huge survey of almost 6,000 academic psychologists (of which
2,155 responded), questionnaires were emailed soliciting self-reported
performance of 10 contraindicated practices known to bias research
results. An intervention was also embedded within the survey designed
to increase the validity of responses and permit modeling regarding the
prevalence of the 10 targeted QRPs although, for present purposes, only
the raw self-admission rates of these 10 QRPs listed in Table 3.1 will be
discussed.
63
Questionable research practices Prevalence
1. Failing to report all outcome variables 66.5%
2. Deciding whether to collect more data 58.0%
3. Selectively reporting studies that “worked” 50.0%
4. Deciding whether to exclude data following an interim 43.4%
analysis
5. Reporting an unexpected finding as a predicted one 35.0%
6. Failing to report all study conditions 27.4%
7. Rounding off p-values (e.g., .054 to .05) 23.3%
8. Stopping study after desired results are obtained 22.5%
9. Falsely claiming that results are unaffected by demographic 4.5%
variables
10. Falsifying data 1.7%
64
However, disciplinary differences in the prevalence of QRP practice
(and especially when confounded by self-reports vs. the observance of
others) are difficult to assess and not particularly important. For
example, in the year between the Fanelli and the John et al. publications,
Bedeian, Taylor, and Miller (2010) conducted a smaller survey of
graduate business school faculty’s observance of colleagues’ committing
11 QRPs during the year previous to their taking the survey. This
likewise produced a very high prevalence of several extremely serious
QRPs, with the fabrication of data being higher in this survey (26.8%)
than any I have yet encountered. Recall, however, that this and the
following behaviors are reports of others’ (not the respondents’)
behaviors:
65
particularly important, although one way to draw the line of demarcation
is between ignorance and willfulness. Everyone, for example, knows that
fabricating data is fraudulent, but not bothering to report all of ones
variables might simply be due to substandard training and mentoring. Or
it might be cultural based on the science in question.
However, while ignorance may have served as a passable excuse for
engaging in some detrimental practices in the past, it does nothing to
mitigate their deleterious effects on science and should no longer be
tolerated given the amount of warnings promulgated in the past decade
or so. For it is worth repeating that the ultimate effect of QRPs is the
potential invalidation of the majority of some entire empirical literatures
in an unknown number of scientific disciplines. And that, going forward,
is simply not acceptable.
66
results to the point that a p-value < 0.05 is more likely to occur than a
p-value > 0.05.
Simulation 1: How to subvert the already generous alpha level of .05
and continue the decade- old process of constructing a trivial (but
entertaining) science.
This simulation is the more conventional of the two. Here 15,000
random samples were drawn from a normal distribution to assess the
impact of the following four QRPs: (a) choosing which of two
correlated outcome variables to report (plus an average of the two), (b)
not specifying sample sizes a priori but beginning with 20 observations
per cell and adding 10 more observations if statistical significance is
not yet obtained, (c) using three experimental conditions and choosing
whether to drop one (which produced four alternate analytic
approaches), and (d) employing a dichotomous variable and its
interaction with the four combinations of analytic alternatives detailed
in (c).
Letting the computer do the heavy lifting, the p-values actually
obtained for the 15,000 samples were computed and contrasted to three
titular alpha levels (< 0.1, <0.05, and < .01). For the most commonly
used level of 0.05, the percentage of false-positive results resulting
from the four chosen scenarios were (recalling that a 5% rate would be
expected to occur by chance alone) as follows:
67
(For example, the 14.4% and 60.7% range estimated for multiple QRP
sins reduces to 3.3% and 21.5%, respectively, for an alpha of .01). And,
as suggested by Benjamin, Berger, Johannesson, et al. (2017);
Greenwald, Gonzalez, Harris, and Guthrie (1996); and the simulations
in Table 2.1, the deleterious effects of the individual QRPs and their
combinations would be greatly reduced if the titular alpha level were to
be decreased to .005. However, since psychology and the vast majority
of other scientific disciplines (at least the social sciences) aren’t likely
to adopt an alpha of .005 anytime soon, the criterion of 0.05 was
employed in Simmons, Nelson, and Simonsohn’s other astonishing
simulation.
Simulation 2: Also, how to subvert the already generous titular alpha
level of .05 and continue the process of constructing an irreproducible
(but entertaining) science.
This one must surely be one of the first modeling strategies of its
kind. Two experiments were reported, the first apparently legitimate (if
trivial since it was based on Daryl Bem’s infamous study “proving”
that future events can influence past events) using a soft, single item
and 30 undergraduates (also a typical sample size for psychology
experiments) who were randomized to listen to one of two songs: “Hot
Potato” (a children’s tune that the undergraduates would most likely
remember from childhood as the experimental condition) versus a
rather blah instrumental control tune (“Kalimba”). Note that the
experimental children’s song was perfectly and purposefully selected to
create an immediate reactive response by asking the undergraduates if
they felt older immediately after listening to it. And, sure enough,
employing the age of the participants’ father as a covariate (which
basically made no sense and was not justified) the experimental group
listening to “Hot Potato” reported that they had felt significantly older
(p = .033) on a 5-point scale (the study outcome) than the group
hearing the nonreactive control song.
For their second study the authors performed a “conceptual”
replication of the one just described, but this time embedding all four of
the computer-modeled QRPs from Simulation 1. However, the authors
first reported the study’s design, analytic procedure, and results without
mentioning any of these QRPs, which made it read like an abstract of a
typically “successful” psychology publication:
68
Using the same method as in Study 1, we asked 20 University of
Pennsylvania undergraduates to listen to either “When I’m Sixty-Four” by
The Beatles or “Kalimba.” Then, in an ostensibly unrelated task, they
indicated their birth date (mm/dd/yyyy) and their father’s age. We used
father’s age to control for variation in baseline age across participants. An
ANCOVA revealed the predicted effect: According to their birth dates,
people were nearly a year-and-a-half younger after listening to “When I’m
Sixty-Four” (adjusted M = 20.1 years) rather than to “Kalimba.” (Adjusted
M = 21.5 years), p = .040). (p. 1360)
69
1. Authors must decide the rule for terminating data collection before data
collection begins and report this rule in the article. Any serious
institutional review board (IRB) requires such a statement in proposals
submitted to them along with a rationale for prematurely terminating a
study or adding more participants to the original sample size
justification if applicable. Submission of these documents should
probably be required by journals prior to publication if the same
information is not preregistered.
2. Authors must collect at least 20 observations per cell or else provide a
compelling cost-of-data-collection justification. This one is unclear
since an N of 68 per group is necessary to produce adequate statistical
power for a typical social science effect size of 0.50. Ironically, a
number of authors (e.g., Bakker, van Dijk, & Wicherts, 2012) have
lamented the fact that the typical power available for psychological
experiments can be as low as 0.35, and (also ironically) 20 participants
per cell doesn’t even quite meet this low criterion for a two-group study.
3. Authors must list all variables collected in a study. This one is quite
important and of course should be an integral part of the preregistration
process for relatively simple experiments such as the ones described
here. Large clinical random controlled trials (RCTs) (as well as
databases used for correlational studies) often collect a large number of
demographic, background, health, and even cost data, a simple list of
which might run several pages in length. Thus perhaps this suggestion
could be loosened a bit for some types of research. However, those used
as covariates, blocking variables, subgroup analyses, and, of course,
primary outcomes must be prespecified accordingly.
4. Authors must report all experimental conditions, including failed
manipulations. The failure to include an extra intervention or
comparison group should be considered censorable misconduct.
5. If observations are eliminated, authors must also report what the
statistical results are if those observations are included. And, of course,
a rationale for the inclusion-exclusion criteria and the treatment of
outliers (with definitions) should be provided a priori.
6. If an analysis includes a covariate, authors must report the statistical
results of the analysis without the covariate. This is an excellent point
and is seldom adhered to. Additionally, the actual covariate–outcome
correlation should be reported (which is almost never done). Covariates
always adjust the meanings of outcomes to a certain extent, so it is
extremely important to ensure that the adjusted outcome conceptually
remains a variable of interest. It is not immediately apparent, for
example, exactly what adjusting “feeling older” in the first experiment
or adjusting “participants’ ages” based upon “fathers’ ages” in the
second experiment winds up producing.
70
The second list of guidelines is presented for peer reviewers. These
will be supplemented in Chapter 9, which discusses publishing
concerns.
71
Said another way, the Simmons et al. study demonstrated the effects that
QRPs could have on the artifactual achievement of statistically
significant results. So while the study may not have demonstrated (a) the
actual occurrence of artifactually significant results or (b) that their four
QRPs actually do result in artifactual statistical significance in the
published literature, surely it demonstrated their potential for doing so.
Of course, the authors’ first simulation involving the likely effects of
their four key QRPs provides strong evidence that such practices also
have the potential to dramatically inflate the obtained p-value and hence
produce false-positive results—evidence buttressed by the preceding
survey results demonstrating the high prevalence of these and other
QRPs in the actual conduct of scientific research.
But even more convincingly, a group of management investigators
fortuitously picked up where the Simmons et al. study left off and
demonstrated the actual effects of QRPs on the production of statistically
significant findings. In my personal opinion this particular study
provides one of the best empirical documentations of the untoward
effects of QRPs and their implicit relationship to both publication bias
and false-positive results.
The investigators accomplished this impressive feat by longitudinally
following a group of studies from their authors’ dissertation to their
subsequent publication in peer reviewed journals. And, as if this wasn’t
sufficiently impressive, the title of their article rivals the iconic entries of
both Ioannidis’s (“Why Most Published Research Findings Are False”)
and Simmons et al.’s (“False-Positive Psychology: Undisclosed
Flexibility in Data Collection and Analysis Allows Presenting Anything
as Significant”).
72
Thankfully the investigators persevered and were able to identify
142 dissertations where there was “overwhelming” evidence that the
studies had been subsequently published in a refereed journal. (The
average time to publication was 3.29 years.) Altogether (i.e., in both
dissertations and journal articles), there were 2,311 hypotheses, 1,978
of which were tested in the dissertations and 978 in the paired articles.
Overall differences between dissertation and journal article results
showed that of the 1,978 hypotheses contained in the dissertations, 889
(44.9%) were statistically significant while, of the 978 hypotheses
tested in the publications, 645 (65.9%) achieved that status. Or, in the
authors’ conceptualization of their results, “Our primary finding is that
from dissertation to journal article, the ratio of supported to
unsupported hypotheses more than doubled (0.82 to 1.00 versus 1.94 to
1.00)” (p. 376).
Another way to view these results is to consider only the 645
hypothesis tests which were common to both dissertations and journal
articles. Here, 56 of the 242 (20.6%) negative hypothesis tests in the
dissertations somehow changed into positive findings in the published
articles, while only 17 of the 323 (4.6%) positive dissertation findings
were changed to negative ones. That in turn reflects a greater than four-
fold negative to positive change as compared to a positive to negative
one.
As for results due to QRPs, perhaps the most creative aspect of this
seminal study involved estimating the effects of individual QRPs on the
bottom-line inferential changes occurring over time for the 142 paired
study versions. (Perhaps not coincidentally the five QRPs in this study
basically overlap the four modeled in the Simmons et al. study, which,
in all but one case, overlapped the preceding John et al. survey.)
QRP 1: Deletion or addition of data after hypothesis tests. Across
the 142 projects, 14 (9.9%) added subjects (as evidenced by increases
in sample size from dissertation to journal) and 29 (20.4%) dropped
subjects. Overall both adding and deleting participants resulted in
increased statistical significance over time (24.5% vs. 10.2%,
respectively).
When broken down by adding versus deleting participants, 19% of
the effects changed from negative to positive when the sample size was
increased, while 8.9% (a two-fold reduction) changed in the opposite
direction. (Note that this contrast was not statistically significant
because of the relatively few studies that increased their sample size
over time.) Among the studies that dropped subjects, there was a 2.5-
73
fold difference favoring changes from non-significance to statistical
significance as compared to positive to negative changes (28.1% vs.
11.1%, respectively).
QRP 2: Altering the data after hypothesis testing. This potential
QRP was assessed in 77 studies in which the sample size did not
change over time. (There were 22 cases in which it was not possible to
determine whether data were altered.) The authors rationale was that
“those studies that added or deleted data have a logical (but not
necessarily appropriate [emphasis added]) reason why descriptive
statistics would change from dissertations to their matched journal
publications” (p. 386). Of these 77 studies, 25 (32.5%) showed changes
in the means, standard deviations, or interrelations of the included
variables, which represented 47 nonsignificant hypothesis tests and 63
statistically significant ones. Following publication, 16 (34%) of the
negative studies became positive and 0% changed from positive to
negative.
QRP 3: Selective deletion or addition of variables. Deleting the same
22 studies as in QRP 2 (i.e., for which data alteration couldn’t be
ascertained) left 120 pairs (i.e., 142 − 22). Of these, 90 included
instances “where not all of the variables included in the dissertation
appeared in the publication and 63 (52.5%) instances where not all of
the variables found in the article were reported in the dissertation.”
(There were 59 studies which both added and dropped variables.) In the
dissertations, there were 84 negative tests and 136 positive ones.
Adding variables to the published studies resulted in a change from
negative to positive of 29.8% as compared to an 8.1% change in the
negative direction—a three-fold migration favoring negative to positive
change.
QRP 4: Reversing the direction or reframing hypotheses to support
data. The authors note that this artifact doesn’t necessarily include
changing a hypothesis from “the intervention will be efficacious” to “it
won’t work.” Instead it might involve adding a covariate (recall the
iconoclastic Simmons et al. simulations) or slightly changing a
predicted three-way interaction effect that might actually be significant
as hypothesized but not in the expected direction. Here, the good news
is that only eight studies representing 22 hypothesis tests were guilty of
substantively reframing the original hypothesis. But of course “bad”
news usually follows good news in this arena so, in this case, the bad
news is that none (0%) of these 22 dissertation hypotheses was
74
originally statistically significant while 17 (77.3%) of the p-values
“somehow” changed from p > 0.05 to p < 0.05 when published.
QRP 5: Post hoc dropping or adding of hypotheses. Of the 142
paired studies, 126 (87.7%) either dropped or added a hypothesis, with
80 doing both. This translates to (a) 1,333 dropped hypotheses of which
516 (38.7%) were statistically significant as opposed to (b) 333 added
hypotheses of which 233 (70.0%) were statistically significant. In other
words, the new hypotheses were almost twice as likely to be positive as
the ones that were dropped from the dissertation.
Qualifiers: The authors quite transparently describe a number of
alternative explanations to some of their findings. For example, they
note that
However, it should be remembered that both the O’Boyle et al. and the
Simmons et al. designs do not necessarily lead to definitive causal
conclusions. With that said, I personally consider these studies to be
extremely creative and both their data (actual or simulated) and the
conclusions based on them quite persuasive, especially when considered
in the context of the other previously presented observational and
modeling studies coupled with actual replication results that will be
presented shortly.
Perhaps, then, the primary contribution of this and the previous
chapter’s simulations resides in the facts that
75
The disciplinary-accepted or prespecified alpha level for any given study
1.
is probably almost never the actual alpha level that winds up being tested
at study’s end—unless, of course, a study is properly designed, conducted,
and analyzed;
2. Many, many investigators—while complying with some important and
obvious bias-reducing strategies—still conduct studies that are improperly
designed, conducted, and analyzed (hence biased in other ways).
76
Inane institutional scientific policies (IISPs) are not directly under the
personal control of individual investigators, but this does not imply that
they cannot be changed over time by individual scientists through their
own advocacy, professional behaviors, or collectively via group
pressures. QRPs, on the other hand, are almost exclusively under
individual investigators’ personal control, although adversely influenced
by IISPs and inadequate scientific mentorship.
So first consider the following list of the more common IISP culprits:
77
1. Publication bias, which has already been discussed in detail and is
partially due to institutional behaviors involving journal editors, funders,
peer reviewers, publishers, the press, and the public, in addition to
individual investigator behaviors. (So this one, like some that follow,
constitutes combination QRP-IISP issues.) Researchers often bemoan the
fact that journal editors and peer reviewers make negative studies so
difficult to publish, but who, after all, are these nefarious and short-
sighted miscreants? Obviously the vast majority are researchers
themselves since they typically serve in these publishing and funding
capacities. It is therefore incumbent upon these individuals to not
discriminate against well-conducted nonsignificant studies and to so lobby
the institutions for which they work.
2. A flawed peer review system that encourages publication bias, ignores
certain QRPs, does not always enforce journal guidelines, and sometimes
engages in cronyism. But again, who are these peer reviewers? The
answer is that almost everyone reading this is (or will be) a peer reviewer
at some point in their career. And some, heaven forbid, may even become
journal editors which will provide them with an even greater opportunity
to influence attitudes and practices among both their peer reviewers and
their publishers. (Suggestions for reforming the peer review process will
be discussed in some detail in Chapter 9.)
3. Insufficient scientific mentoring and acculturation of new or prospective
investigators. This one is tricky because senior mentors have important
experiential advantages but some are completely “set in their ways,”
resistant to change, and may not even be aware of many of the issues
discussed in this book. However, one doesn’t have to be long of tooth to
adopt an informal mentoring role and guide new or prospective
researchers toward the conduct of methodologically sound research.
4. A concomitant lack of substantive disciplinary and methodological
knowledge on the part of many of these insufficiently mentored
investigators. Some of the onus here lies with these individuals to
supplement their own education via the many online or print sources
available. However, institutions also bear a very real responsibility for
providing ongoing educational opportunities for their new faculty
researchers—as well as inculcating the need for self-education, which is
freely and conveniently available online.
5. Institutional fiscal priorities resulting in untoward pressure to publish and
attract external research funds. These issues are largely outside any single
individual’s personal control but someone must at least attempt to educate
the perpetrators thereof to change their policies—perhaps via organized
group efforts.
6. Related to this is the academic administration’s seeming adoption of the
corporate model of perpetual expansion (physical and financial) resulting
in evaluating department heads based on the amount of external grant
78
funds their faculties manage to garner. Such pressures can force senior
scientists to spend too much of their time pursuing funding opportunities
at the expense of actually engaging in scientific activities or adequately
supervising their staff who do. And many of the latter in turn understand
that they will be evaluated on the number of publications their work
generates, which leads us back to publication bias and the multiple
geneses of the reproducibility crisis.
7. The institutionalization of too many disciplines possessing no useful or
truly unique knowledge base and thereby ensuring the conduct of
repetitive, obvious, and trivial research. (In the spirit of transparency, this
is one of my personally held, idiosyncratic opinions.) One option for
individuals stuck in such pseudoscientific professions is to seek
opportunities to work with research teams in other more propitious arenas.
Alternately (or additionally) they can try mightily to discover what might
be a useful and/or parsimonious theory to guide research in their field,
which in turn might eventually lead to a scientifically and societally
useful knowledge base.
8. And related to Number 7 is the institutional practice of never abandoning
(and seldom downsizing) one of these disciplines while forever creating
additional ones. Or recognizing when even some mature, previously
productive disciplines have exhausted their supply of “low hanging fruit”
and hence may be incapable of advancing beyond it. This, of course,
encourages trivial, repetitive, and obvious studies over actual discoveries
as well as possibly increasing the pressure to produce exciting,
counterintuitive (hence often false positive) findings. The only cure for
this state of affairs is for investigators to spend less time conducting
tautological studies and spend more time searching for more propitious
avenues of inquiry. It is worth noting, however, that publishing obviously
trivial studies whose positive effect are already known should result in
statistical significance and not contribute to a discipline’s false positive
rate.
9. And related to both of the previous IISPs is the reluctance of funding
agencies to grant awards to speculative or risky proposals. There was even
once an adage at the National Institutes of Health (NIH) to the effect that
the agency seldom funds a study to which the result isn’t already known.
But of course the NIH is not a stand-alone organism nor do its employees
unilaterally decide what will be funded. Scientists are the ones who know
what is already known, and they have the most input in judging what is
innovative, what is not, what should be funded, and what shouldn’t be.
10. Using publication and citation counts as institutional requirements for
promotion, tenure, or salary increases. Both practices fuel the compulsion
to publish as much, as often, and keyed to what investigators believe will
result in the most citations as humanly possible. We all employ numeric
goals in our personal life such as exercise, weight loss, and wealth (or the
79
lack thereof), but excessive publication rates may actually decrease the
probability of making a meaningful scientific contribution and almost
certainly increases publication bias. One study (Ioannidis, Klavans, &
Boyack, 2018) reported that, between 2000 and 2016, 9,000 individuals
published one paper every 5 days. True, the majority of these were
published in high-energy and particle physics (86%) where the number of
co-authors sometimes exceeded a thousand, but papers in other disciplines
with 100 co-authors were not uncommon. (Ironically the lead author of
this paper [John Ioannidis, who I obviously admire] has published more
than a thousand papers himself in which he was either first or last author.
But let’s give him a pass here.)
80
1. The use of soft, reactive, imprecise, self-reported, and easily manipulated
outcome variables that are often chosen, created, or honed by
investigators to differentially fit their interventions. This one is especially
endemic to those social science investigators who have the luxury of
choosing or constructing their own outcomes and tailoring them to better
match (or be more reactive to) one experimental group than the other.
From a social science perspective, however, if an outcome variable has no
social or scientific significance (such as how old participants feel), then
the experiment itself will most likely also have no significance. But it will
most likely add to the growing reservoir of false-positive results.
2. Failure to control for potential experimenter and participant expectancy
effects. In some disciplines this may be the most virulent QRP of all.
Naturally, double-blinded randomized designs involving sensible
control/comparison groups are crucial in psychology and medicine, given
demand and placebo effects, respectively, but, as will be demonstrated in
Chapter 5, they are equally important in the physical sciences as well. As
is the necessity of blinding investigators and research assistants to group
membership in animal studies or in any research that employs variables
scored by humans or that require human interpretation. (The
randomization of genetically identical rodents and then blinding research
assistants to group membership is a bit more complicated and labor
intensive in practice than it may appear. Also untoward effects can be
quite subtle and even counterintuitive, such as male laboratory rats
responding differentially to male research assistants.)
3. Failure to report study glitches and weaknesses. It is a rare study in which
no glitches occur during its commission. It is true, as Simmons et al.
suggest, that some reviewers punish investigators for revealing
imperfections in an experiment, but it has been my experience (as both a
journal editor-in-chief and investigator) that many reviewers appreciate
(and perhaps reward) transparency in a research report. But more
importantly, hiding a serious glitch in the conduct of a study may result in
a false-positive finding that of course can’t be replicated.
4. Selective reporting of results. This has probably been illustrated and
discussed in sufficient detail in the Simmons et al. paper in which selected
outcomes, covariates, and experimental conditions were deleted and not
reported in either the procedure or result sections. Another facet of this
QRP involves “the misreporting of true effect sizes in published studies . .
. that occurs when researchers try out several statistical analyses and/or
data eligibility specifications and then selectively report those that
produce significant results” (Head, Holman, Lanfear, et al., 2015, p. 1).
Suffice it to say that all of these practices are substantive contributors to
the prevalence of false-positive results.
5. Failure to adjust p-values based on the use of multiple outcomes,
subgroup analyses, secondary analyses, multiple “looks” at the data prior
81
to analysis, or similar practices resulting in artifactual statistical
significance. This is an especially egregious problem in studies involving
large longitudinal databases containing huge numbers of variables which
can easily yield thousands of potential associations there among. (Using
nutritional epidemiology as an example, the European Prospective
Investigation into Cancer and the Nutrition, Nurses’ Health Study has
resulted in more than 1,000 articles each [Ioannidis, 2018].) I personally
have no idea what a reasonable adjusted p-value should be in such
instances, although obtaining one close to 0.05 will obviously be
completely irrelevant. (Perhaps somewhere in the neighborhood [but a bit
more liberal] to the titular alpha levels adopted by the genomic and
particle physics fields, which will be discussed later.)
6. Sloppy statistical analyses and erroneous results in the reporting of p-
values. Errors such as these proliferate across the sciences, as illustrated
by David Vaux’s (2012) article in Nature (pejoratively titled “Know When
Your Numbers Are Significant”), in which he takes biological journals
and investigators to task for simple errors and sometimes absurd practices
such as employing complex statistical procedures involving Ns of 1 or 2.
As another very basic example, Michal Krawczyk (2008), using a dataset
of more than 135,000 p-values found (among other things) that 8% of
them appeared to be inconsistent with the statistics upon which they were
based (e.g., t or F) and, à propos of the next QRP, that authors appear “to
round the p- values down more eagerly than up.” Perhaps more
disturbing, Bakker and Wicherts (2011), in an examination of 281 articles,
found “that around 18% of statistical results in the psychological literature
are incorrectly reported . . . and around 15% of the articles contained at
least one statistical conclusion that proved, upon recalculation, to be
incorrect; that is, recalculation rendered the previously significant result
insignificant, or vice versa” (p. 666). And it should come as no surprise by
now that said errors were most often in line with researchers’
expectations, hence an example of confirmation bias (Nickerson, 1998).
7. Procedurally lowering p-values that are close to, but not quite<0.05. This
might include suspicious (and almost always unreported) machinations
such as (a) searching for covariates or alternate statistical procedures or
(b) deleting a participant or two who appears to be an outlier in order to
whittle down a p of, say, 0.07 a couple of notches. Several investigators in
several disciplines (e.g., Masicampo & Lalande, 2012; Gerber, Malhotra,
Dowling, & Doherty, 2010; Ridley, Kolm, Freckelton, & Gage, 2007)
have noted a large discrepancy between the proportions of p-values found
just below 0.05 (e.g., 0.025 to 0.049) as opposed to those just above it (vs.
0.051 to 0.075).
8. Insufficient attention to statistical power issues (e.g., conducting
experiments with too few participants). While most past methodology
textbooks have emphasized the deleterious effects of low power on the
82
production of negative studies, as previously discussed, insufficient power
has equally (or greater) unfortunate effects on the production of false-
positive findings. The primary mechanism of action of low power
involves the increased likelihood of producing unusually large effect sizes
by chance, which in turn are more likely to be published than studies
producing more realistic results. And, as always, this QRP is magnified
when coupled with others such as repeatedly analyzing results with the
goal of stopping them as soon as an effect size large enough emerges to
produce statistical significance. (Adhering to the prespecified sample size
as determined by an appropriate power analysis would completely avoid
this artifact.)
9. Gaming the power analysis process, such as by hypothesizing an
unrealistically large effect size or not upwardly adjusting the required
sample size for specialized designs such as those involving hierarchical or
nested components.
10. Artifactually sculpting (or selecting) experimental or control procedures
to produce statistical significance during the design process which might
include:
83
a. Selecting tautological controls, thereby guaranteeing positive results
if the experiment is conducted properly. The best examples of this
involve comparisons between interventions which have a recognized
mechanism of action (e.g., sufficient instruction delivered at an
appropriate developmental level in education or a medical
intervention known to elicit a placebo effect) versus “instruction or
treatment as usual.” (It could be argued that this is not a QRP if the
resulting positive effect is not interpreted as evidence of efficacy, but,
in the present author’s experience, it almost always is [as an example,
see Bausell & O’Connell, 2009]).
b. Increasing the fit between the intervention group and the outcome
variable or, conversely, decreasing the control–outcome match. (The
first, semi-legitimate experiment described in the Simmons et al.
paper provides a good example of this although the study was also
fatally underpowered from a reproducibility perspective.) While
speculative on the present author’s part, one wonders if psychology
undergraduates or Amazon Mechanical Turk participants familiar
with computer-administered experiments couldn’t surmise that they
had been assigned the experimental group when a song (“Hot Potato”
which they had heard in their childhood) was interrupted by asking
them how old they felt. Or if their randomly assigned counterparts
couldn’t guess that they were in the control group when a blah
instrumental song was similarly interrupted. (Relatedly, Chandler,
Mueller, and Paolacci [2014] found [a] that investigators tend to
underestimate the degree to which Amazon Turk workers participate
across multiple related experiments and [b] that they “overzealously”
exclude research participants based on the quality of their work.
Thirty-three percent of investigators employing crowdsourcing
participants appear to adopt this latter approach, thereby potentially
committing another QRP [i.e., see Number 11].)
11. Post hoc deletion of participants or animals for subjective reasons. As an
extreme example, in one of my previous positions I once witnessed an
alternative medicine researcher proudly explain his criterion for deciding
which observations were legitimate and which were not in his animal
studies. (The latter’s lab specialized in reputably demonstrating the pain-
relieving efficacy of acupuncture resulting from tiny needles being
inserted into tiny rat legs and comparing the results to a placebo.) His
criterion for deciding which animals to delete was proudly explained as
“sacrificing the non-acupuncture responding animals” with an
accompanying smile while drawing his forefinger across his neck. (No, I
am not making this up nor do I drink while writing. At least not at this
moment.)
12. Improper handling and reporting of missing data. Brief-duration
experiments normally do not have problematically high dropout rates, but
84
interventions whose effects must be studied over time can suffer
significant attrition. Preferred options for compensating for missing data
vary from discipline to discipline and include regression-based imputation
of missing values (available in most widely used statistical packages) and
intent-to-treat. (The latter tending to be more conservative than the
various kinds of imputation and certainly the analysis of complete data
only.) Naturally, the preregistration of protocols for such studies should
describe the specific procedures planned, and the final analyses should
comply with the original protocol, preferably presenting the results for
both the compensatory and unvarnished data. Most funding and regulatory
agencies for clinical trials require the prespecification of one or more of
these options in their grant proposals, as do some IRBs for their
submissions. Of course it should come as no surprise that one set of
investigators (Melander, Ahlqvist-Rastad, Meijer, & Beermann, 2003)
found that 24% of stand-alone studies neglected to include their
preregistered intent-to-treat analysis in the final analysis of a cohort of
antidepressant drug efficacy experiments—presumably because such
analyses produce more conservative (i.e., less positive) results than their
counterparts. (Because of the high financial stakes involved, positive
published pharmaceutical research results tend to be greatly facilitated by
the judicious use of QRPs [see Turner, Matthews, Linardatos, et al., 2008,
for an especially egregious example]).
13. Adding participants to a pilot study in the presence of a promising trend,
thereby making the pilot data part of the final study. Hopefully self-
explanatory, although this is another facet of performing interim analyses
until a desired p-value is obtained.
14. Abandoning a study prior to completion based on the realization that
statistical significance is highly unlikely to occur (or perhaps even that
the comparison group is outperforming the intervention). The mechanism
by which this behavior inflates the obtained p-value may not be
immediately apparent, but abandoning an ongoing experiment (i.e., not a
pilot study) before its completion based on (a) the perceived impotence of
the intervention, (b) the insensitivity of the outcome variable, or (c) a
control group that might be performing above expectations allows an
investigator to conserve resources and immediately initiate another study
until one is found that is sufficiently promising. Continually conducting
such studies until statistical significance is achieved ultimately increases
the prevalence of false, non-replicable positive results in a scientific
literature while possibly encouraging triviality at the same time.
15. Changing hypotheses based on the results obtained. Several facets of this
QRP have already been discussed, such as switching primary outcomes
and deleting experimental conditions, but there are many other
permutations such as (a) obtaining an unexpected result and presenting it
as the original hypothesis or (b) reporting a secondary finding as a
85
planned discovery. (All of which fit under the concepts of HARKing [for
Hypothesizing After the Results are Known, Kerr, 1991] or p-hacking
[Head et al., 2015], which basically encompasses a menu of strategies in
which “researchers collect or select data or statistical analyses until
nonsignificant results become significant.”) As an additional example,
sometimes a plethora of information is collected from participants for
multiple reasons, and occasionally one unexpectedly turns out to be
influenced by the intervention or related to another variable. Reporting
such a result (or writing another article based on it) without explicitly
stating that said finding resulted from an exploratory analysis constitutes a
QRP in its own right. Not to mention contributing to publication bias and
the prevalence of false-positive results.
16. An overly slavish adherence to a theory or worldview. Confirmation bias,
a tendency to search for evidence in support of one’s hypothesis and
ignore or rationalize anything that opposes, it is subsumable under this
QRP. A more extreme manifestation is the previously mentioned animal
lab investigator’s literal termination of rodents when they failed to
respond to his acupuncture intervention. But also, perhaps more
commonly, individuals who are completely intellectually committed to a
specific theory are sometimes capable of actually seeing phenomenon that
isn’t there (or failing to see disconfirmatory evidence that is present). The
history of science is replete with examples such as craniology (Gould,
1981), cold fusion (Taubes, 1993), and a number of other pathologies
which will be discussed in Chapter 5. Adequate controls and effective
blinding procedures are both simple and absolutely necessary strategies
for preventing this very troublesome (and irritating) QRP.
17. Failure to adhere to professional association research standards and
journal publishing “requirements,” such as the preregistration of
statistical approaches, primary hypotheses, and primary endpoints before
conducting studies. Again, as Simmons and colleagues state: “If reviewers
require authors to follow these requirements, they will” (p. 1363).
18. Failure to provide adequate supervision of research staff. Most
experimental procedures (even something as straightforward as the
randomization of participants to conditions or the strict adherence to a
standardized script) can easily be subverted by less than conscientious (or
eager to please) research staff, so a certain amount of supervision (e.g.,
via irregular spot checks) of research staff is required.
19. Outright fraud, of which there are myriad, well-publicized, and infamous
examples, with perhaps the most egregious genre being data fabrication
such as (a) painting patches on mice with permanent markers to mimic
skin grafts (Hixson, 1976), (b) Cyril Burt making a splendid career out of
pretending to administer IQ tests to phantom twins separated at birth to
“prove” the dominance of “nature over nurture” (Wade, 1976), or (c)
Yoshitaka Fujii’s epic publication of 172 fraudulent articles (Stroebe,
86
Postmes, & Spears, 2012). While data fabrications such as these are often
dismissed as a significant cause of scientific irreproducibility because of
their approximately 2% self-reported incidence (Fanelli, 2009), even this
probable underestimate is problematic when one considers the millions of
entries in published scientific databases.
20. Fishing, data dredging, data torturing (Mills, 1993), and data mining. All
of which are used to describe practices designed to reduce a p-value
below the 0.05 (aka p-hacking) threshold by analyzing large numbers of
variables in search of statistically significant relationships to report—but
somehow forgetting to mention the process by which these findings were
obtained.
21. The combined effects of multiple QRPs, which greatly compounds the
likelihood of false-positive effects since (a) a number of these practices
are independent of one another and therefore their effects on false-positive
results are cumulative and (b) individuals who knowingly commit one of
these practices will undoubtedly be inclined to combine it with others
when expedient.
22. Failing to preregistering studies and ensure their accessible to readers. It
is difficult to overemphasize the importance of preregistering study
protocols since this simple strategy would avoid many of the QRPs listed
here if preregistrations are routinely compared to the final research reports
during the peer review process. Or, barring that, they are routinely
compared by bloggers or via other social media outlets.
23. Failure to adhere to the genre of established experimental design
standards discussed in classic research methods books and the myriad
sets of research guidelines discussed later. Common sense examples
include (a) randomization of participants (which should entail following a
strict, computer-generated procedure accompanied by steps to blind
experimenters, participants, and principal investigators); (b) the avoidance
of experimental confounds, the assurance of reliability, and the validity of
measuring instruments, taking Herculean steps (if necessary) to avoid
attrition; and (c) a plethora of others, all of which should be common
knowledge to anyone who has taken a research methods course. However,
far and away the most important of these (with the possible exception of
random assignment) is the blinding of experimenters (including animal
and preclinical researchers), research assistants, and participants
(including everyone who comes in contact with them) with respect to
group membership and study hypotheses/purposes. Of course this is not
possible in some genres of research, as when having research assistants
count handwashing episodes in public lavatories (Munger & Harris,
1989), employing confederates in obedience studies (Milgram, 1963), or
comparing the effects of actual knee surgery to placebo surgery (Moseley,
O’Malley, Petersen, et al., 2002), but in most research scenarios blinding
can and must be successfully instituted.
87
It may be that imperfect blinding of participants and research staff (at
least those who come in contact with participants) may be among the
most virulent QRPs in experiments in which investigators tend to sculpt
experimental or control procedures to produce statistical significance
(QRPs Numbers 2 and 10) and/or are themselves metaphorically blinded,
given the degree to which they are wedded to their theory or word view
(QRP Number 16).
If this is true, it follows that investigators should (and should be
required to) employ blinding checks to ascertain if the procedures they
put (or failed to put) into place were effective. Unfortunately a
considerable amount of evidence exists that this seemingly obvious
procedure is seldom employed. Much of this evidence comes from
individual trials, such as the classic embarrassment in which 311 NIH
employees were randomly assigned to take either a placebo or ascorbic
acid capsule three times a day for 9 months to ascertain the effectiveness
of vitamin C for the treatment and prevention of the common cold.
Unfortunately the investigators failed to construct a placebo that matched
the acidic taste of the intervention, and a blinding check revealed that
many of the NIH participants broke said blind attempt by tasting the
capsules (Karlowski, Chalmers, Frenkel, et al., 1975).
Unfortunately a number of methodologists have uncovered
substandard blinding efforts (most notably failures to evaluate their
effectiveness) in a number of different types of studies. Examples
include
1. Fergusson, Glass, Waring, and Shapiro (2004) found that only 8% of 191
general medical and psychiatric trials reported the success of blinding;
2. A larger study (Baethge, Assall, & Baldessarini, 2013) found even worse
results, with only 2.5% of 2,467 schizophrenia and affective disorders
RCTs reported assessing participant, rater, or clinician blinding; and, not
to be outdone,
3. Hróbjartsson, Forfang, Haahr, and colleagues (2007) found an ever lower
rate (2%) of blinding assessments for a sample of 1,599 interdisciplinary
blinded RCTs.
88
Colagiuri and Benedetti (2010), for example, quite succinctly sum up
the importance of universal blinding checks in a carefully crafted
criticism of the otherwise excellent CONSORT 2010 Statement’s
updated guidelines for randomized trials which inexplicably deleted a
provision to check blinding and downgraded it to a recommendation.
The Colagiuri and Benedetti team explained the rationale for their
criticism (via a British Medical Journal Rapid Response article):
Testing for blinding is the only [emphasis added] valid way to determine
whether a trial is blind. Trialists conducing RCTs should, therefore, report
on the success of blinding. In situations where blinding is successful,
trialists and readers can be confident that guesses about treatment allocation
have not biased the trial’s outcome. In situations where blinding fails,
trialists and readers will have to evaluate whether or not bias may have
influenced the trial’s outcomes. Importantly, however, in the second
situation, while trialists are unable to label their trial as blind, the failure of
blinding should not be taken as definitive evidence that bias occurred.
Instead, trialists should provide a rationale as to why the test of blinding
was unsuccessful and a statement on whether or not they consider the
differences between treatment arms to be valid. (2010, p. 340)
89
Failing to check and report experimental participants’ knowledge (or
24. guesses) regarding the treatments they received. Again, this could be done
by administering a single force-choice item (e.g., “To what condition
[experimental or control] do you believe you were assigned?”) at the end
of an experiment and then correlating said answer not only to actual group
assignment but also to ascertain if there was an interaction between said
guesses and actual assignment with respect to outcome scores. The
answers would not necessarily be causally definitive one way or another,
but a large portion of the participants correctly guessing their treatment
assignment by study’s end would be rather troublesome. And it would be
even more troublesome if individuals in the control group who suspected
they were in the intervention scored differentially higher or lower on the
outcome variable than their control counterparts who correctly guessed
their group assignment.
90
QRPs and Animal Studies
91
(Kilkenny, Browne, Cuthill, et al., 2010) which is closely modeled on the
CONSORT statement—yet another commonality between the two genres
of research. And, like its predecessor, ARRIVE also has its own
checklist (http://www.nc3rs.org.uk/ARRIVEchecklist/) and has likewise
been endorsed by an impressive number of journals.
So, as would be expected, the majority of the ARRIVE procedural
reporting guidelines (e.g., involving how randomization or blinding was
performed if instituted) are similar to their CONSORT counterparts
although others are obviously unique to animal research. (For a
somewhat more extensive list of methodological suggestions for this
important genre of research, see Henderson, Kimmelman, Fergusson, et
al., 2013.)
Unfortunately, while the ARRIVE initiative has been welcomed by
animal researchers and a wide swath of preclinical journals, enforcement
has been a recurring disappointment, as demonstrated in the following
study title “Two Years Later: Journals Are Not Yet Enforcing the
ARRIVE Guidelines on Reporting Standards for Pre-Clinical Animal
Studies” (Baker, Lidster, Sottomayor, & Amor, 2014). The investigators,
in examining a large number of such studies published in the journals
PLoS and Nature (both of which officially endorsed the ARRIVE
reporting guidelines) found that
From one perspective, perhaps 2 years is not a great deal of time for
comprehensive guidelines such as these to be implemented. But from a
scientific perspective, this glass is neither half full nor half empty
because behaviors such as blinding, randomization, and power analyses
should not require guidelines in the 21st century. Rather their
commission should be ironclad prerequisites for publication.
Whether the primary etiology of this disappointing state of affairs
resides in journal policies, mentoring, or knowledge deficiencies is not
known. However, since hopefully 99% of practicing scientists know that
these three methodological procedures are absolute requirements for the
production of valid experimental findings, the major onus probably
involves journal involvement, such as the suggestion by Simmons,
92
Nelson, and Simonsohn (2012) (based on their iconic 2011 article) that
investigators affix a 21-word statement to their published experiments
(i.e., “We report how we determined our sample size, all data exclusions
(if any), all manipulations, and all measures in the study”). A simple
innovation which inspired both the PsychDisclosure initiative (LeBel,
Borsboom, Giner-Sorolla, et al., 2013) and the decision of the editor of
the most prestigious journal in the field (Psychological Science) to
require authors’ disclosure via a brief checklist. (Checklists, incidentally
have been shown to be important peer review aids and are actually
associated with improved compliance with methodological guidelines
[Han, Olonisakin, Pribis, et al., 2017]).
Next Up
References
93
Boutron, I., Dutton, S., Ravaud, P., & Altman, D. G. (2010). Reporting and
interpretation of randomized controlled trials with statistically nonsignificant
results for primary outcomes. Journal of the American Medical Association,
303, 2058–2064.
Chandler, J., Mueller, P., & Paolacci, G. (2014). Nonnaivete among Amazon
Mechanical Turk workers: Consequences and solutions for behavioral
researchers. Behavioral Research Methods, 46, 112–130.
Colagiuri, B., & Benedetti, F. (2010). Testing for blinding is the only way to
determine whether a trial is blind. British Medical Journal, 340, c332.
https://www.bmj.com/rapid-response/2011/11/02/testing-blinding-only-way-
determine-whether-trial-blind
Contopoulos-Ioannidis, D. G., Ntzani, E., & Ioannidis, J. P. (2003). Translation
of highly promising basic science research into clinical applications. American
Journal of Medicine, 114, 477–484.
Fanelli, D. (2009). How many scientists fabricate and falsify research? A
systematic review and meta-analysis of survey data. PLoS ONE, 4, e5738.
Fergusson, D., Glass, K. C., Waring, D., & Shapiro, S. (2004). Turning a blind
eye: The success of blinding reported in a random sample of randomized,
placebo controlled trials. British Medical Journal, 328, 432.
Fiedler, K., & Schwarz N. (2015). Questionable research practices revisited.
Social Psychological and Personality Science 7, 45–52.
Gerber, A S., Malhotra, N., Dowling, C. M, & Doherty, D. (2010). Publication
bias in two political behavior literatures. American Politics Research, 38, 591–
613.
Gould, S. J. (1981). The mismeasure of man. New York: Norton.
Greenwald, A. G., Gonzalez, R., Harris, R. J., & Guthrie, D. (1996). Effect sizes
and p values: what should be reported and what should be replicated?
Psychophysiology, 33, 175–183.
Han, S., Olonisakin, T. F., Pribis, J. P., et al. (2017). A checklist is associated
with increased quality of reporting preclinical biomedical research: A
systematic review. PLoS ONE, 12, e0183591.
Head, M. L., Holman, L., Lanfear, R., et al. (2015). The extent and consequences
of p-hacking in science. PLoS Biology, 13, e1002106.
Henderson, V. C., Kimmelman, J., Fergusson, D., et al. (2013). Threats to
validity in the design and conduct of preclinical efficacy studies: A systematic
review of guidelines for in vivo animal experiments. PLoS Medicine, 10,
e1001489.
Hixson, J. R. (1976). The patchwork mouse. Boston: Anchor Press.
Hróbjartsson, A., Forfang, E., Haahr, M. T., et al. (2007). Blinded trials taken to
the test: An analysis of randomized clinical trials that report tests for the
success of blinding. International Journal of Epidemiology, 36, 654–663.
Ioannidis, J. P. A. (2007). Limitations are not properly acknowledged in the
scientific literature. Journal of Clinical Epidemiology, 60, 324–329.
94
Ioannidis, J. P. A., Klavans, R., & Boyack, K. W. (2018). Thousands of scientists
publish a paper every five days, papers and trying to understand what the
authors have done. Nature, 561, 167–169.
Ioannidis, J. P. A. (2018). The challenge of reforming nutritional epidemiologic
research. JAMA, 320, 969–970.
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of
questionable research practices with incentives for truth-telling. Psychological
Science, 23, 524–532.
Karlowski, T. R., Chalmers, T. C., Frenkel, L. D., et al. (1975). Ascorbic acid for
the common cold: A prophylactic and therapeutic trial. Journal of the
American Medical Association, 231, 1038–1042.
Kerr, N. L. (1991). HARKing: Hypothesizing after the results are known.
Personality and Social Psychology Review, 2, 196–217.
Kilkenny, C., Browne, W. J., Cuthill, I. C., et al. (2010). Improving bioscience
research reporting: The ARRIVE Guidelines for reporting animal research.
PLoS Biology, 8, e1000412.
Kilkenny, C., Parsons, N., Kadyszewski, E., et al. (2009). Survey of the quality
of experimental design, statistical analysis and reporting of research using
animals. PLoS ONE, 4, e7824.
Kola, I., & Landis, J. (2004). Can the pharmaceutical industry reduce attrition
rates? Nature Reviews Drug Discovery, 3, 711–715.
Krawczyk, M. (2008). Lies, Damned lies and statistics: The adverse incentive
effects of the publication bias. Working paper, University of Amsterdam.
http://dare.uva.nl/record/302534
LeBel, E. P., Borsboom, D., Giner-Sorolla, R., et al. (2013).
PsychDisclosure.org: Grassroots support for reforming reporting standards in
psychology. Perspectives on Psychological Science, 8, 424–432.
Masicampo E. J., & Lalande D. R. (2012). A peculiar prevalence of p values just
below .05. Quarterly Journal of Experimental Psychology, 65, 2271–2279.
Melander, H., Ahlqvist-Rastad, J., Meijer, G., & Beermann, B. (2003). Evidence
b(i)ased medicine-selective reporting from studies sponsored by
pharmaceutical industry: Review of studies in new drug applications. British
Medical Journal, 326, 1171–1173.
Milgram, S. (1963). Behavioral study of obedience. Journal of Abnormal and
Social Psychology, 67, 371–378.
Mills, J. L. (1993). Data torturing. New England Journal of Medicine, 329,
1196–1199.
Moseley, J. B., O’Malley, K., Petersen, N. J., et al. (2002). A controlled trial of
arthroscopic surgery for osteoarthritis of the knee. New England Journal of
Medicine, 347, 82–89.
Munger, K., & Harris, S. J. (1989). Effects of an observer on hand washing in
public restroom. Perceptual and Motor Skills, 69, 733–735.
Nickerson, R. S. (1998). Confirmation bias: A ubiquitous phenomenon in many
guises. Review of General Psychology, 2, 175–220.
95
O’Boyle, Jr., E. H., Banks, G. C., & Gonzalez-Mule, E. (2014). The Chrysalis
effect: How ugly initial results metamorphosize into beautiful articles. Journal
of Management, 43, 376–399.
Ridley, J., Kolm, N., Freckelton, R. P., & Gage, M. J. G. (2007). An unexpected
influence of widely used significance thresholds on the distribution of reported
P-values. Journal of Evolutionary Biology, 20, 1082–1089.
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive
psychology: undisclosed flexibility in data collection and analysis allows
presenting anything as significant. Psychological Science, 22, 1359–1366.
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2012). A 21 word solution.
https://ssrn.com/abstract=2160588
Stroebe, W., Postmes, T., & Spears, R. (2012). Scientific misconduct and the
myth of self-correction in science. Perspectives on Psychological Science, 7,
670–688.
Taubes, G. (1993). Bad science: The short life and weird times of cold fusion.
New York: Random House.
Turner, E. H., Matthews, A. M., Linardatos, E., et al. (2008) Selective
publication of antidepressant trials and its influence on apparent efficacy. New
England Journal of Medicine, 358, 252–260.
Vaux D. (2012). Know when your numbers are significant. Nature, 492, 180–
181.
Vinkers, C. H., Tijdink, J. K., & Otte, W. M. (2015). Use of positive and negative
words in scientific PubMed abstracts between 1974 and 2014: Retrospective
analysis. British Medical Journal, 351, h6467.
Wade, N. (1976). IQ and heredity: Suspicion of fraud beclouds classic
experiment. Science, 194, 916–919.
Watson, J. D., & Crick, F. (1953). A structure for deoxyribose nucleic acid.
Nature, 171, 737–738.
96
4
A Few Case Studies of QRP-Driven
Irreproducible Results
The title of our first entry unambiguously describes the article’s bottom-
line conclusions as well as several of the other case studies presented in
this chapter.
97
The authors go on to rather compassionately suggest how such
practices can occur without the conscious awareness of their
perpetrators.
Working scientists are also keenly aware of the risks of data dredging, and
they use confidence intervals and p-values as a tool to avoid getting fooled
by noise. Unfortunately, a by-product of all this struggle and care is that
when a statistically significant pattern does show up, it is natural to get
excited and believe it. The very fact that scientists generally don’t cheat,
generally don’t go fishing for statistical significance, makes them
vulnerable to drawing strong conclusions when they encounter a pattern
that is robust enough to cross the p < 0.05 threshold. (p. 464)
But now let’s begin our more detailed case studies by considering a
hybrid psychological-genetic area of study involving one of the former’s
founding constructs (general intelligence or the much reviled [and
reified] g). We won’t dwell on whether this construct even exists or not,
but for those doubters who want to consider the issue further, Jay
Gould’s Mismeasure of Man (1981) or Howard Gardner’s (1983) Frames
of Mind: The Theory of Multiple Intelligences are definitely
recommended.
What makes the following study so relevant to reproducibility are (a)
its demonstration of how a single QRP can give rise to an entire field of
false-positive results while at the same time (b) demonstrating the power
of the replication process to identify these fallacious results.
98
traits” (p. 1315). (Apparently a large literature has indeed found
statistically significant correlations between various single-nucleotide
polymorphisms [SNPs; of which there are estimated to be ten million
or so] and g.)
The present example represented a replication of 13 of the SNPs
previously found to be related to g in an exhaustive review by Antony
Payton (2009) spanning the years from 1995 to 2009; these SNPs
happened to be located near 10 potentially propitious genes. The
authors’ replications employed three large, “well-characterized”
longitudinal databases containing yoked information on at least 10 of
these 13 SNPs: (a) the Wisconsin high school students and a randomly
selected sibling (N = 5,571), (b) the initial and offspring cohorts of the
Framingham Heart Study (N = 1,759), and (c) a sample of recently
genotyped Swedish twins born between 1936 and 1958 (N = 2,441).
In all, this effort resulted in 32 regression analyses (controlling for
variables such as age, gender, and cohort) performed on the relationship
between the appropriate data on each SNP and IQ. Only one of these
analyses reached statistical significance at the 0.04 level (a very low
bar), and it was in the opposite direction to that occurring in one of the
three original studies. Given the available statistical power of these
replications and the number of tests computed, the authors estimate that
at least 10 of these 32 associations should have been significant at the
.05 level by chance alone.
Their conclusions, while addressed specifically to genetic social
science researchers, is unfortunately relevant to a much broader
scientific audience.
99
A Follow-Up
Partially due to replication failures such as this one (but also due to
technology and lowering costs of a superior alternative), candidate gene
analyses involving alpha levels of .05 have largely disappeared from the
current scientific literature. Now genome-wide association studies are in
vogue, and the genetics research community has reduced the significance
criterion by several orders of magnitude (i.e., p ≤ .0000005). Obviously,
as illustrated in the simulations discussed in Chapter 2, appropriate
reductions in titular alpha levels based on sensible criteria will greatly
reduce the prevalence of false-positive results, and perhaps, just perhaps,
psychology will eventually follow suit and reduce its recommended
alpha level for new “discoveries” to a more sensible level such as 0.005.
Unfortunately, psychological experiments would be considerably more
difficult to conduct if the rules were changed in this way. And much of
this science (and alas others as well) appears to be a game, as suggested
by the Bakker, van Dijk, and Wicherts (2012) title (“The Rules of the
Game Called Psychological Science”) along with many of the previously
discussed articles. But let’s turn our attention to yet another hybrid
psychological-physiological foray somewhere past the boundaries of
either irreproducibility or inanity, this time with a somewhat less
pejorative title.
100
myriad other psychosocial constructs (e.g., emotion, personality, and
“social cognition”) were in excess of 0.80. (Correlation coefficients
range from −1.00 to +1.00, with +/−1.00 representing a perfect
correspondence between two variables and zero indicating no
relationship whatever.) Then, given the “puzzling” (some would
unkindly say “highly suspicious”) size of these correlations, the team
delved into the etiology of these astonishingly large (and decidedly
suspect) values.
Suspect because veridical correlations (as opposed to those observed
by chance, data analysis errors, or fraud) of this size are basically
impossible in the social sciences because the reliability (i.e., stability or
reliability) of these disciplines’ measures typically fall below 0.80 (and
the reliability of neuroimaging measures, regardless of the disciplines
involved, usually falls a bit short of 0.70). Reliability, as the authors of
this study note, places an algebraic upper limit on how high even a
perfect correlation (i.e., 1.00) between two variables can be achieved
via the following very simple formula:
Formula 4.1: The maximum correlation possible between two variables
given their reliabilities
101
and a trait measure” (p. 276), which if nothing else is a preconfirmation
of Ioannidis’s previously mentioned 2011 analysis concerning the
extent of publication bias in fMRI research in general.
The next step in teasing out the etiology of this phenomenon
involved contacting the authors of the 55 studies to obtain more
information on how they performed their correlational analyses.
(Details on the fMRI data points were not provided in the journal
publications.) Incredibly, at least some information was received from
53 of the 55 articles, close to a record response for such requests and is
probably indicative of the authors’ confidence regarding the validity of
their results.
Now to understand the etiology of these “puzzling” correlation
coefficients, a bit of background is needed on the social science
stampede into brain imaging as well as the analytics of how increased
blood flow is measured in fMRI studies in general.
First, Vul and colleagues marveled at the eagerness with which the
social sciences (primarily psychology) had jumped into the
neuroimaging fad only a few years prior to the present paper, as
witnessed by
Second, three such studies were briefly described by the authors (two
published in Science and one in a journal called NeuroImage) that
associated brain activity in various areas of the brain with self-reported
psychological scales while
102
Incredibly, the average correlation for the three studies was 0.77.
But, as explained earlier, this is statistically impossible given the
amount of error (i.e., 1− reliability) present in both the psychological
predictors and the fMRI scans. So even if the true (i.e., error-free)
relationship between blood flow in the studied regions and the
psychological scales was perfect (i.e., 1.00), the maximum numerically
possible correlation among these variables in our less than perfect
scientific world, would be less than 0.77. Or, said another way, the
results obtained were basically statistically impossible since practically
no social science scales are measured with sufficient precision to
support such high correlations.
The authors also kindly and succinctly provided a scientific
explanation (as opposed to the psychometric one just tendered) for why
perfect correlations in studies such as these are almost impossible.
First, it is far-fetched to suppose that only one brain area influences any
behavioral trait. Second, even if the neural underpinnings of a trait were
confined to one particular region, it would seem to require an
extraordinarily favorable set of coincidences [emphasis added] for the
BOLD signal (basically a blood flow measure) assessed in one particular
stimulus or task contrast to capture all functions relevant to the behavioral
trait, which, after all, reflects the organization of complex neural circuitry
residing in that brain area. (p. 276)
And finally, to complete this brief tour of fMRI research, the authors
provide a very clear description of how brain imagining “works,”
which I will attempt to abstract without botching it too badly.
A functional scanning image is comprised of multiple blood
flow/oxygenation signals from roughly cube-shaped regions of the
brain called voxels (volumetric pixels, which may be as small as 1 mm3
or as large as 125 mm3). The number of voxels in any given image
typically ranges from 40,000 to 500,000 of these tiny three-dimensional
pieces of the brain, and the blood flow within each of these can be
correlated with any other data collected on the individual (in this case,
psychosocial questionnaires involving self-reports).
Each voxel can then be analyzed separately with any variable of
interest available on (or administered to) the individuals scanned—
normally 20 or fewer participants are employed (sometimes
considerably fewer) given the expense and machine time required. As
mentioned in the present case, these dependent variables are
psychosocial measures of perhaps questionable validity but which can
103
reflect anything. The intervention can also encompass a wide range of
manipulations, such as contrasting behavioral scenarios or gaming
exercises.
Thus we have a situation in which there are potentially hundreds of
thousands of correlations that could be run between each voxel “score”
and a single independent variable (e.g., a digital game structured to
elicit an emotional response of some sort or even listening to “Hot
Potato” vs. “Kalimba,” although unfortunately, to my knowledge, that
particular study has yet to be conducted via the use of fMRI). Naturally,
reporting thousands of correlation coefficients would take up a good
deal of journal space so groups of voxels are selected in almost any
manner investigators choose in order to “simplify” matters. And therein
resides the solution to our investigators’ “puzzlement” because the
following mind-bending strategy was employed by a majority of the
investigators of the 53 responding authors:
104
genre of research, including (a) ensuring that whoever chooses the
voxels of interest be blinded to the voxel–behavioral measure
correlations and (b) not to “peek” at the behavioral results while
analyzing the fMRI output.
Closer inspection reveals that this problem [the voxel correlations] is only a
special symptom of a broader methodological problem that characterizes all
paradigmatic research [emphasis added] not just neuroscience. Researchers
not only select voxels to inflate effect size, they also select stimuli, task
settings, favorable boundary conditions, dependent variables and
independent variables, treatment levels, moderators, mediators, and
multiple parameter settings in such a way that empirical phenomena
become maximally visible and stable [emphasis added again because this
long sentence encapsulates such an important point]. In general, paradigms
can be understood as conventional setups for producing idealized, inflated
effects. Although the feasibility of representative designs is restricted, a
viable remedy lies in a reorientation of paradigmatic research from the
visibility of strong effect sizes to genuine validity and scientific scrutiny. (p.
163)
105
Fiedler’s methodological language can be a bit idiosyncratic at times,
such as his use of the term “metacognitive myopia” to characterize “a
tendency in sophisticated researchers, who only see the data but overlook
the sampling filters behind, [that] may be symptomatic of an industrious
period of empirical progress, accompanied by a lack of interest in
methodology and logic of science” (p. 167). However, he minces no
words in getting his basic message across via statements such as this:
106
Major components of this science involve (a) case control studies, in
which individuals with a disease are compared to those without the
disease, and (b) secondary analyses of large databases to tease out risk
factors and causes of diseases. (Excluded here are the roots of the
discipline’s name, tracking potential epidemics and devising strategies to
prevent or slow their progress—obviously a vital societal activity upon
which we all depend.)
Also, while case control studies have their shortcomings, it the
secondary analysis wing of the discipline that concerns us here. The
fodder for these analyses is mostly comprised of surveys, longitudinal
studies (e.g., the Framingham Heart Study), and other large databases
(often constructed for other purposes but fortuitously tending to be
composed of large numbers of variables with the potential of providing
hundreds of secondary analyses and hence publications).
The most serious problems with such analyses include already
discussed QRPs such as (a) data involving multiple risk factors and
multiple conditions which permit huge fishing expeditions with no
adjustments to the alpha level, (b) multiple confounding variables which,
when (or if) identifiable can only be partially controlled by statistical
machinations, and (c) reliance on self-reported data, which in turn relies
on faulty memories and under- or overreporting biases.
These and other problems are discussed in a very readable article titled
“Epidemiology Faces Its Limits,” written more than two decades ago
(Taubes, 1995) but which still has important scientific reproducibility
implications today.
Taubes begins his article by referencing conflicting evidence
emanating from analyses designed to identify cancer risk factors—
beginning with those garnering significant press coverage during the
year previous to the publication of his article (i.e., 1994).
1. Residential radon exposure caused lung cancer, and yet another study that
found it did not.
2. DDT exposure was not associated with breast cancer, which conflicted
with the findings of previous, smaller, positive studies.
3. Electromagnetic fields caused brain cancer, which conflicted with a
previous study.
107
Over the years, such studies have come up with a “mind-numbing array of
potential disease-causing agents, from hair dyes (lymphomas, myelomas,
and leukemia) to coffee (pancreatic cancer and heart disease) to oral
contraceptives and other hormone treatments (virtually every disorder
known to woman). (p. 164)
108
Coffee causes pancreatic cancer. Type A personality causes heart attacks.
Trans-fat is a killer. Women who eat breakfast cereal give birth to more
boys. [Note that women do not contribute the necessary Y chromosome for
producing male babies.] (p. 116)
There are 275 × 32 = 8800 potential endpoints for analysis. Using simple
linear regression for covariate adjustment, there are approximately 1000
potential models, including or not including each demographic variable
[there were 10]. Altogether the search space is about 9 million models and
endpoints. The authors remain convinced that their claim is valid. (p. 120)
109
A large number of nonexperimental studies have demonstrated that
improved cognition and educational attainment are associated with large
health benefits in adulthood [seven citations were provided]. However, the
short-term health effects of different schooling policies are largely
unknown, and long-term effects have never been evaluated using a
randomized trial. (p. 1468)
Between 1985 and 2007, there were 42 deaths among the 3,024 Project
STAR participants who attended small classes, 45 deaths among the 4,508
participants who attended regular classes, and 59 deaths among the 4,249
participants who attended regular classes with an aide. (p. 1468)
Interestingly, the authors never discuss the possibility that this might
be a chance finding or that the relationship might be non-causal in
nature or that the entire study was an obvious fishing expedition.
Instead they came up with the following mechanism of action:
110
It is also tempting to speculate that this finding might have run
counter to the investigators’ original expectations, given their above-
quoted rationale for the study and the fact that the class size study in
question resulted in significant learning gains (i.e., “improved
cognition and educational attainment are associated with large health
benefits in adulthood)” (p. 1468)
111
plausible mechanism of action and are so methodologically sloppy that
they can be perfunctorily dismissed as not falling under the rubric of
legitimate science.
This latter category happens to constitute the subject matter of our
next chapter, which deals with a topic sometimes referred to as
“pathological” science. It is certainly several steps down the scientific
ladder from anything we’ve discussed so far, but it must be considered
since it is actually a key component of our story.
References
Bakker, M., van Dijk, A., & Wicherts, J. M. (2012). The rules of the game called
psychological science. Perspectives on Psychological Science, 7, 543–554.
Beall, A. T., & Tracy, J. L. (2013). Women are more likely to wear red or pink at
peak fertility. Psychological Science, 24, 1837–1841.
Chabris, C., Herbert, B., Benjamin, D., et al. (2012). Most reported genetic
associations with general intelligence are probably false positives.
Psychological Science, 23, 1314–1323.
Durante, K., Rae, A., & Griskevicius, V. (2013). The fluctuating female vote:
Politics, religion, and the ovulatory cycle. Psychological Science, 24, 1007–
1016.
Eisenberger, N. I., Lieberman, M. D., & Williams, K. D. (2003). Does rejection
hurt? An FMRI study of social exclusion. Science, 302, 290–292.
Fiedler, K. (2011). Voodoo correlations are everywhere—not only in
neuroscience. Perspectives on Psychological Science, 6, 163–171.
Gardner, H. (1983). Frames of mind: The theory of multiple intelligences. New
York: Basic Books.
Gelman, A., & Loken, E. (2014). The statistical crisis in science: Data-dependent
analysis—a “garden of forking paths”—explains why many statistically
significant comparisons don’t hold up. American Scientist, 102, 460–465.
Giner-Sorolla, R. (2012). Science or art? How aesthetic standards grease the way
through the publication bottleneck but undermine science. Perspectives on
Psychological Science, 7, 562–571.
Gould, S. J. (1981). The mismeasure of man. New York: Norton.
Ioannidis, J. P. (2011). Excess significance bias in the literature on brain volume
abnormalities. Archives of General Psychiatry, 68, 773–780.
Kerr, N. L. (1991). HARKing: Hypothesizing after the results are known.
Personality and Social Psychology Review, 2, 196–217.
Mosteller, F. (1995). The Tennessee study of class size in the early school grades.
The Future of Children, 5, 113–127.
Muennig, P., Johnson, G., & Wilde, E. T. (2011). The effect of small class sizes
on mortality through age 29 years: Evidence from a multicenter randomized
112
controlled trial. American Journal of Epidemiology, 173, 1468–1474.
Park, R. (2000). Voodoo science: The road from foolishness to fraud. New York:
Oxford University Press.
Payton, A. (2009). The impact of genetic research on our understanding of
normal cognitive ageing: 1995 to 2009. Neuropsychology Review, 19, 451–
477.
Sander, D., Grandjean, D., Pourtois, G., et al. (2005). Emotion and attention
interactions in social cognition: Brain regions involved in processing anger
prosody. NeuroImage, 28, 848–858.
Singer, T., Seymour, B., O’Doherty, J., et al. (2004). Empathy for pain involves
the affective but not sensory components of pain. Science, 303, 1157–1162.
Taubes, G. (1995). Epidemiology facts its limits. Science, 269, 164–169.
Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2009). Puzzlingly high
correlations in fMRI studies of emotion, personality, and social cognition.
Perspectives on Psychological Science, 4, 274–290.
Word, E., Johnston, J., Bain, H. P., et al. (1990). Student/teacher achievement
ratio (STAR): Tennessee’s K–3 class size study: Final summary report 1985–
1990. Nashville: Tennessee Department of Education.
Young, S. S., & Karr, A. (2011). Deming, data and observational studies: A
process out of control and needing fixing. Significance, 9, 122–126.
113
5
The Return of Pathological Science
Accompanied by a Pinch of Replication
114
here the term “failure to replicate” does not refer to a replication of an
original study not being undertaken but instead to a divergent result and
conclusion being reached from a performed replication.)
While the cold fusion episode will be discussed in a bit more detail
later, an additional psychological example proffered by Langmuir is
probably even more relevant to our story here, given the chemist’s par
excellence personal investigation into research purporting to prove the
existence of extrasensory perception (ESP) conducted by the Duke
psychologist Joseph Banks Rhine (1934).
At the time (1934), Langmuir was attending a meeting of the
American Chemical Society at Duke University and requested a meeting
with Rhine, who quite enthusiastically agreed due to Langmuir’s
scientific eminence. Interestingly, the visitor transparently revealed his
agenda at the beginning of the meeting by explaining his opinion about
“the characteristics of those things that aren’t so” and that he believed
these applied to Rhine’s findings.
Rhine laughed and said “I wish you’d publish that. I’d love to have
you publish it. That would stir up an awful lot of interest. I’d have more
graduate students. We ought to have more graduate students. This thing
is so important that we should have more people realize its importance.
This should be one of the biggest departments in the university.”
Fortunately for Duke, the athletic department was eventually awarded
that status but let’s return to our story. Basically, Rhine revealed that he
was investigating both clairvoyance, in which an experimental
participant was asked to guess the identity of facedown cards, and
telepathy, in which the participant was required to read the mind of
someone behind a screen who knew the identity of each card. Both
reside in the extrasensory perception realm, as did Daryl Bem’s
precognition studies that had such a large impact on the reproducibility
initiative.
As designed, these experiments were quite easy to conduct and the
results should have been quite straightforward since chance occurrence
was exactly five (20%) correct guesses with the 25-card deck that Rhine
employed. His experimental procedures also appeared fine as well (at
least for that era), so, short of fraud or some completely unanticipated
artifact (such as occurred with the horse Clever Hans who exhibited
remarkable arithmetic talents
[https://en.wikipedia.org/wiki/Clever_Hans]), there was no obvious way
the results could have been biased. (Spoiler alert: all scientific results
can be biased, purposefully or accidentally.)
115
After conducting thousands of trials (Langmuir estimated hundreds of
thousands, and he was probably good with numbers), Rhine found that
his participants correctly guessed the identity of the hidden cards 28% of
the time. On the surface this may not sound earth-shattering but given
the number of experiments conducted, the probability associated with
these thousands upon thousands of trials would undoubtedly be
equivalent to a randomly chosen individual winning the Mega Millions
lottery twice in succession.
Needless to say this result met with a bit of skepticism on Langmuir’s
part, and it wasn’t greatly assuaged when Rhine mentioned that he had
(a) filled several filing cabinets with the results of experiments that had
produced only chance results or lower and (b) taken the precaution of
sealing each file and placing a code number on the outside because he
“Didn’t trust anybody to know that code. Nobody!”
When Langmuir impoliticly (after all he was a guest even if an
enthusiastically welcomed one) expressed some incredulity at Rhine’s
ignoring such a mountain of negative evidence locked away on the
theory that his distractors had deliberately guessed incorrectly just to
“spite” him, Rhine was not in the least nonplussed. After a bit more
probing on Langmuir’s part, Rhine did amend his reason for not at least
mentioning these negative results in his book (1934) on the topic to the
fact that he hadn’t had time to digest their significance and, furthermore,
didn’t want to mislead the public.
Naturally, Rhine’s work has been replicated, but his results have not
(Hines, 2003). As an interesting aside, in preparation for his meeting at
Duke, Langmuir even “commissioned” his own replication by
convincing his nephew, an employee of the Atomic Energy Commission
at the time, to recruit some of his friends to spend several of their
evenings attempting to replicate Rhine’s experiments. At first the group
became quite excited because their results (28% or 7 correct guesses out
of 25) almost perfectly reflected those of Rhine’s, but soon thereafter the
results regressed down to chance (i.e., 5 correct guesses out of 25 cards).
Langmuir concluded his fascinating talk as follows (remember this is a
transcription of a poorly recorded, informal lecture):
116
The characteristics of [the examples he discussed], they have things in
common. These are cases where there is no dishonesty involved but where
people are tricked into false results by a lack of understanding about what
human beings can do to themselves in the way of being led astray by
subjective effects, wishful thinking or threshold interactions. These are
examples of pathological science [emphasis added]. These are things that
attracted a great deal of attention. Usually hundreds of papers have been
published upon them. Sometimes they have lasted for fifteen or twenty
years and then they gradually die away. (p. 13 of Hall’s transcript of
Langmuir’s Colloquim on Pathological Science (1953).
117
going one step farther, that the direction of causation is not limited to
past events influencing future ones, but can travel from the future to the
past.
Rather than describing all nine experiments, let’s allow Bem to
describe the basic methodology of his last two presented experiments,
which were very similar in nature and, of course, positive. (Authors of
multiexperiment studies often save what they consider to be the most
definitive studies for last and sometimes even present a negative study
first to emphasize later, more positive findings). His abbreviated
description of the eighth experiment’s objective follows:
And who could argue with such a venerable theoretician as the White
Queen? So to shorten the story a bit, naturally the hypothesis was
supported:
The results show that practicing a set of words after the recall test does, in
fact, reach back in time [emphasis added] to facilitate the recall of those
words. (p. 419)
After all, what else could it have been but “reaching back into time?”
Perhaps William of Occam (unquestionably my most demanding virtual
mentor) might have come up with some variant of his hair-brained
parsimony principle, but the good man’s work is really outdated so let’s
not go there.
To be fair, Professor Bem does provide a few other theoretical
mechanisms of action related to quantum mechanics emanating from
“conversations” taking place at an “interdisciplinary conference of
physicists and psi researchers sponsored by the American Association
for the Advancement of Science” (Bem, 2011, p. 423). (Perhaps some of
which were centered around one such application advanced by
118
homeopathy advocates to explain how water’s memory of a substance—
the latter dissolved therein but then completely removed—can still be
there and be palliative even though the original substance elicited anti-
palliative symptoms.)
Perhaps due to the unusual nature of the experiments, or perhaps due
to Bem’s repeated quoting of Carroll’s White Queen (e.g., “memory
works both ways” or “It’s a poor sort of memory that only works
backwards”), some psychologists initially thought the article might be a
parody. But most didn’t since the study of psi has a venerable history
within psychological research, and one survey (Wagner & Monnet, 1979)
found that almost two-thirds of academic psychologists believed that psi
was at least possible. (Although in the discipline’s defense, this belief
was actually lower for psychologists than for other college professors.)
However, it soon became apparent that the article was presented in all
seriousness and consequently garnered considerable attention both in the
professional and public press—perhaps because (a) Bem was an
academically respectable psychological researcher (how “academic
respectability” is bestowed is not clear) housed in a respectable Ivy
League university (Cornell) and (b) the experiments’ methodology
seemed to adequately adhere to permissible psychological research
practice at the time. (Permissible at least in the pre-reproducibility crisis
era but not so much now, since, as mentioned by another Nobel Laureate,
“the times they are [or may be] a changing’.”)
And therein lay the difficulties of simply ignoring the article since the
methodological quality of the study mirrored that of many other
published studies and few psychologists believed that Professor Bem
would have been untruthful about his experimental methods, much less
have fabricated his data. So, playing by the rules of the game at the time
and presenting considerable substantiating data, it could be argued that
Bem’s methods were marginally adequate. The problem was that by
2011 (the publication date of the article in question), (a) the rules of the
game were actually beginning to change and (b) everyone beyond the
lunatic fringe of the discipline knew that the positive results Bem
reported were somehow simply wrong.
Unfortunately for Bem, for while scientific journals (especially those
in the social science) have been historically reluctant to publish
replications unless the original research was sufficiently controversial,
interesting, or counterintuitive, this, too, was beginning to change. And
his series of experiments definitely qualified on two (and for some all
three) accounts anyway.
119
Two teams of researchers (Galak, LeBoeuf, Nelson, & Simmons,
2012; Ritchie, Wiseman, & French, 2012) both quickly performed
multiple replications of Bem’s most impressive studies (the Galak team
choosing Bem’s eighth and ninth and the Ritchie Wiseman, and French
teams his eighth study) and promptly submitted their papers for
publication. The Galak group submitted to the same journal that had
published Bem’s original series (i.e., The Journal of Personality and
Social Psychology), and, incredibly to some (but not so much to others),
the editor of that journal promptly rejected the paper on the basis that it
was his journal’s policy not to publish replications. According to Ed
Yong (2012), the Ritchie team encountered the same resistance in
Science and Psychological Science, which both said that they did not
publish “straight replications.” A submission to the British Journal of
Psychology did result in the paper being sent out for peer review but it
was rejected (although Bem having been selected as one of the peer
reviewers surely couldn’t have influenced that decision) before PLoS
ONE finally published the paper.
Now, as previously mentioned, it is not known whether this episode
marked the birth of the reproducibility movement or was simply one of
its several inaugural episodes. And whether it will have an effect on the
course of scientific progress (or simply serve as another publishing
opportunity for academics), my Bronx mentor will not permit me to
guess. But surely the following article is one of the most impactful
replications in this very unusual and promising movement (with kudos
also to the less cited, but equally impressive, Ritchie et al. replication).
120
In all, seven replications were conducted, four for Bem’s eighth
experiment and three for the ninth. Undergraduates were used in three
studies and online participants in the other four. Additional
methodological advantages of the replications included:
1. They all used predetermined sample sizes and all employed considerably
more participants than Bem’s two studies. As previously discussed, the
crucial importance of (a) employing sufficiently large sample sizes for
ensuring adequate statistical power and (b) the a priori decision of
deciding how many participants to employ avoid two of the most
virulent QRPs.
2. The replications used both identical and different words (categories of
words) to Bem’s. This was apparently done to ensure both that (a) the
studies were direct replications of the originals and (b) there wasn’t
something unusual about Bem’s choice of words (a replicator’s version
of both “having one’s cake and eating it, too”).
3. There was less contact between research staff and participants in the
replications, which is important because the former can unconsciously
(or in some cases quite consciously) cue responses from the latter.
4. Post-experimental debriefing included the following question for online
samples: “Did you, at any point during this study, do something else
(e.g., check e-mail)?” Participants were assured that their answer would
not influence their payments for participating. This was done to help
ensure that respondents were taking their task seriously and following
the protocol. If they were not, then obviously the original findings
wouldn’t replicate.
5. At least two of the four relevant QRPs that were responsible for
producing false-positive results modeled by Simmons, Nelson, and
Simonsohn (2011) and may have characterized Bem’s experiments were
completely avoided in the seven replications. One involved choosing
when to stop running participants and when to add more (the replicators
did not look at their data until the end of the experiments, whereas Bem
apparently did and adjusted his procedures accordingly as he went along
based on how things were going). (This obvious QRP was reported by a
former research assistant in an excellent article on the subject in Slate
Magazine [Engber, 2017].) The other avoided QRP involved not
choosing which dependent variables to report on. (Bem reputably
conducted multiple analyses but emphasized only the statistically
significant ones).
121
closely as possible, (b) used more participants than he, and (c) ensured
that certain QRPs did not occur in their replications, who is to say that
they themselves might be wrong and Bem himself might be correct?
So, the replicating authors went one step farther. Through an
exhaustive search of the published literature they located 10
independent replications of Bem’s most impressive two experiments
(i.e., numbers 8 and 9) other than their own (of which, it will be
recalled, there were seven). Numerically, five of these were in a
positive direction (i.e., produced a differential between recalled words
that were reinforced after the recall test vs. words that were not
reinforced), and five favored the words not reinforced versus those
there were. (In other words, the expected result from a coin flipping
exercise.)
Next our heroes combined all 19 studies (i.e., the 2 by Bem, the 7 by
the authors themselves, and the 10 conducted by other researchers) via
a standard meta-analytic technique. Somewhat surprisingly, the results
showed that, as a gestalt, there was no significant evidence favoring
reverse causation even while including Bem’s glowingly positive
results. When the analysis was repeated using only replications of
Bem’s work (i.e., without including his work), the evidence was even
more compelling against psi—reminiscent of Langmuir’s description of
those “things that aren’t so.”
So, was the issue settled? Temporarily perhaps, but this and other
nonsense resurfaces every few years and surveys of young people who
are deluged with (and enjoy watching) screen adaptations of Marvel and
DC comic books, devotees of conspiracy theories on YouTube, and
adults who continue to frequent alternative medical therapists in order to
access the placebo effect will probably continue to believe in the
paranormal until the planet suffers a major meteor strike.
And, as a footnote, obviously Bem—like Rhine before him—remained
convinced that his original findings were correct. He even
122
1. Conducted his own meta-analysis (Bem, Tressoldi, Rabeyron, & Duggan,
2016) which, of course, unlike Galak et al.’s was positive;
2. Preregistered the protocol for a self-replication in a parapsychology
registry (which I had no idea existed) but which theoretically prevented
some of the (probable) original QRPs (http://www.koestler-
parapsychology.psy.ed.ac.uk/Documents/KPU_registry_1016.pdf), which
it will be recalled involved (a) deep-sixing negative findings, (b) cherry-
picking the most propitious outcome variables, (c) tweaking the
experimental conditions as he went along, as Bem (in addition to his
former research assistant) admitted doing in the original studies, and (d)
abandoning “false starts” that interim analyses indicated were trending in
the “wrong” direction (which Bem also admitted doing although he could
not recall the number of times this occurred); and, finally,
3. Actually conducted said replication with Marilyn Schlitz and Arnaud
Delorme which, according to his preregistration, failed to replicate his
original finding. However, when presenting his results at a
parapsychological conference, Engber reports a typically happy,
pathological scientific ending for our intrepid investigator.
They presented their results last summer, at the most recent [2016] annual
meeting of the Parapsychological Association. According to their pre-
registered analysis, there was no evidence at all for ESP, nor was there any
correlation between the attitudes of the experimenters—whether they were
believers or skeptics when it came to psi—and the outcomes of the study. In
summary, their large-scale, multisite, pre-registered replication ended in a
failure. [That’s the bad news, but there’s always good news in pathological
science.] In their conference abstract, though, Bem and his co-authors found
a way to wring some droplets of confirmation from the data. After adding in
a set of new statistical tests, ex post facto, they concluded that the evidence
for ESP was indeed “highly significant.”
123
While the cold fusion debacle of some three decades ago may have
begun to fade from memory (or was never afforded any space therein by
some), it still holds some useful lessons for us today. Ultimately,
following a few initial bumps in the road, the scientific response to the
event should go down as a success story for the replication process in the
physical sciences just as the psi replications did for psychology. But first
a bit of background.
Nuclear fusion occurs when the nuclei of two atoms are forced into
close enough proximity to one another to form a completely different
nucleus. The most dramatic example of the phenomenon occurs in stars
(and our sun, of course) in the presence of astronomically (excuse the
pun) high temperatures and pressures.
Ironically, this celestial process turns out to be the only mechanism by
which the alchemists’ dreams of changing lighter (or less rare) elements
into gold can be realized since their crude laboratory apparatuses
couldn’t possibly supply the necessary energy to duplicate the fusion
process that occurs at the center of stars or the explosion of a nuclear
fusion (hydrogen) bomb. But as Irving Langmuir (pathological science),
Robert Park (voodoo science), and even Barker Bausell (snake oil
science) have illustrated, all of the laws of science can be subverted by a
sufficient amount of ambition, ignorance, and disingenuousness.
So it came to pass on March 23, 1989, that the University of Utah held
a press conference in which it was breathlessly announced that two
chemists, Stanley Pom and Martin Fleischmann, had invented a method
that gave promise to fulfilling the dream of eventually producing
unlimited, nonpolluting, and cheap energy using a simple tabletop device
that would have charmed any alchemist of centuries past. The apparatus
(Figure 5.1) itself generating this earthshaking discovery was comprised
of
124
Figure 5.1 The remarkably simple and cheap device that ostensibly produced a
cold fusion reaction.
https://en.wikipedia.org/wiki/Cold_fusion#/media/File:Cold_fusion_electrolysis.svg
The press conference occurred before any of these results had been
submitted to a peer reviewed journal, apparently to preempt another cold
fusion researcher (Stephen Jones) at another Utah university (Brigham
Young). Jones’s version of inducing nuclear fusion actually had a
recognized scientific mechanism of action but was considered
completely impractical due to the inconvenient fact (which he apparently
recognized) that the tiny amount of energy apparently emanating from
his procedures was far exceeded by the amount of energy required to
produce it.
125
In any event, a double-helix style race commenced (although perhaps
even more vituperative between Pom and Fleischmann vs. Jones), with
Pom suspecting Jones of trying to steal his work (although there is no
evidence of this). Unlike the race to characterize the structure of DNA,
however, the financial stakes here were so high (potentially involving
trillions of dollars) that the University of Utah’s press conference was
characterized by exaggerated claims about their researchers’ actual
progress.
As a result, both the press and a large swath of the scientific
community appeared to lose their respective minds, with Pom and
Fleischmann immediately receiving rock star status accompanied by
dozens of laboratories all over the world beginning the process of
attempting to replicate their results. Very shortly, aided by the simplicity
of the intervention, “confirmations” of the experiment were issued by
researchers at Georgia Tech and Texas A&M, but, before long,
laboratories at MIT, Cal Tech, Harwell, and others reported failures to do
so. To greatly oversimplify the entire drama, along with the back-and-
forth accusations and counter claims, Georgia Tech and A&M retracted
their confirmations upon re-examining their results.
However, exact replications can be a bit difficult without knowledge
of the actual methods employed in the original work, and Stanley Pom
(who basically became the spokesperson for the entire fiasco) wasn’t
about to share anything with anybody including high-impact peer
reviewed journals or scientific competitors. So although there were
dozens of attempted replications accompanied by an ever decreasing
number of positive results, Pom assured the pathologically gullible press
(The Wall Street Journal being the primary advocate of cold fusion
research, given the trillion-dollar industry it would potentially spawn)
that the many failures to replicate could be explained by the simple fact
that the labs producing them had not used the exact and proper Pom-
Fleischmann procedures (which, of course, was probably true since Pom
refused to share those details with them).
But soon, more and more conventional physicist and chemists became
increasingly critical (and indeed incredulous) that neither the original
Utah experiments nor their “successful” replications bothered to run
controls involving, say, plain water instead of its heavy counterpart. For
experimental controls are not only absolute prerequisites for findings to
be considered credible in any discipline (social, biological, or physical),
but also their absence constitutes an egregious QRP in and of itself. And
almost equally important, for some reason, Pom and Fleischmann failed
126
to secure sufficiently sensitive (and available) instrumentation to filter
out background environmental contaminants or other sources of
laboratory noise. A few sticklers were even concerned that the etiology
of the effect violated the first law of thermodynamics (i.e., while energy
can be transformed from one form to another, it cannot be created or
destroyed, even in a closed system such as our protagonists’ tabletop
device). But as the old saying goes, “laws are made to be broken.”
But counter-arguments such as these were impotent compared to those
of a consummate snake oil salesman such as Stanley Pom who dismissed
anything with negative implications as unimportant or part of a witch
hunt by jealous East Coast physicists, especially those affiliated with
MIT and Yale. In fact, Pom was reported by Gary Taubes (1993) in his
splendid 473-page history of the incident (entitled Bad Science: The
Short Life and Weird Times of Cold Fusion) as saying, when confronted
by the ever-growing number of laboratories’ failure to find any effect,
“I’m not interested in negative effects!” (Perhaps he should have been a
journal editor.)
And so it went, press conference after press conference, professional
conference after professional conference. Positive results were
highlighted and negative results were summarily dismissed. Even the
advocates who occasionally produced small degrees of excess heat (a
supposed indicator of the fusion process but obviously of many other
more mundane processes as well) could only do so occasionally. But this
was enough to keep the process alive for a while and seemed to add to its
mysteriousness. For others, it was obvious that something was amiss,
and it wasn’t particularly difficult to guess what that was.
What was obvious to Pom, however, was what was really needed:
more research, far more funding, partnerships with corporations such as
General Electric, and, of course, faith. And, every so often, but with
increasing rarity, one of the process’s proponents would report an
extraordinary claim, such as a report from the Texas A&M laboratory
that their reaction had produced tritium, a ubiquitous byproduct of
fusion. And while this was undoubtedly the most promising finding
emanating from the entire episode, like everything else, it only occurred
once in a few devices in a single lab which eventually led to a general
consensus that the tritium had been spiked by a single individual.
So, gradually, as the failures to replicate continued to pile up and Dr.
Pom kept repeating the same polemics, the press moved on to more
newsworthy issues such as a man biting a dog somewhere in the
heartland. And even the least sensible scientists appeared to have
127
developed a modicum of herd immunity to the extravagant claims and
disingenuous pronouncements by the terminally infected. So, in only a
very few years, the fad had run its course, although during one of those
heady years cold fusion articles became the most frequently published
topic area in all of the physical sciences.
Today, research on “hot” (i.e., conventional) fusion continues, the
beneficiary of billions in funding, but cold fusion investigations have all
but disappeared in the mainstream scientific literature. Most legitimate
peer reviewed journals in fact now refuse to even have a cold fusion
study reviewed, much less publish one. However, a few unrepentant
investigators, like their paranormal and conspiracy theory compatriots,
still doggedly pursue the dream and most likely will continue to do so
until they die or become too infirm to continue the good fight.
Lessons Learned
128
1. “The maximum effect that is observed is produced by a causative agent of
barely detectable intensity, and the magnitude of the effect is substantially
independent of the intensity of the cause.” This one certainly applies to
cold fusion since astronomically large amounts of heat are required to
generate nuclear fusion while cold fusion required only a simple, low-
voltage electrical current flowing through a liquid medium at room
temperature.
2. “The effect is of a magnitude that remains close to the limit of
detectability, or many measurements are necessary because of the very
low statistical significance of the results.” Over and over again
responsible physicists unsuccessfully argued that the small amount of heat
generated in the tabletop device was millions of times less than that
generated by the smallest known fusion reactions. The same was true for
the tiny number of emitted neutrons (a byproduct of the process)
occasionally claimed to have been measured.
3. “There are claims of great accuracy.” As only two examples, the
occasional reports of tiny amounts of increased heat as measured by our
heroes apparently involved a substandard calorimeter and did not take into
account the facts that (a) different solutions result in different degrees of
conductivity (hence influencing measurable heat) or even that (b) the
amount of commercial electric current (that helped produce said heat) is
not constant and varies according to various conditions (e.g., the
performance of air conditioners on hot summer days).
4. “Fantastic theories contrary to experience are suggested.” This one is
obvious.
5. “Criticisms are met by ad hoc excuses.” As well as downright lies, a
paranoid belief that the individuals failing to replicate positive findings
had hidden or nefarious agendas and/or were not able to employ the
original procedures (since, in the case of cold fusion, these happened to be
closely guarded secrets). Furthermore, even the most avid supporters of
the process admitted that their positive results occurred only sporadically
(hence were not reproducible by any scientific definition of the term).
Various excuses were advanced to explain this latter inconvenient truth,
although only one excuse (admitted ignorance of what was going on) was
not disingenuous.
6. “The ratio of supporters to critics rises and then falls gradually to
oblivion.” As mentioned previously, the number of supporters quickly
approached almost epidemic levels, perhaps to a greater extent than for
any other pathologically irreproducible finding up to that point. However,
facilitated by a discipline-wide replication initiative, the epidemic
subsided relatively quickly. But a good guess is that Pom continues to
believe in cold fusion as fervently as Bem still believes in psi.
But since Dr. Langmuir was not privy to the cold fusion epidemic or
Daryl Bem’s landmark discoveries, perhaps he wouldn’t object too
129
strenuously ifthree additional principles were added: one advanced by a
philosopher of science centuries ago who would have also been a Nobel
laureate if the award existed, one attributable to a number of scientific
luminaries, and one to completely unknown pundit:
7. “What is done with fewer assumptions is done in vain with more,” said
William of Occam, who counseled scientists choosing between alternative
theories or explanations to prefer the one that required the fewest
unproved assumptions.
8. “Extraordinary claims require extraordinary evidence,” which, according
to Wikipedia was only popularized by Carl Sagan but originally was
proposed in one form or another by David Hume, Pierre-Simon Laplace,
and perhaps some others as well. A more extraordinary claim is difficult
to concoct than the contention that a humble tabletop apparatus could
subvert the laws of physics and fuel the world’s energy needs for
millennia to come. Or that future events can effect past ones for that
matter.
9. When a scientific finding sounds too good (or too unbelievable) to be
true, it most likely isn’t (pundit unknown).
However, Gary Taubes (1993) probably deserves the final word on the
lessons that the cold fusion fiasco has for science in general as well as its
direct applicability to the reproducibility crisis.
Of all the arguments spun forth in defense of cold fusion, the most often
heard was there must be something to it, otherwise the mainstream scientific
community would not have responded so vehemently to the announcement
of its discovery. What the champions of cold fusion never seemed to realize,
however, or were incapable of acknowledging, was that the vehemence was
aimed not at the science of cold fusion, but at the method [i.e., experimental
methodology]. Positive results in cold fusion were inevitably characterized
by sloppy and amateurish experimental techniques [we could substitute
QRPs here]. If these experiments, all hopelessly flawed, were given the
credibility for which the proponents of cold fusion argued, the science itself
would become an empty and meaningless endeavor. (p. 426)
130
quoting so often but his book truly is am exemplary exposition of the
entire fiasco and should be read in its entirety):
When the cold fusion announcement was made, Robert Bazell, the science
reporter for NBC News, interviewed Robert Park. . . . He asked Park, off the
record, whether he thought cold fusion was fraud, and Park said, “No but
give it two months and it will be.” (p. 314)
And again according to Taubes, Bob was quite proud of this piece of
prognostication, even though he admitted that his timeline was off by
about 6 weeks.
But how about Daryl Bem and the ever-increasing plethora of
positive, often counterintuitive, findings published in today’s literature?
Is that fraud? Unfortunately, Professor Park is quite ill and not able to
grace us with his opinion. So let’s just label psi a QRP-generated
phenomena and let it go at that. In science, being wrong for the wrong
reasons is bad enough.
131
that one’s own work is valid to avoid continuing down a dead end street
and wasting precious time and resources. And regardless of whether
scientists replicate their findings or not, they should also (a) check and
recheck their research findings with an eye toward identifying any QRPs
that somehow might have crept into their work during its conduct and (b)
cooperate fully with colleagues who wish to independently replicate their
work (which, in his defense, Daryl Bem apparently did).
Of course, given the ever increasing glut of new studies being
published daily, everything obviously can’t be replicated, but when an
investigative team plans to conduct a study based on one of these new
findings, replication is a sensible strategy. For while a replication is time-
and resource-consuming, performing one that is directly relevant to a
future project may actually turn out to be a cost- and time-saving device.
Especially if the modeling results involving the prevalence of false-
positive results (e.g., Ioannidis, 2005; Pashler & Harris, 2012) previously
discussed have any validity.
However, some studies must be replicated if they are paradigmatically
relevant enough to potentially challenge the conventional knowledge
characterizing an entire field of study. So all scientists in all serious
disciplines should add the methodologies involved in performing
replications to their repertoires.
In the past, “hard” sciences such as physics and chemistry have had a
far better record for performing replications of potentially important
findings quickly and thoroughly than the “softer” social sciences, but
that difference is beginning to fade. As one example, ironically, in the
same year (2011) that Daryl Bem published his paradigm-shifting
finding, physics experienced a potentially qualitatively similar problem.
In what a social scientist might categorize as a post hoc analysis, the
Oscillation Project Emulsion-t Racking Apparatus (OPERA) recorded
neutrinos apparently traveling faster than the speed of light (Adam,
Agafonova, Aleksandrov, et al., 2011). If true (and not a false-positive
observation), this finding would have negated a “cornerstone of modern
physics” by questioning a key tenet of Einstein’s theory of general
relativity.
Needless to say, the physics community was more than a little
skeptical of the finding (as, in their defense, were most psychologists
regarding the existence of psi), especially given the fact that only seven
neutrinos were observed traveling at this breakneck speed. In response,
several labs promptly conducted exact replications of the finding within
a few months (similar to the speed at which the Galak and Ritchie et al.
132
teams replicated Bem’s work). Also not unlike these tangentially
analogous social science replications, all failed to produce any hint of
faster-than-light travel while concomitantly the original OPERA team
discovered the probable cause of the discrepancy in the form of a faulty
clock and a loose cable connection. The ensuing embarrassment,
partially due to the lab’s possibly premature announcement of the
original finding, reputably resulted in several leaders of the project to
submit their resignations (Grossman, 2012).
At present there is little question that the physical sciences are at least
more prestigious, successful, and gifted with a somewhat lower degree of
publication bias (and thus perhaps more likely to have a lower
prevalence of false-positive findings) than the social sciences. The
former’s cumulative success in generating knowledge, theoretical and
useful, is also far ahead of the social sciences although at least some of
that success may be due to the former’s head start of a few thousand
years.
Daniele Fanelli (2010), a leading meta-scientist who has studied many
of the differences between the hard and soft sciences, has argued that, to
the extent that the two scientific genres perform methodologically
comparable research, any differences between them in other respects
(e.g., subjectivity) is primarily only “a matter of degree.” If by
“methodological comparability” Dr. Fanelli means (a) the avoidance of
pathological science and (b) the exclusion of the extreme differences in
the sensitivity of the measurement instrumentation available to the two
genres, then he is undoubtedly correct. However, there seems to be a
huge historical affective and behavior gap between their approaches to
the replication process which favors the “hard” sciences.
As one example of this differential disciplinary embrace of the
replication process, a Nature online poll of 1,575 (primarily) physical
and life scientists (Baker, 2016) found that 70% of the respondents had
tried and failed to replicate someone else’s study, and, incredibly, almost
as many had tried and failed to replicate one of their own personal
finding. Among this group, 24% reported having published a successful
replication while 13% had published a failure to replicate one, both an
interesting twist on the publication bias phenomenon as well as indirect
133
evidence that the replication process may not be as rare as previously
believed—especially outside the social sciences.
If surveys such as this are representative of the life sciences, their
social counterparts have a long way to go despite the Herculean
replication efforts about to be described here. However, it may be that if
the social sciences continue to make progress in replicating their findings
and begin to value being correct over being published, then perhaps the
“hierarchy of science” and terms such as “hard” and “soft” sciences will
eventually become archaic.
References
Adam, T., Agafonova, A., Aleksandrov, A., et al. (2011). Measurement of the
neutrino velocity with the OPERA detector in the CNGS beam. airXiv:
1109.4897v1.
Baker, M. (2016). Is there a reproducibility crisis? Nature, 533, 452–454.
Bem D. J. (2011). Feeling the future: Experimental evidence for anomalous
retroactive influences on cognition and affect. Journal of Personality and
Social Psychology, 100, 407–425.
134
Bem, D., Tressoldi, P., Rabeyron, T., & Duggan, J. (2016). Feeling the future: A
meta-analysis of 90 experiments on the anomalous anticipation of random
future events. F1000Research, 4, 1188.
Engber, D. (2017). Daryl Bem proved ESP is real: Which means science is
broken. Slate Magazine. https://slate.com/health-and-science/2017/06/daryl-
bem-proved-esp-is-real-showed-science-is-broken.html
Fanelli, D. (2010). “Positive” results increase down the hierarchy of the sciences.
PLoS ONE, 5, e10068.
Galak, J., LeBoeuf, R. A., Nelson, L. D., & Simmons, J. P. (2012). Correcting the
past: Failures to replicate psi. Journal of Personality and Social Psychology,
103, 933–948.
Grossman, L. (2012). Leaders of controversial neutrino experiment step down.
New Scientist. www.newscientist.com/article/dn21656-leaders-of-
controversial-neutrino-experiment-step-down/
Hines, T. (1983). Pseudoscience and the paranormal: A critical examination of
the evidence. Buffalo, NY: Prometheus Books.
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS
Medicine, 2, e124.
Kaptchuk, T. (1999). Intentional ignorance: A history of blind assessment and
placebo controls in medicine. Bulletin of the History of Medicine, 72, 389–
433.
Pashler, H., & Harris, C. R. (2012). Is the replicability crisis overblown? Three
arguments examined. Perspectives on Psychological Science, 7, 531–526.
Rhine, J. B. (1934). Extra-sensory perception. Boston: Bruce Humphries.
Ritchie, S., Wiseman, R., & French, C. (2012). Failing the future: Three
unsuccessful attempts to replicate Bem’s retroactive facilitation of recall
effect. PLoS ONE, 7, e33423.
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive
psychology: Undisclosed flexibility in data collection and analysis allows
presenting anything as significant. Psychological Science, 22, 1359–1366.
Taubes, G. (1993). Bad science: The short life and weird times of cold fusion.
New York: Random House.
Wagner, M. W., & Monnet, M. (1979). Attitudes of college professors toward
extra-sensory perception. Zetetic Scholar, 5, 7–17.
Yong, E. (2012). Bad copy. Nature, 485, 298–300.
135
PART II
136
6
The Replication Process
Previous chapters have hinted at the key role that the replication process
plays in enhancing scientific progress and ameliorating the
reproducibility crisis. However, there is little about the scientific process
that is either easy or perfect—replication included—so two points should
probably be reviewed, one affective and one epistemological.
First, the affective perspective: those who replicate a study and fail to
confirm the original finding should not expect to be embraced with open
arms by the original investigators(s). No scientist wishes to be declared
incorrect to the rest of the scientific world and most will undoubtedly
continue to believe (or at least defend) the validity of their results. So
anyone performing a replication should be compulsively careful in its
design and conduct by avoiding any potential glitches or repeating the
mistakes of the original. And from both a personal and a scientific
perspective, all modern replications should
137
1. Be preregistered on a publically accessible website prior to data
collection. And as important as this dictum is for original research, it may
be even more important for a replication given the amount of blowback
likely to be generated by an offended original investigator whose results
failed to replicate—or by her or his passionate defenders—as often or not
via social media.
2. Be designed with considerably more statistical power than the original
(preferably 0.90). For, as mentioned in Chapter 2, if a replication employs
the same amount of power as a typical original study (0.50 for
psychological experiments), then it will have only a 50% chance of
obtaining statistical significance even if a true effect exists and the
original finding was correct.
3. Follow the original design as closely as possible (with the exception of
repeating any questionable research practices [QRPs] therein) since even
the slightest deviation constitutes fodder for a counterattack. Exceptions
exist here, but any design or procedural changes should be justified (and
justifiable) and preferably informed by pilot work.
4. Attempt to engage the original investigators in the replication process as
much as possible. There are at least two reasons for this. First, it will help
assure the original investigators that the replication is not meant to be an
attack on their integrity, hence some of the virulence of their objections to
a disconfirming replication can be deflected by requesting any feedback
they might have regarding the design of the proposed replication. (This is
not only a professional courtesy but good scientific practice since it may
result in an offer to share scientific materials and other key information
seldom available in a typical journal publication.)
5. And, perhaps most importantly of all, to take the time to examine the Open
Science Framework website (https://osf.io) and its abundance of highly
recommended instructions, examples, and information available for both
the preregistration and replication processes.
1. The original study’s methods were flawed and its results were incorrect,
2. The replicating study’s approach was flawed and its results were false,
3. Both studies were incorrect, or
4. The methods or participants used in the second study were substantively
different from those used in the first study (hence the replication does not
match the original in terms of key conditions).
138
But to make things a bit murkier in social and behavioral research, it is
also always possible that the original study finding could have been
correct at the time but no longer reproducible because (to quote Roland,
the Stephen King character) “the world has moved on.” This possibility
is especially problematic in the scenario-type, often culture-related
studies of which psychology, political science, and economics are so
fond and in which undergraduates and/or Amazon’s Mechanical Turk
participants are almost universally employed.
Said another way, it may be that constantly evolving cultural changes
can influence responses to interventions. Or, given the number of brief
interventional studies being conducted, participants may become more
experienced, sophisticated, and therefore more difficult to blind to group
membership. Or become more susceptible to demand characteristics
purposefully or accidentally presented in the experimental instructions
provided to them. Or, if the original (or a very similar) study has
achieved a degree of notoriety by finding its way into the press or
disciplinary textbooks, participants may have been exposed to it,
recognize the replication’s intent, and consequently respond accordingly.
But setting such quibbles aside, replications remain the best indicators
we have for judging the validity of social scientific results. So let’s begin
by examining the types or genres of replications available to
investigators.
Exact Replications
139
available with adequate documentation, which, unfortunately, as will be
discussed in Chapter 10, appears to be surprisingly rare.
So let’s now concentrate on the replication of experimental findings,
which leads us to the most recommended genre by just about everyone
interested in reproducibility, although all forms of replication have their
special charms.
140
computerized administration of interventions (often using Amazon Turk
participants) as a substitute for laboratory presentations employing
college students. Since this change in the presentation of stimuli
sometimes results in subtle changes to the interventions themselves, it is
reasonable to question whether such studies can still be classified as
direct replications. However, it may be more reasonable to question
whether an effect has any scientific importance if it is so tenuous and
fragile that it can’t be replicated if respondents read their instructions
rather than listening to a research assistant recite them.
Many if not most of the 151 replications of psychological studies
conducted by the Open Science Collaboration and the three “Many
Labs” efforts discussed in the next chapter were conducted using
approaches and participants similar to those used in the original studies,
hence any such discrepancies were not likely to have contributed to the
disappointing failure-to-replicate rate in these initiatives—especially
since the results from the first “Many Labs” study found that
undergraduate and Amazon Mechanical Turk employees responded
similarly to one another in its replications.
In addition, all 151 replications were highly powered and at least as
methodologically sound as the studies they replicated. The replications
also employed dozens of sites and investigators, thereby reducing the
possibility of systematic biases due to settings or individual researchers.
Of course it is always possible that subtle changes in the presentation or
timing of an intervention (which are often necessary in the translation
from laboratory to computer) might affect an outcome, but again, if a
finding is this fragile, how likely is it to be relevant to human behavior in
the noisy milieu of everyday life?
A registered replication report (more on that later) published in
Perspective on Psychological Science performed by the team of Alogna,
Attaya, Aucoin, and colleagues (2014) provides an interesting
perspective on both of these issues (i.e., fragility and “minor”
alternations in study procedures) in a replication of a somewhat
counterintuitive study on a concept (or theory) referred to as “verbal
overshadowing.” The original study (Schooler & Engstler-Schooler,
1990) involved a scenario in which all participants watched a video of a
simulated bank robbery. One group then verbally described the robber
while the other performed an irrelevant task listing US states and
capitals.
Attempting to commit something to memory normally facilitates later
recall but in this case the participants who verbally described the
141
appearance of the culprit were significantly less successful in identifying
said culprit from a mock lineup than the comparison group who
performed an irrelevant task instead.
The replication study initially failed to support the original finding,
but its first author (Jonathan Schooler) objected to a timing change
between the two events employed in the replication. Accordingly, the
replication team repeated that aspect of the study and the original effect
reached statistical significance, although, as usual (Ioannidis, 2008), the
initial effect size was larger than the replicated one. (Hence, if nothing
else this represents a positive case study involving cooperation between
replicators and original investigators.)
As for computer-based versus laboratory-based differences between
replications and original studies, the jury remains out. For example, in a
response to Hagger, Chatzisarantis, Alberts, and colleagues (2016)
failure to replicate something called the “ego depletion effect,” the
original investigators (Baumeister & Vohs, 2016) argued that “the
admirable ideal that all meaningful psychological phenomena can be
operationalized as typing on computer keyboards should perhaps be up
for debate” (p. 575)—an argument that Daryl Bem and others have also
used to suggest a reason for their results’ failures to replicate. (Recall
that a “failure to replicate” here is meant to represent a study that was
replicated but failed to reproduce the original study’s bottom-line result.)
Of course a truly unrepentant curmudgeon might suggest that the idea
than any societally meaningful real-world phenomena that can be
discovered in an academic psychological laboratory employing
undergraduate psychology students “should perhaps be up for debate” as
well. By way of example, returning to the successful cooperative “verbal
overshadowing” example following the tweaking of the time interval
separating the video from the pictorial lineup, one wonders how likely
this finding would translate to a real-life “operationalization” of the
construct? Say, to someone (a) actually witnessing a real armed bank
robbery in person (possibly accompanied by fear of being shot),
followed by (b) questions regarding the appearance of the robbers
anywhere from a few minutes to a few hours later by police arriving on
the scene, and then (c) followed by a live police lineup several days or
weeks later?
142
This genre of replication is more common (at least in the social sciences)
than direct replications. Different authors have slightly different
definitions and names for this genre, but basically conceptual
replications usually involve purposefully changing the intervention or
the outcome measure employed in the original study in order to extend
the concept or theory guiding that study. (The experimental procedures
may also be changed as well, but the underlying purpose of this type of
study is normally not to validate the original finding since it is tacitly
assumed to be correct.)
Of course different investigators have different objectives in mind for
conducting a conceptual replication such as
143
1. “First, direct replications add data to increase precision of the effect size
estimate via meta-analysis” (p. 137).
“Second, direct replication can establish generalizability of effects. There
2. is no such thing as an exact replication. [Presumably the authors are
referring to psychological experiments involving human participants
here.] Any replication will differ in innumerable ways from the original. .
. . Successful replication bolsters evidence that all of the sample, setting,
and procedural differences presumed to be irrelevant are, in fact,
irrelevant.” It isn’t clear why a “successful,” methodologically sound
conceptual replication wouldn’t also “establish generalizability,” but I’ll
defer here.
3. “Third, direct replications that produce negative results facilitate the
identification of boundary conditions for real effects. If existing theory
anticipates the same result should occur and, with a high-powered test, it
does not, then something in the presumed irrelevant differences between
original and replication could be the basis for identifying constraints on
the effect” (p. 137). William of Occam might have countered by asking
“Wouldn’t an at least equally parsimonious conclusion be that the theory
was wrong?”
144
twist to avoid the study being considered a direct replication. If
“successful,” the resulting underpowered conceptual replication is
published; if not it is deep-sixed. (To go one step farther, it may be that
many researchers consider the ability to produce a statistically significant
study to be the primary indicator of scientific skill rather than
discovering something that can stand the test of time or actually be
useful—but this is definitely an unsupported supposition.)
Replication Extensions
145
of 10 statistically significant studies occurring with a power of 0.50
would be extremely improbable (p < .001). And, of course, if some one
or more of these experiments reported a p-value substantively less than
0.05, then these two probability levels (i.e., < .05 or < .001) would be
even lower. So perhaps an astute reader (or peer reviewer) should assume
that something untoward might be operating here in these
multiexperiment scenarios? Perhaps suspecting the presence of a QRP or
two? For a more thorough and slightly more technical explication of
these issues, see Ulrich Schimmack’s 2012 article aptly entitled “The
Ironic Effect of Significant Results on the Credibility of Multiple-Study
Articles” or the Uri Simonsohn, Leif Nelson, and Joseph Simmons
(2014) prescient article, “P-Curve: A Key to the File-Drawer.”
Partial Replications
146
Although changes in societal attitudes can affect behavior, my findings
indicate that the same situational factors that affected obedience in
Milgram’s participants still operate today. (p. 9)
147
2010, 2017] based on student test scores of a decade or so ago which
later proved to be a chimera.)
Naturally any educational graduate student (then or now) would have
considered such a finding to be completely counterintuitive (if not
demented) since everyone knew (and knows) that schools of education
train teachers to produce superior student learning. However, since all
doctoral students must find a dissertation topic, let’s pretend that this
hypothetical one convinced his advisor that they should demonstrate
some of the perceived weaknesses of Popham’s experimental procedures.
Accordingly, a conceptual replication (also a term not yet introduced into
the professional lexicon) was performed comparing trained experienced
classroom teachers with entry-level undergraduate elementary education
students. (The latter participants were selected based on their self-
reported lack of teaching experience and were known to have no teacher
training.)
Alas, to what would have been to the chagrin of any educational
researchers, the replication produced the same inference as Popham’s
original experiments (i.e., no statistically significant learning differences
being obtained between the experienced and trained classroom teachers
and their inexperienced, untrained undergraduate counterparts).
However, being young, foolish, and exceedingly stubborn, our hero
might have decided to replicate his and his advisor’s own negative
finding using a different operationalization of teacher experience and
training involving a controlled comparison of tutoring versus classroom
instruction—an extremely rare self-replication of a replication of a
negative finding.
So, in order to replicate their conceptual replication of their negative
teacher experience and training study, they might have simply added that
comparison as a third factor in a three-way design producing a 2
(tutoring vs. classroom instruction) by 2 (undergraduate elementary
education majors who had completed no mathematics instructional
courses or teaching experience vs. those who had received both) by 3
(high, medium, and low levels of student ability based on previous
standardized math scores).
Naturally, like all hypothetical studies this one was completed without
a hitch (well, to be more realistic, let’s pretend that it did possess a minor
glitch involving a failure to counterbalance the order of instruction in
one of the several schools involved). Let’s also pretend that it produced
the following results for the replication factor (of course, everyone
148
knows [and had known for a couple of millennia] that tutoring was more
effective than classroom instruction):
The first two cells (tutoring vs. classroom instruction) involved a near
perfect direct replication of the factorial study’s tutoring versus
classroom instruction while the final two cells constituted an extension
(or, in today’s language, a conceptual replication) thereof. (The construct
underlying both replications was conceptualized as class size.) The “near
perfect” disclaimer was due to the hypothetical addition to the study
outcome of two instructional objectives accompanied by two items each
based on said objectives in the hope that they would increase the
sensitivity of the outcome measure.
For the results of the single-factor study to have been perfect (a) the
direct replication of tutoring versus classroom instruction would have
reproduced the investigators’ original finding (i.e., the tutored students
149
would have learned more than their classroom counterparts), and (b) the
conceptual replications would have proved statistically significant in an
incrementally ascending direction as well. To make this myth a bit more
realistic (but still heartening), let’s pretend that
150
not have replicated given the reality of teacher noncompliance, which
has long bedeviled educational research studies.
So the purpose of these hypothetical examples was to simply illustrate
some of the complexities in performing and interpreting different types
of replications.
A note on independent versus self-replication: As previously
mentioned there is no question that replications performed by
independent investigators are far more credible than those performed by
the original investigators—if nothing else because of the possibility of
fraud, self-interest, self-delusion, and/or the high prevalence of
unreported or unrecognized QRPs. Makel, Plucker, and Hegarty (2012),
for example, in a survey of more than a century of psychological
research found that replications by the same team resulted in a 27%
higher rate of confirmatory results than replications performed by an
independent team. And incredibly, “when at least one author was on both
the original and replicating articles, only three (out of 167) replications
[< 2%] failed to replicate any [emphasis added] of the initial findings”
(p. 539). Also, using a considerably smaller sample (67 replications) and
a different discipline (second-language research), Marsden, Morgan-
Short, Thompson, and Abugaber (2018) reported a similar finding for
self-replications (only 10% failed to provide any confirmation of the
original study results).
So the point regarding self-replications is? Self-replications are most
useful for allowing researchers to test the validity of their own work in
order to avoid (a) wasting their time in pursuing a unprofitable line or
inquiry and (b) the embarrassment of being the subject of a negative
replication conducted by someone else. But, unfortunately, the practice is
too fraught with past abuses (not necessarily fraudulent practices but
possibly unrecognized QRPs on the part of the original investigators) to
provide much confidence among peer reviewers or skeptical readers.
And perhaps self-replications (or any replications for that matter) should
not even be submitted for publication in the absence of a preregistered
protocol accompanied by an adequate sample size.
After all, myriad calls for more replications have historically been to no
avail (e.g., Greenwald, 1975; Rosenthal, 1991; Schmidt, 2009). In fact
151
the publication rates in some literatures comprise 2% or less (e.g.,
Evanschitzky, Baumgarth, Hubbard, & Armstrong, 2007, in marketing
research; Makel, Plucker, & Hegarty, 2012, in psychology; Makel &
Plucker, 2014, in education)—the latter study being the standard-bearer
for non-replication at 0.13% for the top 100 educational journals.
True, there have been a number of replications overturning highly
cited (even classical) studies, examples of which have been described
decades ago by Greenwald (1975) and more recently in Richard Harris’s
excellent book apocalyptically titled Rigor Mortis: How Science Creates
Worthless Cures, Crushes Hope, and Wastes Billions (2017). But despite
this attention, replications have remained more difficult to publish than
original work—hence less attractive to conduct.
However, to paraphrase our Nobel Laureate one final time, things do
appear to be changing, as witnessed by a recently unprecedented amount
of activity designed to actually conduct replications rather than bemoan
the lack thereof. The impetus for this movement appears to be a
multidisciplinary scientific anxiety regarding the validity of published
scientific results—one aspect of which involves the modeling efforts
discussed previously. This fear (or belief) that much of the scientific
literature is false has surfaced before (e.g., in the 1970s in psychology),
but this time a number of forward-looking, methodologically competent,
and very energetic individuals have made the decision to do something
about the situation as chronicled by this and previous chapters’ sampling
of some very impressive individual initiatives involving a sampling of
high-profile, negative replications of highly questionable constructs such
as psi and priming, poorly conceived genetic studies, and high-tech,
extremely expensive fMRI studies.
All of which have “gifted” our scientific literatures with thousands
(perhaps tens of thousands) of false-positive results. But a propos the
question of how scientists can be convinced to conduct more replications
given the difficulties of publishing their findings, let’s move on to a
discussion of some of the solutions.
152
extended to the replication process in the form of a registered replication
report (RRR) in which publication–nonpublication decisions were made
based almost exclusively on the replication protocol.
A number of journals have now adopted some form of this innovation;
one of the first being Perspectives on Psychological Science. So let’s
briefly examine that journal’s enlightened version as described by
Simons, Holcombe, and Spellman (2014). Basically, the process involves
the following steps:
153
1. A query is submitted to the journal making “a case for the ‘replication
value’ of the original finding. Has the effect been highly influential? Is it
methodologically sound? Is the size of the effect uncertain due to
controversy in the published literature or a lack of published direct
replications?” (p. 552).
2. If accepted, “the proposing researchers complete a form detailing the
methodological and analysis details of the original study, suggesting how
those details will be implemented in the replication, and identifying any
discrepancies or missing information” (p. 553). Therein follows a back-
and-forth discussion between the replicating proposers, the original
author(s) of the research to be replicated, and the editors, which, if
promising, results in a formal proposal. How this actually plays out is
unclear but most likely some objections are raised by the original
investigator(s) who hopefully won’t have the last word in how the
replication study will be designed and conducted.
3. Since the RRR normally requires the participation of multiple laboratories
due to the necessity of recruiting large numbers of participants quickly (as
well as reducing the threat of replicator bias), the journal facilitates the
process by putting out a call for interested participants. This step is not
necessarily adopted by all journals or replicators who may prefer either to
select their own collaborators or perform the replication themselves using
a single site.
4. But, back to the Perspectives on Psychological Science approach, the
selected laboratories “document their implementation plan for the study
on OpenScienceFramework.org. The editor then verifies that their plan
meets all of the specified requirements, and the lab then creates a
registered version of their plan. The individual labs must conduct the
study by following the preregistered plan; their results are included in the
RRR regardless of the outcome” (p. 553).
5. All sites (if multiple ones are employed) employ identical methods, and
results accruing therefrom are analyzed via a meta-analytic approach
which combines the data in order to obtain an overall p-value. Each
participating site’s data is registered on the Open Science Framework
repository and freely available for other researchers to analyze for their
own purposes.
154
However, as promising as this innovation is, another one exists that is
even more impressive. .
What’s Next?
Now that the basics of the replication process has been discussed, it is
now time to examine the most impressive step yet taken in the drive for
increasing the reproducibility of published research. For as impressive as
the methodological contributions discussed to this point have been in
informing us of the existence, extent, etiology, and amelioration of the
crisis facing science, it is now time to consider an even more ambitious
undertaking. Given that the replication process is the ultimate arbiter of
scientific reproducibility, a group of forward-thinking and energetic
scientists have spearheaded the replication of large clusters of original
findings. And it is these initiatives (one of which involved replicating
100 different experiments involving tens of thousands of participants)
that constitute the primary subject of Chapter 7.
References
155
Psychological Science, 11, 546–573.
Harris, R. (2017). Rigor mortis: How sloppy science creates worthless cures,
crushes hope and wastes billions. New York: Basic Books.
Ioannidis, J. P. A. (2008). Why most discovered true associations are inflated.
Epidemiology, 19, 640–648.
Lindsay, R. M., & Ehrenberg, A. S. C. (1993). The design of replicated studies.
American Statistician, 47, 217–228.
Makel, M. C., Plucker, H. A., & Hegarty, B. (2012). Replications in psychology
research: How often do they really occur? Perspectives in Psychological
Science, 7, 537–542.
Makel, M. C., & Plucker, J. A. (2014). Facts are more important than novelty:
Replication in the education sciences. Educational Researcher, 43, 304–316.
Marsden, E., Morgan-Short, K., Thompson, S., & Abugaber, D. (2018).
Replication in second language research: Narrative and systematic reviews and
recommendations for the field. Language Learning, 68, 321–391.
Milgram, S. (1963). Behavioral study of obedience. Journal of Abnormal and
Social Psychology, 67, 371–378.
Nosek, B. A., & Lakens, D. (2014). Registered reports: A method to increase the
credibility of published results. Social Psychology, 45, 137–141.
Pashler, H., & Harris, C. R. (2012). Is the replicability crisis overblown? Three
arguments examined. Perspectives on Psychological Science, 7, 531–526.
Popham, W. J. (1971). Performance tests of teaching proficiency: Rationale,
development, and validation. American Educational Research Journal, 8,
105–117.
Rosenthal, R. (1991). Replication in behavioral research. In J. W. Neuliep (Ed.),
Replication research in the social sciences (pp. 1–39). Newbury Park, CA:
Sage.
Schmidt, S. (2009). Shall we really do it again? The powerful concept of
replication is neglected in the social sciences. Review of General Psychology,
13, 90–100.
Schimmack, U. (2012). The ironic effect of significant results on the credibility
of multiple-study articles. Psychological Methods, 17, 551–566.
Schooler, J. W., & Engstler-Schooler, T. Y. (1990). Verbal overshadowing of
visual memories: Some things are better left unsaid. Cognitive Psychology, 22,
36–71.
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive
psychology: Undisclosed flexibility in data collection and analysis allows
presenting anything as significant. Psychological Science, 22, 1359–1366.
Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve: A key to the
file drawer. Journal of Experimental Psychology: General, 143, 534–547.
Simons, D. J., Holcombe, A. O., & Spellman, B. A. (2014). An introduction to
registered replication reports. Perspectives on Psychological Science, 9, 552–
555.
Yong, E. (2012). Bad copy. Nature, 485, 298–300.
156
7
Multiple-Study Replication Initiatives
Many scientists may have been unaware of the fact that biotech
companies have been replicating promising published preclinical results
for some time, especially those published by university laboratories.
Anecdotally, an “unspoken rule” in the early venture capital industry is
reported to be that at least 50% of preclinical studies, even those
published in high-impact academic journals, “can’t be repeated with the
same conclusions by an industrial lab” (Osherovich, 2011, quoting Bruce
Booth, a venture capitalist).
The reproducibility of such studies is especially important for the
pharmaceutical industry because preclinical work with cells, tissues,
and/or animals form the basis for the development of new clinical drugs.
And while this preliminary research is costly, it pales in comparison to
the costs of drug development and the large controlled efficacy trials
required by the US Food and Drug Administration (FDA) prior to
clinical use and marketing.
So if these positive preclinical results are wrong, the drugs on which
they are based will almost surely not work and the companies that
157
develop and test them will lose a formidable amount of money. And we
all know that pharmaceutical companies are very much more interested
in making money than losing it.
And thus here enters Glenn Begley into our drama, stage left (Begley &
Ellis, 2012). As a senior researcher in Amgen, a major biotech firm, Dr.
Begley, like his counterparts in many such companies, constantly
monitored the published literature for preclinical results that might have
important clinical implications. And whenever an especially promising
one was found, Dr. Begley would either have the published results
replicated in his own company’s hematology and oncology department
labs to see if they were valid (which often they were not) or file the study
away for future consideration.
What makes Dr. Begley so central to this part of our story is that, prior
to leaving Amgen for an academic position, he decided to “clean up his
file cabinet” of 53 promising effects to see if any of them turned out to
be as promising as their published findings suggested. (The papers
themselves “were deliberately selected that described something
completely new, such as fresh approaches to targeting cancers or
alternative clinical uses for existing therapeutics” [p. 532].)
Of the 53 studies replicated, the results from only 6 (or 11%) were
reproducible. This is obviously a finding with very disturbing
implications for this genre of research, medical treatment, and science in
general as succinctly explained in the authors’ words:
158
with reports on its own futile attempts to replicate three studies in
diabetes and neurodegenerative disease in the hope that other companies
will follow suit.” The purpose of this initiative, according to Sasha Kamb
(Amgen’s senior vice president for research) is to reduce wasted time
and resources following-up on flawed findings as well as “to help
improve the self-correcting nature of science to benefit society as a
whole, including those of us trying to create new medicines.”
This analysis revealed that only in ~20–25% of the projects were the
relevant published data completely in line with our in-house findings. In
almost two-thirds of the projects, there were inconsistencies between
published data and in-house data that either considerably prolonged the
duration of the target validation process or, in most cases, resulted in
termination of the projects because the evidence that was generated for the
therapeutic hypothesis was insufficient to justify further investments into
these projects. (p. 713)
159
Perhaps in response to these rather disheartening studies, another
ambitious initiative spearheaded by an amazing organization that will be
described shortly (the Center for Open Science, headed by a whirling
dervish of a psychologist named Brian Nosek) secured funding to
replicate 50 high-impact cancer biology studies. Interestingly, Glenn
Begley is reported (Harris, 2017) to have resigned from this particular
project because he argues that repeating their abysmally poor design will
produce meaningless results even if they do replicate.
Unfortunately, this project has run into a number of other setbacks as
of this writing (January 2019). First, the costs were more than anticipated
(originally budgeted at $25,000 per study, the actual costs rose to more
than $60,000 [Kaiser, 2018]). This, among other problems (e.g., the
difficulties of reproducing some of the laboratory materials and the
unexpected amount of time necessary to troubleshoot or optimize
experiments to get meaningful results), has resulted in reducing the
originally planned 60 replications to 37 in 2015 and then further down to
29 as of 2017.
The online open-access journal elife has been keeping a running tab of
the results, which are somewhat muddled because clear replicated versus
non-replicated results aren’t as definitive as they would be for a single
social science finding involving one intervention and one outcome
variable (or a physics experiment measuring the speed of neutrinos).
Again, as of early 2019, the following 12 results were reported
(https://elifesciences.org/collections/9b1e83d1/reproducibility-project-
cancer-biology):
160
Study # # Replication %
Replications failures Failures
Amgen 53 47 89%
Bayer 67 44a 66%
Preclinical Cancer Biology 6 2 33%
Preclinical Materials 238 109b 46%
Psychology (Self- 257 130 51%
c
Reported)
Open Science 100 64 64%
Collaboration
Many Labs I 13 3 23%
Many Labs II (In Press) 28 13 46%
d 10 5 50%
Many Labs III
Experimental Economics 18 7 39%
Social Science Research 21 8 38%
Total 811 432 53.3%
a
As mentioned in the text, this is an estimate since the authors reported that in
only 20–25% of the cases were the results of the replication identical, and, in
most of “almost two-thirds” of the cases, the results were not sufficiently close to
merit further work. Many reports citing this study report a failure rate of 75–
80%, but this seems to ignore the “almost two-thirds” estimate, whatever that
means.
b
Results were not reported in terms of number of studies but instead used the
number of resources employed across all 238 studies. Therefore this figure
(46%) was applied to the number of studies which would add a (probably
relatively) small amount of imprecision to this number.
c
Hartshorne and Schachner (2012), based on a survey of self-reported
replications.
d
This estimate was based on the nine direct and one conceptual replication and
not the added effects or interactions.
One of the lessons learned from the cancer biology project is that the
replication of preclinical findings is more involved than in social science
research due to the complexity of the experimental materials—
exacerbated by laboratories not keeping detailed workflows and the
effects of passing time on the realities of fading memories and personnel
changes.
161
This is perhaps best illustrated by Vasilevsky, Brush, Paddock, and
colleagues (2013) who examined this very problematic source of error
via a rather unique approach to determining reproducibility in laboratory
research. While this study does not employ actual replications, its
authors (as well as Freedman, Cockburn, & Simcoe [2015] before them)
argue that without sufficient (or locatable) and specific laboratory
materials (e.g., antibodies, cell lines, knockout reagents) a study cannot
be definitively replicated (hence is effectively irreproducible). In fact,
Freedman and colleagues advance an interesting definition of
irreproducibility that is probably applicable to all research genres.
Namely, that irreproducibility:
So, incorporating this definition, and based on 238 life science studies
(e.g., biology, immunology, neuroscience), Vasilevsky et al.’s calculation
of the number of unique (or specific) experimental materials that could
not be identified were 56% of antibodies, 57% of cell lines, 75% of
constructs such as DNA synthesized for a single RNA strand, 17% of
knockout regents, and 23% of the organisms employed.
162
category, the authors offer the following estimates of the causes of
preclinical reproducibility:
Real solutions, such as addressing errors in study design and using high
quality biological reagents and reference materials, will require time,
resources, and collaboration between diverse stakeholders that will be a
key precursor to change. Millions of patients are waiting for therapies and
cures that must first survive preclinical challenges. Although any effort to
improve reproducibility levels will require a measured investment in
capital and time, the long term benefits to society that are derived from
increased scientific fidelity will greatly exceed the upfront costs. (p. 7)
163
Experimental Psychology: The Seminal
“Reproducibility Project”
164
(which of course neither did the preclinical initiatives just discussed),
the project may have resulted in a reasonable estimate thereof for brief
psychological interventions.
But, these caveats aside, the design and methodology of the
replications were nevertheless both exemplary and remarkable. Equally
impressive, of the 153 eligible studies available, 111 articles were
selected for replication, and 100 of these were actually completed by
the prespecified deadline. Fidelity of the replications was facilitated by
using the actual study materials supplied by the original investigators.
These investigators were also consulted with respect to their opinions
regarding any divergences from the original design that might interfere
with replicability. The average statistical power available for the 100
replications was in excess of 0.90 based on the originally obtained
effect sizes. And, of course, all replications were preregistered.
Overall, the replication effect sizes were approximately 50% less
than those of the original studies, and their published statistical
significance decreased from 97% for the 100 original studies to 36% in
the replications. (A statistically significant reduction of less than 10%
would have been expected by chance alone.) In addition, the large
number of studies replicated provided the opportunity to identify
correlates of replication successes and failures, which greatly expands
our knowledge regarding the replication process itself. Examples
include the following:
1. The larger the original effect size, the more likely the finding was to
replicate [recall that the larger the effect size, the lower the obtained p-
value when the sample size is held constant a la the simulations
discussed previously], so this finding also extends to the generalization
that, everything else being equal, the lower the obtained p-value the
more likely a finding is to replicate.
2. The more scientifically “surprising” the original finding was (in the a
priori opinion of the replication investigators, who were themselves
quite conversant with the psychological literature), the less likely the
finding was to replicate.
3. The more difficult the study procedures were to implement, the less
likely the finding was to replicate.
4. Studies involving cognitive psychology topics were more likely to
replicate than those involving social psychology.
165
1. “It is too easy to conclude that successful replication means that the
theoretical understanding of the original finding is correct. Direct
replication mainly provides evidence for the reliability of a result. If
there are alternative explanations for the original finding, those
alternatives could likewise account for the replication [emphasis added].
Understanding is achieved through multiple, diverse investigations that
provide converging support for a theoretical interpretation and rule out
alternative explanations” (p. aac4716-6).
2. “It is also too easy to conclude that a failure to replicate a result means
that the original evidence was a false positive. Replications can fail if
the replication methodology differs from the original in ways that
interfere with observing the effect” (p. aac4716-6). [As, of course, does
random error and the presence of one or more questionable research
practices (QRPs) performed by the replicators themselves.]
3. “How can we maximize the rate of research progress? Innovation points
out paths that are possible; replication points out paths that are likely;
progress relies on both” (p. aac4716-7).
Many Labs 1
166
failure to replicate an effect might be due to factors unique to the process
itself. Thus, in this first project, differences in replicating sites and/or
type of participants (e.g., online respondents vs. students) were explored
to ascertain their relationship (if any) to replication/non-replication.
Perhaps not surprisingly, the interventions themselves accounted for
substantively more of the between-study variation than the sites or
samples employed.
Many Labs 2
Many Labs 3
167
Both the Open Science and the Many Labs replications (the latter
borrowing much of its infrastructure from the former) appear to have
been designed and reported as definitively, transparently, and fairly as
humanly possible. In addition, these initiatives’ procedural strategies are
relevant for any future replications of experiments and should be
followed as closely as possible: a sampling follows:
168
1. First, the original investigators were contacted in order to (a) obtain study
materials if available and necessary, (b) apprise said investigators of the
replication protocol, and (c) obtain any feedback these investigators might
have regarding the project. (The latter is both a courtesy to the original
investigators and in some cases a necessary condition for conducting a
direct replication if the original materials are not otherwise accessible.)
The replicating team was not required to accept any suggested changes to
the planned protocol, but, generally speaking, the Open Science
investigators appeared to accept almost all such feedback when proffered
since (again) not everything occurring in a study tends to be reported in a
journal article.
2. Next, and most importantly, the design of the replication and the planned
analysis were preregistered and any necessary divergences from this plan
were detailed in the final report. Both of these steps should occur
whenever a replication (or any study for that matter) is published. Not to
do so constitutes a QRP (at least for replications occurring after 2016—
the publication date for this Open Science report—or perhaps more fairly
after 2018 as suggested in Chapter 10).
3. The Open Science replications employed considerably larger sample sizes
than the original studies in order to ensure statistical power levels that
were at least 0.80 (usually ≥ 0.90). Sufficient power is essential for all
replications since inadequate power levels greatly reduce the credibility of
any research. (Recall that since the typical statistical power for a
psychological experiment is 0.35 for an effect size of 0.50; hence a
replication employing exactly the same sample size would have only a
35% chance of replicating the original study even if the original positive
result was valid.)
4. It is a rare psychological research publication that reports only one study
(76% of those replicated by the Open Science Collaboration reported two
or more) and an even rarer one that reports only a single p-value. To
overcome this potential problem the Open Science replicators typically
chose the final study along with the p-value reported therein which they
considered to be associated with the most important result in that study. (If
the original author disagreed and requested that a different effect be
selected instead, the replicating investigators typically complied with the
request.)
The replication results were reanalyzed by an Open Science-appointed
5. statistician to ensure accuracy. This is not a bad idea if the analysis of a
replication is performed by a non-statistician and should probably be
universally copied—as should the other strategies just listed for that
matter.
169
decades due to the difficulty of recruiting sufficient numbers of patients
with certain rare diagnoses. Psychological studies are not commonly
affected by this problem but they do often require more participants than
a single site can supply, hence the discipline has invented its own terms
for the strategy such as “crowdsourcing science” or “horizontal versus
vertical approaches to science.”
However, regardless of terminology, multicenter trials unquestionably
have a number of advantages over single-site research, as well as unique
organizing and coordinating challenges of their own—some of which are
specific to the discipline. The ultimate impact of this approach on the
future of psychology is, of course, unknown, but if nothing else high-
powered studies (both original and replicates) are considerably more
likely to be valid than their underpowered counterparts. And while the
use of multiple sites introduces additional sources of variance, these can
be handled statistically. (As well, this variance, systematic or erroneous,
is normally overwhelmed by the increased sample sizes that
“crowdsourcing” makes possible.)
Advice for assembling, designing, and conducting all aspects of such
studies from a psychological perspective is clearly detailed in an article
entitled “Crowdsourcing Science: Scientific Utopia III” (Uhlmann,
Ebersole, & Chartier, 2019) and need not be delineated here. (Utopias I
and II will be discussed shortly.)
170
suggests that replications are considerably more common in psychology
than generally supposed.
Given the sampling procedure employed, no projections to a larger
population are possible (e.g., 14 of the respondents were graduate
students), but that caveat applies in one degree or another to all of the
multiple replication initiatives presented in this chapter. With that said,
some variant of this survey approach could constitute a promising
method for tracking replications if the identity of the original studies was
to be obtained and some information regarding the replication attempt
was available (preferably with sharable data).
Experimental Economics
Science and Nature are among the highest profile, most often cited, and
most prestigious journals in science. They are also the most coveted
publishing outlets as perhaps illustrated by the average bounty (i.e., in
excess of $40,000) paid by China (Quan, Chen, & Shu, 2017) to any
home-grown scientists who are fortunate enough to garner an acceptance
email therefrom, although that bounty has apparently been terminated
recently.
As such, these journals have the pick of many litters on what to
publish, and their tendency appears (perhaps more than any other
extremely high-impact journals) to favor innovative, potentially popular
studies. However, their exclusiveness should also enable them to publish
methodologically higher quality research than the average journal, so
171
these disparate characteristics make their studies an interesting choice for
a replication initiative.
Accordingly, in 2018, Colin Camerer, Anna Dreber, Felix
Holzmeister, and a team of 21 other investigators (several of whom were
involved in the previously discussed replication of 18 economics studies)
performed replications of 21 experimental social science studies
published in Science and Nature between 2010 and 2015. The selected
experiments were required to (a) report a p-value associated with at least
one hypothesis and (b) be replicable with easily accessible participants
(e.g., students or Amazon Mechanical Turk employees). The replicating
team followed the original studies’ procedures as closely as possible,
secured the cooperation of all but one of the original authors, and
ensured adequate statistical power via the following rather interesting
two-stage process:
In stage 1, we had 90% power to detect 75% of the original effect size at the
5% significance level in a two-sided test. If the original result replicated in
stage 1 (a two-sided P < 0.05 and an effect in the same direction as in the
original study), no further data collection was carried out. If the original
result did not replicate in stage 1, we carried out a second data collection in
stage 2 to have 90% power to detect 50% of the original effect size for the
first and second data collections pooled. (p. 2)
172
investigators’ datasets https://www.3ieimpact.org/evidence-
hub/publications/replication-papers/savings-revisited-replication-study-
savings. Note also that the Chabris, Herbert, Benjamin, and colleagues’
(2012) failure to replicate individual gene–intelligence associations in
Chapter 4 were not included because (a) none of the reported associates
replicated (hence would skew the Table 7.1 results) and (b) these studies
represent an approach that is not used in the discipline following its
migration to genome-wide associations.)
Also not included in the preceding calculations are a “mass” (aka
crowdsourced) replication initiative (Schweinsberg, Madana, Vianello, et
al., 2016) involving the replication of a set of 10 unpublished
psychological studies conducted by a single investigator (Eric Uhlmann
and colleagues) centered on a single theoretical topic (moral judgment).
In one sense this replication effort is quite interesting because the
investigator of the 10 studies reports that two of the key QRPs believed
to be responsible for producing irreproducible results were not present in
his original studies: (a) repeated analyses during the course of the study
and (b) dropping participants for any reason. The authors consider these
methodological steps to be major factors in the production of false-
positive results and avoiding them presumably should increase the
replicability of the 10 studies.
In another sense, however, the original investigator’s choosing of the
replicators and the studies to be replicated is somewhat problematic in
the sense that it positions the process somewhere between self- and
independent replication efforts. The authors of the study, on the other
hand, consider this to be a major strength in the sense that it helped (a)
duplicate the original contexts of the 10 studies and (b) ensure the
experience of the replicating labs in conducting such studies. In any
event, the resulting replications produced positive evidence for the
reproducibility of 8 of the 10 originally positive studies (1 of the 2
originally negative studies proved to be statistically significant when
replicated whereas the other did not).
All in all it is difficult to classify the positive replications in this effort.
They appeared to be well-conducted, highly powered, preregistered, and
methodologically sound (i.e., by employing a variant of the “Many
Labs” approach). So it may be unfair to exclude them from the Table 7.1
results simply because they were selected in part because they were
expected to produce positive replications and the replicating laboratories
were personally selected by the original investigator. So, for the
hopefully silent majority who wish to take issue with this decision,
173
adding these eight out of eight positive replications of positive studies to
Table 7.2 produces the overall results shown in Table 7.2.
Close but still an apples versus oranges comparison and weak support
for the Ioannidis and Pashler and Harris modeling results.
And, as always, there are probably additional multiple replication
initiatives that escaped my search, hence the list presented here is
undoubtedly incomplete. In addition, an impressive replication of 17
structural brain–behavior correlations (Boekel, Wagenmakers, Belay, et
al., 2015) was not included because it relied on a Bayesian approach
which employed different replication/non-replication criteria from the
previous 12 efforts. This study’s finding is as follows:
For all but one of the 17 findings under scrutiny, confirmatory Bayesian
hypothesis tests indicated evidence in favor of the null hypothesis [i.e., were
negative] ranging from anecdotal (Bayes factor < 3) to strong (Bayes factor
> 10). (p. 115)
174
of combining apples and oranges, the 11 initiatives as a whole involved a
total of 811 studies, of which 432 failed to replicate. This yielded a
53.3% failure-to-replicate rate. Not a particularly heartening finding but
surprisingly compatible with both Ioannidis’s and Pashler and Harris’s
Chapter 2 modeling estimates.
Third, from an overall scientific perspective, the importance of these
initiatives is that if 25% were to be considered an acceptable level for
irreproducible results, only 1 (the first “many labs” project) of the 11
initiatives reached this level. (And recall that an unspecified number of
the “Many Labs I” studies were reported to have been selected because
they had already been successfully replicated.)
Fourth, although I have reported what amounts to hearsay regarding
the details of Glenn Begley’s resignation from the Open Science cancer
biology initiative designed to replicate 50 high-impact cancer biology
studies, I do agree with Professor Begley’s reported objection. Namely,
that if a study’s design and conduct are sufficiently deficient, a high-
fidelity replication thereof employing the same QRPs (sans perhaps low
statistical power) is uninformative. And while psychological versus
preclinical experiments may differ with respect to the types and
prevalence of these artifacts, the end result of such failings will be the
same in any discipline: a deck carefully and successfully stacked to
increase the prevalence of false-positive results in both original research
and its replication.
And finally, all of these initiatives are basically exploratory
demonstration projects conducted for the betterment of science and for
the benefit of future scientists. Furthermore, none of the authors of these
papers made any pretense that their truly impressive approaches would
solve the reproducibility crisis or even that their results were
representative of their areas of endeavor. They have simply taken the
time and the effort to do what they could to alert the scientific
community to a serious problem for the betterment of their individual
sciences.
And an apology: there are undoubtedly more multiple-study
replications under way than have been discussed here. Engineering, for
example, which has here been given short shrift, has apparently
employed a form of replication for some time to ensure compatibility of
electronic and other parts in order to market them to different companies
and applications. Loosely based on this model, the Biological
Technologies Office of the US Defense Advanced Research Projects
Agency (DARPA) has actually initiated a randomized trial to evaluate
175
the effects of requiring (as a condition of funding) the primary awardees
to cooperate and facilitate (sometimes via in person visits or video
presentations) independent shadow teams of scientists in the replication
and validation of their study results (Raphael, Sheehan, & Vora, 2020).
The results of this initiative or its evaluation are not yet available as of
this writing, but it is intriguing that the costs that this replication add on
typically range between 3% and 8% of the original study’s overall
budget.
Next
The next chapter looks at two affective views of the replication process
based on (a) the reactions of scientists whose original studies have been
declared irreproducible and (b) professional (and even public) opinions
regarding the career effects thereupon (along with a few hints regarding
the best way to respond thereto).
References
Begley, C. G., & Ellis, L. M. (2012). Drug development: raise standards for
preclinical cancer research. Nature, 483, 531–533.
Boekel, W., Wagenmakers, E.-J., Belay, L., et al. (2015). A purely confirmatory
replication study of structural brain-behavior correlations. Cortex, 66, 115–
133.
Camerer, C., Dreber, A., Forsell, E., et al. (2016). Evaluating replicability of
laboratory experiments in economics. Science, 351, 1433–1436.
Camerer, C. F., Dreber, A., Holzmeister, F., et al. (2018). Evaluating the
replicability of social science experiments in Nature and Science between
2010 and 2015. Nature Human Behaviour, 2, 637–644.
Chabris, C., Herbert, B., Benjamin, D., et al. (2012). Most reported genetic
associations with general intelligence are probably false positives.
Psychological Science, 23, 1314–1323.
Dreber, A., Pfeiffer, T., Almenberg, J., et al. (2015). Using prediction markets to
estimate the reproducibility of scientific research. Proceedings of the National
Academy of Sciences, 112, 15343–15347.
Ebersole, C. R., Athertonb, A. E., Belangeret, A. L., et al. (2016). Many Labs 3:
Evaluating participant pool quality across the academic semester via
replication. Journal of Experimental and Social Psychology, 67, 68–82.
Freedman, L. P., Cockburn, I. M., & Simcoe, T. S. (2015). The economics of
reproducibility in preclinical research. PLoS Biology, 13, e1002165.
176
Gertler, P., Baliani, S., & Romero, M. (2018). How to make replication the norm:
The publishing system builds in resistance to replication. Nature, 554, 417–
419.
Harris, R. (2017). Rigor mortis: How sloppy science creates worthless cures,
crushes hope and wastes billions. New York: Basic Books.
Hartshorne, J. K., & Schachner, A. (2012). Tracking replicability as a method of
post-publication open evaluation. Frontiers in Computational Neuroscience, 6,
8.
Hughes, P., Marshall, D., Reid, Y., et al. (2007), The costs of using
unauthenticated, overpassaged cell lines: How much more data do we need?
Biotechniques, 43, 575–582.
International Initiative for Impact Evaluation. (2018).
http://www.3ieimpact.org/about-us
Kaiser, J. (2016). If you fail to reproduce another scientist’s results, this journal
wants to know. https://www.sciencemag.org/news/2016/02/if-you-fail-
reproduce-another-scientist-s-results-journal-wants-know
Kaiser, J. (2018). Plan to replicate 50 high-impact cancer papers shrinks to just
18. Science. https://www.sciencemag.org/news/2018/07/plan-replicate-50-
high-impact-cancer-papers-shrinks-just-18
Klein, R., Ratliff, K. A., Vianello, M., et al. (2014). Investigating variation in
replicability: A “many labs” replication project. Social Psychology, 45, 142–
152.
Klein, R. A., Vianello, M., Hasselman, F., et al. (2018). Many labs 2:
Investigating variation in replicability across sample and setting. Advances in
Methods and Practices in Psychological Science, 1, 443–490.
Lorsch, J. R., Collins, F. S., & Lippincott-Schwartz, J. (2014). Cell biology:
Fixing problems with cell lines. Science, 346, 1452–1453.
Open Science Collaboration. (2012). An open, large-scale, collaborative effort to
estimate the reproducibility of psychological science. Perspectives in
Psychological Science, 7, 657–660.
Open Science Collaboration. (2015). Estimating the reproducibility of
psychological science. Science, 349, aac4716–1–7.
Osherovich, L. (2011). Hedging against academic risk. SciBX, 4.
Prinz, F., Schlange, T., & Asadullah, K. (2011). Believe it or not: How much can
we rely on published data on potential drug targets? Nature Reviews Drug
Discovery, 10, 712–713.
Quan, W., Chen, B., & Shu, F. (2017). Publish or impoverish: An investigation of
the monetary reward system of science in China (1999–2016). Aslib Journal of
Information Management, 69, 1–18.
Raphael, M. P., Sheehan, P. E., & Vora, G. J. (2020). A controlled trial for
reproducibility. Nature, 579, 190–192.
Reproducibility Project: Cancer Biology. (2012). Brian Nosek (correspondence
author representing the Open Science Collaboration).
177
https://elifesciences.org/collections/9b1e83d1/reproducibility-project-cancer-
biology
Schweinsberg, M., Madana, N., Vianello, M., et al. (2016). The pipeline project:
Pre-publication independent replications of a single laboratory’s research
pipeline. Journal of Experimental Social Psychology, 66, 55–67.
Uhlmann, E. L., Ebersole, C. R., & Chartier, C. R. (2019). Scientific utopia III:
Crowdsourcing science. Perspectives on Psychological Science, 14, 711–733.
Vasilevsky, N. A., Brush, M. H., Paddock, H., et al. (2013). On the
reproducibility of science: Unique identification of research resources in the
biomedical literature. PeerJ, 1, e148.
178
8
Damage Control upon Learning That One’s
Study Failed to Replicate
So far we’ve established that replication is the best (if imperfect) means
we have for determining a study’s reproducibility. But what happens if
one’s study, perhaps conducted years ago, fails to replicate and its
carefully crafted conclusions are declared to be wrong?
A flippant answer might be that individuals should feel flattered that
their study was considered important enough to replicate. Most scientists
have not, nor probably ever will be, awarded such a distinction.
And why should colleagues of someone whose study has failed to
replicate be surprised since they should know by now that most studies
aren’t reproducible? But of course we’re all human and emotion usually
trumps knowledge. And for scientists, their work constitutes a major
component of their self-worth.
So the purpose of this brief chapter is to explore the implications for
scientists who find themselves on the wrong side of the replication
process. And to begin, let’s consider two high-profile case studies of
investigators who found themselves in this particular situation, the
negative aspects of which were exacerbated by the ever increasing
dominance of social media—scientific and otherwise.
179
ready for this feat, many psychological journal editors encourage
investigators to yoke their experiments to the testing of an existing one.
One of the more popular of these theories is attributed to John Bargh
and designated as “behavioral priming”—which might be defined in
terms of exposure to one stimulus influencing response to another in the
absence of other competing stimuli. Literally hundreds of studies
(naturally most of them positive) have been conducted supporting the
phenomenon, probably the most famous of which was conducted by
John Bargh, Mark Chen, and Lara Burrows (1996), in which participants
were asked to create sentences from carefully constructed lists of
scrambled words.
The paper itself reported three separate experiments (all positive
confirmations of the effect), but the second garnered the most interest
(and certainly helped solidify the theory’s credence). It consisted of two
studies—one a replication of the other—with both asking 30 participants
to reconstruct 30 four-word sentences from 30 sets of five words. The
participants were randomly assigned to one of two different conditions,
one consisting of sets embedded with elderly stereotypic words, the other
with neutral words. Following the task they were told to leave the
laboratory by taking the elevator at the end of the hall during which a
research assistant unobtrusively recorded their walking speed via a
stopwatch.
The results of both studies (the original and the self-replication)
showed a statistically significant difference between the two conditions.
Namely, that the students who had been exposed to the stereotypic aging
words walked more slowly than the students who had not—a scientific
slam dunk cited more than 5,000 times. Or was it a chimera?
Apparently some major concerns had surfaced regarding the entire
concept over the ensuing years, which may have motivated a team of
researchers to conduct a replication of the famous study during the
magical 2011–2012 scientific window—accompanied, of course by one
of the era’s seemingly mandatory flashy titles.
180
and control tasks (constructing brief sentences from scrambled word
sets) followed by the respondents’ timed walking speeds.
However, the replication instituted three methodological
improvements:
181
The results obtained by the research assistants using stopwatches
(which the authors’ termed “subjective timing”) were as follows:
1. This time around a variation of the Bargh et al. priming effect was
replicated in the sense that the primed participants of the five research
assistants who had been led to believe that the priming intervention
would induce participants to walk more slowly did indeed register
significantly slower walking times than their non-primed counterparts.
2. Interestingly, however, the five research assistants who had been led to
believe that their primed participants would walk faster actually did
record faster walking times than their five control counterparts who had
not been fictitiously “sensitized.”
182
http://blogs.discovermagazine.com/notrocketscience/failed-replication-
bargh-psychology-study-doyen/), Bargh called the authors of the
replication “incompetent or ill-informed,” defamed the journal PLoS One
as not receiving “the usual high scientific journal standards of peer
review scrutiny” (a damning criticism for which there is absolutely no
supporting evidence), and attacked Yong himself for “superficial online
science journalism”—all via his online blog. (Reactions, incidentally,
that Joseph Simmons suggested constituted a textbook case in how not to
respond to a disconfirming replication.)
Since the Doyen team’s failure to replicate Bargh’s famous study, the
priming construct has understandably lost a considerable amount of its
former luster. For example, in the first “Many Labs” replication project
(Klein, Ratliff, Vianello, et al., 2014, discussed in the previous chapter),
only two replications (of 13) provided no support for their original
studies’ findings and both of these involved this once popular construct
(i.e., flag priming [influencing conservatism] and currency priming
[influencing system justification]). And, according to an article in
Nature, Chivers (2019) maintained that dozens of priming replications
did not confirm the effect—perhaps leading Brian Nosek to state
concerning the effect that “ I don’t know a replicable finding. It’s not that
there isn’t one, but I can’t name it” (p. 200).
Of course a meta-analysis (Weingarten, Chen, McAdams, et al., 2016)
on the topic assessing the word priming effect found a small but
statistically significant effect size of 0.33 for 352 effects (not a typo), but
that is customary in meta-analyses. However, the average number of
participants employed was only 25 per condition, which implied that the
average statistical power thereof was 0.20 (which it will be recalled
translates to a 20% chance of obtaining statistical significance if a real
effect exists—which in the case of priming an increasing percentage of
scientists no longer believe exists).
183
on the discipline comparable to Daryl Bem’s psi studies. Perhaps this
event served notice that counterintuitive, headline-catching studies could
constitute a double-edged sword for their authors by encouraging
methodologically oriented critics to mount a social media counterattack
—especially if the original investigators appeared to be profiting
(financially, publicly, or professionally) by conducting questionable
research practice (QRP)-laced, irreproducible science.
What this episode may also herald is an unneeded illustration of the
growing use of the internet to espouse opinions and worldviews
concerning findings published in traditional peer reviewed journals, a
phenomenon that might also reflect a growing tendency for an already
broken peer review/editorial propensity to allow political and social
biases to influence not just publication decisions, but the hyperbolic
language with which those publications are described. (The latter of
which is apparently on an upswing; see Vinkers, Tijdink, & Otte, 2015.)
But let’s begin by considering the study itself (Carney, Cuddy, & Yap,
2010) whose unimposing title (“Power Posing: Brief Nonverbal Displays
Affect Neuroendocrine Levels and Risk Tolerance”) hardly suggested the
controversy that would follow. And neither did its abstract portend
anything controversial, or at least not until its final sentence.
Humans and other animals express power through open, expansive postures,
and they express powerlessness through closed, contractive postures. But
can these postures actually cause power? The results of this study confirmed
our prediction that posing in high-power nonverbal displays (as opposed to
low-power nonverbal displays) would cause neuroendocrine and behavioral
changes for both male and female participants: High-power posers
experienced elevations in testosterone, decreases in cortisol, and increased
feelings of power and tolerance for risk; low-power posers exhibited the
opposite pattern. In short, posing in displays of power caused advantaged
and adaptive psychological, physiological, and behavioral changes, and
these findings suggest that embodiment extends beyond mere thinking and
feeling, to physiology and subsequent behavioral choices. That a person
can, by assuming two simple 1-min poses, embody power and instantly
become more powerful has real-world, actionable implications [emphasis
added]. (p. 1363)
184
of 2010, which is a year before Professor Bem’s magnum opus was
published]), it is quite possible that the vituperative reactions to the study
might have been avoided. That and Dr. Cuddy’s popularization of her
finding.
But alas no such addition (or disclaimer) was added, probably because
its authors appeared to belong the school of belief that just about
anything to which a p < 0.05 can be affixed does translate to “real-
world” behavior and scientific meaningfulness. Hence our little morality
play, which is eloquently presented in a New York Times Magazine piece
written by Susan Dominus and aptly titled “When the Revolution Came
for Amy Cuddy” (2017, Oct. 18).
Apparently (at least from her critics’ perspectives), Dr. Cuddy
parlayed her often cited 2010 study (note that she was the second author
on the study in question) into two extremely popular YouTube
presentations (one of which constituted Ted Talks’ second most popular
offering with 43 million views and counting), a best-selling book
(Presence), and myriad paid speaking engagements.
While the study in question was neither her first nor last on the topic,
the 2010 article—coupled with her new-found fame—engendered a
replication (Ranehill, Dreber, Johannesson, et al., 2015) employing a
sample five times as large as the original study’s. Unfortunately the
original study’s positive results for cortisol, testosterone, or risk-taking
did not replicate, although a quarter of a point difference was observed
on a 4-point rating scale soliciting the degree to which the participants
felt powerful following the brief set of “power” poses—an effect that
was also significant in the original article but which could be interpreted
as primarily an intervention “manipulation check” (i.e., evidence that
participants could perceive a difference between the “high-power
nonverbal poses” and the “low-power nonverbal poses” but little else).
Naturally Cuddy defended her study as almost any investigator would
and continued her belief in the revolutionary societal effects of
empowering the un-empowered (e.g., young women, female children,
and even black men) via her 1-minute poses. As part of this effort and in
response to their critics, she and her original team (note again that Dana
Carney was the first author on this paper as well) even accumulated a
group of 33 studies involving “the embodied effects of expansive (vs.
contractive) nonverbal displays” (Carney, Cuddy, & Yap, 2015), none of
which, other than theirs, found an effect for testosterone or cortisol. As a
group, however, the cumulative results were overwhelmingly positive,
185
which of course is typical of a science which deals in publishing only
positive results involving soft, self-reported, reactive outcomes.
Also, in adddtion to appealing to the authority of William James, the
authors also presented a list of rebuttals to the Ranehill et al. failure to
replicate—one of which was that, for some reason, the latter announced
to their participants that their study was designed to test the effects of
physical position upon hormones and behavior. (It is unclear why the
replicating team did this, although its effect [if any] could have just as
easily increased any such difference due to its seeming potential to elicit
a demand effect.)
So far, all of this is rather typical of the replication and rebuttal
process since no researcher likes to hear or believe that his or her
findings (or interpretations thereof) are incorrect. (Recall that even Daryl
Bem produced a breathlessly positive meta-analysis of 90 experiments
on the anomalous anticipation of random future events in response to
Galak, LeBoeuf, Nelson, and Simmons’s failure to replicate psi.)
But while no comparison between Amy Cuddy and Daryl Bem is
intended, Drs. Galak, Nelson, and Simmons (along with Uri Simonsohn)
soon became key actors in our drama as well. This is perhaps due to the
fact that, prior to the publication of the 33-experiment rejoinder, Dana
Carney (again the original first author of both the review and the original
power posing study) sent the manuscript along with her version of a p-
curve analysis (Simonsohn, Nelson, & Simmons, 2014) performed on
these 33 studies to Leif Nelson (who promptly forwarded it on Simmons
and Simonsohn).
The p-curve is a statistical model designed to ascertain if a related
series of positive studies’ p-values fit an expected distribution, the latter
being skewed to the right. Or, in the words of its originators,
It also doesn’t hurt to remember that the p-curve is a statistical model (or
diagnostic test) whose utility has not been firmly established. And while
it probably is useful for the purposes for which it was designed, its
results (like those of all models) do not reflect absolute mathematical
certainty. Or, as Stephan Bruns and John Ioannidis (2016) remind us, the
exact etiology of aberrant effects (skewness in the case of p-curves)
remains “unknown and uncertain.”
186
In any event, at this juncture our story begins to get a bit muddled to
the point that no one comes across as completely righteous, heroic, or
victimized. According to Susan Dominus, Simmons responded that he
and Simonsohn had conducted their own p-curve and came to a
completely different conclusion from Carney’s, whose version they
considered to be incorrect, and they suggested that “conceptual points
raised before that section [i.e., the “incorrect” p-curve analysis] are
useful and contribute to the debate,” but, according to Dominus they
advised Carney to delete her p-curve and then “everybody wins in that
case.”
Carney and Cuddy complied by deleting their p-curve but they
(especially Amy Cuddy) apparently weren’t among the universal
winners. Simonsohn and Simmons, after giving the original authors a
chance to reply online, then published a decidedly negative blog on their
influential Data Coda site entitled “Reassessing the Evidence Behind the
Most Popular TED Talk” (http://datacolada.org/37), accompanied by a
picture of the 1970s television version of Wonder Woman. (The latter
being a stark reminder of the difference between internet blogs and peer
reviewed scientific communication.)
The rest, as the saying goes, is history. Cuddy temporarily became,
even more so perhaps than Daryl Bem and John Bargh, the poster child
of the reproducibility crisis even though her research was conducted
prior to the 2011–2012 enlightenment, as were both Bem’s and Bargh’s.
Dominus’s New York Times Magazine article sympathetically detailed
the emotional toll inflicted upon Dr. Cuddy, who was portrayed as a
brain-injured survivor who had overcome great obstacles to become a
respected social psychologist, even to the point of surviving Andrew
Gelman’s “dismissive” (Dominus’ descriptor) blogs
(http://andrewgelman.com/) along with his 2016 Slate Magazine article
(with Kaiser Fung) critical of her work and the press’ role in reporting it.
But perhaps the unkindest cut of all came when her friend and first
author (Dana Carney) completely disavowed the power pose studies and
even recommended that researchers abandon studying power poses in the
future.
In response to her colleague’s listing of methodological weaknesses
buttressing the conclusion that the study was not reproducible, Dr. Cuddy
complained that she had not been apprised by Dr. Carney of said
problems—which leads one to wonder why Dr. Cuddy had not learned
about statistical power and demand characteristics in her graduate
training.
187
So what are we to make of all of this? And why is this tempest in a
teapot even worth considering? Perhaps the primary lesson here is that
scientists should approach with caution their Facebook, Twitter, and the
myriad other platforms that encourage brief, spur-of-the-moment posts.
Sitting alone in front of a computer makes it very easy to overstep one’s
scientific training when responding to something perceived as ridiculous
or methodologically offensive.
However. online scientific commentaries and posts are not likely to go
away anytime soon and, in the long run, may even turn out to be a
powerful disincentive for conducting QRP-laden research. But with that
said, perhaps it would be a good idea to sleep on one’s more pejorative
entries (or persuade a friend or two to serve as one’s private peer
reviewer). A bit of time tends to moderate our immediate virulent
reactions to something we disagree with, which offends our sensibilities,
or serves as the subject matter for an overdue blog.
As for Amy Cuddy’s situation, it is extremely difficult for some social
scientists to avoid formulating their hypotheses and interpreting their
results independently of their politico-social orientations—perhaps as
difficult as the proverbial camel traversing the eye of a needle. And it
may be equally difficult to serve in the dual capacity of scientist and
entrepreneur while remaining completely unbiased in the interpretation
of one’s research. Or, even independent of both scenarios, not to feel a
decided sense of personal and professional indignation when one’s work
is subjected to a failed replication and subsequently labeled as not
reproducible.
As for articles in the public press dealing with scientific issues,
including Susan Dominus’s apparently factual entry, it is important for
readers to understand that these writers often do not possess a deep
understanding of the methodological issues or cultural mores underlying
the science they are writing about. And as a result, they may be more
prone to allow their personal biases to surface occasionally.
While I have no idea whether any of this applies to Susan Dominus,
she did appear to be unusually sympathetic to Amy Cuddy’s “plight,” to
appreciate Joseph Simmons’s and Uri Simonsohn’s apparent mea culpa
that they could have handled their role somewhat differently (which
probably would have had no effect on the ultimate outcome), and to not
extend any such appreciation to Andrew Gelman, whom she appeared to
consider too strident and “dismissive” of Dr. Cuddy.
So while treating Dr. Cuddy’s work “dismissively” is understandable,
it might also be charitable to always include a brief reminder that studies
188
such as this were conducted prior to 2011–2012—not as an excuse but as
a potentially mitigating circumstance. If nothing else, such a disclaimer
might constitute a subliminal advertisement for the many available
strategies for decreasing scientific irreproducibility.
This is the larger survey of the two (4,786 US adults and 313
researchers) with also the more iconic title: “Scientists’ Reputations Are
Based on Getting It Right, Not Being Right.” Employing brief scenarios
in two separate surveys which were then combined, the more germane
results of this effort were:
189
kind enough to provide a case study of an exemplary response to a
disconfirming replication by Matthew Vees.
Participants were told to think about a specific finding, of their own, that
they were particularly proud of (self-focused). They then read about how an
independent lab had conducted a large-scale replication of that finding, but
failed to replicate it. Since the replicators were not successful, they tweaked
the methods and ran it again. Again, they were unable to find anything. The
participants were then told that the replicators published the failed
replication and blogged about it. The replicators’ conclusion was that the
effect was likely not true and probably the result of a Type 1 error.
Participants were then told to imagine that they posted on social media or a
blog one of the following comments: “in light of the evidence, it looks like I
was wrong about the effect” (admission) or “I am not sure about the
replication study. I still think the effect is real” (no admission). (p. 4)
(The other two scenarios were practically identical except the description
of the replication involved a well-known study published by a prominent
researcher to whom the two alternate comments were ascribed.)
190
The basic results were similar to those of the Ebersole et al. findings.
Namely (a) that scientists tend to overestimate the untoward effects of
negative replications and (b) these effects are less severe if the original
investigators “admit” that they may have been wrong. The authors
accordingly speculate that such an admission might repair some of the
reputational damage that is projected to occur in these scenarios.
These two surveys produced a plethora of different results that are not
discussed here so, as always, those interested in these issues should
access the full reports. The Ebersole et al. survey, for example, was able
to contrast differences between the general public and scientists with
respect to several issues (e.g., researchers were considerably more
tolerant of researchers who did not replicate their own research than the
general public and were also more appreciative [at least in theory] of
those who routinely performed “boring” rather than “exciting” studies).
Similarly, Drs. Fetterman and Sassenberg were able to study the
relationship between some of their scenarios and whether or not the
respondents were in the reproducibility or business-as-usual research
camps.
Both teams transparently mentioned (a) some of the weaknesses of
employing scenarios to explain real-life behaviors, (b) the fact that
radically different results could be produced by minor tweaks therein,
and (c) that the admission of such problems doesn’t mitigate their very
real potential for themselves creating false-positive results. One potential
problem with the interpretation of these surveys, at least in my opinion,
is the implication that real-life scientists would be better off admitting
that they were (or might have been) wrong when they may have actually
believed their original results were correct. (This is ironic in a sense,
given Ebersole et al.’s nomination of Matthew Vees’s strategy as an
exemplary alternative to going into “attack mode” following the failure
of one of his study’s to replicate since Dr. Vees did not suggest or imply
that his initial results might be wrong. Nor did he imply that his
replicators were wrong either—which is not necessarily contradictory.)
Of course anyone can quibble about scenario wordings and suggest
minor tweaks thereto (which is surely a weakness of scenarios as a
scientific tool in general). In truth no one knows whether the results of
such studies reflect “real-life” behaviors, reactions, or even if the same
results (or interpretation thereof) might change over time. But as
imperfect as surveys and scenarios are, the two just discussed at least
provide the best assurances available that replication failures, while
disappointing, disheartening, and perhaps enraging, are not as bad as
191
they seem to their “victims” at the time. Time may not heal all wounds,
but it almost surely will blunt the pain from this one and have very little
impact on a career. At least barring fraudulent behavior.
192
He was also reported by Forstmeier, Wagenmakers, and Parker (2017)
as somewhat iconoclastically suggesting
To this point, some derivation of the word “publish” has been mentioned
more than 200 times in several different contexts—many suggesting
various biases due to the various actors involved in the process or
inherent to the process itself. It seems natural, therefore, that these
factors should be examined in a bit more detail from the perspectives of
(a) identifying some of their more salient characteristics that impede the
replication (and the scientific) process, (b) facilitating the prevalence of
false-positive results in the scientific literature, and, of course, (c)
examining a number of suggestions tendered to ameliorate these
problems. All of which constitute the subject matter of Chapter 9.
References
Bargh, J. A., Chen, M., & Burrows, L. (1996). Automaticity of social behavior:
Direct effects of trait construct and stereotype-activation on action. Journal of
Personality and Social Psychology, 71, 230–244.
Bruns, S. B., & Ioannidis, J. P. A. (2016). P-curve and p-hacking in observational
research. PLoS One 11, e0149144.
Carney, D. R., Cuddy, A. J. C., & Yap, A. J. (2010). Power posing: Brief
nonverbal displays affect neuroendocrine levels and risk tolerance.
Psychological Science, 21, 1363–1368.
193
Carney, D. R., Cuddy, A. J. C., & Yap, A. J. (2015). Review and summary of
research on the embodied effects of expansive (vs. contractive) nonverbal
displays. Psychological Science, 26, 657–663.
Chivers, T. (2019). What’s next for psychology’s embattled field of social
priming. Nature, 576, 200–202.
Dominus, S. (2017). When the revolution came for Amy Cuddy. New York Times
Magazine, Oct. 18.
Doyen, S., Klein, O., Pichon, C., & Cleeremans, A. (2012). Behavioral priming:
It’s all in the mind, but whose mind? PLoS ONE, 7, e29081.
Ebersole, C. R., Axt, J. R., & Nosek, B. A. (2016). Scientists’ reputations are
based on getting it right, not being right. PLoS Biology, 14, e1002460.
Fetterman, A. K., & Sassenberg, K. (2015). The reputational consequences of
failed replications and wrongness admission among scientists. PLoS One, 10,
e0143723.
Forstmeier, F., Wagenmakers, E-J., & Parker, T. H. (2017). Detecting and
avoiding likely false-positive findings: a practical guide. Biological Reviews,
92, 1941–1968.
Gelman, A., & Fung, K. (2016). The power of the “power pose”: Amy Cuddy’s
famous finding is the latest example of scientific overreach.
https://slate.com/technology/2016/01/amy-cuddys-power-pose-research-is-the-
latest-example-of-scientific-overreach.html
Hadfield, J. (2015). There’s madness in our methods: Improving inference in
ecology and evolution. https://methodsblog.com/2015/11/26/madness-in-our-
methods/
Ioannidis, J. P. A. (2018). The challenge of reforming nutritional epidemiologic
research. Journal of the American Medical Association, 320, 969–970.
Klein, R., Ratliff, K. A., Vianello, M., et al. (2014). Investigating variation in
replicability: A “many labs” replication project. Social Psychology, 45, 142–
152.
Ranehill, E., Dreber, A., Johannesson, M., et al. (2015). Assessing the robustness
of power posing: No effect on hormones and risk tolerance in a large sample
of men and women. Psychological Science, 26, 653–656.
Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve: A key to the
file drawer. Journal of Experimental Psychology: General, 143, 534–547.
Vinkers, C. H., Tijdink, J. K., & Otte, W. M. (2015). Use of positive and negative
words in scientific PubMed abstracts between 1974 and 2014: Retrospective
analysis. British Medical Journal, 351, h6467.
Weingarten, E., Chen, Q., McAdams, M., et al. (2016). From primed concepts to
action: A meta-analysis of the behavioral effects of incidentally-presented
words. Psychological Bulletin, 142, 472–497.
Yong, E. (2012). Bad copy. Nature, 485, 298–300.
194
PART III
195
9
Publishing Issues and Their Impact on
Reproducibility
The vast majority of the studies or opinions cited to this point have been
published in peer reviewed journals. Historically, scientific publishing
has taken many forms and is an evolving process. In past centuries,
books were a primary means of communicating new findings, which is
how James Lind announced his iconic discovery regarding the treatment
and prevention of scurvy in 1753 (Treatise of the Scurvy). Which, of
course, was ignored, thereby delaying a cure being adopted for almost
half a century.
Gradually scientific journals began to proliferate, and as many as a
thousand were created during Lind’s century alone. Peer reviewed
journals now constitute the primary medium for formally presenting new
findings to the scientific community, acting as repositories of past
findings, and forming the foundation on which new knowledge is built.
So, obviously, the issue of scientific reproducibility cannot be
considered in any depth without examining the publishing process itself.
Especially since over half of published studies may be incorrect.
Publishing as Reinforcement
196
surprising number of individuals publish one paper every 5 days
(Ioannidis, Klavans, & Boyack, 2018)—some of which are never cited
and probably never even read by their co-authors.
However, with all of its deficiencies as a metric of scientific
accomplishment, publishing is an essential component of the scientific
enterprise. And, in an era in which traditional scientific book publishing
appears to be in decline, academic journal publishing has evolved
relatively quickly into a multibillion-dollar industry.
Whether most scientists approve of the direction in which publishing
is moving is unknown and probably irrelevant. Certainly nothing is
likely to change the practice of publishing one’s labors as a
reinforcement of behavior or the narcotic-like high that occurs when
investigators receive the news that a paper has been approved for
publication. (Of course a “paper,” while still in use as an antiquated
synonym for a research report, is now digitally produced and most
commonly read online or downloaded and read as a pdf—all sans the use
of “paper.”)
So let’s take a quick look at the publishing process through the lens of
scientific reproducibility. But first, a few facts that are quite relevant for
that purpose.
197
progress. (It should be mentioned that Bjӧrk, Roos, and Lauri [2009]
take issue with this projection and argue that 1,275,000 is a more
accurate figure, but both estimates are mind-boggling.)
As shown in Table 9.1 (abstracted from table 5-6 of the original NSF
report) only 18% of these publications emanated from the United States,
and the majority (54%) of the 2 million-plus total involved disciplines
not or only spottily covered in this book (e.g., engineering, the physical
sciences, mathematics, computer sciences, agriculture). However, this
still leaves quite a bit of publishing activity in almost all recognized
disciplines.
Psychology alone contributed more than 39,000 publications, with the
other social sciences accounting for more than three times that number.
How many of these included p-values are unknown, but, given the
modeling studies discussed earlier, psychology alone undoubtedly
produces thousands of false-positive results per year with the other social
sciences adding tens of thousands more based on their combined 2016
output of more than 120,000 publications. So let’s hope that Ioannidis
(2005) was wrong about most disciplines being affected by his
pessimistic modeling result. Otherwise, we’re talking about hundreds of
thousands (perhaps more than a million) of false-positive scientific
reports being produced each year worldwide.
198
The Current Publication Model
While everyone reading this book probably already knows the basics of
the subscription-based 20th-century journal publication process, let’s
review a few of its main components in order to examine how it has
changed in this increasingly digital age (and may even effectively
disappear altogether in a few decades), what some of its limitations are,
and some rather radical suggestions for its improvement (or least
alteration). If readers will forgive a brief self-indulgence, I will begin
this attempt by detailing some of my experiences as an editor-in-chief as
an example of these relatively rapid changes.
In 1978, an unfortunately now-deceased faculty colleague and I
decided to establish a peer reviewed program evaluation journal
dedicated to the health sciences. In those pre-email days all
correspondence was done by mail (usually on an IBM Selectric
typewriter somewhere) hence communication with reviewers, the
editorial board, and authors involved enumerable trips back and forth to
the post office. After a few issues in which we two academics served as
publishers, printers, mailers, promoters, editors, solicitors of articles (at
least in the journal’s early days), selectors (and solicitors) of peer
reviewers, and collators of the latter’s often disparate and contradictory
reviews, all while maintaining additional back-and-forth correspondence
with authors regarding revisions or reasons for rejections ad nauseam,
we were quite happy to sell the journal for a whopping $6,000 each to
Sage Publications (making it that publisher’s second peer reviewed,
professional journal).
Four decades later, Sage now publishes more than a thousand journals
and is dwarfed by other journal publishers such as Elsevier and Wolters
Kluwer. All communications now employ the internet rather than the
postal service, all of which pretty much encompasses the evolution of
academic publishing prior to exclusively online publishers—that and the
price changes, with the average library subscription for a scientific
journal now being around $3,000 per year plus the substantial
(sometimes exorbitant) publishing fees levied by many journals on
authors for the privilege of typesetting their labors and distributing them
to (primarily) institutional subscribers. (The actual number of legitimate
[i.e., non-predatory] peer reviewed scientific journals is actually not
known but it is undoubtedly well over 25,000.)
199
The Peer Review Process
200
up with a better solution, and it did seem to work reasonably well,
especially for the larger and more prestigious journals.
True, recently artificial intelligence aids have been developed that may
slightly facilitate handling the increasing glut of manuscripts to be
reviewed. These platforms can now perform cursory supplemental tasks,
as succinctly described by Heaven (2018) including checking for
problematic statistical or procedural anomalies (e.g., ScholarOne,
StatReviewer), summarizing the actual subject matter of an article rather
than relying on a quick perusal of its abstract (which one sage describes
as “what authors come up with five minutes before submission”
[courtesy of a marketing director in the Heaven paper]), and employing
automated plagiarism checks. But regardless of the sophistication
achieved by these innovations, the heavy lifting will remain with
scientists as long as the current system exists.
Like every other scientific topic, a substantial literature has grown up
around the shortcomings of peer review. While this literature can’t be
done justice here, one of the most thorough and insightful discussions of
the entire publishing process (accompanied by potential solutions) must
surely be Brian Nosek and Yoav Bar-Anan’s (2012) essay entitled
“Scientific Utopia: I. Opening Scientific Communication.”
However, prior to considering some of these quite prescient
suggestions, let’s consider the following either amusing or alarming
study, depending on one’s perspective.
201
The article was then submitted to 304 open-access journals and was
accepted by more than half of them, all with no notice of the study’s
fatal methodological flaws. As one example, the Journal of Natural
Pharmaceuticals (published by an Indian company which owned 270
online journals at the time but has since been bought by Wolters
Kluwer, a multinational Netherland publishing behemoth with annual
revenues of nearly $5 billion) accepted the article in 51 days with only
minor formatting changes requested. Nothing was mentioned
concerning the study flaws.
For the exercise as a whole, the author reported that
The paper was accepted by journals hosted by industry titans Sage and
Elsevier. The paper was accepted by journals published by prestigious
academic institutions such as Kobe University in Japan. It was accepted
by scholarly society journals. It was even accepted by journals for which
the paper’s topic was utterly inappropriate, such as the Journal of
Experimental & Clinical Assisted Reproduction. (p. 61)
Incredibly, only PLoS One (the flagship journal of the Public Library
of Science and much maligned by John Bargh) rejected the paper on
methodological grounds. One of Sage’s journals (Journal of
International Medical Research) accompanied its acceptance letter with
a bill for $3,100. (For many such acceptances, Bohannon sent an email
withdrawing the paper due to an “embarrassing mistake.”)
Perhaps the main conclusion that can be drawn from this iconoclastic
“survey” is that everything in science, as in all other human pursuits, can
be (and often is) gamed. Other examples designed to expose the
problems associated with peer review abound. For example, MIT
students used SCIgen, a computer program that automatically generates
gobbledygook papers, to submit papers that somehow got through the
peer review process; a recent group of bizarre fake articles and authors
were published in small peer review journals with titles such as “Human
Reactions to Rape Culture and Queer Performativity at Urban Dog Parks
in Portland, Oregon” (Wilson [Retracted, 2018] published in Gender,
Place, & Culture: A Feminist Geography Journal) and Baldwin
[Retracted, 2020] “Who Are They to Judge?” and “Overcoming
Anthropometry Through Fat Bodybuilding” (published in the journal Fat
Studies [yes, this and the feminist geography journal are actual
journals]); and, of course, the iconic hoax by physicist Alan D. Sokal,
who published a completely nonsensical article allegedly linking the then
202
post-modernism fad with quantum physics in the non-peer reviewed
Social Text (1996) (https://en.wikipedia.org/wiki/Alan_Sokal).
One problem with both the publishing and peer review processes lies
in the number of publishing outlets available to even borderline
scientists, which means that just about anything can be published with
enough perseverance (and money). Another lies in the mandate that
journals (especially subscription-based ones) are typically required to
publish a given number of articles every quarter, month, or in some cases
week. This dilutes the quality of published research as one goes down
the scientific food chain and may even encourage fraudulent activity,
such as the practice of authors nominating actual scientists, accompanied
by fake email addresses (opened solely for that purpose and sometimes
supplied by for-profit companies dedicated to supporting the process).
Springer, for example (the publisher of Nature along with more than
2,900 other journals and 250,000 books), was forced to retract 107
papers from Tumor Biology published between 2010 and 2016 due to
fake peer reviews. Similarly, Sage Publications was forced to retract 60
papers (almost all with the same author) due to a compromised peer
review system. And these are known and relatively easily identified
cases. We’ll probably never know the true extent of these problems.
On a lighter note, Ferguson, Marcus, and Oransky (2014) provide
several interesting and somewhat amusing examples of peer review fraud
along with potential solutions. (Cat Ferguson, Adam Marcus and Ivan
Oransky are the staff writer and two co-founders, respectively, of
Retraction Watch, an extremely important organization designed to track
retracted papers and whose website should be accessed regularly by all
scientists interested in reproducibility.)
Examples supplied by the group include:
1. The author asks to exclude some reviewers, then provides a list of almost
every scientist in the field.
2. The author recommends reviewers who are strangely difficult to find
online.
3. The author provides gmail, Yahoo, or other free e-mail addresses to
contact suggested reviewers, rather than email addresses from an
academic institution.
4. Within hours of being requested, the reviews come back. They are
glowing.
5. Even reviewer number three likes the paper (p. 481). [In my experience
three-for-three uncritically positive reviews are relatively uncommon.]
203
Predatory (Fake) Journals
204
• Check that the publisher provides full, verifiable contact
information, including an address, on the journal site. Be cautious
of those that provide only web contact forms.
• Check that a journal’s editorial board lists recognized experts with
full affiliations. Contact some of them and ask about their
experience with the journal or publisher since sometimes these
journals simply list prominent scientists without their knowledge.
• Check that the journal prominently displays its policy for author
fees.
• Be wary of email invitations to submit to journals or to become an
editorial board member. [This one is tricky since legitimate
journals sometimes use email correspondence to solicit
manuscripts for a special issue or to contact potential board
members based on recommendations from other scientists.]
• Read some of the journal’s published articles and assess their
quality. Contact past authors to ask about their experience. [This,
too, has a downside since some of the authors may be quite proud
of the fact that their articles were accepted with absolutely no
required revisions.]
• Check that a journal’s peer review process is clearly described,
and try to confirm that a claimed impact factor is correct.
• Find out whether the journal is a member of an industry
association that vets its members, such as the Directory of Open
Access Journals (www.doaj.org) or the Open Access Scholarly
Publishers Association (www.oaspa.org).
• Use common sense, as you would when shopping online: if
something looks fishy, proceed with caution. (p. 435)
205
Brian Nosek and Yoav Bar-Anan (2012)
This long, comprehensive article clearly delineates the problems
bedeviling publishing in peer reviewed scientific journals followed by
proposed solutions for each. Ideally it should be read in its entirety by
anyone interested in reforming the current system, but, for present
purposes, what follows is an encapsulated version of its authors’ vision
of both some of the problems with the current system and their
potential solutions.
First the problems:
206
leisure. Paper copies of all but the most widely read journals are
disappearing from academic libraries so part of this suggestion is well
on its way to being implemented. As for journal issues, little would be
lost and precious time gained if the increasingly popular practice of
making the final version of papers available to subscribers online in
advance of the completed issue were to simply replace the issue system
itself. The PLoS model, for one, already does this by making papers
freely available as soon as they are accepted, suitably revised, and
copyedited.
Going to a totally open access model in which all scientists (whether
they work at universities, for private industry, or in their own
basements) can access everything without costs. Of course the obvious
problem with this is that someone has to pay the costs of copyediting
and other tasks, but going to an exclusively digital model (and possibly,
ultimately, a non-profit one) should greatly reduce costs. Nosek and
Bar-Anan suggest that most of these costs should be borne by
scientists, their funding agencies, or their institutions (perhaps
augmented by advertisers).
Again the PLoS open-access, purely digital model is presented as
one example of how this transition could be made. Another example is
the National Institutes of Health (NIH)’s PubMed Central
(http://www.ncbi.nlm.nin.gov/pmc/), which attempts to ensure open
access to all published reports of NIH-funded research. The greatest
barriers to the movement itself are individual scientists, and a number
of suggestions are made by the authors to encourage these scientists to
publish their work in open-access journals. The major disadvantage
resides in the inequitable difficulty that unfunded or underfunded
scientists will have in meeting publication costs (which are
considerable but are often also charged by subscription outlets)
although some open-access journals theoretically reduce or even waive
these fee if an investigator has no dedicated funding for this purpose.
Publishing prior to peer review. Citing the often absurd lag between
study completion and publication, the authors suggest that “Authors
prepare their manuscripts and decide themselves when it is published
by submitting it to a repository. The repository manages copyediting
and makes the articles available publicly” (p. 231).
Examples of existing mechanisms through which this is already
occurring are provided, the most notable being the previously
mentioned decades-old and quite successful arXiv preprint repository
(https://en.wikipedia.org/wiki/ArXiv), followed by a growing group of
207
siblings (some of which allow reviewer comments that can be almost as
useful as peer reviews to investigators). An important subsidiary
benefit for scientists is avoiding the necessity of contending with
journal- or reviewer-initiated publication bias.
In this model, preprints of manuscripts are posted without peer
review, and under this system, the number of submissions for arXiv
alone increased by more than 10,000 per month by 2016. Most of the
submissions are probably published later in conventional outlets, but,
published or not, the repository process has a number of advantages
including
208
example, the INA-Rxiv alone, which now receives more than 6,000
submissions per year, will be faced with $25,000 in annual fees and
accordingly has decided to leave the COS repository (Mallapaty, 2020).
Making peer review independent of the journal system. Here the
authors really hit their stride by suggesting the creation of general or
generic peer review systems independent of the journals themselves. In
this system “instead of submitting a manuscript for review by a
particular journal with a particular level of prestige, authors submit to a
review service for peer review . . . and journals become not the
publisher of articles but their ‘promotors’ ” (p. 232).
This process, the authors argue, would free journals from the peer
review process and prevent investigators from the necessity of going
through the entire exercise each time they submit a paper following a
rejection. Both the graded results of the reviews and the manuscript
itself would be available online, and journal editors could then sort
through reviewed articles according to their own quality metrics and
choose which they wished to publish. More controversially, the authors
also suggest that there would be no reason why the same article
couldn’t be published by multiple journals. (How this latter rather odd
strategy would play out is not at all clear, but it is a creative possibility
and the suggestion of employing a single peer review process rather
than forcing each journal to constitute its own, while not unique to
these authors, is in my opinion as a former editor-in-chief absolutely
brilliant.)
Publishing peer reviews. Making a case for the important (and
largely unrewarded) contribution made by peer reviewers, the authors
suggest that peer reviews not only be published but also not be
anonymous unless a reviewer so requests. This way reviewers could
receive credit for their scientific contributions since, as the authors
note, some individuals may not have the inclination, opportunity, or
talent for conducting science but may excel at evaluating it, identifying
experimental confounds, or suggesting alternative explanations for
findings. In this scenario, reviewers’ vitas could correspondingly
document this activity.
It might also be possible for “official” peer reviewers to be evaluated
on both the quality of their reviews and their evaluative tendencies.
(Many journal editors presently do this informally since some
reviewers inevitably reject or accept every manuscript they receive.)
Listed advantages of these suggestions include the avoidance of “quid
pro quo positive reviewing among friends” as well as the retaliatory
209
anonymous comments by someone whose work has been contradicted,
not cited, or found not to be replicable by the study under review.
Continuous, open peer review. Peer reviews are not perfect and even
salutary ones can change over time, as occurs when a critical confound
is identified by someone following initial review and publication or a
finding initially reviewed with disinterest is later found to be of much
greater import. The authors therefore suggest that the peer review
process be allowed to continue over time, much as book reviews or
product evaluations do on Amazon.com. To avoid politically motivated
reviews by nonscientists, a filter could be employed, such as the
requirement of an academic appointment or membership in a
professional organization, in order to post reviews. (This latter
suggestion could be buttressed by the creation of an interprofessional
organization—or special section within current professional
organizations—devoted to the peer review process.)
210
Based on the referees’ recommendations, and his or her own reading of the
manuscript, the editor makes the decision to accept or reject the manuscript.
If the editor accepts the manuscript (subject to normal copy editing), he or
she will inform the authors accordingly, enclosing the editorial comments
and comments made by the referees. It is up to the authors to decide
whether, and to what extent, they would like to incorporate these comments
when they work on their revision for eventual publication. As a condition of
acceptance, the authors are required to write a point-by-point response to
the comments. If they refuse to accept a comment, they have to clearly state
the reasons. The editor will pass on the response to the referees. In sum, the
fate of a submitted manuscript is determined by one round of review, and
authors of an accepted manuscript are required to make one round of
revision. (pp. 11–12)
There are possible variations on this, as well as for all the proposals
tendered in this chapter for reforming the publication and peer review
process. All have their advantages, disadvantages, and potential pitfalls,
but something has to be changed in this arena if we are to ever
substantively improve the reproducibility of empirical research.
211
A ridiculous example of this reluctance on the part of journals to
acknowledge the existence of such errors is provided by Allison, Brown,
George, and Kaiser (2016), who recount their experiences in alerting
journals to published errors in papers they were reviewing for other
purposes. They soon became disenchanted with the process, given that
“Some journals that acknowledged mistakes required a substantial fee to
publish our letters: we were asked to spend our research dollars on
correcting other people’s errors” (p. 28).
Of course, some retractions on the part of investigators reflect
innocent errors or oversights, but apparently most do not. Fang, Steen,
and Casadevall (2012), for example, in examining 2,047 retractions
indexed in PubMed as of 2012 found that 67.4% were due to
misconduct, fraud, or suspected fraud. And what is even more
problematic, according to the Retraction Watch website, some articles
are actually cited more frequently after they are retracted than before.
Exactly how problems such as this can be rectified is not immediately
clear, over and above Drs. Marcus and Oransky’s continuing Herculean
efforts with Retraction Watch. One possibility that could potentially put a
dent in the problem, however, is to send a corrective email to any
investigator citing a retracted article’s published results, perhaps even
suggesting that he or she retract the citation.
Undoubtedly some should, but who is to decide who and how much?
Many of the Utopia I article’s recommendations would probably result in
a significant increase in per scientist published outputs, and whether or
not this is desirable is open for debate.
Brian Martinson (2017) makes a persuasive case for some of
overpublication’s undesirable consequences, and few scientists would
probably disagree with it (at least in private).
212
The purpose of authorship has shifted. Once, its primary role was to share
knowledge. Now it is to get a publication [emphasis added]—“pubcoin: if
you will. Authorship has become a valuable commodity. And as with all
valuable commodities, it is bought, sold, traded and stolen. Marketplaces
allow unscrupulous researchers to purchase authorship on a paper they had
nothing to do with, or even to commission a paper on the topic of their
choice. “Predatory publishers” strive to collect fees without ensuring
quality. (p. 202)
Bad papers are easy to write, but in the current system they are at least
somewhat [emphasis added] difficult to publish. When we make it easier to
publish papers, we do not introduce good papers into the market (those are
already going to be out there); we introduce disproportionately more bad
papers. (p. 292)
Professors Nosek and Bar-Anan have a tacit answer for this question
along with just about everything else associated with publishing. Namely
(a variant of which has been suggested by others as well), that some
“scientists who do not have the resources or interest in doing original
213
research themselves can make substantial contributions to science by
reviewing, rather than waiting to be asked to review” (p. 237).
In a sense all scientists are peer reviewers, if not as publication
gatekeepers, at least for their own purposes every time they read an
article relevant to their work. So why not officially create a profession
given over to this activity, one accompanied by an official record of these
activities for promotion and tenure purposes? Or, barring that, an
increased and rewarded system of online reviews designed to discourage
methodologically unsound, unoriginal, or absurd publications.
Alternately, there are presently multiple online sites upon which one’s
comments regarding the most egregious departures from good scientific
practice can be shared with the profession as a whole and/or via one-on-
one correspondences with the authors themselves. From a scientific
perspective, if institutionalized as a legitimate academic discipline, the
hopeful result of such activities would be to improve reproducibility one
study and one investigator at a time.
There are, of course, many other options and professional models
already proposed, such as Gary King’s 1995 recommendation that
scientists receive credit for the creation of datasets that facilitate the
replication process. In models such as this scientists would be judged
academically on their performance of duties designed to facilitate the
scientific process itself, which could include a wide range of activities in
addition to peer reviewing and the creation of databases. Already
existing examples, such as research design experts and statisticians, are
well established, but the list could be expanded to include checking
preregistered protocols (including addendums thereto) against published
or submitted final reports.
And, of course, given the number of publications being generated in
every discipline there are abundant opportunities for spotting
questionable research practices (QRPs) or errors in newly published
studies. Perhaps not a particularly endearing professional role or
profession, but letters to the offending journals’ editors and/or postings
on websites designed for the specific purpose of promulgating potential
problems could be counted as worthy, quantifiable professional
activities. And, of course, the relatively new field of meta-science is
presently open for candidates and undoubtedly has room for numerous
subspecialties including the development of software to facilitate all of
the just-mentioned activities. This is an activity which has already
produced some quite impressive results, as described next.
214
Already Existing Statistical Tools to Facilitate
These Roles and Purposes
215
1. An R-program (statcheck) developed by Epskamp and Nuijten (2015)
which allows extracted p-values to be recalculated to spot abnormalities,
possible tampering, and errors based on reported descriptive statistics.
Using this approach Nuijten, Hartgerink, van Assen, Epskamp, and
Wicherts (2016) identified a disheartening number of incorrectly reported
p-values in 16,695 published articles employing inferential statistics. As
would be expected by now, substantively more false-positive errors than
negative ones were found.
2. Simulations such as bootstrapping approaches (Goldfarb & King, 2016)
for determining what would happen if a published research result were to
be repeated numerous times, with each repetition being done with a new
random draw of observations from the same underlying population.
3. A strategy (as described by Bergh, Sharp, Aguinis, & Li, 2017) offered by
most statistical packages (e.g., Stata, IBM SPSS, SAS, and R) for
checking the accuracy of statistical analyses involving descriptive and
correlational results when raw data are not available. Using linear
regression and structural equation modeling as examples, the authors
found that of those management studies for which sufficient data were
available and hence could be reanalyzed, “nearly one of three reported
hypotheses as statistically significant which were no longer so in
retesting, and far more significant results were found to be non-significant
in the reproductions than in the opposite direction” (p. 430).
4. The GRIM test (Brown & Heathers, 2016) which evaluates whether or not
the summary statistics in a publication are mathematically possible based
on sample size and number of items for whole (i.e., non-decimal)
numbers, such as Likert scales.
5. Ulrich Schimmack’s “test of insufficient variance” (2014) and “z-curve
analysis” (Schimmack & Brunner, 2017) designed to detect QRPs and
estimate replicability, respectively.
216
analyzed for suspicious values—the presence of which he has the
temerity to inform the editors of those journals in which the offending
articles appear. A process, incidentally, that has resulted in the
identification of both fraudulent investigators and numerous retractions.
The study being described here involved (as its title suggests) more
than 5,000 anesthesia studies from six anesthesia and two general
medical journals (the Journal of the American Medical Association and
the New England Journal of Medicine). These latter two journals were
most likely added because anesthesia research appears to be the
medical analog to social psychology as far as suspicious activities are
concerned. Hence Dr. Carlisle may have wanted to ascertain if his
profession did indeed constitute a medical outlier in this respect.
(According to the Nature article, four anesthesia investigators
[Yoshitaka Fujii, Yuhji Saitoh, Joachim Boldt, and Yoshihiro Sato]
eventually had 392 articles retracted, which, according to Retraction
Watch, dwarfs psychologist Diederik Stapel’s 58 admitted data
fabrications: https://retractionwatch.com/2015/12/08/diederik-stapel-
now-has-58-retractions/.)
Carlisle’s analysis involved 72,261 published arithmetic means of
29,789 variables in 5,087 trials. No significant difference occurred
between anesthesia and general medicine with respect to their baseline
value distributions, although the latter had a lower retraction rate than
the former. And in agreement with just about all of the authors of this
genre of research, Dr. Carlisle was quite explicit in stating that his
results could not be interpreted as evidence of misconduct since they
could also be functions of “unintentional error, correlation, stratified
allocation and poor methodology.”
He did implicitly suggest, however, that more investigators should
join him in this enterprise since, “It is likely that this work will lead to
the identification, correction and retraction of hitherto unretracted
randomised, controlled trials” (p. 944).
217
substantial comorbidity, recruited over very short periods”). The results
of that analysis being that
[o]utcomes were remarkably positive, with very low mortality and study
withdrawals despite substantial comorbidity. There were very large
reductions in hip fracture incidence, regardless of intervention (relative risk
0.22, 95% confidence interval 0.15–0.31, p < 0.0001 . . . that greatly exceed
those reported in meta-analyses of other trials. There were multiple
examples of inconsistencies between and within trials, errors in reported
data, misleading text, duplicated data and text, and uncertainties about
ethical oversight. (p. 1)
It is past time, regardless of how the journal system evolves over the next
few decades. Hopefully the New England Journal of Medicine’s tentative
step in following John Carlisle’s lead is only a precursor to more hands-
on actions by journals to ensure the integrity of what they publish.
Perhaps the reproducibility crisis’s expanding profile will facilitate such
actions, supplemented by continuing efforts by scientists such as Dr.
Carlisle and the determined efforts of the myriad contributors to this
scientific initiative. For it is important to remember that scientific
publishing is not solely in the hands of the publishing industry CEOs and
CFOs or the editors-in-chief. Rather it is a symbiotic process involving
multiple other actors, the most important of which are scientists
themselves.
Methodological recommendations in these regards have been tendered
by a number of the reproducibility advocates mentioned in recent
chapters so there is no need to restate them here. However, given the
descriptions of statistical/empirical aids for ensuring the validity of
empirical results just mentioned, let’s revisit the aforementioned Bergh et
al.’s “Red Flags” article that very succinctly reminds us of the multiple
roles and responsibilities for ensuring reproducibility from a statistical
perspective in the initial phase of the publishing process:
First author’s responsibilities:
218
1. Include such values as coefficient estimates, standard errors, p-values in
decimals, and a correlation matrix that includes means, standard
deviations, correlations [including those between covariate and outcome],
and sample sizes.
“Describe all data-related decisions such as transformed variables and
2. how missing values and outliers were handled.”
3. “Attest to the accuracy of the data and that the reporting of analytical
findings and conclusions.”
1. Do not accept a review assignment unless you can accomplish the task in
the requested timeframe—learn to say no.
2. Avoid conflict of interest [e.g., cronyism].
3. As a reviewer you are part of the authoring process [which means you
should strive to make whatever you review a better paper].
4. Spend your precious time on papers worthy of a good review.
5. Write clearly, succinctly, and in a neutral tone, but be decisive.
6. Make use of the “comments to editors” [i.e., comments designed to help
the editor but not to be shared with the author] (pp. 0973–0974).
219
A Final Publishing Vision that Should Indirectly
Improve Reproducibility
Reading the following brief article when it was first published would
probably have been viewed as absurd by most social scientists dabbling
in computational research (and completely irrelevant for those not
involved therein). Today, however, in the context of the reproducibility
crisis, it resonates as downright prescient for improving all genres of
scientists’ day-to-day empirical practice.
After all our efforts at producing a paper, very few of us have asked the
question, is journal X presenting my work in a way that maximizes the
understanding of what has been done, providing the means to ensure
maximum reproducibility [emphasis added] of what has been done, and
maximizing the outreach of my work? (p. 1)
220
Authors are so happy to have their submission accepted that they blissfully
sign a copyright transfer form sent by the publishers. Then publishers
recoup their investment by closing [emphasis added] access to the articles
and then selling journal subscriptions to the scientists and their institutions
(individual articles can be purchased for $5 to $50 depending on the
journal). [Actually, in my experience it is a rare article that can be
obtained as cheaply as $5.] In other words, the funding public,
universities, and scientists who produced and pay for the research give
ownership of the results to publishers. Then, those with money left over
buy the results back from the publishers; the rest are in the dark. (p. 228)
221
1. The intellectual memory of my laboratory is in my e-mail folders,
themselves not perfectly organized. This creates a hub-and-spoke
environment where lab members and collaborators have to too often go
through me to connect to each other.
2. Much of our outreach is in the form of presentations made to each other
and at national and international forums. We do not have a good central
repository for this material; such a repository could enable us to have a
better understanding of what other researchers are doing.
3. While we endeavor to make all our software open source, there are always
useful bits of code that languish and disappear when the author leaves the
laboratory.
4. Important data get lost as students and postdoctoral fellows leave the
laboratory. (p. 2)
222
Suppose further that within a couple of decades he had realized the
folly of his ways because, by then, his results appeared to have
significant implications for a time-on-task theory he was working on. A
theory, in fact, that not coincidentally explained his earlier failures to
produce an efficacious method of study and could be validated by a
relatively simple series of experiments.
However, there was no possibility that he could recall sufficient details
concerning even a few of the studies, much less all of them (or even how
many he had conducted). And, of course, practically no one in those days
kept detailed paper-based records of procedures, detailed protocols, or
data for decades, especially following the institutional moves that often
accompany such time intervals—and especially not for unpublished
studies. (Hence, even the file drawer constitutes an insufficient metaphor
for this problem.)
Today, however, after the Digital Revolution, there is really no excuse
for such behavior. There is also no excuse for not sharing all of the
information contributing to and surrounding a scientific finding,
published or unpublished. And that just happens to constitute the subject
of Chapter 10.
References
223
Bourne, P. E. (2010). What do I want from the publisher of the future? PLoS
Computational Biology, 6, e1000787.
Bourne P. E., & Korngreen, A. (2006). Ten simple rules for reviewers. PLoS
Computational Biology, 2, e110.
Brown, N. J., & Heathers, J. A. (2016). The GRIM test: A simple technique
detects numerous anomalies in the reporting of results in psychology. Social
Psychological and Personality Science, https://peerj.com/preprints/2064.pdf.
Butler, D. (2013). The dark side of publishing. The explosion in open-access
publishing has enabled the rise of questionable operators. Nature, 435, 433–
435.
Carlisle, J. B. (2017). Data fabrication and other reasons for non-random
sampling in 5087 randomised, controlled trials in anaesthetic and general
medical journals. Anaesthesia, 72, 944–952.
Epskamp, S., & Nuijten, M. B. (2015). Statcheck: Extract statistics from articles
and recompute p values. R package version 1.0.1. http://CRAN.R-
project.org/package=statcheck
Fang, F. C., Steen, R. G., & Casadevall, A. (2012). Misconduct accounts for the
majority of retracted scientific publications. Proceedings of the National
Academy of Science, A, 109, 17028–17033.
Ferguson, C., Marcus, A., & Oransky, I. (2014). Publishing: The peer review
scam. Nature, 515, 480–482.
Goldfarb, B. D., & King, A. A. (2016). Scientific apophenia in strategic
management research: Significance tests and mistaken inference. Strategic
Management Journal, 37, 167–176.
Heaven, D. (2018). AI peer reviewers unleashed to ease publishing grind.
Nature, 563, 609–610.
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLOS
Medicine, 2, e124.
Ioannidis, J. P. A., Klavans, R., & Boyack, K. W. (2018). Thousands of scientists
publish a paper every five days, papers and trying to understand what the
authors have done. Nature, 561, 167–169.
King, G. (1995). Replication, replication. PS: Political Science and Politics, 28,
444–452.
Mallapaty, S. (2020). Popular preprint sites face closure because of money
troubles. Nature, 578, 349.
Martinson, B. C. (2017). Give researchers a lifetime word limit. Nature, 550,
202.
National Science Board. (2018). Science and engineering indicators 2018. NSB-
2018-1. Alexandria, VA: National Science Foundation.
(www.nsf.gov/statistics/indicators/)
Nelson, L. D., Simmons, J. P., & Simonsohn, U. (2012). Let’s publish fewer
papers. Psychological Inquiry, 23, 291–293.
Nosek, B. A., & Bar-Anan, Y. (2012). Scientific Utopia I: Opening scientific
communication. Psychological Inquiry, 23, 217–243.
224
Nuijten, M., Hartgerink, C. J., van Assen, M. L. M., Epskamp, S., & Wicherts, J.
(2016). The prevalence of statistical reporting errors in psychology (1985–
2013). Behavior Research Methods, 48, 1205–1226.
Schimmack, U. (2014). The test of insufficient variance (TIVA): A new tool for
the detection of questionable research practices.
https://replicationindex.wordpress.com/2014/12/30/the-test-
ofinsufficientvariance-tiva-a-new-tool-for-the-detection-
ofquestionableresearch-practices/
Schimmack, U., & Brunner, J. (2017). Z-curve: A method for estimating
replicability based on test statistics in original studies.
https://replicationindex.files.wordpress.com/2017/11/z-curve-submission-
draft.pdf
Sokal, A. D. (1996). Transgressing the boundaries: Toward a transformative
hermeneutics of quantum gravity. Social Text, 46–47, 217–252,
Tsang, E. W., & Frey, B. S. (2007). The as-is journal review process: Let authors
own their ideas. Academy of Management Learning and Education, 6, 128–
136.
Wilson, H. (2018). Retracted Article: Human reactions to rape culture and queer
performativity at urban dog parks in Portland, Oregon. Gender, Place &
Culture, 27(2), 1–20.
225
10
Preregistration, Data Sharing, and Other
Salutary Behaviors
226
1. The preregistered protocols are not compared point-by-point to their
published counterparts (preferably involving a standardized
methodological and statistical checklist), and
2. The registered data are not reanalyzed (using the available code) as a
descriptive check on the demographics and as an inferential check on the
primary p-values reported.
But who is going to do all of this? The easy answer is the multi-billion
dollar publishing industry, but even though its flagship, the New England
Journal of Medicine, has joined Dr. Carlisle in checking distributions of
the baseline values of submitted randomized controlled trials (RCTs),
this does not mean that other journals will embrace this strategy (or the
considerably more effective and time-consuming suggestion just
tendered of reanalyzing the actual data).
However, since this industry is built on the free labor of scientists,
why not create a profession (as alluded to in Chapter 9) devoted to
checking preregistered protocols against the submitted research reports
and actually rerunning the key analyses based on the registered data as
part of the peer review process? These efforts could be acknowledged via
(a) a footnote in the published articles; (b) journal backmatter volume
lists, in which peer reviewers and authors are often listed; (c) or even
rewarded as a new form of “confirming” authorship. (And, naturally,
these behaviors would be acknowledged and rewarded by the
participants’ institutions.)
And why is all of this necessary since replication is the reproducibility
gold standard? The easy answer is that the number of studies presently
being published precludes replicating even a small fraction thereof given
the resources required. (Not to mention the fact that some studies cannot
be replicated for various reasons.)
So, if nothing else, with a study that passes these initial screenings
involving the original protocol and the registered data (and assuming it
also passes the peer review process), considerable more confidence can
be had in its ultimate reproducibility. This evidence, coupled with the
perceived importance of a study, will also help inform whether a
replication is called for or not. And if it is, the registered information will
make the replication process considerably easier—for not only is it the
gold standard for reproducibility, its increased practice provides a
powerful disincentive for conducting substandard research.
But if a study protocol is preregistered why is it necessary to check it
against the final published product? Partly because of scientists’ unique
status in society and the conditions under which they work. Unlike many
227
professions, while scientists typically work in teams, their individual
behaviors are not supervised. If fact, most senior scientists do not
actually perform any of the procedural behaviors involved in their
experiments but instead rely on giving instructions to research assistants
or postdocs who themselves are rarely supervised on a day-to-day basis
—at least following a training run or two.
Of course scientists, like other professionals such as surgeons, are
expected to follow explicit, evidence-based guidelines. But while
surgeons are rewarded for positive results, their operating procedures can
also be investigated based on an egregious and expected negative result
(or a confluence thereof via malpractice suits or facing the families of
patients who have died under their watch)—unless they have rigorously
followed evidence-based, standard operating (not a pun) procedures. And
while scientists also have standard operating procedures, as documented
in guidelines such as the CONSORT and ARRIVE statements, their
carrots and sticks are quite different. The “sticks” have historically come
in the form of little more than an occasional carrot reduction, with the
exception of the commission of outright fraud. (And some countries even
permit investigators who have committed egregious examples of fraud to
continue to practice and publish.) The carrots come in the form of tenure,
direct deposit increases, and the esteem (or jealousy) of their colleagues,
which in turn requires numerous publications, significant external
funding, and, of course statistically significant research results.
All of which may have resulted in the social and life sciences waking
up to the reproducibility crisis after a prolonged nap during which
228
behaviors. Thus the most important message emanating from the
reproducibility movement is that many (not all) of the traditional ways in
which science has been conducted must change and that the most
important mechanisms to drive this change involve required
transparency in the conduct of the research practices which precede the
collection of a study’s first data point. (Or, for investigations involving
existing databases, transparent steps that precede inferential analyses.)
But barring constantly monitored surveillance equipment in the
laboratory, which isn’t a bad idea in some cases (as recommended by
Timothy Clark [2017]), what could accomplish such a task? Hopefully
the requisite behaviors are obvious by now, but let’s examine these
behaviors’ absolute essentials, their requisite policies, and their current
levels of implementation in a bit more detail.
229
1. The very threat of third-party preregistration evaluations broadcasting any
discrepancies on social media might be a significant deterrent in and of
itself;
2. The preregistration document could serve as a reference to conscientious
investigators in the preparation of their final manuscripts by reminding
them of exactly what they had originally proposed and what actually
transpired in the conduct of the study—especially when there is a
significant delay between study initiation and manuscript preparation;
3. Since changes and amendments to the document following study
commencement are often necessary, the preregistration process and the
highly recommended laboratory workflow can operate together
synergistically to keep track of progress and facilitate memories.
230
such as the pharmaceutical industry (second only to the federal
government as a source of medical research funding) from concealing
the presence of selected (presumably negative) trials that could
potentially “influence the thinking of patients, clinicians, other
researchers, and experts who write practice guidelines or decide on
insurance-coverage policy” (De Angelis et al., 2004, p. 1250).
The early requirements for registries were actually rather modest and
could be of the investigators’ choosing as long as they were (a) free of
charge and accessible to the public, (b) open to all prospective
registrants, and (c) managed by a not-for-profit organization. The
requirements for the preregistrations of the clinical trials themselves,
while far from onerous, suggested an acute awareness of many of the
QRPs listed in Chapters 3 and 4.
231
Mills (1993) in his classic article “Data Torturing” (“If the fishing
expedition catches a boot, the fishermen should throw it back, not claim
that they were fishing for boots” [p. 1198]) and Andrew Gelman and Eric
Loken’s (2014) previously discussed seminal article (in a “garden of
forking paths, whatever route you take seems predetermined” [p. 464]).
This genre of ex post facto behavior is surely one of the most
frequently engaged in of QRPs (our number 15 in Chapter 3), one of the
most insidious, and more often than not accompanied by several related
undesirable practices. It is also a leading cause of irreproducibility and
publication bias. And, like most QRBs, its end results (artifactually
produced and incorrect p-values < 0.05) are reinforced by publication
and peer review policies.
However, while preregistration is an important preventive measure for
ex post factor hypothesizing, the practice has a long history in the social
sciences. So let’s take a quick trip back in time and review a small piece
of the history of this particular QRP—and in the process perhaps help
explain why it is so difficult to eradicate.
232
There are two possible articles you can write: (1) the article you planned
to write when you designed your study or (2) the article that makes the
most sense now that you have seen the results [emphasis added]. They are
rarely the same, and the correct answer is (2) . . . the best journal articles
are informed by the actual empirical findings from the opening sentence.
(pp. 171–172)
Or,
The data may be strong enough to justify recentering your article around
the new findings and subordinating or even ignoring your original
hypotheses. . . . If your results suggest a compelling framework for their
presentation, adopt it and make the most instructive findings your
centerpiece. (p. 173)
Now it’s easy to bash Daryl Bem today, but this sort of attitude,
approach, or orientation toward ensuring publication through either
writing a research report or conducting the research seemed to be
generally acceptable a quarter of a century ago. This is better illustrated
via an unpublished survey Professor Kerr conducted in 1991 (with S. E.
Harris), in which 156 behavioral scientists were asked to estimate the
frequency that they suspected HARKing (somewhat broadly defined)
was practiced in their discipline. A majority reported the belief that this
set of behaviors was actually practiced more frequently than the classic
approach to hypothesis testing. Finally, Professor Kerr advanced a
number of untoward side effects of HARking including (in his words)
233
This absolute gem of an article is reminiscent of Anthony
Greenwald’s (1975) previously discussed classic. One almost feels as
though the social sciences in general (and the reproducibility crisis in
specific) are in some sort of bizarre time loop where everything is
repeated every few decades with little more than a change of
terminology and an escalating number of publishing opportunities.
While Anthony Greenwald and Norbert Kerr are hard acts to follow,
someone must carry on the tradition since few practicing researchers
read or heed what anyone has written or advocated decades in the past.
And certainly Brian Nosek (and his numerous collaborators) is eminently
qualified to assume that mantle as witnessed by the establishment of the
Open Science Collaboration, which advocates preregistration and
provides a multidisciplinary registry of its own, and a number of
instructional articles on the topic (e.g., Nosek & Lakens, 2014) as well as
the following aptly titled article.
234
It is an example of circular reasoning––generating a hypothesis based on
observing data, and then evaluating the validity of the hypothesis based on
the same data. (p. 2600)
235
1. Changes to procedure during the conduct of the study. Probably the
most common example of this occurs in clinical trials when recruitment
turns out to be more difficult than anticipated. To paraphrase a shock
trauma investigator I once worked with, “The best prevention for spinal
cord injury is to conduct a clinical trial requiring the recruitment of
spinal cord injured patients. Diving and motor cycle accidents will
inevitably almost completely disappear as soon as study recruitment
begins.” Which, of course, sometimes unavoidably results in readjusting
the study’s originally proposed sample size. But this is only one of a
multitude of other unanticipated glitches or necessary changes to a
protocol that can occur after the study begins. As the authors suggest,
transparently reporting what these changes were and the reason that they
were necessitated will go a long way toward salvaging a study and
making it useful, unless said changes were made after looking at the
data (with the exception of the next “challenge”).
2. Discovery of assumption violations during analysis. This one is best
avoided by pilot work, but distribution violations may occur with a
larger and/or slightly different sample than was used in the pilot study
process. The authors suggest that a “decision tree” approach be specified
in the preregistration regarding what analytic steps will be taken if the
data do not fit the preregistered analytic plan (e.g., non-normality or
missing values on a prespecified covariate). However, some
unanticipated problems (e.g., a ceiling effect or basement effect with
regard to the outcome variable) can be fatal, so it should be noted that
the options presented deal only with the violation of statistical
assumptions. By the same token, all experienced investigators have a
pretty thorough knowledge of what could possibly occur during the
course of a study in their fields, hence Lin and Green’s (2016)
suggestion that common genres of research adopt SOPs which can be
copied and pasted into a preregistration document to cover possible
discrepancies between published and prespecified analysis plans.
3. Analyses based upon preexisting data and (4) longitudinal studies and
large, multivariate databases. These two research genres involve uses of
preregistration that are seldom considered. For example, the utility of
blinding is well-established in experimental research but it can also
apply to longitudinal databases in the form of generating and registering
hypotheses prior to data analysis. When this isn’t practical perhaps a
reasonable fallback position would be to include those relationships
which the investigators have already discovered and reported in a
preregistration document while transparently reporting them as such in
the published analysis, accompanied by an alpha adjustment of 0.005.
(Which isn’t as onerous for large databases as it is for experiments.)
And, of course, non-hypothesized relationships found to be of interest
236
should also be similarly reported. (These latter suggestions shouldn’t be
attributed to the Nosek team since they are my opinions.)
5. Running many experiments at the same time. Here, the authors described
a situation in which a “laboratory acquires data quickly, sometimes
running multiple experiments per week. The notion of pre-registering
every experiment seems highly burdensome for their efficient
workflow” (p. 2603). Some might find this scenario a bit troublesome
since it is highly unlikely that said laboratories publish all of these
“multiple experiments per week” hence the trashing of the non-
significant ones might constitute a QRP in and of itself. In their defense,
the authors suggest that this is normally done “in the context of a
methodological paradigm in which each experiment varies some key
aspects of a common procedure” (p. 2603). So, in this case, the authors
describe how a preregistration can be written for such a program of
research as a whole, and any promising findings can then be replicated.
However, for experiments conducted in this manner which do not
simply vary “some key aspects of a common procedure,” one wonders
what happens to all of the “negative” findings. Are they never
published? At the very least, they should at least be mentioned in the
published article and recorded in the laboratory’s official workflow.
6. Conducting a program of research. This one is a bit like the previous
challenge but seems to involve a series of separate full-blown
experiments, each of which is preregistered and one eventually turns out
to be statistically significant at the 0.05 level. The authors make the
important point that such an investigator (reminiscent of our
hypothetical graduate student’s failed program of research) should report
the number of failures preceding his or her success since the latter is
more likely to be a chance finding than a stand-alone study. Few
investigators would either consider doing this or adjusting the positive
p-value based on the number of previous failures. But they probably
should and, perhaps in the future, will.
7. Conducting “discovery” research with no actual hypotheses. In this
scenario, similar in some ways to challenges (3) and (4), researchers
freely admit that their research is exploratory and hence may see no
need to preregister said studies. The authors conclude that this could be
quite reasonable, but it is a process fraught with dangers, one of which is
that it is quite possible for scientists to fool themselves and truly believe
that an exciting new finding was indeed suspected all along (i.e., “the
garden of forking paths”). Preregistration guards against this possibility
(or others’ suspicions thereof) and possesses a number of other
advantages as well—serving as an imperfect genre of workflow for
investigators who do not routinely keep one.
237
The authors conclude their truly splendid article by suggesting that
the preregistration movement appears to be accelerating, as illustrated
by the numbers of existing research registries across an impressive
number of disciplines and organizations. They also list resources
designed to facilitate the process, including online courses and
publishing incentives, while warning that the movement still has a long
way to go before it is a universal scientific norm.
So why not use this article to draw a red line for reproducibility? I
have suggested tolerance for investigators such as Carney and Cuddy
because they were simply doing what their colleagues had been doing for
decades. Some have even partially excused Daryl Bem by saying that his
work met minimum methodological standards for its times, but let’s not
go that far since there is really no excuse for pathological science.
But surely, at some point, enough becomes enough. Perhaps a year
after the publication date of the preceding article (2018) could constitute
such a red line. A zero tolerance point, if you will, one beyond which it
is no longer necessary to be kindly or politically correct or civil in
copious correspondences to offending editors and investigators, nor on
posts on social media regarding studies published beyond this point in
time that ignore (or its authors are ignorant of) the precepts that have
been laid down from Nosek et al.’s 2018 declaration of a “Preregistration
Revolution.” And this is not to mention the myriad publications that
surfaced around 2011–2012 or Anthony Greenwald’s 1975 classic.
But isn’t this a bit Draconian, given the progress we’re making (or is
something missing)? There is definitely something missing, even from
medicine’s inspired edicts listing the registration of clinical trials as a
publication requirement in its highest impact journals after 2004—and
even after this became a legal requirement for some types of clinical
trials in 2007, with the law being expanded in 2017.
The first problem, as documented by a number of studies, involves the
lack of compliance with preregistration edict. The second reflects the far
too common mismatch between the preregistered protocol and what was
actually published.
Mathieu, Boutron, Moher, and colleagues (2009) illustrated the
necessity for ameliorating both of these problems in a study designed to
compare the key elements present in preregistrations with their published
counterpart. Locating 323 cardiology, rheumatology, and
gastroenterology trials published in 10 medical journals in 2008, these
investigators found that only 147 (46%) had been adequately registered
238
before the end of the trial despite the International Committee of Medical
Journal Editors 2004 edict. And almost equally disheartening, 46 (31%)
of these 147 compliant articles showed a discrepancy in the primary
outcome specifications between the registered and published outcomes.
And definitely most problematically, 19 (83%) of the 23 studies for
which the direction of the discrepancies could be assessed were
associated with statistically significant results. These published versus
preregistration discrepancies were distributed as follows:
(Note that some of the 46 articles had more than one of these QRPs.)
In an interesting coincidence, in the same publication year as this
study, Ewart, Lausen, and Millian (2009) performed an analysis of 110
clinical trials and found the same percentage (31%) of primary outcomes
changed from preregistration to published article, while fully 70% of the
registered secondary outcomes had also been changed.
And, 2 years later, Huić, Marušić, and Marušić (2011) conducted a
similar study comparing a set of 152 published RCTs registered in
ClinicalTrials.gov with respect to both completeness and substantive
discrepancies between the registry entries and the published reports. As
would be expected by now, missing fields were found in the
preregistrations themselves as well as substantive changes in the primary
outcome (17%). Progress from 31% perhaps?
Now granted this is a lot of number parsing, but for those whose eyes
have glazed over, suffice it to say that while the ICMJE initiative was a
seminally exemplary, long overdue, and bordering on revolutionary
policy for a tradition-bound discipline such as medicine and undeniably
better than nothing, it was, however, quite disappointing in the
compliance it elicited several years after initiation.
Naturally all of the investigators just discussed had suggestions for
improving the preregistration process, most of which should sound
familiar by now. From Mathieu and colleagues (2009):
239
First, the sponsor and principal investigator should ensure that the trial
details are registered before [emphasis added] enrolling participants.
Second. the comprehensiveness of the registration should be routinely
checked by editors and readers, especially regarding the adequate reporting
of important items such as the primary outcome.
Third, editors and peer reviewers should systematically check the
consistency between the registered protocol and the submitted manuscript to
identify any discrepancies and, if necessary, require explanations from the
authors, and
Finally, the goal of trial registration could [should] be to make available
and visible information about the existence and design of any trial and give
full access to all trial protocols and the main trial results. (p. 984)
The conclusions for Huić et al., on the other hand, were a combination
of good and bad news.
ICMJE journals published RCTs with proper registration [the good news]
but the registration data were often not adequate, underwent substantial
changes in the registry over time and differed in registered and published
data [the very bad news]. Editors need to establish quality control
procedures in the journals so that they continue to contribute to the
increased transparency of clinical trials. (p. 1)
240
1. Since manuscripts are now submitted online, the first item on the
submission form should include a direct link to the dated preregistration
document, and the submission process should be summarily terminated if
that field is missing. Any major deviations from the preregistration
document should be mentioned as part of the submission process and the
relevant declaration thereof should be included in the manuscript.
(Perhaps as a subtitled section at the end of the methods section.)
2. At least one peer reviewer should be tasked with comparing the
preregistration with the final manuscript using a brief checklist that
should also perfectly match the required checklist completed by the author
in the preregistration document. This would include the specification of
the primary outcome, sample size justification, identity of the
experimental conditions, analytic approach, and inclusion/exclusion
criteria.
3. Any unmentioned discrepancies between the authors’ rendition and the
completed manuscript should either result in a rejection of the manuscript
or their inclusion being made a condition of acceptance.
4. Readers should be brought into the process and rewarded with the
authorship of a no-charge, published “errata discovery” or “addendum” of
some sort since these could constitute fail-safe candidates for both
checking and enforcing this policy. (This might eventually become
common enough to expel the personal onus editors seem to associate with
publishing errata or negative comments.)
Now, of course, these suggestions entail some extra expense from the
journals’ perspective but academic publishers can definitely afford to
hire an extra staff member or, heaven forbid, even pay a peer reviewer an
honorarium when tasked with comparing the preregistered protocol with
the manuscript he or she is reviewing. (After all, publishers can always
fall back on one of their chief strengths, which is to pass on any
additional costs to authors and their institutions.)
On a positive note, there is some evidence that compliance with
preregistration edicts may have begun to improve in the past decade or
so—at least in some disciplines. Kaplan and Irvin (2015), for example,
conducted a natural experiment in which large (defined as requiring
more than $500,000 in direct costs) National Heart, Lung, and Blood
Institute-funded cardiovascular RCTs were compared before and after
preregistration was mandated by clinicaltrials.gov.
Unlike preregistration requirements announced by journals or
professional organizations, this one has apparently been rigorously
enforced since Kaplan and Irvin found that 100% of the located 55 trials
published after 2000 were registered as compared to 0% prior thereto.
241
(Note the sharp contrast to other disciplines without this degree of
oversight, as witnessed by one disheartening study [Cybulski, Mayo-
Wilson, & Grant, 2016] which found that, of 165 health-related
psychological RCTs published in 2013, only 25 [15%) were
preregistered.)
Perhaps equally surprisingly (and equally heartening), the 2015
Kaplan and Irwin cardiovascular study also found a precipitous drop in
publication bias, with trials published prior to 2000 reporting “significant
benefit for their primary outcome” in 17 of 30 (57%) studies versus 8%
(or 2 of 25) after 2000 (p < 0.0005). The authors attributed this
precipitous drop in positive findings to one key preregistration
requirement of the ClinicalTrials.gov initiative.
And lest it appear that social science experiments are being ignored here
with respect to changes from preregistration to publication, the National
Science Foundation (NSF)-sponsored Time-sharing Experiments for the
Social Sciences (TESS) provides an extremely rare opportunity for
comparing preregistered results with published results for a specialized
genre of social science experiments. The program itself involved
embedding “small” unobtrusive interventions (e.g., the addition of a
242
visual stimulus or changes in the wording of questions) into national
surveys conducted for other purposes. Franco, Malhotra, and Simonovits
(2014) were then able to compare the unpublished results of these
experiments with their published counterparts since the NSF required not
only the experimental protocols and accruing data to be archive prior to
publication, but also the study results as well.
The authors were able to locate 32 of these studies that had been
subsequently published. They found that (a) 70% of the published
studies did not report all the outcome variables included in the protocol
and (b) 40% did not report all of the proposed experimental conditions.
Now while this could have been rationalized as editorial pressure to
shorten the published journal articles, another study by the Franco et al.
team discovered a relatively unique wrinkle to add to the huge
publication bias and QRP literatures: “Roughly two thirds of the reported
tests [were] significant at the 5% level compared to about one quarter of
the unreported tests” (p. 10). A similar result was reported for political
science studies drawn from the same archive (Franco, Malhotra, &
Simonovits, 2017).
And if anyone needs to be reminded that registry requirements alone
aren’t sufficient, one extant registry has even more teeth than
ClinicalTrials.gov. This particular registry is unique in the sense that it
potentially controls access to hundreds of billions in profit to powerful
corporations. It is also an example of a registry developed by a
government agency that closely examines and evaluates all submitted
preregistrations before the applicants can proceed with their studies, as
well as the results after the trials are completed.
That honor goes to the US Food and Drug Administration (FDA). The
FDA requires that positive evidence of efficacy (in the form of
randomized placebo or active comparator trials) must be deposited in its
registry and that this proposed evidence must be evaluated by its staff
before a specific medical condition can be approved for a specific
diagnosis.
To fulfill this responsibility, the agency’s registry requires a
prospective protocol, including the analysis plan, the actual RCT data
produced, and the results thereof in support of an application for either
marketing approval or a change in a drug’s labeling use(s). FDA
statisticians and researchers then review this information to decide
whether the evidence is strong enough to warrant approval of each
marketing application. Such a process, if implemented properly, should
preclude the presence of a number of the QRPs described in Chapter 3.
243
Alas, what the FDA does not review (or regulate) are the published
results of these trials that wind up in the peer reviewed scientific
literature. Nor does it check those results against what is reported in its
registry. But what if someone else did?
244
The methods reported in 11 journal articles appeared to depart from the
pre-specified methods reflected in the FDA reviews. . . . Although for
each of these studies the finding with respect to the protocol-specified
primary outcome was non-significant, each publication highlighted a
positive results as if it were the primary outcome. [Sound familiar by
now?] The non-significant results of the pre-specified primary outcomes
were either subordinated to non-primary positive results (in two reports)
or omitted (in nine). (p. 255)
And,
Another team conducted a study around the same time period (Rising,
Bacchetti, & Bero, 2008) employing different FDA studies and reported
similar results—including the propitious changes from registry to journal
articles. However, in 2012, Erick Turner (with Knowepflmacher &
Shapey) basically repeated his 2008 study employing antipsychotic trials
and found similar (but less dramatic) biases—hopefully due to Erick’s
alerting pharmaceutical companies that someone was watching them, but
more likely due to the greater efficacy of antipsychotic drugs. (Or
perhaps placebos are simply less effective for patients experiencing
psychotic symptoms than in those with depression.)
One final study involving an often overlooked preregistration registry:
we don’t conceptualize them in this way, but federally mandated
institutional review boards (IRBs) and institutional animal care and use
committees (IACUCs) are registries that also require protocols
containing much of the same basic information required in a
preregistration.
The huge advantage of these local regulatory “registries” is that it is
illegal for any institution (at least any that receive federal funding) to
allow the recruitment of participants (human or animal) for any research
purposes without first submitting such a protocol for approval by a
committee designated for this purpose. More importantly most of these
institutions are quite conscientious in enforcing this requirement since
federal research funding may be cut off for violations. So, obviously,
such registries would constitute an excellent opportunity for comparing
245
regulatory protocols with their published counterparts, but for some
inexplicable reason IRB and IACUC records are considered proprietary.
However, occasionally investigators are provided access to selected
IRBs for research purposes. Chan, Hrobjartsson, Haahr, and colleagues
(2004), for example, were able to obtain permission from two Danish
IRBs to identify 102 experimental protocols submitted between 1994 and
1995 that had subsequently been published. Each proposed protocol was
then compared to its published counterpart to identify potential
discrepancies in the treatment or the specified primary outcome. The
identified changes from application to publication in the pre-specified
primary outcomes were that some (a) magically became secondary
outcomes, (b) were replaced by a secondary outcomes, (c) disappeared
entirely, or (d) regardless of status, the outcomes used in the power
calculations required by the IRB protocols differed from those reported
in the published articles (which were necessitated by the previous three
changes).
Of the 102 trials, 82 specified a primary outcome (it is “puzzling” that
20 did not). Of these, 51 (62%) had made at least one of the four just-
mentioned changes. And, not surprisingly, the investigators found that
The odds of a particular outcome being fully reported were more than twice
as high if that outcome was statistically significant. Although the response
rate was relatively low, one of the most interesting facets of this study was a
survey sent to the studies’ authors. Of the 49 responses received, 42 (86%)
actually “denied the existence of unreported outcomes despite clear
evidence to the contrary.” (p. 2457)
246
After all, shouldn’t an IRB that is instituted for research participants’
protection also protect those participants from squandering their time and
effort on the incessant production of fallacious research results? Also,
since no participant identifiers are contained in an IRB application, no
privacy concerns can emanate from them except for study investigators.
And anonymity is the last thing investigators want since they gladly affix
their names to the resulting published articles.
While the following three potential objections may smack of once
again raising strawmen, they probably need to be addressed anyway.
247
1. Cost: Admittedly in research-intensive institutions the IRB-IACUC
regulatory process would require at least one additional full-time staff
member to upload all approved proposals and attached amendments (or at
the very least the completed standardized checklist just suggested) to a
central registry. (Or perhaps someone could write a program to do so
automatically.) Just as obviously, some effort (preferably automated) will
be necessary to ensure that none of the required fields is empty. However,
research institutions receive very generous indirect costs (often exceeding
50% of the actual research budget itself) so there is adequate funding for
an integral research function such as this.
2. Release date: The proposal (or possibly simply the minimal information
containing hypotheses, primary outcomes, experimental conditions,
sample size, study design, and analytic approach), while already uploaded
and dated, could be released only upon submission of the final manuscript
for publication and only then following the journal’s decision to submit it
to the peer review process. This release process could be tightened by the
principal investigator granting access to the proposal (and its
amendments) only to the journal to which it is submitted. However, any
such restrictions would be lifted once the manuscript had been published
or after a specified period of time. (Both the submitted manuscript and the
published article would have to include a direct link to the archived IRB
proposal or checklist.)
3. Amendments to regulatory proposals: As mentioned, IRBs and IACUCs
differ significantly in their degree of oversight and conscientiousness. My
familiarity with IRBs extends only to those representing academic
medical centers, which are probably more rigorous than their liberal arts
counterparts. However, any serious IRB or IACUC should require dated
amendments to a proposal detailing changes in (a) sample size (increased
or decreased) along with justifications thereof, (b) experimental
conditions (including changes to existing ones, additions, and/or
deletions), (c) primary outcomes, and (d) analytic approaches. All such
amendments should be attached to the original protocol in the same file,
which would certainly “encourage” investigators to include any such
changes in their manuscripts submitted for publication since their
protocols would be open to professional scrutiny.
248
A final note: preregistration of protocols need not be an onerous or
time-consuming process. It could consist of a simple six- or seven-item
checklist in which each item requires no more than a one- or two-
sentence explication for (a) the primary hypothesis and (b) a justification
for the number of participants to be recruited based on the hypothesized
effect size, the study design, and the resulting statistical power
emanating from them. The briefer and less onerous the information
required, the more likely the preregistration process and its checking will
be implemented.
Data Sharing
249
by “a competition that challenged researchers to recreate each other’s
work.” As of this writing, the results are not in, but the very attempt
augurs well for the continuance and expansion of the reproducibility
movement both for relatively new avenues of inquiry and across the
myriad classic disciplines that comprise the scientific enterprise. Perhaps
efforts such as this provide some hope that the “multiple-study”
replication initiatives discussed in Chapter 7 will continue into the
future.
At first glance data sharing may seem more like a generous
professional gesture than a reproducibility necessity. However, when
data are reanalyzed by separate parties as a formal analytic replication
(aka analytic reproduction), a surprisingly high prevalence of errors
favoring positive results occur, as discussed in Chapter 3 under QRP 6
(sloppy statistical analyses and erroneous results in the reporting of p-
values).
So, if nothing else, investigators who are required to share their data
will be more likely to take steps to ensure their analyses are accurate.
Wicherts, Bakker, and Molenaar (2011), for example, found significantly
more erroneously reported p-values among investigators who refused to
share than those who did.
So, the most likely reasons for those who refuse to share their data are
250
Perhaps this is one reason that Gary King (1995) suggested that data
creation should be recognized by promotion and tenure committees as a
significant scientific contribution in and of itself. In addition, as a
professional courtesy, the individual who created the data in the first
place could be offered an authorship on a publication if he or she
provides any additional assistance that warranted this step. But barring
that, data creators should definitely be acknowledged and cited in any
publication involving their data.
Second, depending on the discipline and the scope of the project,
many principal investigators turn their data entry, documentation, and
analysis over to someone else who may employ idiosyncratic coding and
labeling conventions. However, it is the principal investigator’s
responsibility to ensure that the data and their labeling are clear and
explicit. In addition, code for all analyses, variable transformations, and
annotated labels, along with the data themselves, should be included in a
downloadable file along with all data cleaning or outlier decisions. All of
these tasks are standard operating procedures for any competent
empirical study, but, as will be discussed shortly, compliance is far from
perfect.
The good news is that the presence of archived data in computational
research sharing has increased since the 2011–2012 awakening,
undoubtedly facilitated by an increasing tendency for journals to
delineate policies to facilitate the practice. However, Houtkoop,
Wagenmakers, Chambers, and colleagues (2018) concluded, based on a
survey of 600 psychologists, that “despite its potential to accelerate
progress in psychological science, public data sharing remains relatively
uncommon.” They consequently suggest that “strong encouragement
from institutions, journals, and funders will be particularly effective in
overcoming these barriers, in combination with educational materials
that demonstrate where and how data can be shared effectively” (p. 70).
There are, in fact, simply too many important advantages of the data
sharing process for it to be ignored. These include
251
But for those who are unmoved by altruistic motives, scientific norms
are changing and professional requests to share data will continue to
increase to the point where, in the very near future, failing to comply
with those requests will actually become injurious to a scientist’s
reputation. Or, failing that, it will at the very least be an increasingly
common publication requirement in most respectable empirical journals.
However, requirements and cultural expectations go only so far. As
Robert Burns put it, “The best laid schemes o’ mice an’ men gang aft
agley,” or, more mundanely translated to scientific practice by an
unknown pundit, “even the most excellent of standards and requirements
are close to worthless in the absence of strict enforcement”.
“Close” but not completely worthless as illustrated by a study
conducted by Alsheikh-Ali, Qureshi, Al-Mallah, and Ioannidis (2011), in
which the first 10 original research papers of 2009 published in 50 of the
highest impact scientific journals (almost all of which were medicine- or
life sciences-oriented) were reviewed with respect to their data sharing
behaviors. Of the 500 reviewed articles, 351 papers (70%) were subject
to a data availability policy of some sort. Of these, 208 (59%) were not
completely in compliance with their journal’s instructions. However,
“none of the 149 papers not subject to data availability policies made
their full primary data publicly available [emphasis added]” (p. 1).
So to be even moderately effective, the official “requirement for data
registration” will not result in compliance unless the archiving of said
data (a) precedes publication and (b) is checked by a journal
representative (preferably by a statistician) to ensure adequate
documentation and transparent code. And this common-sense
generalization holds for funding agency requirements as well, as is
disturbingly demonstrated by the following teeth-grating study.
252
data. . . . No data at all were received from 74% of the funded entities
(23% of whom did not reply to the request and 49% could not be
contacted).
Unfortunately, although the EVOSTC reported funding hundreds of
projects, “the success of this effort is unknown as the content of this
collection has since been lost [although surely not conveniently].” But,
conspiracy theories aside, the authors do make a case that a recovery
rate of 26% is not unheard of, and, while this may sound completely
implausible, unfortunately it appears to be supported by at least some
empirical evidence, as witnessed by the following studies reporting
data availability rates (which in no way constitutes a comprehensive or
systematic list).
253
1. Wollins (1962) reported that a graduate student requested the raw data
from 37 studies reported in psychology journals. All but 5 responded,
and 11 (30% of the total requests) complied. (Two of the 11
investigators who did comply demanded control of anything published
using their data, so 24% might be considered a more practical measure
of compliance.)
2. Wicherts, Borsboom, Kats, and Molenaar (2006) received 26% of their
requested 249 datasets from 141 articles published in American
Psychiatric Association journals.
3. Using a very small sample, Savage and Vickers (2009) requested 10
datasets from articles published in PLoS Medicine and PLoS Clinical
Trials and received only 1 (10%). This even after reminding the original
investigators that both journals explicitly required data sharing by all
authors.
4. Vines, Albert, Andrew, and colleagues (2014) requested 516 datasets
from a very specialized area (morphological plant and animal data
analyzed via discriminant analysis) and received a response rate of 19%
(101 actual datasets). A unique facet of this study, in addition to its size,
was the wide time period (1991 to 2011) in which the studies were
published. This allowed the investigators to estimate the odds of a
dataset becoming unavailable over time, which turned out to be a
disappearance rate of 17% per year.
5. Chang and Li (2015) attempted to replicate 61 papers that did not
employ confidential data in “13 well-regarded economics journals using
author-provided replication files that include both data and code.” They
were able to obtain 40 (66%) of the requisite files. However, even with
the help of the original authors, the investigators were able to reproduce
the results of fewer than half of those obtained. However, data sharing
was approximately twice as high for those journals that required it as for
those that did not (83% vs. 42%).
6. Stodden, Seiler, and Ma (2018) randomly selected 204 articles in
Science to evaluate its 2011 data sharing policy, of which 24 provided
access information in the published article. Emails were sent to the
remaining authors, of which 65 provided some data and/or code,
resulting in a total of 89 (24 + 65) articles that shared at least some of
what was requested. This constituted a 44% retrieval rate, the highest
compliance rate of any reviewed here, and (hopefully not coincidentally)
it happened to be the most recent article. From these 89 sets of data, the
investigators judged 56 papers to be “potentially computationally
reproducible,” and from this group they randomly selected 22 to actually
replicate. All but one appeared to replicate, hence the authors estimated
that 26% of the total sample may have been replicable. [Note the
somewhat eerie but completely coincidental recurrence of this 26%
figure.]
254
All six sets of authors provided suggestions for improvement,
especially around the adequacy of runnable source code. This is
underlined in a survey of 100 papers published in Bioinformatics
(Hothorn & Leisch, 2011), which found that adequate code for
simulation studies was “limited,” although what is most interesting about
this paper is that the first author serves (or served) as the “reproducible
research editor” for a biometric journal in which one of his task was to
check the code to make sure it ran—a role and process which should be
implemented by other journals. (Or perhaps alternately a startup
company could offer this role on a per-article fee basis.)
Despite differences in methodologies, however, all of these authors
would probably agree with Stodden and her colleagues’ conclusion
regarding Science’s data sharing guidelines (and perhaps other journal-
unenforced edicts as well).
Due to the gaps in compliance and the apparent author confusion regarding
the policy, we conclude that, although it is a step in the right direction, this
policy is insufficient to fully achieve the goal of computational
reproducibility. Instead, we recommend that the journal verify deposit of
relevant artifacts as a condition of publication. (p. 2588)
255
share their data upon request because too many untoward events (e.g.,
multiple job and computer changes, retirement, and dementia) can occur
over time to subvert that process, even for investigators with the most
altruistic of motives.
Journals should also not accept a paper until the archived data are
checked for completeness and usability. Furthermore, a significant
amount of money should be held in escrow by funders until the archived
data are also checked. In one or both cases, an independent statistician
should be designated to check the data, code, and other relevant aspects
of the process as well as personally sign off on the end product of said
examination. To facilitate this, the archived code should be written in a
commonly employed language (a free system such as R is probably
preferable, but software choices could be left up to investigators as long
as they are not too esoteric). Everything should also be set up in such a
way that all results can be run with a mouse click or two.
While short shrift has admittedly been given to purely computational
research and reproducibility, this field is an excellent resource for
suggestions concerning computational reproducibility. Sandve,
Nekrutenko, Taylor, and Hovig (2013), for example, begin by arguing
that ensuring the reproducibility of findings is as much in the self-
interest of the original investigators as it is to the interests of others.
256
Rule 1: For every result, keep track of how it was produced. This basically
reduces to maintaining a detailed analytic workflow. Or, in the authors’
words: “As a minimum, you should at least record sufficient details on
programs, parameters, and manual procedures to allow yourself, in a year or
so, to approximately reproduce the results.”
Rule 2: Avoid manual data manipulation steps. In other words, use
programs and codes to recode and combine variables rather than perform
even simple data manipulations manually.
Rule 3: Archive the exact versions of all external programs used. Some
programs change enough over time to make exact replication almost
impossible.
Rule 4: Version control all custom scripts. Quite frankly, this one is
beyond my expertise so for those interested, the original authors suggest
using a “version control system such as Subversion, Git, or Mercurial.” Or,
as a minimum, keep a record of the various states the code has taken during
its development.
Rule 5: Record all intermediate results, when possible in standardized
formats. Among other points, the authors note that “in practice, having
easily accessible intermediate results may be of great value. Quickly
browsing through intermediate results can reveal discrepancies toward what
is assumed, and can in this way uncover bugs or faulty interpretations that
are not apparent in the final results.”
Rule 7: Always store raw data behind plots. “As a minimum, one should
note which data formed the basis of a given plot and how this data could be
reconstructed.”
Rule 10: Provide public access to scripts, runs, and results.
257
statisticians note, and both Steegen, Tuerlinckx, Gelman, and Vanpaemel
(2016) and our favorite team of Simonsohn, Simmons, and Nelson
(2015) empirically illustrate, different analytic decisions often result in
completely different inferential results. And while such decisions are
capable of being quite reasonably justified a posteriori, one of our
authors has previously reminded us that in a “garden of forking paths,
whatever route you take seems predetermined.”
Ironically, the Steegen et al. team illustrated this potential for
selectively choosing an analytic approach capable of producing a
statistically significant p-value by using the study employed by Andrew
Gelman to illustrate his garden of forking paths warning. (The study—
and a successful self-replication thereof—it will be recalled was
conducted by Durante, Rae, and Griskevicius (2013) who “found” that
women’s fertility status was influenced both their religiosity and political
attitudes.) The Steegen team’s approach involved
1. Employing the single statistical result reported in the Durante et al. study,
2. Constructing what they termed the “data multiverse,” which basically
comprised all of the reasonable coding and transformation decisions
possible (120 possibilities in the first study and 210 in the replication),
and then
3. Running all of these analyses and comparing the p-values obtained to
those in the published article.
One should reserve judgment and acknowledge that the data are not strong
enough to draw a conclusion on the effect of fertility. The real conclusion of
the multiverse analysis is that there is a gaping hole in theory or in
measurement, and that researchers interested in studying the effect of
fertility should work hard to deflate the multiverse. The multiverse analysis
gives useful directions in this regard. (p. 708)
The Simonsohn team (whose work which actually preceded this study)
arrived at the same basic conclusions and provided, as is their wont, a
statistical approach (“specification-curve analysis”) for evaluating the
multiple results obtained from these multiple, defensible, analytic
approaches.
258
Both the Steegen et al. (2016) and the Simonsohn et al. (2015) articles
demonstrate that different analytic approaches are in some cases capable
of producing both statistically significant and non-significant results.
And certainly some investigators may well analyze and reanalyze their
data in the hope that they will find an approach that gives them a p-value
< 0.05—thereby suggesting the need for a new QRP designation to add
to our list or simply providing yet another example of p-hacking.
However, the recommendation that investigators analyze and report all
of the “reasonable scenarios” (Steegen et al., 2016) or “multiple,
defensible, analytic approaches” (Simonsohn et al., 2015) is, in my
opinion, probably going a bridge too far. Especially since the first set of
authors found an average of 165 possible analyses in a relatively simple
study and its replication. So perhaps the group should have stuck with
their advice given in their iconic 2011 article which involved (a) the
preregistration of study analysis plans and (b) reporting results without
the use of covariates.
Materials Sharing
259
In those instances where cooperation is not provided, the published
results should be (and are) viewed with the same suspicion by the
scientific community as afforded to unpublished discovery claims. The
cold fusion debacle constituted an unusual example of this in the sense
that the specifications for the apparatus were apparently shared but not
the procedures employed to generate the now infamous irreproducible
results.
Material sharing 2.0: Timothy Clark (2017), an experimental
biologist, suggests taking the sharing of materials and procedures a step
farther in a single-page Nature article entitled “Science, Lies and Video-
Taped Experiments.” Acknowledging the difficulties, he succinctly
presents the following analogy: “If extreme athletes can use self-
mounted cameras to record their wildest adventures during mountaintop
blizzards, scientists have little excuse not to record what goes on in lab
and field studies” (p. 139).
Perhaps his suggestion that journals should require such evidence to
be registered (and even used in the peer review process) may currently
be unrealistic. But the process would certainly (a) facilitate replication,
(b) serve as an impressive and time-saving teaching strategy, and (c)
discourage scientific misconduct. And there is even a journal partly
designed to encourage the process (i.e., the Journal of Visualized
Experiments.)
260
1. Encourage other scientists to read the accompanying article since its
results promise to be more reproducible, citable, and perhaps even more
important;
2. Encourage colleagues and scientists interested in conducting secondary
analyses of data or performing replications to not only read and cite the
article but possibly provide its author(s) with collaborative opportunities.
Piwowar, Day, and Fridsma (2007), for example, found that the citation
rate for 85 cancer microarray clinical trial publications which shared
usable research data was significantly higher compared to similar studies
which did not do so;
3. Identify the badged authors as principled, careful, modern scientists; and
4. Potentially even increase the likelihood of future acceptances in the
journal in which the badged article appears.
261
1. From a baseline of 2.5%, Psychological Science reported open data
sharing increased to an average of 22.8% of articles by the first half of
2015 (i.e., in slightly over 1 year after the advent of badges);
2. The four comparison journals, on the other hand, while similar at baseline
to Psychological Science, averaged only 2.1% thereafter (i.e., as
compared to 22.8% in Psychological Science);
3. With respect to actual availability of usable data, the results were equally
(if not more) impressive, with the Psychological Science articles that
earned badges significantly outperforming the comparison journals.
Perhaps a more interesting effect, however, involved a comparison of the
availability of usable data of Psychological Science articles announcing
availability with badges versus the Psychological Science articles
announcing availability but without badges. For the 64 Psychological
Science articles reporting availability of data archived on a website or
repository, 46 had requested and been awarded a data sharing badge while
18 had not. Of those with a badge, 100% actually had datasets available,
82.6% of which were complete. For those who announced the availability
of their data but did not have a badge, 77.7% (N = 14) made their data
available but only 38.9% (N = 7) of these had complete data. And, finally,
4. The effects of badges on the sharing of materials were in the same
direction as data sharing, although not as dramatic.
Now granted, these numbers are relatively small and the evaluation itself
was comparative rather than a randomized experiment, but the authors
(who transparently noted their study’s limitations) were undoubtedly
justified in concluding that, “Badges are simple, effective signals to
promote open practices and improve preservation of data and materials
by using independent repositories” (p. 1).
And that, Dear Readers, concludes the substantive subject matter of
this book, although the final chapter will present a few concluding
thoughts.
References
262
protocols to published articles. Journal of the American Medical Association,
29, 2457–2465.
Chang, A. C., & Li, P. (2015). Is economics research replicable? Sixty published
papers from thirteen journals say “usually not.” Finance and Economics
Discussion Series. http://dx.doi.org/10.17016/FEDS.2015.083
Clark, T. D. (2017). Science, lies and video-taped experiments. Nature, 542, 139.
Couture, J. L., Blake, R. E., McDonald, G., & Ward, C. L. (2018). A funder-
imposed data publication requirement seldom inspired data sharing. PLoS
ONE, 13, e0199789.
Cybulski, L., Mayo-Wilson, E., & Grant, S. (2016). Improving transparency and
reproducibility through registration: The status of intervention trials published
in clinical psychology journals. Journal of Consulting and Clinical
Psychology, 84, 753–767.
De Angelis, C. D., Drazen, J. M., Frizelle, F. A., et al. (2004). Clinical trial
registration: A statement from the International Committee of Medical Journal
Editors. New England Journal of Medicine, 351, 1250–1252.
Donoho, D. L., Maleki, A., Shahram, M., et al. (2009). Reproducibility research
in computational harmonic analysis. Computing in Science & Engineering, 11,
8–18.
Durante, K., Rae, A., & Griskevicius, V. (2013). The fluctuating female vote:
Politics, religion, and the ovulatory cycle. Psychological Science, 24, 1007–
1016.
Ewart, R., Lausen, H., & Millian, N. (2009). Undisclosed changes in outcomes in
randomized controlled trials: An observational study. Annals of Family
Medicine, 7, 542–546.
Franco, A., Malhotra, N., & Simonovits, G. (2014). Underreporting in
psychology experiments: Evidence from a study registry. Social Psychological
and Personality Science, 7, 8–12.
Franco, A., Malhotra, N., & Simonovits, G. (2017). Underreporting in political
science survey experiments: Comparing questionnaires to published results.
Political Analysis, 23, 306–312.
Gelman, A., & Loken, E. (2014). The statistical crisis in science: Data-dependent
analysis—a “garden of forking paths”—explains why many statistically
significant comparisons don’t hold up. American Scientist, 102, 460–465.
Gibney, E. (2019). This AI researcher is trying to ward off a reproducibility
crisis. Nature, 577, 14.
Greenwald, A. G. (1975). Consequences of prejudice against the null hypothesis.
Psychological Bulletin, 82, 1–20.
Hothorn, T., & Leisch, F. (2011). Case studies in reproducibility. Briefings in
Bioinformatics, 12, 288–300.
Houtkoop, B. L., Wagenmakers, E.-J., Chambers, C., et al. (2018). Data sharing
in psychology: A survey on barriers and preconditions. Advances in Methods
and Practices in Psychological Science, 1, 70–85.
263
Huić, M., Marušić, M., & Marušić, A. (2011). Completeness and changes in
registered data and reporting bias of randomized controlled trials in ICMJE
journals after trial registration policy. PLoS ONE, 6, e25258.
Kaplan, R. M., & Irvin, V. L. (2015). Likelihood of null effects of large NHLBI
clinical trials has increased over time. PLoS ONE, 10, e0132382.
Kerr, N. L. (1991). HARKing: Hypothesizing after the results are known.
Personality and Social Psychology Review, 2, 196–217.
Kerr, N. L., & Harris, S. E. (1998). HARKing-hypothesizing after the results are
known: Views from three disciplines. Unpublished manuscript. Michigan State
University, East Lansing (not obtained).
Kidwell, M. C., Ljiljana B. Lazarević, L. B., et al. (2016). Badges to
acknowledge open practices: A simple, low-cost, effective method for
increasing transparency. PLoS Biology, 14, e1002456.
King, G. (1995). Replication, replication. PS: Political Science and Politics, 28,
443–499.
Lin, W., & Green, D. P. (2016). Standard operating procedures: A safety net for
pre-analysis plans. Political Science and Politics, 49, 495–500.
Mathieu, S., Boutron, I., Moher, D., et al. (2009). Comparison of registered and
published primary outcomes in randomized controlled trials. Journal of the
American Medical Association, 302, 977–984.
Mills, J. L. (1993). Data torturing. New England Journal of Medicine, 329,
1196–1199.
Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2018). The
preregistration revolution. Proceedings of the National Academy of Sciences,
115, 2600–2606.
Nosek, B. A., & Lakens, D. (2014). Registered reports: A method to increase the
credibility of published results. Social Psychology, 45, 137–141.
Piwowar, H. A., Day, R. S., & Fridsma, D. B. (2007). Sharing detailed research
data is associated with increased citation rate. PLoS ONE, 2, e308.
Rising, K., Bacchetti, P, & Bero, L. (2008). Reporting bias in drug trials
submitted to the Food and Drug Administration: Review of publication and
presentation. PLoS Medicine, 5, e217.
Savage, C. J., & Vickers, A. J. (2009). Empirical study of data sharing by authors
publishing in PLoS journals. PLoS ONE, 4, e7078.
Sandve, G. K., Nekrutenko, A., Taylor, J., & Hovig, E. (2013). Ten simple rules
for reproducible computational research. PLoS Computational Biology, 9,
e1003285.
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive
psychology: undisclosed flexibility in data collection and analysis allows
presenting anything as significant. Psychological Science, 22, 1359–1366.
Simonsohn, U., Simmons, J. P., & Nelson, L. D. (2015). Specification curve:
Descriptive and inferential statistics on all reasonable specifications.
Manuscript available at http://ssrn.com/abstract=2694998
264
Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing
transparency through a multiverse analysis. Perspectives on Psychological
Science 11, 702–712.
Stodden, V., Guo, P., & Ma, Z. (2013). Toward reproducible computational
research: An empirical analysis of data and code policy adoption by journals.
PLoS ONE, 8, e67111.
Stodden, V., Seiler, J., & Ma, Z. (2018). An empirical analysis of journal policy
effectiveness for computational reproducibility. Proceedings of the National
Academy of Sciences, 115, 2584–2589.
Turner, E. H., Knowepflmacher, D., & Shapey, L. (2012). Publication bias in
antipsychotic trials: An analysis of efficacy comparing the published literature
to the US Food and Drug Administration database. PLoS Medicine, 9,
e1001189.
Turner, E. H., Matthews, A. M., Linardatos, E., et al. (2008). Selective
publication of antidepressant trials and its influence on apparent efficacy. New
England Journal of Medicine, 358, 252–260.
Vines, T. H., Albert, A. Y. K., Andrew, R. L., et al. (2014). The availability of
research data declines rapidly with article age. Current Biology, 24, 94–97.
Wicherts, J. M., Bakker, M., & Molenaar, D. (2011). Willingness to share
research data is related to the strength of the evidence and the quality of
reporting of statistical results. PLoS ONE, 6, e26828.
Wicherts, J. M., Borsboom, D., Kats, J., & Molenaar, D. (2006). The poor
availability of psychological research data for reanalysis. American
Psychologist, 61, 726–728.
Wollins, L. (1962). Responsibility for raw data. American Psychologist, 17, 657–
658.
265
11
A (Very) Few Concluding Thoughts
Educational Interventions
266
overviews of the reproducibility process and suggestions for teaching its
precepts to students and implementing them in practice.
With respect to the educational process, Munafò et al. suggest (in
addition to a formal course, which is more common in the social sciences
than their life and physical counterparts) that
The most effective solutions [for both students and faculty] may be to
develop educational resources that are accessible, easy-to-digest . . . web-
based modules for specific topics, and combinations of modules that are
customized for particular research applications). A modular approach
simplifies the process of iterative updating of those materials.
Demonstration software and hands-on examples may also make the lessons
and implications particularly tangible to researchers at any career stage . . .
[such as] the Experimental Design Assistant (https://eda.nc3rs.org.uk)
supports research design for whole animal experiments, while P-hacker
(http://shinyapps.org/apps/p-hacker/) shows just how easy it is to generate
apparently statistically significant findings by exploiting analytic flexibility.
(p. 2)
While both the Asendorpf et al. (2013) and the Munafò et al. (2017)
articles are too comprehensive to abstract here and their behavioral dicta
have been discussed previously, each deserves to be read in its entirety.
However, in buttressing the argument that methodological and statistical
resources should be sought after by (and available to) all investigators,
the latter recommends a model instituted by the CHDI Foundation
(which specializes in research on Huntington’s disease). Here, a
committee of independent statisticians and methodologists are available
to offer “a number of services, including (but not limited to) provision of
expert assistance in developing protocols and statistical analysis plans,
and evaluation of prepared study protocols” (p. 4).
Of course, students can and should be brought into such a process as
well. In the conduct of science, few would argue that hands-on
experience is one of the, if the not the most, effective ways to learn how
to do science. Hence the Munafò et al. paper describes a resource
designed to facilitate this process in psychology available under the
Open Science framework umbrella called the Collaborative Replications
and Education Project (https://osf.io/wfc6u/), in which
267
A coordinating team identifies recently published research that could be
replicated in the context of a semester-long undergraduate course on
research methods. A central commons provides the materials and guidance
to incorporate the replications into projects or classes, and the data collected
across sites are aggregated into manuscripts for publication. (p. 2)
268
most adept at obtaining p < 0.05 driven irreproducible results. For, as
Andrew Gelman (2018) succinctly (as always) reminds us,
269
Theories in the “soft areas” of psychology have a tendency to go through
periods of initial enthusiasm leading to large amounts of empirical
investigation with ambiguous over-all results. This period of infatuation is
followed by various kinds of amendment and the proliferation of ad hoc
hypotheses. Finally, in the long run, experimenters lose interest rather than
deliberately discard a theory as clearly falsified. (p. 196)
If Las Vegas took bets on the two alternative futures, the smart money
would undoubtedly favor the first. After all, conducting reproducible
research is monetarily more expensive and entails extra effort and time.
Thirty percent more time as estimated by one mathematical biologist—
who nevertheless suggests that achieving this valued (and valuable)
scientific commodity is not insurmountable.
270
While 30% more time and effort is probably an overestimate, there
will also surely be other reproducibility edicts and strategies suggested in
future that will add time and effort to the research process. Some of these
have already been proposed, such as “multiverse analyses,” and these
will most likely fall by the wayside since they require too much time and
effort and actually violate some well-established practices such as
employing only the most defensible and discipline-accepted procedures.
Undoubtedly, if the reproducibility initiative persists, new facilitative
roles and the expansion of existing ones will also be adopted. And
almost certainly, if the prevalence of QRPs and misconduct persists (or
increases), some of these will gain traction such as (a) required use of
institutionally centralized drives for storing all published data and
supporting documentation (hopefully including workflows) or (b)
institutions requiring the perusal of papers by an independent scientist
prior to submission, with an eye toward spotting abnormalities. Existing
examples of the latter include Catherine Winchester’s (2018) “relatively
new reproducibility role” at the Cancer Research UK Beatson Institute
and some institutions’ use of outside firms to conduct reproducibility
screening in response to one or more egregiously fraudulent incidents
(Abbott, 2019).
Perhaps even the increased awareness that a cadre of scientists is
searching, finding, and promulgating published examples of
irreproducibility results may encourage their colleagues to avoid the
deleterious effects of QRPs. Meta-researchers, even with the
considerable limitations of their approach, are already beginning to play
a growing role in both increasing the awareness of substandard
methodologies and tracking reproducibility progress over time.
And progress is being made. Iqbal, Wallach, Khoury, and colleagues
(2016), for example, in analyzing a random sample of 441 biomedical
journal articles published from 2000 to 2014 found a small but positive
trend in the reporting of a number of reproducibility and transparency
behaviors over this time interval. However, as the authors’ noted, the
continuance of such studies plays an important role in tracking the
effects of the reproducibility initiative over time.
271
Similarly, a more recent meta-scientific tracking study (Menke,
Roelandse, Ozyurt, et al., 2020) similarly found small but positive gains
from 1997 to either 2016 or 2019 involving six methodological
indicators (e.g., randomization, blinding, power analysis) and six
indicators related to the provision of sufficient information on the
biological materials employed and that are essential for replication
purposes (e.g., antibodies and cell lines). Unfortunately, the results were
not especially impressive for several of these key indicators.
272
The Open Science Framework website is an impressive example of the
latter; in addition, as discussed by Menke et al., one of the most useful
resources for bench researchers is undoubtedly the Resource
Identification Portal (RRID) which facilitates the reporting RRID
identifiers in all published research employing in vivo resources. These
identifiers are essential for the replication of many if not most such
findings, and the RRID portal allows interested scientists to ascertain the
specific commercial or other sources for said resources associated with
their identifiers—and thus provides the capacity to obtain them.
Using antibodies as an example, Menke and colleagues found that, by
2019, 14 of the 15 journals with the highest identification rates
participated in the RRID initiative. The average antibody identification
rate for this 14-journal cohort was 91.7%, as compared to 43.3% of the
682 journals that had published at least 11 antibody-containing articles.
Not proof-positive of the causal effects of the RRID initiative, but the
fact is that the journal impact factor of the 682 journals was unrelated to
this indicator (numerically the correlation was negative) and certainly
buttressed by the 2016 decree by the editor of Cell (Marcus et al., 2016)
—one of the most cited journals in all of science and definitely the most
cited and prestigious journal in its area.
The decree itself was part of an innovation consisting of the
Structured, Transparent, Accessible Reporting (STAR) system, which not
only required the implementation of the RRID initiative but also required
that the information be provided in a mandated structured Key Resources
Table along with standardized section headings that “follow guidelines
from the NIH Rigor and Reproducibility Initiative and [are] aligned with
the ARRIVE guidelines on animal experimentation and the Center for
Open Science’s Guidelines for Transparency and Openness Promotion
(https://cos.io/top/)” (Marcus et al., 1059).
Not surprisingly the compliance rate for reporting the requisite
antibody information in the eight Cell journals was even higher than the
other seven journals (93.6% vs. 91.7%) included in Menke et al.’s
previously mentioned top 15 journals (see table 5 of the original article).
If other journal editors in other disciplines were this conscientious, there
is little question which of the two posited alternative futures would result
from the reproducibility initiative as a whole.
273
Perhaps it is equally likely that neither of the preceding alternative
futures will occur. Instead, it may be that something in between will be
realized or some completely unforeseen paradigmatic sea change will
manifest itself. But, as always, in deference to my virtual mentor, I will
defer here.
In all fairness, however, I cannot place the blame for my reticence
solely upon my mentor for the simple reason that the reproducibility
field as a whole is in the process of changing so rapidly (and involves so
many empirical disciplines with unique challenges and strategies to meet
those challenges) that no single book could cover them all in any detail.
So the story presented here is by necessity incomplete, and its ending
cannot be told for an indeterminate period of time.
However, I have no hesitation in proclaiming that the reproducibility
initiative represents an unquestionable present-day success story to be
celebrated regardless of what the future holds. And we all currently owe
a significant debt to the dedicated investigators, methodologists, and
statisticians chronicled here. Their contributions to improving the quality
and veracity of scientific inquiry over the past decade or so deserve a
place of honor in the history of science itself.
References
274
Kidwell, M. C., Ljiljana B., Lazarević, L. B., et al. (2016). Badges to
acknowledge open practices: A simple, low-cost, effective method for
increasing transparency. PloS Biology, 14, e1002456.
Marcus, E., for the Cell team. (2016). A STAR is born. Cell, 166, 1059–1060.
Meehl, P. E. (1990). Appraising and amending theories: The strategy of
Lakatosian defense and two principles that warrant using it. Psychological
Inquiry, 1, 108–141.
Menke, J., Roelandse, M., Ozyurt, B., et al. (2020). Rigor and Transparency
Index, a new metric of quality for assessing biological and medical science
methods. bioRxiv http://doi.org/dkg6;2020
Munafò, M. R., Nosek, B. A., Bishop, D. V. M., et al. (2017). A manifesto for
reproducible science. Nature Human Behavior, 1, 1–9.
Winchester, C. (2018). Give every paper a read for reproducibility. Nature, 557,
281.
275
Index
For the benefit of digital users, indexed terms that span two pages (e.g., 52–53)
may, on occasion, appear on only one of those pages.
Tables and figures are indicated by t and f following the page number
276
statistical, 76–77, 212–13
z-curve, 213
analytic replications, 135, 245
anesthesia research
non-random sampling in, 213–14
publication bias in, 24
animal magnetism, 112
Animal Research: Reporting of in Vivo Experiments (ARRIVE), 86–87, 224,
263
animal studies
blinding in, 87
post hoc deletion of participants, 78–79
publication bias in, 23
publishing guidelines for, 263
QRPs and, 86–88
antidepressant trials, 24, 240–41
a posteriori hypothesizing, 59
archives
preprint, 30, 205
rules for improving, 252
ARRIVE (Animal Research: Reporting of in Vivo Experiments), 86–87, 224,
263
artificial intelligence (AI) aids, 198
artificial intelligence (AI) research, 245
arXiv preprint repository, 30, 205
as is peer review, 207–8
authorship
confirmation of, 223
purpose of, 210
author’s responsibilities, 215
automated plagiarism checks, 198
badges, 256–57
Baker, Monya, 265
Banks, George, 67–70
Bar-Anan, Yoav, 198, 203–7, 211
Bargh, John, 173–78, 184, 199
Bausell, Barker, 119
Bausell, R. B., 9–10
Bayer Health Care, 154–55, 156t, 169
Bazell, Robert, 125
Begley, Glenn, 153–54, 155, 169–70
behavioral planning, 173–78
277
behavioral priming, 174–76
behavioral sciences, 2
bell-shaped, normal curve, 40f, 40
Bem, Daryl, 62, 112–18, 126, 138, 141, 180, 228–29, 234
Benjamin, Daniel, 93–94
bias
confirmation, 80–81, 100–1
implicit, 91–92
publication (see publication bias)
systematic, 41–42, 41t
Bioinformatics, 250
biological sciences
number of scientific publications per year, 195–96, 195t
publication bias in, 20, 24, 25
replication initiatives, 155, 156t
Biological Technologies Office (DARPA), 170
biomedical research, 266–67
bioRxiv, 30
Blake, Rachael, 248–50
blinding, 83–84, 109, 267
in animal studies, 87
in fMRI studies, 99
follow-up experiments, 175–76
methodological improvements, 175
Bohannon, John, 198–99
Boldt, Joachim, 213–14
bootstrapping, 212
Bourner, Philip, 217–18
Box, George, 43–44
Boyle, Robert, 15
brain volume abnormalities, 24
Brigham Young University, 120
British Journal of Psychology, 114–15
British Medical Journal, 83–84
Bruns, Stephan, 181
Burger, Jerry, 142
Burns, Robert, 7, 247
Burrows, Lara, 174
Burt, Cyril, 81
business, 20
Butler, Declan, 201–2
278
Camerer, Colin, 165–66
Campbell, Donald T., 3, 4, 5
cancer research
preclinical, 158
publication bias in, 23, 24, 25
replication initiatives, 155, 156t
Cancer Research UK Beatson Institute, 266
cardiovascular research, 238
Carlisle, John B., 213–14, 215
Carney, Dana, 180, 181–82
Carroll, Lewis, 113–14
case control studies, 101–2
CDC (Centers for Disease Control and Prevention), 103–4
Cell, 268
Center for Open Science (COS), 155, 206, 268., See also Open Science
Framework (OSF)
Centers for Disease Control and Prevention (CDC), 103–4
Chabris, Christopher, 93–94
Chambers, Chris, 31, 140
CHDI Foundation, 261–64
checklists, 87–88
chemistry, 127, 195–96, 195t
Chen, Mark, 174
child health, 24
China, vii, 24, 195, 195t,
chrysalis effect, 67–70
Claerbout, Jon, 245, 246
clairvoyance, 110
Clark, Timothy, 225, 255
class sizes, small, 104–6
Cleeremans, Axel, 174–76
Clever Hans (horse), 110
clinical research
misconduct in, 57
preregistration of trials, 226–28
publication bias in, 20, 25
published versus preregistration discrepancies, 235
RCTs (see randomized controlled trials [RCTs])
sample sizes, 238
clinicaltrials.com, 227
ClinicalTrials.gov, 238
close (direct) replications, 135–38
advantages of, 139–40
combined with extension (aka conceptual) replication, 144–45, 145f
279
hypothetical examples, 143–47
Cochrane Database of Systematic Reviews, 26
Cockburn, Iain, 157–58
code availability, 250, 251
cognitive science, 24, 160
Cohen, Jacob, 4
cold fusion, 80–81, 109–10, 119–22, 120f, 255
lessons learned, 123–25
Collaborative Replications and Education Project (OSF), 136–37, 156t, 159–61,
165–66, 227, 254–55, 262–63
communication
digital, 204
online journalism, 177, 182
scientific (see scientific journals)
complementary and alternative medicine, 25
computational research, 245
number of scientific publications per year, 195–96, 195t
rules for improving reproducibility, 252–53
conceptual (aka differentiated, systematic) replications, 138–40, 143–47, 145f
conference abstracts, 22
conference presentations, 30
confirmation bias, 80–81, 100–1
conflicts of interest, 25–26, 48
Consolidated Standards of Reporting Trials (CONSORT), 47, 83, 224, 263
continuous, open peer review, 207
control procedures, 75, 77–78, 82, 109
correlation coefficients
high, in fMRI studies, 95–99
maximum possible, 95–96
voodoo correlations, 100, 101
Cortex, 31
COS (Center for Open Science), 155, 206, 268., See also Open Science
Framework (OSF)
costs, 157–58, 243
Couture, Jessica, 248–50
COVID-19, vii
craniology, 80–81
credit, due, 59
Crick, Francis, 85
criminology trials, 26
crowdsourcing, 164
Cuddy, Amy, 178–83
culture, 6–7
current publication model, 196–97
280
damage control, 173
case study 1, 173–78
case study 2, 178–83
exemplary, 184
DARPA (US Defense Advanced Research Projects Agency), 170
data
alteration of, 68–69
availability of, 248–50
fabrication of, 57, 213–14
missing, 79
registration requirements for, 248
rules for improving manipulation of, 252
selective reporting of, 59
data analysis, 60–66, 254
databases
large, multivariate, 232
secondary analyses of, 101–2
data cleaning teams, 103
Data Coda, 181
data collection
deletion or addition of data, 68
missing data, 79
rules for, 64
undisclosed flexibility in, 60–66
data-dependent analysis, 91–92
data dredging, 81
data mining, 21, 44–45
data multiverse, 253
data sharing, 244–57
advantages of, 247
funder-imposed data publication requirements and, 248–50
guidelines for, 250
incentive for, 256–57
most likely reasons for those who refuse, 246
suggestions for improvement, 250, 251
data storage, 252, 266
data torturing, 81
decision tree approach, 232
DeHaven, Alexander, 230–34
Delorme, Arnaud, 118
design standards, 82
diagnostic studies, 263
Dickersin, Kay, 15
281
differentiated, systematic (conceptual) replications, 138–40
digital communication, 204, 211–12
Directory of Open Access Journals, 202
direct (aka close) replications, 135–38
advantages of, 139–40
combined with extension (aka conceptual) replication, 144–45, 145f
hypothetical examples, 143–47
Discover Magazine, 177
discovery research, 233
disinterest, 4
documentation, 266
of laboratory workflows, 218–19
personal records, 219
Dominus, Susan, 179, 181–82, 183
Doyen, Stéphane, 174–76
Dreber, Anna, 165–66
drug addiction research, 24
due credit, 59
282
publication bias in, 24, 25
epistemology, 134
EQUATOR (Enhancing the Quality and Transparency of Health Research), 263
errors, 209
opportunities for spotting, 211–12
statistical tools for finding, 212–14
Type I (see false-positive results)
ESP (extrasensory perception), 110, 118
ethical concerns, 28–29
ethos of science, 4–5
European Prospective Investigation into Cancer, 76
European Union, 195, 195t
EVOSTC (Exxon Valdez Oil Spill Trustee Council), 248–49
expectancy effects, 75
experimental conditions, 65
Experimental Design Assistant, 262
experimental design standards, 82
experimental economics, 165–66
experimental procedures, 77–78, 82
external validity, 3–6, 139
extrasensory perception (ESP), 110, 118
Exxon Valdez Oil Spill Trustee Council (EVOSTC), 248–49
Facebook, 182
false-negative results, 39, 41–42, 41t
false-positive psychology, 60–66
false-positive results
arguments against, 49
definition of, 39
detection of, 28
epidemics of, 53
genetic associations with general intelligence, 93–94
modeling, 39, 41t, 50t
probabilities of, 41–42, 41t
reason for, 43–48
simulations, 61
statistical constructs that contribute to, 39–40
Fanelli, Daniele, 20, 56, 128
Fat Studies, 199–200
feedback, 208
Ferguson, Cat, 200
Fetterman and Sassenberg survey (2015), 185–86
financial conflicts of interest, 25–26, 48
283
findings
irreproducible (see irreproducible findings)
negative, 18–19 (see also negative results)
outrageous, 126–27
true, 46
fiscal priorities, 72–73
Fisher, Ronald, 3
fishing, 81
Fleischmann, Martin, 119, 120–21
flexibility in data collection and analysis, undisclosed, 60–66
Food and Drug Administration (FDA), 152, 239–40
footnotes, 223
Forsell, Eskil, 165–66
Framingham Heart Study, 101–2
fraud, 26, 81, 125–26, 199–201
Freedman, Leonard, 157–58
Frey, Bruno, 207–8
Fujii, Yoshitaka, 81, 213–14
functional magnetic resonance imaging (fMRI) studies, 99–101, 255
high correlations in, 95–99
publication bias in, 24
funding, 30, 74, 97, 248–50
Fung, Kaiser, 181–82
future directions, 215–20, 264–69
284
Gould, Jay, 92
Greenwald, Anthony, 10, 52–54, 230
GRIM test, 213
Guidelines for Transparency and Openness Promotion (COS), 268
285
INA-Rxiv repository, 206
Incredibility-Index, 141
independent replications, 143–47
independent scientists, 266
information sharing, 220., See also data sharing
institutional animal care and use committees (IACUCs), 22, 241–44
institutionalization, 73
institutional review boards (IRBs), 22, 241–44
institutions
fiscal priorities, 72–73
inane institutional scientific policies (IISPs), 71–74
requirements for promotion, tenure, or salary increases, 74
insufficient variance: test of, 213
intelligence-genetic associations, 93–94
internal validity, 3–6
International Clinical Trials Registry Platform (WHO), 227
International Committee of Medical Journal Editors (ICMJE), 226, 236
investigators, 29., See also scientists
acculturation of, 72
disciplinary and methodological knowledge of, 72
educational interventions for, 261–64
how to encourage replications by, 147–48
mentoring of, 72, 263
reputation of, 183–84
Ioannidis, John P.A., 10, 43–48, 95, 181, 186–87, 210
IRBs (institutional review boards), 22, 241–44
irreproducible findings See also reproducibility crisis
approaches for identifying, 131
behavioral causes of, 7
case studies, 91
costs of, 157–58
damage control, 173
effectively irreproducible, 156–57
publication and, 18–21
QRP-driven, 91
scientific, 18–21
strategies for lowering, 222
warnings, 10
286
Journal of Articles in Support of the Null Hypothesis, 28
Journal of Experimental & Clinical Assisted Reproduction, 199
Journal of Experimental Psychology: Learning, Memory, and Cognition, 159
Journal of International Medical Research, 199
Journal of Natural Pharmaceuticals, 199
Journal of Negative Observations in Genetic Oncology, 28
Journal of Personality and Social Psychology, 52, 112, 114–15, 159
Journal of Pharmaceutical Negative Results, 28
Journal of the American Medical Association (JAMA), 25, 47, 213–14, 226
Journal of Visualized Experiments, 255
journals
backmatter volume lists, 223
control procedures, 236
current publication model, 196–97
devoted to nonsignificant results, 28
editors, 29, 215–16
letters to the editor, 211–12
medical, 226
open-access, 198–99
peer-reviewed, 193, 203–7
predatory (fake), 198–99, 201–2
published versus preregistration discrepancies, 234–35
requirements for publication in, 81, 216–20, 226
scientific, 193
Utopia I recommendations for, 203–7, 210
287
Linardatos, Eftihia, 240–41
Lind, James, 193
Loken, Eric, 91–92, 228
Loladze, Irakli, 265
longitudinal studies, 101–2, 232
288
meta-science, 21
microbiology, 20
Milgram, Stanley, 142
Mills, James, 228
misconduct, 26
prevalence of, 57
statistical tools for finding, 212–14
missing data, 79
MIT, 120–22
modeling false-positive results
advantages of, 49
examples, 42
Ioannidis exercise, 43–48
nontechnical overview, 39, 41t
modeling questionable research practice effects, 60–66
molecular biology-genetics, 20
morality, 129, 178–83
mortality, 104–6
MRI (magnetic resonance imaging) studies See functional magnetic resonance
imaging (fMRI) studies
Muennig, Peter, 104–6
multicenter trials, 104–6
multiexperiment studies, 112–13, 232–33
multiple sites, 164
multiple-study replications, 152, 156t, 168t
multivariate databases, large, 232
multiverse analyses, 253, 266
289
false-negative results, 39, 41–42, 41t
publishing, 18, 30
reasons for not publishing, 19
unacceptable (but perhaps understandable) reasons for not attempting to
publish, 19
Negative Results in Biomedicine, 28
Nelson, Leif D., 60–66, 115–17, 141–42, 180, 210
NeuroImage, 97
neuroimaging, 24
neuroscience-behavior, 20
New England Journal of Medicine (NEJM), 25, 47, 213–14, 215, 223, 226
New Negatives in Plant Science, 28
New York Times Magazine, 179, 181–82
NIH See National Institutes of Health
non-English publications, 25
non-random sampling, 213–14
nonsignificant results, 85
normal, bell-shaped curve, 40f, 40
Nosek, Brian, 31, 140, 159, 177, 183–84, 185–86, 198, 202–7, 211, 228–29,
230–34
not reporting details or results, 59
NSF See National Science Foundation
nuclear fusion, 119
null hypothesis
prejudice against, 52–54
probability level deemed most appropriate for rejecting, 53
statistical power deemed satisfactory for accepting, 53
Nutrition, Nurses’ Health Study, 76
obesity research, 24
O’Boyle, Ernest, Jr., 67–70
observational studies
large-scale, 103
publication bias in, 25
publishing guidelines for, 263
rules for reporting, 65
online science journalism, 177, 182
open-access publishing, 198–99, 204
Open Access Scholarly Publishers Association, 202
open peer review, continuous, 207
Open Science Framework (OSF), 134, 149, 230, 256, 268
cancer biology initiative, 169–70
290
Collaborative Replications and Education Project, 136–37, 156t, 159–61, 165–
66, 227, 254–55, 262–63
methodologies, 163–65
publishing guidelines, 263
OPERA (Oscillation Project Emulsion-t Racking Apparatus), 127
operationalization, 138
Oransky, Ivan, 200, 209
orthodontics, 24
Oscillation Project Emulsion-t Racking Apparatus (OPERA), 127
OSF See Open Science Framework
outcome variables, 75
palladium, 119
paradigmatic shift, 2
Parapsychological Association, 118
parapsychology, 112, 117–18
Park, Robert (Bob), 10, 99–100, 101, 119, 125–26
parsimony principle, 9–10
partial replications, 142–43
Pashler, Harold, 49, 95–99
pathological science, 109
criteria for, 123–25
examples, 111–12
lessons learned, 123–25
Payton, Antony, 93
p-curves, 181, 212
pediatric research, 24
PeerJ, 30
peer review, 28, 29–30, 197–208
as is, 207–8
continuous, open, 207
dark side of, 198–99
fake, 200
fake articles that get through, 198–200
flawed systems, 72
fraudulent, 200–1
guidelines for, 65–66
independent, 206
publishing, 207
publishing prior to, 205–6
shortcomings, 198–99
peer review aids, 87–88, 198
peer-reviewed journals, 193
291
problems bedeviling, 203
Utopia I recommendations for, 203–7, 210
peer reviewers, 211, 216
personality studies, 95–99
personal records, 219
Perspective on Psychological Science, 137, 148–49
P-hacker, 262
p-hacking, 80
pharmaceutical research
misconduct in, 57
publication bias in, 20, 24, 26, 240–41
physical sciences, 20, 25, 128
physics
cold fusion, 80–81, 109–10, 119–22, 120f, 255
number of scientific publications per year, 195–96, 195t
positive publishing, 20
replications, 127
Pichon, Clora-Lise, 174–76
pilot studies, 79, 218
Pineau, Joelle, 245
plagiarism checks, automated, 198
plant and animal research
data availability, 249
positive publishing, 20
PLoS Clinical Trials, 249
PLoS Medicine, 249
PLoS ONE, 114–15, 177, 199
PLoS ONE’s Positively Negative Collection, 28, 87
political behavior, 24
Pom, Stanley, 119, 120–22
Popham, James (Jim), 143
positive publishing, 20., See also publication bias
positive results, 7, 20–21, 22–23, See also publication bias
false-positive results (see false-positive results)
postdiction, 228–29
post hoc analysis, 127
power, statistical, 33, 39–40, 41
power analysis, 77, 87, 267
Preclinical Reproducibility and Robustness Gateway (Amgen), 28, 153–54, 156t,
169
preclinical research
approaches to replication, 156–57
economics of reproducibility in, 157–58
publication bias in, 24
292
replication initiatives, 152–58, 156t
predatory (fake) journals, 201–2
open-access journals, 198–99
strategies for identifying, 201–2
predatory publishers, 210
prediction markets, 165–66
prediction principle, 10
prejudice against null hypothesis, 52–54
preprint archives, 30, 205
preprint repositories, 30, 205–6
preregistration
advantages of, 241–42
of analysis plans, 103
benefits of, 238
challenges associated with, 231–33
checklist process, 244
of clinical trials, 226–28
control procedures, 236
current state, 230–44
functions of, 228
incentive for, 256–57
initiation of, 226–44
via IRBs and IACUCs, 241–44
and publication bias, 25, 238
purposes of, 225–26
Registered Reports, 31
of replications, 133
requirements for, 222, 227, 236–37, 241
revolution, 230–34
in social sciences, 228
of study protocols, 82, 117–18, 165–66, 225–26
suggestions for improving, 236–37, 244
preregistration repositories, 241–44
Presence (Cuddy), 179
press relations, 30
prevention trials, 26
probability
mean level most appropriate for rejecting null hypothesis, 53
of possibly study outcomes, 41–42, 41t
professional associations, 81
Project STAR, 105
promotions, 74
pseudoscientific professions, 73
psi, 112–14, 115–17, 126
293
PsyArXiv, 30
PsychDisclosure initiative, 87–88
Psychological Science, 87–88, 114–15, 159, 256–57
psychology, 112–18
cognitive, 160
data availability, 249
experimental, 159–65
false-positive results, 49, 52, 60–66
number of scientific publications per year, 195–96, 195t
publication bias, 20, 23, 24, 264–65
questionable research practices (QRPs), 57
Registered Reports, 31
replication failures, 115–17
replication initiatives, 156t
reproducibility, 49, 159–61
significance rates, 20
social, 160
publication bias, 11, 15, 53, 56
definition of, 15
documentation of, 21–26
effects of, 18
evidence for, 26–27
factors associated with, 25–26
future, 264
implications, 27
inane institutional scientific policy (IISP), 71–72
preregistration and, 25, 238
steps to reduce untoward effects of, 30
strategies helpful to reduce, 28–29
systematic review of, 26
topic areas affected, 23–24
vaccine against, 31–33
what’s to be done about, 28–30
Public Library of Science (PLoS), 199, 204
public opinion, 30
public relations
case study 1, 173–78
case study 2, 178–83
damage control, 173
exemplary, 184
publishers, 217–18
publishing, 2, 188
20th-century parable, 16–18
author’s responsibilities, 215
294
chrysalis effect in, 67–70
current model, 196–97
editors, 29, 215–16
funding agency requirements for, 248–50
guidelines for, 263
initiatives for improving, 7
limits on, 210
number of scientific publications per year, 194–96, 195t
open-access, 204
peer reviewers’ responsibilities, 216
positive, 20 (see also publication bias)
prerequisites for journal publication, 226
prior to peer review, 205–6
as reinforcement, 193–94
and reproducibility, 18–21, 193
requirements for journal articles, 81, 216–20
retractions, 199–200, 209
scope of, 194–96
selective publication, 240–41
statistical tools for finding errors and misconduct in, 212–14
suggestions for opening scientific communication, 203–7
Utopia I recommendations for, 203–7, 210
value of, 211–12
vision for, 216–20
word limits, 210
publishing negative results, 30
benefits of, 18
dedicated space to, 28
reasons for not publishing, 19
in traditional journals, 28
unacceptable (but perhaps understandable) reasons for not attempting to
publish, 19
publishing nonsignificant results
educational campaigns for, 28
journals devoted to, 28
publishing peer reviews, 207
publishing positive results See also publication bias
prevalence of, 20–21, 22–23
“publish or perish” adage, 193
PubMed Central (NIH), 204, 209
pulmonary and allergy trials, 26
p-values, 33, 42, 160, 181
under 0.05, 77
adjusted, 76
295
definition of, 39
inflated, 43
QRPs regarding, 61, 76–77
rounding down, 76–77
R-program (statcheck) for recalculating, 212
randomisation, 87
randomized controlled trials (RCTs), 46–47, 238
control procedures, 236
guidelines for, 83
multicenter, 104–6
non-random sampling in, 213–14
publication bias in, 24, 25
small class sizes and mortality in, 104–6
survival of conclusions from, 32–33
recordkeeping
laboratory workflow, 218–19
personal records, 219
registered replication reports (RRRs), 148–50
Registered Reports, 31–33
registration See also preregistration
of data, 248
registry requirements, 227, 239–40
regulatory proposals, 243–44
reliability, 161
replicability crisis See reproducibility crisis
replication failure, 109–10
296
case study 1, 173–78
case study 2, 178–83
damage control, 173
effects of, 185
exemplary response to, 184
rates of, 169
reasons for, 134
and reputation, 183–84, 185
strategies for speeding healing after, 186–87
replication studies
analytic, 135
conceptual (aka differentiated, systematic), 138–40, 143–47, 145f
design of, 133
direct (aka close), 135–38, 139–40, 143–47, 145f
Ebersole, Axt, and Nosek (2016) survey, 183–84, 185–86
exact, 135
exemplary response to, 184
extensions, 141–42, 143–47, 145f
Fetterman and Sassenberg survey (2015), 185–86
how to encourage, 147–48
hypothetical examples, 143–47
independent, 143–47
multiple-site, 164
multiple-study initiatives, 152, 156t, 168t
need for, 126–27
of outrageous findings, 126–27
partial, 142–43
pathological science with, 109
preregistered, 133
process, 7, 133
recommendations for, 134
Registered Reports, 31, 148–50
requirements for, 133–34
results, 168, 168t
self-replications, 143–47
survey approach to tracking, 165
reporting not quite QRP practices, 85
repository(-ies)
preprint, 30, 205–6
preregistration, 241–44
reproducibility, 5–6, 265., See also irreproducible findings
arguments for ensuring, 251
author’s responsibilities for ensuring, 215
causes of, 157–58
297
economics of, 157–58
editor’s responsibilities for ensuring, 215–16
educational interventions for increasing, 261–64
estimation of, 159–61
future directions, 264–65
peer reviewers’ responsibilities for ensuring, 216
preclinical, 157–58
of psychological science, 159–61
publishing issues and, 193
rules for improving, 252–53
screening for, 266
strategies for increasing, 191
value of, 6
vision to improve, 216–20
z-curve analysis for, 213
reproducibility crisis, vii, 1, 261
arguments against, 49
background and facilitators, 13
strategies for decreasing, 191
reproduction See also replication studies
analytic, 245
reproductive medicine, 24
reputation
Ebersole, Axt, and Nosek (2016) survey, 183–84, 185–86
Fetterman and Sassenberg survey (2015), 185–86
“publish or perish” adage and, 193
replication failure and, 183–84, 185
research See also specific disciplines
20th-century parable, 16–18
chrysalis effect in, 67–70
design standards, 82
discovery, 233
educational interventions for, 261–64
funding, 30, 74, 97
glitches and weaknesses in, 75
longitudinal studies, 101–2, 232
misconduct, 57
multiexperiment studies, 112–13
negative studies, 18–19
not quite QRPs but definitely irritating, 85
pathological, 109
post hoc deletion of participants or animals, 78–79
preregistration of (see preregistration)
programs, 233
298
protocols, 117–18, 225–26
questionable practices (see questionable research practices [QRPs])
replication (see replication studies)
standards, 81
statistical tools for finding errors and misconduct in, 212–14
study limitations, 85
suggestions for improvement, 64–66
research findings
irreproducible (see irreproducible findings)
negative, 18–19 (see also negative results)
outrageous, 126–27
true, 46
research publishing See publishing
research results See results
Resource Identification Portal (RRID), 268
respectability, academic, 114
results
false-negative, 39, 41–42, 41t
false-positive (see false-positive results)
hyping, 85
irreproducible (see irreproducible findings)
negative (see negative results)
nonsignificant, 85
positive, 7, 20–21, 22–23 (see also publication bias)
reproducibility of (see reproducibility)
retractions of, 209, 213–14
rules for improving, 252
selective reporting of, 75–76
spinning, 85
retractions, 209, 213–14
Retraction Watch, 200, 209
Rhine, Joseph Banks, 110–11
Rosenthal, Robert, 29
R-program (statcheck), 212–13
RRID (Resource Identification Portal), 268
RRRs (registered replication reports), 148–50
299
SAS, 212–13
Sato, Yoshihiro, 213–14
Schachner, Adena, 165
Schimmack, Ulrich, 141–42, 213
Schlitz, Marilyn, 118
ScholarOne, 198, 212
scholarship advertising, 245, 246
Schooler, Jonathan, 137
science, 1, 7.–8, See also specific fields
ethos of, 4–5
hierarchy of, 128
pathological, 109
snake oil, 119, 121–22
voodoo, 10, 100, 101, 119
Science, 97, 114–15, 159, 166, 198–99, 249–50
Science and Technology Indicators (NSF), 194–95
science journalism, 177, 182
scientific journals, 114, 193., See also specific journals
current publication model, 196–97
library subscription rates, 196–97
number of publications per year, 194–96, 195t
Utopia I recommendations for, 203–7, 210
scientific publishing See publishing
scientific results See results
scientists See also investigators
independent, 266
peer reviewers, 211
“publish or perish” adage and, 193
reputation of, 183–84
word limits, 210
SCIgen, 199–200
SciScore, 267
scurvy, 193
secondary analyses, 101–2
self-replications, 143–47
sensationalism, 97
sharing data, 244–57
sharing information, 220
sharing materials, 254–55
incentive for, 256–57
suggestions for, 255
significance, statistical, 33
Simcoe, Timothy, 157–58
Simmons, Joseph P., 60–66, 115–17, 141–42, 180, 183, 210
300
Simonsohn, Uri, 60–66, 141–42, 180, 183, 210
simulations, 212, 250
single-nucleotide polymorphisms (SNPs), 44–45, 93
skepticism, 4
Slate Magazine, 116, 181–82
snake oil science, 119, 121–22
SNPs (single-nucleotide polymorphisms), 44–45, 93
SocArXiv, 30
Social Cognitive and Affective Neuroscience, 97
social media, 182
Social Neuroscience, 97
social psychology, 160
social sciences, 2, 127
fMRI studies, 95–99
general studies, 166–67
number of scientific publications per year, 195–96, 195t
positive publishing, 20
preregistration, 228
publication bias, 24, 25
registered studies, 227
replication initiatives, 156t
Social Text, 199–200
soft sciences, 128
Sokal, Alan D., 199–200
specification-curve analysis, 254
spinning results, 85
Springer, 200
SSRN, 30
staff supervision, 81
standards, 81
Stanley, Julian C., 3, 4, 5
Stapel, Diederik, 213–14
STAR (Structured, Transparent, Accessible Reporting) system, 268
Stata, 212–13
statcheck (R-program), 212
statistical analysis, 76–77, 212–13
statistical power, 33, 39–40, 41
definition of, 39–40
QRPs regarding, 77
satisfactory for null hypothesis, 53
statistical significance, 33, 39
artifactual, 76
comparisons that don’t hold up, 91–92
definition of, 39
301
example, 42
QRPs regarding, 77–78, 79–80
statistical tools, 212–14
StatReviewer, 198, 212
Steinbach, John, 7
Sterling, T. D., 21–22
stroke research, 24, 26
Structured, Transparent, Accessible Reporting (STAR) system, 268
subjective timing, 176
supervision, staff, 81
surveys
comparisons between, 59
for tracking replications, 165
systematic, differentiated (conceptual) replications, 138–40
systematic bias, 41–42, 41t
systematic reviews, 263
302
publication bias in, 25
universalism, 4
universal requirements, 222
University of Utah, 119, 120
US Defense Advanced Research Projects Agency (DARPA), 170
Utah, 109–10
validity
external, 3–6, 139
internal, 3–6
variables, 69, 75
variance, insufficient, 213
Vaux, David, 76–77
Vees, Matthew, 184, 186
verbal overshadowing, 137
version control, 252
Vickers, Andrew, vii
vision for publishing, 216–20
volumetric pixels (voxels), 98
voodoo science, 10, 100, 101, 119
voxels (volumetric pixels), 98
Vul, Edward, 95–99
Xi Jinping, vii
303
Yong, Ed, 114–15, 140, 177
Young, Stanley, 103
YouTube, 179
304
Índice
Title Page 2
Copyright Page 3
Contents 4
A Brief Note 6
Acknowledgments 8
Introduction 9
I. Background and Facilitators of the Crisis 20
1. Publication Bias 21
2. False-Positive Results and a Nontechnical Overview of Their
45
Modeling
3. Questionable Research Practices (QRPs) and Their
62
Devastating Scientific Effects
4. A Few Case Studies of QRP-Driven Irreproducible Results 97
5. The Return of Pathological Science Accompanied by a Pinch
114
of Replication
II. Approaches for Identifying Irreproducible Findings 136
6. The Replication Process 137
7. Multiple-Study Replication Initiatives 157
8. Damage Control upon Learning That One’s Study Failed to
179
Replicate
III. Strategies for Increasing the Reproducibility of
195
Published Scientific Results
9. Publishing Issues and Their Impact on Reproducibility 196
10. Preregistration, Data Sharing, and Other Salutary Behaviors 226
11. A (Very) Few Concluding Thoughts 266
Index 276
305