Cochrane Handbook For Systematic Reviews of Interventions 2019
Cochrane Handbook For Systematic Reviews of Interventions 2019
Cochrane Handbook For Systematic Reviews of Interventions 2019
Systematic Reviews of
Interventions
Cochrane Handbook for
Systematic Reviews of
Interventions
Second Edition
Contributors xiii
Preface xxiii
1 Starting a review 3
1.1 Why do a systematic review? 3
1.2 What is the review question? 4
1.3 Who should do a systematic review? 5
1.4 The importance of reliability 7
1.5 Protocol development 8
1.6 Data management and quality assurance 11
1.7 Chapter information 12
1.8 References 12
2 Determining the scope of the review and the questions it will address 13
2.1 Rationale for well-formulated questions 13
2.2 Aims of reviews of interventions 15
2.3 Defining the scope of a review question 16
2.4 Ensuring the review addresses the right questions 21
2.5 Methods and tools for structuring the review 24
2.6 Chapter information 29
2.7 References 29
3 Defining the criteria for including studies and how they will be grouped
for the synthesis 33
3.1 Introduction 33
3.2 Articulating the review and comparison PICO 35
3.3 Determining which study designs to include 51
3.4 Eligibility based on publication status and language 60
3.5 Chapter information 61
3.6 References 61
v
Contents
7 Considering bias and conflicts of interest among the included studies 177
7.1 Introduction 177
7.2 Empirical evidence of bias 180
7.3 General procedures for risk-of-bias assessment 185
7.4 Presentation of assessment of risk of bias 188
7.5 Summary assessments of risk of bias 188
7.6 Incorporating assessment of risk of bias into analyses 190
7.7 Considering risk of bias due to missing results 192
7.8 Considering source of funding and conflict of interest of authors of included
studies 193
7.9 Chapter information 199
7.10 References 199
vi
Contents
vii
Contents
viii
Contents
ix
Contents
21.2 Designs for synthesizing and integrating qualitative evidence with intervention
reviews 526
21.3 Defining qualitative evidence and studies 527
21.4 Planning a qualitative evidence synthesis linked to an intervention review 528
21.5 Question development 529
21.6 Questions exploring intervention implementation 530
21.7 Searching for qualitative evidence 531
21.8 Assessing methodological strengths and limitations of qualitative studies 532
21.9 Selecting studies to synthesize 533
21.10 Selecting a qualitative evidence synthesis and data extraction method 534
21.11 Data extraction 534
21.12 Assessing the confidence in qualitative synthesized findings 537
21.13 Methods for integrating the qualitative evidence synthesis with an intervention
review 537
21.14 Reporting the protocol and qualitative evidence synthesis 538
21.15 Chapter information 539
21.16 References 539
x
Contents
Index 659
xi
Contributors
xiv
Contributors
Garside, Ruth
Drummond, Michael European Centre for Environment and
Centre for Health Economics Human Health
University of York University of Exeter Medical School
York University of Exeter
UK Truro
UK
Elbers, Roy G
Population Health Sciences Ghersi, Davina
Bristol Medical School Research Policy and Translation
University of Bristol National Health and Medical
Bristol Research Council
UK Canberra
Australia
El Dib, Regina
Institute of Science and Glanville, Julie
Technology York Health Economics Consortium
UNESP - Univ Estadual Paulista York
São José dos Campos UK
Brazil Glasziou, Paul
Institute for Evidence-Based Healthcare
Eldridge, Sandra Bond University
Centre for Primary Care and Queensland
Public Health Australia
Blizard Institute
Barts and The London Golder, Su
School of Medicine and Dentistry Department of Health Sciences
Queen Mary University of London University of York
London York
UK UK xv
Contributors
Ostelo, Raymond W
Department of Epidemiology and Petticrew, Mark
Biostatistics Faculty of Public Health and Policy
Amsterdam UMC London School of Hygiene and
Vrije Universiteit Amsterdam Tropical Medicine
Amsterdam London
The Netherlands UK
Page, Matthew J
School of Public Health and Rader, Tamara
Preventive Medicine Evidence Standards
Monash University Canadian Agency for Drugs and
Melbourne Technologies in Health
Australia Ottawa, Ontario
Canada
Pantoja, Tomás
Department of Family Medicine
Reeves, Barnaby C
Faculty of Medicine
Translational Health Sciences
Pontificia Universidad
Bristol Medical School
Católica de Chile
University of Bristol
Santiago
Bristol
Chile
UK
Pardo Pardo, Jordi
Cochrane Musculoskeletal Group Rehfuess, Eva
University of Ottawa Pettenkofer School of Public Health
Ottawa, Ontario Institute for Medical
Canada Information Processing,
Biometry and Epidemiology
Patrick, Donald L LMU Munich
Health Services and Epidemiology Munich
University of Washington Germany
Seattle, WA
USA
Robalino, Shannon
Peryer, Guy Institute of Health and Society
University of East Anglia Newcastle University
Norwich Newcastle upon Tyne
UK UK
xviii
Contributors
Simmonds, Mark
Santesso, Nancy
Centre for Reviews and Dissemination
Department of Health Research Methods
University of York
Evidence and Impact (HEI)
York
McMaster University
UK
Hamilton, Ontario
Canada
Singh, Jasvinder
School of Medicine
Savović, Jelena
University of Alabama at Birmingham
Population Health Sciences Birmingham, AL
Bristol Medical School USA
University of Bristol
Bristol
UK Skoetz, Nicole
Department of Internal Medicine
University Hospital of Cologne
Schünemann, Holger J Cologne
Departments of Health Research Germany
Methods, Evidence, and Impact (HEI)
and of Medicine Sterne, Jonathan AC
McMaster University Population Health Sciences
Hamilton, Ontario Bristol Medical School
Canada University of Bristol
Bristol
Shea, Beverley UK
Department of Medicine,
Ottawa Hospital Research Institute; Stewart, Lesley A
School of Epidemiology and Centre for Reviews and Dissemination
Public Health, University of Ottawa University of York
Ottawa, Ontario York
Canada UK
xix
Contributors
xx
Contributors
Young, Camilla
Wieland, L Susan Institute of Cardiovascular and
Center for Integrative Medicine Medical Sciences
Department of Family and Community University of Glasgow
Medicine Glasgow
University of Maryland School of Medicine UK
Baltimore, MD
USA
Williams, Katrina
Department of Paediatrics,
Monash University;
Developmental Paediatrics,
Monash Children’s Hospital;
Neurodisability and Rehabilitation,
Murdoch Children’s Research Institute
Melbourne
Australia
xxi
Preface
‘First, do no harm’ is a principle to which those who would intervene in the lives of other
people are often called to ascribe. However, in this era of data deluge, it is not possible
for individual decision makers to ensure that their decisions are informed by the latest,
reliable, research knowledge; and without reliable information to guide them, they can
cause harm, even though their intentions may be good. This is the core problem that
the founder of Cochrane, Sir Iain Chalmers, aimed to address through the provision of
systematic reviews of reliable research.
By synthesizing the results of individual studies, systematic reviews present a sum-
mary of all the available evidence to answer a question, and in doing so can uncover
important knowledge about the effects of healthcare interventions. Systematic reviews
undertaken by Cochrane (Cochrane Reviews) present reliable syntheses of the results
of multiple studies, alongside an assessment of the possibility of bias in the results,
contextual factors influencing the interpretation and applicability of results, and other
elements that can affect certainty in decision making. They reduce the time wasted by
individuals searching for and appraising the same studies, and also aim to reduce
research waste by ensuring that future studies can build on the body of studies already
completed.
A systematic review attempts to collate all empirical evidence that fits pre-specified
eligibility criteria in order to answer a specific research question. It uses explicit, sys-
tematic methods that are selected with a view to minimizing bias, thus providing
more reliable findings from which conclusions can be drawn and decisions made.
The key characteristics of a systematic review are:
•• a clearly stated set of objectives with pre-defined eligibility criteria for studies;
an explicit, reproducible methodology;
• a systematic search that attempts to identify all studies that meet the eligibility
criteria;
• an assessment of the validity of the findings of the included studies, for example
through the assessment of risk of bias; and
• a systematic presentation, and synthesis, of the characteristics and findings of the
included studies.
For twenty-five years, Cochrane Reviews have supported people making healthcare
decisions, whether they are health professionals, managers, policy makers, or indivi-
duals making choices for themselves and their families. The Cochrane Handbook for
xxiii
Preface
About Cochrane
Cochrane is a global network of health practitioners, researchers, patient advocates
and others, with a mission to promote evidence-informed health decision making by
producing high quality, relevant, accessible systematic reviews and other synthesized
research evidence (www.cochrane.org). Founded as The Cochrane Collaboration in
1993, it is a not-for-profit organization whose members aim to produce credible, acces-
sible health information that is free from commercial sponsorship and other conflicts of
interest.
Cochrane works collaboratively with health professionals, policy makers and inter-
national organizations such as the World Health Organization (WHO) to support the
development of evidence-informed guidelines and policy. WHO guidelines on critical
public health issues such as breastfeeding (2017) and malaria (2015), and the WHO
Essential Medicines List (2017) are underpinned by dozens of Cochrane Reviews.
There are many examples of the impact of Cochrane Reviews on health and health
care. Influential reviews of corticosteroids for women at risk of giving birth prematurely,
treatments for macular degeneration and tranexamic acid for trauma patients with
bleeding have demonstrated the effectiveness of these life-changing interventions
and influenced clinical practice around the world. Other reviews of anti-arrhythmic
drugs for atrial fibrillation and neuraminidase inhibitors for influenza have raised
important doubts about the effectiveness of interventions in common use.
Cochrane Reviews are published in full online in the Cochrane Database of Systematic
Reviews, which is a core component of the Cochrane Library (www.thecochranelibrary.
com). The Cochrane Library was first published in 1996, and is now an online collection
of multiple databases.
underpinning the methods we use. Just as there are consequences arising from the
choices we make about health and social care interventions, so too are there conse-
quences when we choose the methods to use in systematic reviews.” (McKenzie
et al, Cochrane Database of Systematic Reviews 2015; 7: ED00010)
With this in mind, the guidance in this Handbook has been written by authors who
are international leaders in their fields, many of whom are supported by the work of
Cochrane Methods Groups.
makers and healthcare professionals to understand from them the most important
uncertainties or information gap. Since its inception, Cochrane has advocated for rou-
tine updating of systematic reviews to take account of new evidence. In some fast-
moving topics frequent updating is needed to ensure that review conclusions remain
relevant.
While some authors new to Cochrane Reviews have training and experience in con-
ducting other systematic reviews, many do not. Training for review authors is delivered
in many countries by regional Cochrane groups or by the Cochrane Methods Groups
responsible for researching and developing the methods used on Cochrane Reviews.
In addition, Cochrane produces an extensive range of online learning resources.
Detailed information is available via https://training.cochrane.org. Training materials
and opportunities for training are continually developed and updated to reflect the
evolving Cochrane methods and the needs of contributors.
In particular, this edition incorporates the following major new chapters and areas of
guidance:
• Expanded advice on assessing the risk of bias in included studies (Chapter 7), includ-
ing Version 2 of the Cochrane Risk of Bias tool (Chapter 8) and the ROBINS-I tool for
assessing risk of bias in non-randomized studies (Chapter 25).
• New guidance on summarizing study characteristics and preparing for synthesis
(Chapters 3 and 9).
•• New guidance on network meta-analysis (Chapter 11).
New guidance on synthesizing results using methods other than meta-analysis
(Chapter 12).
• Updated guidance on assessing the risk of bias due to missing results (reporting
biases, Chapter 13).
• New guidance addressing intervention complexity (Chapter 17).
Acknowledgements
We thank all of our contributing authors and chapter editors for their patience and
responsiveness in preparing this Handbook. We are also indebted to all those who
have contributed to previous versions of the Handbook, and particularly to past
editors Rachel Churchill, Sally Green, Phil Alderson, Mike Clarke, Cynthia Mulrow and
Andy Oxman.
Many contributed constructive and timely peer review for this edition. We thank
Zhenggang Bai, Hilda Bastian, Jesse Berlin, Lisa Bero, Jane Blazeby, Jacob Burns, Chris
Cates, Nathorn Chaiyakunapruk, Kay Dickersin, Christopher Eccleston, Sam Egger,
Cindy Farquhar, Nicole Fusco, Hernando Guillermo Gaitán Duarte, Paul Garner, Claire
Glenton, Su Golder, Helen Handoll, Jamie Hartmann-Boyce, Joseph Lau, Simon Lewin,
Jane Marjoribanks, Evan Mayo-Wilson, Steve McDonald, Emma Mead, Richard Morley,
Sylivia Nalubega, Gerry Richardson, Richard Riley, Elham Shakibazadeh, Dayane
Silveira, Jonathan Sterne, Alex Sutton, Özge Tunçalp, Peter von Philipsborn, Evelyn
Whitlock, Jack Wilkinson. We thank Tamara Lotfi from the Secretariat for the Global
Evidence Synthesis Initiative (GESI) and GESI for assisting with identifying peer referees,
and Paul Garner and Taryn Young for their liaison with Learning Initiative for eXperi-
enced Authors (LIXA).
Specific administrative support for this version of the Handbook was provided by
Laura Mellor, and we are deeply indebted to Laura for her many contributions. We
would also like to thank staff at Wiley for their patience, support and advice, including
Priyanka Gibbons (Commissioning Editor), Jennifer Seward (Senior Project Editor),
Deirdre Barry (Senior Editorial Assistant) and Tom Bates (Senior Production Editor).
xxvii
Preface
We thank Ella Flemyng for her assistance and Elizabeth Royle and Jenny Bellorini at
Cochrane Copy Edit Support for their assistance in copy editing some chapters of
the Handbook. Finally, we thank Jan East for copy editing the whole volume, and
Nik Prowse for project management.
This Handbook would not have been possible without the generous support provided
to the editors by colleagues at the University of Bristol, University College London,
Johns Hopkins Bloomberg School of Public Health, Cochrane Australia at Monash
University, and the Cochrane Editorial and Methods Department at Cochrane Central
Executive. We particularly thank David Tovey (former Editor in Chief, The Cochrane
Library), and acknowledge Cochrane staff Madeleine Hill for editorial support, and
Jo Anthony and Holly Millward for contributing to the cover design.
Finally, the Editors would like to thank the thousands of Cochrane authors who
volunteer their time to collate evidence for people making decisions about health care,
and the methodologists, editors and trainers who support them.
James Thomas (Senior Editor) is Professor of Social Research & Policy, and Associate
Director of the EPPI-Centre at UCL, London, UK.
Vivian A. Welch (Associate Scientific Editor) is Editor in Chief of the Campbell Collabo-
ration, Scientist at Bruyère Research Institute, Ottawa, Canada; Associate Professor at
the School of Epidemiology and Public Health, University of Ottawa, Canada.
xxviii
Part One
Core methods
1
Starting a review
Toby J Lasserson, James Thomas, Julian PT Higgins
KEY POINTS
• Systematic reviews address a need for health decision makers to be able to access
•
high quality, relevant, accessible and up-to-date information.
Systematic reviews aim to minimize bias through the use of pre-specified research
questions and methods that are documented in protocols, and by basing their findings
•
on reliable research.
Systematic reviews should be conducted by a team that includes domain expertise
•
and methodological expertise, who are free of potential conflicts of interest.
People who might make – or be affected by – decisions around the use of interventions
•
should be involved in important decisions about the review.
Good data management, project management and quality assurance mechanisms are
essential for the completion of a successful systematic review.
This chapter should be cited as: Lasserson TJ, Thomas J, Higgins JPT. Chapter 1: Starting a review.
In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook
for Systematic Reviews of Interventions. 2nd Edition. Chichester (UK): John Wiley & Sons, 2019: 3–12.
© 2019 The Cochrane Collaboration. Published 2019 by John Wiley & Sons Ltd.
3
1 Starting a review
A systematic review attempts to collate all the empirical evidence that fits
pre-specified eligibility criteria in order to answer a specific research question. It uses
explicit, systematic methods that are selected with a view to minimizing bias, thus
providing more reliable findings from which conclusions can be drawn and decisions
made (Antman et al 1992, Oxman and Guyatt 1993). Systematic review methodology,
pioneered and developed by Cochrane, sets out a highly structured, transparent and
reproducible methodology (Chandler and Hopewell 2013). This involves: the a priori
specification of a research question; clarity on the scope of the review and which
studies are eligible for inclusion; making every effort to find all relevant research
and to ensure that issues of bias in included studies are accounted for; and analysing
the included studies in order to draw conclusions based on all the identified research in
an impartial and objective way.
This Handbook is about systematic reviews on the effects of interventions, and
specifically about methods used by Cochrane to undertake them. Cochrane Reviews
use primary research to generate new knowledge about the effects of an intervention
(or interventions) used in clinical, public health or policy settings. They aim to provide
users with a balanced summary of the potential benefits and harms of interventions
and give an indication of how certain they can be of the findings. They can also
compare the effectiveness of different interventions with one another and so help users
to choose the most appropriate intervention in particular situations. The primary
purpose of Cochrane Reviews is therefore to inform people making decisions about
health or health care.
Systematic reviews are important for other reasons. New research should be
designed or commissioned only if it does not unnecessarily duplicate existing research
(Chalmers et al 2014). Therefore, a systematic review should typically be undertaken
before embarking on new primary research. Such a review will identify current and
ongoing studies, as well as indicate where specific gaps in knowledge exist, or evidence
is lacking; for example, where existing studies have not used outcomes that are
important to users of research (Macleod et al 2014). A systematic review may also reveal
limitations in the conduct of previous studies that might be addressed in the new study
or studies.
Systematic reviews are important, often rewarding and, at times, exciting research
projects. They offer the opportunity for authors to make authoritative statements
about the extent of human knowledge in important areas and to identify priorities
for further research. They sometimes cover issues high on the political agenda and
receive attention from the media. Conducting research with these impacts is not
without its challenges, however, and completing a high-quality systematic review is
often demanding and time-consuming. In this chapter we introduce some of the key
considerations for review authors who are about to start a systematic review.
• The review team should co-ordinate the input of the advisory group to inform key
review decisions.
• The advisory group’s input should continue throughout the systematic review
process to ensure relevance of the review to end users is maintained.
• Advisory group membership should reflect the breadth of the review question, and
consideration should be given to involving vulnerable and marginalized people
(Steel 2004) to ensure that conclusions on the value of the interventions are well-
informed and applicable to all groups in society (see Chapter 16).
may be used as a basis for deciding not to publish a review in the Cochrane Database of
Systematic Reviews (CDSR). Items described as highly desirable should generally be
implemented, but there are reasonable exceptions and justifications are not required.
All MECIR expectations for the conduct of a review are presented in the relevant
chapters of this Handbook. Expectations for reporting of completed reviews (including
PLEACS) are described in online Chapter III. The recommendations provided in the
Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA)
Statement have been incorporated into the Cochrane reporting expectations, ensuring
compliance with the PRISMA recommendations and summarizing attributes of
reporting that should allow a full assessment of the methods and findings of the review
(Moher et al 2009).
8
MECIR Box 1.5.b Relevant expectations for the conduct of intervention reviews
C20: Planning the assessment of risk of bias in included studies (Mandatory)
Plan in advance the methods to be used for Predefining the methods and criteria for
assessing risk of bias in included studies, assessing risk of bias is important since
including the tool(s) to be used, how the analysis or interpretation of the review
tool(s) will be implemented, and the criteria findings may be affected by the
used to assign studies, for example, to judgements made during this process. For
judgements of low risk, high risk and randomized trials, use of the Cochrane
unclear risk of bias. risk-of-bias tool is Mandatory, so it is
sufficient (and easiest) simply to refer to
the definitions of low risk, unclear risk and
high risk of bias provided in the Handbook.
Publication of a protocol for a review that is written without knowledge of the avail-
able studies reduces the impact of review authors’ biases, promotes transparency of
methods and processes, reduces the potential for duplication, allows peer review of
the planned methods before they have been completed, and offers an opportunity
for the review team to plan resources and logistics for undertaking the review itself.
All chapters in the Handbook should be consulted when drafting the protocol. Since
systematic reviews are by their nature retrospective, an element of knowledge of
the evidence is often inevitable. This is one reason why non-content experts such as
methodologists should be part of the review team (see Section 1.3). Two exceptions
to the retrospective nature of a systematic review are a meta-analysis of a prospectively
planned series of trials and some living systematic reviews, as described in Chapter 22.
The review question should determine the methods used in the review, and not vice
versa. The question may concern a relatively straightforward comparison of one treat-
ment with another; or it may necessitate plans to compare different treatments as part
of a network meta-analysis, or assess differential effects of an intervention in different
populations or delivered in different ways.
The protocol sets out the context in which the review is being conducted. It presents
an opportunity to develop ideas that are foundational for the review. This concerns,
most explicitly, definition of the eligibility criteria such as the study participants and
the choice of comparators and outcomes. The eligibility criteria may also be defined
following the development of a logic model (or an articulation of the aspects of an
extent logic model that the review is addressing) to explain how the intervention might
work (see Chapter 2, Section 2.5.1).
A key purpose of the protocol is to make plans to minimize bias in the eventual find-
ings of the review. Reliable synthesis of available evidence requires a planned, system-
atic approach. Threats to the validity of systematic reviews can come from the studies
they include or the process by which reviews are conducted. Biases within the studies
can arise from the method by which participants are allocated to the intervention
groups, awareness of intervention group assignment, and the collection, analysis
and reporting of data. Methods for examining these issues should be specified in
the protocol. Review processes can generate bias through a failure to identify an unbi-
ased (and preferably complete) set of studies, and poor quality assurance throughout
the review. The availability of research may be influenced by the nature of the results
(i.e. reporting bias). To reduce the impact of this form of bias, searching may need to
include unpublished sources of evidence (Dwan et al 2013) (MECIR Box 1.5.b).
Developing a protocol for a systematic review has benefits beyond reducing bias.
Investing effort in designing a systematic review will make the process more manage-
able and help to inform key priorities for the review. Defining the question, referring to
it throughout, and using appropriate methods to address the question focuses the
analysis and reporting, ensuring the review is most likely to inform treatment decisions
for funders, policy makers, healthcare professionals and consumers. Details of the
planned analyses, including investigations of variability across studies, should be spe-
cified in the protocol, along with methods for interpreting the results through the sys-
tematic consideration of factors that affect confidence in estimates of intervention
effect (MECIR Box 1.5.c).
While the intention should be that a review will adhere to the published protocol,
changes in a review protocol are sometimes necessary. This is also the case for a
10
1.6 Data management and quality assurance
Funding: JT is supported by the National Institute for Health Research (NIHR) Collab-
oration for Leadership in Applied Health Research and Care North Thames at Barts
Health NHS Trust. JPTH is a member of the NIHR Biomedical Research Centre at Uni-
versity Hospitals Bristol NHS Foundation Trust and the University of Bristol. JPTH
received funding from National Institute for Health Research Senior Investigator award
NF-SI-0617-10145. The views expressed are those of the author(s) and not necessarily
those of the NHS, the NIHR or the Department of Health.
1.8 References
Antman E, Lau J, Kupelnick B, Mosteller F, Chalmers T. A comparison of results of meta-
analyses of randomized control trials and recommendations of clinical experts: treatment
for myocardial infarction. JAMA 1992; 268: 240–248.
Chalmers I, Bracken MB, Djulbegovic B, Garattini S, Grant J, Gulmezoglu AM, Howells DW,
Ioannidis JP, Oliver S. How to increase value and reduce waste when research priorities
are set. Lancet 2014; 383: 156–165.
Chandler J, Hopewell S. Cochrane methods – twenty years experience in developing
systematic review methods. Systematic Reviews 2013; 2: 76.
Dwan K, Gamble C, Williamson PR, Kirkham JJ, Reporting Bias Group. Systematic review of
the empirical evidence of study publication bias and outcome reporting bias: an updated
review. PloS One 2013; 8: e66844.
Gøtzsche PC, Ioannidis JPA. Content area experts as authors: helpful or harmful for
systematic reviews and meta-analyses? BMJ 2012; 345.
Macleod MR, Michie S, Roberts I, Dirnagl U, Chalmers I, Ioannidis JP, Al-Shahi Salman R,
Chan AW, Glasziou P. Biomedical research: increasing value, reducing waste. Lancet 2014;
383: 101–104.
Moher D, Liberati A, Tetzlaff J, Altman D, PRISMA Group. Preferred reporting items for
systematic reviews and meta-analyses: the PRISMA statement. PLoS Medicine 2009; 6:
e1000097.
Oxman A, Guyatt G. The science of reviewing research. Annals of the New York Academy of
Sciences 1993; 703: 125–133.
Rees R, Oliver S. Stakeholder perspectives and participation in reviews. In: Gough D, Oliver S,
Thomas J, editors. An Introduction to Systematic Reviews. 2nd ed. London: Sage; 2017.
p. 17–34.
Steel R. Involving marginalised and vulnerable people in research: a consultation document
(2nd revision). INVOLVE; 2004.
Thomas J, Harden A, Oakley A, Oliver S, Sutcliffe K, Rees R, Brunton G, Kavanagh J.
Integrating qualitative research with trials in systematic reviews. BMJ 2004; 328:
1010–1012.
12
2
Determining the scope of the review
and the questions it will address
James Thomas, Dylan Kneale, Joanne E McKenzie, Sue E Brennan,
Soumyadeep Bhaumik
KEY POINTS
• Systematic reviews should address answerable questions and fill important gaps in
•
knowledge.
Developing good review questions takes time, expertise and engagement with
•
intended users of the review.
Cochrane Reviews can focus on broad questions, or be more narrowly defined. There
•
are advantages and disadvantages of each.
Logic models are a way of documenting how interventions, particularly complex inter-
ventions, are intended to ‘work’, and can be used to refine review questions and the
•
broader scope of the review.
Using priority-setting exercises, involving relevant stakeholders, and ensuring that the
review takes account of issues relating to equity can be strategies for ensuring that the
scope and focus of reviews address the right questions.
This chapter should be cited as: Thomas J, Kneale D, McKenzie JE, Brennan SE, Bhaumik S. Chapter 2:
Determining the scope of the review and the questions it will address. In: Higgins JPT, Thomas J, Chandler J,
Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook for Systematic Reviews of Interventions.
2nd Edition. Chichester (UK): John Wiley & Sons, 2019: 13–32.
© 2019 The Cochrane Collaboration. Published 2019 by John Wiley & Sons Ltd.
13
2 Determining the scope of the review
studies, structuring the syntheses and presenting findings (Cooper 1984, Hedges 1994,
Oliver et al 2017). In Cochrane Reviews, questions are stated broadly as review ‘Objec-
tives’, and operationalized in terms of the studies that will be eligible to answer those
questions as ‘Criteria for considering studies for this review’. As well as focusing review
conduct, the contents of these sections are used by readers in their initial assessments
of whether the review is likely to be directly relevant to the issues they face.
The FINER criteria have been proposed as encapsulating the issues that should be
addressed when developing research questions. These state that questions should
be Feasible, Interesting, Novel, Ethical, and Relevant (Cummings et al 2007). All of these
criteria raise important issues for consideration at the outset of a review and should be
borne in mind when questions are formulated.
A feasible review is one that asks a question that the author team is capable of
addressing using the evidence available. Issues concerning the breadth of a review
are discussed in Section 2.3.1, but in terms of feasibility it is important not to ask a
14
2.2 Aims of reviews of interventions
the relationship between the size of an intervention effect and other characteristics,
such as aspects of the population, the intervention itself, how the outcome is meas-
ured, or the methodology of the primary research studies included. Such approaches
might be used to investigate which components of multi-component interventions are
more or less important or essential (and when). While it is not always necessary to know
how an intervention achieves its effect for it to be useful, many reviews will aim to artic-
ulate an intervention’s mechanisms of action (see Section 2.5.1), either by making this
an explicit aim of the review itself (see Chapters 17 and 21), or when describing the
scope of the review. Understanding how an intervention works (or is intended to work)
can be an important aid to decision makers in assessing the applicability of the review
to their situation. These investigations can be assisted by the incorporation of results
from process evaluations conducted alongside trials (see Chapter 21). Further, many
decisions in policy and practice are at least partially constrained by the resource avail-
able, so review authors often need to consider the economic context of interventions
(see Chapter 20).
16
2.3 Defining the scope of a review question
might be followed by one or more secondary objectives, for example relating to differ-
ent participant groups, different comparisons of interventions or different outcome
measures. The detailed specification of the review question(s) requires consideration
of several key components (Richardson et al 1995, Counsell 1997) which can often be
encapsulated by the ‘PICO’ mnemonic, an acronym for Population, Intervention,
Comparison(s) and Outcome. Equal emphasis in addressing, and equal precision in defin-
ing, each PICO component is not necessary. For example, a review might concentrate on
competing interventions for a particular stage of breast cancer, with stage and severity of
the disease being defined very precisely; or alternately focus on a particular drug for any
stage of breast cancer, with the treatment formulation being defined very precisely.
Throughout the Handbook we make a distinction between three different stages in
the review at which the PICO construct might be used. This division is helpful for under-
standing the decisions that need to be made:
• The review PICO (planned at the protocol stage) is the PICO on which eligibility of
studies is based (what will be included and what excluded from the review).
• The PICO for each synthesis (also planned at the protocol stage) defines the ques-
tion that each specific synthesis aims to answer, determining how the synthesis will
be structured, specifying planned comparisons (including intervention and compar-
ator groups, any grouping of outcome and population subgroups).
• The PICO of the included studies (determined at the review stage) is what was actu-
ally investigated in the included studies.
Reaching the point where it is possible to articulate the review’s objectives in the above
form – the review PICO – requires time and detailed discussion between potential authors
and users of the review. It is important that those involved in developing the review’s scope
and questions have a good knowledge of the practical issues that the review will address as
well as the research field to be synthesized. Developing the questions is a critical part of the
research process. As such, there are methodological issues to bear in mind, including: how
to determine which questions are most important to answer; how to engage stakeholders
in question formulation; how to account for changes in focus as the review progresses; and
considerations about how broad (or narrow) a review should be.
Table 2.3.a Some advantages and disadvantages of broad versus narrow reviews
18
2.3 Defining the scope of a review question
Disadvantages: Disadvantages:
Searching, data collection, Evidence may be sparse.
analysis and writing may Unable to explore whether
require more resources. different modes of an
Interpretation may be difficult intervention modify the
for readers if the review is large intervention effects.
and lacks a clear rationale Increased burden for decision
(such as examining consistency makers if multiple reviews must
of findings) for including be accessed (e.g. if evidence is
different modes of an sparse for a specific mode).
intervention.
Scope could be chosen by
review authors to produce a
desired result.
Choice of interventions and Advantages: Advantages:
comparators Comprehensive summary of the Manageability for review team.
e.g. oxybutynin compared evidence. Relative simplicity of objectives
with desmopressin for Opportunity to compare the and ease of reading.
preventing bed-wetting effectiveness of a range of
(narrow) or interventions for different intervention options.
preventing bed-wetting
(broad)
Disadvantages: Disadvantages:
Searching, data collection, Increased burden for decision
analysis and writing may makers if not included in an
require more resources. Overview since multiple reviews
May be unwieldy, and more may need to be accessed.
appropriate to present as an
Overview of reviews (see online
Chapter V).
actually been recruited into research studies. Likewise, heterogeneity can be a disadvan-
tage when the expectation is for homogeneity of effects between studies, but an advan-
tage when the review question seeks to understand differential effects (see Chapter 10).
A distinction should be drawn between the scope of a review and the precise ques-
tions within, since it is possible to have a broad review that addresses quite narrow
questions. In the antiplatelet agents for preventing thrombotic events example, a sys-
tematic review with a broad scope might include all available treatments. Rather than
combining all the studies into one comparison though, specific treatments would be
compared with one another in separate comparisons, thus breaking a heterogeneous
set of treatments into narrower, more homogenous groups. This relates to the three
levels of PICO, outlined in Section 2.3. The review PICO defines the broad scope of
the review, and the PICO for comparison defines the specific treatments that will be
compared with one another; Chapter 3 elaborates on the use of PICOs.
In practice, a Cochrane Review may start (or have started) with a broad scope, and be
divided up into narrower reviews as evidence accumulates and the original review
19
2 Determining the scope of the review
becomes unwieldy. This may be done for practical and logistical reasons, for example
to make updating easier as well as to make it easier for readers to see which parts of the
evidence base are changing. Individual review authors must decide if there are
instances where splitting a broader focused review into a series of more narrowly
focused reviews is appropriate and implement appropriate methods to achieve this.
If a major change is to be undertaken, such as splitting a broad review into a series
of more narrowly focused reviews, a new protocol must be written for each of the com-
ponent reviews that documents the eligibility criteria for each one.
Ultimately, the selected breadth of a review depends upon multiple factors including
perspectives regarding a question’s relevance and potential impact; supporting theo-
retical, biologic and epidemiological information; the potential generalizability and
validity of answers to the questions; and available resources. As outlined in
Section 2.4.2, authors should consider carefully the needs of users of the review and
the context(s) in which they expect the review to be used when determining the most
optimal scope for their review.
20
2.4 Ensuring the review addresses the right questions
21
2 Determining the scope of the review
the review question is meaningful for healthcare decision making. Two approaches are
discussed below:
• Using results from existing research priority-setting exercises to define the review
question.
• In the absence of, or in addition to, existing research priority-setting exercises, engag-
ing with stakeholders to define review questions and establish their relevance to pol-
icy and practice.
Other sources of questions are often found in ‘implications for future research’ sec-
tions of articles in journals and clinical practice guidelines. Some guideline developers
have prioritized questions identified through the guideline development process
(Sharma et al 2018), although these priorities will be influenced by the needs of health
systems in which different guideline development teams are working.
the interplay between core intervention components and their introduction into differ-
ing school environments; different child-level effect modifiers; how the intervention
then had an impact on the knowledge of the child (and their family); the child’s
self-efficacy and adherence to their treatment regime; the severity of their asthma;
the number of days of restricted activity; how this affected their attendance at school;
and finally, the distal outcomes of education attainment and indicators of child health
and well-being (Kneale et al 2015).
Several specific tools can help authors to consider issues raised when defining review
questions and planning their review; these are also helpful when developing eligibility
criteria and classifying included studies. These include the following.
1) Taxonomies: hierarchical structures that can be used to categorize (or group)
related interventions, outcomes or populations.
2) Generic frameworks for examining and structuring the description of intervention
characteristics (e.g. TIDieR for the description of interventions (Hoffmann et al
2014), iCAT_SR for describing multiple aspects of complexity in systematic reviews
(Lewin et al 2017)).
3) Core outcome sets for identifying and defining agreed outcomes that should be
measured for specific health conditions (described in more detail in Chapter 3).
Unlike these tools, which focus on particular aspects of a review, logic models
provide a framework for planning and guiding synthesis at the review level (see
Section 2.5.1).
Logic models can vary in their emphasis, with a distinction sometimes made between
system-based and process-oriented logic models (Rehfuess et al 2018). System-based
logic models have particular value in examining the complexity of the system (e.g. the
geographical, epidemiological, political, socio-cultural and socio-economic features of
a system), and the interactions between contextual features, participants and the inter-
vention (see Chapter 17). Process-oriented logic models aim to capture the complexity
of causal pathways by which the intervention leads to outcomes, and any factors that
may modify intervention effects. However, this is not a crisp distinction; the two types
are interrelated; with some logic models depicting elements of both systems and proc-
ess models simultaneously.
The way that logic models can be represented diagrammatically (see Chapter 17 for
an example) provides a valuable visual summary for readers and can be a communi-
cation tool for decision makers and practitioners. They can aid initially in the develop-
ment of a shared understanding between different stakeholders of the scope of the
review and its PICO, helping to support decisions taken throughout the review process,
from developing the research question and setting the review parameters, to structur-
ing and interpreting the results. They can be used in planning the PICO elements of a
review as well as for determining how the synthesis will be structured (i.e. planned
comparisons, including intervention and comparator groups, and any grouping of out-
come and population subgroups). These models may help review authors specify the
link between the intervention, proximal and distal outcomes, and mediating factors. In
other words, they depict the intervention theory underpinning the synthesis plan.
Anderson and colleagues note the main value of logic models in systematic review as
(Anderson et al 2011):
Logic models can be useful in systematic reviews when considering whether failure to
find a beneficial effect of an intervention is due to a theory failure, an implementation
failure, or both (see Chapter 17 and Cargo et al 2018). Making a distinction between
implementation and intervention theory can help to determine whether and how
the intervention interacts with (and potentially changes) its context (see Chapters 3
and 17 for further discussion of context). This helps to elucidate situations in which
26
2.5 Methods and tools for structuring the review
variations in how the intervention is implemented have the potential to affect the integ-
rity of the intervention and intended outcomes.
Given their potential value in conceptualizing and structuring a review, logic models
are increasingly published in review protocols. Logic models may be specified a priori
and remain unchanged throughout the review; it might be expected, however, that the
findings of reviews produce evidence and new understandings that could be used to
update the logic model in some way (Kneale et al 2015). Some reviews take a more
staged approach, pre-specifying points in the review process where the model may
be revised on the basis of (new) evidence (Rehfuess et al 2018) and a staged logic model
can provide an efficient way to report revisions to the synthesis plan. For example, in a
review of portion, package and tableware size for changing selection or consumption of
food and other products, the authors presented a logic model that clearly showed
changes to their original synthesis plan (Hollands et al 2015).
It is preferable to seek out existing logic models for the intervention and revise or
adapt these models in line with the review focus, although this may not always be pos-
sible. More commonly, new models are developed starting with the identification of
outcomes and theorizing the necessary pre-conditions to reach those outcomes. This
process of theorizing and identifying the steps and necessary pre-conditions continues,
working backwards from the intended outcomes, until the intervention itself is repre-
sented. As many mechanisms of action are invisible and can only be ‘known’ through
theory, this process is invaluable in exposing assumptions as to how interventions are
thought to work; assumptions that might then be tested in the review. Logic models
can be developed with stakeholders (see Section 2.5.2) and it is considered good prac-
tice to obtain stakeholder input in their development.
Logic models are representations of how interventions are intended to ‘work’, but
they can also provide a useful basis for thinking through the unintended consequences
of interventions and identifying potential adverse effects that may need to be captured
in the review (Bonell et al 2015). While logic models provide a guiding theory of how
interventions are intended to work, critiques exist around their use, including their
potential to oversimplify complex intervention processes (Rohwer et al 2017). Here,
contributions from different stakeholders to the development of a logic model may
be able to articulate where complex processes may occur; theorizing unintended inter-
vention impacts; and the explicit representation of ambiguity within certain parts of the
causal chain where new theory/explanation is most valuable.
27
2 Determining the scope of the review
However, such broader questions may be useful for identifying important leads in areas
that lack effective interventions and for guiding future research. Changes in the group-
ing may affect the assessment of the certainty of the evidence (see Chapter 14).
Funding: JT and DK are supported by the National Institute for Health Research (NIHR)
Collaboration for Leadership in Applied Health Research and Care North Thames at
Barts Health NHS Trust. JEM is supported by an Australian National Health and Medical
Research Council (NHMRC) Career Development Fellowship (1143429). SEB’s position is
supported by the NHMRC Cochrane Collaboration Funding Program. The views
expressed are those of the authors and not necessarily those of the NHS, the NIHR,
the Department of Health or the NHMRC.
2.7 References
Anderson L, Petticrew M, Rehfuess E, Armstrong R, Ueffing E, Baker P, Francis D, Tugwell P.
Using logic models to capture complexity in systematic reviews. Research Synthesis
Methods 2011; 2: 33–42.
Bailey JV, Murray E, Rait G, Mercer CH, Morris RW, Peacock R, Cassell J, Nazareth I.
Interactive computer-based interventions for sexual health promotion. Cochrane
Database of Systematic Reviews 2010; 9: CD006483.
Baxter SK, Blank L, Woods HB, Payne N, Rimmer M, Goyder E. Using logic model methods in
systematic review synthesis: describing complex pathways in referral management
interventions. BMC Medical Research Methodology 2014; 14: 62.
Bonell C, Jamal F, Melendez-Torres GJ, Cummins S. ‘Dark logic’: theorising the harmful
consequences of public health interventions. Journal of Epidemiology and Community
Health 2015; 69: 95–98.
Bryant J, Sanson-Fisher R, Walsh J, Stewart J. Health research priority setting in selected
high income countries: a narrative review of methods used and recommendations for
future practice. Cost Effectiveness and Resource Allocation 2014; 12: 23.
Caldwell DM, Welton NJ. Approaches for synthesising complex mental health interventions
in meta-analysis. Evidence-Based Mental Health 2016; 19: 16–21.
Cargo M, Harris J, Pantoja T, Booth A, Harden A, Hannes K, Thomas J, Flemming K, Garside
R, Noyes J. Cochrane Qualitative and Implementation Methods Group guidance series-
paper 4: methods for assessing evidence on intervention implementation. Journal of
Clinical Epidemiology 2018; 97: 59–69.
Chalmers I, Bracken MB, Djulbegovic B, Garattini S, Grant J, Gülmezoglu AM, Howells DW,
Ioannidis JPA, Oliver S. How to increase value and reduce waste when research priorities
are set. Lancet 2014; 383: 156–165.
Chamberlain C, O’Mara-Eves A, Porter J, Coleman T, Perlen S, Thomas J, McKenzie J.
Psychosocial interventions for supporting women to stop smoking in pregnancy.
Cochrane Database of Systematic Reviews 2017; 2: CD001055.
29
2 Determining the scope of the review
Cooper H. The problem formulation stage. In: Cooper H, editor. Integrating Research:
A Guide for Literature Reviews. Newbury Park (CA) USA: Sage Publications; 1984.
Counsell C. Formulating questions and locating primary studies for inclusion in systematic
reviews. Annals of Internal Medicine 1997; 127: 380–387.
Cummings SR, Browner WS, Hulley SB. Conceiving the research question and developing the
study plan. In: Hulley SB, Cummings SR, Browner WS, editors. Designing Clinical Research:
An Epidemiological Approach. 4th ed. Philadelphia (PA): Lippincott Williams & Wilkins;
2007. p. 14–22.
Glass GV. Meta-analysis at middle age: a personal history. Research Synthesis Methods 2015;
6: 221–231.
Hedges LV. Statistical considerations. In: Cooper H, Hedges LV, editors. The Handbook of
Research Synthesis. New York (NY): USA: Russell Sage Foundation; 1994.
Hetrick SE, McKenzie JE, Cox GR, Simmons MB, Merry SN. Newer generation antidepressants
for depressive disorders in children and adolescents. Cochrane Database of Systematic
Reviews 2012; 11: CD004851.
Hoffmann T, Glasziou P, Boutron I. Better reporting of interventions: template for
intervention description and replication (TIDieR) checklist and guide. BMJ 2014;
348: g1687.
Hollands GJ, Shemilt I, Marteau TM, Jebb SA, Lewis HB, Wei Y, Higgins JPT, Ogilvie D.
Portion, package or tableware size for changing selection and consumption of food,
alcohol and tobacco. Cochrane Database of Systematic Reviews 2015; 9: CD011045.
Keown K, Van Eerd D, Irvin E. Stakeholder engagement opportunities in systematic reviews:
Knowledge transfer for policy and practice. Journal of Continuing Education in the Health
Professions 2008; 28: 67–72.
Kneale D, Thomas J, Harris K. Developing and optimising the use of logic models in
systematic reviews: exploring practice and good practice in the use of programme theory
in reviews. PloS One 2015; 10: e0142187.
Lewin S, Hendry M, Chandler J, Oxman AD, Michie S, Shepperd S, Reeves BC, Tugwell P,
Hannes K, Rehfuess EA, Welch V, McKenzie JE, Burford B, Petkovic J, Anderson LM, Harris
J, Noyes J. Assessing the complexity of interventions within systematic reviews:
development, content and use of a new tool (iCAT_SR). BMC Medical Research
Methodology 2017; 17: 76.
Lorenc T, Petticrew M, Welch V, Tugwell P. What types of interventions generate
inequalities? Evidence from systematic reviews. Journal of Epidemiology and Community
Health 2013; 67: 190–193.
Nasser M, Ueffing E, Welch V, Tugwell P. An equity lens can ensure an equity-oriented
approach to agenda setting and priority setting of Cochrane Reviews. Journal of Clinical
Epidemiology 2013; 66: 511–521.
Nasser M. Setting priorities for conducting and updating systematic reviews [PhD Thesis]:
University of Plymouth; 2018.
O’Neill J, Tabish H, Welch V, Petticrew M, Pottie K, Clarke M, Evans T, Pardo Pardo J, Waters
E, White H, Tugwell P. Applying an equity lens to interventions: using PROGRESS ensures
consideration of socially stratifying factors to illuminate inequities in health. Journal of
Clinical Epidemiology 2014; 67: 56–64.
Oliver S, Dickson K, Bangpan M, Newman M. Getting started with a review. In: Gough D,
Oliver S, Thomas J, editors. An Introduction to Systematic Reviews. London (UK): Sage
Publications Ltd.; 2017.
30
2.7 References
Petticrew M, Roberts H. Systematic Reviews in the Social Sciences: A Practical Guide. Oxford
(UK): Blackwell; 2006.
Pfadenhauer L, Gerhardus A, Mozygemba K, Lysdahl KB, Booth A, Hofmann B, Wahlster P,
Polus S, Burns J, Brereton L, Rehfuess E. Making sense of complexity in context and
implementation: the Context and Implementation of Complex Interventions (CICI)
framework. Implementation Science 2017; 12: 21.
Rehfuess EA, Booth A, Brereton L, Burns J, Gerhardus A, Mozygemba K, Oortwijn W,
Pfadenhauer LM, Tummers M, van der Wilt GJ, Rohwer A. Towards a taxonomy of logic
models in systematic reviews and health technology assessments: a priori, staged, and
iterative approaches. Research Synthesis Methods 2018; 9: 13–24.
Richardson WS, Wilson MC, Nishikawa J, Hayward RS. The well-built clinical question: a key
to evidence-based decisions. ACP Journal Club 1995; 123: A12–13.
Rohwer A, Pfadenhauer L, Burns J, Brereton L, Gerhardus A, Booth A, Oortwijn W, Rehfuess
E. Series: Clinical epidemiology in South Africa. Paper 3: Logic models help make sense of
complexity in systematic reviews and health technology assessments. Journal of Clinical
Epidemiology 2017; 83: 37–47.
Sharma T, Choudhury M, Rejón-Parrilla JC, Jonsson P, Garner S. Using HTA and guideline
development as a tool for research priority setting the NICE way: reducing research waste
by identifying the right research to fund. BMJ Open 2018; 8: e019777.
Squires J, Valentine J, Grimshaw J. Systematic reviews of complex interventions: framing
the review question. Journal of Clinical Epidemiology 2013; 66: 1215–1222.
Tong A, Chando S, Crowe S, Manns B, Winkelmayer WC, Hemmelgarn B, Craig JC. Research
priority setting in kidney disease: a systematic review. American Journal of Kidney
Diseases 2015; 65: 674–683.
Tong A, Sautenet B, Chapman JR, Harper C, MacDonald P, Shackel N, Crowe S, Hanson C, Hill
S, Synnot A, Craig JC. Research priority setting in organ transplantation: a systematic
review. Transplant International 2017; 30: 327–343.
Turley R, Saith R, Bhan N, Rehfuess E, Carter B. Slum upgrading strategies involving physical
environment and infrastructure interventions and their effects on health and socio-
economic outcomes. Cochrane Database of Systematic Reviews 2013; 1: CD010067.
van der Heijden I, Abrahams N, Sinclair D. Psychosocial group interventions to improve
psychological well-being in adults living with HIV. Cochrane Database of Systematic
Reviews 2017; 3: CD010806.
Viergever RF. Health Research Prioritization at WHO: An Overview of Methodology and High
Level Analysis of WHO Led Health Research Priority Setting Exercises. Geneva (Switzerland):
World Health Organization; 2010.
Viergever RF, Olifson S, Ghaffar A, Terry RF. A checklist for health research priority setting:
nine common themes of good practice. Health Research Policy and Systems 2010; 8: 36.
Whitehead M. The concepts and principles of equity and health. International Journal of
Health Services 1992; 22: 429–25.
31
3
Defining the criteria for including studies and
how they will be grouped for the synthesis
Joanne E McKenzie, Sue E Brennan, Rebecca E Ryan, Hilary J Thomson,
Renea V Johnston, James Thomas
KEY POINTS
•
serve as a reminder of these.
The population, intervention and comparison components of the question, with the
additional specification of types of study that will be included, form the basis of the
pre-specified eligibility criteria for the review. It is rare to use outcomes as eligibility
criteria: studies should be included irrespective of whether they report outcome data,
but may legitimately be excluded if they do not measure outcomes of interest, or if
•
they explicitly aim to prevent a particular outcome.
Cochrane Reviews should include all outcomes that are likely to be meaningful and
not include trivial outcomes. Critical and important outcomes should be limited in
•
number and include adverse as well as beneficial outcomes.
Review authors should plan at the protocol stage how the different populations, inter-
ventions, outcomes and study designs within the scope of the review will be grouped
for analysis.
3.1 Introduction
One of the features that distinguishes a systematic review from a narrative review is
that systematic review authors should pre-specify criteria for including and excluding
studies in the review (eligibility criteria, see MECIR Box 3.2.a).
When developing the protocol, one of the first steps is to determine the elements
of the review question (including the population, intervention(s), comparator(s) and
This chapter should be cited as: McKenzie JE, Brennan SE, Ryan RE, Thomson HJ, Johnston RV, Thomas J.
Chapter 3: Defining the criteria for including studies and how they will be grouped for the synthesis.
In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook
for Systematic Reviews of Interventions. 2nd Edition. Chichester (UK): John Wiley & Sons, 2019: 33–66.
© 2019 The Cochrane Collaboration. Published 2019 by John Wiley & Sons Ltd.
33
3 Defining criteria for including studies
outcomes, or PICO elements) and how the intervention, in the specified population,
produces the expected outcomes (see Chapter 2, Section 2.5.1 and Chapter 17,
Section 17.2.1). Eligibility criteria are based on the PICO elements of the review ques-
tion plus a specification of the types of studies that have addressed these questions.
The population, interventions and comparators in the review question usually translate
directly into eligibility criteria for the review, though this is not always a straightforward
process and requires a thoughtful approach, as this chapter shows. Outcomes usually
are not part of the criteria for including studies, and a Cochrane Review would typically
seek all sufficiently rigorous studies (most commonly randomized trials) of a particular
comparison of interventions in a particular population of participants, irrespective of
the outcomes measured or reported. It should be noted that some reviews do legiti-
mately restrict eligibility to specific outcomes. For example, the same intervention
may be studied in the same population for different purposes; or a review may specif-
ically address the adverse effects of an intervention used for several conditions (see
Chapter 19).
Eligibility criteria do not exist in isolation, but should be specified with the synthesis
of the studies they describe in mind. This will involve making plans for how to group
variants of the PICO elements for synthesis. This chapter describes the processes by
which the structure of the synthesis can be mapped out at the beginning of the review,
and the interplay between the review question, considerations for the analysis and
their operationalization in terms of eligibility criteria. Decisions about which studies
to include (and exclude), and how they will be combined in the review’s synthesis,
should be documented and justified in the review protocol.
A distinction between three different stages in the review at which the PICO construct
might be used is helpful for understanding the decisions that need to be made. In
Chapter 2 (Section 2.3) we introduced the ideas of a review PICO (on which eligibility
of studies is based), the PICO for each synthesis (defining the question that each spe-
cific synthesis aims to answer) and the PICO of the included studies (what was actually
investigated in the included studies). In this chapter, we focus on the review PICO and
the PICO for each synthesis as a basis for specifying which studies should be included
in the review and planning its syntheses. These PICOs should relate clearly and directly
to the questions or hypotheses that are posed when the review is formulated (see
Chapter 2) and will involve specifying the population in question, and a set of compar-
isons between the intervention groups.
An integral part of the process of setting up the review is to specify which character-
istics of the interventions (e.g. individual compounds of a drug), populations (e.g. acute
and chronic conditions), outcomes (e.g. different depression measurement scales) and
study designs, will be grouped together. Such decisions should be made independent
of knowing which studies will be included and the methods of synthesis that will be
used (e.g. meta-analysis). There may be a need to modify the comparisons and even
add new ones at the review stage in light of the data that are collected. For example,
important variations in the intervention may be discovered only after data are col-
lected, or modifying the comparison may facilitate the possibility of synthesis when
only one or few studies meet the comparison PICO. Planning for the latter scenario
at the protocol stage may lead to less post-hoc decision making (Chapter 2,
Section 2.5.3) and, of course, any changes made during the conduct of the review
should be recorded and documented in the final report.
34
3.2 Articulating the review and comparison PICO
35
3 Defining criteria for including studies
location), since this has implications for grouping studies and for the method of syn-
thesis (Chapter 10, Section 10.11.5). It is often helpful to consider the types of people
that are of interest in three steps.
First, the diseases or conditions of interest should be defined using explicit criteria
for establishing their presence (or absence). Criteria that will force the unnecessary
exclusion of studies should be avoided. For example, diagnostic criteria that were
developed more recently – which may be viewed as the current gold standard for diag-
nosing the condition of interest – will not have been used in earlier studies. Expensive
or recent diagnostic tests may not be available in many countries or settings, and time-
consuming tests may not be practical in routine healthcare settings.
Second, the broad population and setting of interest should be defined. This
involves deciding whether a specific population group is within scope, determined by
factors such as age, sex, race, educational status or the presence of a particular condition
such as angina or shortness of breath. Interest may focus on a particular setting such as a
community, hospital, nursing home, chronic care institution, or outpatient setting.
Box 3.2.a outlines some factors to consider when developing population criteria.
Whichever criteria are used for defining the population and setting of interest, it is
common to encounter studies that only partially overlap with the review’s population.
For example, in a review focusing on children, a cut-point of less than 16 years might be
desirable, but studies may be identified with participants aged from 12 to 18. Unless the
study reports separate data from the eligible section of the population (in which case
data from the eligible participants can be included in the review), review authors will
need a strategy for dealing with these studies (see MECIR Box 3.2.a). This will involve
balancing concerns about reduced applicability by including participants who do not
meet the eligibility criteria, against the loss of data when studies are excluded. Arbitrary
rules (such as including a study if more than 80% of the participants are under 16) will
not be practical if detailed information is not available from the study. A less stringent
rule, such as ‘the majority of participants are under 16’ may be sufficient. Although
there is a risk of review authors’ biases affecting post-hoc inclusion decisions (which
is why many authors endeavour to pre-specify these rules), this may be outweighed
by a common-sense strategy in which eligibility decisions keep faith with the objectives
of the review rather than with arbitrary rules. Difficult decisions should be documented
in the review, checked with the advisory group (if available, see Chapter 1), and
Box 3.2.a Factors to consider when developing criteria for ‘Types of participants’
36
3.2 Articulating the review and comparison PICO
sensitivity analyses can assess the impact of these decisions on the review’s findings
(see Chapter 10, Section 10.14 and MECIR Box 3.2.b).
Third, there should be consideration of whether there are population characteris-
tics that might be expected to modify the size of the intervention effects (e.g. dif-
ferent severities of heart failure). Identifying subpopulations may be important for
implementation of the intervention. If relevant subpopulations are identified, two
courses of action are possible: limiting the scope of the review to exclude certain sub-
populations; or maintaining the breadth of the review and addressing subpopulations
in the analysis.
Restricting the review with respect to specific population characteristics or settings
should be based on a sound rationale. It is important that Cochrane Reviews are glob-
ally relevant, so the rationale for the exclusion of studies based on population charac-
teristics should be justified. For example, focusing a review of the effectiveness of
mammographic screening on women between 40 and 50 years old may be justified
based on biological plausibility, previously published systematic reviews and existing
controversy. On the other hand, focusing a review on a particular subgroup of people
on the basis of their age, sex or ethnicity simply because of personal interests, when
there is no underlying biologic or sociological justification for doing so, should be
avoided, as these reviews will be less useful to decision makers and readers of the
review.
Maintaining the breadth of the review may be best when it is uncertain whether there
are important differences in effects among various subgroups of people, since this
allows investigation of these differences (see Chapter 10, Section 10.11.5). Review
authors may combine the results from different subpopulations in the same synthesis,
examining whether a given subdivision explains variation (heterogeneity) among the
intervention effects. Alternatively, the results may be synthesized in separate compar-
isons representing different subpopulations. Splitting by subpopulation risks there
being too few studies to yield a useful synthesis (see Table 3.2.a and Chapter 2,
Section 2.3.2). Consideration needs to be given to the subgroup analysis method,
37
3 Defining criteria for including studies
Intended recipient Patient, carer, healthcare provider (general In a review of e-learning programmes for health professionals, a subgroup analysis
of intervention practitioners, nurses, allied health was planned to examine if the effects were modified by the type of healthcare provider
professionals), health system, policy maker, (doctors, nurses or physiotherapists). The authors hypothesized that e-learning
community programmes for doctors would be more effective than for other health professionals,
but did not provide a rationale (Vaona et al 2018).
Disease/condition Type and severity of a condition In a review of platelet-rich therapies for musculoskeletal soft tissue injuries, a
(to be treated subgroup analysis was undertaken to examine if the effects of platelet-rich therapies
or prevented) were modified by the type of condition (e.g. rotator cuff tear, anterior cruciate ligament
reconstruction, chronic Achilles tendinopathy) (Moraes et al 2014).
In planning a review of beta-blockers for heart failure, subgroup analyses were specified
to examine if the effects of beta-blockers are modified by the underlying cause of heart
failure (e.g. idiopathic dilated cardiomyopathy, ischaemic heart disease, valvular heart
disease, hypertension) and the severity of heart failure (‘reduced left ventricular ejection
fraction (LVEF)’ ≤ 40%, ‘mid-range LVEF’ > 40% and < 50%, ‘preserved LVEF’ ≥ 50%,
mixed, not-specified). Studies have shown that patient characteristics and
comorbidities differ by heart failure severity, and that therapies have been shown to
reduce morbidity in ‘reduced LVEF’ patients, but the benefits in the other groups are
uncertain (Safi et al 2017).
Participant Age (neonate, child, adolescent, adult, older In a review of newer-generation antidepressants for depressive disorders in children
characteristics adult) and adolescents, a subgroup analysis was undertaken to examine if the effects of the
Race/ethnicity antidepressants were modified by age. The rationale was based on the findings of
another review that suggested that children and adolescents may respond differently
Sex/gender to antidepressants. The age groups were defined as ‘children’ (aged approximately 6
to 12 years), ‘adolescents’ (aged approximately 13 to 18 years), and ‘children and
PROGRESS-Plus equity characteristics (e.g.
adolescents’ (when the study included both children and adolescents, and results
place of residence, socio-economic status,
could not be obtained separately by these subpopulations) (Hetrick et al 2012).
education) (O’Neill et al 2014)
Setting Setting of care (primary care, hospital, In a review of hip protectors for preventing hip fractures in older people, separate
community) comparisons were specified based on setting (institutional care or community-
Rurality (urban, rural, remote) dwelling) for the critical outcome of hip fracture (Santesso et al 2014).
38
3.2 Articulating the review and comparison PICO
particularly for population characteristics measured at the participant level (see Chap-
ters 10 and 26, Fisher et al 2017). All subgroup analyses should ideally be planned a
priori and stated as a secondary objective in the protocol, and not driven by the avail-
ability of data.
In practice, it may be difficult to assign included studies to defined subpopulations
because of missing information about the population characteristic, variability in how
the population characteristic is measured across studies (e.g. variation in the method
used to define the severity of heart failure), or because the study does not wholly fall
within (or report the results separately by) the defined subpopulation. The latter issue
mainly applies for participant characteristics but can also arise for settings or geo-
graphic locations where these vary within studies. Review authors should consider pla-
nning for these scenarios (see example reviews Hetrick et al 2012, Safi et al 2017;
Table 3.2.b, column 3).
1. Identify intervention Consider whether differences in interventions characteristics Exercise interventions differ across multiple characteristics,
characteristics that may might modify the size of the intervention effect importantly. which vary in importance depending on the review.
modify the effect of the Content-specific research literature and expertise should In a review of exercise for osteoporosis, whether the exercise is
intervention. inform this step. weight-bearing or non-weight-bearing may be a key
The TIDieR checklist – a tool for describing interventions – characteristic, since the mechanism by which exercise is
outlines the characteristics across which an intervention might thought to work is by placing stress or mechanical load on
differ (Hoffmann et al 2014). These include ‘what’ materials bones (Howe et al 2011).
and procedures are used, ‘who’ provides the intervention, Different mechanisms apply in reviews of exercise for knee
‘when and how much’ intervention is delivered. The iCAT-SR osteoarthritis (muscle strengthening), falls prevention (gait
tool provides equivalent guidance for complex interventions and balance), cognitive function (cardiovascular fitness).
(Lewin et al 2017).
The differing mechanisms might suggest different ways of
grouping interventions (e.g. by intensity, mode of delivery)
according to potential modifiers of the intervention effects.
2a. Label and define For each intervention group, provide a short label (e.g. In a review of psychological therapies for coronary heart
intervention groups to be supportive psychotherapy) and describe the core disease, a single group was specified for meta-analysis that
considered in the characteristics (criteria) that will be used to assign each included all types of therapy. Subgroups were defined to
synthesis. intervention from an included study to a group. examine whether intervention effects were modified by
Groups are often defined by intervention content (especially intervention components (e.g. cognitive techniques, stress
the active components), such as materials, procedures or management) or mode of delivery (e.g. individual, group)
techniques (e.g. a specific drug, an information leaflet, a (Richards et al 2017).
behaviour change technique). Other characteristics may also In a review of psychological therapies for panic disorder
be used, although some are more commonly used to define (Pompoli et al 2016), eight types of therapy were specified:
subgroups (see Chapter 10, Section 10.11.5): the purpose or
theoretical underpinning, mode of delivery, provider, dose or 1) psychoeducation;
intensity, duration or timing of the intervention (Hoffmann 2) supportive psychotherapy (with or without a
et al 2014). psychoeducational component);
3) physiological therapies;
In specifying groups: 4) behaviour therapy;
5) cognitive therapy;
focus on ‘clinically’ meaningful groups that will inform 6) cognitive behaviour therapy (CBT);
selection and implementation of an intervention in practice; 7) 7. third-wave CBT; and
40
3.2 Articulating the review and comparison PICO
(Continued)
41
3 Defining criteria for including studies
use terminology that is understood by those using or The behaviour change wheel has been used to group
implementing the intervention; interventions (or components) by function (e.g. to educate,
are developed systematically and based on consensus, persuade, enable) (Michie et al 2011). This system was used to
preferably with stakeholders including clinicians, patients, describe the components of dietary advice interventions
policy makers, and researchers; and (Desroches et al 2013).
have been validated through successful use in a range of Specific systems
applications (ideally, including in systematic reviews).
Multiple reviews have used the consensus-based taxonomy
Systems for grouping interventions may be generic, widely developed by the Prevention of Falls Network Europe
applicable across clinical areas, or specific to a condition or (ProFaNE) (e.g. Verheyden et al 2013, Kendrick et al 2014). The
intervention type. Some Cochrane Groups recommend specific taxonomy specifies broad groups (e.g. exercise, medication,
taxonomies. environment/assistive technology) within which are more
specific groups (e.g. exercise: gait, balance and functional
training; flexibility; strength and resistance) (Lamb et al 2011).
4. Plan how the specified Decide whether it is useful to pool all interventions in a single In a review of exercise for knee osteoarthritis, the different
groups will be used in meta-analysis (‘lumping’), within which specific characteristics categories of exercise were combined in a single meta-
synthesis and reporting. can be explored as effect modifiers (e.g. in subgroups). analysis, addressing the question ‘what is the effect of
Alternatively, if pooling all interventions is unlikely to address exercise on knee osteoarthritis?’. The categories were also
a useful question, separate synthesis of specific interventions analysed as subgroups within the meta-analysis to explore
may be more appropriate (‘splitting’). whether the effect size varied by type of exercise (Fransen
Determining the right analytic approach is discussed further in et al 2015). Other subgroup analyses examined mode of
Chapter 2, Section 2.3.2. delivery and dose.
5. Decide how to group Some interventions, especially those considered ‘complex’, Grouping by main component: In a review of psychological
interventions with include multiple components that could also be implemented therapies for panic disorder, two of the eight eligible
multiple components or independently (Guise et al 2014, Lewin et al 2017). These therapies (psychoeducation and supportive psychotherapy)
co-interventions. components might be eligible for inclusion in the review alone, could be used alone or as part of a multi-component therapy.
or eligible only if used alongside an eligible intervention. When accompanied by another eligible therapy, the
Options for considering multi-component interventions may intervention was categorized as the other therapy (i.e.
include the following. psychoeducation + cognitive behavioural therapy was
categorized as cognitive behavioural therapy) (Pompoli et al
Identifying intervention components for meta-regression or 2016).
a components-based network meta-analysis (see Separate group: In a review of psychosocial interventions for
Chapter 11 and Welton et al 2009, Caldwell and Welton 2016, smoking cessation in pregnancy, two approaches were used.
Higgins et al 2019). All intervention types were included in a single meta-analysis
42
3.2 Articulating the review and comparison PICO
Grouping based on the ‘main’ intervention component with subgroups for multi-component, single and tailored
(Caldwell and Welton 2016). interventions. Separate meta-analyses were also performed
Specifying a separate group (‘multi-component for each intervention type, with categorization of multi-
interventions’). ‘Lumping’ multi-component interventions component interventions based on the ‘main’ component
together may provide information about their effects in (Chamberlain et al 2017).
general; however, this approach may lead to unexplained
heterogeneity and/or inability to identify which
components are effective (Caldwell and Welton 2016).
Reporting results study by study. An option if components
are expected to be so diverse that synthesis will not be
interpretable.
Excluding multi-component interventions. An option if the
effect of the intervention of interest cannot be discerned.
This approach may reduce the relevance of the review.
The first two approaches may be challenging but are likely to
be most useful (Caldwell and Welton 2016).
See Section 3.2.3.1. for the special case of when a co-
intervention is administered in both treatment arms.
6. Build in contingencies Consider grouping interventions at more than one level, so In a review of psychosocial interventions for smoking
by specifying both specific that studies of a broader group of interventions can be cessation, the authors planned to group any psychosocial
and broader intervention synthesized if too few studies are identified for synthesis in intervention in a single comparison (addressing the higher
groups. more specific groups. This will provide flexibility where review level question of whether, on average, psychosocial
authors anticipate few studies contributing to specific groups interventions are effective). Given that sufficient data were
(e.g. in reviews with diverse interventions, additional diversity available, they also presented separate meta-analyses to
in other PICO elements, or few studies overall, see also examine the effects of specific types of psychosocial
Chapter 2, Section 2.5.3. interventions (e.g. counselling, health education, incentives,
social support) (Chamberlain et al 2017).
43
3 Defining criteria for including studies
Box 3.2.b Factors to consider when developing criteria for ‘Types of interventions’
• Have the different meanings of phrases such as ‘control’, ‘placebo’, ‘no intervention’
or ‘usual care’ been considered?
an overall synthesis can provide useful information for decision makers. Where differ-
ences in intervention characteristics are more substantial (such as delivery of brief alco-
hol counselling by nurses versus doctors), and are expected to have a substantial
impact on the size of intervention effects, these differences should be examined in
the synthesis. What constitutes an important difference requires judgement, but in gen-
eral differences that alter decisions about how an intervention is implemented or
whether the intervention is used or not are likely to be important. In such circum-
stances, review authors should consider specifying separate groups (or subgroups)
to examine in their synthesis.
44
3.2 Articulating the review and comparison PICO
Clearly defined intervention groups serve two main purposes in the synthesis. First,
the way in which interventions are grouped for synthesis (meta-analysis or other syn-
thesis) is likely to influence review findings. Careful planning of intervention groups
makes best use of the available data, avoids decisions that are influenced by study find-
ings (which may introduce bias), and produces a review focused on questions relevant
to decision makers. Second, the intervention groups specified in a protocol provide a
standardized terminology for describing the interventions throughout the review, over-
coming the varied descriptions used by study authors (e.g. where different labels are
used for the same intervention, or similar labels used for different techniques) (Michie
et al 2013). This standardization enables comparison and synthesis of information
about intervention characteristics across studies (common characteristics and differ-
ences) and provides a consistent language for reporting that supports interpretation
of review findings.
Table 3.2.b outlines a process for planning intervention groups as a basis for/precursor
to synthesis, and the decision points and considerations at each step. The table is
intended to guide, rather than to be prescriptive and, although it is presented as a
sequence of steps, the process is likely to be iterative, and some steps may be done
concurrently or in a different sequence. The process aims to minimize data-driven
approaches that can arise once review authors have knowledge of the findings of
the included studies. It also includes principles for developing a flexible plan that
maximizes the potential to synthesize in circumstances where there are few studies,
many variants of an intervention, or where the variants are difficult to anticipate. In
all stages, review authors should consider how to categorize studies whose reports
contain insufficient detail.
• Intervention versus placebo (e.g. placebo drug, sham surgical procedure, psycholog-
ical placebo). Placebos are most commonly used in the evaluation of pharmacolog-
ical interventions, but may be also be used in some non-pharmacological
evaluations. For example:
◦ newer generation antidepressants versus placebo (Hetrick et al 2012); and
◦ vertebroplasty for osteoporotic vertebral compression fractures versus placebo
(sham procedure) (Buchbinder et al 2018).
• Intervention versus control (e.g. no intervention, wait-list control, usual care). Both
intervention arms may also receive standard therapy. For example:
45
3 Defining criteria for including studies
◦ chemotherapy or targeted therapy plus best supportive care (BSC) versus BSC for
palliative treatment of esophageal and gastroesophageal-junction carcinoma
(Janmaat et al 2017); and
◦ personalized care planning versus usual care for people with long-term conditions
(Coulter et al 2015).
• Intervention A versus intervention B. A comparison of active interventions may
include comparison of the same intervention delivered at different time points, for
different lengths of time or different doses, or two different interventions. For
example:
◦ early (commenced at less than two weeks of age) versus late (two weeks of age or
more) parenteral zinc supplementation in term and preterm infants (Taylor
et al 2017);
◦ high intensity versus low intensity physical activity or exercise in people with hip or
knee osteoarthritis (Regnaux et al 2015);
◦ multimedia education versus other education for consumers about prescribed and
over the counter medications (Ciciriello et al 2013).
The first two types of comparisons aim to establish the effectiveness of an interven-
tion, while the last aims to compare the effectiveness of two interventions. However,
the distinction between the placebo and control is often arbitrary, since any differences
in the care provided between trials with a control arm and those with a placebo arm
may be unimportant, especially where ‘usual care’ is provided to both. Therefore, pla-
cebo and control groups may be determined to be similar enough to be combined for
synthesis.
In reviews including multiple intervention groups, many comparisons are possible. In
some of these reviews, authors seek to synthesize evidence on the comparative effec-
tiveness of all their included interventions, including where there may be only indirect
comparison of some interventions across the included studies (Chapter 11,
Section 11.2.1). However, in many reviews including multiple intervention groups, a lim-
ited subset of the possible comparisons will be selected. The chosen subset of compar-
isons should address the most important clinical and research questions. For example,
if an established intervention (or dose of an intervention) is used in practice, then the
synthesis would ideally compare novel or alternative interventions to this established
intervention, and not, for example, to no intervention.
of interest alone (Squires et al 2013). While qualitative interactions are rare (where the
effect of the intervention is in the opposite direction when combined with the supple-
mentary intervention), it is possible that there will be more variation in the intervention
effects (heterogeneity) when supplementary interventions are involved, and it is impor-
tant to plan for this. Approaches for dealing with this in the statistical synthesis may
include fitting a random-effects meta-analysis model that encompasses heterogeneity
(Chapter 10, Section 10.10.4), or investigating whether the intervention effect is mod-
ified by the addition of the supplementary intervention through subgroup analysis
(Chapter 10, Section 10.11.2).
47
3 Defining criteria for including studies
48
3.2 Articulating the review and comparison PICO
49
3 Defining criteria for including studies
synthesis questions being addressed through the use of Overviews of reviews (see
online Chapter V).
Outcomes considered to be meaningful, and therefore addressed in a review, may
not have been reported in the primary studies. For example, quality of life is an impor-
tant outcome, perhaps the most important outcome, for people considering whether or
not to use chemotherapy for advanced cancer, even if the available studies are found to
report only survival (see Chapter 18). A further example arises with timing of the out-
come measurement, where time points determined as clinically meaningful in a review
are not measured in the primary studies. Including and discussing all important out-
comes in a review will highlight gaps in the primary research and encourage research-
ers to address these gaps in future studies.
Box 3.2.c Factors to consider when selecting and prioritizing review outcomes
50
3.3 Determining which study designs to include
1. Fully specify outcome domains. For each outcome domain, provide a short label (e.g. In a review of computer-based interventions for sexual
cognition, consumer evaluation of care) and describe the health promotion, three broad outcome domains were
domain in sufficient detail to enable eligible outcomes defined (cognitions, behaviours, biological) based on a
from each included study to be categorized. The conceptual model of how the intervention might work.
definition should be based on the concept (or construct) Each domain comprised more specific domains and
measured, that is ‘what’ is measured. ‘When’ and ‘how’ outcomes (e.g. condom use, seeking health services such
the outcome is measured will be considered in as STI testing); listing these helped define the broad
subsequent steps. domains and guided categorization of the diverse
Outcomes can be defined hierarchically, starting with outcomes reported in included studies (Bailey et al 2010).
very broad groups (e.g. physiological/clinical outcomes, In a protocol for a review of social media interventions for
life impact, adverse events), then outcome domains (e.g. improving health, the rationale for synthesizing broad
functioning and perceived health status are domains groupings of outcomes (e.g. health behaviours, physical
within ‘life impact’). Within these may be narrower health) was based on prediction of a common underlying
domains (e.g. physical function, cognitive function), and mechanism by which the intervention would work, and
then specific outcome measures (Dodd et al 2018). The the review objective, which focused on overall health
level at which outcomes are grouped for synthesis alters rather than specific outcomes (Welch et al 2018).
the question addressed, and so decisions should be
guided by the review objectives.
In specifying outcome domains:
definitions should reflect existing systems if available,
or relevant literature and terminology understood by
decision makers;
where outcomes are likely to be inconsistently labelled
and described, listing examples may convey the scope
of the domain;
consider the level at which domains will be defined
(broad versus narrow) and the implications for
reporting and synthesis: combining diverse outcomes
may lead to unexplained heterogeneity whereas
narrowly specified outcomes may prevent synthesis
when few studies report specific measures;
52
3.3 Determining which study designs to include
(Continued)
53
3 Defining criteria for including studies
short-term or long-term outcomes are important); duration with short-term follow-up, so an additional
consider whether there are agreed or accepted important outcome of ‘depression (< 3 months)’ might
outcome time points (e.g. standards in a clinical area also be specified.
such as an NIH task force suggestion for at least 6 to
12 months follow-up for chronic low back pain (Deyo
et al 2014), or core outcome sets (Williamson et al
2017);
consider carefully the width of the time frame (e.g.
what constitutes ‘short term’ for this review?). Narrow
time frames may lead to few studies in the synthesis.
Broad time frames may lead to multiplicity (see Step 5)
and difficulties with interpretation if the timing is very
diverse across studies.
4. Specify the measurement tool or For each outcome domain, specify: In a review of interventions to support women to stop
measurement method. smoking, objective (biochemically validated) and
measurement methods or tools that provide an subjective (self-report) measures of smoking cessation
appropriate assessment of the domain or specific were specified separately to examine bias due to the
outcome (e.g. including clinical assessment, laboratory method used to measure the outcome (Step 6)
tests, objective measures, and patient-reported (Chamberlain et al 2017).
outcome measures (PROMs)); In a review of high-intensity versus low-intensity exercise
whether different methods or tools are comparable for osteoarthritis, measures of pain were selected based
measures of a domain, which has implications for on relevance of the content and properties of the
synthesis (Step 6). measurement tool (i.e. evidence of validity and reliability)
Minimum criteria for inclusion of a measure may include: (Regnaux et al 2015).
adequate evidence of reliability (e.g. consistent scores
across time and raters when the outcome is
unchanged), and validity (e.g. comparable results to
similar measures, including a gold standard if
available); and
54
3.3 Determining which study designs to include
(Continued)
55
3 Defining criteria for including studies
56
3.3 Determining which study designs to include
57
3 Defining criteria for including studies
58
3.3 Determining which study designs to include
suggests that, on average, non-randomized studies produce effect estimates that indi-
cate more extreme benefits of the effects of health care than randomized trials. How-
ever, the extent, and even the direction, of the bias is difficult to predict. These issues
are discussed at length in Chapter 24, which provides guidance on when it might be
appropriate to include non-randomized studies in a Cochrane Review.
Practical considerations also motivate the restriction of many Cochrane Reviews to
randomized trials. In recent decades there has been considerable investment interna-
tionally in establishing infrastructure to index and identify randomized trials. Cochrane
has contributed to these efforts, including building up and maintaining a database of
randomized trials, developing search filters to aid their identification, working with
MEDLINE to improve tagging and identification of randomized trials, and using machine
learning and crowdsourcing to reduce author workload in identifying randomized trials
(Chapter 4, Section 4.6.6.2). The same scale of organizational investment has not (yet)
been matched for the identification of other types of studies. Consequently, identifying
and including other types of studies may require additional efforts to identify studies
and to keep the review up to date, and might increase the risk that the result of the
review will be influenced by publication bias. This issue and other bias-related issues
that are important to consider when defining types of studies are discussed in detail in
Chapters 7 and 13.
Specific aspects of study design and conduct should be considered when defining
eligibility criteria, even if the review is restricted to randomized trials. For example,
whether cluster-randomized trials (Chapter 23, Section 23.1) and crossover trials
(Chapter 23, Section 23.2) are eligible, as well as other criteria for eligibility such as
use of a placebo comparison group, evaluation of outcomes blinded to allocation
sequence, or a minimum period of follow-up. There will always be a trade-off between
restrictive study design criteria (which might result in the inclusion of studies that are at
low risk of bias, but very few in number) and more liberal design criteria (which might
result in the inclusion of more studies, but at a higher risk of bias). Furthermore, exces-
sively broad criteria might result in the inclusion of misleading evidence. If, for example,
interest focuses on whether a therapy improves survival in patients with a chronic con-
dition, it might be inappropriate to look at studies of very short duration, except to
make explicit the point that they cannot address the question of interest.
study design features, and not the study labels applied by the primary researchers (e.g.
case-control, cohort), which are often used inconsistently (Reeves et al 2017; see
Chapter 24).
When non-randomized studies are included, review authors should consider how the
studies will be grouped and used in the synthesis. The Cochrane Non-randomized Stud-
ies Methods Group taxonomy of design features (see Chapter 24) may provide a basis
for grouping together studies that are expected to have similar inferential strength and
for providing a consistent language for describing the study design.
Once decisions have been made about grouping study designs, planning of how
these will be used in the synthesis is required. Review authors need to decide whether
it is useful to synthesize results from non-randomized studies and, if so, whether results
from randomized trials and non-randomized studies should be included in the same
synthesis (for the purpose of examining whether study design explains heterogeneity
among the intervention effects), or whether the effects should be synthesized in sep-
arate comparisons (Valentine and Thompson 2013). Decisions should be made for each
of the different types of non-randomized studies under consideration. Review authors
might anticipate increased heterogeneity when non-randomized studies are synthe-
sized, and adoption of a meta-analysis model that encompasses heterogeneity is wise
(Valentine and Thompson 2013) (such as a random effects model, see Chapter 10,
Section 10.10.4). For further discussion of non-randomized studies, see Chapter 24.
60
3.6 References
introduced by excluding all unpublished studies, given what is known about the impact
of reporting biases (see Chapter 13 on bias due to missing studies, and Chapter 4,
Section 4.3 for a more detailed discussion of searching for unpublished and grey
literature).
Likewise, while searching for, and analysing, studies in any language can be
extremely resource-intensive, review authors should consider carefully the implications
for bias (and equity, see Chapter 16) if they restrict eligible studies to those published in
one specific language (usually English). See Chapter 4 (Section 4.4.5) for further discus-
sion of language and other restrictions while searching.
3.6 References
Bailey JV, Murray E, Rait G, Mercer CH, Morris RW, Peacock R, Cassell J, Nazareth I.
Interactive computer-based interventions for sexual health promotion. Cochrane
Database of Systematic Reviews 2010; 9: CD006483.
Bender R, Bunce C, Clarke M, Gates S, Lange S, Pace NL, Thorlund K. Attention should be
given to multiplicity issues in systematic reviews. Journal of Clinical Epidemiology 2008;
61: 857–865.
Buchbinder R, Johnston RV, Rischin KJ, Homik J, Jones CA, Golmohammadi K, Kallmes DF.
Percutaneous vertebroplasty for osteoporotic vertebral compression fracture. Cochrane
Database of Systematic Reviews 2018; 4: CD006349.
Caldwell DM, Welton NJ. Approaches for synthesising complex mental health interventions
in meta-analysis. Evidence-Based Mental Health 2016; 19: 16–21.
61
3 Defining criteria for including studies
62
3.6 References
63
3 Defining criteria for including studies
64
3.6 References
Verheyden GSAF, Weerdesteyn V, Pickering RM, Kunkel D, Lennon S, Geurts ACH, Ashburn A.
Interventions for preventing falls in people after stroke. Cochrane Database of Systematic
Reviews 2013; 5: CD008728.
Weisz JR, Kuppens S, Ng MY, Eckshtain D, Ugueto AM, Vaughn-Coaxum R, Jensen-Doss A,
Hawley KM, Krumholz Marchette LS, Chu BC, Weersing VR, Fordwood SR. What five
decades of research tells us about the effects of youth psychological therapy: a multilevel
meta-analysis and implications for science and practice. American Psychologist 2017; 72:
79–117.
Welch V, Petkovic J, Simeon R, Presseau J, Gagnon D, Hossain A, Pardo Pardo J, Pottie K,
Rader T, Sokolovski A, Yoganathan M, Tugwell P, DesMeules M. Interactive social media
interventions for health behaviour change, health outcomes, and health equity in the
adult population. Cochrane Database of Systematic Reviews 2018; 2: CD012932.
Welton NJ, Caldwell DM, Adamopoulos E, Vedhara K. Mixed treatment comparison meta-
analysis of complex interventions: psychological interventions in coronary heart disease.
American Journal of Epidemiology 2009; 169: 1158–1165.
Williamson PR, Altman DG, Bagley H, Barnes KL, Blazeby JM, Brookes ST, Clarke M, Gargon E,
Gorst S, Harman N, Kirkham JJ, McNair A, Prinsen CAC, Schmitt J, Terwee CB, Young B.
The COMET Handbook: version 1.0. Trials 2017; 18: 280.
65
4
Searching for and selecting studies
Carol Lefebvre, Julie Glanville, Simon Briscoe, Anne Littlewood, Chris Marshall,
Maria-Inti Metzendorf, Anna Noel-Storr, Tamara Rader, Farhad Shokraneh, James
Thomas, L. Susan Wieland; on behalf of the Cochrane Information Retrieval
Methods Group
KEY POINTS
• Review authors should work closely, from the start of the protocol, with an experi-
•
enced medical/healthcare librarian or information specialist.
Studies (not reports of studies) are included in Cochrane Reviews but identifying
reports of studies is currently the most convenient approach to identifying the major-
•
ity of studies and obtaining information about them and their results.
The Cochrane Central Register of Controlled Trials (CENTRAL) and MEDLINE, together
with Embase (if access to Embase is available to the review team) should be searched
•
for all Cochrane Reviews.
Additionally, for all Cochrane Reviews, the Specialized Register of the relevant
Cochrane Review Groups should be searched, either internally within the Review
•
Group or via CENTRAL.
Trials registers should be searched for all Cochrane Reviews and other sources such as
regulatory agencies and clinical study reports (CSRs) are an increasingly important
••
source of information for study results.
Searches should aim for high sensitivity, which may result in relatively low precision.
Search strategies should avoid using too many different search concepts but a wide
•
variety of search terms should be combined with OR within each included concept.
Both free-text and subject headings (e.g. Medical Subject Headings (MeSH) and
•
Emtree) should be used.
Published, highly sensitive, validated search strategies (filters) to identify randomized
trials should be considered, such as the Cochrane Highly Sensitive Search Strategies
for identifying randomized trials in MEDLINE (but do not apply these randomized trial
or human filters in CENTRAL).
This chapter should be cited as: Lefebvre C, Glanville J, Briscoe S, Littlewood A, Marshall C, Metzendorf M-I,
Noel-Storr A, Rader T, Shokraneh F, Thomas J, Wieland LS. Chapter 4: Searching for and selecting studies.
In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook
for Systematic Reviews of Interventions. 2nd Edition. Chichester (UK): John Wiley & Sons, 2019: 67–108.
© 2019 The Cochrane Collaboration. Published 2019 by John Wiley & Sons Ltd.
67
4 Searching for and selecting studies
4.1 Introduction
Cochrane Reviews take a systematic and comprehensive approach to identifying stud-
ies that meet the eligibility criteria for the review. This chapter outlines some general
issues in searching for studies; describes the main sources of potential studies; and dis-
cusses how to plan the search process, design and carry out search strategies, manage
references found during the search process, correctly document the search process and
select studies from the search results.
This chapter aims to provide review authors with background information on all
aspects of searching for studies so that they can better understand the search process.
All authors of systematic reviews should, however, identify an experienced medical/
healthcare librarian or information specialist to provide support for the search process.
The chapter also aims to provide advice and guidance for medical/healthcare librarians
and information specialists (within and beyond Cochrane) involved in the search proc-
ess to identify studies for inclusion in systematic reviews.
This chapter focuses on searching for randomized trials. Many of the search princi-
ples discussed, however, will also apply to other study designs. Considerations for
searching for non-randomized studies are discussed in Chapter 24 (see also
Chapter 19 when these are specifically for adverse effects). Other discussion of search-
ing for specific types of evidence appears in chapters dedicated to these types of evi-
dence, such as Chapter 17 on complex and public health interventions, Chapter 20 on
economics evidence and Chapter 21 on qualitative research.
An online Technical Supplement to this chapter provides more detail on searching
methods and is available from Cochrane Training.
A key element of the role is the maintenance of a Specialized Register for their Review
Group, containing reports of trials relating to the group’s scope. Within the limits of licen-
sing restrictions, the content of these group registers is shared with users worldwide via
the Cochrane Central Register of Controlled Trials (CENTRAL), part of the Cochrane
Library (see Section 4.3.3).
Most CRGs offer support to authors in study identification from the early planning
stage to the final write-up of the review, and the support available may include some
or all of the following:
found. Relying exclusively on a MEDLINE search may retrieve a set of reports unrepre-
sentative of all reports that would have been identified through a wider or more exten-
sive search of several sources.
Time and budget restraints require the review team to balance the thoroughness of
the search with efficiency in the use of time and funds. The best way of achieving this
balance is to be aware of, and try to minimize, the biases such as publication bias and
language bias that can result from restricting searches in different ways (see Chapters 8
and 13 for further guidance on assessing these biases). Unlike for tasks such as study
selection or data extraction, it is not considered necessary (or even desirable) for two
people to conduct independent searches in parallel. It is strongly recommended, how-
ever, that all search strategies should be peer reviewed by a suitably qualified and
experienced medical/healthcare librarian or information specialist (see Section 4.4.8).
generally the most efficient way to identify an initial set of relevant reports of studies
(EUnetHTA 2017). Database selection should be guided by the review topic (Suarez-
Almazor et al 2000, Stevinson and Lawlor 2004, Lorenzetti et al 2014). When topics
are specialized, cross-disciplinary, or involve emerging technologies (Rice et al 2016),
additional databases may need to be identified and searched (Wallace et al 1997,
Stevinson and Lawlor 2004).
The three bibliographic databases generally considered to be the most important
sources to search for reports of trials are CENTRAL, MEDLINE (Halladay et al 2015,
71
4 Searching for and selecting studies
Sampson et al 2016) and Embase (Woods and Trewheellar 1998, Sampson et al 2003,
Bai et al 2007). These databases are described in more detail in Sections 4.3.1.2 and
4.3.1.3 and in the online Technical Supplement. For Cochrane Reviews, CENTRAL, MED-
LINE and Embase (if access to Embase is available to the review team) should be
searched (see MECIR Box 4.3.a). These searches may be undertaken specifically for
the review, or indirectly by searching the CRG’s Specialized Register.
Some bibliographic databases, such as MEDLINE and Embase, include abstracts for
the majority of recent records. A key advantage of such databases is that they can be
searched electronically both for words in the title or abstract and by using the standar-
dized indexing terms, or controlled vocabulary, assigned to each record (see
Section 4.3.1.2). Cochrane has developed a database of reports of randomized trials
called the Cochrane Central Register of Controlled Trials (CENTRAL), which is published
within the Cochrane Library (see Section 4.3.1.3).
Bibliographic databases are available to individuals for a fee (by subscription or on a
‘pay-as-you-go’ basis) or free at the point of use. They may be available through
national provisions, site-wide licences at institutions such as universities or hospitals,
through professional organizations as part of their membership packages or free-of-
charge on the internet. Some international initiatives provide free or low-cost online
access to databases (and full-text journals) over the internet. The Health InterNetwork
Access to Research Initiative (HINARI) programme, set up by the World Health Organ-
ization (WHO) together with major publishers, provides access to a wide range of data-
bases including the Cochrane Library for healthcare professionals in local, not-for-
profit institutions in more than 115 countries, areas and territories. The International
Network for the Availability of Scientific Publications (INASP) also provides access to a
wide range of databases (and journals) including the Cochrane Library. Electronic Infor-
mation for Libraries (EIFL) is a similar initiative based on library consortia to support
affordable licensing of journals and other sources in more than 60 low-income and
transition countries in central, eastern and south-east Europe, the former Soviet Union,
Africa, the Middle East and South-east Asia.
The online Technical Supplement provides more detailed information about how to
search these sources and other databases. It also provides a list of general healthcare
databases by region and healthcare databases by subject area. Further evidence-based
information about sources to search can be found on the SuRe Info portal, which is
updated twice per year.
approval of drugs provided complete information on study methods and results than
did trials register records or journal publications (Wieseler et al 2012) and that
conventional, publicly available sources (European Public Assessment Reports, jour-
nal publications, and trials register records) provide insufficient information on new
drugs, especially on patient relevant outcomes in approved subpopulations (Köhler
et al 2015).
A Cochrane Methodology Review examined studies assessing methods for obtain-
ing unpublished data and concluded that those carrying out systematic reviews
should continue to contact authors for missing data and that email contact was more
successful than other methods (Young and Hopewell 2011). An annotated bibliogra-
phy of published studies addressing searching for unpublished studies and obtaining
access to unpublished data is also available (Arber et al 2013). One particular study
focused on the contribution of unpublished studies, including dissertations, and
studies in languages other than English, to the results of meta-analyses in reviews
relevant to children (Hartling et al 2017). They found that, in their sample, unpub-
lished studies and studies in languages other than English rarely had any impact
on the results and conclusions of the review. They did, however, concede that inclu-
sion of these study types may have an impact in situations where there are few rel-
evant studies, or where there are ‘questionable vested interests’ in the published
literature.
Correspondence can be an important source of information about unpublished
studies. It is highly desirable for authors of Cochrane Reviews of interventions to con-
tact relevant individuals and organizations for information about unpublished or
ongoing studies (see MECIR Box 4.3.c). Letters of request for information can be used
to identify completed but unpublished studies. One way of doing this is to send a
comprehensive list of relevant articles along with the eligibility criteria for the review
to the first author of reports of included studies, asking if they know of any additional
studies (ongoing or completed; published or unpublished) that might be relevant.
This approach may be especially useful in areas where there are few trials or a limited
number of active research groups. It may also be desirable to send the same letter to
other experts and pharmaceutical companies or others with an interest in the area.
Some review teams set up websites for systematic review projects, listing the studies
identified to date and inviting submission of information on studies not already
listed.
76
4.3 Sources to search
Asking researchers for information about completed but never published studies has
not always been found to be fruitful (Hetherington et al 1989, Horton 1997) though
some researchers have reported that this is an important method for retrieving studies
for systematic reviews (Royle and Milne 2003, Greenhalgh and Peacock 2005, Reveiz
et al 2006). The RIAT (Restoring Invisible and Abandoned Trials) initiative (Doshi
et al 2013) aims to address these problems by offering a methodology that allows
others to re-publish mis-reported and to publish unreported trials. Anyone who can
access the trial data and document trial abandonment can use this methodology.
The RIAT Support Centre offers free-of-charge support and competitive funding to
researchers interested in this approach. It has been suggested that legislation such
as Freedom of Information Acts in various countries might be used to gain access to
information about unpublished trials (Bennett and Jull 2003, MacLean et al 2003).
77
4 Searching for and selecting studies
• whether the review has specific eligibility criteria around study design to address
adverse effects (see Chapter 19), economic issues (see Chapter 20) or qualitative
research questions (see Chapter 21), in which case searches to address these criteria
should be undertaken (see MECIR Box 4.4.a).
comparator, for example if the comparator is explicitly placebo; in other cases the out-
comes may be particularly well defined and consistently reported in abstracts. The
advice on whether or not to search for outcomes for adverse effects differs from the
advice given earlier (see Chapter 19).
Some search strategies may not easily divide into the structure suggested, par-
ticularly for reviews addressing complex or unknown interventions, or diagnostic
tests (Huang et al 2006, Irvin and Hayden 2006, Petticrew and Roberts 2006, de
Vet et al 2008, Booth 2016). Cochrane Reviews of public health interventions
and of qualitative data may adopt very different search approaches to those
described here (Lorenc et al 2014, Booth 2016) (see Chapter 17 on complex and
81
4 Searching for and selecting studies
• use a single concept such as searching for the intervention alone (European Food
Safety Authority 2010);
•• break a concept into two or more subconcepts;
use a multi-stranded or multi-faceted approach that uses a series of searches,
with different combinations of concepts, to capture a complex research question
(Lefebvre et al 2013);
• use a variety of different search approaches to compensate for when a specific
concept is difficult to define (Shemilt et al 2014); or
• use citation searching on key papers in addition to a database search (Haddaway
et al 2015, Hinde and Spackman 2015) (see online Technical Supplement).
Relevant reports Relevant reports retrieved (a) Relevant reports not retrieved (b)
Irrelevant reports Irrelevant reports retrieved (c) Irrelevant reports not retrieved (d)
Sensitivity: fraction of relevant reports retrieved from all relevant reports (a/(a+b))
Precision: fraction of relevant reports retrieved from all reports retrieved (a/(a+c))
82
4.4 Designing search strategies
83
4 Searching for and selecting studies
(see MECIR Box 4.4.d). For example, excluding letters is not recommended because let-
ters may contain important additional information relating to an earlier trial report or
new information about a trial not reported elsewhere (Iansavichene et al 2008).
In addition, articles indexed as ‘Comments’ should not be routinely excluded without
further examination as these may contain early warnings of suspected fraud
(see Section 4.4.6).
Evidence indicates that excluding non-English studies does not change the conclu-
sions of most systematic reviews (Morrison et al 2012, Jiao et al 2013, Hartling et al
2017), although exceptions have been observed for complementary and alternative
medicine (Moher et al 2003, Pham et al 2005, Wu et al 2013). There is, however, also
research related to language bias that supports the inclusion of non-English studies
in systematic reviews (Egger et al 1997). For further discussion of these issues see
Chapter 13.
Inclusion of non-English studies may also increase the precision of the result and the
generalizability and applicability of the findings. There may be differences in therapeu-
tic response to pharmaceutical agents according to ethnicity, either because of pheno-
type and pathogenesis of disease due to environmental factors or because of
population pharmacogenomics and pharmacogenetics (Brusselle and Blasi 2015).
The inclusion of non-English studies also makes it possible to perform sensitivity ana-
lyses to find out if there is any geographical bias in reporting the positive findings
(Vickers et al 1998, Kaptchuk 1999). It also could be an indicator of quality of systematic
reviews (Wang et al 2015).
Limiting searching to databases containing predominantly English-language records,
even if no language restrictions are applied, may result in missed relevant studies
84
4.4 Designing search strategies
(Pilkington et al 2005). Review authors should, therefore, attempt to identify and assess
for eligibility all possibly relevant reports of trials irrespective of language of publica-
tion. If a Cochrane Review team requires help with translation of and/or data extraction
from non-English language reports of studies, they should seek assistance to do so
(this is a common task for which volunteer assistance can be sought via Cochrane’s
TaskExchange platform, accessible to both Cochrane and non-Cochrane review teams).
Where it is not possible to extract the relevant information and data from non-English
language reports, the review team should file the study in ‘Studies Awaiting Classifica-
tion’ rather than ‘Excluded Studies’, to inform readers of the review of the availability
of other possibly relevant reports and reflect this information in the PRISMA flow
diagram (or, if there is no flow diagram, then in the text of the review) as ‘Studies
Awaiting Classification’.
85
4 Searching for and selecting studies
Including data from studies that are fraudulent or studies that include errors can
have an impact on the overall estimates in systematic reviews. Details of how to identify
fraudulent studies, other retracted publications, errata and comments are described in
the online Technical Supplement.
86
4.4 Designing search strategies
underpinning systematic reviews (Sampson and McGowan 2006) and that search stra-
tegies are not always conducted or reported to a high standard (Mullins et al 2014,
Layton 2017). An evidence-based checklist such as the PRESS Evidence-Based
Checklist should be used to assess which elements are important in peer review of
electronic search strategies (McGowan et al 2016a, McGowan et al 2016b). The check-
list covers not only the technical accuracy of the strategy (line numbers, spellings,
etc), but also that the search strategy covers all relevant aspects of the protocol
and has interpreted the research question appropriately. Research has shown that
peer review using a specially designed checklist can improve the quality of searches
(Relevo and Paynter 2012, Spry et al 2013). The names, credentials and institutions of
the peer reviewers of the search strategies should be noted in the review (with their
permission) in the Acknowledgements section.
4.4.9 Alerts
Alerts, also called literature surveillance services, ‘push’ services or SDIs (selective
dissemination of information), are an excellent method of staying up to date with
the medical literature currently being published, as a supplement to designing and
running specific searches for specific reviews. In practice, alerts are based on a pre-
viously developed search strategy, which is saved in a personal account on the data-
base platform (e.g. ‘My EBSCOhost – search alerts’ on EBSCO, ‘My searches & alerts’
on Ovid and ‘MyNCBI – saved searches’ on PubMed). These saved strategies filter the
content as the database is being updated with new information. The account owner is
notified (usually via email) when new publications meeting their specified search
parameters are added to the database. In the case of PubMed, the alert can be set
up to be delivered weekly or monthly, or in real-time and can comprise email or
RSS feeds.
For review authors, alerts are a useful tool to help monitor what is being published in
their review topic after the original search has been conducted. By following the alert,
authors can become aware of a new study that meets the review’s eligibility criteria,
and decide either to include it in the review immediately or mention it as a ‘study await-
ing assessment’ for inclusion during the next review update (see online Chapter IV).
Authors should consider setting up alerts so that the review can be as current as pos-
sible at the time of publication.
Another way of attempting to stay current with the literature as it emerges is by using
alerts based on journal tables of contents (TOCs). These usually cannot be specifically
tailored to the information needs in the same way as search strategies developed to
cover a specific topic. They can, however, be a good way of trying to keep up to date
on a more general level by monitoring what is currently being published in journals of
interest. Many journals, even those that are available by subscription only, offer TOC
alert services free of charge. In addition, a number of publishers and organizations offer
TOC services (see online Technical Supplement). Use of TOCs is not proposed as a sin-
gle alternative to the various other methods of study identification necessary for under-
taking systematic reviews, rather as a supplementary method. (See also Chapter 22,
Section 22.2 for a discussion of new technologies to support evidence surveillance
in the context of ‘living’ systematic reviews.)
87
4 Searching for and selecting studies
88
4.4 Designing search strategies
cut-off (Chilcott et al 2003). Stopping might also be appropriate when the removal of
terms or concepts results in missing relevant records. Another consideration is the
amount of evidence that has already accrued: in topics where evidence is scarce,
authors might need to be more cautious about deciding when to stop searching.
Although many methods have been described to assist with deciding when to stop
developing the search, there has been little formal evaluation of the approaches
(Booth 2010, Wood and Arber 2019).
At a basic level, investigation is needed as to whether a strategy is performing ade-
quately. One simple test is to check whether the search is finding the publications that
have been recommended as key publications or that have been included in other sim-
ilar reviews (EUnetHTA 2017). It is not enough, however, for the strategy to find only
those records, otherwise this might be a sign that the strategy is biased towards
known studies and other relevant records might be being missed. In addition, citation
searches and reference checking are useful checks of strategy performance. If those
additional methods are finding documents that the searches have already retrieved,
but that the team did not necessarily know about in advance, then this is one sign
that the strategy might be performing adequately. Also, an evidence-based checklist
such as the PRESS Evidence-Based Checklist (McGowan et al 2016b) should be used to
assess whether the search strategy is adequate (see Section 4.4.8). If some of the
PRESS dimensions seem to be missing without adequate explanation or arouse con-
cerns, then the search may not yet be complete.
Statistical techniques can be used to assess performance, such as capture-recapture
(Spoor et al 1996) (also known as capture-mark-recapture; Kastner et al 2009), or the
relative recall technique (Sampson et al 2006, Sampson and McGowan 2011). Kastner
suggests the capture-mark-recapture technique merits further investigation since it
could be used to estimate the number of studies in a literature prospectively and to
determine where to stop searches once suitable cut-off levels have been identified.
Kastner’s approach involves searching databases, conducting record selection, calcu-
lating capture-mark-recapture and then making decisions about whether further
searches are necessary. This would entail potentially an iterative search and selection
process. Capture-recapture needs results from at least two searches to estimate the
number of missed studies. Further investigation of published prospective techniques
seems warranted to learn more about the potential benefits.
Relative recall (Sampson et al 2006, Sampson and McGowan 2011) requires a range of
searches to have been conducted so that the relevant studies have been built up by a
set of sensitive searches. The performance of the individual searches can then be
assessed in each individual database by determining how many of the studies that
were deemed eligible for the evidence synthesis and were indexed within a database,
can be found by the database search used to populate the synthesis. If a search in a
database did not perform well and missed many studies, then that search strategy is
likely to have been suboptimal. If the search strategy found most of the studies that
were available to be found in the database then it was likely to have been a sensitive
strategy. Assessments of precision could also be made, but these mostly inform future
search approaches since they cannot affect the searches and record assessment
already undertaken. Relative recall may be most useful at the end of the search process
since it relies on the achievement of several searches to make judgements about the
overall performance of strategies.
89
4 Searching for and selecting studies
In evidence synthesis involving qualitative data, searching is often more organic and
intertwined with the analysis such that the searching stops when new information
ceases to be identified (Booth 2016). The reasons for stopping need to be documented
and it is suggested that explanations or justifications for stopping may centre around
saturation (Booth 2016). Further information on searches for qualitative evidence can
be found in Chapter 21.
90
4.5 Documenting and reporting the search process
strategy. The search strategies should not be re-typed, because this can introduce
errors. The same process is also good practice for searches of trials registers and other
sources, where the interface used, such as introductory or advanced, should also be
specified. Creating a report of the search process can be accomplished through
methodical documentation of the steps taken by the searcher. This need not be oner-
ous if suitable record keeping is performed during the process of the search, but it can
be nearly impossible to recreate post hoc. Many database interfaces have facilities for
search strategies to be saved online or to be emailed; an offline copy in text format
should also be saved. For some databases, taking and saving a screenshot of the search
may be the most practical approach (Rader et al 2014).
Documenting the searching of sources other than databases, including the search
terms used, is also required if searches are to be reproducible (Atkinson et al 2015,
Chow 2015, Witkowski and Aldhouse 2015). Details about contacting experts or man-
ufacturers, searching reference lists, scanning websites, and decisions about search
iterations can be kept internally for future updates or external requests and can be
reproduced as an appendix in the final document. Since the purpose of search docu-
mentation is to support transparency, internal assessment, and reference for any
future update, it is important to plan how to record searching of sources other than
databases since some activities (contacting experts, reference list searching, and for-
ward citation searching) will occur later on in the review process after the database
results have been screened (Rader et al 2014). The searcher should record any corre-
spondence on key decisions and report a summary of this correspondence alongside
the search strategy. The narrative describes the major decisions that shaped the strat-
egy and can give a peer reviewer an insight into the rationale for the search approach
(Craven and Levay 2011).
It is particularly important to save locally or file print copies of any information found
on the internet, such as information about ongoing and/or unpublished trials, as this
information may no longer be accessible at the time the review is written. Local copies
should be stored in a structured way to allow retrieval when needed. There are also
web-based tools which archive webpage content for future reference, such as WebCite
(Eysenbach and Trudel 2005). The results of web searches will not be reproducible to
the same extent as bibliographic database searches because web content and search
engine algorithms frequently change, and search results can differ between users due
to a general move towards localization and personalization. It is still important, how-
ever, to document the search process to ensure that the methods used can be trans-
parently reported (Briscoe 2018). In cases where a search engine retrieves more results
than it is practical to screen in full (it is rarely practical to search thousands of web
results, as the precision of web searches is likely to be relatively low), the number
of results that are documented and reported should be the number that were screened
rather than the total number (Dellavalle et al 2003, Bramer 2016).
Decisions should be documented for all records identified by the search. Details of the
flow of studies from the number(s) of references identified in the search to the number of
studies included in the review will need to be reported in the final review, ideally using a
flow diagram such as that proposed by PRISMA (see online Chapter III); these can be gen-
erated using software including Covidence, DistillerSR, EPPI-Reviewer, the METAGEAR
package for R, the PRISMA Flow Diagram Generator, and RevMan. A table of ‘Character-
istics of excluded studies’ will also need to be presented (see Section 4.6.5). Numbers of
91
4 Searching for and selecting studies
records are sufficient for exclusions based on initial screening of titles and abstracts.
Broad categorizations are sufficient for records classed as potentially eligible during
an initial screen of the full text. Authors will need to decide for each review when to
map records to studies (if multiple records refer to one study). The flow diagram records
initially the total number of records retrieved from various sources, then the total number
of studies to which these records relate. Review authors need to match the various
records to the various studies in order to complete the flow diagram correctly. Lists of
included and excluded studies must be based on studies rather than records (see also
Section 4.6.1).
• records of ongoing trials for which results (either published or unpublished) are not
(yet) available; and
• records of studies which seem to be eligible but for which data are incomplete or the
publication related to the record could not be obtained.
8) Tag or record any ongoing trials which have not yet been reported so that they can
be added to the ongoing studies table.
Note that studies should not be omitted from a review solely on the basis of measured
outcome data not being reported (see MECIR Box 4.6.b and Chapter 13).
95
4 Searching for and selecting studies
number of references and subsequently the number of studies so that a flow diagram
can be constructed. The decision and reasons for exclusion can be tracked using ref-
erence software, a simple document or spreadsheet, or using specialist systematic
review software (see Section 4.6.6.1).
• systems that support the study selection process, typically involving multiple
reviewers (see Section 4.6.6.1); and
• tools and techniques based on text mining and/or machine learning, which aim to
semi- or fully-automate the selection process (see Section 4.6.6.2).
Software to support the selection process, along with other stages of a systematic
review, including text mining tools, can be identified using the Systematic Review Tool-
box. The SR Toolbox is a community driven, web-based catalogue of tools that provide
support for systematic reviews (Marshall and Brereton 2015).
• Abstrackr – a free web-based screening tool that can prioritize the screening of
records using machine learning techniques.
• Covidence – a web-based software platform for conducting systematic reviews,
which includes support for collaborative title and abstract screening, full-text review,
risk-of-bias assessment and data extraction. Full access to this system normally
requires a paid subscription but is free for authors of Cochrane Reviews. A free trial
for non-Cochrane review authors is also available.
• DistillerSR – a web-based software application for undertaking bibliographic record
screening and data extraction. It has a number of management features to track
progress, assess interrater reliability and export data for further analysis. Reduced
pricing for Cochrane and Campbell reviews is available.
• EPPI-Reviewer – web-based software designed to support all stages of the sys-
tematic review process, including reference management, screening, risk of bias
assessment, data extraction and synthesis. The system is free to use for Cochrane
and Campbell reviews, otherwise it requires a paid subscription. A free trial is
available.
• Rayyan – a web-based application for collaborative citation screening and full-text
selection. The system is currently available free of charge (June 2018).
Compatibility with other software tools used in the review process (such as
RevMan) may be a consideration when selecting a tool to support study selection.
97
4 Searching for and selecting studies
Covidence and EPPI-Reviewer are Cochrane-preferred tools, and are likely to have the
strongest integration with RevMan.
•• Abstrackr (http://abstrackr.cebm.brown.edu/);
Colandr (https://www.colandrapp.com/);
•• EPPI-Reviewer (http://eppi.ioe.ac.uk/);
Rayyan (http://rayyan.qcri.org/);
•• RobotAnalyst (http://nactem.ac.uk/robotanalyst/); and
Swift-review (http://swift.sciome.com/swift-review/).
Finally, tools are available that use natural language processing to highlight sen-
tences and key phrases automatically (e.g. PICO elements, trial characteristics, details
of randomization) to support the reviewer whilst screening (Tsafnat et al 2014).
98
4.8 References
Acknowledgements: This chapter has been developed from sections of previous edi-
tions of the Cochrane Handbook co-authored since 1995 by Kay Dickersin, Julie
Glanville, Kristen Larson, Carol Lefebvre and Eric Manheimer. Many of the sources listed
in this chapter and the accompanying online Technical Supplement have been brought
to our attention by a variety of people over the years and we should like to acknowl-
edge this. We should like to acknowledge: Ruth Foxlee, (formerly) Information Special-
ist, Cochrane Editorial Unit; Miranda Cumpston, (formerly) Head of Learning & Support,
Cochrane Central Executive; Colleen Finley, Product Manager, John Wiley and Sons, for
checking sections relating to searching the Cochrane Library; the (UK) National Insti-
tute for Health and Care Excellence and the German Institute for Quality and Efficiency
in Health Care (IQWiG) for support in identifying some of the references; the (US) Agency
for Healthcare Research and Quality (AHRQ) Effective Healthcare Program Scientific
Resource Center Article Alert service; Tianjing Li, Co-Convenor, Comparing Multiple
Interventions Methods Group, for text and references that formed the basis of the
re-drafting of parts of Section 4.6 Selecting studies; Lesley Gillespie, Cochrane author
and former Editor and Trials Search Co-ordinator of the Cochrane Bone, Joint and
Muscle Trauma Group, for copy-editing an early draft; the Cochrane Information
Specialist Executive, the Cochrane Information Specialists’ Support Team, Cochrane
Information Specialists and members of the Cochrane Information Retrieval Methods
Group for comments on drafts; Su Golder, Co-Convenor, Adverse Effects Methods
Group and Steve McDonald, Co-Director, Cochrane Australia for peer review.
4.8 References
Agency for Healthcare Research and Quality. Methods guide for effectiveness and
comparative effectiveness reviews: AHRQ publication no. 10(14)-EHC063-EF. 2014.
https://effectivehealthcare.ahrq.gov/topics/cer-methods-guide/overview.
Arber M, Cikalo M, Glanville J, Lefebvre C, Varley D, Wood H. Annotated bibliography of
published studies addressing searching for unpublished studies and obtaining access to
unpublished data. 2013. https://methods.cochrane.org/sites/methods.cochrane.org.
irmg/files/public/uploads/Annotatedbibliographtifyingunpublishedstudies.pdf.
Atkinson KM, Koenka AC, Sanchez CE, Moshontz H, Cooper H. Reporting standards
for literature searches and report inclusion criteria: making research
syntheses more transparent and easy to replicate. Research Synthesis Methods 2015;
6: 87–95.
Bai Y, Gao J, Zou D, Li Z. Is MEDLINE alone enough for a meta-analysis? Alimentary
Pharmacology and Therapeutics 2007; 26: 125–126; author reply 126.
99
4 Searching for and selecting studies
100
4.8 References
Doshi P, Dickersin K, Healy D, Vedula SS, Jefferson T. Restoring invisible and abandoned
trials: a call for people to publish the findings. BMJ 2013; 346: f2865.
Doust JA, Pietrzak E, Sanders S, Glasziou PP. Identifying studies for systematic reviews
of diagnostic tests was difficult due to the poor sensitivity and precision of methodologic
filters and the lack of information in the abstract. Journal of Clinical Epidemiology 2005;
58: 444–449.
Easterbrook PJ, Berlin JA, Gopalan R, Matthews DR. Publication bias in clinical research.
Lancet 1991; 337: 867–872.
Edwards P, Clarke M, DiGuiseppi C, Pratap S, Roberts I, Wentz R. Identification of
randomized controlled trials in systematic reviews: accuracy and reliability of screening
records. Statistics in Medicine 2002; 21: 1635–1640.
Egger M, Zellweger-Zähner T, Schneider M, Junker C, Lengeler C, Antes G. Language bias in
randomised controlled trials published in English and German. Lancet 1997; 350:
326–329.
Egger M, Smith GD. Bias in location and selection of studies. BMJ 1998; 316: 61–66.
Elsevier. Embase content 2016a. https://www.elsevier.com/solutions/embase-biomedical-
research/embase-coverage-and-content.
Elsevier. Embase classic fact sheet 2016b. https://www.elsevier.com/__data/assets/
pdf_file/0005/58982/R_D-Solutions_Embase_Fact-Sheet_Classic-DIGITAL.pdf.
EUnetHTA. Process of information retrieval for systematic reviews and health technology
assessments on clinical effectiveness (Version 1.2). Germany: European network for
Health Technology Assessment; 2017. https://www.eunethta.eu/wp-content/uploads/
2018/01/Guideline_Information_Retrieval_V1-2_2017.pdf.
European Food Safety Authority. Application of systematic review methodology to food
and feed safety assessments to support decision making. EFSA Journal 2010; 8: 1637.
doi:10.2903/j.efsa.2010.1637. www.efsa.europa.eu.
Eysenbach G, Trudel M. Going, going, still there: using the WebCite service to permanently
archive cited web pages. Journal of Medical Internet Research 2005; 7: e60.
Franco JVA, Garrote VL, Escobar Liquitay CM, Vietto V. Identification of problems in search
strategies in Cochrane Reviews. Research Synthesis Methods 2018; 9: 408–416.
Glanville J, Lefebvre C, Wright Ke, editors. ISSG search filter resource York (UK): The
InterTASC Information Specialists’ Sub-Group 2019. https://sites.google.com/a/york.ac.
uk/issg-search-filters-resource/home.
Glanville JM, Duffy S, McCool R, Varley D. Searching ClinicalTrials.gov and the International
Clinical Trials Registry Platform to inform systematic reviews: what are the optimal search
approaches? Journal of the Medical Library Association 2014; 102: 177–183.
Goldacre B, Lane S, Mahtani KR, Heneghan C, Onakpoya I, Bushfield I, Smeeth L.
Pharmaceutical companies’ policies on access to trial data, results, and methods: audit
study. BMJ 2017; 358: j3334.
Greenhalgh T, Peacock R. Effectiveness and efficiency of search methods in systematic
reviews of complex evidence: audit of primary sources. BMJ 2005; 331: 1064–1065.
Haddaway NR, Collins AM, Coughlin D, Kirk S. The role of Google Scholar in evidence reviews
and its applicability to grey literature searching. PloS One 2015; 10: e0138237.
Halfpenny NJ, Quigley JM, Thompson JC, Scott DA. Value and usability of unpublished data
sources for systematic reviews and network meta-analyses. Evidence-Based Medicine
2016; 21: 208–213.
101
4 Searching for and selecting studies
Halladay CW, Trikalinos TA, Schmid IT, Schmid CH, Dahabreh IJ. Using data sources beyond
PubMed has a modest impact on the results of systematic reviews of therapeutic
interventions. Journal of Clinical Epidemiology 2015; 68: 1076–1084.
Hartling L, Featherstone R, Nuspl M, Shave K, Dryden DM, Vandermeer B. The contribution of
databases to the results of systematic reviews: a cross-sectional study. BMC Medical
Research Methodology 2016; 16: 127.
Hartling L, Featherstone R, Nuspl M, Shave K, Dryden DM, Vandermeer B. Grey literature in
systematic reviews: a cross-sectional study of the contribution of non-English reports,
unpublished studies and dissertations to the results of meta-analyses in child-relevant
reviews. BMC Medical Research Methodology 2017; 17: 64.
Hetherington J, Dickersin K, Chalmers I, Meinert CL. Retrospective and prospective
identification of unpublished controlled trials: lessons from a survey of obstetricians and
pediatricians. Pediatrics 1989; 84: 374–380.
Hinde S, Spackman E. Bidirectional citation searching to completion: an exploration of
literature searching methods. Pharmacoeconomics 2015; 33: 5–11.
Horton R. Medical editors trial amnesty. Lancet 1997; 350: 756.
Huang X, Lin J, Demner-Fushman D. Evaluation of PICO as a knowledge representation
for clinical questions. AMIA Annual Symposium Proceedings/AMIA Symposium 2006:
359–363.
Hwang TJ, Carpenter D, Lauffenburger JC, Wang B, Franklin JM, Kesselheim AS. Failure of
investigational drugs in late-stage clinical development and publication of trial results.
JAMA Internal Medicine 2016; 176: 1826–1833.
Iansavichene AE, Sampson M, McGowan J, Ajiferuke IS. Should systematic reviewers search
for randomized, controlled trials published as letters? Annals of Internal Medicine 2008;
148: 714–715.
Institute of Medicine. Finding what works in health care: standards for systematic reviews.
Washington, DC: The National Academies Press; 2011. http://books.nap.edu/openbook.
php?record_id=13059
Irvin E, Hayden J. Developing and testing an optimal search strategy for identifying studies
of prognosis [Poster]. 14th Cochrane Colloquium; 2006 October 23–26; Dublin, Ireland;
2006. https://abstracts.cochrane.org/2006-dublin/developing-and-testing-optimal-
search-strategy-identifying-studies-prognosis.
Isojarvi J, Wood H, Lefebvre C, Glanville J. Challenges of identifying unpublished data from
clinical trials: getting the best out of clinical trials registers and other novel sources.
Research Synthesis Methods 2018; 9: 561–578.
Jefferson T, Doshi P, Boutron I, Golder S, Heneghan C, Hodkinson A, Jones M, Lefebvre C,
Stewart LA. When to include clinical study reports and regulatory documents in
systematic reviews. BMJ Evidence-Based Medicine 2018; 23: 210–217.
Jiao S, Tsutani K, Haga N. Review of Cochrane reviews on acupuncture: how Chinese
resources contribute to Cochrane reviews. Journal of Alternative and Complementary
Medicine 2013; 19: 613–621.
Kaptchuk T. Certain countries produce only positive trial results. Focus on Alternative and
Complementary Therapies 1999; 4: 86–87.
Kastner M, Straus SE, McKibbon KA, Goldsmith CH. The capture-mark-recapture technique
can be used as a stopping rule when searching in systematic reviews. Journal of Clinical
Epidemiology 2009; 62: 149–157.
102
4.8 References
103
4 Searching for and selecting studies
Marshall I, Noel-Storr A, Kuiper J, Thomas J, Wallace BC. Machine learning for identifying
randomized controlled trials: an evaluation and practitioner’s guide. Research Synthesis
Methods 2018; 9: 602–614.
McGowan J, Sampson M, Salzwedel D, Cogo E, Foerster V, Lefebvre C. PRESS Peer Review of
Electronic Search Strategies: 2015 Guideline Explanation and Elaboration (PRESS E&E).
Ottawa: CADTH; 2016a. https://www.cadth.ca/sites/default/files/pdf/
CP0015_PRESS_Update_Report_
2016.pdf.
McGowan J, Sampson M, Salzwedel DM, Cogo E, Foerster V, Lefebvre C. PRESS Peer Review
of Electronic Search Strategies: 2015 Guideline Statement. Journal of Clinical
Epidemiology 2016b; 75: 40–46.
Meert D, Torabi N, Costella J. Impact of librarians on reporting of the literature searching
component of pediatric systematic reviews. Journal of the Medical Library Association
2016; 104: 267–277.
Metzendorf M-I. Why medical information specialists should routinely form part of teams
producing high quality systematic reviews – a Cochrane perspective. Journal of the
European Association for Health Information and Libraries 2016; 12(4): 6–9.
Moher D, Pham B, Lawson ML, Klassen TP. The inclusion of reports of randomised trials
published in languages other than English in systematic reviews. Health Technology
Assessment 2003; 7: 1–90.
Morrison A, Polisena J, Husereau D, Moulton K, Clark M, Fiander M, Mierzwinski-Urban M,
Clifford T, Hutton B, Rabb D. The effect of English-language restriction on systematic
review-based meta-analyses: a systematic review of empirical studies. International
Journal of Technology Assessment in Health Care 2012; 28: 138–144.
Mullins MM, DeLuca JB, Crepaz N, Lyles CM. Reporting quality of search methods in
systematic reviews of HIV behavioral interventions (2000–2010): are the searches clearly
explained, systematic and reproducible? Research Synthesis Methods 2014; 5: 116–130.
Niederstadt C, Droste S. Reporting and presenting information retrieval processes: the need
for optimizing common practice in health technology assessment. International Journal
of Technology Assessment in Health Care 2010; 26: 450–457.
O’Mara-Eves A, Thomas J, McNaught J, Miwa M, Ananiadou S. Using text mining for study
identification in systematic reviews: a systematic review of current approaches.
Systematic Reviews 2015; 4: 5. (Erratum in: Systematic Reviews 2015; 4: 59).
Oxman AD, Guyatt GH. The science of reviewing research. Annals of the New York Academy of
Sciences 1993; 703: 125–133; discussion 133–134.
Pansieri C, Pandolfini C, Bonati M. Clinical trial registries: more international, converging
efforts are needed. Trials 2017; 18: 86.
Petticrew M, Roberts H. Systematic Reviews in the Social Sciences: a Practical Guide. Oxford
(UK): Blackwell; 2006.
Pham B, Klassen TP, Lawson ML, Moher D. Language of publication restrictions in
systematic reviews gave different results depending on whether the intervention was
conventional or complementary. Journal of Clinical Epidemiology 2005; 58: 769–776.
Pilkington K, Boshnakova A, Clarke M, Richardson J. ‘No language restrictions’ in database
searches: what does this really mean? Journal of Alternative and Complementary Medicine
2005; 11: 205–207.
104
4.8 References
105
4 Searching for and selecting studies
Shemilt I, Simon A, Hollands GJ, Marteau TM, Ogilvie D, O’Mara-Eves A, Kelly MP, Thomas J.
Pinpointing needles in giant haystacks: use of text mining to reduce impractical screening
workload in extremely large scoping reviews. Research Synthesis Methods 2014; 5: 31–49.
Shemilt I, Khan N, Park S, Thomas J. Use of cost-effectiveness analysis to compare the
efficiency of study identification methods in systematic reviews. Systematic Reviews 2016;
5: 140.
Spoor P, Airey M, Bennett C, Greensill J, Williams R. Use of the capture-recapture technique
to evaluate the completeness of systematic literature searches. BMJ 1996; 313: 342–343.
Spry C, Mierzwinski-Urban M, Rabb D. Peer review of literature search strategies: does it
make a difference? 21st Cochrane Colloquium; 2013; Quebec City, Canada. https://
abstracts.cochrane.org/2013-québec-city/peer-review-literature-search-strategies-does-
it-make-difference.
Stevinson C, Lawlor DA. Searching multiple databases for systematic reviews: added value
or diminishing returns? Complementary Therapies in Medicine 2004; 12: 228–232.
Suarez-Almazor ME, Belseck E, Homik J, Dorgan M, Ramos-Remus C. Identifying clinical
trials in the medical literature with electronic databases: MEDLINE alone is not enough.
Controlled Clinical Trials 2000; 21: 476–487.
Thomas J, Noel-Storr A, Marshall I, Wallace B, McDonald S, Mavergames C, Glasziou P,
Shemilt I, Synnot A, Turner T, Elliott J; Living Systematic Review Network. Living
systematic reviews: 2. Combining human and machine effort. Journal of Clinical
Epidemiology 2017; 91: 31–37.
Tramèr MR, Reynolds DJ, Moore RA, McQuay HJ. Impact of covert duplicate publication on
meta-analysis: a case study. BMJ 1997; 315: 635–640.
Tsafnat G, Glasziou P, Choong MK, Dunn A, Galgani F, Coiera E. Systematic review
automation technologies. Systematic Reviews 2014; 3: 74.
US National Library of Medicine. PubMed. 2018. https://www.nlm.nih.gov/bsd/
pubmed.html.
US National Library of Medicine. MEDLINE®: Description of the Database. 2019. https://www.
nlm.nih.gov/bsd/medline.html.
US National Library of Medicine. Fact Sheet: MEDLINE, PubMed, and PMC (PubMed Central):
How are they different? no date. https://www.nlm.nih.gov/bsd/difference.html.
Vickers A, Goyal N, Harland R, Rees R. Do certain countries produce only positive results?
A systematic review of controlled trials. Controlled Clinical Trials 1998; 19: 159–166.
Viergever RF, Li K. Trends in global clinical trial registration: an analysis of numbers of
registered clinical trials in different parts of the world from 2004 to 2013. BMJ Open 2015;
5: e008932.
von Elm E, Poglia G, Walder B, Tramèr MR. Different patterns of duplicate publication: an
analysis of articles used in systematic reviews. JAMA 2004; 291: 974–980.
Wallace S, Daly C, Campbell M, Cody J, Grant A, Vale L, Donaldson C, Khan I, Lawrence P,
MacLeod A. After MEDLINE? Dividend from other potential sources of randomised
controlled trials. Second International Conference Scientific Basis of Health Services &
Fifth Annual Cochrane Colloquium; 1997; Amsterdam, The Netherlands.
Wang Z, Brito JP, Tsapas A, Griebeler ML, Alahdab F, Murad MH. Systematic reviews with
language restrictions and no author contact have lower overall credibility: a methodology
study. Clinical Epidemiology 2015; 7: 243–247.
Weber EJ, Callaham ML, Wears RL, Barton C, Young G. Unpublished research from a medical
specialty meeting: why investigators fail to publish. JAMA 1998; 280: 257–259.
106
4.8 References
107
5
Collecting data
Tianjing Li, Julian PT Higgins, Jonathan J Deeks
KEY POINTS
• Systematic reviews have studies, rather than reports, as the unit of interest, and so
multiple reports of the same study need to be identified and linked together before
•
or after data extraction.
Because of the increasing availability of data sources (e.g. trials registers, regulatory
documents, clinical study reports), review authors should decide on which sources
may contain the most useful information for the review, and have a plan to resolve
•
discrepancies if information is inconsistent across sources.
Review authors are encouraged to develop outlines of tables and figures that will
appear in the review to facilitate the design of data collection forms. The key to suc-
cessful data collection is to construct easy-to-use forms and collect sufficient and
unambiguous data that faithfully represent the source in a structured and organized
•
manner.
Effort should be made to identify data needed for meta-analyses, which often need to
•
be calculated or converted from data reported in diverse formats.
Data should be collected and archived in a form that allows future access and data
sharing.
5.1 Introduction
Systematic reviews aim to identify all studies that are relevant to their research ques-
tions and to synthesize data about the design, risk of bias, and results of those studies.
Consequently, the findings of a systematic review depend critically on decisions relat-
ing to which data from these studies are presented and analysed. Data collected for
systematic reviews should be accurate, complete, and accessible for future updates
of the review and for data sharing. Methods used for these decisions must be transpar-
ent; they should be chosen to minimize biases and human error. Here we describe
This chapter should be cited as: Li T, Higgins JPT, Deeks JJ (editors). Chapter 5: Collecting data. In: Higgins
JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook for
Systematic Reviews of Interventions. 2nd Edition. Chichester (UK): John Wiley & Sons, 2019: 109–142.
© 2019 The Cochrane Collaboration. Published 2019 by John Wiley & Sons Ltd.
109
5 Collecting data
approaches that should be used in systematic reviews for collecting data, including
extraction of data directly from journal articles and other reports of studies.
110
5.2 Sources of data
authorities. Because CSRs also incorporate tables and figures, with appendices
containing the protocol, statistical analysis plan, sample case report forms, and patient
data listings (including narratives of all serious adverse events), they can be thousands
of pages in length. CSRs often contain more data about trial methods and results than
any other single data source (Mayo-Wilson et al 2018). CSRs are often difficult to access,
and are usually not publicly available. Review authors could request CSRs from the Euro-
pean Medicines Agency (Davis and Miller 2017). The US Food and Drug and Administra-
tion had historically avoided releasing CSRs but launched a pilot programme in 2018
whereby selected portions of CSRs for new drug applications were posted on the
agency’s website. Many CSRs are obtained through unsealed litigation documents, repo-
sitories (e.g. clinicalstudydatarequest.com), and other open data and data-sharing
channels (e.g. The Yale University Open Data Access Project) (Doshi et al 2013, Wieland
et al 2014, Mayo-Wilson et al 2018)).
Regulatory reviews such as those available from the US Food and Drug Administra-
tion or European Medicines Agency provide useful information about trials of drugs,
biologics, and medical devices submitted by manufacturers for marketing approval
(Turner 2013). These documents are summaries of CSRs and related documents,
prepared by agency staff as part of the process of approving the products for
marketing, after reanalysing the original trial data. Regulatory reviews often are
available only for the first approved use of an intervention and not for later applications
(although review authors may request those documents, which are usually brief ).
Using regulatory reviews from the US Food and Drug Administration as an example,
drug approval packages are available on the agency’s website for drugs approved since
1997 (Turner 2013); for drugs approved before 1997, information must be requested
through a freedom of information request. The drug approval packages contain various
documents: approval letter(s), medical review(s), chemistry review(s), clinical
pharmacology review(s), and statistical reviews(s).
Individual participant data (IPD) are usually sought directly from the researchers
responsible for the study, or may be identified from open data repositories (e.g.
www.clinicalstudydatarequest.com). These data typically include variables that
represent the characteristics of each participant, intervention (or exposure) group,
prognostic factors, and measurements of outcomes (Stewart et al 2015). Access to
IPD has the advantage of allowing review authors to reanalyse the data flexibly, in
accordance with the preferred analysis methods outlined in the protocol, and can
reduce the variation in analysis methods across studies included in the review. IPD
reviews are addressed in detail in Chapter 26.
report and then link together the collected data across reports. Either strategy may be
appropriate, depending on the nature of the reports at hand. It may not be clear that
two reports relate to the same study until data collection has commenced. Although
sometimes there is a single report for each study, it should never be assumed that this
is the case.
It can be difficult to link multiple reports from the same study, and review authors
may need to do some ‘detective work’. Multiple sources about the same trial may not
reference each other, do not share common authors (Gøtzsche 1989, Tramèr et al 1997),
or report discrepant information about the study design, characteristics, outcomes,
and results (von Elm et al 2004, Mayo-Wilson et al 2017a).
Some of the most useful criteria for linking reports are:
Review authors should use as many trial characteristics as possible to link multiple
reports. When uncertainties remain after considering these and other factors, it may be
necessary to correspond with the study authors or sponsors for confirmation.
Table 5.2.a Strengths and limitations of different data sources for systematic reviews
Public sources
Journal Found easily Available for some, but not all studies
articles Data extracted quickly (with a risk of reporting biases: see
Chapters 7 and 13)
Include useful information about
methods and results Contain limited study characteristics and
methods
Can omit outcomes, especially harms
Conference Identify unpublished studies Include little information about study design
abstracts Include limited and unclear information
for meta-analysis
May result in double-counting studies in
meta-analysis if not correctly linked to
other reports of the same study
Trial Identify otherwise unpublished trials Limited to more recent studies that
registrations May contain information about design, comply with registration requirements
risk of bias, and results not included in Often contain limited information about
other public sources trial design and quantitative results
Link multiple sources about the same trial May report only harms (adverse events)
using unique registration number occurring above a threshold (e.g. 5%)
May be inaccurate or incomplete for trials
whose methods have changed during the
conduct of the study, or results not kept
up to date
Regulatory Identify studies not reported in other Available only for studies submitted to
information public sources regulators
Describe details of methods and results Available for approved indications, but
not found in other sources not ‘off-label’ uses
Not always in a standard format
Not often available for old products
Non-public sources
Clinical Contain detailed information about study Do not exist or difficult to obtain for most
study characteristics, methods, and results studies
reports Can be particularly useful for identifying Require more time to obtain and analyse
(CSRs) detailed information about harms than public sources
Describe aggregate results, which are easy
to analyse and sufficient for most reviews
Individual Allow review authors to use contemporary Require considerable expertise and time
participant statistical methods and to standardize to obtain and analyse
data analyses across studies May lead to the same results that can be
Permit additional analyses that the review found in aggregate report
authors desire (e.g. subgroup analyses) May not be necessary if one has a CSR
113
5 Collecting data
plan in advance what data will be required for their systematic review, and develop a
strategy for obtaining them (see MECIR Box 5.3.a). The involvement of consumers and
other stakeholders can be helpful in ensuring that the categories of data collected are
sufficiently aligned with the needs of review users (Chapter 1, Section 1.3). The data to
be sought should be described in the protocol, with consideration wherever possible of
the issues raised in the rest of this chapter.
The data collected for a review should adequately describe the included studies, sup-
port the construction of tables and figures, facilitate the risk of bias assessment, and
enable syntheses and meta-analyses. Review authors should familiarize themselves
with reporting guidelines for systematic reviews (see online Chapter III and the PRISMA
statement; Liberati et al 2009) to ensure that relevant elements and sections are incor-
porated. The following sections review the types of information that should be sought,
and these are summarized in Table 5.3.a (Li et al 2015).
• Parallel, factorial, crossover, cluster aspects of design for randomized trials, and/or study design
•
features for non-randomized studies
Single or multicentre study; if multicentre, number of recruiting centres
Recruitment and sampling procedures used (including at the level of individual participants and
clusters/sites if relevant)
Enrolment start and end dates; length of participant follow-up
Details of random sequence generation, allocation sequence concealment, and masking for
randomized trials, and methods used to prevent and control for confounding, selection biases, and
information biases for non-randomized studies*
Methods used to prevent and address missing data*
Statistical analysis:
Unit of analysis (e.g. individual participant, clinic, village, body part)
Statistical methods used if computed effect estimates are extracted from reports, including any
covariates included in the statistical model
Likelihood of reporting and other biases*
Source(s) of funding or other material support for the study
Authors’ financial relationship and other potential conflicts of interest
Participants
Setting
Region(s) and country/countries from which study participants were recruited
Study eligibility criteria, including diagnostic criteria
Characteristics of participants at the beginning (or baseline) of the study (e.g. age, sex, comorbidity,
socio-economic status)
Intervention
Description of the intervention(s) and comparison intervention(s), ideally with sufficient detail for
replication:
•• Components, routes of delivery, doses, timing, frequency, intervention protocols, length of intervention
•
Factors relevant to implementation (e.g. staff qualifications, equipment requirements)
Integrity of interventions (i.e. the degree to which specified procedures or components of the
••
intervention were implemented as planned)
Description of co-interventions
Definition of ‘control’ groups (e.g. no intervention, placebo, minimally active comparator, or
••
components of usual care)
Components, dose, timing, frequency
For observational studies: description of how intervention status was assessed; length of exposure,
cumulative exposure
116
5.3 What data to collect
Outcomes
For each pre-specified outcome domain (e.g. anxiety) in the systematic review:
• Whether there is evidence that the outcome domain was assessed (especially important if the
•
outcome was assessed but the results not presented; see Chapter 13)
Measurement tool or instrument (including definition of clinical outcomes or endpoints); for a
scale, name of the scale (e.g. the Hamilton Anxiety Rating Scale), upper and lower limits, and
•
whether a high or low score is favourable, definitions of any thresholds if appropriate
Specific metric (e.g. post-intervention anxiety, or change in anxiety from baseline to a post-
•
intervention time point, or post-intervention presence of anxiety (yes/no))
Method of aggregation (e.g. mean and standard deviation of anxiety scores in each group, or
•
proportion of people with anxiety)
Timing of outcome measurements (e.g. assessments at end of eight-week intervention period,
•
events occurring during the eight-week intervention period)
Adverse outcomes need special attention depending on whether they are collected systematically
or non-systematically (e.g. by voluntary report)
Results
For each group, and for each outcome at each time point: number of participants randomly assigned
and included in the analysis; and number of participants who withdrew, were lost to follow-up or were
excluded (with reasons for each)
Summary data for each group (e.g. 2×2 table for dichotomous data; means and standard deviations
for continuous data)
Between-group estimates that quantify the effect of the intervention, and their precision (e.g. risk
ratio, odds ratio, mean difference)
If subgroup analysis is planned, the same information would need to be extracted for each participant
subgroup
Miscellaneous
Key conclusions of the study authors
Reference to other relevant studies
Correspondence required
Miscellaneous comments from the study authors or by the review authors
∗
Full description required for assessments of risk of bias (see Chapters 8, 23 and 25).
assessments of risk of bias. For non-randomized studies, the most appropriate tool is
described in Chapter 25. A separate tool also covers bias due to missing results in
meta-analysis (see Chapter 13).
A particularly important piece of information is the funding source of the study and
potential conflicts of interest of the study authors.
Some review authors will wish to collect additional information on study character-
istics that bear on the quality of the study’s conduct but that may not lead directly to
risk of bias, such as whether ethical approval was obtained and whether a sample size
calculation was performed a priori.
allow assessment of how directly or completely the participants in the included studies
reflect the original review question.
Typically, aspects that should be collected are those that could (or are believed to) affect
presence or magnitude of an intervention effect and those that could help review users
assess applicability to populations beyond the review. For example, if the review authors
suspect important differences in intervention effect between different socio-economic
groups, this information should be collected. If intervention effects are thought constant
over such groups, and if such information would not be useful to help apply results, it
should not be collected. Participant characteristics that are often useful for assessing
applicability include age and sex. Summary information about these should always be col-
lected unless they are not obvious from the context. These characteristics are likely to be
presented in different formats (e.g. ages as means or medians, with standard deviations or
ranges; sex as percentages or counts for the whole study or for each intervention group
separately). Review authors should seek consistent quantities where possible, and decide
whether it is more relevant to summarize characteristics for the study as a whole or by
intervention group. It may not be possible to select the most consistent statistics until data
collection is complete across all or most included studies. Other characteristics that are
sometimes important include ethnicity, socio-demographic details (e.g. education level)
and the presence of comorbid conditions. Clinical characteristics relevant to the review
question (e.g. glucose level for reviews on diabetes) also are important for understanding
the severity or stage of the disease.
Diagnostic criteria that were used to define the condition of interest can be a
particularly important source of diversity across studies and should be collected.
For example, in a review of drug therapy for congestive heart failure, it is important
to know how the definition and severity of heart failure was determined in each study
(e.g. systolic or diastolic dysfunction, severe systolic dysfunction with ejection fractions
below 20%). Similarly, in a review of antihypertensive therapy, it is important to
describe baseline levels of blood pressure of participants.
If the settings of studies may influence intervention effects or applicability, then
information on these should be collected. Typical settings of healthcare intervention
studies include acute care hospitals, emergency facilities, general practice, and extended
care facilities such as nursing homes, offices, schools, and communities. Sometimes
studies are conducted in different geographical regions with important differences that
could affect delivery of an intervention and its outcomes, such as cultural characteristics,
economic context, or rural versus city settings. Timing of the study may be associated
with important technology differences or trends over time. If such information is impor-
tant for the interpretation of the review, it should be collected.
Important characteristics of the participants in each included study should be
summarized for the reader in the table of ‘Characteristics of included studies’.
5.3.4 Interventions
Details of all experimental and comparator interventions of relevance to the review
should be collected. Again, details are required for aspects that could affect the
presence or magnitude of an effect or that could help review users assess applicability
to their own circumstances. Where feasible, information should be sought (and
presented in the review) that is sufficient for replication of the interventions under
118
5.3 What data to collect
study. This includes any co-interventions administered as part of the study, and applies
similarly to comparators such as ‘usual care’. Review authors may need to request
missing information from study authors.
The Template for Intervention Description and Replication (TIDieR) provides a
comprehensive framework for full description of interventions and has been proposed
for use in systematic reviews as well as reports of primary studies (Hoffmann et al 2014).
The checklist includes descriptions of:
In addition, the growing field of implementation science has led to an increased awareness
of the impact of setting and context on delivery of interventions (Damschroder et al 2009).
(See Chapter 17, Section 17.1.2.1 for further information and discussion about how an
intervention may be tailored to local conditions in order to preserve its integrity.)
Information about integrity can help determine whether unpromising results are due
to a poorly conceptualized intervention or to an incomplete delivery of the prescribed
components. It can also reveal important information about the feasibility of
implementing a given intervention in real life settings. If it is difficult to achieve full
implementation in practice, the intervention will have low feasibility (Dusenbury
et al 2003).
Whether a lack of intervention integrity leads to a risk of bias in the estimate of its effect
depends on whether review authors and users are interested in the effect of assignment
to intervention or the effect of adhering to intervention, as discussed in more detail in
Chapter 8, Section 8.2.2. Assessment of deviations from intended interventions is impor-
tant for assessing risk of bias in the latter, but not the former (see Chapter 8, Section 8.4),
but both may be of interest to decision makers in different ways.
An example of a Cochrane Review evaluating intervention integrity is provided by a
review of smoking cessation in pregnancy (Chamberlain et al 2017). The authors found
that process evaluation of the intervention occurred in only some trials and that the
implementation was less than ideal in others, including some of the largest trials.
The review highlighted how the transfer of an intervention from one setting to another
may reduce its effectiveness when elements are changed, or aspects of the materials
are culturally inappropriate.
5.3.5 Outcomes
An outcome is an event or a measurement value observed or recorded for a particular
person or intervention unit in a study during or following an intervention, and that
is used to assess the efficacy and safety of the studied intervention (Meinert 2012).
Review authors should indicate in advance whether they plan to collect information
about all outcomes measured in a study or only those outcomes of (pre-specified)
interest in the review. Research has shown that trials addressing the same condition
and intervention seldom agree on which outcomes are the most important, and con-
sequently report on numerous different outcomes (Dwan et al 2014, Ismail et al 2014,
120
5.3 What data to collect
Further considerations for economics outcomes are discussed in Chapter 20, and for
patient-reported outcomes in Chapter 18.
121
5 Collecting data
which could lead to bias in the available data. Unfortunately, most adverse events are
collected non-systematically rather than systematically, creating a challenge for review
authors. The following pieces of information are useful and worth collecting (Nicole
Fusco, personal communication):
• any coding system or standard medical terminology used (e.g. COSTART, MedDRA),
including version number;
•• name of the adverse events (e.g. dizziness);
reported intensity of the adverse event (e.g. mild, moderate, severe);
•• whether the trial investigators categorized the adverse event as ‘serious’;
whether the trial investigators identified the adverse event as being related to the
intervention;
• time point (most commonly measured as a count over the duration of
the study);
• any reported methods for how adverse events were selected for inclusion in the
publication (e.g. ‘We reported all adverse events that occurred in at least 5% of
participants’); and
• associated results.
measures are important and can be used to gauge overall participant well-being, they
should not be regarded as substitutes for a detailed evaluation of safety and
tolerability.
5.3.6 Results
Results data arise from the measurement or ascertainment of outcomes for individual
participants in an intervention study. Results data may be available for each individual
in a study (i.e. individual participant data; see Chapter 26), or summarized at arm level,
or summarized at study level into an intervention effect by comparing two intervention
arms. Results data should be collected only for the intervention groups and outcomes
specified to be of interest in the protocol (see MECIR Box 5.3.b). Results for other
outcomes should not be collected unless the protocol is modified to add them. Any
modification should be reported in the review. However, review authors should be alert
to the possibility of important, unexpected findings, particularly serious adverse
effects.
Reports of studies often include several results for the same outcome. For example,
different measurement scales might be used, results may be presented separately for
different subgroups, and outcomes may have been measured at different follow-up
time points. Variation in the results can be very large, depending on which data are
selected (Gøtzsche et al 2007, Mayo-Wilson et al 2017a). Review protocols should be
as specific as possible about which outcome domains, measurement tools, time points,
and summary statistics (e.g. final values versus change from baseline) are to be col-
lected (Mayo-Wilson et al 2017b). A framework should be pre-specified in the protocol
to facilitate making choices between multiple eligible measures or results. For example,
a hierarchy of preferred measures might be created, or plans articulated to select the
result with the median effect size, or to average across all eligible results for a particular
outcome domain (see also Chapter 9, Section 9.3.3). Any additional decisions or
changes to this framework made once the data are collected should be reported in
the review as changes to the protocol.
Section 5.6 describes the numbers that will be required to perform meta-analysis, if
appropriate. The unit of analysis (e.g. participant, cluster, body part, treatment period)
should be recorded for each result when it is not obvious (see Chapter 6, Section 6.2).
The type of outcome data determines the nature of the numbers that will be sought for
each outcome. For example, for a dichotomous (‘yes’ or ‘no’) outcome, the number of
participants and the number who experienced the outcome will be sought for each
123
5 Collecting data
group. It is important to collect the sample size relevant to each result, although this is
not always obvious. A flow diagram as recommended in the CONSORT Statement
(Moher et al 2001) can help to determine the flow of participants through a study. If
one is not available in a published report, review authors can consider drawing one
(available from www.consort-statement.org).
The numbers required for meta-analysis are not always available. Often, other
statistics can be collected and converted into the required format. For example, for
a continuous outcome, it is usually most convenient to seek the number of
participants, the mean and the standard deviation for each intervention group. These
are often not available directly, especially the standard deviation. Alternative statistics
enable calculation or estimation of the missing standard deviation (such as a standard
error, a confidence interval, a test statistic (e.g. from a t-test or F-test) or a P value).
These should be extracted if they provide potentially useful information (see MECIR
Box 5.3.c). Details of recalculation are provided in Section 5.6. Further considerations
for dealing with missing data are discussed in Chapter 10 (Section 10.12).
124
5.4 Data collection tools
125
5 Collecting data
information that are important. Thus, forms can be adapted from one review to the
next. Although we use the term ‘data collection form’ in the singular, in practice it
may be a series of forms used for different purposes: for example, a separate form
could be used to assess the eligibility of studies for inclusion in the review to assist
in the quick identification of studies to be excluded from or included in the review.
126
5.4 Data collection tools
Can record notes and editing and analysis Allow online data storage,
explanations easily Allow electronic data linking, and sharing
Require minimal software storage, sharing and
skills collation Easy to expand or edit forms
as required
Easy to expand or edit
forms as required Can be integrated with title/
abstract, full-text screening
Can automate data and other functions
comparison with
additional Can link data items to
programming locations in the report to
facilitate checking
Can copy data to
analysis software Can readily automate data
without manual comparison between
re-entry, reducing independent data collection
errors for the same study
Allow easy monitoring of
progress and performance of
the author team
Facilitate coordination
among data collectors such
as allocation of studies for
collection and monitoring
team progress
Allow simultaneous data
entry by multiple authors
Can export data directly to
analysis software
In some cases, improve
public accessibility through
open data sharing
Disadvantages Inefficient and potentially Require familiarity Upfront investment of
unreliable because data with software resources to set up the form
must be entered into packages to design and train data extractors
software for analysis and and use forms
reporting Structured templates may
Susceptible to not be as flexible as
Susceptible to errors changes in software electronic forms
versions
Data collected by multiple Cost of commercial data
authors must be manually systems
collated
Require familiarity with data
Difficult to amend as the systems
review progresses
Susceptible to changes in
If the papers are lost, all data software versions
will need to be re-created
127
5 Collecting data
• Ask closed-ended questions (i.e. questions that define a list of permissible responses)
as much as possible. Closed-ended questions do not require post hoc coding and
provide better control over data quality than open-ended questions. When setting
up a closed-ended question, one must anticipate and structure possible responses
128
5.4 Data collection tools
and include an ‘other, specify’ category because the anticipated list may not be
exhaustive. Avoid asking data extractors to summarize data into uncoded text, no
matter how short it is.
• Avoid asking a question in a way that the response may be left blank. Include ‘not
applicable’, ‘not reported’ and ‘cannot tell’ options as needed. The ‘cannot tell’
option tags uncertain items that may promote review authors to contact study
authors for clarification, especially on data items critical to reach conclusions.
• Remember that the form will focus on what is reported in the article rather what has
been done in the study. The study report may not fully reflect how the study was
actually conducted. For example, a question ‘Did the article report that the parti-
cipants were masked to the intervention?’ is more appropriate than ‘Were partici-
pants masked to the intervention?’
• Where a judgement is required, record the raw data (i.e. quote directly from the
source document) used to make the judgement. It is also important to record the
source of information collected, including where it was found in a report or whether
information was obtained from unpublished sources or personal communications.
As much as possible, questions should be asked in a way that minimizes subjective
interpretation and judgement to facilitate data comparison and adjudication.
• Incorporate flexibility to allow for variation in how data are reported. It is strongly recom-
mended that outcome data be collected in the format in which they were reported and
transformed in a subsequent step if required. Review authors also should consider the
software they will use for analysis and for publishing the review (e.g. RevMan).
Step 4. Develop and pilot-test data collection forms, ensuring that they provide data
in the right format and structure for subsequent analysis. In addition to data items
described in Step 2, data collection forms should record the title of the review as well
as the person who is completing the form and the date of completion. Forms occasion-
ally need revision; forms should therefore include the version number and version date
to reduce the chances of using an outdated form by mistake. Because a study may be
associated with multiple reports, it is important to record the study ID as well as the
report ID. Definitions and instructions helpful for answering a question should appear
next to the question to improve quality and consistency across data extractors (Stock
1994). Provide space for notes, regardless of whether paper or electronic forms are used.
All data collection forms and data systems should be thoroughly pilot-tested before
launch (see MECIR Box 5.4.a). Testing should involve several people extracting data
from at least a few articles. The initial testing focuses on the clarity and completeness
of questions. Users of the form may provide feedback that certain coding instructions
are confusing or incomplete (e.g. a list of options may not cover all situations). The
testing may identify data that are missing from the form, or likely to be superfluous.
After initial testing, accuracy of the extracted data should be checked against the
source document or verified data to identify problematic areas. It is wise to draft entries
for the table of ‘Characteristics of included studies’ and complete a risk of bias assess-
ment (Chapter 8) using these pilot reports to ensure all necessary information is col-
lected. A consensus between review authors may be required before the form is
modified to avoid any misunderstandings or later disagreements. It may be necessary
to repeat the pilot testing on a new set of reports if major changes are needed after the
first pilot test.
129
5 Collecting data
Problems with the data collection form may surface after pilot testing has been com-
pleted, and the form may need to be revised after data extraction has started. When
changes are made to the form or coding instructions, it may be necessary to return to
reports that have already undergone data extraction. In some situations, it may be neces-
sary to clarify only coding instructions without modifying the actual data collection form.
extraction. Results of the pilot testing of the form should prompt discussion among
review authors and extractors of ambiguous questions or responses to establish consist-
ency. Training should take place at the onset of the data extraction process and period-
ically over the course of the project (Li et al 2015). For example, when data related to a
single item on the form are present in multiple locations within a report (e.g. abstract,
main body of text, tables, and figures) or in several sources (e.g. publications, Clinical-
Trials.gov, or CSRs), the development and documentation of instructions to follow an
agreed algorithm are critical and should be reinforced during the training sessions.
Some have proposed that some information in a report, such as its authors, be
blinded to the review author prior to data extraction and assessment of risk of bias
(Jadad et al 1996). However, blinding of review authors to aspects of study reports gen-
erally is not recommended for Cochrane Reviews as there is little evidence that it alters
the decisions made (Berlin 1997).
careful examination, since it may contain valuable information not included in the pri-
mary report. Review authors will need to decide between two strategies:
• Extract data from each report separately, then combine information across multiple
data collection forms.
• Extract data from all reports directly into a single data collection form.
The choice of which strategy to use will depend on the nature of the reports and may
vary across studies and across reports. For example, when a full journal article and mul-
tiple conference abstracts are available, it is likely that the majority of information will
be obtained from the journal article; completing a new data collection form for each
conference abstract may be a waste of time. Conversely, when there are two or more
detailed journal articles, perhaps relating to different periods of follow-up, then it is
likely to be easier to perform data extraction separately for these articles and collate
information from the data collection forms afterwards. When data from all reports are
extracted into a single data collection form, review authors should identify the ‘main’
data source for each study when sources include conflicting data and these differences
cannot be resolved by contacting authors (Mayo-Wilson et al 2018). Flow diagrams such
as those modified from the PRISMA statement can be particularly helpful when collat-
ing and documenting information from multiple reports (Mayo-Wilson 2018).
• Use one author’s (paper) data collection form and record changes after consensus in
a different ink colour.
•• Enter consensus data onto an electronic form.
Record original data extracted and consensus data in separate forms (some online
tools do this automatically).
Agreement of coded items before reaching consensus can be quantified, for example
using kappa statistics (Orwin 1994), although this is not routinely done in Cochrane
Reviews. If agreement is assessed, this should be done only for the most important data
(e.g. key risk of bias assessments, or availability of key outcomes).
Throughout the review process informal consideration should be given to the reliability
of data extraction. For example, if after reaching consensus on the first few studies, the
132
5.5 Extracting data from reports
authors note a frequent disagreement for specific data, then coding instructions may need
modification. Furthermore, an author’s coding strategy may change over time, as the cod-
ing rules are forgotten, indicating a need for retraining and, possibly, some recoding.
be difficult for the review authors to link the trial with other reports of the same trial
(Section 5.2.1).
Many of the documents downloaded from the US Food and Drug Administration’s
website for older drugs are scanned copies and are not searchable because of redac-
tion of confidential information (Turner 2013). Optical character recognition software
can convert most of the text. Reviews for newer drugs have been redacted electroni-
cally; documents remain searchable as a result.
Compared to CSRs, regulatory reviews contain less information about trial design,
execution, and results. They provide limited information for assessing the risk of bias.
In terms of extracting outcomes and results, review authors should follow the guidance
provided for CSRs (Section 5.5.6).
• At least 26 studies have tested various natural language processing and machine
learning approaches for facilitating data extraction for systematic reviews.
• Each tool focuses on only a limited number of data elements (ranges from one to
seven). Most of the existing tools focus on the PICO information (e.g. number of par-
ticipants, their age, sex, country, recruiting centres, intervention groups, outcomes,
and time points). A few are able to extract study design and results (e.g. objectives,
study duration, participant flow), and two extract risk of bias information (Marshall
et al 2016, Millard et al 2016). To date, well over half of the data elements needed for
systematic reviews have not been explored for automated extraction.
134
5.5 Extracting data from reports
• Most tools highlight the sentence(s) that may contain the data elements as opposed
to directly recording these data elements into a data collection form or a data
system.
• There is no gold standard or common dataset to evaluate the performance of these
tools, limiting our ability to interpret the significance of the reported accuracy
measures.
At the time of writing, we cannot recommend a specific tool for automating data
extraction for routine systematic review production. There is a need for review authors
to work with experts in informatics to refine these tools and evaluate them rigorously.
Such investigations should address how the tool will fit into existing workflows. For
example, the automated or semi-automated data extraction approaches may first
act as checks for manual data extraction before they can replace it.
Because investigations may take time, and institutions may not always be respon-
sive (Wager 2011), articles suspected of being fraudulent should be classified as
‘awaiting assessment’. If a misconduct investigation indicates that the publication
is unreliable, or if a publication is retracted, it should not be included in the system-
atic review, and the reason should be noted in the ‘excluded studies’ section.
5.9 References
Al-Marzouki S, Evans S, Marshall T, Roberts I. Are these data real? Statistical methods for the
detection of data fabrication in clinical trials. BMJ 2005; 331: 267–270.
Allen EN, Mushi AK, Massawe IS, Vestergaard LS, Lemnge M, Staedke SG, Mehta U, Barnes KI,
Chandler CI. How experiences become data: the process of eliciting adverse event,
medical history and concomitant medication reports in antimalarial and antiretroviral
interaction trials. BMC Medical Research Methodology 2013; 13: 140.
137
5 Collecting data
138
5.9 References
Gøtzsche PC, Hróbjartsson A, Maric K, Tendal B. Data extraction errors in meta-analyses that
use standardized mean differences. JAMA 2007; 298: 430–437.
Gross A, Schirm S, Scholz M. Ycasd – a tool for capturing and scaling data from graphical
representations. BMC Bioinformatics 2014; 15: 219.
Hoffmann TC, Glasziou PP, Boutron I, Milne R, Perera R, Moher D, Altman DG, Barbour V,
Macdonald H, Johnston M, Lamb SE, Dixon-Woods M, McCulloch P, Wyatt JC, Chan AW,
Michie S. Better reporting of interventions: template for intervention description and
replication (TIDieR) checklist and guide. BMJ 2014; 348: g1687.
ICH. ICH Harmonised tripartite guideline: Struture and content of clinical study reports
E31995. ICH1995. www.ich.org/fileadmin/Public_Web_Site/ICH_Products/Guidelines/
Efficacy/E3/E3_Guideline.pdf.
Ioannidis JPA, Mulrow CD, Goodman SN. Adverse events: the more you search, the more you
find. Annals of Internal Medicine 2006; 144: 298–300.
Ip S, Hadar N, Keefe S, Parkin C, Iovin R, Balk EM, Lau J. A web-based archive of systematic
review data. Systematic Reviews 2012; 1: 15.
Ismail R, Azuara-Blanco A, Ramsay CR. Variation of clinical outcomes used in glaucoma
randomised controlled trials: a systematic review. British Journal of Ophthalmology 2014;
98: 464–468.
Jadad AR, Moore RA, Carroll D, Jenkinson C, Reynolds DJM, Gavaghan DJ, McQuay H.
Assessing the quality of reports of randomized clinical trials: is blinding necessary?
Controlled Clinical Trials 1996; 17: 1–12.
Jelicic Kadic A, Vucic K, Dosenovic S, Sapunar D, Puljak L. Extracting data from figures with
software was faster, with higher interrater reliability than manual extraction. Journal of
Clinical Epidemiology 2016; 74: 119–123.
Jones AP, Remmington T, Williamson PR, Ashby D, Smyth RL. High prevalence but low
impact of data extraction and reporting errors were found in Cochrane systematic
reviews. Journal of Clinical Epidemiology 2005; 58: 741–742.
Jones CW, Keil LG, Holland WC, Caughey MC, Platts-Mills TF. Comparison of registered and
published outcomes in randomized controlled trials: a systematic review. BMC Medicine
2015; 13: 282.
Jonnalagadda SR, Goyal P, Huffman MD. Automating data extraction in systematic reviews:
a systematic review. Systematic Reviews 2015; 4: 78.
Lewin S, Hendry M, Chandler J, Oxman AD, Michie S, Shepperd S, Reeves BC, Tugwell P,
Hannes K, Rehfuess EA, Welch V, McKenzie JE, Burford B, Petkovic J, Anderson LM, Harris
J, Noyes J. Assessing the complexity of interventions within systematic reviews:
development, content and use of a new tool (iCAT_SR). BMC Medical Research
Methodology 2017; 17: 76.
Li G, Abbade LPF, Nwosu I, Jin Y, Leenus A, Maaz M, Wang M, Bhatt M, Zielinski L, Sanger N,
Bantoto B, Luo C, Shams I, Shahid H, Chang Y, Sun G, Mbuagbaw L, Samaan Z, Levine
MAH, Adachi JD, Thabane L. A scoping review of comparisons between abstracts and full
reports in primary biomedical research. BMC Medical Research Methodology 2017; 17: 181.
Li TJ, Vedula SS, Hadar N, Parkin C, Lau J, Dickersin K. Innovations in data collection,
management, and archiving for systematic reviews. Annals of Internal Medicine 2015; 162:
287–294.
Liberati A, Altman DG, Tetzlaff J, Mulrow C, Gøtzsche PC, Ioannidis JPA, Clarke M, Devereaux
PJ, Kleijnen J, Moher D. The PRISMA statement for reporting systematic reviews and
139
5 Collecting data
140
5.9 References
141
6
Choosing effect measures and computing
estimates of effect
Julian PT Higgins, Tianjing Li, Jonathan J Deeks
KEY POINTS
• The types of outcome data that review authors are likely to encounter are dichoto-
•
mous data, continuous data, ordinal data, count or rate data and time-to-event data.
There are several different ways of comparing outcome data between two intervention
groups (‘effect measures’) for each data type. For example, dichotomous outcomes can
be compared between intervention groups using a risk ratio, an odds ratio, a risk dif-
ference or a number needed to treat. Continuous outcomes can be compared between
•
intervention groups using a mean difference or a standardized mean difference.
Effect measures are either ratio measures (e.g. risk ratio, odds ratio) or difference mea-
sures (e.g. mean difference, risk difference). Ratio measures are typically analysed on a
•
logarithmic scale.
Results extracted from study reports may need to be converted to a consistent, or usa-
ble, format for analysis.
This chapter should be cited as: Higgins JPT, Li T, Deeks JJ (editors). Chapter 6: Choosing effect measures
and computing estimates of effect. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ,
Welch VA (editors). Cochrane Handbook for Systematic Reviews of Interventions. 2nd Edition. Chichester (UK):
John Wiley & Sons, 2019: 143–176.
© 2019 The Cochrane Collaboration. Published 2019 by John Wiley & Sons Ltd.
143
6 Choosing effect measures
The ways in which the effect of an intervention can be assessed depend on the nature
of the data being collected. In this chapter, for each of the above types of data, we
review definitions, properties and interpretation of standard measures of intervention
effect, and provide tips on how effect estimates may be computed from data likely to
be reported in sources such as journal articles. Formulae to estimate effects (and their
standard errors) for the commonly used effect measures are provided in a supplemen-
tary document Statistical algorithms in Review Manager, as well as other standard text-
books (Deeks et al 2001). Chapter 10 discusses issues in the selection of one of these
measures for a particular meta-analysis.
6.1.2.1 A note on ratio measures of intervention effect: the use of log scales
The values of ratio measures of intervention effect (such as the odds ratio, risk ratio,
rate ratio and hazard ratio) usually undergo log transformations before being analysed,
and they may occasionally be referred to in terms of their log transformed values (e.g.
log odds ratio). Typically the natural log transformation (log base e, written ‘ln’) is used.
Ratio summary statistics all have the common features that the lowest value that
they can take is 0, that the value 1 corresponds to no intervention effect, and that
the highest value that they can take is infinity. This number scale is not symmetric.
For example, whilst an odds ratio (OR) of 0.5 (a halving) and an OR of 2 (a doubling)
144
6.2 Study designs and identifying the unit of analysis
are opposites such that they should average to no effect, the average of 0.5 and 2 is not
an OR of 1 but an OR of 1.25. The log transformation makes the scale symmetric: the
log of 0 is minus infinity, the log of 1 is zero, and the log of infinity is infinity. In the
example, the log of the above OR of 0.5 is –0.69 and the log of the OR of 2 is 0.69.
The average of –0.69 and 0.69 is 0 which is the log transformed value of an OR of 1,
correctly implying no intervention effect on average.
Graphical displays for meta-analyses performed on ratio scales usually use a log
scale. This has the effect of making the confidence intervals appear symmetric, for
the same reasons.
Review authors should consider the impact on the analysis of any such clustering,
matching or other non-standard design features of the included studies (see MECIR
Box 6.2.a). A more detailed list of situations in which unit-of-analysis issues commonly
arise follows, together with directions to relevant discussions elsewhere in this
Handbook.
145
6 Choosing effect measures
3) Select a single time point and analyse only data at this time for studies in which it is
presented. Ideally this should be a clinically important time point. Sometimes it
might be chosen to maximize the data available, although authors should be aware
of the possibility of reporting biases.
4) Select the longest follow-up from each study. This may induce a lack of consistency
across studies, giving rise to heterogeneity.
6.2.7 Multiple body parts I: body parts receive the same intervention
In some studies, people are randomized, but multiple parts (or sites) of the body receive the
same intervention, a separate outcome judgement being made for each body part, and the
number of body parts is used as the denominator in the analysis. For example, eyes may be
mistakenly used as the denominator without adjustment for the non-independence
between eyes. This is similar to the situation in cluster-randomized studies, except that
participants are the ‘clusters’ (see methods described in Chapter 23, Section 23.1).
6.2.8 Multiple body parts II: body parts receive different interventions
A different situation is that in which different parts of the body are randomized to
different interventions. ‘Split-mouth’ designs in oral health are of this sort, in which
different areas of the mouth are assigned different interventions. These trials have
similarities to crossover trials: whereas in crossover studies individuals receive multiple
interventions at different times, in these trials they receive multiple interventions at
different sites. See methods described in Chapter 23 (Section 23.2). It is important
to distinguish these trials from those in which participants receive the same interven-
tion at multiple sites (Section 6.2.7).
problem arises if the same group of participants is included twice in the same
meta-analysis (for example, if ‘Dose 1 vs Placebo’ and ‘Dose 2 vs Placebo’ are both
included in the same meta-analysis, with the same placebo patients in both compar-
isons). Review authors should approach multiple intervention groups in an appropriate
way that avoids arbitrary omission of relevant groups and double-counting of partici-
pants (see MECIR Box 6.2.b) (see Chapter 23, Section 23.3). One option is network
meta-analysis, as discussed in Chapter 11.
convenient to extract hazard ratios (see Section 6.8.2). Similarly, for ordinal data
and rate data it may be convenient to extract effect estimates (see Sections 6.6.2
and 6.7.2).
4) For non-randomized studies: when extracting data from non-randomized studies,
adjusted effect estimates may be available (e.g. adjusted odds ratios from logistic
regression analyses, or adjusted rate ratios from Poisson regression analyses).
These are generally preferable to analyses based on summary statistics, because
they usually reduce the impact of confounding. The variables that have been used
for adjustment should be recorded (see Chapter 24).
5) When summary data for each group are not available: on occasion, summary data for
each intervention group may be sought, but cannot be extracted. In such situations
it may still be possible to include the study in a meta-analysis (using the generic
inverse variance method) if an effect estimate is extracted directly from the study
report.
intervals should have been computed using t distributions, especially when the sample
sizes are small: see Section 6.5.2.3 for details.
Where exact P values are quoted alongside estimates of intervention effect, it is
possible to derive SEs. While all tests of statistical significance produce P values,
different tests use different mathematical approaches. The method here assumes
P values have been obtained through a particularly simple approach of dividing the
effect estimate by its SE and comparing the result (denoted Z) with a standard normal
distribution (statisticians often refer to this as a Wald test).
The first step is to obtain the Z value corresponding to the reported P value from a
table of the standard normal distribution. A SE may then be calculated as
SE = intervention effect estimate/Z
As an example, suppose a conference abstract presents an estimate of a risk difference
of 0.03 (P = 0.008). The Z value that corresponds to a P value of 0.008 is Z = 2.652. This
can be obtained from a table of the standard normal distribution or a computer pro-
gram (for example, by entering =abs(normsinv(0.008/2)) into any cell in a Microsoft
Excel spreadsheet). The SE of the risk difference is obtained by dividing the risk differ-
ence (0.03) by the Z value (2.652), which gives 0.011.
Where significance tests have used other mathematical approaches, the estimated
SEs may not coincide exactly with the true SEs. For P values that are obtained from
t-tests for continuous outcome data, refer instead to Section 6.5.2.3.
the outcome of interest has such a binary form. The most commonly encountered
effect measures used in randomized trials with dichotomous data are:
Details of the calculations of the first three of these measures are given in Box 6.4.a.
Numbers needed to treat are discussed in detail in Chapter 15 (Section 15.4), as they
are primarily used for the communication and interpretation of results.
Methods for meta-analysis of dichotomous outcome data are covered in Chapter 10
(Section 10.4).
Aside: as events of interest may be desirable rather than undesirable, it would be pref-
erable to use a more neutral term than risk (such as probability), but for the sake of
convention we use the terms risk ratio and risk difference throughout. We also use
the term ‘risk ratio’ in preference to ‘relative risk’ for consistency with other terminol-
ogy. The two are interchangeable and both conveniently abbreviate to ‘RR’. Note also
that we have been careful with the use of the words ‘risk’ and ‘rates’. These words are
often treated synonymously. However, we have tried to reserve use of the word ‘rate’
for the data type ‘counts and rates’ where it describes the frequency of events in a
measured period of time.
Box 6.4.a Calculation of risk ratio (RR), odds ratio (OR) and risk difference (RD) from a
2×2 table
The results of a two-group randomized trial with a dichotomous outcome can be
displayed as a 2×2 table:
where SE, SC, FE and FC are the numbers of participants with each outcome (‘S’ or ‘F’)
in each group (‘E’ or ‘C’). The following summary statistics can be calculated:
risk of event in experimental group SE /NE
RR = =
risk of event in comparator group SC /NC
odds of event in experimental group SE /FE SE FC
OR = = =
odds of event in comparator group SC /FC FE SC
RD = risk of event in experimental group risk of event in comparator group
S E SC
= −
NE NC
151
6 Choosing effect measures
6.4.1.2 Measures of relative effect: the risk ratio and odds ratio
Measures of relative effect express the expected outcome in one group relative to that
in the other. The risk ratio (RR, or relative risk) is the ratio of the risk of an event in
the two groups, whereas the odds ratio (OR) is the ratio of the odds of an event
152
6.4 Dichotomous outcome data
(see Box 6.4.a). For both measures a value of 1 indicates that the estimated effects are
the same for both interventions.
Neither the risk ratio nor the odds ratio can be calculated for a study if there are no
events in the comparator group. This is because, as can be seen from the formulae in
Box 6.4.a, we would be trying to divide by zero. The odds ratio also cannot be calculated
if everybody in the intervention group experiences an event. In these situations, and
others where SEs cannot be computed, it is customary to add ½ to each cell of the
2×2 table (for example, RevMan automatically makes this correction when necessary).
In the case where no events (or all events) are observed in both groups the study
provides no information about relative probability of the event and is omitted from
the meta-analysis. This is entirely appropriate. Zeros arise particularly when the event
of interest is rare, such as unintended adverse outcomes. For further discussion of
choice of effect measures for such sparse data (often with lots of zeros) see
Chapter 10 (Section 10.4.4).
Risk ratios describe the multiplication of the risk that occurs with use of the
experimental intervention. For example, a risk ratio of 3 for an intervention
implies that events with intervention are three times more likely than events with-
out intervention. Alternatively we can say that intervention increases the risk of
events by 100 × (RR – 1)% = 200%. Similarly, a risk ratio of 0.25 is interpreted as
the probability of an event with intervention being one-quarter of that without
intervention. This may be expressed alternatively by saying that intervention
decreases the risk of events by 100 × (1 – RR)% = 75%. This is known as the
relative risk reduction (see also Chapter 15, Section 15.4.1). The interpretation
of the clinical importance of a given risk ratio cannot be made without knowledge
of the typical risk of events without intervention: a risk ratio of 0.75 could corre-
spond to a clinically important reduction in events from 80% to 60%, or a small,
less clinically important reduction from 4% to 3%. What constitutes clinically
important will depend on the outcome and the values and preferences of the per-
son or population.
The numerical value of the observed risk ratio must always lie somewhere between
0 and 1/CGR, where CGR (abbreviation of ‘comparator group risk’, sometimes referred
to as the control group risk or the control event rate) is the observed risk of the event
in the comparator group expressed as a number between 0 and 1. This means that for
common events large values of risk ratio are impossible. For example, when the
observed risk of events in the comparator group is 0.66 (or 66%) then the observed
risk ratio cannot exceed 1.5. This boundary applies only for increases in risk, and can
cause problems when the results of an analysis are extrapolated to a different
population in which the comparator group risks are above those observed in
the study.
Odds ratios, like odds, are more difficult to interpret (Sinclair and Bracken 1994,
Sackett et al 1996). Odds ratios describe the multiplication of the odds of the outcome
that occur with use of the intervention. To understand what an odds ratio means in
terms of changes in numbers of events it is simplest to convert it first into a risk ratio,
and then interpret the risk ratio in the context of a typical comparator group risk, as
outlined here. The formula for converting an odds ratio to a risk ratio is provided in
Chapter 15 (Section 15.4.4). Sometimes it may be sensible to calculate the RR for more
than one assumed comparator group risk.
153
6 Choosing effect measures
participants have particular symptoms at the start of the study the event of interest is
usually recovery or cure. If participants are well or, alternatively, at risk of some adverse
outcome at the beginning of the study, then the event is the onset of disease or
occurrence of the adverse outcome.
It is possible to switch events and non-events and consider instead the proportion of
patients not recovering or not experiencing the event. For meta-analyses using risk
differences or odds ratios the impact of this switch is of no great consequence: the
switch simply changes the sign of a risk difference, indicating an identical effect size
in the opposite direction, whilst for odds ratios the new odds ratio is the reciprocal
(1/x) of the original odds ratio.
In contrast, switching the outcome can make a substantial difference for risk ratios,
affecting the effect estimate, its statistical significance, and the consistency of interven-
tion effects across studies. This is because the precision of a risk ratio estimate differs
markedly between those situations where risks are low and those where risks are high.
In a meta-analysis, the effect of this reversal cannot be predicted easily. The identifi-
cation, before data analysis, of which risk ratio is more likely to be the most relevant
summary statistic is therefore important. It is often convenient to choose to focus on
the event that represents a change in state. For example, in treatment studies where
everyone starts in an adverse state and the intention is to ‘cure’ this, it may be more
natural to focus on ‘cure’ as the event. Alternatively, in prevention studies where
everyone starts in a ‘healthy’ state and the intention is to prevent an adverse event,
it may be more natural to focus on ‘adverse event’ as the event. A general rule of thumb
is to focus on the less common state as the event of interest. This reduces the problems
associated with extrapolation (see Section 6.4.1.2) and may lead to less heterogeneity
across studies. Where interventions aim to reduce the incidence of an adverse event,
there is empirical evidence that risk ratios of the adverse event are more consistent
than risk ratios of the non-event (Deeks 2002).
Different variations on the SMD are available depending on exactly what choice of SD
is chosen for the denominator. The particular definition of SMD used in Cochrane
Reviews is the effect size known in social science as Hedges’ (adjusted) g. This uses
a pooled SD in the denominator, which is an estimate of the SD based on outcome data
from both intervention groups, assuming that the SDs in the two groups are similar. In
contrast, Glass’ delta (Δ) uses only the SD from the comparator group, on the basis that
if the experimental intervention affects between-person variation, then such an impact
of the intervention should not influence the effect estimate.
To overcome problems associated with estimating SDs within small studies, and
with real differences across studies in between-person variability, it may sometimes
be desirable to standardize using an external estimate of SD. External estimates
might be derived, for example, from a cross-sectional analysis of many individuals
assessed using the same continuous outcome measure (the sample of individuals
might be derived from a large cohort study). Typically the external estimate would
be assumed to be known without error, which is likely to be reasonable if it is based
on a large number of individuals. Under this assumption, the statistical methods used
for MDs would be used, with both the MD and its SE divided by the externally
derived SD.
Sometimes the numbers of participants, means and SDs are not available, but an
effect estimate such as a MD or SMD has been reported. Such data may be included
in meta-analyses using the generic inverse variance method only when they are accom-
panied by measures of uncertainty such as a SE, 95% confidence interval or an exact
P value. A suitable SE from a confidence interval for a MD should be obtained using the
early steps of the process described in Section 6.5.2.3. For SMDs, see Section 6.3.
160
6.5 Continuous outcome data
If the sample size is small (say fewer than 60 participants in each group) then con-
fidence intervals should have been calculated using a value from a t distribution. The
numbers 3.92, 3.29 and 5.15 are replaced with slightly larger numbers specific to the t
distribution, which can be obtained from tables of the t distribution with degrees of
freedom equal to the group sample size minus 1. Relevant details of the t distribution
are available as appendices of many statistical textbooks or from standard computer
spreadsheet packages. For example the t statistic for a 95% confidence interval from a
sample size of 25 can be obtained by typing =tinv(1-0.95,25-1) in a cell in a Microsoft
Excel spreadsheet (the result is 2.0639). The divisor, 3.92, in the formula above would be
replaced by 2 × 2.0639 = 4.128.
For moderate sample sizes (say between 60 and 100 in each group), either a t distri-
bution or a standard normal distribution may have been used. Review authors should
look for evidence of which one, and use a t distribution when in doubt.
As an example, consider data presented as follows:
The confidence intervals should have been based on t distributions with 24 and
21 degrees of freedom, respectively. The divisor for the experimental intervention
group is 4.128, from above. The SD for this group is √25 × (34.2 – 30.0)/4.128 = 5.09. Cal-
culations for the comparator group are performed in a similar way.
It is important to check that the confidence interval is symmetrical about the
mean (the distance between the lower limit and the mean is the same as the
distance between the mean and the upper limit). If this is not the case, the con-
fidence interval may have been calculated on transformed values (see
Section 6.5.2.4).
161
6 Choosing effect measures
6.5.2.6 Ranges
Ranges are very unstable and, unlike other measures of variation, increase when the sam-
ple size increases. They describe the extremes of observed outcomes rather than the
163
6 Choosing effect measures
average variation. One common approach has been to make use of the fact that, with nor-
mally distributed data, 95% of values will lie within 2 × SD either side of the mean. The SD
may therefore be estimated to be approximately one-quarter of the typical range of data
values. This method is not robust and we recommend that it not be used. Walter and Yao
based an imputation method on the minimum and maximum observed values. Their
enhancement of the “range’ method provided a lookup table, according to sample size,
of conversion factors from range to SD (Walter and Yao 2007). Alternative methods have
been proposed to estimate SDs from ranges and quantiles (Hozo et al 2005, Wan et al 2014,
Bland 2015), although to our knowledge these have not been evaluated using empirical
data. As a general rule, we recommend that ranges should not be used to estimate SDs.
164
6.5 Continuous outcome data
Note that the mean change in each group can be obtained by subtracting the post-
intervention mean from the baseline mean even if it has not been presented explicitly.
However, the information in this table does not allow us to calculate the SD of the changes.
We cannot know whether the changes were very consistent or very variable across indivi-
duals. Some other information in a paper may help us determine the SD of the changes.
When there is not enough information available in a paper to calculate the SDs for the
changes, they can be imputed, for example, by using change-from-baseline SDs for the
same outcome measure from other studies in the review. However, the appropriateness
of using a SD from another study relies on whether the studies used the same measure-
ment scale, had the same degree of measurement error, had the same time interval
between baseline and post-intervention measurement, and in a similar population.
When statistical analyses comparing the changes themselves are presented (e.g. con-
fidence intervals, SEs, t statistics, P values, F statistics) then the techniques described in
Section 6.5.2.3 may be used. Also note that an alternative to these methods is simply to
use a comparison of post-intervention measurements, which in a randomized trial in
theory estimates the same quantity as the comparison of changes from baseline.
The following alternative technique may be used for calculating or imputing missing
SDs for changes from baseline (Follmann et al 1992, Abrams et al 2005). A typically unre-
ported number known as the correlation coefficient describes how similar the baseline
and post-intervention measurements were across participants. Here we describe
(1) how to calculate the correlation coefficient from a study that is reported in consid-
erable detail and (2) how to impute a change-from-baseline SD in another study, mak-
ing use of a calculated or imputed correlation coefficient. Note that the methods in
(2) are applicable both to correlation coefficients obtained using (1) and to correlation
coefficients obtained in other ways (for example, by reasoned argument). Methods in
(2) should be used sparingly because one can never be sure that an imputed correlation
is appropriate. This is because correlations between baseline and post-intervention
values usually will, for example, decrease with increasing time between baseline
and post-intervention measurements, as well as depending on the outcomes, charac-
teristics of the participants and intervention effects.
Experimental intervention (sample size 129) mean = 15.2 mean = 16.2 mean = 1.0
SD = 6.4 SD = 7.1 SD = 4.5
Comparator intervention (sample size 135) mean = 15.7 mean = 17.2 mean = 1.5
SD = 7.0 SD = 6.9 SD = 4.2
An analysis of change from baseline is available from this study, using only the data in
the final column. We can use other data in this study to calculate two correlation
coefficients, one for each intervention group. Let us use the following notation:
165
6 Choosing effect measures
The correlation coefficient in the experimental group, CorrE, can be calculated as:
and similarly for the comparator intervention, to obtain CorrC. In the example, these
turn out to be
6 42 + 7 12 − 4 52
CorrE = = 0 78,
2×6 4×7 1
7 02 + 6 92 − 4 22
CorrC = = 0 82
2×7 0×6 9
When either the baseline or post-intervention SD is unavailable, then it may be substi-
tuted by the other, providing it is reasonable to assume that the intervention does not alter
the variability of the outcome measure. Assuming the correlation coefficients from the two
intervention groups are reasonably similar to each other, a simple average can be taken as
a reasonable measure of the similarity of baseline and final measurements across all indi-
viduals in the study (in the example, the average of 0.78 and 0.82 is 0.80). It is recommended
that correlation coefficients be computed for many (if not all) studies in the meta-analysis
and examined for consistency. If the correlation coefficients differ, then either the sample
sizes are too small for reliable estimation, the intervention is affecting the variability in out-
come measures, or the intervention effect depends on baseline level, and the use of aver-
age is best avoided. In addition, if a value less than 0.5 is obtained (correlation coefficients
lie between –1 and 1), then there is little benefit in using change from baseline and an anal-
ysis of post-intervention measurements will be more precise.
SDE, change = SD2E, baseline + SD2E, final − 2 × Corr × SDE, baseline × SDE, final
166
6.5 Continuous outcome data
and similarly for the comparator intervention. Again, if either of the SDs (at baseline
and post-intervention) is unavailable, then one may be substituted by the other as long
as it is reasonable to assume that the intervention does not alter the variability of the
outcome measure.
As an example, consider the following data:
Using the correlation coefficient calculated in step 1 above of 0.80, we can impute the
change-from-baseline SD in the comparator group as:
SDC, change = 4 02 + 4 42 − 2 × 0 80 × 4 0 × 4 4 = 2 68
Table 6.5.a Formulae for combining summary statistics across two groups: Group 1 (with sample
size = N1, mean = M1 and SD = SD1) and Group 2 (with sample size = N2, mean = M2 and SD = SD2)
Combined groups
Sample size N1 + N2
Mean N1 M1 + N2 M2
N1 + N2
SD
N1 N2
N1 − 1 SD21 + N2 − 1 SD22 + M21 + M22 − 2M1 M2
N1 + N2
N1 + N2 − 1
These formulae are also appropriate for use in studies that compared three or more
interventions, two of which represent the same intervention category as defined for the
purposes of the review. In that case, it may be appropriate to combine these two
groups and consider them as a single intervention (see Chapter 23, Section 23.3).
For example, ‘Group 1’ and ‘Group 2’ may refer to two slightly different variants of
an intervention to which participants were randomized, such as different doses of
the same drug.
When there are more than two groups to combine, the simplest strategy is to
apply the above formula sequentially (i.e. combine Group 1 and Group 2 to create
Group ‘1+2’, then combine Group ‘1+2’ and Group 3 to create Group ‘1+2+3’, and so on).
and categories 2 and 3 a failure; or categories 1 and 2 constitute a success and category
3 a failure. A proportional odds model assumes that there is an equal odds ratio for
both dichotomies of the data. Therefore, the odds ratio calculated from the propor-
tional odds model can be interpreted as the odds of success on the experimental inter-
vention relative to comparator, irrespective of how the ordered categories might be
divided into success or failure. Methods (specifically polychotomous logistic regression
models) are available for calculating study estimates of the log odds ratio and its SE.
Methods specific to ordinal data become unwieldy (and unnecessary) when the num-
ber of categories is large. In practice, longer ordinal scales acquire properties similar to
continuous outcomes, and are often analysed as such, whilst shorter ordinal scales are
often made into dichotomous data by combining adjacent categories together until
only two remain. The latter is especially appropriate if an established, defensible
cut-point is available. However, inappropriate choice of a cut-point can induce bias,
particularly if it is chosen to maximize the difference between two intervention arms
in a randomized trial.
Where ordinal scales are summarized using methods for dichotomous data, one of
the two sets of grouped categories is defined as the event and intervention effects are
described using risk ratios, odds ratios or risk differences (see Section 6.4.1). When ordi-
nal scales are summarized using methods for continuous data, the mean score is cal-
culated in each group and intervention effect is expressed as a MD or SMD, or possibly a
RoM (see Section 6.5.1). Difficulties will be encountered if studies have summarized
their results using medians (see Section 6.5.2.5). Methods for meta-analysis of ordinal
outcome data are covered in Chapter 10 (Section 10.7).
EE EC
SE of rate difference = +
TE2 TC2
Counts of more common events, such as counts of decayed, missing or filled teeth, may
often be treated in the same way as continuous outcome data. The intervention effect
170
6.7 Count and rate data
used will be the MD which will compare the difference in the mean number of events
(possibly standardized to a unit time period) experienced by participants in the inter-
vention group compared with participants in the comparator group.
171
6 Choosing effect measures
Funding: JPTH is a member of the National Institute for Health Research (NIHR) Biomed-
ical Research Centre at University Hospitals Bristol NHS Foundation Trust and the Uni-
versity of Bristol. JJD received support from the NIHR Birmingham Biomedical Research
Centre at the University Hospitals Birmingham NHS Foundation Trust and the University
of Birmingham. JPTH received funding from National Institute for Health Research Senior
Investigator award NF-SI-0617-10145. The views expressed are those of the author(s) and
not necessarily those of the NHS, the NIHR or the Department of Health.
6.11 References
Abrams KR, Gillies CL, Lambert PC. Meta-analysis of heterogeneously reported trials
assessing change from baseline. Statistics in Medicine 2005; 24: 3823–3844.
Ades AE, Lu G, Dias S, Mayo-Wilson E, Kounali D. Simultaneous synthesis of treatment effects
and mapping to a common scale: an alternative to standardisation. Research Synthesis
Methods 2015; 6: 96–107.
174
6.11 References
Agresti A. An Introduction to Categorical Data Analysis. New York (NY): John Wiley &
Sons; 1996.
Anzures-Cabrera J, Sarpatwari A, Higgins JPT. Expressing findings from meta-analyses of
continuous outcomes in terms of risks. Statistics in Medicine 2011; 30: 2967–2985.
Bland M. Estimating mean and standard deviation from the sample size, three quartiles,
minimum, and maximum. International Journal of Statistics in Medical Research 2015; 4:
57–64.
Colantuoni E, Scharfstein DO, Wang C, Hashem MD, Leroux A, Needham DM, Girard TD.
Statistical methods to compare functional outcomes in randomized controlled trials with
high mortality. BMJ 2018; 360: j5748.
Collett D. Modelling Survival Data in Medical Research. London (UK): Chapman & Hall; 1994.
Deeks J. Are you sure that’s a standard deviation? (part 1). Cochrane News 1997a; 10: 11–12.
Deeks J. Are you sure that’s a standard deviation? (part 2). Cochrane News 1997b; 11: 11–12.
Deeks JJ, Altman DG, Bradburn MJ. Statistical methods for examining heterogeneity and
combining results from several studies in meta-analysis. In: Egger M, Davey Smith G,
Altman DG, editors. Systematic Reviews in Health Care: Meta-analysis in Context. 2nd
edition ed. London (UK): BMJ Publication Group; 2001. pp. 285–312.
Deeks JJ. Issues in the selection of a summary statistic for meta-analysis of clinical trials
with binary outcomes. Statistics in Medicine 2002; 21: 1575–1600.
Dubey SD, Lehnhoff RW, Radike AW. A statistical confidence interval for true per cent
reduction in caries-incidence studies. Journal of Dental Research 1965; 44: 921–923.
Early Breast Cancer Trialists’ Collaborative Group. Treatment of Early Breast Cancer. Volume
1: Worldwide Evidence 1985–1990. Oxford (UK): Oxford University Press; 1990.
Follmann D, Elliott P, Suh I, Cutler J. Variance imputation for overviews of clinical trials with
continuous response. Journal of Clinical Epidemiology 1992; 45: 769–773.
Friedrich JO, Adhikari N, Herridge MS, Beyene J. Meta-analysis: low-dose dopamine
increases urine output but does not prevent renal dysfunction or death. Annals of Internal
Medicine 2005; 142: 510–524.
Friedrich JO, Adhikari NK, Beyene J. The ratio of means method as an alternative to mean
differences for analyzing continuous outcome variables in meta-analysis: a simulation
study. BMC Medical Research Methodology 2008; 8: 32.
Furukawa TA, Barbui C, Cipriani A, Brambilla P, Watanabe N. Imputing missing standard
deviations in meta-analyses can provide accurate results. Journal of Clinical Epidemiology
2006; 59: 7–10.
Guyot P, Ades AE, Ouwens MJ, Welton NJ. Enhanced secondary analysis of survival data:
reconstructing the data from published Kaplan-Meier survival curves. BMC Medical
Research Methodology 2012; 12: 9.
Higgins JPT, White IR, Anzures-Cabrera J. Meta-analysis of skewed data: combining results
reported on log-transformed or raw scales. Statistics in Medicine 2008; 27: 6072–6092.
Hozo SP, Djulbegovic B, Hozo I. Estimating the mean and variance from the median, range,
and the size of a sample. BMC Medical Research Methodology 2005; 5: 13.
Johnston BC, Thorlund K, Schünemann HJ, Xie F, Murad MH, Montori VM, Guyatt GH.
Improving the interpretation of quality of life evidence in meta-analyses: the
application of minimal important difference units. Health and Quality of Life Outcomes
2010; 8: 116.
Laupacis A, Sackett DL, Roberts RS. An assessment of clinically useful measures of the
consequences of treatment. New England Journal of Medicine 1988; 318: 1728–1733.
175
6 Choosing effect measures
MacLennan JM, Shackley F, Heath PT, Deeks JJ, Flamank C, Herbert M, Griffiths H,
Hatzmann E, Goilav C, Moxon ER. Safety, immunogenicity, and induction of immunologic
memory by a serogroup C meningococcal conjugate vaccine in infants: a randomized
controlled trial. JAMA 2000; 283: 2795–2801.
Marinho VCC, Higgins JPT, Logan S, Sheiham A. Fluoride toothpaste for preventing dental
caries in children and adolescents. Cochrane Database of Systematic Reviews 2003; 1:
CD002278.
Parmar MKB, Torri V, Stewart L. Extracting summary statistics to perform meta-analyses of
the published literature for survival endpoints. Statistics in Medicine 1998; 17: 2815–2834.
Sackett DL, Deeks JJ, Altman DG. Down with odds ratios! Evidence Based Medicine 1996; 1:
164–166.
Sackett DL, Richardson WS, Rosenberg W, Haynes BR. Evidence-Based Medicine: How to
Practice and Teach EBM. Edinburgh (UK): Churchill Livingstone; 1997.
Simmonds MC, Tierney J, Bowden J, Higgins JPT. Meta-analysis of time-to-event data: a
comparison of two-stage methods. Research Synthesis Methods 2011; 2: 139–149.
Sinclair JC, Bracken MB. Clinically useful measures of effect in binary analyses of
randomized trials. Journal of Clinical Epidemiology 1994; 47: 881–889.
Tierney JF, Stewart LA, Ghersi D, Burdett S, Sydes MR. Practical methods for incorporating
summary time-to-event data into meta-analysis. Trials 2007; 8.
Vickers AJ. The use of percentage change from baseline as an outcome in a controlled trial is
statistically inefficient: a simulation study. BMC Medical Research Methodology 2001; 1: 6.
Walter SD, Yao X. Effect sizes can be calculated for studies reporting ranges for outcome
variables in systematic reviews. Journal of Clinical Epidemiology 2007; 60: 849–852.
Wan X, Wang W, Liu J, Tong T. Estimating the sample mean and standard deviation from the
sample size, median, range and/or interquartile range. BMC Medical Research
Methodology 2014; 14: 135.
Weir CJ, Butcher I, Assi V, Lewis SC, Murray GD, Langhorne P, Brady MC. Dealing with missing
standard deviation and mean values in meta-analysis of continuous outcomes: a
systematic review. BMC Medical Research Methodology 2018; 18: 25.
Williamson PR, Smith CT, Hutton JL, Marson AG. Aggregate data meta-analysis with time-to-
event outcomes. Statistics in Medicine 2002; 21: 3337–3351.
176
7
Considering bias and conflicts of interest
among the included studies
Isabelle Boutron, Matthew J Page, Julian PT Higgins, Douglas G Altman, Andreas
Lundh, Asbjørn Hróbjartsson; on behalf of the Cochrane Bias Methods Group
KEY POINTS
• Review authors should seek to minimize bias. We draw a distinction between two
places in which bias should be considered. The first is in the results of the individual
studies included in a systematic review. The second is in the result of the meta-
•
analysis (or other synthesis) of findings from the included studies.
Problems with the design and execution of individual studies of healthcare interven-
tions raise questions about the internal validity of their findings; empirical evidence
•
provides support for this concern.
An assessment of the internal validity of studies included in a Cochrane Review should
emphasize the risk of bias in their results, that is, the risk that they will over-estimate
•
or under-estimate the true intervention effect.
Results of meta-analyses (or other syntheses) across studies may additionally be
affected by bias due to the absence of results from studies that should have been
•
included in the synthesis.
Review authors should consider source of funding and conflicts of interest of authors
of the study, which may inform the exploration of directness and heterogeneity of
study results, assessment of risk of bias within studies, and assessment of risk of bias
in syntheses owing to missing results.
7.1 Introduction
Cochrane Reviews seek to minimize bias. We define bias as a systematic error, or devi-
ation from the truth, in results. Biases can lead to under-estimation or over-estimation
of the true intervention effect and can vary in magnitude: some are small (and trivial
compared with the observed effect) and some are substantial (so that an apparent
This chapter should be cited as: Boutron I, Page MJ, Higgins JPT, Altman DG, Lundh A, Hróbjartsson A.
Chapter 7: Considering bias and conflicts of interest among the included studies. In: Higgins JPT, Thomas J,
Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook for Systematic Reviews
of Interventions. 2nd Edition. Chichester (UK): John Wiley & Sons, 2019: 177–204.
© 2019 The Cochrane Collaboration. Published 2019 by John Wiley & Sons Ltd.
177
7 Considering bias and conflicts of interest
finding may be due entirely to bias). A source of bias may even vary in direction across
studies. For example, bias due to a particular design flaw such as lack of allocation
sequence concealment may lead to under-estimation of an effect in one study but
over-estimation in another (Jüni et al 2001).
Bias can arise because of the actions of primary study investigators or because of
the actions of review authors, or may be unavoidable due to constraints on how
research can be undertaken in practice. Actions of authors can, in turn, be influenced
by conflicts of interest. In this chapter we introduce issues of bias in the context of a
Cochrane Review, covering both biases in the results of included studies and biases in
the results of a synthesis. We introduce the general principles of assessing the risk
that bias may be present, as well as the presentation of such assessments and their
incorporation into analyses. Finally, we address how source of funding and conflicts
of interest of study authors may impact on study design, conduct and reporting.
Conflicts of interest held by review authors are also of concern; these should
be addressed using editorial procedures and are not covered by this chapter (see
Chapter 1, Section 1.3).
We draw a distinction between two places in which bias should be considered.
The first is in the results of the individual studies included in a systematic review.
Since the conclusions drawn in a review depend on the results of the included
studies, if these results are biased, then a meta-analysis of the studies will produce
a misleading conclusion. Therefore, review authors should systematically take into
account risk of bias in results of included studies when interpreting the results of
their review.
The second place in which bias should be considered is the result of the meta-analysis
(or other synthesis) of findings from the included studies. This result will be affected by
biases in the included studies, and may additionally be affected by bias due to the
absence of results from studies that should have been included in the synthesis.
Specifically, the conclusions of the review may be compromised when decisions about
how, when and where to report results of eligible studies are influenced by the nature
and direction of the results. This is the problem of ‘non-reporting bias’ (also described
as ‘publication bias’ and ‘selective reporting bias’). There is convincing evidence that
results that are statistically non-significant and unfavourable to the experimental
intervention are less likely to be published than statistically significant results, and
hence are less easily identified by systematic reviews (see Section 7.2.3). This leads
to results being missing systematically from syntheses, which can lead to syntheses
over-estimating or under-estimating the effects of an intervention. For this reason,
the assessment of risk of bias due to missing results is another essential component
of a Cochrane Review.
Both the risk of bias in included studies and risk of bias due to missing results may be
influenced by conflicts of interest of study investigators or funders. For example,
investigators with a financial interest in showing that a particular drug works may
exclude participants who did not respond favourably to the drug from the analysis,
or fail to report unfavourable results of the drug in a manuscript.
Further discussion of assessing risk of bias in the results of an individual rando-
mized trial is available in Chapter 8, and of a non-randomized study in Chapter 25.
Further discussion of assessing risk of bias due to missing results is available in
Chapter 13.
178
7.1 Introduction
In 2008, Cochrane released the Cochrane risk-of-bias (RoB) tool, which was slightly
revised in 2011 (Higgins et al 2011). The tool was built on the following key principles.
1) The tool focused on a single concept: risk of bias. It did not consider other concepts
such as the quality of reporting, precision (the extent to which results are free of
random errors), or external validity (directness, applicability or generalizability).
2) The tool was based on a domain-based (or component) approach, in which different
types of bias are considered in turn. Users were asked to assess seven domains: ran-
dom sequence generation, allocation sequence concealment, blinding of partici-
pants and personnel, blinding of outcome assessment, incomplete outcome
data, selective outcome reporting, and other sources of bias. There was no scoring
system in the tool.
3) The domains were selected to characterize mechanisms through which bias may be
introduced into a trial, based on a combination of theoretical considerations and
empirical evidence.
4) The assessment of risk of bias required judgement and should thus be completely
transparent. Review authors provided a judgement for each domain, rated as ‘low’,
‘high’ or ‘unclear’ risk of bias, and provided reasons to support their judgement.
This tool has been implemented widely both in Cochrane Reviews and non-Cochrane
reviews (Jørgensen et al 2016). However, user testing has raised some concerns related
to the modest inter-rater reliability of some domains (Hartling et al 2013), the need to
rethink the theoretical background of the ‘selective outcome reporting’ domain (Page
and Higgins 2016), the misuse of the ‘other sources of bias’ domain (Jørgensen et al
2016), and the lack of appropriate consideration of the risk-of-bias assessment in
the analyses and interpretation of results (Hopewell et al 2013).
To address these concerns, a new version of the Cochrane risk-of-bias tool, RoB 2,
has been developed. The tool, described in Chapter 8, includes important innovations
in the assessment of risk of bias in randomized trials. The structure of the tool is sim-
ilar to that of the ROBINS-I tool for non-randomized studies of interventions
(described in Chapter 25). Both tools include a fixed set of bias domains, which
are intended to cover all issues that might lead to a risk of bias. To help reach
risk-of-bias judgements, a series of ‘signalling questions’ are included within each
domain. Also, the assessment is typically specific to a particular result. This is because
the risk of bias may differ depending on how an outcome is measured and how the
data for the outcome are analysed. For example, if two analyses for a single outcome
are presented, one adjusted for baseline prognostic factors and the other not, then
the risk of bias in the two results may be different. The risk of bias in at least one
specific result for each included study should be assessed in all Cochrane Reviews
(MECIR Box 7.1.a).
estimates. Recent meta-epidemiologic studies have shown that effect estimates were
lower in prospectively registered trials compared with trials not registered or registered
retrospectively (Dechartres et al 2016b, Odutayo et al 2017). Others have shown an
association between sample size and effect estimates, with larger effects observed
in smaller trials (Dechartres et al 2013). Studies have also shown a consistent associ-
ation between intervention effect and single or multiple centre status, with single-
centre trials showing larger effect estimates, even after controlling for sample size
(Dechartres et al 2011).
In some of these cases, plausible bias mechanisms can be hypothesized. For exam-
ple, both the number of centres and sample size may be associated with intervention
effect estimates because of non-reporting bias (e.g. single-centre studies and small
studies may be more likely to be published when they have larger, statistically signif-
icant effects than when they have smaller, non-significant effects); or single-centre and
small studies may be subject to less stringent controls and checks. However, alternative
explanations are possible, such as differences in factors relating to external validity
(e.g. participants in small, single-centre trials may be more homogenous than partici-
pants in other trials). Because of this, these factors are not directly captured by the risk-
of-bias tools recommended by Cochrane. Review authors should record these charac-
teristics systematically for each study included in the systematic review (e.g. in the
‘Characteristics of included studies’ table) where appropriate. For example, trial regis-
tration status should be recorded for all randomized trials identified.
182
7.2 Empirical evidence of bias
The number of times a study report is cited appears to be influenced by the nature
and direction of its results (‘citation bias’). In a meta-analysis of 21 methodological
studies, Duyx and colleagues observed that articles with statistically significant results
were cited 1.57 times the rate of articles with non-significant results (rate ratio 1.57;
95% CI 1.34 to 1.83) (Duyx et al 2017). They also found that articles with results in a
positive direction (regardless of their statistical significance) were cited at 2.14 times
the rate of articles with results in a negative direction (rate ratio 2.14; 95% CI 1.29
to 3.56) (Duyx et al 2017). In an analysis of 33,355 studies across all areas of science,
Fanelli and colleagues found that the number of citations received by a study was pos-
itively correlated with the magnitude of effects reported (Fanelli et al 2017). If positive
studies are more likely to be cited, they may be more likely to be located, and thus more
likely to be included in a systematic review.
Investigators may report the results of their study across multiple publications; for
example, Blümle and colleagues found that of 807 studies approved by a research eth-
ics committee in Germany from 2000 to 2002, 135 (17%) had more than one corre-
sponding publication (Blümle et al 2014). Evidence suggests that studies with
statistically significant results or larger treatment effects are more likely to lead to mul-
tiple publications (‘multiple (duplicate) publication bias’) (Easterbrook et al 1991,
Tramèr et al 1997), which makes it more likely that they will be located and included
in a meta-analysis.
Research suggests that the accessibility or level of indexing of journals is associated
with effect estimates in trials (‘location bias’). For example, a study of 61 meta-analyses
found that trials published in journals indexed in Embase but not MEDLINE yielded
smaller effect estimates than trials indexed in MEDLINE (ratio of odds ratios 0.71;
95% CI 0.56 to 0.90); however, the risk of bias due to not searching Embase may be
minor, given the lower prevalence of Embase-unique trials (Sampson et al 2003). Also,
Moher and colleagues estimate that 18,000 biomedical research studies are tucked
away in ‘predatory’ journals, which actively solicit manuscripts and charge publications
fees without providing robust editorial services (such as peer review and archiving or
indexing of articles) (Moher et al 2017). The direction of bias associated with non-
inclusion of studies published in predatory journals depends on whether they are pub-
lishing valid studies with null results or studies whose results are biased towards find-
ing an effect.
what is available in the final trial report. In two landmark studies, Chan and collea-
gues found that results were not reported for at least one benefit outcome in 71% of
randomized trials in one cohort (Chan et al 2004a) and 88% in another (Chan et al
2004b). Results were under-reported (e.g. stating only that “P > 0.05”) for at least
one benefit outcome in 92% of randomized trials in one cohort and 96% in another.
Statistically significant results for benefit outcomes were twice as likely as
non-significant results to be completely reported (range of odds ratios 2.4 to 2.7)
(Chan et al 2004a, Chan et al 2004b). Reviews of studies investigating selective
non-reporting and under-reporting of results suggest that it is more common for
outcomes defined by trialists as secondary rather than primary (Jones et al 2015,
Li et al 2018).
Selective non-reporting and under-reporting of results occurs for both benefit and
harm outcomes. Examining the studies included in a sample of 283 Cochrane Reviews,
Kirkham and colleagues suspected that 50% of 712 studies with results missing for the
primary benefit outcome of the review were missing because of the nature of the
results (Kirkham et al 2010). This estimate was slightly higher (63%) in 393 studies with
results missing for the primary harm outcome of 322 systematic reviews (Saini
et al 2014).
186
7.3 General procedures for risk-of-bias assessment
MHFA training Control Std. Mean Difference Std. Mean Difference Risk of Bias
Study or Subgroup Mean SD Total Mean SD Total Weight IV, Random, 95% Cl IV, Random, 95% Cl A B C D E F
Burns 2017 13.62 2.287 59 12.72 2.015 81 12.5% 0.42 [0.08, 0.76] + + – + + –
Jensen 2016a 9.4 2.5 142 8.3 2.5 132 25.1% 0.44 [0.20, 0.68] + + + + + +
Jensen 2016b 9.5 2.5 145 8.1 2.7 143 26.1% 0.54 [0.30, 0.77] + + + + + +
Svensson 2014 8.7 2.1 199 7.3 2.4 207 36.3% 0.62 [0.42, 0.82] + + + + + +
Figure 7.4.a Forest plot displaying RoB 2 risk-of-bias judgements for each randomized trial included in a meta-analysis of mental health first aid (MHFA)
knowledge scores. Adapted from Morgan et al (2018).
189
7 Considering bias and conflicts of interest
studies (RoB 2 and ROBINS-I) produce an overall judgement of risk of bias for the result
being assessed. These overall judgements are derived from assessments of individual
bias domains as described, for example, in Chapter 8 (Section 8.2).
To summarize risk of bias across study results in a synthesis, review authors should fol-
low guidance for assessing certainty in the body of evidence (e.g. using GRADE), as
described in Chapter 14 (Section 14.2.2). When a meta-analysis is dominated by study
results at high risk of bias, the certainty of the body of evidence may be rated as being lower
than if such studies were excluded from the meta-analysis. Section 7.6 discusses some pos-
sible courses of action that may be preferable to retaining such studies in the synthesis.
190
7.6 Incorporating assessment of risk of bias
analyses and interpretations while ignoring flaws identified during the assessment of risk
of bias. In this section we present suitable strategies for addressing risk of bias in results
from studies included in a meta-analysis, either in order to understand the impact of bias
or to determine a suitable estimate of intervention effect (Section 7.6.2). For the latter,
decisions often involve a trade-off between bias and precision. A meta-analysis that
includes all eligible studies may produce a result with high precision (narrow confidence
interval) but be seriously biased because of flaws in the conduct of some of the studies.
However, including only the studies at low risk of bias in all domains assessed may produce
a result that is unbiased but imprecise (if there are only a few studies at low risk of bias).
The choice between strategies (1) and (2) should be based to large extent on the bal-
ance between the potential for bias and the loss of precision when studies at high or
unclear risk of bias are excluded.
are selectively omitted from a published report; (2) selective under-reporting of data,
where results for some outcomes are selectively reported with inadequate detail for the
data to be included in a meta-analysis; and (3) bias in selection of the reported result,
where a result has been selected for reporting by the study authors, on the basis of the
results, from multiple measurements or analyses that have been generated for the out-
come domain (Page and Higgins 2016).
The RoB 2 and ROBINS-I tools focus solely on risk of bias as it pertains to a specific
trial result. With respect to selective reporting, RoB 2 and ROBINS-I examine whether a
specific result from the trial is likely to have been selected from multiple possible
results on the basis of the findings (scenario 3 above). Guidance on assessing the risk
of bias in selection of the reported result is available in Chapter 8 (for randomized trials)
and Chapter 25 (for non-randomized studies of interventions).
If there is no result (i.e. it has been omitted selectively from the report or under-
reported), then a risk-of-bias assessment at the level of the study result is not applicable.
Selective non-reporting of results and selective under-reporting of data are therefore not
covered by the RoB 2 and ROBINS-I tools. Instead, selective non-reporting of results and
under-reporting of data should be assessed at the level of the synthesis across studies.
Both practices lead to a situation similar to that when an entire study report is unavail-
able because of the nature of the results (also known as publication bias). Regardless of
whether an entire study report or only a particular result of a study is unavailable, the
same consequence can arise: bias in a synthesis because available results differ system-
atically from missing results (Page et al 2018). Chapter 13 provides detailed guidance on
assessing risk of bias due to missing results in a systematic review.
At the time of writing, a formal Tool for Addressing Conflicts of Interest in Trials (TACIT)
is being developed under the auspices of the Cochrane Bias Methods Group. The TACIT
development process has informed the content of this section, and we encourage read-
ers to check http://tacit.one for more detailed guidance that will become available.
193
7 Considering bias and conflicts of interest
194
7.8 Considering source of funding and conflict of interest of authors of included studies
trials with negative findings, and not to publish unfavourable results (Sterne 2013).
When relevant trial results are systematically missing from a meta-analysis because
of the nature of the findings, the synthesis is at risk of bias due to missing results.
Chapter 13 provides detailed guidance on assessing risk of bias due to missing results
in a systematic review.
less concerning in a trial comparing two treatments in general use with no connotation
to highly controversial scientific theories, ideology or professional groups. Mixing trivial
conflicts of interest with important ones may mask the latter and will expand review
author workload considerably.
Acknowledgements: We thank Gerd Antes, Peter Gøtzsche, Peter Jüni, Steff Lewis,
David Moher, Andrew Oxman, Ken Schulz, Jonathan Sterne and Simon Thompson
for their contributions to previous versions of this chapter.
7.10 References
Ahn R, Woodbridge A, Abraham A, Saba S, Korenstein D, Madden E, Boscardin WJ, Keyhani S.
Financial ties of principal investigators and randomized controlled trial outcomes: cross
sectional study. BMJ 2017; 356: i6770.
Als-Nielsen B, Chen W, Gluud C, Kjaergard LL. Association of funding and conclusions in
randomized drug trials: a reflection of treatment effect or adverse events? JAMA 2003;
290: 921–928.
Bero LA, Grundy Q. Why having a (nonfinancial) interest is not a conflict of interest. PLoS
Biology 2016; 14: e2001221.
199
7 Considering bias and conflicts of interest
Blümle A, Meerpohl JJ, Schumacher M, von Elm E. Fate of clinical research studies after ethical
approval – follow-up of study protocols until publication. PloS One 2014; 9: e87184.
Boutron I, Dutton S, Ravaud P, Altman DG. Reporting and interpretation of randomized
controlled trials with statistically nonsignificant results for primary outcomes. JAMA 2010;
303: 2058–2064.
Chan A-W, Song F, Vickers A, Jefferson T, Dickersin K, Gøtzsche PC, Krumholz HM, Ghersi D,
van der Worp HB. Increasing value and reducing waste: addressing inaccessible research.
The Lancet 2014; 383: 257–266.
Chan AW, Hróbjartsson A, Haahr MT, Gøtzsche PC, Altman DG. Empirical evidence for
selective reporting of outcomes in randomized trials: comparison of protocols to
published articles. JAMA 2004a; 291: 2457–2465.
Chan AW, Krleža-Jeric K, Schmid I, Altman DG. Outcome reporting bias in randomized trials
funded by the Canadian Institutes of Health Research. Canadian Medical Association
Journal 2004b; 171: 735–740.
da Costa BR, Beckett B, Diaz A, Resta NM, Johnston BC, Egger M, Jüni P, Armijo-Olivo S.
Effect of standardized training on the reliability of the Cochrane risk of bias assessment
tool: a prospective study. Systematic Reviews 2017; 6: 44.
Dechartres A, Boutron I, Trinquart L, Charles P, Ravaud P. Single-center trials show larger
treatment effects than multicenter trials: evidence from a meta-epidemiologic study.
Annals of Internal Medicine 2011; 155: 39–51.
Dechartres A, Trinquart L, Boutron I, Ravaud P. Influence of trial sample size on treatment
effect estimates: meta-epidemiological study. BMJ 2013; 346: f2304.
Dechartres A, Trinquart L, Faber T, Ravaud P. Empirical evaluation of which trial
characteristics are associated with treatment effect estimates. Journal of Clinical
Epidemiology 2016a; 77: 24–37.
Dechartres A, Ravaud P, Atal I, Riveros C, Boutron I. Association between trial registration
and treatment effect estimates: a meta-epidemiological study. BMC Medicine 2016b;
14: 100.
Dechartres A, Trinquart L, Atal I, Moher D, Dickersin K, Boutron I, Perrodeau E, Altman DG,
Ravaud P. Evolution of poor reporting and inadequate methods over time in 20,920
randomised controlled trials included in Cochrane reviews: research on research study.
BMJ 2017; 357: j2490.
Dechartres A, Atal I, Riveros C, Meerpohl J, Ravaud P. Association between publication
characteristics and treatment effect estimates: a meta-epidemiologic study. Annals of
Internal Medicine 2018; 169: 385–393.
Drazen JM, de Leeuw PW, Laine C, Mulrow C, DeAngelis CD, Frizelle FA, Godlee F, Haug C,
Hébert PC, Horton R, Kotzin S, Marusic A, Reyes H, Rosenberg J, Sahni P, Van der Weyden
MB, Zhaori G. Towards more uniform conflict disclosures: the updated ICMJE conflict of
interest reporting form. BMJ 2010; 340: c3239.
Duyx B, Urlings MJE, Swaen GMH, Bouter LM, Zeegers MP. Scientific citations favor positive
results: a systematic review and meta-analysis. Journal of Clinical Epidemiology 2017; 88:
92–101.
Easterbrook PJ, Berlin JA, Gopalan R, Matthews DR. Publication bias in clinical research. The
Lancet 1991; 337: 867–872.
Estellat C, Ravaud P. Lack of head-to-head trials and fair control arms: randomized
controlled trials of biologic treatment for rheumatoid arthritis. Archives of Internal
Medicine 2012; 172: 237–244.
200
7.10 References
201
7 Considering bias and conflicts of interest
Kirkham JJ, Dwan KM, Altman DG, Gamble C, Dodd S, Smyth R, Williamson PR. The impact of
outcome reporting bias in randomised controlled trials on a cohort of systematic reviews.
BMJ 2010; 340: c365.
Li G, Abbade LPF, Nwosu I, Jin Y, Leenus A, Maaz M, Wang M, Bhatt M, Zielinski L, Sanger N,
Bantoto B, Luo C, Shams I, Shahid H, Chang Y, Sun G, Mbuagbaw L, Samaan Z, Levine
MAH, Adachi JD, Thabane L. A systematic review of comparisons between protocols or
registrations and full reports in primary biomedical research. BMC Medical Research
Methodology 2018; 18: 9.
Lo B, Field MJ, Institute of Medicine (US) Committee on Conflict of Interest in Medical
Research Education and Practice. Conflict of Interest in Medical Research, Education,
and Practice. Washington, D.C.: National Academies Press (US); 2009.
Lundh A, Lexchin J, Mintzes B, Schroll JB, Bero L. Industry sponsorship and research
outcome. Cochrane Database of Systematic Reviews 2017; 2: MR000033.
Mann H, Djulbegovic B. Comparator bias: why comparisons must address genuine
uncertainties. Journal of the Royal Society of Medicine 2013; 106: 30–33.
Marret E, Elia N, Dahl JB, McQuay HJ, Møiniche S, Moore RA, Straube S, Tramèr MR.
Susceptibility to fraud in systematic reviews: lessons from the Reuben case.
Anesthesiology 2009; 111: 1279–1289.
Marshall IJ, Kuiper J, Wallace BC. RobotReviewer: evaluation of a system for automatically
assessing bias in clinical trials. Journal of the American Medical Informatics Association
2016; 23: 193–201.
McGauran N, Wieseler B, Kreis J, Schuler YB, Kolsch H, Kaiser T. Reporting bias in medical
research – a narrative review. Trials 2010; 11: 37.
Millard LA, Flach PA, Higgins JPT. Machine learning to assist risk-of-bias assessments in
systematic reviews. International Journal of Epidemiology 2016; 45: 266–277.
Moher D, Shamseer L, Cobey KD, Lalu MM, Galipeau J, Avey MT, Ahmadzai N, Alabousi M,
Barbeau P, Beck A, Daniel R, Frank R, Ghannad M, Hamel C, Hersi M, Hutton B, Isupov I,
McGrath TA, McInnes MDF, Page MJ, Pratt M, Pussegoda K, Shea B, Srivastava A, Stevens
A, Thavorn K, van Katwyk S, Ward R, Wolfe D, Yazdi F, Yu AM, Ziai H. Stop this waste of
people, animals and money. Nature 2017; 549: 23–25.
Montedori A, Bonacini MI, Casazza G, Luchetta ML, Duca P, Cozzolino F, Abraha I. Modified
versus standard intention-to-treat reporting: are there differences in methodological
quality, sponsorship, and findings in randomized trials? A cross-sectional study. Trials
2011; 12: 58.
Morgan AJ, Ross A, Reavley NJ. Systematic review and meta-analysis of Mental Health First Aid
training: effects on knowledge, stigma, and helping behaviour. PloS One 2018; 13: e0197102.
Morrison A, Polisena J, Husereau D, Moulton K, Clark M, Fiander M, Mierzwinski-Urban M,
Clifford T, Hutton B, Rabb D. The effect of English-language restriction on systematic
review-based meta-analyses: a systematic review of empirical studies. International
Journal of Technology Assessment in Health Care 2012; 28: 138–144.
Norris SL, Burda BU, Holmer HK, Ogden LA, Fu R, Bero L, Schunemann H, Deyo R. Author’s
specialty and conflicts of interest contribute to conflicting guidelines for screening
mammography. Journal of Clinical Epidemiology 2012; 65: 725–733.
Odutayo A, Emdin CA, Hsiao AJ, Shakir M, Copsey B, Dutton S, Chiocchia V, Schlussel M,
Dutton P, Roberts C, Altman DG, Hopewell S. Association between trial registration and
positive study findings: cross-sectional study (Epidemiological Study of Randomized
Trials-ESORT). BMJ 2017; 356: j917.
202
7.10 References
Page MJ, Higgins JPT. Rethinking the assessment of risk of bias due to selective reporting: a
cross-sectional study. Systematic Reviews 2016; 5: 108.
Page MJ, Higgins JPT, Clayton G, Sterne JAC, Hróbjartsson A, Savović J. Empirical evidence
of study design biases in randomized trials: systematic review of meta-epidemiological
studies. PloS One 2016; 11: 7.
Page MJ, McKenzie JE, Higgins JPT. Tools for assessing risk of reporting biases in studies
and syntheses of studies: a systematic review. BMJ Open 2018; 8: e019703.
Patel SV, Yu D, Elsolh B, Goldacre BM, Nash GM. Assessment of conflicts of interest in robotic
surgical studies: validating author’s declarations with the open payments database.
Annals of Surgery 2018; 268: 86–92.
Polanin JR, Tanner-Smith EE, Hennessy EA. Estimating the difference between published
and unpublished effect sizes: a meta-review. Review of Educational Research 2016; 86:
207–236.
Rasmussen K, Schroll J, Gøtzsche PC, Lundh A. Under-reporting of conflicts of interest among
trialists: a cross-sectional study. Journal of the Royal Society of Medicine 2015; 108: 101–107.
Riechelmann RP, Wang L, O’Carroll A, Krzyzanowska MK. Disclosure of conflicts of interest by
authors of clinical trials and editorials in oncology. Journal of Clinical Oncology 2007; 25:
4642–4647.
Rising K, Bacchetti P, Bero L. Reporting bias in drug trials submitted to the Food and Drug
Administration: review of publication and presentation. PLoS Medicine 2008; 5: e217.
Riveros C, Dechartres A, Perrodeau E, Haneef R, Boutron I, Ravaud P. Timing and
completeness of trial results posted at ClinicalTrials.gov and published in journals. PLoS
Medicine 2013; 10: e1001566.
Rothenstein JM, Tomlinson G, Tannock IF, Detsky AS. Company stock prices before and after
public announcements related to oncology drugs. Journal of the National Cancer Institute
2011; 103: 1507–1512.
Safer DJ. Design and reporting modifications in industry-sponsored comparative
psychopharmacology trials. Journal of Nervous and Mental Disease 2002; 190: 583–592.
Saini P, Loke YK, Gamble C, Altman DG, Williamson PR, Kirkham JJ. Selective reporting bias
of harm outcomes within studies: findings from a cohort of systematic reviews. BMJ 2014;
349: g6501.
Sampson M, Barrowman NJ, Moher D, Klassen TP, Pham B, Platt R, St John PD, Viola R,
Raina P. Should meta-analysts search Embase in addition to Medline? Journal of Clinical
Epidemiology 2003; 56: 943–955.
Savović J, Jones HE, Altman DG, Harris RJ, Jüni P, Pildal J, Als-Nielsen B, Balk EM, Gluud C,
Gluud LL, Ioannidis JPA, Schulz KF, Beynon R, Welton NJ, Wood L, Moher D, Deeks JJ,
Sterne JAC. Influence of reported study design characteristics on intervention effect
estimates from randomized, controlled trials. Annals of Internal Medicine 2012; 157:
429–438.
Scherer RW, Meerpohl JJ, Pfeifer N, Schmucker C, Schwarzer G, von Elm E. Full publication
of results initially presented in abstracts. Cochrane Database of Systematic Reviews 2018;
11: MR000005.
Schmid CH. Outcome reporting bias: a pervasive problem in published meta-analyses.
American Journal of Kidney Diseases 2016; 69: 172–174.
Schmucker C, Schell LK, Portalupi S, Oeller P, Cabrera L, Bassler D, Schwarzer G, Scherer RW,
Antes G, von Elm E, Meerpohl JJ. Extent of non-publication in cohorts of studies approved
by research ethics committees or included in trial registries. PloS One 2014; 9: e114023.
203
7 Considering bias and conflicts of interest
Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical evidence of bias. Dimensions of
methodological quality associated with estimates of treatment effects in controlled trials.
JAMA 1995; 273: 408–412.
Shawwa K, Kallas R, Koujanian S, Agarwal A, Neumann I, Alexander P, Tikkinen KA, Guyatt G,
Akl EA. Requirements of clinical journals for authors’ disclosure of financial and non-
financial conflicts of interest: a cross-sectional study. PloS One 2016; 11: e0152301.
Sterne JAC. Why the Cochrane risk of bias tool should not include funding source as a
standard item [editorial]. Cochrane Database of Systematic Reviews 2013; 12: ED000076.
Tramèr MR, Reynolds DJ, Moore RA, McQuay HJ. Impact of covert duplicate publication on
meta-analysis: a case study. BMJ 1997; 315: 635–640.
Turner RM, Spiegelhalter DJ, Smith GC, Thompson SG. Bias modelling in evidence synthesis.
Journal of the Royal Statistical Society Series A (Statistics in Society) 2009; 172: 21–47.
Urrutia G, Ballesteros M, Djulbegovic B, Gich I, Roque M, Bonfill X. Cancer randomized trials
showed that dissemination bias is still a problem to be solved. Journal of Clinical
Epidemiology 2016; 77: 84–90.
Vedula SS, Li T, Dickersin K. Differences in reporting of analyses in internal company
documents versus published trial reports: comparisons in industry-sponsored trials in off-
label uses of gabapentin. PLoS Medicine 2013; 10: e1001378.
Viswanathan M, Carey TS, Belinson SE, Berliner E, Chang SM, Graham E, Guise JM, Ip S,
Maglione MA, McCrory DC, McPheeters M, Newberry SJ, Sista P, White CM. A proposed
approach may help systematic reviews retain needed expertise while minimizing bias
from nonfinancial conflicts of interest. Journal of Clinical Epidemiology 2014; 67:
1229–1238.
Welton NJ, Ades AE, Carlin JB, Altman DG, Sterne JAC. Models for potentially biased
evidence in meta-analysis using empirically based priors. Journal of the Royal Statistical
Society: Series A (Statistics in Society) 2009; 172: 119–136.
Wieland LS, Berman BM, Altman DG, Barth J, Bouter LM, D’Adamo CR, Linde K, Moher D,
Mullins CD, Treweek S, Tunis S, van der Windt DA, Zwarenstein M, Witt C. Rating of
included trials on the efficacy-effectiveness spectrum: development of a new tool for
systematic reviews. Journal of Clinical Epidemiology 2017; 84.
Wieseler B, Kerekes MF, Vervoelgyi V, McGauran N, Kaiser T. Impact of document type on
reporting quality of clinical drug trials: a comparison of registry reports, clinical study
reports, and journal publications. BMJ 2012; 344: d8141.
Wood L, Egger M, Gluud LL, Schulz K, Jüni P, Altman DG, Gluud C, Martin RM, Wood AJG,
Sterne JAC. Empirical evidence of bias in treatment effect estimates in controlled trials
with different interventions and outcomes: meta-epidemiological study. BMJ 2008; 336:
601–605.
Zarin DA, Tse T, Williams RJ, Carr S. Trial reporting in ClinicalTrials.gov – The Final Rule. New
England Journal of Medicine 2016; 375: 1998–2004.
204
8
Assessing risk of bias in a randomized trial
Julian PT Higgins, Jelena Savović, Matthew J Page, Roy G Elbers,
Jonathan AC Sterne
KEY POINTS
• This chapter details version 2 of the Cochrane risk-of-bias tool for randomized trials
•
(RoB 2), the recommended tool for use in Cochrane Reviews.
RoB 2 is structured into a fixed set of domains of bias, focusing on different aspects of
•
trial design, conduct and reporting.
Each assessment using the RoB 2 tool focuses on a specific result from a rando-
•
mized trial.
Within each domain, a series of questions (‘signalling questions’) aim to elicit informa-
•
tion about features of the trial that are relevant to risk of bias.
A judgement about the risk of bias arising from each domain is proposed by an algo-
rithm, based on answers to the signalling questions. Judgements can be ‘Low’, or
‘High’ risk of bias, or can express ‘Some concerns’.
• Answers to signalling questions and judgements about risk of bias should be sup-
•
ported by written justifications.
The overall risk of bias for the result is the least favourable assessment across the
domains of bias. Both the proposed domain-level and overall risk-of-bias judgements
can be overridden by the review authors, with justification.
8.1 Introduction
Cochrane Reviews include an assessment of the risk of bias in each included study (see
Chapter 7 for a general discussion of this topic). When randomized trials are included,
the recommended tool is the revised version of the Cochrane tool, known as RoB 2,
described in this chapter. The RoB 2 tool provides a framework for assessing the risk
of bias in a single result (an estimate of the effect of an experimental intervention
This chapter should be cited as: Higgins JPT, Savović J, Page MJ, Elbers RG, Sterne JAC. Chapter 8: Assessing
risk of bias in a randomized trial. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch
VA (editors). Cochrane Handbook for Systematic Reviews of Interventions. 2nd Edition. Chichester (UK): John
Wiley & Sons, 2019: 205–228.
© 2019 The Cochrane Collaboration. Published 2019 by John Wiley & Sons Ltd.
205
8 Assessing risk of bias in a randomized trial
If some patients do not receive their assigned intervention or deviate from the
assigned intervention after baseline, these effects will differ, and will each be of inter-
est. For example, the estimated effect of assignment to intervention would be the most
appropriate to inform a health policy question about whether to recommend an inter-
vention in a particular health system (e.g. whether to instigate a screening programme,
or whether to prescribe a new cholesterol-lowering drug), whereas the estimated effect
of adhering to the intervention as specified in the trial protocol would be the most
appropriate to inform a care decision by an individual patient (e.g. whether to be
screened, or whether to take the new drug). Review authors should define the
206
8.2 Overview of RoB 2
intervention effect in which they are interested, and apply the risk-of-bias tool appro-
priately to this effect.
The effect of principal interest should be specified in the review protocol: most sys-
tematic reviews are likely to address the question of assignment rather than adherence
to intervention. On occasion, review authors may be interested in both effects of
interest.
The effect of assignment to intervention should be estimated by an intention-to-
treat (ITT) analysis that includes all randomized participants (Fergusson et al
2002). The principles of ITT analyses are (Piantadosi 2005, Menerit 2012):
An ITT analysis maintains the benefit of randomization: that, on average, the inter-
vention groups do not differ at baseline with respect to measured or unmeasured prog-
nostic factors. Note that the term ‘intention-to-treat’ does not have a consistent
definition and is used inconsistently in study reports (Hollis and Campbell 1999, Gravel
et al 2007, Bell et al 2014).
Patients and other stakeholders are often interested in the effect of adhering to the
intervention as described in the trial protocol (the ‘per-protocol effect’), because it
relates most closely to the implications of their choice between the interventions. How-
ever, two approaches to estimation of per-protocol effects that are commonly used in
randomized trials may be seriously biased. These are:
207
8 Assessing risk of bias in a randomized trial
Each domain is required, and no additional domains should be added. Table 8.2.a
summarizes the issues addressed within each bias domain.
For each domain, the tool comprises:
•• Yes;
Probably yes;
•• Probably no;
No;
• No information.
To maximize their simplicity and clarity, the signalling questions are phrased such
that a response of ‘Yes’ may indicate either a low or high risk of bias, depending on
the most natural way to ask the question. Responses of ‘Yes’ and ‘Probably yes’ have
the same implications for risk of bias, as do responses of ‘No’ and ‘Probably no’. The
definitive responses (‘Yes’ and ‘No’) would typically imply that firm evidence is available
in relation to the signalling question; the ‘Probably’ versions would typically imply that
a judgement has been made. Although not required, if review authors wish to calculate
measures of agreement (e.g. kappa statistics) for the answers to the signalling
208
8.2 Overview of RoB 2
Table 8.2.a Bias domains included in version 2 of the Cochrane risk-of-bias tool for randomized trials,
with a summary of the issues addressed
•
the allocation sequence was adequately concealed;
baseline differences between intervention groups suggest a
problem with the randomization process.
Bias due to deviations from Whether:
intended interventions
• participants were aware of their assigned intervention during
•
the trial;
carers and people delivering the interventions were aware of
participants’ assigned intervention during the trial.
When the review authors’ interest is in the effect of assignment to
intervention (see Section 8.2.2):
•
groups and likely to have affected the outcome;
an appropriate analysis was used to estimate the effect of
assignment to intervention; and, if not, whether there was
potential for a substantial impact on the result.
When the review authors’ interest is in the effect of adhering to
intervention (see Section 8.2.2):
•
balanced across intervention groups;
(if applicable) failures in implementing the intervention could
•
have affected the outcome;
(if applicable) study participants adhered to the assigned
•
intervention regimen;
(if applicable) an appropriate analysis was used to estimate
the effect of adhering to the intervention.
Bias due to missing outcome Whether:
data
• data for this outcome were available for all, or nearly all,
•
participants randomized;
(if applicable) there was evidence that the result was not
•
biased by missing outcome data;
(if applicable) missingness in the outcome was likely to
depend on its true value (e.g. the proportions of missing
outcome data, or reasons for missing outcome data, differ
between intervention groups).
Bias in measurement of the Whether:
outcome
•• the method of measuring the outcome was inappropriate;
measurement or ascertainment of the outcome could have
•
differed between intervention groups;
outcome assessors were aware of the intervention received by
study participants;
(Continued)
209
8 Assessing risk of bias in a randomized trial
•
result
the trial was analysed in accordance with a pre-specified plan
that was finalized before unblinded outcome data were
•
available for analysis;
the numerical result being assessed is likely to have been
selected, on the basis of the results, from multiple outcome
•
measurements within the outcome domain;
the numerical result being assessed is likely to have been
selected, on the basis of the results, from multiple analyses of
the data.
∗
For the precise wording of signalling questions and guidance for answering each one, see the full risk-of-bias
tool at www.riskofbias.info.
questions, we recommend treating ‘Yes’ and ‘Probably yes’ as the same response, and
‘No’ and ‘Probably no’ as the same response.
The ‘No information’ response should be used only when both (1) insufficient details
are reported to permit a response of ‘Yes’, ‘Probably yes’, ‘No’ or ‘Probably no’, and
(2) in the absence of these details it would be unreasonable to respond ‘Probably
yes’ or ‘Probably no’ given the circumstances of the trial. For example, in the context
of a large trial run by an experienced clinical trials unit for regulatory purposes, if spe-
cific information about the randomization methods is absent, it may still be reasonable
to respond ‘Probably yes’ rather than ‘No information’ to the signalling question about
allocation sequence concealment.
The implications of a ‘No information’ response to a signalling question differ accord-
ing to the purpose of the question. If the question seeks to identify evidence of a prob-
lem, then ‘No information’ corresponds to no evidence of that problem. If the question
relates to an item that is expected to be reported (such as whether any participants
were lost to follow-up), then the absence of information leads to concerns about there
being a problem.
A response option ‘Not applicable’ is available for signalling questions that are
answered only if the response to a previous question implies that they are required.
Signalling questions should be answered independently: the answer to one question
should not affect answers to other questions in the same or other domains other than
through determining which subsequent questions are answered.
Once the signalling questions are answered, the next step is to reach a risk-of-bias
judgement, and assign one of three levels to each domain:
210
• High risk of bias.
8.2 Overview of RoB 2
The RoB 2 tool includes algorithms that map responses to signalling questions to
a proposed risk-of-bias judgement for each domain (see the full documentation at
www.riskofbias.info for details). The algorithms include specific mappings of each pos-
sible combination of responses to the signalling questions (including responses of ‘No
information’) to judgements of low risk of bias, some concerns or high risk of bias.
Use of the word ‘judgement’ is important for the risk-of-bias assessment. The algo-
rithms provide proposed judgements, but review authors should verify these and
change them if they feel this is appropriate. In reaching final judgements, review
authors should interpret ‘risk of bias’ as ‘risk of material bias’. That is, concerns should
be expressed only about issues that are likely to affect the ability to draw reliable con-
clusions from the study.
A free text box alongside the signalling questions and judgements provides space for
review authors to present supporting information for each response. In some instances,
when the same information is likely to be used to answer more than one question, one text
box covers more than one signalling question. Brief, direct quotations from the text of the
study report should be used whenever possible. It is important that reasons are provided
for any judgements that do not follow the algorithms. The tool also provides space to indi-
cate all the sources of information about the study obtained to inform the judgements (e.g.
published papers, trial registry entries, additional information from the study authors).
RoB 2 includes optional judgements of the direction of the bias for each domain and
overall. For some domains, the bias is most easily thought of as being towards or away
from the null. For example, high levels of switching of participants from their assigned
intervention to the other intervention may have the effect of reducing the observed dif-
ference between the groups, leading to the estimated effect of adhering to intervention
(see Section 8.2.2) being biased towards the null. For other domains, the bias is likely to
favour one of the interventions being compared, implying an increase or decrease in
the effect estimate depending on which intervention is favoured. Examples include
manipulation of the randomization process, awareness of interventions received influ-
encing the outcome assessment and selective reporting of results. If review authors do
not have a clear rationale for judging the likely direction of the bias, they should not
guess it and can leave this response blank.
Overall risk-of-bias
judgement Criteria
Low risk of bias The trial is judged to be at low risk of bias for all domains for this result.
Some concerns The trial is judged to raise some concerns in at least one domain for this
result, but not to be at high risk of bias for any domain.
High risk of bias The trial is judged to be at high risk of bias in at least one domain for this
result.
Or
The trial is judged to have some concerns for multiple domains in a way
that substantially lowers confidence in the result.
Once an overall judgement has been reached for an individual trial result, this infor-
mation will need to be presented in the review and reflected in the analysis and con-
clusions. For discussion of the presentation of risk-of-bias assessments and how they
can be incorporated into analyses, see Chapter 7. Risk-of-bias assessments also feed
into one domain of the GRADE approach for assessing certainty of a body of evidence,
as discussed in Chapter 14.
that are consistent with the intentions of the investigators from those that should be
considered as deviations from the intended intervention. For example, a cancer trial
protocol may not define progression, or specify the second-line drug that should be
used in patients who progress (Hernán and Scharfstein 2018). It may therefore be nec-
essary for review authors to document changes that are and are not considered to be
deviations from intended intervention. Similarly, for trials in which the comparator
intervention is ‘usual care’, the protocol may not specify interventions consistent with
usual care or whether they are expected to be used alongside the experimental inter-
vention. Review authors may therefore need to document what departures from usual
care will be considered as deviations from intended intervention.
For example, in an unblinded study participants may feel unlucky to have been
assigned to the comparator group and therefore seek the experimental intervention,
or other interventions that improve their prognosis. Similarly, monitoring patients ran-
domized to a novel intervention more frequently than those randomized to standard
care would increase the risk of bias, unless such monitoring was an intended part of the
novel intervention. Deviations from intervention that do not arise because of the
experimental context, such as a patient’s choice to stop taking their assigned med-
ication, do not lead to bias in the effect of assignment to intervention.
To examine the effect of adhering to the interventions as specified in the trial protocol,
it is important to specify what types of deviations from the intended intervention will be
examined. These will be one or more of:
If such deviations are present, review authors should consider whether appropriate
statistical methods were used to adjust for their effects.
This domain addresses risk of bias due to missing outcome data, including biases
introduced by procedures used to impute, or otherwise account for, the missing
outcome data.
Some participants may be excluded from an analysis for reasons other than missing
outcome data. In particular, a naïve ‘per-protocol’ analysis is restricted to participants
who received the intended intervention. Potential bias introduced by such analyses, or
217
8 Assessing risk of bias in a randomized trial
by other exclusions of eligible participants for whom outcome data are available, is
addressed in the domain ‘Bias due to deviations from intended interventions’ (see
Section 8.4).
The ITT principle of measuring outcome data on all participants (see Section 8.2.2) is
frequently difficult or impossible to achieve in practice. Therefore, it can often only be
followed by making assumptions about the missing outcome values. Even when an
analysis is described as ITT, it may exclude participants with missing outcome data
and be at risk of bias (such analyses may be described as ‘modified intention-to-treat’
(mITT) analyses). Therefore, assessments of risk of bias due to missing outcome data
should be based on the issues addressed in the signalling questions for this domain,
and not on the way that trial authors described the analysis.
1) the true value of the outcome in participants with missing outcome data: this is the
value of the outcome that should have been measured but was not; and
2) the missingness mechanism, which is the process that led to outcome data being
missing.
Whether missing outcome data lead to bias in complete case analyses depends on
whether the missingness mechanism is related to the true value of the outcome.
Equivalently, we can consider whether the measured (non-missing) outcomes differ
systematically from the missing outcomes (the true values in participants with miss-
ing outcome data). For example, consider a trial of cognitive behavioural therapy
compared with usual care for depression. If participants who are more depressed
are less likely to return for follow-up, then whether a measurement of depression
is missing depends on its true value, which implies that the measured depression out-
comes will differ systematically from the true values of the missing depression
outcomes.
The specific situations in which a complete case analysis suffers from bias (when
there are missing data) are discussed in detail in the full guidance for the RoB 2 tool
at www.riskofbias.info. In brief:
1) missing outcome data will not lead to bias if missingness in the outcome is unre-
lated to its true value, within each intervention group;
2) missing outcome data will lead to bias if missingness in the outcome depends on
both the intervention group and the true value of the outcome; and
3) missing outcome data will often lead to bias if missingness is related to its true
value and, additionally, the effect of the experimental intervention differs from that
of the comparator intervention.
218
8.5 Bias due to missing outcome data
In practice, our ability to assess risk of bias will be limited by the extent to which trial
authors collected and reported reasons that outcome data were missing. The situation
most likely to lead to bias is when reasons for missing outcome data differ between the
intervention groups: for example if participants who became seriously unwell withdrew
from the comparator group while participants who recovered withdrew from the exper-
imental intervention group.
Trial authors may present statistical analyses (in addition to or instead of complete
case analyses) that attempt to address the potential for bias caused by missing out-
come data. Approaches include single imputation (e.g. assuming the participant had
no event; last observation carried forward), multiple imputation and likelihood-based
methods (see Chapter 10, Section 10.12.2). Imputation methods are unlikely to remove
or reduce the bias that occurs when missingness in the outcome depends on its true
value, unless they use information additional to intervention group assignment to pre-
dict the missing values. Review authors may attempt to address missing data using sen-
sitivity analyses, as discussed in Chapter 10 (Section 10.12.3).
assigned to placebo. These lead to more MRI scans being done in the experimental inter-
vention group, and therefore to more diagnoses of symptomless brain tumours, even
though the drug does not increase the incidence of brain tumours. Even for a pre-specified
outcome measure, the nature of the intervention may lead to methods of measuring
the outcome that are not comparable across intervention groups. For example, an inter-
vention involving additional visits to a healthcare provider may lead to additional oppor-
tunities for outcome events to be identified, compared with the comparator intervention.
3. Who is the outcome assessor. The outcome assessor can be:
1) the participant, when the outcome is a participant-reported outcome such as pain,
quality of life, or self-completed questionnaire;
2) the intervention provider, when the outcome is the result of a clinical examination,
the occurrence of a clinical event or a therapeutic decision such as decision to offer a
surgical intervention; or
3) an observer not directly involved in the intervention provided to the participant,
such as an adjudication committee, or a health professional recording outcomes
for inclusion in disease registries.
4. Whether the outcome assessor is blinded to intervention assignment. Blinding of
outcome assessors is often possible even when blinding of participants and personnel
during the trial is not feasible. However, it is particularly difficult for participant-
reported outcomes: for example, in a trial comparing surgery with medical manage-
ment when the outcome is pain at 3 months. The potential for bias cannot be ignored
even if the outcome assessor cannot be blinded.
5. Whether the assessment of outcome is likely to be influenced by knowledge of
intervention received. For trials in which outcome assessors were not blinded, the risk
of bias will depend on whether the outcome assessment involves judgement, which
depends on the type of outcome. We describe most situations in Table 8.6.a.
Table 8.6.a Considerations of risk of bias in measurement of the outcome for different types of outcomes
Participant- Reports coming directly from Pain, nausea and health-related The participant, even if a The outcome assessment is
reported participants about how they quality of life. blinded interviewer is potentially influenced by
outcomes function or feel in relation to a questioning the knowledge of intervention
health condition or intervention, participant and received, leading to a judgement
without interpretation by anyone completing a of at least ‘Some concerns’.
else. They include any evaluation questionnaire on their Review authors will need to
obtained directly from participants behalf. judge whether it is likely that
through interviews, self-completed participants’ reporting of the
questionnaires or hand-held outcome was influenced by
devices. knowledge of intervention
received, in which case risk of
bias is considered high.
Observer- Outcomes reported by an external All-cause mortality or the result of The observer. The assessment of outcome is
reported observer (e.g. an intervention an automated test. usually not likely to be
outcomes not provider, independent researcher, influenced by knowledge of
involving or radiologist) that do not involve intervention received.
judgement any judgement from the observer.
Observer- Outcomes reported by an external Assessment of an X-ray or other The observer. The assessment of outcome is
reported observer (e.g. an intervention image, clinical examination and potentially influenced by
outcomes provider, independent researcher, clinical events other than death knowledge of intervention
involving or radiologist) that involve some (e.g. myocardial infarction) that received, leading to a judgement
some judgement. require judgements on clinical of at least ‘Some concerns’.
judgement definitions or medical records. Review authors will need to
judge whether it is likely that
assessment of the outcome was
influenced by knowledge of
intervention received, in which
case risk of bias is considered
high.
222
8.7 Bias in selection of the reported result
Outcomes Outcomes that reflect decisions Hospitalization, stopping The care provider making Assessment of outcome is
that reflect made by the intervention provider, treatment, referral to a different the decision. usually likely to be influenced
decisions where recording of the decisions ward, performing a caesarean by knowledge of intervention
made by the does not involve any judgement, section, stopping ventilation and received, if the care provider is
intervention but where the decision itself can be discharge of the participant. aware of this. This is particularly
provider influenced by knowledge of important when preferences or
intervention received. expectations regarding the
effect of the experimental
intervention are strong.
Composite Combination of multiple end points Major adverse cardiac and Any of the above. Assessment of risk of bias for
outcomes into a single outcome. Typically, cerebrovascular events. composite outcomes should
participants who have experienced take into account the frequency
any of a specified set of endpoints or contribution of each
are considered to have experienced component and the risk of bias
the composite outcome. due to the most influential
Composite endpoints can also be components.
constructed from continuous
outcome measures.
223
8 Assessing risk of bias in a randomized trial
the effect estimate for mortality was not statistically significant. Such bias puts the
result of a synthesis at risk because results are omitted based on their direction, mag-
nitude or statistical significance. It should therefore be addressed at the review level, as
part of an integrated assessment of the risk of reporting bias (Page and Higgins 2016).
For further guidance, see Chapter 7 and Chapter 13.
Bias in selection of the reported result typically arises from a desire for findings to sup-
port vested interests or to be sufficiently noteworthy to merit publication. It can arise for
both harms and benefits, although the motivations may differ. For example, in trials com-
paring an experimental intervention with placebo, trialists who have a preconception or
vested interest in showing that the experimental intervention is beneficial and safe may
be inclined to be selective in reporting efficacy estimates that are statistically significant
and favourable to the experimental intervention, along with harm estimates that are not
significantly different between groups. In contrast, other trialists may selectively report
harm estimates that are statistically significant and unfavourable to the experimental
intervention if they believe that publicizing the existence of a harm will increase their
chances of publishing in a high impact journal.
This domain considers:
1. Whether the trial was analysed in accordance with a pre-specified plan that was
finalized before unblinded outcome data were available for analysis. We strongly
encourage review authors to attempt to retrieve the pre-specified analysis intentions
for each trial (see Chapter 7, Section 7.3.1). Doing so allows for the identification of any
outcome measures or analyses that have been omitted from, or added to, the results
report, post hoc. Review authors should ideally ask the study authors to supply the
study protocol and full statistical analysis plan if these are not publicly available. In
addition, if outcome measures and analyses mentioned in an article, protocol or trial
registration record are not reported, study authors could be asked to clarify whether
those outcome measures were in fact analysed and, if so, to supply the data.
Trial protocols should describe how unexpected adverse outcomes (that potentially
reflect unanticipated harms) will be collected and analysed. However, results based on
spontaneously reported adverse outcomes may lead to concerns that these were
selected based on the finding being noteworthy.
For some trials, the analysis intentions will not be readily available. It is still possible
to assess the risk of bias in selection of the reported result. For example, outcome mea-
sures and analyses listed in the methods section of an article can be compared with
those reported. Furthermore, outcome measures and analyses should be compared
across different papers describing the trial.
•• reporting only one or a subset of time points at which the outcome was measured;
use of multiple measurement instruments (e.g. pain scales) and only reporting data
for the instrument with the most favourable result;
• having multiple assessors measure an outcome domain (e.g. clinician-rated and
patient-rated depression scales) and only reporting data for the measure with the
most favourable result; and
224
8.8 Differences from the previous version of the tool
• reporting only the most favourable subscale (or a subset of subscales) for an instru-
ment when measurements for other subscales were available.
• carrying out analyses of both change scores and post-intervention scores adjusted
for baseline and reporting only the more favourable analysis;
• multiple analyses of a particular outcome measurement with and without adjust-
ment for prognostic factors (or with adjustment for different sets of prognostic
factors);
• a continuously scaled outcome converted to categorical data on the basis of multiple
cut-points; and
• effect estimates generated for multiple composite outcomes with full reporting of
just one or a subset.
Either type of selective reporting will lead to bias if selection is based on the direction,
magnitude or statistical significance of the effect estimate.
Insufficient detail in some documents may preclude full assessment of the risk of bias
(e.g. trialists only state in the trial registry record that they will measure ‘pain’, without
specifying the measurement scale, time point or metric that will be used). Review
authors should indicate insufficient information alongside their responses to signalling
questions.
due to deviations from the intended intervention, rather than bias due to missing
outcome data;
6) the concept of selective reporting of a result is distinguished from that of selective
non-reporting of a result, with the latter concept removed from the tool so that it can
be addressed (more appropriately) at the level of the synthesis (see Chapter 13);
7) the option to add new domains has been removed;
8) an explicit process for reaching a judgement about the overall risk of bias in the
result has been introduced.
Because most Cochrane Reviews published before 2019 used the first version of
the tool, authors working on updating these reviews should refer to online
Chapter IV for guidance on considering whether to change methodology when updat-
ing a review.
Funding: Development of RoB 2 was supported by the Medical Research Council (MRC)
Network of Hubs for Trials Methodology Research (MR/L004933/2- N61) hosted by the
MRC ConDuCT-II Hub (Collaboration and innovation for Difficult and Complex rando-
mised controlled Trials In Invasive procedures – MR/K025643/1), by a Methods Innova-
tion Fund grant from Cochrane and by MRC grant MR/M025209/1. JPTH and JACS are
members of the National Institute for Health Research (NIHR) Biomedical Research
Centre at University Hospitals Bristol NHS Foundation Trust and the University of Bris-
tol, and the MRC Integrative Epidemiology Unit at the University of Bristol. JPTH, JS and
JACS are members of the NIHR Collaboration for Leadership in Applied Health Research
and Care West (CLAHRC West) at University Hospitals Bristol NHS Foundation Trust.
JPTH and JACS received funding from NIHR Senior Investigator awards NF-SI-0617-
10145 and NF-SI-0611-10168, respectively. MJP received funding from an Australian
National Health and Medical Research Council (NHMRC) Early Career Fellowship
(1088535). The views expressed are those of the authors and not necessarily those
of the National Health Service, the NIHR, the UK Department of Health and Social Care,
the MRC or the Australian NHMRC.
226
8.10 References
8.10 References
Abraha I, Montedori A. Modified intention to treat reporting in randomised controlled trials:
systematic review. BMJ 2010; 340: c2697.
Bell ML, Fiero M, Horton NJ, Hsu CH. Handling missing data in RCTs: a review of the top
medical journals. BMC Medical Research Methodology 2014; 14: 118.
Bello S, Moustgaard H, Hróbjartsson A. Unreported formal assessment of unblinding
occurred in 4 of 10 randomized clinical trials, unreported loss of blinding in 1 of 10 trials.
Journal of Clinical Epidemiology 2017; 81: 42–50.
Berger VW. Quantifying the magnitude of baseline covariate imbalances resulting from
selection bias in randomized clinical trials. Biometrical Journal 2005; 47: 119–127.
Boutron I, Estellat C, Guittet L, Dechartres A, Sackett DL, Hróbjartsson A, Ravaud P. Methods
of blinding in reports of randomized controlled trials assessing pharmacologic
treatments: a systematic review. PLoS Medicine 2006; 3: e425.
Brown S, Thorpe H, Hawkins K, Brown J. Minimization – reducing predictability for
multi-centre trials whilst retaining balance within centre. Statistics in Medicine 2005; 24:
3715–3727.
Clark L, Fairhurst C, Torgerson DJ. Allocation concealment in randomised controlled trials:
are we getting better? BMJ 2016; 355: i5663.
Corbett MS, Higgins JPT, Woolacott NF. Assessing baseline imbalance in randomised trials:
implications for the Cochrane risk of bias tool. Research Synthesis Methods 2014; 5: 79–85.
Fergusson D, Aaron SD, Guyatt G, Hebert P. Post-randomisation exclusions: the intention to
treat principle and excluding patients from analysis. BMJ 2002; 325: 652–654.
Gravel J, Opatrny L, Shapiro S. The intention-to-treat approach in randomized controlled
trials: are authors saying what they do and doing what they say? Clinical Trials (London,
England) 2007; 4: 350–356.
Haahr MT, Hróbjartsson A. Who is blinded in randomized clinical trials? A study of 200 trials
and a survey of authors. Clinical Trials (London, England) 2006; 3: 360–365.
Hernán MA, Hernandez-Diaz S. Beyond the intention-to-treat in comparative effectiveness
research. Clinical Trials (London, England) 2012; 9: 48–55.
Hernán MA, Scharfstein D. Cautions as regulators move to end exclusive reliance on
intention to treat. Annals of Internal Medicine 2018; 168: 515–516.
Hernán MA, Robins JM. Per-protocol analyses of pragmatic trials. New England Journal of
Medicine 2017; 377: 1391–1398.
Higgins JPT, White IR, Wood AM. Imputation methods for missing outcome data in
meta-analysis of clinical trials. Clinical Trials 2008; 5: 225–239.
Higgins JPT, Altman DG, Gøtzsche PC, Jüni P, Moher D, Oxman AD, Savović J, Schulz KF,
Weeks L, Sterne JAC. The Cochrane Collaboration’s tool for assessing risk of bias in
randomised trials. BMJ 2011; 343: d5928.
Hollis S, Campbell F. What is meant by intention to treat analysis? Survey of published
randomised controlled trials. BMJ 1999; 319: 670–674.
Jensen JS, Bielefeldt AO, Hróbjartsson A. Active placebo control groups of pharmacological
interventions were rarely used but merited serious consideration: a methodological
overview. Journal of Clinical Epidemiology 2017; 87: 35–46.
Jüni P, Altman DG, Egger M. Systematic reviews in health care: assessing the quality of
controlled clinical trials. BMJ 2001; 323: 42–46.
227
8 Assessing risk of bias in a randomized trial
Kirkham JJ, Dwan KM, Altman DG, Gamble C, Dodd S, Smyth R, Williamson PR. The impact of
outcome reporting bias in randomised controlled trials on a cohort of systematic reviews.
BMJ 2010; 340: c365.
Mansournia MA, Higgins JPT, Sterne JAC, Hernán MA. Biases in randomized trials: a
conversation between trialists and epidemiologists. Epidemiology 2017; 28: 54–59.
Menerit CL. Clinical Trials – Design, Conduct, and Analysis. 2nd ed. Oxford (UK): Oxford
University Press; 2012.
National Research Council. The Prevention and Treatment of Missing Data in Clinical Trials.
Panel on Handling Missing Data in Clinical Trials. Committee on National Statistics, Division
of Behavioral and Social Sciences and Education. Washington, DC: The National Academies
Press; 2010.
Page MJ, Higgins JPT. Rethinking the assessment of risk of bias due to selective reporting:
a cross-sectional study. Systematic Reviews 2016; 5: 108.
Piantadosi S. Clinical Trials: A Methodologic Perspective. 2nd ed. Hoboken (NJ): Wiley; 2005.
Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical evidence of bias. Dimensions of
methodological quality associated with estimates of treatment effects in controlled trials.
JAMA 1995; 273: 408–412.
Schulz KF. Subverting randomization in controlled trials. JAMA 1995; 274: 1456–1458.
Schulz KF, Grimes DA. Generation of allocation sequences in randomised trials: chance, not
choice. Lancet 2002; 359: 515–519.
Schulz KF, Chalmers I, Altman DG. The landscape and lexicon of blinding in randomized
trials. Annals of Internal Medicine 2002; 136: 254–259.
Schulz KF, Grimes DA. The Lancet Handbook of Essential Concepts in Clinical Research.
Edinburgh (UK): Elsevier; 2006.
228
9
Summarizing study characteristics and
preparing for synthesis
Joanne E McKenzie, Sue E Brennan, Rebecca E Ryan, Hilary J Thomson,
Renea V Johnston
KEY POINTS
• Synthesis is a process of bringing together data from a set of included studies with the
aim of drawing conclusions about a body of evidence. This will include synthesis of
•
study characteristics and, potentially, statistical synthesis of study findings.
A general framework for synthesis can be used to guide the process of planning the
comparisons, preparing for synthesis, undertaking the synthesis, and interpreting and
•
describing the results.
Tabulation of study characteristics aids the examination and comparison of PICO
elements across studies, facilitates synthesis of these characteristics and grouping
•
of studies for statistical synthesis.
Tabulation of extracted data from studies allows assessment of the number of studies
contributing to a particular meta-analysis, and helps determine what other statistical
synthesis methods might be used if meta-analysis is not possible.
9.1 Introduction
Synthesis is a process of bringing together data from a set of included studies with the
aim of drawing conclusions about a body of evidence. Most Cochrane Reviews on the
effects of interventions will include some type of statistical synthesis. Most commonly
this is the statistical combination of results from two or more separate studies (hence-
forth referred to as meta-analysis) of effect estimates.
An examination of the included studies always precedes statistical synthesis in
Cochrane Reviews. For example, examination of the interventions studied is often
needed to itemize their content so as to determine which studies can be grouped in
a single synthesis. More broadly, synthesis of the PICO (Population, Intervention,
Comparator and Outcome) elements of the included studies underpins interpretation
This chapter should be cited as: McKenzie JE, Brennan SE, Ryan RE, Thomson HJ, Johnston RV. Chapter 9:
Summarizing study characteristics and preparing for synthesis. In: Higgins JPT, Thomas J, Chandler J,
Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook for Systematic Reviews of Interventions.
2nd Edition. Chichester (UK): John Wiley & Sons, 2019: 229–240.
© 2019 The Cochrane Collaboration. Published 2019 by John Wiley & Sons Ltd.
229
9 Summarizing study characteristics and preparing for synthesis
of review findings and is an important output of the review in its own right. This syn-
thesis should encompass the characteristics of the interventions and comparators in
included studies, the populations and settings in which the interventions were evalu-
ated, the outcomes assessed, and the strengths and weaknesses of the body of
evidence.
Chapter 2 defined three types of PICO criteria that may be helpful in understanding
decisions that need to be made at different stages in the review.
• The review PICO (planned at the protocol stage) is the PICO on which eligibility of
studies is based (what will be included and what excluded from the review).
• The PICO for each synthesis (also planned at the protocol stage) defines the ques-
tion that the specific synthesis aims to answer, determining how the synthesis will be
structured, specifying planned comparisons (including intervention and comparator
groups, any grouping of outcome and population subgroups).
• The PICO of the included studies (determined at the review stage) is what was actu-
ally investigated in the included studies.
In this chapter, we focus on the PICO for each synthesis and the PICO of the
included studies, as the basis for determining which studies can be grouped for sta-
tistical synthesis and for synthesizing study characteristics. We describe the preliminary
steps undertaken before performing the statistical synthesis. Methods for the statistical
synthesis are described in Chapters 10, 11 and 12.
Box 9.2.a A general framework for synthesis that can be applied irrespective of the
methods used to synthesize results
described in reviews, yet can require many subjective decisions about the nature and
similarity of the PICO elements of the included studies. The examples described in this
section illustrate approaches for making this process more transparent.
9.3.2 Determine which studies are similar enough to be grouped within each
comparison (step 2.2)
Once the PICO of included studies have been coded using labels and descriptions spe-
cified in the PICO for each synthesis, it will be possible to compare PICO elements
across studies and determine which studies are similar enough to be grouped within
each comparison.
Tabulating study characteristics can help to explore and compare PICO elements
across studies, and is particularly important for reviews that are broad in scope, have
diversity across one or more PICO elements, or include large numbers of studies. Data
about study characteristics can be ordered in many different ways (e.g. by comparison
or by specific PICO elements), and tables may include information about one or more
PICO elements. Deciding on the best approach will depend on the purpose of the table
and the stage of the review. A close examination of study characteristics will require
detailed tables; for example, to identify differences in characteristics that were pre-
specified as potentially important modifiers of the intervention effects. As the review
progresses, this detail may be replaced by standardized description of PICO character-
istics (e.g. the coding of counselling interventions presented in Table 9.3.a).
Table 9.3.b illustrates one approach to tabulating study characteristics to enable
comparison and analysis across studies. This table presents a high-level summary of
the characteristics that are most important for determining which comparisons can
be made. The table was adapted from tables presented in a review of self-management
education programmes for osteoarthritis (Kroon et al 2014). The authors presented a
structured summary of intervention and comparator groups for each study, and then
categorized intervention components thought to be important for enabling patients to
manage their own condition. Table 9.3.b shows selected intervention components, the
232
9.3 Preliminary steps of a synthesis
Definition of (selected) intervention groups from the PICO for each synthesis
• Counselling: “provide[s] motivation to quit, support to increase problem solving and coping skills,
and may incorporate ‘transtheoretical’ models of change. … includes … motivational interviewing,
cognitive behaviour therapy, psychotherapy, relaxation, problem solving facilitation, and other
strategies.”∗
• Incentives: “women receive a financial incentive, contingent on their smoking cessation; these
incentives may be gift vouchers. … Interventions that provided a ‘chance’ of incentive (e.g. lottery
•
tickets) combined with counselling were coded as counselling.”
Social support: “interventions where the intervention explicitly included provision of support from a
peer (including self-nominated peers, ‘lay’ peers trained by project staff, or support from
healthcare professionals), or from partners” (Chamberlain et al 2017).
Main Other
Study intervention intervention
ID Precis of intervention description from study strategy components
Study 1
• Assessment of smoking motivation and Counselling Incentive
•
intention to quit.
Bilingual health educators (Spanish and
English) with bachelors degrees provided
15 minutes of individual counselling that
included risk information and quit messages or
reinforcement. Participants were asked to select
a quit date and nominate a significant other as a
•
‘quit buddy’.
Self-help guide ‘Time for a change’ with an
explanation of how to use it and behavioural
•
counselling.
Explanation of how to win prizes ($100) by
•
completing activity sheets.
Booster postcard one month after study entry.
••
facilitator (based on stages of change theory).
Partners invited to be involved in the program.
An information pack (developed in
collaboration with a focus group of women),
•
which included a self-help booklet.
Invited to join a stop smoking support group.
Study 3 Midwives received two and a half days of training Counselling Nil
on theory of transtheoretical model. Participants
received a set of six stage-based self-help manuals
‘Pro-Change programme for a healthy pregnancy’.
The midwife assessed each participant’s stage of
change and pointed the woman to the appropriate
manual. No more than 15 minutes was spent on
the intervention.
∗
The definition also specified eligible modes of delivery, intervention duration and personnel.
233
9 Summarizing study characteristics and preparing for synthesis
comparator, and outcomes measured in a subset of studies (some details are ficti-
tious). Outcomes have been grouped by the outcome domains ‘Pain’ and ‘Function’
(column ‘Outcome measure’ Table 9.3.b). These pre-specified outcome domains are
the chosen level for the synthesis as specified in the PICO for each synthesis. Authors
will need to assess whether the measurement methods or tools used within each
study provide an appropriate assessment of the domains (Chapter 3, Section 3.2.4).
A next step is to group each measure into the pre-specified time points. In this
example, outcomes are grouped into short-term (< 6 weeks) and long-term follow-
up (≥ 6 weeks to 12 months) (column ‘Time points (time frame)’ Table 9.3.b).
Variations on the format shown in Table 9.3.b can be presented within a review
to summarize the characteristics of studies contributing to each synthesis, which is
important for interpreting findings (step 2.5).
9.3.3 Determine what data are available for synthesis (step 2.3)
Once the studies that are similar enough to be grouped together within each compar-
ison have been determined, a next step is to examine what data are available for syn-
thesis. Tabulating the measurement tools and time frames as shown in Table 9.3.b
allows assessment of the potential for multiplicity (i.e. when multiple outcomes
within a study and outcome domain are available for inclusion (Chapter 3,
Section 3.2.4.3)). In this example, multiplicity arises in two ways. First, from multiple
measurement instruments used to measure the same outcome domain within the
same time frame (e.g. ‘Short-term Pain’ is measured using the ‘Pain VAS’ and ‘Pain
on walking VAS’ scales in study 3). Second, from multiple time points measured within
the same time frame (e.g. ‘Short-term Pain’ is measured using ‘Pain VAS’ at both
2 weeks and 1 month in study 6). Pre-specified methods to deal with the multiplicity
can then be implemented (see Table 9.3.c for examples of approaches for dealing
with multiplicity). In this review, the authors pre-specified a set of decision rules
for selecting specific outcomes within the outcome domains. For example, for the
outcome domain ‘Pain’, the selected outcome was the highest on the following list:
global pain, pain on walking, WOMAC pain subscore, composite pain scores other
than WOMAC, pain on activities other than walking, rest pain or pain during the night.
The authors further specified that if there were multiple time points at which the out-
come was measured within a time frame, they would select the longest time point.
The selected outcomes from applying these rules to studies 3 and 6 are indicated by
an asterisk in Table 9.3.b.
Table 9.3.b also illustrates an approach to tabulating the extracted data. The avail-
able statistics are tabulated in the column labelled ‘Data’, from which an assessment
can be made as to whether the study contributes the required data for a meta-
analysis (column ‘Effect & SE’) (Chapter 10). For example, of the seven studies
comparing health-directed behaviour (BEH) with usual care, six measured ‘Short-
term Pain’, four of which contribute required data for meta-analysis. Reordering
the table by comparison, outcome and time frame, will more readily show the num-
ber of studies that will contribute to a particular meta-analysis, and help determine
what other synthesis methods might be used if the data available for meta-analysis
are limited.
234
Table 9.3.b Table of study characteristics illustrating similarity of PICO elements across studies
1 Attention BEH MON CON SKL NAV Pain Pain VAS 1 mth (short), Mean, N / group Yes4
control 8 mths (long)
Function HAQ disability 1 mth (short), Median, IQR, N / group Maybe4
subscale 8 mths (long)
2 Acupuncture BEH EMO CON SKL NAV Pain Pain on walking 1 mth (short), MD from ANCOVA model, 95%CI Yes
VAS 12 mths (long)
Function Dutch AIMS-SF 1 mth (short), Median, range, N / group Maybe4
12 mths (long)
4 Information BEH ENG EMO MON CON SKL NAV Pain Pain VAS 1 mth (short) MD, SE Yes
Function Dutch AIMS-SF 1 mth (short) Mean, SD, N / group Yes
12 Information BEH SKL Pain WOMAC pain 12 mths (long) MD from ANCOVA model, 95%CI Yes
subscore
3 Usual care BEH EMO MON SKL NAV Pain Pain VAS* 1 mth (short) Mean, SD, N / group Yes
Pain on walking 1 mth (short)
VAS
5 Usual care BEH ENG EMO MON CON SKL Pain Pain on walking 2 wks (short) Mean, SD, N / group Yes
VAS
6 Usual care BEH MON CON SKL NAV Pain Pain VAS 2 wks (short), MD, t-value and P value for MD Yes
1 mth (short)*
Function WOMAC disability 2 wks (short), Mean, N / group Yes
subscore 1 mth (short)*
7 Usual care BEH MON CON SKL NAV Pain WOMAC pain 1 mth (short) Direction of effect No
subscore
Function WOMAC disability 1 mth (short) Means, N / group; statistically Yes4
subscore significant difference
8 Usual care MON Pain Pain VAS 12 mths (long) MD, 95%CI Yes
(Continued)
Table 9.3.b (Continued)
9 Usual care BEH MON SKL Function Global disability 12 mths (long) Direction of effect, NS No
10 Usual care BEH EMO MON CON SKL NAV Pain Pain VAS 1 mth (short) No information No
Function Global disability 1 mth (short) Direction of effect No
11 Usual care BEH MON SKL Pain WOMAC pain 1 mth (short), Mean, SD, N / group Yes
subscore 12 mths (long)
BEH = health-directed behaviour; CON = constructive attitudes and approaches; EMO = emotional well-being; ENG = positive and active engagement in life; MON = self-monitoring and insight;
NAV = health service navigation; SKL = skill and technique acquisition.
ANCOVA = Analysis of covariance; CI = confidence interval; IQR = interquartile range; MD = mean difference; SD = standard deviation; SE = standard error, NS = non-significant.
Pain and function measures: Dutch AIMS-SF = Dutch short form of the Arthritis Impact Measurement Scales; HAQ = Health Assessment Questionnaire; VAS = visual analogue scale; WOMAC =
Western Ontario and McMaster Universities Osteoarthritis Index.
1
Ordered by type of comparator; 2 Short-term (denoted ‘immediate’ in the review Kroon et al (2014)) follow-up is defined as < 6 weeks, long-term follow-up (denoted ‘intermediate’ in the
review) is ≥ 6 weeks to 12 months; 3 For simplicity, in this example the available data are assumed to be the same for all outcomes within an outcome domain within a study. In practice, this is
unlikely and the available data would likely vary by outcome; 4 Indicates that an effect estimate and its standard error may be computed through imputation of missing statistics, methods to
convert between statistics (e.g. medians to means) or contact with study authors. ∗ Indicates the selected outcome when there was multiplicity in the outcome domain and time frame.
9.3 Preliminary steps of a synthesis
Table 9.3.c Examples of approaches for selecting one outcome (effect estimate) for inclusion in a
synthesis. Adapted from López-López et al (2018)
Random Randomly select an outcome Assumes that the effect estimates are
selection (effect estimate) when multiple are interchangeable measures of the domain and
available for an outcome domain that random selection will yield a
‘representative’ effect for the meta-analysis.
Averaging of Calculate the average of the Assumes that the effect estimates are
effect intervention effects when multiple interchangeable measures of the domain. The
estimates are available for a particular standard error of the average effect can be
outcome domain calculated using a simple method of averaging
the variances of the effect estimates.
Median Rank the effect estimates of An alternative to averaging effect estimates.
effect outcomes within an outcome Assumes that the effect estimates are
estimate domain and select the outcome interchangeable measures of the domain and
with the middle value that the median effect will yield a
‘representative’ effect for the meta-analysis.
This approach is often adopted in Effective
Practice and Organization of Care reviews that
include broad outcome domains.
Decision Select the most relevant outcome Assumes that while the outcomes all provide a
rules from multiple that are available for measure of the outcome domain, they are not
an outcome domain using a completely interchangeable, with some being
decision rule more relevant. The decision rules aim to select
the most relevant. The rules may be based on
clinical (e.g. content validity of measurement
tools) or methodological (e.g. reliability of the
measure) considerations. If multiple rules are
specified, a hierarchy will need to be
determined to specify the order in which they
are applied.
Methods Text/Tabular Vote Combining P values Summary of effect Pairwise meta- Network meta- Subgroup analysis/
counting estimates analysis analysis meta-regression
Questions Narrative summary of Is there any Is there evidence What is the range What is the common Which What factors modify
addressed evidence presented in evidence of that there is an and distribution of intervention effect? intervention of the magnitude of
either text or tabular an effect? effect in at least one observed effects? (fixed-effect model) multiple is most the intervention
form study? effective? effects?
What is the average
intervention effect?
(random effects
model)
Example Forest plot (plotting Harvest Albatross plot Box and whisker Forest plot Forest plot Forest plot
plots individual study plot plot
Network Box and whisker plot
effects without a
Effect Bubble plot diagram
combined Bubble plot
direction
effect estimate) Rankogram
plot
plots
239
9 Summarizing study characteristics and preparing for synthesis
9.7 References
Chamberlain C, O’Mara-Eves A, Porter J, Coleman T, Perlen SM, Thomas J, McKenzie JE.
Psychosocial interventions for supporting women to stop smoking in pregnancy.
Cochrane Database of Systematic Reviews 2017; 2: CD001055.
Hollands GJ, Shemilt I, Marteau TM, Jebb SA, Lewis HB, Wei Y, Higgins JPT, Ogilvie D.
Portion, package or tableware size for changing selection and consumption of food,
alcohol and tobacco. Cochrane Database of Systematic Reviews 2015; 9: CD011045.
Kroon FPB, van der Burg LRA, Buchbinder R, Osborne RH, Johnston RV, Pitt V. Self-
management education programmes for osteoarthritis. Cochrane Database of Systematic
Reviews 2014; 1: CD008963.
López-López JA, Page MJ, Lipsey MW, Higgins JPT. Dealing with effect size multiplicity in
systematic reviews and meta-analyses. Research Synthesis Methods 2018; 9: 336–351.
240
10
Analysing data and undertaking meta-analyses
Jonathan J Deeks, Julian PT Higgins, Douglas G Altman; on behalf of the Cochrane
Statistical Methods Group
KEY POINTS
•
studies.
Potential advantages of meta-analyses include an improvement in precision, the abil-
ity to answer questions not posed by individual studies, and the opportunity to settle
controversies arising from conflicting claims. However, they also have the potential to
mislead seriously, particularly if specific study designs, within-study biases, variation
•
across studies, and reporting biases are not carefully considered.
It is important to be familiar with the type of data (e.g. dichotomous, continuous) that
result from measurement of an outcome in an individual study, and to choose suitable
•
effect measures for comparing intervention groups.
Most meta-analysis methods are variations on a weighted average of the effect
•
estimates from the different studies.
Studies with no events contribute no information about the risk ratio or odds ratio. For
rare events, the Peto method has been observed to be less biased and more powerful
•
than other methods.
Variation across studies (heterogeneity) must be considered, although most Cochrane
Reviews do not have enough studies to allow for the reliable investigation of its causes
Random-effects meta-analyses allow for heterogeneity by assuming that underlying
effects follow a normal distribution, but they must be interpreted carefully. Prediction
intervals from random-effects meta-analyses are a useful device for presenting the
•
extent of between-study variation.
Many judgements are required in the process of preparing a meta-analysis. Sensitivity
analyses should be used to examine whether overall findings are robust to potentially
influential decisions.
This chapter should be cited as: Deeks JJ, Higgins JPT, Altman DG (editors). Chapter 10: Analysing data and
undertaking meta-analyses. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA
(editors). Cochrane Handbook for Systematic Reviews of Interventions. 2nd Edition. Chichester (UK):
John Wiley & Sons, 2019: 241–284.
© 2019 The Cochrane Collaboration. Published 2019 by John Wiley & Sons Ltd.
241
10 Analysing data and undertaking meta-analyses
way for every study. For example, the summary statistic may be a risk ratio if the
data are dichotomous, or a difference between means if the data are continuous
(see Chapter 6).
2) In the second stage, a summary (combined) intervention effect estimate is calcu-
lated as a weighted average of the intervention effects estimated in the individual
studies. A weighted average is defined as
Figure 10.2.a Example of a forest plot from a review of interventions to promote ownership of smoke alarms (DiGuiseppi and Higgins 2001). Reproduced with
permission of John Wiley & Sons
244
10.3 A generic inverse-variance approach
Yi 1/ SE2i
generic inverse-variance weighted average =
1/ SE2i
where Yi is the intervention effect estimated in the ith study, SEi is the standard error of
that estimate, and the summation is across all studies. The basic data required for the
analysis are therefore an estimate of the intervention effect and its standard error from
each study. A fixed-effect meta-analysis is valid under an assumption that all effect
estimates are estimating the same underlying intervention effect, which is referred
to variously as a ‘fixed-effect’ assumption, a ‘common-effect’ assumption or an
‘equal-effects’ assumption. However, the result of the meta-analysis can be interpreted
without making such an assumption (Rice et al 2018).
When the data are conveniently available as summary statistics from each interven-
tion group, the inverse-variance method can be implemented directly. For example,
estimates and their standard errors may be entered directly into RevMan under the
‘Generic inverse variance’ outcome type. For ratio measures of intervention effect,
the data must be entered into RevMan as natural logarithms (for example, as a log odds
ratio and the standard error of the log odds ratio). However, it is straightforward to
instruct the software to display results on the original (e.g. odds ratio) scale. It is pos-
sible to supplement or replace this with a column providing the sample sizes in the two
groups. Note that the ability to enter estimates and standard errors creates a
high degree of flexibility in meta-analysis. It facilitates the analysis of properly analysed
crossover trials, cluster-randomized trials and non-randomized trials (see Chapter 23),
as well as outcome data that are ordinal, time-to-event or rates (see Chapter 6).
expected number of events in the experimental intervention group of each study under
the null hypothesis of no intervention effect.
The approximation used in the computation of the log odds ratio works well when
intervention effects are small (odds ratios are close to 1), events are not particularly
common and the studies have similar numbers in experimental and comparator
groups. In other situations it has been shown to give biased answers. As these criteria
are not always fulfilled, Peto’s method is not recommended as a default approach for
meta-analysis.
Corrections for zero cell counts are not necessary when using Peto’s method. Per-
haps for this reason, this method performs well when events are very rare
(Bradburn et al 2007); see Section 10.4.4.1. Also, Peto’s method can be used to combine
studies with dichotomous outcome data with studies using time-to-event analyses
where log-rank tests have been used (see Section 10.9).
Consistency Empirical evidence suggests that relative effect measures are, on average,
more consistent than absolute measures (Engels et al 2000, Deeks 2002, Rucker et al
2009). For this reason, it is wise to avoid performing meta-analyses of risk differences,
unless there is a clear reason to suspect that risk differences will be consistent in a par-
ticular clinical situation. On average there is little difference between the odds ratio and
risk ratio in terms of consistency (Deeks 2002). When the study aims to reduce the inci-
dence of an adverse event, there is empirical evidence that risk ratios of the adverse
event are more consistent than risk ratios of the non-event (Deeks 2002). Selecting
an effect measure based on what is the most consistent in a particular situation is
not a generally recommended strategy, since it may lead to a selection that spuriously
maximizes the precision of a meta-analysis estimate.
247
10 Analysing data and undertaking meta-analyses
Ease of interpretation The odds ratio is the hardest summary statistic to understand
and to apply in practice, and many practising clinicians report difficulties in using them.
There are many published examples where authors have misinterpreted odds ratios
from meta-analyses as risk ratios. Although odds ratios can be re-expressed for inter-
pretation (as discussed here), there must be some concern that routine presentation of
the results of systematic reviews as odds ratios will lead to frequent over-estimation of
the benefits and harms of interventions when the results are applied in clinical practice.
Absolute measures of effect are thought to be more easily interpreted by clinicians than
relative effects (Sinclair and Bracken 1994), and allow trade-offs to be made between
likely benefits and likely harms of interventions. However, they are less likely to be
generalizable.
It is generally recommended that meta-analyses are undertaken using risk ratios
(taking care to make a sensible choice over which category of outcome is classified
as the event) or odds ratios. This is because it seems important to avoid using summary
statistics for which there is empirical evidence that they are unlikely to give consistent
estimates of intervention effects (the risk difference), and it is impossible to use statis-
tics for which meta-analysis cannot be performed (the number needed to treat for an
additional beneficial outcome). It may be wise to plan to undertake a sensitivity anal-
ysis to investigate whether choice of summary statistic (and selection of the event cat-
egory) is critical to the conclusions of the meta-analysis (see Section 10.14).
It is often sensible to use one statistic for meta-analysis and to re-express the results
using a second, more easily interpretable statistic. For example, often meta-analysis
may be best performed using relative effect measures (risk ratios or odds ratios)
and the results re-expressed using absolute effect measures (risk differences or num-
bers needed to treat for an additional beneficial outcome – see Chapter 15
(Section 15.4). This is one of the key motivations for ‘Summary of findings’ tables in
Cochrane Reviews: see Chapter 14). If odds ratios are used for meta-analysis they
can also be re-expressed as risk ratios (see Chapter 15, Section 15.4). In all cases the
same formulae can be used to convert upper and lower confidence limits. However,
all of these transformations require specification of a value of baseline risk that indi-
cates the likely risk of the outcome in the ‘control’ population to which the experimen-
tal intervention will be applied. Where the chosen value for this assumed comparator
group risk is close to the typical observed comparator group risks across the studies,
similar estimates of absolute effect will be obtained regardless of whether odds ratios
or risk ratios are used for meta-analysis. Where the assumed comparator risk differs
from the typical observed comparator group risk, the predictions of absolute benefit
will differ according to which summary statistic was used for meta-analysis.
248
10.4 Meta-analysis of dichotomous outcomes
which group is likely to have the higher risk, or on whether the risks are of the same or
different orders of magnitude (when risks are very low, they are compatible with very
large or very small ratios). Whilst one might be tempted to infer that the risk would be
lowest in the group with the larger sample size (as the upper limit of the confidence
interval would be lower), this is not justified as the sample size allocation was deter-
mined by the study investigators and is not a measure of the incidence of the event.
Risk difference methods superficially appear to have an advantage over odds ratio
methods in that the risk difference is defined (as zero) when no events occur in either
arm. Such studies are therefore included in the estimation process. Bradburn and col-
leagues undertook simulation studies which revealed that all risk difference methods
yield confidence intervals that are too wide when events are rare, and have associated
poor statistical power, which make them unsuitable for meta-analysis of rare events
(Bradburn et al 2007). This is especially relevant when outcomes that focus on treat-
ment safety are being studied, as the ability to identify correctly (or attempt to refute)
serious adverse events is a key issue in drug development.
It is likely that outcomes for which no events occur in either arm may not be
mentioned in reports of many randomized trials, precluding their inclusion in a
meta-analysis. It is unclear, though, when working with published results, whether
failure to mention a particular adverse event means there were no such events, or
simply that such events were not included as a measured endpoint. Whilst the results
of risk difference meta-analyses will be affected by non-reporting of outcomes with no
events, odds and risk ratio based methods naturally exclude these data whether or not
they are published, and are therefore unaffected.
coverage, provided there was no substantial imbalance between treatment and com-
parator group sizes within studies, and treatment effects were not exceptionally large.
This finding was consistently observed across three different meta-analytical scenarios,
and was also observed by Sweeting and colleagues (Sweeting et al 2004).
This finding was noted despite the method producing only an approximation to the
odds ratio. For very large effects (e.g. risk ratio = 0.2) when the approximation is known
to be poor, treatment effects were under-estimated, but the Peto method still had the
best performance of all the methods considered for event risks of 1 in 1000, and the bias
was never more than 6% of the comparator group risk.
In other circumstances (i.e. event risks above 1%, very large effects at event risks
around 1%, and meta-analyses where many studies were substantially imbalanced)
the best performing methods were the Mantel-Haenszel odds ratio without zero-cell
corrections, logistic regression and an exact method. None of these methods is avail-
able in RevMan.
Methods that should be avoided with rare events are the inverse-variance methods
(including the DerSimonian and Laird random-effects method) (Efthimiou 2018). These
directly incorporate the study’s variance in the estimation of its contribution to the
meta-analysis, but these are usually based on a large-sample variance approximation,
which was not intended for use with rare events. We would suggest that incorporation
of heterogeneity into an estimate of a treatment effect should be a secondary consid-
eration when attempting to produce estimates of effects from sparse data – the pri-
mary concern is to discern whether there is any signal of an effect in the data.
For the mean difference approach, the SDs are used together with the sample sizes to
compute the weight given to each study. Studies with small SDs are given relatively
higher weight whilst studies with larger SDs are given relatively smaller weights. This
is appropriate if variation in SDs between studies reflects differences in the reliability of
outcome measurements, but is probably not appropriate if the differences in SD reflect
real differences in the variability of outcomes in the study populations.
For the standardized mean difference approach, the SDs are used to standardize the
mean differences to a single scale, as well as in the computation of study weights. Thus,
studies with small SDs lead to relatively higher estimates of SMD, whilst studies with
larger SDs lead to relatively smaller estimates of SMD. For this to be appropriate, it
must be assumed that between-study variation in SDs reflects only differences in meas-
urement scales and not differences in the reliability of outcome measures or variability
among study populations, as discussed in Chapter 6 (Section 6.5.1.2).
These assumptions of the methods should be borne in mind when unexpected var-
iation of SDs is observed across studies.
weights in the analysis than they would have received if post-intervention values had
been used, as they will have smaller SDs.
When combining the data on the MD scale, authors must be careful to use the appro-
priate means and SDs (either of post-intervention measurements or of changes from
baseline) for each study. Since the mean values and SDs for the two types of outcome
may differ substantially, it may be advisable to place them in separate subgroups to
avoid confusion for the reader, but the results of the subgroups can legitimately be
pooled together.
In contrast, post-intervention value and change scores should not in principle be
combined using standard meta-analysis approaches when the effect measure is an
SMD. This is because the SDs used in the standardization reflect different things.
The SD when standardizing post-intervention values reflects between-person variabil-
ity at a single point in time. The SD when standardizing change scores reflects variation
in between-person changes over time, so will depend on both within-person and
between-person variability; within-person variability in turn is likely to depend on
the length of time between measurements. Nevertheless, an empirical study of
21 meta-analyses in osteoarthritis did not find a difference between combined SMDs
based on post-intervention values and combined SMDs based on change scores (da
Costa et al 2013). One option is to standardize SMDs using post-intervention SDs rather
than change score SDs. This would lead to valid synthesis of the two approaches, but
we are not aware that an appropriate standard error for this has been derived.
A common practical problem associated with including change-from-baseline mea-
sures is that the SD of changes is not reported. Imputation of SDs is discussed in
Chapter 6 (Section 6.4.2.8).
253
10 Analysing data and undertaking meta-analyses
for outcomes such as weight, volume and blood concentrations, which have lowest
possible values of 0, or for scale outcomes with minimum or maximum scores, but
it may not be appropriate for change-from-baseline measures. The check involves cal-
culating the observed mean minus the lowest possible value (or the highest possible
value minus the observed mean), and dividing this by the SD. A ratio less than 2 sug-
gests skew (Altman and Bland 1996). If the ratio is less than 1, there is strong evidence
of a skewed distribution.
Transformation of the original outcome data may reduce skew substantially. Reports
of trials may present results on a transformed scale, usually a log scale. Collection of
appropriate data summaries from the trialists, or acquisition of individual patient data,
is currently the approach of choice. Appropriate data summaries and analysis strate-
gies for the individual patient data will depend on the situation. Consultation with a
knowledgeable statistician is advised.
Where data have been analysed on a log scale, results are commonly presented as
geometric means and ratios of geometric means. A meta-analysis may be then per-
formed on the scale of the log-transformed data; an example of the calculation of
the required means and SD is given in Chapter 6 (Section 6.5.2.4). This approach
depends on being able to obtain transformed data for all studies; methods for trans-
forming from one scale to the other are available (Higgins et al 2008b). Log-transformed
and untransformed data should not be mixed in a meta-analysis.
3
SMD = lnOR
π
The standard error of the log odds ratio can be converted to the standard error of a SMD
by multiplying by the same constant (√3/π = 0.5513). Alternatively SMDs can be re-
expressed as log odds ratios by multiplying by π/√3 = 1.814. Once SMDs (or log odds ratios)
and their standard errors have been computed for all studies in the meta-analysis, they can
be combined using the generic inverse-variance method. Standard errors can be com-
puted for all studies by entering the data as dichotomous and continuous outcome type
data, as appropriate, and converting the confidence intervals for the resulting log odds
ratios and SMDs into standard errors (see Chapter 6, Section 6.3).
data if the counts are dichotomized for each individual (see Section 10.4), continuous
data (see Section 10.5) and time-to-event data (see Section 10.9), as well as being
analysed as rate data.
Rate data occur if counts are measured for each participant along with the time over
which they are observed. This is particularly appropriate when the events being
counted are rare. For example, a woman may experience two strokes during a fol-
low-up period of two years. Her rate of strokes is one per year of follow-up (or, equiv-
alently 0.083 per month of follow-up). Rates are conventionally summarized at the
group level. For example, participants in the comparator group of a clinical trial
may experience 85 strokes during a total of 2836 person-years of follow-up. An under-
lying assumption associated with the use of rates is that the risk of an event is constant
across participants and over time. This assumption should be carefully considered for
each situation. For example, in contraception studies, rates have been used (known as
Pearl indices) to describe the number of pregnancies per 100 women-years of follow-
up. This is now considered inappropriate since couples have different risks of concep-
tion, and the risk for each woman changes over time. Pregnancies are now analysed
more often using life tables or time-to-event methods that investigate the time elapsing
before the first pregnancy.
Analysing count data as rates is not always the most appropriate approach and is
uncommon in practice. This is because:
The results of a study may be expressed as a rate ratio, that is the ratio of the rate in
the experimental intervention group to the rate in the comparator group. The (natural)
logarithms of the rate ratios may be combined across studies using the generic inverse-
variance method (see Section 10.3.3). Alternatively, Poisson regression approaches can
be used (Spittal et al 2015).
In a randomized trial, rate ratios may often be very similar to risk ratios obtained after
dichotomizing the participants, since the average period of follow-up should be similar
in all intervention groups. Rate ratios and risk ratios will differ, however, if an interven-
tion affects the likelihood of some participants experiencing multiple events.
It is possible also to focus attention on the rate difference (see Chapter 6,
Section 6.7.1). The analysis again can be performed using the generic inverse-variance
method (Hasselblad and McCrory 1995, Guevara et al 2004).
using the ‘O – E and Variance’ outcome type. There are several ways to calculate these
‘O – E’ and ‘V’ statistics. Peto’s method applied to dichotomous data (Section 10.4.2)
gives rise to an odds ratio; a log-rank approach gives rise to a hazard ratio; and a var-
iation of the Peto method for analysing time-to-event data gives rise to something in
between (Simmonds et al 2011). The appropriate effect measure should be specified.
Only fixed-effect meta-analysis methods are available in RevMan for ‘O – E and Vari-
ance’ outcomes.
Alternatively, if estimates of log hazard ratios and standard errors have been
obtained from results of Cox proportional hazards regression models, study results
can be combined using generic inverse-variance methods (see Section 10.3.3).
If a mixture of log-rank and Cox model estimates are obtained from the studies, all
results can be combined using the generic inverse-variance method, as the log-rank
estimates can be converted into log hazard ratios and standard errors using the
approaches discussed in Chapter 6 (Section 6.8).
10.10 Heterogeneity
10.10.1 What is heterogeneity?
Inevitably, studies brought together in a systematic review will differ. Any kind of var-
iability among studies in a systematic review may be termed heterogeneity. It can be
helpful to distinguish between different types of heterogeneity. Variability in the parti-
cipants, interventions and outcomes studied may be described as clinical diversity
(sometimes called clinical heterogeneity), and variability in study design, outcome
measurement tools and risk of bias may be described as methodological diversity
(sometimes called methodological heterogeneity). Variability in the intervention effects
being evaluated in the different studies is known as statistical heterogeneity, and is a
consequence of clinical or methodological diversity, or both, among the studies.
Statistical heterogeneity manifests itself in the observed intervention effects being
more different from each other than one would expect due to random error (chance)
alone. We will follow convention and refer to statistical heterogeneity simply as
heterogeneity.
Clinical variation will lead to heterogeneity if the intervention effect is affected by the
factors that vary across studies; most obviously, the specific interventions or patient
characteristics. In other words, the true intervention effect will be different in different
studies.
Differences between studies in terms of methodological factors, such as use of
blinding and concealment of allocation sequence, or if there are differences between
studies in the way the outcomes are defined and measured, may be expected to lead
to differences in the observed intervention effects. Significant statistical heterogene-
ity arising from methodological diversity or differences in outcome assessments sug-
gests that the studies are not all estimating the same quantity, but does not
necessarily suggest that the true intervention effect varies. In particular, heterogene-
ity associated solely with methodological diversity would indicate that the studies
suffer from different degrees of bias. Empirical evidence suggests that some aspects
257
10 Analysing data and undertaking meta-analyses
of design can affect the result of clinical trials, although this is not always the case.
Further discussion appears in Chapters 7 and 8.
The scope of a review will largely determine the extent to which studies included in a
review are diverse. Sometimes a review will include studies addressing a variety of ques-
tions, for example when several different interventions for the same condition are of
interest (see also Chapter 11) or when the differential effects of an intervention in differ-
ent populations are of interest. Meta-analysis should only be considered when a group of
studies is sufficiently homogeneous in terms of participants, interventions and outcomes
to provide a meaningful summary. It is often appropriate to take a broader perspective in
a meta-analysis than in a single clinical trial. A common analogy is that systematic
reviews bring together apples and oranges, and that combining these can yield a mean-
ingless result. This is true if apples and oranges are of intrinsic interest on their own, but
may not be if they are used to contribute to a wider question about fruit. For example, a
meta-analysis may reasonably evaluate the average effect of a class of drugs by combin-
ing results from trials where each evaluates the effect of a different drug from the class.
There may be specific interest in a review in investigating how clinical and method-
ological aspects of studies relate to their results. Where possible these investigations
should be specified a priori (i.e. in the protocol for the systematic review). It is legiti-
mate for a systematic review to focus on examining the relationship between some clin-
ical characteristic(s) of the studies and the size of intervention effect, rather than on
obtaining a summary effect estimate across a series of studies (see Section 10.11).
Meta-regression may best be used for this purpose, although it is not implemented
in RevMan (see Section 10.11.4).
258
10.10 Heterogeneity
test for heterogeneity is available. This Chi2 (χ2, or chi-squared) test is included in the
forest plots in Cochrane Reviews. It assesses whether observed differences in results
are compatible with chance alone. A low P value (or a large Chi2 statistic relative to
its degree of freedom) provides evidence of heterogeneity of intervention effects (var-
iation in effect estimates beyond chance).
Care must be taken in the interpretation of the Chi2 test, since it has low power in the
(common) situation of a meta-analysis when studies have small sample size or are few
in number. This means that while a statistically significant result may indicate a prob-
lem with heterogeneity, a non-significant result must not be taken as evidence of no
heterogeneity. This is also why a P value of 0.10, rather than the conventional level
of 0.05, is sometimes used to determine statistical significance. A further problem with
the test, which seldom occurs in Cochrane Reviews, is that when there are many studies
in a meta-analysis, the test has high power to detect a small amount of heterogeneity
that may be clinically unimportant.
Some argue that, since clinical and methodological diversity always occur in a meta-
analysis, statistical heterogeneity is inevitable (Higgins et al 2003). Thus, the test for
heterogeneity is irrelevant to the choice of analysis; heterogeneity will always exist
whether or not we happen to be able to detect it using a statistical test. Methods have
been developed for quantifying inconsistency across studies that move the focus
away from testing whether heterogeneity is present to assessing its impact on the
meta-analysis. A useful statistic for quantifying inconsistency is:
Q − df
I2 = × 100
Q
In this equation, Q is the Chi2 statistic and df is its degrees of freedom (Higgins and
Thompson 2002, Higgins et al 2003). I2 describes the percentage of the variability in
effect estimates that is due to heterogeneity rather than sampling error (chance).
Thresholds for the interpretation of the I2 statistic can be misleading, since the impor-
tance of inconsistency depends on several factors. A rough guide to interpretation in
the context of meta-analyses of randomized trials is as follows:
1) Check again that the data are correct. Severe apparent heterogeneity can indicate
that data have been incorrectly extracted or entered into meta-analysis software.
For example, if standard errors have mistakenly been entered as SDs for continuous
outcomes, this could manifest itself in overly narrow confidence intervals with poor
overlap and hence substantial heterogeneity. Unit-of-analysis errors may also be
causes of heterogeneity (see Chapter 6, Section 6.2).
2) Do not do a meta-analysis. A systematic review need not contain any meta-analyses.
If there is considerable variation in results, and particularly if there is inconsistency
in the direction of effect, it may be misleading to quote an average value for the
intervention effect.
3) Explore heterogeneity. It is clearly of interest to determine the causes of heteroge-
neity among results of studies. This process is problematic since there are often
many characteristics that vary across studies from which one may choose. Hetero-
geneity may be explored by conducting subgroup analyses (see Section 10.11.3) or
meta-regression (see Section 10.11.4). Reliable conclusions can only be drawn from
analyses that are truly pre-specified before inspecting the studies’ results, and even
these conclusions should be interpreted with caution. Explorations of heterogeneity
that are devised after heterogeneity is identified can at best lead to the generation
of hypotheses. They should be interpreted with even more caution and should gen-
erally not be listed among the conclusions of a review. Also, investigations of het-
erogeneity when there are very few studies are of questionable value.
4) Ignore heterogeneity. Fixed-effect meta-analyses ignore heterogeneity. The summary
effect estimate from a fixed-effect meta-analysis is normally interpreted as being the
best estimate of the intervention effect. However, the existence of heterogeneity sug-
gests that there may not be a single intervention effect but a variety of intervention
effects. Thus, the summary fixed-effect estimate may be an intervention effect that
does not actually exist in any population, and therefore have a confidence interval
that is meaningless as well as being too narrow (see Section 10.10.4).
260
10.10 Heterogeneity
1) Many have argued that the decision should be based on an expectation of whether
the intervention effects are truly identical, preferring the fixed-effect model if this is
likely and a random-effects model if this is unlikely (Borenstein et al 2010). Since it is
generally considered to be implausible that intervention effects across studies are
identical (unless the intervention has no effect at all), this leads many to advocate
use of the random-effects model.
262
10.10 Heterogeneity
2) Others have argued that a fixed-effect analysis can be interpreted in the presence of
heterogeneity, and that it makes fewer assumptions than a random-effects meta-
analysis. They then refer to it as a ‘fixed-effects’ meta-analysis (Peto et al 1995, Rice
et al 2018).
3) Under any interpretation, a fixed-effect meta-analysis ignores heterogeneity. If the
method is used, it is therefore important to supplement it with a statistical inves-
tigation of the extent of heterogeneity (see Section 10.10.2).
4) In the presence of heterogeneity, a random-effects analysis gives relatively more
weight to smaller studies and relatively less weight to larger studies. If there is addi-
tionally some funnel plot asymmetry (i.e. a relationship between intervention effect
magnitude and study size), then this will push the results of the random-effects anal-
ysis towards the findings in the smaller studies. In the context of randomized trials,
this is generally regarded as an unfortunate consequence of the model.
5) A pragmatic approach is to plan to undertake both a fixed-effect and a random-effects
meta-analysis, with an intention to present the random-effects result if there is no
indication of funnel plot asymmetry. If there is an indication of funnel plot asymmetry,
then both methods are problematic. It may be reasonable to present both analyses or
neither, or to perform a sensitivity analysis in which small studies are excluded or
addressed directly using meta-regression (see Chapter 13, Section 13.3.5.6).
6) The choice between a fixed-effect and a random-effects meta-analysis should never
be made on the basis of a statistical test for heterogeneity.
result, whereas the latter estimates it by comparing each study’s result with an inverse-
variance fixed-effect meta-analysis result. In practice, the difference is likely to be
trivial.
There are alternative methods for performing random-effects meta-analyses that
have better technical properties than the DerSimonian and Laird approach with a
moment-based estimate (Veroniki et al 2016). Most notable among these is an adjust-
ment to the confidence interval proposed by Hartung and Knapp and by Sidik and
Jonkman (Hartung and Knapp 2001, Sidik and Jonkman 2002). This adjustment widens
the confidence interval to reflect uncertainty in the estimation of between-study het-
erogeneity, and it should be used if available to review authors. An alternative option to
encompass full uncertainty in the degree of heterogeneity is to take a Bayesian
approach (see Section 10.13).
An empirical comparison of different ways to estimate between-study variation in
Cochrane meta-analyses has shown that they can lead to substantial differences in esti-
mates of heterogeneity, but seldom have major implications for estimating summary
effects (Langan et al 2015). Several simulation studies have concluded that an
approach proposed by Paule and Mandel should be recommended (Langan et al
2017); whereas a comprehensive recent simulation study recommended a restricted
maximum likelihood approach, although noted that no single approach is universally
preferable (Langan et al 2019). Review authors are encouraged to select one of these
options if it is available to them.
Findings from multiple subgroup analyses may be misleading. Subgroup analyses are
observational by nature and are not based on randomized comparisons. False negative
and false positive significance tests increase in likelihood rapidly as more subgroup
analyses are performed. If their findings are presented as definitive conclusions there
is clearly a risk of people being denied an effective intervention or treated with an inef-
fective (or even harmful) intervention. Subgroup analyses can also generate misleading
recommendations about directions for future research that, if followed, would waste
scarce resources.
It is useful to distinguish between the notions of ‘qualitative interaction’ and ‘quan-
titative interaction’ (Yusuf et al 1991). Qualitative interaction exists if the direction of
effect is reversed, that is if an intervention is beneficial in one subgroup but is harmful
in another. Qualitative interaction is rare. This may be used as an argument that the
most appropriate result of a meta-analysis is the overall effect across all subgroups.
Quantitative interaction exists when the size of the effect varies but not the direction,
that is if an intervention is beneficial to different degrees in different subgroups.
differences should be used only when the data in the subgroups are independent (i.e.
they should not be used if the same study participants contribute to more than one of
the subgroups in the forest plot).
If fixed-effect models are used for the analysis within each subgroup, then these sta-
tistics relate to differences in typical effects across different subgroups. If random-
effects models are used for the analysis within each subgroup, then the statistics relate
to variation in the mean effects in the different subgroups.
An alternative method for testing for differences between subgroups is to use meta-
regression techniques, in which case a random-effects model is generally preferred (see
Section 10.11.4). Tests for subgroup differences based on random-effects models may
be regarded as preferable to those based on fixed-effect models, due to the high risk of
false-positive results when a fixed-effect model is used to compare subgroups (Higgins
and Thompson 2004).
10.11.4 Meta-regression
If studies are divided into subgroups (see Section 10.11.2), this may be viewed as an
investigation of how a categorical study characteristic is associated with the intervention
effects in the meta-analysis. For example, studies in which allocation sequence conceal-
ment was adequate may yield different results from those in which it was inadequate.
Here, allocation sequence concealment, being either adequate or inadequate, is a cat-
egorical characteristic at the study level. Meta-regression is an extension to subgroup
analyses that allows the effect of continuous, as well as categorical, characteristics to
be investigated, and in principle allows the effects of multiple factors to be investigated
simultaneously (although this is rarely possible due to inadequate numbers of studies)
(Thompson and Higgins 2002). Meta-regression should generally not be considered
when there are fewer than ten studies in a meta-analysis.
Meta-regressions are similar in essence to simple regressions, in which an outcome
variable is predicted according to the values of one or more explanatory variables. In
meta-regression, the outcome variable is the effect estimate (for example, a mean dif-
ference, a risk difference, a log odds ratio or a log risk ratio). The explanatory variables
are characteristics of studies that might influence the size of intervention effect. These
are often called ‘potential effect modifiers’ or covariates. Meta-regressions usually dif-
fer from simple regressions in two ways. First, larger studies have more influence on the
relationship than smaller studies, since studies are weighted by the precision of their
respective effect estimate. Second, it is wise to allow for the residual heterogeneity
267
10 Analysing data and undertaking meta-analyses
among intervention effects not modelled by the explanatory variables. This gives rise to
the term ‘random-effects meta-regression’, since the extra variability is incorporated in
the same way as in a random-effects meta-analysis (Thompson and Sharp 1999).
The regression coefficient obtained from a meta-regression analysis will describe
how the outcome variable (the intervention effect) changes with a unit increase in
the explanatory variable (the potential effect modifier). The statistical significance of
the regression coefficient is a test of whether there is a linear relationship between
intervention effect and the explanatory variable. If the intervention effect is a ratio
measure, the log-transformed value of the intervention effect should always be used
in the regression model (see Chapter 6, Section 6.1.2.1), and the exponential of the
regression coefficient will give an estimate of the relative change in intervention effect
with a unit increase in the explanatory variable.
Meta-regression can also be used to investigate differences for categorical explana-
tory variables as done in subgroup analyses. If there are J subgroups, membership of
particular subgroups is indicated by using J minus 1 dummy variables (which can only
take values of zero or one) in the meta-regression model (as in standard linear regres-
sion modelling). The regression coefficients will estimate how the intervention effect in
each subgroup differs from a nominated reference subgroup. The P value of each
regression coefficient will indicate the strength of evidence against the null hypothesis
that the characteristic is not associated with the intervention effect.
Meta-regression may be performed using the ‘metareg’ macro available for the Stata
statistical package, or using the ‘metafor’ package for R, as well as other packages.
10.11.5.1 Ensure that there are adequate studies to justify subgroup analyses
and meta-regressions
It is very unlikely that an investigation of heterogeneity will produce useful findings
unless there is a substantial number of studies. Typical advice for undertaking simple
regression analyses: that at least ten observations (i.e. ten studies in a meta-analysis)
should be available for each characteristic modelled. However, even this will be too few
when the covariates are unevenly distributed across studies.
and second by preventing knowledge of the studies’ results influencing which sub-
groups are analysed. True pre-specification is difficult in systematic reviews, because
the results of some of the relevant studies are often known when the protocol is
drafted. If a characteristic was overlooked in the protocol, but is clearly of major impor-
tance and justified by external evidence, then authors should not be reluctant to
explore it. However, such post-hoc analyses should be identified as such.
active intervention, choice of comparison intervention), how the study was done
(length of follow-up) or methodology (design and quality).
In fact, the age of the recipient is probably a key factor and the subgroup finding
would simply be due to the strong association between the age of the recipient
and the age of their sibling.
2) Was the analysis pre-specified or post hoc? Authors should state whether subgroup
analyses were pre-specified or undertaken after the results of the studies had been
compiled (post hoc). More reliance may be placed on a subgroup analysis if it was
one of a small number of pre-specified analyses. Performing numerous post-hoc
subgroup analyses to explain heterogeneity is a form of data dredging. Data dred-
ging is condemned because it is usually possible to find an apparent, but false,
explanation for heterogeneity by considering lots of different characteristics.
3) Is there indirect evidence in support of the findings? Differences between subgroups
should be clinically plausible and supported by other external or indirect evidence, if
they are to be convincing.
4) Is the magnitude of the difference practically important? If the magnitude of a differ-
ence between subgroups will not result in different recommendations for different
subgroups, then it may be better to present only the overall analysis results.
5) Is there a statistically significant difference between subgroups? To establish whether
there is a different effect of an intervention in different situations, the magnitudes of
effects in different subgroups should be compared directly with each other. In par-
ticular, statistical significance of the results within separate subgroup analyses
should not be compared (see Section 10.11.3.1).
6) Are analyses looking at within-study or between-study relationships? For patient and
intervention characteristics, differences in subgroups that are observed within
studies are more reliable than analyses of subsets of studies. If such within-study
relationships are replicated across studies then this adds confidence to the
findings.
terms of absolute differences in risk in the sense that it reduces a 50% stroke rate by
10 percentage points to 40% (number needed to treat = 10), but a 20% stroke rate by 4
percentage points to 16% (number needed to treat = 25).
Use of different summary statistics (risk ratio, odds ratio and risk difference) will dem-
onstrate different relationships with underlying risk. Summary statistics that show
close to no relationship with underlying risk are generally preferred for use in meta-
analysis (see Section 10.4.3).
Investigating any relationship between effect estimates and the comparator group
risk is also complicated by a technical phenomenon known as regression to the mean.
This arises because the comparator group risk forms an integral part of the effect esti-
mate. A high risk in a comparator group, observed entirely by chance, will on average
give rise to a higher than expected effect estimate, and vice versa. This phenomenon
results in a false correlation between effect estimates and comparator group risks.
There are methods, which require sophisticated software, that correct for regression
to the mean (McIntosh 1996, Thompson et al 1997). These should be used for such ana-
lyses, and statistical expertise is recommended.
of publication bias). This problem is discussed at length in Chapter 13. Details of com-
prehensive search methods are provided in Chapter 4.
Some studies might not report any information on outcomes of interest to the
review. For example, there may be no information on quality of life, or on serious
adverse effects. It is often difficult to determine whether this is because the outcome
was not measured or because the outcome was not reported. Furthermore, failure to
report that outcomes were measured may be dependent on the unreported results
(selective outcome reporting bias; see Chapter 7, Section 7.2.3.3). Similarly, summary
data for an outcome, in a form that can be included in a meta-analysis, may be missing.
A common example is missing standard deviations (SDs) for continuous outcomes. This
is often a problem when change-from-baseline outcomes are sought. We discuss impu-
tation of missing SDs in Chapter 6 (Section 6.5.2.8). Other examples of missing summary
data are missing sample sizes (particularly those for each intervention group sepa-
rately), numbers of events, standard errors, follow-up times for calculating rates,
and sufficient details of time-to-event outcomes. Inappropriate analyses of studies,
for example of cluster-randomized and crossover trials, can lead to missing summary
data. It is sometimes possible to approximate the correct analyses of such studies, for
example by imputing correlation coefficients or SDs, as discussed in Chapter 23
(Section 23.1) for cluster-randomized studies and Chapter 23 (Section 23.2) for cross-
over trials. As a general rule, most methodologists believe that missing summary data
(e.g. ‘no usable data’) should not be used as a reason to exclude a study from a sys-
tematic review. It is more appropriate to include the study in the review, and to discuss
the potential implications of its absence from a meta-analysis.
It is likely that in some, if not all, included studies, there will be individuals missing
from the reported results. Review authors are encouraged to consider this problem
273
10 Analysing data and undertaking meta-analyses
carefully (see MECIR Box 10.12.a). We provide further discussion of this problem in
Section 10.12.3; see also Chapter 8 (Section 8.5).
Missing data can also affect subgroup analyses. If subgroup analyses or meta-
regressions are planned (see Section 10.11), they require details of the study-level
characteristics that distinguish studies from one another. If these are not available for
all studies, review authors should consider asking the study authors for more information.
more likely to have missing outcome data. Such data are ‘non-ignorable’ in the sense
that an analysis of the available data alone will typically be biased. Publication bias and
selective reporting bias lead by definition to data that are ‘not missing at random’, and
attrition and exclusions of individuals within studies often do as well.
The principal options for dealing with missing data are:
1) analysing only the available data (i.e. ignoring the missing data);
2) imputing the missing data with replacement values, and treating these as if they
were observed (e.g. last observation carried forward, imputing an assumed outcome
such as assuming all were poor outcomes, imputing the mean, imputing based on
predicted values from a regression analysis);
3) imputing the missing data and accounting for the fact that these were imputed with
uncertainty (e.g. multiple imputation, simple imputation methods (as point 2) with
adjustment to the standard error); and
4) using statistical models to allow for missing data, making assumptions about their
relationships with the available data.
Option 2 is practical in most circumstances and very commonly used in systematic
reviews. However, it fails to acknowledge uncertainty in the imputed values and results,
typically, in confidence intervals that are too narrow. Options 3 and 4 would require
involvement of a knowledgeable statistician.
Five general recommendations for dealing with missing data in Cochrane Reviews are
as follows.
since it can be unclear whether reported numbers of events in trial reports apply to the
full randomized sample or only to those who did not drop out (Akl et al 2016).
Although there is a tradition of implementing ‘worst case’ and ‘best case’ analyses
clarifying the extreme boundaries of what is theoretically possible, such analyses
may not be informative for the most plausible scenarios (Higgins et al 2008a).
276
10.14 Sensitivity analyses
practitioners actually interpret a classical confidence interval, but strictly in the clas-
sical framework the 95% refers to the long-term frequency with which 95% intervals
contain the true value. The Bayesian framework also allows a review author to calcu-
late the probability that the odds ratio has a particular range of values, which cannot be
done in the classical framework. For example, we can determine the probability that
the odds ratio is less than 1 (which might indicate a beneficial effect of an experimental
intervention), or that it is no larger than 0.8 (which might indicate a clinically important
effect). It should be noted that these probabilities are specific to the choice of the prior
distribution. Different meta-analysts may analyse the same data using different prior
distributions and obtain different results. It is therefore important to carry out sensi-
tivity analyses to investigate how the results depend on any assumptions made.
In the context of a meta-analysis, prior distributions are needed for the particular
intervention effect being analysed (such as the odds ratio or the mean difference)
and – in the context of a random-effects meta-analysis – on the amount of heteroge-
neity among intervention effects across studies. Prior distributions may represent sub-
jective belief about the size of the effect, or may be derived from sources of evidence
not included in the meta-analysis, such as information from non-randomized studies of
the same intervention or from randomized trials of other interventions. The width of
the prior distribution reflects the degree of uncertainty about the quantity. When there
is little or no information, a ‘non-informative’ prior can be used, in which all values
across the possible range are equally likely.
Most Bayesian meta-analyses use non-informative (or very weakly informative) prior
distributions to represent beliefs about intervention effects, since many regard it as
controversial to combine objective trial data with subjective opinion. However, prior
distributions are increasingly used for the extent of among-study variation in a
random-effects analysis. This is particularly advantageous when the number of studies
in the meta-analysis is small, say fewer than five or ten. Libraries of data-based prior
distributions are available that have been derived from re-analyses of many thousands
of meta-analyses in the Cochrane Database of Systematic Reviews (Turner et al 2012).
Statistical expertise is strongly recommended for review authors who wish to carry
out Bayesian analyses. There are several good texts (Sutton et al 2000, Sutton and
Abrams 2001, Spiegelhalter et al 2004).
It is highly desirable to prove that the findings from a systematic review are not
dependent on such arbitrary or unclear decisions by using sensitivity analysis (see
MECIR Box 10.14.a). A sensitivity analysis is a repeat of the primary analysis or
meta-analysis in which alternative decisions or ranges of values are substituted for
decisions that were arbitrary or unclear. For example, if the eligibility of some studies
in the meta-analysis is dubious because they do not contain full details, sensitivity anal-
ysis may involve undertaking the meta-analysis twice: the first time including all studies
and, second, including only those that are definitely known to be eligible. A sensitivity
analysis asks the question, ‘Are the findings robust to the decisions made in the process
of obtaining them?’
There are many decision nodes within the systematic review process that can gen-
erate a need for a sensitivity analysis. Examples include:
3) Ordinal scales: what cut-point should be used to dichotomize short ordinal scales
into two groups?
4) Cluster-randomized trials: what values of the intraclass correlation coefficient
should be used when trial analyses have not been adjusted for clustering?
5) Crossover trials: what values of the within-subject correlation coefficient should be
used when this is not available in primary reports?
6) All analyses: what assumptions should be made about missing outcomes? Should
adjusted or unadjusted estimates of intervention effects be used?
Analysis methods:
1) Should fixed-effect or random-effects methods be used for the analysis?
2) For dichotomous outcomes, should odds ratios, risk ratios or risk differences be used?
3) For continuous outcomes, where several scales have assessed the same dimension,
should results be analysed as a standardized mean difference across all scales or as
mean differences individually for each scale?
Some sensitivity analyses can be pre-specified in the study protocol, but many issues
suitable for sensitivity analysis are only identified during the review process where the indi-
vidual peculiarities of the studies under investigation are identified. When sensitivity ana-
lyses show that the overall result and conclusions are not affected by the different
decisions that could be made during the review process, the results of the review can
be regarded with a higher degree of certainty. Where sensitivity analyses identify particular
decisions or missing information that greatly influence the findings of the review, greater
resources can be deployed to try and resolve uncertainties and obtain extra information,
possibly through contacting trial authors and obtaining individual participant data. If this
cannot be achieved, the results must be interpreted with an appropriate degree of caution.
Such findings may generate proposals for further investigations and future research.
Reporting of sensitivity analyses in a systematic review may best be done by produ-
cing a summary table. Rarely is it informative to produce individual forest plots for each
sensitivity analysis undertaken.
Sensitivity analyses are sometimes confused with subgroup analysis. Although some
sensitivity analyses involve restricting the analysis to a subset of the totality of studies,
the two methods differ in two ways. First, sensitivity analyses do not attempt to esti-
mate the effect of the intervention in the group of studies removed from the analysis,
whereas in subgroup analyses, estimates are produced for each subgroup. Second, in
sensitivity analyses, informal comparisons are made between different ways of esti-
mating the same thing, whereas in subgroup analyses, formal statistical comparisons
are made across the subgroups.
Joseph Lau, Keith O’Rourke, Gerta Rücker, Rob Scholten, Jonathan Sterne, Simon
Thompson, Anne Whitehead
Funding: JJD received support from the National Institute for Health Research (NIHR)
Birmingham Biomedical Research Centre at the University Hospitals Birmingham NHS
Foundation Trust and the University of Birmingham. JPTH is a member of the NIHR
Biomedical Research Centre at University Hospitals Bristol NHS Foundation Trust
and the University of Bristol. JPTH received funding from National Institute for Health
Research Senior Investigator award NF-SI-0617-10145. The views expressed are those of
the author(s) and not necessarily those of the NHS, the NIHR or the Department of
Health.
10.16 References
Agresti A. An Introduction to Categorical Data Analysis. New York (NY): John Wiley &
Sons; 1996.
Akl EA, Kahale LA, Agoritsas T, Brignardello-Petersen R, Busse JW, Carrasco-Labra A,
Ebrahim S, Johnston BC, Neumann I, Sola I, Sun X, Vandvik P, Zhang Y, Alonso-Coello P,
Guyatt G. Handling trial participants with missing outcome data when conducting a
meta-analysis: a systematic survey of proposed approaches. Systematic Reviews 2015;
4: 98.
Akl EA, Kahale LA, Ebrahim S, Alonso-Coello P, Schünemann HJ, Guyatt GH. Three
challenges described for identifying participants with missing data in trials reports, and
potential solutions suggested to systematic reviewers. Journal of Clinical Epidemiology
2016; 76: 147–154.
Altman DG, Bland JM. Detecting skewness from summary information. BMJ 1996; 313: 1200.
Anzures-Cabrera J, Sarpatwari A, Higgins JPT. Expressing findings from meta-analyses of
continuous outcomes in terms of risks. Statistics in Medicine 2011; 30: 2967–2985.
Berlin JA, Longnecker MP, Greenland S. Meta-analysis of epidemiologic dose-response data.
Epidemiology 1993; 4: 218–228.
Berlin JA, Antman EM. Advantages and limitations of metaanalytic regressions of clinical
trials data. Online Journal of Current Clinical Trials 1994; Doc No 134.
Berlin JA, Santanna J, Schmid CH, Szczech LA, Feldman KA, Group A-LAITS. Individual patient-
versus group-level data meta-regressions for the investigation of treatment effect modifiers:
ecological bias rears its ugly head. Statistics in Medicine 2002; 21: 371–387.
280
10.16 References
Borenstein M, Hedges LV, Higgins JPT, Rothstein HR. A basic introduction to fixed-effect
and random-effects models for meta-analysis. Research Synthesis Methods 2010;
1: 97–111.
Borenstein M, Higgins JPT. Meta-analysis and subgroups. Prevention Science 2013; 14:
134–143.
Bradburn MJ, Deeks JJ, Berlin JA, Russell Localio A. Much ado about nothing: a comparison
of the performance of meta-analytical methods with rare events. Statistics in Medicine
2007; 26: 53–77.
Chinn S. A simple method for converting an odds ratio to effect size for use in meta-analysis.
Statistics in Medicine 2000; 19: 3127–3131.
da Costa BR, Nuesch E, Rutjes AW, Johnston BC, Reichenbach S, Trelle S, Guyatt GH, Jüni P.
Combining follow-up and change data is valid in meta-analyses of continuous outcomes:
a meta-epidemiological study. Journal of Clinical Epidemiology 2013; 66: 847–855.
Deeks JJ. Systematic reviews of published evidence: miracles or minefields? Annals of
Oncology 1998; 9: 703–709.
Deeks JJ, Altman DG, Bradburn MJ. Statistical methods for examining heterogeneity
and combining results from several studies in meta-analysis. In: Egger M, Davey Smith G,
Altman DG, editors. Systematic Reviews in Health Care: Meta-analysis in Context. 2nd ed.
London (UK): BMJ Publication Group; 2001. p. 285–312.
Deeks JJ. Issues in the selection of a summary statistic for meta-analysis of clinical trials
with binary outcomes. Statistics in Medicine 2002; 21: 1575–1600.
DerSimonian R, Laird N. Meta-analysis in clinical trials. Controlled Clinical Trials 1986; 7:
177–188.
DiGuiseppi C, Higgins JPT. Interventions for promoting smoke alarm ownership and
function. Cochrane Database of Systematic Reviews 2001; 2: CD002246.
Ebrahim S, Akl EA, Mustafa RA, Sun X, Walter SD, Heels-Ansdell D, Alonso-Coello P, Johnston
BC, Guyatt GH. Addressing continuous data for participants excluded from trial analysis:
a guide for systematic reviewers. Journal of Clinical Epidemiology 2013; 66: 1014–
1021 e1011.
Ebrahim S, Johnston BC, Akl EA, Mustafa RA, Sun X, Walter SD, Heels-Ansdell D, Alonso-
Coello P, Guyatt GH. Addressing continuous data measured with different instruments for
participants excluded from trial analysis: a guide for systematic reviewers. Journal of
Clinical Epidemiology 2014; 67: 560–570.
Efthimiou O. Practical guide to the meta-analysis of rare events. Evidence-Based Mental
Health 2018; 21: 72–76.
Egger M, Davey Smith G, Schneider M, Minder C. Bias in meta-analysis detected by a simple,
graphical test. BMJ 1997; 315: 629–634.
Engels EA, Schmid CH, Terrin N, Olkin I, Lau J. Heterogeneity and statistical significance in
meta-analysis: an empirical study of 125 meta-analyses. Statistics in Medicine 2000; 19:
1707–1728.
Greenland S, Robins JM. Estimation of a common effect parameter from sparse follow-up
data. Biometrics 1985; 41: 55–68.
Greenland S. Quantitative methods in the review of epidemiologic literature. Epidemiologic
Reviews 1987; 9: 1–30.
Greenland S, Longnecker MP. Methods for trend estimation from summarized dose-
response data, with applications to meta-analysis. American Journal of Epidemiology
1992; 135: 1301–1309.
281
10 Analysing data and undertaking meta-analyses
Guevara JP, Berlin JA, Wolf FM. Meta-analytic methods for pooling rates when follow-up
duration varies: a case study. BMC Medical Research Methodology 2004; 4: 17.
Hartung J, Knapp G. A refined method for the meta-analysis of controlled clinical trials with
binary outcome. Statistics in Medicine 2001; 20: 3875–3889.
Hasselblad V, McCrory DC. Meta-analytic tools for medical decision making: a practical
guide. Medical Decision Making 1995; 15: 81–96.
Higgins JPT, Thompson SG. Quantifying heterogeneity in a meta-analysis. Statistics in
Medicine 2002; 21: 1539–1558.
Higgins JPT, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta-
analyses. BMJ 2003; 327: 557–560.
Higgins JPT, Thompson SG. Controlling the risk of spurious findings from meta-regression.
Statistics in Medicine 2004;23: 1663–1682.
Higgins JPT, White IR, Wood AM. Imputation methods for missing outcome data in
meta-analysis of clinical trials. Clinical Trials 2008a; 5: 225–239.
Higgins JPT, White IR, Anzures-Cabrera J. Meta-analysis of skewed data: combining
results reported on log-transformed or raw scales. Statistics in Medicine 2008b; 27:
6072–6092.
Higgins JPT, Thompson SG, Spiegelhalter DJ. A re-evaluation of random-effects meta-
analysis. Journal of the Royal Statistical Society: Series A (Statistics in Society) 2009;
172: 137–159.
Kjaergard LL, Villumsen J, Gluud C. Reported methodologic quality and discrepancies
between large and small randomized trials in meta-analyses. Annals of Internal Medicine
2001; 135: 982–989.
Langan D, Higgins JPT, Simmonds M. An empirical comparison of heterogeneity variance
estimators in 12 894 meta-analyses. Research Synthesis Methods 2015; 6: 195–205.
Langan D, Higgins JPT, Simmonds M. Comparative performance of heterogeneity variance
estimators in meta-analysis: a review of simulation studies. Research Synthesis Methods
2017; 8: 181–198.
Langan D, Higgins JPT, Jackson D, Bowden J, Veroniki AA, Kontopantelis E, Viechtbauer W,
Simmonds M. A comparison of heterogeneity variance estimators in simulated
random-effects meta-analyses. Research Synthesis Methods 2019; 10: 83–98.
Lewis S, Clarke M. Forest plots: trying to see the wood and the trees. BMJ 2001; 322:
1479–1480.
Lunn DJ, Thomas A, Best N, Spiegelhalter D. WinBUGS – A Bayesian modelling framework:
concepts, structure, and extensibility. Statistics and Computing 2000; 10: 325–337.
Mantel N, Haenszel W. Statistical aspects of the analysis of data from retrospective studies
of disease. Journal of the National Cancer Institute 1959; 22: 719–748.
McIntosh MW. The population risk as an explanatory variable in research synthesis of clinical
trials. Statistics in Medicine 1996; 15: 1713–1728.
Morgenstern H. Uses of ecologic analysis in epidemiologic research. American Journal of
Public Health 1982; 72: 1336–1344.
Oxman AD, Guyatt GH. A consumers guide to subgroup analyses. Annals of Internal Medicine
1992; 116: 78–84.
Peto R, Collins R, Gray R. Large-scale randomized evidence: large, simple trials and
overviews of trials. Journal of Clinical Epidemiology 1995; 48: 23–40.
Poole C, Greenland S. Random-effects meta-analyses are not always conservative. American
Journal of Epidemiology 1999; 150: 469–475.
282
10.16 References
Rhodes KM, Turner RM, White IR, Jackson D, Spiegelhalter DJ, Higgins JPT. Implementing
informative priors for heterogeneity in meta-analysis using meta-regression and pseudo
data. Statistics in Medicine 2016; 35: 5495–5511.
Rice K, Higgins JPT, Lumley T. A re-evaluation of fixed effect(s) meta-analysis. Journal of the
Royal Statistical Society Series A (Statistics in Society) 2018; 181: 205–227.
Riley RD, Higgins JPT, Deeks JJ. Interpretation of random effects meta-analyses. BMJ 2011;
342: d549.
Röver C. Bayesian random-effects meta-analysis using the bayesmeta R package 2017.
https://arxiv.org/abs/1711.08683.
Rücker G, Schwarzer G, Carpenter J, Olkin I. Why add anything to nothing? The arcsine
difference as a measure of treatment effect in meta-analysis with zero cells. Statistics in
Medicine 2009; 28: 721–738.
Sharp SJ. Analysing the relationship between treatment benefit and underlying risk:
precautions and practical recommendations. In: Egger M, Davey Smith G, Altman DG,
editors. Systematic Reviews in Health Care: Meta-analysis in Context. 2nd ed. London (UK):
BMJ Publication Group; 2001: 176–188.
Sidik K, Jonkman JN. A simple confidence interval for meta-analysis. Statistics in Medicine
2002; 21: 3153–3159.
Simmonds MC, Tierney J, Bowden J, Higgins JPT. Meta-analysis of time-to-event data:
a comparison of two-stage methods. Research Synthesis Methods 2011; 2: 139–149.
Sinclair JC, Bracken MB. Clinically useful measures of effect in binary analyses of
randomized trials. Journal of Clinical Epidemiology 1994; 47: 881–889.
Smith TC, Spiegelhalter DJ, Thomas A. Bayesian approaches to random-effects meta-
analysis: a comparative study. Statistics in Medicine 1995; 14: 2685–2699.
Spiegelhalter DJ, Abrams KR, Myles JP. Bayesian Approaches to Clinical Trials and Health-
Care Evaluation. Chichester (UK): John Wiley & Sons; 2004.
Spittal MJ, Pirkis J, Gurrin LC. Meta-analysis of incidence rate data in the presence of zero
events. BMC Medical Research Methodology 2015; 15: 42.
Sutton AJ, Abrams KR, Jones DR, Sheldon TA, Song F. Methods for Meta-analysis in Medical
Research. Chichester (UK): John Wiley & Sons; 2000.
Sutton AJ, Abrams KR. Bayesian methods in meta-analysis and evidence synthesis.
Statistical Methods in Medical Research 2001; 10: 277–303.
Sweeting MJ, Sutton AJ, Lambert PC. What to add to nothing? Use and avoidance of
continuity corrections in meta-analysis of sparse data. Statistics in Medicine 2004; 23:
1351–1375.
Thompson SG, Smith TC, Sharp SJ. Investigating underlying risk as a source of
heterogeneity in meta-analysis. Statistics in Medicine 1997; 16: 2741–2758.
Thompson SG, Sharp SJ. Explaining heterogeneity in meta-analysis: a comparison of
methods. Statistics in Medicine 1999; 18: 2693–2708.
Thompson SG, Higgins JPT. How should meta-regression analyses be undertaken and
interpreted? Statistics in Medicine 2002; 21: 1559–1574.
Turner RM, Davey J, Clarke MJ, Thompson SG, Higgins JPT. Predicting the extent of
heterogeneity in meta-analysis, using empirical data from the Cochrane Database of
Systematic Reviews. International Journal of Epidemiology 2012; 41: 818–827.
Veroniki AA, Jackson D, Viechtbauer W, Bender R, Bowden J, Knapp G, Kuss O, Higgins JPT,
Langan D, Salanti G. Methods to estimate the between-study variance and its uncertainty
in meta-analysis. Research Synthesis Methods 2016; 7: 55–79.
283
10 Analysing data and undertaking meta-analyses
284
11
Undertaking network meta-analyses
Anna Chaimani, Deborah M Caldwell, Tianjing Li, Julian PT Higgins, Georgia Salanti
KEY POINTS
•
across a network of studies.
Network meta-analysis produces estimates of the relative effects between any pair of
interventions in the network, and usually yields more precise estimates than a single
direct or indirect estimate. It also allows estimation of the ranking and hierarchy of
•
interventions.
A valid network meta-analysis relies on the assumption that the different sets of stud-
ies included in the analysis are similar, on average, in all important factors that may
•
affect the relative effects.
Incoherence (also called inconsistency) occurs when different sources of information
•
(e.g. direct and indirect) about a particular intervention comparison disagree.
Grading confidence in evidence from a network meta-analysis begins by evaluating
confidence in each direct comparison. Domain-specific assessments are combined
to determine the overall confidence in the evidence.
This chapter should be cited as: Chaimani A, Caldwell DM, Li T, Higgins JPT, Salanti G. Chapter 11:
Undertaking network meta-analyses. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ,
Welch VA (editors). Cochrane Handbook for Systematic Reviews of Interventions. 2nd Edition. Chichester (UK):
John Wiley & Sons, 2019: 285–320.
© 2019 The Cochrane Collaboration. Published 2019 by John Wiley & Sons Ltd.
285
11 Undertaking network meta-analyses
from a single review that includes all relevant interventions, and presents their com-
parative effectiveness and potential for harm. Network meta-analysis provides an
analysis option for such a review.
Any set of studies that links three or more interventions via direct comparisons forms
a network of interventions. In a network of interventions there can be multiple ways
to make indirect comparisons between the interventions. These are comparisons that
have not been made directly within studies, and they can be estimated using mathe-
matical combinations of the direct intervention effect estimates available. Network
meta-analysis combines direct and indirect estimates across a network of interven-
tions in a single analysis. Synonymous terms, less often used, are mixed treatment
comparisons and multiple treatments meta-analysis.
Figure 11.1.a Example of network diagram with four competing interventions and information on the
presence of multi-arm randomized trials
286
11.2 Important concepts
‘Dietitian’
B C
‘Doctor’ ‘Nurse’
Indirect estimate ‘doctor’
versus ‘nurse’ (BC)
Figure 11.2.a Illustration of an indirect estimate that compares the effectiveness of ‘doctor’ (B) and
‘nurse’ (C) in providing dietary advice through a common comparator ‘dietitian’ (A)
‘Doctor’ B A C ‘Nurse’
‘Dietitian’
Figure 11.2.b Graphical representation of the indirect comparison ‘doctor’ (B) versus ‘nurse’ (C) via
‘dietitian’ (A)
and A versus C (‘AC’), measured as mean difference (MD) in weight reduction (see
Chapter 6, Section 6.5.1.1). The situation is illustrated in Figure 11.2.a, where the solid
straight lines depict available evidence. We wish to learn about the relative effect of
advice by a doctor versus a nurse (B versus C); the dashed line depicts this comparison,
for which there is no direct evidence.
One way to understand an indirect comparison is to think of the BC comparison (of B
versus C) as representing the benefit of B over C. All else being equal, the benefit of
B over C is equivalent to the benefit of B over A plus the benefit of A over C. Thus,
for example, the indirect comparison describing benefit of ‘doctor’ over ‘nurse’ may
be thought of as the benefit of ‘doctor’ over ‘dietitian’ plus the benefit of ‘dietitian’ over
‘nurse’ (these ‘benefits’ may be positive or negative; we do not intend to imply any par-
ticular superiority among these three types of people offering dietary advice). This is
represented graphically in Figure 11.2.b.
Mathematically, the sum can be written:
indirect MD BvsC = direct MD BvsA + direct MD AvsC
We usually write this in the form of subtraction:
indirect MD BvsC = direct MD AvsC − direct MD AvsB
288
11.2 Important concepts
such that the difference between the summary statistics of the intervention effect in the
direct A versus C and A versus B meta-analyses provides an indirect estimate of the B
versus C intervention effect.
For this simple case where we have two direct comparisons (three interventions) the
analysis can be conducted by performing subgroup analyses using standard meta-
analysis routines (including RevMan): studies addressing the two direct comparisons
(i.e. A versus B and A versus C) can be treated as two subgroups in the meta-analysis.
The difference between the summary effects from the two subgroups gives an estimate
for the indirect comparison.
Most software will provide a P value for the statistical significance of the difference
between the subgroups based on the estimated variance of the indirect effect estimate
(Bucher et al 1997):
Variance indirect MD BvsC = Variance direct MD AvsC + Variance direct MD AvsB
where variance[direct MD(AvsC)] and variance[direct MD(AvsB)] are the variances of the
respective direct estimates (from the two subgroup analyses).
A 95% confidence interval for the indirect summary effect is constructed by the
formula:
This method uses the intervention effects from each group of randomized trials and
therefore preserves within-trial randomization. If we had instead pooled single arms
across the studies (e.g. all B arms and all C arms, ignoring the A arms) and then per-
formed a direct comparison between the pooled B and C arms (i.e. treating the data as
if they came from a single large randomized trial), then our analysis would discard the
benefits of within-trial randomization (Li and Dickersin 2013). This approach should not
be used.
When four or more competing interventions are available, indirect estimates can be
derived via multiple routes. The only requirement is that two interventions are ‘con-
nected’ and not necessarily via a single common comparator. An example of this sit-
uation is provided in Figure 11.2.c. Here ‘doctor’ (B) and ‘pharmacist’ (D) do not have a
common comparator, but we can compare them indirectly via the route ‘doctor’ (B) –
‘dietitian’ (A) – ‘nurse’ (C) – ‘pharmacist (D) by an extension of the arguments set out
earlier.
11.2.2 Transitivity
11.2.2.1 Validity of an indirect comparison
The underlying assumption of indirect comparisons is that we can learn about the true
relative effect of B versus C via treatment A by combining the true relative effects
A versus B and A versus C. This relationship can be written mathematically as
effect of B versus C = effect of A versus C – effect of A versus B
In words, this means that we can compare interventions B and C via intervention A
(Figure 11.2.a).
289
11 Undertaking network meta-analyses
‘Dietitian’
B C
‘Doctor’ ‘Nurse’
‘pharmacist’
Figure 11.2.c Example of deriving indirect estimate that compares the effectiveness of ‘doctor’ (B) and
‘pharmacist’ (D) in providing dietary advice through a connected loop
particular trial, the ‘missing’ interventions (those not included in trial) may be considered
to be missing for reasons unrelated to their effects (Caldwell et al 2005, Salanti 2012).
Direct Indirect
intervention intervention
effect effect
VALID
A versus C
A versus B
B versus C
A B A C
Direct Indirect
intervention intervention
effect effect
VALID
A versus C
A versus B B versus C
A B A C
Direct Indirect
intervention intervention
effect effect
INVALID
A versus C
A versus B
B versus C
A B A C
Figure 11.2.d Example of valid and invalid indirect comparisons when the severity of disease acts as
effect modifier and its distribution differs between the two direct comparisons. The shaded boxes
represent the treatment effect estimates from each source of evidence (striped box for A versus B and
checked box for A versus C). In the first row, randomized trials of A versus B and of A versus C are all
conducted in moderately obese populations; in the second row randomized trials are all conducted in
severely obese populations. In both of these the indirect comparisons of the treatment effect estimates
would be valid. In the last row, the A versus B and A versus C randomized trials are conducted in
different populations. As severity is an effect modifier, the indirect comparison based on these would
not be valid (Jansen et al 2014). Reproduced with permission of Elsevier
292
11.3 Comparing multiple interventions
mixed estimate of the intervention effect. We will use the former term in this chapter.
A combined estimate can be computed as an inverse variance weighted average (see
Chapter 10, Section 10.3) of the direct and indirect summary estimates.
Since combined estimates incorporate indirect comparisons, they rely on the transi-
tivity assumption. Violation of transitivity threatens the validity of both indirect and
combined estimates. Of course, biased direct intervention effects for any of the com-
parisons also challenge the validity of a combined effect (Madan et al 2011).
Stage I
patients
294
11.3 Comparing multiple interventions
The following example refers to a network that used two criteria to classify electronic
interventions for smoking cessation into five categories: “To be able to draw general-
izable conclusions on the different types of electronic interventions, we developed a
categorization system that brought similar interventions together in a limited number
of categories. We sought advice from experts in smoking cessation on the key dimen-
sions that would influence the effectiveness of smoking cessation programmes.
Through this process, two dimensions for evaluating interventions were identified.
The first dimension was related to whether the intervention offered generic advice
or tailored its feedback to information provided by the user in some way. The second
dimension related to whether the intervention used a single channel or multiple chan-
nels. From these dimensions, we developed a system with five categories…, ranging
from interventions that provide generic information through a single channel, e.g. a
static Web site or mass e-mail (category e1) to complex interventions with multiple
channels delivering tailored information, e.g. an interactive Web site plus an interactive
forum (category e5)” (Madan et al 2014).
Empirical evidence is currently lacking on whether more or less expanded networks
are more prone to important intransitivity or incoherence. Extended discussions of how
different dosages can be modelled in network meta-analysis are available (Giovane
et al 2013, Owen et al 2015, Mawdsley et al 2016).
Hysterectomy
Mirena
Figure 11.4.a Network graph of four interventions for heavy menstrual bleeding (Middleton et al 2010).
The size of the nodes is proportional to the number of participants assigned to the intervention
and the thickness of the lines is proportional to the number of randomized trials that studied the
respective direct comparison. Reproduced with permission of BMJ Publishing Group
297
11 Undertaking network meta-analyses
298
11.4 Synthesis of results
Table 11.4.a Intervention effects, measured as odds ratios of patient dissatisfaction at 12 months of
four interventions for heavy menstrual bleeding with 95% confidence intervals. Odds ratios lower than
1 favour the column-defining intervention for the network meta-analysis results (lower triangle) and
the row-defining intervention for the pair-wise meta-analysis results (upper triangle)
Pair-wise meta-analysis
Hysterectomy – – 0.38
(0.22 to 0.65)
0.45 Second generation 1.35 0.82
(0.24 to 0.82) non-hysteroscopic (0.45 to 4.08) (0.60 to 1.12)
techniques
Network meta-analysis
probability that an intervention is at a specific rank (first, second, etc.) when compared
with the other interventions in the network, is frequently used. Ranking probabilities
may vary for different outcomes. As for any estimated quantity, ranking probabilities
are estimated with some variability. Therefore, inference based solely on the probabil-
ity of being ranked as the best, without accounting for the variability, is misleading and
should be avoided.
Ranking measures such as the mean ranks, median ranks and the cumulative rank-
ing probabilities summarize the estimated probabilities for all possible ranks and
account for uncertainty in relative ranking. Further discussion of ranking measures
is available elsewhere (Salanti et al 2011, Chaimani et al 2013, Tan et al 2014, Rücker
and Schwarzer 2015).
The estimated ranking probabilities for the heavy menstrual bleeding network (see
Section 11.4.3.2) are presented in Table 11.4.b. ‘Hysterectomy’ is the most effective
intervention according to mean rank.
Table 11.4.b Ranking probabilities and mean ranks for intervention effectiveness in heavy menstrual
bleeding. Lower mean rank values indicate that the interventions are associated with less mortality
1 96% 1% 4% 0%
2 4% 46% 40% 9%
3 0% 46% 19% 35%
4 0% 7% 37% 56%
Mean rank 1 3 3 4
301
11 Undertaking network meta-analyses
IF measures the level of disagreement between the direct and indirect effect estimates.
The standard error of the incoherence factor is obtained from
Variance IF = Variance direct MD BvsC + Variance indirect MD BvsC
and can be used to construct a 95% confidence interval for the IF:
IF ± 1 96 × SE IF
Several approaches have been suggested for evaluating incoherence in a network of
interventions with many loops (Donegan et al 2013, Veroniki et al 2013), broadly cate-
gorized as local and global approaches. Local approaches evaluate regions of network
separately to detect possible ‘incoherence spots’, whereas global approaches evaluate
coherence in the entire network.
Table 11.4.c Results based on the SIDE approach to evaluating local incoherence. P values less than
0.05 suggest statistically significant incoherence
302
11.4 Synthesis of results
Table 11.5.a Steps to obtain the overall confidence ratings (across all GRADE domains) for every combined comparison of the dietary advice example. A ✓ or x
indicates whether a particular step is needed in order to proceed to the next step
305
11 Undertaking network meta-analyses
comparison estimate and can be interpreted as the contributions of the direct com-
parison estimates. Then, the confidence in an indirect or combined comparison is esti-
mated by combining the confidence assessment for the available direct comparison
estimates with their contribution to the combined (or network) comparison. This
approach is similar to the process of evaluating the likely impact of a high risk-of-bias
study by looking at its weight in a pair-wise meta-analysis to decide whether to down-
grade or not in a standard GRADE assessment.
As an example, in the dietary advice network (Figure 11.2.a) suppose that most of the
evidence involved in the indirect comparison (i.e. the trials including dietitians) is at low
risk of bias, and that there are studies of ‘doctor versus nurse’ that are mostly at high risk
of bias. If the direct evidence on ‘doctor versus nurse’ has a very large contribution to the
network meta-analysis estimate of the same comparison, then we would judge this
result to be at high risk of bias. If the direct evidence has a very low contribution, we might
judge the result to be at moderate, or possibly low, risk of bias. This approach might be
preferable when there are indirect or mixed comparisons informed by many loops within
a network, and for a specific comparison these loops lead to different risk-of-bias assess-
ments. The contributions of the direct comparisons and the risk-of-bias assessments
may be presented jointly in a bar graph, with bars proportional to the contributions
of direct comparisons and different colours representing the different judgements.
The bar graph for the heavy menstrual bleeding example is available in Figure 11.5.a,
AvsB
RoB of network estimates
AvsC
AvsD
CvsD
BvsC
BvsD
0 20 40 60 80 100
Percentage contribution of direct comparisons
Figure 11.5.a Bar graph illustrating the percentage of information for every comparison that comes
from low (dark grey), moderate (light grey) or high (blue) risk-of-bias (RoB) studies with respect to both
randomization and compliance to treatment for the heavy menstrual bleeding network (Middleton
et al 2010). The risk of bias of the direct comparisons was defined based on Appendix 3 of the original
paper. The intervention labels are: A, first generation hysteroscopic techniques; B, hysterectomy; C,
second generation non-hysteroscopic techniques; D, Mirena. Reproduced with permission of BMJ
Publishing Group
306
11.5 Evaluating confidence in the results
which suggests that there are two comparisons (‘First generation hysteroscopic techni-
ques versus Mirena’ and ‘Second generation non-hysteroscopic techniques versus Mir-
ena’) for which a substantial amount of information comes from studies at high risk
of bias.
Regardless of whether a review contains a network meta-analysis or a simple indirect
comparison, Puhan and colleagues propose to focus on so-called ‘most influential’
loops only. These are the connections between a pair of interventions of interest that
involve exactly one common comparator. This implies that the assessment for the
indirect comparison is dependent only on confidence in the two other direct compar-
isons in this loop. To illustrate, consider the dietary advice network described in
Section 11.2 (Figure 11.2.a), where we are interested in confidence in the evidence
for the indirect comparison ‘doctor versus nurse’. According to Puhan and colleagues,
the lower confidence rating between the two direct comparisons ‘dietitian versus doc-
tor’ and ‘dietitian versus nurse’ would be chosen to inform the confidence rating for the
indirect comparison. If there are also studies directly comparing doctor versus nurse,
the confidence in the combined comparison would be the higher rated source between
the direct evidence and the indirect evidence. The main rationale for this is that, in gen-
eral, the higher rated comparison is expected to be the more precise (and thus the
dominating) body of evidence. Also, in the absence of important incoherence, the lower
rated evidence is only supportive of the higher rated evidence; thus it is not very likely
to reduce the confidence in the estimated intervention effects. One disadvantage of this
approach is that investigators need to identify the most influential loop; this loop might
be relatively uninfluential when there are many loops in a network, which is often the
case when there are many interventions. In large networks, many loops with compa-
rable influence may exist and it is not clear how many of those equally influential loops
should be considered under this approach.
At the time of writing, no formal comparison has been performed to evaluate the
degree of agreement between these two methods. Thus, at this point we do not pre-
scribe using one approach or the other. However, when indirect comparisons are built
on existing pair-wise meta-analyses, which have already been rated with respect to
their confidence, it may be reasonable to follow the approach of Puhan and colleagues.
On the other hand, when the body of evidence is built from scratch, or when a large
number of interventions are involved, it may be preferable to consider the approach
of Salanti and colleagues whose application is facilitated via the online tool CINeMA.
Since network meta-analysis produces estimates for several intervention effects, the
confidence in the evidence should be assessed for each intervention effect that is
reported in the results. In addition, network meta-analysis may also provide informa-
tion on the relative ranking of interventions, and review authors should consider also
assessing confidence in results for relative ranking when these are reported. Salanti
and colleagues address confidence in the ranking based on the contributions of the
direct comparisons to the entire network as well as on the use of measures and graphs
that aim to assess the different GRADE domains in the network as a whole (e.g. mea-
sures of global incoherence) (see Section 11.4.4).
The two approaches modify the standard GRADE domains to fit network meta-
analysis to varying degrees. These modifications are briefly described in Box 11.5.a;
more details and examples are available in the original articles (Puhan et al 2014,
Salanti et al 2014).
307
11 Undertaking network meta-analyses
Box 11.5.a Modifications to the five domains of the standard GRADE system to fit
network meta-analysis.
Study limitations (i.e. classical risk-of-bias items) Salanti and colleagues suggest a bar
graph with bars proportional to the contributions of direct comparisons and different
colours representing the different confidence ratings (e.g. green, yellow, red for low,
moderate or high risk of bias) with respect to study limitations (Figure 11.5.a). The deci-
sion about downgrading or not is then formed by interpreting this graph. Such a graph
can be used to rate the confidence of evidence for each combined comparison and for
the relative ranking.
Indirectness The assessment of indirectness in the context of network meta-analysis
should consider two components: the similarity of the studies in the analysis to the tar-
get question (PICO); and the similarity of the studies in the analysis to each other. The
first addresses the extent to which the evidence at hand relates to the population,
intervention(s), comparators and outcomes of interest, and the second relates to the
evaluation of the transitivity assumption. A common view of the two approaches is that
they do not support the idea of downgrading indirect evidence by default. They suggest
that indirectness should be considered in conjunction with the risk of intransitivity.
Inconsistency Salanti and colleagues propose to create a common domain to consider
jointly both types of inconsistency that may occur: heterogeneity within direct comparisons
and incoherence. More specifically, they evaluate separately the presence of the two types
of variation and then consider them jointly to infer whether downgrading for inconsistency
is appropriate or not. It is usual in network meta-analysis to assume a common heteroge-
neity variance. They propose the use of prediction intervals to facilitate the assessment of
heterogeneity for each combined comparison. Prediction intervals are the intervals
expected to include the true intervention effects in future studies (Higgins et al 2009, Riley
et al 2011) and they incorporate the extent of between-study variation; in the presence of
important heterogeneity they are wide enough to include intervention effects with different
implications for practice. The potential for incoherence for a particular comparison can be
assessed using existing approaches for evaluating local and global incoherence (see
Section 11.5). We may downgrade for one or two levels due to the presence of heterogeneity
or incoherence, or both. The judgement for the relative ranking is based on the magnitude of
the common heterogeneity as well as the use of global incoherence tests (see Section 11.4).
Imprecision Both approaches suggest that imprecision of the combined comparisons
can be judged based on their 95% confidence intervals. Imprecision for relative treat-
ment ranking is the variability in the relative order of the interventions. This is reflected
by the overlap in the distributions of the ranking probabilities; i.e. when all or some of
the interventions have similar probabilities of being at a particular rank.
Publication bias The potential for publication bias in a network meta-analysis can be
difficult to judge. If a natural common comparator exists, a ‘comparison-adjusted funnel
plot’ can be employed to identify possible small-study effects in a network meta-analysis
(Chaimani and Salanti 2012, Chaimani et al 2013). This is a modified funnel plot that allows
putting together all the studies of the network irrespective of the interventions they com-
pare. However, the primary considerations for both the combined comparisons and rel-
ative ranking should be non-statistical. Review authors should consider whether there
might be unpublished studies for every possible pair-wise comparison in the network.
308
11.6 Presenting network meta-analyses
D B
309
11 Undertaking network meta-analyses
Table 11.6.a Example of table presenting a network that compares seven interventions and placebo for controlling exacerbation of episodes in chronic
obstructive pulmonary disease (Baker et al 2009). Reproduced with permission of John Wiley & Sons
Number of studies Placebo Fluticasone Budesonide Salmeterol Formoterol Tiotropium Fluticasone + salmeterol Budesonide + formoterol
4 x x x x
4 x x
2 x x x x
2 x x x
2 x x x
8 x x
2 x x
10 x x
1 x x
1 x x
1 x x
1 x x
1 x x
310
11.6 Presenting network meta-analyses
(Lu and Ades 2006). Additional information, such as the number of participants in each
arm, may be presented in the non-empty cells.
Mixed estimates
A–B 100.0
Network meta-analysis
Indirect estimates
B–C 49.6 48.9 0.7 0.7
B–D 38.5 23.0 15.5 23.0
Included studies 5 11 1 3
Figure 11.6.b Contribution matrix for the network on interventions for heavy menstrual bleeding
presented in Figure 11.4.a. Four direct comparisons in the network are presented in the columns, and
their contributions to the combined treatment effect are presented in the rows. The entries of the
matrix are the percentage weights attributed to each direct comparison. The intervention labels are: A,
first generation hysteroscopic techniques; B, hysterectomy; C, second generation non-hysteroscopic
techniques; D, Mirena
311
11 Undertaking network meta-analyses
techniques versus hysterectomy (A versus B) has the largest contribution to the indirect
comparisons hysterectomy versus second generation non-hysteroscopic techniques
(B versus C) (49.6%) and hysterectomy versus Mirena (B versus D) (38.5%), for both
of which no direct evidence exists.
Comparison OR (95%Cl)
Figure 11.6.c Forest plot for effectiveness in heavy menstrual bleeding between four interventions. FGHT,
first generation hysteroscopic techniques; SGNHT, second generation non-hysteroscopic techniques
312
11.6 Presenting network meta-analyses
In the presence of many competing interventions, the results across different out-
comes (e.g. efficacy and acceptability) might conflict with respect to which interven-
tions work best. To avoid drawing misleading conclusions, review authors may
consider the simultaneous presentation of results for outcomes in these two
categories.
Interpretation of the findings from network meta-analysis should always be consid-
ered with the evidence characteristics: risk of bias in included studies, heterogeneity,
incoherence and selection bias. Reporting results with respect to the evaluation of inco-
herence and heterogeneity (such as I2 statistic for incoherence) is important for draw-
ing meaningful conclusions.
0 0 0 0
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
Figure 11.6.d Ranking probabilities (rankograms) for the effectiveness of interventions in heavy menstrual bleeding. The horizontal axis shows the possible
ranks and the vertical axis the ranking probabilities. Each line connects the estimated probabilities of being at a particular rank for every intervention
314
11.7 Concluding remarks
provides more precise estimates of relative effect than a single direct or indirect esti-
mate. Network meta-analysis can yield estimates between any pairs of interventions,
including those that have never been compared directly against each other. Network
meta-analysis also allows the estimation of the ranking and hierarchy of interventions.
Much care should be taken when interpreting the results and drawing conclusions from
network meta-analysis, especially in the presence of incoherence or other potential
biases.
Funding: This work was supported by the Methods Innovation Fund Program of the
Cochrane Collaboration (MIF1) under the project ‘Methods for comparing multiple
interventions in Intervention reviews and Overviews of reviews’.
11.9 References
Ades AE, Caldwell DM, Reken S, Welton NJ, Sutton AJ, Dias S. Evidence synthesis for decision
making 7: a reviewer’s checklist. Medical Decision Making 2013; 33: 679–691.
Baker WL, Baker EL, Coleman CI. Pharmacologic treatments for chronic obstructive
pulmonary disease: a mixed-treatment comparison meta-analysis. Pharmacotherapy
2009; 29: 891–905.
Bucher HC, Guyatt GH, Griffith LE, Walter SD. The results of direct and indirect treatment
comparisons in meta-analysis of randomized controlled trials. Journal of Clinical
Epidemiology 1997; 50: 683–691.
Caldwell DM, Ades AE, Higgins JPT. Simultaneous comparison of multiple treatments:
combining direct and indirect evidence. BMJ 2005; 331: 897–900.
Caldwell DM, Welton NJ, Ades AE. Mixed treatment comparison analysis provides internally
coherent treatment effect estimates based on overviews of reviews and can reveal
inconsistency. Journal of Clinical Epidemiology 2010; 63: 875–882.
Caldwell DM, Dias S, Welton NJ. Extending treatment networks in health technology
assessment: how far should we go? Value in Health 2015; 18: 673–681.
Chaimani A, Salanti G. Using network meta-analysis to evaluate the existence of small-study
effects in a network of interventions. Research Synthesis Methods 2012; 3: 161–176.
Chaimani A, Higgins JPT, Mavridis D, Spyridonos P, Salanti G. Graphical tools for network
meta-analysis in STATA. PloS One 2013; 8: e76654.
316
11.9 References
317
11 Undertaking network meta-analyses
318
11.9 References
Owen RK, Tincello DG, Keith RA. Network meta-analysis: development of a three-level
hierarchical modeling approach incorporating dose-related constraints. Value in Health
2015; 18: 116–126.
Petropoulou M, Nikolakopoulou A, Veroniki AA, Rios P, Vafaei A, Zarin W, Giannatsi M,
Sullivan S, Tricco AC, Chaimani A, Egger M, Salanti G. Bibliographic study showed
improving statistical methodology of network meta-analyses published between 1999
and 2015. Journal of Clinical Epidemiology 2016; 82: 20–28.
Puhan MA, Schünemann HJ, Murad MH, Li T, Brignardello-Petersen R, Singh JA, Kessels
AG, Guyatt GH; GRADE Working Group. A GRADE Working Group approach for rating
the quality of treatment effect estimates from network meta-analysis. BMJ 2014;
349: g5630.
Riley RD, Higgins JPT, Deeks JJ. Interpretation of random effects meta-analyses. BMJ 2011;
342: d549.
Rücker G. Network meta-analysis, electrical networks and graph theory. Research Synthesis
Methods 2012; 3: 312–324.
Rücker G, Schwarzer G. netmeta: an R package for network meta-analysis 2013. http://www.
r-project.org.
Rücker G, Schwarzer G. Ranking treatments in frequentist network meta-analysis works
without resampling methods. BMC Medical Research Methodology 2015; 15: 58.
Salanti G, Higgins JPT, Ades AE, Ioannidis JPA. Evaluation of networks of randomized trials.
Statistical Methods in Medical Research 2008; 17: 279–301.
Salanti G, Marinho V, Higgins JPT. A case study of multiple-treatments meta-analysis
demonstrates that covariates should be considered. Journal of Clinical Epidemiology
2009; 62: 857–864.
Salanti G, Ades AE, Ioannidis JPA. Graphical methods and numerical summaries for
presenting results from multiple-treatment meta-analysis: an overview and tutorial.
Journal of Clinical Epidemiology 2011; 64: 163–171.
Salanti G. Indirect and mixed-treatment comparison, network, or multiple-treatments
meta-analysis: many names, many benefits, many concerns for the next generation
evidence synthesis tool. Research Synthesis Methods 2012; 3: 80–97.
Salanti G, Del Giovane C, Chaimani A, Caldwell DM, Higgins JPT. Evaluating the quality of
evidence from a network meta-analysis. PloS One 2014; 9: e99682.
Schmitz S, Adams R, Walsh C. Incorporating data from various trial designs into a mixed
treatment comparison model. Statistics in Medicine 2013; 32: 2935–2949.
Schwarzer G, Carpenter JR, Rücker G. Meta-analysis with R. Cham: Springer; 2015.
Soares MO, Dumville JC, Ades AE, Welton NJ. Treatment comparisons for decision making:
facing the problems of sparse and few data. Journal of the Royal Statistical Society Series
A (Statistics in Society) 2014; 177: 259–279.
Sobieraj DM, Cappelleri JC, Baker WL, Phung OJ, White CM, Coleman CI. Methods used to
conduct and report Bayesian mixed treatment comparisons published in the medical
literature: a systematic review. BMJ Open 2013; 3: pii.
Song F, Altman DG, Glenny AM, Deeks JJ. Validity of indirect comparison for estimating
efficacy of competing interventions: empirical evidence from published meta-analyses.
BMJ 2003; 326: 472.
Song F, Xiong T, Parekh-Bhurke S, Loke YK, Sutton AJ, Eastwood AJ, Holland R, Chen YF,
Glenny AM, Deeks JJ, Altman DG. Inconsistency between direct and indirect comparisons
of competing interventions: meta-epidemiological study. BMJ 2011; 343: d4909.
319
11 Undertaking network meta-analyses
320
12
Synthesizing and presenting findings using
other methods
Joanne E McKenzie, Sue E Brennan
KEY POINTS
• Meta-analysis of effect estimates has many advantages, but other synthesis methods
may need to be considered in the circumstance where there is incompletely reported
•
data in the primary studies.
Alternative synthesis methods differ in the completeness of the data they require, the
hypotheses they address, and the conclusions and recommendations that can be
•
drawn from their findings.
These methods provide more limited information for healthcare decision making than
meta-analysis, but may be superior to a narrative description where some results are
•
privileged above others without appropriate justification.
Tabulation and visual display of the results should always be presented alongside any
synthesis, and are especially important for transparent reporting in reviews without
•
meta-analysis.
Alternative synthesis and visual display methods should be planned and specified in the
•
protocol. When writing the review, details of the synthesis methods should be described.
Synthesis methods that involve vote counting based on statistical significance have
serious limitations and are unacceptable.
This chapter should be cited as: McKenzie JE, Brennan SE. Chapter 12: Synthesizing and presenting findings
using other methods. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors).
Cochrane Handbook for Systematic Reviews of Interventions. 2nd Edition. Chichester (UK): John Wiley & Sons,
2019: 321–348.
© 2019 The Cochrane Collaboration. Published 2019 by John Wiley & Sons Ltd.
321
12 Synthesizing and presenting findings using other methods
Table 12.1.a Scenarios that may preclude meta-analysis, with possible solutions
Limited evidence for Meta-analysis is not possible with no studies, or only one study. This Build contingencies into the analysis plan to group one or
a pre-specified circumstance may reflect the infancy of research in a particular more of the PICO elements at a broader level (Chapter 2,
comparison area, or that the specified PICO for the synthesis aims to address a Section 2.5.3).
narrow question.
Incompletely Within a study, the intervention effects may be incompletely Calculate the effect estimate and measure of precision from
reported outcome or reported (e.g. effect estimate with no measure of precision; the available statistics if possible (Chapter 6).
effect estimate direction of effect with P value or statement of statistical Impute missing statistics (e.g. standard deviations) where
significance; only the direction of effect). possible (Chapter 6, Section 6.4.2).
Use other synthesis method(s) (Section 12.2), along with
methods to display and present available effects visually
(Section 12.3).
Different effect Across studies, the same outcome could be treated differently (e.g. Calculate the effect estimate and measure of precision for the
measures a time-to-event outcome has been dichotomized in some studies) same effect measure from the available statistics if possible
or analysed using different methods. Both scenarios could lead to (Chapter 6).
different effect measures (e.g. hazard ratios and odds ratios). Transform effect measures (e.g. convert standardized mean
difference to an odds ratio) where possible (Chapter 10,
Section 10.6).
Use other synthesis method(s) (Section 12.2), along with
methods to display and present available effects visually
(Section 12.3).
Bias in the evidence Concerns about missing studies, missing outcomes within the When there are major concerns about bias in the evidence, use
studies (Chapter 13), or bias in the studies (Chapters 8 and 25), are structured reporting of the available effects using tables and
legitimate reasons for not undertaking a meta-analysis. These visual displays (Section 12.3).
concerns similarly apply to other synthesis methods (Section 12.2).
Incompletely reported outcomes/effects may bias meta-analyses, For incompletely reported outcomes/effects, also consider
but not necessarily other synthesis methods. other synthesis methods in addition to meta-analysis (Section
12.2).
322
12.1 Meta-analysis of effect estimates
Clinical and Concerns about diversity in the populations, interventions, Modify planned comparisons, providing rationale for
methodological outcomes, study designs, are often cited reasons for not using post-hoc changes (Chapter 9).
diversity meta-analysis (Ioannidis et al 2008). Arguments against using meta-
analysis because of too much diversity equally apply to the other
synthesis methods (Valentine et al 2010).
Statistical Statistical heterogeneity is an often cited reason for not reporting Attempt to reduce heterogeneity (e.g. checking the data,
heterogeneity the meta-analysis result (Ioannidis et al 2008). Presentation of an correcting an inappropriate choice of effect measure)
average combined effect in this circumstance can be misleading, (Chapter 10, Section 10.10).
particularly if the estimated effects across the studies are both Attempt to explain heterogeneity (e.g. using subgroup
harmful and beneficial. analysis) (Chapter 10, Section 10.11).
Consider (if possible) presenting a prediction interval, which
provides a predicted range for the true intervention effect in an
individual study (Riley et al 2011), thus clearly demonstrating
the uncertainty in the intervention effects.
∗
Italicized text indicates possible solutions discussed in this chapter.
323
12 Synthesizing and presenting findings using other methods
324
12.2 Statistical synthesis
Minimum data
required
Direction of effect
Estimate of effect
Precise P value
Variance
Question
Synthesis method answered Purpose Limitations
Preferable
Meta-analysis of What is the ✓ ✓ Can be used to synthesize results when Requires effect estimates and their
effect estimates common effect estimates and their variances are variances.
and extensions intervention reported (or can be calculated). Extensions (network meta-analysis, meta-
(Chapters 10 effect? Provides a combined estimate of average regression/subgroup analysis) require a
and 11) What is the intervention effect (random effects), and reasonably large number of studies.
average precision of this estimate (95% CI). Meta-regression/subgroup analysis involves
intervention Can be used to synthesize evidence from observational comparisons and requires
effect? multiple interventions, with the ability to careful interpretation. High risk of false
Which rank them (network meta-analysis). positive conclusions for sources of
intervention, of Can be used to detect, quantify and heterogeneity.
multiple, is most investigate heterogeneity (meta-regression/ Network meta-analysis is more complicated
effective? subgroup analysis). to undertake and requires careful
What factors Associated plots: forest plot, funnel plot, assessment of the assumptions.
modify the network diagram, rankogram plot
magnitude of the
intervention
effects?
Acceptable
Summarizing What is the range ✓ Can be used to synthesize results when it is Does not account for differences in the
effect estimates and distribution difficult to undertake a meta-analysis (e.g. relative sizes of the studies.
of observed missing variances of effects, unit of analysis Performance of these statistics applied in
effects? errors). the context of summarizing effect estimates
has not been evaluated.
(Continued)
325
12 Synthesizing and presenting findings using other methods
Minimum data
required
Direction of effect
Estimate of effect
Precise P value
Variance
Question
Synthesis method answered Purpose Limitations
••
P values and direction of effect; large studies with small effects and small
results of non-parametric analyses; studies with large effects.
results of different types of outcomes Difficult to interpret the test results when
•
and statistical tests; statistically significant, since the null
outcomes are different across studies hypothesis can be rejected on the basis of an
(e.g. different serious side effects). effect in only one study (Jones 1995).
Associated plot: albatross plot When combining P values from few, small
studies, failure to reject the null hypotheses
should not be interpreted as evidence of no
effect in all studies.
Vote counting Is there any ✓ Can be used to synthesize results when only Provides no information on the magnitude
based on direction evidence of an direction of effect is reported, or there is of effects (Borenstein et al 2009).
of effect effect? inconsistency in the effect measures or data Does not account for differences in the
reported across studies. relative sizes of the studies (Borenstein et al
Associated plots: harvest plot, effect 2009).
direction plot Less powerful than methods used to
combine P values.
326
12.2 Statistical synthesis
Reporting of methods and results The statistics that will be used to summarize the effects
(e.g. median, interquartile range) should be reported. Box-and-whisker or bubble plots
will complement reporting of the summary statistics by providing a visual display of the
distribution of observed effects (Section 12.3.3). Tabulation of the available effect esti-
mates will provide transparency for readers by linking the effects to the studies
(Section 12.3.1). Limitations of the method should be acknowledged (Table 12.2.a).
One-sided P values are used, since these contain information about the direction of
effect. However, these P values must reflect the same directional hypothesis (e.g. all test-
ing if intervention A is more effective than intervention B). This is analogous to standar-
dizing the direction of effects before undertaking a meta-analysis. Two-sided P values,
which do not contain information about the direction, must first be converted to one-
sided P values. If the effect is consistent with the directional hypothesis (e.g. intervention
A is beneficial compared with B), then the one-sided P value is calculated as
P2-sided
P1-sided = ;
2
otherwise,
P2-sided
P1-sided = 1 −
2
327
12 Synthesizing and presenting findings using other methods
In studies that do not report an exact P value but report a conventional level of signif-
icance (e.g. P < 0.05), a conservative option is to use the threshold (e.g. 0.05). The
P values must have been computed from statistical tests that appropriately account
for the features of the design, such as clustering or matching, otherwise they will likely
be incorrect.
The Chi2 statistic will follow a chi-squared distribution with 2k degrees of freedom if
there is no effect in every study. A large Chi2 statistic compared to the degrees of free-
dom (with a corresponding low P value) provides evidence of an effect in at least one
study (see Section 12.4.2.2 for guidance on implementing Fisher’s method for combin-
ing P values).
Reporting of methods and results There are several methods for combining P values
(Loughin 2004), so the chosen method should be reported, along with details of sen-
sitivity analyses that examine if the results are sensitive to the choice of method. The
results from the test should be reported alongside any available effect estimates (either
individual results or meta-analysis results of a subset of studies) using text, tabulation
and appropriate visual displays (Section 12.3). The albatross plot is likely to comple-
ment the analysis (Section 12.3.4). Limitations of the method should be acknowledged
(Table 12.2.a).
Reporting of methods and results The vote counting method should be reported in the
‘Data synthesis’ section of the review. Failure to recognize vote counting as a synthesis
method has led to it being applied informally (and perhaps unintentionally) to
328
12.2 Statistical synthesis
summarize results (e.g. through the use of wording such as ‘3 of 10 studies showed
improvement in the outcome with intervention compared to control’; ‘most studies
found’; ‘the majority of studies’; ‘few studies’ etc). In such instances, the method is
rarely reported, and it may not be possible to determine whether an unacceptable
(invalid) rule has been used to define benefit and harm (Section 12.2.2). The results
from vote counting should be reported alongside any available effect estimates (either
individual results or meta-analysis results of a subset of studies) using text, tabulation
and appropriate visual displays (Section 12.3). The number of studies contributing to a
synthesis based on vote counting may be larger than a meta-analysis, because
only minimal statistical information (i.e. direction of effect) is required from each
study to vote count. Vote counting results are used to derive the harvest and effect
direction plots, although often using unacceptable methods of vote counting (see
Section 12.3.5). Limitations of the method should be acknowledged (Table 12.2.a).
elements, cut-offs and algorithms used to categorize effects, and while detailed
descriptions of the rules may provide a veneer of legitimacy, such rules have poor per-
formance validity (Ioannidis et al 2008).
A further problem occurs when the rules are not described in sufficient detail for the
results to be reproduced (e.g. ter Wee et al 2012, Thornicroft et al 2016). This lack of
transparency does not allow determination of whether an acceptable or unacceptable
vote counting method has been used (Valentine et al 2010).
If the purpose of the table is complete and transparent reporting of data, then order-
ing the studies to increase the prominence of the most relevant and trustworthy evi-
dence should be considered. Possibilities include:
(a) (b)
2.5 2.5
2 2
Odds ratio
Odds ratio
1.5 1.5
1 1
Risk of bias
(c) (d)
500
Number of participants
100
10
1
0.01 0.05 1 0.05 0.01 e f d g h j k l a b c i
P value
Favours other Favours midwife-led
Negative Association Null Positive Association models models
Studies
SMD = ±0.10 SMD = ±0.25 SMD = ±0.50
Figure 12.4.a Possible graphical displays of different types of data. (A) Box-and-whisker plots of odds
ratios for all outcomes and separately by overall risk of bias. (B) Bubble plot of odds ratios for all outcomes
and separately by the model of care. The colours of the bubbles represent the overall risk of bias judgement
(dark grey = low risk of bias; light grey = some concerns; blue = high risk of bias). (C) Albatross plot of the
study sample size against P values (for the five continuous outcomes in Table 12.4.c, column 6). The effect
contours represent standardized mean differences. (D) Harvest plot (height depicts overall risk of bias
judgement (tall = low risk of bias; medium = some concerns; short = high risk of bias), shading depicts model
of care (light grey = caseload; dark grey = team), alphabet characters represent the studies)
The size of the bubbles reflected the number of study participants. However, different
formulations of the bubble plot can display other characteristics of the data (e.g. pre-
cision, risk-of-bias assessments).
where the results are separated by the direction of effect. Superimposed on the plot are
‘effect size contours’ (inspiring the plot’s name). These contours are specific to the type
of data (e.g. continuous, binary) and statistical methods used to calculate the P values.
The contours allow interpretation of the approximate effect sizes of the studies, which
would otherwise not be possible due to the limited reporting of the results. Character-
istics of studies (e.g. type of study design) can be identified using different colours or
symbols, allowing informal comparison of subgroups.
The plot is likely to be more inclusive of the available studies than meta-analysis,
because of its minimal data requirements. However, the plot should complement
the results from a statistical synthesis, ideally a meta-analysis of available effects.
approach to tabulation with alternative presentations that may enhance the transparency
of reporting and interpretation of findings. Subsequent scenarios show the application of
the synthesis approaches outlined in preceding sections of the chapter. Box 12.4.a sum-
marizes the review comparisons and outcomes, and decisions taken by the review authors
334
12.4 Worked example
in planning their synthesis. While the example is loosely based on an actual review, the
review description, scenarios and data are fabricated for illustration.
• Studies are ordered by study ID, rather than grouped by characteristics that might
enhance interpretation (e.g. risk of bias, study size, validity of the measures, certainty
of the evidence (GRADE)).
• Data reported are as extracted from each study; effect estimates were not calculated
by the review authors and, where reported, were not standardized across studies
(although data were available to do both).
Table 12.4.b shows an improved presentation of the same results. In line with best
practice, here effect estimates have been calculated by the review authors for all out-
comes, and a common metric computed to aid interpretation (in this case an odds
ratio; see Chapter 6 for guidance on conversion of statistics to the desired format).
Redundant information has been removed (‘statistical test’ and ‘P value’ columns).
The studies have been re-ordered, first to group outcomes by period of care (intrapar-
tum outcomes are shown here), and then by risk of bias. This re-ordering serves two
purposes. Grouping by period of care aligns with the plan to consider outcomes for
each period separately and ensures the table structure matches the order in which
results are described in the text. Re-ordering by risk of bias increases the prominence
of studies at lowest risk of bias, focusing attention on the results that should most influ-
ence conclusions. Had the review authors determined that a synthesis would be infor-
mative, then ordering to facilitate comparison across studies would be appropriate; for
example, ordering by the type of satisfaction outcome (as pre-defined in the protocol,
starting with global measures of satisfaction), or the comparisons made in the studies.
The results may also be presented in a forest plot, as shown in Figure 12.4.b. In both
the table and figure, studies are grouped by risk of bias to focus attention on the most
trustworthy evidence. The pattern of effects across studies is immediately apparent in
Figure 12.4.b and can be described efficiently without having to interpret each estimate
(e.g. difference between studies at low and high risk of bias emerge), although these
results should be interpreted with caution in the absence of a formal test for subgroup
differences (see Chapter 10, Section 10.11). Only outcomes measured during the intra-
partum period are displayed, although outcomes from other periods could be added,
maximizing the information conveyed.
An example description of the results from Scenario 1 is provided in Box 12.4.b. It
shows that describing results study by study becomes unwieldy with more than a
335
12 Synthesizing and presenting findings using other methods
Table 12.4.a Scenario 1: table ordered by study ID, data as reported by study authors
Effect estimate
Outcome (scale details*) Intervention Control (metric) 95% CI Statistical test P value
Perception of care: labour/birth 260/344 192/287 1.13 (RR) 1.02 to 1.25 z = 2.36 0.018
Experience of antenatal care 21.0 (5.6) 182 19.7 (7.3) 186 1.3 (MD) –0.1 to 2.7 t = 1.88 0.061
(0 to 24 points)
Experience of labour/birth (0 to 9.8 (3.1) 182 9.3 (3.3) 186 0.5 (MD) –0.2 to 1.2 t = 1.50 0.135
18 points)
Experience of postpartum care 11.7 (2.9) 182 10.9 (4.2) 186 0.8 (MD) 0.1 to 1.5 t = 2.12 0.035
(0 to 18 points)
Care from staff during labour 240/275 208/256 1.07 (RR) 1.00 to 1.16 z = 1.89 0.059
Frances 2000
Labour & Delivery Satisfaction 182 (14.2) 101 185 (30) 93 t = –0.90 for MD 0.369 for MD
Index (37 to 222 points)
336
12.4 Worked example
Satisfaction with intrapartum 605/1163 363/826 8.1% (RD) 3.6 to 12.5 <0.001
care
Birth satisfaction 849/1163 496/826 13.0% (RD) 8.8 to 17.2 z = 6.04 0.000
Parr 2002
Rowley 1995
Zhang 2011 N N
Perception of antenatal care 359 322 1.23 (POR) 0.68 to 2.21 z = 0.69 0.490
Perception of care: labour/birth 355 320 1.10 (POR) 0.91 to 1.34 z = 0.95 0.341
∗
All scales operate in the same direction; higher scores indicate greater satisfaction.
CI = confidence interval; MD = mean difference; OR = odds ratio; POR = proportional odds ratio; RD = risk difference; RR = risk ratio.
337
12 Synthesizing and presenting findings using other methods
Table 12.4.b Scenario 1: intrapartum outcome table ordered by risk of bias, standardized effect estimates calculated for all studies
Outcome* (scale details) Intervention Control Mean difference (95% CI)** Odds ratio (95% CI)†
Some concerns
338
12.4 Worked example
339
12 Synthesizing and presenting findings using other methods
0.2 0.5 1 2 5
Favours other models Favours midwife-led
Figure 12.4.b Forest plot depicting standardized effect estimates (odds ratios) for satisfaction
few studies, highlighting the importance of tables and plots. It also brings into focus the
risk of presenting results without any synthesis, since it seems likely that the reader will
try to make sense of the results by drawing inferences across studies. Since a synthesis
was considered inappropriate, GRADE was applied to individual studies and then used
to prioritize the reporting of results, focusing attention on the most relevant and trust-
worthy evidence. An alternative might be to report results at low risk of bias, an
approach analogous to limiting a meta-analysis to studies at low risk of bias. Where
possible, these and other approaches to prioritizing (or ordering) results from individ-
ual studies in text and tables should be pre-specified at the protocol stage.
Box 12.4.b How to describe the results from this structured summary
Scenario 1. Structured reporting of effects (no synthesis)
Table 12.4.b and Figure 12.4.b present results for the 12 included studies that reported a
measure of maternal satisfaction with care during labour and birth (hereafter ‘satisfac-
tion’). Results from these studies were not synthesized for the reasons reported in the
data synthesis methods. Here, we summarize results from studies providing high or
moderate certainty evidence (based on GRADE) for which results from a valid measure
of global satisfaction were available. Barry 2015 found a small increase in satisfaction
with midwife-led care compared to obstetrician-led care (4 more women per 100 were
satisfied with care; 95% CI 4 fewer to 15 more per 100 women; 469 participants, 1 study;
moderate certainty evidence). Harvey 1996 found a small possibly unimportant decrease
in satisfaction with midwife-led care compared with obstetrician-led care (3-point reduc-
tion on a 185-point LADSI scale, higher scores are more satisfied; 95% CI 10 points lower
to 4 higher; 367 participants, 1 study; moderate certainty evidence). The remaining
10 studies reported specific aspects of satisfaction (Frances 2000, Rowley 1995, …), used
tools with little or no evidence of validity and reliability (Parr 2002, …) or provided low or
very low certainty evidence (Turnbull 1996, …).
Note: While it is tempting to make statements about consistency of effects across studies
(…the majority of studies showed improvement in …, X of Y studies found …), be aware
that this may contradict claims that a synthesis is inappropriate and constitute uninten-
tional vote counting.
For studies that reported multiple satisfaction outcomes, one result is selected for syn-
thesis using the decision rules in Box 12.4.a (point 2).
Table 12.4.c Scenarios 2, 3 and 4: available data for the selected outcome from each study
Scenario 3. Combining
Scenario 2. Summary statistics P values Scenario 4. Vote counting
Available Stand.
data** metric
Outcome Overall RoB Stand. metric (2-sided (1-sided Stand.
Study ID (scale details*) judgement Available data** OR (SMD) P value) P value) Available data metric
Crowe 2010 Expectation of Some Intervention 9.8 (3.1); 1.3 (0.16) Favours 0.068 NS —
labour/birth concerns Control 9.3 (3.3) intervention,
(0 to 18 points) P = 0.135,
N = 368
Finn 1997 Experience of Some Intervention 21 (5.6); 1.4 (0.20) Favours 0.030 MD 1.3, NS 1
labour/birth concerns Control 19.7 (7.3) intervention,
(0 to 24 points) P = 0.061,
N = 351
Harvey 1996 Labour & Delivery Some Intervention 182 0.8 (–0.13) MD –3, 0.816 MD –3, NS 0
Satisfaction Index concerns (14.2); Control 185 P = 0.368,
(37 to 222 points) (30) N = 194
Kidman 2007 Control during High Intervention 11.7 1.5 (0.22) MD 0.8, 0.017 MD 0.8 (95% CI 1
labour/birth (0 to (2.9); Control 10.9 P = 0.035, 0.1 to 1.5)
18 points) (4.2) N = 368
Turnbull 1996 Intrapartum care High Intervention 1.2 2.3 (0.45) MD 0.27, 0.036 MD 0.27 (95% 1
rating (–2 to 2 (0.57); Control 0.93 P = 0.072, CI0.03 to 0.57)
points) (0.62) N = 65
Binary
342
12.4 Worked example
Flint 1989 Care from staff High Intervention 240/275; 1.58 Favours 0.029 RR 1.07 (95% CI 1
during labour Control 208/256 intervention, 1.00 to 1.16)
P = 0.059
Frances 2000 Communication: Low OR 0.90 0.90 Favours 0.697 Favours control, 0
labour/birth control, NS
P = 0.606
Johns 2004 Satisfaction with Some Intervention 605/ 1.38 Favours 0.0005 RD 8.1% (95% CI 1
intrapartum care concerns 1163; Control 363/ intervention, 3.6% to 12.5%)
826 P < 0.001
Mac Vicar Birth satisfaction High OR 1.80, P < 0.001 1.80 Favours 0.0005 RD 13.0% (95% CI 1
1993 intervention, 8.8% to 17.2%)
P < 0.001
Parr 2002 Experience of Some OR 0.85 0.85 OR 0.85, 0.658 NS —
childbirth concerns P = 0.685
Rowley 1995 Encouraged to ask Low OR 1.02, NS 1.02 P = 0.685 — NS —
questions
Ordinal
Waldenstrom Perception of Low POR 1.23, P = 0.490 1.23 POR 1.23, 0.245 POR 1.23, NS 1
2001 intrapartum care P = 0.490
Zhang 2011 Perception of care: Low POR 1.10, P > 0.05 1.10 POR 1.1, 0.170 Favours 1
labour/birth P = 0.341 intervention
∗
All scales operate in the same direction. Higher scores indicate greater satisfaction.
∗∗
For a particular scenario, the ‘available data’ column indicates the data that were directly reported, or were calculated from the reported statistics, in terms of: effect
estimate, direction of effect, confidence interval, precise P value, or statement regarding statistical significance (either statistically significant, or not).
CI = confidence interval; direction = direction of effect reported or can be calculated; MD = mean difference; NS = not statistically significant; OR = odds ratio; RD = risk
difference; RoB = risk of bias; RR = risk ratio; sig. = statistically significant; SMD = standardized mean difference; Stand. = standardized.
343
12 Synthesizing and presenting findings using other methods
where Pi is the one-sided P value from study i, and k is the total number of P values. This
formula can be implemented using a standard spreadsheet package. The statistic is
then compared against the chi-squared distribution with 26 ( = 2 × 13) degrees of free-
dom to obtain the P value. Using a Microsoft Excel spreadsheet, this can be obtained by
typing =CHIDIST(82.3, 26) into any cell. In Stata or R, the packages (both named) metap
could be used. These packages include a range of methods for combining P values.
The combination of P values suggests there is strong evidence of benefit of midwife-
led models of care in at least one study (P < 0.001 from a Chi2 test, 13 studies). Restrict-
ing this analysis to those studies judged to be at an overall low risk of bias (sensitivity
analysis), there is no longer evidence to reject the null hypothesis of no benefit of mid-
wife-led model of care in any studies (P = 0.314, 3 studies). For the five studies reporting
continuous satisfaction outcomes, sufficient data (precise P value, direction, total sam-
ple size) are reported to construct an albatross plot (Figure 12.4.a, Panel C). The loca-
tion of the points relative to the standardized mean difference contours indicate that
the likely effects of the intervention in these studies are small.
An example description of the results from the synthesis is provided in Box 12.4.d.
344
12.5 Chapter information
345
12 Synthesizing and presenting findings using other methods
We are grateful to the following for commenting helpfully on earlier drafts: Miranda
Cumpston, Jamie Hartmann-Boyce, Tianjing Li, Rebecca Ryan and Hilary Thomson.
12.6 References
Achana F, Hubbard S, Sutton A, Kendrick D, Cooper N. An exploration of synthesis methods
in public health evaluations of interventions concludes that the use of modern statistical
methods would be beneficial. Journal of Clinical Epidemiology 2014; 67: 376–390.
Becker BJ. Combining significance levels. In: Cooper H, Hedges LV, editors. A handbook of
research synthesis. New York (NY): Russell Sage; 1994. pp. 215–235.
Boonyasai RT, Windish DM, Chakraborti C, Feldman LS, Rubin HR, Bass EB. Effectiveness of
teaching quality improvement to clinicians: a systematic review. JAMA 2007; 298:
1023–1037.
Borenstein M, Hedges LV, Higgins JPT, Rothstein HR. Meta-Analysis methods based on
direction and p-values. Introduction to Meta-Analysis. Chichester (UK): John Wiley & Sons,
Ltd; 2009. pp. 325–330.
Brown LD, Cai TT, DasGupta A. Interval estimation for a binomial proportion. Statistical
Science 2001; 16: 101–117.
Bushman BJ, Wang MC. Vote-counting procedures in meta-analysis. In: Cooper H, Hedges
LV, Valentine JC, editors. Handbook of Research Synthesis and Meta-Analysis. 2nd ed. New
York (NY): Russell Sage Foundation; 2009. pp. 207–220.
Crowther M, Avenell A, MacLennan G, Mowatt G. A further use for the Harvest plot: a novel
method for the presentation of data synthesis. Research Synthesis Methods 2011; 2: 79–83.
Friedman L. Why vote-count reviews don’t count. Biological Psychiatry 2001; 49: 161–162.
Grimshaw J, McAuley LM, Bero LA, Grilli R, Oxman AD, Ramsay C, Vale L, Zwarenstein M.
Systematic reviews of the effectiveness of quality improvement strategies and
programmes. Quality and Safety in Health Care 2003; 12: 298–303.
Harrison S, Jones HE, Martin RM, Lewis SJ, Higgins JPT. The albatross plot: a novel graphical
tool for presenting results of diversely reported studies in a systematic review. Research
Synthesis Methods 2017; 8: 281–289.
Hedges L, Vevea J. Fixed- and random-effects models in meta-analysis. Psychological
Methods 1998; 3: 486–504.
Ioannidis JP, Patsopoulos NA, Rothstein HR. Reasons or excuses for avoiding meta-analysis
in forest plots. BMJ 2008; 336: 1413–1415.
Ivers N, Jamtvedt G, Flottorp S, Young JM, Odgaard-Jensen J, French SD, O’Brien MA,
Johansen M, Grimshaw J, Oxman AD. Audit and feedback: effects on professional practice
and healthcare outcomes. Cochrane Database of Systematic Reviews 2012; 6: CD000259.
Jones DR. Meta-analysis: weighing the evidence. Statistics in Medicine 1995; 14: 137–149.
Loughin TM. A systematic comparison of methods for combining p-values from independent
tests. Computational Statistics & Data Analysis 2004; 47: 467–485.
346
12.6 References
McGill R, Tukey JW, Larsen WA. Variations of box plots. The American Statistician 1978; 32:
12–16.
McKenzie JE, Brennan SE. Complex reviews: methods and considerations for summarising
and synthesising results in systematic reviews with complexity. Report to the Australian
National Health and Medical Research Council. 2014.
O’Brien MA, Rogers S, Jamtvedt G, Oxman AD, Odgaard-Jensen J, Kristoffersen DT,
Forsetlund L, Bainbridge D, Freemantle N, Davis DA, Haynes RB, Harvey EL. Educational
outreach visits: effects on professional practice and health care outcomes. Cochrane
Database of Systematic Reviews 2007; 4: CD000409.
Ogilvie D, Fayter D, Petticrew M, Sowden A, Thomas S, Whitehead M, Worthy G. The harvest
plot: a method for synthesising evidence about the differential effects of interventions.
BMC Medical Research Methodology 2008; 8: 8.
Riley RD, Higgins JP, Deeks JJ. Interpretation of random effects meta-analyses. BMJ 2011;
342: d549.
Schriger DL, Sinha R, Schroter S, Liu PY, Altman DG. From submission to publication: a
retrospective review of the tables and figures in a cohort of randomized controlled trials
submitted to the British Medical Journal. Annals of Emergency Medicine 2006; 48: 750–756,
756 e751–721.
Schriger DL, Altman DG, Vetter JA, Heafner T, Moher D. Forest plots in reports of systematic
reviews: a cross-sectional study reviewing current practice. International Journal of
Epidemiology 2010; 39: 421–429.
ter Wee MM, Lems WF, Usan H, Gulpen A, Boonen A. The effect of biological agents on work
participation in rheumatoid arthritis patients: a systematic review. Annals of the
Rheumatic Diseases 2012; 71: 161–171.
Thomson HJ, Thomas S. The effect direction plot: visual display of non-standardised effects
across multiple outcome domains. Research Synthesis Methods 2013; 4: 95–101.
Thornicroft G, Mehta N, Clement S, Evans-Lacko S, Doherty M, Rose D, Koschorke M,
Shidhaye R, O’Reilly C, Henderson C. Evidence for effective interventions to reduce
mental-health-related stigma and discrimination. Lancet 2016; 387: 1123–1132.
Valentine JC, Pigott TD, Rothstein HR. How many studies do you need?: a primer on
statistical power for meta-analysis. Journal of Educational and Behavioral Statistics 2010;
35: 215–247.
347
13
Assessing risk of bias due to missing
results in a synthesis
Matthew J Page, Julian PT Higgins, Jonathan AC Sterne
KEY POINTS
• Systematic reviews seek to identify all research that meets the eligibility criteria.
However, this goal can be compromised by ‘non-reporting bias’: when decisions about
how, when or where to report results of eligible studies are influenced by the P value,
•
magnitude or direction of the results.
There is convincing evidence for several types of non-reporting bias, reinforcing the
need for review authors to search all possible sources where study reports and results
may be located. It may be necessary to consult multiple bibliographic databases, trials
•
registers, manufacturers, regulators and study authors or sponsors.
Regardless of whether an entire study report or a particular study result is unavailable
selectively (e.g. because the P value, magnitude or direction of the results were con-
sidered unfavourable by the investigators), the same consequence can arise: risk of
•
bias in a synthesis because available results differ systematically from missing results.
Several approaches for assessing risk of bias due to missing results have been sug-
gested. A thorough assessment of selective non-reporting or under-reporting of results
in the studies identified is likely to be the most valuable. Because the number of iden-
tified studies that have results missing for a given synthesis is known, the impact of
selective non-reporting or under-reporting of results can be quantified more easily
•
than the impact of selective non-publication of an unknown number of studies.
Funnel plots (and the tests used for examining funnel plot asymmetry) may help
review authors to identify evidence of non-reporting biases in cases where protocols
or trials register records were unavailable for most studies. However, they have well-
•
documented limitations.
When there is evidence of funnel plot asymmetry, non-reporting biases should be con-
sidered as only one of a number of possible explanations. In these circumstances,
review authors should attempt to understand the source(s) of the asymmetry, and
consider their implications in the light of any qualitative signals that raise a suspicion
of additional missing results, and other sensitivity analyses.
This chapter should be cited as: Page MJ, Higgins JPT, Sterne JAC. Chapter 13: Assessing risk of bias due to
missing results in a synthesis. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA
(editors). Cochrane Handbook for Systematic Reviews of Interventions. 2nd Edition. Chichester (UK): John
Wiley & Sons, 2019: 349–374.
© 2019 The Cochrane Collaboration. Published 2019 by John Wiley & Sons Ltd.
349
13 Assessing risk of bias due to missing results
13.1 Introduction
Systematic reviews seek to identify all research that meets pre-specified eligibility cri-
teria. This goal can be compromised if decisions about how, when or where to report
results of eligible studies are influenced by the P value, magnitude or direction of the
study’s results. For example, ‘statistically significant’ results that suggest an interven-
tion works are more likely than ‘statistically non-significant’ results to be available,
available rapidly, available in high impact journals and cited by others, and hence more
easily identifiable for systematic reviews. The term ‘reporting bias’ has often been used
to describe this problem, but we prefer the term non-reporting bias.
Non-reporting biases lead to bias due to missing results in a systematic review.
Syntheses such as meta-analyses are at risk of bias due to missing results when results
of some eligible studies are unavailable because of the P value, magnitude or direction
of the results. Bias due to missing results differs from a related source of bias – bias in
selection of the reported result – where study authors select a result for reporting
from among multiple measurements or analyses, on the basis of the P value, magni-
tude or direction of the results. In such cases, the study result that is available for inclu-
sion in the synthesis is at risk of bias. Bias in selection of the reported result is described
in more detail in Chapter 7, and addressed in the RoB 2 tool (Chapter 8) and ROBINS-I
tool (Chapter 25).
Failure to consider the potential impact of non-reporting biases on the results of the
review can lead to the uptake of ineffective and harmful interventions in clinical prac-
tice. For example, when unreported results were included in a systematic review of
oseltamivir (Tamiflu) for influenza, the drug was not shown to reduce hospital admis-
sions, had unclear effects on pneumonia and other complications of influenza, and
increased the risk of harms such as nausea, vomiting and psychiatric adverse events.
These findings were different from synthesized results based only on published study
results (Jefferson et al 2014).
We structure the chapter as follows. We start by discussing approaches for avoiding or
minimizing bias due to missing results in systematic reviews in Section 13.2, and provide
guidance for assessing the risk of bias due to missing results in Section 13.3. For the pur-
pose of discussing these biases, ‘statistically significant’ (P < 0.05) results are sometimes
denoted as ‘positive’ results and ‘statistically non-significant’ or null results as ‘negative’
results. As explained in Chapter 15, Cochrane Review authors should not use any of these
labels when reporting their review findings, since they are based on arbitrary thresholds
and may not reflect the clinical or policy significance of the findings.
In this chapter, we use the term result to describe the combination of a point
estimate (such as a mean difference or risk ratio) and a measure of its precision (such
as a confidence interval) for a particular study outcome. We use the term outcome to
refer to an event (such as mortality or a reduction in pain). When fully defined, an out-
come for an individual participant includes the following elements: an outcome
domain; a specific measure; a specific metric; and a time point (Zarin et al 2011). An
example of a fully defined outcome is ‘a 50% change from baseline to eight weeks
on the Montgomery-Asberg Depression Rating Scale total score’. A corresponding result
for this outcome additionally requires a method of aggregation across individuals: here
it might be a risk ratio with 95% confidence interval, which estimates the between-
group difference in the proportion of people with the outcome.
350
13.2 Minimizing risk of bias due to missing results
OR (95% Cl)
Figure 13.2.a Results of meta-analyses of reboxetine versus placebo for acute treatment of major
depression, with or without unpublished data (data from Eyding et al 2010). Reproduced with
permission of BMJ Publishing Group.
351
13 Assessing risk of bias due to missing results
added, the summary estimate suggested that patients on reboxetine were more than
twice as likely to withdraw (Eyding et al 2010).
Cases such as this illustrate how bias in a meta-analysis can be reduced by the inclu-
sion of missing results. In other situations, the bias reduction may not be so dramatic.
Schmucker and colleagues reviewed five methodological studies examining the differ-
ence in summary effect estimates of 173 meta-analyses that included or omitted results
from sources other than journal articles (e.g. conference abstracts, theses, government
reports, regulatory websites) (Schmucker et al 2017). They found that the direction and
magnitude of the differences in summary estimates varied. While inclusion of unre-
ported results may not change summary estimates markedly in all cases, doing so often
leads to an increase in precision of the summary estimates (Schmucker et al 2017).
Guidance on searching for unpublished sources is included in Chapter 4 (Section 4.3).
2013, Maund et al 2014, Schroll et al 2016). A few systematic reviews have found that
conclusions about the benefits and harms of interventions changed after regulatory data
were included in the review (Turner et al 2008, Rodgers et al 2013, Jefferson et al 2014).
CSRs and other regulatory documents have great potential for improving the cred-
ibility of systematic reviews of regulated interventions, but substantial resources are
needed to access them and disentangle the data within them (Schroll et al 2015, Doshi
and Jefferson 2016). Only limited guidance is currently available for review authors con-
sidering embarking on a review including regulatory data. Jefferson and colleagues
provide criteria for assessing whether to include regulatory data for a drug or biologic
in a systematic review (Jefferson et al 2018). The RIAT (Restoring Invisible and Aban-
doned Trials) Support Center website provides useful information, including a taxon-
omy of regulatory documents, a glossary of relevant terms, guidance on how to request
CSRs from regulators and contact information for making requests (Doshi et al 2018).
Also, Ladanie and colleagues provide guidance on how to access and use FDA Drug
Approval Packages for evidence syntheses (Ladanie et al 2018).
Reliance on trials registers to assemble an inception cohort may not be ideal in all
instances. Prospective registration of trials started to increase only after 2004, when the
International Committee of Medical Journal Editors announced that they would no
longer publish trials that were not registered at inception (De Angelis et al 2004).
For this reason, review authors are unlikely to identify any prospectively registered
trials of interventions that were investigated only prior to this time. Also, until quite
recently there have been fewer incentives to register prospectively trials of non-
regulated interventions (Dal-Ré et al 2015), and unless registration rates increase, sys-
tematic reviews of such interventions are unlikely to identify many prospectively regis-
tered trials.
Restricting a synthesis to an inception cohort therefore involves a trade-off between
bias, precision and applicability. For example, limiting inclusion to prospectively regis-
tered trials will avoid risk of bias due to missing results if no results are missing from a
meta-analysis selectively. However, the precision of the meta-analysis may be low if
there are only a few, small, prospectively registered trials. Also, the summary estimate
from the meta-analysis may have limited applicability to the review question if the
questions asked in the prospectively registered trials are narrower in scope than the
questions asked in unregistered or retrospectively registered trials. Therefore, as with
any synthesis, review authors will need to consider precision and applicability when
interpreting the synthesis findings (methods for doing so are covered in Chapters 14
and 15).
1) Select syntheses to assess for risk of bias due to missing results (Section 13.3.1).
2) Define which results are eligible for inclusion in each synthesis (Section 13.3.2).
354
13.3 A framework for assessing risk of bias
3) Record whether any of the studies identified are missing from each synthesis
because results known (or presumed) to have been generated by study investigators
are unavailable: the ‘known unknowns’ (Section 13.3.3).
4) Consider whether each synthesis is likely to be biased because of the missing results
in the studies identified (Section 13.3.4).
5) Consider whether results from additional studies are likely to be missing from each
synthesis: the ‘unknown unknowns’ (Section 13.3.5).
6) Reach an overall judgement about risk of bias due to missing results in each syn-
thesis (Section 13.3.6).
13.3.3 Recording whether any of the studies identified are missing from each
synthesis because results known (or presumed) to have been generated by
study investigators are unavailable: the ‘known unknowns’
Once eligible results have been defined for each synthesis, review authors can inves-
tigate the availability of such results for all studies identified. Key questions to consider
are as follows.
1) Are the particular results I am seeking unavailable for any study?
2) If so, are the results unavailable because of the P value, magnitude or direction of
the results?
Review authors should try to identify results that are completely or partially unavail-
able because of the P value, magnitude or direction of the results (selective non-
reporting or under-reporting of results, respectively). By completely unavailable, we
mean that no information is available to estimate an intervention effect or to make
356
13.3 A framework for assessing risk of bias
any other inference (including a qualitative conclusion about the direction of effect)
in any of the sources identified or from the study authors/sponsors. By partially una-
vailable, we mean that some, but not all, of the information necessary to include a
result in a meta-analysis is available (e.g. study authors report only that results were
‘non-significant’ rather than providing summary statistics, or they provide a point esti-
mate without any measure of precision) (Chan et al 2004).
There are several ways to detect selective non-reporting or under-reporting of
results, although a thorough assessment is likely to be labour intensive. It is helpful
to start by assembling all sources of information obtained about each study (see
Chapter 4, Section 4.6.2). This may include the trial’s register record, protocol, statis-
tical analysis plan (SAP), reports of the results of the study (e.g. journal articles, CSRs) or
any information obtained directly from the study authors or sponsor. The more sources
of information sought, the more reliable the assessment is likely to be. Studies should
be assessed regardless of whether a report of the results is available. For example, in
some cases review authors may only know about a study because there is a registration
record of it in ClinicalTrials.gov. If a long time has passed since the study was com-
pleted, it is possible that the results are not available because the investigators con-
sidered them unworthy of dissemination. Ignoring this registered study with no
results available could lead to less concern about the risk of bias due to missing results
than is warranted.
If study plans are available (e.g. in a trials register, protocol or statistical analysis
plan), details of outcomes that were assessed can be compared with those for which
results are available. Suspicion is raised if results are unavailable for any outcomes that
were pre-specified in these sources. However, outcomes pre-specified in a trials register
may differ from the outcomes pre-specified in a trial protocol (Chan et al 2017), and the
latest version of a trials register record may differ from the initial version. Such differ-
ences may be explained by legitimate, yet undeclared, changes to the study plans: pre-
specification of an outcome does not guarantee it was actually assessed. Further infor-
mation should be sought from study authors or sponsors to resolve any unexplained
discrepancies between sources.
If no study plans are available, then other approaches can be used to uncover missing
results. Abstracts of presentations about the study may contain information about out-
comes not subsequently mentioned in publications, or the methods section of a pub-
lished article may list outcomes not subsequently mentioned in the results section.
Missing information that seems certain to have been recorded is of particular inter-
est. For example, some measurements, such as systolic and diastolic blood pressure,
are expected to appear together, so that if only one is reported we should wonder why.
Williamson and Gamble give several examples, including a Cochrane Review in which
all nine trials reported the outcome ‘treatment failure’ but only five reported mortality
(Williamson and Gamble 2005). Since mortality was part of the definition of treatment
failure, those data must have been collected in the other four trials. Searches of the
Core Outcome Measures in Effectiveness Trials (COMET) database can help review
authors identify core sets of outcomes that are expected to have been measured in
all trials of particular conditions (Williamson and Clarke 2012), although review authors
should keep in mind that trials conducted before the publication of a relevant core out-
come set are less likely to have measured the relevant outcomes, and adoption of core
outcome sets may not be complete even after they have been published.
357
13 Assessing risk of bias due to missing results
If the particular results that review authors seek are not reported in any of the sources
identified (e.g. journal article, trials results register, CSR), review authors should consider
requesting the required result from the study authors or sponsors. Authors or sponsors
may be able to calculate the result for the review authors or send the individual partic-
ipant data for review authors to analyse themselves. Failure to obtain the results
requested should be acknowledged when discussing the limitations of the review proc-
ess. In some cases, review authors might be able to compute or impute missing details
(e.g. imputing standard deviations; see Chapter 6, Section 6.5.2).
Once review authors have identified that a study result is unavailable, they must
consider whether this is because of the P value, magnitude or direction of the result.
The Outcome Reporting Bias In Trials (ORBIT) system for classifying reasons for missing
results (Kirkham et al 2018) can be used to do this. Examples of scenarios where it may
be reasonable to assume that a result is not unavailable because of the P value,
magnitude or direction of the result include:
• it is clear (or very likely) that the outcome of interest was not measured in the study
(based on examination of the study protocol or SAP, or correspondence with the
authors/sponsors);
• the instrument or equipment needed to measure the outcome of interest was not
available at the time the study was conducted; and
• the outcome of interest was measured but data were not analysed owing to a fault in
the measurement instrument, or substantial missing data.
Examples of scenarios where it may be reasonable to suspect that a result is missing
because of the P value, magnitude or direction of the result include:
• study authors claimed to have measured the outcome, but no results were available
and no explanation for this is provided;
• the between-group difference for the result of interest was reported as being ‘non-
significant’, whereas summary statistics (e.g. means and standard deviations) per
intervention group were available for other outcomes in the study when the differ-
ence was statistically significant;
• results are missing for an outcome that tends to be measured together with another
(e.g. results are available for cause-specific mortality and are favourable to the exper-
imental intervention, yet results for all-cause mortality are missing);
• summary statistics (number of events, or mean scores) are available only globally
across all groups (e.g. study authors claim that 10 of 100 participants in the trial expe-
rienced adverse events, but do not report the number of events by intervention
group); and
• the outcome is expected to have been measured, and the study is conducted by
authors or sponsored by an organization with a vested interest in the intervention
who may be inclined to withhold results that are unfavourable to the intervention
(guidance on assessing conflicts of interest is provided in Chapter 7).
Typically, selective non-reporting or under-reporting of results manifests as the sup-
pression of results that are statistically non-significant or unfavourable to the experi-
mental intervention. However, in some instances the opposite may occur. For example,
a trialist who believes that an intervention is ineffective may choose not to report
results indicating a difference in favour of the intervention over placebo. Therefore,
358
13.3 A framework for assessing risk of bias
review authors should consider the interventions being compared when considering
reasons for missing results.
Review authors may find it useful to construct a matrix (with rows as studies and
columns as syntheses) indicating the availability of study results for each synthesis
to be assessed for risk of bias due to missing results. Table 13.3.a shows an example
of a matrix indicating the availability of results for three syntheses in a Cochrane Review
comparing selective serotonin reuptake inhibitors (SSRIs) with placebo for fibromyal-
gia (Walitt et al 2015). Results were available from all trials for the synthesis of ‘number
of patients with at least 30% pain reduction’. For the synthesis of ‘mean fatigue scores’,
results were unavailable for two trials, but for a reason unrelated to the P value, mag-
nitude or direction of the results (fatigue was not measured in these studies). For the
synthesis of ‘mean depression scores’, results were unavailable for one study, likely on
the basis of the P value (the trialists reported only that there was a ‘non-significant’
difference between groups, and review authors’ attempts to obtain the necessary data
for the synthesis were unsuccessful). Kirkham and colleagues have developed template
Table 13.3.a Matrix indicating availability of study results for three syntheses of trials comparing
selective serotonin reuptake inhibitors (SSRIs) with placebo for fibromyalgia (Walitt et al 2015).
Adapted from Kirkham et al (2018)
Anderberg 17 18 ✓ ✓ ✓
2000
Arnold 2002 25 26 ✓ ✓ ✓
Goldenberg 22 19 ✓ ✓ ✓
1996
GSK 2005 26 26 ✓ – ✓
Norregaard 20 21 ✓ ✓ ✓
1995
Patkar 2007 58 58 ✓ – X
Wolfe 1994 15 9 ✓ ✓ ✓
Key:
✓ A study result is available for inclusion in the synthesis
X No study result is available for inclusion, (probably) because the P value, magnitude or direction of the
results generated were considered unfavourable by the study investigators
– No study result is available for inclusion, (probably) because the outcome was not assessed, or for a
reason unrelated to the P value, magnitude or direction of the results
? No study result is available for inclusion, and it is unclear if the outcome was assessed in the study
359
13 Assessing risk of bias due to missing results
matrices that enable review authors to classify the reporting of results of clinical trials
more specifically for both benefit and harm outcomes (Kirkham et al 2018).
%
Study Weight SMD (95% Cl) n1 n2
Available results
Anderberg 2000 14.67 –0.31 (–0.98 to 0.36) 17 18
Arnold 2002 20.18 –0.74 (–1.31 to –0.17) 25 26
Goldenberg 1996 17.18 –0.26 (–0.88 to 0.36) 22 19
GSK 2005 21.54 –0.44 (–0.99 to 0.11) 26 26
Norregaard 1995 17.43 0.01 (–0.60 to 0.62) 20 21
Wolfe 1994 9.00 –0.67 (–1.52 to 0.18) 15 9
Subtotal (l-squared = 0.0%, P = 0.584) 100.00 –0.39 (–0.65 to –0.14)
Missing results
Patkar 2007 0.00 P > 0.05 58 58
Figure 13.3.a Forest plot displaying available and missing results for a meta-analysis of depression
scores (data from Walitt et al 2015). Reproduced with permission of John Wiley and Sons
360
13.3 A framework for assessing risk of bias
In other cases, knowledge of the size of eligible studies may lead to reassurance that
a meta-analysis is unlikely to be biased due to missing results. For example, López-
López and colleagues performed a network meta-analysis of trials of oral anticoagu-
lants for prevention of stroke in atrial fibrillation (López-López et al 2017). Among
the five larger phase III trials comparing a direct acting oral anticoagulant with warfarin
(each of which included thousands or tens of thousands of participants), results were
fully available for important outcomes including stroke or systemic embolism, ischae-
mic stroke, myocardial infarction, all-cause mortality, major bleeding, intracranial
bleeding and gastrointestinal bleeding. The review authors felt that the inability to
include results for these outcomes from a few much smaller eligible trials (with at most
a few hundred participants) was unlikely to change the summary estimates of these
meta-analyses (López-López et al 2017).
Copas and colleagues have developed a more sophisticated model-based sensitivity
analysis that explores the robustness of the meta-analytic estimate to the definitely
missing results (Copas et al 2017). Its application requires that review authors use
the ORBIT classification system (see Section 13.3.3). Review authors applying this
method should always present the summary estimate from the sensitivity analysis
alongside the primary estimate. Consultation with a statistician is recommended for
its implementation.
When the amount of data missing from the synthesis due to selective non-reporting
or under-reporting of results is very high, review authors may decide not to report a
meta-analysis of the studies with results available, on the basis that such a synthesized
estimate could be seriously biased. In other cases, review authors may be uncertain
whether selective non-reporting or under-reporting of results occurred, because it
was unclear whether the outcome of interest was even assessed. This uncertainty
may arise when study plans (e.g. trials register record or protocol) were unavailable,
and studies in the field are known to vary in what they assess. If outcome assessment
was unclear for a large proportion of the studies identified, review authors might be
wary when drawing conclusions about the synthesis, and alert users to the possibility
that it could be missing additional results from these studies.
studies will therefore typically scatter more widely at the bottom of the graph, with the
spread narrowing among larger studies. Ideally, the plot should approximately resem-
ble a symmetrical (inverted) funnel. This is illustrated in Panel A of Figure 13.3.b in
which the effect estimates in the larger studies are close to the true intervention odds
ratio of 0.4. If there is bias due to missing results, for example because smaller studies
without statistically significant effects (shown as open circles in Figure 13.3.b, Panel A)
remain unpublished, this will lead to an asymmetrical appearance of the funnel plot
with a gap at the bottom corner of the graph (Panel B). In this situation the summary
estimate calculated in a meta-analysis will tend to over-estimate the intervention effect
(Egger et al 1997). The more pronounced the asymmetry, the more likely it is that the
amount of bias in the meta-analysis will be substantial.
We recommend that when generating funnel plots, effect estimates be plotted against
the standard error of the effect estimate, rather than against the total sample size, on the
vertical axis (Sterne and Egger 2001). This is because the statistical power of a trial is deter-
mined by factors in addition to sample size, such as the number of participants experien-
cing the event for dichotomous outcomes, and the standard deviation of responses for
continuous outcomes. For example, a study with 100,000 participants and 10 events is less
likely to show a statistically significant intervention effect than a study with 1000 partici-
pants and 100 events. The standard error summarizes these other factors. Plotting stand-
ard errors on a reversed scale places the larger, or most powerful, studies towards the top
of the plot. Another advantage of using standard errors is that a simple triangular region
can be plotted, within which 95% of studies would be expected to lie in the absence of both
biases and heterogeneity. These regions are included in Figure 13.3.b. Funnel plots of effect
estimates against their standard errors (on a reversed scale) can be created using RevMan
and other statistical software. A triangular 95% confidence region based on a fixed-effect
meta-analysis can be included in the plot, and different plotting symbols can be used to
allow studies in different subgroups to be identified.
Ratio measures of intervention effect (such as odds ratios and risk ratios) should be
plotted on a logarithmic scale. This ensures that effects of the same magnitude but oppo-
site directions (e.g. odds ratios of 0.5 and 2) are equidistant from 1.0. For outcomes
measured on a continuous (numerical) scale (e.g. blood pressure, depression score)
intervention effects are measured as mean differences or standardized mean differences
(SMDs), which should therefore be used as the horizontal axis in funnel plots.
Some authors have argued that visual interpretation of funnel plots is too subjective
to be useful. In particular, Terrin and colleagues found that researchers had only a lim-
ited ability to identify correctly funnel plots for meta-analyses that were subject to bias
due to missing results (Terrin et al 2005).
0
0.5
SE of lnOR
1
missing results
2
2.5
0.01 0.1 1 10
Odds ratio
0
0.5
SE of lnOR
1
0.01 0.1 1 10
Odds ratio
0
exaggerated intervention
effect estimates.
2
2.5
0.01 0.1 1 10
Odds ratio
364
13.3 A framework for assessing risk of bias
Table 13.3.b Possible sources of asymmetry in funnel plots. Adapted from Egger et al (1997)
•
1) Non-reporting biases
Entire study reports, or particular results, of smaller studies are unavailable because of the nature
of the findings (e.g. statistical significance, direction of effect).
•
2) Poor methodological quality leading to spuriously inflated effects in smaller studies
Trials with less methodological rigour tend to show larger intervention effects (Page et al 2016a).
Therefore, trials that would have been ‘negative’, if conducted and analysed properly, may become
‘positive’. Asymmetry can arise when some smaller studies are of lower methodological quality and
therefore produce larger intervention effect estimates (Figure 13.3.b, Panel C).
•
3) True heterogeneity
Substantial benefit may be seen only in patients at high risk for the outcome that is affected by the
intervention, and usually these high-risk patients are more likely to be included in small, early
•
studies (Davey Smith and Egger 1994).
Some interventions may have been implemented less thoroughly in larger trials and may,
therefore, have resulted in smaller estimates of the intervention effect (Stuck et al 1998).
•
4) Artefactual
Some effect estimates (e.g. odds ratios and standardized mean differences) are naturally correlated
with their standard errors, and this can produce spurious asymmetry in a funnel plot (Sterne et al
2011, Zwetsloot et al 2017).
5) Chance
This allows the statistical significance of study estimates, and areas in which studies are
perceived to be missing, to be considered. Such contour-enhanced funnel plots may help
review authors to differentiate asymmetry that is due to non-reporting biases from that
due to other factors. For example, if studies appear to be missing in areas where results
would be statistically non-significant and unfavourable to the experimental intervention
(see Figure 13.3.c, Panel A) then this adds credence to the possibility that the asymmetry is
due to non-reporting biases. Conversely, if the supposed missing studies are in areas where
results would be statistically significant and favourable to the experimental intervention
(see Figure 13.3.c, Panel B), this would suggest the cause of the asymmetry is more likely to
be due to factors other than non-reporting biases (see Table 13.3.b).
• As a rule of thumb, tests for funnel plot asymmetry should be used only when there
are at least 10 studies included in the meta-analysis, because when there are fewer
studies the power of the tests is low. Only 24% of a random sample of Cochrane
Reviews indexed in 2014 included a meta-analysis with at least 10 studies
365
13 Assessing risk of bias due to missing results
0
–3 –2 –1 0 1 2
6
0.1 > P > 0.05
0.05 > P > 0.01
P < 0.01
Studies
Precision (1/SE)
(Page et al 2016b), which implies that tests for funnel plot asymmetry are likely to be
applicable in a minority of meta-analyses.
• Tests should not be used if studies are of similar size (similar standard errors of inter-
vention effect estimates).
• Results of tests for funnel plot asymmetry should be interpreted in the light of visual
inspection of the funnel plot (see Sections 13.3.5.2 and 13.3.5.3). Examining a
contour-enhanced funnel plot may further aid interpretation (see Figure 13.3.c).
• When there is evidence of funnel plot asymmetry from a test, non-reporting biases
should be considered as one of several possible explanations, and review authors
should attempt to distinguish the different possible reasons for it (see Table 13.3.b).
severity of the selection bias (Copas 1999). These analyses show how the estimated
intervention effect (and confidence interval) changes as the assumed amount of selec-
tion bias increases. If the estimates are relatively stable regardless of the selection model
assumed, this suggests that the unadjusted estimate is unlikely to be influenced by non-
reporting biases. On the other hand, if the estimates vary considerably depending on the
selection model assumed, this suggests that non-reporting biases may well drive the
unadjusted estimate (McShane et al 2016).
A major problem with selection models is that they assume that mechanisms leading
to small-study effects other than non-reporting biases (see Table 13.3.b) are not oper-
ating, and may give misleading results if this assumption is not correct. Jin and collea-
gues summarize the advantages and disadvantages of eight selection models, indicate
circumstances in which each can be used, and describe software available to imple-
ment them (Jin et al 2015). Given the complexity of the models, consultation with a
statistician is recommended for their implementation.
However, if the search for studies was not comprehensive, or if a contour-enhanced funnel
plot or sensitivity analysis suggests results are missing systematically, then it would be
reasonable to conclude that the synthesis is at risk of bias due to missing results. On
the other hand, if the review is based on an inception cohort, such that all studies that have
been conducted are known, and these studies were fully reported in line with their analysis
plans, then there would be low risk of bias due to missing results in a synthesis. Indeed,
such a low risk-of-bias judgement would carry even in the presence of asymmetry in a fun-
nel plot; although it would be important to investigate the reason for this asymmetry (e.g. it
might be due to systematic differences in the PICOs of smaller and larger studies, or to
problems in the methodological conduct of the smaller studies).
13.4 Summary
There is clear evidence that selective dissemination of study reports and results leads
to an over-estimate of the benefits and under-estimate of the harms of interventions in
systematic reviews and meta-analyses. However, overcoming, detecting and correcting
for bias due to missing results is difficult. Comprehensive searches are important, but
are not on their own sufficient to prevent substantial potential biases. Review authors
should therefore consider the risk of bias due to missing results in syntheses included in
their review (see MECIR Box 13.4.a).
We have presented a framework for assessing risk of bias due to missing results in a
synthesis. Of the approaches described, a thorough assessment of selective non-
reporting or under-reporting of results in the studies identified (Section 13.3.3) is likely
to be the most labour intensive, yet the most valuable. Because the number of identified
studies with results missing for a given synthesis is known, the impact of selective non-
reporting or under-reporting of results can be quantified more easily (see Section 13.3.4)
than the impact of selective non-publication of an unknown number of studies.
369
13 Assessing risk of bias due to missing results
The value of the other methods described in the framework will depend on the circum-
stances of the review. For example, if review authors suspect that a synthesis is biased
because results were missing selectively from a large proportion of the studies identified,
then the graphical and statistical methods outlined in Section 13.3.5 (e.g. funnel plots)
are unlikely to change their judgement. However, funnel plots, tests for funnel plot asym-
metry and other sensitivity analyses may be useful in cases where protocols or records
from trials registers were unavailable for most studies, making it difficult to assess selec-
tive non-reporting or under-reporting of results reliably. When there is evidence of funnel
plot asymmetry, non-reporting biases should be considered as only one of a number of
possible explanations: review authors should attempt to understand the sources of the
asymmetry, and consider their implications in the light of any qualitative signals that
raise a suspicion of additional missing results, and other sensitivity analyses.
13.6 References
Askie LM, Darlow BA, Finer N, Schmidt B, Stenson B, Tarnow-Mordi W, Davis PG, Carlo WA,
Brocklehurst P, Davies LC, Das A, Rich W, Gantz MG, Roberts RS, Whyte RK, Costantini L,
Poets C, Asztalos E, Battin M, Halliday HL, Marlow N, Tin W, King A, Juszczak E, Morley CJ,
Doyle LW, Gebski V, Hunter KE, Simes RJ. Association between oxygen saturation
targeting and death or disability in extremely preterm infants in the Neonatal
Oxygenation Prospective Meta-analysis Collaboration. JAMA 2018; 319: 2190–2201.
370
13.6 References
371
13 Assessing risk of bias due to missing results
372
13.6 References
McShane BB, Bockenholt U, Hansen KT. Adjusting for publication bias in meta-analysis: an
evaluation of selection methods and some cautionary notes. Perspectives on
Psychological Science 2016; 11: 730–749.
Moreno SG, Sutton AJ, Turner EH, Abrams KR, Cooper NJ, Palmer TM, Ades AE. Novel
methods to deal with publication biases: secondary analysis of antidepressant trials in
the FDA trial registry database and related journal publications. BMJ 2009; 339: b2981.
Moreno SG, Sutton AJ, Thompson JR, Ades AE, Abrams KR, Cooper NJ. A generalized
weighting regression-derived meta-analysis estimator robust to small-study effects and
heterogeneity. Statistics in Medicine 2012; 31: 1407–1417.
Morrison A, Polisena J, Husereau D, Moulton K, Clark M, Fiander M, Mierzwinski-Urban M,
Clifford T, Hutton B, Rabb D. The effect of English-language restriction on systematic
review-based meta-analyses: a systematic review of empirical studies. International
Journal of Technology Assessment in Health Care 2012; 28: 138–144.
Mueller KF, Meerpohl JJ, Briel M, Antes G, von Elm E, Lang B, Motschall E, Schwarzer G,
Bassler D. Methods for detecting, quantifying and adjusting for dissemination bias in
meta-analysis are described. Journal of Clinical Epidemiology 2016; 80: 25–33.
Page MJ, Higgins JPT, Clayton G, Sterne JAC, Hróbjartsson A, Savović J. Empirical evidence
of study design biases in randomized trials: systematic review of meta-epidemiological
studies. PloS One 2016a; 11: 7.
Page MJ, Shamseer L, Altman DG, Tetzlaff J, Sampson M, Tricco AC, Catala-Lopez F, Li L, Reid
EK, Sarkis-Onofre R, Moher D. Epidemiology and reporting characteristics of systematic
reviews of biomedical research: a cross-sectional study. PLoS Medicine 2016b; 13:
e1002028.
Page MJ, McKenzie JE, Higgins JPT. Tools for assessing risk of reporting biases in studies
and syntheses of studies: a systematic review. BMJ Open 2018; 8: e019703.
Peters JL, Sutton AJ, Jones DR, Abrams KR, Rushton L. Comparison of two methods to
detect publication bias in meta-analysis. JAMA 2006; 295: 676–680.
Peters JL, Sutton AJ, Jones DR, Abrams KR, Rushton L. Contour-enhanced meta-analysis
funnel plots help distinguish publication bias from other causes of asymmetry. Journal of
Clinical Epidemiology 2008; 61: 991–996.
Roberts I, Ker K, Edwards P, Beecher D, Manno D, Sydenham E. The knowledge system
underpinning healthcare is not fit for purpose and must change. BMJ 2015; 350: h2463.
Rodgers MA, Brown JV, Heirs MK, Higgins JPT, Mannion RJ, Simmonds MC, Stewart LA.
Reporting of industry funded study outcome data: comparison of confidential and
published data on the safety and effectiveness of rhBMP-2 for spinal fusion. BMJ 2013;
346: f3981.
Rücker G, Carpenter JR, Schwarzer G. Detecting and adjusting for small-study effects in
meta-analysis. Biometrical Journal 2011a; 53: 351–368.
Rücker G, Schwarzer G, Carpenter JR, Binder H, Schumacher M. Treatment-effect estimates
adjusted for small-study effects via a limit meta-analysis. Biostatistics 2011b; 12: 122–142.
Schmucker C, Schell LK, Portalupi S, Oeller P, Cabrera L, Bassler D, Schwarzer G, Scherer RW,
Antes G, von Elm E, Meerpohl JJ. Extent of non-publication in cohorts of studies approved
by research ethics committees or included in trial registries. PloS One 2014; 9: e114023.
Schmucker CM, Blümle A, Schell LK, Schwarzer G, Oeller P, Cabrera L, von Elm E, Briel M,
Meerpohl JJ. Systematic review finds that study data not published in full text articles
have unclear impact on meta-analyses results in medical research. PloS One 2017; 12:
e0176210.
373
13 Assessing risk of bias due to missing results
Schroll JB, Abdel-Sattar M, Bero L. The Food and Drug Administration reports provided
more data but were more difficult to use than the European Medicines Agency reports.
Journal of Clinical Epidemiology 2015; 68: 102–107.
Schroll JB, Penninga EI, Gøtzsche PC. Assessment of adverse events in protocols, clinical
study reports, and published papers of trials of orlistat: a document analysis. PLoS
Medicine 2016; 13: e1002101.
Simmonds MC, Brown JV, Heirs MK, Higgins JPT, Mannion RJ, Rodgers MA, Stewart LA.
Safety and effectiveness of recombinant human bone morphogenetic protein-2 for spinal
fusion: a meta-analysis of individual-participant data. Annals of Internal Medicine 2013;
158: 877–889.
Sterne JAC, Egger M. Funnel plots for detecting bias in meta-analysis: guidelines on choice
of axis. Journal of Clinical Epidemiology 2001; 54: 1046–1055.
Sterne JAC, Sutton AJ, Ioannidis JPA, Terrin N, Jones DR, Lau J, Carpenter J, Rücker G,
Harbord RM, Schmid CH, Tetzlaff J, Deeks JJ, Peters J, Macaskill P, Schwarzer G, Duval S,
Altman DG, Moher D, Higgins JPT. Recommendations for examining and interpreting
funnel plot asymmetry in meta-analyses of randomised controlled trials. BMJ 2011;
343: d4002.
Stuck AE, Rubenstein LZ, Wieland D. Bias in meta-analysis detected by a simple, graphical
test. Asymmetry detected in funnel plot was probably due to true heterogeneity. Letter.
BMJ 1998; 316: 469–471.
Terrin N, Schmid CH, Lau J. In an empirical evaluation of the funnel plot, researchers could
not visually identify publication bias. Journal of Clinical Epidemiology 2005; 58: 894–901.
Toews I, Booth A, Berg RC, Lewin S, Glenton C, Munthe-Kaas HM, Noyes J, Schroter S,
Meerpohl JJ. Further exploration of dissemination bias in qualitative research required to
facilitate assessment within qualitative evidence syntheses. Journal of Clinical
Epidemiology 2017; 88: 133–139.
Turner EH, Matthews AM, Linardatos E, Tell RA, Rosenthal R. Selective publication of
antidepressant trials and its influence on apparent efficacy. New England Journal of
Medicine 2008; 358: 252–260.
Walitt B, Urrutia G, Nishishinya MB, Cantrell SE, Hauser W. Selective serotonin reuptake
inhibitors for fibromyalgia syndrome. Cochrane Database of Systematic Reviews 2015; 6:
CD011735.
Wieseler B, Wolfram N, McGauran N, Kerekes MF, Vervolgyi V, Kohlepp P, Kamphuis M,
Grouven U. Completeness of reporting of patient-relevant clinical trial outcomes:
comparison of unpublished clinical study reports with publicly available data. PLoS
Medicine 2013; 10: e1001526.
Williamson P, Clarke M. The COMET (Core Outcome Measures in Effectiveness Trials)
Initiative: its role in improving Cochrane Reviews. Cochrane Database of Systematic
Reviews 2012; 5: ED000041.
Williamson PR, Gamble C. Identification and impact of outcome selection bias in meta-
analysis. Statistics in Medicine 2005; 24: 1547–1561.
Zarin DA, Tse T, Williams RJ, Califf RM, Ide NC. The ClinicalTrials.gov results database:
update and key issues. New England Journal of Medicine 2011; 364: 852–860.
Zwetsloot P-P, Van Der Naald M, Sena ES, Howells DW, IntHout J, De Groot JAH, Chamuleau
SAJ, MacLeod MR, Wever KE. Standardized mean differences cause funnel plot distortion
in publication bias assessments. eLife 2017; 6: e24260.
374
14
Completing ‘Summary of findings’ tables and
grading the certainty of the evidence
Holger J Schünemann, Julian PT Higgins, Gunn E Vist, Paul Glasziou, Elie A Akl, Nicole
Skoetz, Gordon H Guyatt; on behalf of the Cochrane GRADEing Methods Group
(formerly Applicability and Recommendations Methods Group) and the Cochrane
Statistical Methods Group
KEY POINTS
•
available evidence.
‘Summary of findings’ tables include a row for each important outcome (up to a max-
imum of seven). Accepted formats of ‘Summary of findings’ tables and interactive
‘Summary of findings’ tables can be produced using GRADE’s software GRADEpro GDT.
• Cochrane has adopted the GRADE approach (Grading of Recommendations Assess-
ment, Development and Evaluation) for assessing certainty (or quality) of a body of
•
evidence.
The GRADE approach specifies four levels of the certainty for a body of evidence for a
•
given outcome: high, moderate, low and very low.
GRADE assessments of certainty are determined through consideration of five
domains: risk of bias, inconsistency, indirectness, imprecision and publication bias.
For evidence from non-randomized studies and rarely randomized studies, assess-
ments can then be upgraded through consideration of three further domains.
This chapter should be cited as: Schünemann HJ, Higgins JPT, Vist GE, Glasziou P, Akl EA, Skoetz N, Guyatt
GH. Chapter 14: Completing ‘Summary of findings’ tables and grading the certainty of the evidence. In:
Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook for
Systematic Reviews of Interventions. 2nd Edition. Chichester (UK): John Wiley & Sons, 2019: 375–402.
© 2019 The Cochrane Collaboration. Published 2019 by John Wiley & Sons Ltd.
375
14 Completing ‘Summary of findings’ tables
concerning the certainty or quality of evidence (i.e. the confidence or certainty in the
range of an effect estimate or an association), the magnitude of effect of the interven-
tions examined, and the sum of available data on the main outcomes. Cochrane
Reviews should incorporate ‘Summary of findings’ tables during planning and publica-
tion, and should have at least one key ‘Summary of findings’ table representing the
most important comparisons. Some reviews may include more than one ‘Summary
of findings’ table, for example if the review addresses more than one major compari-
son, or includes substantially different populations that require separate tables (e.g.
because the effects differ or it is important to show results separately). In the Cochrane
Database of Systematic Reviews (CDSR), the principal ‘Summary of findings’ table of a
review appears at the beginning, before the Background section. Other ‘Summary of
findings’ tables appear between the Results and Discussion sections.
make them transparent to readers. Review authors are encouraged to include non-
randomized studies to examine rare or long-term adverse effects that may not ade-
quately be studied in randomized trials. This raises the possibility that harm outcomes
may come from studies in which participants differ from those in studies used in the anal-
ysis of benefit. Review authors will then need to consider how much such differences are
likely to impact on the findings, and this will influence the certainty of evidence because
of concerns about indirectness related to the population (see Section 14.2.2).
Non-randomized studies can provide important information not only when rando-
mized trials do not report on an outcome or randomized trials suffer from indirectness,
but also when the evidence from randomized trials is rated as very low and non-
randomized studies provide evidence of higher certainty. Further discussion of these
issues appears also in Chapter 24.
Patients or population: anyone taking a long flight (lasting more than 6 hours)
Without With
stockings stockings
Symptomatic deep See comment See comment Not estimable 2821 See 0 participants developed
vein thrombosis comment symptomatic DVT in these
(9 studies)
(DVT) studies
Pulmonary embolus See comment See comment Not estimable 2821 See 0 participants developed
(9 studies) comment pulmonary embolus in these
studiese
Death See comment See comment Not estimable 2821 See 0 participants died in these
(9 studies) comment studies
Adverse effects See comment See comment Not estimable 1182 See The tolerability of the
(4 studies) comment stockings was described as
very good with no
complaints of side effects in
4 studiesf
*The basis for the assumed risk is provided in footnotes. The corresponding risk (and its 95% confidence interval) is based on
the assumed risk in the intervention group and the relative effect of the intervention (and its 95% Cl).
Cl: confidence interval; RR: risk ratio; GRADE: GRADE Working Group grades of evidence (see explanations).
aAll the stockings in the nine studies included in this review were below-knee compression stockings. In four studies the compression strength was 20 mmHg to
30 mmHg at the ankle. It was 10 mmHg to 20 mmHg in the other four studies. Stockings come in different sizes. If a stocking is too tight around the knee it can
prevent essential venous return causing the blood to pool around the knee. Compression stockings should be fitted properly. A stocking that is too tight could cut
into the skin on a long flight and potentially cause ulceration and increased risk of DVT. Some stockings can be slightly thicker than normal leg covering and can
be potentially restrictive with tight foot wear. It is a good idea to wear stockings around the house prior to travel to ensure a good, comfortable fit. Participants put
their stockings on two to three hours before the flight in most of the studies. The availability and cost of stockings can vary.
b Two studies recruited high risk participants defined as those with previous episodes of DVT, coagulation disorders, severe obesity, limited mobility due to bone
or joint problems, neoplastic disease within the previous two years, large varicose veins or, in one of the studies, participants taller than 190 cm and heavier than
90 kg. The incidence for the seven studies that excluded high risk participants was 1.45% and the incidence for the two studies that recruited high-risk
participants (with at least one risk factor) was 2.43%. We have used 10 and 30 per 1000 to express different risk strata, respectively.
c The confidence interval crosses no difference and does not rule out a small increase.
d The measurement of oedema was not validated (indirectness of the outcome) or blinded to the intervention (risk of bias).
e If there are very few or no events and the number of participants is large, judgement about the certainty of evidence (particularly judgements about imprecision)
may be based on the absolute effect. Here the certainty rating may be considered ‘high’ if the outcome was appropriately assessed and the event, in fact, did not
occur in 2821 studied participants.
f None of the other studies reported adverse effects, apart from four cases of superficial vein thrombosis in varicose veins in the knee region that were
compressed by the upper edge of the stocking in one study.
378
14.1 ‘Summary of findings’ tables
our ability to draw conclusions on the safety of the many probiotics agents and doses administered.
h Serious unexplained inconsistency (large heterogeneity l2 = 79%, P value [P = 0.04], point estimates and confidence intervals vary
considerably).
i Serious imprecision. The upper bound of 0.02 fewer days of diarrhoea is not considered patient important.
j Serious unexplained inconsistency (large heterogeneity l2 = 78%, P value [P = 0.05], point estimates and confidence intervals vary
considerably).
k Serious imprecision. The 95% confidence interval includes no effect and lower bound of 0.60 stools per day is of questionable
patient importance.
379
14 Completing ‘Summary of findings’ tables
considerations feeding into the grading of certainty and of the results of the studies
(Guyatt et al 2011a). They ensure that a structured approach is used to rating the cer-
tainty of evidence. Although they are rarely published in Cochrane Reviews, evidence
profiles are often used, for example, by guideline developers in considering the cer-
tainty of the evidence to support guideline recommendations. Review authors will find
it easier to develop the ‘Summary of findings’ table by completing the rating of the cer-
tainty of evidence in the evidence profile first in GRADEpro GDT. They can then auto-
matically convert this to one of the ‘Summary of findings’ formats in GRADEpro GDT,
including an interactive ‘Summary of findings’ for publication.
As a measure of the magnitude of effect for dichotomous outcomes, the ‘Summary of
findings’ table should provide a relative measure of effect (e.g. risk ratio, odds ratio,
hazard) and measures of absolute risk. For other types of data, an absolute measure
alone (such as a difference in means for continuous data) might be sufficient. It is
important that the magnitude of effect is presented in a meaningful way, which
may require some transformation of the result of a meta-analysis (see also
Chapter 15, Sections 15.4 and 15.5). Reviews with more than one main comparison
should include a separate ‘Summary of findings’ table for each comparison.
Figure 14.1.a provides an example of a ‘Summary of findings’ table. Figure 14.1.b pro-
vides an alternative format that may further facilitate users’ understanding and inter-
pretation of the review’s findings. Evidence evaluating different formats suggests that
the ‘Summary of findings’ table should include a risk difference as a measure of the
absolute effect and authors should preferably use a format that includes a risk differ-
ence (Carrasco-Labra et al 2016).
A detailed description of the contents of a ‘Summary of findings’ table appears in
Section 14.1.6.
comparator group risks (i.e. variation in the risk of the event occurring without the
intervention of interest, for example in different populations) makes it impossible
for more than one of these measures to be truly the same in every study.
It has long been assumed in epidemiology that relative measures of effect are more
consistent than absolute measures of effect from one scenario to another. There is
empirical evidence to support this assumption (Engels et al 2000, Deeks and Altman
2001, Furukawa et al 2002). For this reason, meta-analyses should generally use either
a risk ratio or an odds ratio as a measure of effect (see Chapter 10, Section 10.4.3). Cor-
respondingly, a single estimate of relative effect is likely to be a more appropriate sum-
mary than a single estimate of absolute effect. If a relative effect is indeed consistent
across studies, then different comparator group risks will have different implications
for absolute benefit. For instance, if the risk ratio is consistently 0.75, then the exper-
imental intervention would reduce a comparator group risk of 80% to 60% in the inter-
vention group (an absolute risk reduction of 20 percentage points), but would also
reduce a comparator group risk of 20% to 15% in the intervention group (an absolute
risk reduction of 5 percentage points).
‘Summary of findings’ tables are built around the assumption of a consistent relative
effect. It is therefore important to consider the implications of this effect for different
comparator group risks (these can be derived or estimated from a number of sources,
see Section 14.1.6.3), which may require an assessment of the certainty of evidence for
prognostic evidence (Spencer et al 2012, Iorio et al 2015). For any comparator group
risk, it is possible to estimate a corresponding intervention group risk (i.e. the absolute
risk with the intervention) from the meta-analytic risk ratio or odds ratio. Note that the
numbers provided in the ‘Corresponding risk’ column are specific to the ‘risks’ in the
adjacent column.
For the meta-analytic risk ratio (RR) and assumed comparator risk (ACR) the corre-
sponding intervention risk is obtained as:
Corresponding intervention risk per 1000 = 1000 × ACR × RR
As an example, in Figure 14.1.a, the meta-analytic risk ratio for symptomless deep vein
thrombosis (DVT) is RR = 0.10 (95% CI 0.04 to 0.26). Assuming a comparator risk of
ACR = 10 per 1000 = 0.01, we obtain:
Corresponding intervention risk per 1000 = 1000 × 0 01 × 0 10 = 1
For the meta-analytic odds ratio (OR) and assumed comparator risk, ACR, the corre-
sponding intervention risk is obtained as:
OR × ACR
Corresponding intervention risk per 1000 = 1000 ×
1 − ACR + OR × ACR
Upper and lower confidence limits for the corresponding intervention risk are obtained
by replacing RR or OR by their upper and lower confidence limits, respectively (e.g.
replacing 0.10 with 0.04, then with 0.26, in the example). Such confidence intervals
do not incorporate uncertainty in the assumed comparator risks.
When dealing with risk ratios, it is critical that the same definition of ‘event’ is used as
was used for the meta-analysis. For example, if the meta-analysis focused on ‘death’ (as
opposed to survival) as the event, then corresponding risks in the ‘Summary of findings’
table must also refer to ‘death’.
381
14 Completing ‘Summary of findings’ tables
(i) Absolute risk of event-free survival within a particular period of time Event-free survival
(e.g. overall survival) is commonly reported by individual studies. To obtain absolute
effects for time-to-event outcomes measured as event-free survival, the summary
HR can be used in conjunction with an assumed proportion of patients who are
382
14.1 ‘Summary of findings’ tables
event-free in the comparator group (Tierney et al 2007). This proportion of patients will
be specific to a period of time of observation. However, it is not strictly necessary to
specify this period of time. For instance, a proportion of 50% of event-free patients
might apply to patients with a high event rate observed over 1 year, or to patients with
a low event rate observed over 2 years.
Corresponding intervention risk per 1000
= exp ln proportion of patients event-free × HR × 1000
As an example, suppose the meta-analytic hazard ratio is 0.42 (95% CI 0.25 to 0.72).
Assuming a comparator group risk of event-free survival (e.g. for overall survival people
being alive) at 2 years of ACR = 900 per 1000 = 0.9 we obtain:
(iii) Median time to the event Instead of absolute numbers, the time to the event in the
intervention and comparison groups can be expressed as median survival time in
months or years. To obtain median survival time the pooled HR can be applied to
an assumed median survival time in the comparator group (Tierney et al 2007):
Corresponding median survival, in months = comparator group median survival time,
in months/HR
In the example, assuming a comparator group median survival time of 80 months, we
obtain:
Corresponding median survival, in months = 80 months/0 42 = 190 months
For all three of these options for re-expressing results of time-to-event analyses,
upper and lower confidence limits for the corresponding intervention risk are obtained
by replacing HR by its upper and lower confidence limits, respectively (e.g. replacing
0.42 with 0.25, then with 0.72, in the example). Again, as for dichotomous outcomes,
383
14 Completing ‘Summary of findings’ tables
Patients or population This further clarifies the population (and possibly the subpopu-
lations) of interest and ideally the magnitude of risk of the most crucial adverse out-
come at which an intervention is directed. For instance, people on a long-haul flight
may be at different risks for DVT; those using selective serotonin reuptake inhibitors
(SSRIs) might be at different risk for side effects; while those with atrial fibrillation
may be at low (<1%), moderate (1% to 4%) or high (>4%) yearly risk of stroke.
Setting This should state any specific characteristics of the settings of the healthcare
question that might limit the applicability of the summary of findings to other settings
(e.g. primary care in Europe and North America).
14.1.6.2 Outcomes
The rows of a ‘Summary of findings’ table should include all desirable and undesirable
health outcomes (listed in order of importance) that are essential for decision making,
up to a maximum of seven outcomes. If there are more outcomes in the review, review
authors will need to omit the less important outcomes from the table, and the decision
selecting which outcomes are critical or important to the review should be made during
protocol development (see Chapter 3). Review authors should provide time frames for
the measurement of the outcomes (e.g. 90 days or 12 months) and the type of instru-
ment scores (e.g. ranging from 0 to 100).
Note that review authors should include the pre-specified critical and important out-
comes in the table whether data are available or not. However, they should be alert to
the possibility that the importance of an outcome (e.g. a serious adverse effect) may
only become known after the protocol was written or the analysis was carried out, and
should take appropriate actions to include these in the ‘Summary of findings’ table.
The ‘Summary of findings’ table can include effects in subgroups of the population for
different comparator risks and effect sizes separately. For instance, in Figure 14.1.b effects
384
14.1 ‘Summary of findings’ tables
are presented for children younger and older than 5 years separately. Review authors may
also opt to produce separate ‘Summary of findings’ tables for different populations.
Review authors should include serious adverse events, but it might be possible to
combine minor adverse events as a single outcome, and describe this in an explanatory
footnote (note that it is not appropriate to add events together unless they are inde-
pendent, that is, a participant who has experienced one adverse event has an unaf-
fected chance of experiencing the other adverse event).
Outcomes measured at multiple time points represent a particular problem. In gen-
eral, to keep the table simple, review authors should present multiple time points only
for outcomes critical to decision making, where either the result or the decision made
are likely to vary over time. The remainder should be presented at a common time
point where possible.
Review authors can present continuous outcome measures in the ‘Summary of find-
ings’ table and should endeavour to make these interpretable to the target audience.
This requires that the units are clear and readily interpretable, for example, days of pain,
or frequency of headache, and the name and scale of any measurement tools used
should be stated (e.g. a Visual Analogue Scale, ranging from 0 to 100). However, many
measurement instruments are not readily interpretable by non-specialist clinicians or
patients, for example, points on a Beck Depression Inventory or quality of life score.
For these, a more interpretable presentation might involve converting a continuous to
a dichotomous outcome, such as > 50% improvement (see Chapter 15, Section 15.5).
result presented in the relative effect column (see Section 14.1.6.6). Formulae are pro-
vided in Section 14.1.5. Review authors should present the absolute effect in the same
format as the risks with comparator intervention (see Section 14.1.6.3), for example as
the number of people experiencing the event per 1000 people.
For continuous outcomes, a difference in means or standardized difference in means
should be presented with its confidence interval. These will typically be obtained
directly from a meta-analysis. Explanatory text should be used to clarify the meaning,
as in Figures 14.1.a and 14.1.b.
example, the certainty would be ‘high’ if the summary were of several randomized trials
with low risk of bias, but the rating of certainty becomes lower if there are concerns
about risk of bias, inconsistency, indirectness, imprecision or publication bias. Judge-
ments other than of ‘high’ certainty should be made transparent using explanatory
footnotes or the ‘Comments’ column in the ‘Summary of findings’ table (see
Section 14.1.6.10).
14.1.6.9 Comments
The aim of the ‘Comments’ field is to help interpret the information or data identified in
the row. For example, this may be on the validity of the outcome measure or the pres-
ence of variables that are associated with the magnitude of effect. Important caveats
about the results should be flagged here. Not all rows will need comments, and it is best
to leave a blank if there is nothing warranting a comment.
14.1.6.10 Explanations
Detailed explanations should be included as footnotes to support the judgements in
the ‘Summary of findings’ table, such as the overall GRADE assessment. The explana-
tions should describe the rationale for important aspects of the content. Table 14.1.a
lists guidance for useful explanations. Explanations should be concise, informative, rel-
evant, easy to understand and accurate. If explanations cannot be sufficiently
described in footnotes, review authors should provide further details of the issues in
the Results and Discussion sections of the review.
Table 14.1.a Guidance for providing useful explanations in ‘Summary of findings’ (SoF) tables.
Adapted from Santesso et al (2016)
General guidance
1) Enter the information for readers directly into the table if possible (e.g. information about the
duration of follow-up or the scale used).
2) Generally, do not cite references in the explanations section, unless there are specific reasons,
for example, for providing information about sources of baseline risks (see point 3).
3) Provide the source of information about the baseline risks used to calculate absolute effects.
4) On completion of the table, review all explanations to determine if some could be referenced
multiple times if reworded or combined.
5) Provide reasons for upgrading and downgrading the evidence (see domain-specific guidance
below) and use GRADEpro GDT software to adhere to GRADE guidance.
6) The body of evidence for a particular outcome may be determined to have serious or very serious
issues for the affected domain (or critically serious for risk of bias when ROBINS-I is used). Thus,
it may be useful to indicate the number of levels for downgrading (e.g. downgraded by one level
for risk of bias), but avoid repetition of what is in the table (and the impression of formulaic or
algorithmic reporting). In evidence profiles, this information is already in the cells of the table.
7) Although explanations about the certainty in the evidence are primarily required when they alter
the certainty, consider adding an explanation when the certainty in the evidence has not been
altered but when this decision may be questioned by others. This will help with understanding
reasons for disagreement.
(Continued)
387
14 Completing ‘Summary of findings’ tables
388
14.2 Assessing the certainty or quality of a body of evidence
389
14 Completing ‘Summary of findings’ tables
For systematic reviews, the GRADE approach defines the certainty of a body of evi-
dence as the extent to which one can be confident that an estimate of effect or asso-
ciation is close to the quantity of specific interest. Assessing the certainty of a body of
evidence involves consideration of within- and across-study risk of bias (limitations in
study design and execution or methodological quality), inconsistency (or heterogene-
ity), indirectness of evidence, imprecision of the effect estimates and risk of publication
bias (see Section 14.2.2), as well as domains that may increase our confidence in the
effect estimate (as described in Section 14.2.3). The GRADE system entails an assess-
ment of the certainty of a body of evidence for each individual outcome. Judgements
about the domains that determine the certainty of evidence should be described in the
results or discussion section and as part of the ‘Summary of findings’ table.
The GRADE approach specifies four levels of certainty (Figure 14.2.a). For interven-
tions, including diagnostic and other tests that are evaluated as interventions
(Schünemann et al 2008b, Schünemann et al 2008a, Balshem et al 2011, Schünemann
et al 2012), the starting point for rating the certainty of evidence is categorized into
two types:
1. 2. 3.
Establish initial Consider lowering or raising Final level of
level of certainty level of certainty certainty rating
Study design Initial certainty Reasons for considering lowering Certainty
in an estimate or raising certainty in an estimate of effect
of effect across those considerations
Lower if Higher if*
Figure 14.2a Levels of the certainty of a body of evidence in the GRADE approach. ∗Upgrading criteria
are usually applicable to non-randomized studies only (but exceptions exist).
390
14.2 Assessing the certainty or quality of a body of evidence
14.2.2 Domains that can lead to decreasing the certainty level of a body
of evidence
We now describe in more detail the five reasons (or domains) for downgrading the cer-
tainty of a body of evidence for a specific outcome. In each case, if a reason is found for
downgrading the evidence, it should be classified as ‘no limitation’ (not important
enough to warrant downgrading), ‘serious’ (downgrading the certainty rating by one
level) or ‘very serious’ (downgrading the certainty grade by two levels). For non-
randomized studies assessed with ROBINS-I, rating down by three levels should be clas-
sified as ‘extremely’ serious.
1) Risk of bias or limitations in the detailed design and implementation
Our confidence in an estimate of effect decreases if studies suffer from major limita-
tions that are likely to result in a biased assessment of the intervention effect. For ran-
domized trials, these methodological limitations include failure to generate a random
sequence, lack of allocation sequence concealment, lack of blinding (particularly with
subjective outcomes that are highly susceptible to biased assessment), a large loss to
391
14 Completing ‘Summary of findings’ tables
Table 14.2.a Further guidelines for domain 1 (of 5) in a GRADE assessment: going from assessments of
risk of bias in studies to judgements about study limitations for main outcomes across studies
GRADE
assessment of
risk of bias or
study
limitations of
Risk study
of bias Across studies Interpretation Considerations limitations
be used for carotid endarterectomy in symptomatic patients with high grade stenosis
(70% to 99%) in which the intervention is, in the hands of the right surgeons, beneficial,
and another (if review authors considered it relevant) for asymptomatic patients with
low grade stenosis (less than 30%) in which surgery appears harmful (Orrapin and
Rerkasem 2017). When heterogeneity exists and affects the interpretation of results,
but review authors are unable to identify a plausible explanation with the data avail-
able, the certainty of the evidence decreases.
3) Indirectness of evidence
Two types of indirectness are relevant. First, a review comparing the effectiveness of
alternative interventions (say A and B) may find that randomized trials are available,
but they have compared A with placebo and B with placebo. Thus, the evidence is
restricted to indirect comparisons between A and B. Where indirect comparisons are
393
14 Completing ‘Summary of findings’ tables
Outcome: …
Description (evidence
found and included,
including evidence from
other studies) –
consider the domains of
study design and study
Domain limitation,
(original inconsistency,
question imprecision and
asked) publication bias Judgement – is the evidence sufficiently direct?
of risk of bias (see Chapter 8, Section 8.7), so for the studies contributing to the out-
come in the ‘Summary of findings’ table this is addressed by domain 1 above (limita-
tions in the design and implementation). If a large number of studies included in the
review do not contribute to an outcome, or if there is evidence of publication bias, the
certainty of the evidence may be downgraded. Chapter 13 provides a detailed discus-
sion of reporting biases, including publication bias, and how it may be tackled in a
Cochrane Review. A prototypical situation that may elicit suspicion of publication bias
is when published evidence includes a number of small studies, all of which are
industry-funded (Bhandari et al 2004). For example, 14 studies of flavanoids in patients
with haemorrhoids have shown apparent large benefits, but enrolled a total of only
395
14 Completing ‘Summary of findings’ tables
1432 patients (i.e. each study enrolled relatively few patients) (Alonso-Coello et al
2006). The heavy involvement of sponsors in most of these studies raises questions
of whether unpublished studies that suggest no benefit exist (publication bias).
A particular body of evidence can suffer from problems associated with more than one
of the five factors listed here, and the greater the problems, the lower the certainty of
evidence rating that should result. One could imagine a situation in which randomized
trials were available, but all or virtually all of these limitations would be present, and in
serious form. A very low certainty of evidence rating would result.
14.2.3 Domains that may lead to increasing the certainty level of a body of
evidence
Although NRSI and downgraded randomized trials will generally yield a low rating for
certainty of evidence, there will be unusual circumstances in which review authors
could ‘upgrade’ such evidence to moderate or even high certainty (Table 14.3.a).
1) Large effects On rare occasions when methodologically well-done observational
studies yield large, consistent and precise estimates of the magnitude of an inter-
vention effect, one may be particularly confident in the results. A large estimated
effect (e.g. RR > 2 or RR < 0.5) in the absence of plausible confounders, or a very large
effect (e.g. RR > 5 or RR < 0.2) in studies with no major threats to validity, might
qualify for this. In these situations, while the NRSI may possibly have provided
an over-estimate of the true effect, the weak study design may not explain all of
the apparent observed benefit. Thus, despite reservations based on the observa-
tional study design, review authors are confident that the effect exists. The magni-
tude of the effect in these studies may move the assigned certainty of evidence from
low to moderate (if the effect is large in the absence of other methodological limita-
tions). For example, a meta-analysis of observational studies showed that bicycle
helmets reduce the risk of head injuries in cyclists by a large margin (odds ratio
(OR) 0.31, 95% CI 0.26 to 0.37) (Thompson et al 2000). This large effect, in the
absence of obvious bias that could create the association, suggests a rating of
moderate-certainty evidence.
Note: GRADE guidance suggests the possibility of rating up one level for a large
effect if the relative effect is greater than 2.0. However, if the point estimate of
the relative effect is greater than 2.0, but the confidence interval is appreciably
below 2.0, then some hesitation would be appropriate in the decision to rate up
for a large effect. Another situation allows inference of a strong association without
a formal comparative study. Consider the question of the impact of routine colon-
oscopy versus no screening for colon cancer on the rate of perforation associated
with colonoscopy. Here, a large series of representative patients undergoing colon-
oscopy may provide high certainty evidence about the risk of perforation associated
with colonoscopy. When the risk of the event among patients receiving the relevant
comparator is known to be near 0 (i.e. we are certain that the incidence of sponta-
neous colon perforation in patients not undergoing colonoscopy is extremely low),
case series or cohort studies of representative patients can provide high certainty
evidence of adverse effects associated with an intervention, thereby allowing us
to infer a strong association from even a limited number of events.
396
14.2 Assessing the certainty or quality of a body of evidence
Table 14.3.a Framework for describing the certainty of evidence and justifying downgrading or
upgrading
Large effects Describe the magnitude of the effect Upgraded because the RR is large:
(upgrading) and the widths of the associate 0.3 (95% CI 0.2 to 0.4), with a
confidence intervals. sufficient number of events to be
precise.
Dose response The studies show a clear relation Upgraded because the dose-
(upgrading) with increases in the outcome of an response relation shows a relative
outcome (e.g. lung cancer) with risk increase of 10% in never
higher exposure levels. smokers, 15% in smokers of
10 pack years and 20% in smokers
of 15 pack years.
Opposing plausible Describe which opposing plausible The estimate of effect is not
residual bias and biases and confounders may have controlled for the following possible
confounding not been considered. confounders: smoking, degree of
(upgrading) education, but the distribution of
these factors in the studies is likely
to lead to an under-estimate of the
true effect. The certainty of the
evidence was increased.
Chapter 15 (Section 15.6) describes in more detail how the overall GRADE assessment
across all domains can be used to draw conclusions about the effects of the interven-
tion, as well as providing implications for future research.
Funding: This work was in part supported by funding from the Michael G DeGroote
Cochrane Canada Centre and the Ontario Ministry of Health.
14.5 References
Alonso-Coello P, Zhou Q, Martinez-Zapata MJ, Mills E, Heels-Ansdell D, Johanson JF, Guyatt
G. Meta-analysis of flavonoids for the treatment of haemorrhoids. British Journal of
Surgery 2006; 93: 909–920.
399
14 Completing ‘Summary of findings’ tables
Atkins D, Best D, Briss PA, Eccles M, Falck-Ytter Y, Flottorp S, Guyatt GH, Harbour RT, Haugh
MC, Henry D, Hill S, Jaeschke R, Leng G, Liberati A, Magrini N, Mason J, Middleton P,
Mrukowicz J, O’Connell D, Oxman AD, Phillips B, Schünemann HJ, Edejer TT, Varonen H,
Vist GE, Williams JW, Jr., Zaza S. Grading quality of evidence and strength of
recommendations. BMJ 2004; 328: 1490.
Balshem H, Helfand M, Schünemann HJ, Oxman AD, Kunz R, Brozek J, Vist GE, Falck-Ytter Y,
Meerpohl J, Norris S, Guyatt GH. GRADE guidelines: 3. Rating the quality of evidence.
Journal of Clinical Epidemiology 2011; 64: 401–406.
Bhandari M, Busse JW, Jackowski D, Montori VM, Schünemann H, Sprague S, Mears D,
Schemitsch EH, Heels-Ansdell D, Devereaux PJ. Association between industry funding and
statistically significant pro-industry findings in medical and surgical randomized trials.
Canadian Medical Association Journal 2004; 170: 477–480.
Brophy JM, Joseph L, Rouleau JL. Beta-blockers in congestive heart failure: a Bayesian
meta-analysis. Annals of Internal Medicine 2001; 134: 550–560.
Carrasco-Labra A, Brignardello-Petersen R, Santesso N, Neumann I, Mustafa RA, Mbuagbaw
L, Etxeandia Ikobaltzeta I, De Stio C, McCullagh LJ, Alonso-Coello P, Meerpohl JJ, Vandvik
PO, Brozek JL, Akl EA, Bossuyt P, Churchill R, Glenton C, Rosenbaum S, Tugwell P, Welch
V, Garner P, Guyatt G, Schünemann HJ. Improving GRADE evidence tables part 1: a
randomized trial shows improved understanding of content in summary of findings
tables with a new format. Journal of Clinical Epidemiology 2016; 74: 7–18.
Deeks JJ, Altman DG. Effect measures for meta-analysis of trials with binary outcomes.
In: Egger M, Davey Smith G, Altman DG, editors. Systematic Reviews in Health Care:
Meta-analysis in Context. 2nd ed. London (UK): BMJ Publication Group; 2001. pp. 313–335.
Devereaux PJ, Choi PT, Lacchetti C, Weaver B, Schünemann HJ, Haines T, Lavis JN, Grant BJ,
Haslam DR, Bhandari M, Sullivan T, Cook DJ, Walter SD, Meade M, Khan H, Bhatnagar N,
Guyatt GH. A systematic review and meta-analysis of studies comparing mortality rates of
private for-profit and private not-for-profit hospitals. Canadian Medical Association
Journal 2002; 166: 1399–1406.
Engels EA, Schmid CH, Terrin N, Olkin I, Lau J. Heterogeneity and statistical significance in
meta-analysis: an empirical study of 125 meta-analyses. Statistics in Medicine 2000; 19:
1707–1728.
Furukawa TA, Guyatt GH, Griffith LE. Can we individualize the ‘number needed to treat’? An
empirical study of summary effect measures in meta-analyses. International Journal of
Epidemiology 2002; 31: 72–76.
Gibson JN, Waddell G. Surgical interventions for lumbar disc prolapse: updated Cochrane
Review. Spine 2007; 32: 1735–1747.
Guyatt G, Oxman A, Vist G, Kunz R, Falck-Ytter Y, Alonso-Coello P, Schünemann H. GRADE: an
emerging consensus on rating quality of evidence and strength of recommendations. BMJ
2008; 336: 3.
Guyatt G, Oxman AD, Akl EA, Kunz R, Vist G, Brozek J, Norris S, Falck-Ytter Y, Glasziou P,
DeBeer H, Jaeschke R, Rind D, Meerpohl J, Dahm P, Schünemann HJ. GRADE guidelines: 1.
Introduction-GRADE evidence profiles and summary of findings tables. Journal of Clinical
Epidemiology 2011a; 64: 383–394.
Guyatt GH, Oxman AD, Kunz R, Brozek J, Alonso-Coello P, Rind D, Devereaux PJ, Montori VM,
Freyschuss B, Vist G, Jaeschke R, Williams JW, Jr., Murad MH, Sinclair D, Falck-Ytter Y, Meerpohl
J, Whittington C, Thorlund K, Andrews J, Schünemann HJ. GRADE guidelines 6. Rating the
quality of evidence–imprecision. Journal of Clinical Epidemiology 2011b; 64: 1283–1293.
400
14.5 References
Iorio A, Spencer FA, Falavigna M, Alba C, Lang E, Burnand B, McGinn T, Hayden J, Williams K,
Shea B, Wolff R, Kujpers T, Perel P, Vandvik PO, Glasziou P, Schünemann H, Guyatt G. Use
of GRADE for assessment of evidence about prognosis: rating confidence in estimates of
event rates in broad categories of patients. BMJ 2015; 350: h870.
Langendam M, Carrasco-Labra A, Santesso N, Mustafa RA, Brignardello-Petersen R,
Ventresca M, Heus P, Lasserson T, Moustgaard R, Brozek J, Schünemann HJ. Improving
GRADE evidence tables part 2: a systematic survey of explanatory notes shows more
guidance is needed. Journal of Clinical Epidemiology 2016; 74: 19–27.
Levine MN, Raskob G, Landefeld S, Kearon C, Schulman S. Hemorrhagic complications of
anticoagulant treatment: the Seventh ACCP Conference on Antithrombotic and
Thrombolytic Therapy. Chest 2004; 126: 287S–310S.
Orrapin S, Rerkasem K. Carotid endarterectomy for symptomatic carotid stenosis. Cochrane
Database of Systematic Reviews 2017; 6: CD001081.
Salpeter S, Greyber E, Pasternak G, Salpeter E. Risk of fatal and nonfatal lactic acidosis with
metformin use in type 2 diabetes mellitus. Cochrane Database of Systematic Reviews 2007;
4: CD002967.
Santesso N, Carrasco-Labra A, Langendam M, Brignardello-Petersen R, Mustafa RA, Heus P,
Lasserson T, Opiyo N, Kunnamo I, Sinclair D, Garner P, Treweek S, Tovey D, Akl EA, Tugwell
P, Brozek JL, Guyatt G, Schünemann HJ. Improving GRADE evidence tables part 3:
detailed guidance for explanatory footnotes supports creating and understanding GRADE
certainty in the evidence judgments. Journal of Clinical Epidemiology 2016; 74: 28–39.
Schünemann HJ, Best D, Vist G, Oxman AD, Group GW. Letters, numbers, symbols and
words: how to communicate grades of evidence and recommendations. Canadian
Medical Association Journal 2003; 169: 677–680.
Schünemann HJ, Jaeschke R, Cook DJ, Bria WF, El-Solh AA, Ernst A, Fahy BF, Gould MK,
Horan KL, Krishnan JA, Manthous CA, Maurer JR, McNicholas WT, Oxman AD, Rubenfeld G,
Turino GM, Guyatt G. An official ATS statement: grading the quality of evidence and
strength of recommendations in ATS guidelines and recommendations. American Journal
of Respiratory and Critical Care Medicine 2006; 174: 605–614.
Schünemann HJ, Oxman AD, Brozek J, Glasziou P, Jaeschke R, Vist GE, Williams JW Jr, Kunz
R, Craig J, Montori VM, Bossuyt P, Guyatt GH. Grading quality of evidence and strength of
recommendations for diagnostic tests and strategies. BMJ 2008a; 336: 1106–1110.
Schünemann HJ, Oxman AD, Brozek J, Glasziou P, Bossuyt P, Chang S, Muti P, Jaeschke R,
Guyatt GH. GRADE: assessing the quality of evidence for diagnostic recommendations.
ACP Journal Club 2008b; 149: 2.
Schünemann HJ, Mustafa R, Brozek J. [Diagnostic accuracy and linked evidence–testing the
chain.] Zeitschrift für Evidenz, Fortbildung und Qualität im Gesundheitswesen 2012; 106:
153–160.
Schünemann HJ, Tugwell P, Reeves BC, Akl EA, Santesso N, Spencer FA, Shea B, Wells G,
Helfand M. Non-randomized studies as a source of complementary, sequential or
replacement evidence for randomized controlled trials in systematic reviews on the
effects of interventions. Research Synthesis Methods 2013; 4: 49–62.
Schünemann HJ. Interpreting GRADE’s levels of certainty or quality of the evidence: GRADE
for statisticians, considering review information size or less emphasis on imprecision?
Journal of Clinical Epidemiology 2016; 75: 6–15.
Schünemann HJ, Cuello C, Akl EA, Mustafa RA, Meerpohl JJ, Thayer K, Morgan RL, Gartlehner
G, Kunz R, Katikireddi SV, Sterne J, Higgins JPT, Guyatt G; GRADE Working Group. GRADE
401
14 Completing ‘Summary of findings’ tables
guidelines: 18. How ROBINS-I and other tools to assess risk of bias in nonrandomized
studies should be used to rate the certainty of a body of evidence. Journal of Clinical
Epidemiology 2018.
Spencer-Bonilla G, Quinones AR, Montori VM, International Minimally Disruptive Medicine
Workgroup. Assessing the burden of treatment. Journal of General Internal Medicine 2017;
32: 1141–1145.
Spencer FA, Iorio A, You J, Murad MH, Schünemann HJ, Vandvik PO, Crowther MA, Pottie K,
Lang ES, Meerpohl JJ, Falck-Ytter Y, Alonso-Coello P, Guyatt GH. Uncertainties in baseline
risk estimates and confidence in treatment effects. BMJ 2012; 345: e7401.
Sterne JAC, Hernán MA, Reeves BC, Savović J, Berkman ND, Viswanathan M, Henry D, Altman
DG, Ansari MT, Boutron I, Carpenter JR, Chan AW, Churchill R, Deeks JJ, Hróbjartsson A,
Kirkham J, Jüni P, Loke YK, Pigott TD, Ramsay CR, Regidor D, Rothstein HR, Sandhu L,
Santaguida PL, Schünemann HJ, Shea B, Shrier I, Tugwell P, Turner L, Valentine JC,
Waddington H, Waters E, Wells GA, Whiting PF, Higgins JPT. ROBINS-I: a tool for assessing
risk of bias in non-randomised studies of interventions. BMJ 2016; 355: i4919.
Thompson DC, Rivara FP, Thompson R. Helmets for preventing head and facial injuries in
bicyclists. Cochrane Database of Systematic Reviews 2000; 2: CD001855.
Tierney JF, Stewart LA, Ghersi D, Burdett S, Sydes MR. Practical methods for incorporating
summary time-to-event data into meta-analysis. Trials 2007; 8: 16.
van Dalen EC, Tierney JF, Kremer LCM. Tips and tricks for understanding and using SR
results. No. 7: time-to-event data. Evidence-Based Child Health 2007; 2: 1089–1090.
402
15
Interpreting results and drawing conclusions
Holger J Schünemann, Gunn E Vist, Julian PT Higgins, Nancy Santesso,
Jonathan J Deeks, Paul Glasziou, Elie A Akl, Gordon H Guyatt; on behalf
of the Cochrane GRADEing Methods Group
KEY POINTS
•
communicate the conclusions of the review effectively.
Methods are presented for computing, presenting and interpreting relative and absolute
•
effects for dichotomous outcome data, including the number needed to treat (NNT).
For continuous outcome measures, review authors can present summary results for
studies using natural units of measurement or as minimal important differences when
all studies use the same scale. When studies measure the same construct but with
different scales, review authors will need to find a way to interpret the standardized
mean difference, or to use an alternative effect measure for the meta-analysis such as
•
the ratio of means.
Review authors should not describe results as ‘statistically significant’, ‘not statisti-
cally significant’ or ‘non-significant’ or unduly rely on thresholds for P values, but
•
report the confidence interval together with the exact P value.
Review authors should not make recommendations about healthcare decisions, but
they can – after describing the certainty of evidence and the balance of benefits and
harms – highlight different actions that might be consistent with particular patterns of
values and preferences and other factors that determine a decision such as cost.
15.1 Introduction
The purpose of Cochrane Reviews is to facilitate healthcare decisions by patients and
the general public, clinicians, guideline developers, administrators and policy makers.
They also inform future research. A clear statement of findings, a considered discussion
This chapter should be cited as: Schünemann HJ, Vist GE, Higgins JPT, Santesso N, Deeks JJ, Glasziou P,
Akl EA, Guyatt GH. Chapter 15: Interpreting results and drawing conclusions. In: Higgins JPT, Thomas J,
Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook for Systematic Reviews
of Interventions. 2nd Edition. Chichester (UK): John Wiley & Sons, 2019: 403–432.
© 2019 The Cochrane Collaboration. Published 2019 by John Wiley & Sons Ltd.
403
15 Interpreting results and drawing conclusions
and a clear presentation of the authors’ conclusions are, therefore, important parts of
the review. In particular, the following issues can help people make better informed
decisions and increase the usability of Cochrane Reviews:
In this chapter, we address first one of the key aspects of interpreting findings that is
also fundamental in completing a ‘Summary of findings’ table: the certainty of evidence
related to each of the outcomes. We then provide a more detailed consideration of
issues around applicability and around interpretation of numerical results, and provide
suggestions for presenting authors’ conclusions.
408
15.3 Interpreting results of statistical analyses
effect?’ A confidence interval may be reported for any level of confidence (although
they are most commonly reported for 95%, and sometimes 90% or 99%). For example,
the odds ratio of 0.80 could be reported with an 80% confidence interval of 0.73 to 0.88;
a 90% interval of 0.72 to 0.89; and a 95% interval of 0.70 to 0.92. As the confidence level
increases, the confidence interval widens.
There is logical correspondence between the confidence interval and the P value (see
Section 15.3.3). The 95% confidence interval for an effect will exclude the null value
(such as an odds ratio of 1.0 or a risk difference of 0) if and only if the test of significance
yields a P value of less than 0.05. If the P value is exactly 0.05, then either the upper or
lower limit of the 95% confidence interval will be at the null value. Similarly, the 99%
confidence interval will exclude the null if and only if the test of significance yields a
P value of less than 0.01.
Together, the point estimate and confidence interval provide information to assess
the effects of the intervention on the outcome. For example, suppose that we are eval-
uating an intervention that reduces the risk of an event and we decide that it would be
useful only if it reduced the risk of an event from 30% by at least 5 percentage points to
25% (these values will depend on the specific clinical scenario and outcomes, including
the anticipated harms). If the meta-analysis yielded an effect estimate of a reduction of
10 percentage points with a tight 95% confidence interval, say, from 7% to 13%, we
would be able to conclude that the intervention was useful since both the point esti-
mate and the entire range of the interval exceed our criterion of a reduction of 5% for
net health benefit. However, if the meta-analysis reported the same risk reduction of
10% but with a wider interval, say, from 2% to 18%, although we would still conclude
that our best estimate of the intervention effect is that it provides net benefit, we could
not be so confident as we still entertain the possibility that the effect could be between
2% and 5%. If the confidence interval was wider still, and included the null value of a
difference of 0%, we would still consider the possibility that the intervention has no
effect on the outcome whatsoever, and would need to be even more sceptical in
our conclusions.
Review authors may use the same general approach to conclude that an intervention
is not useful. Continuing with the above example where the criterion for an important
difference that should be achieved to provide more benefit than harm is a 5% risk dif-
ference, an effect estimate of 2% with a 95% confidence interval of 1% to 4% suggests
that the intervention does not provide net health benefit.
than particular threshold values. In particular, P values less than 0.05 are often
reported as ‘statistically significant’, and interpreted as being small enough to justify
rejection of the null hypothesis. However, the 0.05 threshold is an arbitrary one that
became commonly used in medical and psychological research largely because
P values were determined by comparing the test statistic against tabulations of spe-
cific percentage points of statistical distributions. If review authors decide to present
a P value with the results of a meta-analysis, they should report a precise P value
(as calculated by most statistical software), together with the 95% confidence inter-
val. Review authors should not describe results as ‘statistically significant’, ‘not
statistically significant’ or ‘non-significant’ or unduly rely on thresholds for
P values, but report the confidence interval together with the exact P value (see
MECIR Box 15.3.a).
We discuss interpretation of the test for heterogeneity in Chapter 10 (Section 10.10.2);
the remainder of this section refers mainly to tests for an overall effect. For tests of an
overall effect, the computation of P involves both the effect estimate and precision of
the effect estimate (driven largely by sample size). As precision increases, the range of
plausible effects that could occur by chance is reduced. Correspondingly, the statistical
significance of an effect of a particular magnitude will usually be greater (the P value
will be smaller) in a larger study than in a smaller study.
P values are commonly misinterpreted in two ways. First, a moderate or large
P value (e.g. greater than 0.05) may be misinterpreted as evidence that the interven-
tion has no effect on the outcome. There is an important difference between this
statement and the correct interpretation that there is a high probability that the
observed effect on the outcome is due to chance alone. To avoid such a misinterpre-
tation, review authors should always examine the effect estimate and its 95% confi-
dence interval.
The second misinterpretation is to assume that a result with a small P value for the
summary effect estimate implies that an experimental intervention has an important
benefit. Such a misinterpretation is more likely to occur in large studies and meta-
analyses that accumulate data over dozens of studies and thousands of participants.
The P value addresses the question of whether the experimental intervention effect is
precisely nil; it does not examine whether the effect is of a magnitude of importance to
potential recipients of the intervention. In a large study, a small P value may represent
the detection of a trivial effect that may not lead to net health benefit when compared
with the potential harms (i.e. harmful effects on other important outcomes). Again,
inspection of the point estimate and confidence interval helps correct interpretations
(see Section 15.3.1).
410
15.4 Interpreting results from dichotomous outcomes
(Section 6.4.1), there are several measures for comparing dichotomous outcomes in
two groups. Meta-analyses are usually undertaken using risk ratios (RR), odds ratios
(OR) or risk differences (RD), but there are several alternative ways of expressing
results.
Relative risk reduction (RRR) is a convenient way of re-expressing a risk ratio as a
percentage reduction:
RRR = 100 × 1 – RR
For example, a risk ratio of 0.75 translates to a relative risk reduction of 25%, as in the
example above.
The risk difference is often referred to as the absolute risk reduction (ARR) or abso-
lute risk increase (ARI), and may be presented as a percentage (e.g. 1%), as a decimal
(e.g. 0.01), or as account (e.g. 10 out of 1000). We consider different choices for present-
ing absolute effects in Section 15.4.3. We then describe computations for obtaining
these numbers from the results of individual studies and of meta-analyses in
Section 15.4.4.
1) since the NNT is derived from the risk difference, it is still a comparative measure of
effect (experimental versus a specific comparator) and not a general property of a
single intervention; and
2) the NNT gives an ‘expected value’. For example, NNT = 10 does not imply that one
additional event will occur in each and every group of 10 people.
NNTs can be computed for both beneficial and detrimental events, and for interven-
tions that cause both improvements and deteriorations in outcomes. In all instances
NNTs are expressed as positive whole numbers. Some authors use the term ‘number
needed to harm’ (NNH) when an intervention leads to an adverse outcome, or a
decrease in a positive outcome, rather than improvement. However, this phrase can
be misleading (most notably, it can easily be read to imply the number of people
who will experience a harmful outcome if given the intervention), and it is strongly
recommended that ‘number needed to harm’ and ‘NNH’ are avoided. The preferred
alternative is to use phrases such as ‘number needed to treat for an additional bene-
ficial outcome’ (NNTB) and ‘number needed to treat for an additional harmful out-
come’ (NNTH) to indicate direction of effect.
As NNTs refer to events, their interpretation needs to be worded carefully when the
binary outcome is a dichotomization of a scale-based outcome. For example, if the
412
15.4 Interpreting results from dichotomous outcomes
outcome is pain measured on a ‘none, mild, moderate or severe’ scale it may have been
dichotomized as ‘none or mild’ versus ‘moderate or severe’. It would be inappropriate
for an NNT from these data to be referred to as an ‘NNT for pain’. It is an ‘NNT for mod-
erate or severe pain’.
We consider different choices for presenting absolute effects in Section 15.4.3. We
then describe computations for obtaining these numbers from the results of individual
studies and of meta-analyses in Section 15.4.4.
Hart 2005). Among high-risk atrial fibrillation patients with prior stroke or transient
ischaemic attack who have stroke rates of about 12% (120 per 1000) per year, warfarin
prevents about 70 strokes yearly per 1000 patients, whereas for low-risk atrial fibrilla-
tion patients (with a stroke rate of about 2% per year or 20 per 1000), warfarin prevents
only 12 strokes. This presentation helps users to understand the important impact that
typical baseline risks have on the absolute benefit that they can expect.
15.4.4 Computations
Direct computation of risk difference (RD) or a number needed to treat (NNT) depends
on the summary statistic (odds ratio, risk ratio or risk differences) available from the
study or meta-analysis. When expressing results of meta-analyses, review authors
should use, in the computations, whatever statistic they determined to be the most
appropriate summary for meta-analysis (see Chapter 10, Section 10.4.3). Here we pres-
ent calculations to obtain RD as a reduction in the number of participants per 1000. For
example, a risk difference of –0.133 corresponds to 133 fewer participants with the
event per 1000.
RDs and NNTs should not be computed from the aggregated total numbers of parti-
cipants and events across the trials. This approach ignores the randomization within
studies, and may produce seriously misleading results if there is unbalanced random-
ization in any of the studies. Using the pooled result of a meta-analysis is more appro-
priate. When computing NNTs, the values obtained are by convention always rounded
up to the next whole number.
415
15 Interpreting results and drawing conclusions
interventions. However, when units of such outcomes may be difficult to interpret, par-
ticularly when they relate to rating scales (again, see the oedema row of Chapter 14,
Figure 14.1.a). ‘Summary of findings’ tables should include the minimum and maximum
of the scale of measurement, and the direction. Knowledge of the smallest change in
instrument score that patients perceive is important – the minimal important differ-
ence (MID) – and can greatly facilitate the interpretation of results (Guyatt et al
1998, Schünemann and Guyatt 2005). Knowing the MID allows review authors and users
to place results in context. Review authors should state the MID – if known – in the Com-
ments column of their ‘Summary of findings’ table. For example, the chronic respiratory
questionnaire has possible scores in health-related quality of life ranging from 1 to 7
and 0.5 represents a well-established MID (Jaeschke et al 1989, Schünemann
et al 2005).
Table 15.5.a Approaches and their implications to presenting results of continuous variables when
primary studies have used different instruments to measure the same construct. Adapted from Guyatt
et al (2013b)
(Continued)
417
15 Interpreting results and drawing conclusions
15.5.3.1 Presenting and interpreting SMDs using generic effect size estimates
The SMD expresses the intervention effect in standard units rather than the original
units of measurement. The SMD is the difference in mean effects between the exper-
imental and comparator groups divided by the pooled standard deviation of partici-
pants’ outcomes, or external SDs when studies are very small (see Chapter 6,
418
15.5 Interpreting results from continuous outcomes
Section 6.5.1.2). The value of a SMD thus depends on both the size of the effect (the
difference between means) and the standard deviation of the outcomes (the inherent
variability among participants or based on an external SD).
If review authors use the SMD, they might choose to present the results directly as
SMDs (row 1a, Table 15.5.a and Table 15.5.b). However, absolute values of the interven-
tion and comparison groups are typically not useful because studies have used differ-
ent measurement instruments with different units. Guiding rules for interpreting SMDs
(or ‘Cohen’s effect sizes’) exist, and have arisen mainly from researchers in the social
sciences (Cohen 1988). One example is as follows: 0.2 represents a small effect, 0.5 a
moderate effect and 0.8 a large effect (Cohen 1988). Variations exist (e.g. < 0.40 = small,
0.40 to 0.70 = moderate, > 0.70 = large). Review authors might consider including such a
guiding rule in interpreting the SMD in the text of the review, and in summary versions
such as the Comments column of a ‘Summary of findings’ table. However, some meth-
odologists believe that such interpretations are problematic because patient impor-
tance of a finding is context-dependent and not amenable to generic statements.
419
15 Interpreting results and drawing conclusions
Table 15.5.b Application of approaches when studies have used different measures: effects of dexamethasone for pain after laparoscopic cholecystectomy
(Karanicolas et al 2008). Reproduced with permission of Wolters Kluwer
1a. Post-operative pain, standard The pain score in the dexamethasone groups was on average – 539 (5) OO2,3 As a rule of thumb, 0.2 SD
deviation units 0.79 SDs (1.41 to 0.17) lower than in the placebo groups). Low represents a small difference,
Investigators measured pain using 0.5 a moderate and 0.8 a large.
different instruments. Lower scores
mean less pain.
1b. Post-operative pain The mean post- The mean pain score in the – 539 (5) OO Scores calculated based on an
Measured on a scale from 0, no pain, operative pain scores intervention groups was on average Low2,3 SMD of 0.79 (95% CI –1.41 to –
to 100, worst pain imaginable. with placebo ranged 8.1 (1.8 to 14.5) lower. 0.17) and rescaled to a 0 to 100
from 43 to 54. pain scale.
The minimal important
difference on the 0 to 100 pain
scale is approximately 10.
1c. Substantial post-operative pain, 20 per 1004 15 more (4 more to 18 more) per 100 RR = 0.25 539 (5) OO2,3 Scores estimated based on an
dichotomized patients in dexamethasone group (95% CI Low SMD of 0.79 (95% CI –1.41 to –
Investigators measured pain using achieved important improvement in 0.05 to 0.17).
different instruments. the pain score. 0.75)
2. Post-operative pain The mean post- On average a 3.7 lower pain score Ratio of 539 (5) OO2,3 Weighted average of the mean
Investigators measured pain using operative pain scores (0.6 to 6.1 lower) means Low pain score in dexamethasone
different instruments. Lower scores with placebo was 0.87 group divided by mean pain
mean less pain. 28.1.5 (0.78 to score in placebo.
0.98)
3. Post-operative pain The pain score in the dexamethasone groups was on average 0.40 – 539 (5) OO2,3 An effect less than half the
Investigators measured pain using (95% CI 0.74 to 0.07) minimal important difference units less Low minimal important difference
different instruments. than the control group. suggests a small or very small
effect.
1
Certainty rated according to GRADE from very low to high certainty.
2
Substantial unexplained heterogeneity in study results.
3
Imprecision due to wide confidence intervals.
4
The 20% comes from the proportion in the control group requiring rescue analgesia.
5
Crude (arithmetic) means of the post-operative pain mean responses across all five trials when transformed to a 100-point scale.
420
15.5 Interpreting results from continuous outcomes
Table 15.5.c Risk difference derived for specific SMDs for various given ‘proportions improved’ in the
comparator group (Furukawa 1999, Guyatt et al 2013b). Reproduced with permission of Elsevier
Comparator
group response
proportion 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Situations in which the event is undesirable, reduction (or increase if intervention harmful) in
adverse events with the intervention
SMD = −0.2 −3% −5% −7% −8% −8% −8% −7% −6% −40%
SMD = −0.5 −6% −11% −15% −17% −19% −20% −20% −17% −12%
SMD = −0.8 −8% −15% −21% −25% −29% −31% −31% −28% −22%
SMD = −1.0 −9% −17% −24% −23% −34% −37% −38% −36% −29%
Situations in which the event is desirable, increase (or decrease if intervention harmful) in positive
responses to the intervention
SMD = 0.2 4% 61% 7% 8% 8% 8% 7% 5% 3%
SMD = 0.5 12% 17% 19% 20% 19% 17% 15% 11% 6%
SMD = 0.8 22% 28% 31% 31% 29% 25% 21% 15% 8%
SMD = 1.0 29% 36% 38% 38% 34% 30% 24% 17% 9%
(or approximately 1.81 × SMD). The resulting odds ratio can then be presented as nor-
mal, and in a ‘Summary of findings’ table, combined with an assumed comparator
group risk to be expressed as an absolute risk difference. The comparator group risk
in this case would refer to the proportion of people who have achieved a specific value
of the continuous outcome. In randomized trials this can be interpreted as the propor-
tion who have improved by some (specified) amount (responders), for instance by 5
points on a 0 to 100 scale. Table 15.5.c shows some illustrative results from this
method. The risk differences can then be converted to NNTs or to people per thousand
using methods described in Section 15.4.4.
mean is very small, in which case even a modest difference from the intervention group
will yield a large and therefore misleading ratio of means. It also requires that separate
ratios of means be calculated for each included study, and then entered into a generic
inverse variance meta-analysis (see Chapter 10, Section 10.3).
The ratio of means approach illustrated in Table 15.5.b suggests a relative reduction
in pain of only 13%, meaning that those receiving steroids have a pain severity 87% of
those in the comparator group, an effect that might be considered modest.
422
15.6 Drawing conclusions
• P (Population): diagnosis, disease stage, comorbidity, risk factor, sex, age, ethnic
group, specific inclusion or exclusion criteria, clinical setting;
•• I (Intervention): type, frequency, dose, duration, prognostic factor;
C (Comparison): placebo, routine care, alternative treatment/management;
• O (Outcome): which clinical or patient-related outcomes will the researcher need to
measure, improve, influence or accomplish? Which methods of measurement should
be used?
While Cochrane Review authors will find the PICO domains helpful, the domains of
the GRADE certainty framework further support understanding and describing what
additional research will improve the certainty in the available evidence. Note that as
the certainty of the evidence is likely to vary by outcome, these implications will be
specific to certain outcomes in the review. Table 15.6.a shows how review authors
may be aided in their interpretation of the body of evidence and drawing conclusions
about future research and practice.
The review of compression stockings for prevention of deep vein thrombosis (DVT) in
airline passengers described in Chapter 14 provides an example where there is some
convincing evidence of a benefit of the intervention: “This review shows that the ques-
tion of the effects on symptomless DVT of wearing versus not wearing compression
stockings in the types of people studied in these trials should now be regarded as
answered. Further research may be justified to investigate the relative effects of differ-
ent strengths of stockings or of stockings compared to other preventative strategies.
Further randomised trials to address the remaining uncertainty about the effects of
wearing versus not wearing compression stockings on outcomes such as death, pulmo-
nary embolism and symptomatic DVT would need to be large.” (Clarke et al 2016).
A review of therapeutic touch for anxiety disorder provides an example of the impli-
cations for research when no eligible studies had been found: “This review highlights
the need for randomized controlled trials to evaluate the effectiveness of therapeutic
touch in reducing anxiety symptoms in people diagnosed with anxiety disorders. Future
trials need to be rigorous in design and delivery, with subsequent reporting to include
high quality descriptions of all aspects of methodology to enable appraisal and inter-
pretation of results.” (Robinson et al 2007).
Table 15.6.a Implications for research and practice suggested by individual GRADE domains
Domain Implications for research Examples for research statements Implications for practice
Risk of bias Need for methodologically better All studies suffered from lack of blinding of The estimates of effect may be biased because
designed and executed studies. outcome assessors. Trials of this type are of a lack of blinding of the assessors of the
required. outcome.
Inconsistency Unexplained inconsistency: need for Studies in patients with small cell lung cancer Unexplained inconsistency: consider and
individual participant data meta- are needed to understand if the effects differ interpret overall effect estimates as for the
analysis; need for studies in relevant from those in patients with pancreatic cancer. overall certainty of a body of evidence.
subgroups. Explained inconsistency (if results are not
presented in strata): consider and interpret
effects estimates by subgroup.
Indirectness Need for studies that better fit the PICO Studies in patients with early cancer are It is uncertain if the results directly apply to the
question of interest. needed because the evidence is from studies patients or the way that the intervention is
in patients with advanced cancer. applied in a particular setting.
Imprecision Need for more studies with more Studies with approximately 200 more events Same uncertainty interpretation as for certainty
participants to reach optimal in the experimental intervention group and of a body of evidence: e.g. the true effect may
information size. the comparator intervention group are be substantially different.
required.
Publication Need to investigate and identify Large studies are required. Same uncertainty interpretation as for certainty
bias unpublished data; large studies might of a body of evidence (e.g. the true effect may
help resolve this issue. be substantially different).
Large effects No direct implications. Not applicable. The effect is large in the populations that were
included in the studies and the true effect is
likely going to cross important thresholds.
Dose effects No direct implications. Not applicable. The greater the reduction in the exposure the
larger is the expected harm (or benefit).
Opposing Studies controlling for the residual bias Studies controlling for possible confounders The effect could be even larger or smaller
bias and and confounding are needed. such as smoking and degree of education are (depending on the direction of the results) than
confounding required. the one that is observed in the studies
presented here.
425
15 Interpreting results and drawing conclusions
these is to consider the results blinded; that is, consider how the results would be pre-
sented and framed in the conclusions if the direction of the results was reversed. If the
confidence interval for the estimate of the difference in the effects of the interventions
overlaps with no effect, the analysis is compatible with both a true beneficial effect and
a true harmful effect. If one of the possibilities is mentioned in the conclusion, the other
possibility should be mentioned as well. Table 15.6.b suggests narrative statements for
drawing conclusions based on the effect estimate from the meta-analysis and the
certainty of the evidence.
Another common mistake is to reach conclusions that go beyond the evidence. Often
this is done implicitly, without referring to the additional information or judgements
that are used in reaching conclusions about the implications of a review for practice.
Even when additional information and explicit judgements support conclusions about
the implications of a review for practice, review authors rarely conduct systematic
reviews of the additional information. Furthermore, implications for practice are often
dependent on specific circumstances and values that must be taken into consideration.
As we have noted, review authors should always be cautious when drawing conclusions
about implications for practice and they should not make recommendations.
Funding: This work was in part supported by funding from the Michael G DeGroote
Cochrane Canada Centre and the Ontario Ministry of Health. JJD receives support from
the National Institute for Health Research (NIHR) Birmingham Biomedical Research
Centre at the University Hospitals Birmingham NHS Foundation Trust and the Univer-
sity of Birmingham. JPTH receives support from the NIHR Biomedical Research Centre
at University Hospitals Bristol NHS Foundation Trust and the University of Bristol. The
views expressed are those of the author(s) and not necessarily those of the NHS, the
NIHR or the Department of Health.
427
15 Interpreting results and drawing conclusions
15.8 References
Aguilar MI, Hart R. Oral anticoagulants for preventing stroke in patients with non-valvular
atrial fibrillation and no previous history of stroke or transient ischemic attacks. Cochrane
Database of Systematic Reviews 2005; 3: CD001927.
Aguilar MI, Hart R, Pearce LA. Oral anticoagulants versus antiplatelet therapy for preventing
stroke in patients with non-valvular atrial fibrillation and no history of stroke or transient
ischemic attacks. Cochrane Database of Systematic Reviews 2007; 3: CD006186.
Akl EA, Gunukula S, Barba M, Yosuico VE, van Doormaal FF, Kuipers S, Middeldorp S,
Dickinson HO, Bryant A, Schünemann H. Parenteral anticoagulation in patients with
cancer who have no therapeutic or prophylactic indication for anticoagulation. Cochrane
Database of Systematic Reviews 2011a; 1: CD006652.
Akl EA, Oxman AD, Herrin J, Vist GE, Terrenato I, Sperati F, Costiniuk C, Blank D, Schünemann
H. Using alternative statistical formats for presenting risks and risk reductions. Cochrane
Database of Systematic Reviews 2011b; 3: CD006776.
Alonso-Coello P, Schünemann HJ, Moberg J, Brignardello-Petersen R, Akl EA, Davoli M,
Treweek S, Mustafa RA, Rada G, Rosenbaum S, Morelli A, Guyatt GH, Oxman AD, GRADE
Working Group. GRADE Evidence to Decision (EtD) frameworks: a systematic and
transparent approach to making well informed healthcare choices. 1: Introduction. BMJ
2016; 353: i2016.
Altman DG. Confidence intervals for the number needed to treat. BMJ 1998; 317: 1309–1312.
Atkins D, Best D, Briss PA, Eccles M, Falck-Ytter Y, Flottorp S, Guyatt GH, Harbour RT, Haugh
MC, Henry D, Hill S, Jaeschke R, Leng G, Liberati A, Magrini N, Mason J, Middleton P,
Mrukowicz J, O’Connell D, Oxman AD, Phillips B, Schünemann HJ, Edejer TT, Varonen H,
Vist GE, Williams JW, Jr., Zaza S. Grading quality of evidence and strength of
recommendations. BMJ 2004; 328: 1490.
Brown P, Brunnhuber K, Chalkidou K, Chalmers I, Clarke M, Fenton M, Forbes C, Glanville J,
Hicks NJ, Moody J, Twaddle S, Timimi H, Young P. How to formulate research
recommendations. BMJ 2006; 333: 804–806.
Cates C. Confidence intervals for the number needed to treat: pooling numbers needed to
treat may not be reliable. BMJ 1999; 318: 1764–1765.
Clarke MJ, Broderick C, Hopewell S, Juszczak E, Eisinga A. Compression stockings for
preventing deep vein thrombosis in airline passengers. Cochrane Database of Systematic
Reviews 2016; 9: CD004002.
Cohen J. Statistical Power Analysis in the Behavioral Sciences. 2nd ed. Hillsdale (NJ):
Lawrence Erlbaum Associates, Inc.; 1988.
Coleman T, Chamberlain C, Davey MA, Cooper SE, Leonardi-Bee J. Pharmacological
interventions for promoting smoking cessation during pregnancy. Cochrane Database of
Systematic Reviews 2015; 12: CD010078.
Dans AM, Dans L, Oxman AD, Robinson V, Acuin J, Tugwell P, Dennis R, Kang D.
Assessing equity in clinical practice guidelines. Journal of Clinical Epidemiology 2007;
60: 540–546.
Friedman LM, Furberg CD, DeMets DL. Fundamentals of Clinical Trials. 2nd ed. Littleton (MA):
John Wright PSG, Inc.; 1985.
Friedrich JO, Adhikari NK, Beyene J. The ratio of means method as an alternative to mean
differences for analyzing continuous outcome variables in meta-analysis: a simulation
study. BMC Medical Research Methodology 2008; 8: 32.
428
15.8 References
Furukawa T. From effect size into number needed to treat. Lancet 1999; 353: 1680.
Graham R, Mancher M, Wolman DM, Greenfield S, Steinberg E. Committee on Standards for
Developing Trustworthy Clinical Practice Guidelines, Board on Health Care Services: Clinical
Practice Guidelines We Can Trust. Washington, DC: National Academies Press; 2011.
Guyatt G, Oxman AD, Akl EA, Kunz R, Vist G, Brozek J, Norris S, Falck-Ytter Y, Glasziou P,
DeBeer H, Jaeschke R, Rind D, Meerpohl J, Dahm P, Schünemann HJ. GRADE guidelines: 1.
Introduction-GRADE evidence profiles and summary of findings tables. Journal of Clinical
Epidemiology 2011a; 64: 383–394.
Guyatt GH, Juniper EF, Walter SD, Griffith LE, Goldstein RS. Interpreting treatment effects in
randomised trials. BMJ 1998; 316: 690–693.
Guyatt GH, Oxman AD, Vist GE, Kunz R, Falck-Ytter Y, Alonso-Coello P, Schünemann HJ.
GRADE: an emerging consensus on rating quality of evidence and strength of
recommendations. BMJ 2008; 336: 924–926.
Guyatt GH, Oxman AD, Kunz R, Woodcock J, Brozek J, Helfand M, Alonso-Coello P, Falck-
Ytter Y, Jaeschke R, Vist G, Akl EA, Post PN, Norris S, Meerpohl J, Shukla VK, Nasser M,
Schünemann HJ. GRADE guidelines: 8. Rating the quality of evidence–indirectness.
Journal of Clinical Epidemiology 2011b; 64: 1303–1310.
Guyatt GH, Oxman AD, Santesso N, Helfand M, Vist G, Kunz R, Brozek J, Norris S, Meerpohl J,
Djulbegovic B, Alonso-Coello P, Post PN, Busse JW, Glasziou P, Christensen R,
Schünemann HJ. GRADE guidelines: 12. Preparing summary of findings tables-binary
outcomes. Journal of Clinical Epidemiology 2013a; 66: 158–172.
Guyatt GH, Thorlund K, Oxman AD, Walter SD, Patrick D, Furukawa TA, Johnston BC,
Karanicolas P, Akl EA, Vist G, Kunz R, Brozek J, Kupper LL, Martin SL, Meerpohl JJ, Alonso-
Coello P, Christensen R, Schünemann HJ. GRADE guidelines: 13. Preparing summary of
findings tables and evidence profiles-continuous outcomes. Journal of Clinical
Epidemiology 2013b; 66: 173–183.
Hawe P, Shiell A, Riley T, Gold L. Methods for exploring implementation variation and local
context within a cluster randomised community intervention trial. Journal of
Epidemiology and Community Health 2004; 58: 788–793.
Hoffrage U, Lindsey S, Hertwig R, Gigerenzer G. Medicine. Communicating statistical
information. Science 2000; 290: 2261–2262.
Jaeschke R, Singer J, Guyatt GH. Measurement of health status. Ascertaining the minimal
clinically important difference. Controlled Clinical Trials 1989; 10: 407–415.
Johnston B, Thorlund K, Schünemann H, Xie F, Murad M, Montori V, Guyatt G. Improving the
interpretation of health-related quality of life evidence in meta-analysis: the application
of minimal important difference units. Health Outcomes and Qualithy of Life 2010; 11: 116.
Karanicolas PJ, Smith SE, Kanbur B, Davies E, Guyatt GH. The impact of prophylactic
dexamethasone on nausea and vomiting after laparoscopic cholecystectomy: a
systematic review and meta-analysis. Annals of Surgery 2008; 248: 751–762.
Lumley J, Oliver SS, Chamberlain C, Oakley L. Interventions for promoting smoking
cessation during pregnancy. Cochrane Database of Systematic Reviews 2004; 4: CD001055.
McQuay HJ, Moore RA. Using numerical results from systematic reviews in clinical practice.
Annals of Internal Medicine 1997; 126: 712–720.
Resnicow K, Cross D, Wynder E. The Know Your Body program: a review of evaluation
studies. Bulletin of the New York Academy of Medicine 1993; 70: 188–207.
Robinson J, Biley FC, Dolk H. Therapeutic touch for anxiety disorders. Cochrane Database of
Systematic Reviews 2007; 3: CD006240.
429
15 Interpreting results and drawing conclusions
Rothwell PM. External validity of randomised controlled trials: “to whom do the results of
this trial apply?” Lancet 2005; 365: 82–93.
Santesso N, Carrasco-Labra A, Langendam M, Brignardello-Petersen R, Mustafa RA, Heus P,
Lasserson T, Opiyo N, Kunnamo I, Sinclair D, Garner P, Treweek S, Tovey D, Akl EA, Tugwell
P, Brozek JL, Guyatt G, Schünemann HJ. Improving GRADE evidence tables part 3:
detailed guidance for explanatory footnotes supports creating and understanding GRADE
certainty in the evidence judgments. Journal of Clinical Epidemiology 2016; 74: 28–39.
Schünemann HJ, Guyatt GH. Commentary–goodbye M(C)ID! Hello MID, where do you come
from? Health Services Research 2005; 40: 593–597.
Schünemann HJ, Puhan M, Goldstein R, Jaeschke R, Guyatt GH. Measurement properties
and interpretability of the Chronic respiratory disease questionnaire (CRQ). COPD: Journal
of Chronic Obstructive Pulmonary Disease 2005; 2: 81–89.
Schünemann HJ, Fretheim A, Oxman AD. Improving the use of research evidence in
guideline development: 13. Applicability, transferability and adaptation. Health Research
Policy and Systems 2006; 4: 25.
Schünemann HJ. Methodological idiosyncracies, frameworks and challenges of non-
pharmaceutical and non-technical treatment interventions. Zeitschrift für Evidenz,
Fortbildung und Qualität im Gesundheitswesen 2013; 107: 214–220.
Schünemann HJ, Tugwell P, Reeves BC, Akl EA, Santesso N, Spencer FA, Shea B, Wells G,
Helfand M. Non-randomized studies as a source of complementary, sequential or
replacement evidence for randomized controlled trials in systematic reviews on the
effects of interventions. Research Synthesis Methods 2013; 4: 49–62.
Schünemann HJ, Wiercioch W, Etxeandia I, Falavigna M, Santesso N, Mustafa R, Ventresca M,
Brignardello-Petersen R, Laisaar KT, Kowalski S, Baldeh T, Zhang Y, Raid U, Neumann I,
Norris SL, Thornton J, Harbour R, Treweek S, Guyatt G, Alonso-Coello P, Reinap M, Brozek
J, Oxman A, Akl EA. Guidelines 2.0: systematic development of a comprehensive checklist
for a successful guideline enterprise. CMAJ: Canadian Medical Association Journal 2014;
186: E123–142.
Schünemann HJ. Interpreting GRADE’s levels of certainty or quality of the evidence: GRADE
for statisticians, considering review information size or less emphasis on imprecision?
Journal of Clinical Epidemiology 2016; 75: 6–15.
Smeeth L, Haines A, Ebrahim S. Numbers needed to treat derived from meta-analyses–
sometimes informative, usually misleading. BMJ 1999; 318: 1548–1551.
Sun X, Briel M, Busse JW, You JJ, Akl EA, Mejza F, Bala MM, Bassler D, Mertz D, Diaz-Granados
N, Vandvik PO, Malaga G, Srinathan SK, Dahm P, Johnston B, Alonso-Coello P, Hassouneh
B, Walter SD, Heels-Ansdell D, Bhatnagar N, Altman DG, Guyatt GH. Credibility of claims of
subgroup effects in randomised controlled trials: systematic review. BMJ 2012;
344: e1553.
Zhang Y, Akl EA, Schünemann HJ. Using systematic reviews in guideline development: the
GRADE approach. Research Synthesis Methods 2018a; doi: 10.1002/jrsm.1313.
Zhang Y, Alonso-Coello P, Guyatt GH, Yepes-Nuñez JJ, Akl EA, Hazlewood G, Pardo-
Hernandez H, Etxeandia-Ikobaltzeta I, Qaseem A, Williams JW, Jr., Tugwell P, Flottorp S,
Chang Y, Zhang Y, Mustafa RA, Rojas MX, Schünemann HJ. GRADE Guidelines: 19.
Assessing the certainty of evidence in the importance of outcomes or values and
preferences-Risk of bias and indirectness. Journal of Clinical Epidemiology 2018b; doi:
10.1016/j.jclinepi.2018.01.013.
430
15.8 References
Zhang Y, Alonso Coello P, Guyatt G, Yepes-Nuñez JJ, Akl EA, Hazlewood G, Pardo-Hernandez
H, Etxeandia-Ikobaltzeta I, Qaseem A, Williams JW, Jr., Tugwell P, Flottorp S, Chang Y,
Zhang Y, Mustafa RA, Rojas MX, Xie F, Schünemann HJ. GRADE Guidelines: 20. Assessing
the certainty of evidence in the importance of outcomes or values and preferences –
Inconsistency, Imprecision, and other Domains. Journal of Clinical Epidemiology 2018c;
doi: 10.1016/j.jclinepi.2018.05.011.
431
Part Two
Specific perspectives in reviews
16
Equity and specific populations
Vivian A Welch, Jennifer Petkovic, Janet Jull, Lisa Hartling, Terry Klassen, Elizabeth
Kristjansson, Jordi Pardo Pardo, Mark Petticrew, David J Stott, Denise Thomson, Erin
Ueffing, Katrina Williams, Camilla Young, Peter Tugwell
KEY POINTS
•
(‘Plus’) such as sexual orientation, age and disability).
Cochrane Reviews can inform decision making by considering the distribution of
•
effects in the population and implications for equity.
To address health equity in Cochrane Reviews, review authors may: consider health
equity at the question formulation stage, possibly using a logic model; decide what
methods will be used to identify and appraise evidence related to equity and specific
populations; consider implications for ‘Summary of findings’ tables (e.g. separate
tables for disadvantaged populations, separate rows for differences in risk of events);
and interpret findings related to health equity in the discussion.
This chapter should be cited as: Welch VA, Petkovic J, Jull J, Hartling L, Klassen T, Kristjansson E, Pardo
Pardo J, Petticrew M, Stott DJ, Thomson D, Ueffing E, Williams K, Young C, Tugwell P. Chapter 16: Equity and
specific populations. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors).
Cochrane Handbook for Systematic Reviews of Interventions. 2nd Edition. Chichester (UK): John Wiley & Sons,
2019: 435–450.
© 2019 The Cochrane Collaboration. Published 2019 by John Wiley & Sons Ltd.
435
16 Equity and specific populations
Trials often exclude populations that are disadvantaged or those above or below a
certain age. The exclusion of these populations may influence the applicability of
results beyond the trial settings. Review authors should report on the characteristics
of the populations according to relevant PROGRESS-Plus factors as well as whether
there are population subgroups with a higher risk of the condition or problem or if there
are differences in factors that influence access to care. Such factors include values, pre-
ferences, affordability and feasibility from the patient/public perspective and conscious
or unconscious bias by practitioners. Wait times for total joint arthroplasty provide an
example of practitioner bias and gender differences in access to care (Pederson and
Armstrong 2015). These factors may vary according to context.
It is usually not feasible to assess all PROGRESS-Plus characteristics. Thus, in choos-
ing characteristics to assess, review authors should consider the perspective of the
intended beneficiaries of the interventions and the intended users of the evidence.
436
16.2 Formulation of the review
16.2.2 Logic models and theories of change to articulate hypotheses about equity
Analytic frameworks such as logic models, causal chains and funnels of attrition are
increasingly being used in systematic reviews to identify key questions across the pop-
ulation, intervention, comparison group and outcomes (PICO) of interest (Chapter 2,
Section 2.4). Funnel-of-attrition or equity-effectiveness frameworks explain why effect
sizes decrease along the causal chain and allow for identification of the various factors
438
16.2 Formulation of the review
such as coverage and uptake that may impact the implementation of an intervention
(Tugwell et al 2008, White 2014). Logic models, which show the relationships between
inputs and results, can help identify the key questions that are relevant to assessing
effects on health equity by predicting likely differences in response, differences in base-
line risk, applicability and also factors that may mediate effects. These factors and dif-
ferences can guide the methods of the review. They can help scope the review question,
identify eligibility criteria, focus the search strategy, design a process evaluation and
consider relevance to policy and/or practice (Anderson et al 2011, O’Connor et al
2011). For example, a Cochrane Review of food supplementation for improving the
physical and psychosocial health of socio-economically disadvantaged children
included a logic model showing how socio-economic factors and family structure might
modify effectiveness of supplementary feeding (Kristjansson et al 2015).
Theories of change provide a comprehensive description and illustration of how and
why a desired change is expected to happen in a particular context (Mackinnon et al
2006, Kneale et al 2015). Pathways to change may be uncovered in the process of doing
the review, therefore, theories of change may need to be updated and revised during
the review process to incorporate discoveries about the processes and barriers and
facilitators to implementation.
440
16.3 Identification of evidence
findings’ tables (Chapter 14, Section 14.1.2). Context should be considered in rating
importance of outcomes (Section 16.2.5). Additionally, inconvenience, burden (e.g.
out-of-pocket costs, travel time) and stigma need to be considered as potential out-
comes even if they are not commonly reported in primary studies since they may be
of utmost importance to the intended recipients of the intervention.
potentially relevant studies may be found in a wider range of literature sources and
may be unreliably categorized. This may influence the databases and search terms cho-
sen. A Cochrane Review of interventions for promoting reintegration and reducing
harmful behaviour and lifestyles in street-connected children and young people
searched a broad range of websites and grey literature sources (Coren et al 2016).
• Use expert advice on planning and executing the search strategy, given the antici-
pated complexity of the searches (Chapter 4). Experts might know of unpublished,
non-indexed or hard-to-locate evidence.
442
16.5 Synthesis of evidence
• Identify validated filters, considering sensitivity and specificity, and trying to correct
known limitations. If the filter is not validated, consider carefully the risk of missing
vital information.
• Look beyond traditional databases: small and specific databases addressing the
research topic may be more relevant (Ogilvie et al 2005, Augustincic Polec et al 2015).
•• Develop logic models to make explicit the decisions on the search strategy.
Conduct iterative searches: language changes over time and varies by place.
Meta-regression (Chapter 10, Section 10.11.4) may also be feasible to assess the role
of explanatory variables such as population, context or process factors (Hollands
et al 2015).
Box 16.6.a Issues with interpretation for reviews including older adults
It is often difficult to determine applicability to all older people, including those who are
frail and dependent. Frailty is an important concept, but it is of limited use as there are no
widely adopted operational criteria. However, the following reported data can be useful:
• type of residence, for example the proportion of patients living long-term in a care
home (can be a proxy measure for those who are frail, disabled or have chronic cog-
nitive impairment or dementia);
• ability to perform basic activities of daily living (allows interpretation of whether
results are applicable to older people living with disability); and
• number and proportion of those with dementia, or whether dementia was a study
exclusion criterion (allows consideration of whether results are generalizable to older
people with major chronic cognitive impairment).
444
16.9 References
Funding: VAW holds an Early Researcher Award (2014–2019) from the Ontario Govern-
ment. PT holds a Canada Research Chair in Health Equity (Tier 1), 2016–2024.
16.9 References
Anderson LM, Petticrew M, Rehfuess E, Armstrong R, Ueffing E, Baker P, Francis D, Tugwell P.
Using logic models to capture complexity in systematic reviews. Research Synthesis
Methods 2011; 2: 33–42.
Augustincic Polec L, Petkovic J, Welch V, Ueffing E, Tanjong Ghogomu E, Pardo Pardo J,
Grabowsky M, Attaran A, Wells GA, Tugwell P. Strategies to increase the ownership and
445
16 Equity and specific populations
446
16.9 References
Kneale D, Thomas J, Harris K. Developing and optimising the use of logic models in
systematic reviews: exploring practice and good practice in the use of programme theory
in reviews. PLoS ONE 2015; 10: e0142187.
Krieger N. Proximal, distal, and the politics of causation: what’s level got to do with it?
American Journal of Public Health 2008; 98: 221–230.
Kristjansson E, Francis DK, Liberato S, Benkhalti Jandu M, Welch V, Batal M, Greenhalgh T, Rader
T, Noonan E, Shea B, Janzen L, Wells GA, Petticrew M. Food supplementation for improving
the physical and psychosocial health of socio-economically disadvantaged children aged
three months to five years. Cochrane Database of Systematic Reviews 2015; 3: CD009924.
Kristjansson EA, Robinson V, Petticrew M, MacDonald B, Krasevec J, Janzen L, Greenhalgh T,
Wells G, MacGowan J, Farmer A, Shea BJ, Mayhew A, Tugwell P. School feeding for
improving the physical and psychosocial health of disadvantaged elementary school
children. Cochrane Database of Systematic Reviews 2007; 1: CD004676.
Kugley S, Wade A, Thomas J, Mahood Q, Jørgensen AMK, Hammerstrom K, Sathe N.
Searching for studies: a guide to information retrieval for Campbell systematic
reviews. Campbell Collboration; 2017. Version 1.1 https://www.campbellcollaboration.
org/library/searching-for-studies-information-retrieval-guide-campbell-reviews.html
Lewin S, Hendry M, Chandler J, Oxman AD, Michie S, Shepperd S, Reeves BC, Tugwell P,
Hannes K, Rehfuess EA, Welch V, McKenzie JE, Burford B, Petkovic J, Anderson LM,
Harris J, Noyes J. Assessing the complexity of interventions within systematic
reviews: development, content and use of a new tool (iCAT_SR). BMC Medical
Research Methodology 2017; 17: 76.
Lorenc T, Petticrew M, Welch V, Tugwell P. What types of interventions generate
inequalities? Evidence from systematic reviews. Journal of Epidemiology and Community
Health 2013; 67: 190–193.
Lorenzetti DL, Lin Y. Locating sex- and gender-specific data in health promotion research:
evaluating the sensitivity and precision of published filters. Journal of the Medical Library
Association 2017; 105: 216–225.
Mackinnon A, Amott N, McGarvey C. Mapping change: using a theory of change to guide
planning and evaluation 2006. http://www.grantcraft.org/index.cfm.
Marmot M, Friel S, Bell R, Houweling T, Taylor S, Commission on Social Determinants of
Health. Closing the gap in a generation: health equity through action on the social
determinants of health. Lancet 2008; 372: 1661–1669.
Marmot M, Allen J, Bell R, Goldblatt P. Building of the global movement for health equity:
from Santiago to Rio and beyond. Lancet 2012; 379: 181–188.
McCalman J, Heyeres M, Campbell S, Bainbridge R, Chamberlain C, Strobel N, Ruben A.
Family-centred interventions by primary healthcare services for Indigenous early
childhood wellbeing in Australia, Canada, New Zealand and the United States: a
systematic scoping review. BMC Pregnancy and Childbirth 2017; 17: 71.
Noyes J, Gough D, Lewin S, Mayhew A, Michie S, Pantoja T, Petticrew M, Pottie K, Rehfuess E,
Shemilt I, Shepperd S, Sowden A, Tugwell P, Welch V. A research and development
agenda for systematic reviews that ask complex questions about complex interventions.
Journal of Clinical Epidemiology 2013; 66: 1262–1270.
O’Neill J, Tabish H, Welch V, Petticrew M, Pottie K, Clarke M, Evans T, Pardo Pardo J, Waters
E, White H, Tugwell P. Applying an Equity Lens to interventions: Using PROGRESS to
ensure consideration of socially stratifying factors to illuminate inequities in health.
Journal of Clinical Epidemiology 2013; 67: 56–64.
447
16 Equity and specific populations
O’Connor D, Green S, Higgins JPT, editors. Chapter 5: Defining the review question and
developing criteria for including studies. In: Higgins JPT, Green S, editors. Cochrane
Handbook for Systematic Reviews of Interventions. Version 5.1.0 (updated March 2011):
The Cochrane Collaboration; 2011.
Ogilvie D, Hamilton V, Egan M, Petticrew M. Systematic reviews of health effects of social
interventions: 1. Finding the evidence: how far should you go? Journal of Epidemiology
and Community Health 2005; 59: 804–808.
Oxman AD, Lavis JN, Lewin S, Fretheim A. SUPPORT Tools for evidence-informed health
Policymaking (STP) 10: Taking equity into consideration when assessing the findings of a
systematic review. Health Research Policy and Systems 2009; 7 Suppl 1: S10.
Pederson A, Armstrong P. Sex, gender and systematic reviews: the example of wait times for
hip and knee replacements. In: Armstrong P, Pederson A, editors. Women’s Health:
Intersections of Policy, Research and Practice. Toronto: Women’s Press; 2015. pp. 56–72.
Petticrew M, Whitehead M, Macintyre SJ, Graham H, Egan M. Evidence for public health
policy on inequalities: 1: The reality according to policymakers. Journal of Epidemiology
and Community Health 2004; 58: 811.
Pfadenhauer LM, Gerhardus A, Mozygemba K, Lysdahl KB, Booth A, Hofmann B, Wahlster P,
Polus S, Burns J, Brereton L, Rehfuess E. Making sense of complexity in context and
implementation: the Context and Implementation of Complex Interventions (CICI)
framework. Implementation Science 2017; 12: 21.
Pope C, Mays N, Popay J. Synthesising qualitative and quantitative health evidence: A guide
to methods: A guide to methods. McGraw-Hill Education (UK); 2007.
Prady SL, Uphoff EP, Power M, Golder S. Development and validation of a search filter to
identify equity-focused studies: reducing the number needed to screen. BMC Medical
Research Methodology 2018; 18: 106.
Rader T, Pardo Pardo J, Stacey D, Ghogomu E, Maxwell LJ, Welch VA, Singh JA, Buchbinder
R, Legare F, Santesso N, Toupin April K, O’Connor AM, Wells GA, Winzenberg TM, Johnston
R, Tugwell P. Update of strategies to translate evidence from Cochrane Musculoskeletal
Group systematic reviews for use by various audiences. Journal of Rheumatology 2014; 41:
206–215.
Sinha IP, Altman DG, Beresford MW, Boers M, Clarke M, Craig J, Alberighi OD, Fernandes RM,
Hartling L, Johnston BC, Lux A, Plint A, Tugwell P, Turner M, van der Lee JH, Offringa M,
Williamson PR, Smyth RL. Standard 5: selection, measurement, and reporting of
outcomes in clinical trials in children. Pediatrics 2012; 129 Suppl 3: S146–152.
Tugwell P, Maxwell L, Welch V, Kristjansson E, Petticrew M, Wells G, Buchbinder R, Suarez-
Almazor ME, Nowlan MA, Ueffing E, Khan M, Shea B, Tsikata S. Is health equity considered
in systematic reviews of the Cochrane Musculoskeletal Group? Arthritis and Rheumatism
2008; 59: 1603–1610.
Tugwell P, Petticrew M, Kristjansson E, Welch V, Ueffing E, Waters E, Bonnefoy J, Morgan A,
Doohan E, Kelly MP. Assessing equity in systematic reviews: realising the recommendations
of the Commission on Social Determinants of Health. BMJ 2010; 341: c4739.
Ueffing E, Tugwell P, Welch V, Petticrew M, Kristjansson E. C1, C2 Equity Checklist for
Systematic Review Authors. Version 2009-05-28. 2009. http://equity.cochrane.org/sites/
equity.cochrane.org/files/uploads/equitychecklist.pdf
van de Glind EM, van Munster BC, Spijker R, Scholten RJ, Hooft L. Search filters to identify
geriatric medicine in Medline. Journal of the American Medical Informatics Association
2012; 19: 468–472.
448
16.9 References
von Philipsborn P, Stratil JM, Burns J, Busert LK, Pfadenhauer LM, Polus S, Holzapfel C,
Hauner H, Rehfuess E. Environmental interventions to reduce the consumption of
sugar-sweetened beverages and their effects on health. Cochrane Database of Systematic
Reviews 2016; 7: CD012292.
Welch V, Petticrew M, Tugwell P, Moher D, O’Neill J, Waters E, White H. PRISMA-Equity 2012
extension: reporting guidelines for systematic reviews with a focus on health equity. PLoS
Medicine 2012; 9: e1001333.
Welch V, Jull J, Petkovic J, Armstrong R4, Boyer Y, Cuervo LG, Edwards S, Lydiatt A, Gough D,
Grimshaw J, Kristjansson E, Mbuagbaw L, McGowan J, Moher D, Pantoja T, Petticrew M,
Pottie K, Rader T, Shea B, Taljaard M, Waters E, Weijer C, Wells GA, White H, Whitehead M,
Tugwell P. Protocol for the development of a CONSORT-equity guideline to improve
reporting of health equity in randomized trials. Implement Science 2015; 10: 146.
Welch VA, Norheim OF, Jull J, Cookson R, Sommerfelt H, Tugwell P. CONSORT-Equity 2017
extension and elaboration for better reporting of health equity in randomised trials. BMJ
2017a; 359: j5085.
Welch VA, Akl EA, Pottie K, Ansari MT, Briel M, Christensen R, Dans A, Dans L, Eslava-
Schmalbach J, Guyatt G, Hultcrantz M, Jull J, Katikireddi SV, Lang E, Matovinovic E,
Meerpohl JJ, Morton RL, Mosdol A, Murad MH, Petkovic J, Schünemann H, Sharaf R, Shea
B, Singh JA, Sola I, Stanev R, Stein A, Thabaneii L, Tonia T, Tristan M, Vitols S, Watine J,
Tugwell P. GRADE equity guidelines 3: considering health equity in GRADE guideline
development: rating the certainty of synthesized evidence. Journal of Clinical
Epidemiology 2017b; 90: 76–83.
White H. Current challenges in impact evaluation. European Journal of Development
Research 2014; 26: 18–30.
Whitehead M. The concepts and principles of equity and health. International Journal of
Health Services 1992; 22: 429–445.
Williams K, Thomson D, Seto I, Contopoulos-Ioannidis DG, Ioannidis JP, Curtis S, Constantin
E, Batmanabane G, Hartling L, Klassen T. Standard 6: age groups for pediatric trials.
Pediatrics 2012; 129 Suppl 3: S153–160.
Zoritch B, Roberts I, Oakley A. Day care for pre-school children. Cochrane Database of
Systematic Reviews 2000; 3: CD000564.
449
17
Intervention complexity
James Thomas, Mark Petticrew, Jane Noyes, Jacqueline Chandler, Eva Rehfuess,
Peter Tugwell, Vivian A Welch
KEY POINTS
•
complexity.
There are three ways of understanding intervention complexity:
i) in terms of the number of components in the intervention;
ii) in terms of interactions between intervention components or interactions between
the intervention and its context, or both; and
iii) in terms of the wider system within which the intervention is introduced.
• Of most relevance to Cochrane Review authors are (i) and (ii), and the chapter focuses
mainly on these understandings of intervention complexity.
17.1 Introduction
This chapter introduces how to conceptualize and consider intervention complexity
within systematic reviews. Advice available on this subject can appear contradictory
and there is a risk that accounting for intervention complexity can make the review
itself overly complex and less comprehensible to users. The key issue is how to identify
an approach that assists in a specific systematic review. The chapter aims to signpost
review authors to advice that helps them make decisions on when and in which circum-
stances to apply that advice. It does not aim to cover all aspects of complexity but
advises review authors on how to frame review questions to address issues of interven-
tion complexity and directs them to other sources for further reference. Other parts of
this Handbook have been expanded to support considerations of intervention complex-
ity, and this chapter provides cross-references where appropriate. Most of the methods
This chapter should be cited as: Thomas J, Petticrew M, Noyes J, Chandler J, Rehfuess E, Tugwell P, Welch
VA. Chapter 17: Intervention complexity. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ,
Welch VA (editors). Cochrane Handbook for Systematic Reviews of Interventions. 2nd Edition. Chichester (UK):
John Wiley & Sons, 2019: 451–478.
© 2019 The Cochrane Collaboration. Published 2019 by John Wiley & Sons Ltd.
451
17 Intervention complexity
discussed in this chapter have been thoroughly tested and published elsewhere. Some
are still relatively new and under development. These new and emerging methods are
flagged as such when they are discussed.
This chapter focuses mainly on addressing the first two perspectives of intervention
complexity, rather than the systems perspective, because these are most commonly
used in Cochrane Reviews. The next section introduces the first two aspects of complex-
ity in more detail, and the following section outlines some implications when the anal-
ysis is focused on the wider system.
• whether there are multiple components within the experimental and control inter-
ventions, and whether they may interact with one another;
• the range of behaviours required by those delivering or receiving the intervention,
and how difficult or variable they may be;
•• whether the intervention, or its components, result in non-linear effects;
the number of groups or organizational levels targeted by the intervention;
•• the number and variability of outcomes; and
the degree of flexibility or tailoring of the intervention permitted.
the circumstances that form the setting for an event, statement, or idea, and in
terms of which it can be fully understood.
When defined in these terms, knowing the context of an intervention, and thus, ‘fully
understanding’ how it gave rise to its outcomes, is both a highly desirable and an
extremely challenging objective for review authors.
A further challenge is that defining ‘context’ is itself a matter of judgement. The
ROBINS-I tool for appraisal of non-randomized studies (see Chapter 25) defines context
broadly as “characteristics of the healthcare setting (e.g. public outpatient versus hos-
pital outpatient), organizational service structure (e.g. managed care or publicly
funded program), geographical setting (e.g. rural vs urban), and cultural setting and
the legal environment where the intervention is implemented”.
Pfadenhauer and colleagues concur that the physical and social setting of the inter-
vention needs to be considered as part of the context but, in line with the guidance in
Section 17.1.1 on ‘conceptualizing intervention complexity’, expand this understanding
to acknowledge the potential for interactions between intervention, participants and
the setting within which the intervention is introduced:
an actively planned and deliberately initiated effort with the intention to bring a
given intervention into policy and practice within a particular setting. These
actions are undertaken by agents who either actively promote the use of the
intervention or adopt the newly appraised practices. Usually, a structured imple-
mentation process consisting of specific implementation strategies is used being
underpinned by an implementation theory.
(Pfadenhauer et al 2017)
some will not; it may even be possible to compare these different intervention adapta-
tions and their implementations within the systematic review. To understand what has
happened, it will be necessary to unpack the intended ‘function’ of the intervention
that underlies variations in form.
With most (simple) interventions, integrity is defined as having the ‘dose’ deliv-
ered at an optimal level and in the same way in each site. Complex intervention
thinking defines integrity of interventions differently. The issue is to allow the
form to be adapted while standardising the process and function.
(Hawe et al 2004).
456
17.1 Introduction
457
17 Intervention complexity
458
17.1 Introduction
•• What do my review users want to know about? The intervention, the system, or both?
At what level is the intervention delivered? Is the intervention likely to have antici-
pated effects of interest to users at levels above the individual level? If the implemen-
tation and effects spill over into the family, community, or beyond, then taking a
systems perspective may be helpful.
• Is the intervention: (i) a discrete, identifiable intervention, or package of interven-
tions; or (ii) a more diffuse change within an existing system?
460
17.2 Formulation of the review
Review authors should also take account of the resources available to conduct the
review. A large scale, theoretically informed review of an intervention within its wider
system may be time-consuming, expensive and require a large multidisciplinary team.
It may also produce complex answers that are beyond the needs of many users.
e.g. For further information on logic models and defining interventions see Chapter 2
(Section 2.5.1), Chapter 3 (Section 3.2) and Chapter 21 (Section 21.6.1). See the follow-
ing for key references on the topics discussed in this section. On understanding inter-
vention complexity: Campbell et al (2000), Craig et al (2008), Kelly et al (2017), Petticrew
et al (2019); on mechanisms of action: Howick et al (2010), Fletcher et al (2016), Noyes
et al (2016a); on context and implementation: Hawe et al (2009), Noyes et al (2013),
Moore et al (2015), Pfadenhauer et al (2017).
• Under what circumstances does the intervention work (Thomas et al 2004, Squires
et al 2013)?
• What is the relative importance of, and synergy between, different components of
multicomponent interventions?
•• What are the mechanisms of action by which the intervention achieves an effect?
What are the factors that impact on implementation and participant responses?
•• What is the feasibility and acceptability of the intervention in different contexts?
What are the dynamics of the wider system?
461
17 Intervention complexity
Broadly, therefore, systematic reviews can consider complexity in terms of the inter-
vention (e.g. how the components of the intervention interact), and also in terms of
how it is implemented. In this situation, systematic reviews can use the concept of com-
plexity to help develop theories of implementation, and inform strategies to improve
implementation (Nilsen 2015).
As Chapters 2 and 3 outline, addressing broader review questions has implications for
the search strategy, the types of evidence, the eligibility criteria, the evidence appraisal,
and the review design and synthesis methods (Squires et al 2013). Sometimes more
than one type of study design may be required to address the questions of interest,
the products of which might subsequently be integrated in a third synthesis (see
Chapter 21 and Glenton et al 2013, Harris et al 2018).
School-based self-management educational interventions for asthma in children and adolescents: Of chronic disease in children, asthma accounts
for most school absences, emergency admissions, and disproportionately impacts upon children from lower socio-economic backgrounds. The school
environment, offers an environment to develop self-care strategies among adolescents and children.
Family knowledge
Intervention inputs • Knowledge about asthma and how to assist management
Modifiable design characteristics Teachers’ knowledge and skills
Resources
• Knowledge about asthma symptoms and management
• Teachers/instructors • Co-design/engagement strategies
• Training for teachers/instructors • Involvement of health professionals/
• Materials provided to deliver Alliances developed
intervention • Delivery to all children or those with Proximal outcomes:
asthma alone
Theory and aims • Family involvement in intervention Health/medical
• Theoretical basis • Pedagogical techniques used • Severity of asthma
• Teacher or instructor led • Night-time and
• Integration into educational day/curriculum • Day-time symptoms
• Assessment • Lung function
• Individual or group delivery • Use of reliever
Core elements of intervention medicine Child-level distal
(some/all) outcomes
1. Reinforcement of regular lung
Child-level moderators: • Indicators of
function monitoring
improved educational
2. Emphasis on self-management Intermediate outcomes: outcomes
practice and behaviour • Severity of asthma
3. Reinforcement of regular dialogue • Age/gender Education • Indicators of
with health practitioners • Presence of comorbidity • School attendance improved health and
4. Instruction in inhaler techniques • Socio-economic and socio-
mental well-being
5. Reinforcement/provision of asthma demographic factors Health and well-being
management plan • Emergency admissions for
6. Emphasis on appropriate use of asthma
reliever therapies • Presentation at emergency
7. Emphasis on appropriate use of department for asthma
regular preventer therapies Process metrics: • Days of restricted activity
8. Non-pharmacological self- •Quality of life Macro-level distal
• Adherence/fidelity outcomes
management strategies • Dose
• Acceptability
• Relevance
• Quality of intervention provided
• Intensity
• Attrition
• Recruitment and representativeness
Action Change
Figure 17.2.a Logic model of school-based asthma interventions (Harris et al 2018). Reproduced with permission of John Wiley & Sons
463
17 Intervention complexity
circumstances (see Sections 17.1.1 and 17.1.3). For a detailed discussion of planning
comparisons for synthesis, see Chapters 3 and 9.
Outcomes of interest are likely to include a range of intended and unintended health
and non-health effects of interest to review users. The choice of outcomes to prioritize
is a matter of judgement and perspective, and the rationale for selection decisions
should be explicitly reported. Review authors should note that the prioritization of out-
comes varies culturally, and according to the perspective of those directly affected by
an intervention (e.g. patients, an at-risk population), those delivering the intervention
(e.g. clinicians, staff working for healthcare or public health institutions), or policy
makers or others deciding on or financing an intervention and the general public. How-
ever, the answer is not simply to include any plausible outcome: a plausible theoretical
case can probably be made for most outcomes, but that does not mean they are mean-
ingful. Worse, including a wide range of speculative outcomes raises the risk of data
dredging and vastly increases the complexity of the analysis and interpretation (see
Chapter 9, Section 9.3.3 on multiplicity of outcomes and Chapter 3, Section 3.4.4). Again,
an understanding of the intervention theory can help select the outcomes for which the
strongest plausible a priori case can be made for inclusion – perhaps those outcomes for
which there is prior evidence of an important association with the intervention. As the
illustrative logic model (Figure 17.2.a) shows, there can be numerous intermediate out-
comes between the intervention and the final outcome of interest. Guidance is available
on how to select the most important outcomes from the list of all plausible outcomes
(Chapter 3, Section 2.4.4 and Guyatt et al 2011). It will also be important to determine
the availability of core outcome sets within the review context (see www.comet-initia-
tive.org). Core outcome sets are now becoming available for more complex interventions
and may help to guide outcome selection (e.g. see Kaufman et al 2017).
although some systematic reviews may be undertaken for a specific setting (see
Pantoja et al 2017 for an example of an overview of reviews which examines specifically
issues from a low-income country perspective). When a review aims to inform decisions
in a specific situation, consideration should be given to the ‘directness’ of the evidence
(the extent to which the participants, interventions and outcome measures are similar
to those of interest); this is a core feature of GRADE assessment, discussed in Chapter 14
(GRADE Working Group 2004).
The TIDieR framework (Hoffman et al 2014) refers to “the type(s) of location(s) where
the intervention occurred, including any necessary infrastructure or relevant features”,
and the iCAT_SR tool notes that “the effects of an intervention may be dependent on
the societal, political, economic, health systems or environmental context in which the
intervention is delivered” (Lewin et al 2017). Finally, the PRECIS-2 tool, while written
to support the design of trials, also contains useful information for review authors when
considering how to address issues relating to context and implementation (Loudon
et al 2015).
These are important considerations because for social and public health (and per-
haps any intervention), the political context is often an important determinant of
whether interventions can be implemented or not; regulatory interventions (e.g. alco-
hol or tobacco control policies) may be less politically acceptable within certain juris-
dictions, even if such interventions are likely to be effective. Historical and cultural
contexts are also often important moderators of the effects and acceptability of public
health interventions (Craig et al 2018). It is therefore impossible (and probably mislead-
ing) to attempt to specify what ‘is’ or ‘isn’t’ context, as this depends on the intervention
and the review question, as well as how the intervention and its effects are theorized
(implicitly or explicitly) by the review authors. Booth and colleagues suggest that a sup-
plementary framework (e.g. the Context and Implementation of Complex Interventions
(CICI) Framework (Pfadenhauer et al 2017); see Section 17.1.2.1) can help to understand
and explore contextual issues: for example, helping to decide whether to ‘lump’ or
‘split’ studies by context, and how to frame the review question and subsequent stages
of the review (Booth et al 2019a).
randomized and uncontrolled studies may mean excluding the few evaluations that
exist, and in some cases such designs can provide adequate evidence of effect
(Craig et al 2012). For example, when evaluating the impact of a smoking ban on hos-
pital admissions for coronary heart disease, Khuder and colleagues employed a quasi-
experimental design with interrupted time series (Khuder et al 2007).
As outlined in Section 17.2.2, the questions asked in systematic reviews that address
complexity often go beyond asking whether a given intervention works, to ask how it
might work, in which circumstances and for whom. Addressing these questions can
require the inclusion of a range of different research designs. In particular, when evi-
dence about the processes by which an intervention influences intermediate and final
outcomes, as well as evidence on intervention acceptability and implementation, qual-
itative evidence is often included. Qualitative evidence can also identify evidence of
unintended adverse effects which may not be reported in the main quantitative eval-
uation studies (Thomas and Harden 2008). Petticrew and colleagues’ Table 1 sum-
marizes each aspect of complexity and suggests which types of evidence might be
most useful to address each issue. For example, when aiming to understand interac-
tions between intervention and context, multicentre trials with stratified reporting,
observational studies which provide evidence of mediators and moderators, and qual-
itative studies which observe behaviours and ask people about their understandings
and experiences are suggested as being helpful study designs to include (Petticrew
et al 2019). See also Noyes et al (2019) and Rehfuess et al (2019) for further information
on matching study designs to research questions to address intervention complexity.
For example, Chapter 3 contains detailed information on specifying review and com-
parison PICOs that is essential reading for review authors addressing intervention com-
plexity. The illustration of a logic model in Figure 17.2.a should be read alongside the
introduction to logic models in Chapter 2, Section 2.5.1. See also Chapter 2, Section 2.3
for discussion about breadth and depth in review questions. See the following for key
references on the topics discussed in this section. On theory and logic models:
467
17 Intervention complexity
468
17.5 Synthesis of evidence
While this kind of search can inform the design and framing of the review, a compre-
hensive search is required to identify as much as possible of the body of evidence rel-
evant to the review (see Chapter 4). As for any review, the search should be led by the
review question, a detailed understanding of the PICO elements, and the review’s eli-
gibility criteria (Chapter 3).
For further information see Chapter 4 and also the supplementary information associ-
ated with Noyes et al (2019). Table 1 in Petticrew et al (2019) also describes the rela-
tionship between different types of review questions, and the sort of evidence that
might be sought to answer them. See the following for key references on the topics
discussed in this section: Booth et al (2013), Brunton et al (2017).
with one another; (ii) those that might be considered ‘standard’ methods of meta-
analysis – including meta-regression (see Chapter 10) – which enable review authors
to examine possible moderators of effect at the study level; and (iii) more advanced
methods, which include network meta-analysis (see Chapter 11), but go beyond this
and encompass methods that enable review authors to examine intervention compo-
nents, mechanisms of action, and complexities of the system into which the interven-
tion is introduced (Higgins et al 2019).
At the outset, even when a statistical synthesis is planned, it is usually useful to begin
the synthesis using non-quantitative methods, understanding the characteristics of the
populations and interventions included in the review, and reviewing the outcome data
from the available studies in a structured way. Informative tables and graphical tools
can play an important role in this regard, assisting review authors to visualize and
explore complexity. These include harvest plots, box-and-whisker plots, bubble plots,
network diagrams and forest plots. See Chapters 9 and 12 for further discussion of
these approaches.
Standard meta-analytic methods may not always be appropriate, since they
do depend on reasonable comparability of both interventions and comparators –
something that may not apply when synthesizing evidence with high heterogeneity.
Chapter 3 considers in detail how to think about the comparability of, and categories
within, interventions, populations and outcomes. However, where interventions and
populations are judged sufficiently similar to answer questions which aggregate the
findings from similar studies, then approaches such as standard meta-analysis,
meta-regression or network meta-analysis may be appropriate, particularly when
the mechanism of action is clearly understood (Viswanathana et al 2017).
Questions concerning the circumstances in which the intervention might work and
the relative importance of different components of interventions require methods that
explore between-study heterogeneity. Subgroup analysis and meta-regression enable
review authors to investigate effect moderators with the usual caveats that pertain to
such observational analyses (see Chapter 10). Caldwell and Welton describe alternative
quantitative approaches to synthesis, which include ‘component-based’ meta-analysis
where individual intervention components (or meaningful combinations of compo-
nents) are modelled explicitly, thus enabling review authors to identify those compo-
nents most (or least) associated with intervention success (Caldwell and Welton 2016).
When the review questions ask review authors to consider how interventions achieve
their effect, other types of evidence, other than randomized trials, are vital to provide
theory that identifies causal connections between intervention(s) and outcome(s).
Logic models (see Section 17.2.1 and Chapter 2) can provide some rationale for the
selection of factors to include in analysis, but the review may require an additional syn-
thesis of qualitative evidence to elucidate the complexity adequately. This is especially
the case when understanding differential intervention effects that require review
authors to consider the perspectives and experiences of those receiving the interven-
tion. See Chapter 21 for a detailed exploration of the methods available. While logic
models aim to summarize how the interactions between intervention, participant
and context may produce outcomes, specific causal pathways may be identified for
testing. Causal chain analysis encompasses a range of methods that help review
authors to do this (Kneale et al 2018), including meta-analytic path analysis and struc-
tural equation modelling (Tanner-Smith and Grant 2018), and model-based meta-
470
17.5 Synthesis of evidence
analysis (Becker 2009). These types of analyses are rare in Cochrane Reviews, as meth-
ods are still developing and require relatively large datasets.
Integrating different types of data within the same analysis can be a challenging but
powerful approach, often enabling the theories generated in synthesis of qualitative
literature to be used to explore and explain heterogeneity between quantitative studies
(Thomas et al 2004). Reviews with multiple components and analyses can address dif-
ferent questions relating to complexity often in a sequential way, with each component
building on the findings of the previous one. Methods used include: mixed-methods
synthesis (involving qualitative thematic synthesis, meta-analysis and cross-study syn-
thesis); Bayesian synthesis (where qualitative studies are used to generate informative
priors); and qualitative comparative analysis (QCA: a set-based method which uses
Boolean algebra to juxtapose intervention components in configurational patterns;
see Chapter 21 (Section 21.13) and (Thomas et al 2014)). Such analyses are explanatory
analyses, to identify differential intervention effect, and also to explain why it occurs
(Cook et al 1994). The example review given in Box 17.1.a is a multi-component review,
which integrates different types of data in order better to understand differential inter-
vention effects. It uses qualitative data from process evaluations to identify which inter-
vention features were associated with successful implementation. It then uses the
inferences generated in this analysis to explore heterogeneity between the results of
randomized trials, using what might be considered ‘standard’ meta-analytic and
meta-regression methods. It is important to bear in mind that the review question
always comes first in these multi-component reviews: the decision to use process eval-
uation data in this way was driven by an understanding of the context within which
these interventions are implemented. A different mix of data will be appropriate in dif-
ferent situations.
Finally, review authors may want to synthesize research to reach a better under-
standing of the dynamics of the wider system in which the intervention is introduced.
Analytical methods can include some of those already mentioned – for combining
diverse types of data – but may also include methods developed in systems science
such as systems dynamics models and agent-based modelling (Luke and Stamata-
kis 2012).
For further information about steps to follow before results are combined, review
authors should consider the guidance in Chapter 9 to summarize studies and prepare
for synthesis. Standard meta-analytical methods are outlined in Chapter 10, with
Section 10.10 on investigating heterogeneity particularly relevant. Methods for under-
taking network meta-analysis are outlined in Chapter 11.
471
17 Intervention complexity
the appropriate reporting criteria from existing quantitative and qualitative reporting
guidelines (see Chapter 21 for further details) (Flemming et al 2018). One of the chal-
lenges that review authors may meet when addressing complexity through incorporat-
ing a range of study designs beyond randomized trials is that GRADE assessments of
evidence can generally turn out to be ‘low’, offering little assistance to readers in terms
of understanding the relative confidence in the different studies included. See Mont-
gomery et al (2019) for practical advice in this situation.
Increasing the quantity and range of evidence synthesized in a systematic review can
make reports quite challenging (and lengthy) to read. Preparing a report that is suffi-
ciently clear in its conclusions can take many rounds of redrafting, and it is also useful
to obtain feedback from consumers and other stakeholders involved in the review
(Chapter 1, Section 1.3.1). Intervention complexity can thus increase the resources
needed at this phase of the review too, and it is essential to plan for this if the reporting
of the review is to be sufficiently clear for it to be used to inform decisions. (See also
Chapter 15 and online Chapter III.)
Acknowledgements: This chapter replaces Chapter 21 in the first edition of this Hand-
book (2008) and subsequent version 5.2. We would like to thank the previous chapter
authors Rebecca Armstrong, Jodie Doyle, Helen Morgan and Elizabeth Waters.
Funding: VAW holds an Early Researcher Award (2014–2019) from the Ontario Govern-
ment. JT is supported by the National Institute for Health Research (NIHR) Collabora-
tion for Leadership in Applied Health Research and Care North Thames at Barts Health
NHS Trust. The views expressed are those of the author(s) and not necessarily those of
the NHS, the NIHR or the Department of Health.
473
17 Intervention complexity
17.8 References
Anderson L, Petticrew M, Rehfuess E, Armstrong R, Ueffing E, Baker P, Francis D, Tugwell P.
Using logic models to capture complexity in systematic reviews. Research Synthesis
Methods 2011; 2: 33–42.
Becker B. Model-based meta-analysis. In: Cooper H, Hedges L, Valentine J, editors. The
Handbook of Research Synthesis and Meta-Analysis. New York (NY): Russell Sage
Foundation; 2009. pp. 377–395.
Blankenship KM, Friedman SR, Dworkin S, Mantell JE. Structural interventions: concepts,
challenges and opportunities for research. Journal of Urban Health 2006; 83: 59–72.
Booth A, Harris J, Croot E, Springett J, Campbell F, Wilkins E. Towards a methodology for
cluster searching to provide conceptual and contextual “richness” for systematic reviews
of complex interventions: case study (CLUSTER). BMC Medical Research Methodology
2013; 13: 118.
Booth A, Moore G, Flemming K, Garside R, Rollins N, Tuncalp Ö, Noyes J. Taking account of
context in systematic reviews and guidelines considering a complexity perspective. BMJ
Global Health 2019a; 4: e000840.
Booth A, Noyes J, Flemming K, Moore G, Tuncalp Ö, Shakibazadeh E. Formulating questions
to explore complex interventions within qualitative evidence synthesis. BMJ Global Health
2019b; 4: e001107.
Brunton G, Stansfield C, Caird J, Thomas J. Finding relevant studies. In: Gough D, Oliver S,
Thomas J, editors. An Introduction to Systematic Reviews. 2nd ed. London: Sage; 2017.
Caldwell D, Welton N. Approaches for synthesising complex mental health interventions in
meta-analysis. Evidence-Based Mental Health 2016; 19: 16–21.
Campbell M, Fitzpatrick R, Haines A, Kinmonth A, Sandercock P, Spiegelhalter D, Tyrer P.
Framework for design and evaluation of complex interventions to improve health. BMJ
2000; 321: 694–696.
Christakis N, Fowler J. The spread of obesity in a large social network over 32 years. New
England Journal of Medicine 2007; 357: 370–379.
Cook TD, Cooper H, Cordray DS, Hartmann H, Hedges LV, Light RJ, Louis TA, Mosteller F.
Meta-Analysis for Explanation: A Casebook. New York (NY): Russell Sage Foundation; 1994.
Craig P, Dieppe P, Macintyre S, Michie S, Petticrew M, Nazareth I. Developing and evaluating
complex interventions: the new Medical Research Council guidance. BMJ 2008; 337: a1655.
Craig P, Cooper C, Gunnell D, Haw S, Lawson K, Macintyre S, Ogilvie D, Petticrew M, Reeves B,
Sutton M, Thompson S. Using natural experiments to evaluate population health
interventions: new MRC guidance. Journal of Epidemiology and Community Health 2012; 66:
1182–1186.
Craig P, Di Ruggiero E, Frohlich K, Mykhalovskiy E, White M, on behalf of the Canadian
Institutes of Health Research (CIHR)–National Institute for Health Research (NIHR) Context
Guidance Authors Group. Taking account of context in population health intervention
research: guidance for producers, users and funders of research. Southampton; 2018.
Evans RE, Craig P, Hoddinott P, Littlecott H, Moore L, Murphy S, O’Cathain A, Pfadenhauer L,
Rehfuess E, Segrott J, Moore G. When and how do ‘effective’ interventions need to be
adapted and/or re-evaluated in new contexts? The need for guidance. Journal of
Epidemiology and Community Health 2019; 73: 481–482.
Flemming K, Booth A, Hannes K, Cargo M, Noyes J. Cochrane Qualitative and
Implementation Methods Group guidance series-paper 6: reporting guidelines for
474
17.8 References
475
17 Intervention complexity
Howick J, Glasziou P, Aronson JK. Problems with using mechanisms to solve the problem of
extrapolation. Theoretical Medicine and Bioethics 2013; 34: 275–291.
Kaufman J, Ryan R, Glenton C, Lewin S, Bosch-Capblanch X, Cartier Y, Cliff J, Oyo-Ita A, Ames
H, Muloliwa AM, Oku A, Rada G, Hill S. Childhood vaccination communication outcomes
unpacked and organized in a taxonomy to facilitate core outcome establishment. Journal
of Clinical Epidemiology 2017; 84: 173–184.
Kelly M, Noyes J, Kane R, Chang C, Uhl S, Robinson K, Springs S, Butler M, Guise J. AHRQ
series on complex intervention systematic reviews-paper 2: defining complexity,
formulating scope, and questions. Journal of Clinical Epidemiology 2017; 90: 11–18.
Khuder SA, Milz S, Jordan T, Price J, Silvestri K, Butler P. The impact of a smoking ban on
hospital admissions for coronary heart disease. Preventive Medicine 2007; 45: 3–8.
Kneale D, Thomas J, Harris K. Developing and optimising the use of logic models in
systematic reviews: exploring practice and good practice in the use of programme theory
in reviews. PloS One 2015; 10: e0142187.
Kneale D, Thomas J, Bangpan M, Waddington H, Gough D. Conceptualising causal pathways in
systematic reviews of international development interventions through adopting a causal
chain analysis approach. Journal of Development Effectiveness 2018; 10: 422–437.
Krieger N. Who and what is a ‘population’? Historical debates, current controversies, and
implications for understanding “Population Health” and rectifying health inequities.
Milbank Quarterly 2012; 90: 634–681.
Lewin S, Hendry M, Chandler J, Oxman A, Michie S, Shepperd S, Reeves B, Tugwell P, Hannes
K, Rehfuess E, Welch V, Mckenzie J, Burford B, Petkovic J, Anderson L, Harris J, Noyes J.
Assessing the complexity of interventions within systematic reviews: development,
content and use of a new tool (iCAT_SR). BMC Medical Research Methodology 2017; 17: 76.
Loudon K, Treweek S, Sullivan F, Donnan P, Thorpe KE, Zwarenstein M. The PRECIS-2 tool:
designing trials that are fit for purpose. BMJ 2015; 350: h2147.
Luke D, Stamatakis K. Systems science methods in public health: dynamics, networks, and
agents. Annual Review of Public Health 2012; 33: 357–376.
Montgomery M, Movsisyan A, Grant S. Considerations of complexity in rating certainty of
evidence in systematic reviews: a primer on using the GRADE approach in global health.
BMJ Global Health 2019; 4: e000848.
Moore G, Audrey S, Barker M, Bond L, Bonell C, Hardeman W, Moore L, O’Cathain A, Tinati T,
Wight D, Baird J. Process evaluation of complex interventions: Medical Research Council
guidance. BMJ 2015; 350: h1258.
Nilsen P. Making sense of implementation theories, models and frameworks.
Implementation Science 2015; 10: 53.
Noyes J, Gough D, Lewin S, Mayhew A, Welch V. A research and development agenda for
systematic reviews that ask complex questions about complex interventions. Journal of
Clinical Epidemiology 2013; 66: 1262–1270.
Noyes J, Hendry M, Booth A, Chandler J, Lewin S, Glenton C, Garside R. Current use was
established and Cochrane guidance on selection of social theories for systematic reviews
of complex interventions was developed. Journal of Clinical Epidemiology 2016a; 75: 78–92.
Noyes J, Hendry M, Lewin S, Glenton C, Chandler J, Rashidian A. Qualitative ‘trial-sibling’
studies and ‘unrelated’ qualitative studies contributed to complex intervention reviews.
Journal of Clinical Epidemiology 2016b; 74: 133–143.
Noyes J, Booth A, Moore G, Flemming K, Tuncalp Ö, Shakibazadeh E. Synthesising
quantitative and qualitative evidence to inform guidelines on complex interventions:
476
17.8 References
clarifying the purposes, designs and outlining some methods. BMJ Global Health 2019;
4 (Suppl 1): e000893.
Oliver S, Dickson K, Bangpan M, Newman M. Getting started with a review. In: Gough D,
Oliver S, Thomas J, editors. An Introduction to Systematic Reviews. London: Sage
Publications Ltd.; 2017. pp. 71–92
Pantoja T, Opiyo N, Lewin S, Paulsen E, Ciapponi A, Wiysonge CS, Herrera CA, Rada G,
Peñaloza B, Dudley L, Gagnon MP, Garcia Marti S, Oxman AD. Implementation strategies
for health systems in low-income countries: an overview of systematic reviews. Cochrane
Database of Systematic Reviews 2017; 9: CD011086.
Petticrew M. Time to rethink the systematic review catechism. Systematic Reviews 2015; 4: 1.
Petticrew M, Knai C, Thomas J, Rehfuess E, Noyes J, Gerhardus A, Grimshaw J, Rutter H, McGill E.
Implications of a complex systems perspective perspective for systematic reviews and
guideline development in health decision-making. BMJ Global Health 2019; 4 (Suppl 1):
e000899.
Pfadenhauer L, Gerhardus A, Mozygemba K, Bakke Lysdahl K, Booth A, Hofmann B, Wahlster
P, Polus S, Burns J, Brereton L, Rehfuess E. Making sense of complexity in context and
implementation: the Context and Implementation of Complex Interventions (CICI)
framework. Implementation Science 2017; 12: 21.
Pigott T, Noyes J, Umscheid CA, Myers E, Morton SC, Fu R, Sanders-Schmidler GD, Devine B,
Murad MH, Kelly MP, Fonnesbeck C, Kahwati L, Beretvas SN. AHRQ series on complex
intervention systematic reviews-paper 5: advanced analytic methods. Journal of Clinical
Epidemiology 2017; 90: 37–42.
Rehfuess EA, Stratil JM, Scheel IB, Portela A, Norris SL, Baltussen R. Integrating WHO norms
and values into guideline and other health decisions: the WHO-INTEGRATE evidence to
decision framework version 1.0. BMJ Global Health 2019; 4: e000844.
Rohwer A, Pfadenhauer L, Burns J, Brereton L, Gerhardus A, Booth A, Oortwijn W, Rehfuess
E. Series: Clinical Epidemiology in South Africa. Paper 3: Logic models help make sense of
complexity in systematic reviews and health technology assessments. Journal of Clinical
Epidemiology 2017; 83: 37–47.
Squires J, Valentine J, Grimshaw J. Systematic reviews of complex interventions: framing
the review question. Journal of Clinical Epidemiology 2013; 66: 1215–1222.
Tanner-Smith E, Grant S. Meta-analysis of complex interventions. Annual Review of Public
Health 2018; 391617: 1–16.
Thomas J, Harden A, Oakley A, Sutcliffe K, Rees R, Brunton G, Kavanagh K. Integrating
qualitative research with trials in systematic reviews. BMJ 2004; 328: 1010–1012.
Thomas J, Harden A. Methods for the thematic synthesis of qualitative research in
systematic reviews. BMC Medical Research Methodology 2008; 8: 45.
Thomas J, O’Mara-Eves A, Brunton G. Using qualitative comparative analysis (QCA) in systematic
reviews of complex interventions: a worked example. Systematic Reviews 2014; 3: 67.
Tong A, Flemming K, McInnes E, Oliver S, Craig J. Enhancing transparency in reporting the
synthesis of qualitative research: ENTREQ. BMC Medical Research Methodology 2012; 12: 181.
Tugwell P, Petticrew M, Kristjansson E, Welch V, Ueffing E, Waters E, Bonnefoy J, Morgan A,
Doohan E, Kelly M. Assessing equity in systematic reviews: realising the recommendations
of the Commission on Social Determinants of Health. BMJ 2010; 341: c4739.
Viswanathana M, McPheeters M, Hassan Murad M, Butler M, Devine E, Dyson M, Guise J,
Kahwatia L, Milesh J, Morton S. AHRQ series on complex intervention systematic reviews
paper 4: selecting analytic approaches. Journal of Clinical Epidemiology 2017; 90: 28–36.
477
18
Patient-reported outcomes
Bradley C Johnston, Donald L Patrick, Tahira Devji, Lara J Maxwell, Clifton O
Bingham III, Dorcas E Beaton, Maarten Boers, Matthias Briel, Jason W Busse, Alonso
Carrasco-Labra, Robin Christensen, Bruno R da Costa, Regina El Dib, Anne Lyddiatt,
Raymond W Ostelo, Beverley Shea, Jasvinder Singh, Caroline B Terwee, Paula R
Williamson, Joel J Gagnier, Peter Tugwell, Gordon H Guyatt
KEY POINTS
•
care decision makers are informed about the outcomes most meaningful to patients.
Authors of systematic reviews that include PROs should have a good understanding of
how patient-reported outcome measures (PROMs) are developed, including the con-
•
structs they are intended to measure, their reliability, validity and responsiveness.
Authors should pre-specify at the protocol stage a hierarchy of preferred PROMs to
measure the outcomes of interest.
479
18 Patient-reported outcomes
may address patient-relevant outcomes via proxy reports or observations from care-
givers, health professionals, or parents and guardians, these are not PROMs but rather
clinician-reported or observer-reported outcomes (Powers et al 2017).
PROs provide crucial information for patients and clinicians facing choices in health care.
Conducting systematic reviews and meta-analyses including PROMs and interpreting their
results is not straightforward, and guidance can help review authors address the challenges.
The objectives of this chapter are to: (i) describe the category of outcomes known as
PROs and their importance for healthcare decision making; (ii) illustrate the key issues
related to reliability, validity and responsiveness that systematic review authors should
consider when including PROs; and (iii) address the structure and content (domains,
items) of PROs and provide guidance for combining information from different PROs.
This chapter outlines a step-by-step approach to addressing each of these elements in
the systematic review process. The focus is on the use of PROs in randomized trials, and
what is crucial in this context when selecting PROs to include in a meta-analysis. The
principles also apply to systematic reviews of non-randomized studies addressing PROs
(e.g. dealing with adverse drug reactions).
A common term used in the health status measurement literature is construct. Con-
struct refers to what PROMs are trying to measure, the concept that defines the PROM
such as pain, physical function or depressive mood. Constructs are the postulated attri-
butes of the person that investigators hope to capture with the PROM (Cronbach and
Meehl 1955).
Many different ways exist to label and classify PROMs and the constructs they meas-
ure. For instance, reports from patients include signs (observable manifestations of a
condition), sensations (most commonly classified as symptoms that may be attribut-
able to disease and/or treatment), behaviours and abilities (commonly classified as
functional status), general perceptions or feelings of well-being, general health, satis-
faction with treatment, reports of adverse effects, adherence to treatment, and partic-
ipation in social or community events and health-related quality of life (HRQoL).
Investigators can use different approaches to capture patient perspectives, including
interviews, self-completed questionnaires, diaries, and via different interfaces such as
hand-held devices or computers. Review authors must identify the postulated con-
structs that are important to patients, and then determine the extent to which the
PROMs used and reported in the trials address those constructs, the characteristics
(measurement properties) of the PROMs used, and communicate this information to
the reader (Calvert et al 2013).
Focusing now on HRQoL, an important PRO, some approaches attempt to cover the
full range of health-related patient experience – including, for instance, self-care, and
physical, emotional and social function – and thus enable comparisons between the
impact of treatments on HRQoL across diseases or conditions. Authors often call these
approaches generic instruments (Guyatt et al 1989, Patrick and Deyo 1989). These
include utility measures such as the EuroQol five dimensions questionnaire (EQ-5D)
or the Health Utilities Index (HUI). They also include health profiles such as the Short
Form 36-item (SF-36) or the SF-12; these have come to dominate the field of health pro-
files (Tarlov et al 1989, Ware et al 1995, Ware et al 1996). An alternative approach to
measuring PROs is to focus on much more specific constructs: PROMs may be specific
to function (e.g. sleep, sexual function), to a disease (e.g. asthma, heart failure), to a
population (e.g. the frail elderly) or to a symptom (pain, fatigue) (Guyatt et al 1989,
Patrick and Deyo 1989). Another domain-specific measurement system now receiving
attention is Patient-Reported Outcomes Measurement Instruments System (PROMIS).
PROMIS is a National Institutes of Health funded PROM programme using computer-
ized adaptive testing from large item banks for over 70 domains (e.g. anxiety, depres-
sion, pain, social function) relevant to wide variety of chronic diseases (Cella et al 2007,
Witter 2016, PROMIS 2018).
Authors often use the terms ‘quality of life’, ‘health status’, ‘functional status’,
‘HRQoL’ and ‘well-being’ loosely and interchangeably. Systematic review authors must
therefore consider carefully the constructs that the PROMs have actually measured. To
do so, they may need to examine the items or questions included in a PROM.
Another issue to consider is whether and how the individual items of instruments are
weighted. A number of approaches can be used to arrive at weights (Wainer 1976). Util-
ity instruments designed for economic analysis put greater emphasis on item weight-
ing, attempting ultimately to present HRQoL as a continuum anchored between death
and full health. Many PROMs weight items equally in the calculation of the overall score,
a reasonable approach. Readers can refer to a helpful overview of classical test theory
481
18 Patient-reported outcomes
Table 18.2.a Checklist for describing and assessing PROMs in clinical trials. Adapted from Guyatt
et al (1997)
and item response theory to understand better the merits and limitations of weighting
(Cappelleri et al 2014).
Table 18.2.a presents a framework for considering and reporting PROMs in clinical
trials, including their constructs and how they were measured. A good understanding
of the PROMs identified in the included studies for a review is essential to appropriate
analysis of outcomes across studies, and appraisal of the certainty of the evidence.
18.3.2 Reliability
Intuitively, many think of reliability as obtaining the same scores on repeated admin-
istration of an instrument in stable respondents. That stability (or lack of measurement
error) is important, but not sufficient. Satisfactory instruments must be able to distin-
guish between individuals despite measurement error.
Reliability statistics therefore look at the ratio of the variability between respondents
(typically the numerator of a reliability statistic) and the total variability (the variability
between respondents and the variability within respondents). The most commonly
used statistics to measure reliability is a kappa coefficient for categorical data, a
weighted kappa coefficient for ordered categorical data, and an intraclass correlation
coefficient for continuous data (de Vet et al 2011).
Limitations in reliability will be of most concern for the review author when rando-
mized trials have failed to establish the superiority of an experimental intervention over
a comparator intervention. The reason is that lack of reliability cannot create interven-
tion effects that are not present, but can obscure true intervention effects as a result of
random error. When a systematic review does not find evidence that an intervention
affects a PROM, review authors should consider whether this may be due to poor reli-
ability (e.g. if reliability coefficients are less than 0.7) rather than lack of an effect.
18.3.3 Validity
Validity has to do with whether the instrument is measuring what it is intended to
measure. Content validity assessment involves patient and clinician evaluation of
the relevance and comprehensiveness of the content contained in the measures, usu-
ally obtained through qualitative research with patients and families (Johnston et al
2012). Guidance is available on the assessment of content validity for PROMs used
in clinical trials (Patrick et al 2011a, Patrick et al 2011b).
Construct validity involves examining the logical relationships that should exist
between assessment measures. For example, in patients with COPD, we would expect
that patients with lower treadmill exercise capacity generally will have more dyspnoea
(shortness of breath) in daily life than those with higher exercise capacity, and we
would expect to see substantial correlations between a new measure of emotional
function and existing emotional function questionnaires.
When we are interested in evaluating change over time – that is, in the context of eval-
uation when measures are available both before and after an intervention – we examine
correlations of change scores. For example, patients with COPD who deteriorate in their
treadmill exercise capacity should, in general, show increases in dyspnea, while those
whose exercise capacity improves should experience less dyspnea. Similarly, a new emo-
tional function instrument should show concurrent improvement in patients who
improve on existing measures of emotional function. The technical term for this process
is testing an instrument’s longitudinal construct validity. Review authors should look for
evidence of the validity of PROMs used in clinical studies. Unfortunately, reports of ran-
domized trials using PROMs seldom review or report evidence of the validity of the
483
18 Patient-reported outcomes
instruments they use, but when these are available review authors can gain some reas-
surance from statements (backed by citations) that the questionnaires have been previ-
ously validated, or could seek additional published information on named PROMs.
Ideally, review authors should look for systematic reviews of the measurement properties
of the instruments in question. The Consensus-based standards for the selection of
health measurement instruments (COSMIN) website offers a database of such reviews
(COSMIN Database of Systematic Reviews). In addition, the Patient-Reported Outcomes
and Quality of Life Instruments Database (PROQOLID) provides documentation of the
measurement properties for over 1000 PROs.
If the validity of the PROMs used in a systematic review remains unclear, review
authors should consider whether the PROM is an appropriate measure of the review’s
planned outcomes, or whether it should be excluded (ideally, this would be considered
at the protocol stage), and any included results should be interpreted with appropriate
caution. For instance, in a review of flavonoids for haemorrhoids, authors of primary
trials used PROMs to ascertain patients’ experience with pain and bleeding (Alonso-
Coello et al 2006). Although the wording of these PROMs was simple and made intuitive
sense, the absence of formal validation raises concerns over whether these measures
can give meaningful data to distinguish between the intervention and its comparators.
A final concern about validity arises if the measurement instrument is used with a
different population, or in a culturally and linguistically different environment from
the one in which it was developed. Ideally, PROMs should be re-validated in each study,
but systematic review authors should be careful not to be too critical on this
basis alone.
18.3.4 Responsiveness
In the evaluative context, randomized trial participant measurements are typically
available before and after the intervention. PROMs must therefore be able to distin-
guish among patients who remain the same, improve or deteriorate over the course
of the trial (Guyatt et al 1987, Revicki et al 2008). Authors often refer to this measure-
ment property as responsiveness; alternatives are sensitivity to change or ability to
detect change.
As with reliability, responsiveness becomes an issue when a meta-analysis suggests
no evidence of a difference between an intervention and control. An instrument with a
poor ability to measure change can result in false-negative results, in which the inter-
vention improves how patients feel, yet the instrument fails to detect the improvement.
This problem may be particularly salient for generic questionnaires that have the
advantage of covering all relevant areas of HRQoL, but the disadvantage of covering
each area superficially or without the detail required for the particular context of
use (Wiebe et al 2003, Johnston et al 2016a). Thus, in studies that show no difference
in PROMs between intervention and control, lack of instrument responsiveness is one
possible reason. Review authors should look for published evidence of responsiveness.
If there is an absence of prior evidence of responsiveness, this represents a potential
reason for being less certain about evidence from a series of randomized trials. For
instance, a systematic review of respiratory muscle training in COPD found no effect
on patients’ function. However, two of the four studies that assessed a PROM used
instruments without established responsiveness (Smith et al 1992).
484
18.4 Synthesis and interpretation of evidence
In some clinical fields core outcome sets are available to guide the use of appropriate
PROs (COMET 2018). Only rarely do these include specific guidance on which PROMs are
preferable, although methods have been proposed for this (Prinsen et al 2016). Within
the field of rheumatology, the Outcome Measures in Rheumatology (OMERACT) initia-
tive has developed a conceptual framework known as OMERACT Filter 2.0 to identify
both core domain sets (what outcome should be measured) and core outcome meas-
urement sets (how the outcome should be measured, i.e. which PROM to use) (Boers
et al 2014). This is a generic framework and applicable to those developing core out-
come sets outside the field of rheumatology.
As an example of a pre-defined hierarchy, for knee osteoarthritis, OMERACT has used
a published hierarchy based on responsiveness for extraction of PROMs evaluating pain
and physical function for performing systematic reviews (Juhl et al 2012).
Authors should decide in advance whether to exclude PROMs not included in the hierar-
chy, or to include additional measures where none of the preferred measures are available.
Ideally, the decision to combine scores from different PROMs would be based not
only on their measuring similar constructs but also on their satisfactory validity,
and, depending on whether before and after intervention or only after intervention
measurements were available, and on their responsiveness or reliability. For example,
extensive evidence of validity is available for both CRQ and the SGRQ. The CRQ has,
however, proved more responsive than the SGRQ: in an investigation that included
15 studies using both instruments, standardized response means of the CRQ (median
0.51, interquartile range (IQR) 0.19 to 0.98) were significantly higher (P < 0.001) than
those associated with the SGRQ (median 0.26, IQR −0.03 to 0.40) (Puhan et al 2006).
As a result, pooling results from trials using these two instruments could lead to under-
estimates of intervention effect in studies using the SGRQ (Puhan et al 2006, Johnston
et al 2010). This can be tested using a sensitivity analysis of studies using the more
responsive versus less responsive instrument.
Usually, detailed data such as those described above will be unavailable. Investigators
must then fall back on intuitive decisions about the extent to which different instruments
are measuring the same underlying concept. For example, the authors of a meta-analysis of
psychosocial interventions in the treatment of pre-menstrual syndrome faced a profusion
of outcome measures, with 25 PROMs used in their nine eligible studies (Busse et al 2009).
Table 18.4.a Examples of potentially combinable PROMs measuring similar constructs from a review
of psychosocial interventions in the treatment of pre-menstrual syndrome (Busse et al 2009).
Reproduced with permission of Karger
Anxiety
Beck Anxiety Inventory
Menstrual Symptom Diary-Anxiety domain
State and Trait Anxiety Scale-State Anxiety domain
Behavioural Changes
Menstrual Distress Questionnaire-Behavioural Changes domain
Pre-Menstrual Assessment Form-Social Withdrawal domain
Depression
Beck Depression Inventory
Depression Adjective Checklist State-Depression domain
General Contentment Scale-Depression and Well-being domain
Menstrual Symptom Diary-Depression domain
Menstrual Distress Questionnaire-Negative Affect domain
Interference
Global Rating of Interference Daily Record of Menstrual Complaints-Interference domain
Sexual Relations
Martial Satisfaction Inventory-Sexual Dissatisfaction domain
Social Adjustment Scale-Sexual Relationship domain
Water Retention and Oedema
Menstrual Distress Questionnaire-Water Retention domain
Menstrual Symptom Diary-Oedema domain
487
18 Patient-reported outcomes
They dealt with this problem by having two experienced clinical researchers, knowledge-
able to the study area and not otherwise involved in the review, independently examine
each instrument – including all domains – and group 16 PROMs into six discrete conceptual
categories. Any discrepancies were resolved by discussion to achieve consensus.
Table 18.4.a details the categories and the included instruments within each category.
Authors should follow the guidance elsewhere in this Handbook on appropriate meth-
ods of synthesizing different outcome measures in a single analysis (Chapter 10) and
interpreting these results in a way that is most meaningful for decision makers
(Chapter 15).
Having decided which PROs and subsequently PROMs to include in a meta-analysis,
review authors face the challenge of ensuring the results they present are interpretable
to their target audiences. For instance, if told that the mean difference between reha-
bilitation and standard care in a series of randomized trials using the CRQ was 1.0
(95% CI 0.6 to 1.5), many readers would be uncertain whether this represents a trivial,
small but important, moderate, or large effect (Guyatt et al 1998, Brozek et al 2006,
Schünemann et al 2006). Similarly, the interpretation of a standardized mean differ-
ence is challenging for most (Johnston et al 2016b). Chapter 15 summarizes the various
statistical presentation approaches that can be used to improve the interpretability of
summary estimates. Further, for those interested in additional guidance, the GRADE
working group summarizes five presentation approaches to enhancing the interpreta-
bility of pooled estimates of PROs when preparing ‘Summary of findings’ tables
(Thorlund et al 2011, Guyatt et al 2013, Johnston et al 2013).
18.6 References
Alonso-Coello P, Zhou Q, Martinez-Zapata MJ, Mills E, Heels-Ansdell D, Johanson JF, Guyatt
G. Meta-analysis of flavonoids for the treatment of haemorrhoids. British Journal of
Surgery 2006; 93: 909–920.
Beaton D, Boers M, Tugwell P. Assessment of health outcomes. In: Firestein G, Budd R,
Gabriel SE, McInnes IB, O’Dell J, editors. Kelley and Firestein’s Textbook of Rheumatology.
10th ed. Philadelphia (PA): Elsevier; 2016. pp. 496–508.
Boers M, Brooks P, Strand CV, Tugwell P. The OMERACT filter for Outcome Measures in
Rheumatology. Journal of Rheumatology 1998; 25: 198–199.
Boers M, Kirwan JR, Wells G, Beaton D, Gossec L, d’Agostino MA, Conaghan PG, Bingham CO,
3rd, Brooks P, Landewe R, March L, Simon LS, Singh JA, Strand V, Tugwell P. Developing
core outcome measurement sets for clinical trials: OMERACT filter 2.0. Journal of Clinical
Epidemiology 2014; 67: 745–753.
Brozek JL, Guyatt GH, Schünemann HJ. How a well-grounded minimal important difference
can enhance transparency of labelling claims and improve interpretation of a patient
reported outcome measure. Health and Quality of Life Outcomes 2006; 4: 69.
Bucher HC, Cook DJ, Holbrook AM, Guyatt G. Chapter 13.4: Surrogate outcomes. In: Guyatt
G, Rennie D, Meade MO, Cook DJ, editors. Users’ Guides to the Medical Literature: A Manual
for Evidence-Based Clinical Practice. 3rd ed. New York: McGraw-Hill Education; 2014.
Busse JW, Montori VM, Krasnik C, Patelis-Siotis I, Guyatt GH. Psychological intervention for
premenstrual syndrome: a meta-analysis of randomized controlled trials. Psychotherapy
and Psychosomatics 2009; 78: 6–15.
Calvert M, Blazeby J, Altman DG, Revicki DA, Moher D, Brundage MD. Reporting of patient-
reported outcomes in randomized trials: the CONSORT PRO extension. JAMA 2013; 309:
814–822.
Cappelleri JC, Jason Lundy J, Hays RD. Overview of classical test theory and item response
theory for the quantitative assessment of items in developing patient-reported outcomes
measures. Clinical Therapeutics 2014; 36: 648–662.
Cella D, Yount S, Rothrock N, Gershon R, Cook K, Reeve B, Ader D, Fries JF, Bruce B, Rose M.
The Patient-Reported Outcomes Measurement Information System (PROMIS): progress of
an NIH Roadmap cooperative group during its first two years. Medical Care 2007; 45:
S3–S11.
Christensen R, Maxwell LJ, Jüni P, Tovey D, Williamson PR, Boers M, Goel N, Buchbinder R,
March L, Terwee CB, Singh JA, Tugwell P. Consensus on the need for a hierarchical list of
patient-reported pain outcomes for metaanalyses of knee osteoarthritis trials: an
OMERACT objective. Journal of Rheumatology 2015; 42: 1971–1975.
COMET. Core Outcome Measures in Effectiveness Trials 2018. http://www.comet-
initiative.org.
Cronbach LJ, Meehl PE. Construct validity in psychological tests. Psychological Bulletin 1955;
52: 281–302.
de Vet HCW, Terwee CB, Mokkink LB, Knol DL. Measurement in Medicine: A Practical Guide.
Cambridge: Cambridge University Press; 2011.
FDA. Guidance for Industry: Patient-Reported Outcome Measures: Use in Medical Product
Development to Support Labeling Claims. Rockville, MD; 2009. http://www.fda.gov/
downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/
UCM193282.pdf
489
18 Patient-reported outcomes
FDA. Clinical Outcome Assessment Program Silver Spring, MD: US Food and Drug
Administration; 2018. https://www.fda.gov/Drugs/DevelopmentApprovalProcess/
DrugDevelopmentToolsQualificationProgram/ucm284077.htm.
Guyatt G, Walter S, Norman G. Measuring change over time: assessing the usefulness of
evaluative instruments. Journal of Chronic Diseases 1987; 40: 171–178.
Guyatt G, Oxman AD, Akl EA, Kunz R, Vist G, Brozek J, Norris S, Falck-Ytter Y, Glasziou P,
DeBeer H, Jaeschke R, Rind D, Meerpohl J, Dahm P, Schünemann HJ. GRADE guidelines: 1.
Introduction-GRADE evidence profiles and summary of findings tables. Journal of Clinical
Epidemiology 2011; 64: 383–394.
Guyatt GH, Veldhuyzen Van Zanten SJ, Feeny DH, Patrick DL. Measuring quality of life in
clinical trials: a taxonomy and review. CMAJ: Canadian Medical Association Journal 1989;
140: 1441–1448.
Guyatt GH, Naylor CD, Juniper E, Heyland DK, Jaeschke R, Cook DJ. Users’ guides to the
medical literature. XII. How to use articles about health-related quality of life. Evidence-
Based Medicine Working Group. JAMA 1997; 277: 1232–1237.
Guyatt GH, Juniper EF, Walter SD, Griffith LE, Goldstein RS. Interpreting treatment effects in
randomised trials. BMJ 1998; 316: 690–693.
Guyatt GH, Thorlund K, Oxman AD, Walter SD, Patrick D, Furukawa TA, Johnston BC,
Karanicolas P, Akl EA, Vist G, Kunz R, Brozek J, Kupper LL, Martin SL, Meerpohl JJ, Alonso-
Coello P, Christensen R, Schünemann HJ. GRADE guidelines: 13. Preparing summary of
findings tables and evidence profiles-continuous outcomes. Journal of Clinical
Epidemiology 2013; 66: 173–183.
Hannan MT, Felson DT, Pincus T. Analysis of the discordance between radiographic
changes and knee pain in osteoarthritis of the knee. Journal of Rheumatology 2000;
27: 1513–1517.
Johnston BC, Thorlund K, Schünemann HJ, Xie F, Murad MH, Montori VM, Guyatt GH.
Improving the interpretation of quality of life evidence in meta-analyses: the
application of minimal important difference units. Health and Quality of Life
Outcomes 2010; 8: 116.
Johnston BC, Thorlund K, da Costa BR, Furukawa TA, Guyatt GH. New methods can extend
the use of minimal important difference units in meta-analyses of continuous outcome
measures. Journal of Clinical Epidemiology 2012; 65: 817–826.
Johnston BC, Patrick DL, Thorlund K, Busse JW, da Costa BR, Schünemann HJ, Guyatt GH.
Patient-reported outcomes in meta-analyses-part 2: methods for improving
interpretability for decision-makers. Health and Quality of Life Outcomes 2013; 11: 211.
Johnston BC, Miller PA, Agarwal A, Mulla S, Khokhar R, De Oliveira K, Hitchcock CL,
Sadeghirad B, Mohiuddin M, Sekercioglu N, Seweryn M, Koperny M, Bala MM, Adams-
Webber T, Granados A, Hamed A, Crawford MW, van der Ploeg AT, Guyatt GH. Limited
responsiveness related to the minimal important difference of patient-reported
outcomes in rare diseases. Journal of Clinical Epidemiology 2016a; 79: 10–21.
Johnston BC, Alonso-Coello P, Friedrich JO, Mustafa RA, Tikkinen KA, Neumann I, Vandvik
PO, Akl EA, da Costa BR, Adhikari NK, Dalmau GM, Kosunen E, Mustonen J, Crawford MW,
Thabane L, Guyatt GH. Do clinicians understand the size of treatment effects?
A randomized survey across 8 countries. CMAJ: Canadian Medical Association Journal
2016b; 188: 25–32.
Jones PW. Health status measurement in chronic obstructive pulmonary disease. Thorax
2001; 56: 880–887.
490
18.6 References
491
18 Patient-reported outcomes
Schünemann HJ, Goldstein R, Mador MJ, McKim D, Stahl E, Puhan M, Griffith LE, Grant B,
Austin P, Collins R, Guyatt GH. A randomised trial to evaluate the self-administered
standardised chronic respiratory questionnaire. European Respiratory Journal 2005; 25:
31–40.
Schünemann HJ, Akl EA, Guyatt GH. Interpreting the results of patient reported outcome
measures in clinical trials: the clinician’s perspective. Health and Quality of Life Outcomes
2006; 4: 62.
Singh SJ, Sodergren SC, Hyland ME, Williams J, Morgan MD. A comparison of three disease-
specific and two generic health-status measures to evaluate the outcome of pulmonary
rehabilitation in COPD. Respiratory Medicine 2001; 95: 71–77.
Smith K, Cook D, Guyatt GH, Madhavan J, Oxman AD. Respiratory muscle training in chronic
airflow limitation: a meta-analysis. American Review of Respiratory Disease 1992; 145:
533–539.
Tarlov AR, Ware JE, Jr., Greenfield S, Nelson EC, Perrin E, Zubkoff M. The Medical Outcomes
Study. An application of methods for monitoring the results of medical care. JAMA 1989;
262: 925–930.
Tendal B, Nuesch E, Higgins JP, Jüni P, Gøtzsche PC. Multiplicity of data in trial reports and
the reliability of meta-analyses: empirical study. BMJ 2011; 343: d4829.
Thorlund K, Walter SD, Johnston BC, Furukawa TA, Guyatt GH. Pooling health-related
quality of life outcomes in meta-analysis-a tutorial and review of methods for enhancing
interpretability. Research Synthesis Methods 2011; 2: 188–203.
Wainer H. Estimating coefficients in linear models: it don’t make no nevermind.
Psychological Bulletin 1976; 83: 213–217.
Ware J, Jr., Kosinski M, Keller SD. A 12-Item Short-Form Health Survey: construction of
scales and preliminary tests of reliability and validity. Medical Care 1996; 34: 220–233.
Ware JE, Jr., Kosinski M, Bayliss MS, McHorney CA, Rogers WH, Raczek A. Comparison of
methods for the scoring and statistical analysis of SF-36 health profile and summary
measures: summary of results from the Medical Outcomes Study. Medical Care 1995; 33:
As264–279.
Wiebe S, Guyatt G, Weaver B, Matijevic S, Sidwell C. Comparative responsiveness of generic
and specific quality-of-life instruments. Journal of Clinical Epidemiology 2003; 56: 52–60.
Witter JP. Introduction: PROMIS a first look across diseases. Journal of Clinical Epidemiology
2016; 73: 87–88.
Yohannes AM, Roomi J, Waters K, Connolly MJ. Quality of life in elderly patients with COPD:
measurement and predictive factors. Respiratory Medicine 1998; 92: 1231–1236.
492
19
Adverse effects
Guy Peryer, Su Golder, Daniela R Junqueira, Sunita Vohra, Yoon Kong Loke; on behalf
of the Cochrane Adverse Effects Methods Group
KEY POINTS
• To achieve a balanced perspective, all reviews should try to consider adverse aspects
•
of interventions.
A detailed analysis of adverse effects is particularly relevant when evidence on the
•
potential for harm has a major influence on treatment or policy decisions.
There are major challenges in specifying relevant outcomes and study designs for sys-
tematic reviews evaluating adverse effects. This is due to high diversity in the number
and type of possible adverse effects, as well as variation in their definition, methods of
•
ascertainment, incidence and time-course.
Review authors should pre-specify their approach to reviewing studies of adverse
effects within the review protocol. The approach may be confirmatory (focused on
particular adverse effects of interest), exploratory (opportunistic capture of any
•
adverse effects that happen to be reported), or a hybrid (combination of both).
Depending on the approach used and outcomes of interest to the review, identifica-
tion of relevant adverse effects data may require a bespoke search process that
includes a wider selection of sources than that required to identify data on beneficial
•
outcomes.
Because adverse effects data are often handled with less rigour than the primary ben-
eficial outcomes of a study, review authors must recognize the possibility of poor case
definition, inadequate monitoring and incomplete reporting when synthesizing data.
This chapter should be cited as: Peryer G, Golder S, Junqueira D, Vohra S, Loke YK. Chapter 19: Adverse
effects. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane
Handbook for Systematic Reviews of Interventions. 2nd Edition. Chichester (UK): John Wiley & Sons, 2019:
493–506.
© 2019 The Cochrane Collaboration. Published 2019 by John Wiley & Sons Ltd.
493
19 Adverse effects
and may make the intervention look more favourable than it should. All reviews should
try to consider the adverse aspects of interventions.
This chapter addresses special issues about adverse effects in Cochrane Reviews. It
focuses on methodological differences when assessing adverse effects compared with
other outcomes.
vaccines), spontaneously reported adverse events are usually coded, grouped and
categorized following established dictionaries for analysis and presentation.
Whichever monitoring method is used to collect information about adverse events,
study investigators may combine adverse events into global or composite measures,
which are often reported as total number of serious adverse events, or number of with-
drawals due to adverse events, or total number of adverse events in an anatomic or
organ system (e.g. gastrointestinal, cardiovascular). However, these composite mea-
sures do not give information on what exactly the events were, and so it is usually nec-
essary to drill down for details of distinct or individual adverse events, such as nausea
or rash.
Ideally, the definition and ascertainment of adverse events should be as uniform as
possible across the included studies in the review. The lack of systematic monitoring or
follow-up, coupled with divergent methods of seeking, verifying and classifying adverse
events, can introduce heterogeneity in effect estimates among studies. Review authors
will therefore need to pay close attention to outcome definition and method of mon-
itoring when interpreting or comparing frequencies, rates and risk estimates for
adverse effects.
496
19.2 Formulation of the review
When different eligibility criteria are used to address beneficial and adverse effects, it
will often be necessary to conduct a separate search for the two (or more) sets of stud-
ies (see Section 19.3), and it may be necessary to plan different methods in other
aspects such as assessing risk of bias (see Section 19.4).
• The attribution of reason(s) for discontinuation is complex and may be due to mild
but irritating side effects, toxicity, lack of efficacy, non-medical reasons, or a combi-
nation of causes.
• The pressures on patients and investigators under trial conditions to reduce number
of withdrawals and dropouts can result in rates that do not reflect the experience of
adverse events within the wider population.
• Unblinding of intervention assignment often precedes the decision to withdraw.
This can lead to an over-estimate of the intervention’s effect on patient withdrawal.
For example, symptoms of patients in the placebo arm are less likely to lead to
discontinuation. Conversely, patients in the active intervention group who com-
plained of symptoms suggesting adverse effects may have been more readily
withdrawn.
have strengths and limitations, but the strengths are increased and limitations reduced
when they are combined. It is therefore advisable to combine index terms and free-text
searching (where possible) to increase search sensitivity and reduce the possibility of
missing relevant material. More details are provided in the online Technical Supple-
ment to Chapter 4.
19.4.2 Recommended tools for assessing risk of bias in adverse effects data
Review authors should use the currently recommended risk-of-bias tools, the RoB 2
tool for randomized trials (see Chapter 8), and the ROBINS-I tool for non-randomized
studies (see Chapter 25). Although these tools are most easily directed at outcomes
that have been pre-specified by the review team, they are suitable for any type of quan-
titative outcome analysed in a review. Where adverse effects are extracted post hoc
from included trials in an exploratory approach, it may not be possible to list important
co-interventions or confounding variables in the review protocol, as would usually be
expected for using the ROBINS-I tool.
Particular issues in assessing risk of bias for adverse effects data include outcome
definition and methods of monitoring adverse effects. These warrant special attention
when there are significant concerns over bias towards the null stemming from poor
definition, ascertainment or reporting of harms. This is particularly important for
new or unexpected adverse events that have not been pre-specified as outcomes of
interest in the trials, and where monitoring and reporting may be potentially inade-
quate. Additional resources such as the McHarm tool (Chou et al 2010) and the Agency
502
19.5 Synthesis and interpretation of evidence
for Healthcare Research and Quality (AHRQ) assessment tool (Chou et al 2007,
Viswanathan and Berkman 2012) provide further discussion of these issues.
between the interventions in individual adverse effects being obscured. Owing to dif-
ferences in coding and categorization of adverse effects between studies, review
authors should avoid trying to increase numbers of events available for analysis by con-
structing composite categories that have not been reported in the primary studies.
Conversely, review authors should be alert to situations in which the coding of adverse
effects splits data unnecessarily (e.g. pain in leg, pain in arm), which may dilute the
signal of a more global effect (e.g. all patients affected by pain).
Review authors should include at least one adverse effect outcome in the ‘Summary
of findings’ table. If the review did not focus on detailed evaluation of any adverse
effects, then the review authors should make an explicit statement that harms were
not assessed, rather than say (or imply) the intervention appears to be safe.
504
19.7 References
19.7 References
Chou R, Fu R, Carson S, Saha S, Helfand M. Methodological shortcomings predicted lower
harm estimates in one of two sets of studies of clinical interventions. Journal of Clinical
Epidemiology 2007; 60: 18–28.
Chou R, Aronson N, Atkins D, Ismaila AS, Santaguida P, Smith DH, Whitlock E, Wilt TJ, Moher
D. AHRQ series paper 4: assessing harms when comparing medical interventions: AHRQ
and the effective health-care program. Journal of Clinical Epidemiology 2010; 63: 502–512.
Golder S, Loke YK. The contribution of different information sources for adverse effects data.
International Journal of Technology Assessment in Health Care 2012; 28: 133–137.
Golder S, Loke YK, Zorzela L. Some improvements are apparent in identifying adverse effects
in systematic reviews from 1994 to 2011. Journal of Clinical Epidemiology 2013; 66:
253–260.
Golder S, Loke YK, Wright K, Norman G. Reporting of adverse events in published and
unpublished studies of health care interventions: a systematic review. PLoS Medicine
2016; 13: e1002127.
Kicinski M, Springate DA, Kontopantelis E. Publication bias in meta-analyses from the
Cochrane Database of Systematic Reviews. Statistics in Medicine 2015; 34: 2781–2793.
Loke YK, Mattishent K. If nothing happens, is everything all right? Distinguishing genuine
reassurance from a false sense of security. CMAJ: Canadian Medical Association Journal
2015; 187: 15–16.
Saini P, Loke YK, Gamble C, Altman DG, Williamson PR, Kirkham JJ. Selective reporting bias
of harm outcomes within studies: findings from a cohort of systematic reviews. BMJ 2014;
349: g6501.
Schroll JB, Penninga EI, Gøtzsche PC. Assessment of adverse events in protocols, clinical
study reports, and published papers of trials of orlistat: a document analysis. PLoS
Medicine 2016; 13: e1002101.
Smith PG, Morrow RH, Ross DA. Outcome measures and case definition. In: Field Trials of
Health Interventions: A Toolbox. Smith PG, Morrow RH, Ross DA, editors. Oxford (UK):
Oxford University Press; 2015.
Tang E, Ravaud P, Riveros C, Perrodeau E, Dechartres A. Comparison of serious adverse
events posted at ClinicalTrials.gov and published in corresponding journal articles. BMC
Medicine 2015; 13: 189.
Viswanathan M, Berkman ND. Development of the RTI item bank on risk of bias and
precision of observational studies. Journal of Clinical Epidemiology 2012; 65: 163–178.
Zorzela L, Loke YK, Ioannidis JP, Golder S, Santaguida P, Altman DG, Moher D, Vohra S,
PRISMA Harms Group. PRISMA harms checklist: improving harms reporting in systematic
reviews. BMJ 2016; 352: i157.
505
20
Economic evidence
Ian Shemilt, Patricia Aluko, Erin Graybill, Dawn Craig, Catherine Henderson, Michael
Drummond, Edward CF Wilson, Shannon Robalino, Luke Vale; on behalf of the
Campbell and Cochrane Economics Methods Group
KEY POINTS
• Economics is the study of the optimal allocation of limited resources for the produc-
••
tion of benefit to society and is therefore relevant to any healthcare decision.
Optimal decisions also require best evidence on cost-effectiveness.
This chapter describes methods for incorporating an economics view on the review
•
question and evidence into Cochrane Reviews.
Incorporating an economics view on the review question and evidence into Cochrane
Reviews can enhance their usefulness and applicability for healthcare decision-
making and new economic analyses.
20.1 Introduction
Economics is the study of the optimal allocation of limited resources for the production
of benefit to society. Resources include human time and skills, equipment, buildings,
energy and any other inputs used to achieve a specified course of action. These courses
of action might relate, for example, to a clinical decision to refer a patient for a health-
care intervention (including management of complications and follow-up care), or a
policy decision to implement a public health intervention.
In the face of limited resource availability, decision makers often need to consider not
only the beneficial and adverse health effects of interventions, but the impacts on the
use of healthcare resources, costs associated with use of those resources, and
ultimately their value – decision makers also need information on efficiency. The need
for evidence on both effectiveness and efficiency are closely aligned in healthcare
This chapter should be cited as: Shemilt I, Aluko P, Graybill E, Craig D, Henderson C, Drummond M, Wilson
ECF, Robalino S, Vale L; on behalf of the Campbell and Cochrane Economics Methods Group. Chapter 20:
Economic evidence. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors).
Cochrane Handbook for Systematic Reviews of Interventions. 2nd Edition. Chichester (UK): John Wiley & Sons,
2019: 507–524.
© 2019 The Cochrane Collaboration. Published 2019 by John Wiley & Sons Ltd.
507
20 Economic evidence
decision making. For these reasons, incorporating economic perspectives and evidence
into Cochrane Reviews – alongside (and informed by) the evidence for beneficial and
adverse effects – can make the findings of the review more useful for decision making
(MacLehose et al 2012, Niessen et al 2012).
The focus of this this chapter is on methods to incorporate a health economics per-
spective into a Cochrane Review. Decisions about whether to include an economic per-
spective in a Cochrane Review should be included in the planning stage. Further
support with this stage is available from the Economics Methods Group and can be
found in other chapters of this Handbook.
A number of economics terms are used in this chapter but it is not expected that the
reader will be familiar with economics terminology. Where a brief definition is possible it
is provided but where a fuller definition is needed please see the glossary and supplemen-
tary material, available on the Campbell and Cochrane Economics Methods Group website.
The integrated full systematic review of economic evidence is covered only briefly in
this chapter. A detailed definition and description can be found on the Campbell and
Cochrane Economics Methods Group website. This approach is substantially more
resource intensive when implemented in full than the brief economic commentary. This
is because it requires additional ‘economic’ methods procedures to be integrated into
each stage of the main systematic review of intervention effects. Conducting an inte-
grated full systematic review of economic evidence will also require specialist input to
the author team from a health economist, with experience (or support from someone
with experience) of applying the framework, at all stages of the process.
The brief economic commentary framework is less intensive but also less rigorous,
and most of this chapter focuses on this approach. This framework is specifically
designed to support the inclusion of economic evidence in Cochrane Reviews without
requiring specialist input from health economists (beyond initial guidance and training
in the method and procedures), and without placing a major additional workload bur-
den on author teams or editorial bases. This framework can be viewed as a ‘minimal
framework’ for incorporating economic evidence, with inherent limitations that will
require appropriate caveats in the commentary.
20.1.2 Core principles for the methods for the review of economic evidence
Three core principles underpin both frameworks.
Full reviews or brief economic commentaries developed with the aim of summarizing
evidence on the costs and/or cost-effectiveness of interventions should not in general
be conducted as a standalone exercise. They must place the relevant economic evi-
dence (in this case the impacts on resource use, costs and/or cost-effectiveness) into
the context of reliable evidence for intervention effects on health and related out-
comes. Failure to do so can lead to a biased summary of the evidence and a distorted
assembly of data from primary studies, because data on the evidence of effects used in
identified economic evaluations are highly likely to be (at best) only a subset of the data
used to provide the summary of evidence of effects (including assessment of the quality
of that evidence). The evidence of effects produced by a Cochrane Review will be the
most up-to-date synthesis and any published economic evaluation can, at best, be
based on only a subset of the data that were available at some earlier time point.
Furthermore, economic evaluations may be susceptible to a specific source of pub-
lication bias (or indeed conduct bias). For example, audits of some clinical areas have
shown that clinical effect sizes in randomized trials published with a concurrent eco-
nomic evaluation are systematically larger than those in randomized trials without.
This may reflect the difficulty in publishing planned economic evaluations conducted
alongside ‘inconclusive’ trials. Also, decisions made whilst planning a trial may mean
that an economic evaluation is excluded (e.g. because it is felt implausible that an
effective intervention could be anything other than cost-saving). However, such rea-
soning may not be reflected in published trial protocols or final study reports. Both of
these issues compound the issue of reporting biases in randomized trials (see
Chapter 13).
509
20 Economic evidence
Table 20.1.a Decision algorithm to help prioritize reviews for inclusion of economic evidence
(reproduced from Frick et al (2012))
on the incorporation of economic evidence into review. This is because with a large ben-
eficial effect on health (which is likely to translate into lower subsequent use of health ser-
vices and lower associated healthcare costs) and small input costs, the intervention is
likely to be cost-effective (possibly cost-saving) overall. It would, however, be important
to state this reasoning in the Background section of a protocol and review.
Conversely, if the expected incremental beneficial effect is small, the expected incre-
mental costs are high, and the economic evidence has a high probability of changing
the decision, then this algorithm places a high priority on the incorporation of eco-
nomic evidence.
The other rows of Table 20.2.a represent six further scenarios that fall between these
two extremes. For example, the second row represents a scenario in which the incre-
mental beneficial effect is small, the incremental cost is low, and the economic evi-
dence has a high probability of changing the decision. This scenario may occur
when, for example, the expected cost impact of the intervention is small but the health
condition targeted by the intervention has a very high prevalence, such that the cumu-
lative impact of small changes in costs across a large number of treated patients adds
up to a large overall change in costs at the level of a region or a country, so affordability
may be very important to a decision maker.
The decision algorithm in Table 20.2.a excludes scenarios in which the intervention is
expected to be associated with negative incremental cost (i.e. net savings) and a pos-
itive incremental effect relative to the comparator (and vice versa); in other words,
situations in which decisions to adopt or reject are expected to be straightforward
because the intervention is clearly better or clearly worse than the comparator (i.e.
it dominates, or is dominated by the comparator).
It is important to understand that if the decision algorithm shown in Table 20.2.a sug-
gests that low (or very low) priority should be placed on incorporating economic evi-
dence, this does not necessarily imply that doing so would provide no useful
information for decision makers. Rather, it implies that a low (or very low) priority
might be assigned to devoting limited research time and resources to conducting
the economics component of a review.
To assess the effects of aspirin [intervention] versus placebo [comparator] for pri-
mary prevention of heart attacks [condition and primary health outcome] among
adults aged > 50 years [population].
The questions for a brief economic commentary need to be expressed in the form of
an objective, usually a secondary objective for the review. However, the most important
objective in this case is to summarize the availability and principal findings in terms of
costs and cost-effectiveness of eligible economic evaluations.
Box 20.2.c Example commentary on the general issue of intervention costs and
cost-effectiveness
Given the economic impact of acute and non-union fractures and their treatment, and
the need for economic decisions on the added value of adopting BMP in clinical practice,
it is also important to critically evaluate and summarize current evidence on the costs
(resource use) and estimated cost-effectiveness associated with use of BMP as an
adjunct to, or replacement for, current standard treatments (Garrison et al 2010).
514
20.2 Formulation of the review
516
20.3 Identification of evidence
∗
a definition of these terms can be found in the Glossary and a fuller explanation is
provided in the supplementary material on the Campbell and Cochrane Economics
Methods Group website.
1) checking reference lists and conduct forward citation tracking from eligible studies
of effects identified for inclusion in the main review;
2) conducting a search of NHS Economic Evaluation Database (NHS EED) using key-
word terms based on intervention (and possibly comparator) concepts; and
3) applying specialist search filters to sets of records retrieved by searches of one or
two selected general electronic biomedical literature databases searched for the
main review of intervention effects. Examples of relevant search filters can be
obtained from the Economics Methods Group.
The primary rationale for incorporating using specialist search filters is the need to
identify reports of eligible full economic evaluations published since NHS EED stopped
being updated at the end of 2014. If a brief economic commentary is restricted to full
economic evaluations only, then we recommend using specialist searches from
1 January 2014 as the NHS EED was based on rigorous and comprehensive searches
for full economic evaluations before that date.
517
20 Economic evidence
It is helpful to classify cost items into four categories: health sector costs, other sector
costs, patient and family costs, and productivity impacts (Drummond et al 2015)
(although not all economic evaluations will follow this structure). The categories included
will be driven primarily by the analytic perspective of the study. Health sector costs
include the cost to the system or insurers of care provided (excluding costs directly paid
by patients) and can include items such as primary care physician contacts (e.g. face-to-
face visits or formal contacts via phone or via the internet, etc), prescribed medications,
inpatient and outpatient hospital contacts, as well as any specialist tertiary care contacts.
Other sector costs include costs borne by social services, education, local authorities, or
police and criminal justice services. Patient and family costs could include any direct pay-
ment or co-payments for medications or care, or out of pocket expenses such as travel or
arranging child or adult care while attending appointments. Productivity losses are the
518
20.5 Synthesis and interpretation of evidence
loss of output to the economy, and are usually measured in terms of time off work due to
accessing care as well as morbidity or premature mortality.
For principal findings, the following data should be collected:
It is important to highlight that we did not subject any of the [N] identified eco-
nomic evaluations to critical appraisal and we do not attempt to draw any firm
or general conclusions regarding the relative costs or efficiency of [‘Intervention
X’] compared with [‘Comparator Y’].
519
20 Economic evidence
The findings of the brief economic commentary should be incorporated into the Dis-
cussion (and not the Results) section of a Cochrane Review. The most appropriate place
for this material is where the results of the systematic review of effects are put into
context of other information and other reviews.
The overall aim of this element of the commentary is to summarize the availability
and principal findings of identified eligible economic evaluations, with appropriate
caveats, rather than to present the detailed results of a systematic search for evidence.
520
20.6 Chapter information
Box 20.5.b Example forms of words for concise discussion points in a brief economic
commentary
Lack of evidence
The apparent shortage of relevant economic evaluations indicates that economic evi-
dence regarding [‘Intervention X’] for [‘Health Condition Z’] is currently lacking.
20.7 References
Abdelhamid A, Shemilt I. Glossary of terms. In: Shemilt I MM, Vale L, Marsh K, Donaldson C,
editors. Evidence-based Decisions and Economics: Health Care, Social Welfare, Education
and Criminal Justice. Oxford: Wiley-Blackwell; 2010.
Brown SR, Wadhawan H, Nelson RL. Surgery for faecal incontinence in adults. Cochrane
Database of Systematic Reviews 2013; 7: CD001757.
Drummond M, Sculpher M, Claxton K, Stoddart G, Torrance G. Methods for Economic
Evaluation of Health Care Programmes. 4th ed. USA: Oxford University Press; 2015.
Frick K, Neissen L, Bridges J, Walker D, Wilson R, Bass E. Usefulness of Economic Evaluation Data
in Systematic Reviews of Evidence. Rockville (MD): Agency for Healthcare Research and
Quality (US); 2012. 12(13)-EHC114-EF. https://www.ncbi.nlm.nih.gov/books/NBK114533/
Garrison KR, Shemilt I, Donell S, Ryder JJ, Mugford M, Harvey I, Song F, Alt V. Bone
morphogenetic protein (BMP) for fracture healing in adults. Cochrane Database of
Systematic Reviews 2010; 6: CD006950.
522
20.7 References
Gilbody S, Bower P, Sutton AJ. Randomized trials with concurrent economic evaluations
reported unrepresentatively large clinical effect sizes. Journal of Clinical Epidemiology
2007; 60: 781–786.
Jenkins M. Evaluation of methodological search filters: a review. Health Information and
Libraries Journal 2004; 21: 148–163.
Larg A, Moss JR. Cost-of-illness studies: a guide to critical evaluation. Pharmacoeconomics
2011; 29: 653–671.
Latour-Perez J, de Miguel Balsa E. Cost effectiveness of fondaparinux in non-ST-elevation
acute coronary syndrome. Pharmacoeconomics 2009; 27: 585–595.
MacLehose H, Hilton J, Tovey D. The Cochrane Library: Revolution or evolution? Shaping the
future of Cochrane content (background paper). The Cochrane Collaboration’s Strategic
Session; 2012; Paris, France.
Maxwell CB, Holdford DA, Crouch MA, Patel DA. Cost-effectiveness analysis of
anticoagulation strategies in non-ST-elevation acute coronary syndromes. Annals of
Pharmacotherapy 2009; 43: 586–595.
Mellgren A, Jensen LL, Zetterstrom JP, Wong WD, Hofmeister JH, Lowry AC. Long-term cost
of fecal incontinence secondary to obstetric injuries. Diseases of the Colon and Rectum
1999; 42: 857–865; discussion 865–857.
Niessen L, Bridges J, Lau B, Wilson R, Sharma R, Walker D, Frick K, Bass E. Assessing the
Impact of Economic Evidence on Policymakers in Health Care: A Systematic Review.
Agency for Healthcare Research and Quality (US); 2012 Contract No.: No. 12(13)-EHC133-
EF https://effectivehealthcare.ahrq.gov/topics/economic-evidence/research
Petticrew M. Time to rethink the systematic review catechism? Moving from ‘what works’ to
‘what happens’. Systematic Reviews 2015; 4: 36.
Sculpher MJ, Lozano-Ortega G, Sambrook J, Palmer S, Ormanidhi O, Bakhai A, Flather M,
Steg PG, Mehta SR, Weintraub W. Fondaparinux versus Enoxaparin in non-ST-elevation
acute coronary syndromes: short-term cost and long-term cost-effectiveness using data
from the Fifth Organization to Assess Strategies in Acute Ischemic Syndromes
Investigators (OASIS-5) trial. American Heart Journal 2009; 157: 845–852.
Shemilt I, Mugford M, Vale L, Craig D, on behalf of the Campbell and Cochrane Economics
Methods Group. Searching NHS EED and HEED to inform development of economic
commentary for Cochrane intervention reviews. 2011. http://methods.cochrane.org/
economics/sites/methods.cochrane.org.economics/files/public/uploads/brief_
economic_commentaries_study_report.pdf
Yusuf S, Mehta SR, Chrolavicius S, Afzal R, Pogue J, Granger CB, Budaj A, Peters RJ, Bassand
JP, Wallentin L, Joyner C, Fox KA. Comparison of fondaparinux and enoxaparin in acute
coronary syndromes. New England Journal of Medicine 2006; 354: 1464–1476.
523
21
Qualitative evidence
Jane Noyes, Andrew Booth, Margaret Cargo, Kate Flemming, Angela Harden,
Janet Harris, Ruth Garside, Karin Hannes, Tomás Pantoja, James Thomas
KEY POINTS
• A qualitative evidence synthesis (commonly referred to as QES) can add value by provid-
ing decision makers with additional evidence to improve understanding of intervention
complexity, contextual variations, implementation, and stakeholder preferences and
•
experiences.
A qualitative evidence synthesis can be undertaken and integrated with a correspond-
•
ing intervention review; or
Undertaken using a mixed-method design that integrates a qualitative evidence
•
synthesis with an intervention review in a single protocol.
Methods for qualitative evidence synthesis are complex and continue to develop.
Authors should always consult current methods guidance at methods.cochrane.
org/qi.
21.1 Introduction
The potential contribution of qualitative evidence to decision making is well-established
(Glenton et al 2016, Booth 2017, Carroll 2017). A synthesis of qualitative evidence can
inform understanding of how interventions work by:
This chapter should be cited as: Noyes J, Booth A, Cargo M, Flemming K, Harden A, Harris J, Garside R,
Hannes K, Pantoja T, Thomas J. Chapter 21: Qualitative evidence. In: Higgins JPT, Thomas J, Chandler J,
Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook for Systematic Reviews of Interventions.
2nd Edition. Chichester (UK): John Wiley & Sons, 2019: 525–546.
© 2019 The Cochrane Collaboration. Published 2019 by John Wiley & Sons Ltd.
525
21 Qualitative evidence
External
Internal validity/relevance validity/
relevance
Figure 21.2.a Considering context and points of contextual integration with the intervention review or
within a mixed-method review
intervention effects, two types of qualitative study are available: those that collect data
from the same participants as the included trials, known as ‘trial siblings’; and those
that address relevant issues about the intervention, but as separate items of research –
not connected to any included trials. Both can provide useful information, with trial
sibling studies obviously closer in terms of their precise contexts to the included trials
(Moore et al 2015), and non-sibling studies possibly contributing perspectives not present
in the trials (Noyes et al 2016b).
528
21.5 Question development
Table 21.5.a PerSPecTIF Question formulation framework for qualitative evidence syntheses
(Booth et al 2019b). Reproduced with permission of BMJ Publishing Group
Per S P E (C) Ti F
529
21 Qualitative evidence
1) Identify studies at the point of study selection rather than through tailored search
strategies. This involves conducting a sensitive topic search without any study
design filter (Harden et al 1999), and identifying all study designs of interest during
the screening process. This approach can be feasible when a review question
involves multiple publication types (e.g. randomized trial, qualitative research
and economic evaluations), which then do not require separate searches.
531
21 Qualitative evidence
532
21.9 Selecting studies to synthesize
and their assessments of ‘risk to rigour’ for each paper and how the study’s methodo-
logical limitations may affect review findings (Noyes et al 2019). We further advise that
qualitative ‘sensitivity analysis’, exploring the robustness of the synthesis and its vulner-
ability to methodologically limited studies, be routinely applied regardless of the review
authors’ overall confidence in synthesized findings (Carroll et al 2013). Evidence suggests
that qualitative sensitivity analysis is equally advisable for mixed methods studies from
which the qualitative component is extracted (Verhage and Boels 2017).
533
21 Qualitative evidence
on a range of general and review specific criteria that Noyes and colleagues (Noyes et al
2019) outline in detail. The number of qualitative studies selected needs to be consist-
ent with a manageable synthesis, and the contexts of the included studies should ena-
ble integration with the trials in the effectiveness analysis (see Figure 21.2.a). The
guiding principle is transparency in the reporting of all decisions and their rationale.
Table 21.10.a Recommended methods for undertaking a qualitative evidence synthesis for
subsequent integration with an intervention review, or as part of a mixed-method review (adapted
from an original source developed by convenors (Flemming et al 2019, Noyes et al 2019))
Methodology Explanation
Table 21.11.a). The template for intervention description and replication TIDieR check-
list (Hoffmann et al 2014) and ICAT_SR tool may help with specifying key information
for extraction (Lewin et al 2017). Review authors must ensure that they preserve the
context of the primary study data during the extraction and synthesis process to pre-
vent misinterpretation of primary studies (Noyes et al 2019).
535
21 Qualitative evidence
Table 21.11.a Contextual and methodological information for inclusion within a table of
‘Characteristics of included studies’. From Noyes et al (2019). Reproduced with permission of
BMJ Publishing Group
Context and Important elements of study context, relevant to addressing the review
participants question and locating the context of the primary study; for example, the
study setting, population characteristics, participants and participant
characteristics, the intervention delivered (if appropriate), etc.
Study design and Methodological design and approach taken by the study; methods for
methods used identifying the sample recruitment; the specific data collection and
analysis methods utilized; and any theoretical models used to interpret
or contextualize the findings.
Noyes and colleagues (Noyes et al 2019) provide additional guidance and examples
of the various methods of data extraction. It is usual for review authors to select one
method. In summary, extraction methods can be grouped as follows.
• Using logic models or other types of conceptual framework A logic model (Glenton
et al 2013) or other type of conceptual framework, which represents the processes by
which an intervention produces change provides a common scaffold for integrating
findings across different types of evidence (Booth and Carroll 2015). Frameworks can
be specified a priori from the literature or through stakeholder engagement or newly
developed during the review. Findings from quantitative studies testing the effects of
interventions and those from qualitative evidence are used to develop and/or further
refine the model.
• Testing hypotheses derived from syntheses of qualitative evidence Quantitative
studies are grouped according to the presence or absence of the proposition
specified by the hypotheses to be tested and subgroup analysis is used to explore
differential findings on the effects of interventions (Thomas et al 2004).
• Qualitative comparative analysis (QCA) Findings from a qualitative synthesis are
used to identify the range of features that are important for successful interven-
tions, and the mechanisms through which these features operate. A QCA then tests
whether or not the features are associated with effective interventions (Kahwati
et al 2016). The analysis unpicks multiple potential pathways to effectiveness
accommodating scenarios where the same intervention feature is associated both
with effective and less effective interventions, depending on context. QCA offers
potential for use in integration; unlike the other methods and tools presented here
it does not yet have sufficient methodological guidance available. However, exem-
plar reviews using QCA are available (Thomas et al 2014, Harris et al 2015, Kahwati
et al 2016).
Review authors can use the above methods in combination (e.g. patterns observed
through juxtaposing findings within a matrix can be tested using subgroup analysis or
QCA). Analysing programme theory, using logic models and QCA would require mem-
bers of the review team with specific skills in these methods. Using subgroup analysis
and QCA are not suitable when limited evidence is available (Harden et al 2018, Noyes
et al 2019). (See also Chapter 17 on intervention complexity.)
If no
Use generic
(ii) Examine whether generic guidance may be suitable per se. If yes
guidance.
If no
(iv) Consider supplementing with generic guidance specific to stages of the synthesis of qualitative,
implementation or process evaluation evidence.
(v) Identify recent published examples of review type and make a list of
desirable features from several sources.
Figure 21.14.a Decision flowchart for choice of reporting approach for syntheses of qualitative,
implementation or process evaluation evidence (Flemming et al 2018). Reproduced with permission of
Elsevier
Acknowledgements: This chapter replaces Chapter 20 in the first edition of this Hand-
book (2008) and subsequent Version 5.2. We would like to thank the previous
Chapter 20 authors Jennie Popay and Alan Pearson. Elements of this chapter draw
on previous supplemental guidance produced by the Cochrane Qualitative and Imple-
mentation Methods Group Convenors, to which Simon Lewin contributed.
Funding: JT is supported by the National Institute for Health Research (NIHR) Collab-
oration for Leadership in Applied Health Research and Care North Thames at Barts
Health NHS Trust. The views expressed are those of the author(s) and not necessarily
those of the NHS, the NIHR or the Department of Health.
21.16 References
Alvesson M, Sköldberg K. Reflexive Methodology: New Vistas for Qualitative Research. 2nd ed.
London, UK: Sage; 2009.
539
21 Qualitative evidence
Ames HM, Glenton C, Lewin S. Parents’ and informal caregivers’ views and experiences of
communication about routine childhood vaccination: a synthesis of qualitative evidence.
Cochrane Database of Systematic Reviews 2017; 2: CD011787.
Anderson LM, Petticrew M, Rehfuess E, Armstrong R, Ueffing E, Baker P, Francis D, Tugwell P.
Using logic models to capture complexity in systematic reviews. Research Synthesis
Methods 2011; 2: 33–42.
Barnett-Page E, Thomas J. Methods for the synthesis of qualitative research: a critical
review. BMC Medical Research Methodology 2009; 9: 59.
Benoot C, Hannes K, Bilsen J. The use of purposeful sampling in a qualitative evidence
synthesis: a worked example on sexual adjustment to a cancer trajectory. BMC Medical
Research Methodology 2016; 16: 21.
Bonell C, Jamal F, Harden A, Wells H, Parry W, Fletcher A, Petticrew M, Thomas J, Whitehead
M, Campbell R, Murphy S, Moore L. Public Health Research. Systematic review of the
effects of schools and school environment interventions on health: evidence mapping and
synthesis. Southampton (UK): NIHR Journals Library; 2013.
Booth A, Harris J, Croot E, Springett J, Campbell F, Wilkins E. Towards a methodology for
cluster searching to provide conceptual and contextual “richness” for systematic reviews
of complex interventions: case study (CLUSTER). BMC Medical Research Methodology
2013; 13: 118.
Booth A, Carroll C. How to build up the actionable knowledge base: the role of ‘best fit’
framework synthesis for studies of improvement in healthcare. BMJ Quality and Safety
2015; 24: 700–708.
Booth A, Noyes J, Flemming K, Gerhardus A, Wahlster P, van der Wilt GJ, Mozygemba K,
Refolo P, Sacchini D, Tummers M, Rehfuess E. Guidance on choosing qualitative evidence
synthesis methods for use in health technology assessment for complex interventions
2016. https://www.integrate-hta.eu/wp-content/uploads/2016/02/Guidance-on-
choosing-qualitative-evidence-synthesis-methods-for-use-in-HTA-of-complex-
interventions.pdf
Booth A. Qualitative evidence synthesis. In: Facey K, editor. Patient involvement in Health
Technology Assessment. Singapore: Springer; 2017. p. 187–199.
Booth A, Noyes J, Flemming K, Gehardus A, Wahlster P, Jan van der Wilt G, Mozygemba K,
Refolo P, Sacchini D, Tummers M, Rehfuess E. Structured methodology review identified
seven (RETREAT) criteria for selecting qualitative evidence synthesis approaches. Journal
of Clinical Epidemiology 2018; 99: 41–52.
Booth A, Moore G, Flemming K, Garside R, Rollins N, Tuncalp Ö, Noyes J. Taking account of
context in systematic reviews and guidelines considering a complexity perspective. BMJ
Global Health 2019a; 4: e000840.
Booth A, Noyes J, Flemming K, Moore G, Tuncalp O, Shakibazadeh E. Formulating questions
to address the acceptability and feasibility of complex interventions in qualitative
evidence synthesis. BMJ Global Health 2019b; 4: e001107.
Candy B, King M, Jones L, Oliver S. Using qualitative synthesis to explore heterogeneity of
complex interventions. BMC Medical Research Methodology 2011; 11: 124.
Cargo M, Harris J, Pantoja T, Booth A, Harden A, Hannes K, Thomas J, Flemming K, Garside
R, Noyes J. Cochrane Qualitative and Implementation Methods Group guidance series-
paper 4: methods for assessing evidence on intervention implementation. Journal of
Clinical Epidemiology 2018; 97: 59–69.
540
21.16 References
541
21 Qualitative evidence
Glenton C, Colvin CJ, Carlsen B, Swartz A, Lewin S, Noyes J, Rashidian A. Barriers and
facilitators to the implementation of lay health worker programmes to improve access to
maternal and child health: qualitative evidence synthesis. Cochrane Database of
Systematic Reviews 2013; 10: CD010414.
Glenton C, Lewin S, Norris S. Chapter 15: Using evidence from qualitative research to
develop WHO guidelines. In: Norris S, editor. World Health Organization Handbook for
Guideline Development. 2nd ed. Geneva: WHO; 2016.
Grant MJ, Booth A. A typology of reviews: an analysis of 14 review types and associated
methodologies. Health Information and Libraries Journal 2009; 26: 91–108.
Greenhalgh T, Kristjansson E, Robinson V. Realist review to understand the efficacy of
school feeding programmes. BMJ 2007; 335: 858.
Harden A, Oakley A, Weston R. A review of the effectiveness and appropriateness of peer-
delivered health promotion for young people. London: Institute of Education, University of
London; 1999.
Harden A, Thomas J, Cargo M, Harris J, Pantoja T, Flemming K, Booth A, Garside R, Hannes
K, Noyes J. Cochrane Qualitative and Implementation Methods Group guidance series-
paper 5: methods for integrating qualitative and implementation evidence within
intervention effectiveness reviews. Journal of Clinical Epidemiology 2018; 97: 70–78.
Harris JL, Booth A, Cargo M, Hannes K, Harden A, Flemming K, Garside R, Pantoja T, Thomas
J, Noyes J. Cochrane Qualitative and Implementation Methods Group guidance series-
paper 2: methods for question formulation, searching, and protocol development for
qualitative evidence synthesis. Journal of Clinical Epidemiology 2018; 97: 39–48.
Harris KM, Kneale D, Lasserson TJ, McDonald VM, Grigg J, Thomas J. School-based self
management interventions for asthma in children and adolescents: a mixed methods
systematic review (Protocol). Cochrane Database of Systematic Reviews 2015; 4:
CD011651.
Hoffmann TC, Glasziou PP, Boutron I, Milne R, Perera R, Moher D, Altman DG, Barbour V,
Macdonald H, Johnston M, Lamb SE, Dixon-Woods M, McCulloch P, Wyatt JC, Chan AW,
Michie S. Better reporting of interventions: template for intervention description and
replication (TIDieR) checklist and guide. BMJ 2014; 348: g1687.
Houghton C, Murphy K, Meehan B, Thomas J, Brooker D, Casey D. From screening to
synthesis: using nvivo to enhance transparency in qualitative evidence synthesis. Journal
of Clinical Nursing 2017; 26: 873–881.
Hurley M, Dickson K, Hallett R, Grant R, Hauari H, Walsh N, Stansfield C, Oliver S. Exercise
interventions and patient beliefs for people with hip, knee or hip and knee osteoarthritis:
a mixed methods review. Cochrane Database of Systematic Reviews 2018; 4: CD010842.
Kahwati L, Jacobs S, Kane H, Lewis M, Viswanathan M, Golin CE. Using qualitative
comparative analysis in a systematic review of a complex intervention. Systematic
Reviews 2016; 5: 82.
Kelly MP, Noyes J, Kane RL, Chang C, Uhl S, Robinson KA, Springs S, Butler ME, Guise JM.
AHRQ series on complex intervention systematic reviews-paper 2: defining complexity,
formulating scope, and questions. Journal of Clinical Epidemiology 2017; 90: 11–18.
Kneale D, Thomas J, Harris K. Developing and optimising the use of logic models in
systematic reviews: exploring practice and good practice in the use of programme theory
in reviews. PloS One 2015; 10: e0142187.
Levac D, Colquhoun H, O’Brien KK. Scoping studies: advancing the methodology.
Implementation Science 2010; 5: 69.
542
21.16 References
543
21 Qualitative evidence
544
21.16 References
SURE (Supporting the Use of Research Evidence) Collaboration. SURE Guides for Preparing
and Using Evidence-based Policy Briefs: 5 Identifying and Addressing Barriers to
Implementing the Policy Options. Version 2.1, updated November 2011. http://global.
evipnet.org/SURE-Guides/.
Suri H. Purposeful sampling in qualitative research synthesis. Qualitative Research Journal
2011; 11: 63–75.
Thomas J, Harden A, Oakley A, Oliver S, Sutcliffe K, Rees R, Brunton G, Kavanagh J.
Integrating qualitative research with trials in systematic reviews. BMJ 2004; 328:
1010–1012.
Thomas J, Harden A. Methods for the thematic synthesis of qualitative research in
systematic reviews. BMC Medical Research Methodology 2008; 8: 45.
Thomas J, Brunton J, Graziosi S. EPPI-Reviewer 4.0: software for research synthesis
[Software]. EPPI-Centre Software. Social Science Research Unit, Institute of Education,
University of London UK; 2010. https://eppi.ioe.ac.uk/CMS/Default.aspx?alias=eppi.ioe.
ac.uk/cms/er4&.
Thomas J, O’Mara-Eves A, Brunton G. Using qualitative comparative analysis (QCA) in
systematic reviews of complex interventions: a worked example. Systematic Reviews
2014; 3: 67.
Tong A, Flemming K, McInnes E, Oliver S, Craig J. Enhancing transparency in reporting the
synthesis of qualitative research: ENTREQ. BMC Medical Research Methodology 2012;
12: 181.
van Grootel L, van Wesel F, O’Mara-Eves A, Thomas J, Hox J, Boeije H. Using the realist
perspective to link theory from qualitative evidence synthesis to quantitative studies:
broadening the matrix approach. Research Synthesis Methods 2017; 8: 303–311.
Verhage A, Boels D. Critical appraisal of mixed methods research studies in a systematic
scoping review on plural policing: assessing the impact of excluding inadequately
reported studies by means of a sensitivity analysis. Quality and Quantity 2017; 51:
1449–1468.
Walker LO, Avant KC. Strategies for Theory Construction in Nursing. Upper Saddle River (NJ):
Pearson Prentice Hall; 2005.
545
Part Three
Further topics
22
Prospective approaches to
accumulating evidence
James Thomas, Lisa M Askie, Jesse A Berlin, Julian H Elliott, Davina Ghersi,
Mark Simmonds, Yemisi Takwoingi, Jayne F Tierney, Julian PT Higgins
KEY POINTS
• Cochrane Reviews should reflect the state of current knowledge, but maintaining their
currency is a challenge due to resource limitations. It is difficult to know when a given
review might become out of date, but tools are available to assist in identifying when a
•
review might need updating.
Living systematic reviews are systematic reviews that are continually updated, with
new evidence being incorporated as soon as it becomes available. They are useful
in rapidly evolving fields where research is published frequently. New technologies
and better processes for data storage and reuse are being developed to facilitate
•
the rapid identification and synthesis of new evidence.
A prospective meta-analysis is a meta-analysis of studies (usually randomized trials)
that were identified or even collectively planned to be eligible for the meta-analysis
before the results of the studies became known. They are usually undertaken by a
collaborative group including authors of the studies to be included, and they usually
•
collect and analyse individual participant data.
Formal sequential statistical methods are discouraged for standard updated meta-
analyses in most circumstances for Cochrane Reviews. They should not be used for
the main analyses, or to draw main conclusions. Sequential methods may, however,
be used in the context of a prospectively planned series of randomized trials.
22.1 Introduction
Iain Chalmers’ vision of “a library of trial overviews which will be updated when new
data become available” (Chalmers 1986), became the mission and founding purpose of
Cochrane. Thousands of systematic reviews are now published in the Cochrane Data-
base of Systematic Reviews, presenting critical summaries of the evidence. However,
This chapter should be cited as: Thomas J, Askie LM, Berlin JA, Elliott JH, Ghersi D, Simmonds M, Takwoingi
Y, Tierney JF, Higgins HPT. Chapter 22: Prospective approaches to accumulating evidence. In: Higgins JPT,
Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook for Systematic
Reviews of Interventions. 2nd Edition. Chichester (UK): John Wiley & Sons, 2019: 549–568.
© 2019 The Cochrane Collaboration. Published 2019 by John Wiley & Sons Ltd.
549
22 Prospective approaches
maintaining the currency of these reviews through periodic updates, consistent with
Chalmers’ vision, has been a challenge. Moreover, as the global community of research-
ers has begun to see research in a cumulative way, rather than in terms of individual
studies, the idea of ‘prospective’ meta-analyses has emerged. A prospective meta-
analysis (PMA) begins with the idea that future studies will be integrated within a sys-
tematic review and works backwards to plan a programme of trials with the explicit
purpose of their future integration.
The first part of this chapter covers methods for keeping abreast of the accumulating
evidence to help a review team understand when a systematic review might need
updating (see Section 22.2). This includes the processes that can be put into place
to monitor relevant publications, and algorithms that have been proposed to deter-
mine whether or when it is appropriate to revisit the review to incorporate new find-
ings. We outline a vision for regularly updated reviews, known as ‘living’ systematic
reviews, which are continually updated, with new evidence being identified and incor-
porated as soon as it becomes available.
While evidence surveillance and living systematic reviews may require some modifi-
cations to review processes, and can dramatically improve the delivery time and cur-
rency of updates, they are still essentially following a retrospective model of reviewing
the existing evidence base. The retrospective nature of most systematic reviews poses
an inevitable challenge, in that the selection of what types of evidence to include may
be influenced by authors’ knowledge of the context and findings of the available stud-
ies. This might introduce bias into any aspect of the review’s eligibility criteria including
the selection of a target population, the nature of the intervention(s), choice of com-
parator and the outcomes to be assessed. The best way to overcome this problem is to
identify evidence entirely prospectively, that is before the results of the studies are
known. Section 22.3 describes such prospectively planned meta-analyses.
Finally, Section 22.4 addresses concerns about the regular repeating of statistical tests
in meta-analyses as they are updated over time. Cochrane actively discourages use of the
notion of statistical significance in favour of reporting estimates and confidence intervals,
so such concerns should not arise. Nevertheless, sequential approaches are an estab-
lished method in randomized trials, and may play a role in a prospectively planned series
of trials in a prospective meta-analysis.
Statistical methods have been proposed to assess the extent to which new evidence
might affect the findings of a systematic review. Sample size calculations can incorpo-
rate the result of a current meta-analysis, thus providing information about how addi-
tional studies of a particular sample size could have an impact on the results of an
updated meta-analysis (Sutton et al 2007, Roloff et al 2013). These methods demon-
strate in many cases that new evidence may have very little impact on a random-effects
meta-analysis if there is heterogeneity across studies, and they require assumptions
that the future studies will be similar to the existing studies. Their practical use in decid-
ing whether to update a systematic review may therefore be limited.
As part of their development of the aforementioned tool, Takwoingi and colleagues
created a prediction equation based on findings from a sample of 65 updated Cochrane
Reviews (Takwoingi et al 2013). They collated a list of numerical ‘signals’ as candidate
predictors of changing conclusions on updating (including, for example, heterogeneity
statistics in the original meta-analysis, presence of a large new study, and various
measures of the amount of information in the new studies versus the original meta-
analysis). Their prediction equation involved two of these signals: the ratio of statistical
information (inverse variance) in the new versus the original studies, and the number of
new studies. Further work is required to develop ways to operationalize this approach
efficiently, as it requires detailed knowledge of the new evidence; once this is in place,
much of the effort to perform the update has already been expended.
required if they are not to consume more resources than traditional approaches. For-
tunately, new developments in information and computer science offer some potential
for reductions in manual effort through automation. (For an overview of a range of
these technologies see Chapter 4, Section 4.6.6.2.)
New systems (such as the Epistemonikos database, which contains the results of reg-
ular searches of multiple datasets), offer potential reductions in the number of data-
bases that individuals need to search, as well as reducing duplication of effort across
review teams. In addition, the growth in interest of open access publications has led to
the creation of large datasets of open access bibliographic records, such as OpenCita-
tion, CrossRef and Microsoft Academic. As these datasets continue to grow to contain
all relevant records in their respective areas, they may also reduce the need for author
teams to search as many different sources as they currently need to.
Undertaking regular searches also requires the regular screening of records retrieved
for eligibility. Once the review has been set up and initial searches screened, subse-
quent updates can reduce manual screening effort using automation tools that ‘learn’
the review’s eligibility criteria based on previous screening decisions by the review
authors. Automation tools that are built on large numbers of records for more generic
use are also available, such as Cochrane’s RCT Classifier, which can be used to filter
studies that are unlikely to be randomized trials from a set of records (Thomas et al
2017). Cochrane has also developed Cochrane Crowd, which crowdsources decisions
classifying studies as randomized trials, (see Chapter 4, Section 4.6.6.2).
Later stages of the review process can also be assisted using new technologies. These
include risk-of-bias assessment, the extraction of structured data from tables in PDF files,
information extraction from reports (such as identifying the number of participants in a
study and characteristics of the intervention) and even the writing of review results.
These technologies are less well-advanced than those used for study identification.
These various tools aim to reduce manual effort at specific points in the standard
systematic review process. However, Cochrane is also setting up systems that aim
to change the study selection process quite substantially, as depicted in Figure 22.2.a.
These developments begin with the prospective identification of relevant evidence, out-
side of the context of any given review, including bibliographic and trial registry records,
through centralized routine searches of appropriate sources. These records flow through
a ‘pipeline’ which classifies the records in detail using a combination of machine learning
and human effort (including Cochrane Crowd). First, the type of study is determined and,
if it is likely to be a randomized trial, then the record proceeds to be classified in terms of
its review topic and its PICO elements using terms from the Cochrane Linked Data ontol-
ogy. Finally, relevant data are extracted from the full text report. The viability of such a
system depends upon its accuracy, which is contingent on human decisions being con-
sistent and correct. For this reason, the early focus on randomized trials is appropriate, as
a clear and widely understood definition exists for this type of study. Overall, the accu-
racy of Cochrane Crowd for identification of randomized trials exceeds 99%; and the
machine learning system is similarly calibrated to achieve over 99% recall (Wallace
et al 2017, Marshall et al 2018).
Setting up such a system for centralized study discovery is yielding benefits through
economies of scale. For example, in the past the same decisions about the same studies
have been made multiple times across different reviews because previously there was
no way of sharing these decisions between reviews. Duplication in manual effort is
553
22 Prospective approaches
EVIDENCE PIPELINE
COCHRANE
CROWD MACHINE
LEARNING
LINKED
DATA
Services
and
Applications
CRS
being reduced substantially by ensuring that decisions made about a given record
(e.g. whether or not it describes a randomized trial) are only made once. These deci-
sions are then reflected in the inclusion of studies in the Cochrane Register of Studies,
which can then be searched more efficiently for future reviews. The system benefits
further from its scale by learning that if a record is relevant for one review, it is unlikely
to be relevant for reviews with quite different eligibility criteria. Ultimately, the aim is
for randomized trials to be identified for reviews through a single search of their PICO
classifications in the central database, with new studies for existing reviews being iden-
tified automatically.
(PMA) is a systematic review and meta-analysis of studies that are identified, evaluated
and determined to be eligible for the meta-analysis before the relevant results of any of
those studies become known. Most experience of PMA comes from their application to
randomized trials. In this section we focus on PMAs of trials, although most of the same
considerations will also apply to systematic reviews of other types of studies.
PMA can help to overcome some of the problems of retrospective meta-analyses of
individual participant data or of aggregate data by enabling:
1) hypotheses to be specified without prior knowledge of the results of individual trials
(including hypotheses underlying subgroup analyses);
2) selection criteria to be applied to trials prospectively; and
3) analysis methods to be chosen before the results of individual trials are known, avoid-
ing potential difficulties in interpretation arising from data-dependent decisions.
PMAs are usually initiated when trials have already started recruiting, and are carried
out by collaborative groups including representatives from each of the participating
trials. They have tended to involve collecting individual participant data (IPD), such that
they have many features in common with retrospective IPD meta-analyses (see also
Chapter 26).
If initiated early enough, PMA provides an opportunity for trial design, data collection
and other trial processes to be standardized across the eligible ongoing trials. For
example, the investigators may agree to use the same instrument to measure a partic-
ular outcome, and to measure the outcome at the same time-points in each trial. In a
Cochrane Review of interventions for preventing obesity in children, for example, the
diversity and unreliability of some of the outcome measures made it difficult to com-
bine data across trials (Summerbell et al 2005). A PMA of this question proposed a set of
shared standards so that some of the issues raised by lack of standardization could be
addressed (Steinbeck et al 2006).
PMAs based on IPD have been conducted by trialists in cardiovascular disease (Simes
1995, WHO-ISI Blood Pressure Lowering Treatment Trialists’ Collaboration 1998), child-
hood leukaemia (Shuster and Gieser 1996, Valsecchi and Masera 1996), childhood and
adolescent obesity (Askie et al 2010, Steinbeck et al 2006) and neonatology (Askie et al
2018). There are areas such as infectious diseases, however, where the opportunity to
use PMA has largely been missed (Ioannidis and Lau 1999).
Where resources are limited, it may still be possible to undertake a prospective sys-
tematic review and meta-analysis based on aggregate data, rather than IPD, as we dis-
cuss in Section 22.3.6. In practice, these are often initiated at a later stage during the
course of the trials, so there is less opportunity to standardize conduct of the trials.
However, it is possible to harmonize data for inclusion in meta-analysis.
protocol appropriate to local circumstances, with the local protocol being aligned with
elements of a PMA protocol that are common to all included trials.
PMAs may be an attractive alternative when a single, adequately sized trial is infea-
sible for practical or political reasons (Simes 1987, Probstfield and Applegate 1998).
They may also be useful when two or more trials addressing the same question are
started with the investigators ignorant of the existence of the other trial(s): once these
similar trials are identified, investigators can plan prospectively to combine their
results in a meta-analysis.
Variety in the design of the included trials is a potentially desirable feature of PMA as
it may improve generalizability. For example, FICSIT (Frailty and Injuries: Cooperative
Studies of Intervention Techniques) was a pre-planned meta-analysis of eight trials of
exercise-based interventions in a frail elderly population (Schechtman and Ory 2001).
The eight FICSIT sites defined their own interventions using site-specific endpoints and
evaluations and differing entry criteria (except that all participants were elderly).
Objectives, eligibility and outcomes As for any systematic review or meta-analysis, the
protocol for a PMA should specify its objectives and eligibility criteria for inclusion
of the trials (including trial design, participants, interventions and comparators). In
addition, it should specify which outcomes will be measured by all trials in the PMA,
and when and how these should be measured. Additionally, details of subgroup anal-
ysis variables should be specified.
Trial details Details of trials already identified for inclusion should be listed in the pro-
tocol, including their trial registration identifiers, the anticipated number of partici-
pants and timelines for each participating trial. The protocol should state whether a
signed agreement to collaborate has been obtained from the appropriate representa-
tive of each trial (e.g. the sponsor or principal investigator). The protocol should include
a statement that, at the time of inclusion in the PMA, no trial results related to the PMA
research question were known to anyone outside each trial’s own data monitoring
committee. If eligible trials are identified but not included in the PMA because their
results related to the PMA research question are already known, the PMA protocol
should outline how these data will be dealt with. For example, sensitivity analyses
including data from these trials might be planned. The protocol should describe actions
to be taken if subsequent trials are located while the PMA is in progress.
Data collection and analysis The protocol should outline the plans for the collection and
analyses of data in a similar manner to that of a standard, aggregate data meta-
analysis or an IPD meta-analysis. Details of overall sample size and power calculations,
interim analyses (if applicable) and subgroup analyses should be provided. For a
prospectively planned series of trials, a sequential approach to the meta-analysis
may be reasonable (see Section 22.4).
In an IPD-PMA, the protocol should describe what will happen if the investigators of
some trials within the PMA are unable (or unwilling) to provide participant-level data.
Would the PMA secretariat, for example, accept appropriate summary data? The pro-
tocol should specify whether there is an intention to update the PMA data at regular
intervals via ongoing cycles of data collection (e.g. five yearly). A detailed statistical
analysis plan should be agreed and made public before the receipt or analysis of
any data to be included in the PMA.
Management and co-ordination The PMA protocol should outline details of project man-
agement structure (including any committees, see Section 22.3.1.2), the procedures for
data management (how data are to be collected, the format required, when data will be
required to be submitted, quality assurance procedures, etc; see Chapter 26,
Section 26.2), and who will be responsible for the statistical analyses.
557
22 Prospective approaches
Publication policy It is important to have an authorship policy in place for the PMA (e.g.
specifying that publications will be in the group name, but also including a list of indi-
vidual authors), and a policy on manuscript preparation (e.g. formation of a writing
committee, opportunities to comment on draft papers).
A unique issue that arises within the context of the PMA (which would generally not
arise for a multicentre trial or a retrospective IPD meta-analysis) is whether or not indi-
vidual trials should publish their own results separately and, if so, the timing of those
publications. In addition to contributing to the PMA, it is likely that investigators will
prefer trial-specific publications to appear before the combined PMA results are pub-
lished. It is recommended that PMA publication(s) clearly indicate the sources of the
included data and refer to prior publications of the individual included trials.
conduct and analysis, bringing greater clarity to eligibility screening and accuracy to
risk-of-bias assessments (Vale et al 2013).
4) Predict if and when sufficient results will be available for reliable and robust
meta-analysis (typically using aggregate data)
The information from steps 2 and 3 about how results will emerge over time allows a
prospective assessment of the feasibility and timing of a reliable meta-analysis. A first
indicator of reliability is that the projected amount of participants or events that would
be available for the meta-analysis would constitute an ‘optimal information size’
(Pogue and Yusuf 1997). In other words they would provide sufficient power to detect
realistic effects of the intervention under investigation, on the basis of standard meth-
ods of sample size calculation. A second indicator of reliability is that the anticipated
participants or events would comprise a substantial proportion of the total eligible
(‘relative information size’). This serves to minimize the likelihood of reporting or other
data availability biases. Such predictions and decisions for FAME should be outlined in
the systematic review protocol.
Interpretation should consider how representative the actual data obtained are, and
the potential impact of the results of unpublished or ongoing trials that were not
included. This is in addition to the direction and precision of the meta-analysis result
and consistency of effects across trials, as is standard.
6) Assess the value of updating the systematic review and meta-analysis in the
future
560
22.4 Statistical analysis of accumulating evidence
There are several choices to make when deciding on a sequential approach to meta-
analysis. Two particular sets of choices have been articulated in papers by Wetterslev,
Thorlund, Brok and colleagues, and by Whitehead, Higgins and colleagues.
The first group refer to their methods as ‘trials sequential analysis’ (TSA). They use the
principle of alpha spending and articulate the desirable total amount of information in
terms of sample size (Wetterslev et al 2008, Brok et al 2009, Thorlund et al 2009). This
sample size is calculated in the same way as if the meta-analysis was a single clinical trial,
by setting a desired type I error, an assumed effect size, and the desired statistical power
to detect that effect. They recommended that the sample size be adjusted for heteroge-
neity, using either some pre-specified estimate of heterogeneity or the best current esti-
mate of heterogeneity in the meta-analysis. The adjustment is generally made using a
statistic called D2, which produces a larger required sample size, although the more
widely used I2 statistic may be used instead (Wetterslev et al 2009).
Whitehead and Higgins implemented a boundaries approach and represent informa-
tion using statistical information (specifically, the sum of the meta-analytic weights)
(Whitehead 1997, Higgins et al 2011). As noted, this implicitly adjusts for heterogeneity
because as heterogeneity increases, the information contained in the meta-analysis
decreases. In this approach, the cumulative information can decrease between updates
as well as increase (i.e. the path can go backwards in relation to the boundary). These
authors propose a parallel Bayesian approach to updating the estimate of between-
study heterogeneity, starting with an informative prior distribution, to reduce the risk
that the path will go backwards (Higgins et al 2011). If the prior estimate of heteroge-
neity is suitably large, the method can account for underestimation of heterogeneity
early in the updating process.
1) The results of each meta-analysis, conducted at any point in time, indicate the cur-
rent best evidence of the estimated intervention effect and its accompanying uncer-
tainty. These results need to stand on their own merit. Decision makers should use
the currently available evidence, and their decisions should not be influenced by
previous meta-analyses or plans for future updates.
2) Cochrane Review authors should interpret evidence on the basis of the estimated
magnitude of the effect of intervention and its uncertainty (usually quantified using
a confidence interval) and not on the basis of statistical significance (see Chapter 15,
Section 15.3.1). In particular, Cochrane Review authors should not draw binary inter-
pretations of intervention effects as present or absent, based on defining results as
‘significant’ or ‘non-significant’ (see Chapter 15, Section 15.3.2).
3) There are important differences between the context of an individual trial and the
context of a meta-analysis. Whereas a trialist is in control of recruitment of further
participants, the meta-analyst (except in the context of a prospective meta-analysis)
has no control over designing or affecting trials that are eligible for the meta-
563
22 Prospective approaches
22.6 References
Akl EA, Meerpohl JJ, Elliott J, Kahale LA, Schunemann HJ, Living Systematic Review N. Living
systematic reviews: 4. Living guideline recommendations. Journal of Clinical Epidemiology
2017; 91: 47–53.
Askie LM, Baur LA, Campbell K, Daniels L, Hesketh K, Magarey A, Mihrshahi S, Rissel C, Simes
RJ, Taylor B, Taylor R, Voysey M, Wen LM, on behalf of the EPOCH Collaboration. The Early
Prevention of Obesity in CHildren (EPOCH) Collaboration –an Individual Patient Data
Prospective Meta-Analysis [study protocol]. BMC Public Health 2010; 10: 728.
Askie LM, Brocklehurst P, Darlow BA, Finer N, Schmidt B, Tarnow-Mordi W. NeOProM:
Neonatal Oxygenation Prospective Meta-analysis Collaboration study protocol. BMC
Pediatrics 2011; 11: 6.
Askie LM, Darlow BA, Finer N, et al. Association between oxygen saturation targeting and
death or disability in extremely preterm infants in the neonatal oxygenation prospective
meta-analysis collaboration. JAMA 2018; 319: 2190–2201.
Berkey CS, Mosteller F, Lau J, Antman EM. Uncertainty of the time of first significance in
random effects cumulative meta-analysis. Controlled Clinical Trials 1996; 17: 357–371.
Brok J, Thorlund K, Wetterslev J, Gluud C. Apparently conclusive meta-analyses may be
inconclusive: trial sequential analysis adjustment of random error risk due to repetitive
testing of accumulating data in apparently conclusive neonatal meta-analyses.
International Journal of Epidemiology 2009; 38: 287–298.
Chalmers I. Electronic publications for updating controlled trial reviews. Lancet 1986;
328: 287.
Crequit P, Trinquart L, Yavchitz A, Ravaud P. Wasted research when systematic reviews fail
to provide a complete and up-to-date evidence synthesis: the example of lung cancer.
BMC Medicine 2016; 14: 8.
Elliott JH, Turner T, Clavisi O, Thomas J, Higgins JPT, Mavergames C, Gruen RL. Living
systematic reviews: an emerging opportunity to narrow the evidence-practice gap. PLoS
Medicine 2014; 11: e1001603.
Elliott JH, Synnot A, Turner T, Simmonds M, Akl EA, McDonald S, Salanti G, Meerpohl J,
MacLehose H, Hilton J, Tovey D, Shemilt I, Thomas J, Living Systematic Review N. Living
565
22 Prospective approaches
systematic review: 1. Introduction-the why, what, when, and how. Journal of Clinical
Epidemiology 2017; 91: 23–30.
Garner P, Hopewell S, Chandler J, MacLehose H, Schünemann HJ, Akl EA, Beyene J, Chang S,
Churchill R, Dearness K, Guyatt G, Lefebvre C, Liles B, Marshall R, Martinez Garcia L,
Mavergames C, Nasser M, Qaseem A, Sampson M, Soares-Weiser K, Takwoingi Y, Thabane
L, Trivella M, Tugwell P, Welsh E, Wilson EC, Schünemann HJ, Panel for Updating
Guidance for Systematic Reviews (PUGs). When and how to update systematic reviews:
consensus and checklist. BMJ 2016; 354: i3507.
Higgins JPT, Whitehead A, Simmonds M. Sequential methods for random-effects meta-
analysis. Statistics in Medicine 2011; 30: 903–921.
Hillman DW, Louis TA. DSMB case study: decision making when a similar clinical trial is
stopped early. Controlled Clinical Trials 2003; 24: 85–91.
Ioannidis JPA, Lau J. State of the evidence: current status and prospects of meta-analysis in
infectious diseases. Clinical Infectious Diseases 1999; 29: 1178–1185.
Marshall IJ, Noel-Storr A, Kuiper J, Thomas J, Wallace BC. Machine learning for identifying
Randomized Controlled Trials: An evaluation and practitioner’s guide. Research Synthesis
Methods 2018; 9: 602–614.
Martínez García L, Pardo-Hernandez H, Superchi C, Niño de Guzman E, Ballesteros M,
Ibargoyen Roteta N, McFarlane E, Posso M, Roqué IFM, Rotaeche Del Campo R, Sanabria
AJ, Selva A, Solà I, Vernooij RWM, Alonso-Coello P. Methodological systematic review
identifies major limitations in prioritization processes for updating. Journal of Clinical
Epidemiology 2017; 86: 11–24.
Moher D, Shamseer L, Clarke M, Ghersi D, Liberati A, Petticrew M, Shekelle P, Stewart LA.
Preferred reporting items for systematic review and meta-analysis protocols (PRISMA-P)
2015 statement. Systematic Reviews 2015; 4: 1.
Nikolakopoulou A, Mavridis D, Furukawa TA, Cipriani A, Tricco AC, Straus SE, Siontis GCM,
Egger M, Salanti G. Living network meta-analysis compared with pairwise meta-analysis
in comparative effectiveness research: empirical study. BMJ 2018; 360: k585.
O’Brien PC, Fleming TR. A multiple testing procedure for clinical trials. Biometrics 1979; 35:
549–556.
Page MJ, Shamseer L, Altman DG, Tetzlaff J, Sampson M, Tricco AC, Catalá-López F, Li L, Reid
EK, Sarkis-Onofre R, Moher D. Epidemiology and reporting characteristics of systematic
reviews of biomedical research: a cross-sectional study. PLoS Medicine 2016; 13: e1002028.
Pogue JM, Yusuf S. Cumulating evidence from randomized trials: utilizing sequential
monitoring boundaries for cumulative meta-analysis. Controlled Clinical Trials 1997; 18:
580–593; discussion 661–586.
Probstfield J, Applegate WB. Prospective meta-analysis: Ahoy! A clinical trial? Journal of the
American Geriatrics Society 1998; 43: 452–453.
Roloff V, Higgins JPT, Sutton AJ. Planning future studies based on the conditional power of
a meta-analysis. Statistics in Medicine 2013; 32: 11–24.
Rydzewska LHM, Burdett S, Vale CL, Clarke NW, Fizazi K, Kheoh T, Mason MD, Miladinovic B,
James ND, Parmar MKB, Spears MR, Sweeney CJ, Sydes MR, Tran N, Tierney JF, STOPCaP
Abiraterone Collaborators. Adding abiraterone to androgen deprivation therapy in men
with metastatic hormone-sensitive prostate cancer: a systematic review and meta-
analysis. European Journal of Cancer 2017; 84: 88–101.
Schechtman K, Ory M. The effects of exercise on the quality of life of frail older adults: a
preplanned meta-analysis of the FICSIT trials. Annals of Behavioural Medicine 2001; 23: 186–197.
566
22.6 References
567
22 Prospective approaches
568
23
Including variants on randomized trials
Julian PT Higgins, Sandra Eldridge, Tianjing Li
KEY POINTS
•
be analysed using methods appropriate to the design.
If the authors of studies included in the review fail to account for correlations among
outcome data that arise because of the design, approximate methods can often be
•
applied by review authors.
A variant of the risk-of-bias assessment tool is available for cluster-randomized trials.
Special attention should be paid to the potential for bias arising from how individual
•
participants were identified and recruited within clusters.
A variant of the risk-of-bias assessment tool is available for crossover trials. Special
attention should be paid to the potential for bias arising from carry-over of effects
from one period to the subsequent period of the trial, and to the possibility of ‘period
•
effects’.
To include a study with more than two intervention groups in a meta-analysis, a
recommended approach is (i) to omit groups that are not relevant to the comparison
being made, and (ii) to combine multiple groups that are eligible as the experimental
or comparator intervention to create a single pair-wise comparison. Alternatively,
multi-arm studies are dealt with appropriately by network meta-analysis.
This chapter should be cited as: Higgins JPT, Eldridge S, Li T (editors). Chapter 23: Including variants
on randomized trials. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA
(editors). Cochrane Handbook for Systematic Reviews of Interventions. 2nd Edition. Chichester (UK):
John Wiley & Sons, 2019: 569–594.
© 2019 The Cochrane Collaboration. Published 2019 by John Wiley & Sons Ltd.
569
23 Including variants on randomized trials
Table 23.1.a Issues addressed in the Cochrane risk-of-bias tool for cluster-randomized trials
•
motivation or knowledge to subvert randomization.
The number of clusters can be relatively small, so
chance imbalances are more common than in
individually randomized trials. Such chance
imbalances should not be interpreted as evidence of
•
risk of bias.
Bias arising from the timing of This bias domain is specific to cluster-randomized
•
identification and recruitment of trials.
participants It is important to consider when individual
participants were identified and recruited in relation
•
to the timing of randomization.
If identification or recruitment of any participants in
the trial happened after randomization of the
cluster, then their recruitment could have been
affected by knowledge of the intervention,
introducing bias.
(Continued)
571
23 Including variants on randomized trials
•
this domain.
If participants, carers or people delivering
interventions are aware of the assigned intervention,
then the issues are the same as for individually
•
randomized trials.
Data may be missing for clusters or for individuals
Bias due to missing outcome data
•
within clusters.
Considerations when addressing either type of
missing data are the same as for individually
randomized trials, but review authors should ensure
•
that they cover both.
If outcome assessors are not aware that a trial is
Bias in measurement of the outcome
taking place, then their assessments should not be
•
affected by intervention assignment.
If outcome assessors are aware of the assigned
intervention, then the issues are the same as for
•
individually randomized trials.
The issues are the same as for individually
Bias in selection of the reported result
randomized trials.
∗
For the precise wording of signalling questions and guidance for answering each one, see the full risk-of-bias
tool at www.riskofbias.info.
572
23.1 Cluster-randomized trials
method used is appropriate. When the study authors have not conducted such an anal-
ysis, there are two approximate approaches that can be used by review authors to adjust
the results (see Sections 23.1.4 and 23.1.5).
Effect estimates and their standard errors from correct analyses of cluster-
randomized trials may be meta-analysed using the generic inverse-variance approach
(e.g. in RevMan).
• the number of clusters (or groups) randomized to each intervention group and the
total number of participants in the study; or the average (mean) size of each cluster;
• the outcome data ignoring the cluster design for the total number of individuals (e.g.
the number or proportion of individuals with events, or means and standard devia-
tions for continuous data); and
• an estimate of the intracluster (or intraclass) correlation coefficient (ICC).
The ICC is an estimate of the relative variability within and between clusters (Eldridge
and Kerry 2012). Alternatively it describes the ‘similarity’ of individuals within the same
cluster (Eldridge et al 2009b). In spite of recommendations to report the ICC in all trial
reports (Campbell et al 2012), ICC estimates are often not available in published
reports.
A common approach for review authors is to use external estimates obtained from
similar studies, and several resources are available that provide examples of ICCs
(Ukoumunne et al 1999, Campbell et al 2000, Health Services Research Unit 2004),
or use an estimate based on known patterns in ICCs for particular types of cluster
or outcome. ICCs may appear small compared with other types of correlations: values
lower than 0.05 are typical. However, even small values can have a substantial impact
on confidence interval widths (and hence weights in a meta-analysis), particularly if
cluster sizes are large. Empirical research has observed that clusters that tend to be
naturally larger have smaller ICCs (Ukoumunne et al 1999). For example, for the same
outcome, regions are likely to have smaller ICCs than towns, which are likely to have
smaller ICCs than families.
An approximately correct analysis proceeds as follows. The idea is to reduce the size of
each trial to its ‘effective sample size’ (Rao and Scott 1992). The effective sample size of a
single intervention group in a cluster-randomized trial is its original sample size divided
by a quantity called the ‘design effect’. The design effect is approximately
1 + M − 1 × ICC
where M is the average cluster size and ICC is the intracluster correlation coefficient.
When cluster sizes vary, M can be estimated more appropriately in other ways
(Eldridge et al 2006). A common design effect is usually assumed across intervention
573
23 Including variants on randomized trials
groups. For dichotomous data, both the number of participants and the number experi-
encing the event should be divided by the same design effect. Since the resulting data
must be rounded to whole numbers for entry into meta-analysis software such as
RevMan, this approach may be unsuitable for small trials. For continuous data, only
the sample size need be reduced; means and standard deviations should remain
unchanged. Special considerations for analysis of standardized mean differences
from cluster-randomized trials are discussed by White and Thomas (White and
Thomas 2005).
295 + 330 10 + 11 = 29 8
Applying the design effects also to the numbers of events (in this case, successes)
produces the following modified results:
Treatment 40 0/187 2
Control 53 3/209 4
Once trials have been reduced to their effective sample size, the data may be entered
into statistical software such as RevMan as, for example, dichotomous outcomes or
continuous outcomes. Rounding the results to whole numbers, the results from the
example trial may be entered as:
Treatment 40/187
Control 53/209
574
23.1 Cluster-randomized trials
1) each participant acts as his or her own control, significantly reducing between-
participant variation;
2) consequently, fewer participants are usually required to obtain the same precision
in estimation of intervention effects; and
3) every participant receives every intervention, which allows the determination of the
best intervention or preference for an individual participant.
576
23.2 Crossover trials
In some trials, randomization of interventions takes place within individuals, with dif-
ferent interventions being applied to different body parts (e.g. to the two eyes or to
teeth in the two sides of the mouth). If body parts are randomized and the analysis
is by the multiple parts within an individual (e.g. each eye or each side of the mouth)
then the analysis should account for the pairing (or matching) of parts within indivi-
duals in the same way that pairing of intervention periods is recognized in the analysis
of a crossover trial.
A readable introduction to crossover trials is given by Senn (Senn 2002). More
detailed discussion of meta-analyses involving crossover trials is provided by Elbourne
and colleagues (Elbourne et al 2002), and some empirical evidence on their inclusion in
systematic reviews by Lathyris and colleagues (Lathyris et al 2007). Evidence suggests
that many crossover trials have not been analysed appropriately when included in
Cochrane Reviews (Nolan et al 2016).
578
23.2 Crossover trials
Table 23.2.a Issues addressed in version 2 of the Cochrane risk-of-bias tool for randomized
crossover trials
•
process the same as for parallel-group trials.
If an equal proportion of participants is randomized to
each intervention sequence, then any period effects will
cancel out in the analysis (providing there is not
•
differential missing data).
If unequal proportions of participants are randomized to
the different intervention sequences, then period effects
•
should be included in the analysis to avoid bias.
When using baseline differences to infer a problem with
the randomization process, this should be based on
•
differences at the start of the first period only.
Bias due to deviations from intended Carry-over is the key concern when assessing risk of bias in a
interventions crossover trial. Carry-over effects should not affect
outcomes measured in the second period. A long period of
wash-out between periods can avoid this but is not
essential. The important consideration is whether sufficient
time passes before outcome measurement in the second
••
period, such that any carry-over effects have disappeared.
All other issues are the same as for parallel-group trials.
Bias due to missing outcome data The issues are the same as for parallel-group trials. Use
of last observation carried forward imputation may be
particularly problematic if the observations being
carried forward were made before carry-over effects had
disappeared. Some analyses of crossover trials will
automatically exclude (for an AB/BA design) all patients
•
with missing data in either period.
Bias in measurement of the outcome The issues are the same as for parallel-group trials.
Bias in selection of the reported
result • An additional concern is the selective reporting of first
period data on the basis of a test for carry-over.
∗
For the precise wording of signalling questions and guidance for answering each one, see the full risk-of-bias
tool at www.riskofbias.info.
579
23 Including variants on randomized trials
impact of selective reporting if first-period data are reported only when carry-over is
detected by the trialists. Omission of trials reporting only paired analyses (i.e. not
reporting data for the first period separately) may lead to bias at the meta-analysis
level. The bias will not be picked up using study-level assessments of risk of bias.
•• individual participant data from the paper or by correspondence with the trialist;
the mean and standard deviation (or standard error) of the participant-level differ-
ences between experimental intervention (E) and comparator intervention (C)
measurements;
• the mean difference and one of the following: (i) a t-statistic from a paired t-test; (ii) a
P value from a paired t-test; (iii) a confidence interval from a paired analysis;
• a graph of measurements on experimental intervention (E) and comparator interven-
tion (C) from which individual data values can be extracted, as long as matched mea-
surements for each individual can be identified as such.
For details see Elbourne and colleagues (Elbourne et al 2002).
Crossover trials with dichotomous outcomes require more complicated methods and
consultation with a statistician is recommended (Elbourne et al 2002).
If results are available broken into subgroups by the particular sequence each par-
ticipant received, then analyses that adjust for period effects are straightforward (e.g.
as outlined in Chapter 3 of Senn (Senn 2002)).
of crossover trials in this way, the unit-of-analysis error might be regarded as less seri-
ous than some other types of unit-of-analysis error.
A second approach to incorporating crossover trials is to include only data from the
first period. This might be appropriate if carry-over is thought to be a problem, or if a
crossover design is considered inappropriate for other reasons. However, it is possible
that available data from first periods constitute a biased subset of all first period data.
This is because reporting of first period data may be dependent on the trialists having
found statistically significant carry-over.
A third approach to incorporating inappropriately reported crossover trials is to
attempt to approximate a paired analysis, by imputing missing standard deviations.
We address this approach in detail in Section 23.2.7.
Table 23.2.b Some possible data available from the report of a crossover trial
581
23 Including variants on randomized trials
Section 23.2.5, the standard error can also be obtained directly from a confidence inter-
val for MD, from a paired t-statistic, or from the P value from a paired t-test. The quan-
tities MD and SE(MD) may be entered into a meta-analysis under the generic inverse-
variance outcome type (e.g. in RevMan).
When the standard error is not available directly and the standard deviation of the
differences is not presented, a simple approach is to impute the standard deviation, as
is commonly done for other missing standard deviations (see Chapter 6,
Section 6.5.2.7). Other studies in the meta-analysis may present standard deviations
of differences, and as long as the studies use the same measurement scale, it may
be reasonable to borrow these from one study to another. As with all imputations, sen-
sitivity analyses should be undertaken to assess the impact of the imputed data on the
findings of the meta-analysis (see Chapter 10, Section 10.14).
If no information is available from any study on the standard deviations of the within-
participant differences, imputation of standard deviations can be achieved by assum-
ing a particular correlation coefficient. The correlation coefficient describes how similar
the measurements on interventions E and C are within a participant, and is a number
between –1 and 1. It may be expected to lie between 0 and 1 in the context of a cross-
over trial, since a higher than average outcome for a participant while on E will tend to
be associated with a higher than average outcome while on C. If the correlation coef-
ficient is zero or negative, then there is no statistical benefit of using a crossover design
over using a parallel-group design.
A common way of presenting results of a crossover trial is as if the trial had been a
parallel-group trial, with standard deviations for each intervention separately (SDE and
SDC; see Table 23.2.b). The desired standard deviation of the differences can be
estimated using these intervention-specific standard deviations and an imputed
correlation coefficient (Corr):
1 SMD2
SE SMD = + × 2 1 − Corr
N 2N
582
23.2 Crossover trials
Alternatively, the SMD can be calculated from the MD and its standard error, using an
imputed correlation:
MD
SMD =
N
SE MD ×
2 1 − Corr
In this case, the imputed correlation impacts on the magnitude of the SMD effect
estimate itself (rather than just on the standard error, as is the case for MD analyses
in Section 23.2.7.1). Imputed correlations should therefore be used with great caution
for estimation of SMDs.
23.2.7.4 Example
As an example, suppose a crossover trial reports the following data:
Intervention E ME = 7.0,
(sample size 10) SDE = 2.38
Intervention C MC = 6.5,
(sample size 10) SDC = 2.21
SDdiff 2
SE MD = = = 0 632
N 10
The numbers 0.5 and 0.632 may be entered into RevMan as the estimate and standard
error of a mean difference, under a generic inverse-variance outcome.
SDdiff 1 8426
SE MD = = = 0 583
N 10
The numbers 0.5 and 0.583 may be entered into a meta-analysis as the estimate and
standard error of a mean difference, under a generic inverse-variance outcome. Corre-
lation coefficients other than 0.68 should be used as part of a sensitivity analysis.
1 SMD2 1 0 2182
SE SMD = + × 2 1 − Corr = + × 2 1 − 0 68 = 0 256
N 2N 10 20
The numbers 0.218 and 0.256 may be entered into a meta-analysis as the estimate and
standard error of a standardized mean difference, under a generic inverse-variance
outcome.
We could also have obtained the SMD from the MD and its standard error:
MD 05
SMD = = = 0 217
N 10
SE MD × 0 583 ×
2 1 − Corr 2 1 − 0 68
The minor discrepancy arises due to the slightly different ways in which the two
formulae calculate a pooled standard deviation for the standardizing.
584
23.3 Studies with more than two intervention groups
There are three separate issues to consider when faced with a study with more than two
intervention groups.
1) Determine which intervention groups are relevant to the systematic review.
2) Determine which intervention groups are relevant to a particular meta-analysis.
3) Determine how the study will be included in the meta-analysis if more than two
groups are relevant.
Conversely, all groups of the study of ‘acupuncture versus sham acupuncture versus no
intervention’ might be considered eligible for the same meta-analysis. This would be
the case if the meta-analysis would otherwise include both studies of ‘acupuncture ver-
sus sham acupuncture’ and studies of ‘acupuncture versus no intervention’, treating
sham acupuncture and no intervention both as relevant comparators. We describe
methods for dealing with the latter situation in Section 23.3.4.
588
23.3 Studies with more than two intervention groups
589
23 Including variants on randomized trials
The 2 × 2 factorial design can be displayed as a 2 × 2 table, with the rows indicating
one comparison (e.g. aspirin versus placebo) and the columns the other (e.g. beha-
vioural intervention versus standard care):
Randomization of B
Randomization
23.5 References
Arnup SJ, Forbes AB, Kahan BC, Morgan KE, McKenzie JE. Appropriate statistical
methods were infrequently used in cluster-randomized crossover trials. Journal of Clinical
Epidemiology 2016; 74: 40–50.
Arnup SJ, McKenzie JE, Hemming K, Pilcher D, Forbes AB. Understanding the cluster
randomised crossover design: a graphical illustraton of the components of variation and
a sample size tutorial. Trials 2017; 18: 381.
Borenstein M, Hedges LV, Higgins JPT, Rothstein HR. Introduction to Meta-analysis.
Chichester (UK): John Wiley & Sons; 2008.
Campbell M, Grimshaw J, Steen N. Sample size calculations for cluster randomised trials.
Changing Professional Practice in Europe Group (EU BIOMED II Concerted Action). Journal
of Health Services Research and Policy 2000; 5: 12–16.
Campbell MJ, Walters SJ. How to design, Analyse and Report Cluster Randomised Trials in
Medicine and Health Related Research. Chichester (UK): John Wiley & Sons; 2014.
Campbell MK, Piaggio G, Elbourne DR, Altman DG, Group C. Consort 2010 statement:
extension to cluster randomised trials. BMJ 2012; 345: e5661.
Chan AW, Altman DG. Epidemiology and reporting of randomised trials published in PubMed
journals. Lancet 2005; 365: 1159–1162.
Donner A, Klar N. Design and Analysis of Cluster Randomization Trials in Health Research.
London (UK): Arnold; 2000.
Donner A, Piaggio G, Villar J. Statistical methods for the meta-analysis of cluster randomized
trials. Statistical Methods in Medical Research 2001; 10: 325–338.
Donner A, Klar N. Issues in the meta-analysis of cluster randomized trials. Statistics in
Medicine 2002; 21: 2971–2980.
Elbourne DR, Altman DG, Higgins JPT, Curtin F, Worthington HV, Vaillancourt JM. Meta-
analyses involving cross-over trials: methodological issues. International Journal of
Epidemiology 2002; 31: 140–149.
Eldridge S, Ashby D, Bennett C, Wakelin M, Feder G. Internal and external validity of cluster
randomised trials: systematic review of recent trials. BMJ 2008; 336: 876–880.
Eldridge S, Kerry S, Torgerson DJ. Bias in identifying and recruiting participants in cluster
randomised trials: what can be done? BMJ 2009a; 339: b4006.
Eldridge S, Kerry S. A Practical Guide to Cluster Randomised Trials in Health Services Research.
Chichester (UK): John Wiley & Sons; 2012.
Eldridge SM, Ashby D, Kerry S. Sample size for cluster randomized trials: effect of coefficient
of variation of cluster size and analysis method. International Journal of Epidemiology
2006; 35: 1292–1300.
Eldridge SM, Ukoumunne OC, Carlin JB. The intra-cluster correlation coefficient in cluster
randomized trials: a review of definitions. International Statistical Review 2009b; 77:
378–394.
Freeman PR. The performance of the two-stage analysis of two-treatment, two-period
cross-over trials. Statistics in Medicine 1989; 8: 1421–1432.
591
23 Including variants on randomized trials
Hahn S, Puffer S, Torgerson DJ, Watson J. Methodological bias in cluster randomised trials.
BMC Medical Research Methodology 2005; 5: 10.
Hayes RJ, Moulton LH. Cluster Randomised Trials. Boca Raton (FL): CRC Press; 2017.
Health Services Research Unit. Database of ICCs: Spreadsheet (Empirical estimates of ICCs
from changing professional practice studies) [page last modified 11 Aug 2004] 2004.
http://www.abdn.ac.uk/hsru/epp/cluster.shtml.
Hemming K, Haines TP, Chilton PJ, Girling AJ, Lilford RJ. The stepped wedge cluster
randomised trial: rationale, design, analysis, and reporting. BMJ 2015; 350: h391.
Juszczak E, Altman D, Chan AW. A review of the methodology and reporting of
multi-arm, parallel group, randomised clinical trials (RCTs). 3rd Joint Meeting of the
International Society for Clinical Biostatistics and Society for Clinical Trials; London
(UK) 2003.
Lathyris DN, Trikalinos TA, Ioannidis JP. Evidence from crossover trials: empirical evaluation
and comparison against parallel arm trials. International Journal of Epidemiology 2007;
36: 422–430.
Lee LJ, Thompson SG. Clustering by health professional in individually randomised trials.
BMJ 2005; 330: 142–144.
Li T, Yu T, Hawkins BS, Dickersin K. Design, analysis, and reporting of crossover trials for
inclusion in a meta-analysis. PloS One 2015; 10: e0133023.
McAlister FA, Straus SE, Sackett DL, Altman DG. Analysis and reporting of factorial trials: a
systematic review. JAMA 2003; 289: 2545–2553.
Murray DM, Short B. Intraclass correlation among measures related to alcohol-use by
young-adults – estimates, correlates and applications in intervention studies. Journal of
Studies on Alcohol 1995; 56: 681–694.
Nolan SJ, Hambleton I, Dwan K. The use and reporting of the cross-over study design in
clinical trials and systematic reviews: a systematic assessment. PloS One 2016; 11:
e0159014.
Puffer S, Torgerson D, Watson J. Evidence for risk of bias in cluster randomised trials:
review of recent trials published in three general medical journals. BMJ 2003; 327:
785–789.
Qizilbash N, Whitehead A, Higgins J, Wilcock G, Schneider L, Farlow M. Cholinesterase
inhibition for Alzheimer disease: a meta-analysis of the tacrine trials. JAMA 1998; 280:
1777–1782.
Rao JNK, Scott AJ. A simple method for the analysis of clustered binary data. Biometrics
1992; 48: 577–585.
Richardson M, Garner P, Donegan S. Cluster tandomised trials in Cochrane Reviews:
evaluation of methodological and reporting practice. PloS One 2016; 11: e0151818.
Rietbergen C, Moerbeek M. The design of cluster randomized crossover trials. Journal of
Educational and Behavioral Statistics 2011; 36: 472–490.
Senn S. Cross-over Trials in Clinical Research. 2nd ed. Chichester (UK): John Wiley &
Sons; 2002.
Ukoumunne OC, Gulliford MC, Chinn S, Sterne JA, Burney PG. Methods for evaluating area-
wide and organisation-based interventions in health and health care: a systematic
review. Health Technology Assessment 1999; 3: 5.
Walwyn R, Roberts C. Meta-analysis of absolute mean differences from randomised trials
with treatment-related clustering associated with care providers. Statistics in Medicine
2015; 34: 966–983.
592
23.5 References
593
24
Including non-randomized studies on
intervention effects
Barnaby C Reeves, Jonathan J Deeks, Julian PT Higgins, Beverley Shea, Peter
Tugwell, George A Wells; on behalf of the Cochrane Non-Randomized Studies
of Interventions Methods Group
KEY POINTS
• For some Cochrane Reviews, the question of interest cannot be answered by rando-
•
mized trials, and review authors may be justified in including non-randomized studies.
Potential biases are likely to be greater for non-randomized studies compared with
randomized trials when evaluating the effects of interventions, so results should
always be interpreted with caution when they are included in reviews and meta-
•
analyses.
Non-randomized studies of interventions vary in their ability to estimate a causal
effect; key design features of studies can distinguish ‘strong’ from ‘weak’ studies.
• Biases affecting non-randomized studies of interventions vary depending on the fea-
•
tures of the studies.
We recommend that eligibility criteria, data collection and assessment of included
studies place an emphasis on specific features of study design (e.g. which parts of
the study were prospectively designed) rather than ‘labels’ for study designs (such
•
as case-control versus cohort).
Review authors should consider how potential confounders, and how the likelihood of
increased heterogeneity resulting from residual confounding and from other biases
that vary across studies, are addressed in meta-analyses of non-randomized studies.
24.1 Introduction
This chapter aims to support review authors who are considering including non-
randomized studies of interventions (NRSI) in a Cochrane Review. NRSI are defined here
as any quantitative study estimating the effectiveness of an intervention (harm or ben-
efit) that does not use randomization to allocate units (individuals or clusters of
This chapter should be cited as: Reeves BC, Deeks JJ, Higgins JPT, Shea B, Tugwell P, Wells GA. Chapter 24:
Including non-randomized studies on intervention effects. In: Higgins JPT, Thomas J, Chandler J, Cumpston
M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook for Systematic Reviews of Interventions. 2nd Edition.
Chichester (UK): John Wiley & Sons, 2019: 595–620.
© 2019 The Cochrane Collaboration. Published 2019 by John Wiley & Sons Ltd.
595
24 Non-randomized studies
recommendations, we aim to set out the pros and cons of alternative actions and to
identify questions for further methodological research.
Review authors who are considering including NRSI in a Cochrane Review should not
start with this chapter unless they are already familiar with the process of preparing a
systematic review of randomized trials. The format and basic steps of a Cochrane
Review should be the same irrespective of the types of study included. The reader is
referred to Chapters 1 to 15 of the Handbook for a detailed description of these steps.
Every step in carrying out a systematic review is more difficult when NRSI are included
and the review team should include one or more people with expert knowledge of the
subject and of NRSI methods.
No
Yes
No
Yes
Figure 24.1.a Algorithm to decide whether a review should include non-randomized studies of an
intervention or not
the study design that is least likely to be biased. All Cochrane Reviews must con-
sider the risk of bias in individual primary studies, whether randomized trials or
NRSI (see Chapters 7, 8 and 25). Some biases apply to both randomized trials
and NRSI. However, some biases are specific (or particularly important) to NRSI,
such as biases due to confounding or selection of participants into the study
(see Chapter 25). The key advantage of a high-quality randomized trial is its ability
to estimate the causal relationship between an experimental intervention (relative
to a comparator) and outcome. Review authors will need to consider (i) the
strengths of the design features of the NRSI that have been used (such as noting
their potential to estimate causality, in particular by inspecting the assumptions
that underpin such estimation); and (ii) the execution of the studies through a care-
ful assessment of their risk of bias. The review team should be constituted so that it
can judge suitability of the design features of included studies and implement a
careful assessment of risk of bias.
Potential biases are likely to be greater for NRSI compared with randomized trials
because some of the protections against bias that are available for randomized trials
are not established for NRSI. Randomization is an obvious example. Randomization
aims to balance prognostic factors across intervention groups, thus preventing con-
founding (which occurs when there are common causes of intervention group assign-
ment and outcome). Other protections include a detailed protocol and a pre-specified
statistical analysis plan which, for example, should define the primary and secondary
outcomes to be studied, their derivation from measured variables, methods for man-
aging protocol deviations and missing data, planned subgroup and sensitivity analyses
and their interpretation.
601
24 Non-randomized studies
harm may outweigh any benefit from the intervention. This situation is more likely to
occur when there are competing interventions for a condition.
NRSI vary with respect to their intrinsic ability to estimate the causal effect of an
intervention (Reeves et al 2017, Tugwell et al 2017). Therefore, to reach reliable conclu-
sions, review authors should include only ‘strong’ NRSI that can estimate causality with
minimal risk of bias. It is not helpful to include primary studies in a review when the
results of the studies are highly likely to be biased even if there is no better evidence
(except for justification 3, i.e. to examine the case for performing a randomized trial by
describing the weakness of the NRSI evidence; see Section 24.1.1). This is because a
misleading effect estimate from a systematic review may be more harmful to future
patients than no estimate at all, particularly if the people using the evidence to make
decisions are unaware of its limitations (Doll 1993, Peto et al 1995). Systematic reviews
have a privileged status in the evidence base (Reeves et al 2013), typically sitting
between primary research studies and guidelines (which frequently cite them). There
may be long-term undesirable consequences of reviewing evidence when it is inade-
quate: an evidence synthesis may make it less likely that less biased research will
be carried out in the future, increasing the risk that more poorly informed decisions
will be made than would otherwise have been the case (Stampfer and Colditz 1991,
Siegfried et al 2005).
There is not currently a general framework for deciding which kinds of NRSI will be
used to answer a specific PICO question. One possible strategy is to limit included
NRSI to those that have used a strong design (NRSI with specified design features;
Reeves et al 2017, Tugwell et al 2017). This should give reasonably valid effect esti-
mates, subject to assessment of risk of bias. An alternative strategy is to include the
best available NRSI (i.e. those with the strongest design features among those that
have been carried out) to answer the PICO question. In this situation, we recommend
scoping available NRSI in advance of finalizing study eligibility for a specific review
question and defining eligibility with respect to study design features (Reeves et al
2017). Widespread adoption of the first strategy might result in reviews that consist-
ently include NRSI with the same design features, but some reviews would include
no studies at all. The second strategy would lead to different reviews including
NRSI with different study design features according to what is available. Whichever
strategy is adopted, it is important to explain the choice of included studies in the
protocol. For example, review authors might be justified in using different eligibility
criteria when reviewing the harms, compared with the benefits, of an intervention
(see Chapter 19, Section 19.2).
We advise caution in assessing NRSI according to existing ‘evidence hierarchies’
for studies of effectiveness (Eccles et al 1996, National Health and Medical Research
Council 1999, Oxford Centre for Evidence-based Medicine 2001). These appear to
have arisen largely by applying hierarchies for aetiological research questions to
effectiveness questions and refer to study design labels. NRSI used for studying
the effects of interventions are very diverse and complex (Shadish et al 2002)
and may not be easily assimilated into existing evidence hierarchies. NRSI with dif-
ferent study design features are susceptible to different biases, and it is often
unclear which biases have the greatest impact and how they vary between health-
care contexts. We recommend including at least one expert with knowledge of the
subject and NRSI methods (with previous experience of estimating an intervention
effect from NRSI similar to the ones of interest) on a review team to help to address
these complexities.
603
24 Non-randomized studies
Box 24.2.a Checklist of study features. Responses to each item should be recorded as:
yes, no, or can’t tell (Reeves et al 2017). Reproduced with permission of Elsevier
1) Was the intervention/comparator (answer ‘yes’ to more than one item, if applicable):
••
allocated to (provided for/administered to/chosen by) individuals?
allocated to (provided for/administered to/chosen by) clusters of individuals?a
•
clustered in the way it was provided (by practitioner or organizational unit)?b
2) Were outcome data available (answer ‘yes’ to only one item):
••
after intervention/comparator only (same individuals)?
after intervention/comparator only (not all same individuals)?
••
before (once) AND after intervention/comparator (same individuals)?
before (once) AND after intervention/comparator (not all same individuals)?
•
multiple times before AND multiple times after intervention/comparator (same
individuals)?
•
multiple times before AND multiple times after intervention/comparator (not all
same individuals)?
3) Was the intervention effect estimated by (answer ‘yes’ to only one item):
••
change over time (same individuals at different time-points)?
change over time (not all same individuals at different time-points)?
•
difference between groups (of individuals or clusters receiving either intervention
or comparator)c?
604
24.2 Developing criteria for including NRSI
4) Did the researchers aim to control for confounding (design or analysis) (answer ‘yes’
to only one item):
•• using methods that control in principle for any confounding?
using methods that control in principle for time invariant unobserved
confounding?
• using methods that control only for confounding by observed covariates?
5) Were groups of individuals or clusters formed by (answer ‘yes’ to more than one item,
if applicable)d:
••randomization?
quasi-randomization?
•explicit rule for allocation based on a threshold for a variable measured on a con-
tinuous or ordinal scale or boundary (in conjunction with identifying the variable
dimension, below)?
••
some other action of researchers?
time differences?
••
location differences?
healthcare decision makers/practitioners?
••
participants’ preferences?
policy maker?
••
on the basis of outcome?e
some other process? (specify)
6) Were the following features of the study carried out after the study was designed
(answer ‘yes’ to more than one item, if applicable):
•• characterization of individuals/clusters before intervention?
actions/choices leading to an individual/cluster becoming a member of a group?e
• assessment of outcomes?
7) Were the following variables measured before intervention (answer ‘yes’ to more
than one item, if applicable):
••potential confounders?
outcome variable(s)?
a
This item describes ‘explicit’ clustering. In randomized controlled trials, participants can be allocated
individually or by virtue of ‘belonging to a cluster such as a primary care practice or a village.
b
This item describes ‘implicit’ clustering. In randomized controlled trials, participants can be allocated
individually but with the intervention being delivered in clusters (e.g. group cognitive therapy); similarly,
in a cluster-randomized trial (by general practice), the provision of an intervention could also be
clustered by therapist, with several therapists providing ‘group’ therapy.
c
A study should be classified as ‘yes’ for this feature, even if it involves comparing the extent of change
over time between groups.
d
The distinction between these options is to do with the exogeneity of the allocation.
e
For (nested) case-control studies, group refers to the case/control status of an individual. This option
is not applicable when interventions are allocated to (provided for/administered to/chosen by) clusters.
605
24 Non-randomized studies
Some Cochrane Reviews have limited inclusion of NRSI by study design labels, some-
times in combination with considerations of methodological quality. For example,
Cochrane Effective Practice and Organisation of Care accepts protocols that include
interrupted time series (ITS) and controlled before-and-after (CBA) studies, and speci-
fies some minimum criteria for these types of studies. The risks of using design labels
are highlighted by a recent review that showed that Cochrane Reviews inconsistently
labelled CBA and ITS studies, and included studies that used these labels in highly
inconsistent ways (Polus et al 2017). We believe that these issues will be addressed
by applying the study feature checklist.
Our proposal is that:
1) the review team decides which study design features are desirable in a NRSI to
address a specific PICO question;
2) scoping will indicate the study design features of the NRSI that are available; and
3) the review team sets eligibility criteria based on study design features that represent
an appropriate balance between the priority of the question and the likely strength
of the available evidence.
When both randomized trials and NRSI of an intervention exist in relation to a specific
PICO question and, for one or more of the reasons given in Section 24.1.1, both are
defined as eligible, the results for randomized trials and for NRSI should be presented
and analysed separately. Alternatively, if there is an adequate number of randomized
trials to inform the main analysis for a review question, comments about relevant NRSI
can be included in the Discussion section of a review although the reader needs to be
reassured that NRSI studies are not selectively cited.
be found in CENTRAL, and authors of Cochrane Reviews can search these registers
where they are likely to be relevant (e.g. the register of Cochrane Effective Practice
and Organisation of Care). There are no databases of NRSI similar to CENTRAL.
Some review authors have tried to develop and validate methodological filters for
NRSI (strategy 2) but with limited success because NRSI design labels are not reliably
indexed by bibliographic databases and are used inconsistently by authors of primary
studies (Wieland and Dickersin 2005, Fraser et al 2006, Furlan et al 2006). Furthermore,
study design features, which are the preferred approach to determining eligibility of
NRSI for a review, suffer from the same problems. Review authors have also sought
to optimize search strategies for adverse effects (see Chapter 19, Section 19.3)
(Golder et al 2006c, Golder et al 2006b). Because of the time-consuming nature of sys-
tematic reviews that include NRSI, attempts to develop search strategies for NRSI have
not investigated large numbers of review questions. Therefore, review authors should
be cautious about assuming that previous strategies can be applied to new topics.
Finally, although trials registers such as ClinicalTrials.gov do include some NRSI, their
coverage is very low so strategy 3 is unlikely to be very fruitful.
Searching using ‘snowballing’ methods may be helpful, if one or more publications of
relevance or importance are known (Wohlin 2014), although it is likely to identify other
evidence about the research question in general rather than studies with similar design
features.
If both unadjusted and adjusted intervention effects are reported, then adjusted
effects should be preferred. It is straightforward to extract an adjusted effect estimate
and its standard error for a meta-analysis if a single adjusted estimate is reported for a
particular outcome in a primary NRSI. However, some NRSI report multiple adjusted
estimates from analyses including different sets of covariates. If multiple adjusted
estimates of intervention effect are reported, the one that is judged to minimize
the risk of bias due to confounding should be chosen (see Chapter 25, Section 25.2.1).
(Simple numerators and denominators, or means and standard errors, for intervention
and control groups cannot control for confounding unless the groups have been
matched on all important confounding domains at the design stage.)
Anecdotally, the experience of review authors is that NRSI are poorly reported so that
the required information is difficult to find, and different review authors may extract
different information from the same paper. Data collection forms may need to be cus-
tomized to the research question being investigated. Restricting included studies to
those that share specific features can help to reduce their diversity and facilitate the
design of customized data collection forms.
As with randomized trials, results of NRSI may be presented using different measures
of effect and uncertainty or statistical significance. Before concluding that informa-
tion required to describe an intervention effect has not been reported, review
authors should seek statistical advice about whether reported information can
be transformed or used in other ways to provide a consistent effect measure
across studies so that this can be analysed using standard software (see
Chapter 6). Data collection sheets need to be able to handle the different kinds of infor-
mation about study findings that review authors may encounter.
1) Data about study design features to demonstrate the eligibility of included studies
against criteria specified in the review protocol. The study design feature checklist
can help to do this (see Section 24.2.2). When using this checklist, whether to decide
on eligibility or for data extraction, the intention should be to document what
researchers did in the primary studies, rather than what researchers called their
studies or think they did. Further guidance on using the checklist is included with
the description of the tool (Reeves et al 2017).
2) Variables measured in a study that characterize confounding domains of interest;
the ROBINS-I tool provides a template for collecting this information (see
Chapter 25, Section 25.3) (Sterne et al 2016).
3) The availability of data for experimental and comparator intervention groups, and
about the co-interventions; the ROBINS-I tool provides a template for collecting
information about co-interventions (see Chapter 25).
4) Data to characterize the directness with which the study addresses the review ques-
tion (i.e. the PICO elements of the study). We recommend that review authors record
this information, then apply a simple template that has been published for doing
this (Schünemann et al 2013, Wells et al 2013), judging the directness of each
609
24 Non-randomized studies
element as ‘sufficient’ on a 4-point categorical scale. (This tool could be used for
scoping and can be applied to randomized trials as well as NRSI.)
5) Data describing the study results (see Section 24.6.1). Capturing these data is likely
to be challenging and data collection will almost certainly need to be customized to
the research question being investigated. Review authors are strongly advised to
pilot the methods they plan to use with studies that cover the expected diversity;
developing the data collection form may require several iterations. It is almost
impossible to finalize these forms in advance. Methods developed at the outset
(e.g. forms or database) may need to be amended to record additional important
information identified when appraising NRSI but overlooked at the outset. Review
authors should record when required data are not available due to poor reporting,
as well as data that are available. Data should be captured describing both unad-
justed and adjusted intervention effects.
The ROBINS-I tool involves some preliminary work when writing the protocol.
Notably, review authors will need to specify important confounding domains and
co-interventions. There is no established method for identifying a pre-specified set
of important confounding domains. The list of potential confounding domains should
not be generated solely on the basis of factors considered in primary studies included
in the review (at least, not without some form of independent validation), since the
number of suspected confounders is likely to increase over time (hence, older studies
may be out of date) and researchers themselves may simply choose to measure con-
founders considered in previous studies. Rather, the list should be based on evidence
(although undertaking a systematic review to identify all potential prognostic factors is
extreme) and expert opinion from members of the review team and advisors with con-
tent expertise.
The ROBINS-I assessment involves consideration of several bias domains. Each
domain is judged as low, moderate, serious or critical risk of bias. A judgement of
low risk of bias for a NRSI using ROBINS-I equates to a low risk-of-bias judgement
for a high-quality randomized trial. Few circumstances around a NRSI are likely to give
a similar level of protection against confounding as randomization, and few NRSI have
detailed statistical analysis plans in advance of carrying out analyses. We therefore con-
sider it very unlikely that any NRSI will be judged to be at low risk of bias overall.
Although the bias domains are common to all types of NRSI, specific issues can arise
for certain types of study, such as analyses of routinely collected data, pharmaco-
epidemiological studies. Review authors are advised to consider carefully whether a
methodologist with knowledge of the kinds of study to be included should be recruited
to the review team to help to identify key areas of weakness.
odds ratio will commonly be used as it is the only effect measure for dichotomous
outcomes that can be estimated from case-control studies, and is estimated when
logistic regression is used to adjust for confounders.
One danger is that a very large NRSI of poor methodological quality (e.g. based on
routinely collected data) may dominate the findings of other smaller studies at less
risk of bias (perhaps carried out using customized data collection). Review authors
need to remember that the confidence intervals for effect estimates from larger NRSI
are less likely to represent the true uncertainty of the observed effect than are the
confidence intervals for smaller NRSI (Deeks et al 2003), although there is no way
of estimating or correcting for this. Review authors should exclude from analysis
any NRSI judged to be at critical risk of bias and may choose to include only studies
that are at moderate or low risk of bias, specifying this choice a priori in the review
protocol.
selective or emphasizing some findings over others. Ideally, authors should set out in
the review protocol how they plan to use narrative synthesis to report the findings of
primary studies.
24.7.1.2 Has the risk of bias to included studies been adequately assessed?
Interpretation of the results of a review of NRSI should include consideration of the
likely direction and magnitude of bias, although this can be challenging to do. Some
of the biases that affect randomized trials also affect NRSI but typically to a greater
extent. For example, attrition in NRSI is often worse (and poorly reported), intervention
614
24.7 Interpretation and discussion
or publication bias. In practice, the final rating for a body of evidence based on NRSI is
typically rated as ‘low’ or ‘very low’.
Application of the GRADE approach to systematic reviews of NRSI requires expertise
about the design of NRSI due to the nature of the biases that may arise. For example,
the strength of evidence for an association may be enhanced by a subset of primary
studies that have tested considerations about causality not usually applied to rando-
mized trial evidence (Bradford Hill 1965), or use of negative controls (Jackson et al
2006). In some contexts, little prognostic information may be known, limiting identifi-
cation of possible confounding (Jefferson et al 2005).
Whether the debate concludes that the evidence from NRSI is adequate for informed
decision making or that there is a need for randomized trials will depend on the value
placed on the uncertainty arising through use of potentially biased NRSI, and the col-
lective value of the observed effects. The GRADE approach interprets certainty as the
certainty that the effect of the intervention is large enough to reach a threshold for
action. This value may depend on the wider healthcare context. It may not be possible
to include assessments of the value within the review itself, and it may become evident
only as part of the wider debate following publication.
For example, is evidence from NRSI of a rare serious adverse effect adequate to
decide that an intervention should not be used? The evidence has low certainty
(due to a lack of randomized trials) but the value of knowing that there is the possibility
of a potentially serious harm is considerable, and may be judged sufficient to withdraw
the intervention. (It is worth noting that the judgement about withdrawing an interven-
tion may depend on whether equivalent benefits can be obtained from elsewhere with-
out such a risk; if not, the intervention may still be offered but with full disclosure of the
potential harm.) Where evidence of benefit is also uncertain, the value attached to a
systematic review of NRSI of harm may be even greater.
In contrast, evidence of a small benefit of a novel intervention from a systematic
review of NRSI may not be sufficient for decision makers to recommend widespread
implementation in the face of the uncertainty of the evidence and the costs arising from
provision of the intervention. In these circumstances, decision makers may conclude
that randomized trials should be undertaken to improve the certainty of the evidence
if practicable and if the investment in the trial is likely to be repaid in the future.
systematic review of randomized trials. Inclusion of NRSI to address some review ques-
tions will be invaluable in addressing the broad aims of a review; however, the conclu-
sions in relation to some review questions are likely to be much weaker and may make
a relatively small contribution to the topic. Therefore, review authors and Cochrane
Review Group editors need to decide at an early stage whether the investment of
resources is likely to be justified by the priority of the research question.
Bringing together the required team of healthcare professionals and methodologists
may be easier for systematic reviews of NRSI to estimate the effects of an intervention
on long-term and rare adverse outcomes, for example when considering the side
effects of drugs. A review of this kind is likely to provide important missing evidence
about the effects of an intervention in a priority area (i.e. adverse effects). However,
these reviews may require the input of additional specialist authors, for example with
relevant content pharmacological expertise. There is a pressing need in many health
conditions to supplement traditional systematic reviews of randomized trials of effec-
tiveness with systematic reviews of adverse (unintended) effects. It is likely that these
systematic reviews will usually need to include NRSI.
Funding: BCR is supported by the UK National Institute for Health Research Biomedical
Research Centre at University Hospitals Bristol NHS Foundation Trust and the Univer-
sity of Bristol. JJD receives support from the National Institute for Health Research
(NIHR) Birmingham Biomedical Research Centre at the University Hospitals Birming-
ham NHS Foundation Trust and the University of Birmingham. JPTH is a member of
the NIHR Biomedical Research Centre at University Hospitals Bristol NHS Foundation
Trust and the University of Bristol. The views expressed are those of the author(s) and
not necessarily those of the NHS, the NIHR or the Department of Health.
24.9 References
Audigé L, Bhandari M, Griffin D, Middleton P, Reeves BC. Systematic reviews of
nonrandomized clinical studies in the orthopaedic literature. Clinical Orthopaedics and
Related Research 2004: 249–257.
Bradford Hill A. The environment and disease: association or causation? Proceedings of the
Royal Society of Medicine 1965; 58: 295–300.
617
24 Non-randomized studies
Chan AW, Hróbjartsson A, Haahr MT, Gøtzsche PC, Altman DG. Empirical evidence for
selective reporting of outcomes in randomized trials: comparison of protocols to
published articles. JAMA 2004; 291: 2457–2465.
Deeks JJ, Dinnes J, D’Amico R, Sowden AJ, Sakarovitch C, Song F, Petticrew M, Altman DG.
Evaluating non-randomised intervention studies. Health Technology Assessment 2003;
7: 27.
Doll R. Doing more good than harm: the evaluation of health care interventions. Summation
of the conference. Annals of the New York Academy of Sciences 1993; 703: 310–313.
Eccles M, Clapp Z, Grimshaw J, Adams PC, Higgins B, Purves I, Russel I. North of England
evidence based guidelines development project: methods of guideline development. BMJ
1996; 312: 760–762.
Fraser C, Murray A, Burr J. Identifying observational studies of surgical interventions in
MEDLINE and EMBASE. BMC Medical Research Methodology 2006; 6: 41.
Furlan AD, Irvin E, Bombardier C. Limited search strategies were effective in finding relevant
nonrandomized studies. Journal of Clinical Epidemiology 2006; 59: 1303–1311.
Glasziou P, Chalmers I, Rawlins M, McCulloch P. When are randomised trials unnecessary?
Picking signal from noise. BMJ 2007; 334: 349–351.
Golder S, Loke Y, McIntosh HM. Room for improvement? A survey of the methods used in
systematic reviews of adverse effects. BMC Medical Research Methodology 2006a; 6: 3.
Golder S, McIntosh HM, Duffy S, Glanville J, Centre for Reviews and Dissemination and UK
Cochrane Centre Search Filters Design Group. Developing efficient search strategies to
identify reports of adverse effects in MEDLINE and EMBASE. Health Information and
Libraries Journal 2006b; 23: 3–12.
Golder S, McIntosh HM, Loke Y. Identifying systematic reviews of the adverse effects of
health care interventions. BMC Medical Research Methodology 2006c; 6: 22.
Henry D, Moxey A, O’Connell D. Agreement between randomized and non-randomized
studies: the effects of bias and confounding. 9th Cochrane Colloquium; 2001; Lyon
(France).
Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is
not available. American Journal of Epidemiology 2016; 183: 758–764.
Higgins JPT, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta-
analyses. BMJ 2003; 327: 557–560.
Higgins JPT, Ramsay C, Reeves BC, Deeks JJ, Shea B, Valentine JC, Tugwell P, Wells G. Issues
relating to study design and risk of bias when including non-randomized studies in
systematic reviews on the effects of interventions. Research Synthesis Methods 2013; 4:
12–25.
Higgins JPT, Soares-Weiser K, López-López JA, Kakourou A, Chaplin K, Christensen H, Martin
NK, Sterne JA, Reingold AL. Association of BCG, DTP, and measles containing vaccines
with childhood mortality: systematic review. BMJ 2016; 355: i5170.
Jackson LA, Jackson ML, Nelson JC, Neuzil KM, Weiss NS. Evidence of bias in estimates of
influenza vaccine effectiveness in seniors. International Journal of Epidemiology 2006; 35:
337–344.
Jefferson T, Smith S, Demicheli V, Harnden A, Rivetti A, Di Pietrantonj C. Assessment of the
efficacy and effectiveness of influenza vaccines in healthy children: systematic review.
Lancet 2005; 365: 773–780.
Kwan J, Sandercock P. In-hospital care pathways for stroke. Cochrane Database of
Systematic Reviews 2004; 4: CD002924.
618
24.9 References
619
24 Non-randomized studies
Stampfer MJ, Colditz GA. Estrogen replacement therapy and coronary heart disease: a
quantitative assessment of the epidemiologic evidence. Preventive Medicine 1991; 20:
47–63.
Sterne JAC, Hernán MA, Reeves BC, Savović J, Berkman ND, Viswanathan M, Henry D, Altman
DG, Ansari MT, Boutron I, Carpenter JR, Chan AW, Churchill R, Deeks JJ, Hróbjartsson A,
Kirkham J, Jüni P, Loke YK, Pigott TD, Ramsay CR, Regidor D, Rothstein HR, Sandhu L,
Santaguida PL, Schünemann HJ, Shea B, Shrier I, Tugwell P, Turner L, Valentine JC,
Waddington H, Waters E, Wells GA, Whiting PF, Higgins JPT. ROBINS-I: a tool for assessing
risk of bias in non-randomized studies of interventions. BMJ 2016; 355: i4919.
Taggart DP, D’Amico R, Altman DG. Effect of arterial revascularisation on survival: a
systematic review of studies comparing bilateral and single internal mammary arteries.
Lancet 2001; 358: 870–875.
Tugwell P, Knottnerus JA, McGowan J, Tricco A. Big-5 Quasi-Experimental designs. Journal
of Clinical Epidemiology 2017; 89: 1–3.
Wells GA, Shea B, Higgins JPT, Sterne J, Tugwell P, Reeves BC. Checklists of methodological
issues for review authors to consider when including non-randomized studies in
systematic reviews. Research Synthesis Methods 2013; 4: 63–77.
Wieland S, Dickersin K. Selective exposure reporting and Medline indexing limited the search
sensitivity for observational studies of the adverse effects of oral contraceptives. Journal
of Clinical Epidemiology 2005; 58: 560–567.
Wohlin C. Guidelines for snowballing in systematic literature studies and a replication in
software engineering. EASE ’14 Proceedings of the 18th International Conference on
Evaluation and Assessment in Software Engineering; London, UK 2014.
620
25
Assessing risk of bias in a
non-randomized study
Jonathan AC Sterne, Miguel A Hernán, Alexandra McAleenan, Barnaby C Reeves,
Julian PT Higgins
KEY POINTS
•
included in Cochrane Reviews.
Review authors should specify important confounding domains and co-interventions
•
of concern in their protocol.
At the start of a ROBINS-I assessment of a study, review authors should describe a
‘target trial’, which is a hypothetical pragmatic randomized trial of the interven-
tions compared in the study, conducted on the same participant group and with-
•
out features putting it at risk of bias.
Assessment of risk of bias in a non-randomized study should address
pre-intervention, at-intervention, and post-intervention features of the study. The
•
issues related to post-intervention features are similar to those in randomized trials.
Many features of ROBINS-I are shared with the RoB 2 tool for assessing risk of bias in
randomized trials. It focuses on a specific result, is structured into a fixed set of
domains of bias, includes signalling questions that inform risk of bias judgements
•
and leads to an overall risk-of-bias judgement.
Based on answers to the signalling questions, judgements for each bias domain,
and for overall risk of bias, can be ‘Low’, ‘Moderate’, ‘Serious’ or ‘Critical’ risk
•
of bias.
The full guidance documentation for the ROBINS-I tool, including the latest variants
for different study designs, is available at www.riskofbias.info.
This chapter should be cited as: Sterne JAC, Hernán MA, McAleenan A, Reeves BC, Higgins JPT. Chapter 25:
Assessing risk of bias in a non-randomized study. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T,
Page MJ, Welch VA (editors). Cochrane Handbook for Systematic Reviews of Interventions. 2nd Edition.
Chichester (UK): John Wiley & Sons, 2019: 621–642.
© 2019 The Cochrane Collaboration. Published 2019 by John Wiley & Sons Ltd.
621
25 Risk of bias in a non-randomized study
25.1 Introduction
Cochrane Reviews often include non-randomized studies of interventions (NRSI), as
discussed in detail in Chapter 24. Risk of bias should be assessed for each included
study (see Chapter 7). The Risk Of Bias In Non-randomized Studies of Interventions
(ROBINS-I) tool (Sterne et al 2016) is recommended for assessing risk of bias in a NRSI:
it provides a framework for assessing the risk of bias in a single result (an estimate of
the effect of an experimental intervention compared with a comparator intervention on
a particular outcome). Many features of ROBINS-I are shared with the RoB 2 tool for
assessing risk of bias in randomized trials (see Chapter 8).
Evaluating risk of bias in results of NRSI requires both methodological and content
expertise. The process is more involved than for randomized trials, and the participa-
tion of both methodologists with experience in the relevant study designs or design
features, and health professionals with knowledge of prognostic factors that influence
intervention decisions for the target patient or population group, is recommended (see
Chapter 24). At the planning stage, the review question must be clearly articulated, and
important potential problems in NRSI relevant to the review should be identified. This
includes a preliminary specification of important confounders and co-interventions
(see Section 25.3.1). Each study should then be carefully examined, considering all
the ways in which its results might be put at risk of bias.
In this chapter we summarize the biases that can affect NRSI and describe the main
features of the ROBINS-I tool. Since the initial version of the tool was published in 2016
(Sterne et al 2016), developments to it have continued. At the time of writing, a new
version is under preparation, with variants for several types of NRSI design. The full
guidance documentation for the ROBINS-I tool, including the latest variants for dif-
ferent study designs, is available at www.riskofbias.info.
25.2.1 Confounding
Confounding occurs when there are common causes of the choice of intervention and
the outcome of interest. In the presence of confounding, the association between inter-
vention and outcome differs from its causal effect. This difference is known as con-
founding bias. A confounding domain (or, more loosely, a ‘confounder’) is a pre-
intervention prognostic factor (i.e. a variable that predicts the outcome of interest) that
also predicts whether an individual receives one or the other interventions of interest.
Some common examples are severity of pre-existing disease, presence of comorbid-
ities, healthcare use, physician prescribing practices, adiposity, and socio-economic
status.
Investigators measure specific variables (often also referred to as confounders) in an
attempt to control fully or partly for these confounding domains. For example, baseline
immune function and recent weight loss may be used to adjust for disease severity;
hospitalizations and number of medical encounters in the six months preceding base-
line may be used to adjust for healthcare use; geographic measures to adjust for phy-
sician prescribing practices; body mass index and waist-to-hip ratio to adjust for
adiposity; and income and education to adjust for socio-economic status.
The confounding domains that are important in the context of particular interven-
tions may vary across study settings. For example, socio-economic status might be
an important confounder in settings where cost or having insurance cover affects
access to health care, but might not introduce confounding in studies conducted in
countries in which access to the interventions of interest is universal and therefore
socio-economic status does not influence intervention received.
Confounding may be overcome, in principle, either by design (e.g. by restricting eligi-
bility to individuals who all have the same value of the baseline confounders) or – more
623
25 Risk of bias in a non-randomized study
commonly – through statistical analyses that adjust (‘control’) for the confounder(s).
Adjusting for factors that are not confounders, and in particular adjusting for
variables that could be affected by intervention (‘post-intervention’ variables), may
introduce bias.
In practice, confounding is not fully overcome. First, residual confounding occurs
when a confounding domain is measured with error, or when the relationship between
the confounding domain and the outcome or exposure (depending on the analytic
approach being used) is imperfectly modelled. For example, in a NRSI comparing
two antihypertensive drugs, we would expect residual confounding if pre-intervention
blood pressure was measured three months before the start of intervention, but
the blood pressures used by clinicians to decide between the drugs at the point of
intervention were not available in our dataset. Second, unmeasured confounding
occurs when a confounding domain has not been measured at all, or is not controlled
for in the analysis. This would be the case if no pre-intervention blood pressure
measurements were available, or if the analysis failed to control for pre-intervention
blood pressure despite it being measured. Unmeasured confounding can usually
not be excluded, because we are seldom certain that we know all the confounding
domains.
When NRSI are to be included in a review, review authors should attempt to pre-
specify important confounding domains in their protocol. The identification of poten-
tial confounding domains requires subject-matter knowledge. For example, experts on
surgery are best-placed to identify prognostic factors that are likely to be related to the
choice of a surgical strategy. We recommend that subject-matter experts be included in
the team writing the review protocol, and we encourage the listing of confounding
domains in the review protocol, based on initial discussions among the review authors
and existing knowledge of the literature.
which outcome data were not available) is related to both the intervention (because
folate supplementation increases the chance of a live birth) and the outcome (because
the presence of neural tube defects makes a live birth less likely) (Velie and Shaw 1996,
Hernán et al 2002).
Selection bias can also occur when some follow-up time is excluded from the anal-
ysis. For example, there is potential for bias when prevalent users of an intervention
(those already receiving the intervention), rather than incident (new) users are included
in analyses comparing them with non-users. This is a type of selection bias that has also
been termed inception bias or lead time bias. If participants are not followed from
assignment of the intervention (inception), as they would be in a randomized trial, then
a period of follow-up has been excluded, and individuals who experienced the outcome
soon after starting the intervention will be missing from analyses.
Selection bias may also arise because of missing data due to, among other reasons,
attrition (loss to follow-up), missed appointments, incomplete data collection and by
participants being excluded from analysis by primary investigators. In NRSI, data may
be missing for baseline characteristics (including interventions received or baseline
confounders), for pre-specified co-interventions, for outcome measurements, for other
variables involved in the analysis or a combination of these. Specific considerations for
missing data broadly follow those established for randomized trials and described in
the RoB 2 tool for randomized trials (see Chapter 8).
intervention status is collected at the time of the intervention and the information is
complete and accessible to those undertaking the NRSI.
Bias in measurement of the outcome is often referred to as detection bias. Exam-
ples of situations in which such bias can arise are if (i) outcome assessors are aware of
intervention status (particularly when assessment of the outcome is subjective);
(ii) different methods (or intensities of observation) are used to assess outcomes in
the different intervention groups; and (iii) measurement errors are related to interven-
tion status (or to a confounder of the intervention-outcome relationship). Blinding of
outcome assessors aims to prevent systematic differences in measurements between
intervention groups but is frequently not possible or not performed in NRSI.
(predictors of the outcome) that also predict whether an individual receives one or the
other intervention of interest.
Review authors are also encouraged to list important co-interventions in their pro-
tocol. Relevant co-interventions are the interventions or exposures that individuals
might receive after or with initiation of the intervention of interest, which are related
to the intervention received and which are prognostic for the outcome of interest.
Therefore, co-interventions are a type of confounder, which we consider separately
to highlight its importance.
Important confounders and co-interventions are likely to be identified both through
the knowledge of subject-matter experts who are members of the review team, and
through initial (scoping) reviews of the literature. Discussions with health professionals
who make intervention decisions for the target patient or population groups may also
be helpful. Assessment of risk of bias may, for some domains, rely heavily on expert
opinion rather than empirical data: this means that consensus may not be reached
among experts with different opinions. Nonetheless use of ROBINS-I should help struc-
ture discussions about risk of bias and make disagreements explicit.
The signalling questions aim to elicit information relevant to the risk-of-bias judge-
ment for the domain, and work in the same way as for RoB 2 (see Chapter 8,
Section 8.2.3). The response options are:
•• yes;
probably yes;
•• probably no;
no;
628
• no information.
25.3 The ROBINS-I tool
Pre-intervention domains
Bias due to confounding Confounding Baseline confounding occurs when one or more
prognostic variables (factors that predict the
outcome of interest) also predicts the intervention
received at baseline. ROBINS-I can also address
time-varying confounding, which occurs when
post-baseline prognostic factors affect the
intervention received after baseline.
Bias in selection of Selection bias When exclusion of some eligible participants, or
participants into the the initial follow-up time of some participants,
study or some outcome events, is related to both
intervention and outcome, there will be an
association between interventions and outcome
even if the effect of interest is truly null. This
type of bias is distinct from confounding.
A specific example is bias due to the inclusion of
prevalent users, rather than new users, of an
intervention.
At-intervention domain
Bias in classification of Information bias Bias introduced by either differential or non-
interventions differential misclassification of intervention status.
Non-differential misclassification is unrelated to
the outcome and will usually bias the estimated
effect of intervention towards the null. Differential
misclassification occurs when misclassification of
intervention status is related to the outcome or the
risk of the outcome.
Post-intervention domains
Bias due to deviations Confounding Bias that arises when there are systematic
from intended differences between experimental intervention
interventions and comparator groups in the care provided,
which represent a deviation from the intended
intervention(s). Assessment of bias in this
domain will depend on the effect of interest
(either the effect of assignment to intervention or
the effect of adhering to intervention).
Bias due to missing data Selection bias Bias that arises when later follow-up is missing for
individuals initially included and followed (e.g.
differential loss to follow-up that is affected by
prognostic factors); bias due to exclusion of
individuals with missing information about
intervention status or other variables such as
confounders.
(Continued)
629
25 Risk of bias in a non-randomized study
Risk-of-bias
judgement Interpretation
Low risk of bias The study is comparable to a well-performed randomized trial with regard to
this domain.
Moderate risk The study is sound for a non-randomized study with regard to this domain but
of bias cannot be considered comparable to a well-performed randomized trial.
Serious risk The study has some important problems in this domain.
of bias
Critical risk The study is too problematic in this domain to provide any useful evidence on
of bias the effects of intervention.
No information No information on which to base a judgement about risk of bias for this domain.
Based on these responses to the signalling questions, the options for a domain-level
risk-of-bias judgement are ‘Low’, ‘Moderate’, ‘Serious’ or ‘Critical’ risk of bias, with
an additional option of ‘No information’ (see Table 25.3.b). These differ from the
risk-of-bias judgements for the RoB 2 tool (Chapter 8, Section 8.2.3).
Note that a judgement of ‘Low risk of bias’ corresponds to the absence of bias in a
well-performed randomized trial, with regard to the domain being considered. This cat-
egory thus provides a reference for risk-of-bias assessment in NRSI in particular for the
‘pre-intervention’ and ‘at-intervention’ domains. Because of confounding, we antici-
pate that only rarely will design or analysis features of a non-randomized study lead
to a classification of low risk of bias when studying the intended effects of interventions
(on the other hand, confounding may be a less serious concern when studying unin-
tended effects of intervention (Institute of Medicine 2012)). By contrast, since random-
ization does not protect against post-intervention biases, we expect more overlap
between assessments of randomized trials and assessments of NRSI for the post-
intervention domains. Nonetheless other features of randomized trials that are usually
not feasible in NRSI, such as blinding of participants, health professionals or outcome
assessors, may make NRSI more at risk of post-intervention biases.
630
25.3 The ROBINS-I tool
As for RoB 2, a free text box alongside the signalling questions and judgements pro-
vides space for review authors to present supporting information for each response.
Brief, direct quotations from the text of the study report should be used whenever
possible.
The tool includes an optional component to judge the direction of the bias for each
domain and overall. For some domains, the bias is most easily thought of as being
towards or away from the null. For example, suspicion of selective non-reporting of
statistically non-significant results would suggest bias away from the null. However,
for other domains (in particular confounding, selection bias and forms of measurement
bias such as differential misclassification), the bias needs to be thought of as an
increase or decrease in the effect estimate to favour either the experimental interven-
tion or comparator compared with the target trial, rather than towards or away from
the null. For example, confounding bias that decreases the effect estimate would be
towards the null if the true risk ratio were greater than 1, and away from the null if
the risk ratio were less than 1. If review authors do not have a clear rationale for judging
the likely direction of the bias, they should not attempt to guess it and should leave this
response blank.
Overall
risk-of-bias
judgement Interpretation Criterion
Low risk of The study is comparable to a well-performed The study is judged to be at low
bias randomized trial. risk of bias for all domains for this
result.
Moderate The study appears to provide sound The study is judged to be at low or
risk of bias evidence for a non-randomized study but moderate risk of bias for all
cannot be considered comparable to a well- domains.
performed randomized trial.
Serious risk The study has one or more important The study is judged to be at serious
of bias problems. risk of bias in at least one domain,
but not at critical risk of bias in any
domain.
Critical risk The study is too problematic to provide any The study is judged to be at critical
of bias useful evidence and should not be included risk of bias in at least one domain.
in any synthesis.
631
25 Risk of bias in a non-randomized study
judgement of ‘Serious’ risk of bias within any domain implies that the concerns iden-
tified have serious implications for the result overall, irrespective of which domain is
being assessed. In practice this means that if the answers to the signalling questions
yield a proposed judgement of ‘Serious’ or ‘Critical’ risk of bias, review authors should
consider whether any identified problems are of sufficient concern to warrant this
judgement for that result overall. If this is not the case, the appropriate action would
be to retain the answers to the signalling questions but override the proposed default
judgement and provide justification.
‘Moderate’ risk of bias in multiple domains may lead review authors to decide on an
overall judgement of ‘Serious’ risk of bias for that outcome or group of outcomes, and
‘Serious’ risk of bias in multiple domains may lead review authors to decide on an over-
all judgement of ‘Critical’ risk of bias.
Once an overall judgement has been reached for an individual study result, this infor-
mation should be presented in the review and reflected in the analysis and conclusions.
For discussion of the presentation of risk-of-bias assessments and how they can be
incorporated into analyses, see Chapter 7. Risk-of-bias assessments also feed into
one domain of the GRADE approach for assessing certainty of a body of evidence,
as discussed in Chapter 14.
Table 25.4.a Bias domains included in the ROBINS-I tool for follow-up studies, with a summary of the
issues addressed
••
the intervention being received);
all important confounding domains were controlled for;
the confounding domains were measured validly and reliably
•
by the variables available; and
appropriate analysis methods were used to control for the
confounding.
Bias in selection of participants Whether:
into the study
• selection of participants into the study (or into the analysis)
was based on participant characteristics observed after the
•
start of intervention;
(if applicable) these characteristics were associated with
intervention and influenced by outcome (or a cause of the
•
outcome);
start of follow-up and start of intervention were the
•
same; and
(if applicable) adjustment techniques were used to correct for
the presence of selection biases.
Bias in classification of Whether:
interventions
• intervention status was classified correctly for all (or nearly
•
all) participants;
information used to classify intervention groups was
•
recorded at the start of the intervention; and
classification of intervention status could have been
influenced by knowledge of the outcome or risk of the
outcome.
Bias due to deviations from When the review authors’ interest is in the effect of assignment to
intended interventions intervention (see Section 25.3.3):
Whether:
•
intervention groups;
failures in implementing the intervention could have affected
the outcome and were unbalanced across intervention
groups;
(Continued)
633
25 Risk of bias in a non-randomized study
•
across intervention groups; and
(if applicable) an appropriate analysis was used to estimate
the effect of adhering to the intervention.
Bias due to missing data Whether:
•
missing outcome data was small;
the number of participants omitted from the analysis due to
•
missing data on intervention status was small;
the number of participants omitted from the analysis due to
missing data on other variables needed for the analysis
•
was small;
(if applicable) there was evidence that the result was not
•
biased by missing outcome data; and
(if applicable) missingness in the outcome was likely to
depend on the true value of the outcome (e.g. because of
different proportions of missing outcome data, or different
reasons for missing outcome data, between intervention
groups).
Bias in measurement of the Whether:
outcome
•• the method of measuring the outcome was inappropriate;
measurement or ascertainment of the outcome could have
•
differed between intervention groups;
outcome assessors were aware of the intervention received
•
by study participants; and
(if applicable) assessment of the outcome could have been
influenced by knowledge of intervention received; and
whether this was likely.
Bias in selection of the reported Whether:
result
• the numerical result being assessed is likely to have been
selected, on the basis of the results, from multiple outcome
•
measurements within the outcome domain;
the numerical result being assessed is likely to have been
selected, on the basis of the results, from multiple analyses
•
of the data; and
the numerical result being assessed is likely to have been
selected, on the basis of the results, from multiple subgroups
of a larger cohort.
∗
For the precise wording of signalling questions and guidance for answering each one, see the full ROBINS-I
tool at www.riskofbias.info.
the interventions to which the participants switch, then this can lead to time-varying
confounding. For example, suppose a study of patients treated for HIV partitions
follow-up time into periods during which patients were receiving different antiretroviral
regimens and compares outcomes during these periods in the analysis. Post-baseline
634
25.5 Risk of bias in uncontrolled before-after studies
CD4 cell counts might influence switches between the regimens of interest. When such
post-baseline prognostic variables are affected by the interventions themselves (e.g.
antiretroviral regimen may influence post-baseline CD4 count), we say that there is
treatment-confounder feedback. This implies that conventional adjustment (e.g.
Poisson or Cox regression models) is not appropriate as a means of controlling for
time-varying confounding. Other post-baseline prognostic factors, such as adverse
effects of an intervention, may also predict switches between interventions.
Note that a change from the baseline intervention may result in switching to an inter-
vention other than the alternative of interest in the study (i.e. from experimental inter-
vention to something other than the comparator intervention, or from comparator
intervention to something other than the experimental intervention). If follow-up time
is re-allocated to the alternative intervention in the analysis that produced the result
being assessed for risk of bias, then there is a potential for bias arising from time-
varying confounding. If follow-up time was not allocated to the alternative interven-
tion, then the potential for bias is considered either (i) under the domain ‘Bias due
to deviations from intended interventions’ if interest is in the effect of adhering to inter-
vention and the follow-up time on the subsequent intervention is included in the analysis,
or (ii) under ‘Bias due to missing data’ if the follow-up time on the subsequent interven-
tion is excluded from the analysis.
based on patterns observed before the intervention. The intervention effect is estimated
by comparing the observed outcome trajectory after intervention with the assumed tra-
jectory had there been no intervention.
The category also includes studies in which multiple individuals are each measured
before and after receiving an intervention: there may be several pre- and post-
intervention measurements. These studies might be characterized as uncontrolled, lon-
gitudinal designs (alternatively they may be referred to as repeated measures studies,
before-after studies, pre-post studies or reflexive control studies). One special case is
a study with a single pre-intervention outcome measurement and a single post-
intervention outcome measurement for each of multiple participants. Such a study will
usually be judged to be at serious or critical risk of bias because it is impossible to deter-
mine whether pre-post changes are due to the intervention rather than other factors.
The main issues addressed in a ROBINS-I evaluation of an uncontrolled before-after
study are summarized below and in Table 25.5.a. We address issues only for the effect
of assignment to intervention, since we do not expect uncontrolled before-after studies
to examine the effect of starting and adhering to the intended intervention.
• There is a possibility that extraneous events or changes in context occur around the
time at which the intervention is introduced. Bias will be introduced if these external
forces influence the outcome. This issue is addressed under the first domain of
ROBINS-I (‘Bias due to confounding’).
• There should be sufficient data to extrapolate from outcomes before the intervention
into the future. ‘Sufficient’ means enough time points, over a sufficient period of
time, to characterize trends and patterns. This issue is also addressed under ‘Bias
due to confounding’.
• ITS analyses require specification of a specific time point (the ‘interruption’) before
which there was no intervention (pre-intervention period) and after which there
has been an intervention (the post-intervention period). However, interventions do
not happen instantaneously, so this time point may be before, or after, some impor-
tant features of the intervention were implemented. The time point could be selected
to maximize the apparent effect: this issue is covered primarily in the domain ‘Bias in
classification of the intervention’ but is also relevant to ‘Bias in selection of the
reported result’ since researchers could conduct analyses with different interruption
points and report that which maximizes the support for their hypothesis).
• The interruption time point might be before important features of the intervention
have been implemented, so that there is a delay before the intervention is fully effec-
tive. Such lagging of effects should not be regarded as bias, but is rather an issue of
applicability of some of the measurement times. Lagging effects can be accommo-
dated in analyses if sufficient post-intervention measurements are available, for
example by excluding data from a phase-in period of the intervention.
• The interruption time point might be after important features of the intervention
have been implemented: for example, if anticipation of a policy change alters
people’s behaviour so that there is early impact of the intervention before its main
implementation. Such effects will attenuate differences between pre- and post-
intervention outcomes. We address this issue as a type of contamination of the
pre-intervention period by aspects of the intervention and consider it under ‘Bias
due to deviations from the intended intervention’.
636
25.5 Risk of bias in uncontrolled before-after studies
Table 25.5.a Bias domains included in the ROBINS-I tool for (uncontrolled) before-after studies, with a
summary of the issues addressed
•
of pre-intervention trends and patterns;
there are extraneous events or changes in context around
the time of the intervention that could have influenced the
•
outcome; and
the study authors used an appropriate analysis method that
accounts for time trends and patterns, and controls for all the
•
important confounding domains.
Bias in selection of participants The issues are similar to those for follow-up studies. For
into the study studies that prospectively follow a specific group of units
from pre-intervention to post-intervention, selection bias is
unlikely. For repeated cross-sectional surveys of a
population, there is the potential for selection bias even if
•
the study is prospective.
Bias in classification of Whether specification of the distinction between
interventions pre-intervention time points and post-intervention time
points could have been influenced by the outcome data.
Bias due to deviations from Assuming the review authors’ interest is in the effect of
intended interventions assignment to intervention (see Section 25.3.3):
•
accounted for.
Bias due to missing data Whether outcome data were missing for whole clusters
(units of multiple individuals) as well as for individual
participants.
Bias in measurement of the Whether:
outcome
• methods of outcome assessment were comparable before
•
and after the intervention; and
there were changes in systematic errors in measurement of
the outcome coincident with implementation of the
•
intervention.
Bias in selection of the reported The issues are the same as for follow-up studies.
result
∗
For the precise wording of signalling questions and guidance for answering each one, see the full ROBINS-I
tool at www.riskofbias.info.
• The intervention might cause attrition from the framework or system used to meas-
ure outcomes. This is a bias due to selection out of the study, and is addressed in the
domain ‘Bias due to missing data’.
• The occurrence of extraneous events around the time of intervention may differ
between the intervention and comparator groups. This is addressed under ‘Bias
due to confounding’.
• Trends and patterns of the outcome over time may differ between the intervention
and comparator groups. The plausibility of this threat to validity can be assessed if
more than one pre-intervention measurement of the outcome is available: the more
measurements, the better the pre-intervention trends can be modelled and com-
pared between groups. This issue is also addressed under ‘Bias due to confounding’.
• If the definition of the intervention and comparator groups depends on pre-
intervention outcome measurements (e.g. if individuals with high values are selected
for intervention and those with low values for the comparator), regression to the
mean may be confused with a treatment effect. The plausibility of this threat can
be assessed by having more than one pre-intervention measurement. This is
addressed under ‘Bias due to confounding’.
638
25.6 Risk of bias in controlled before-after studies
Table 25.6.a Bias domains included in the ROBINS-I tool for controlled before-after studies, with a
summary of the issues addressed
•
patterns;
any extraneous events or changes in context around the time
of the intervention that could have influenced the outcome
•
were experienced equally by both intervention groups; and
pre-intervention trends and patterns in outcomes were
analysed appropriately and found to be similar across the
•
intervention and comparator groups.
Bias in selection of participants The issues are similar to those for follow-up studies. For
into the study repeated cross-sectional surveys of a population, there is
the potential for selection bias if changes in the types of
participants/units included in repeated surveys differ
•
between intervention and comparator groups.
Bias in classification of Whether classification of time points as before versus after
interventions intervention could have been influenced by post-
intervention outcome data.
Bias due to deviations from Assuming the review authors’ interest is in the effect of
intended interventions assignment to intervention (see Section 25.3.3):
Bias due to missing data •• The issues are the same as for follow-up studies.
Whether outcome data were missing for whole clusters as
well as for individual participants.
Bias in measurement of the Whether:
outcome
• methods of outcome assessment were comparable across
intervention groups and before and after the
•
intervention; and
there were changes in systematic errors in measurement of
the outcome coincident with implementation of the
•
intervention.
Bias in selection of the reported The issues are the same as for follow-up studies.
result
∗
For the precise wording of signalling questions and guidance for answering each one, see the full ROBINS-I
tool at www.riskofbias.info.
• There is a risk of selection bias in repeated cross-sectional surveys if the types of par-
ticipants/units included in repeated surveys changes over time, and such changes
differ between intervention and comparator groups. Changes might occur contem-
poraneously with the intervention if it causes (or requires) attrition from the meas-
urement framework. These issues are addressed under ‘Bias due to selection of
participants into the study’ and ‘Bias due to missing data’.
• Outcome measurement methods might change between pre- and post-intervention
periods. This issue may complicate analyses if it occurs in the intervention and
639
25 Risk of bias in a non-randomized study
comparator groups at the same time but is a threat to validity if it differs between
them. This is addressed under ‘Bias due to measurement of the outcome’.
• Poor specification of the time point before which there was no intervention and after
which there has been an intervention may introduce bias. This is addressed under
‘Bias in classification of interventions’.
25.8 References
Eccles M, Grimshaw J, Campbell M, Ramsay C. Research designs for studies evaluating the
effectiveness of change and improvement strategies. Quality and Safety in Health Care
2003; 12: 47–52.
Hernán MA, Hernandez-Diaz S, Werler MM, Mitchell AA. Causal knowledge as a prerequisite
for confounding evaluation: an application to birth defects epidemiology. American
Journal of Epidemiology 2002; 155: 176–184.
640
25.8 References
Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is
not available. American Journal of Epidemiology 2016; 183: 758–764.
Institute of Medicine. Ethical and Scientific Issues in Studying the Safety of Approved Drugs.
Washington (DC): The National Academies Press; 2012.
Kontopantelis E, Doran T, Springate DA, Buchan I, Reeves D. Regression based quasi-
experimental approach when randomisation is not an option: interrupted time series
analysis. BMJ 2015; 350: h2750.
Lopez Bernal J, Cummins S, Gasparrini A. The use of controls in interrupted time series
studies of public health interventions. International Journal of Epidemiology 2018; 47:
2082–2093.
Polus S, Pieper D, Burns J, Fretheim A, Ramsay C, Higgins JPT, Mathes T, Pfadenhauer LM,
Rehfuess EA. Heterogeneity in application, design, and analysis characteristics was found
for controlled before-after and interrupted time series studies included in Cochrane
reviews. Journal of Clinical Epidemiology 2017; 91: 56–69.
Schünemann HJ, Tugwell P, Reeves BC, Akl EA, Santesso N, Spencer FA, Shea B, Wells G,
Helfand M. Non-randomized studies as a source of complementary, sequential or
replacement evidence for randomized controlled trials in systematic reviews on the
effects of interventions. Research Synthesis Methods 2013; 4: 49–62.
Sterne JAC, Hernán MA, Reeves BC, Savović J, Berkman ND, Viswanathan M, Henry D, Altman
DG, Ansari MT, Boutron I, Carpenter JR, Chan AW, Churchill R, Deeks JJ, Hróbjartsson A,
Kirkham J, Jüni P, Loke YK, Pigott TD, Ramsay CR, Regidor D, Rothstein HR, Sandhu L,
Santaguida PL, Schünemann HJ, Shea B, Shrier I, Tugwell P, Turner L, Valentine JC,
Waddington H, Waters E, Wells GA, Whiting PF, Higgins JPT. ROBINS-I: a tool for assessing
risk of bias in non-randomized studies of interventions. BMJ 2016; 355: i4919.
Velie EM, Shaw GM. Impact of prenatal diagnosis and elective termination on prevalence
and risk estimates of neural tube defects in California, 1989–1991. American Journal of
Epidemiology 1996; 144: 473–479.
641
26
Individual participant data
Jayne F Tierney, Lesley A Stewart, Mike Clarke; on behalf of the Cochrane Individual
Participant Data Meta-analysis Methods Group
KEY POINTS
• Individual participant data (IPD) reviews are a specific type of systematic review that
involve the collection, checking and re-analysis of the original data for each participant
in each study. Data may be obtained either from study investigators or via data-sharing
•
repositories or platforms.
IPD reviews should be considered when the available published or other aggregate
data do not permit a good quality review, or are insufficient for a thorough analysis.
•
In certain situations, aggregate data synthesis might be an appropriate first step.
The IPD approach can bring substantial improvements to the quality of data available
and offset inadequate reporting of individual studies. Risk of bias can be assessed
more thoroughly and IPD enables more detailed and flexible analysis than is possible
•
in systematic reviews of aggregate data.
Access to IPD offers scope to analyse data and report results in many different ways, so
analytical methods should be pre-specified in detail and reporting should follow the
•
PRISMA-IPD guideline.
Most commonly, IPD reviews are carried out by a collaborative group, comprising a
project management team, the researchers who contribute their study data, and
•
an advisory group.
An IPD review usually takes longer and costs more than a conventional systematic review
of the same question, and requires a range of skills to obtain, manage and analyse data.
Thus, they are difficult to do without dedicated time and funding.
26.1 Introduction
26.1.1 What is an IPD review?
Systematic reviews incorporating individual participant data (IPD) include the original
data from each eligible study. The IPD will usually contain de-identified demographic
This chapter should be cited as: Tierney JF, Stewart LA, Clarke M. Chapter 26: Individual participant data.
In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook
for Systematic Reviews of Interventions. 2nd Edition. Chichester (UK): John Wiley & Sons, 2019: 643–658.
© 2019 The Cochrane Collaboration. Published 2019 by John Wiley & Sons Ltd.
643
26 Individual participant data
information for each participant such as age, sex, nature of their health condition, as well
as information about treatments or tests received and outcomes observed (Stewart et al
1995, Stewart and Tierney 2002). These data can then be checked and analysed centrally
and, if appropriate, combined in meta-analyses (Stewart et al 1995, Stewart and Tierney
2002). Most commonly, IPD are sought directly from the study investigators, but access
through data-sharing platforms and data repositories may increase in the coming years.
Advantages of an IPD approach are summarized in Table 26.1.a. Compared with
aggregate data, the collection of IPD can bring about substantial improvements to
Table 26.1.a Advantages of the IPD approach to systematic review and meta-analysis. Adapted
from Tierney et al (2015a). (https://journals.plos.org/plosmedicine/article?id=10.1371/journal.
pmed.1001855 licensed under CC BY 4.0).
Aspect of systematic
review/meta-analysis Advantages of the IPD approach
Study inclusion Asking the IPD collaborative group (of study investigators and other experts
in the clinical field) to supplement list of identified studies.*
Clarify study eligibility with trial investigators.*
Data quality Include studies that are unpublished or not reported in full.
Include unreported data (e.g. more outcomes per study, and more complete
information on those outcomes, data on participants excluded from study
analyses).
Check the integrity of study IPD and resolve any queries with investigators.
Derive standardized outcome definitions across studies or translate different
definitions to a common scale.
Derive standardized classifications of participant characteristics or their
disease/condition or translate different definitions to a common scale.
Update follow-up of time-to-event or other outcomes beyond that reported.
Risk of bias Clarify study design, conduct and analysis methods with trial investigators.*
Check risk of bias of study IPD and obtain extra data where necessary.
Analysis Analyse all important outcomes.
Determine validity of analysis assumptions with IPD (e.g. proportionality of
hazards for a Cox model).
Derive measures of effect directly from the IPD.
Use a consistent unit of analysis for each study.
Apply a consistent method of analysis for each study.
Conduct more detailed analysis of time-to-event outcomes (e.g. generating
Kaplan-Meier curves).
Achieve greater power for assessing interactions between effects of
interventions and participant or disease/condition characteristics.
Conduct more complex analyses not (usually) possible with aggregate data
(e.g. simultaneous assessment of the relationship between multiple study
and/or participant characteristics and effects of interventions).
Use non-standard models or measures of effect.
Account for missing data at the patient level (e.g. using multiple imputation).
Use IPD to address secondary clinical questions (e.g. to explore the natural
history of disease, prognostic factors or surrogate outcomes).
Interpretation Discuss implications for clinical practice and research with a multidisciplinary
group of collaborators including study investigators who supplied data.
∗
These may also be done for non-IPD reviews.
644
26.1 Introduction
the quantity and quality of data, for example, through the inclusion of more trials, par-
ticipants and outcomes (Debray et al 2015a, Tierney et al 2015a). A Cochrane Method-
ology Review of empirical research shows some of these advantages (Tudur Smith et al
2016). IPD also affords greater scope and flexibility in the analyses, including the ability
to investigate how participant-level covariates such as age or severity of disease might
alter the impact of the treatment, exposure or test under investigation (Debray et al
2015a, Debray et al 2015b, Tierney et al 2015a). With such better-quality data and anal-
ysis, IPD reviews can help to provide in-depth explorations and robust meta-analysis
results, which may differ from those based on aggregate data (Tudur Smith et al
2016). Not surprisingly then, IPD reviews have had a substantial impact on clinical prac-
tice and research, but could be better used to inform treatment guidelines (Vale et al
2015), and new studies (Tierney et al 2015b). However, IPD reviews can take longer than
other reviews; those evaluating the effects of therapeutic interventions typically taking
at least two years to complete. Also, they usually require a skilled team with dedicated
time and specific funding.
This chapter provides an overview of the IPD approach to systematic reviews, to help
authors decide whether collecting IPD might be useful and feasible for their review. As
most IPD reviews have assessed the efficacy of interventions, and have been based on
randomized trials, this is the focus of the chapter. However, the approach also offers par-
ticular advantages for the synthesis of diagnostic and prognostic studies (Debray et al
2015a) and many of the principles described will apply to these sorts of synthesis.
The chapter does not provide detailed guidance on practical or statistical methods,
which are summarized elsewhere (Stewart et al 1995, Stewart and Tierney 2002, Debray
et al 2015b, Tierney et al 2015a). Therefore, anyone contemplating carrying out their first
IPD meta-analysis as part of a Cochrane Review should seek appropriate advice and
guidance from experienced researchers through the IPD Meta-analysis Methods Group.
26.2.2 Obtaining data from sources other than the original researchers
A number of initiatives are helping to increase the availability of IPD from both
academic and industry-led studies, either through generic data sharing platforms such
as Yale Open Data, Clinical Study Data Request, DataSphere or Vivli. These have been in
response to calls from federal agencies (e.g. NIH), funders (e.g. MRC), journal editors,
the AllTrials campaign and Cochrane to make results and IPD from clinical studies more
readily available.
As the focus of these efforts is to make the data from individual studies available,
formatting and coding are not necessarily standard or consistent across the different
study datasets. Some platforms offer fully unrestricted access to IPD and others mod-
erated access, with release subject to approval of a project proposal. Also, while some
sources allow transfer of IPD directly to the research team conducting the review,
others limit the use of IPD to within a secure area within a platform. Therefore, for
any given review, the availability of study IPD from these platforms may be patchy,
the modes of access variable, and the usual process of re-formatting and re-coding
data in a consistent way will likely be required. Thus, although promising, as yet they
do not provide a viable alternative to the traditional collaborative IPD approach. As the
culture of data sharing gathers pace, the increased availability and accessibility of IPD
should benefit the production of IPD reviews.
before embarking on data collection, it is worthwhile checking the study protocols and/
or with the original researchers to determine which data are actually available. In many
cases it will only be necessary to collect outcomes and participant characteristics as
defined in the individual studies. However, additional variables might be required to
provide greater granularity (e.g. subscales in quality of life instruments), or to allow
outcomes or other variables to be defined in a consistent way for each study. For exam-
ple, to redefine pre-eclampsia according to a common definition, data on systolic and
diastolic blood pressure and proteinurea are needed (Askie et al 2007).
IPD provides the most practical way to synthesize data for time-to-event outcomes,
such as time to recovery, time free of seizures, or time to death. Therefore, it is impor-
tant to collect data on whether an event (e.g. death) has happened, the date of the
event (e.g. date of death) and the date of last follow-up for those not experiencing
an event. As a bare minimum, whether an event happened and the time that each indi-
vidual spent ‘event-free’ may suffice. IPD also allows follow-up to be updated some-
times substantially beyond the point of publication (Stewart et al 1995, Stewart and
Tierney 2002), which has been particularly important in evaluating the long-term
effects of therapies in the cancer field (Pan et al 2017).
whichever format is most convenient, and recode it as necessary. A copy of the data, as
supplied, should be archived before carrying out conversions or modifications to the
data, and it is vital that any alterations made are properly logged.
their exclusion from study analyses (Tierney and Stewart 2005), and allows an intention-
to-treat analysis of all randomized participants, avoiding the potential bias of a per-
protocol analysis.
original analyses. For example, it should be possible to carry out analyses according to
intention-to-treat principles, even if the original/published trial analyses did not, use
more appropriate effect measures, and perform sophisticated analyses to account
for missing data.
As IPD offers the potential to analyse data in many different ways, it is particularly
important that all methods relating to analysis are pre-specified in detail in the review
protocol or analysis plan (Tierney et al 2015a) and are clearly reported in publications
(Stewart et al 2015). This should include: outcomes and their definitions; methods for
checking IPD and assessing risk of bias of included studies; methods for evaluating
treatments effects, risks or test accuracy (including those for exploring variations by
trial or patient characteristics) and methods for quantifying and accounting for heter-
ogeneity. Unplanned analyses can still play an important role in explaining or adding to
the results, but such exploratory analyses should be justified and clearly reported
as such.
Statistical methods for the analysis of IPD can be complex and are described in more
detail elsewhere (Debray et al 2015b). These methods are less well developed for prog-
nostic or diagnostic test accuracy reviews than for interventions reviews based on
randomized trials, so we outline some key principles for the re-analysis of IPD from
randomized trials.
Funding: JFT and coordination of the IPD Meta-analysis Methods Group are funded by
the UK Medical Research Council (MC_UU_12023/24); Lesley A Stewart is funded by the
University of York and Mike Clarke is funded by Queen’s University Belfast.
26.8 References
Abo-Zaid G, Guo B, Deeks JJ, Debray TP, Steyerberg EW, Moons KG, Riley RD. Individual
participant data meta-analyses should not ignore clustering. Journal of Clinical
Epidemiology 2013; 66: 865–873 e864.
655
26 Individual participant data
Ahmed I, Sutton AJ, Riley RD. Assessment of publication bias, selection bias, and
unavailable data in meta-analyses using individual participant data: a database survey.
BMJ 2012; 344: d7762.
Askie LM, Duley L, Henderson-Smart D, Stewart LA, on behalf of the PARIS Collaborative
Group. Antiplatelet agents for prevention of pre-eclampsia: a meta-analysis of individual
patient data. Lancet 2007; 369: 1791–1798.
Bowden J, Tierney JF, Simmonds M, Copas AJ. Individual patient data meta-analysis
of time-to-event outcomes: one-stage versus two-stage approaches for estimating
the hazard ratio under a random effects model. Research Synthesis Methods 2011; 2:
150–162.
Burdett S, Stewart LA. A comparison of the results of checked versus unchecked individual
patient data meta-analyses. International Journal of Technology Assessment in Health
Care 2002; 18: 619–624.
Burke DL, Ensor J, Riley RD. Meta-analysis using individual participant data: one-stage and
two-stage approaches, and why they may differ. Statistics in Medicine 2017; 36: 855–875.
Clarke M, Halsey J. DICE 2: a further investigation of the effects of chance in life, death and
subgroup analyses. International Journal of Clinical Practice 2001; 55: 240–242.
Clarke M, Stewart L, Pignon JP, Bijnens L. Individual patient data meta-analyses in cancer.
British Journal of Cancer 1998; 77: 2036–2044.
Debray TP, Riley RD, Rovers MM, Reitsma JB, Moons KG, Cochrane IPD Meta-analysis
Methods group. Individual participant data (IPD) meta-analyses of diagnostic and
prognostic modeling studies: guidance on their use. PLoS Medicine 2015a; 12: e1001886.
Debray TP, Moons KG, van Valkenhoef G, Efthimiou O, Hummel N, Groenwold RH, Reitsma
JB, GetReal Methods Review Group. Get real in individual participant data (IPD) meta-
analysis: a review of the methodology. Research Synthesis Methods 2015b; 6: 293–309.
Deeks JJ, Higgins JPT, Altman DG. Chapter 9: Analysing data and undertaking meta-
analyses. In: Higgins JPT, Green S, editors. Cochrane Handbook for Systematic Reviews of
Interventions Version 5.1.0: The Cochrane Collaboration; 2011.
Dwan K, Altman DG, Cresswell L, Blundell M, Gamble CL, Williamson PR. Comparison of
protocols and registry entries to published reports for randomised controlled trials.
Cochrane Database of Systematic Reviews 2011; 1: MR000031.
Ensor J, Burke DL, Snell KIE, Hemming K, Riley RD. Simulation-based power calculations for
planning a two-stage individual participant data meta-analysis. BMC Medical Research
Methodology 2018; 18.
Fisher DJ, Copas AJ, Tierney JF, Parmar MKB. A critical review of methods for the
assessment of patient-level interactions in individual patient data (IPD) meta-analysis of
randomised trials, and guidance for practitioners. Journal of Clinical Epidemiology 2011;
64: 949–967.
Fisher DJ. Two-stage individual participant data meta-analysis and generalized forest plots.
Stata Journal 2015; 15: 369–396.
Fisher DJ, Carpenter JR, Morris TP, Freeman SC, Tierney JF. Meta-analytical methods to
identify who benefits most from treatments: daft, deluded, or deft approach? BMJ 2017;
356: j573.
Higgins JPT, Altman DG, Gøtzsche PC, Jüni P, Moher D, Oxman AD, Savović J, Schulz KF,
Weeks L, Sterne J, Cochrane Bias Methods Group, Cochrane Statistical Methods Group.
The Cochrane Collaboration’s tool for assessing risk of bias in randomised trials. BMJ
2011; 343: d5928.
656
26.8 References
Kirkham JJ, Dwan KM, Altman DG, Gamble C, Dodd S, Smyth R, Williamson PR. The impact of
outcome reporting bias in randomised controlled trials on a cohort of systematic reviews.
BMJ 2010; 340: c365.
Mhaskar R, Djulbegovic B, Magazin A, Soares HP, Kumar A. Published methodological
quality of randomized controlled trials does not reflect the actual quality assessed in
protocols. Journal of Clinical Epidemiology 2012; 65: 602–609.
Moher D, Liberati A, Tetzlaff J, Altman D, PRISMA Group. Preferred reporting items for
systematic reviews and meta-analyses: the PRISMA Statement. PLoS Medicine 2009; 6:
e1000097. doi:1000010.1001371/journal.pmed.1000097.
Morris TP, Fisher DJ, Kenward MG, Carpenter JR. Meta-analysis of Gaussian individual
patient data: two-stage or not two-stage? Statistics in Medicine 2018; 37: 1419–1438.
Nevitt SJ, Sudell M, Weston J, Tudur Smith C, Marson AG. Antiepileptic drug monotherapy
for epilepsy: a network meta-analysis of individual participant data. Cochrane Database
of Systematic Reviews 2017; 12: CD011412.
Pan H, Gray R, Braybrooke J, Davies C, Taylor C, McGale P, Peto R, Pritchard KI, Bergh J,
Dowsett M, Hayes DF, EBCTCG. 20-Year risks of breast-cancer recurrence after
stopping endocrine therapy at 5 years. New England Journal of Medicine 2017; 377:
1836–1846.
Riley RD, Lambert PC, Staessen JA, Wang J, Gueyffier F, Thijs L, Boutitie F. Meta-analysis of
continuous outcomes combining individual patient data and aggregate data. Statistics in
Medicine 2008; 27: 1870–1893.
Riley RD, Lambert PC, Abo-Zaid G. Meta-analysis of individual participant data: rationale,
conduct, and reporting. BMJ 2010; 340: c221.
Sarcoma Meta-analysis Collaboration. Adjuvant chemotherapy for localised resectable soft
tissue sarcoma in adults: meta-analysis of individual patient data. Lancet 1997; 350:
1647–1654.
Sargent DJ, Patiyil S, Yothers G, Haller DG, Gray R, Benedetti J, Buyse M, Labianca R, Seitz JF,
O’Callaghan CJ, Francini G, Grothey A, O’Connell M, Catalano PJ, Kerr D, Green E, Wieand
HS, Goldberg RM, de Gramont A, ACCENT Group. End points for colon cancer adjuvant
trials: observations and recommendations based on individual patient data from 20,898
patients enrolled onto 18 randomized trials from the ACCENT Group. Journal of Clinical
Oncology 2007; 25: 4569–4574.
Simmonds M, Stewart G, Stewart L. A decade of individual participant data meta-analyses: a
review of current practice. Contemporary Clinical Trials 2015; 45: 76–83.
Simmonds MC, Higgins JPT, Stewart LA, Tierney JF, Clarke MJ, Thompson SG. Meta-analysis
of individual patient data from randomised trials: a review of methods used in practice.
Clinical Trials 2005; 2: 209–217.
Sinicrope FA, Foster NR, Yothers G, Benson A, Seitz JF, Labianca R, Goldberg RM, Degramont
A, O’Connell MJ, Sargent DJ, Adjuvant Colon Cancer Endpoints Group. Body mass index at
diagnosis and survival among colon cancer patients enrolled in clinical trials of adjuvant
chemotherapy. Cancer 2013; 119: 1528–1536.
Sterne JA, Hernan MA, Reeves BC, Savović J, Berkman ND, Viswanathan M, Henry D, Altman
DG, Ansari MT, Boutron I, Carpenter JR, Chan AW, Churchill R, Deeks JJ, Hróbjartsson A,
Kirkham J, Jüni P, Loke YK, Pigott TD, Ramsay CR, Regidor D, Rothstein HR, Sandhu L,
Santaguida PL, Schünemann HJ, Shea B, Shrier I, Tugwell P, Turner L, Valentine JC,
Waddington H, Waters E, Wells GA, Whiting PF, Higgins JPT. ROBINS-I: a tool for assessing
risk of bias in non-randomised studies of interventions. BMJ 2016; 355: i4919.
657
26 Individual participant data
Stewart GB, Altman DG, Askie LM, Duley L, Simmonds MC, Stewart LA. Statistical analysis of
individual participant data meta-analyses: a comparison of methods and
recommendations for practice. PloS One 2012; 7: e46042.
Stewart L, Tierney J, Burdett S. Do systematic reviews based on individual patient data offer
a means of circumventing biases associated with trial publications? In: Rothstein H,
Sutton A, Borenstein M, editors. Publication Bias in Meta-Analysis: Prevention, Assessment
and Adjustments. Chichester: John Wiley & Sons; 2005. p. 261–286.
Stewart LA, Clarke MJ, on behalf of the Cochrane Working Party Group on Meta-analysis
using Individual Patient Data. Practical methodology of meta-analyses (overviews) using
updated individual patient data. Statistics in Medicine 1995; 14: 2057–2079.
Stewart LA, Tierney JF. To IPD or Not to IPD? Advantages and disadvantages of systematic
reviews using individual patient data. Evaluation and the Health Professions 2002; 25:
76–97.
Stewart LA, Clarke M, Rovers M, Riley RD, Simmonds M, Stewart G, Tierney JF, PRISMA-IPD
Development Group. Preferred reporting items for a systematic review and meta-analysis
of individual participant data: the PRISMA-IPD statement. JAMA 2015;313: 1657–1665.
Tierney JF, Stewart LA. Investigating patient exclusion bias in meta-analysis. International
Journal of Epidemiology 2005; 34: 79–87.
Tierney JF, Vale CL, Riley R, Tudur Smith C, Stewart LA, Clarke M, Rovers M. Individual
participant data (IPD) meta-analyses of randomised controlled trials: guidance on their
use. PLoS Medicine 2015a; 12: e1001855.
Tierney JF, Pignon J-P, Gueffyier F, Clarke M, Askie L, Vale CL, Burdett S. How individual
participant data meta-analyses can influence trial design and conduct Journal of Clinical
Epidemiology 2015b; 68: 1325–1335.
Tudur Smith C, Clarke M, Marson T, Riley R, Stewart L, Tierney J, Vail A, Williamson P. A
framework for deciding if individual participant data are likely to be worthwhile (oral
session). 23rd Cochrane Colloquium; 2015; Vienna, Austria. http://2015.colloquium.
cochrane.org/abstracts/framework-deciding-if-individual-participant-data-are-likely-be-
worthwhile.
Tudur Smith C, Marcucci M, Nolan SJ, Iorio A, Sudell M, Riley R, Rovers MM, Williamson PR.
Individual participant data meta-analyses compared with meta-analyses based on
aggregate data. Cochrane Database of Systematic Reviews 2016; 9: MR000007.
Vale CL, Tierney JF, Burdett S. Can trial quality be reliably assessed from published reports
of cancer trials: evaluation of risk of bias assessments in systematic reviews. BMJ 2013;
346: f1798.
Vale CL, Rydzewska LHM, Rovers MM, Emberson JR, Gueyffier F, Stewart LA. Uptake of
systematic reviews and meta-analyses based on individual participant data in clinical
practice guidelines: descriptive study. BMJ 2015; 350: h1088.
Veroniki AA, Straus SE, Ashoor H, Stewart LA, Clarke M, Tricco AC. Contacting authors to
retrieve individual patient data: study protocol for a randomized controlled trial. Trials
2016; 17: 138.
Veroniki AA, Rios P, Le S, Mavridis D, Stewart L, Clarke M, Ashoor H, Straus S, Tricco A.
Obtaining individual patient data depends on study characteristics and can take longer
than a year after a positive response. Journal of Clinical Epidemiology 2019; in press.
658
Index
Cochrane Handbook for Systematic Reviews of Interventions. 2nd Edition. Chichester (UK): John Wiley & Sons,
2019. © 2019 The Cochrane Collaboration. Published 2019 by John Wiley & Sons Ltd.
659
Index
660
Index
661
Index
662
Index
663
Index
664
Index
665
Index
666
Index
667
Index
668
Index
669
Index
670
Index
671
Index
672
Index
673
Index
674
Index
675
Index
676
Index
677
Index
678
Index
679
Index
680
Index
681
Index
682
Index
683
Index
684
Index
685
Index
686
Index
687
Index
688
Index
689
Index
690
Index
691
Index
692
Index
693
Index
694