0% found this document useful (0 votes)
521 views2,573 pages

Principles and Practice of Clinical Trials

Uploaded by

Zuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
521 views2,573 pages

Principles and Practice of Clinical Trials

Uploaded by

Zuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2573

Steven Piantadosi

Curtis L. Meinert
Editors

Principles and
Practice of
Clinical Trials
Principles and Practice of Clinical Trials
Steven Piantadosi • Curtis L. Meinert
Editors

Principles and Practice of


Clinical Trials

With 241 Figures and 191 Tables


Editors
Steven Piantadosi Curtis L. Meinert
Department of Surgery Department of Epidemiology
Division of Surgical Oncology School of Public Health
Brigham and Women’s Hospital Johns Hopkins University
Harvard Medical School Baltimore, MD, USA
Boston, MA, USA

ISBN 978-3-319-52635-5 ISBN 978-3-319-52636-2 (eBook)


https://doi.org/10.1007/978-3-319-52636-2
© Springer Nature Switzerland AG 2022
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors
or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims
in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG.
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
In memory of
Lulu, Champ, and Dudley
A Foreword to the Principles and Practice of
Clinical Trials

Trying to identify the effects of treatments is not new. The Book of Daniel (verses
12–15) describes a test of the effects King Nebuchadnezzar’s meat:

Prove thy servants, I beseech thee, ten days; and let them give us pulse to eat, and water to
drink. Then let our countenances be looked upon before thee, and the countenance of the
children that eat of the portion of the King’s meat: and as thou seest, deal with thy servants.
So he consented to them in this matter, and proved them ten days. And at the end of ten days
their countenances appeared fairer and fatter in flesh than all the children which did eat the
portion of the King’s meat.

The requirement of comparison in identifying treatment effects was recognized in


the tenth century by Abu Bakr Muhammad ibn Zakariya al-Razi (Persian physician):

When the dullness (thiqal) and the pain in the head and neck continue for three and four and
five days or more, and the vision shuns light, and watering of the eyes is abundant, yawning
and stretching are great, insomnia is severe, and extreme exhaustion occurs, then the patient
after that will progress to meningitis (sirsâm). . . If the dullness in the head is greater than the
pain, and there is no insomnia, but rather sleep, then the fever will abate, but the throbbing
will be immense but not frequent and he will progress into a stupor (lîthûrghas). So when you
see these symptoms, then proceed with bloodletting. For I once saved one group [of patients]
by it, while I intentionally neglected [to bleed] another group. By doing that, I wished to
reach a conclusion (ra’y). And so all of these [latter] contracted meningitis. (Tibi 2006)

But it was not until the beginning of the eighteenth century before the importance
of treatment comparisons was broadly acknowledged, for example, as in chances of
contracting smallpox among people inoculated with smallpox lymph versus those
who caught smallpox disease naturally (Bird 2018).
By the middle of the eighteenth century there were examples of tests with
comparison groups, for example, as described by James Lind in relation to his
scurvy experiment on board the HMS Salisbury at sea:

On the 20th of May 1747, I took twelve patients in the scurvy, on board the Salisbury at sea.
Their cases were as similar as I could have them. They all in general had putrid gums, the
spots and lassitude, with weakness of their knees. They lay together in one place, being a
proper apartment for the sick in the fore-hold; and had one diet common to all, viz.,
watergruel sweetened with sugar in the morning; fresh mutton-broth often times for dinner;

vii
viii A Foreword to the Principles and Practice of Clinical Trials

at other times puddings, boiled biscuit with sugar, etc; and for supper, barley and raisins,
rice and currants, sago and wine, or the like. Two of these were ordered each a quart of
cyder a-day. Two others took twenty-five gutts of elixir vitriol three times a day, upon an
empty stomach; using a gargle strongly acidulated with it for their mouths. Two others took
two spoonfuls of vinegar three times a day, upon an empty stomach; having their gruels and
their other food well acidulated with it, as also the gargle for their mouth. Two of the worst
patients, with the tendons in the ham rigid, (a symptom none of the rest had), were put under
a course of seawater. Of this they drank half a pint every day, and sometimes more or less as
it operated, by way of gentle physic. Two others had each two oranges and one lemon given
them every day. These they eat with greediness, at different times, upon an empty stomach.
They continued but six days under this course, having consumed the quantity that could be
spared. The two remaining patients, took the bigness of a nutmeg three times a-day, of an
electuary recommended by an hospital surgeon, made of garlic, mustard-seed, rad raphan,
balsam of Peru, and gum myrrh; using for common drink, barley-water well acidulated with
tamarinds; by a decoction of which, with the addition of cremor tartar, they were gently
purged three or four times during the course.
***
The consequence was, that the most sudden and visible good effects were perceived from
the use of the oranges and lemons; one of those who had taken them, being at the end of six
days fit for duty. (Lind 1753)

Lind did not make clear how his 12 sailors were assigned to the treatments in his
experiment. During the late nineteenth and early twentieth century, alternation (and
sometimes randomization) became used to create study comparison groups that
differed only by chance (Chalmers et al. 2011).
In 1937 assignment was discussed in Hill’s book, Principles of Medical Statistics,
in which he emphasized the importance of strictly observing the allocation schedule.
Implementation of this principle was reflected in concealment of allocation sched-
ules in two important clinical trials designed for the UK Medical Research Council
in the 1940s (Medical Research Council 1944; 1948). Sir Austin Bradford Hill’s
1937 book went into 12 editions, and his other writings, such as Statistical Methods
in Clinical and Preventive Medicine, helped propel the upward methodological
progression.
The United States Congress passed the Kefauver-Harris Amendments to the
Food, Drug, and Cosmetic Act of 1938 in 1962. The amendments revolutionized
drug development by requiring drug manufacturers to prove that a drug was safe and
effective. A feature of the amendments was language spelling out the nature of
scientific evidence required for a drug to be approved:

The term “substantial evidence” means evidence consisting of adequate and well-controlled
investigations, including clinical investigations, by experts qualified by scientific training
and experience to evaluate the effectiveness of the drug involved, on the basis of which it
could fairly and responsibly be concluded by such experts that the drug will have the effect it
purports or is represented to have under the conditions of its use prescribed, recommended,
or suggested in the labeling or proposed labeling thereof. (United States Congress 1962)

Post World War II prosperity brought sizeable increases in government funding


for training and research. The National Institutes of Health played a major role in
training biostatisticians in the 1960s and 1970s with its fellowship programs. By the
A Foreword to the Principles and Practice of Clinical Trials ix

1980s clinical trial courses started showing up in syllabi of academic institutions. By


the 1990s academic institutions started offering PhDs focused on design and conduct
of trials with a few now offering PhD training in clinical trials.
The clinical trial enterprise is huge. There were over 25,000 trials registered on
CT.gov starting in 2019. That number, assuming CT.gov registrations account for
70% of all registered trials, translates to 38,000 trials. That amounts to 2.3 million
people studied when those trials are finished assuming a median sample size of
60 per trial.
Lind did his trial before IRBs and consents, before requirements for written
protocols, before investigator certifications, before the Health Insurance Portability
and Accountability Act (HIPAA), before data sharing, before data monitoring
committees, before site visiting, and before requirements for posting results within
1 year of completion. Trials moved from the backroom of obscurity to front and
center with trials seen as forms of public trust.
The act of trying progressed from efforts involving a single investigator to efforts
involving cadres of investigators with training in medicine, biostatistics, epidemiol-
ogy, programming, data processing, and in regulations and ethics underlying trials.
The size of the research team increases with the size and complexities of the trials.
Multicenter trials may involve investigatorships numbering in the hundreds.
Enter trialists – persons with training and experience in the design, organization,
conduct, and analysis of trials. Presently trialists are scattered, located in various
departments in medical schools and schools of public health. They have no
academic home.
The scattering works to the disadvantage of the art and science of trials in that it
stymies communications and development of curricula relevant to trials. One of our
motivations in undertaking this work is hope of speeding development of such
homes.
The blessing of online publications is that works can be updated at will. The curse
is that the work is never done. We hope to advance the science of trials by providing
the trials world with a comprehensive work from leaders in the field covering the
waterfront of clinical trials serving as a reference resource for novices and experts in
trials for use in designing, conducting, and analyzing them.

13 May 2020 Steven Piantadosi and Curtis L. Meinert


Editors

Postscript
When we started this effort, there was no COVID-19. Now we are living through a
pandemic caused by the virus leading us to proclaim in regard to trials, as Charles
Dickens did in his A Tale of Two Cities in a different context, “the best of times, the
worst of times.”
“The best of times” because never before has there been more interest and
attention directed to trials, even from the President. Everybody wants to know
when there will be a vaccine to protect us from COVID-19.
x A Foreword to the Principles and Practice of Clinical Trials

“The worst of times” because of the chaos caused by the pandemic in mounting
and doing trials and the impact of “social distancing” on the way trials are done now.
It is a given that the pandemic will change how we do trials, but whatever those
changes will be, trials will remain humankind’s best and most enduring answer to
addressing the conditions and maladies that affect us.

Acknowledgment
We are indebted to Sir Iain Chalmers for his review and critical input in reviewing
this piece. Dr. Chalmers is founder of the Cochrane Collaboration and first coordi-
nator of the James Lind Library.

Events in the Development of Clinical Trials

Date Author/source Event


1747 Lind Experiment with untreated control group (Lind 1953)
1799 Haygarth Use of sham procedure (Haggard 1932)
1800 Waterhouse Smallpox trial (Waterhouse 1802, 1800)
1863 Gull Use of placebo treatment (Sutton 1865)
1918 First department of biostatistics; Johns Hopkins
University, https://www.jhsph.edu/departments/
biostatistics/about-us/history/
1923 Fisher Application of randomization to experimentation
(Fisher and MacKenzie 1923)
1931 Committee on clinical trials created by the Medical
Research Council of Great Britain (Medical Research
Council 1931)
1931 Amberson Random assignment of treatment to groups of patients
(Amberson et al. 1931)
1937 NIH Start of NIH grant support with creation of the
National Cancer Institute (National Institutes of Health
1981)
1944 Publication of multicenter trial on treatment for
common cold (Patulin Clinical Trials Committee
1944)
1946 Nüremberg Code for Human Experimentation (Curran
and Shapiro 1970) https://history.nih.gov/research/
downloads/nuremberg.pdf
1948 MRC Streptomycin TB multicenter trial published; BMJ:
30 Oct, 1948 (Medical Research Council 1948)
1962 Hill Book: Statistical Methods in Clinical and Preventive
Medicine (Hill 1962)
1962 Kefauver, Amendments to the Food, Drug, and Cosmetic Act of
Harris 1938 (United States Congress 1962)
A Foreword to the Principles and Practice of Clinical Trials xi

Date Author/source Event


1964 NLM MEDLARS ® (MEDical Literature Analysis and
Retrieval System) of the National Library of Medicine
initiated
1966; USPHS Memo from Surgeon General of USPHS informing
8 Feb recipients of NIH funding of requirement for informed
consent as condition for funding henceforth (Stewart
1966), https://history.nih.gov/research/downloads/
surgeongeneraldirective1966.pdf
1966 Levine Publication of U.S. Public Health Service regulations
leading to creation of Institutional Review Boards for
research involving humans (Levine 1988)
1966; US govt Freedom of Information Act (FOIA) signed into law
6 Sep by Lyndon Johnson 6 September 1966 (Public Law
89-554, 80 Statue 383); Act specifies US
Governmental Agencies records subject to disclosure
under the Act; amended and extended in 1996, 2002,
and 2007, https://www.justice.gov/oip/foia_guide09/
foia-final.pdf; 5 September 2009
1967 Tom Structure for separating the treatment monitoring and
Chalmers treatment administration process (Coronary Drug
Project Research Group: 1973)
1974; US govt Creation of U.S. National Commission for the
12 July Protection of Human Subjects of Biomedical and
Behavioral Research; part of the National Research
Act (Public Law No. 93-348, § 202, 88 Stat. 342)
1974 US govt US Code of Federal Regulations promulgated
establishing Institutional Review Boards, https://www.
hhs.gov/ohrp/humansubjects/guidance/45cfr46
1979 OPRR Belmont Report (Ethical Principles and Guidelines for
the Protection of Human Subjects of Research);
product of the National Commission for the Protection
of Human Subjects of Biomedical and Behavioral
Research (Office for Protection from Research Risks
Belmont Report 1979)
1979 Gorden NIH Clinical Trials Committee (chaired by Robert
Gorden) recommends that “every clinical trial should
have provisions for data and safety monitoring”
(National Institutes of Health 1979)
1979 Society for Clinical Trials established
1980 First issue of Controlled Clinical Trials (Meinert and
Tonascia 1998)
1981 Friedman Book: Fundamentals of Clinical Trials (Friedman
et al. 1981)
xii A Foreword to the Principles and Practice of Clinical Trials

Date Author/source Event


1983 Pocock Book: Clinical Trials: A Practical Approach (Pocock
1983)
1986 Meinert Book: Clinical Trials: Design, Conduct, and Analysis
(Meinert and Tonascia 1986)
1990 ICH International Conference on Harmonisation (ICH)
formed (European Union, Japan, and the United
States) (Vozeh 1995)
1990 Initiation of PhD training program in clinical trials at
Johns Hopkins University
1992 FDA Prescription Drug User Fee Act (PDUFA) enacted;
allows FDA to collect fees for review of New Drug
Applications (Public Law 102-571; 102 Congress;
https://www.fda.gov/ForIndustry/UserFees/
PrescriptionDrugUserFee/ucm200361.htm; 2002)
1993 US govt Mandate regarding valid analysis for gender and
ethnic origin treatment interactions (United States
Congress 1993)
1993 UK Cochrane Collaboration founded under leadership of
Iain Chalmers; developed in response to Archie
Cochrane’s call for up-to-date, systematic reviews of
all relevant trials in the healthcare field
1996 HIPAA Health Insurance Portability and Accountability Act
(HIPAA) enacted (Public Law 104-191; 104th US
Congress; https://aspe.hhs.gov/admnsimp/
pL10419.htm)
1996 NLM PubMed (search engine for MEDLINE) made free to
public
1996 Consolidated Standards of Reporting Trials
(CONSORT) (Begg et al. 1996)
1997 US govt US public law calling for registration of trials; Food
and Drug Administration Modernization Act of 1997;
Public Law 105-115; Nov 21, 1997 (https://www.
govinfo.gov/content/pkg/PLAW-105publ115/pdf/
PLAW-105publ115.pdf)
1997 Piantadosi Book: Clinical Trials: A Methodologic Perspective
(Piantadosi 1997)
2000 NIH ClinicalTrials.gov registration website launched (Zarin
et al. 2007)
2003 NIH NIH statement on data sharing (National Institutes of
Health 2003)
2003 UK Launch of James Lind Library; marking
250th anniversary of the publication of James Lind’s
Treatise of the Scurvy (https://www.jameslindlibrary.
org/search/)
A Foreword to the Principles and Practice of Clinical Trials xiii

Date Author/source Event


2004 ICMJE Requirement of registration of trials in public registries
as condition for publication for trials starting
enrollment after 1 July 2005 by member journals of the
International Committee of Medical Journal Editors
(ICMJE) (DeAngelis et al. 2004)
2004; NIH NIH notice NOT-OD-04-064 (Enhanced Public
3 Sep Access to NIH Research Information) required “its
grantees and supported Principal Investigators provide
the NIH with electronic copies of all final version
manuscripts upon acceptance for publication if the
research was supported in whole or in part by NIH
funding” for deposit in PubMed Central within six
months after publication
2006 WHO World Health Organization (WHO) launch of
International Clinical Trials Registry Platform
(ICTRP) (https://www.who.int/ictrp/en/)
2007 FDA Requirement for investigators to post tabular results of
trials covered under FDA regulations on
ClinicalTrials.gov within one year of completion
[Food and Drug Administration Amendments Act of
2007 (FDAAA)]
2007 Wiley Encyclopedia of Clinical Trials (4 vols)
(D’Agostino et al. 2007)
2013 Standard Protocol Items: Recommendations for
Interventional Trials (SPIRIT) (Chan et al. 2013)
2016 NIH Final NIH policy on single institutional review board
for multi-site research (NOT-OD-16-094)
2017 FDA 2007 requirement for posting results extended to all
trials, whether or not subject to FDA regulations
(81 FR64983)
2017 ICMJE ICMJE requirement for data sharing in clinical trials
(Ann Intern Med doi: 10.7326/M17-1028) (Taichman
et al. 2017)

References
Amberson JB Jr, McMahon BT, Pinner M (1931) A clinical trial of sanocrysin in
pulmonary tuberculosis. Am Rev Tuberc 24:401–435
Begg C, Cho M, Eastwood S, Horton R, Moher D, Olkin I, Pitkin R, Rennie D,
Schulz KF, Simel D, Stroup DF (1996) Improving the quality of reporting of
randomized controlled trials. The CONSORT statement. JAMA 276(8):637–639
Bird A (2018) James Jurin and the avoidance of bias in collecting and assessing
evidence on the effects of variolation. JLL Bulletin: Commentaries on the history
of treatment evaluation. https://www.jameslindlibrary.org/articles/james-jurin-
xiv A Foreword to the Principles and Practice of Clinical Trials

and-the-avoidance-of-bias-in-collecting-and-assessing-evidence-on-the-effects-
of-variolation/
Chalmers I, Dukan E, Podolsky SH, Davey Smith G (2011) The advent of fair
treatment allocation schedules in clinical trials during the 19th and early 20th
centuries. JLL Bulletin: Commentaries on the history of treatment evaluation.
https://www.jameslindlibrary.org/articles/the-advent-of-fair-treatment-allocation-
schedules-in-clinical-trials-during-the-19th-and-early-20th-centuries/
Chan AW, Tetzlaff JM, Altman DG, Laupacis A, Gøtzsche PC, Krleža-Jeric K,
Hróbjartsson A, Mann H, Dickersin K, Berlin JA, Doré CJ, Parulekar WR,
Summerskill WSM, Groves T, Schulz KF, Sox HC, Rockhold FW,
Drummond R, Moher D (2013) SPIRIT 2013 statement: defining standard pro-
tocol items for clinical trials. Ann Intern Med 158(3):200–207
Coronary Drug Project Research Group (1973) The Coronary Drug Project: design,
methods, and baseline results. Circulation 47(Suppl I):I-1-I-50
Curran WJ, Shapiro ED (1970) Law, medicine, and forensic science, 2nd edn. Little,
Brown, Boston
D’Agostino R, Sullivan LM, Massaro J (eds) (2007) Wiley encyclopedia of clinical
trials, 4 vols. Wiley, New York
DeAngelis CD, Drazen JM, Frizelle FA, Haug C, Hoey J, Horton R, Kotzin S,
Laine C, Marusic A, Overbeke AJPM, Schroeder TV, Sox HC, Van Der Weyden
MB (2004) Clinical Trial Registration: A statement from the International Com-
mittee of Medical Journal Editors. JAMA 292:1363–1364
Fisher RA, MacKenzie WA (1923) Studies in crop variation: II. The manurial
response of different potato varieties. J Agric Sci 13:311–320
Friedman LM, Furberg CD, DeMets DR (1981) Fundamentals of clinical trials, 5th
edn, [2015]. Springer, New York
Haggard HW (1932) The Lame, the Halt, and the Blind: the vital role of medicine in
the history of civilization. Harper and Brothers, New York
Hill AB (1937) Principles of medical statistics. Lancet
Hill AB (1962) Statistical methods in clinical and preventive medicine. Oxford
University Press, New York
Levine RJ (1988) Ethics and regulation of clinical research, 2nd edn. Yale University
Press, New Haven
Lind J (1753) A treatise of the scurvy (reprinted in Lind’s treatise on scurvy, edited
by CP Stewart, D Guthrie, Edinburgh University Press, Edinburgh, 1953). Sands,
Murray, Cochran, Edinburgh
Medical Research Council (1931) Clinical trials of new remedies (annotations).
Lancet 2:304
Medical Research Council (1944) Clinical trial of patulin in the common cold.
Lancet 16:373–375
Medical Research Council (1948) Streptomycin treatment of pulmonary tuberculo-
sis: a Medical Research Council investigation. Br Med J 2:769–782
Meinert CL, Tonascia S (1986) Clinical trials: design, conduct, and analysis. Oxford
University Press, New York (2nd edn, 2012)
Meinert CL, Tonascia S (1998) Controlled Clinical Trials. Encyclopedia of biosta-
tistics, vol 1. Wiley, New York, pp 929–931
A Foreword to the Principles and Practice of Clinical Trials xv

National Institutes of Health (1979) Clinical trials activity (NIH Clinical Trials
Committee; RS Gordon Jr, Chair). NIH Guide Grants Contracts 8 (# 8):29
National Institutes of Health (1981) NIH Almanac. Publ no 81-5. Division of Public
Information, Bethesda
National Institutes of Health (2003) NIH data sharing policy and implementation
guidance. http://grants.nih.gov/grants/policy/data_sharing/data_sharing_
guidance.htm
Office for Protection from Research Risks (1979) The Belmont Report. Ethical
principles and guidelines for the protection of human subjects of research,
18 April 1979
Patulin Clinical Trials Committee (of the Medical Research Council) (1944) Clinical
trial of Patulin in the common cold. Lancet 2:373–375
Piantadosi S (1997) Clinical trials: a methodologic perspective. Wiley, Hoboken (3rd
edn, 2017)
Pocock SJ (1983) Clinical trials: a practical approach. Wiley, New York
Stewart WH (1966) Surgeon general’s directives on human experimentation. https://
history.nih.gov/research/downloads/surgeongeneraldirective1966.pdf
Sutton HG (1865) Cases of rheumatic fever. Guy’s Hosp Rep 11:392–428
Taichman DB, Sahni P, Pinborg A, Peiperl L, Laine C, James A, Hong ST,
Haileamlak A, Gollogly L, Godlee F, Frizelle FA, Florenzano F, Drazen JM,
Bauchner H, Baethge C, Backus J (2017) Data sharing statements for clinical
trials: a requirement of the International Committee of Medical Journal Editors.
Ann Intern Med 167(1):63–65
Tibi S (2006) Al-Razi and Islamic medicine in the 9th century; J R Soc Med 99(4):
206–207
United States Congress (103rd; 1st session): NIH Revitalization Act of 1993,
42 USC § 131 (1993); Clinical research equity regarding women and minorities;
part I: women and minorities as subjects in clinical research, 1993
United States Congress (87th): Drug Amendments of 1962, Public Law 87-781, S
1522. Washington, Oct 10, 1962
Vozeh S (1995) The International Conference on Harmonisation. Eur J Clin
Pharmacol 48:173–175
Waterhouse B (1800) A prospect of exterminating the small pox. Cambridge Press,
Cambridge
Waterhouse B (1802) A prospect of exterminating the small pox (part II). University
Press, Cambridge
Zarin DA, Ide NC, Tse T, Harlan WR, West JC, Lindberg DAB (2007) Issues in the
registration of clinical trials. JAMA 297:2112–2120
Preface

The two of us have spent our professional lives doing trials; writing textbooks on
how to do them, teaching about them, and sitting on advisory groups responsible for
trials. We are pleased to say that over our lifetime trials have moved up the scale of
importance to now where people feel cheated if denied enrollment.
Clinical trials are admixtures of disciplines: Medicine, behavioral sciences, bio-
statistics, epidemiology, ethics, quality control, and regulatory sciences to name the
principal ones, making it difficult to cover the field in any textbook on the subject.
This reality is the reason we campaigned (principally SP) for a collective work
designed to cover the waterfront of trials. We are pleased to have been able to do this
in conjunction with Springer Nature, both as print and e-books.
There has long been a need for a comprehensive clinical trials text written at a
level accessible to both technical and nontechnical readers. The perspective is the
same as that in many other fields where the scope of a “principles and practice”
textbook has been defining and instructive to those learning the discipline. Accord-
ingly, the intent of Principles and Practice of Clinical Trials has been to cover,
define, and explicate the field in ways that are approachable to trialists of all types.
The work is intended to be comprehensive, but not encyclopedic.

Boston, USA Steven Piantadosi


Baltimore, USA Curtis L. Meinert
April 2022 Editors

xvii
Acknowledgments

The work involved nine subject sections and appendices.

Section Section editor Affiliation

1 Steven N. Goodman Stanford University; Professor


Perspectives on clinical trials Karen A. Robinson Johns Hopkins University;
Professor

2 Eleanor McFadden Managing Director; Frontier


Conduct and management Science (Scotland)

3 Winifred Werther Amgen; Epidemiologist


Regulation and oversight

4 O. Dale Williams Florida International University;


Bias control and precision Retired

5 Christopher S. Coffey University of Iowa; Professor


Basics of trial design

6 Babak Choodari-Oskooei University College London; Senior


Advanced topics in trial design Mahesh K. B. Parmar Research Associate
University College London;
Professor

7 Stephen L. George Duke University; Professor


Analysis Emeritus

8 Tianjing Li University of Colorado; Associate


Publication and related issues Professor

9 Lawrence Friedman NIH:NHLBI; Retired


Special topics Nancy L. Geller NIH:NHLBI; Director, Office of
Biostatistics Research

10 Gillian Gresham Cedars-Sinai Medical Center (Los


Appendices Angeles); Assistant Professor

We are most grateful to the section editors in producing this work.


Thanks to Springer Nature in making this work possible.
Thanks for the guidance and council provided by Alexa Steele, editor, Springer
Nature, and for the help and guidance provided by Rukmani Parameswaran and
Swetha Varadharajan in shepherding this work to completion.

xix
xx Acknowledgments

A special thanks to Gillian Gresham for her production of the appendices and her
efforts as Senior Associate Editor.

Steven Piantadosi and Curtis L. Meinert


Editors
Contents

Volume 1

Part I Perspectives on Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . 1


1 Social and Scientific History of Randomized Controlled
Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Laura E. Bothwell, Wen-Hua Kuo, David S. Jones, and
Scott H. Podolsky

2 Evolution of Clinical Trials Science ....................... 21


Steven Piantadosi

3 Terminology: Conventions and Recommendations . . . . . . . . . . . . 35


Curtis L. Meinert

4 Clinical Trials, Ethics, and Human Protections Policies . . . . . . . . 55


Jonathan Kimmelman

5 History of the Society for Clinical Trials ................... 73


O. Dale Williams and Barbara S. Hawkins

Part II Conduct and Management . . . . . . . . . . . . . . . . . . . . . . . . . . . 83


6 Investigator Responsibilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Bruce J. Giantonio

7 Centers Participating in Multicenter Trials . . . . . . . . . . . . . . . . . 97


Roberta W. Scherer and Barbara S. Hawkins

8 Qualifications of the Research Staff . . . . . . . . . . . . . . . . . . . . . . . 123


Catherine A. Meldrum

9 Multicenter and Network Trials . . . . . . . . . . . . . . . . . . . . . . . . . . 135


Sheriza Baksh

xxi
xxii Contents

10 Principles of Protocol Development . . . . . . . . . . . . . . . . . . . . . . . 151


Bingshu E. Chen, Alison Urton, Anna Sadura, and
Wendy R. Parulekar
11 Procurement and Distribution of Study Medicines . . . . . . . . . . . . 169
Eric Hardter, Julia Collins, Dikla Shmueli-Blumberg, and
Gillian Armstrong
12 Selection of Study Centers and Investigators . . . . . . . . . . . . . . . . 191
Dikla Shmueli-Blumberg, Maria Figueroa, and Carolyn Burke
13 Design and Development of the Study Data System . . . . . . . . . . . 209
Steve Canham
14 Implementing the Trial Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 239
Jamie B. Oughton and Amanda Lilley-Kelly
15 Participant Recruitment, Screening, and Enrollment . . . . . . . . . . 257
Pascale Wermuth
16 Administration of Study Treatments and Participant
Follow-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Jennifer J. Gassman
17 Data Capture, Data Management, and Quality Control;
Single Versus Multicenter Trials . . . . . . . . . . . . . . . . . . . . . . . . . . 303
Kristin Knust, Lauren Yesko, Ashley Case, and Kate Bickett
18 End of Trial and Close Out of Data Collection . . . . . . . . . . . . . . . 321
Gillian Booth
19 International Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
Lynette Blacher and Linda Marillo
20 Documentation: Essential Documents and Standard Operating
Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
Eleanor McFadden, Julie Jackson, and Jane Forrest
21 Consent Forms and Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . 389
Ann-Margret Ervin and Joan B. Cobb Pettit
22 Contracts and Budgets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
Eric Riley and Eleanor McFadden
23 Long-Term Management of Data and Secondary Use . . . . . . . . . 427
Steve Canham

Part III Regulation and Oversight .......................... 457


24 Regulatory Requirements in Clinical Trials . . . . . . . . . . . . . . . . . 459
Michelle Pernice and Alan Colley
Contents xxiii

25 ClinicalTrials.gov ..................................... 479


Gillian Gresham
26 Funding Models and Proposals . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
Matthew Westmore and Katie Meadmore
27 Financial Compliance in Clinical Trials . . . . . . . . . . . . . . . . . . . . 521
Barbara K. Martin
28 Financial Conflicts of Interest in Clinical Trials . . . . . . . . . . . . . . 541
Julie D. Gottlieb
29 Trial Organization and Governance . . . . . . . . . . . . . . . . . . . . . . . 559
O. Dale Williams and Katrina Epnere
30 Advocacy and Patient Involvement in Clinical Trials . . . . . . . . . . 569
Ellen Sigal, Mark Stewart, and Diana Merino
31 Training the Investigatorship . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583
Claire Weber
32 Responsibilities and Management of the Clinical Coordinating
Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
Trinidad Ajazi
33 Efficient Management of a Publicly Funded Cancer Clinical
Trials Portfolio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615
Catherine Tangen and Michael LeBlanc
34 Archiving Records and Materials . . . . . . . . . . . . . . . . . . . . . . . . . 637
Winifred Werther and Curtis L. Meinert
35 Good Clinical Practice ................................. 649
Claire Weber
36 Institutional Review Boards and Ethics Committees . . . . . . . . . . 657
Keren R. Dunn
37 Data and Safety Monitoring and Reporting . . . . . . . . . . . . . . . . . 679
Sheriza Baksh and Lijuan Zeng
38 Post-Approval Regulatory Requirements . . . . . . . . . . . . . . . . . . . 699
Winifred Werther and Anita M. Loughlin

Volume 2

Part IV Bias Control and Precision . . . . . . . . . . . . . . . . . . . . . . . . . . 727


39 Controlling for Multiplicity, Eligibility, and Exclusions . . . . . . . . 729
Amber Salter and J. Philip Miller
40 Principles of Clinical Trials: Bias and Precision Control . . . . . . . 739
Fan-fan Yu
xxiv Contents

41 Power and Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 767


Elizabeth Garrett-Mayer
42 Controlling Bias in Randomized Clinical Trials . . . . . . . . . . . . . . 787
Bruce A. Barton
43 Masking of Trial Investigators . . . . . . . . . . . . . . . . . . . . . . . . . . . 805
George Howard and Jenifer H. Voeks
44 Masking Study Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815
Lea Drye
45 Issues for Masked Data Monitoring . . . . . . . . . . . . . . . . . . . . . . . 823
O. Dale Williams and Katrina Epnere
46 Variance Control Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 833
Heidi L. Weiss, Jianrong Wu, Katrina Epnere, and O. Dale Williams
47 Ascertainment and Classification of Outcomes . . . . . . . . . . . . . . . 843
Wayne Rosamond and David Couper
48 Bias Control in Randomized Controlled Clinical Trials . . . . . . . . 855
Diane Uschner and William F. Rosenberger

Part V Basics of Trial Design .............................. 875


49 Use of Historical Data in Design . . . . . . . . . . . . . . . . . . . . . . . . . . 877
Christopher Kim, Victoria Chia, and Michael Kelsh
50 Outcomes in Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 891
Justin M. Leach, Inmaculada Aban, and Gary R. Cutter
51 Patient-Reported Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915
Gillian Gresham and Patricia A. Ganz
52 Translational Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 939
Steven Piantadosi
53 Dose-Finding and Dose-Ranging Studies ................... 951
Mark R. Conaway and Gina R. Petroni
54 Inferential Frameworks for Clinical Trials . . . . . . . . . . . . . . . . . . 973
James P. Long and J. Jack Lee
55 Dose Finding for Drug Combinations . . . . . . . . . . . . . . . . . . . . . . 1003
Mourad Tighiouart
56 Middle Development Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1031
Emine O. Bayman
Contents xxv

57 Randomized Selection Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047


Shing M. Lee, Bruce Levin, and Cheng-Shiun Leu
58 Futility Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1067
Sharon D. Yeatts and Yuko Y. Palesch
59 Interim Analysis in Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . 1083
John A. Kairalla, Rachel Zahigian, and Samuel S. Wu

Part VI Advanced Topics in Trial Design . . . . . . . . . . . . . . . . . . . . . 1103


60 Bayesian Adaptive Designs for Phase I Trials . . . . . . . . . . . . . . . 1105
Michael J. Sweeting, Adrian P. Mander, and Graham M. Wheeler
61 Adaptive Phase II Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1133
Boris Freidlin and Edward L. Korn
62 Biomarker-Guided Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1145
L. C. Brown, A. L. Jorgensen, M. Antoniou, and J. Wason
63 Diagnostic Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1171
Madhu Mazumdar, Xiaobo Zhong, and Bart Ferket
64 Designs to Detect Disease Modification . . . . . . . . . . . . . . . . . . . . . 1199
Michael P. McDermott
65 Screening Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1219
Philip C. Prorok
66 Biosimilar Drug Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1237
Johanna Mielke and Byron Jones
67 Prevention Trials: Challenges in Design, Analysis, and
Interpretation of Prevention Trials . . . . . . . . . . . . . . . . . . . . . . . . 1261
Shu Jiang and Graham A. Colditz
68 N-of-1 Randomized Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1279
Reza D. Mirza, Sunita Vohra, Richard Kravitz, and
Gordon H. Guyatt
69 Noninferiority Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1297
Patrick P. J. Phillips and David V. Glidden
70 Cross-over Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1325
Byron Jones
71 Factorial Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1353
Steven Piantadosi and Susan Halabi
xxvi Contents

72 Within Person Randomized Trials . . . . . . . . . . . . . . . . . . . . . . . . 1377


Gui-Shuang Ying
73 Device Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1399
Heng Li, Pamela E. Scott, and Lilly Q. Yue
74 Complex Intervention Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1417
Linda Sharples and Olympia Papachristofi
75 Randomized Discontinuation Trials . . . . . . . . . . . . . . . . . . . . . . . 1439
Valerii V. Fedorov
76 Platform Trial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1455
Oleksandr Sverdlov, Ekkehard Glimm, and Peter Mesenbrink
77 Cluster Randomized Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1487
Lawrence H. Moulton and Richard J. Hayes
78 Multi-arm Multi-stage (MAMS) Platform Randomized
Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1507
Babak Choodari-Oskooei, Matthew R. Sydes, Patrick Royston, and
Mahesh K. B. Parmar
79 Sequential, Multiple Assignment, Randomized Trials
(SMART) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1543
Nicholas J. Seewald, Olivia Hackworth, and Daniel Almirall
80 Monte Carlo Simulation for Trial Design Tool . . . . . . . . . . . . . . . 1563
Suresh Ankolekar, Cyrus Mehta, Rajat Mukherjee, Sam Hsiao,
Jennifer Smith, and Tarek Haddad

Volume 3

Part VII Analysis ........................................ 1587


81 Preview of Counting and Analysis Principles . . . . . . . . . . . . . . . . 1589
Nancy L. Geller

82 Intention to Treat and Alternative Approaches . . . . . . . . . . . . . . 1597


Judith D. Goldberg

83 Estimation and Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . 1615


Pamela A. Shaw and Michael A. Proschan

84 Estimands and Sensitivity Analyses . . . . . . . . . . . . . . . . . . . . . . . 1631


Estelle Russek-Cohen and David Petullo
Contents xxvii

85 Confident Statistical Inference with Multiple Outcomes,


Subgroups, and Other Issues of Multiplicity . . . . . . . . . . . . . . . . 1659
Siyoen Kil, Eloise Kaizar, Szu-Yu Tang, and Jason C. Hsu
86 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1681
Guangyu Tong, Fan Li, and Andrew S. Allen
87 Essential Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1703
Gregory R. Pond and Samantha-Jo Caetano
88 Nonparametric Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 1717
Yuliya Lokhnygina
89 Survival Analysis II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1743
James J. Dignam
90 Prognostic Factor Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1771
Liang Li
91 Logistic Regression and Related Methods . . . . . . . . . . . . . . . . . . 1789
Márcio A. Diniz and Tiago M. Magalhães
92 Statistical Analysis of Patient-Reported Outcomes in
Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1813
Gina L. Mazza and Amylou C. Dueck
93 Adherence Adjusted Estimates in Randomized Clinical
Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1833
Sreelatha Meleth
94 Randomization and Permutation Tests . . . . . . . . . . . . . . . . . . . . . 1851
Vance W. Berger, Patrick Onghena, and J. Rosser Matthews
95 Generalized Pairwise Comparisons for Prioritized Outcomes . . . 1869
Marc Buyse and Julien Peron
96 Use of Resampling Procedures to Investigate Issues of Model
Building and Its Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1895
Willi Sauerbrei and Anne-Laure Boulesteix
97 Joint Analysis of Longitudinal and Time-to-Event Data . . . . . . . 1919
Zheng Lu, Emmanuel Chigutsa, and Xiao Tong
98 Pharmacokinetic and Pharmacodynamic Modeling . . . . . . . . . . . 1937
Shamir N. Kalaria, Hechuan Wang, and Jogarao V. Gobburu
99 Safety and Risk Benefit Analyses . . . . . . . . . . . . . . . . . . . . . . . . . 1961
Jeff Jianfei Guo
xxviii Contents

100 Causal Inference: Efficacy and Mechanism Evaluation . . . . . . . . 1981


Sabine Landau and Richard Emsley

101 Development and Validation of Risk Prediction Models . . . . . . . 2003


Damien Drubay, Ben Van Calster, and Stefan Michiels

Part VIII Publication and Related Issues . . . . . . . . . . . . . . . . . . . . . 2025


102 Paper Writing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2027
Curtis L. Meinert

103 Reporting Biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2045


S. Swaroop Vedula, Asbjørn Hróbjartsson, and Matthew J. Page

104 CONSORT and Its Extensions for Reporting Clinical Trials . . . . 2073
Sally Hopewell, Isabelle Boutron, and David Moher

105 Publications from Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . 2089


Barbara S. Hawkins

106 Study Name, Authorship, Titling, and Credits . . . . . . . . . . . . . . . 2103


Curtis L. Meinert

107 De-identifying Clinical Trial Data . . . . . . . . . . . . . . . . . . . . . . . . . 2115


Jimmy Le

108 Data Sharing and Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2137


Ida Sim

109 Introduction to Systematic Reviews . . . . . . . . . . . . . . . . . . . . . . . 2159


Tianjing Li, Ian J. Saldanha, and Karen A. Robinson

110 Introduction to Meta-Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 2179


Theodoros Evrenoglou, Silvia Metelli, and Anna Chaimani

111 Reading and Interpreting the Literature on Randomized


Controlled Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2197
Janet Wittes

112 Trials Can Inform or Misinform: “The Story of Vitamin


A Deficiency and Childhood Mortality” . . . . . . . . . . . . . . . . . . . . 2209
Alfred Sommer

Part IX Special Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2225


113 Issues in Generalizing Results from Clinical Trials . . . . . . . . . . . 2227
Steven Piantadosi
Contents xxix

114 Leveraging “Big Data” for the Design and Execution of


Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2241
Stephen J. Greene, Marc D. Samsky, and Adrian F. Hernandez

115 Trials in Complementary and Integrative Health


Interventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2263
Catherine M. Meyers and Qilu Yu

116 Orphan Drugs and Rare Diseases . . . . . . . . . . . . . . . . . . . . . . . . . 2289


James E. Valentine and Frank J. Sasinowski

117 Pragmatic Randomized Trials Using Claims or Electronic


Health Record Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2307
Frank W. Rockhold and Benjamin A. Goldstein

118 Fraud in Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2319


Stephen L. George, Marc Buyse, and Steven Piantadosi

119 Clinical Trials on Trial: Lawsuits Stemming from Clinical


Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2339
John J. DeBoy and Annie X. Wang

120 Biomarker-Driven Adaptive Phase III Clinical Trials . . . . . . . . . 2367


Richard Simon

121 Clinical Trials in Children . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2379


Gail D. Pearson, Kristin M. Burns, and Victoria L. Pemberton

122 Trials in Older Adults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2397


Sergei Romashkan and Laurie Ryan

123 Trials in Minority Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . 2417


Otis W. Brawley

124 Expanded Access to Drug and Device Products for


Clinical Treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2431
Tracy Ziolek, Jessica L. Yoos, Inna Strakovsky, Praharsh Shah, and
Emily Robison

125 A Perspective on the Process of Designing and Conducting


Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2453
Curtis L. Meinert and Steven Piantadosi
xxx Contents

Appendix 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2475
Appendix 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2477
Appendix 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2481
Appendix 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2489
Appendix 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2493
Appendix 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2499
Appendix 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2503
Appendix 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2509
Appendix 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2513
Appendix 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2515
Appendix 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2523
Appendix 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2525
Appendix 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2529
Appendix 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2535
Appendix 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2557
Appendix 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2563
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2565
About the Editors

Steven Piantadosi, MD, PhD, is a clinical trialist with


40 years’ experience in research, teaching, and
healthcare leadership. He has worked on clinical trials
of all types, including multicenter and international tri-
als, academic portfolios, and regulatory trials. Most of
his work has been in cancer; he also works in other
disciplines such as neurodegenerative and cardiovascu-
lar diseases.
Dr. Piantadosi began his career in clinical trials early
during an intramural Staff Fellowship at the National
Cancer Institute’s Clinical and Diagnostic Trials
Section from 1982 to 1987. That group focused on
theory, methodology, and applications with the
NCI-sponsored Lung Cancer Study Group. Collabora-
tive work included studies of bias induced by missing
covariates, factorial clinical trials, and the ecological
fallacy. In the latter years, the Branch was focused on
Cancer Prevention, including design of the PLCO Trial,
which would conclude 30 years later.
In 1987, Dr. Piantadosi joined the Johns Hopkins
Oncology Center (now the Johns Hopkins Sidney Kim-
mel Comprehensive Cancer Center) as the first Director
of Biostatistics and the CC Shared Resource. He also
carried appointments in the Department of Biostatistics,
and in the Johns Hopkins Center for Clinical Trials in
the Department of Epidemiology in the School of Public
Health (now the Johns Hopkins Bloomberg School).
The division he founded became well diversified in
cancer research and peer reviewed support, including
the CCSG, 6 SPORE grants, PPGs, R01s, and many
other grants. A program in Bioinformatics was begun
jointly with the Biostatistics Department in Public
Health, which would eventually develop into its own

xxxi
xxxii About the Editors

funded CCSG Shared Resource. The Biostatistics Divi-


sion also had key responsibilities in Cancer Center
teaching, the Protocol Review and Monitoring Commit-
tee, Clinical Research Office, Clinical Informatics, and
Research Data Systems and Informatics.
From 1987 onward Dr. Piantadosi’s work involved
nearly every type of cancer, but especially bone marrow
transplant, lung cancer, brain tumors, and drug develop-
ment. In 1994, he helped to found the New Approaches
to Brian Tumor Therapy Consortium (now the Adult
Brian Tumor Consortium, ABTC), focused on early
developmental trials of new agents. This group was
funded by NCI for 25 years, was one of the first to
accomplish multicenter phase I trials, and was an early
implementer of the Continual Reassessment Method
(CRM) for dose-finding.
Collaborations at Johns Hopkins extended well
beyond the Oncology Department and included Epide-
miology (Multi-Center AIDS Cohort Study), Biostatis-
tics, Surgery, Medicine, Anesthesiology, Urology, and
Neurosurgery. His work on design and analysis of brain
tumor trials through the Department of Neurosurgery
led to the FDA approval of BCNU-impregnated biode-
gradable polymers (Gliadel) for treatment of glioblas-
toma. He also maintained important external
collaborations such as with the Parkinson’s Study
Group, based at the University of Rochester. He ran
the Coordinating Center for the National Emphysema
Treatment Trial (NETT) sponsored by NHLBI and
CMS. Numerous important findings emerged from this
trial, not the least of which was sharpened indications
for risks, benefits, and efficacy of lung volume reduction
surgery for emphysema. Dr. Piantadosi also participated
actively in prevention trials such as the Alzheimer’s
Disease Anti-Inflammatory Prevention Trial (ADAPT)
and the Chemoprevention for Barrett’s Esophagus Trial,
both employing NSAIDs and concluding that they were
ineffective preventives. He worked with FDA, serving
on the Oncologic Drugs Advisory Committee, and after-
wards on various review panels, and as advisor to
industry.
From 2007 to 2017, Dr. Piantadosi was the inaugural
Director of the Samuel Oschin Cancer Institute at
Cedars Sinai, a UCLA teaching hospital, Professor of
Medicine, and Professor of Biomathematics and
About the Editors xxxiii

Medicine at UCLA. Cedars is the largest hospital in the


western USA and treats over 5000 new cancer cases
each year, using full-time faculty, in-network oncolo-
gists, and private practitioners. Broadly applied work
continued with activities in the Long-Term Oxygen
Treatment Trial (LOTT), dose-finding designs for can-
cer drug combinations, neurodegenerative disease trial
design, and support of the UCLA multi-campus CTSA.
During this interval, numerous clinicians and
researchers were recruited. Peer-reviewed funding
increased from ~$1M to over $20M annually. A clinical
trialist is an unusual choice for a Cancer Center director,
but it represented an opportunity to improve cancer care
in Los Angeles, strengthen the academics at the institu-
tion using the NCI P30 model, and serve as a role model
for clinical trialists.
In 2018, Dr. Piantadosi joined the Division of Surgi-
cal Oncology at Brigham and Women’s Hospital, as
Professor in Residence, Harvard Medical School.
Work at BWH, HMS, includes roles on the Alliance
NCTN group Executive Committee as the Associate
Group Chair for Strategic Initiatives and Innovation, as
well as mentoring in the Alliance Statistics Office. He is
currently course Co-director for Methods in Clinical
Research at DFCI and Course Director for Advanced
Clinical Trials (CI 726) in the Master of Medical Sci-
ences in Clinical Investigation Program at Harvard
Medical School.
Teaching and Education: In 1988, while at Hopkins
Dr. Piantadosi began teaching Experimental Design
followed by advanced Clinical Trials. This work formed
the foundation for the textbook Clinical Trials:
A Methodologic Perspective, first published in 1997
and now in its 3rd edition. His course was a staple for
students in Biostatistics, Epidemiology, and the Gradu-
ate Training Program in Clinical Investigation, where he
also taught a research seminar. Subsequently, he
mentored numerous PhD graduate students and fellows
and served on many doctoral committees. At UCLA, he
continued to teach Clinical Trials in their Specialty
Training and Research Program.
Dr. Piantadosi has also taught extensively in national
workshops focused on training of clinical investigators
in cancer, biostatistics, and neurologic disease. This
began with the start of the well-known Vail Workshop,
xxxiv About the Editors

and similar venues in Europe and Australia. He was also


the Director of several similarly structured courses
solely for biostatisticians sponsored by AACR. Indepen-
dent of those workshops, he taught extensively in Japan,
Holland, and Italy.

Curtis L. Meinert
Department of Epidemiology
School of Public Health
Johns Hopkins University
Baltimore, MD, USA

Professor Emeritus (Retired 30 June 2019)


I was born 30 June 1934 on a farm four miles west of
Sleepy Eye, Minnesota.
My birthday was the first day of a three-day rampage
orchestrated by Adolf Hitler known as the Night of the
Long Knives. Ominous foreboding of events to come.
My first 6 years of schooling was in a country school
located near the Chicago and Northwestern railroad line.
There was no studying when freight trains got stuck
making the grade past the school.
As was the custom of my parents, all four of us were
sent to St John’s Lutheran School in Sleepy Eye for our
seventh and eighth years of schooling for modicums of
religious training. After Lutheran School it was Sleepy
Eye Public School, and after that it was the University of
Minnesota.
Bachelor of Arts in psychology (1956)
Masters of Science in biostatistics (1959)
Doctor of Philosophy in biostatistics (1964) (Disser-
tation: Quantitation of the isotope displacement insulin
immunoassay)
My sojourn in trials started when I was a graduate
student at the University of Minnesota. It started when I
signed on to work with Chris Klimt looking for someone
to work with him developing what was to become the
University Group Diabetes Program (UGDP).
Dr. Klimt decided to move to Baltimore in 1962 to
take an appointment in the University of Maryland
Medical School. He wanted me to move with him. I
did, albeit reluctantly because I wanted to stay and finish
my PhD dissertation.
About the Editors xxxv

Being Midwestern, Baltimore seemed foreign. Peo-


ple said we talked with an accent, but in our mind it was
they who had the accents. A few days after we unpacked
I told my wife we would stay a little while, but that I did
not want to wake up dead in Baltimore. That surely now
is my fate with all my daughters and grand children
living here.
The UGDP begat the Coronary Drug Project (CDP;
1966) and it begat others.
I moved across town in 1979 to accept an appoint-
ment in the Department of Epidemiology, School of
Public Health, Johns Hopkins University. The move
led to classroom teaching, mentoring passels of doctoral
students, several text books, and a blog site
trialsmeinertsway.com.
It was Abe Lilienfeld, after I arrived at Hopkins, who
rekindled my “textbook fire.” I had taken a sabbatical a
few years back while at Maryland to write a text on
design and conduct of trials and produced nothing! The
good news was that the “textbook bug” was gone – that
is until Abe got a hold of me at Hopkins.
Trials became my life with the creation of the Center
for Clinical Trials (now the Center for Clinical Trials
and Evidence Synthesis) established in 1990 with the
urging and help of Al Sommer, then dean of the school.
The Center has done dozens and dozens of trials since its
creation.
I lost my wife 20 February 2015. I met her at a
Tupperware party on Washington’s birthday in 1954.
We married a year and half later. She was born and
raised in Sioux Falls, South Dakota. Being 50 900 inches
tall she was happy to be able to wear her 300 heels when
we went out on the town and still be 6 in. shorter than
her escort. Height has its advantages, but not when you
are in the middle seat flying sardine!
I came to know Steve Piantadosi after he arrived at
Hopkins in 1987. He started talking about a collective
work as we are now involved in long before it had a
name. For years I ignored his talk, but the “smooth
talking North Carolinian” can be insidious and
convincing.
So here I am, with Steve joined at the hip, trying to
shepherd this work to the finish line.
About the Section Editors

Gillian Gresham
Department of Medicine
Cedars-Sinai Medical Center
Los Angeles, CA, USA

Steven N. Goodman
Stanford University School of Medicine
Stanford, CA, USA

xxxvii
xxxviii About the Section Editors

Eleanor McFadden
Frontier Science (Scotland) Ltd.
Kincraig, Scotland

O. Dale Williams
University of North Carolina at Chapel Hill
Chapel Hill, NC, USA
University of Alabama at Birmingham
Birmingham, AL, USA

Babak Choodari-Oskooei
MRC Clinical Trials Unit at UCL
Institute of Clinical Trials and Methodology, UCL
London, UK
About the Section Editors xxxix

Stephen L. George
Department of Biostatistics and Bioinformatics
Duke University School of Medicine
Durham, NC, USA

Tianjing Li
Department of Ophthalmology
School of Medicine
University of Colorado Anschutz Medical Campus
Colorado School of Public Health
Aurora, CO, USA

Karen A. Robinson
Johns Hopkins University
Baltimore, MD, USA
xl About the Section Editors

Nancy L. Geller
Office of Biostatistics Research
NHLBI
Bethesda, MD, USA

Winifred Werther
Amgen Inc.
South San Francisco, CA, USA

Christopher S. Coffey
University of Iowa
Iowa City, IA, USA
About the Section Editors xli

Mahesh K. B. Parmar
University College of London
London, England

Lawrence Friedman
Rockville, MD, USA
Contributors

Inmaculada Aban Department of Biostatistics, University of Alabama at Birming-


ham, Birmingham, AL, USA
Trinidad Ajazi Alliance for Clinical Trials in Oncology, University of Chicago,
Chicago, IL, USA
Andrew S. Allen Department of Biostatistics and Bioinformatics, Duke University,
School of Medicine, Durham, NC, USA
Daniel Almirall University of Michigan, Ann Arbor, MI, USA
Suresh Ankolekar Cytel Inc, Cambridge, MA, USA
Maastricht School of Management, Maastricht, Netherlands
M. Antoniou F. Hoffmann-La Roche Ltd, Basel, Switzerland
Gillian Armstrong GSK, Slaoui Center for Vaccines Research, Rockville, MD,
USA
Sheriza Baksh Johns Hopkins Bloomberg School of Public Health, Baltimore,
MD, USA
Bruce A. Barton Department of Population and Quantitative Health Sciences,
University of Massachusetts Medical School, Worcester, MA, USA
Emine O. Bayman University of Iowa, Iowa City, IA, USA
Vance W. Berger Biometry Research Group, National Cancer Institute, Rockville,
MD, USA
Kate Bickett Emmes, Rockville, MD, USA
Lynette Blacher Frontier Science Amherst, Amherst, NY, USA
Gillian Booth Leeds Institute of Clinical Trials Research, University of Leeds,
Leeds, UK
Laura E. Bothwell Worcester State University, Worcester, MA, USA

xliii
xliv Contributors

Anne-Laure Boulesteix Institute for Medical Information Processing, Biometry,


and Epidemiology, LMU Munich, Munich, Germany
Isabelle Boutron Epidemiology and Biostatistics Research Center (CRESS),
Inserm UMR1153, Université de Paris, Paris, France
Otis W. Brawley Johns Hopkins School of Medicine, and Johns Hopkins
Bloomberg School of Public Health, Baltimore, MD, USA
L. C. Brown MRC Clinical Trials Unit, UCL Institute of Clinical Trials and
Methodology, London, UK
Carolyn Burke The Emmes Company, LLC, Rockville, MD, USA
Kristin M. Burns National Heart, Lung, and Blood Institute, National Institutes of
Health, Bethesda, MD, USA
Marc Buyse International Drug Development Institute (IDDI) Inc., San Francisco,
CA, USA
CluePoints S.A., Louvain-la-Neuve, Belgium and I-BioStat, University of Hasselt,
Louvain-la-Neuve, Belgium
Interuniversity Institute for Biostatistics and Statistical Bioinformatics (I-BioStat),
Hasselt University, Hasselt, Belgium
Samantha-Jo Caetano Department of Mathematics and Statistics, McMaster Uni-
versity, Hamilton, ON, Canada
Steve Canham European Clinical Research Infrastructure Network (ECRIN),
Paris, France
Ashley Case Emmes, Rockville, MD, USA
Anna Chaimani Université de Paris, Research Center of Epidemiology and Sta-
tistics (CRESS-U1153), INSERM, Paris, France
Cochrane France, Paris, France
Bingshu E. Chen Canadian Cancer Trials Group, Queen’s University, Kingston,
ON, Canada
Victoria Chia Amgen Inc., Thousand Oaks, CA, USA
Emmanuel Chigutsa Pharmacometrics, Eli Lilly and Company, Zionsville, IN,
USA
Babak Choodari-Oskooei MRC Clinical Trials Unit at UCL, Institute of Clinical
Trials and Methodology, London, UK
Joan B. Cobb Pettit Johns Hopkins Bloomberg School of Public Health, Balti-
more, MD, USA
Graham A. Colditz Division of Public Health Sciences, Department of Surgery,
Washington University School of Medicine, Saint Louis, MO, USA
Contributors xlv

Alan Colley Amgen, Ltd, Cambridge, UK


Julia Collins The Emmes Company, LLC, Rockville, MD, USA
Mark R. Conaway University of Virginia Health System, Charlottesville, VA,
USA
David Couper Department of Biostatistics, Gillings School of Global Public
Health, University of North Carolina, Chapel Hill, NC, USA
Gary R. Cutter Department of Biostatistics, University of Alabama at Birming-
ham, Birmingham, AL, USA
John J. DeBoy Covington & Burling LLP, Washington, DC, USA
James J. Dignam Department of Public Health Sciences, The University of Chi-
cago, Chicago, IL, USA
Márcio A. Diniz Biostatistics and Bioinfomatics Research Center, Samuel Oschin
Cancer Center, Cedars Sinai Medical Center, Los Angeles, CA, USA
Damien Drubay INSERM U1018, CESP, Paris-Saclay University, UVSQ,
Villejuif, France
Gustave Roussy, Service de Biostatistique et d’Epidémiologie, Villejuif, France
Lea Drye Office of Clinical Affairs, Blue Cross Blue Shield Association, Chicago,
IL, USA
Amylou C. Dueck Division of Biomedical Statistics and Informatics, Department
of Health Sciences Research, Mayo Clinic, Scottsdale, AZ, USA
Keren R. Dunn Office of Research Compliance and Quality Improvement, Cedars-
Sinai Medical Center, Los Angeles, CA, USA
Richard Emsley Department of Biostatistics and Health Informatics, King’s Col-
lege London, London, UK
Katrina Epnere WCG Statistics Collaborative, Washington, DC, USA
Ann-Margret Ervin Johns Hopkins Bloomberg School of Public Health, Balti-
more, MD, USA
The Johns Hopkins Center for Clinical Trials and Evidence Synthesis, Johns Hop-
kins University, Baltimore, MD, USA
Theodoros Evrenoglou Université de Paris, Research Center of Epidemiology and
Statistics (CRESS-U1153), INSERM, Paris, France
Valerii V. Fedorov ICON, North Wales, PA, USA
Bart Ferket Ichan School of Medicine at Mount Sinai, New York, NY, USA
Maria Figueroa The Emmes Company, LLC, Rockville, MD, USA
Jane Forrest Frontier Science (Scotland) Ltd, Grampian View, Kincraig, UK
xlvi Contributors

Boris Freidlin Biometric Research Program, Division of Cancer Treatment and


Diagnosis, National Cancer Institute, Bethesda, MD, USA
Patricia A. Ganz Jonsson Comprehensive Cancer Center, University of California
at Los Angeles, Los Angeles, CA, USA
Elizabeth Garrett-Mayer American Society of Clinical Oncology, Alexandria,
VA, USA
Jennifer J. Gassman Department of Quantitative Health Sciences, Cleveland
Clinic, Cleveland, OH, USA
Nancy L. Geller National Heart, Lung and Blood Institute, National Institutes of
Health, Bethesda, MD, USA
Stephen L. George Department of Biostatistics and Bioinformatics, Basic Science
Division, Duke University School of Medicine, Durham, NC, USA
Bruce J. Giantonio The ECOG-ACRIN Cancer Research Group, Philadelphia,
PA, USA
Massachusetts General Hospital, Boston, MA, USA
Department of Medical Oncology, University of Pretoria, Pretoria, South Africa
David V. Glidden Department of Epidemiology and Biostatistics, University of
California San Francisco, San Francisco, CA, USA
Ekkehard Glimm Novartis Pharma AG, Basel, Switzerland
Jogarao V. Gobburu Center for Translational Medicine, University of Maryland
School of Pharmacy, Baltimore, MD, USA
Judith D. Goldberg Department of Population Health and Environmental Medi-
cine, New York University School of Medicine, New York, NY, USA
Benjamin A. Goldstein Department of Biostatistics and Bioinformatics, Duke
Clinical Research Institute, Duke University Medical Center, Durham, NC, USA
Julie D. Gottlieb Johns Hopkins University School of Medicine, Baltimore, MD,
USA
Stephen J. Greene Duke Clinical Research Institute, Durham, NC, USA
Division of Cardiology, Duke University School of Medicine, Durham, NC, USA
Gillian Gresham Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai
Medical Center, Los Angeles, CA, USA
Jeff Jianfei Guo Division of Pharmacy Practice and Administrative Sciences,
University of Cincinnati College of Pharmacy, Cincinnati, OH, USA
Gordon H. Guyatt McMaster University, Hamilton, ON, Canada
Olivia Hackworth University of Michigan, Ann Arbor, MI, USA
Contributors xlvii

Tarek Haddad Medtronic Inc, Minneapolis, MN, USA


Susan Halabi Department of Biostatistics and Bioinformatics, Duke University
Medical Center, Durham, NC, USA
Eric Hardter The Emmes Company, LLC, Rockville, MD, USA
Barbara S. Hawkins Johns Hopkins School of Medicine and Bloomberg School of
Public Health, The Johns Hopkins University, Baltimore, MD, USA
Richard J. Hayes Faculty of Epidemiology and Population Health, London School
of Hygiene and Tropical Medicine, London, UK
Adrian F. Hernandez Duke Clinical Research Institute, Durham, NC, USA
Division of Cardiology, Duke University School of Medicine, Durham, NC, USA
Sally Hopewell Centre for Statistics in Medicine, Nuffield Department of Ortho-
paedics, Rheumatology and Musculoskeletal Sciences, University of Oxford,
Oxford, UK
George Howard Department of Biostatistics, University of Alabama at Birming-
ham, Birmingham, AL, USA
Asbjørn Hróbjartsson Cochrane Denmark and Centre for Evidence-Based Med-
icine Odense, University of Southern Denmark, Odense, Denmark
Sam Hsiao Cytel Inc, Cambridge, MA, USA
Jason C. Hsu Department of Statistics, The Ohio State University, Columbus, OH,
USA
Julie Jackson Frontier Science (Scotland) Ltd, Grampian View, Kincraig, UK
Shu Jiang Division of Public Health Sciences, Department of Surgery, Washington
University School of Medicine, Saint Louis, MO, USA
Byron Jones Novartis Pharma AG, Basel, Switzerland
David S. Jones Harvard University, Cambridge, MA, USA
A. L. Jorgensen Department of Health Data Science, University of Liverpool,
Liverpool, UK
John A. Kairalla University of Florida, Gainesville, FL, USA
Eloise Kaizar The Ohio State University, Columbus, OH, USA
Shamir N. Kalaria Center for Translational Medicine, University of Maryland
School of Pharmacy, Baltimore, MD, USA
Michael Kelsh Amgen Inc., Thousand Oaks, CA, USA
Siyoen Kil LSK Global Pharmaceutical Services, Seoul, Republic of Korea
Christopher Kim Amgen Inc., Thousand Oaks, CA, USA
xlviii Contributors

Jonathan Kimmelman Biomedical Ethics Unit, McGill University, Montreal, QC,


Canada
Kristin Knust Emmes, Rockville, MD, USA
Edward L. Korn Biometric Research Program, Division of Cancer Treatment and
Diagnosis, National Cancer Institute, Bethesda, MD, USA
Richard Kravitz University of California Davis, Davis, CA, USA
Wen-Hua Kuo National Yang-Ming University, Taipei City, Taiwan
Sabine Landau Department of Biostatistics and Health Informatics, King’s Col-
lege London, London, UK
Jimmy Le National Eye Institute, Bethesda, MD, USA
Justin M. Leach Department of Biostatistics, University of Alabama at Birming-
ham, Birmingham, AL, USA
Michael LeBlanc SWOG Statistical Center, Fred Hutchinson Cancer Research
Center, Seattle, WA, USA
J. Jack Lee Department of Biostatistics, University of Texas MD Anderson Cancer
Center, Houston, TX, USA
Shing M. Lee Department of Biostatistics, Mailman School of Public Health,
Columbia University, New York, NY, USA
Cheng-Shiun Leu Department of Biostatistics, Mailman School of Public Health,
Columbia University, New York, NY, USA
Bruce Levin Department of Biostatistics, Mailman School of Public Health,
Columbia University, New York, NY, USA
Fan Li Department of Biostatistics, Yale University, School of Public Health, New
Haven, CT, USA
Heng Li Center for Devices and Radiological Health, U.S. Food and Drug Admin-
istration, Silver Spring, MD, USA
Liang Li Department of Biostatistics, The University of Texas MD Anderson
Cancer Center, Houston, TX, USA
Tianjing Li Department of Ophthalmology, University of Colorado Anschutz
Medical Campus, Aurora, CO, USA
Amanda Lilley-Kelly Clinical Trials Research Unit, Leeds Institute of Clinical
Trials Research, University of Leeds, Leeds, UK
Yuliya Lokhnygina Department of Biostatistics and Bioinformatics, Duke Univer-
sity, Durham, NC, USA
Contributors xlix

James P. Long Department of Biostatistics, University of Texas MD Anderson


Cancer Center, Houston, TX, USA
Anita M. Loughlin Corrona LLC, Waltham, MA, USA
Zheng Lu Clinical Pharmacology and Exploratory Development, Astellas Pharma,
Northbrook, IL, USA
Tiago M. Magalhães Department of Statistics, Institute of Exact Sciences, Federal
University of Juiz de Fora, Juiz de Fora, Minas Gerais, Brazil
Adrian P. Mander Centre for Trials Research, Cardiff University, Cardiff, UK
Linda Marillo Frontier Science Amherst, Amherst, NY, USA
Barbara K. Martin Administrative Director, Research Institute, Penn Medicine
Lancaster General Health, Lancaster, PA, USA
J. Rosser Matthews General Dynamics Health Solutions, Defense and Veterans
Brain Injury Center, Silver Spring, MD, USA
Madhu Mazumdar Director of Institute for Healthcare Delivery Science, Mount
Sinai Health System, NY, USA
Gina L. Mazza Division of Biomedical Statistics and Informatics, Department of
Health Sciences Research, Mayo Clinic, Scottsdale, AZ, USA
Michael P. McDermott Department of Biostatistics and Computational Biology,
University of Rochester Medical Center, Rochester, NY, USA
Eleanor McFadden Frontier Science (Scotland) Ltd., Kincraig, Scotland, UK
Katie Meadmore University of Southampton, Southampton, UK
Cyrus Mehta Cytel Inc, Cambridge, MA, USA
Harvard T.H. Chan School of Public Health, Boston, MA, USA
Curtis L. Meinert Department of Epidemiology, School of Public Health, Johns
Hopkins University, Baltimore, MD, USA
Catherine A. Meldrum University of Michigan, Ann Arbor, MI, USA
Sreelatha Meleth RTI International, Atlanta, GA, USA
Diana Merino Friends of Cancer Research, Washington, DC, USA
Peter Mesenbrink Novartis Pharmaceuticals Corporation, East Hannover, NJ,
USA
Silvia Metelli Université de Paris, Research Center of Epidemiology and Statistics
(CRESS-U1153), INSERM, Paris, France
Assistance Publique - Hôpitaux de Paris (APHP), Paris, France
l Contributors

Catherine M. Meyers Office of Clinical and Regulatory Affairs, National Insti-


tutes of Health, National Center for Complementary and Integrative Health,
Bethesda, MD, USA
Stefan Michiels INSERM U1018, CESP, Paris-Saclay University, UVSQ,
Villejuif, France
Gustave Roussy, Service de Biostatistique et d’Epidémiologie, Villejuif, France
Johanna Mielke Novartis Pharma AG, Basel, Switzerland
J. Philip Miller Division of Biostatistics, Washington University School of Med-
icine in St. Louis, St. Louis, MO, USA
Reza D. Mirza Department of Medicine, McMaster University, Hamilton, ON,
Canada
David Moher Centre for Journaology, Clinical Epidemiology Program, Ottawa
Hospital Research Institute, Canadian EQUATOR centre, Ottawa, ON, Canada
Lawrence H. Moulton Departments of International Health and Biostatistics,
Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
Rajat Mukherjee Cytel Inc, Cambridge, MA, USA
Patrick Onghena Faculty of Psychology and Educational Sciences, KU Leuven,
Leuven, Belgium
Jamie B. Oughton Clinical Trials Research Unit, Leeds Institute of Clinical Trials
Research, University of Leeds, Leeds, UK
Matthew J. Page School of Public Health and Preventive Medicine, Monash
University, Melbourne, VIC, Australia
Yuko Y. Palesch Data Coordination Unit, Department of Public Health Sciences,
Medical University of South Carolina, Charleston, SC, USA
Olympia Papachristofi London School of Hygiene and Tropical Medicine, Lon-
don, UK
Clinical Development and Analytics, Novartis Pharma AG, Basel, Switzerland
Mahesh K. B. Parmar MRC Clinical Trials Unit at UCL, Institute of Clinical
Trials and Methodology, London, UK
Wendy R. Parulekar Canadian Cancer Trials Group, Queen’s University, Kings-
ton, ON, Canada
Gail D. Pearson National Heart, Lung, and Blood Institute, National Institutes of
Health, Bethesda, MD, USA
Victoria L. Pemberton National Heart, Lung, and Blood Institute, National Insti-
tutes of Health, Bethesda, MD, USA
Michelle Pernice Dynavax Technologies Corporation, Emeryville, CA, USA
Contributors li

Julien Peron CNRS, UMR 5558, Laboratoire de Biométrie et Biologie Evolutive,


Université Lyon 1, France
Departments of Biostatistics and Medical Oncology, Centre Hospitalier Lyon-Sud,
Institut de Cancérologie des Hospices Civils de Lyon, Lyon, France
Gina R. Petroni Translational Research and Applied Statistics, Public Health
Sciences, University of Virginia Health System, Charlottesville, VA, USA
David Petullo Division of Biometrics II, Office of Biostatistics Office of Transla-
tional Sciences, Center for Drug Evaluation and Research, U.S. Food and Drug
Administration, Silver Spring, MD, USA
Patrick P. J. Phillips UCSF Center for Tuberculosis, University of California San
Francisco, San Francisco, CA, USA
Department of Epidemiology and Biostatistics, University of California San
Francisco, San Francisco, CA, USA
Steven Piantadosi Department of Surgery, Division of Surgical Oncology,
Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA
Scott H. Podolsky Harvard Medical School, Boston, MA, USA
Gregory R. Pond Department of Oncology, McMaster University, Hamilton, ON,
Canada
Ontario Institute for Cancer Research, Toronto, ON, Canada
Philip C. Prorok Division of Cancer Prevention, National Cancer Institute,
Bethesda, MD, USA
Michael A. Proschan National Institute of Allergy and Infectious Diseases,
Bethesda, MD, USA
Eric Riley Frontier Science (Scotland) Ltd., Kincraig, Scotland, UK
Karen A. Robinson Department of Medicine, Johns Hopkins University, Balti-
more, MD, USA
Emily Robison Optum Labs, Las Vegas, NV, USA
Frank W. Rockhold Department of Biostatistics and Bioinformatics, Duke Clini-
cal Research Institute, Duke University Medical Center, Durham, NC, USA
Sergei Romashkan National Institutes of Health, National Institute on Aging,
Bethesda, MD, USA
Wayne Rosamond Department of Epidemiology, Gillings School of Global Public
Health, University of North Carolina, Chapel Hill, NC, USA
William F. Rosenberger Biostatistics Center, The George Washington University,
Rockville, MD, USA
lii Contributors

Patrick Royston MRC Clinical Trials Unit at UCL, Institute of Clinical Trials and
Methodology, London, UK
Estelle Russek-Cohen Office of Biostatistics, Center for Drug Evaluation and
Research, U.S. Food and Drug Administration, Silver Spring, MD, USA
Laurie Ryan National Institutes of Health, National Institute on Aging, Bethesda,
MD, USA
Anna Sadura Canadian Cancer Trials Group, Queen’s University, Kingston, ON,
Canada
Ian J. Saldanha Department of Health Services, Policy, and Practice and Depart-
ment of Epidemiology, Brown University School of Public Health, Providence, RI,
USA
Amber Salter Division of Biostatistics, Washington University School of Medi-
cine in St. Louis, St. Louis, MO, USA
Marc D. Samsky Duke Clinical Research Institute, Durham, NC, USA
Division of Cardiology, Duke University School of Medicine, Durham, NC, USA
Frank J. Sasinowski University of Rochester School of Medicine, Department of
Neurology, Rochester, NY, USA
Willi Sauerbrei Institute of Medical Biometry and Statistics, Faculty of Medicine
and Medical Center - University of Freiburg, Freiburg, Germany
Roberta W. Scherer Department of Epidemiology, Johns Hopkins Bloomberg
School of Public Health, Baltimore, MD, USA
Pamela E. Scott Office of the Commissioner, U.S. Food and Drug Administration,
Silver Spring, MD, USA
Nicholas J. Seewald University of Michigan, Ann Arbor, MI, USA
Praharsh Shah University of Pennsylvania, Philadelphia, PA, USA
Linda Sharples London School of Hygiene and Tropical Medicine, London, UK
Pamela A. Shaw University of Pennsylvania Perelman School of Medicine, Phil-
adelphia, PA, USA
Dikla Shmueli-Blumberg The Emmes Company, LLC, Rockville, MD, USA
Ellen Sigal Friends of Cancer Research, Washington, DC, USA
Ida Sim Division of General Internal Medicine, University of California San
Francisco, San Francisco, CA, USA
Richard Simon R Simon Consulting, Potomac, MD, USA
Jennifer Smith Sunesis Pharmaceuticals Inc, San Francisco, CA, USA
Contributors liii

Alfred Sommer Johns Hopkins Bloomberg School of Public Health, Baltimore,


MD, USA
Mark Stewart Friends of Cancer Research, Washington, DC, USA
Inna Strakovsky University of Pennsylvania, Philadelphia, PA, USA
Oleksandr Sverdlov Novartis Pharmaceuticals Corporation, East Hannover, NJ,
USA
Michael J. Sweeting Department of Health Sciences, University of Leicester,
Leicester, UK
Department of Public Health and Primary Care, University of Cambridge, Cam-
bridge, UK
Matthew R. Sydes MRC Clinical Trials Unit at UCL, Institute of Clinical Trials
and Methodology, London, UK
Szu-Yu Tang Roche Tissue Diagnostics, Oro Valley, AZ, USA
Catherine Tangen SWOG Statistical Center, Fred Hutchinson Cancer Research
Center, Seattle, WA, USA
Mourad Tighiouart Cedars-Sinai Medical Center, Los Angeles, CA, USA
Guangyu Tong Department of Sociology, Duke University, Durham, NC, USA
Xiao Tong Clinical Pharmacology, Biogen, Boston, MA, USA
Alison Urton Canadian Cancer Trials Group, Queen’s University, Kingston, ON,
Canada
Diane Uschner Department of Statistics, George Mason University, Fairfax, VA,
USA
James E. Valentine University of Maryland Carey School of Law, Baltimore, MD,
USA
Ben Van Calster Department of Development and Regeneration, KU Leuven,
Leuven, Belgium
Department of Biomedical Data Sciences, Leiden University Medical Center, Lei-
den, The Netherlands
S. Swaroop Vedula Malone Center for Engineering in Healthcare, Whiting School
of Engineering, The Johns Hopkins University, Baltimore, MD, USA
Jenifer H. Voeks Department of Neurology, Medical University of South Carolina,
Charleston, SC, USA
Sunita Vohra University of Alberta, Edmonton, AB, Canada
Annie X. Wang Covington & Burling LLP, Washington, DC, USA
liv Contributors

Hechuan Wang Center for Translational Medicine, University of Maryland School


of Pharmacy, Baltimore, MD, USA
J. Wason Population Health Sciences Institute, Newcastle University, Newcastle
upon Tyne, UK
MRC Biostatistics Unit, University of Cambridge, Cambridge, UK
Claire Weber Excellence Consulting, LLC, Moraga, CA, USA
Heidi L. Weiss Biostatistics and Bioinformatics Shared Resource Facility, Markey
Cancer Center, University of Kentucky, Lexington, KY, USA
Pascale Wermuth Basel, Switzerland
Winifred Werther Center for Observational Research, Amgen Inc, South San
Francisco, CA, USA
Matthew Westmore University of Southampton, Southampton, UK
Graham M. Wheeler Imperial Clinical Trials Unit, Imperial College London,
London, UK
Cancer Research UK & UCL Cancer Trials Centre, University College London,
London, UK
O. Dale Williams Department of Biostatistics, University of North Carolina,
Chapel Hill, NC, USA
Department of Medicine, University of Alabama at Birmingham, Birmingham, AL,
USA
Janet Wittes Statistics Collaborative, Inc, Washington, DC, USA
Jianrong Wu Biostatistics and Bioinformatics Shared Resource Facility, Markey
Cancer Center, University of Kentucky, Lexington, KY, USA
Samuel S. Wu University of Florida, Gainesville, FL, USA
Sharon D. Yeatts Data Coordination Unit, Department of Public Health Sciences,
Medical University of South Carolina, Charleston, SC, USA
Lauren Yesko Emmes, Rockville, MD, USA
Gui-Shuang Ying Center for Preventive Ophthalmology and Biostatistics, Depart-
ment of Ophthalmology, Perelman School of Medicine, University of Pennsylvania,
Philadelphia, PA, USA
Jessica L. Yoos University of Pennsylvania, Philadelphia, PA, USA
Qilu Yu Office of Clinical and Regulatory Affairs, National Institutes of Health,
National Center for Complementary and Integrative Health, Bethesda, MD, USA
Fan-fan Yu Statistics Collaborative, Inc., Washington, DC, USA
Contributors lv

Lilly Q. Yue Center for Devices and Radiological Health, U.S. Food and Drug
Administration, Silver Spring, MD, USA
Rachel Zahigian Vertex Pharmaceuticals, Boston, MA, USA
Lijuan Zeng Statistics Collaborative, Inc., Washington, DC, USA
Xiaobo Zhong Ichan School of Medicine at Mount Sinai, New York, NY, USA
Tracy Ziolek University of Pennsylvania, Philadelphia, PA, USA
Part I
Perspectives on Clinical Trials
Social and Scientific History of Randomized
Controlled Trials 1
Laura E. Bothwell, Wen-Hua Kuo, David S. Jones, and
Scott H. Podolsky

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Early History of Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Refining Trial Methods in the Early Twentieth Century . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
The Role of Governments in the Institutionalization of Randomized Controlled Trials . . . . . . . 8
Historical Trial Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
RCTs and Evidence-Based Medicine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
The Globalization of RCTs and the Challenges of Similarities and Differences in Global
Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Social and Scientific Challenges in Randomized Controlled Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

L. E. Bothwell (*)
Worcester State University, Worcester, MA, USA
e-mail: [email protected]
W.-H. Kuo
National Yang-Ming University, Taipei City, Taiwan
e-mail: [email protected]
D. S. Jones
Harvard University, Cambridge, MA, USA
e-mail: [email protected]
S. H. Podolsky
Harvard Medical School, Boston, MA, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 3


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_196
4 L. E. Bothwell et al.

Abstract
The practice and conceptual foundations of randomized controlled trials have
been changed both by societal forces and by generations of investigators com-
mitted to applying rigorous research methods to therapeutic evaluation. This
chapter briefly discusses the emergence of key trial elements such as control
groups, alternate allocation, blinding, placebos, and finally randomization. We
then explore how shifting intellectual, social, political, economic, regulatory,
ethical, and technological forces have shaped the ways that RCTs have taken
form, the types of therapies explored, the ethical standards that have been
prioritized, and the populations included in studies. This history has not been a
simple, linear march of progress. We also highlight key challenges in the histor-
ical use of RCTs and the more recent expansion of concerns regarding competing
commercial interests that can influence trial design. As investigators continue to
advance the rigor of controlled trials amid these challenges, exploring the influ-
ence of historical contexts on clinical trial development can help us to understand
the forces that may impact trials today.

Keywords
History · Randomized controlled trial · Control groups · Fair allocation · Policy ·
Regulations · Ethics · Globalization · Ethnicity · Clinical trial

Introduction

Since the mid-twentieth century, clinical researchers have increasingly deployed ran-
domized controlled trials (RCTs) in efforts to improve the reliability and objectivity of
medical knowledge. RCTs have come to serve as authoritative standards of evidence for
the evaluation of experimental drugs and therapies, the remuneration for medical
interventions by insurance companies and governmental payers, and the evaluation of
an increasingly diverse range of social and policy interventions, from educational
programs to injury prevention campaigns. Yet, as researchers have increasingly relied
on the RCT as an evidentiary “gold standard,” critics have also identified myriad
challenges. This chapter highlights and adds to historical explorations of how RCTs
have come to serve such prominent roles in modern scientific (particularly clinical)
knowledge, considering the intellectual, social, political, economic, regulatory, ethical,
and technological contexts of this history. We also examine the history of the enduring
social and scientific challenges in the conduct, interpretation, and application of RCTs.

Early History of Clinical Trials

Trials comparing intervention and control groups are as old as the historical record
itself, appearing in the Hebrew Bible and in texts from various societies around the
world, albeit sporadically, for centuries (Lilienfeld 1982).The tenth century
1 Social and Scientific History of Randomized Controlled Trials 5

Persian physician, Abu Bakr Muhammad ibn Zakariyyaal-Razi, has been cele-
brated for conducting empirical experiments with control and intervention groups
testing contemporaneous medical practices such as bloodletting as a prevention for
meningitis (Tibi 2006). In the eighteenth century, Scottish surgeon James Lind
demonstrated the efficacy of citrus fruits over five alternative treatments for scurvy
among groups of sailors by following their responses to the ingested substances
under controlled conditions (Milne 2012). Loosely controlled trials, often
conducted by skeptics, increasingly appeared in the eighteenth and nineteenth
centuries to test therapies ranging from mesmerism to homeopathy to venesection
(Tröhler 2000).
These trials remained relatively scattered and dwarfed in the literature by case
reports that doctors published of their experiences with individual patients. Early
controlled trials had little apparent impact on therapeutic practice. Indeed, medical
epistemology through the nineteenth century tended to privilege the belief that
patients should be treated on an individual basis and that disease experiences were
not easily comparable among different patients (Warner 1986).
However, major shifts in the social and scientific structure of medicine in the late
nineteenth and early twentieth centuries created new opportunities and demands for
more rigorous clinical research methods. Hospitals expanded, providing settings for
more clinical researchers to compare treatment effects among numerous patients
simultaneously. Germ theory and developments in physiology and chemistry pro-
vided the stimulus for researchers to produce new vaccines and drugs that had never
been tested in patients. Charlatans also sought to capitalize from this wave of
discovery and innovation by marketing a host of poorly tested proprietary drugs of
dubious effectiveness. All these factors motivated scrupulous or skeptical clinical
investigators to pursue more sophisticated approaches to evaluate experimental
therapies. Simultaneously, public health researchers expanded their use of statistics,
bolstering empiricism in health research overall (Bothwell and Podolsky 2016;
Bothwell et al. 2016).
Among those interested in empirically testing the efficacy of remedies, the
question of controlling for the bias or enthusiasm of the individual arose, along
with related concerns about basing scientific knowledge on clinicians’ reports of
experiences with individual patients. In response, by the end of the nineteenth
century, several medical societies launched “collective investigations” that amal-
gamated numerous practitioners’ experiences using remedies among different
patients. The method was employed, for example, by the American Pediatric Society
in its 1896 evaluation of diphtheria antiserum, which incorporated input from 613
clinicians in 114 cities and towns and 3384 cases. The study demonstrated a 13%
mortality rate among treated patients (4.9% when treated on the first day of symp-
toms), far below the expected mortality baseline. This contributed to the uptake of
the remedy (Marks 2006). Still, some within the medical profession critiqued
“collective investigation” as an insufficiently standardized research method, while
numerous practicing clinicians complained that the method was a potentially elitist
infringement upon their patient care prerogatives. This dynamic would prove to be
an enduring tension between the clinical art of individualized patient care and an
aspiration to a generalizable medical science (Warner 1991).
6 L. E. Bothwell et al.

Refining Trial Methods in the Early Twentieth Century

As medical research overall continued to become more empirical and scientific in the
early twentieth century, some researchers began to test remedies in humans much as
they would in the laboratory. They began to employ “alternate allocation” studies,
treating every other patient with a novel remedy, withholding it from the others, and
comparing outcomes. Dozens of alternate allocation studies appeared in the medical
literature in the early twentieth century (Chalmers et al. 2012; Podolsky 2015).
Reflecting the major threat of infectious diseases during this era, the majority of
alternate allocation trials assessed anti-infective therapies. For example, Waldemar
Haffkine, Nasarwanji Hormusji Choksy, and their colleagues conducted investigations
of plague remedies in India in the 1900s, German Adolf Bingel performed a double-
blinded study of anti-diphtheria antiserum in the 1910s, and a series of American
researchers investigated anti-pneumococcal antiserum in the 1920s. The researchers
who conducted these alternate allocation trials also introduced varying degrees of
statistical sophistication in their assessments of outcomes, ranging from simple quan-
titative comparisons and impressionistic judgments to the far rarer use of complex
biometric evaluations and tests of statistical significance (Podolsky 2006, 2009).
Many researchers espoused ethical hesitations toward designating control groups
in trials, as they often had more faith in the experimental treatment than the control
treatment, and therefore felt that it was unethical to allocate patients to a control
group. This was exemplified by Rufus Cole and his colleagues at the Hospital of the
Rockefeller Institute during the development of anti-pneumococcal antiserum. After
convincing themselves of the utility of antiserum based on early case series, the
researchers “did not feel justified as physicians in withholding a remedy that in our
opinion definitely increased the patient’s chances of recovery” (Cole, as cited in
Podolsky 2006). Amid a culture in medicine that tended to give clinical experimen-
tation less emphasis than physicians’ beliefs, values, and individual experiences
regarding treatment efficacy, clinical trials remained overshadowed in the pre-World
War II era by research based on a priori mechanistic justifications and case series, as
well as laboratory and animal studies.
Despite the minimal uptake of clinical trials with control groups, however,
those who saw trials as the optimal means of adjudicating therapeutic efficacy
became more sophisticated in their attempts to minimize biased assessments and
ensure fair allocation of patients to active versus control groups. Regarding bias,
patient suggestibility had long been acknowledged, with numerous researchers
employing sham treatments in the assessments of eighteenth- and nineteenth-
century unorthodox interventions like mesmerism (in France) and homeopathy
(in America, as well as in Europe) (Kaptchuk 1998; Podolsky et al. 2016). By the
early decades of the twentieth century, investigators began to increasingly use
sham control groups in their assessments of conventional pharmaceuticals, with
Cornell’s Harry Gold and colleagues using the existing clinical term “placebo” to
describe such sham control remedies in their assessment of xanthines for the
chest pain characteristic of angina pectoris in the 1930s (Gabriel 2014; Podolsky
et al. 2016; Shapiro and Shapiro 1997).
1 Social and Scientific History of Randomized Controlled Trials 7

Attempts to minimize researcher bias were grounded in the recognition that


researchers were apt to see what they hoped or expected to find. Researchers in
the late nineteenth and early twentieth century often acknowledged the presence
of this “personal equation” in clinical research (Podolsky et al. 2016). Such
concerns led to periodic efforts to “blind” or mask research observers as to
whether or not a given subject had been exposed to an experimental agent,
with the term “blinding” first appearing in the medical literature in the 1910s
(Shapiro and Shapiro 1997). By the 1950s, Harry Gold formally placed concerns
over both research subject suggestibility and research observer enthusiasm into
the same phrase, coining the term “double-blind” and stating that “the whole
history of therapeutics, especially that having to do with the action of drugs on
subjective symptoms, demonstrates that the verdict of one study is frequently
reversed by another unless one takes measures to rule out the psychic effect of a
medication on the patient and the unconscious bias of the doctor” (Gold, as cited
in Podolsky et al. 2016).
Attempts to further ensure fair allocation of patients to active experimental
groups versus control groups with placebos or existing methods of care captured
international attention. Among investigators in the United States who had
conducted alternate allocation studies of anti-pneumococcal antiserum, some felt
that the bias of well-intended researchers could lead to their cheating the alternation
scheme (e.g., by assigning sicker patients to the active treatment group). Britain’s
Medical Research Council (MRC), aware of such US studies, conducted its own
assessment of anti-pneumococcal antiserum by the early 1930s. And when the
statistician Austin Bradford Hill was asked to evaluate this series of trials, he
likewise grew suspicious of such cheating. In designing a major, groundbreaking
RCT – the 1948 MRC assessment of streptomycin for tuberculosis – Bradford Hill
thus replaced alternate allocation with the strictly concealed randomization of
patients to treatment or control groups in order to prevent researchers from
interfering with the allocation of patients. This was not the first use of randomiza-
tion in a clinical trial, but it represented a turning point at which the RCT began to
emerge as a major method of clinical investigation (Bothwell and Podolsky 2016;
Chalmers 2005).
By the 1950s, Bradford Hill and his contemporaries implemented a number of
large-scale RCTs, particularly evaluating therapies for tuberculosis. Moreover, as the
post-World War II pharmaceutical industry began to produce and market a widening
array of remedies ranging from antibiotics and antipsychotics to steroids and minor
tranquilizers, pioneering clinical pharmacologists like Harry Gold, Henry Beecher,
and Louis Lasagna joined statisticians like Bradford Hill and Donald Mainland in
advocating for the need for clinical investigative rigor. They argued in multiple
settings that the emerging controlled clinical trial methodology provided the best
way to distinguish useful from useless novel drugs (Marks 1997, 2000).Throughout
the 1950s and 1960s, formal involvement of statisticians and statistical input into
design and analysis became a larger part of major pharmaceutical and clinical
investigations, a key component of the “triumph of statistics” in medicine (Porter
1996; Marks 1997).
8 L. E. Bothwell et al.

The Role of Governments in the Institutionalization of


Randomized Controlled Trials

In the 1950s and 1960s, the British and US governments alike spearheaded heavy
investments in academic medical research institutions. The British MRC, for
instance, played a key role in the institutionalization of the controlled clinical trial
(Lewontin 2008; Timmermann 2008). Academic clinical trials expanded substan-
tially in these countries in part through this support and a political culture of
investment in scientific research and institution-building in medicine (Bothwell
2014). Jonas Salk’s polio vaccine trial drew broad scientific interest, as did the
National Cancer Institute’s clinical trial expansion (Meldrum 1998; Keating and
Cambrosio 2012). As the 1950s also witnessed strong growth in industrial drug
research and development, some companies collaborated with public sector
researchers in devising clinical trials (Gaudilliere and Lowy 1998; Marks 1997).
Some surgeons also adopted the technique, initiating a series of randomized con-
trolled trials in the 1950s (Bothwell and Jones 2019).
Still, without a regulatory mandate to conduct rigorous trials, seemingly “well-
controlled” clinical studies remained a small proportion of clinical investigations in
the 1950s. For example, a 1951 study by Otho Ross of 100 articles entailing
therapeutic assessment in 5 leading American medical journals found that only
27% were “well controlled,” with 45% employing no controls at all (Ross 1951).
Within two decades of Ross’ evaluation, however, the US Food and Drug
Administration (FDA) established regulations that would dramatically shape the
subsequent history of RCTs (Carpenter 2010). The US federal government had
been gradually building and clarifying the FDA’s power to regulate drug safety
and efficacy since requiring accurate drug labeling with the Pure Food and
Drug Act of 1906. As the pharmaceutical industry burgeoned in the 1950s, the
scientific community and regulators observed with troubling frequency the use
of unproven, ineffective, and sometimes dangerous drugs that had not been
adequately tested before companies promoted their benefits. Yet the FDA
lacked the necessary statutory authority to strengthen testing requirements for
drug efficacy and safety. This changed following an international drug safety
crisis in 1961 in which the inadequately vetted sedative thalidomide was found
to cause stillbirths or devastating limb malformations among infants of women
who had taken the drug for morning sickness during pregnancy (Carpenter
2010). This took place just as Senator Estes Kefauver was in the midst of
extensive hearings and legislative negotiating regarding the excesses of phar-
maceutical marketing and the inability of the FDA to formally adjudicate drug
efficacy. Broad public concern galvanized the political support necessary in
1962 for the passage of the Kefauver-Harris amendments to the Federal Food,
Drug, and Cosmetic Act. These established a legal mandate for the FDA to
require drug producers to evaluate their products in “adequate and well-con-
trolled investigations, including clinical investigations, by experts qualified by
scientific training and experience to evaluate the effectiveness of the drug
involved” (FDA 1963).
1 Social and Scientific History of Randomized Controlled Trials 9

By 1970, after prevailing in a legal battle with Upjohn Pharmaceuticals over the
methodological requirements of drug safety and efficacy studies, the FDA
established that RCTs (ideally, double-blinded, placebo-controlled) should be car-
ried out to fulfill the mandate of “adequate and well-controlled” drug studies. With
this decision, the FDA formally placed RCTs at the regulatory and conceptual center
of drug evaluation in America (Carpenter 2010; Podolsky 2015). While the FDA
seems to have spearheaded this regulatory specification for RCTs in part as a result of
a litigious culture in the American pharmaceutical industry, the global scientific and
regulatory community had come to a general consensus on the public health benefits
of high standards for drug trials. Regulators in Japan and the European Union soon
established similar trial requirements. As it worked to comply with these regulations,
the pharmaceutical industry, which had grown substantially since World War II,
became a major international sponsor of RCTs (Bothwell et al. 2016). By the 1990s,
industry replaced national governments as the leading funder of RCTs: governments
continued to fund substantial numbers of RCTs, but the sheer volume of pharma-
ceutical studies led to a larger proportion of overall published RCTs reporting drug
company funding than any other source. Pharmaceutical research grew more rigor-
ous in this process, but critics also raised concerns about conflicts of interest and the
shaping of biomedical knowledge through industry-sponsored trials, a problem that
has persisted in different manifestations in ensuing decades (as described later in this
chapter) (Bothwell 2014).

Historical Trial Ethics

New governmental policies also substantively influenced the ethical standards


commonly held for RCTs. The thalidomide crisis was among a series of debacles
drawing public ire over patient safety and protections in medical research. In the
early to mid-1960s, Maurice Pappworth in the United Kingdom and Henry Beecher
in the United States shed light on numerous ethically scandalous studies that had
apparently been conducted without the informed consent of research subjects. Broad
societal dismay about the lack of patient protections in clinical research prompted
legislators to empower regulatory, ethical, and scientific leaders to create new
policies to govern the ethics of research (Bothwell 2014; Jones et al. 2016).
For instance, in 1964, the World Medical Association, an international confeder-
ation of medical associations founded after World War II, had established the
Declaration of Helsinki, including principles of informed consent and research
subject protections previously outlined in the 1948 Nuremberg Code. But this
Declaration lacked any enforcement mechanism. As the field of bioethics developed
in the late 1960s and 1970s, growing numbers of philosophers, social scientists, and
other non-clinical researchers drew further attention to ethical concerns in trials, such
as reliance on vulnerable populations in research, including children, the elderly and
infirm, racial minorities, prisoners, and people with disabilities. Ethicists argued that
informed consent was insufficient in protecting the rights of vulnerable groups
whose consent was often provided under social, economic, or physiological
10 L. E. Bothwell et al.

constraints that impeded their ability to either fully understand or freely and inde-
pendently elect to participate in trials. They contended that external review of
research was thus crucial to ensure that study designs were fair and informed consent
would be meaningfully achieved (Bothwell 2014). Clinical research directors and
investigators themselves also increasingly recognized the legal and ethical need for
expanded policies on peer review of study protocol ethics (Stark 2011).
All of these concerns escalated in the early 1970s as more research scandals came
to public light. Scientists, ethicists, and the public reeled when news broke of the 40-
year Tuskegee study of untreated syphilis among African American men. Investiga-
tors deceived study participants and withheld treatment long after antibiotics had
become available to cure the disease. In response to outcry over this tragedy, the US
Department of Health and Human Services passed Title 45 Code of Federal Regu-
lations, Part 46, in 1974, clarifying new ethical guidelines, formalizing institutional
review boards, and expanding their use in clinical trials and other human subjects
research. Since the United States was a leading global sponsor of RCTs at this time,
these ethical requirements had a sizable impact on the conduct of RCTs overall
(Bothwell 2014).
As RCT use expanded, ethicists began to clarify core challenges specifically
related to randomized allocation of patients to treatments. Critics continued to
raise concerns that withholding a promising treatment from patients simply in the
name of methodological rigor prioritized scientific advancement over patient care.
They argued that RCTs were not necessarily in the best short-term interests of
patients, since patient allocation to control and intervention arms could prevent
clinicians from fulfilling their obligations to administer what they believed to be
promising experimental therapies to all patients (Bothwell and Podolsky 2016).
Proponents of RCTs countered that randomized allocation to experimental and
control groups was essential to determine whether promising experimental treat-
ments would live up to the hopes of their proponents, or whether they would prove
less effective or be accompanied by unacceptable adverse events (Bradford Hill
1963). Growing numbers of researchers favored the latter stance, and in subsequent
decades, the notion was formalized as the principle of equipoise. This principle
stipulated that in situations of genuine uncertainty over whether a new treatment is
superior to the existing treatment, it is ethically acceptable for physicians to ran-
domly assign patients to either control or intervention arms (Freedman 1987).
Critics of the principle of equipoise noted that investigators often did not possess
a state of genuine uncertainty regarding whether an experimental treatment was
preferable to an existing treatment. Rather, researchers often had a sense that the
experimental treatment was favorable based on early case series or pilot studies.
Responding to this ethical confusion, Benjamin Freedman proposed the more
specific principle of “clinical equipoise” in 1987, stipulating that investigators may
continue to randomly allocate patients to different arms in trials only when there is
genuine uncertainty or honest professional disagreement among a community of
expert practitioners as to which treatment in a trial is preferable. According to this
interpretation, carefully conducted RCTs often would be necessary to determine the
actual efficacy and adverse events associated with medical interventions (Freedman
1 Social and Scientific History of Randomized Controlled Trials 11

1987). Most researchers now accept the rationale of clinical equipoise. Yet, among
each new generation of investigators there are those with ethical hesitations about
allocating trial participants to arms thought to be inferior.

RCTs and Evidence-Based Medicine

RCTs continued to expand in the closing decades of the twentieth century as part of
broader trends toward more quantitative and empirical research methods in medi-
cine. New technologies continuously broadened the evidentiary foundation from
which medical knowledge could be developed. The introduction of computers into
medical investigations in the 1960s and 1970s increased researchers’ efficiency in
collecting and processing large quantities of data from multiple study sites, facili-
tating the conduct of RCTs and the dissemination of trial results. Alongside the
growing availability of data, critics increasingly questioned medical epistemology
for relying too heavily on theory and expert opinion without evidence from con-
trolled clinical experiments on statistically significant numbers of patients. In all
fields of medicine, critics deployed RCTs to assess both new, experimental treat-
ments and existing therapies that had become widespread despite never having been
rigorously tested. Numerous RCTs revealed that popular medical interventions were
ineffective or even harmful, leading to their discontinuation (Cochrane 1972).
By the early 1980s, scientists widely considered RCTs the “gold standard” of
clinical research (Jones and Podolsky 2015). Large-scale multi-site RCTs grew
exponentially in the published literature and were highly influential in medical
knowledge and clinical research methodologies (Bothwell 2014). Academic pro-
grams also developed to explore and critique empirical research methods. In the
1990s, Canadian medical researchers Gordon Guyatt and David Sackett coined the
term “evidence-based medicine” to refer to the application of current best evidence
to decisions about individual patient care. Advocates of evidence-based medicine
developed a pyramid illustrating a general hierarchy of research design quality, with
expert opinion and case reports at the lowest level, various observational designs at
intermediate levels, and RCTs at the pinnacle as the optimal study design. RCTs, in
turn, could be incorporated into meta-analyses and systematic reviews. In 1993, Iain
Chalmers led in the creation of the Cochrane Collaboration (now called Cochrane),
an international organization designed to conduct systematic reviews by synthesiz-
ing large quantities of medical research evidence in order to inform clinical decision-
making (Daly 2005). Internet expansion also facilitated wider access to information
on evidence-based medicine. In 2000, the NIH established ClinicalTrials.gov, a
publicly accessible online registry of clinical trials, with concomitant and now
legal requirements to register trials before initiation. Recent legislation also requires
the reporting of results from all registered trials on the site, regardless of outcome, so
that physicians, scientists, and patients can access more complete data from
unpublished trials. This has provided a counterpoint to publication bias while also
allowing valuable comparisons between predefined and published trial endpoints.
12 L. E. Bothwell et al.

The database has been a powerful tool to shift the power imbalance between those
who conduct RCTs and those who use their results.

The Globalization of RCTs and the Challenges of Similarities and


Differences in Global Populations

The recent history of RCTs and evidence-based medicine has been characterized in
part by expanded trial globalization. This has often reflected commercial interests
and has raised new ethical, political, and regulatory questions. Pharmaceutical
companies want to complete increasingly demanding and complicated RCTs as
quickly as possible. They have realized that they can do so by recruiting sufficient
numbers of subjects on an international scale (Kuo 2008). Since the late 1970s,
contract research organizations (CROs), now a multibillion-dollar annual industry,
have grown to serve this demand. As for-profit entities, CROs have been broadly
critiqued for questionable practices such as offshoring growing numbers of trials to
middle-income countries, oftentimes studying fairly homogenous demographic
groups, a move which has raised skepticism regarding their ability to measure
treatment effects among diverse patients. CROs have also targeted research settings
with looser regulatory oversights and weaker systems of institutional ethical review,
raising concerns about the rights of research subjects. Access to a tested treatment
after a trial ends has also been a critical ethical concern (Petryna 2009).
At the same time, health policymakers hope to create a harmonized regulatory
platform to reduce redundant clinical trials and broaden accessibility to the latest
medications for the people who need them. Policy initiatives arose in the early 1980s
to address this from different perspectives and on different levels. The World Health
Organization’s (WHO) International Conference of Drug Regulatory Authorities
(ICDRA) was the first attempt to establish common regulations to help drug regu-
latory authorities of WHO member states strengthen collaboration and exchange
information. Other regional and bilateral harmonization efforts were motivated by
commercial concerns. For example, pharmaceuticals were selected as a topic for
trade negotiation at the first US-Japan Market-Oriented-Sector Selective talk in 1986
for the potential for higher sales of pharmaceuticals in Japan, then the second-largest
national market in the world. It was followed by expert meetings on the technical
requirements for drug approval, including RCT designs (Kuo 2005).
The International Conference on Harmonisation of Technical Requirements for
Registration of Pharmaceuticals for Human Use (ICH) was created in 1990. Initially
designed to incorporate Japan into global pharmaceutical markets with fewer regu-
latory hurdles, the ICH is a communication platform to accelerate the harmonization
of pharmaceutical regulations. It started with only members from the United States,
the European Union (EU), and Japan and carefully limited its working scope to
technical issues. The outcomes were guidelines for safety, efficacy, quality, and drug
labeling integrated into the regulations of each participating country/region after the
ICH reached consensus. ICH guidelines thus further established global recognition
1 Social and Scientific History of Randomized Controlled Trials 13

for RCTs and helped to standardize approaches for the generation of medical
evidence (Kuo 2005).
Following Japan, other East Asian states quickly recognized the importance of
the ICH and began following ICH guidelines. Korea and Taiwan aggressively
established sizable regulatory agencies and national centers for clinical trials. Even
Japan made infrastructural changes to reform clinical trial protocols and use common
technical documents (Chikenkokusaikakenkyukai [study group on the globalization
of clinical trials] 2013). In 1999, the ICH founded the Global Cooperation Group
(GCG) to serve as a liaison to other countries affected by these guidelines, but it did
not permit policy contributions from non-ICH member regions. It was not until 2010
that the ICH opened technical working groups to active participation from experts in
non-ICH member regions and countries of the ICH GCG. In addition to the founding
members of Japan, the United States, and the EU, the ICH invited five additional
regulatory members – Brazil, China, Korea, Singapore, and Taiwan.
The process of incorporating East Asian RCTs into ICH standards raised issues
concerning the generalizability of research findings. Researchers and policymakers
hoped to establish clinical trial designs and good clinical practices (GCP) that would
clarify the influence of ethnic factors on any physiological and behavioral differ-
ences in trial test populations in East Asian RCTs. In the end, a technically vague
concept of “bridging” was created to make sense of how to extend the applicability
of clinical data to a different ethnic population by conducting additional trials with
fewer subjects than originally required by local authorities (Kuo 2009, 2012). It is
important to note that several ICH guidelines deal with additional differences among
subjects (such as age), but the extent of these guidelines varies. For example, the ICH
sets no independent guideline regarding inclusion or measurement of gender in
clinical trials.

Social and Scientific Challenges in Randomized Controlled Trials

While RCTs have continued to evolve and grow more standardized and globally
inclusive, social, economic, political, and internal scientific challenges have contin-
ued to complicate both the application of RCTs and the construction of evidence-
based medicine (Bothwell et al. 2016; Timmermans and Berg 2003). Additionally,
critics have identified the growing impact of commercial interests on the overall
ecology of medical evidence. As the pharmaceutical industry has sponsored growing
numbers of RCTs since the late 1960s, it has tended to sponsor trials of drugs with
substantial potential for use among wealthier populations rather than prioritizing
treatments that can transform global public health, such as antibiotics or vaccines for
infectious diseases endemic to low-income regions (Bothwell et al. 2016). Pharma-
ceutical sponsors also have expanded markets by strategically deploying RCTs to
establish new drug indications for existing products through trials that claim slightly
new therapeutic niches, rather than developing innovative original therapies
(Matheson 2017). Researchers conducting industry-sponsored trials have been cri-
tiqued for being more susceptible to bias, as comparative analyses have revealed that
14 L. E. Bothwell et al.

industry-sponsored trials are more likely to reveal outcomes favoring the product
under investigation than publicly funded trials (Bourgeois et al. 2010). Critics have
noted that some industry-funded researchers have designed trials in ways that are
more likely to reveal treatment effects by selecting narrow patient populations likely
to demonstrate favorable results, rather than patients who represent a drug’s ultimate
target population (Petryna 2009).
Growing interest in assessing treatments in applied clinical settings has also given
rise to variations on RCTs. Pragmatic trials, which have been widely discussed and
debated, have been proposed more recently as tools to examine medical interven-
tions in the context of their application in clinical practice. Proponents have
suggested that pragmatic trials would be most useful during the implementation
stage of an intervention or in the post-marketing phase of drug evaluation, once
phase 3 trials have been completed (Ford and Norrie 2016). Similarly, as the quantity
of therapeutics has expanded over time, researchers also have increasingly
conducted randomized comparative effectiveness trials using existing treatments
rather than placebos in control arms of trials. The expansion of comparative effec-
tiveness RCTs has responded to a clinical demand for more detailed information not
just validating individual therapies but comparing different treatments in current use
to guide clinical decision-making (Fiore and Lavori 2016).
New challenges have also emerged in relation to establishing trial endpoints. For
example, trial sponsors have pursued surrogate endpoints – intermediate markers
anticipated to correlate with clinical outcomes – to achieve statistically significant
trial results more quickly. Such trials, however, do not generate comprehensive data
on the clinical outcomes experienced by patients. These approaches have had value,
such as expediting the evaluation of initial data on the effects of HIV treatments so
that more patients could access promising experimental therapies more quickly
(Epstein 1996). However, critics have also warned of the shortcomings of the partial
data that surrogate endpoints can yield (Bothwell et al. 2016), and there have been
important examples of drugs that “improved” the status of a biomarker while leading
to worsened clinical outcomes (e.g., torcetrapib raised HDL levels but also increased
the risk of mortality and morbidity of patients via unknown mechanisms) (Barter et
al. 2007). Some advocates of trial efficiency have also promoted adaptive methods
that alter trial design based on interim trial data. The US FDA has examined
methodological issues in adaptive designs with their Guidance for Industry on
adaptive trials, describing certain methodological challenges in adaptive designs
that remain unresolved (US FDA 2018).
While academic critics have identified limitations of RCTs, drug and device
industries have used these critiques for their own purposes. In recent years, some
industry representatives have embraced criticism of RCTs and evidence-based
medicine, seemingly with a goal of undermining major twentieth century attempts
to demand clinical trial rigor in the assessment of new therapies. Decriers of
regulation have contended that standards for clinical trials are unnecessarily narrow
and exacting, increasing research costs and slowing the delivery of new therapies to
1 Social and Scientific History of Randomized Controlled Trials 15

the market. They make this argument even as other critics argue that regulators such
as the FDA have approved some experimental products too rapidly and without
sufficient evidence, resulting in poorer patient health outcomes (Kesselheim and
Avorn 2017; Ostroff 2015). It is likely that debates over trial design and evidence
standards will persist: competing interests continue to have stakes in how medical
therapies are regulated and tested. Additionally, academic researchers have compet-
ing interests to publish trials with significant results when such publications are
criteria for professional advancement (Calabrese and Roberts 2004).
Finally, researchers have continued to face challenges in conducting RCTs of
treatments that are less amenable to controlled experimentation. While investigators
have long conducted pharmaceutical RCTs comparing active pills and placebos, it
has been more challenging to conduct RCTs in certain other areas of medicine, such
as surgery. Surgeons, who had long espoused the goal of establishing a rational basis
for surgical practice, had recognized the value of control groups in the eighteenth
century and had implemented alternate allocation, and then randomization, in the
twentieth century. In 1953, for instance, surgeons in New York began a study of
surgical and medical management of upper GI bleeding. They began with alternate
allocation but switched to randomization for the majority of the study (1955–1963)
“to achieve statistically sound conclusions” (Enquist et al., as cited in Jones 2018).
By the late 1950s surgeons had randomized patients to tests of many different
surgical procedures (Bothwell and Jones 2019). However, RCTs have not become
as influential in surgery as they have in pharmaceutical research. Part of this has been
a matter of regulation: the FDA does not require RCTs before surgeons start using a
new procedure (unless the new procedure relies on a new device for which the FDA
deems an RCT appropriate). Surgical epistemology and methodology also pose
challenges for RCTs. Since the success of an operation can often seem self-evident,
surgeons have been reluctant to randomize patients between radically different
modes of therapy (e.g., to a medical vs. surgical treatment for a particular problem;
see Jones 2000). There are also few procedures for which surgeons can perform a
meaningful sham operation, forcing many surgical trials to be done without blinding.
Additionally, variations in practitioner skill can confound trial results: a surgical
RCT is not simply a test of the procedure per se, but a test of the procedure as done
by a specific group of surgeons whose skills and techniques might or might not
reflect those of other surgeons. These challenges have limited the use of RCTs within
surgery. When surgical RCTs have been performed historically, it often has not been
to validate and introduce a new operation, but rather to test an existing operation
which surgeons or nonsurgical physicians have begun to doubt. While RCTs remain
an important part of knowledge production in surgery, surgeons have continued to
rely extensively on other modes of knowledge production, including case series and
registry studies. These problems are not unique to surgery. Challenges have also
emerged for RCTs in other medical fields, such as psychotherapy, in which practi-
tioners may have significant degrees of variation in treatment approaches (Bothwell
et al. 2016; Jones 2018).
16 L. E. Bothwell et al.

Summary and Conclusion

The foundations of modern RCTs run deep through centuries of thinkers, physicians,
scientists, and medical reformers committed to accurately measuring the effects of
medical interventions. Clinical trials have taken different forms in different historical
social contexts, growing from isolated, small controlled experiments to massive
multinational trials. The shifting burden of disease and the interests of trial sponsors
have influenced the types of questions investigated in trials – from infectious diseases
in the early twentieth century to chronic diseases, particularly those affecting wealthier
populations, in contemporary society. Shifts in trial funding and regulatory and ethical
policy landscapes have dramatically shaped the historical trajectory of RCTs such that
trial design, study location, ethical safeguards for research subjects, investigator
accountability, and even the likelihood of favorable trial results have all been
influenced by political and economic pressures and contexts. This has not been a
linear story of progress. Advances in trial rigor, ethics, and inclusiveness have occurred
alongside the emergence of new challenges related to the commercialization of
research and pressures to lower regulatory standards for evidence. Many RCTs today
have grown so complex and institutionalized that persistent challenges may seem
ingrained. However, the history of clinical trials offers numerous examples of how
science has been dramatically transformed through the work of individuals committed
to rigorous investigations over other competing interests.

Key Facts

1. The historical foundations of RCTs run deep – across time, different societies, and
different contexts, investigators have endeavored to create controlled experiments
of interventions to improve human health.
2. Social contexts of research – from physical trial settings to funding schemes and
regulatory requirements – have significantly impacted the design and scale of
trials, the types of questions asked, trial ethics, research subject demographics,
and the objectives of trial investigators.
3. The history of RCTs has involved both advances and setbacks: it has not been a
linear story of progress. Recent history has revealed persistent challenges for
RCTs as well as expanding concerns such as commercial interests in trials that
will need to be carefully considered moving forward.

Cross-References

▶ A Perspective on the Process of Designing and Conducting Clinical Trials


▶ Evolution of Clinical Trials Science
▶ Trials in Minority Populations
1 Social and Scientific History of Randomized Controlled Trials 17

Permission Segments of this chapter are also published in Bothwell, L., and Podolsky, S.
“Controlled Clinical Trials and Evidence-Based Medicine,” in Oxford Handbook of American
Medical History, ed. J. Schafer, R. Mizelle, and H. Valier. Oxford: Oxford University Press,
forthcoming. With kind permission of Oxford University Press, date TBA. All Rights Reserved.

References
Barter PJ et al (2007) Effects of torcetrapib in patients at high risk for coronary events. N Engl J
Med 357:2109–2112
Bothwell LE (2014) The emergence of the randomized controlled trial: origins to 1980. Disserta-
tion, Columbia University
Bothwell LE, Jones DS (2019) Innovation and tribulation in the history of randomized controlled
trials in surgery. Ann Surg. https://doi.org/10.1097/SLA.0000000000003631
Bothwell LE, Podolsky SH (2016) The emergence of the randomized, controlled trial. N Engl J Med
375:501–504
Bothwell LE, Greene JA, Podolsky SH, Jones DS (2016) Assessing the gold standard – lessons
from the history of RCTs. N Engl J Med 374(22):2175–2181
Bourgeois FT, Murthy S, Mandl KD (2010) Outcome reporting among drug trials registered in
clinical Trials.gov. Ann Intern Med 153:158–166
Calabrese RL, Roberts B (2004) Self-interest and scholarly publication: the dilemma of researchers,
reviewers, and editors. Int J Educ Manag 18:335–341
Carpenter D (2010) Reputation and power. Princeton University Press, Princeton
Chalmers I (2005) Statistical theory was not the reason that randomisation was used in the
British Medical Research Council’s clinical trial of streptomycin for pulmonary tuberculo-
sis. In: Jorland G et al (eds) Body counts. McGill-Queen’s University Press, Montreal,
pp 309–334
Chalmers I, Dukan E, Podolsky S, Smith GD (2012) The advent of fair treatment allocation
schedules in clinical trials during the 19th and early 20th centuries. J R Soc Med 105(5). See
also JLL Bulletin: Commentaries on the history of treatment evaluation. http://www.jameslin
dlibrary.org/articles/the-advent-of-fair-treatment-allocation-schedules-in-clinical-trials-during-
the-19th-and-early-20th-centuries/. Accessed 17 Mar 2019
Chikenkokusaikakenkyukai [Study group on the globalization of clinical trials] (2013) ICH-GCP
Nabgeita: Kokusaitekishitenkaranihonnochiken wo kangaeru. (ICH-GCP navigator: consider-
ations of clinical trials in Japan from an international perspective). Jiho, Tokyo
Cochrane AL (1972) Effectiveness and efficiency: random reflections on the health services.
Nuffield Provincial Hospitals Trust, London
Daly J (2005) Evidence-based medicine and the search for a science of clinical care. University of
California Press, Berkeley
Epstein S (1996) Impure science: AIDS, activism, and the politics of knowledge. University of
California Press, Berkeley
Fiore L, Lavori P (2016) Integrating randomized comparative effectiveness research with patient
care. N Engl J Med 374:2152–2158
Ford I, Norrie J (2016) Pragmatic trials. N Engl J Med 375:454–463
Freedman B (1987) Equipoise and the ethics of clinical research. N Engl J Med 317:141–145
Gabriel JM (2014) The testing of Sanocrysin: science, profit, and innovation in clinical trial design,
1926–1931. J Hist Med Allied Sci 69:604–632
Gaudilliere JP, Lowy I (1998) The invisible industrialist: manufactures and the production of
scientific knowledge. Macmillan, London
Hill AB (1963) Medical ethics and controlled trials. Br Med J 5337:1043–1049
Jones DS (2000) Visions of a cure: visualization, clinical trials, and controversies in cardiac
therapeutics, 1968–1998. Isis 91:504–541
18 L. E. Bothwell et al.

Jones DS (2018) Surgery and clinical trials: the history and controversies of surgical evidence. In:
Schlich T (ed) The Palgrave handbook of the history of the surgery. Palgrave Macmillan,
London, pp 479–501
Jones DS, Podolsky SH (2015) The history and fate of the gold standard. Lancet 9977:1502–1503
Jones DS, Grady C, Lederer SE (2016) ‘Ethics and clinical research’ – the 50th anniversary of
Beecher’s bombshell. N Engl J Med 374:2393–2398
Kaptchuk TJ (1998) Intentional ignorance: a history of blind assessment and placebo controls in
medicine. Bull Hist Med 72:389–433
Keating P, Cambrosio A (2012) Cancer on trial. University of Chicago Press, Chicago
Kesselheim AS, Avorn J (2017) New ‘21st century cures’ legislation: speed and ease vs science. J
Am Med Assoc 317:581–582
Kuo W-H (2005) Japan and Taiwan in the wake of bio-globalization: drugs, race and standards.
Dissertation, MIT
Kuo W-H (2008) Understanding race at the frontier of pharmaceutical regulation: an analysis of the
racial difference debate at the ICH. J Law Med Ethics 36:498–505
Kuo W-H (2009) The voice on the bridge: Taiwan’s regulatory engagement with global pharma-
ceuticals. East Asian Science, Technology and Society: an International Journal 3:51–72
Kuo (2012) Transforming states in the era of global pharmaceuticals: visioning clinical research in
Japan, Taiwan, and Singapore. In: Rajan KS (ed) Lively capital: biotechnologies, ethics, and
governance in global market. Duke University Press, Durham, pp 279–305
Lewontin RC (2008) The socialization of research and the transformation of the academy. In:
Hannaway C (ed) Biomedicine in the twentieth century: practices, policies, and politics. IOS
Press, Amsterdam, pp 19–25
Lilienfeld L (1982) The fielding H. Garrison lecture: ceteris paribus: the evolution of the clinical
trial. Bull Hist Med 56:1–18
Marks HM (1997) The progress of experiment: science and therapeutic reform in the United States,
1900–1990. Cambridge Univ Press, Cambridge
Marks HM (2000) Trust and mistrust in the marketplace: statistics and clinical research, 1945–1960.
Hist Sci 38:343–355
Marks HM (2006) ‘Until the sun of science . . . the true Apollo of medicine has risen’: collective
investigation in Britain and America, 1880–1910. Med Hist 50:147–166
Matheson A (2017) Marketing trials, marketing tricks – how to spot them and how to stop them.
Trials 18:105
Meldrum ML (1998) A calculated risk: the Salk polio vaccine field trials of 1954. Br Med J
7167:1233–1236
Milne I (2012) Who was James Lind, and what exactly did he achieve? J R Soc Med 105:503–508.
See also JLL Bulletin: Commentaries on the history of treatment evaluation, (2011). http://www.
jameslindlibrary.org/articles/who-was-james-lind-and-what-exactly-did-he-achieve/. Accessed
30 Jan 2019
Ostroff SM (2015) ‘Responding to changing regulatory needs with care and due diligence’ –
remarks to the regulatory affairs professional society. United States Food and Drug Adminis-
tration, Baltimore
Petryna AP (2009) When experiments travel: clinical trials and the global search for human
subjects. Princeton University Press, Princeton
Podolsky SH (2006) Pneumonia before antibiotics: therapeutic evolution and evaluation in twen-
tieth-century America. Johns Hopkins University Press, Baltimore
Podolsky SH (2009) Jesse Bullowa, specific treatment for pneumonia, and the development of the
controlled clinical trial. J R Soc Med 102:203–207. See also JLL Bulletin: Commentaries on the
history of treatment evaluation, (2008). http://www.jameslindlibrary.org/articles/jesse-bullowa-
specific-treatment-for-pneumonia-and-the-development-of-the-controlled-clinical-trial/. Accessed
17 Mar 2019
Podolsky SH (2015) The antibiotic era: reform, resistance, and the pursuit of a rational therapeutics.
Johns Hopkins University Press, Baltimore
1 Social and Scientific History of Randomized Controlled Trials 19

Podolsky SH, Jones DS, Kaptchuk TJ (2016) From trials to trials: blinding, medicine, and honest
adjudication. In: Robertson CT, Kesselheim AS (eds) Blinding as a solution to bias: strength-
ening biomedical science, forensic science, and law. Academic Press, London, pp 45–58
Porter TM (1996) Trust in numbers: the pursuit of objectivity in science and public life. Princeton
University Press, Ewing
Ross OB (1951) Use of controls in medical research. J Am Med Assoc 145:72–75
Shapiro AK, Shapiro E (1997) The powerful placebo: from ancient priest to modern physician.
Johns Hopkins University Press, Baltimore
Stark L (2011) Behind closed doors: irbs and the making of ethical research. Univ of Chicago Press,
Chicago
Tibi S (2006) Al-Razi and Islamic medicine in the 9th century. J R Soc Med 99:206–207. See also
James Lind Library Bulletin: Commentaries on the History of Treatment Evaluation, (2005). http://
www.jameslindlibrary.org/articles/al-razi-and-islamic-medicine-in-the-9th-century/. Accessed
17 Mar 2019
Timmermann C (2008) Clinical research in post-war Britain: the role of the Medical Research
Council. In: Hannaway C (ed) Biomedicine in the twentieth century: practices, policies, and
politics. IOS Press, Amsterdam, pp 231–254
Timmermans S, Berg M (2003) The gold standard: the challenges of evidence-based medicine and
standardization in health care. Temple University Press, Philadelphia
Tröhler U (2000) To improve the evidence of medicine: the 18th century British origins of a critical
approach. Royal College of Physicians of Edinburgh, Edinburgh
United States Food and Drug Administration (1963) Proceedings of the FDA conference on the
Kefauver-Harris drug amendments and proposed regulations. United States Department of
Health, Education, and Welfare, Washington, DC
United States Food and Drug Administration, Center for Drug Evaluation and Research, Center for
Biologics Evaluation and Research (2018) Adaptive design clinical trials of drugs and biologics:
guidance for industry (draft guidance). In: United States Department of Health. Education, and
Welfare, Rockville
Warner JH (1986) The therapeutic perspective: medical practice, knowledge, and identity in
America, 1820–1885. Harvard University Press, Cambridge, MA
Warner JH (1991) Ideals of science and their discontents in late nineteenth-century American
medicine. Isis 82:454–478
Evolution of Clinical Trials Science
2
Steven Piantadosi

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
The Scientific Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Some Key Evolutionary Developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Governance Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Computerization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Statistical Advances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
A Likely Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Final Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Abstract
The art of medicine took two millennia to establish the necessary groundwork for
clinical trials which embody the scientific method for making fair comparisons of
treatments. This resulted from a synthesis of opposing approaches to the acqui-
sition of knowledge. Establishment of clinical trials in their basic form in the last
half of the twentieth century continues to be augmented by advances in disparate
fields such as research ethics, computerization, research administration and
governance, and statistics.

Keywords
Design · Design evolution

S. Piantadosi (*)
Department of Surgery, Division of Surgical Oncology, Brigham and Women’s Hospital, Harvard
Medical School, Boston, MA, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 21


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_198
22 S. Piantadosi

Introduction

Clinical trials have been with us for about 80 years, using the famous sanocrysin
tuberculosis trial as the dawn of the modern era (Amberson et al. 1931). Trials have
evolved in various applications and in response to pressures from regulation, ethics,
economics, technology, and the changing needs of therapeutic development. Clinical
trials are dynamic elements of scientific medicine and have never really been broken,
though nowadays everyone seems to know how to fix them.
Neither the science nor the art of trials is static. Perhaps they may eventually
be replaced by therapeutic inferences based on transactional records from the
point of care, as some people expect. However, most of us who do trials envision
only their relentless application. Evolution of clinical trials manifests in compo-
nents such as organization, technology, statistical methods, medical care, and
science (Table 1). Any of these topics is probably worthy of its own evolutionary
history.
Every trialist would have their own list of the most important developments that
have aided the scope and validity of modern clinical trials. In this discussion I take
for granted and omit three mature experiment design principles covered elsewhere in
this book: control of random variation using replication, bias control using random-
ization and masking, and control of extraneous effects using methods such as

Table 1 Evolutionary Area of advancement Example improvements


advancements that have led
Organization Multicenter management
to improvements in how
Institutional review boards
clinical trials are performed
Data and safety monitoring boards
Ethics and regulation Ethics standards
Evidentiary standards
International harmonization
Technology Computers
Analysis software
Data systems
Statistical methods Randomization
Survival analysis
Missing data methods
Medical care Imaging
Diagnostics
Targeted therapeutics
Biological science Risk and prognosis
Pharmacology and drug action
Genomic markers
Reporting Registries
Reporting guidelines
Meta-analyses
Hybrid Wearable devices
Point of care data
Artificial intelligence
2 Evolution of Clinical Trials Science 23

placebos, blocking, and stratification. Statisticians have taken these as axioms to be


applied consistently since from the beginning of experiment design such as in the
early work of Fisher (Fisher 1925).
Maturation of the scientific method was an evolutionary prerequisite for clinical
trials. It is equivalent to reconciling dogmatism (rationalism) and empiricism, two
opposing philosophies from history. The first part of this chapter discusses why that
reconciliation is a bedrock of clinical trials. The second part focuses on some more
recent developments that have boosted our ability to conduct clinical trials atop the
infrastructure of the scientific method, including ethics, governance models, com-
puterization, multicenter collaborations, and statistical advances. The purpose is to
illustrate catalysts for advancement of the science, but not to describe them
comprehensively.
Improvements often arrive bundled with counterproductive ideas. Burdensome
privacy regulations may be an example, as were required representation in trial
cohorts, complex adaptive designs, and centralized institutional review boards
(IRB). Scientists tend to view new as better, but even the best intentionally designed
improvements carry imperfections. Nearly every such improvement in recent
decades makes trials harder and more expensive to perform.
Prior incremental improvements have similarly been bundled, with good ideas
emerging in settings or from individuals also holding bad ones. History did not
design trials from first principles but gave single methodologic suggestions inter-
mittently. How else can you create something you don’t yet understand?
Synthesis of theory (established knowledge) with data is the scientific method.
Theory supports the construction of useful biological hypotheses and the paradigm
by which they are evaluated. Science relies equally on empirical knowledge derived
from data to provide evidence regarding the hypotheses. The ability of data designed
for the purpose to disprove hypotheses (falsifiability) is characteristic of the scientific
method according to Popper (1959). Clinical trials embody the required interplay
between theory, hypotheses, and data.
Cooperation between empirical data and theoretical reasoning synthesizes two
trends from history. Rationalism is one trend that itself evolved from dogmatism.
Empiricism which disregarded theory in opposition to dogmatism is the other trend.
Either approach to knowledge in isolation does not constitute a scientific method.
For example, both “empiric” and “dogmatic” continue to be used as pejorative
labels. The method of scientific medicine emerges when the contrasting philosophies
are joined. Clinical trials could not exist until these two modes of thought learned to
collaborate in the experimental method.

The Scientific Method

After Hippocrates, two rival schools of Greek medicine arose, both finding justifi-
cation in his writings and teachings. One was the Dogmatist (later the rationalist)
school, with philosophical perspectives strengthened by the teachings of Plato and
Aristotle. Medical doctrines of Diocles, Praxagoras, and Mnesitheus helped to form
24 S. Piantadosi

it (Neuburger 1910). The Dogmatist view of diagnosis and therapeutics was based
on pathology and anatomy and sought causes for illness.
The empiric school of medicine arose between 270 and 220 B.C. largely as a
reaction to the rigid components of Dogmatist teachings, with underpinnings
founded in Skeptic philosophers. Empiric medical doctrines followed the teachings
of Philinus of Cos, Serapion of Alexandria, and Glaucias of Nicomedia (Neuburger
1910). Empirics used a “tripod” in their approach to treatment: 1) their own
experience (autopsia), 2) knowledge obtained from the experience of others (his-
tory), and 3) similarity with other conditions (analogy). A fourth leg was later added
to the tripod: 4) inference of previous conditions from present symptoms (epilogism)
(Neuburger 1910; Robinson 1931).
Empirics taught that the physician should reject theory, speculation, abstract
reasoning, and the search for causes. Physiology and pathology of the time were
held in low esteem, and books were written opposing rationalist anatomical doc-
trines. Thus, empirics were guided almost entirely by experience (King 1982;
Kutumbiah 1971). Regarding the search for causes, Celsus (25 B.C.–A.D. 50) stated
clearly the empiricist objections:

Those who are called “empirici” because they have experience, do indeed accept evident
causes as necessary; but they contend that inquiry about obscure causes and natural
actions is superfluous, because nature is not to be comprehended . . . Even in its
beginnings, they add, the art of medicine was not deduced from such questionings, but
from experience. . . . It was afterwards, . . . when the remedies had already been discov-
ered, that men began to discuss the reasons for them: the art of medicine was not a
discovery following upon reasoning, but after the discovery of the remedy, the reason for
it was sought out. (Celsus 1809)

The two components of modern scientific reasoning were separated forcefully in


these schools. The two schools of medicine coexisted into the second century when
dogmatism (rationalism), embodied by Galen (131–200 A.D.), became dominant.
Empiricism nearly died out in the third century, even becoming a disreputable term,
and Galen’s teachings formed the basis for most western medicine until the sixteenth
century. Empiricism was frowned upon in the Middle East as well. For example,
Avicenna (980–1037) wrote:

But truly every science has both a speculative and a practical side. So has medicine. . . .
When, in regard to medicine, we say that practice proceeds from theory, we do not mean that
there is one division of medicine by which we know, and another, distinct therefrom, by
which we act. We mean that these two aspects belong together - one deals with the basic
principles of knowledge; the other with the mode of operation of these principles. The
former is theory; the latter is applied knowledge. (Gruner 1930)

Maimonides wrote in the twelfth century:

The mere empiricists who do not think scientifically are greatly in error. . . . He who puts his
life in the hands of a physician skilled in his art but lacking scientific training is not unlike the
mariner who puts his trust in good luck, relying on the sea winds which know no science to
2 Evolution of Clinical Trials Science 25

steer by. Sometimes they blow in the direction the seafarer wants them to blow, and then his
luck shines upon him; another time they may spell his doom. (Muntner 1963)

Rationalist ideas were adopted more broadly in science and the mariner metaphor
was a popular one. For example, Leonardo da Vinci (1452–1519) defended the value
of theory in scientific thinking by saying:

Those who are enamored of practice without science are like a pilot who goes into a ship
without rudder or compass and never has any certainty where he is going. Practice should
always be based on a sound knowledge of theory. (da Vinci 1510)

Even so, empiricism was not dead. Theophrastus Bombastus of Hohenheim


(Paracelsus) (1493–1541) challenged the existing medical dogma in the early
1500s, literally burning the writings of Galen and Avicenna, and taught that expe-
rience with treatments should be the source of knowledge regarding their application
(Pagel 1982). The experience of other practitioners was also a worthy source of
knowledge. Empiricism was strengthened by thinkers such as Francis Bacon
(1561–1626) and J.B. van Helmont (1578–1644). However, the empirics ignored
new knowledge or subordinated it to experience, thereby becoming less able to deal
with the emerging basic sciences. In contrast, seventeenth-century rationalists such
as Rene Descartes (1596–1650), H. Boerhaave (1668–1738), and others adopted
new knowledge in anatomy, physiology, and chemistry as a basis for causes of
disease.
The dialectic between empiric and rationalist thinking has continued since the
seventeenth century. Polemic essays of Thomas Percival (Percival 1767) provide an
interesting debate between both positions as an educational device for readers.
Percival was the originator of the first code of medical ethics (Editorial 1965;
Percival 1803). He appears to be a rationalist who overstated the criticisms of
rationalism to make it look better – praising with faint damns. In favor of empiricism,
he states:

It is evident that theory is absurd and fallacious, always useless and often in the highest
degree pernicious. The annals of medicine afford the most striking proof, that it hath in all
ages been the bane and disgrace of the healing art.

In the next essay favoring rationalism, Percival states:

And by thus treading occasionally in unbeaten tracks [the rationalist] enlarges the boundaries
of science in general and adds new discoveries to the art of medicine. In a word, the
rationalist has every advantage which the empiric can boast, from reading, observation
and practice, accompanied with superior knowledge, understanding, and judgment.

By the twentieth century, the scientific method had embraced rationalism and its
response to new knowledge in the physical and biological sciences. The develop-
ment of basic biological science both contributed to and was supported by the
rationalist tradition. Clinical medicine remained somewhat more empirical, but
26 S. Piantadosi

increasingly influenced by the scientific method. Abraham Flexner (1866–1959),


who was influential in shaping medical education in the USA, recognized the
strengths of theory, the usefulness of the scientific method in medicine, and the
dangers of purely empiric thinking when he wrote:

The fact that disease is only in part accurately known does not invalidate the scientific method
in practice. In the twilight region probabilities are substituted for certainties. There the
physician may indeed only surmise, but, most important of all, he knows that he surmises.
His procedure is tentative, observant, heedful, responsive. Meanwhile the logic of the process
has not changed. The scientific physician still keeps his advantage over the empiric. He studies
the actual situation with keener attention; he is freer of prejudiced prepossession; he is more
conscious of liability to error. Whatever the patient may have to endure from a baffling disease,
he is not further handicapped by reckless medication. In the end the scientist alone draws the
line accurately between the known, the partly known, and the unknown. The empiricist fares
forth with an indiscriminate confidence which sharp lines do not disturb. Investigation and
practice are thus one in spirit, method, and object. (Flexner 1910)

and

Modern medicine deals, then, like empiricism, not only with certainties, but also with
probabilities, surmises, theories. It differs from empiricism, however, in actually knowing
at the moment the logical quality of the material which it handles. . . . The empiric and the
scientist both theorize, but logically to very different ends. The theories of the empiric set up
some unverifiable existence back of and independent of facts . . . the scientific theory is in the
facts, summing them up economically and suggesting practical measures by whose outcome
it stands or falls. (Flexner 1910)

This last quote may seem somewhat puzzling because it states that empirics do, in
fact, theorize. However, as the earlier quote from Celsus suggested, the theories of
the empiric are not useful devices for acquiring new knowledge.
There is no sharp demarcation when rationalist and empiricist viewpoints became
cooperative and balanced. The optimal mixture of these philosophies continues to
elude some applications even in clinical trials. R.A. Fisher’s work on experimental
design (Fisher 1925) might be taken as the beginning of the modern synthesis because
it placed statistics on a comparable footing with the maturing biological sciences,
providing for the first time the tools needed for the interoperability of theory and data.
But the modern form of clinical trials would take another 25 years to evolve.
Biological and inferential sciences have synergized and co-evolved since the
middle of the twentieth century. The modern synthesis has yielded great understand-
ing of disease and effective treatments, in parallel with appropriate methods of
evaluation. Modern understanding of disease and treatment are flexible enough to
accommodate such diverse contexts as molecular biology, chronic disease, psycho-
social components of illness, infectious organisms, quality of life, and acupuncture.
Modern inferential science, a.k.a. statistics, is applied universally in science. The
scientific method can reject theory based on empirical data but can also reject data
based on evidence of poor quality, bias, or inaccuracy. An interesting exception to
this implied order is the elaborate justification for homeopathy based on empiricism,
2 Evolution of Clinical Trials Science 27

for example, by Coulter (1973, 1975, 1977). It illustrates the ways in which the
residual proponents of purely empirical practices justify them and why biological
theory is a problem for such practices.

Some Key Evolutionary Developments

In the remainder of this chapter, we look beyond the historical trends that converged
to allow scientific medicine to evolve. Clinical trials have both stimulated and
benefitted from key developments in the recent 80 years. These include ethics,
governance models, computerization, multicenter collaborations, and statistical
advances.

Ethics

Key landmarks in the history of ethics behind biomedical research are well known to
clinical trialists because it is a required part of their research training. Historical
mistakes have led to great awareness and mechanisms for the protection of research
subjects.
With respect to clinical trials specifically, we might take the evolutionary steps in
ethics to be Nuremberg, Belmont, and data and safety monitoring boards (DSMB).
Some might take a more granular view of ethics landmarks, but this short list has
proved to be beneficial to clinical trials. The foundational importance of Nuremberg
and Belmont needs no further elaboration here. DSMBs are important because they
operationalize some of the responsibilities in institutional review boards (IRB)
which otherwise could be overwhelmed without delegating the work of detailed
interim oversight.
Ethics principles can be in tension with one another. Resolution of those tensions
is part of the evolution of ethics in clinical trials. A good example in recent years has
been the debate over content and wording of the Helsinki Declaration regarding the
proper use of placebo control groups (Skierka and Michels 2018). Another evolu-
tionary step might be visible in the growing use of “central” IRBs. They are strongly
motivated by efficiency but essentially discard the “institutional” local spirit of IRBs
as originally chartered.
The modern platform for clinical trials would not exist without public trust
founded on ethics principles and review and the risk-benefit protections it affords
participants. Ethics is therefore as essential as the underlying biomedical knowledge
and clinical trials science.

Governance Models

Multicenter clinical trial collaborations are common today. Their complexity has
help improve the management of all trials. They were not as feasible in the early
28 S. Piantadosi

history of clinical trials because they depend on technologies like data systems, rapid
communications, and travel that have improved greatly in the last 80 years. Aside
from technologies, administrative improvements have also made them feasible. The
multicenter model of trial management is not monolithic. Some such collaborations
are relatively stable such as the NCI National Clinical Trials Network (NCTN) which
has existed in similar form for decades, even following its “reorganization” in the
last decade. Other collaborations are constituted de novo with each major research
question. That model is used often by the NHLBI and many commercial entities.
Infrastructure costs are high, and a multicenter collaboration must have an
extensive portfolio to make the ongoing investment worthwhile. This has been
true of cancer clinical trials for many years. Multicenter collaborations overcome
the main shortcoming of single-institution trials which is low accrual. They add
broad investigator expertise at the same time. The increased costs associated with
them are in governance, infrastructure, and management.
The governance model for multicenter projects relies on committees rather than
individuals, aside from a Principal Investigator (PI) or Co-PIs. For example, there
may be Executive (small) and Steering (larger) Committees. Other efforts scale
similarly such as data management, pathology or other laboratory review, publica-
tion, and biostatistics. See Meinert (2013) for a concise listing of committee respon-
sibilities. Multicenter collaborations seem to function well even when the various
components are geographically separate largely due to advances in computerization.

Computerization

Several waves of computer technology have washed over clinical research in the last
50 years. The first might be described as the mainframe era, during which data
systems and powerful and accessible statistical analysis methods began, both of
which yielded great benefit to clinical trials. The idea that much could be measured
and stored in databases suggested to some in the 1960s and 1970s that designed
experiments might be replaced by recorded experience. In fact, data system tech-
nology probably did more to advance clinical trials than was appreciated at the time.
Even so, the seemingly huge volume and speed of data storage in the mainframe era
was trivial by today’s standards.
Microcomputers and their associated software comprised the next wave. They
created decentralized computing, put unprecedented power in individual hands, and
led to the Internet. These technologies allowed breakthroughs in data systems and
communication which also greatly facilitated clinical trials. In this period, many
commercial sponsors realized the expense of maintaining clinical trial support
infrastructure in-house. Those support services could be outsourced more econom-
ically to specialist contract research organizations (CROs). This model appears to
accomplish the dual aims of maintaining skilled support but paying only for what is
needed at the appropriate time.
A third era is occurring presently and might be described as big data or big
computation. Increasing speed, storage, computing power, and miniaturization
2 Evolution of Clinical Trials Science 29

parallels a therapeutic focus on the individual patient. Rapid data capture and
transfer of images, video, and genomic data is the rule. Miniaturization is leading
to wearable or even ingestible sensors. A major problem now is how to store,
summarize, and analyze the huge amount of data available. These developments
are changing the course of clinical trials again. Designs can be flexible and their
performance tested by simulation. Outcomes can be measured directly rather than
reported or inferred. Trials can incorporate individual markers before, during, and
after treatment.
Some are expecting a fourth wave of computerization, which might be called true
interoperability of data systems. Lack of interoperability by design can be seen in the
ubiquitous need for human curation of data sources to meet research needs. As good
as they can be, case report forms (CRFs) for clinical trials illustrate the problem.
Unstructured data sources must be curated in CRFs to render them computable. Even
when data sources are in electronic form, most are not interoperable. This potential
wave of computerization will be described below in a brief discussion of the future.

Statistical Advances

Statistics like all other fields of science has made great progress in the last 80 years in
both theory and application. Statistics is not a collection of tricks and techniques but
is the science of making reliable inferences in the presence of uncertainty. It is our
only tool to do so. For that reason, it has found application in every discipline. One
might reasonably claim today that there is no science without the integration of
statistical methods. The history of statistics has been fleshed out, for example, by
Stigler (1980, 1986, 1999), Porter (1986), and Marks (1997).
Statistics is not universally viewed as a branch of mathematics, but probability,
upon which statistics is based, is. However, statistics uses the same methods of
deductive reasoning and proof as in mathematics. Statistics and mathematics live
together in a single academic department in many universities, for example.
Clinical trials and experimental design broadly have stimulated applied statistics
and are a major application area. Especially in the domain of data analyses, trials
have derived enormous benefit from advances in both statistical theory and methods.
The modern wave of “data scientists” may not realize that fundamental tools such as
censored data analysis and related nonparametric tests, proportional hazards and
similar nonlinear regression methods, pharmacokinetic modeling, bootstrapping,
missing data methods, feasible Bayesian methods, fully sequential and group
sequential methods, meta-analysis, and dozens of other major advances have come
during the era of the modern clinical trial.
Aside from the solid theoretical foundations for these and other methods, com-
puter software and hardware advances have further supported their universal imple-
mentation. Procedure-oriented languages evolved in this interval and have moved
from mainframes to personal computers while making the newest statistical methods
routine components of the language. Importantly all those languages have integrated
methods for data transformations aside from methods for data analyses. Clinical
30 S. Piantadosi

trials could not have their present form without these technological advances. For
example, consider a single interim analysis on a large randomized trial as reviewed
by the DSMB. The data summaries of treatment effects, multiple outcomes, and
formal statistical analyses could not take place in the narrow time windows required
without the power of the methods and technologies listed.
A byproduct of both computerization and analysis methods is data integration.
This means the ability to assemble disparate sources of data, connect them to
facilitate analysis, and show the results of those analyses immediately. Although
impressive, clinical trials like many other research applications have minimal need
for up to the minute data analysis. Snapshots that are days or weeks old are typically
acceptable for research purposes. Live surveys of trial data could be useful in review
of data quality or side effects of treatment.

A Likely Future

This collective work is intended to assist the science of clinical trials by defining its
scope and content. Much progress necessary in the field remains beyond what
scholars can write about. For example, clinical trials still do not have a universal
academic home. One must be found. The biostatistics elements of the field have been
found in Public Health for 100 years. Biostatisticians have contributed to Public
Health so well that most academic institutions function as though that discipline
should be present only in Public Health. Often the view is reciprocated by bio-
statisticians inside Public Health who think they must own the discipline every-
where. The result is that there is little biostatistics formally in therapeutic
environments despite the huge need there.
Aside from this, biostatistics is not the only component of clinical trials. Training,
management, infrastructure, and various clinical and allied sciences are also essen-
tial. Where does trial science belong in the academic setting? The answer is not
perfectly clear even after 80 years of clinical trials. Options might include depart-
ments with various names such as clinical investigation, quantitative science, or
medical statistics. In any case, clinical trial science cannot be kept at a distance
organizationally and still expect these collaborations to function effectively.
The immediate future of medicine emphasizes economic value. There can be no
conversation regarding value unless we also understand efficacy, which is the
domain of clinical trials. Organizations that either conduct or consume efficacy
evidence to provide high value medical care need internal expertise in clinical trials.
Without it, they cannot participate actively in understanding value. It seems likely
that this need will be as important in the future as any particular clinical or basic
science.
Despite historical progress in computerization, we lack true interoperability of
medical records. Electronic health records (EHR) seem to be optimistically misun-
derstood by the public and politicians alike, who view them as solutions to many
problems that realistically have not been solved yet. EHRs are essentially electronic
paper created to assist billing, and they have only parchment-like interoperability.
2 Evolution of Clinical Trials Science 31

EHRs, like paper, can be sent and read by new caregivers, but this constitutes the
lowest form or interoperability. Much of the data contained is unstructured and
useless for research without extensive and expensive human curation – hence the
need for CRFs. We also know that skilled humans curate EHRs imperfectly, so we
can’t expect augmented intelligence or natural language processing to fix this
problem until EHRs improve. Lack of data model standardization is a major hurdle
on that evolutionary path.
Despite the current research-crushing limitations of EHRs, many people have
begun to talk about “real-world” data, evidence, or trials which are derived from
those sources. This faddish term has a foothold but is bad for several reasons. It
implies that 80 years of clinical trials have somehow not reflected practical findings
or benefits, which is contrary to the evidence. It also implies that use of EHRs that
contain happenstance data, i.e., not designed for purpose, will suffice for therapeutic
inferences. This has not been true in the past and remains unproven for current
ambitions. In any case, a less catchy but more accurate term is “point of care,”
indicating data recorded in source documents when and where care is delivered
compared to secondary or derived documents like CRFs.
When EHRs evolve to hold essential structured data, data models become
standardized, and point of care data become available in adequate volume and
quality to address research queries across entire healthcare systems; some questions
badly addressed by current clinical trials can be asked. These include what happens
in subgroups of the population not represented well in traditional clinical trials, how
frequent are rare events not seen in relatively small experimental cohorts, and what
outcomes are most likely in lengthy longitudinal disease and complex treatment
histories that require multiple changes in therapy. We cannot yet know if we will get
accurate answers to these and similar questions using point of care data simply because
the inferences will be based on uncontrolled or weakly controlled comparisons.

Final Comments

Clinical trials are quintessential science. They hardwire the scientific method in their
design and execution and demand cooperation between theory and data. Trials
evolved as science inside science only after rationalism and empiricism began to
cooperate and following necessary advances in both statistical theory and biological
understanding of disease.
A populist view of science and medicine is that investigative directions are
substantially determined by whoever is doing the research. This makes it subject
to personal or cultural biases and justifies guidance by political process. While some
science is investigator initiated (which does not escape either peer review, sponsor
oversight, or accountability), the truth is that science and medicine are obviously
guided mostly by economic concerns. Government sponsorship places funding in
topic areas that are pursued by scientists. Much research is sponsored by commercial
pharmaceutical and device companies, themselves guided nowadays almost solely by
32 S. Piantadosi

marketing considerations. Government allocation of resources is subject to the polit-


ical process and advocacy and sometimes by personal experiences of lawmakers.
Today, it is not possible to deny the rationalist or scientific nature of medicine or
the power of biological theory in understanding disease. We may choose not to use
the full power of this method because of constraints such as inefficiency, cost, lack of
humanism, or political correctness, but that does not refute the greater usefulness of
scientific compared with either dogmatic or empirical thinking. The failure to use
well-founded, coherent biological theory can also encourage fraudulent or question-
able treatment practices in medicine.
Purely empirical perspectives and their consequences on clinical trials are not
without supporters and apologists in the scientific community. History shows us that
this is likely to remain true in the future. Claude Bernard was not warm to the
application of statistical methods in his day, but likely correct when he said:

Medicine is destined to escape empiricism little by little, and it will escape in the same way
as all the other sciences, by the experimental method. (Bernard 1865)

Key Facts

Clinical trials are an evolving science. Their appearance as a method of scientific


medicine depended on a long reconciliation of two competing themes in science –
dogmatism and empiricism. The basic tools of designed data production that embody
clinical trials have been supplemented in modern times by codification of ethics
norms, governance models for complex investigations, multicenter collaborations,
advances in computer technology, and improved statistical methods. It is unlikely
that therapeutic comparisons provided by clinical trials in their foundational form
can be replaced by those based on transactional data.

Cross-References

▶ Leveraging “Big Data” for the Design and Execution of Clinical Trials
▶ Multicenter and Network Trials
▶ Responsibilities and Management of the Clinical Coordinating Center
▶ Social and Scientific History of Randomized Controlled Trials

References
Amberson JB, McMahon BT, Pinner M (1931) A clinical trial of sanocrysin in pulmonary
tuberculosis. Am Rev Tuberc 24:401–435
Anon. Editorial (1965) Thomas Percival (1740-1804) codifier of medical ethics. JAMA 194(12):
1319–1320
Bernard C (1865) Introduction a l’Etude de la Medicine Experimentale. J. B. Bailliere et Fils, Paris
2 Evolution of Clinical Trials Science 33

Celsus AC (1809) De Medicina. Blackwood and Bryce, Edinburg. Section translated by W. G.


Spencer and quoted in Strauss, M. B. (Ed.) (1968). Familiar Medical Quotations. Boston: Little
Brown and Company
Coulter HL (1973) Divided legacy. A history of the schism in medical thought. Vol. III, science and
ethics in American medicine: 1800–1914. McGrath Publishing Co, Washington, DC
Coulter HL (1975) Divided legacy. A history of the schism in medical thought. Vol. I, the patterns
emerge: Hippocrates to Paracelsus. Wehawken Book Co, Washington, DC
Coulter HL (1977) Divided legacy. A history of the schism in medical thought. Vol. II, Progress and
regress: J. B. van Helmont to Claude Bernard. Wehawken Book Co, Washington, DC
da Vinci L (c.1510) Manuscript G, Library of the Institut de France (translated by Edward
MacCurdy in The Notebooks of Leonardo da Vinci, vol II, Chap XXIX)
Fisher RA (1925) Statistical methods for research workers. Oliver and Boyd, Edinburgh
Flexner A (1910) Medical education in the United States and Canada. Merrymount Press, Boston, p 53
Gruner OC (1930) A treatise on the canon of medicine of Avicenna. Luzac & Co, London
King LS (1982) Medical thinking, a historical preface. Princeton University Press, Princeton
Kluger J (2004) Splendid solution: Jonas Salk and the conquest of polio. G.P. Putnam’s Sons,
New York. ISBN: 0-399-15216-4
Kutumbiah P (1971) The evolution of scientific medicine. Orient Longman, Ltd, New Delhi
LCSG: Lung Cancer Study Group (1981) Surgical adjuvant intrapleural BCG treatment for stage I
non-small cell lung cancer. J Thorac Cardiovasc Surg 82(5):649–657
Marks HM (1997) The progress of experiment: science and therapeutic reform in the United States,
1900–1990. Cambridge University Press, New York
McKneally MF et al (1976) Regional immunotherapy of lung cancer with intrapleural
B.C.G. Lancet 1(7956):377–379
Meinert CL (2013) Clinical trials handbook. John Wiley & Sons, Hoboken
Meldrum M (1998) A calculated risk: the Salk polio vaccine field trials of 1954. BMJ 317
(7167):1233–1236
Muntner S (ed) (1963) The medical writings of Moses Maimonides: treatise on asthma. XI, 3. J. B.
Lippincott, Philadelphia
Neuburger M (1910) History of medicine, vol 1. Henry Frowde, London
Pagel W (1982) Paracelsus. Karger, Basel
Percival T (1767) Essays medical and experimental. J. Johnson, London
Percival T (1803) Medical ethics; or a code of institutes and precepts adapted to the professional
conduct of physicians and surgeons. S. Russell, Manchester
Popper K (1959) The logic of scientific discovery. Hutchinson, London
Porter TM (1986) The rise of statistical thinking, 1820–1900. Princeton University Press, Princeton
Robinson V (1931) The story of medicine. Albert & Charles Boni, New York
Skierka A-S, Michels KB (2018) Ethical principles and placebo-controlled trials – interpretation
and implementation of the declaration of Helsinki’s placebo paragraph in medical research.
BMC Med Ethics 19(1):24
Stigler SM (ed) (1980) American contributions to mathematical statistics in the nineteenth century
(2 vols). Arno Press, New York. ISBN 978-0-4051-2590-4
Stigler SM (1986) The history of statistics: the measurement of uncertainty before 1900. Harvard
University Press, Cambridge, MA. ISBN 978-0-6744-0341-3
Stigler SM (1999) Statistics on the table: the history of statistical concepts and methods. Harvard
University Press, Cambridge, MA. ISBN 978-0-6740-0979-0
Terminology: Conventions and
Recommendations 3
Curtis L. Meinert

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Clinical Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Trial Versus Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Pilot Study Versus Feasibility Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Name of Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Name of the Experimental Variable: Treatment Versus Intervention . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Name for Groups Represented by Experimental Variable: Study Group,
Treatment Group, or Arm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Persons Studied: Subject, Patient, or Participant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Trial Protocol Versus Manual of Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Blocking Versus Stratification and Quotafication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Open . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Controlled . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Placebo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Consent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Randomization Versus Randomized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Registration Versus Enrollment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Single Center Trial Versus Multicenter Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Multicenter Versus Cooperative Versus Collaborative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Principal Investigator (PI) Versus Study Chair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Clinical Investigator Versus Investigator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Steering Committee Versus Executive Committee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Data Monitoring Versus Data Monitoring Committee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Random Versus Haphazard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Primary Versus Secondary Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Outcome Versus Endpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Treatment Failure Versus Treatment Cessation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

C. L. Meinert (*)
Department of Epidemiology, School of Public Health, Johns Hopkins University, Baltimore,
MD, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 35


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_200
36 C. L. Meinert

Blind Versus Mask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48


Lost to Followup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Withdrawals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Design Variable Versus Primary Outcome Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Baseline Versus Baseline Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Screened Versus Enrolled . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
End of Followup Versus End of Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Analysis by Assigned Treatment Versus Per Protocol Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Early Stop Versus Nominal Stop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Abstract
There are dozens of types of trials with specialized vocabularies, but the feature
common to all is that they are comparative and focused on differences. If the trial
involves just one treatment group, then the focus is on change from enrollment. If
the trial is controlled, then the focus is on differences between the treatment
groups in outcomes during and at the end of the trial.

Keywords
Language · Usage conventions · Clinical trials · Randomized trials

Introduction

The vocabulary of trials is an admixture of vocabularies from medicine, statistics,


epidemiology, and other fields. You need a collection of dictionaries to master the
language: medical, statistics (Upton and Cook 2014), epidemiology (Porta 2014),
and clinical trials dictionaries (Day 1999; Meinert 2012).
To be sure, terminology varies across trials. To be convinced just read a
few publications of results from different investigators. There is nothing that
can be done about that variation, but you can standardize vocabulary within
your own trial by producing a glossary of accepted terms and sticking to
them.
Variation in language during a trial, even in one involving only a few investiga-
tors, can lead to confusion in the investigator group. Precision of language is a
necessity to avoid confusion on such basic issues as to when a person is counted as
enrolled and when a visit is coined as missed and the difference between a person
being “off treatment,” a dropout, or “lost to followup.”
This chapter is about the vocabulary of trials. Definitions, unless otherwise
indicated, are from or adapted from Clinical Trials Dictionary: Terminology and
Usage Recommendations (Meinert 2012).
3 Terminology: Conventions and Recommendations 37

Clinical Trial

The term “clinical trial” can mean any of the following: 1. The first use of a treatment
in human beings. 2. An uncontrolled trial involving treatment of people followed
over time. 3. An experiment done involving persons for the purpose of assessing the
safety and/or efficacy of a treatment, especially such an experiment involving a
clinical event as an outcome measure, done in a clinical setting, and involving
persons having a specific disease or health condition. 4. An experiment involving
the administration of different study treatments in a parallel treatment design to a
defined set of study subjects done to evaluate the efficacy and safety of a treatment in
ameliorating or curing a disease or health condition; any such trial, including those
involving healthy persons, undertaken to assess safety and/or efficacy of a treatment
or health care procedure (Meinert 2012). A publication type in the National Library
of Medicine indexing system defined as: Pre-planned clinical study of the safety,
efficacy, or optimum dosage schedule of one or more diagnostic, therapeutic, or
prophylactic drugs, devices, or techniques in humans selected according to pre-
determined criteria of eligibility and observed for predefined evidence of favorable
and unfavorable effects (National Library of Medicine 1998).
“Clinical” as an adjective means related to the sickbed or to care given in a clinic.
The use of the term should be limited to trials involving people with medical
conditions and even then usually can be dropped except where necessary to avoid
confusion with other kinds of trials, like in vitro trials or trials involving animals.

Trial Versus Study

Trial, when done to test or assess treatments, should be used rather than the less
informative term study. Study can mean all kinds of things. Trial conveys the essence
of what is being done.
To qualify as a trial there should be a plan – protocol. Trials may be referred to as
studies and studies as trials. For example, Ambroise Paré’s (Packard 1921) experi-
ence on the battlefield in 1537 in regard to use of a digestive medicament for
treatment of gunshot victims has been referred to as a clinical trial, but is a misuse
of the term because Paré resorted to the medicament when his supply of boiling oil
ran out. No protocol.
Ironically, ClinicalTrials.gov (aka, CT.gov), a registration site created specifically
for registration of trials, does not use the term, opting instead for “interventional study.”

Pilot Study Versus Feasibility Study

A pilot study is one performed as a prelude to a full-scale trial intended to provide


training and experience in carrying out the trial.
38 C. L. Meinert

A feasibility study is one performed for the purpose of determining whether it is


possible to perform a full-scale trial.

Name of Trial

The most important words in any publication of results from a trial are the few
represented in the title of the manuscript. If it is your trial the words are your choice.
Choose wisely. The name you choose will be used hundreds of times. Steer clear of
names with special characters or symbols.
Avoid use of unnecessary or redundant terms like “controlled” in “randomized
controlled trial”; “randomized” is sufficient to convey “control.”
Include the term “trial.” Avoid using surrogate terms instead of “trial,” like
“study” or “project.”
Include currency terms like “randomized” and “masked” when appropriate.
Include terms to indicate the disease or condition being treated and the treatment
being used, for example, as in Alzheimer’s Disease Anti-inflammatory Prevention
Trial (ADAPT Research Group 2009).
If you are looking for publications of results from trials, do not expect to find
them by screening titles. A sizeable fraction of results publications do not have
“trial” in the title.

Name of the Experimental Variable: Treatment Versus


Intervention

The most important variable in trials is the regimen or course of procedures applied
to persons to produce an effect. If you have to choose one name, what will it be?
Treatment or intervention?
“Treat” as a noun (Merriam Webster; online dictionary) means: 1a: the act or
manner or an instance of treating someone or something; b: the techniques or actions
customarily applied in a specified situation; 2a: a substance or technique used in
treating; b: an experimental condition.
“Intervene” as a verb (Merriam Webster; online dictionary) means: 1: to occur,
fall, or come between points of time or events; 2: to enter or appear as an irrelevant or
extraneous feature or circumstance; 3a: to come in or between by way of hindrance
or modification; b: to interfere with the outcome or course especially of a condition
or process (as to prevent harm or improve functioning); 4: to occur or lie between
two things; 5: to become a third party to a legal proceeding begun by others for the
protection of an alleged interest; b: to interfere usually by force or threat of force in
another nation’s internal affairs especially to compel or prevent an action.
There is no perfect choice, but “treatment” comes closer to what one wants to
communicate than intervention.
3 Terminology: Conventions and Recommendations 39

The downside with “treatment” is when it refers to nonmedical regimens like


counseling schemes to get people to stop smoking, or when the “treatment” involves
devices.
The trouble with “intervene” is that technically anything one does to another
is a form of intervention, whether or not related to administration of study
treatments.
“Intervention” is the term of choice for designating trials on CT.gov. An
“interventional study” on the website is defined as “a clinical study in which
participants are assigned to receive one or more interventions (or no interven-
tion) so that researchers can evaluate the effects of the interventions on
biomedical or health-related outcomes. The assignments are determined by
the study protocol. Participants may receive diagnostic, therapeutic, or other
types of interventions.”

Name for Groups Represented by Experimental Variable: Study


Group, Treatment Group, or Arm

Any of the three work, and are used; study group or treatment group preferred
though “arm” is often the term of choice in cancer trials.

Persons Studied: Subject, Patient, or Participant

A frequently used label for persons studied is “research subject,” “study subject,”
or simply “subject.” The advantage of the labels lies in their generic nature, but the
characterization lacks “warmth” as conveyed in a usage note for “subject” as taken
from Meinert: The primary difficulty with the term for persons being studied in the
setting of trials has to do with the implication that the persons are research
objects. The term carries the connotation of subjugation and, thus, is at odds
with the voluntary nature of the participation and requirements of consent. In
addition, it carries the connotation of use without benefit; a misleading connota-
tion in many trials and, assuredly, in treatment trials. Even if such a connotation is
correct, the term suggests a passive relationship with study investigators when, in
fact, the relationship is more akin to a partnership involving active cooperation.
Avoid by using more humanistic terms, such as person, patient, or participant
(Meinert 2012).
Patient versus subject? The terms imply different relationships and ethics under-
lying interactions. “Patient” implies a therapeutic doctor-patient relationship. “Sub-
ject” is devoid of that connotation.
Limit “patient” or “study patient” to settings involving persons with an illness or
disease and a doctor-patient relationship. Avoid in settings involving well people or
when there is a need to avoid connotations of illness or of medical care by using a
medically neutral term, such as study participant.
40 C. L. Meinert

Trial Protocol Versus Manual of Operations

protocol n – [MF prothocole, fr ML protocollum, fr LGk prōtokollon first sheet of a


papyrus roll bearing date of manufacture, fr Gk prōt- prot- + kollon to glue together,
fr kolla glue; akin to MD helen to glue] 1. Specifications, rules, and procedures for
performing some activity or function. 2. Study protocol. 3. Data collection schedule.
4. Treatment plan. Usage note: Often used as a synonym for treatment, as in “on
protocol.”
study protocol n e – [trials] A written document specifying eligibility require-
ments, treatments being tested, method of assigning treatment to treatment units, and
details of data collection and followup. 5. Treatment protocol. Usage note: May refer
to unwritten document when used loosely. Often used as a synonym for treatment, as
in “on protocol.” Assumed to refer to a written document in formal usage; in the
context of trials, a written document that is submitted to Institutional Review Board
(IRBs) for approval and followed by investigators in conduct of the trial.
manual of operations (MOO, MoO, MOP, MoP) n – 1. A document of instruc-
tional material used for performing operations in relation to some defined task or
function. 2. Study manual of operations.
study manual of operations n – 1. A document or collection of documents,
largely in narrative form, describing the procedures used in a center or set of centers
in a study (e.g., study clinics, coordinating center, or reading center) for performing
defined functions. 2. Study handbook. Usage note: Manual and handbook are
sometimes used interchangeably; however, there are differences between the two
types of documents. Use manual to characterize a document organized much like a
book with a series of chapters and written narrative. Use handbook for a collection of
tables, lists, charts, etc., largely devoid of written narrative.

Blocking Versus Stratification and Quotafication

Blocking in relation to treatment assignment is done to ensure that after a specified


number of assignments the assignment ratio is satisfied. For example, in a two
treatment group design with a 1:1 assignment ratio and blocks of size of 8, assign-
ments are constrained so that after the 8th, 16th, etc., there are the same number of
persons assigned to each of the two treatment groups. The purpose of blocking is to
ensure balance in the mix of the treatment assignments over enrollment. Time-
related shifts in the nature of persons enrolled over the course of a trial can be a
confounding variable for treatment comparisons if the mix of persons changes over
time and is different by treatment group.
Blocking should not be confused with stratification. Strata in trials are formed by
classifying persons to be enrolled into a trial using some baseline characteristic, for
example, gender, and randomizing within strata.
stratification n – 1. Broadly, the act or process of stratifying. 2. An active
ongoing process of stratifying as in placing persons in strata as a prelude to
randomization. 3. The act or process of classifying treatment units or observations
3 Terminology: Conventions and Recommendations 41

into strata after enrollment for a subgroup analysis. Avoid confusion when both
forms of stratification are used in a trial by referring to this form of stratification as
post-stratification.
The purpose of stratification is to ensure that treatment assignments are balanced
across strata. To be useful, the stratification variable has to be related to the outcome
of interest. If it is not, there is no statistical gain from stratification.
Stratification and blocking treatment assignments serve different purposes.
Blocking is done to ensure that the assignment ratio for the trial is satisfied at points
in time over the course of enrollment; stratification is done to ensure the compara-
bility of the treatment groups with regard to the stratification variable(s).
Likewise stratification and quotafication are different. Stratification merely
ensures the mix of people with regard to the stratification variable is the same across
treatment groups. The trialist may carry out treatment comparisons by the stratifica-
tion variable but is not under any obligation to do so.
quotafication v – The act or process of imposing a quota requirement on the mix
of persons enrolled in a trial. Not to be confused with stratification. The purpose of
stratification is to ensure that the different treatment groups in a trial have the same
proportionate mix of people with regard to the stratification variable(s).
Quotafication is to ensure a study population having a specified mix with regard to
the variables used for quotafication.
For example, quotafication for gender would involve enrolling a specified number
of males and females and randomizing by gender, that is, with gender also as a
stratification variable. The mix of persons enrolled in a trial is determined by the mix
of persons seen and ultimately judged eligible for enrollment. Hence, the numbers
ultimately represented in the various strata will be variables having values known
only after completion of enrollment. The imposition of a sample size requirement for
one or more of the strata by imposition of quota requirements will extend the time
required for recruitment and should not be imposed unless there are valid scientific
or practical reasons for doing so.

Open

Open has various meanings in the context of trials as seen below, including one
being a euphemism for unrandomized trials.
open trial n – 1. A trial in which the treating physician, some other person in a
clinic, or the study participant selects the treatment to be administered. 2. A trial in
which treatment assignments are known in advance to clinic personnel or patients,
e.g., schemes where the schedule of assignments is posted in the clinic or as in
systematic schemes, such as odd-even methods of treatment assignment, where the
scheme is known. 3. A trial in which treatments are not masked; nonmasked trial. 4.
A trial still enrolling. 5. A trial involving an open sequential design. Usage note:
Avoid by use of appropriate descriptors to make meaning clear. Use nonmasked in
the sense of defn 3. If used in the sense of defns 4 or 5 make certain the term is not
taken to denote conditions described in defns 1, 2, or 3.
42 C. L. Meinert

open label adj – [trials] Of or relating to a trial in which study treatments are
administered in unmasked fashion. Usage note: Avoid; use unmasked.

Controlled

control n – 1. A standard of comparison for testing, verifying, or evaluating some


observation or result. 2. Something that controls. 3. A person (or larger observation
unit) used for comparison, e.g., a control in a case-control study; control patient;
control treatment.
controlled adj – 1. Restrained; constrained 2. Monitored; watched 3. Any system
of observation and data collection designed to provide a basis for comparing one
group with another, such as provided in a parallel treatment design with concurrent
enrollment to the different study groups represented in the design. 4. Data analysis
involving use of control variables. Usage note: Often unnecessary as a modifier,
especially in relation to design terms that themselves convey the notion of control, as
in randomized controlled trial (the modifier randomized indicates the nature of the
control implied). One assumes that the notion of “control” in the lay sense of usage
applies in all research settings. Hence, usage should be limited to those in the sense
of defns 1 and 2. However, it is conventional to use the term as a modifier of trial,
especially when not preceded or followed by the modifier randomized.
controlled clinical trial n – MEDLINE defn: A clinical trial involving one or
more test treatments, at least one control treatment, specified outcome measures for
evaluating the studied intervention, and a bias-free method for assigning patients to
the test treatment. The treatment may be drugs, devices, or procedures studied for
diagnostic, therapeutic, or prophylactic effectiveness. Control measures include
placebos, active medicine, no-treatment, dosage forms and regimens, historical
comparisons, etc. When randomization using mathematical techniques, such as
the use of a random numbers table, is employed to assign patients to test or control
treatments, the trial is characterized as a randomized controlled trial [publication
type]. However, trials employing treatment allocation methods such as coin flips,
odd-even numbers, patient social security numbers, days of the week, medical record
numbers, or other such pseudo- or quasi-random processes, are simply designated
as controlled clinical trials (National Library of Medicine 1998) (http://www.nlm.
nih.gov/archive/20060905/nichsr/ehta/chapter13.html).

Placebo

placebo adj – 1. Of or relating to the use or administration of a placebo. 2. Of or


relating to something considered to be useless or ineffective. Usage note: Limit use
to the sense of defn 1. Avoid nonsensical uses such as when the term serves as an
adjective for patient or group, as in “placebo patient” or “placebo group”; use
placebo-assigned or placebo-treated instead.
3 Terminology: Conventions and Recommendations 43

placebo n – [ME, fr L, I shall please, fr placēre to please; the first word of the first
antiphon of the service for the dead, I shall please the Lord in the land of the living, fr
Roman Catholic vespers] 1. A pharmacologically inactive substance given as a
substitute for an active substance, especially when the person taking or receiving it
is not informed whether it is an active or inactive substance. 2. Placebo treatment 3.
A sugar-coated pill made of lactose or some other pharmacologically inert substance.
4. Any medication considered to be useless, especially one administered in pill form.
5. Nil treatment 6. An ineffective treatment. Usage note: Subject to varying use.
Avoid in the sense of defns 4, 5, and 6; not to be used interchangeably with sham.
The use of a placebo should not be construed to imply the absence of treatment.
Virtually all trials involve care and investigators conducting them are obligated to
meet standards of care, regardless of treatment assignment and whether masked or
not. As a result, a control treatment involving use of a placebo is best thought of as a
care regimen with placebo substituting for one element of the care regimen. Labels
such as “placebo patient” or “placebo group” create the impression that patients
assigned to receive placebos are left untreated. The labels (in addition to being
wrong in the literal sense of usage) are misleading when placebo treatment is in
addition to other treatments, as usually the case.
placebo control n – 1. Placebo-control treatment 2. A treatment involving the use
of a placebo.
placebo effect n – 1. The effect produced by a placebo; assessed or measured
against the effect expected or observed in the absence of any treatment. 2. The effect
produced by an inactive control treatment. 3. The effect produced by a control
treatment considered to be nil. 4. An effect attributable to a placebo. rt: sham effect
Usage note: Limit usage to settings involving the actual use of a placebo. Avoid in
the sense of defns 2 and 3 when the control treatment does not involve a placebo.
placebo group n – 1. Placebo-assigned group 2. Placebo-treated group 3. A
group not receiving any treatment (avoid).

Consent

Usually the modifier “informed” is more an expression of hope than of fact. Its use is
best reserved for settings in which there are steps built into the consent process to
ensure an informed decision based on evidence of comprehension of what is
involved, or for settings in which the decision can be demonstrated to have been
informed; otherwise use consent.

Randomization Versus Randomized

Random adj – [ME impetuosity, fr MF randon, fr OF, fr randir, to run, of Gmc


origin, akin to OHG rinnan to run] [general] 1. Having or appearing to have no
specific pattern or objective. 2. Of or designating a process in which the occurrence
of previous events is of no value in predicting future events. 3. Haphazard
44 C. L. Meinert

[scientific]. 4. Of or relating to a sequence, observation, assignment, arrangement,


etc., that is the result of a chance process with known or knowable probabilities. 5.
Of or relating to a process that has the properties of one that is random. 6.
Pseudorandom. 7. Of or relating to a single value, observation, assignment, or
arrangement that is the result of randomization. Syn: casual, chance, haphazard
Usage note: Subject to misuse. Avoid in the absence of a probability base (e.g., as in
random blood sugar); use haphazard or some other term implying less rigor than
random. Misuse in the context of trials arises most commonly in relation to charac-
terizations of treatment assignment schemes as random that are systematic or
haphazard. In scientific discourse, reserve the descriptor for uses in the sense of
defns 4, 5, 6, and 7.
randomization n – 1. An act of assigning or ordering that is the result of a
random process such as that represented by a sequence of numbers in a table of
random numbers or a sequence of numbers produced by a random number generator,
e.g., the assignment of a patient to treatment using a random process. 2. The process
of deriving an order or sequence of items, specimens, records, or the like using a
random process. Usage note: Do not use as a characterization except in settings
where there is an explicit or implied mathematical basis for supporting the usage, as
discussed in the usage note for random adj. Use other terms implying less rigor than
implied by randomization, such as haphazardization, quasirandomization, or chance,
when that basis is not present or evident.
randomized n – [trials] The condition of having been assigned to a treatment via
a random process; normally considered to have occurred when the treatment assign-
ment is revealed to any member of the clinic staff, e.g., when an envelope containing
the treatment assignment is opened.

Registration Versus Enrollment

registration n – 1. Registering; as in entering name and other pertinent information


into a register. 2. Enrollment 3. A document certifying the act of registering. 4. The
granting of an application or license; in regard to a new drug, the approval of a new
drug application by a regulatory agency. 5. The act of registering a trial on CT.gov or
other similar registry. Usage note: In trials, registration may or may not correspond
to enrollment. Usually the act of registration is a necessary but not sufficient
condition for enrollment. Hence, registration and enrollment should not be used
interchangeably. Registration typically takes place at the first contact with a person
during screening; signaled by the act of entering the person’s name into a register or
log or issue of an identification number for the person. The act of enrollment takes
place when the treatment assignment is revealed or treatment is initiated; usually
after baseline evaluations have been completed and consent has been obtained.
enrollment n – 1. The act of enrolling a person in a research study. 2. The state of
having been enrolled. Usage note: Ambiguous when used in the absence of detail
indicating the point at which enrollment occurs. Generally, in the case of randomized
3 Terminology: Conventions and Recommendations 45

trials, that point when treatment assignment is revealed to clinic personnel. Not to be
confused with registration.

Single Center Trial Versus Multicenter Trial

A trial is single center if all activities involved in conducting the trial are housed
within the same institution. A trial is multicenter if it has two or more enrollment
sites.
single-center trial n – 1. A trial performed at or from a single site: (a) Such a trial,
even if performed in association with a coalition of clinics in which each clinic
performs its own trial, but in which all trials focus on the same disease or condition
(e.g., such a coalition formed to provide preliminary information on a series of
different approaches to the treatment of hypertension by control or reduction); (b) A
trial not having any clinical centers and a single resource center, e.g., the Physicians’
Health Study (Henneken and Eberlein 1985; Physicians’ Health Study Research
Group Steering Committee 2012). 2. A trial involving a single clinic; with or without
satellite clinics or resource centers. 3. A trial involving a single clinic and a center to
receive and process data. 4. A trial involving a single clinic and one or more resource
centers.
multicenter trial n – 1. A trial involving two or more clinical centers, a common
study protocol, and a data center, data coordinating center, or coordinating center to
receive, process, and analyze study data. 2. A trial involving at least one clinical
center or data collection site and one or more resource centers. 3. A trial involving
two or more clinics or data collection sites.
The usual line of demarcation between single and multicenter is determined by
whether or not there is more than one treatment or data collection site. Hence, a trial
having multiple centers may still be classified as a single-center trial if it has only one
treatment or data collection site.

Multicenter Versus Cooperative Versus Collaborative

Multicenter preferred to collaborative or cooperative because cooperation and col-


laboration is not unique to multicenter trials.

Principal Investigator (PI) Versus Study Chair

In research, “principal investigator” refers to the person having responsibility for


conduct of the research; the lead scientist on a research project. “Principal” means
first, highest, or foremost in rank, importance, or degree; chief. Hence, uses where
reference is to multiple persons in relation to a research project is an oxymoron of
sorts. Confusion arises when the term refers to multiple persons in the same study,
46 C. L. Meinert

for example, often as in some multicenter trials. In general, is best avoided in favor of
“study chair.”

Clinical Investigator Versus Investigator

Investigator is the generic name applied to anyone in a research setting who has a
key role in conducting the research or some aspect of the research.
In the context of trials, clinical investigator refers to persons with responsibilities
for enrolling and caring for persons enrolled in the trial. Avoid as a designation when
used to the exclusion of others having investigator status, for example, as in settings
also involving nonclinical investigators as in data coordinating centers.

Steering Committee Versus Executive Committee

steering committee (SC) n – In multicenter trials, the committee responsible for


conduct of the trial and to which other study committees report. Usually headed by
the study chair and consisting of persons designated or elected to represent study
centers, disciplines, or activities. Usage note: Sometimes executive committee.
executive committee (EC) n – A committee within multicenter leadership struc-
tures responsible for direction of the day-to-day affairs of the study and accountable
to the steering committee; usually consists of the officers of the study and others
selected from the steering committee; typically headed by the chair or vice-chair of
the steering committee.

Data Monitoring Versus Data Monitoring Committee

data monitoring v – 1. Monitoring relating to the process of data collection. 2.


Monitoring related to the detection of problems in the execution of a study (perfor-
mance monitoring) or assessing treatment effects (treatment monitoring).
data monitoring committee (DMC) n – A committee with defined responsibil-
ities for data monitoring, e.g., as required in performance or treatment effects
monitoring.
treatment effects monitoring n – The act of or an instance of reviewing
accumulated outcome data by treatment group to determine if the trial should
continue unaltered.

Random Versus Haphazard

random adj – [ME impetuosity, fr MF randon, fr OF, fr randir, to run, of Gmc


origin, akin to OHG rinnan to run] [general] 1. Having or appearing to have no
specific pattern or objective. 2. Of or designating a process in which the occurrence
3 Terminology: Conventions and Recommendations 47

of previous events is of no value in predicting future events. 3. Haphazard [scien-


tific]. 4. Of or relating to a sequence, observation, assignment, arrangement, etc., that
is the result of a chance process with known or knowable probabilities. 5. Of or
relating to a process that has the properties of one that is random. 6. Pseudorandom.
7. Of or relating to a single value, observation, assignment, or arrangement that is the
result of randomization. Syn: casual, chance, haphazard Usage note: Subject to
misuse. Avoid in the absence of a probability base (e.g., as in random blood sugar);
use haphazard or some other term implying less rigor than random. Misuse in the
context of trials arises most commonly in relation to characterizations of treatment
assignment schemes as random that are systematic or haphazard. In scientific
discourse, reserve the descriptor for uses in the sense of defns 4, 5, 6, and 7.
pseudorandom adj – Being or involving entities that are generated, selected, or
ordered by a deterministic process that can be shown to generate orders that satisfy
traditional statistical tests for randomness. Usage note: Most random number gen-
erators are, in fact, pseudorandom number generators, though usually referred to as
random number generators. Typically they are built using deterministic computa-
tional procedures that rely on a user supplied seed to start the generation process; use
of the same seed on different occasions will generate the exact same sequence of
numbers.
haphazard adj – Occurring without any apparent order or pattern. Usage note:
Use when characterizing a process that is unordered but not meeting the scientific
definition of random, or where there is uncertainty as to whether that definition is
satisfied. Do not equate haphazard with random in scientific discourse. Distinct from
random, in that there is no mathematical basis for characterizing haphazard
processes.

Primary Versus Secondary Outcomes

primary outcome: 1. The event or condition a trial is designed to treat, ameliorate,


delay, or prevent. 2. The outcome of interest as specified in the primary objective. 3.
The foremost measure of success or failure of a treatment in a trial. 4. The actual
occurrence of a primary event in a study participant. 5. Primary endpoint (not
recommended). Usage note: Not to be used interchangeably with design variable.
The modifier, “primary,” should be used sparingly, since primariness depends on
perspective. Most trials involve observations of various outcomes, each with differ-
ent implications for well-being or life.
primary outcome measure: 1. That measure specifically designated as primary
in the study protocol. 2. That measure, among two or more in a trial, considered to be
of primary importance in its design (e.g., the one used for the sample size calcula-
tion) or analysis; may be continuous or an event; primary outcome variable. 3.
Design variable.
Subject to misuse and confusion. Without access to the study protocol or absent
explicit statements as to the primary outcome measure in a manuscript, readers may
48 C. L. Meinert

be hard put to know if the outcome focused onin the analysis is “primary” as
represented in definitions above.

Outcome Versus Endpoint

Use outcome instead of endpoint.


An outcome, broadly defined, is something that follows as a consequence of some
antecedent action or event. In the context of trials it is an event or measure observed
or recorded for a person during or following treatment in a trial. The term may refer
to primary or secondary outcome measures.
Endpoint, often used instead of or as a synonym for outcome, but best avoided.
Most “endpoints” noted over the course of followup in trials are not indicators of
“end” in regard to treatment or followup. Most protocols call for followup over a
defined period of time, even in the presence of or following intercurrent events.
Therefore, there are no endpoints in the operational sense of usage, except death. Use
of the term in protocols and manuals for trials can cause personnel at clinics to stop
treatment and followup on the occurrence of an “endpoint” if they mistakenly regard
the term as having operational meaning.

Treatment Failure Versus Treatment Cessation

treatment failure n – 1. The failure of a treatment, as used in or on a person, to


produce a desired effect or result. 2. Such a failure as observed, inferred, or declared
by a study physician or other study personnel from measurements, evaluations, or
observations on the person in question and resulting in cessation of the treatment or a
treatment switch. 3. A person in a trial no longer receiving the assigned treatment;
especially cessation of treatment occurring because of concerns regarding the safety
or efficacy of the treatment. Usage note: The term should be used with caution
because of the implied conclusion regarding the treatment and value-laden meaning.
Its use should be limited to settings where there is supporting evidence indicating a
failure. It should not be used simply as a synonym for treatment cessation.
treatment cessation n – 1. Cessation of treatment of a person, especially that due
to lack of benefit or intolerable or undesirable side effects associated with treatment.
2. Cessation of a designated treatment regimen in a trial because of lack of benefit,
especially such cessation arising from treatment effects monitoring. 3. Treatment
termination.

Blind Versus Mask

blind, blinded adj – Being unaware or not informed of treatment assignment; being
unaware or not informed of course of treatment.
3 Terminology: Conventions and Recommendations 49

mask, masked adj – Of, relating to, or being a procedure in which persons (e.g.,
patients, treaters, or readers in a trial) are not informed of certain items of informa-
tion, e.g., the treatment represented by a treatment assignment in a clinical trial.
Preferred to blind.
mask n – A condition imposed on an individual (or group of individuals) for the
purpose of keeping that individual (or group of individuals) from knowing or
learning of some condition, fact, or observation, such as treatment assignment, as
in single-masked or double-masked trials.
The term “blind,” as an adjective descriptor in relation to treatment administra-
tion, is more widely used in trials than its counterpart descriptor of “mask” or
“masked.” The shortcoming of “blind” as a descriptor in relation to treatment
administration has do with unfortunate connotations (e.g., as in “blind stupidity”)
and the fact that the characterization can be confusing to study participants (e.g., in
vision trials where loss of vision or blindness is an outcome measure). For these
reasons, mask is preferred to blind.

Lost to Followup

Avoid as a generic label. Be specific as to what is lost. Generally used as a synonym


for a person who does not show up for followup visits but often such persons can be
followed by telephone.

Dropout

Variously defined: 1. One who terminates involvement in an activity by decla-


ration or action; especially one who so terminates because of waning interest or
for physical, practical, or philosophical reasons. 2. A person who withdraws
from a trial. 3. A person who fails to appear for a followup visit, e.g., a person
so classified after having failed to appear for three consecutive followup visits
as defined by specified visit time windows. 4. One who refuses or stops taking
the assigned treatment. 5. One who stops taking the assigned treatment and
whose reasons for doing so is judged not to be related to the assigned
treatment.
Subject to varying usage. Most trials require continuing data collection
regardless of course of treatment. Hence, a “dropout” in the sense of defn 4
may continue to be an active participant in regard to scheduled data collection.
Persons meeting the requirements of defns 4 or 5 are better characterized in
relation to treatment adherence. Avoid uses in the sense of defn 5 because of
difficulty in making reliable judgments regarding the reason a person stops taking
the assigned treatment. The stated reason may not be the real reason. Defn 2
includes those who actively refuse, those who passively refuse, and those who
are simply unable to continue followup for physical or practical reasons. Further,
the definition allows for the possibility of a person returning for followup. Most
50 C. L. Meinert

long-term trials will have provisions for reinstating persons classified as dropouts
if and when they return to a study clinic for required data collection. Avoid in the
sense of defn 3 in relation to a single visit or contact in the absence of other
reasons for regarding someone as a dropout. Use other language, such as missed
visit or missed procedure, to avoid the connotation of dropout. The term should
not be confused with lost to followup, noncompliant, withdrawal, or endpoint. A
dropout need not be lost to followup if one can determine outcome without
seeing or contacting the person (as in some forms of followup for survival) but
will be lost to followup if the outcome measure depends on data collected from
examinations of the person. Similarly, the act of dropping out need not affect
treatment compliance. A person will become noncompliant upon dropping out in
settings where doing so results in discontinuation of an active treatment process.
However, there may be no effect on treatment compliance in settings where the
assigned treatment is administered only once on enrollment and where that
treatment is not routinely available outside the trial. Similarly, the term should
not be confused with or used as a synonym for withdrawal, since its meaning is
different from that for dropout.

Withdrawals

withdrawal n – 1. The act of withdrawing from a trial. 2. The removal of a person


from a lifetable analysis at the cessation of followup or at the occurrence of the
event of interest; removal due to cessation of followup may occur as a conse-
quence of when the person was enrolled (e.g., calculation of a three-year event
rate based on data provided by those who were enrolled at least three years prior
to the date of the analysis) or because the person dropped out. 3. Dropout (not a
recommended synonym). 4. One who has been removed from treatment; treat-
ment withdrawal. 5. One who is not receiving or taking the assigned treatment
(not recommended usage).
The term should not be used as a synonym for dropout or lost to followup.
When used in the context of treatment, as in defns 4 and 5, use should be with
details indicating the nature of use. Use in the sense of defn 4 involves the act of
removing a person from treatment. Use in the sense of defn 5 is neutral with
regard to action.

Design Variable Versus Primary Outcome Measure

The design variable is the variable used for determining sample size in planning a
trial.
It is usually the same as the primary outcome measure but not always. For
example, the design variable could be difference in blood pressure after a specified
period of treatment but the outcome of primary interest could be cardiovascular
deaths.
3 Terminology: Conventions and Recommendations 51

Baseline Versus Baseline Period

baseline (Bl, BL) n – 1. An observation, set of observations, measurement, or series


of measurements made or recorded on a person just prior to or in conjunction with
treatment assignment that serves as a basis for gauging change in relation to
treatment assignment. 2. An observation, series of observations, measurement, or
series of measurements made or recorded at some point after enrollment in relation to
some act or event that serves as a basis for gauging change (e.g., a blood pressure
measurement made in relation to an increase in dosage of an anti-hypertensive drug
to measure the effect of the increase). Usage note: Subject to varying uses. Typically,
in trials, unless otherwise indicated, reserved for characterizations that are consistent
with defn 1. Baseline observations in most trials arise from a series of baseline
examinations, separated in time by days or weeks. Hence, the time of observation for
one baseline variable, relative to another, may be different.
baseline period n – [general] A period of time that is used to perform procedures
needed to assess the suitability and eligibility of a study candidate for enrollment into
a study, to collect required baseline data, and to carry out consent processes. [trials]
1. For a study participant, the period defined by the first data collection visit and
ending with assignment to treatment. 2. Such a period ending shortly after assign-
ment to treatment. 3. A period of time during the course of treatment or followup of a
person, marked by some event, process, or procedure, in which new measurements
or observations are made to serve as a base for gauging subsequent change. 4.
Enrollment period. Usage note: Avoid in the sense of defn 2 or 4 without defining
qualifications. Provide qualifying detail for uses in the sense of defn 3. Traditionally,
the point defining the end of the baseline period in trials is assignment to or initiation
of treatment. The tendency to “stretch” the baseline period, as in defn 2, arises from a
desire to reduce missing baseline data. Clearly, the utility of a measure as a baseline
measure is diminished if there are possibilities of the observation being influenced
by treatment. Hence, the practice is not recommended, even if the time interval
following treatment assignment or initiation of treatment is small and even if the
likelihood of treatment having had an effect on the variable(s) being observed within
that interval is small.

Screened Versus Enrolled

screen, screened, screening, screens v – To assess or examine in some systematic


way in order to separate persons into groups or to identify a subset eligible for further
evaluation or enrollment into some activity, e.g., the process of measuring blood
pressures of all persons appearing at a clinic for the purpose of identifying people
suitable for enrollment in a study of high blood pressure.
screening n – 1. A search for persons with an identifying marker or characteristic,
as determined by results from some test or observation, known or believed to be
associated with some disease (or adverse health condition). 2. The process of
evaluating study candidates for enrollment into a study. 3. Any of a variety of
52 C. L. Meinert

procedures applied to data to identify outlier or questionable values. 4. A 100%


inspection of items, such as in a manufacturing process, in which unacceptable items
are rejected.
enrollment n – 1. The act of enrolling a person in a research study. 2. The state of
having been enrolled. Usage note: Ambiguous when used in the absence of detail
indicating the point at which enrollment occurs. Generally, in the case of randomized
trials, that point when treatment assignment is revealed to clinic personnel. Not to be
confused with registration.

End of Followup Versus End of Trial

Followup for persons enrolled in a trial may end at the same time regardless of when
they were enrolled (common closing date design) or may end on a per person basis
after a specified period of time after enrollment (anniversary closing date design).
End of trial is when all enrollment and data collection activities cease.

Analysis by Assigned Treatment Versus Per Protocol Analysis

Analysis by assigned treatment (aka intention to treat analysis) is an analysis in


which persons are counted to the treatment group to which assigned, even if they did
not receive any of the assigned treatment. The analysis is the sine qua non of analysis
in trials. It may be supplemented by other arrangements but the analysis by assigned
treatment is central to whatever conclusions are reached regarding the results.
An alternative analysis is per protocol analysis (PPA) in which persons are
arranged by treatment received rather than by assigned treatment. The analysis is
considered to provide a more realistic estimate of the actual treatment effect but is
subject to selection biases and hence should be presented only as a supplement to
analysis by assigned treatment.
Authors of papers are expected to label their analyses so readers know if the
results being summarized are per protocol or by treatment assignment.

Bias

bias n – 1. An inclination of temperament, state of mind, or action based on


perception, opinion, or impression serving to reduce rational thought or action, or
the making of impartial judgments; a specified instance of such an inclination. 2. A
tendency toward certain measurements, outcomes, or conclusions over others as a
result of a conscious or subconscious mind set, temperament, or the like; a specific
expression of such a tendency. 3. Any behavior or performance that is differential
across groups in a trial; treatment-related bias. 4. Deviation of the expected value of
3 Terminology: Conventions and Recommendations 53

an estimate of a statistic from its true value. Usage note: Distinguish between uses in
which bias (defns 1 or 2) is being proposed in a speculative sense as opposed to an
actual instance of bias. Usages in the latter sense should be supported with evidence
or arguments to substantiate the claim. Usages in the former sense should be
preceded or followed by appropriate modifiers or statements to make clear that the
user is speculating. Similarly, since most undifferentiated uses (in the sense of defns
1 or 2) are in the speculative sense, prudent readers will treat all uses as being in that
sense, unless accompanied by data, evidence, or arguments to establish bias as a fact.
Not to be confused with systematic error. Systematic error can be removed from
finished data; bias is more elusive and not easily quantified.
selection bias n – 1. A systematic inclination or tendency for elements or units
selected for study (persons in trials) to differ from those not selected. 2. Treatment-
related selection bias Usage note: The bias defined by defn 1 is unavoidable in most
trials because of selective factors introduced as a result of eligibility requirements for
enrollment and because of the fact that individuals may decline enrollment. The
existence of the bias does not affect the validity of treatment comparisons so long as
it is the same across treatment groups, for example, as when treatment assignments
are randomized.
treatment-related selection bias n – Broadly, bias related to treatment assign-
ment introduced during the selection and enrollment of persons into a trial; often due
to knowing treatment assignments in advance of issue and using that information in
the selection process. The risk of the bias is greatest in unmasked trials involving
systematic assignment schemes (e.g., one in which assignments are based on order
or day of arrival of persons at a clinic). It is nil in trials involving simple
(unrestricted) randomization, but can arise in relation to blocked randomization if
the blocking scheme is known or deduced. For example, one would be able to
correctly predict one-half of the assignments before use in an unmasked trial of two
study treatments arranged in blocks of size two, if the blocking is known or deduced.
The chance of the bias operating, even if the blocking scheme is simple, is minimal
in double-masked trials.

Early Stop Versus Nominal Stop

A nominal stop is when all treatment and data collection procedures in the trial cease
or are stopped. The usual reason for a nominal stop is when the trial has been
completed as planned. Nominal stops can also be the result of loss of funding or
because of orders to stop by the funding agency or from a regulatory agency.
Early stops may pertain to all persons enrolled in the trial as in nominal stops or
only to a subset. The stop may be due to a clinical hold issued by a regulatory agency
or may be due to evidence that a treatment is harmful or ineffective. Typically, early
stops are the results of actions taken by investigators based on interim looks at
accumulating data over the course of the trial.
54 C. L. Meinert

Summary

The vocabulary of trials is an admixture of vocabularies principally from medicine,


statistics, and epidemiology. Students of trials need collections of dictionaries and
glossaries to be competent in the language of trials.
Whether you are a person simply interested in trials or a person designing and
conducting a trial, you have to know the language and jargon of trials. You have to be
familiar with the difference between “study” and “trial.” You have to know the
difference between “dropout” and “lost to followup,” “open trials” and “open label
trials,” “control” and “controlled,” “endpoint” and “outcome,” and “blocking” and
“stratification.”

References
ADAPT Research Group (2009) Alzheimer’s disease anti-inflammatory prevention trial: design,
methods, and baseline results. Alzheimers Dement 5:93–104
Day S (1999) Dictionary for clinical trials. Wiley, Chichester, 217pp
Henneken CH, Eberlein K (1985) For the physicians’ health study research group: a randomized
trial of aspirin and β-carotene among U.S. physicians. Prev Med 14:165–168
Meinert CL (2012) Clinical trials dictionary: terminology and usage recommendations, 2nd edn.
Wiley, Hoboken
National Library of Medicine (1998) Medical subject headings – annotated alphabetic list: 1998.
National Library of Medicine, Bethesda
Packard FR (1921) Life and times of Ambroise Paré, 1510–1590. Paul B Hoeber, New York
Physicians’ Health Study Research Group Steering Committee (2012) Preliminary report:
findings from the aspirin component of the ongoing Physicians’ health study. New Engl J
Med 318:262–264
Porta M (ed) (2014) Dictionary of epidemiology, 5th edn. Oxford University Press, New York,
376pp
Upton G, Cook I ( 2014) Dictionary of statistics. Oxford University Press, New York, 496pp
Clinical Trials, Ethics, and Human
Protections Policies 4
Jonathan Kimmelman

Contents
Origins of Research Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Conception of Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Design of Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Risk/Benefit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Justification of Therapeutic Procedures: Clinical Equipoise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Justification of Demarcated Research Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Riskless Research, High Risk Research, Comparative Effectiveness Trials, and Ethics . . . . 61
Maximizing Efficiencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Justice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Fair Subject Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Inclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Trial Inception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Respect for Persons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Elements of Valid Informed Consent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Research Without Informed Consent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Informed Consent Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Independent Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Conduct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Trials and Ethics Across Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Publication and Results Deposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Methods and Outcome Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
The Afterlife of Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

J. Kimmelman (*)
Biomedical Ethics Unit, McGill University, Montreal, QC, Canada
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 55


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_238
56 J. Kimmelman

Abstract
Clinical trials raise two main sets of ethical challenges. The first concerns
protecting human beings when they are used in scientific experiments. The
second concerns protecting the welfare of downstream users of medical evidence
generated in trials. The present chapter reviews core ethical standards and prin-
ciples governing the conception, design, conduct, and reporting of clinical trials.
This review concludes by suggesting that even the most technical decisions about
design and reporting embed numerous moral judgments about how to serve the
interests of research subjects and downstream users of medical evidence.
Clinical trials are experiments on human beings. They involve two elements
that make them ethically sensitive undertakings. First, the very reagent used in the
experiment – the human being – has what philosophers call moral status. That is,
human beings are sentient and self-aware, they have preferences and plans, and
they have a capacity for suffering. Human beings are thus entitled to having their
interests respected and protected when they are themselves the research reagents.
Second, clinical trials are aimed at supporting decision-making in health care.
Life and death decisions are ultimately based on the evidence we generate from
human experiments. Human beings who use that evidence deserve protection
from scientific findings that are misleading, incomplete, or biased.
In what follows, I provide a very condensed overview of the ethics of clinical
trials. Most writings on the ethics of clinical trials have centered on the protection
of human volunteers, treating issues of research integrity as an afterthought, if at
all. Two core claims ground this review – claims that perhaps differentiate it from
similar overviews of human research ethics. The first is that scientific integrity is a
complement to human protections; even the most technical decisions about
design and analysis are laden with implicit ethical judgments. The second is
that clinical trials present ethical challenges across their full life cycle – not
merely in the brief window when a trial is open for enrollment, and where
human protection regulations are directed. This review is thus organized
according to the life cycle of a clinical trial.

Keywords
Clinical trials · Research ethics · Human protections · Medical evidence ·
Research oversight

Origins of Research Ethics

The moral and regulatory framework for protecting human subjects has its origins in
the aftermath of World War II. The US prosecution of 23 Nazis in the so-called Nazi
Doctors’ trial led to the first formalized policy on human research ethics, the
“Nuremberg Code” (Annas and Grodin 1995). At least in North America, however,
4 Clinical Trials, Ethics, and Human Protections Policies 57

the Nuremberg Code went largely unheeded for two decades. Revelations of various
research abuses (Beecher 1966), including those surrounding the Tuskegee Syphilis
study (an observational study of African American men with middle to late stage
syphilis that had run continuously from 1932) led the US Congress to establish the
National Commission for the Protection of Research Subjects of Biomedical and
Behavioral Research. A key task for this committee was to articulate the basic moral
principles of human research. The product of this effort came to be known as the
Belmont Report, and regulations established in the USA from this effort include 45
CFR 46 and the Food and Drug Administration’s (FDA) equivalent, 21 CFR 50. The
first section of the former, which covers the general requirements of human pro-
tections, is sometimes called the “Common Rule”; it was revised in 2018 (Menikoff
et al. 2017). Various other jurisdictions and bodies have articulated their own
policies as well, including the World Health Organization (1982 with several
revisions since) (World Health Organization and Council for International Organi-
zations of Medical Sciences 2017), the World Medical Association (1964 with
numerous revisions since) (General Assembly of the World Medical Association
2014), and the Canadian Tri-council (1998 with one revision) (Canadian Institutes of
Health Research et al. 2018). Though policies around the world vary around the
edges, they share a core consensus in expressing principles and policies articulated
by the Belmont Report. These principles are respect for persons (implemented by
obtaining informed consent or restricting risk for persons lacking capacity); benef-
icence (implemented by independent establishment of a favorable balance of risk/
benefit); and justice (implemented by comparing a trial population to the target
population of the knowledge).
The Belmont Principles provide orientation points for thinking about the ethics of
a trial. But they are not exhaustive (Kimmelman 2020). Nor do regulations address
all duties attending to the conduct of clinical trials. I will return to these gaps
periodically.

Conception of Trials

The Belmont Report contains a suggestive and widely overlooked statement: “Rad-
ically new procedures of this description should, however, be made the object of
formal research at an early stage in order to determine whether they are safe and
effective.” Nevertheless, there are no regulations or policies that govern the choice of
what research questions to address in trials. As far as ethical oversight and drug
regulation is concerned, there is no moral distinction between a trial testing a me-too
drug for male pattern baldness and a trial testing a promising new treatment for
pediatric glioma.
Yet clearly, the resources available for research are finite, and some research
questions are more deserving of societal investment than other questions. This is
illustrated by the often quoted claim that 90% of the world’s resources are committed
58 J. Kimmelman

to addressing health issues that afflict only 10% of the world population (the so-
called “10–90% gap”) (Flory and Kitcher 2004). It is also illustrated by the historic
and unfair exclusion of certain populations from medical research (such as children,
women, persons living in economically deprived settings, racial minorities, pregnant
women, and elderly populations), or the persistence of medical uncertainty surround-
ing widely deployed treatments (e.g., the value of PCI for treatment of angina) (Al-
Lamee et al. 2018).
Four general considerations might be offered for selecting research hypotheses.
First, researchers should direct their attention towards questions that are unresolved.
This may seem obvious. However, many trials address questions that have already
been adequately resolved. One particularly striking example was the persistence of
placebo-controlled trials testing the drug aprotinin, long after its efficacy had been
decisively established (Fergusson et al. 2005). Drug companies often run trials that
are primarily aimed at promoting a drug rather than testing an unresolved hypothesis
(Vedula et al. 2013). Second, researchers should only test hypotheses that are
sufficiently mature. For example, researchers generally ought not to initiate phase
1 trials unless there are compelling preclinical studies to motivate them; they should
generally not pursue phase 3 studies if there is insufficient grounds to settle on a dose
or treatment schedule for testing (again, there are many examples of trials that have
been launched absent compelling evidentiary grounds) (Kimmelman and Federico
2017). The best way to ground a claim that a medical hypothesis merits evaluation in
clinical trials is with a systematic review (Savulescu et al. 1996; Chalmers and
Nylenna 2014; Nasser et al. 2017); some jurisdictions require systematic review
before trial conduct (Goldbeck-Wood 1998).
Third, researchers should prioritize clinical questions that are likely to have the
greatest impact on health and well-being. To some extent researchers’ priorities are
constrained by their field, logistical considerations, and funding options. Neverthe-
less, they can exercise some discretion within these constraints. All else being equal,
researchers ought to favor trials involving conditions that cause greater morbidity or
mortality (whether because of prevalence or intensity of morbidity) and that afflict
unfairly disadvantaged or excluded populations.
Finally, researchers should not initiate trials unless there are reasonable pros-
pects for findings being incorporated into downstream decisions. For example,
phase 1 trials should generally meet a more demanding review standard if there is
no sponsor to carry encouraging findings forward in a phase 2 trial. Many
exploratory trials suggesting the promise of approved drugs in new indications
are never advanced into more rigorous clinical trials (Federico et al. 2019).
Research that is not embedded within a coordinated research program presents a
variety of problems, including a concern about disseminating potentially biased
research findings that are incorporated into clinical practice guidelines. The pre-
sent author has argued that abortive research programs raise questions about the
social value of many exploratory trials – especially in the post-approval research
context (Carlisle et al. 2018).
4 Clinical Trials, Ethics, and Human Protections Policies 59

Design of Trials

Risk/Benefit

All major policies on research ethics require that risks are favorably balanced against
benefits to society in the form of improved knowledge and benefit to subjects (if
any). Benefits include direct medical benefits of receiving medical interventions
tested in trials (if any) and those expected by addressing a research hypothesis.
Inclusion benefit (the benefit patients might receive from extra medical attention they
receive when entering trials, regardless of treatment assignment) is generally con-
sidered irrelevant for establishing a favorable risk benefit (King 2000).
How, then, are researchers and oversight bodies to operationalize the notion of a
favorable risk/benefit balance? The Belmont Report urges an assessment of risk and
benefit that is systematic, evidence based, and explicit. One of the most useful
approaches is “component analysis” (Weijer and Miller 2004). Clinical trials typi-
cally involve a mix of potentially burdensome exposures, including treatment with
unproven drugs, venipuncture, imaging or diagnostic procedures, and/or tissue
biopsies. Component analysis involves dividing a study into its constituent pro-
cedures, and evaluating the risk/benefit for each individual procedure. Importantly,
benefits associated with one procedure cannot be used to “purchase” risk for other
procedures. For example, the burdens of a painful lumbar puncture cannot be
justified by appealing to the therapeutic advantages offered by access to a novel
treatment in a trial.
In performing component analysis, procedures can be sorted into two categories.
Some procedures, like withholding an established effective treatment, implicate care
obligations and are thus termed “therapeutic procedures.” Other procedures, like
venipunctures to monitor a metabolite, are performed solely to advance a research
objective; these are called “demarcated research procedures.” Each has a separate
process for justifying risk.

Justification of Therapeutic Procedures: Clinical Equipoise

Therapeutic procedures in trials are generally evaluated according to the principle of


clinical equipoise. First defined in 1987, clinical equipoise refers to genuine uncer-
tainty within an expert community as to an intervention’s clinical value as compared
with standard of care (Freedman 1987). A trial that meets these conditions is said to
be ethical when, if executed properly, the trial is a necessary step in resolving that
expert uncertainty. According to the principle of clinical equipoise, patients should
not be assigned to treatments that are known in advance to fall below a standard of
care that a patient would receive outside a trial. For example, for Alzheimer’s
disease, there are currently no proven treatments for reducing progression. Accord-
ingly, assigning a patient to a placebo comparator in a trial testing an Alzheimer’s
60 J. Kimmelman

treatment does not deprive that patient of a standard of care that patient would
otherwise receive. On the other hand, asking a patient with relapse remitting multiple
sclerosis to forgo disease-modifying treatment for a year would fall below standard
of care. Placebo-controlled trials of this duration in multiple sclerosis would gener-
ally be unethical (Polman et al. 2008).
In one concept, clinical equipoise captures three imperatives for risk/benefit in
research. First, it preserves a physician’s duty of care when they participate in
research. As such, physicians can recruit patients without compromising their
fiduciary obligations to patients. Second, the principle of clinical equipoise estab-
lishes a standard for acceptable risk in clinical trials by benchmarking risk/benefit
in trials to generally accepted standards in medicine. Third, clinical equipoise
establishes a standard for scientific value. A trial is only ethical insofar as it is a
necessary step towards resolving uncertainty in the expert community. This has
subtle implications. For example, it means small, underpowered trials are ethically
questionable (unless they are a necessary step towards resolving medical uncer-
tainty, as in the case of phase 2 trials, or designed expressly to be incorporated in
future meta-analyses) (Halpern et al. 2002), since “positive” but underpowered
trials might be sufficient to encourage further trials, but will generally be insuffi-
cient to convince the expert medical community about a treatment’s advantage
over standard of care.
Though clinical equipoise was first articulated in the context of randomized trials
testing new interventions, it can logically be extended to single armed studies that
use historical controls as comparators. The concept of clinical equipoise is not
without critics (reviewed in London 2007), and its operationalization – like many
ethical concepts – can pose challenges (e.g., how much residual uncertainty is
necessary for a trial to be ethical). Just the same, no other concept comes close to
binding the moral dimensions of trials to their methodology and the obligations of
those who conduct them.

Justification of Demarcated Research Procedures

The justification of risks associated with demarcated research procedures proceeds in


two steps. In the first, researchers should minimize burdens – for example, by using
state-of-the-art techniques for collecting tissue samples. Remaining risks then need
to be justified by appealing to the value of the knowledge such procedures enable. If
this seems incredibly vague, it is. At best, one can look to the precedent of other
studies to ask whether the risks of a research procedure have generally been deemed
to have been justified by the incremental gain in knowledge. At worst, this vagueness
reveals ongoing unresolved problems for research ethics. As a general rule, demar-
cated research risks can never exceed minimal risk (or minor increase over minimal
risk in the USA) for studies involving minors; for studies in patients that have
capacity, they should never involve risk of death or irreversible injury. The sum of
all research burdens in component analysis should still be favorably balanced by the
value of the trial.
4 Clinical Trials, Ethics, and Human Protections Policies 61

Riskless Research, High Risk Research, Comparative Effectiveness


Trials, and Ethics

Both extremes of risk in research pose challenges to the assessment and evaluation of
risk in research. Some trials, like early phase trials testing novel strategies, or trials of
aggressive treatments in pre-symptomatic patients, present high degrees of risk and
uncertainty. Many patients are willing to undertake extraordinary levels of risk, and
for patients who have exhausted treatment options, a “standard of care” may be
difficult to define for establishing clinical equipoise. Some might argue that, in such
circumstances, investigators and ethics review committees should defer to well-
informed preferences of research volunteers. However, risk in trials can impact
others outside of trials (e.g., third parties), or undermine public confidence (Hope
and McMillan 2004). For example, a major debacle or a series of negative trials in a
novel research arena can undermine support for parallel research efforts, as occurred
with gene therapy in the late 1990s. Though ethics polices and oversight systems do
not generally instruct investigators to consider how their trials might affect parallel
investigations, some commentators argue that researchers bear duties to steward
research programs and refrain from activities that might damage them (London et al.
2010).
At the other extreme are seemingly riskless studies. One category of riskless
studies is “seeding trials”: trials that involve well-characterized drugs and that are
aimed primarily at marketing by habituating doctors to prescribing them (Andersen
et al. 2006) or by generating a publication that can function legally as an advertise-
ment through reprint circulation (US Department of Health and Human Services,
Food and Drug Administration 2009; Federico et al. 2019) rather than resolving a
scientific question. Most human protections policies have little to say about seeding
trials, because their risks are so low that little to no scientific value is needed to
justify them. Seeding trials are nevertheless an obvious breach of scientific integrity
(Sox and Rennie 2008; London et al. 2012). Such studies not only sap scarce human
capital for research, but subvert the aims of science (which is aimed at belief change
through evidence, not habituation or attentional manipulation) and undermine the
credibility of the research enterprise.
Comparative effectiveness and usual care–randomized trials represent a second
category of seemingly riskless studies. In these studies, patients are randomly
assigned to standards of care in order to determine whether one standard of care is
better or noninferior (many such studies do not use any demarcated research pro-
cedures). Even when such studies use primary endpoints like mortality or major
morbidity, they are often viewed as “riskless” insofar as all patients are receiving the
same treatments within trials that they would receive outside of trials (Lantos and
Feudtner 2015). Whether “usual care” randomized trials are necessarily minimal risk
is hotly debated among research ethicists. The present author would argue that they
should not be understood as minimal risk (Kane et al. 2020). First, by using a morbid
primary endpoint, researchers are openly declaring they are uncertain as to whether
one standard is better than another on a clinically meaningful measure. Second, it is
impossible to exclude the possibility that a patient who opts to enter a usual care
62 J. Kimmelman

randomized trial – by having their treatment determined using randomization – is


directed toward a treatment trajectory that leaves them worse off against the coun-
terfactual of their not joining the trial. To say that a study fulfills clinical equipoise
means that probability of risks and benefit within are very similar to those outside a
trial. There may, in fact, be little rational basis for declining trial participation.
However, these statements do not entail that the trial is necessarily minimal risk.

Maximizing Efficiencies

Human protections policies are overwhelmingly focused on protecting individual


research subjects from undue risk. They are not set up to consider whether a given
research question might be addressed with fewer patients, or whether a design is
suboptimal. This means that human protections policies have very little to say
directly about inefficient research designs, including trials that are overpowered or
over-accrue, trials that use uneven randomization ratios, or that divide their alpha
excessively over many hypotheses. Yet clearly, if a medical question can be resolved
by burdening a smaller number of patients (even though, for each individual patient,
there is a favorable risk/benefit balance), that design ought to be preferred (Hey and
Kimmelman 2014). Similarly, many studies employ designs that – while riskless per
se – cloud the interpretation of results. For example, in one study, 43% of trials in
meta-analyses were deemed to show high risk of bias in at least one domain of the
risk of bias assessment tool; simple and low-cost design refinements could have
halved this figure (Yordanov et al. 2015).

Justice

Fair Subject Selection

The principle of justice, as originally interpreted by the Belmont Report and US


regulations, pertains to the relationship between vulnerable populations used in
clinical trials and those who benefit from the knowledge acquired in the trials. The
principle originated from a concern that disadvantaged research groups, like
prisoners, racial minorities, or children, were often enrolled in trials that were
aimed at addressing medical uncertainties more relevant for advantaged groups (e.
g., racial majorities, adults, etc.). The issue of fair subject selection became a focus
of debate in the late 1990s, when questions were raised about the testing of short
course AZT for the prevention of perinatal mother to child transmission of HIV.
These trials, which were aimed at testing an AZT course that was more suited to
the infrastructure and economics of low-income settings, employed a placebo
comparator even though a longer (and more expensive and medically demanding)
course of AZT had been established as standard of care for high-income countries
(Crouch and Arras 1998). Some critics charged the study-violated clinical equi-
poise, since placebo fell below the high-income standard of care. However, the
4 Clinical Trials, Ethics, and Human Protections Policies 63

high-income standard of care was simply inaccessible in low-income countries,


where the local standard of care was nontreatment.
Following this episode, international policies like the Declaration of Helsinki and
others were revised to articulate two core expectations for research funded by high-
income sponsors and conducted in vulnerable groups. The first is that trials must
make provisions for post-trial access for patients/participants in a trial in the event a
treatment shows benefit. The second is “responsiveness,” articulated in the Decla-
ration of Helsinki (to pick one definition) as “research [that] is responsive to the
health needs or priorities of this group and [that] cannot be carried out in a non-
vulnerable group this group should stand to benefit from the knowledge, practices or
interventions that result from the research.” In some cases, as in trials testing
economical approaches to treating tropical disease, responsiveness is easy to estab-
lish. In other cases, the connection of trials to health needs of disadvantaged
populations is more attenuated.
The high costs and regulatory demands of conducting trials have motivated many
drug companies to pursue trials in low- and middle-income countries (Glickman
et al. 2009). For instance, many pivotal cancer drug trials include recruitment sites in
former Eastern Bloc countries. Cancer treatments are among the most expensive
drugs in the modern pharmacopeia. While health care systems in countries like
Romania and the Ukraine confront cancer, the extent to which they are likely to
absorb the costs of new cancer treatments is unclear. Of note, one of the most
influential regulatory documents for clinical trial research ethics, the International
Conference on Harmonization “Good Clinical Practice” policy (International Coun-
cil for Harmonisation 1996) – omits language pertaining to such justice concerns;
this omission is inconsistent with nearly every other influential and contemporary
policy on human protections (Kimmelman et al. 2009).

Inclusion

A second major salient for expansion of the justice principle in the 1990s was
inclusion. By the 1990s, it had become increasingly clear that certain populations
had been excluded – often systematically – from research, unfairly depriving these
populations of medical evidence for managing their conditions. These populations
have variously included gay men, African Americans, women, children, pregnant
women, and the elderly (Palmowski et al. 2018).
Major policy reforms in the USA at funding agencies and with drug regulation
have encouraged greater inclusion of (and analysis of subgroups for) children (US
Department of Health and Human Services, Food and Drug Administration 2005),
women (Elahi et al. 2016), and racial minorities (US Department of Health and
Human Services, Food and Drug Administration 2016). Though ethical review of
trial protocols does not typically focus on inclusion and representativeness, it is now
widely recognized that, absent a compelling scientific or policy rationale, clinical
trial investigators should strive to maximize the representativeness of the
populations they recruit into trials – particularly in later phase trials that are aimed
64 J. Kimmelman

at directly informing regulatory approvals and/or health care. Even with these policy
reforms, there are suggestions that certain populations continue to be underrepre-
sented in clinical research relative to incidence of disease in these populations
(Dickmann and Schutzman 2017; Ghare et al. 2019). Studies that do enroll diverse
populations often do not report stratified analyses, potentially frustrating the aim of
broader inclusion.

Trial Inception

Respect for Persons

Only after a trial has been deemed to fulfill the above expectations is informed
consent relevant. All major policies require that investigators offer prospective
research subjects the opportunity to consider a study’s risks, burdens, and benefits
against their preferences, values, and goals. This consent, expressed at the outset of
screening and enrollment, must be ongoing for the duration of a clinical trial.

Elements of Valid Informed Consent

Valid informed consent is said to consist of three core elements (Faden and
Beauchamp 1986). The first is capacity. Prospective research participants must
have the cognitive and emotional resources to render informed judgments about
trial participation. Generally, capacity is a clinical judgment. In cases where there are
concerns, there are tools for assessing capacity to participate in research. Some
populations, like children, trauma victims, or persons with dementia, lack compe-
tence to provide informed consent. Under such circumstances, there are other pro-
visions for respecting persons (see below).
The second element of valid informed consent is understanding. Prospective
research subjects must receive, comprehend, and appreciate information that is
material to their decision to enroll in a trial. Information includes (but is not limited
to): risk/benefit, study procedures, study purpose, and alternatives to participation.
There is a very large literature showing that patients often report inability to recall
basic information about study features. In particular, many patients struggle to
accurately understand the probability of benefit (therapeutic overestimation)
(Horng and Grady 2003) or the way research participation may constrain ability to
pursue individualized care (therapeutic misconception) (Appelbaum et al. 1987).
The third element of a valid informed consent is voluntariness. Prospective research
participants should be free of controlling influences, such as coercion (i.e., threatening
to make an individual worse off, or threatening to withhold something that is owed to
the individual) or undue manipulation (i.e., alterations to choice architecture, disclosure
processes, or interactions) that encourage compliance. Some forms of manipulation are
considered ethical – at least for certain routine research settings. Healthy volunteer
phase 1 studies often use financial payment to manipulate an individual’s enrollment in
a trial. Key to judging whether a manipulation is “undue” is whether it involves an
4 Clinical Trials, Ethics, and Human Protections Policies 65

offer that is disrespectful (Grant and Sugarman 2004) (i.e., offering to pay an individual
to override a moral commitment) or whether an offer is irresistible. Compensation, that
is, covering expenses associated with lost wages, parking, or travel, is different from
inducement and does not involve manipulation.

Research Without Informed Consent

There are several circumstances where human research can be ethically and legally
conducted without valid informed consent of research subjects. One is in studies
involving persons lacking decisional competence. Generally, three protections are
established for such populations. First, demarcated research risk is limited to min-
imal risk or minor increase over minimal risk (though policies vary). Second,
surrogate consent is sought from parents or guardians (in the case of children) or
from a designated agent (e.g., an individual designated as such in an advanced
directive, or a family member) for incapacitated adults. Third, where applicable,
assent (i.e., agreement and cooperation) is sought from the research subject.
A second circumstance where consent can be waived, at least in some jurisdic-
tions, is emergency research (e.g., testing trauma or resuscitation trials). As above,
demarcated research procedures cannot exceed minimal risk. Because surrogate
consent cannot typically be obtained in emergency research, there are provisions
for public disclosure and community consultation before such studies are launched
(Halperin et al. 2007).
Many proposals have circulated about expedited or waived informed consent,
particularly in the context of usual care trials. One such example is Zelen’s consent,
which pre-randomizes patients and bypasses informed consent for those assigned to
treatments that are identical to those they would receive had they not enrolled in a trial.
This particular design is generally subject to strong ethical criticisms, since patients who
are randomly assigned to standard treatments are denied the opportunity to consent to
research participation (Hawkins 2004). However, other similar trial designs have been
proposed that, according to some, correct these ethical deficiencies while preserving
expedience. The reader is directed elsewhere for discussions (Flory et al. 2016).

Informed Consent Documents

One of the main vehicles for informed consent is the informed consent document
(ICD). ICDs typically contain a description of key disclosure elements of a study.
ICDs are widely criticized for their readability, their length, and their ineffectiveness
in supporting understanding among research participants. However, ICDs can be
better understood as a supplement for face-to-face discussions, which are much more
effective at achieving understanding (Flory and Emanuel 2004). They also provide
Institutional Review Boards (IRB, described in the next section) a proxy of what will
be covered in these discussions. Effort spent fussing over wording, if redirected
towards an appraisal of risk and benefit, would probably be a better investment for
members of IRBs.
66 J. Kimmelman

Independent Review

As Henry Beecher noted in his 1966 exposé (Beecher 1966), physicians harbor
divided loyalties when they conduct clinical research. Judgments about risk and
benefit, informed consent, and fair subject selection are refereed prospectively by
submitting trial protocols to independent review bodies (in the USA, these commit-
tees are called Institutional Review Boards or IRBs; in Canada they are called
“Research Ethics Boards” or REBs). Various policies stipulate the composition of
REBs, as well as the range of issues they should (or should not) consider. Different
models of REB review have emerged in the past decade or so, including for profit
REB review, centralized review mechanisms, and specialized review mechanisms
for fields like gene therapy or cancer trials (Levine et al. 2004).
Before trial launch, REB approval must be obtained. All design elements and
planned analyses for the trial should be pre-specified in a trial protocol. The protocol
should have been reviewed for scientific merit. And main design details, hypotheses,
and planned analyses should be registered prospectively in a public database like
clinicicaltrials.gov.

Conduct

Trials and Ethics Across Time

Trials occur over time. New information emerges from within a trial as it unfolds, or
from concurrent research or adverse events documented outside of trials. This
information can alter a study’s risk/benefit balance, necessitating an alteration of
study design, reconsenting, and sometimes halting a trial. Similarly, slow recruit-
ment can compromise a study’s risk/benefit balance, since under-accrual can stymie
the ability of a trial to achieve the quantum of social value that was projected during
ethical review, and the options available outside the trial can change. Accordingly,
risk/benefit must always be monitored as a study proceeds. In studies involving
higher levels of risk, this duty typically devolves to investigators and data safety
monitoring boards (DMSBs). DSMBs confront myriad policy, ethical, and statistical
challenges; the reader is directed elsewhere for further discussions of trial monitor-
ing (DeMets et al. 2005).

Reporting

Publication and Results Deposition

Once a trial is complete, the fulfillment of the risk/benefit established at trial outset
requires that results be disseminated to relevant knowledge users. Until recently,
there were few expectations and regulations on trial reporting. Numerous studies
have shown that many clinical trials are never published. For example, the present
author’s own work showed that only 37% of pre-license trials of drugs that stalled in
4 Clinical Trials, Ethics, and Human Protections Policies 67

clinical development were published within 5 years of trial closure (Hakala et al.
2015). Many policies, like Declaration of Helsinki or Canada’s Tricouncil Policy
Statement articulate a requirement of deposition of results for all clinical research.
The US FDA also requires deposition of clinical trial results in clinicaltrials.gov
within 12 months of completion of primary endpoint collection (US Department of
Health and Human Services 2016). However, many clinical trials are exempt from
this requirement, including phase 1 trials as well as trials testing nonregulated
products (e.g., surgeries, psychotherapies (Azar et al. 2019), or any research not
pursued as part of an IND). Some funders, institutions, and journals have policies
intended to address these gaps.

Methods and Outcome Reporting

The main way findings in trials are disseminated is through publication. Trials
should also provide a frank and transparent description of methods and results.
This entails at least three considerations. First, methods should be described in
sufficient detail to support valid inferences. Methods in the report should be consis-
tent with the study protocol. Second, results should be reported in full, and consistent
with planned analyses. For example, all planned subgroup analyses should be
reported; any new subgroup analyses should be labeled as post hoc analyses.
Third, study reports should explain limitations and what new results mean in the
context of existing evidence. There is a wealth of literature showing these three
aspirations are not always fulfilled in trials. Regarding the first, systematic reviews
show that many trials do not adequately describe methods such as how allocation
was concealed or how randomization sequences were generated (Turner et al. 2012),
or have reported primary outcomes that are inconsistent with those stated in trial
protocols (Mathieu et al. 2009). Regarding complete reporting, safety outcomes are
often not well reported in trials (Phillips et al. 2019). Lack of balance in reports is
suggested by the frequent use of “spin” in trial reports (Boutron et al. 2014), or by the
selective presentation of positive subgroup analyses in study abstracts (Kasenda
et al. 2014).

The Afterlife of Trials

Researchers and sponsors continue to have ethical obligations to research partici-


pants and evidence consumers. For instance, many commentators argue that there is
an obligation to share trial results with study participants (Partridge and Winer 2002;
Dixon-Woods et al. 2006). Response to researcher queries, sharing unpublished
analyses, or making data available can interact with the value and burden associated
with research (Bauchner et al. 2016). One area where these obligations are
contended is access to individual patient data. Recently, several medical journals
have proposed (and some endorsed) an expectation that researchers who publish in
their venues make provisions for sharing individual patient data (Loder and Groves
2015). Most patient participants favor sharing individual patient data provided
68 J. Kimmelman

safeguards are in place (Mello et al. 2018). Model safeguards for patient privacy and
research integrity are described elsewhere (Mello et al. 2013).

Synthesis

The above review is, by necessity, cursory and leaves many dimensions of clinical
trial ethics unaddressed. These include questions about protecting third parties like
caregivers in research (Kimmelman 2005), the ethics of incidental findings (Wolf
et al. 2008), and ancillary care obligations (Richardson and Belsky 2004). New trial
methodologies like adaptive trial designs (Bothwell and Kesselheim 2017) or clus-
ter-randomized trials (Weijer et al. 2011) pose challenges for implementing the
ethical standards described above.
It is tempting to view human protections and research ethics as a set of considerations
that are only visited once a clinical trial has been designed and submitted for ethical
review. However, decisions about what hypotheses to test, how to test them, how the
trial is conducted, and how to report results are saturated with ethical judgments. Most of
these judgments occur absent clear regulatory guidance, or outside the gaze of research
ethics boards. In that way, every scientist participating in the conception, design,
reporting, and uptake of clinical research is practicing research ethics.

Cross-References

▶ ClinicalTrials.gov
▶ Clinical Trials in Children
▶ Consent Forms and Procedures
▶ International Trials
▶ Reporting Biases

References
Al-Lamee R, Thompson D, Dehbi H-M et al (2018) Percutaneous coronary intervention in stable
angina (ORBITA): a double-blind, randomised controlled trial. Lancet 391:31–40. https://doi.
org/10.1016/S0140-6736(17)32714-9
Andersen M, Kragstrup J, Søndergaard J (2006) How conducting a clinical trial affects physicians’
guideline adherence and drug preferences. JAMA 295:2759–2764. https://doi.org/10.1001/
jama.295.23.2759
Annas GJ, Grodin MA (eds) (1995) The Nazi doctors and the Nuremberg Code: human rights in
human experimentation, 1st edn. Oxford University Press, New York
Appelbaum PS, Roth LH, Lidz CW et al (1987) False hopes and best data: consent to research and
the therapeutic misconception. Hast Cent Rep 17:20–24
Azar M, Riehm KE, Saadat N et al (2019) Evaluation of journal registration policies and prospec-
tive registration of randomized clinical trials of nonregulated health care interventions. JAMA
Intern Med. https://doi.org/10.1001/jamainternmed.2018.8009
4 Clinical Trials, Ethics, and Human Protections Policies 69

Bauchner H, Golub RM, Fontanarosa PB (2016) Data sharing: an ethical and scientific imperative.
JAMA 315:1238–1240. https://doi.org/10.1001/jama.2016.2420
Beecher HK (1966) Ethics and clinical research. N Engl J Med 274:1354–1360. https://doi.org/10.
1056/NEJM196606162742405
Bothwell LE, Kesselheim AS (2017) Thereal-world ethics of adaptive-design clinical trials. Hast
Cent Rep 47:27–37. https://doi.org/10.1002/hast.783
Boutron I, Altman DG, Hopewell S et al (2014) Impact of spin in the abstracts of articles reporting
results of randomized controlled trials in the field of cancer: the SPIIN randomized controlled
trial. J Clin Oncol 32:4120–4126. https://doi.org/10.1200/JCO.2014.56.7503
Canadian Institutes of Health Research, Natural Sciences and Engineering Research Council of
Canada, Social Sciences and Humanities Research Council of Canada, Secretariat on Respon-
sible Conduct of Research (Canada) (2018) Tri-Council policy statement: ethical conduct for
research involving humans
Carlisle B, Federico CA, Kimmelman J (2018) Trials that say “maybe”: the disconnect between
exploratory and confirmatory testing after drug approval. BMJ 360:k959. https://doi.org/10.
1136/bmj.k959
Chalmers I, Nylenna M (2014) A new network to promote evidence-based research. Lancet
384:1903–1904. https://doi.org/10.1016/S0140-6736(14)62252-2
Crouch RA, Arras JD (1998) AZT trials and tribulations. Hast Cent Rep 28:26–34. https://doi.org/
10.2307/3528266
DeMets DL, Furberg CD, Friedman LM (eds) (2005) Data monitoring in clinical trials: a case
studies approach, 2006 edition. Springer, New York
Dickmann LJ, Schutzman JL (2017) Racial and ethnic composition of cancer clinical drug trials:
how diverse are we? Oncologist. https://doi.org/10.1634/theoncologist.2017-0237
Dixon-Woods M, Jackson C, Windridge KC, Kenyon S (2006) Receiving a summary of the results
of a trial: qualitative study of participants’ views. BMJ 332:206–210. https://doi.org/10.1136/
bmj.38675.677963.3A
Elahi M, Eshera N, Bambata N et al (2016) The Food and Drug Administration Office of Women’s
Health: impact of science on regulatory policy: an update. J Women’s Health 25:222–234.
https://doi.org/10.1089/jwh.2015.5671
Faden RR, Beauchamp TL (1986) A history and theory of informed consent. Oxford University
Press, New York
Federico CA, Wang T, Doussau A et al (2019) Assessment of pregabalin postapproval trials and the
suggestion of efficacy for new indications: a systematic review. JAMA Intern Med 179:90–97.
https://doi.org/10.1001/jamainternmed.2018.5705
Fergusson D, Glass KC, Hutton B, Shapiro S (2005) Randomized controlled trials of aprotinin in
cardiac surgery: could clinical equipoise have stopped the bleeding? Clin Trials 2:218–232.
https://doi.org/10.1191/1740774505cn085oa
Flory J, Emanuel E (2004) Interventions to improve research participants’ understanding in
informed consent for research: a systematic review. JAMA 292:1593–1601. https://doi.org/10.
1001/jama.292.13.1593
Flory JH, Kitcher P (2004) Global health and the scientific research agenda. Philos Public Aff
32:36–65. https://doi.org/10.1111/j.1467-6486.2004.00004.x
Flory JH, Mushlin AI, Goodman ZI (2016) Proposals to conduct randomized controlled trials
without informed consent: a narrative review. J Gen Intern Med 31:1511–1518. https://doi.org/
10.1007/s11606-016-3780-5
Freedman B (1987) Equipoise and the ethics of clinical research. N Engl J Med 317:141–145.
https://doi.org/10.1056/NEJM198707163170304
General Assembly of the World Medical Association (2014) World Medical Association Declara-
tion of Helsinki: ethical principles for medical research involving human subjects. J Am Coll
Dent 81:14–18
Ghare MI, Chandrasekhar J, Mehran R et al (2019) Sexdisparities in cardiovascular device
evaluations: strategies for recruitment and retention of female patients in clinical device trials.
JACC Cardiovasc Interv 12:301–308. https://doi.org/10.1016/j.jcin.2018.10.048
70 J. Kimmelman

Glickman SW, McHutchison JG, Peterson ED et al (2009) Ethical and scientific implications of the
globalization of clinical research. N Engl J Med 360:816–823. https://doi.org/10.1056/
NEJMsb0803929
Goldbeck-Wood S (1998) Denmark takes a lead on research ethics. BMJ 316:1185. https://doi.org/
10.1136/bmj.316.7139.1185j
Grant RW, Sugarman J (2004) Ethics in human subjects research: do incentives matter? J Med
Philos 29:717–738. https://doi.org/10.1080/03605310490883046
Hakala A, Kimmelman J, Carlisle B et al (2015) Accessibility of trial reports for drugs stalling in
development: a systematic assessment of registered trials. BMJ 350:h1116. https://doi.org/10.
1136/bmj.h1116
Halperin H, Paradis N, Mosesso Vet al (2007) Recommendations for implementation of community
consultation and public disclosure under the Food and Drug Administration’s “exception from
informed consent requirements for emergency research”: a special report from the American
Heart Association Emergency Cardiovascular Care Committee and Council on Cardiopulmo-
nary, Perioperative and Critical Care: endorsed by the American College of Emergency Physi-
cians and the Society for Academic Emergency Medicine. Circulation 116:1855–1863. https://
doi.org/10.1161/CIRCULATIONAHA.107.186661
Halpern SD, Karlawish JHT, Berlin JA (2002) The continuing unethical conduct of underpowered
clinical trials. JAMA 288:358–362
Hawkins JS (2004) The ethics of Zelen consent. J Thromb Haemost 2:882–883. https://doi.org/10.
1111/j.1538-7836.2004.00782.x
Hey SP, Kimmelman J (2014) The questionable use of unequal allocation in confirmatory trials.
Neurology 82:77–79. https://doi.org/10.1212/01.wnl.0000438226.10353.1c
Hope T, McMillan J (2004) Challenge studies of human volunteers: ethical issues. J Med Ethics
30:110–116. https://doi.org/10.1136/jme.2003.004440
Horng S, Grady C (2003) Misunderstanding in clinical research: distinguishing therapeutic mis-
conception, therapeutic misestimation, and therapeutic optimism. IRB 25:11–16
International Council for Harmonisation (1996) Guideline for good clinical practice. https://
database.ich.org/sites/default/files/E6_R2_Addendum.pdf
Kane PB, Kim SYH, Kimmelman J (2020) Whatresearch ethics (often) gets wrong about minimal
risk. Am J Bioeth 20:42–44. https://doi.org/10.1080/15265161.2019.1687789
Kasenda B, Schandelmaier S, Sun X et al (2014) Subgroup analyses in randomised controlled trials:
cohort study on trial protocols and journal publications. BMJ 349:g4539. https://doi.org/10.
1136/bmj.g4539
Kimmelman J (2005) Medical research, risk, and bystanders. IRB Ethics Hum Res 27:1
Kimmelman J (2020) What is human research for? Reflections on the omission of scientific integrity
from the Belmont Report (accepted). Perspect Biol Med 62(2):251–261
Kimmelman J, Federico C (2017) Consider drug efficacy before first-in-human trials. Nature
542:25–27. https://doi.org/10.1038/542025a
Kimmelman J, Weijer C, Meslin EM (2009) Helsinki discords: FDA, ethics, and international drug
trials. Lancet 373:13–14. https://doi.org/10.1016/S0140-6736(08)61936-4
King NMP (2000) Defining and describing benefit appropriately in clinical trials. J Law Med Ethics
28:332–343. https://doi.org/10.1111/j.1748-720X.2000.tb00685.x
Lantos JD, Feudtner C (2015) SUPPORT and the ethics of study implementation: lessons for
comparative effectiveness research from the trial of oxygen therapy for premature babies.
Hast Cent Rep 45:30–40. https://doi.org/10.1002/hast.407
Levine C, Faden R, Grady C et al (2004) “Special scrutiny”: a targeted form of research protocol
review. Ann Intern Med 140:220–223
Loder E, Groves T (2015) The BMJ requires data sharing on request for all trials. BMJ 350:h2373.
https://doi.org/10.1136/bmj.h2373
London AJ (2007) Clinical equipoise: foundational requirement or fundamental error? In: The
Oxford handbook of bioethics. Oxford University Press, Oxford, pp 571–596
4 Clinical Trials, Ethics, and Human Protections Policies 71

London AJ, Kimmelman J, Emborg ME (2010) Beyond access vs. protection in trials of innovative
therapies. Science 328:829–830. https://doi.org/10.1126/science.1189369
London AJ, Kimmelman J, Carlisle B (2012) Rethinking research ethics: the case of postmarketing
trials. Science 336:544–545. https://doi.org/10.1126/science.1216086
Mathieu S, Boutron I, Moher D et al (2009) Comparison of registered and published primary
outcomes in randomized controlled trials. JAMA 302:977–984. https://doi.org/10.1001/jama.
2009.1242
Mello MM, Francer JK, Wilenzick M et al (2013) Preparing for responsible sharing of clinical trial
data. N Engl J Med 369:1651–1658. https://doi.org/10.1056/NEJMhle1309073
Mello MM, Lieou V, Goodman SN (2018) Clinical trial participants’ views of the risks and benefits
of data sharing. N Engl J Med. https://doi.org/10.1056/NEJMsa1713258
Menikoff J, Kaneshiro J, Pritchard I (2017) Thecommon rule, updated. N Engl J Med 376:613–615.
https://doi.org/10.1056/NEJMp1700736
Nasser M, Clarke M, Chalmers I et al (2017) What are funders doing to minimise waste in research?
Lancet 389:1006–1007. https://doi.org/10.1016/S0140-6736(17)30657-8
Palmowski A, Buttgereit T, Palmowski Y et al (2018) Applicability of trials in rheumatoid arthritis
and osteoarthritis: a systematic review and meta-analysis of trial populations showing adequate
proportion of women, but underrepresentation of elderly people. Semin Arthritis Rheum. https://
doi.org/10.1016/j.semarthrit.2018.10.017
Partridge AH, Winer EP (2002) Informing clinical trial participants about study results. JAMA
288:363–365. https://doi.org/10.1001/jama.288.3.363
Phillips R, Hazell L, Sauzet O, Cornelius V (2019) Analysis and reporting of adverse events in
randomised controlled trials: a review. BMJ Open 9:e024537. https://doi.org/10.1136/bmjopen-
2018-024537
Polman CH, Reingold SC, Barkhof F et al (2008) Ethics of placebo-controlled clinical trials in
multiple sclerosis: a reassessment. Neurology 70:1134–1140. https://doi.org/10.1212/01.wnl.
0000306410.84794.4d
Richardson HS, Belsky L (2004) The ancillary-care responsibilities of medical researchers. An
ethical framework for thinking about the clinical care that researchers owe their subjects. Hast
Cent Rep 34:25–33
Savulescu J, Chalmers I, Blunt J (1996) Are research ethics committees behaving unethically?
Some suggestions for improving performance and accountability. BMJ 313:1390–1393. https://
doi.org/10.1136/bmj.313.7069.1390
Sox HC, Rennie D (2008) Seeding trials: just say “no”. Ann Intern Med 149:279–280
Turner L, Shamseer L, Altman DG et al (2012) Consolidated standards of reporting trials (CON-
SORT) and the completeness of reporting of randomised controlled trials (RCTs) published in
medical journals. Cochrane Database Syst Rev 11:MR000030. https://doi.org/10.1002/
14651858.MR000030.pub2
U.S. Department of Health and Human Services (2016) 42 CFR 11: clinical trials registration and
results information submission
U.S. Department of Health and Human Services, Food and Drug Administration (2005) Guidance
for industry: how to comply with the Pediatric Research Equity Act
U.S. Department of Health and Human Services, Food and Drug Administration(2009) Good reprint
practices for the distribution of medical journal articles and medical or scientific reference
publications on unapproved new uses of approved drugs and approved or cleared medical devices
U.S. Department of Health and Human Services, Food and Drug Administration(2016) Collection
of race and ethnicity data in clinical trials
Vedula SS, Li T, Dickersin K (2013) Differences in reporting of analyses in internal company
documents versus published trial reports: comparisons in industry-sponsored trials in off-label
uses of gabapentin. PLoS Med 10:e1001378. https://doi.org/10.1371/journal.pmed.1001378
Weijer C, Miller PB (2004) When are research risks reasonable in relation to anticipated benefits?
Nat Med 10:570. https://doi.org/10.1038/nm0604-570
72 J. Kimmelman

Weijer C, Grimshaw JM, Taljaard M et al (2011) Ethical issues posed by cluster randomized trials in
health research. Trials 12:100. https://doi.org/10.1186/1745-6215-12-100
Wolf SM, Paradise J, Caga-anan C (2008) The law of incidental findings in human subjects
research: establishing researchers’ duties. J Law Med Ethics 36(361–383):214. https://doi.org/
10.1111/j.1748-720X.2008.00281.x
World Health Organization, Council for International Organizations of Medical Sciences (2017)
International ethical guidelines for health-related research involving humans. CIOMS, Geneva
Yordanov Y, Dechartres A, Porcher R et al (2015) Avoidable waste of research related to inadequate
methods in clinical trials. BMJ 350:h809. https://doi.org/10.1136/bmj.h809
History of the Society for Clinical Trials
5
O. Dale Williams and Barbara S. Hawkins

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Background: 1967–1972 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Steps in the Creation of the Society for Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Meetings of Coordinating Center Personnel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Coordinating Center Models Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
National Conference on Clinical Trials Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Journals of the Society for Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
International Participation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Abstract
This chapter provides a synopsis of the events leading to the creation of the
Society for Clinical Trials (SCT). The Society was officially incorporated in
September 1978 and celebrated its 40th anniversary during its annual meeting
in New Orleans May 19–22, 2019.

O. D. Williams (*)
Department of Biostatistics, University of North Carolina, Chapel Hill, NC, USA
Department of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
e-mail: [email protected]
B. S. Hawkins
Johns Hopkins School of Medicine and Bloomberg School of Public Health, The Johns Hopkins
University, Baltimore, MD, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 73


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_208
74 O. D. Williams and B. S. Hawkins

Keywords
Clinical trials · Greenberg report · Coordinating center · CCMP · Models Project ·
National Conference · Directors

Introduction

The Society for Clinical Trials, Inc. (SCT) is a professional society for advocates,
designers, and practitioners of clinical trials, regardless of medical specialty or area of
expertise. SCT was incorporated in September 1978. It was created with the purpose:

• To promote the development and exchange of information for design and conduct
of clinical trials and research using similar methods
• To provide a forum for discussion of philosophical, ethical, legal, and procedural
issues involved in the design, organization, operations, and analysis of clinical
trials and other epidemiological studies that use similar methods (Society for
Clinical Trials Board of Directors 1980).

Background: 1967–1972

In the early 1970s, there was an expanding and evolving awareness that clinical trials
were going to play a vital and key role in the development and implementation
of improved strategies for addressing important public health and medical issues.
At that time, only a few large multicenter trials had been undertaken, including the
University Group Diabetes Program (UGDP) and the Coronary Drug Project (CDP),
both sponsored by the National Institutes of Health (NIH). Experience in these
projects highlighted the theoretical, organizational, and operational challenges
such studies presented. It was clear that clinical trials were a valuable tool for
evaluating and comparing interventions; however, there was significant concern
among sponsors and practitioners as to their cost, management, and duration before
they reached a conclusion regarding the effectiveness and safety of interventions
under evaluation. Donald S. Fredrickson (1924–2002; Director, National Heart
Institute 1966–1974; NIH Director 1975–1981) summarized these issues eloquently
in an address to the New York Academy of Science on January 23, 1968. In this
address, for which the full text is available2, he described field trials as indispensable
ordeals that are necessary for avoiding perpetual uncertainty.
In 1967, the National Heart Institute (later the National Heart, Lung, and Blood
Institute [NHLBI]) created the Heart Special Project Committee (chaired by Bernard
G. Greenberg, Chair of Biostatistics, University of North Carolina) to review the
approach to cooperative studies. The resulting report, known as the Greenberg
Report, was presented to the National Advisory Heart Council later in the same
year. The Greenberg Report (Heart Special Project Committee 1988) highlighted
three essential components:
5 History of the Society for Clinical Trials 75

• Organization of local units under good leadership


• Establishment of a coordinating center
• Critical “interrogation” of the data

It also provided guidelines for organizational structure and operations, many


of which continue to be followed to this day.
Many of the technology elements now taken for granted were not available in the
early 1970s. Communication by telephone and postal mail was more difficult, fax
capability was not fully available, and desktop computers were still under develop-
ment. Word processing and other essential software applications, including those for
data management, also were not available. Further, telephone conferencing was in its
early phases and was expensive and less than fully reliable. It required an operator to
contact participants and add them to the call. Video conferencing was beyond the
imagination of most clinical trialists.
In spite of these and other constraints, it was clear that the volume of large-scale
multicenter trials was likely to grow and that they would require techniques and
operations not yet fully developed. During the early 1970s, the National Heart
Institute initiated three large-scale multicenter activities that included large clinical
trials: the Lipids Research Clinics (LRC) Program with its Coronary Primary
Prevention Trial, the Hypertension Detection and Follow-up Program (HDFP), and
the Multiple Risk Factor Intervention Trial (MRFIT). These were large, multicenter
long-term studies designed to address mechanisms of and interventions for critical
aspects of heart disease, the leading cause of death among men in the United States
(USA) at that time.

Steps in the Creation of the Society for Clinical Trials

So the stage was set for the creation of an organization of designers and others
engaged in multicenter clinical trials. Key steps in this process were:

1. Annual meetings of personnel from coordinating centers for trials already


underway that began in 1973
2. Meeting of interested individuals with the NIH Clinical Trials Committee, 1976
3. National Conference on Clinical Trials Methodology, October 1977
4. SCT incorporation, September 1978
5. Publication of first issue of Controlled Clinical Trials, May 1980

It may be surprising that the first step listed is meetings of coordinating center
personnel. However, as noted above, the Greenberg Report indicated that coordi-
nating centers were essential components of cooperative studies, such as large-scale
multicenter clinical trials. The annual meetings were facilitated by NHLBI under the
leadership of Robert I. Levy. Levy understood that coordinating center capabilities
and the pool of relevant expertise needed to be expanded to meet current and future
needs for successful conduct of large studies. These meetings began in 1973 and
76 O. D. Williams and B. S. Hawkins

continued until 1981; they initiated a sequence of events that ultimately led to the
creation of the SCT.

Meetings of Coordinating Center Personnel

The initial 1.5-day meeting of coordinating center personnel was held in May 1973
in Columbia, Maryland. More than 60 people participated, including 9 from NIH;
participants represented 11 institutions. This initial meeting was mostly a “show and
tell” by investigators from coordinating centers for major NHLBI-sponsored multi-
center studies:

• Lipid Research Clinics Program


• Hypertension Detection and Follow-Up Program
• Multiple Risk Factor Intervention Trial
• Hypertension, Thrombolysis, and Exercise
• Coronary Drug Project

Organization of and participation in the initial meeting was facilitated by NHLBI


primarily through existing funding awards to individual investigators.
The full sequence of meetings of coordinating center personnel, meeting
locations, and host institutions are listed by year in Table 1. The host institution
typically was the location of one or more of the NHLBI-supported coordinating
centers. Beginning in 1976, these meetings were sponsored by NHLBI through the
Coordinating Center Models Project (Curtis L. Meinert, principal investigator),
described below. The last two meetings were held in conjunction with the first and
second annual scientific meetings of the Society for Clinical Trials.
The breadth and depth of these meetings evolved over time. For example,
by the time of the fourth meeting in Chapel Hill, the 166 attendees represented
40 multicenter studies. Presentations focused on data quality assurance, computer

Table 1 Locations and host organizations for meetings of personnel from clinical trial coordinat-
ing centers, 1973–1981
Year Meeting location Host organization
1973 Columbia, MD University of Maryland
1974 – –
1975 Plymouth, MN University of Minnesota
1976 Houston, TX University of Texas
1977 Chapel Hill, NC University of North Carolina
1978 Washington, DC George Washington University
1979 Boston, MA
1980 Philadelphia, PA Society for Clinical Trials
1981 San Francisco, CA Society for Clinical Trials
–, no meeting held
5 History of the Society for Clinical Trials 77

operations and cost, and survival analysis. Also, the Coordinating Center Models
Project was introduced. The guest speaker was Levy, who addressed “Decision
Making in Large-Scale Clinical Trials.” Organizers of these meetings already had
adopted the format of typical scientific society meetings.

Coordinating Center Models Project

Two important activities occurred in 1976. One was NHLBI funding of the
Coordinating Center Models Project (CCMP [1976–1979]). The CCMP purpose
was to study existing coordinating centers for large, multicenter trials with the aim of
establishing guidelines and standards for organization and operations of coordinat-
ing centers for future multicenter trials. The seven CCMP reports are available from
the National Technical Information Service (Coordinating Center Models Project
Research Group. Coordinating Center Models Project 1979a, b, c, d, e, f, 1980).
Also during 1976, Williams, Meinert, and others met with Robert S. Gordon, Jr.,
who was special assistant to the NIH Director (Frederickson) and other key leaders at
NIH. Later that year, a group consisting of Fred Ederer (National Eye Institute) and
Meinert, Williams and Harold P. Roth (National Institute of Arthritis, Metabolism,
and Digestive Disease) met with the NIH Clinical Trials Committee. This group
proposed that a professional society that addressed the general issues of clinical trials
be created and asked for the Committee’s support. The Committee members
expressed interest in the concept but indicated that evidence for widespread partic-
ipatory support was lacking. Instead, the Committee proposed holding a conference
to assess the level of support for such a society. As a result, a Planning Committee
was formed under the leadership of Roth.
Although thus far we have described activities in the USA sponsored primarily
by the NHLBI, other sponsors and practitioners of clinical trials were interested
in creating a forum for sharing experiences, methods, and related developments.
The National Cancer Institute (NCI) created the Clinical Trials Cooperative Group
Program in 1955; the National Cancer Act of 1971 enhanced the role of these
cooperative groups and their coordinating centers. In 1962, the U.S. Veterans
Administration (VA; now the U.S. Department of Veterans Affairs) established
four regional research support centers for the VA Cooperative Studies Program
(VA CSP) under the leadership of Lawrence Shaw (https://www.vascp.research.va.
gov/CSP/history.asp). In 1967 and 1970, findings from a major trial of antihyper-
tensive agents conducted by a VA Cooperative Studies Group were published
(Veterans Administration Cooperative Study Group on Antihypertensive Agents
1967, 1970). In 1972, two of the regional research support centers were designated
to house CSP coordinating centers to support “multicenter clinical trials that evalu-
ated novel therapies or new uses of standard treaments” (Streptomycin in Tubercu-
losis Trials Committee 1948). Two more CSP coordinating centers were established
during the next 6 years. In the United Kingdom (UK), the Medical Research Council
had sponsored randomized trials, following the landmark trial of streptomycin for
tuberculosis (Streptomycin in Tuberculosis Trials Committee 1948). Thus, the group
78 O. D. Williams and B. S. Hawkins

of sponsors and individuals potentially interested in a professional society for


clinical trialists extended well beyond NHLBI and its awardees.

National Conference on Clinical Trials Methodology

The 1977 National Conference on Clinical Trials Methodology was held in Building
1 on the NIH campus on October 3 and 4, 1977. Somewhat to the surprise of almost
everyone involved in its planning, the conference attracted more than 700 partici-
pants from around the USA who represented much of NIH and other current and
potential sponsors of clinical trials. Attendees were welcomed by Gordon and
Fredrickson; the program included presentations within broad topics:

• When and how to stop a clinical trial?


• Who will be effective as a clinical trials investigator, and what are adequate
incentives?
• Patient recruitment: problems and solutions.
• Quality assurance of clinical data.
• Ethical considerations in clinical trials.

One of the more important sessions was on communications, which addressed the
question “Should mechanisms be established for sharing among clinical trial inves-
tigators experience in handling problems in design, execution and analysis?” The
discussion leaders were Roth and Genell Knatterud, with Louis Lasagna, Meinert,
and Barbara Hawkins. Harold Schoolman and Fred Mosteller also contributed. The
conference and this session on communications played a key role in the creation
of the SCT. The conference proceedings were published 2 years later (Roth and
Gordon 1979).
Soon after this conference, in September, 1978, the Society for Clinical Trials was
incorporated. The members of the initial board of directors are listed in Table 2. One

Table 2 Members of the initial board of directors of the Society for Clinical Trials
Thomas C. Chalmers, MD, Chair, Mt. Sinai School of Medicine
Harold O. Conn, MD, Yale University School of Medicine and West Haven VA Hospital
Fred Ederer, MA, National Eye Institute, National Institutes of Health
Robert S. Gordon, Jr., MD, National Institutes of Health
Curtis L. Meinert, PhD, The Johns Hopkins University
Christian R. Klimt, MD, DrPH, University of Maryland and Maryland Medical Research Institute
Paul Meier, PhD, University of Chicago
Charles G. Moertel, MD, Mayo Clinic
Thaddeus E. Prout, MD, The Johns Hopkins University and Greater Baltimore Medical Institute
Harold P. Roth, MD, National Institute of Diabetes and Digestive Diseases, NIH
Maurice J. Staquet, MD, Representative of the International Society for Clinical Biostatistics
O. Dale Williams, PhD, University of North Carolina
5 History of the Society for Clinical Trials 79

of the first acts of the Board was to develop plans for the first meeting of the Society.
A Program Committee (Williams, Chair) was created; as noted above, this meeting
was planned and undertaken in conjunction with the seventh annual meeting of
coordinating center personnel. The result was a four-day meeting May 5–8, 1980, in
Philadelphia. Sponsors included NHLBI, NEI, National Institute for Neurologic
Conditions, Deafness, and Stoke (NINDS), National Institute for Addiction and
Infectious Diseases (NIAID), and Maryland Medical Research Institute. This impor-
tant meeting included three key presentations (Combined Annual Scientific Sessions
Society for Clinical Trials 1980):

• Fredrickson, NIH Director: “Sorting out the doctor’s bag”


• Seymour Perry, Director National Center for Health Care Technology:
“Introduction to the National Center for Health Care Technology”
• Charles G. Moertel, Chair, Comprehensive Cancer Center, Mayo Clinic: “How
to succeed in clinical trials without really trying”

The second annual meeting, also with Williams as Program Committee Chair,
was held in conjunction with the eighth and final annual meeting of coordinating
center personnel.

Journals of the Society for Clinical Trials

In conjunction with the incorporation of the Society, negotiations began with a


publisher to initiate a journal, Controlled Clinical Trials, with Meinert to serve as
Editor and Williams as Associate Editor. The editorial arrangement continued for
about 10 years until Williams stepped down from his role. The journal had been
adopted as the official journal of the Society by the time the first issue was published
in 1980. Janet Wittes, (1995–1998), and James Neaton, (1999–2003), each served as
a later editor of Controlled Clinical Trials.
Controlled Clinical Trials was unique when it originated because of its publica-
tion of articles on a variety of topics that represented the broad and varied interests
and expertise of the SCT membership. These spanned clinical trial design, conduct,
logistics, ethics, regulation, policy, analysis, and methodology. In particular, it was
prescient in being one of very few or perhaps the only journal at that time to publish
“design papers,” i.e., papers devoted solely to presenting the design and key protocol
elements of planned clinical trials. The idea that protocols are both scholarly
outputs and should be publicly described was enshrined in the later formation of
clinicaltrials.gov, and the idea has also been adopted by a wide variety of fields today
under the rubric of “Registered Reports.”
In 2004, the Society decided to part ways with Elsevier, the publisher of the
Society’s journal. Because the journal’s name was owned by Elsevier (albeit subse-
quently changed by them to Contemporary Clinical Trials), the Society in effect
launched a new journal, albeit with the same editorial board. This new journal was
named Clinical Trials: the Journal of the Society for Clinical Trials and is published
80 O. D. Williams and B. S. Hawkins

by Sage. Steven N. Goodman was the originating editor of the new journal from
2004 to 2013 and succeeded by Colin Begg in 2013.

International Participation

The Society has been enhanced by membership and meeting attendees from outside
the USA, essentially from its outset. In fact, Maurice J. Staquet was a member of the
initial board of directors as a representative of the International Society for Clinical
Biostatistics. Also, meetings have been held in other countries. The first was the
7th meeting, held in Montreal, Canada, May 1986. The 11th meeting was held
in Toronto, Canada, May 1990; the 12th was the first joint meeting with the
International Society for Clinical Biostatistics, held in Brussels, Belgium, July
1991; the 18th was the second meeting of the Society and the International Society
for Clinical Biostatistics, held in Boston, July 1997; the 21st was held in Toronto,
Canada, April, 2000; the 24th was the third joint meeting with the International
Society for Clinical Biostatistics, held in London, England, July 2003; the 28th was
held in Montreal, Canada, May, 2007; the 32nd in Vancouver, Canada, May, 2011;
the 37th was held in Montreal, Canada, May 2016; and the 38th was held in
Liverpool, England, May 2017.

Summary and Conclusion

By 1981, the Society for Clinical Trials was fully functional; Controlled Clinical Trials
was in publication. Beginning in 1980, scientific meetings of the Society have been
held annually. Programs of annual meetings and abstracts of contributed presentations,
along with other information, are online at https://www.sctweb.org; many have been
published in the official SCT journals. Although clinical procedures and information
technology have evolved since 1981, many of the practical issues persist; for example,
the methods for promoting and monitoring data quality have evolved but the need for
data of high quality remains. As in other areas of clinical trials methodology, the
practices to achieve a desired result may change over time, but the principles are
permanent. Thus, the goals of the SCT remain pertinent to today’s clinical trialists.
Many different individuals contributed importantly to the creation and early
operation of the SCT. Of those mentioned above, three are especially important.
Fredrickson saw the need for an entity such as the SCT and supported it from the
highest levels of NIH. Gordon was tireless in meeting with interested advocates for
a professional society for clinical trialists, eliciting and securing support across
NIH, and helping to lead the overall creation effort. Levy saw a compelling need
for enhanced coordinating center capability and made sure the meetings of coordi-
nating center personnel that led directly to the creation of the Society were supported
and well organized. Many other individuals deserve significant credit as well,
but these three are especially deserving of recognition for their important and
timely contributions.
5 History of the Society for Clinical Trials 81

References
Abstracts. Combined Annual Scientific Sessions Society for Clinical Trials and seventh annual
symposium for coordinating clinical trials. May 6–8, 1980, Marriott Inn, Philadelphia, PA
(1980) Control Clin Trials 1:165–178
Coordinating Center Models Project Research Group. Coordinating Center Models Project
(1979a, March 1) A study of coordinating centers in multicenter clinical trials. Design
and methods, vols 1 and 2. NTIS Accession No. PB82-143730 and PB82-143744. National
Technical Information Services, Springfield
Coordinating Center Models Project Research Group. Coordinating Center Models Project
(1979b, March 1) A study of coordinating centers in multicenter clinical trials. RFPs
for coordinating centers: a content evaluation. NTIS Accession No. PB82143702. National
Technical Information Services, Springfield
Coordinating Center Models Project Research Group. Coordinating Center Models Project
(1979c, August 1) A study of coordinating centers in multicenter clinical trials. Terminology.
NTIS Accession No. PB82-143728. National Technical Information Services, Springfield.
Bibliographic resource for clinical trials. April 1, 1980. NTIS Accession No. PB87-??????
Coordinating Center Models Project Research Group. Coordinating Center Models Project
(1979d, September 1) A study of coordinating centers in multicenter clinical trials. Phases of
a multicenter clinical trials. NTIS Accession No. PB82-143751. National Technical Information
Services, Springfield
Coordinating Center Models Project Research Group. Coordinating Center Models Project
(1979e, September 1) A study of coordinating centers in multicenter clinical trials. Enhancement
of methodological research in the field of clinical trials. NTIS Accession No. PB82-143710.
National Technical Information Services, Springfield
Coordinating Center Models Project Research Group. Coordinating Center Models Project
(1979f, June 1) A study of coordinating centers in multicenter clinical trials. CCMP manuscripts
presented at the annual symposia on coordinating clinical trials. NTIS Accession No. PB82-
143694. National Technical Information Services, Springfield
Coordinating Center Models Project Research Group. Coordinating Center Models Project
(1980, September 1) A study of coordinating centers in multicenter clinical trials. Coordinating
centers and the contract process. NTIS Accession No. PB87-139101? National Technical
Information Services, Springfield
Frederickson DS (1968) The field trial: some thoughts on the indispensable ordeal. Bull N Y Acad
Med 44(2):985–993
Heart Special Project Committee (1988) Organization, review, and administration of cooperative
studies (Greenberg report): a report from the Heart Special Project Committee to the National
Advisory Heart Council. Control Clin Trials 9:137–148. [Includes a list of members of the Heart
Special Project Committee]
https://www.vascp.research.va.gov/CSP/history.asp. Accessed 12 Aug 2019
Roth HP, Gordon RS (1979) Proceedings of the national conference on clinical trials methodology.
Clin Pharmacol Ther 25(5, pt 2):629–765
Society for Clinical Trials Board of Directors (1980) By-laws. Control Clin Trials 1(1):83–89
Streptomycin in Tuberculosis Trials Committee (1948) Streptomycin treatment of pulmonary
tuberculosis. Br Med J 2:769–782
Veterans Administration Cooperative Study Group on Antihypertensive Agents (1967) Effects of
treatment on morbidity in hypertension. Results in patients with diastolic blood pressures
averaging 115 through 129 mm Hg. JAMA 202(11):116–122
Veterans Administration Cooperative Study Group on Antihypertensive Agents (1970) Effects of
treatment on morbidity in hypertension. II. Results in patients with diastolic blood pressures
averaging 90 through 114 mm Hg. JAMA 213(7):1143–1152
Part II
Conduct and Management
Investigator Responsibilities
6
Bruce J. Giantonio

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Investigators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Overall Responsibilities of Investigators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Research Study Design and Conduct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Conduct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Safeguards to Protect Research Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Adverse Event Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Safety Oversight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
IRB Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Informed Consent Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Protocol Noncompliance and Research Misconduct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Protocol Noncompliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Research Misconduct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Regulations and Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

B. J. Giantonio (*)
The ECOG-ACRIN Cancer Research Group, Philadelphia, PA, USA
Massachusetts General Hospital, Boston, MA, USA
Department of Medical Oncology, University of Pretoria, Pretoria, South Africa
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 85


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_29
86 B. J. Giantonio

Abstract
The research atrocities committed during World War II using human subjects
prompted the development of a body of regulations, beginning with the Nurem-
berg Code, to ensure that human subjects’ research is safely conducted and
prioritizes the rights of the individual over the conduct of the research. The
resultant regulations guiding human subjects’ research affect protocol design,
the selection of participants, safety reporting and oversight, and the dissemination
of research results. The investigator conducting research on human subjects must
be familiar with those regulations to meet his/her responsibility to protect the
rights and welfare of research participants.

Keywords
Belmont Report · The Common Rule · Delegation of tasks · Drug accountability ·
Good clinical practice · Informed consent process · Institutional review board
(IRB) · Investigator · Noncompliance · Scientific misconduct

Introduction

Clinical research is the foundation for evidence-based, high-quality medical care.


The individuals who participate in clinical research are central to advancing our
knowledge of disease, yet inherent in the conduct of clinical research is the risk of
harm to the research participant. Since the development of the Nuremberg Code in
1947, regulations and guidelines for performing clinical research that balance the
common good of scientific advancement with protecting the rights and welfare of
individuals provide the framework for the ethical conduct of clinical research with
human subjects. Building on the Nuremburg Code, the Belmont Report delineates
the boundary between practice and research and describes three basic principles
relevant to the ethics of research involving human subjects: the principles of
respect of persons, beneficence, and justice. These principles provide the founda-
tions for the “Common Rule” and “good clinical practice” (GCP). (Of note, the
“Common Rule” is used to describe subpart A of Title 45, part 46 (45 CFR 46) of
the Code of Federal Regulations. The CFR represents the codification of the
general and permanent rules of the departments and agencies of the US Federal
Government.)
Regulations and policies governing clinical research and defining the responsi-
bilities of the investigator will be found at the federal, state, and local level and can
vary by region of the world. Additional responsibilities may be required by
funding agencies, sponsors, and research institutions. In section “References,”
we list the regulations and guidance documents with the greatest impact on the
conduct of clinical research and from which the responsibilities of the investigator
are derived.
6 Investigator Responsibilities 87

Definitions

Research

Research is defined in 45 CFR 46 as a systematic investigation, including research


development, testing, and evaluation, designed to develop or contribute to general-
izable knowledge.

Investigators

The definitions of investigator used in the existing guidelines and policies vary but in
general describe any individual responsible for the conduct of research involving
human subjects. The conduct of clinical research can be limited to one research site
or performed across hundreds of sites, and the term investigator (or principal inves-
tigator) can apply to the person responsible either for the study as a whole, or for an
individual site. For purposes of clarity, and unless otherwise stated, we will use the
term investigator to encompass the investigator of single-site research, the lead
investigator of multi-site research, and site-specific investigators of multi-site research.

Overall Responsibilities of Investigators

Research that requires the participation of human subjects includes prospective


clinical trials of drugs, devices, or other interventions; translational biologic and
imaging analyses; participant-based surveys and questionnaires; and retrospective
data analyses. Regardless of the type of investigation being conducted, a clinical
investigator’s primary responsibility is to protect the rights and welfare of the
participants (Bear et al. 2011).
In general this requires that the clinical research is scientifically justifiable and
uses research methods and a trial design appropriate to the question being studied
and that there must be safeguards for the participants proportional to the level of risk
of the research.
We will review the specific responsibilities of the investigator according to the
two broad, albeit overlapping, categories of research study design and conduct and
safeguard to protect research participants.

Research Study Design and Conduct

Design

The investigator of a single-site clinical research study, or the lead investigator of a


multi-site study, has direct responsibility for the design of the proposed research.
88 B. J. Giantonio

Table 1 Seven requirements for the ethical conduct of clinical research


1. Value Enhancements of health or knowledge must be derived from the
research
2. Scientific validity The research must the methodologically rigorous
3. Fair subject Scientific objectives, not vulnerability or privilege, and the potential for
selection and distribution of risks and benefits should determine communities
selected as study sites and the inclusion criteria for individual subjects
4. Favorable risk- Within the context of standard clinical practice and the research
benefit ratio protocol, risks must be minimized, and potential benefits enhanced and
the potential benefits to individuals and the knowledge gained for
society must outweigh the risks
5. Independent review Unaffiliated individuals must review the research and approve, amend,
or terminate it
6. Informed consent Individuals should be informed about the research and provide their
voluntary consent
7. Respect for enrolled Subjects should have their privacy protected, the opportunity to
subjects withdraw, and their well-being monitored

Four of the seven requirements for the ethical conduct of clinical research
(Emanuel et al. 2000) relate to the design of the research: value, scientific validity,
fair subject selection, and a favorable risk-benefit ratio (Table 1).
To justify exposing participants to potential harm, the proposed research must
provide generalizable knowledge that contributes to the common good. There must
be uncertainty among the medical community, or “clinical equipoise,” for the
question being asked by the research, and the methodology for obtaining that
knowledge must be appropriate to the question and rigorously applied. Risks to
participants must be minimized, and subject selection must be done such that both
the risks and benefits of the research are fairly distributed with subjects excluded
only for valid safety or scientific reasons.

Conduct

Qualifications, Training, and Delegation of Tasks


The investigator of a single-site study, the lead investigator for a multi-site study, and
individual site-specific investigators are responsible for conducting or supervising
the research.
All investigators involved in the research must have the appropriate level of
education, training, and experience to conduct the research. This includes mainte-
nance of records of all training required by their institutions and research sponsors,
updated as necessary and relevant over time. In addition, the investigator must have
sufficient time and resources to properly conduct and supervise the research for
which they are responsible.
The rationale for the study and the requirements for its safe and appropriate
conduct must be fully understood by the investigator and the research conducted in
compliance with the specifics of the study. Investigators must have no conflicts of
6 Investigator Responsibilities 89

interest, financial, or otherwise, that could influence their judgment for the inclusion
of subjects in the research or the interpretation of the findings.

The Clinical Research Team


The complexity of conducting clinical research has rendered it nearly impossible for
an investigator to meet his or her responsibilities to the research without the assistance
of others. It is common for certain types of clinical research to be supported by a team
of individuals who assist the investigator including co-investigators, clinical research
associates, and research pharmacists. In these instances the investigator will serve as
the leader of the team with responsibility for its supervision.
The delegation of tasks related to the conduct of the research by the investigator to
other qualified individuals is accepted practice. Most institutions that conduct
clinical research have programs designed to support their research activities that
include offices staffed with those individuals who assist the investigator. The extent
to which research-related tasks are delegated, and to whom, is affected by and must
comply with federal, state, and local regulations.
It is the responsibility of the investigator that all members of the research team to
whom tasks are delegated must have appropriate education, training, and experience
in both the conduct of clinical research in general and for the specific study. The
supervision of the staff performing the delegated tasks is also the responsibility of
the investigator, regardless of the employment arrangements of the staff. In addition,
the members of the research team must also be free of conflicts of interest.

Accountability of Investigational Agents


In clinical research that involves investigational agents, the investigator is responsi-
ble for drug accountability. This includes oversight of the proper handling and
storage of the agent, its correct administration, and the return or destruction of
unused agent. The chain of custody of the agent must be clearly documented from
the time of the arrival of the investigational agent until its return or destruction. It is
appropriate to delegate these tasks to an individual such as a pharmacist who has
been trained on the handling, storage, and use of investigational agents.

Data Collection, Research Record Maintenance, and Retention


Data is the lingua franca of clinical research. The accuracy of the data is essential to
the sound and safe conduct of the research, and all reported data must be verifiable in
the source documents and/or electronic record.
Investigators are ultimately responsible for the accuracy of research data, its
storage, and the necessary confidentiality protections according to the specifics of
the protocol and its consent form. For large multi-site studies, the storage of research
data is centralized and managed by a sponsor or an institution in accordance with
their policies. “Shadow charts” maintained at the research site are subject to the same
regulations.
All records and communications related to the research are to be securely
maintained by the investigator, or his or her representatives. This includes all
versions of the protocol and research plan, consent documents, and institutional
90 B. J. Giantonio

review board (IRB)/ethics committee correspondence. The research records must be


made available upon request by regulatory agencies and research sponsors.
The duration of retention of research data and records may vary by country and
local requirements and the specifics of the research. In general, the duration of
retention should be in compliance with the regulation requiring the longest duration
of record retention.

Reporting of Research Findings


The results of the research, regardless of outcome, are to be reported in a timely
manner. To not publish the results of clinical research violates the commitment made
to the participant that their involvement in the research will contribute to generaliz-
able knowledge.

Safeguards to Protect Research Participants

Participants in clinical research can be individuals with, or at risk for, a particular


disease, or they can be entirely well. The acceptable level of risk that the participant
in a research study is subject to must consider the potential benefit to society, and the
design of the research study must include steps to protect the participants from
untoward harm. Risks to participants can include immediate and long-term physical
and emotional harm, financial damage, and privacy violations.
Some of the risk associated with participation in clinical research is mitigated by
appropriate study design and research methods, as described above.

Adverse Event Reporting

The recording and monitoring of adverse events encountered during research is


essential to protecting the safety of research participants. The investigator must
have processes in place that allow for the timely review of adverse events. Any
serious or unanticipated adverse events must be promptly reported according to the
specifics of the study to the IRB of record for the research project, to any regulatory
agency as appropriate, and to the study sponsor.
The investigator is responsible for ensuring that reasonable medical care is
provided to the research participants for issues related to the study participation
and for the facilitation of care for other health issues that might arise during the
study.

Safety Oversight

In addition to an IRB, the oversight of participant safety for a specific research


project may utilize a Data and Safety Monitoring Committee (DSMC). Such com-
mittees are comprised of individuals with experience in the conduct of clinical
6 Investigator Responsibilities 91

research and expertise in the particular disease being studied, the majority of whom
are unaffiliated with the specific research project and are free from other conflicts of
interest.
Both IRBs and DSMCs are provided with safety data during the conduct of the
study. Any severe or unanticipated adverse event is to be reported in a timely manner
and according to requirements included in the protocol; all others are submitted
according to a preplanned review schedule. Data and Safety Monitoring Committees
review not only adverse effects but also outcomes data at preplanned intervals to
ensure that the continuation of the research is justified. The reports from Data and
Safety Monitoring Committees are usually submitted to the IRB of record for the
specific research project.

IRB Requirements

Unless exempt from review, investigators are responsible for interactions with the
IRB, including initial IRB approval, approval for any modifications to the research,
safety reporting, and, as required, continuing review of the research.
In order for the IRB to effectively evaluate and monitor a clinical research project,
investigators must provide the IRB of record with sufficient information to make
their determinations regarding the initiation of the clinical research and its ongoing
conduct. This includes new safety and outcomes information, deviations to study
specified or IRB requirements (section “Protocol Noncompliance and Research
Misconduct” below), as well as unanticipated problems involving risks to subjects
or others.
Additionally, the investigator is responsible for informing the IRB of record of
any significant new finding that emerges during the conduct of the research that
might affect a participant’s willingness to continue to participate in the research.
Based on the significance of the new findings, the investigator may be responsible
for providing the new findings to the registered participants, and modifications to the
study to account for the new findings may require a suspension in accrual.
It is important to note that the inclusion of vulnerable populations (such as
children, students, prisoners, and pregnant women) in clinical research may require
additional IRB-approved safeguards for their participation, and the investigator is
responsible for implementing those safeguards.

Informed Consent Process

The informed consent process and the consent form are required to ensure that a
potential participant in a clinical research project has enough information about the
specific project, and about clinical research in general, to make an informed and
autonomous decision about their participation.
For clinical research that requires the participant’s informed consent, the inves-
tigator is responsible for ensuring that consent is obtained and documented as
92 B. J. Giantonio

approved by the IRB and according to federal, state, and local requirements. In
addition, each participant is to be provided a copy of the informed consent document
when written consent is required.
Investigators are required to allow for monitoring and auditing of the research by
the IRB of record, sponsors, and any applicable regulatory agencies.

Protocol Noncompliance and Research Misconduct

The investigator is responsible for reporting to the IRB any instances of protocol
noncompliance and misconduct. Reporting to the sponsor and other regulatory
agencies may be required as well. In addition, the suspension or termination of an
IRB approval may also require reporting to the sponsor and to the appropriate
federal, state, and local regulatory agencies.

Protocol Noncompliance

As discussed, the protection of the rights and welfare of research participants


requires an appropriate trial design. The investigator is responsible for compliance
with all protocol-specified requirements during the conduct of the research. This is to
ensure both the safety of the participants and the scientific integrity of the research.
However, deviations and/or violations (the terms are used interchangeably by some)
to protocol-specified criteria can occur and in some instances may be justified. In
general protocol deviations are either intentional, or discovered, after they occur.
An intentional protocol deviation represents a change in the research that requires
prior IRB review and approval before the change is allowed. In some instances,
however, deviations from the protocol for safety reasons are allowed in advance of
IRB review and approval.
Research protocol deviations that are discovered after they occur may need to be
reported to the IRB. The IRB will determine if the violations represent serious
noncompliance or continuing noncompliance. Additional reporting to the sponsor
and to regulatory agencies may be required.

• Serious noncompliance: any deviation from protocol-specified requirements that


may affect the rights and welfare of participants or adversely affects the scientific
integrity of a study.
• Continuing Noncompliance: a pattern of noncompliance that may affect the rights
and welfare of participants or adversely affects the scientific integrity of a study.

Research Misconduct

Research misconduct is generally defined by regulatory agencies as fabrication,


falsification, or plagiarism in proposing, performing, or reviewing research, or in
6 Investigator Responsibilities 93

reporting research results (From: https://ori.hhs.gov/definition-misconduct). In addi-


tion, misconduct is not limited to the researcher but includes sponsors, institutions,
and the journals publishing the research.
Scientific misconduct violates the investigator’s commitment to the ethical con-
duct of research and exposes research participants to unjustifiable risk. Corrective
actions that result from misconduct can include the retraction of the published
findings, a loss of research privileges and funding, monetary fines and criminal
charges, and a public distrust of the clinical research enterprise. The investigator has
a responsibility to report any findings of misconduct, intentional or unintentional, by
the research team to the sponsor, to the IRB of record, and to any federal, state, and
local regulatory authorities.

Summary and Conclusion

The responsibilities of the clinical research investigator can cover the lifespan of the
study and are intended to protect rights and welfare of the research participants and
the integrity of the research itself. And while much of the work of clinical research is
delegated to others, the overall responsibility remains with the investigator. Non-
compliance with the requirements of the research protocol can compromise both
participant safety and study integrity.

Key Facts

• The clinical investigator’s primary responsibility is to protect the rights and


welfare of the participants.
• Research is a systematic investigation, including research development, testing,
and evaluation, designed to develop or contribute to generalizable knowledge.
• The Belmont Report identifies three principles for the ethical conduct of research
involving human subjects: respect of persons, beneficence, and justice.
• Risk of harm to research participants is mitigated by appropriate study design and
research methods.
• All investigators involved in clinical research must have the appropriate level of
education, training, and experience to conduct the research.
• The investigator is responsible for the accuracy of research data, its storage, and
the necessary confidentiality.
• Noncompliance with all protocol-specified requirements during the conduct of
the research can compromise participant safety and the scientific integrity of the
research.
• Intentional protocol deviation requires IRB review and approval before the
change is allowed. In some instances deviations from the protocol for safety
reasons are allowed in advance of IRB review and approval.
• Scientific misconduct violates the investigator’s commitment to the ethical con-
duct of research and exposes research participants to unjustifiable risk.
94 B. J. Giantonio

Regulations and Policies

Regulations and Policies upon which the investigator responsibilities are derived:
a. The Belmont Report: US Department of Health and Human Services, Office for
Human Subjects Research: Belmont Report. Ethical Principles and Guidelines for
the Protection of Human Subjects of Research.
http://www.hhs.gov/ohrp/humansubjects/guidance/belmont.html
b. US Code of Federal Regulations
eCFR – Code of Federal Regulations
The two relevant sections of the Code of Federal Regulations that apply to the
conduct of clinical research are Title 21 CFR Food and Drugs and 45 CFR Public
Welfare.
i. Title 21 CFR: Food and Drugs
https://gov.ecfr.io/cgi-bin/text-idx?SID=027d2d7bd97666fc9f896c580d5039dc
&mc=true&tpl=/ecfrbrowse/Title21/21tab_02.tpl
This section establishes many of the regulations concerning the investigation of
new agents and devices and forms the basis of the policies of the Food and Drug
Administration. Relevant sections of 21 CFR are:
21 CFR 312.50: General Responsibilities of Investigators
21 CFR 812.100: Responsibilities of Investigators: Biologics
21 CFR 812.110: Responsibilities of Investigators: Devices
21 CFR 11: Electronic records/Electronic signature
21 CFR 50: Protections of Human Subject
21 CFR 54: Financial Disclosure by Clinical Investigators
21 CFR 56: Institutional Review Boards
ii. Title 45 CFR: Public Welfare
Section 45 CFR 46: “Regulation for the Protection of Human Subjects in
Research”
https://gov.ecfr.io/cgi-bin/text-idx?SID=6ddf773215b32fc68af87b6599529417
&mc=true&node=pt45.1.46&rgn=div5
45 CFR 46 applies to all research involving human subjects that is conducted or
funded by US Department of Health and Human Services (DHHS) and has been
widely adopted as guidance for clinical research outside of that funded by DHHS.
Subpart A: The Common Rule
Subpart B: Additional protections for pregnant women, human fetuses and
neonates
Subpart C: Additional protections for prisoners
Subpart D: Additional protections for children
c. Good Clinical Practice Guidelines (GCP ICH-E6R2)
The International Council for Harmonisation of Technical Requirements for Phar-
maceuticals for Human Use (ICH) was established in 1990 with the stated mission to
achieve greater harmonization worldwide to ensure that safe, effective, and high-
quality medicines are developed and registered in the most resource-efficient manner.
GCP ICH-E6R2 is one of the products created by this organization to define
“good clinical practices” for clinical research. These guidelines are commonly used
6 Investigator Responsibilities 95

guide clinical research around the world. Of note, the use of the term “good clinical
practice” in this context is distinct from that applied to day-to-day patient care.
https://www.ich.org/products/guidelines/efficacy/efficacy-single/article/integrated-
addendum-good-clinical-practice.html
d. CIOMS International Ethical Guidelines for Biomedical Research
The Council for International Organizations of Medical Sciences (CIOMS) is an
international nongovernmental organization in an official relationship with World
Health Organization (WHO). The guidelines focus primarily on rules and principles
to protect humans in research and to reliably safeguard the rights and welfare of
humans.
https://cioms.ch/shop/product/international-ethical-guidelines-for-health-related-
research-involving-humans/
e. Additional Guidance Documents
Attachment C: https://www.hhs.gov/ohrp/sachrp-committee/recommendations/
2013-january-10-letter-attachment-c/index.html
FDA Guidance for Industry on Investigator Responsibilities: https://www.fda.
gov/downloads/Drugs/.../Guidances/UCM187772.pdf

Cross-References

▶ Data and Safety Monitoring and Reporting


▶ Financial Conflicts of Interest in Clinical Trials
▶ Fraud in Clinical Trials
▶ Good Clinical Practice
▶ Institutional Review Boards and Ethics Committees
▶ Qualifications of the Research Staff
▶ Trial Organization and Governance

References
Baer AR, Devine S, Beardmore CD, Catalano R (2011) Clinical investigator responsibilities. J
Oncol Pract 7:124–128. https://doi.org/10.1200/JOP.2010.000216
Emanuel EJ, Wendler D, Grady C (2000) What makes clinical research ethical? JAMA
283:2701–2711. https://doi.org/10.1001/jama.283.20.2701
Centers Participating in Multicenter Trials
7
Roberta W. Scherer and Barbara S. Hawkins

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Roles and Functions of Resource Centers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Coordinating Centers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Other Resource Centers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Clinical Centers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
The Implementation Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Participant Enrollment Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Treatment and Follow-Up Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Participant Closeout Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Post-Funding Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

Abstract
Successful conduct of multicenter trials requires many different types of activities,
implemented by different types of centers. Resource centers are those involved in
planning the trial protocol, overseeing trial conduct, and analyzing and interpreting
trial data. They include clinical and data coordinating centers, reading centers, central
laboratories, and others. Clinical centers prepare for the trial at their setting and

R. W. Scherer (*)
Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health,
Baltimore, MD, USA
e-mail: [email protected]
B. S. Hawkins
Johns Hopkins School of Medicine and Bloomberg School of Public Health, The Johns Hopkins
University, Baltimore, MD, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 97


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_30
98 R. W. Scherer and B. S. Hawkins

accrue, treat, and follow up study participants. Each center has specific responsibil-
ities, which are tied to the trial phase and wax and wan over the course of the trial.
Activities during the planning phase are mostly the purview of the clinical and data
coordinating centers, which are responsible for obtaining funding and designing a
trial that will answer the specific research question being asked. The initial design
phase and the protocol development and implementation phase see both resource
centers and clinical centers making preparations for the trial to be conducted. The
main responsibilities of clinical centers during the participant recruitment, treatment,
and follow-up phases are to recruit, randomize, treat, and follow study participants
and collect and transmit study data to the data coordinating center. The resource
centers manage drug or device distribution, receive and manage data, and monitor
trial progress. Clinical centers complete closeout visits during the participant closeout
phase, while resource centers complete final data management activities, data anal-
ysis, and interpretation. The termination phase finds investigators from all centers
involved in manuscript writing activities. Collaboration among all centers during all
phases is essential for the successful completion of any multicenter trial.

Keywords
Resource center · Clinical center · Coordinating center · Reading center · Central
laboratory · Multicenter trial · Trial phase

Introduction

In this chapter, we discuss the types of centers that form the organizational units of
a multicenter trial. As implied by the term multicenter, different types of centers are
typically required to conduct a multicentered trial, each with specific responsibilities
but which together perform all required functions. Resource centers are those with
expertise and experience in performing specific tasks and include groups such as
coordinating centers, data management centers, central laboratories, reading centers,
and quality control centers among others. Distinct from the resource centers are
clinical centers, whose main function is the accrual, treatment, and follow-up of
study participants, thus forming an integral part of any multicenter trial.

Roles and Functions of Resource Centers

Although all aspects of a single-center trial may be performed within a single


institution, responsibilities may be assigned to individual departments or divisions
of the institution that provide special expertise. In a multicenter trial, some respon-
sibilities may be assigned to a facility or institution that houses people with special
expertise and facilities that serve the entire trial, i.e., resource centers, sometimes
referred to as central units or support centers. The types of resource centers selected
for an individual trial vary with the design and goals of the trial. Resource centers
7 Centers Participating in Multicenter Trials 99

may be established to provide expertise to multiple clinical trials, and possibly other
types of research studies, or may be organized to serve a specific trial or group
of trials.

Coordinating Centers

All multicenter trials have at least one center with overall responsibility for
the scientific conduct of the trial. Two types of coordinating centers are common:
clinical coordinating centers and data or statistical coordinating centers. In multi-
national trials, there may be multiple regional or national coordinating centers
(Brennan 1983; Alamercery et al. 1986; Franzosi et al. 1996; Kyriakides et al.
2004; Larson et al. 2016).

Clinical Coordinating Centers


Clinical coordinating centers sometimes are known as the trial chair’s office, prin-
cipal investigator’s office, or treatment coordinating center, depending on the types
of responsibilities assigned to the center, sponsor, and medical setting of the trial.
Responsibilities in some trials have included:

• Identification of clinical and resource centers to collaborate in the conduct of the


trial
• Obtaining and maintaining funding for the trial
• Disbursement of funds to participating clinical centers and resource centers
• Training of clinical investigators to standardize diagnostic or treatment
procedures
• Responding to queries from clinical investigators regarding protocol
interpretation
• Scheduling and organizing meetings of the trial investigators and personnel
• Developing and disseminating information about the trial to aid participant
recruitment
• Marketing the trial to the medical community

In other trials, some of these responsibilities may be assigned to another


resource center, such as the data or statistical coordinating center. Typically,
the clinical coordinating center does not have responsibility for data collection,
storage, or analysis, except possibly information regarding clinical center perfor-
mance. Those responsibilities are usually assigned to a data coordinating center.
Regardless of the distribution of responsibilities, typically there is close collabora-
tion between personnel in the clinical coordinating center and those in the data
coordinating center.

Data Coordinating Centers


Data (or statistical) coordinating centers often are known as coordinating centers
without a modifier, particularly when some of the responsibilities described above
100 R. W. Scherer and B. S. Hawkins

for clinical coordinating centers are assigned to the (data) coordinating center. The
senior statistician for the trial typically is located at the data coordinating center and
may be the principal investigator (or a co-investigator) for the funding award to this
center. In some cases, an epidemiologist or a person with other related expertise may
be the principal investigator.
Typically, the expected principal investigator and/or the senior trial statistician
participates in the design of the trial and preparation of the trial protocol (Williford
et al. 1995).
Coordinating centers often serve as the trial communications center and informa-
tion source for investigators, trial leadership, other trial personnel, and sponsor.
Personnel at these centers provide expertise regarding research design and methods
and, often, experience gained from participation in other clinical trials. They oversee
the treatment allocation (randomization) process and serve as the scientific con-
science of the investigative group. This trial resource center has primary responsi-
bility for assembling and maintaining an accurate, complete, and secure trial
database and for analysis of data accumulated for the trial.
Typical responsibilities of data coordinating centers include:

• Select and implement the information technology methods that will be used for
the trial (McBride and Singer 1995).
• Design and implement the methods for collecting and recording the data required
to address the goals of the trial (Hosking et al. 1995).
• Design and implement the randomization schema and the methods for assigning
trial participants to treatment arms and communicating the assignment to partic-
ipants and trial personnel as required.
• Develop and monitor methods for masking trial personnel and participants to
treatment assignment as required to preserve the integrity of the trial.
• Develop methods for storing, managing, securing, and reporting accumulating
trial data reported by clinical centers, other resource centers, and committees
assigned responsibility for coding events or other aspects of participant
follow-up.
• Provide for regular communication with personnel at clinical centers and other
resource centers regarding protocol issues and data anomalies.
• Develop methods for assessing and reporting data quality, including data pro-
vided by participants, clinical center personnel, and personnel at other resource
centers (Gassman et al. 1995).
• Develop methods for reporting accumulated data for groups assigned to monitor
the progress of the trial, data quality, and comparative safety and effectiveness of
trial treatments (McBride and Singer 1995).
• Cooperate with external monitors of the coordinating center (Canner et al. 1987).
• Participate in preparation of manuscripts to report trial methods and outcomes.
In particular, the trial statisticians typically are responsible for performing all
data analyses included in reports, including selecting and describing appropriate
methods of statistical analysis and verifying all data reported and their
interpretation.
7 Centers Participating in Multicenter Trials 101

• Retain all documentation regarding data analyses included in manuscripts and


reports for future reference as needed.
• Participate in development of procedures for closing trial participation and
debriefing of participants (McBride and Singer 1995; Dinnett et al. 2004).
• Archive the database when the trial is completed so that the data are available for
future exploratory analyses (Hawkins et al. 1988).
• Store minutes of committee meetings and reports.

To meet these responsibilities, staff of the data coordinating center must include
personnel with expertise in several areas. There is no formula that applies to every
trial. Besides statistical and information technology expertise at various levels,
personnel typically include clerical and other types of personnel. It is essential that
coordinating center personnel be able to interact effectively with trial investigators
and personnel at other trial centers. When some of the trial data are collected
by coordinating center personnel, for example, through telephone interviews
for patient-reported outcomes or central long-term follow-up of participants for
outcomes (Peduzzi et al. 1987), the personnel may include telephone interviewers.
Regardless of the expertise or roles of individual coordinating center personnel,
all must be trained in the trial protocol and procedures.
The manuals of operations/procedures prepared for individual trials are useful
resources for identifying the many responsibilities assigned to data coordinating
centers, the ways in which personnel and investigators at those centers have met their
responsibilities, and the organizational structure of data coordinating centers.
In clinical trial networks, a single coordinating center may serve all trials,
or subgroups of personnel may be designated to participate in individual trials
(Blumenstein et al. 1995). A data coordinating center also may be created to
participate in a single multicenter trial. The organization of the coordinating center
depends on the trial setting and, often, the trial sponsor and funding source.
In the United Kingdom, clinical trials units (another name for coordinating centers)
currently undergo registration to assure that they meet requirements regarding (1)
expertise, continuity, and stability; (2) quality assurance; (3) information systems;
and (4) statistical input (McFadden et al. 2015). The importance of development and
documentation of standard operating procedures (SOPs) for these units is empha-
sized in the UK registration process (McFadden et al. 2015) and by others
(Krockenberger et al. 2008).
Because coordinating center responsibilities evolve during the course of a trial, it is
useful to consider changes in responsibilities by trial phase. Phases of a trial, adapted
from the Coordinating Center Models Project [CCMP], are defined in Table 1.
Common coordinating center responsibilities by trial phase are summarized in Table 2.

Other Resource Centers

Other resource centers required for an individual trial depend on the goals of the trial
and the need for standardization of trial procedures. Resource centers may be created
102 R. W. Scherer and B. S. Hawkins

Table 1 Phases of a multicenter clinical trial. (Adapted from Coordinating Center Models Project
Report No. VI)
Planning and pilot phase: Ends with submission of funding application(s) to sponsor
Initial design phase: Ends with funding for the trial
Protocol development and implementation phase: Ends with initiation of participant
recruitment
Participant enrollment phase: Ends with completion of participant recruitment
Treatment and follow-up phase: Ends with initiation of participant closeout
Participant closeout phase: Ends with completion of participant closeout
Termination phase: Ends with termination of funding for the trial
Post-funding phase: Ends with completion of all trial activities and publications
Comment
Phases of an individual trial are not always clearly defined regarding beginning and ending dates
or events. Activities of a phase may overlap with those of another; hence the end of each phase is
defined above by an “event.” Activities of the trial centers vary by trial phase; in fact, the
coordinating center investigators must be constantly planning for the next phase

to serve an individual trial or serve multiple trials that have similar needs. Some
of the responsibilities assigned to resource centers have been:

• Establish pre-randomization eligibility or confirm post-randomization eligibility


of trial participants based on review of images or biologic specimens from trial
participants, possibly together with selected clinical data.
• Interpret images from participants made at baseline and/or follow-up
examinations.
• Prepare and distribute trial medications, and possibly other supplies, with appro-
priate masking of trial personnel and participants.
• Monitor accuracy of assays made at local laboratories.
• Monitor/confirm clinical center adherence to treatment protocol.
• Collect and analyze participant diaries, regarding, for example, nutrient intake,
adherence to study medication, or exercise program.
• Collect and store biospecimens for analysis for the current trial or future research.

As an example, resource centers that participated in the Collaborative Ocular


Melanoma Study (COMS) and their responsibilities are described in Table 3. The
COMS was a multicenter study with two randomized trials of radiotherapy versus
enucleation (removal of the eye) for choroidal melanoma in adults. Of the resource
centers listed in Table 3, three relocated with the center investigators during the
20 years that trial activities were underway, an important lesson for investigators
who design trials with long accrual or follow-up periods expected.
Central laboratories have a long history in US multicenter trials, extending back
to the University Group Diabetes Program and the Coronary Drug Project (Hainline
et al. 1983) and possibly earlier in oncology trials for standard analysis of biochem-
ical specimens. An extensive literature exists regarding central laboratories, partic-
ularly from the perspective of the pharmaceutical industry (Habig et al. 1983;
7 Centers Participating in Multicenter Trials 103

Table 2 Common responsibilities of [data] coordinating centers and clinical centers by trial phase.
(Adapted from the Coordinating Center Models Project and other sources)
[Data] coordinating centers Clinical centers
Planning and pilot phase
Participate in literature review/meta-analyses
to assess and document need for contemplated
trial
Participate in design and analysis of pilot Conduct pilot studies
studies to assess feasibility of trial design and
methods
Meet with investigators expected to direct
other resource centers
Visit one or more clinical centers expected to Engage in discussions regarding possible
participate in trial participation in trial
Initial design phase
Estimate required sample size
Outline the data collection schedule
Outline quality assurance and monitoring
procedures
Outline data analysis plans
Outline data intake and editing procedures
Prepare funding proposal for the [data] Review proposed clinical center budget
coordinating center
Participate in preparation of qualifications of
clinical centers and the selection process
Work with the proposing study chair to
coordinate the overall funding application
package
Coordinate development of a draft manual of Provide input on manual of procedures if asked
procedures for the trial
Protocol development and implementation phase
Register or assure registration of trial in a Determine feasibility of integrating trial into
clinical trials registry clinic setting
Develop patient treatment allocation Constitute study team and select primary
(randomization) procedures clinical coordinator
Develop computer software and related Complete Good Clinical Practice and ethics
procedures for receiving, processing, editing, training
and analyzing study data
Develop and test study data forms and methods
used for completion and submission by clinical
center personnel
Oversee development of interfaces for data Organize infrastructure including telephone,
transmission between individual resource computer and Internet, courier services
centers and clinical centers
Coordinate drug or device distribution Procure equipment, including refrigerators or
freezers, lockable cabinets, and any special
equipment
(continued)
104 R. W. Scherer and B. S. Hawkins

Table 2 (continued)
[Data] coordinating centers Clinical centers
Train clinical center personnel in the data Attend training meeting
collection and transmission process
Implement training certification in the trial Complete certification requirements
protocol at clinical centers
Distribute study forms and related study
material for use in next phases of the trial
Designate one or more coordinating center Institute interdepartmental communication
investigators to serve on each trial committee pathways, including pharmacy, radiology,
laboratory, etc.
Modify and refine manual of procedures and Organize filing system and binders, including
distribute to all trial centers those for essential documents
Develop and document internal procedures for Organize space, including interview or exam
coordinating center operations and rooms, storage areas for confidential materials,
responsibilities and drugs or devices
Act as a repository for official records of the
trial: minutes of meetings, committee reports,
etc.
When agreed with sponsor, reimburse clinical Complete budgeting and contractual
centers and other resource centers and others negotiations
based on the funding award
Participate in creating the application for local Incorporate trial protocol and informed consent
and/or study-wide institutional review boards/ statement template into local ethics board
ethics committees template and submit for approval
Develop and implement dedicated website
with trial information suitable for access by
multiple stakeholders
Participant recruitment phase
Develop templates for recruitment materials Develop recruitment materials or use templates
developed by coordinating center; submit to
local ethics review board
Implement recruitment activities
Administer treatment assignments. Screen potential study participants; complete
Periodically check (1) baseline comparability eligibility testing on potential study
of treatment arms and (2) characteristics of participants
participants versus target population and
eligibility criteria
Implement editing procedures to detect data Complete all baseline data collection forms to
deficiencies determine eligibility of potential study
participant
Develop monitoring procedures and prepare Obtain formal informed consent from study
data reports to summarize performance of participant and complete randomization
participating clinical centers with patient process
recruitment
Develop monitoring and reporting procedures
to detect evidence of adverse or beneficial
effects of trial treatments
(continued)
7 Centers Participating in Multicenter Trials 105

Table 2 (continued)
[Data] coordinating centers Clinical centers
Respond to requests for reports and data
analyses from within the trial organization
Implement and lead quality assurance and
monitoring program
Schedule and participate in site visits to clinical Participate in site visits, and respond to queries
centers and other resource centers by site visitors
Prepare progress reports for trial sponsor
Prepare, or collaborate in preparing, any
requests for continued or supplemental funding
by the sponsor
Prepare a manuscript to describe the trial
design
Participant treatment and follow-up phase
Monitor drug or device distribution Administer treatment as assigned by
randomization, including accountability
activities for study medications or devices
Monitor treatment adherence Complete required documents, or provide
materials related to treatment adherence
Prepare periodic reports of the data concerning
adverse and beneficial effects of trial
treatments
Monitor and report adverse events to sponsors Identify and report all serious adverse events as
as required required by trial and federal agencies and local
ethics committees
Prepare periodic reports of the performance of Identify and resolve all protocol deviations
all trial centers
Schedule all study visits, including any
logistical issues (e.g., travel for participant(s),
scheduling radiology or surgery, etc.)
Evaluate data handling procedures and modify Complete all study follow-up visits and
as necessary associated data collection forms, and transmit
data to data coordinating center
Analyze baseline and related data for Respond to data queries from coordinating
publication, as appropriate center
Prepare materials for investigator meetings Attend all investigator group meetings
Prepare summary of trial results for individual
participants for use in closeout discussion and
final data collection
Develop and test data forms for patient
closeout phase
Initiate/lead searches for participants lost to
follow-up by clinical centers
Work with trial leadership and investigators to
develop a publication plan
Coordinate participant closeout process
(continued)
106 R. W. Scherer and B. S. Hawkins

Table 2 (continued)
[Data] coordinating centers Clinical centers
Complete annual progress reports for ethics
review boards
Participant closeout phase
Collect participant closeout data Complete closeout study visit
Coordinate and monitor progress with
participant closeout
Monitor adherence to closeout procedures
Develop plans for final checks on Complete all remaining data queries from
completeness and accuracy of trial database coordinating center
Develop and test analysis programs for any
additional data summaries or analyses
Develop plan for final disposition of final trial
database and accumulated materials
Participate in reorganization of trial for final Complete all center closeout activities,
phases, including disengagement of clinical including final reports
centers
Continue to participate in preparation of
manuscripts to disseminate trial findings and
methods
Coordinate and monitor progress with trial Participate in manuscript writing, as required
manuscripts
Termination phase
Perform final quality checks of trial database
Implement plans for documentation and
disposition of final database and other trial
records
Advise clinical center personnel regarding Archive or dispose of all study documents as
disposition of local trial records required by sponsor
Continue activities regarding preparation and Participate as co-author on study publications,
publication of manuscripts if requested
Monitor collection and disposal of unused Complete accountability of study medications
study medications and supplies or devices and all study materials
Undertake final efforts to determine or confirm
the vital status of all trial participants
Provide writing teams with data analyses and Present study findings at a conference, as
summaries needed to complete manuscripts requested
Circulate manuscripts for review by trial Review final manuscripts
investigators prior to submission for
publication

Cooper et al. 1986; Rosier and Demoen 1990; Davey 1994; Davis 1994; Dijkman
and Queron 1994; Harris 1994; Wilkinson 1994; Lee and Hill 2004; Sheintul et al.
2004; Strylewicz and Doctor 2010). The integration of local and central roles in an
international multicenter trial has been described by Nesbitt et al. (2006).
7 Centers Participating in Multicenter Trials 107

Table 3 Example of resource centers and selected responsibilities in the Collaborative Ocular
Melanoma Study
Study chairman’s office
Organize planning meetings of prospective center investigators
Identify potential clinical centers and resource centers
Interact with sponsor to assess feasibility of support
Prepare core study funding applications in collaboration with coordinating center investigators
Design and develop informational materials for prospective study participants
Design and disseminate materials to inform community oncologists and ophthalmologists
about the trials
Schedule meetings of investigators and committees; develop meeting agendas in collaboration
with coordinating center investigators and committee chairs
Monitor adherence of investigators to study policy for presentation and publication of study
data
Participate in preparation of manuscripts to disseminate trial findings
Advise the coordinating center investigators regarding issues outside their areas of expertise
Coordinating center
Coordinate preparation and submission of funding applications to sponsor
Coordinate study communications
Enroll and randomize eligible participants
Maintain the COMS Manual of Procedures
Develop methods for and coordinate data collection
Maintain the COMS database
Monitor data quality at other centers and internally
Analyze and report accumulating data to appropriate groups
Maintain study documents
Coordinate preparation of manuscripts to report trial findings
Archive the final database and documentation after study completion
Ophthalmic echography reading center
Train clinical center echographers in study methods
Confirm diagnosis of choroidal melanoma based on photographs from baseline echographic
examination
Monitor quality of echography
Measure tumor height to monitor changes after brachytherapy
Assess topographic features of tumors
Photograph reading center
Train clinical center photographers in study methods
Monitor quality of photography
Confirm diagnosis of choroidal melanoma based on characteristics observed on photographs
and fluorescein angiograms of tumor
Describe changes in boundaries of tumor base
Describe retinopathy following brachytherapy
Pathology center
Describe tumor characteristics based on external and microscopic examinations of enucleated
eyes
(continued)
108 R. W. Scherer and B. S. Hawkins

Table 3 (continued)
Provide technical processing of enucleated eyes sent from clinical centers
Coordinate activities of the Pathology Review Committee
Radiological physics center
Participate in development of radiotherapy protocols with radiation oncologist study co-chair
Assess accuracy of clinical center calculation of radiation dose, and notify clinical center in case
of disagreement
Disseminate instructive findings to clinical center personnel
Sponsor: National Eye Institute
Monitor overall study progress
Observe adherence to study goals

Image analysis and interpretation centers were used in early multicenter for
standard interpretation of echocardiographic tracings from participants (Prineas
and Blackburn 1983; Rautaharju et al. 1986). Over time, manual review and coding
largely has been replaced with automated methods (Goodman 1993). Resource
centers with similar roles for other types of images have become common as new
imaging methods have been developed and implemented clinically to monitor the
effects of treatment. They have been widely used in oncology (Chauvie et al. 2014;
Gopal et al. 2016) and ophthalmology trials (Siegel and Milton 1989; Danis 2009;
Price et al. 2015; Toth et al. 2015; Domalpally et al. 2016; Rosenfeld et al. 2018) but
also in trials in other medical settings (Desiderio et al. 2006; Ahmad et al. 2016).
Central pharmacies/procurement and distribution centers also have a long
history in multicenter clinical trials. They have been established to aid with masking
of treatment assignments in pharmaceutical trials and with distribution of supplies
to clinical centers and participants (Fye et al. 2003; Martin et al. 2004; Peterson et al.
2004; Rogers et al. 2016).
Adjudication centers or committees have been created for many trials to confirm
or code outcomes reported by clinical center personnel or trial participants (Moy
et al. 2001; Pogue et al. 2009; Marcus et al. 2012; Barry et al. 2013). The need for
central adjudication of outcomes such as death has been debated (Granger et al.
2008; Meinert et al. 2008).
Other types of resource centers have been less common but have played impor-
tant roles in multicenter clinical trials to date (Glicksman et al. 1985; Kempson 1985;
Sievert et al. 1989; Carrick et al. 2005; Henning 2012; Shroyer et al. 2019).
A registry of resource centers with various types of expertise and experience with
participation in multicenter clinical trials would be a useful resource for designers
of future trials. Similarly, a registration system similar to that used in the United
Kingdom for clinical trials units could be modified to be applicable to other types
of resource centers.
Regardless of the role of a resource center, interactions with the (clinical and data)
coordinating centers and clinical center personnel are required to assure that trans-
mission of information and materials from clinical centers to the resource center is
timely and accurate and that the information transmitted to the trial database is linked
to the correct trial participant and examination.
7 Centers Participating in Multicenter Trials 109

Resource centers may be funded independently of other trial centers or may


receive funding from the clinical or data coordinating center budgets. The expertise
and number of personnel at these centers depend on the type of collaboration
provided. As noted by Farrell (1998), expertise acquired at resource centers is
a valuable resource and should be recognized.
Typical responsibilities of resource centers, other than coordinating centers,
include:

• Maintain secure and accurate local records of trial participant identifiers.


• Log receipt of materials, track progress with processing, and store materials
securely.
• Establish the format and frequency of data to be transmitted to the trial database.
• Respond promptly in trials where eligibility of candidate participants depends on
reading or interpretation of baseline materials or specimens.
• Notify clinical center personnel of deficiencies in transmission/shipment or qual-
ity of specimens/materials.
• Provide periodic monitoring of internal data quality.
• Collaborate with external monitors of data quality.
• Provide progress reports to trial sponsor and committees at agreed intervals.
• Adhere to local institutional regulations and policies.
• Establish and monitor clinical trial budget for the center.
• Document internal procedures relative to the trial.
• Collaborate with site visitors on behalf of the trial.
• Prepare manuscripts to describe central methods used for the trial.

Clinical Centers

Clinical centers are a unique “resource” center. Without a committed and fully
functioning clinical center, a multicenter trial is doomed to failure. Clinical centers
or sites are the engines that generate the data needed to answer the research question
posed by the trial. The historical model for a clinical center has been a single
academic center, but other workable models have emerged with the inclusion of
clinical practice sites (Submacular Surgery Trials Research Group 2004; Dording
et al. 2012), nontraditional sites such as nursing homes (Kiel et al. 2007), and
international centers (Perkovic et al. 2012). In some cases, with the organization
of the clinical trial networks, a network of dedicated clinical sites may contribute to
multiple related trials (Beck 2002; Sun and Jampol 2019). Even though clinical
centers may be located at any one of these types of sites, all are responsible for
administrative functions, patient interactions, data management functions, and inter-
actions with coordinating and other resource centers. Similarly, personnel within
a clinical center can vary depending on the complexity of the trial being conducted,
but all clinical centers have a principal investigator who is usually the clinician
actively involved in the trial and a clinical coordinator, who handles day-to-day
functions. Overall, clinical centers communicate with and are responsive to the
clinical coordinating or data coordinating center and often with other resource
110 R. W. Scherer and B. S. Hawkins

centers. In addition to administrative functions, clinical center responsibilities


always include recruitment and accrual of study participants, including determina-
tion of eligibility, elicitation of an informed consent, administration of treatment, and
completion of follow-up visits. Secure and confidential data collection is ongoing
during these processes.
Clinical center responsibilities during each phase of the trial are summarized in
Table 2 and described in the following sections.

The Implementation Phase

The first responsibility of the clinical center is to learn about the incipient trial. As the
principal investigator, the clinical center director must have a thorough understand-
ing of the research question, the trial design, the operations, and the possible impact
on the clinic if the decision is to be part of the investigative group. The primary
clinical coordinator optimally is selected at this phase of the trial, and the clinical
director and coordinator together perform many of the tasks necessary to integrate
the trial into the clinic setting.
Following clinical center selection, the clinical center enters into negotiations
for contractual or funding arrangements. Because operational failures are often based
on insufficient allocation of resources (Melnyk et al. 2018), it is important to ensure
sufficient funding for all phases of the trial, including both start-up and final visit
data collection. Certainly, given the time commitment required by a trial, either time
protected from clinical responsibilities or appropriate reimbursement for research
time is helpful, if not necessary, for the principal investigator (Herrick et al. 2012).
Different models for clinical center funding exist from fixed funding based on
personnel effort and anticipated costs to capitation with reimbursement based
on completion of all data entry for enrollment and study visits by participants.
A combination with some fixed up-front costs and further reimbursement based on
completed visits also has been used successfully (Jellen et al. 2008).

Organizing the Site


To participate fully in a multicenter trial, the site must be organized to ensure
adequate space, telecommunications, and computer systems. Rooms for physical
examinations or private participant interviews may be necessary depending on the
specific trial. Lockable cabinets are also necessary to store study participant binders
and other confidential patient information. Special equipment or materials may
be required (e.g., specific charts or instrumentation). In some cases, trial-specific
equipment, such as laptop computers, may be supplied as part of the trial. If
biospecimens are to be obtained, then laboratory equipment, such as a centrifuge,
and refrigerators or freezers to store samples are required. The principal investigator
must ascertain whether the required equipment is available and make provisions
for maintenance and upkeep of the equipment for the duration of the trial.
Given the collaborative nature of a multicenter trial, it is necessary to have
clear communication pathways between the involved parties, so adequate computer
and Internet service is essential. Integration of the site with other facilities or
7 Centers Participating in Multicenter Trials 111

departments may also be required, such as radiology or the chemistry laboratories. If


biospecimens are required to be sent to a central laboratory, required supplies and
courier services need to be set up. Administration of a study drug mandates either an
appropriate cabinet or safe locked for storage of study drug or involvement of the
local pharmacy.

Assembling the Study Team


Hand in hand with organizing the site is assembling the study team. Depending on
trial needs, staff in addition to the principal investigator and coordinator may
include specialty physicians, imaging technicians, laboratory personnel, psycho-
metricians, regulatory monitors, or other technical staff. In choosing trial staff, the
principal investigator should build a coalition of persons capable of sustained
group effort to maintain continuous communication among staff members
(Chung and Song 2010). Notwithstanding the size or attributes of the staff at a
clinical site, the principal investigator is ultimately responsible for the performance
and integrity of the site. On the other hand, all members of the clinical center team
should have a commitment to the trial and have a voice in the local decision-
making (Fiss et al. 2010).

Attributes of the Principal Investigator


The principal investigator should be an experienced clinical researcher with an
understanding of the structure and conduct of randomized trials, have completed
Good Clinical Practice training, be open to new ideas, and have impeccable integrity.
Any potential conflicts of interest must be disclosed both at the beginning and also
throughout the trial. The principal investigator should be a leader and have the ability
to delegate tasks and engender a spirit of teamwork and collaboration among the
clinical center staff.

Attributes of the Clinical Coordinator


The importance of a skilled coordinator cannot be overstated. This person is crucial
to the proper functioning of a clinical site. Many trials require clinic coordinators to
have specific credentials and/or certification (e.g., nursing degree). Responsibilities
reported by most oncology trial clinical coordinators included patient registration
and randomization, recruitment, follow-up, case report form completion, serious
adverse event reporting, managing study files, and preparing for, and attending,
audits (Rico-Villademoros et al. 2004). Empathy, an essential attribute for a coordi-
nator (Jellen et al. 2008), provides an incentive for study participants to remain in the
trial and complete all follow-up visits. A coordinator is flexible, works indepen-
dently, and has superlative organizational skills.
As Jellen stated (Jellen et al. 2008):

Responsibility for protocol adherence ultimately rests with the principal investigator, but
ensuring that it is achieved falls primarily to the clinic coordinator.

There may be a single coordinator at a clinical center or multiple coordinators


with each performing a specific task or serving as backup when necessary.
112 R. W. Scherer and B. S. Hawkins

The coordinator or a separate regulatory monitor also coordinates all compliance


issues. A “regulatory monitor” is responsible for financial and contractual arrange-
ments and all interactions with ethics review board and federal drug agencies. A
monitor also may review the protocol for feasibility and develop and monitor the
clinical center budget together with the principal investigator. Other responsibilities
include all communications and interactions with the local ethics board and appro-
priate and timely reporting of adverse events.

Interactions with the Team


Once funding is secured, and the team assembled, the principal investigator
typically meets with the staff to orient them to the trial and individual responsibilities
delegated to each staff member. These meetings, coordinated by the study coordi-
nator, optimally continue on a regular basis, either as a group or singly, depending on
the required input.

Ethical Approval
Before participating in the trial, local ethics board or a commercial review board
approval is required to conduct the trial at each institution. Often the coordinator or
regulatory monitor drafts the required materials for ethics review by using templates
of the study protocol and consent forms provided by the data or clinical coordinating
center. Continuing interactions with the ethics committee include obtaining approval
for any amendments to the protocol, approval for ancillary studies, notification
of any serious adverse events that occur, and submission of annual progress reports.
If there is a data monitoring committee for the trial as a whole, clinical centers also
submit the report of this committee following each meeting.

Organization Binders/Files
Prior to trial initiation, the coordinator or regulatory monitor typically sets up all
the requisite binders and trial files, whether paper-based or electronic, and organizes
all correspondence. Binders include those for essential documents as mandated
by Good Clinical Practice, a current manual of procedures or handbook, required
logbooks, ethics board correspondence, a scheduling book, and study participant
binders, among others. The coordinator maintains currency of these documents and
files as the trial progresses.

Certification and Training


Trial-specific training invariably is required for the principal investigator, coordina-
tor, and all other involved staff. Training typically takes place at a “kickoff” full
investigative group meeting over 1 or 2 days at a convenient location. There often
are breakout sessions where responsibilities specific to the coordinator and possibly
other clinic personnel are covered. Topics typically covered during training for
coordinators include research integrity, trial protocol, governance, timeline, and
trial-specific processes for randomization, treatment administration, data manage-
ment, and adverse event reporting. Associated with training is documentation
of basic knowledge about the trial and role-specific understanding and knowledge
7 Centers Participating in Multicenter Trials 113

through the process of certification to participate in the trial. Usually there are
requirements for certification that must be met to allow data collection or treatment
administration within the trial. The purpose of certification is to demonstrate
knowledge of the trial and competency in the role the staff member will hold.
Requirements may include reading the manual of procedures, demonstrating under-
standing of the trial objectives, and design and skill in administration of tests
or questionnaires. Certification may also require submission of “dummy” data
collection forms or data collected using trial materials (e.g., an audiotape for
counseling). For surgical procedures, documentation of experience or submission
of videotapes may be required. With staff replacements, training and certification
continue during the trial, especially for longer trials (Mobley et al. 2004). While
the trial is ongoing, staff may be trained by an existing certified staff member or
a member of the clinical coordinating center or data coordinating center who visits
the clinical center for this purpose. Other options include special training meetings,
webinars with didactic lectures and/or videotapes, or slides sets or videotapes from
previous training meetings.

Participant Enrollment Phase

Recruitment
Following ethics review board approval, accrual of study participants, or recruit-
ment, proceeds. Recruitment materials, such as educational brochures or poster
templates, are used to facilitate identification of potential study participants. These
materials are either prepared using templates provided by study resource centers or
may be developed at the clinic and specifically aimed at the local population (Mullin
et al. 1984). Other recruitment activities include disseminating information through
grand rounds or local presentations or directly using letters or personal meetings with
colleagues or other persons who have access to the potential patient population.

Screening and Randomization


Screening of potential participants is the first step in determining eligibility for the
trial and when the informed consent process begins. Assignment of an identification
number to each candidate often happens at screening to facilitate confidentiality.
Following a positive screening, clinical center staff further assess eligibility and
continue the informed consent process by explaining the trial to the potential
participant in greater detail. The coordinator and principal investigator review
collected data to verify eligibility prior to enrollment in the trial. A binder or file is
prepared for every person who undergoes screening and eligibility testing, while
information for those failing eligibility testing may be filed separately.
Following confirmation of eligibility, a designated staff member proceeds with
obtaining final informed consent from the study participant to enroll in the trial. The
staff member who obtains informed consent makes sure that the potential participant
understands the risks and benefits and his or her responsibilities within the trial.
Formal informed consent is documented by the participant signing the informed
114 R. W. Scherer and B. S. Hawkins

consent statement; a copy is given to the participant, and a copy is kept for the clinic
and is stored in a binder or file designated for that purpose. Randomization and
treatment assignment typically occurs only after informed consent has been obtained
and documented. Randomization is often achieved via an online system, although in
small or single-center studies, sealed opaque-coded envelopes may be issued to
provide the assigned treatment. Documentation related to randomization is stored in
the participant binder or file.

Treatment and Follow-Up Phase

The next step is to implement the assigned treatment. If a drug or pharmaceutical


is assigned, the drug must be obtained from the pharmacy or from the storage cabinet
where study drug is held, carefully checked to ensure that the assigned code
matches that on the bottle or other drug container, and given to the participant
with instructions. Similarly, if the assignment involves a device, then the device
must be retrieved or ordered and may require matching codes. If the assignment is
a specific surgical procedure, the surgery and operating room must be scheduled
within the required time window for treatment. A counseling or other behavioral
intervention requires scheduling within the designated time frame. For all assign-
ments, it is imperative that the intervention given to the study participant matches
that assigned at randomization.

Scheduling and Communications


For the most part, the clinic coordinator becomes the face of the trial for the study
participant. The coordinator schedules all study visits and ensures that all required
procedures and data collection take place during the study visit. Preparations may
include ensuring examination or interview space is available and all imaging,
physical, or psychological examinations that are required for that visit are scheduled.
If study drug is to be distributed, then the drug must be obtained from the pharmacy
or storage area to give to the study participant. The clinic coordinator either arranges
travel or investigates transportation options when required for overnight visits or
for participants unable to drive themselves to the clinic. At the study visit, the
coordinator makes sure that the contact information for the study participant is up
to date and arranges for meals for long visits. Between visits, especially with long
intervals between visits, the coordinator may contact the study participant to make
sure everything is going well. The study coordinator serves as a patient advocate and
educator and can provide referrals to social services or other needed resources
(Larkin et al. 2012). Throughout trial interactions, the coordinator builds a relation-
ship with each study participant, establishing a rapport as they work together on the
trial as a team.

Adverse Events
Adverse events that occur during a trial, particularly serious adverse events, must be
handled correctly and in a timely manner. Appropriate medical care for the study
7 Centers Participating in Multicenter Trials 115

participant overrides the trial protocol, especially for serious adverse events or
events that may be related to a trial intervention. The assigned treatment may need
to be discontinued and appropriate documentation prepared. All regulatory bodies,
including trial sponsors, local ethics board, and pertinent regulatory agencies, must
be notified within the time frame designated by the trial protocol.

Data Collection
Data are the backbone of clinical research, mandating proper data collection and
management. Data may be collected using paper forms, directly onto electronic
media, or indirectly through review of medical charts, taped interviews, or other
methods. Data sources include participant self-report, scheduled interviews or exami-
nations, or imaging or laboratory findings. Collection of external data also may be
required, e.g., death certificates for determination of cause of death, hospital records, or
operative notes from surgical procedures. Data are collected at various points during a
participant’s sojourn in the trial and possibly in multiple ways. At screening and
eligibility assessment, eligibility criteria are verified, and baseline data used to assess
change over follow-up may be collected. Treatment administration requires data related
to treatment adherence or compliance and associated treatment-related adverse events.
Outcome data and adverse events are collected during follow-up. Clinical center staff
must ensure faithful and accurate data collection. Data forms should be checked for
completeness and accuracy prior to transmission to the data coordinating center.
Transmission of data may be implemented using various modes (e.g., paper forms,
online, submission of electronic files) but should be completed expeditiously and
accurately while maintaining confidentiality. In addition to collecting participant-related
data, the clinic also manages other study-related data, such as drug accountability or
biospecimen collection and shipment. There may also be specific data forms dealing
with study-related events, such as protocol deviations or adherence to treatment.

Data Management
Although the database may have checks to provide for accurate entry and double-data
entry be employed, there still may be missing, out of range, or inconsistent data items.
Errors may occur within a single form, across forms, or across visits. Typically, the
data coordinating center routinely reviews the data and queries the clinical center about
perceived errors. Clinic staff then review each query and, if an error had occurred,
correct the paper forms or electronic record and transmit the corrected item to the data
coordinating center. These data management activities typically take more time at the
beginning of a trial or during slow recruitment when more errors occur (Taekman et al.
2010). The amount of effort for all data management activities requires a substantial
time commitment by the coordinator (Goldsborough et al. 1998).

Site Visits
Site visits are a quality assurance measure that typically includes a review of the clinic
setting, implementation of the trial protocol, and routine audits of the data. Having a site
visit scheduled often provides an incentive for a clinical center to make sure all
documents are current and well-organized. During the site visit, the site visitors typically
116 R. W. Scherer and B. S. Hawkins

will observe administration of a clinical procedure, determine that adequate space and
required facilities are available to conduct the trial, review that essential documents
(paper or electronic) are current and that signed informed consent documents are
available for each study participant, and conduct a data audit. During an audit, data in
the database are compared with those recorded on paper forms or in other ways.
Discrepancies are treated as data errors and require review and correction. Common
problems encountered during site visits are inadequate consent documentation, prob-
lems with drug accountability, and protocol nonadherence (Turner et al. 1987).

Study Meetings
Goals of full investigative group meetings vary, but generally are designed to build
collaboration between investigators and coordinators. Topics covered may include
trial progress; problem-solving, especially during the recruitment phase; and issues
related to implementing the protocol in the clinic setting. Protocol amendments may
be discussed and reviewed, as well as performance reports and accumulating base-
line data. Continuing education also may be provided, by either engaging outside
speakers or having a trial investigator provide updates on the scientific literature in
the trial topic area.

Participant Closeout Phase

Final Follow-Up Visit


Relationships that the study participant may have forged with the coordinator over
the past year or more are changed at the last study visit. The principal investigator
should ensure that there are no untoward events associated with termination of
treatment. He or she should also provide for ongoing medical care for the study
participant and provide information from trial participation to facilitate future care.

Post-Funding Phase

Dissemination of Study Results


The recognition received for completing a well-conducted trial comes with dissem-
ination of the trial result; the clinical center principal investigators typically are key
to this process. Their role may entail presenting trial results at a conference, being on
a Writing Committee for a journal publication, or presenting results locally.

Summary and Conclusion

Different types of centers form the nodes in the organizational structure of


a multicenter trial. Resource centers perform designated functions as required within
specific trials; they include coordinating centers, reading centers, central laborato-
ries, and others. Clinical center responsibilities encompass functions related to
accrual, treatment, and follow-up of study participants. Specific responsibilities
7 Centers Participating in Multicenter Trials 117

of both types of centers change depending on the phase of the trial. Collaboration
among all resource centers and clinical centers is essential as they aim toward the
common goal of successfully completing a multicenter trial.

Key Facts

• Centers participating in multi-center clinical trials may include clinical or data


coordinating centers and clinical centers, along with others as required by trial
design.
• Specific responsibilities of centers change across trial phases.
• Clinical and data coordinating centers provide scientific oversight of the trial, and
oversee data management and analyses.
• Clinical centers are responsible for accrual, treatment, and follow-up of study
participants.

Cross-References

▶ Administration of Study Treatments and Participant Follow-Up


▶ Consent Forms and Procedures
▶ Data Capture, Data Management, and Quality Control; Single Versus Multicenter
Trials
▶ Design and Development of the Study Data System
▶ Funding Models and Proposals
▶ Implementing the Trial Protocol
▶ Investigator Responsibilities
▶ Multicenter and Network Trials
▶ Paper Writing
▶ Participant Recruitment, Screening, and Enrollment
▶ Procurement and Distribution of Study Medicines
▶ Qualifications of the Research Staff
▶ Responsibilities and Management of the Clinical Coordinating Center
▶ Selection of Study Centers and Investigators
▶ Training the Investigatorship
▶ Trial Organization and Governance

References
Ahmad HA, Gottlieb K, Hussain F (2016) The 2 + 1 paradigm: an efficient algorithm for central
reading of Mayo endoscopic subscores in global multicenter phase 3 ulcerative colitis clinical
trials. Gastroenterol Rep (Oxf) 4(1):35–38
Alamercery Y, Wilkins P, Karrison T (1986) Functional equality of coordinating centers in a
multicenter clinical trial. Experience of the International Mexiletine and Placebo Antiarrhythmic
Coronary Trial (IMPACT). Control Clin Trials 7(1):38–52
118 R. W. Scherer and B. S. Hawkins

Barry MJ, Andriole GL, Culkin DJ, Fox SH, Jones KM, Carlyle MH, Wilt TJ (2013) Ascertaining
cause of death among men in the prostate cancer intervention versus observation trial. Clin
Trials 10(6):907–914
Beck RW (2002) Clinical research in pediatric ophthalmology: the Pediatric Eye Disease
Investigator Group. Curr Opin Ophthalmol 13(5):337–340
Blumenstein BA, James KE, Lind BK, Mitchell HE (1995) Functions and organization of
coordinating centers for multicenter studies. Control Clin Trials 16(2 Suppl):4s–29s
Brennan EC (1983) The Coronary Drug Project. Role and methods of the Drug Procurement and
Distribution Center. Control Clin Trials 4(4):409–417
Canner PL, Gatewood LC, White C, Lachin JM, Schoenfield LJ (1987) External monitoring of
a data coordinating center: experience of the National Cooperative Gallstone Study. Control
Clin Trials 8(1):1–11
Carrick B, Tennyson M, Lund B (2005) Managing a blood repository for use by multiple ancillary
studies in the Women’s Health Initiative. Clin Trials 2(Suppl 1):S73
Chauvie S, Biggi A, Stancu A, Cerello P, Cavallo A, Fallanca F, Ficola U, Gregianin M, Guerra UP,
Chiaravalloti A, Schillaci O, Gallamini A (2014) WIDEN: a tool for medical image manage-
ment in multicenter clinical trials. Clin Trials 11(3):355–361
Chung KC, Song JW (2010) A guide to organizing a multicenter clinical trial. Plast Reconstr Surg
126(2):515–523
Cooper GR, Haff AC, Widdowson GM, Bartsch GE, DuChene AG, Hulley SB (1986)
Quality control in the MRFIT local screening and clinic laboratory. Control Clin Trials 7(3
Suppl):158s–165s
Danis RP (2009) The clinical site-reading center partnership in clinical trials. Am J Ophthalmol 148
(6):815–817
Davey J (1994) Managing clinical laboratory data flow. Drug Inf J 28:397–402
Davis JM (1994) Current reporting methods for laboratory data at Zeneca Pharmaceuticals.
Drug Inf J 28:403–406
Desiderio LM, Jaramillo SA, Felton D, Andrews LA, Espeland MA, Tan JC, Bryan NR, Perry J,
Liu DF (2006) A multi-institutional imaging network: application to Women’s Health Initiative
Memory Study. Clin Trials 3(2):193–194
Dijkman JHM, Queron J (1994) Feasibility of Europe-wide specimen shipments. Drug Inf
J 28:385–389
Dinnett EM, Mungall MM, Kent JA, Ronald ES, Gaw A (2004) Closing out a large clinical trial:
lessons from the prospective study of pravastatin in the elderly at risk (PROSPER). Clin Trials 1
(6):545–552
Domalpally A, Danis R, Agron E, Blodi B, Clemons T, Chew E (2016) Evaluation of geographic
atrophy from color photographs and fundus autofluorescence images: Age-Related Eye Disease
Study 2 report number 11. Ophthalmology 123(11):2401–2407
Dording CM, Dalton ED, Pencina MJ, Fava M, Mischoulon D (2012) Comparison of academic and
nonacademic sites in multi-center clinical trials. J Clin Psychopharmacol 32(1):65–68
Farrell B (1998) Efficient management of randomised controlled trials: nature or nurture. BMJ 317
(7167):1236–1239
Fiss AL, McCoy SW, Bartlett DJ, Chiarello LA, Palisano RJ, Stoskopf B, Jeffries L, Yocum A,
Wood A (2010) Sharing of lessons learned from multisite research. Pediatr Phys Ther 22(4):
408–416
Franzosi MG, Bonfanti I, Garginale AP, Nicolis E, Santoro E, N. Investigators (1996) The role
of a regional data coordinating centre (RDCC) in a multi-national large phase-II trial. Control
Clin Trials 17(Suppl 2S):104S–105S
Fye CL, Gagne WH, Raisch DW, Jones MS, Sather MR, Buchanan SL, Chacon FR, Garg R, Yusuf
S, Williford WO (2003) The role of the pharmacy coordinating center in the DIG trial. Control
Clin Trials 24(6 Suppl):289s–297s
Gassman JJ, Owen WW, Kuntz TE, Martin JP, Amoroso WP (1995) Data quality assurance,
monitoring, and reporting. Control Clin Trials 16(2 Suppl):104s–136s
7 Centers Participating in Multicenter Trials 119

Glicksman AS, Reinstein LE, Laurie F (1985) Quality assurance of radiotherapy in clinical trials.
Cancer Treat Rep 69(10):1199–1205
Goldsborough IL, Church RY, Newhouse MM, Hawkins BS (1998) How clinic coordinators spend
their time. Appl Clin Trials 7(1):33–40
Goodman DB (1993) Standardized and centralized electrocardiographic data for clinical trials.
Appl Clin Trials 2(6):34, 36, 40–41
Gopal AK, Pro B, Connors JM, Younes A, Engert A, Shustov AR, Chi X, Larsen EK, Kennedy DA,
Sievers EL (2016) Response assessment in lymphoma: concordance between independent
central review and local evaluation in a clinical trial setting. Clin Trials 13(5):545–554
Granger CB, Vogel V, Cummings SR, Held P, Fiedorek F, Lawrence M, Neal B, Reidies H,
Santarelli L, Schroyer R, Stockbridge NL, Feng Z (2008) Do we need to adjudicate major
clinical events? Clin Trials 5(1):56–60
Habig RL, Thomas P, Lippel K, Anderson D, Lachin J (1983) Central laboratory quality control in
the National Cooperative Gallstone Study. Control Clin Trials 4(2):101–123
Hainline A Jr, Miller DT, Mather A (1983) The Coronary Drug Project. Role and methods of the
Central Laboratory. Control Clin Trials 4(4):377–387
Harris RAJ (1994) Clinical research aspects of sampling, storage, and shipment of blood samples.
Drug Inf J 28:377–379
Hawkins BS, Gannon C, Hosking JD, James KE, Markowitz JA, Mowery RL (1988) Report from
a workshop: archives for data and documents from completed clinical trials. Control Clin Trials
9(1):19–22
Henning AK (2012) Starting a genetic repository. Clin Trials 9(4):523
Herrick LM, Locke GR 3rd, Zinsmeister AR, Talley NJ (2012) Challenges and lessons learned in
conducting comparative-effectiveness trials. Am J Gastroenterol 107(5):644–649
Hosking JD, Newhouse MM, Bagniewska A, Hawkins BS (1995) Data collection and transcription.
Control Clin Trials 16(2 Suppl):66s–103s
Jellen PA, Brogan FL, Kuzma AM, Meldrum C, Meli YM, Grabianowski CL (2008) NETT
coordinators: researchers, caregivers, or both? Proc Am Thorac Soc 5(4):412–415
Kempson RL (1985) Pathology quality control in the cooperative clinical cancer trial programs.
Cancer Treat Rep 69(10):1207–1210
Kiel DP, Magaziner J, Zimmerman S, Ball L, Barton BA, Brown KM, Stone JP, Dewkett D,
Birge SJ (2007) Efficacy of a hip protector to prevent hip fracture in nursing home residents: the
HIP PRO randomized controlled trial. JAMA 298(4):413–422
Krockenberger K, Luntz SP, Knaup P (2008) Usage and usability of standard operating procedures
(SOPs) among the coordination centers for clinical trials (KKS). Methods Inf Med 47(6):
505–510
Kyriakides TC, Babiker A, Singer J, Piaseczny M, Russo J (2004) Study conduct, monitoring and
data management in a trinational trial: the OPTIMA model. Clin Trials 1(3):277–281
Larkin ME, Lorenzi GM, Bayless M, Cleary PA, Barnie A, Golden E, Hitt S, Genuth S (2012)
Evolution of the study coordinator role: the 28-year experience in Diabetes Control and
Complications Trial/Epidemiology of Diabetes Interventions and Complications (DCCT/
EDIC). Clin Trials 9(4):418–425
Larson GS, Carey C, Grarup J, Hudson F, Sachi K, Vjecha MJ, Gordin F (2016) Lessons learned:
infrastructure development and financial management for large, publicly funded, international
trials. Clin Trials 13(2):127–136
Lee JY, Hill A (2004) A multicenter lab sample tracking system. Clin Trials 2:252–253
Marcus P, Gareen IF, Doria-Rose P, Rosnbaum J, Clingan K, Brewer B, Mille AB (2012) Did
death certificates and a mortality review committee agree on lung cancer cause of death in The
National Lung Screening Trial? Clin Trials 9(4):464–465
Martin DE, Pan J-W, Marticn JP, Beringer KC (2004) Pharmacy management for randomized
pharmacotherapy trials: the MATRIX web data management system. Clin Trials 1(2):248
McBride R, Singer SW (1995) Interim reports, participant closeout, and study archives. Control
Clin Trials 16(2 Suppl):137s–167s
120 R. W. Scherer and B. S. Hawkins

McFadden E, Bashir S, Canham S, Darbyshire J, Davidson P, Day S, Emery S, Pater J, Rudkin S,


Stead M, Brown J (2015) The impact of registration of clinical trials units: the UK experience.
Clin Trials 12(2):166–173
Meinert CL, Martin BK, McCaffrey LD, Breitner JC (2008) Do we need to adjudicate major clinical
events? Clin Trials 5(5):557; author reply 558
Melnyk H, Rosenfeld P, Glassman KS (2018) Participating in a multisite study exploring opera-
tional failures encountered by frontline nurses: lessons learned. J Nurs Adm 48(4):203–208
Mobley RY, Moy CS, Reynolds SM, Diener-West M, Newhouse MM, Kerman JS, Hawkins BS
(2004) Time trends in personnel certification and turnover in the Collaborative Ocular
Melanoma Study. Clin Trials 1(4):377–386
Moy CS, Albert DM, Diener-West M, McCaffrey LD, Scully RE, Willson JK (2001) Cause-specific
mortality coding. Methods in the collaborative ocular melanoma study coms report no. 14.
Control Clin Trials 22(3):248–262
Mullin SM, Warwick S, Akers M, Beecher P, Helminger K, Moses B, Rigby PA, Taplin NE, Werner
W, Wettach R (1984) An acute intervention trial: the research nurse coordinator’s role. Control
Clin Trials 5(2):141–156
Nesbitt GS, Smye M, Sheridan B, Lappin TR, Trimble ER (2006) Integration of local and central
laboratory functions in a worldwide multicentre study: experience from the Hyperglycemia and
Adverse Pregnancy Outcome (HAPO) Study. Clin Trials 3(4):397–407
Peduzzi P, Hatch HT, Johnson G, Charboneau A, Pritchett J, Detre K (1987) Coordinating center
follow-up in the Veterans Administration Cooperative Study of Coronary Artery Bypass
Surgery. Control Clin Trials 8(3):190–201
Perkovic V, Patil V, Wei L, Lv J, Petersen M, Patel A (2012) Global randomized trials: the promise
of India and China. J Bone Joint Surg Am 94(Suppl 1E):92–96
Peterson M, Byrom B, Dowlman N, McEntegart D (2004) Optimizing clinical trial supply require-
ments: simulation of computer-controlled supply chain management. Clin Trials 1(4):399–412
Pogue J, Walter SD, Yusuf S (2009) Evaluating the benefit of event adjudication of cardiovascular
outcomes in large simple RCTs. Clin Trials 6(3):239–251
Price MO, Knight OJ, Benetz BA, Debanne SM, Verdier DD, Rosenwasser GO, Rosenwasser M,
Price FW Jr, Lass JH (2015) Randomized, prospective, single-masked clinical trial of endothe-
lial keratoplasty performance with 2 donor cornea 4 degrees C storage solutions and associated
chambers. Cornea 34(3):253–256
Prineas RJ, Blackburn H (1983) The Coronary Drug Project. Role and methods of the ECG Reading
Center. Control Clin Trials 4(4):389–407
Rautaharju PM, Broste SK, Prineas RJ, Eifler WJ, Crow RS, Furberg CD (1986) Quality control
procedures for the resting electrocardiogram in the Multiple Risk Factor Intervention Trial.
Control Clin Trials 7(3 Suppl):46s–65s
Rico-Villademoros F, Hernando T, Sanz JL, Lopez-Alonso A, Salamanca O, Camps C, Rosell R
(2004) The role of the clinical research coordinator – data manager – in oncology clinical trials.
BMC Med Res Methodol 4:6
Rogers A, Flynn RW, McDonnell P, Mackenzie IS, MacDonald TM (2016) A novel drug manage-
ment system in the Febuxostat versus Allopurinol Streamlined Trial: a description of a pharmacy
system designed to supply medications directly to patients within a prospective multicenter
randomised clinical trial. Clin Trials 13(6):665–670
Rosenfeld PJ, Dugel PU, Holz FG, Heier JS, Pearlman JA, Novack RL, Csaky KG, Koester JM,
Gregory JK, Kubota R (2018) Emixustat hydrochloride for geographic atrophy secondary
to age-related macular degeneration: a randomized clinical trial. Ophthalmology 125(10):
1556–1567
Rosier J, Demoen P (1990) Labeling of clinical trial samples: an overview of some regulatory
requirements. Drug Inf J 24:583–590
Sheintul M, Lun R, Cron-Fabio D, Krishtul R, Xue H, Lin K-H (2004) The challenges of designing
a clinical research laboratory database. Clin Trials 1(2):219–220
7 Centers Participating in Multicenter Trials 121

Shroyer ALW, Quin JA, Wagner TH, Carr BM, Collins JF, Almassi GH, Bishawi M, Grover FL,
Hattler B (2019) Off-pump versus on-pump impact: diabetic patient 5-year coronary artery
bypass clinical outcomes. Ann Thorac Surg 107(1):92–98
Siegel D, Milton RC (1989) Grading of images in a clinical trial. Stat Med 8(12):1433–1438
Sievert YA, Schakel SF, Buzzard IM (1989) Maintenance of a nutrient database for clinical trials.
Control Clin Trials 10(4):416–425
Strylewicz G, Doctor J (2010) Evaluation of an automated method to assist with error detection in
the ACCORD central laboratory. Clin Trials 7(4):380–389
Submacular Surgery Trials Research Group (2004) Clinical trial performance of community- vs
university-based practice in the Submacular Surgery Trials: SST report no. 2. Arch Ophthalmol
122:857–863
Sun JK, Jampol LM (2019) The Diabetic Retinopathy Clinical Research Network (DRCR.net) and
its contributions to the treatment of diabetic retinopathy. Ophthalmic Res 62:225–230
Taekman JM, Stafford-Smith M, Velazquez EJ, Wright MC, Phillips-Bute BG, Pfeffer MA,
Sellers MA, Pieper KS, Newman MF, Van de Werf F, Diaz R, Leimberger J, Califf RM
(2010) Departures from the protocol during conduct of a clinical trial: a pattern from the data
record consistent with a learning curve. Qual Saf Health Care 19(5):405–410
Toth CA, Decroos FC, Ying GS, Stinnett SS, Heydary CS, Burns R, Maguire M, Martin D, Jaffe GJ
(2015) Identification of fluid on optical coherence tomography by treating ophthalmologists
versus a reading center in the comparison of age-related macular degeneration treatments trials.
Retina 35(7):1303–1314
Turner G, Lisook AB, Delman DP (1987) FDA’s conduct, review, and evaluation of inspections of
clinical investigators. Drug Inf J 21(2):117–125
Wilkinson M (1994) Carrier requirements for laboratory samples. Drug Inf J 28:381–384
Williford WO, Krol WF, Bingham SF, Collins JF, Weiss DG (1995) The multicenter clinical trials
coordinating center statistician: more than a consultant. Am Stat 49(2):221–225
Qualifications of the Research Staff
8
Catherine A. Meldrum

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
History of Clinical Research Staff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Staff Qualifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Credentialing Organizations in Clinical Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

Abstract
Through the use of clinical trials, the global research community have paved the
way for new medical interventions and ground breaking new therapies for
patients. In the past 60 years we have embraced rigorous scientific standards for
assessing and improving our therapeutic knowledge and practice. These scientific
standards dictate that a research workforce possessing knowledge, skills, and
abilities is crucial to the success of a study. The many challenging (and ever
changing) rules and regulations inherent in the clinical research arena also require
diligent trained staff be available to conduct the study. While it is well known that
the Principal Investigator is ultimately responsible for the conduct of the study, it
is generally a team of individuals who conduct the daily operations of the study.
This chapter discusses qualifications and training of other research staff members
and why these are important for achieving success in clinical trials.

C. A. Meldrum (*)
University of Michigan, Ann Arbor, MI, USA
e-mail: [email protected]; [email protected]

© Springer Nature Switzerland AG 2022 123


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_283
124 C. A. Meldrum

Keywords
Clinical research staff · Qualifications · Education · Clinical research coordinator

Introduction

Development and subsequent licensing of new pharmaceutical agents and devices


are highly regulated processes. Clinical trials protocols have become more complex.
Regulations and guidelines for managing clinical research trials have also increased
(Infectious Disease Society of America 2009; NIH 2012). This then translates into
the need for adequately trained research staff who conduct research to be knowl-
edgeable in research ethics and scientific education and have sufficient training and
education in the areas where they work. For high quality large scale clinical trials to
be completed, the research staff involved should have the skills and knowledge to
truly advance science. Study quality and integrity can be severely compromised if a
very inexperienced individual is managing the research study without assistance
from a more experienced individual. Meeting the needs of a research study means
that each staff member has the responsibility to understand and apply the ethical
tenets, principals, and regulatory requirements of Good Clinical Practice (GCP).
That being said the research community understands the benefits and challenges of
finding that seasoned professional to hire.
Commonly in research, industry dictates an individual have 2 years of profes-
sional experience upon hire but how does one get that experience if no one hires
them. Furthermore, one could consider that although research studies require accu-
racy and diligence in performing them, maybe it is possible to hire novice staff with a
blend of senior staff for oversight. Given the national shortage of clinical research
staff for these positions, it is possible to consider staff having complete core
competencies and not be so stringent on experience (ACRP 2015). Organizations
could develop training programs that include internal and external education that
meet the high expectations required in clinical research. Certainly this is a profession
still under development and we need to evaluate previous history and future needs in
the clinical research arena to best formalize and guide advancement of clinical
research staff. Only then can it be assured the protection and well-being of the
research participants are fulfilled and that research activities are conducted according
to regulatory guidelines.

History of Clinical Research Staff

The Principal Investigator (PI) of a research study is the primary individual respon-
sible for the overall research study though we are keenly aware it takes a village to
actually complete a successful research study. It is not unusual for many required
study tasks be delegated from the PI to members of the research team but who are the
members of the research team and more importantly how have they been trained. Do
8 Qualifications of the Research Staff 125

the study members possess the experience and skill needed to assure compliance
with guidelines and regulations set forth by regulatory agencies and/or institutions?
Unlike many professions the professional research staff role is really in its infancy
in terms of development. While the nursing profession, widely recognized and
licensed, flourished in 1860 when the first school of nursing was opened in Europe,
the creation of research staff roles really began less than 40 years ago. Due to the
early stage development of this career pathway, there has been less consensus on
standardized job descriptions for this profession. There is no state licensure required
for this type of work nor is the expectation that baseline experience is the same
between institutions. The actual roles and titles of the research staff conducting a
clinical trial are also widely varied. Titles may include: Clinical Research Coordi-
nator, Study Nurse, Clinical Research Nurse, Research Nurse Coordinator, Clinical
Research Assistant, Study Coordinator, and many more.
In nursing, the use of research nurses in oncology trials was common though their
roles within the research project were not well defined. Many nurses worked with
oncology patients clinically, though lacked clear direction or training to work with
patients in a research capacity. An oncology research nurse may have complemen-
tary functions and roles with oncology chemotherapy nurses, but there are unique
characteristics required of the research nurse that are not applicable to the oncology
chemotherapy nurse (Ocker and Pawlik Plank 2000). In 1982, the Oncology Nursing
Society sought to standardize job descriptions for oncology nurses who were
involved in research studies (Hubbard 1982). Subsequently, as clinical trials grew
in both numbers and complexity, there was a further push to define the scope of
practice in the role of the research nurse. In 2007, the Clinical Center Nursing at the
National Institute of Health launched an effort to help define Clinical Research
Nursing (CRN). Later, in 2009 the first professional organization, the International
Association of Clinical Research Nurses (IACRN) for research nurses was founded.
This organization is not specific to oncology nurses but supports the professional
development of nurses that specialize in any research domain thus, regardless of the
title for this occupation or the research domain, nurses engaged in clinical trials
develop, coordinate, and implement research and administrative strategies vital to
the successful management of research studies.
Though it was common to use nurses in the research enterprise in oncology, there
are many other areas outside of oncology that do not rely solely on a Clinical
Research Nurse/Research Nurse Coordinator. Many staff hired to complete a
research project are not trained as nurses. In fact, it is common to hire varied
ancillary personnel to get the research study/project done. Without industry job
description standards much of the initial workforce sort of “fell into the role.” As
noted early on it was quite common for an Investigator to utilize a nurse to assist in
research. The nurse would begin to take on other responsibilities within research
until it literally grew into some type of research role. Eventually other staff that may
have worked in allied health or even in a clerical position began to take on additional
duties, again eventually inheriting the role of study coordinator. Within the past
decades there has been a tremendous growth in clinical research and with this, the
need for more individuals conducting clinical research grew. No longer could just
126 C. A. Meldrum

on-the-job training be the answer nor could pulling additional staff in at random
times to assist in a research study be deemed sufficient as that individual may lack
the proper training. A job entity/title for this profession was clearly needed. This led
to several job titles and descriptions for research professionals as noted above but
probably most common are Study Coordinator or Clinical Research Coordinator
(CRC). Today many of the titles are interchangeable though for the purposes of this
chapter we will refer to the Research Professional as a Clinical Research Coordinator
(CRC).
A CRC does work under the direction of the Principal Investigator (PI); however,
the background of the CRC can be quite diverse. Traditionally, a high percentage of
CRCs did possess a nursing background (Davis et al. 2002; Spilsbury et al. 2008),
and given the nursing curriculum and their experience with patients, this could be
considered a well-suited career move for a nurse. Transitioning from the bedside
nursing position to a CRC position while not difficult does require additional
training, but the registered nurse (RN) already possesses some of the major key
attributes required in the CRC role. Certainly understanding medical terminology,
documentation skills, and good people skills help facilitate the transition. If an
individual has a medical background other than nursing such as pharmacy, respira-
tory therapy, or another allied health field, they also understand medical terminology
and are likely skilled in working with patients. More recently, individuals with these
types of medical backgrounds are clearly more abundant in the research community
as they pursue advanced training to work in the field of research. Individuals who
would like to work as a CRC and have a nonmedical background may require even
more additional training to sufficiently work in the clinical research arena but this
can be achieved.
Although it is evident that it takes many personnel needed to effectively and
efficiently carry out clinical research, the regulations require the Principal Investigator
(PI) ensure that all the study staff are adequately trained and maintain up-to-date
knowledge about the study (FDA 2000). Frequently, Co-Investigators (CO-I) are
available to support the principal investigator in the management and leadership of the
research project but the actual day to day operations of the study are carried out with
ancillary personnel other than a PI or CO-I. In most studies the PI (or a delegate) hire
personnel to assist in carrying out the study yet what really needs to be considered is:
do the individuals hired have the appropriate training and skills necessary to fulfill the
high demands of a clinical research study. This can be challenging as many times it
may not even be the PI doing the actual hiring of the candidate. The individual doing
the hiring needs to be fully aware of what the needs are for the study and be able to
assess whether the potential candidate can meet those needs.

Staff Qualifications

As with any job position you have to find the person who best suits your needs.
There are certainly times when an entry level candidate can assist in performing the
activities needed for the research study but usually a highly experienced staff
8 Qualifications of the Research Staff 127

member is needed to oversee the study especially if it is a complex study. The overall
question facing the researcher (or PI) is what type of individual do I need to complete
the study successfully? Some broad topics that should be considered when hiring
staff are:

• Educational background (medical/science)


• Research experience
• Patient care experience (people oriented)
• Experience with databases and collection of data
• Communication skills
• Organization skills
• Detail oriented
• Flexibility

Hiring can also be complicated since the length of time a research study is
ongoing is quite varied; thus, the person you hire for that particular research study
may not be the same type of individual you need for the next research study. A PI
must consider the long term needs of their research enterprise and gauge what the
staff composition should look like to fulfill their goals.
For most health care professions, entry level requirements include a focused
didactic curriculum usually from an academic institution followed by some hands
on experience. This has not been the case with entry level staff in clinical research.
An individual cannot pursue this degree as a new student at an academic institution
as no academic institution has entry level programs in this type of discipline.
There is also no license mandate to practice in clinical research as required in
other medical disciplines though research staff may obtain professional certification
credentials within the field. Frequently in clinical research entry level people per-
form such tasks as data entry, data management, and basic patient care tasks such as
taking a blood pressure, measuring height and weight, or performing a manual count
for returned medications. As one achieves more experience, they may gradually
move up the clinical ladder with added responsibilities and commonly take on the
title of Clinical Research Coordinator (or some facsimile of this).
CRCs work in a variety of settings such as private and public institutions, device
pharmaceutical and biotechnology companies, private practice, Clinical Research
Organizations (CRO), Site Management Organizations (SMO), and varied indepen-
dent organizations involved in clinical research. Being a CRC is a multifaceted role
with many responsibilities. They are really at the center of the research enterprise
with multiple roles. Previous work has demonstrated that one research coordinator
may be responsible for between 78 and 128 different activities (Papke 1996). With
the growth in clinical research studies, there are still many more activities in the
future that may be required of the research coordinator. The additional complexities
of research ethics that have evolved over time and regulatory and economic pres-
sures that continue to mount create the need for a skilled individual in this role (NIH
2006). Though the duties of clinical research staff vary from each institution, they
likely include some or all of the following:
128 C. A. Meldrum

• Recruitment and enrollment of participants


• Protocol development
• Protocol implementation
• Assurance of participant safety
• Development of informed consent documents
• Development of case report forms
• Development of research budgets
• IRB submission
• Maintenance of drug accountability
• Accurate data collection
• Accurate data entry
• Data monitoring/Data analysis of people
• Staff education

While the above list is not an exhaustive list of duties, it illuminates the need for
clinical research staff to have expert clinical skills and well-developed thinking
skills. To achieve the best possible outcomes for research participants and the overall
research process, they must be well versed in the regulatory, ethical and scientific
domains of clinical research. Thus, it is crucial that PI assure the study employ at
least one highly skilled individual to conduct a clinical trial. This leads us to discuss
“How does staff obtain training?”

Training

In traditional University settings, there are no adequate curriculums in any disci-


plines that prepare a student to graduate with a bachelor’s degree and work in clinical
research though in the United States there are many clinical research administration
programs that provide either a post graduate certificate or a master’s degree. Online
courses and online curriculums are also available but these can be costly. Conference
symposiums are offered around the country though these types of courses usually
provide a specialized topic (such as informed consent documentation, clinical trial
initiation, FDA forms and procedures, regulatory documents and binder mainte-
nance, source documents, study initiation and close-out visits, compliance and
retention, drug compliance/storage/documentation, IRB submissions and HIPAA,
adverse events and safety monitoring, quality assurance audits and monitor visits
and preparing for FDA audits). Though the above may seem fairly thorough, these
topics are not done in enough detail to provide adequate comprehensive training for
a novice staff member so generally attending a one-day conference symposium is not
considered sufficient training for entry level staff. These types of conference sym-
posiums are mean to be adjunct materials for a research coordinator. On-the-job
training and mentoring from more experienced staff are also training methods for
that are used by many employers.
Training for a particular research study is generally done by the Sponsor. Once
training is complete the individual signs a Delegation of Authority Log (DOA). This
8 Qualifications of the Research Staff 129

log serves to ensure that the research staff member performing study-related tasks/
procedures has been appropriately trained and authorized by the investigator to
perform such tasks. Although a delegation log is not federally mandated, it may be
a Sponsor requirement and must be completed and maintained throughout the trial.
ICG/GCP guidance (E6 4.1.5) requires an investigator maintain a list of qualified
staff to whom the investigator has designated study-related activities.
In 2012, a Clinical & Translational Science Awards Program (CTSA) taskforce
found insufficient training, and lack of support among CRC’s employed at Clinical
Translational Science Institutes (CTSI) (Speicher et al. 2012). They also observed a
low job satisfaction within this field. Recognizing the evolving demands of the
clinical research enterprise across the nation, the Task Force’s study reiterated the
need for support and educational development while recognizing that there are
insufficient numbers of adequately trained and educated staff for these roles.
One year after the CTSA study Clinical Research was formally accepted as a
profession by Commission on Accreditation of Allied Health Education Programs
(CAAHEP) though at the time of this writing the occupational description, job
description, employment characteristics, and educational programs available for
the role are still not available on the CAAHEP website (CAAHEP 1994).
Increasing clinical research studies and newer technologies create a demand for
an evenr newer skillset in the research workforce, and this is where professional
competencies come into play. In 2014, the Joint Task Force (JTF) for Clinical Trial
Competency published a landmark piece on defining the standards for professional-
ism in the research industry (Sonstein et al. 2014). This universal Core Competency
Framework has undergone three revisions with the most recent publication in
October 2018 (Sonstein et al. 2018). It now incorporates three levels (Fundamental,
Skilled, and Advanced), whereas more standard roles, assessments, and knowledge
can be assessed within the eight domains. The domains include: Scientific Concepts
and Research Design, Ethical and Participant Safety Considerations, Investigational
Products Development and Regulation, Clinical Study Operations, Study and Site
Management, Data Management and Informatics, Leadership and Professionalism,
and Communication and Teamwork. A diagram of the framework is provided below
(Fig. 1):

Credentialing Organizations in Clinical Research

To date there are two organizations that credential research staff. They are the
Association of Clinical Research Professionals (ACRP) and The Society of Clinical
Research Associates (SOCRA). Both of these organizations are international in
scope. SOCRA currently has chapters in six international countries (Belgium, Brazil,
Canada, Nigeria, Poland, Saudi Arabia) and certification testing can be done at a PSI
testing center throughout the world. ACRP is located in more than 70 countries with
about 600 testing centers available internationally.
Credentialing is achieved by way of an examination through both organizations.
Interestingly though is that both organizations require at least 2 years of documented
130 C. A. Meldrum

Fig. 1 JTF core competency domains. (JTF, Joint Task Force for Clinical Trial Competency,
Sonstein and Jones 2018)

clinical research experience to take the examination (Association of Clinical


Research Professionals 2018; Society of Clinical Research Associates 2018). The
oldest organization is the Association of Clinical Research Professionals (ACRP).
Founded in 1976 they have over 32,000 certified clinical researchers. This organi-
zation has several certification programs available for those employed in the clinical
research workforce. These include: Certified Clinical Research Associate (CCRA),
Certified Clinical Research Coordinator (CCRC), Certified Principal Investigator
(CPI), and Professional certification (ACRP-CP). A CRA works on behalf of the
sponsor and is not involved in obtaining research data, changing or manipulating
data. Their duties largely revolved around independent monitoring of the data; thus,
they are technically not part of the day to day operations in a clinical research study.
A CRC is the individual largely tasked with day to day operations of participating in
the study under the direction of the PI. Those seeking ACRP-CP certification are
involved in planning, conducting, and overseeing the overall study.
8 Qualifications of the Research Staff 131

To maintain certification a CCRC must submit an application and pay a


recertification fee. The applicant must demonstrate competency by retaking the
certification exam every 2 years or have accumulated 24 contact hours in continuing
education activities. At least 12 of the continuing education activities must be
research related. The remaining educational credit activities are obtained through
Continuing Involvement or Continuing Education activities of the applicants’
choice. This designates continued understanding of new knowledge relevant to
clinical research study conduct.
SOCRA was incorporated in 1991 with a strong focus on providing education and
credentialing for oncology coordinators. Through its growth it has emerged into a
leading research organization providing education opportunities for clinical research
staff in all therapeutic areas supporting government, industry, and academic institu-
tions. Similar to ACRP the background of the research staff is varied and may
include nursing, pharmacy, biology, teaching, medical technology, business admin-
istration, and other areas. Eligibility for taking the certification exam requires the
applicant be working with Good Clinical Practice (GCP) guidelines with protocols
that have been approved by either an Institutional Review Board (IRB), Institutional
Ethics Committee (IEC), or a Research Ethics Board (REB). Additionally, the
applicant must quality under one of three categories with a combination of work
and/or educational experience. Upon successfully completion of the examination,
the individual may use the title of Certified Clinical Research Professional (CCRP).
This designation represents the individual has understanding, knowledge, and con-
duct application of clinical research that involves human subjects according to
International Conference on Harmonization Guideline for Good Clinical Practice
(E6) (ICH/GCP), ICH Clinical Safety Data Management: Definitions and Standards
for Expedited Reporting (E2A), the United States Code of Federal Regulations
(CFR) and the ethical principles of the Nuremberg Code, the Belmont Report. Duties
may include: data collection, preparation of reports, protocol development, devel-
opment or monitoring of case report forms, development of informed consent
documents, protection of subject and subjects’ rights, and reporting of adverse
events throughout the study.
Certification is maintained by completion of 45 continuing education credits
within a 3-year time period or retaking the examination. A minimum of 22 of the
credits must be related to Clinical Research polices, regulations, etc. The remaining
continuing education credits are generally related to ones’ therapeutic area. Similar
to ACRP an application and a fee for recertification is required.
Competency guidelines for Clinical Research Coordinators were recently devel-
oped by the ACRP (ACRP 2015). These guidelines were developed with input from
over forty industries ranging from pharmaceutical, academic medical institutions,
Medical Research Corporations, and a variety of Foundations. The guidelines are
intended to serve many purposes such as standardize CRC performance, develop
competency-based job descriptions, enhance CRC retention, increase CRC recruit-
ment, and professional development and improve clinical trial quality.
Both ACRP and SOCRA are overwhelming accepted by industry standards as
professional organizations assisting in establishing professional identities for Clin-
ical Research Staff.
132 C. A. Meldrum

Summary

While there is still no standard education level for a position in clinical research, it is
often associated with a bachelor’s degree and some type of clinical trial research
experience. Increased technology has essentially pushed the clinical research enter-
prise to expect that clinical research staff have a higher skillset than in previous
decades. Certainly the importance of having a trained workforce in clinical research
ultimately impacts the integrity of clinical research. The research field has come a
long way in education and job description roles for research staff but there is still
have a long road ahead. The field is evolving towards somewhat more standardized
job descriptions even with the historical lack of clear consistent definitions for
research staff titles and their responsibilities. This evolution should allow transpar-
ency that standardizes job classifications and education expectations while providing
new growth opportunity for advancement in the clinical research profession. This
will ultimately allow the profession to mature increasing workforce development
and increased job satisfaction within the profession. If the universal competencies,
certifications, and accepted job descriptions are not adopted, it may end up in the
hands a government who would license this professional group.

Key Facts

1. Historically education and training of clinical research staff is highly variable.


2. Responsibilities and duties of the clinical research staff have increased due to
increasing regulatory demands.
3. SOCRA and ACRP are international organizations that have assisted in the
creation of guidance for more standardized job descriptions within the field of
research.

Cross-References

▶ Documentation: Essential Documents and Standard Operating Procedures


▶ Implementing the Trial Protocol
▶ Investigator Responsibilities
▶ Selection of Study Centers and investigators

References
ACRP (2015) A New Approach to Developing the CRA Workforce. Retrieved January 30, 2019
from https://www.acrpnet.org/resources/new-approach-developing-cra-workforce/
Association of Clinical Research Professionals (2018) CRC Certification. Available at: https://
www.acrpnet.org/certifications/crc-certification/
8 Qualifications of the Research Staff 133

CAAHEP (1994) Consortium of academic programs in clinical research. https://www.caahep.org/


Students/Program-Info/Clinical-Research-Professional.aspx. Retrieved January 31, 2019
Davis AM, Hull SC, Grady C, Wilfond BS, Henderson GE (2002) The invisible hand in clinical
research: the study coordinator’s critical role in human subjects protection. J Law Med Ethics
30:411–419
Hubbard SM (1982) Cancer treatment research: the role of the nurse in clinical trials of cancer
therapy. Nurs Clin N Am 17(4):763–781
Infectious Diseases Society of America (2009) Grinding to a halt: the effects of the increasing
regulatory burden on research and quality improvement efforts. Clin Infect Dis 49:328–335
National Institutes of Health (2006) Regulations and ethical guidelines. Retrieved January 17, 2019
from https://grants.nih.gov/policy/humansubjects.htm
NIH Policies and Procedures for Promoting Scientific Integrity (2012) Retrieved January 31, 2019
from https://www.nih.gov/sites/default/files/about-nih/nih-director/testimonies/nih-policies-pro
cedures-promoting-scientific-integrity-2012
Ocker BM, Pawlik Plank D (2000) The research nurse role in a clinic-based oncology research
setting. Cancer Nurs 23(4):286–292
Papke A (1996) The ACW national job analysis of the clinical research coordinator. Monitor 46:45–
53
Society of Clinical Research Associates (2018) Certification program overview. Available at:
https://www.socra.org/certification/certification-programoverview/introduction/
Sonstein SA, Jones CT (2018) Joint task force for clinical trial competency and clinical research
professional workforce development. Fronti Pharmacol. https://doi.org/10.3389/
fphar.2018.01148
Sonstein SA, Seltzer J, Li R, Jones CT, Silva H, Daemen E (2014) Moving from compliance to
competency: a harmonized core competency framework for the clinical research professional.
Clin Res 28(3):17–12
Sonstein SA, Namenek Brouwer RJ, Gluck W, Kolb HR, Aldinger C, Bierer BE, Jones CT (2018)
Leveling the joint task force core competencies for clinical research professionals. Ther Innov
Regul Sci 216847901879929
Speicher LA, Fromell G, Avery S, Brassil D, Carlson L, Stevens E, Toms M (2012) The critical
need for academic health centers to assess the training, support, and career development
requirements of clinical research coordinators: recommendations from the clinical and transla-
tional science award research coordinator taskforce. Clin Transl Sci 5:470–475
Spilsbury K, Petherick E, Cullum N, Nelson A, Nixon J, Mason S (2008) The role and potential
contribution of clinical research nurses to clinical trials. J Clin Nurs 17(4):549–557
United States Food and Drug Administration (2000) Retrieved February 5, 2019, from http://www.
fda.gov
Multicenter and Network Trials
9
Sheriza Baksh

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Multicenter Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Trial Leadership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Coordination of Study Activities and Logistics Between Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Clinical Trial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

Abstract
Multicenter clinical trial designs offer a unique opportunity to leverage the
diversity of patient populations in multiple geographic locations, share the burden
of resource acquisition, and collaborate in the development of research questions
and approaches. In a time of increasing globalization and rapid technological
advancement, investigators are better able to conduct such projects seamlessly,
benefiting investigators, sponsors, and patient populations. Regulatory agencies
have embraced this shift towards the use of multicenter clinical trials in product
development and have issued statements and guidance documents promoting
their utility and offering best practices. Some governmental health agencies
have even formed clinical trial networks to facilitate the use of multicenter
clinical trials to answer a broad range of clinical questions related to a disease
or disease area. This chapter will cover design considerations, data coordination,

S. Baksh (*)
Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 135


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_282
136 S. Baksh

regulatory requirements, and study monitoring of multicenter clinical trials as


well as how they can be conducted with a clinical trial network.

Keywords
Multicenter clinical trials · Clinical trial networks · Trial consortiums ·
Cooperative group clinical trials

Introduction

The conduct of clinical trials sometimes requires multiple clinical sites in order to
complete studies in a timely manner and maximize the external generalizability of
trial results. Increased globalization and inherent improvements in global coordina-
tion of data and research activities have made this study design the preferred option
when large study populations, generalizable study results, and fast turnaround are the
primary goals. Multicenter clinical trials allow for the streamlining of trial resources,
collaborative consensus for research decisions, greater precision in study results,
increased generalizability and external validity, and a wider range of population
characteristics. Multicenter clinical trials conducted in various regions of the world
may also bring clinical care options that would otherwise not be available to study
participants in lower- and middle-income countries. There are special considerations
for multicenter clinical trials that must be incorporated into protocols, statistical
analysis plans, and data monitoring plans. Additionally, data collection, data man-
agement, and treatment guidance must be coordinated across study sites to comply
with local standards.
Multicenter clinical trial designs have been in use for several decades, necessi-
tating additional guidance on trial conduct from national and multinational regula-
tory and funding agencies. Many places such as Brazil, China, The European Union,
and the United States use consensus documents, such as those produced in the
International Conference for Harmonisation (ICH) as the basis for their own guid-
ance on the conduct of clinical trials. For instance, the South African Good Clinical
Practice Guidelines draws from source documents developed by ICH, the Council
for International Organisations of Medical Sciences, World Medical Association,
and the UNAIDS, but heavily emphasizes the importance of incorporating the local
South African context into the design of multicenter clinical trials conducted in
South Africa (Department of Health 2006). In the United States, since the passage of
Kefauver-Harrison Amendments in 1962, multicenter clinical trials conducted in
foreign countries have been used for regulatory submissions. However, the United
States Food and Drug Administration (United States Food and Drug Administration
2006a, b, 2013) and the United States National Institutes of Health (NIH) (National
Institutes of Health 2017) have only recently developed guidelines for trialists who
are either submitting multicenter clinical trial data for regulatory approval or being
funded for a multicenter clinical trial through the federal government. Each of the
Institutes of the NIH have also developed specific guidelines for multicenter clinical
9 Multicenter and Network Trials 137

trial grants under their purview. These guidelines address the nuances of trial
conduct, coordination, data analysis, and ethical considerations across many trial
designs conducted in a multicenter setting. These guidelines are also heavily based
on those developed by ICH. Other countries implementing and tailoring ICH
guidelines include Brazil, Singapore, Canada, Korea, and others.
The ICH of Technical Requirements for Pharmaceuticals for Human Use E17
Guideline for multi-regional clinical trials outline principles for the conduct of
multicenter clinical trials for submission to multiple regulatory agencies (The Inter-
national Conference on Harmonisation of Technical Requirements for Registration
of Pharmaceuticals for Human Use 2017). The document discusses important
considerations in study design, such as regional variability on the measured treat-
ment effect, choice of study population, dosing and comparators, allowable con-
comitant medications, and statistical analysis planning. Additionally, E17 highlights
the benefit of incorporating multi-regional clinical trials into the global product
development plan to decrease the need for replication in various regions for each
submission. Study investigators should consider the regulatory requirements of
different regions, outcome definitions, treatment allocation strategies, and subpopu-
lations of interest when designing multicenter clinical trials across different regions.
This may require consultation with multiple regulatory agencies in the design of the
trial. Safety reporting should also conform to the local requirements for all study
sites. By coordinating study activities to meet the regulatory requirements in differ-
ent regions, sponsors can efficiently leverage study results for timely reviews of
investigational products.
The ICH has created additional guidance on the potential impact of ethnic
differences across study sites that should be considered when conducting multicenter
clinical trials in different countries (The International Conference on Harmonisation
of Technical Requirements for Registration of Pharmaceuticals for Human Use
1998). The E5 Guideline discusses intrinsic and extrinsic factors that have the
potential to modify the association between treatment and safety, efficacy, and
dosing. Characterization of treatment effect may differ based on factors such as
genetic polymorphism, receptor sensitivity, socioeconomic factors, or study end-
points. The clinical data package should have sufficient documentation of pharma-
cokinetics, pharmacodynamics, safety, and efficacy in the study population for each
region in which the trial will be submitted for regulatory consideration. In the
absence of that, additional bridging data that assesses the sensitivity of the treatment
effect to specific ethnic factors unique to the target population in a particular region
can help regulators extrapolate from study results accordingly. While the ICH E5
Guideline was developed for multicenter clinical trials in an international context, it
can be applied to any multicenter clinical trial with heterogeneity in study
populations across clinical sites.
Multicenter clinical trials can also be conducted via clinical trial networks. This
mechanism allows for collaborative research and the alignment of research priorities
among a core group of investigators. Networks are typically organized around a
specific disease area and consist of investigators with common research initiatives.
Clinical trial networks can serve to advance innovation in research methodologies
138 S. Baksh

for a particular clinical area, accelerate translational research, facilitate a measured


approach to researching multiple questions surrounding a poorly understood disease
or condition, and leverage existing patient populations for larger clinical trials in rare
diseases.
This chapter highlights the nuances of conducting a multicenter clinical trial, in
contrast to a single center trial, and highlight aspects of doing so within a clinical trial
network.

Multicenter Clinical Trials

Formation

This chapter will delve into the conduct of multicenter clinical trials within the
United States, with selected comparisons to contexts in other countries. The exam-
ples presented here are typical of an NIH-funded, investigator-initiated, multicenter
clinical trial. As there are other models for multicenter clinical trials such as industry
trials for regulatory approval, this chapter will highlight other notable design aspects
when applicable. For the purposes of this chapter, the principal investigator (PI) is
the lead clinical scientist who has received funds from government or private entities
for the conduct of a multicenter clinical trial. The funder is the provider of financial
support for the clinical trial. The sponsor is the responsible party for the clinical trial
and may or may not be the same party as the funder.
Multicenter clinical trials are clinical trials conducted under a single protocol that
utilize multiple clinical sites in different geographical locations to recruit participants
for a clinical trial answering a specific clinical question. The clinical site principal
investigators typically contribute to the study leadership and collaborate in the
development and refinement of the study question and design. Communication
among the sites and between the sites and the study PI is often coordinated by a
data coordinating center and often directed by the PI.
Multicenter clinical trials understandably require a considerable deal of coordi-
nation, both administratively and functionally. The decision for which clinical sites
to include in a clinical trial can be determined during the study planning phase and
finalized after the start of the trial. During this time, the PI may invite clinical sites to
apply to join the multicenter clinical trial. PIs can choose to invite individual clinical
trial sites within their professional networks, existing clinical trial networks, or sites
identified through clinical trial site directories or trial registries. During the applica-
tion process, potential clinical sites are asked to list their proposed study team,
clinical site resources, confirmation of ability to conduct clinical research activities,
and their ability to coordinate ethical approvals across sites. These applications can
be accompanied by a site assessment visit, where site monitors can visit the potential
clinical site to inspect the facilities and capabilities of the applicant site. These visits
can be useful in the design phase of the study, when the site monitors consider which
research activities may or may not be feasible for each clinical site and what tasks
might better be completed centrally. This is also an opportunity for the site monitors
9 Multicenter and Network Trials 139

to ascertain the types of patients a clinic typically receives, what proportion would
meet study eligibility criteria, and discuss appropriate recruitment goals with the
potential clinical site investigator. Site monitors may also use the site visit as an
opportunity to conduct a risk assessment of the site’s ability to complete study
recruitment and quality goals.
Once the PI and the clinical site decide to pursue collaboration on the clinical trial,
they enter into a contract that outlines the rights, roles, and responsibilities of each
party. The contract may also address payment schedules to the sites, any resource
transfers to or sharing with the individual sites, data storage and security liability,
and event reporting responsibilities. If there are specimen collections in the clinical
trial, the contract might also specify who holds ownership for those specimens and
material transfer agreement details.
Earlier collaboration and consensus between the clinical sites and the PI in the
development of the study and investment by the clinical sites in the clinical trial are
two benefits to engaging and onboarding potential sites earlier in the planning
phase. The Wrist and Radius Injury Surgical Trial (WRIST) group highlighted
three techniques they employed in consensus building during their trial planning
phase: focus group discussion, nominal group technique, and the Delphi method
(Chung et al. 2010; Van De Ven and Delbecq 1974). Each of these methods require
different levels of structure in reaching consensus. The PI for the clinical trial must
assess whether the group of clinical site investigators have existing relationships
with each other and whether or not some voices might carry more weight than
others. For example, in a focus group discussion on increasing study recruitment,
dominant voices may decrease democratic decision-making through the discus-
sion, and less vocal investigators may have fewer opportunities to voice their ideas,
resulting in a net loss to innovation in problem-solving. Pre-existing relationships
may lend themselves to an established group dynamic that may or may not
accommodate the addition of new voices. The nominal group technique is better
suited for this type of situation in that it requires participation from all members
(Van de Ven and Delbecq 1972). In the study recruitment example, by offering
everyone an opportunity to share their ideas, innovation around recruitment strat-
egies can be readily shared and amplified through voting by others in the group. It
can be difficult to implement however, since it involves face-to-face meetings in
order to prioritize and vote on decisions. Additionally, PIs should consider whether
each investigator has an equal opportunity to voice his/her opinion in the course of
designing and implementing the clinical trial. This includes ensuring that each
investigator has his/her research interests considered for incorporation into the
study objectives and is given an equal chance at authorship for manuscripts
resulting from the trial. If the investigator group consists of researchers with
varying levels of experience and seniority, the ideas of those more junior may be
lost in the conversation. In this scenario, the Delphi method might be a more
appropriate means of reaching consensus, as this method uses anonymity to
minimize the effect of dominant voices (Dalkey 1969). Using the recruitment
strategy example, this technique might further allow for the amplification of
novel strategies, regardless of who presents the idea.
140 S. Baksh

Trial Leadership

In contrast to a single-site clinical trial where study leadership primarily consists of


the PIs and a lead coordinator, sponsors may ask PIs to establish a steering commit-
tee for larger multicenter clinical trials (“Trial Governance,” 2015). As the name
suggests, the steering committee steers the direction of the day-to-day activities and
priorities of clinical and data coordinating centers (Daykin et al. 2016). The steering
committee typically consists of the PI, the head of the data coordinating center,
senior investigators from each clinical site, study statistician, and independent
researchers who advise the PI throughout the study. This is one of many possible
configurations of a steering committee; however, their key responsibility is to vote
on major study decisions regarding design and analysis issues, study procedures,
data sharing, allocation of study resources, and priorities for meeting competing
demands of the study, should they arise. Of note, members of the steering committee
may not all be voting members. The committee monitors the progress of the study,
considers outside research that could affect the interpretation of the trial results and
the appropriateness of the study design and analysis, and communicates study
progress to interested parties. The steering committee is usually blinded and privy
to the recommendations of the data safety monitoring board/committee in order to
effectively direct study activities.
In situations where major study decisions must be made on a schedule that makes
it difficult to convene the full steering committee, a smaller executive committee
may meet to resolve routine issues that arise. The executive committee may consist
of a handful of investigators, such as the study chair (usually the study PI), vice chair,
director of the data coordinating center, and key study personnel to communicate and
implement the decisions made. This smaller group of study leadership are tasked
with resolving day-to-day issues in study conduct, prepare policies and proposals for
steering committee review, address higher level administrative issues, develop plans
to correct deficiencies in study conduct, review publications and presentations of
study findings for steering committee approval, and review any proposed ancillary
studies before approval by the steering committee. The executive committee may
meet more frequently than the steering committee to address such issues in a timely
manner and prepare for larger group review during the steering committee meetings.
They may also work directly with other study team members to execute their
assigned tasks.
There are several managerial and administrative duties that are often delegated in
a multicenter clinical trial. Study leadership may decide to set up a chairman’s office
to undertake some of these tasks, consisting of the PI, study account manager,
administrative coordinator, and potentially a clinical coordinating center. They
carry the responsibility of distributing funds to the various clinics, labs, specimen
repositories, and data coordinating center. They may also manage the contracts with
each of these entities. The chairman’s office may meet with the data coordinating
center in executive committee meetings to coordinate start-up activities, data collec-
tion procedures, and other study logistics. Scheduling steering committee, executive
committee, and data safety monitoring board/committee meetings for the study are
9 Multicenter and Network Trials 141

additional responsibilities of the chairman’s office. If the chairman’s office also


handles clinical coordinating activities, they may arrange for training and research
group meetings to ensure that all study personnel are kept abreast of changes to study
procedures and policies, as well as to address any issues in the day-to-day activities
of the study. The chairman’s office may also be in charge of communicating study-
wide changes to procedure, new recruitment initiatives, and concerns about study
progress.
There may be situations where some or all of these clinical coordinating duties are
delegated to the data coordinating center, depending on available resources, and
pre-determined study roles and responsibilities. Data coordinating centers also play
an integral role in disseminating information about data collection procedures, data
standardization, study monitoring, and data audits. The data coordinating center may
be heavily involved in collecting details about study adverse events and protocol
deviations for dissemination to sites for local reporting in addition to central
reporting to the single investigational review board (sIRB). They may also conduct
site visits to assess site performance and conduct quality checks. Reports from these
visits are usually shared with the steering committee. These reports can serve to
highlight unique approaches to study hurdles as well as identify areas for improve-
ment. By sharing these reports with the entire steering committee, local site PIs can
solicit feedback for improvement and share their successes with the group during the
discussions of the reports. Finally, the data coordinating center monitors and facil-
itates timely data entry for the purpose of required reporting to the data safety
monitoring board/committee, sIRB, study leadership, and study sponsor.

Design Considerations

There are a few design considerations unique to multicenter clinical trials. First,
when a study statistician develops the randomization scheme for a multicenter
clinical trial, he/she typically stratifies the randomization by clinical site. This
reduces potential bias due to measured and unmeasured differences across clinical
sites. In the case of stratifying by clinical site, the investigators control for the
potential interaction between clinical site and primary outcome measures (Senn
1998; Zelen 1974). Stratification by clinical site maximizes the probability of
balanced numbers of participants receiving each treatment arm in the study. Without
this balance, there is potential for bias if one clinical site experiences different
outcomes on average than other clinical sites. One can imagine a situation where
clinical sites might have catchment areas with different socioeconomic statuses,
patient demographics, and clinical characteristics. All of these could potentially
affect baseline risk for the primary outcome in the study population at each site.
The second point of consideration is the target number of randomizations for each
site. After a site has agreed to participate, the data coordinating center typically
establishes recruitment goals for each site participating in the study to ensure that the
overall recruitment goal for the study is met. These site-specific recruitment goals
should consider what the clinical capacity is at each site, length of study visits and
142 S. Baksh

contacts, full-time equivalents dedicated to the study at each site, recruitment goals
of other sites in the study, the timeline for completion of study recruitment, and the
flexibility around adding additional sites to the study. Different recruitment goals
across sites should not bias results or result in less precision, especially if random-
ization is stratified by clinical site (Senn 1998). Recruitment goals at each site should
be roughly similar with allowances for faster and slower recruitment at sites; these
goals should not be uniform and driven by extremes.
Third, multicenter clinical trial designs can benefit from built in flexibility for
adding sites at any point during the trial. In order to facilitate this process, PIs should
seriously consider the burden both on the data coordinating center and the prospec-
tive sites of the site start-up procedures. While multicenter trials inherently increase
power, the benefits of adding a site as a collaborator should outweigh the adminis-
trative and logistical hurdles of doing so. Having a start-up package of forms for
clinical sites to complete, a mini handbook of start-up activities, and a start-up
training can ease this burden and allow for transparency.

Coordination of Study Activities and Logistics Between Sites

Sites involved in a multicenter clinical trial should all have the resources or capacity
to acquire the resources necessary for the execution of all study procedures. This
includes key personnel, study materials, and regulatory infrastructure (if applicable).
PIs should bear in mind these potential limitations at each site as they design the
study. In extreme cases, this may mean that study budgets may have to account for
infrastructure support to ensure that each site has the minimum required resources to
conduct study activities.
Through the course of a multicenter clinical trial, the data coordinating center
works to ensure uniformity in study procedures, data collection, and adverse event
and protocol deviation reporting across sites. To accomplish this, they coordinate a
number of study logistics in an orchestrated manner. This begins when a site is
chosen to join the clinical trial and has agreed to participate. For example, in the
United States, all clinical sites are asked to join an sIRB designated for the multi-
center clinical trial. As of 2016, all National Institutes of Health (NIH) sponsored
multicenter clinical trials are required to use an sIRB of record for their ethical
review (National Institutes of Health 2016). This move was intended to streamline
the review of studies, promote consistency of reviews, and alleviate some of the
burdens to investigators (Ervin et al. 2016). There are situations when a site may be
unable to join an sIRB (i.e., foreign jurisdiction or highly restrictive local regula-
tions). If a site agrees to rely on the sIRB for the study, they must complete a reliance
agreement documenting this arrangement between the sIRB of record and their site.
The letter of indemnification corresponding to this reliance outlines the scope of
reliance, claims, and governing laws. Australian regulatory agencies have endorsed a
similar approach through the National Mutual Acceptance (NMA) system. Through
this agreement, health departments across Australian states and territories agree to
recognize the ethical reviews conducted in member states for multicenter trials.
9 Multicenter and Network Trials 143

Similarly, the government of Ontario, Canada has supported Clinical Trials Ontario
to streamline the ethical review of study protocols across the province. While a
single ethical review may not be possible for protocols for all multicenter clinical
trials, streamlining these activities when possible has been endorsed by sponsors and
regulatory agencies.
After a site has received ethical approval from either the study’s sIRB or their
local institutional review board, then the data coordinating center can work with the
clinic staff to prepare for initiating the study at their site. The data coordinating
center may hold a training session for all certified clinic staff to orient them with the
study protocol and data entry system. In preparation for this training session, clinic
staff may be asked to review the protocol as well as any handbook (i.e., manual of
procedures or standard operating procedures). This smaller training session during
the onboarding process is an opportune time for clinic staff to clarify any technical
issues with the protocol or identify any difficulties with data entry. By conducting
this training with every site, the data coordinating center reinforces uniformity in
study activities across sites. The coordinators introduce the data collection instru-
ments at this time and indoctrinate clinic staff into the formatting requirements of the
data as well as any nuances to the data system. Data collection instruments are
standardized across the entire study and do not differ between clinical sites; however,
sites are allowed to maintain their own local records of study participants. As the
study proceeds, the data and/or clinical coordinating center may hold regular tele-
conferences or webinars with clinic staff to communicate important study changes,
assess and triage any challenges, and solicit feedback from sites. This is also an
opportunity for sites to learn from the experience of the other sites.
Given the potential for a large number of clinical sites, data coordinating centers
may utilize risk-based approaches to remote and on-site data monitoring. This is a
multipronged strategy for monitoring that prioritizes the most important aspects of
patient safety, study conduct, and data reporting (Organization for Economic
Co-operation and Development 2013; United States Food and Drug Administration
2013). Key features of risk-based monitoring include statistical approach to central
monitoring, electronic access to source documents, timely identification of systemic
issues, and greater efficiency during on-site monitoring. This risk-based monitoring
plan is usually developed after a risk assessment of critical data and procedures to be
monitored throughout the trial both remotely and on-site. In contrast to regular visits
to all clinical sites with 100% data audits, this approach to monitoring allows study
sponsors to effectively use resources to centralize data quality checks and use site
visits as an opportunity to further investigate any data anomalies, observe clinic and
study activities, and gather feedback about ease of data procedures and the data
system. This may mean that monitors conduct source data verification on selected
data items, a random sample of data forms, or a hybrid approach of source data
verification of 100% of key data collection instruments and a sample of the
remaining forms. This is another opportunity for study monitors to reinforce unifor-
mity across sites in the conduct of study procedures. These on-site monitoring visits
are seen as particularly useful at the beginning of a trial, with supplementary
centralized monitoring through the duration of the trial. If clinics are found to be
144 S. Baksh

“higher risk” with regards to errors, then additional on-site monitoring visits and
re-training can be arranged in a targeted manner.

Clinical Trial Networks

Multicenter clinical trials can be conducted within clinical trial networks, also
referred to as trial consortiums or cooperative group clinical trials. Clinical trial
networks can be organized around a common clinical or disease area, can span
multiple countries, and can be publicly or privately funded. Table 1 lists several
clinical trial networks around the world, their sponsors, and mission. They range in
specificity of their missions, and in some cases, their goals have evolved since their
establishment. Some may focus on research areas on therapeutic testing for
neglected diseases or diseases that have high mortality with few to no treatment
options. Many of the institutes within the United States National Institutes of Health
sponsor clinical trial networks to accelerate research in priority areas.

Table 1 Clinical Trial Network Examples


Name Sponsor Overview
Alzheimer’s Clinical Trials National Comprised of 35 sites in 24 states
Consortium Institute on Mission to provide infrastructure,
Aging centralized resources, and shared expertise
for the accelerated development of
Alzheimer’s disease and related disorders
Platform for European European Hundreds of sites across Europe and
Preparedness for Re(emerging) Commission Western AustraliaMission to establish a
Epidemics (PREPARE) framework for harmonized clinical
research on infectious disease, prepared
for rapid response, and real-time evidence
Ozwaldo Cruz Foundation Ozwaldo Network of clinical research groups,
(FIOCRUZ) Clinical Research Cruz steering committee, executive secretary,
Network Foundation and communities of practice across Brasil
Mission to strengthen the role of clinical
research at FIOCRUZ, overcome
technological hurdles, and establish
national clinical research program
Korea National Enterprise for Government Multiple clinical research sites across
Clinical Trials (KoNECT) of Korea Korea
Collaboration Center Mission to foster a community of clinical
research and networking for product
development
East African Consortium for European Multiple research nodes across East Africa
Clinical Research (EACCR2) Union supported by a network of European and
African country governments
Mission to conduct rigorous clinical trials
on poverty-related diseases and neglected
diseases
9 Multicenter and Network Trials 145

Clinical trial networks in Europe may build on the existing relationships between
governments in the European Union (EU) and apply for European Research Infra-
structure Consortium (ERIC) designation. This allows for clinical trial networks to
be legally recognized across all EU member states, to fast-track the development of
an international organization, and to be exempt from Value Added Tax (VAT) and
excise duty. Countries outside of Europe are also allowed to join ERICs. Clinical
trial networks interested in this designation must provide evidence that they have the
infrastructure necessary to carry out the intended research, that the research is a
value-add to the European Research Area (ERA), and the venture is a joint European

a
Centralized Trial Network Clinical
Management Site
Clinical
Site
Executive Committee

Data Coordinating Center


Sponsor Clinical
Site
Analytic Core

Regulatory Team Clinical


Site

Data and Specimen


Repository
Clinical
Site

Public Centralized Trial Network Disease


Sponsor Management X Node

Executive Committee

Data Coordinating Center


Private Disease Clinical Sites
Sponsor Analytic Core Y Node

Regulatory Team

Data and Specimen


Non- Repository
Disease
profit
Z Node
Sponsor

Fig. 1 (continued)
146 S. Baksh

c Centralized Trial Network


Management
Country
Y Clinic
Country X
Government Executive Committee

Data Coordinating Center

Country
Analytic Core Y Clinic

Country Y Regulatory Team


Government
Data and Specimen Country
Repository Y Clinic

Fig. 1 Clinical Trial Network Structure Examples. (a) Clinical trial network with one sponsor who
works directly with the centralized trial network management to direct and coordinate research at
multiple clinics. (b) Clinical trial network with multiple public and private sponsors who work
directly with the centralized trial network management to direct and coordinate research nodes,
focusing on different diseases, that then coordinate research at multiple clinal sites. (c) Clinical trial
network with public sponsors from different governments who work directly with the centralized
trial network management to direct and coordinate research at multiple clinics in one country;
typically, seen with one northern country sponsoring government and one southern country
sponsoring government, with research conducted in the southern country

initiative intended to disseminate research results to benefit the entire ERA


(European Commission 2009).
There are a variety of ways in which a clinical trial network can be organized.
Figure 1 depicts common structures. Despite their differences in design, there are
several unique elements common to most clinical trial networks. Due to their
inherent complexity, clinical trial networks tend to have centralized operations
management. This may consist of some executive body that directs the mission
and research programs for the network. They may include a representative from the
funding organization to inform the direction of the research agenda. This group may
task one or more working groups to coordinate between clinical investigators to
execute such initiatives. Clinical trial networks may also include a dedicated,
centralized, regulatory arm that handles the regulatory reporting responsibilities of
all studies conducted within the network. Data coordination can also be done
centrally for a number of studies within the network. Along with data coordination,
a clinical trial network may have a dedicated analytic core to perform all study
analyses. This group may also be charged with developing novel trial designs that
are best suited to answer questions related to the clinical area of interest for the
network. Because the network pulls participants from the same patient pool, inves-
tigators can recruit for multiple studies within the network at once, leading to better
enrollment for all network trials, particularly those that require highly specific patient
populations (Liu et al. 2013; McCrae et al. 2012). This often means that patient data
9 Multicenter and Network Trials 147

that feeds into a data repository can serve to inform several trials with minimal
associated administrative overhead (Massett et al. 2019). Finally, clinical trial
networks have a formal system for building consensus around research priorities,
resource allocation, and network leadership (Organization for Economic
Co-operation and Development 2013). Developing the protocol for reaching con-
sensus is essential when clinical trial networks are large and multinational with
differing regulatory oversight and clinical standards.
Clinical trial networks offer many benefits to various stakeholders. They provide
an opportunity for investigators with similar research interests to exchange ideas,
develop novel trial methodologies for their clinical area, share resources, and
leverage a large pool of potential participants to push their field forward (Bentley
et al. 2019; Davidson et al. 2006). In cases where there is national buy-in from local
governments, this becomes an important public-private partnership to accelerate
national research agendas in an efficient manner. Initiating trials within an existing
infrastructure of research groups with existing relationships between each other and
with the sponsor, organizational competence, and administrative support contribute
to this efficiency. The inherent structure of clinical trial networks lends itself to
comparative effectiveness research that can inform government reimbursement
decisions and potential guideline changes. Smaller clinical sites wishing to develop
relationships with certain sponsors can benefit from joining clinical trial networks as
well. Sub-studies are also easier to execute for clinical sites with limited resources by
leveraging existing infrastructure. Lastly, all clinical sites, regardless of capacity, can
benefit from the increased exchange of ideas through the frequent meetings.
Despite these benefits, clinical trial networks can carry limitations worth consid-
ering. There is a common perception that enrollment is the standard metric of success
within a network. As such, payment structures may be based on the number of
participants enrolled at each site, with little consideration of overhead costs. This
could be a deterrent for investigators from academic institutions with high overhead
costs or for study investigators with study protocols requiring high resource utiliza-
tion, as investigators are dependent on their institution’s cooperation and support of
the endeavor. Clinical trial networks can also fall victim to inadequate staffing, with
consequences more substantial than would be in a single clinical research group
(Baer et al. 2010). Additionally, participation in a clinical trial network may mean
involvement in multiple clinical trials; however, not all of these may lead to
significant credit or publications for every investigator (Bentley et al. 2019). The
decision for investigators to participate in a clinical trial network should weigh the
benefits and limitations of their home institution, existing patient pool, and potential
for professional growth in their group and contribution to science.

Summary and Conclusions

Clinical trial networks provide an efficient platform for investigators to conduct a


large number of multicenter clinical trials, simultaneously, in niche research areas.
They also allow for trial innovation in ways that are specifically catered to the needs
148 S. Baksh

of a clinical area. While this is a highly structured manner to conduct many


multicenter clinical trials, they can be done on a single trial basis with a different
group of investigators each time. Multicenter clinical trial designs are a welcoming
platform for smaller clinical centers wishing to engage in clinical trials for which the
patient population is hard to recruit and for which trial conduct is resource intense.
This approach then allows for engagement of investigators and patient pools that
would otherwise be excluded from such research. Multicenter clinical trials allow for
an efficient, timely approach to conduct a clinical trial with a representative patient
population. They are utilized by both private and public entities for evidence used in
product development and informing guidelines and reimbursement decisions.

Key Facts

1. Multicenter clinical trials are a resource conservative approach to engage inves-


tigators in large clinical trials, particularly those for which patients are rare or hard
to recruit.
2. As clinical research moves towards electronic records and virtual assessments, mul-
ticenter clinical trials are becoming more feasible in resource poor locations, as data
coordination, study monitoring, and regulatory submissions are managed centrally.
3. Government agencies can accelerate priority research by establishing clinical trial
networks around these areas to engage investigators in multiple, simultaneous
clinical trials for disease therapeutics and interventions.
4. While multicenter clinical trials offer an efficient way to conduct clinical trials,
they do have considerable levels of oversight from trial leadership and require site
investigators to cede some autonomy.

Cross-References

▶ Data Capture, Data Management, and Quality Control; Single Versus Multicenter
Trials
▶ Institutional Review Boards and Ethics Committees
▶ Responsibilities and Management of the Clinical Coordinating Center
▶ Selection of Study Centers and Investigators
▶ Trial Organization and Governance

References
Baer AR, Kelly CA, Bruinooge SS, Runowicz CD, Blayney DW (2010) Challenges to National
Cancer Institute-Supported Cooperative Group clinical trial participation: an ASCO survey of
cooperative group sites. J Oncol Pract 6(3):114–117. https://doi.org/10.1200/jop.200028
Bentley C, Cressman S, van der Hoek K, Arts K, Dancey J, Peacock S (2019) Conducting clinical
trials – costs, impacts, and the value of clinical trials networks: a scoping review. Clin Trials
16(2):183–193. https://doi.org/10.1177/1740774518820060
9 Multicenter and Network Trials 149

Chung KC, Song JW, Group, WS (2010) A guide to organizing a multicenter clinical trial. Plast
Reconstr Surg 126(2):515–523. https://doi.org/10.1097/PRS.0b013e3181df64fa
Dalkey NC (1969) The Delphi method: an experimental study of group opinion. Santa Monica, CA:
RAND Corporation, 1969. https://www.rand.org/pubs/research_memoranda/RM5888.html
Davidson RM, McNeer JF, Logan L, Higginbotham MB, Anderson J, Blackshear J, . . . Wagner GS
(2006) A cooperative network of trained sites for the conduct of a complex clinical trial: a new
concept in multicenter clinical research. Am Heart J 151(2):451–456. https://doi.org/10.1016/j.
ahj.2005.04.013
Daykin A, Selman LE, Cramer H, McCann S, Shorter GW, Sydes MR, . . . Shaw A (2016) What are
the roles and valued attributes of a Trial Steering Committee? Ethnographic study of eight
clinical trials facing challenges. Trials 17(1):307. https://doi.org/10.1186/s13063-016-1425-y
Department of Health (2006) Guidelines for good practice in the conduct of clinical trials with
human participants in South Africa. https://www.dst.gov.za/rdtax/index.php/guiding-docu
ments/south-africangood-clinical-practice-guidelines/file
Ervin AM, Taylor HA, Ehrhardt S (2016) NIH policy on single-IRB review – a new era in
multicenter studies. N Engl J Med 375(24):2315–2317. https://doi.org/10.1056/
NEJMp1608766
European Commission (2009) Report from the Commission to the European Parliament and the
council on the application of Council Regulation (EC) No 723/2009 of 25 June 2009 on the
community legal framework for a European Research Infrastructure Consortium (ERIC). (COM
(2014) 460 final). European Commission, Brussels. Retrieved from https://ec.europa.eu/info/
sites/info/files/eric_report-2014.pdf
Liu G, Chen G, Sinoway LI, Berg A (2013) Assessing the impact of the NIH CTSA program on
institutionally sponsored clinical trials. Clin Transl Sci 6(3):196–200. https://doi.org/10.1111/
cts.12029
Massett HA, Mishkin G, Moscow JA, Gravell A, Steketee M, Kruhm M, . . . Ivy SP (2019)
Transforming the early drug development paradigm at the National Cancer Institute: the
formation of NCI’s Experimental Therapeutics Clinical Trials Network (ETCTN). Clin Cancer
Res 25(23):6925–6931. https://doi.org/10.1158/1078-0432.Ccr-19-1754
McCrae N, Douglas L, Banerjee S (2012) Contribution of research networks to a clinical trial of
antidepressants in people with dementia. J Ment Health 21(5):439–447. https://doi.org/10.3109/
09638237.2012.664298
National Institutes of Health (2016) Final NIH policy on the use of a single institutional review
board for multi-site research. Bethesda. Retrieved from http://grants.nih.gov/grants/guide/
notice-files/NOT-OD-16-094.html
National Institutes of Health (2017) Guidance on implementation of the NIH policy on the use of a
single institutional review board for multi-site research. Bethesda. Retrieved from https://grants.
nih.gov/grants/guide/notice-files/NOT-OD-18-004.html
Organization for Economic Co-operation and Development (2013) OECD recommendation on
the governance of clinical trials. Retrieved from http://www.oecd.org/sti/inno/oecdrecomme
ndationonthegovernanceofclinicaltrials.htm
Senn S (1998) Some controversies in planning and analysing multi-centre trials. Stat Med
17(15–16):1753–1765. https://doi.org/10.1002/(sici)1097-0258(19980815/30)17:15/16<1753::
aid-sim977>3.0.co;2-x; discussion 1799–1800
The International Conference on Harmonisation of Technical Requirements for Registration of
Pharmaceuticals for Human Use (1998) ICH E5(R1) ethnic factors in the acceptability of foreign
clinical data. cited European Medicines Agency. Available from: https://www.ema.europa.eu/
en/documents/scientific-guideline/iche-5-r1-ethnic-factors-acceptability-foreign-clinical-data-
step-5_en.pdf
The International Conference on Harmonisation of Technical Requirements for Registration of
Pharmaceuticals for Human Use (2017) ICH E17 general principles for planning and design of
multi-regional clinical trials. Available from: The International Conference on Harmonisation of
Technical Requirements for Registration of Pharmaceuticals for Human Use. ICH E17 General
Principles for Planning and Design of Multi-Regional Clinical Trials. 2017
150 S. Baksh

Trial Governance (2015) Field trials of health interventions: a toolbox, 3rd edn. In: Smith P,
Morrow R, Ross D (Eds). OUP Oxford, Oxford, UK
United States Food and Drug Administration (2006a) Guidance for clinical trial sponsors –
establishment and operation of clinical trial Data Monitoring Committees. Rockville. Retrieved
from https://www.fda.gov/media/75398/download
United States Food and Drug Administration (2006b) Guidance for industry – using a centralized
IRB review process in multicenter clinical trials. Retrieved from https://www.fda.gov/
regulatory-information/search-fda-guidance-documents/using-centralized-irb-review-process-
multicenter-clinical-trials
United States Food and Drug Administration (2013) Guidance for industry – oversight of clinical
investigations – a risk-based approach to monitoring. Silver Spring. Retrieved from https://
www.fda.gov/media/116754/download
Van de Ven AH, Delbecq AL (1972) The nominal group as a research instrument for exploratory
health studies. Am J Public Health 62(3):337–342. https://doi.org/10.2105/ajph.62.3.337
Van De Ven AH, Delbecq AL (1974) The effectiveness of nominal, Delphi, and interacting group
decision making processes. Acad Manag J 17(4):605–621. https://doi.org/10.2307/255641
Zelen M (1974) The randomization and stratification of patients to clinical trials. J Chronic Dis
27(7–8):365–375. https://doi.org/10.1016/0021-9681(74)90015-0
Principles of Protocol Development
10
Bingshu E. Chen, Alison Urton, Anna Sadura, and
Wendy R. Parulekar

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Administrative Information: SPIRIT Checklist Items 1–5d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Background and Rationale and Objectives: SPIRIT Checklist Items 6–7 . . . . . . . . . . . . . . . . . 154
Trial Design: SPIRIT Checklist Item 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Participants, Interventions, and Outcomes: SPIRIT Checklist Items 9–17b . . . . . . . . . . . . . . . 155
Participant Timeline, Sample Size Recruitment (Items 13–15) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Assignment of Interventions (for Controlled Trials): Spirit Checklist Items 16–17b . . . . . . 159
Data Collection/Management and Analysis: SPIRIT Checklist Items 18a–20c . . . . . . . . . . . 160
Monitoring: SPIRIT Checklist Items 21–23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Quality Assurance (Monitoring/Auditing) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Ethics and Dissemination: SPIRIT Checklist Items 24–31c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

Abstract
Randomized clinical trials are essential to the advancement of clinical care by
providing an unbiased estimate of the efficacy of new therapies compared to
current standards of care. The protocol document plays a key role during the life
cycle of a trial and guides all aspects of trial organization and conduct, data
collection, analysis, and publication of results.
Several guidance documents are available to assist with protocol generation.
The SPIRIT (Standard Protocol Items: Recommendations for Interventional
Trials) Statement comprises a checklist of essential items for inclusion in a

B. E. Chen · A. Urton · A. Sadura · W. R. Parulekar (*)


Canadian Cancer Trials Group, Queen’s University, Kingston, ON, Canada
e-mail: [email protected]; [email protected]; [email protected];
[email protected]

© Springer Nature Switzerland AG 2022 151


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_32
152 B. E. Chen et al.

protocol document. Other essential references include those generated by the


International Conference on Harmonization and the Declaration of Helsinki
which inform the design and conduct of trials that meet the highest scientific,
ethical, and safety standards.

Keywords
SPIRIT Statement · International Conference on Harmonization · Declaration of
Helsinki

Introduction

The protocol serves as the reference document for the conduct, analysis, and
reporting of a clinical trial which must satisfy the requirements of all stakeholders
involved in clinical trial research including trial participants, ethics committees,
regulatory and legal authorities, funders, sponsors, public advocates, as well as the
medical and scientific communities that are the direct consumers of the research
findings.
An inadequate or erroneous protocol has significant consequences. A deficient
protocol may result in delayed or denied regulatory or ethical approval, risks to the
safety of study subjects, investigator frustration and poor accrual, inconsistent
implementation across investigators, as well as increased workload burden and
financial costs due to unnecessary amendments. Ultimately, the trial results may
not be interpretable or publishable.
The purpose of this chapter is to outline the general principles of protocol
development with an emphasis on use of standard definitions and criteria for
protocol content where applicable. Essential reading for this chapter is the SPIRIT
2013 Statement (Chan et al. 2013a) and accompanying Elaborations and Explana-
tions paper (Chan et al. 2013b). The SPIRIT Initiative was launched in 2007 to
address a critical gap in evidence-based guidance documents for protocol generation.
Using systematic reviews, a formal Delphi consensus process, and face-to-face
meetings of key stakeholders, a 33-item checklist relating to protocol content was
generated and subsequently field tested prior to publication. Although the SPIRIT
checklist was primarily developed as a guidance document for randomized clinical
trials, the principles and application extend to all clinical trials, regardless of design.
The reader is also directed to the SPIRT PRO extension which builds on the
methodology of the SPIRIT Statement and provides recommendations for protocol
development when a patient-reported outcome is a key primary or secondary
outcome (Calvert et al. 2018).
Key principles that underpin the content of high-quality clinical trial protocols
relate to the originality and relevance of the primary research hypothesis contained
therein, use of design elements to adequately test the hypothesis, and inclusion of
appropriate measures to protect the rights and safety of trial participants. Guidance
documents generated by the International Conference on Harmonization (ICH) are
10 Principles of Protocol Development 153

useful references and address multiple topics of interest such as the E6 Good Clinical
Practice (GCP) and E8 General Considerations for Clinical Trials (https://www.ich.
org/products/guidelines/efficacy/efficacy-single/article/integrated-addendum-good-
clinical-practice.html). Another important reference document is the Declaration of
Helsinki which was developed by the World Medical Association and represents a
set of principles that guides the ethical conduct of research involving humans
(https://www.wma.net/policies-post/wma-declaration-of-helsinki-ethical-principles-
for-medical-research-involving-human-subjects).
What follows is a brief summary of key protocol content topics annotated with the
associated SPIRIT checklist items. Additional comments to assist with comprehen-
sion or use of SPIRIT protocol items are included as appropriate.

Administrative Information: SPIRIT Checklist Items 1–5d

The administrative information relates to protocol title, unique trial registry number,
amendment history, and contact information for trial conduct from scientific, oper-
ational, and regulatory perspectives. The title should indicate the trial phase, inter-
ventions under evaluation and disease settings/trial population (Fig. 1)
The Declaration of Helsinki (revised 2008) mandates the registration of all
clinical trial in a publicly accessible database before recruitment of the first subject.
Trial registration in a primary register of the WHO International Clinical Trials
Registry Platform (ICTRP) or in ClinicalTrials.gov has been endorsed by the
International Committee of Medical Journal Editors (http://www.icmje.org/recom
mendations) since both registries meet the criteria of access to the public at no
charge, oversight by a not for profit organization, inclusion of a mechanism to ensure

Fig. 1 Sample protocol


Protocol Version Date
title page
Research Organization

Protocol Title (number/code)


Trial Registration Number

Study Chair:
Steering Committee:
Biostatistician:
Collaborating Research Organizations:
Regulatory Sponsor:
Support Providers: Grant Agencies:
Pharmaceutical Companies:
154 B. E. Chen et al.

validity of registry information, and searchability by electronic means. Effective


April 18, 2017, trial registration for applicable clinical trials became mandatory
under the US Federal Food, Drug, and Cosmetic Act (FD&C Act). Registration
promotes informed decision-making by reducing publication bias, ensures
researchers and potential study participants are aware of trial opportunities, avoids
duplication of trials, and can identify gaps in clinical research. Registries may also
require submission of basic study results.
A protocol is a dynamic document that is responsive to new information that
emerges during the life cycle of a trial from external or internal sources. Amendment
history is recorded in the protocol document using date and version control; a master
list of changes to the protocol must be maintained in the trial master file. The content
of the amendment will guide the rapidity with which the protocol is modified and
circulated to investigators. Amendments based on safety considerations are the
highest priority and are processed rapidly; administrative changes or clarifications
can be issued when there are sufficient cumulative changes in a trial to justify the
workload associated with approval of the amendment by regulators and ethics
committees. Designation of key roles and responsibilities to specific individuals
involved in trial design and conduct and their contact information are an essential
resource for trial participants and the research community and are provided in the
protocol document.

Background and Rationale and Objectives: SPIRIT Checklist Items


6–7

The justification for a research study is the single most important component of
any clinical results of trial that will not contribute meaningfully to the advance-
ment of healthcare and research represents a waste of resources and is unethical,
regardless of adherence to checklists and standards for research involving human
subjects.
The background section should summarize the current literature about the
research topic and the hypothesis that will be addressed by the clinical trial. A
review of ongoing trials addressing the same or similar research questions will
demonstrate non-duplication of research efforts. Finally, explicit statements regard-
ing the anticipated impact of the trial results – either positive or negative – provide a
powerful justification for trial conduct. This section should be updated as required to
reflect important advances in knowledge as they relate to the research question,
especially if they result in changes to trial design or conduct.
The objectives of the trial enable the research hypothesis to be tested and are
listed in order of importance as primary and secondary objectives. The primary
objective links directly to the statistical design of the trial which allow the results
to be interpreted within a pre-specified set of statistical parameters (see
Section Statistical Methods). Secondary objectives are selected to provide additional
information to support interpretation of the primary analysis data and typically focus
on additional measures of efficacy, safety, and tolerability associated with a given
10 Principles of Protocol Development 155

intervention. Tertiary objectives are exploratory in nature and may address prelim-
inary research questions related to disease biology or response to treatment.

Trial Design: SPIRIT Checklist Item 8

The trial design is driven by the research hypothesis under evaluation. For example,
new therapeutic strategies with the potential for greater disease control compared to
standard of care may be tested in a parallel group superiority trial; a non-inferiority
trial may be suitable to test a therapy associated with less toxicity or greater ease of
delivery for which a small loss of efficacy may be acceptable. In addition to a
description of the trial framework, the protocol must clearly indicate the randomi-
zation allocation ratio. Deviation from the usual 1:1 allocation may be justified by
the desire for a more in-depth characterization of treatment-associated safety and
tolerability and may increase participant willingness to be enrolled in a specific trial
if there is a greater chance of receiving a new treatment compared to standard of care.
Crossover in treatment administration is another important aspect of trial design and
is used frequently in studies of chronic diseases which are relatively stable, and
therapeutic interventions result in amelioration but not cures of the condition, e.g.,
pain syndromes or asthma. Patients are randomly allocated to predefined sequence of
treatments administered over time periods, and the outcome of interest is measured at
the end of each treatment time period unit.
A design of increasing interest and use is the pilot study. This type of trial is to is
conducted using the same randomization scheme and interventions but on a smaller
scale. The goal of the pilot study is to gather information regarding trial conduct such
as ability to randomize patients, administer the therapeutic interventions, or measure
the outcomes measure(s) of interest but not to estimate relative treatment efficacy
between the interventions (Lancaster et al. 2004; Whitehead et al. 2014).

Participants, Interventions, and Outcomes: SPIRIT Checklist Items


9–17b

Participants
The population selected for trial participation must meet specific criteria to ensure
safety and enable the primary and secondary objectives to be met.
For trials testing drug interventions, adequate organ function is based on the
known pharmacokinetic and pharmacodynamic properties of the drug. Surgical or
radiotherapy trials may require additional tests of fitness for the required intervention
including lung function tests, adequacy of coagulation, and ability to tolerate an
anesthetic.
A patient is enrolled on a trial based on the assumption that he/she will contribute
meaningful information to the outcome measure(s) with a small loss of data due to
trial dropouts or withdrawals. The eligibility criteria should ensure that enrolled
patients can contribute data to enable the trial objectives to be met. For example, a
156 B. E. Chen et al.

trial examining the impact of a therapeutic intervention on pain response must enroll
symptomatic patients with a specific pain threshold; trials evaluating the ability of
interventions to control or shrink disease must have a quantifiable disease burden,
e.g., radiological evidence of cancer in a trial evaluating anticancer activity of
different therapies. Given the significant resource required for the conduct of a
randomized trial, a natural tendency is to include as many outcome measures as
possible to maximize the yield of the data generated by the trial. This approach is not
recommended since it increases the burden of study conduct and participation and
the risk of noncompliance with data submission and may negatively impact accrual
if enrollment is contingent on ability to provide data on multiple outcome measures.
The overall false-positive rate could be inflated when multiple numbers of hypoth-
eses are being tested and proper adjustment for multiple tests is required.
Patient-reported outcomes (PROs) are frequently included in trials of therapeutic
interventions to provide a patient perspective on their health status during the course
of a trial. Specific eligibility criteria related to participation in PRO data collection,
e.g., language ability, comprehension of questionnaires, and access to electronic
devices for direct patient to database submissions, should be adequately described in
the eligibility criteria. Similarly, criteria related to other research objectives such as
submission of biological tissue for analyses related to disease prognosis or predictors
of response to therapeutic interventions or health utility questionnaires for economic
analyses are included in the eligibility criteria as appropriate. The mandatory versus
optional nature of the criteria for tissue submission and patient-reported outcome
must be stipulated.
A long-standing criticism of clinical trials is that the results may have limited real-
world applicability due to the highly selected patient population enrolled. Linked to
this concern is the issue of screen failures, i.e., patients who are appropriate candi-
dates to participate in the trial but cannot be enrolled due to inability to meet the
stringent eligibility criteria. In response to this concern, efforts are underway by
research and advocacy organizations to broaden criteria to allow greater participation
in clinical trials by removing barriers such as the presence of comorbidities, organ
dysfunction, prior history of malignancies, or minimum age (Gore et al. 2017; Kim
et al. 2017; Lichtman et al. 2017).

Interventions
The treatment strategies under evaluation must be clearly described in the protocol,
to allow participating centers to safely administer the intervention and the medical
community to reproducibly administer the intervention should it be adopted or used
on a wider basis. A basic trial schema included early in the protocol document
provides a visual illustration of the interventions (Fig. 2).
For drug trials, dose calculation and guidelines regarding administration and dose
modification are provided. A tabular format is a convenient method to illustrate the
dose modification requirements mandated by specific laboratory values and/or
adverse events. In addition, guidance regarding dose modifications should a patient
experience multiple adverse events with conflicting recommendations regarding
dose adjustments is essential information for the protocol. Nondrug interventions
10 Principles of Protocol Development 157

O Arm 1
Stratification
Factors M

I
Patient
Z Primary Outcome
Population
Measure
A

I
Arm 2
O

Target Sample Size

Fig. 2 Sample trial schema

such as surgery, radiotherapy, or use of other devices require additional protocol


guidance related to credentialing requirements of those administering the interven-
tion and the setting of administration.
The protocol should also include details regarding strategies to maintain
compliance with trial interventions and how these will be measured. Inclusion
of this information will optimize exposure of the participants to the interventions
of interest and interpretation of efficacy estimates and inform the uptake of the
intervention by the clinical community if the intervention is beneficial. Caution is
advised when choosing the instrument(s) to measure compliance. For example,
oral dosing of a drug may be monitored by patient diaries and/or pill counts at
the end of a predefined treatment period. Maintaining a medication diary may be
burdensome and inaccurate for patients who are treated over long periods of
time. In addition, using both diary entries and pill returns to measure compliance
may be problematic if these measures provide conflicting information about drug
exposure.
Concurrent therapies or care administered on a trial may impact the adverse event
experience and compliance with the intervention of interest and may have the
potential to alter the disease outcome, leading to biased estimates of efficacy for a
given intervention. To minimize this problem, permissible and non-permissible
therapies should be clearly listed in a protocol combined with guidance regarding
dose adjustments or discontinuation in the case of administration of a prohibited
therapy.
158 B. E. Chen et al.

Outcomes
The outcome measures selected in a clinical trial are of paramount importance – they
form the basis for data collection, statistical analysis, and results reporting. An
appropriate outcome measure must have a biologically plausible and clinically
relevant link to the intervention(s) under evaluation and be objectively and reliably
measured and reported using appropriate nomenclature.
Standardization of outcome measures has been identified by the research com-
munity as a goal to improve the general interpretability of the results of individual
trials and to enhance the integration and analysis of results from multiple trials. The
COMET (Core Outcome Measures in Effectiveness Trials) initiative is an example
of a collaborative effort to define a minimum core set of outcomes to be measured
and reported in clinical trials (http://www.comet-initiative.org). In addition to
providing guidance regarding disease-specific outcome measures, the COMET
initiative represents a rich resource of relevant methodologies for interested
researchers.
Composite outcome measures are often used to evaluate the efficacy of therapeu-
tic interventions and deserve specific mention. As with single-item outcome mea-
sures, the individual components of a composite measure must be clearly defined
and evaluable. In addition, the hierarchy of importance of the individual components
of a composite outcome measure must be prospectively identified to assist with data
collection and reporting. For example, if disease worsening can be defined by
radiological investigations or measurement of a blood-based marker, guidance for
reporting must be included in the protocol should both outcome events occur
simultaneously.
Perhaps the most important and challenging criterion to satisfy when selecting an
outcome measure relates to clinical benefit or meaningfulness. If the ultimate goal of
a therapeutic intervention is to live longer or better, the outcome measure must be
correlated to clinical benefit. Overall survival is considered the gold standard
outcome measure for trials testing therapeutic interventions for life-threatening
diseases but may be challenging to measure and interpret if death occurs years
after enrollment in a trial or if multiple efficacious therapies are administered after
the intervention of interest has failed to control the disease. Use of an intermediate,
clinically meaningful outcome measure may be justified in circumstances when
overall survival measurement is not feasible, especially when the alternative out-
come measure is a validated surrogate for overall survival, e.g., metastasis-free
survival in early prostate cancer (Xie et al. 2017).

Participant Timeline, Sample Size Recruitment (Items 13–15)

The schedule of investigations and interventions is included in all protocols to enable


meaningful participation of patients and researchers. Baseline and post-
randomization evaluations should be displayed in easy to understand format such
as tables or schematic diagrams. Timing of evaluations is linked to safety and
efficacy oversight. The former is dictated by the schedule of treatment administration
10 Principles of Protocol Development 159

and safety profile of the treatment intervention; the latter must be symmetric between
arms to avoid biased assessment of treatment efficacy. Classification of investiga-
tions by disease and treatment trajectory is a logical way to convey the information
to trial participants, i.e., prior to randomization, treatment phase, and follow-up
phase after the treatment has been completed or discontinued. Only essential inves-
tigations should be included in a protocol to minimize the burden of participation on
patients and healthcare facilities. An important principle guiding protocol develop-
ment relates to alignment of study assessments to usual care. Tests or interactions
within the healthcare systems that deviate from current practice will increase the risk
of noncompliance of participants with protocol-mandated assessments and may lead
to incomplete data collection and an impact on enrollment. To minimize this risk, the
protocol schedule of assessments and follow-up should be shared with prospective
participants for review prior to trial initiation.

Sample Size
The sample size justification is directly linked to the trial hypothesis and primary
objective. The statistical and clinical assumptions that inform the sample size
calculation must be clearly stated. The relevant information includes identification
of the primary outcome measure, expected primary outcome in the control group,
and the targeted difference in the primary outcome measure between treatment
groups, primary statistical test, type I and II errors rates, and measures of precision.
In general, a minimal clinically important difference (MCID) shall be used in sample
size calculation. Sample size adjustments for missing data and/or interim analyses
should be detailed. Additional important details to include in this section relate to the
planned duration of accrual and follow-up required to compile sufficient data to
enable the primary analysis.

Recruitment
The success of a trial is directly related to its ability to meet the pre-specified accrual
target. Given the tremendous effort and resource required to conduct a randomized
trial, every effort must be made to ensure the enrollment of consenting patients in a
timely manner. Details of recruitment plans are included in the protocol and will vary
with the patient population and interventions of interest, participating research
networks, and duration of accrual. Oversight measures to ensure adequacy of accrual
are described in this section.

Assignment of Interventions (for Controlled Trials): Spirit Checklist


Items 16–17b

The single most powerful design aspect of a controlled clinical trial is the process of
randomization or random assignment of enrolled subjects/patients to protocol treat-
ments. The purpose of randomization is to reduce the impact of bias of known and
unknown factors on treatment comparisons as a means of isolating the treatment
effect on patient outcome. Multiple methods of randomization exist. Blocked
160 B. E. Chen et al.

randomization ensures balance of treatment assignment within a pre-specified num-


ber of enrollments. For example, with a block size of eight and a 1:1 randomization,
four patients will have been randomized to each treatment after completion of a
block (Altman and Bland 1999). To reduce selection bias, random block size is
recommended in the randomized trial, and block size should be concealed from the
trial investigators. Another technique of randomization is known as stratified ran-
domization. This method balances treatment allocations within pre-specified strata
defined by factors which may impact disease outcomes independent of treatment
assignment (Zelen 1974; Kernan et al. 1999). Minimization is another method
frequently used in clinical trial conduct that adaptively assigns patients to treatments
based on the last treatment assignment of the trial taking into account pre-specified
stratification factors. This technique represents a rigorous method to achieve balance
of treatment assignment for predefined patient factors as well as for the enrollment
number for each treatment group (Pocock and Richard 1975). All stratification
factors at randomization (except for center unit) shall be taken into account in the
statistical analysis.
The protocol document must clearly outline the procedures for trial entry/enroll-
ment. This includes the requirements for activation of a participating center. Specific
requirements such as investigator or center credentialing for treatment delivery must
be outlined including reference to appropriate appendices for additional guidance.
A step by step description of the enrolment process should also be provided. This
includes instructions regarding the enrollment system in use, means of access, and
hours of operation as well as the data fields required to complete the enrollment.
Details on how a successful enrollment and treatment allocation will be communi-
cated to the participating center are also outlined in the protocol document.

Data Collection/Management and Analysis: SPIRIT Checklist Items


18a–20c

Data Collection
Data collection must align with the protocol specifications and thus not exceed what
has been approved by regulators, ethics boards, and consenting patients. Several
principles guide data collection during trial conduct: protection of identity and
confidentiality of trial participant data, adequacy of data to meet the primary and
secondary objectives of the trial, use of standard criteria to collect and report data,
and non-duplication of data collection unless justified and pre-specified in the
protocol document. The protocol must specify the data points of interest, methods
of collection, and frequency of reporting. Standard dictionaries for data collection
and reporting should be used where available, e.g., TNM (tumor, lymph node,
metastases) system for solid tumor cancer staging in oncology trials. Use of vali-
dated questionnaires or other instruments to enable accurate measurement of out-
come measures will ensure consistency of reporting and enhance the quality and
interpretation of the statistical analyses, e.g., EORTC QLQ-C30 questionnaire for
global quality of life evaluation in cancer patients (Aaronson et al. 1993) (Table 1).
10 Principles of Protocol Development 161

Table 1 Sample patient evaluation flow sheet


During After
Pre-study prior to registration/ protocol protocol
Required investigations randomization treatment treatment
History and physical
exam
xx Within x days
Hematology
xx
Coagulation
xx
Biochemistry
xx
Radiology
xx
Other investigations
xx
Correlative studies
xx
Adverse events
xx
Quality of life
xx
Health economics

Data Management
To demonstrate adherence to guidelines and regulations for database compilation,
storage, and access, the protocol or associated documents must detail the infrastruc-
ture and oversight procedures for data management. This includes information
regarding how trial conduct will be monitored at participating sites to enable data
verification, ethics compliance, and review of pharmacy documentation for drug
trials.
Guidance documents for retention of essential documents at participating sites
should be cited as appropriate. For example, ICH GCP 4.9.5 guidance refers to the
number of years that essential documents must be retained at an investigative site;
GCP 4.9.7 outlines investigative site obligations to allow direct access to trial-related
documents by an oversight bodies such as a regulatory authority, research ethics
board, or monitors/auditors (https://ichgcp.net/4-investigator/).
The integrity of a database is related to the quality of data contained therein. To
ensure the submission of high-quality data by trial participants, including accurate,
complete, and timely submission, data collection forms should include clear instruc-
tions and unambiguous data entry fields. Submitted data should be consistent with
source records. Data management guidebooks are useful tools to address topics such
as data entry and editing; methods to record unknown data; how to add comments
and how to respond to queries. Specific trial-related procedures can also be detailed
162 B. E. Chen et al.

in protocol appendices or guides, e.g., collection and submission procedures for


biological samples.

Statistical Analysis
The statistical analysis section must be described in sufficient detail to allow
replication of the analysis and interpretation of the trial results by the scientific and
clinical community. Inclusion of an experienced statistical member/team in the
protocol writing, trial conduct, and analysis phases is essential to meet these goals.
The parameters of interest include the outcome measure to be compared; the
population whose data will be included, e.g., all randomized versus eligible; and
the statistical methods used to analyze the data. Details regarding the use of
censoring and methods to deal with missing data should also be included. When
stratification was used at randomization, the statistical test for primary hypothesis
shall account for the stratification factors (e.g., stratified Cochran-Mantel-Haenszel
test for response rate and stratified log-rank test for time to event outcome). For
example, a clinical trial comparing the impact of a new therapy compared to standard
of care on overall survival may utilize a time to event analysis. Appropriate statistical
methods to analyze the survival experience of all randomized patients grouped by
assigned treatment include graphical display using the Kaplan-Meier method and
comparison using an appropriate log-rank test (Rosner 1990) with additional explor-
atory comparisons adjusted for prognostic covariates (Cox 1972).
Subgroup analyses are of great interest to the clinical community to understand the
treatment effect of a given intervention on different populations defined by specific
covariates such as those related to disease burden, exposure to prior treatment, or
patient characteristics. Given the exploratory nature of subgroup analyses, these
should be prospectively justified and defined in the protocol with the appropriate
statistical tests to determine if there is an interaction between treatment and subgroup.
Analyses of secondary outcome measures should inform interpretation of the
primary analysis and, ultimately, the research hypothesis. Sufficient details regarding
these analyses to justify their inclusion in the protocol and the associated data
collection plans are required. Using quality of life as an example, the specific
questionnaire/domains of interest, definition of meaningful change in score(s),
time point of data collection for analysis, and methods to control the type I error
due to multiplicity of testing should be outlined in the statistical section (Calvert
et al. 2018).

Monitoring: SPIRIT Checklist Items 21–23

Monitoring activities of a trial relate to real-time oversight of accumulating data


related to safety and efficacy as well as trial conduct in participating centers.

Data Monitoring
Oversight of data is integral to the regulatory, safety, and ethical obligations for any
trial. It is expected that all phase III randomized trials will be monitored in a real time
10 Principles of Protocol Development 163

and ongoing basis by an independent Data and Safety Monitoring Committee/Board


(DSMC/DSMB). According to ICH GCP, the oversight committee responsibilities
include assessment of trial progress, review of safety and critical efficacy data, and
providing recommendations on trial continuation, modification, and/or termination
as appropriate.
The protocol should refer to the existence of this oversight body and the reporting
obligations of the trial sponsor to the DSMC/DSMB relating to the progress of the
clinical trial, safety, and critical efficacy analyses. Specific terms of reference or
charters for the DSMC/DSMB may be contained in non-protocol documents that are
available on demand.

Efficacy
Interim analyses of the primary outcome measure allow for early termination of a
clinical trial if extreme differences between the treatment arms are seen. Given the
potential for misleading results and interpretations due to multiple analyses of
accumulating data (Geller and Pocock 1987), all prospectively planned interim
analyses must be described in detail in the statistical section. The description will
include the timing or triggers for the interim analyses, the nominal critical p-values
for rejecting the null and alternative hypotheses that may lead to early disclosure of
results or termination of the trial, and required statistical adjustment to preserve the
overall type I error of the trial.

Harm
Safety monitoring is continuous during trial conduct and is multifaceted. It includes
adverse event reporting, laboratory, and organ-specific surveillance testing such as
EKCs as well as physical examinations of enrolled trial participants. An adverse
event is defined by the ICH E2A guideline as any untoward medical occurrence in a
patient or clinical investigation subject administered a pharmaceutical product and
which does not necessarily have to have a causal relationship with this treatment
(www.ich.org). Key components of this definition relate to the temporal association
of the untoward sign, symptom, or disease with the pharmaceutical product, regard-
less of causality. The ICH 2sA guideline further defines the term serious adverse
event as any medical experience that result in death, is life-threatening, requires
inpatient hospitalization or prolongation of existing hospitalization, results in per-
sistent or significant disability/incapacity, or is a congenital anomaly/birth defect.
For protocol development, these definitions of adverse event and serious adverse
events apply to any medical procedure, not just pharmaceutical products.
To enable safety oversight in a given trial, the lexicon for adverse event classi-
fication and submission timelines by research personnel must be referred to in the
protocol and provided in a companion document or appendix. One example of such a
lexicon is the Common Terminology Criteria for Adverse Events (CTC AE) devel-
oped by the US National Cancer Institute and widely utilized in oncology and
non-oncology clinical trials (www.ctep.cancer.gov/protocoldevelopment). These
criteria provide standard wording and severity ratings for adverse events, grouped
by organ or system class. An essential component of safety reporting relates to
164 B. E. Chen et al.

requirements for expedited or time-sensitive adverse event reporting by participants


and sponsor obligations for reporting adverse event data to other organizations
involved in trial conduct, such as national and international regulatory authorities,
pharmaceutical partners, and participating centers.
On a practical note, the safety elements embedded in any protocol should reflect
the developmental stage of a therapeutic agent and the research objectives of the
trial. For example, the adverse event reporting requirements for a drug that is
approved and used within indication in a trial may be streamlined to focus on
higher-grade events with expedited reporting mandated only for serious events that
are unexpected and related. If a stated research objective is to characterize late or
organ-specific side effects of a therapeutic intervention, the protocol should outline
the specific requirements for collection and reporting of adverse events of interest.

Quality Assurance (Monitoring/Auditing)

The quality management and quality assurance process is essential to the successful
conduct of clinical trials to ensure human subject protection and the integrity of trial
results. Systems should be in place to manage quality in all aspects of the trial
through all stages. The quality assurance process should be defined in the trial
protocol and be supported by standard operating procedures and plans. Details of
the plan must comply with applicable regulations and guidelines, health authority
expectations, and sponsor standard operating procedures. This includes details
regarding visit frequency, scope of review, and extent of compliance and source
data assessment. Risk factors to consider in development of the plan include but are
not limited to population, phase of trial, safety profile of agent, trial objectives and
complexity, accrual, performance history, and regulatory filing intent.
Quality assurance may include monitoring, either central or on-site, and auditing
activities. Per GCP these activities may be risk adapted. GCP 1.38 defines monitor-
ing as “the act of overseeing the progress of a clinical trial, and of ensuring that it is
conducted, recorded, and reported in accordance with the protocol, standard operat-
ing procedures, Good Clinical Practice, and the applicable regulatory requirements
(s),” whereas GCP 1.6 defines audit as “a systematic and independent examination of
trial related activities and documents to determine whether the evaluated trial
activities were conducted, and the data were recorded, analyzed, and accurately
reported according to the protocol, sponsor’s standard operating procedures, Good
Clinical Practice, and the applicable regulatory requirements.” Quality assurance
activities may include reviews at participating sites and vendors, internally with
respect to sponsor procedures. The objectives are to verify patient safety, to verify
the accuracy and validity of reported data, and to assess the compliance with
regulations/guidelines and standard operating procedures. In general, components
of review related to informed consent, protocol compliance and source data verifi-
cation, ethics and essential documents part of the trial master file which includes
standard operating procedures and training, and handling of investigational medic-
inal product as applicable.
10 Principles of Protocol Development 165

Ethics and Dissemination: SPIRIT Checklist Items 24–31c

Ethics
An ethical trial is one that addresses an important research question while protecting
the safety, rights, and confidentiality of trial participants. The protocol must include
sufficient detail to reflect adherence to regulatory and guidance documents
pertaining to these principles. This includes adherence to the Declaration of Helsinki
and other reference documents such as the Tri-Council guidelines (Tri-Council
Policy Statement: Ethical Conduct for Research Involving Humans, December
2014. Retrieved from http://www.pre.ethics.gc.ca/pdf/eng/tcps2-2014/TCPS_2_
FINAL_Web.pdf) regarding research in vulnerable populations as defined by the
ability to make independent decisions or susceptibility to coercion. The protocol
should contain specific wording regarding enrollment of vulnerable individuals.
ICH GCP Section 4.8 provides guidance on the informed consent process. This
includes the requirement for an ethics committee-approved, signed informed consent
document prior to enrollment in the trial, the need to identify the most responsible
parties in a trial from compliance and liability perspectives, the use of a translator to
obtain informed consent, the methods to consent a participant who cannot read, and
the obligation to disclosure new information to a trial participant during trial
conduct. Guidance regarding pregnancy reporting and follow-up is also required if
applicable to the trial population.
ICH GCP 4.8 also provides guidance regarding the explanations of the trial to be
included in the consent document. The explanations cover topics related to the
experimental nature of the research and the research question; the treatments under
evaluation and likelihood of assignment; trial-mandated interventions and categori-
zation of which are experimental versus nonexperimental; the risks and benefits of
trial participation including exposure of unborn embryos, fetuses, or nursing infants
to protocol therapies; the existence of alternative treatment options; the trial sample
size; and anticipated duration.
Topics relating to legal and ethical oversight of the trial must also be addressed in
the consent document including the roles of regulatory and ethics bodies in trial
conduct, the voluntary nature of trial participation, the rights of an enrolled partic-
ipant including the ability to withdraw consent to participate or submit data to the
sponsor, compensation for injuries should they occur, and the protection of confi-
dentiality, which trial-related organizations will have direct access to original patient
data and how data will be stored. Specific contact information for all trial-related
questions or in the case of emergency is also provided.
Optional consents are utilized if there is a nonmandatory aspect of trial conduct in
which enrolled patients can participate. An example of an optional consent is one
that allows banking of tissue samples for future biomarker analyses related to disease
prognosis or predictors of response to the treatment strategies under investigation in
a trial.
Practically speaking, a consent must be written in clear, nontechnical language
aimed for the general readership rather than a research savvy/legally trained partic-
ipant. Content-specific sections should be clearly identified using appropriate
166 B. E. Chen et al.

headings and inclusive of the critical information required for an informed decision
regarding trial participation to be made. In reality, the process of ensuring that the
consent is informed extends beyond a written signature of a consent and the protocol
document. Adequate time and resources must be available prior to and after the
actual signature is obtained to respond to questions and provide information regard-
ing the clinical trial. The actual consent document is retained as a permanent part of
the healthcare record and is a useful resource for continued dialogue during the entire
trajectory of the trial including the analysis, publication, and dissemination
process (Resnick 2009; www.fda.gov/patients/clinical-trials-what-patients-need-
know/informed-consent-clinical-trials).

Dissemination
Dissemination of results of a trial is usually done via presentations at scientific
meetings and/or a peer-reviewed research manuscript published in a scientific
journal. The International Committee of Medical Journal Editors (ICMJE)
has established four general criteria for authorship in a medical journal that
must be met for all named individuals on a submitted manuscript (http://www.
icmje.org/):

• Substantial contributions to the conception or design of the work or the acquisi-


tion, analysis, or interpretation of data for the work
• Drafting the work or revising it critically for important intellectual content
• Final approval of the version to be published
• Agreement to be accountable for all aspects of the work in ensuring that questions
related to the accuracy or integrity of any part of the work are appropriately
investigated and resolved

The protocol should make reference to authorship guidelines as well as related,


specific policies of the trial sponsor. The mechanism of assigning and ensuring
accountability of author roles rests with the trial leadership and sponsor rather than
the journal editor/editorial staff.
An important part of the dissemination process includes direct communication of
results of the trial and resulting publication to participants, including the enrolled
patient or subject as well as the research staff involved in trial conduct. The process
of communication should be described in the protocol as well as the informed
consent document. Trials registered in a clinical trials registry may be subject to
results reporting requirements at a specified time point.
Data sharing is considered an integral part of the clinical trial process as a means
of optimizing a culture of transparency while enhancing scientific knowledge and
inquiry as well as resource utilization (http://www.iom.edu/activities/research/
sharingclinicaltrialdata.aspx. 2015). Plans and policies for making publicly available
the protocol, statistical analysis report, and/or individual patient data should be
outlined in the protocol if known at the time of study conduct. Associated timelines
for dissemination of this information and administrative requirements for access
should also be described.
10 Principles of Protocol Development 167

Conclusion

The protocol is the pivotal guidance document for a clinical trial that communicates
the essential details of the research plan to trial participants and organizations
involved in research oversight. A well-written protocol has internal consistency
and logical and clear designation of specific protocol sections and is written using
unambiguous wording. Guidance documents for protocol development and content
are useful resources for all stakeholders involved in clinical trial research.

References
Aaronson NK, Ahmedzai S, Bergman B, Bullinger M, Cull A, Duez NJ, Filiberti A, Flechtner H,
Fleishman SB, de Haes JC, Klee M, Osoba D, Razavi D, Rofe PB, Schraub S, Sneeuw K,
Sullivan M, Takeda F (1993) The European Organization for Research and Treatment of Cancer
QLQ-C30; a quality-of-life instrument for use in international clinical trials in oncology. J Natl
Cancer Inst 85:365–376
Altman DG, Bland JM (1999) How to randomise. BMJ 319:703–704
Calvert M, Kyte D, Mercieca-Bebber R, Slade A, Chan AW, King MT, The SPIRIT-PRO Group,
Hunn A, Bottomley A, Regnault A, Chan AW, Ells C, O’Connor D, Revicki D, Patrick D,
Altman D, Basch E, Velikova G, Price G, Draper H, Blazeby J, Scott J, Coast J, Norquist J,
Brown J, Haywood K, Johnson LL, Campbell L, Frank L, von Hildebrand M, Brundage M,
Palmer M, Kluetz P, Stephens R, Golub RM, Mitchell S, Groves T (2018) Guidelines for
inclusion of patient-reported outcomes in clinical trial protocols: the SPIRIT-PRO extension.
JAMA 319(5):483–494
Chan AW, Tetzlaff JM, Altman DG (2013a) SPIRIT 2013 statement: defining standard protocol
items for clinical trials. Ann Intern Med 158:200–207
Chan AW, Tetzlaff JM, Gotzsche PC (2013b) SPIRIT 2013 explanation and elaboration: guidance
for protocols of clinical trials. BMJ 346:e7586. https://doi.org/10.1136/bmj.e7586
Cox DR (1972) Regression models and life tables (with discussion). J R Statist Soc Ser
B34:187–220
Geller NL, Pocock SJ (1987) Biometrics 43(1):213–223
Gore L, Ivy SP, Balis FM, Rubin E, Thornton K, Donoghue M, Roberts S, Bruinooge S, Ersek J,
Goodman N, Schenkel C, Reaman G (2017) Modernizing clinical trial eligibility: recommen-
dations of the American Society of Clinical Oncology–friends of Cancer research minimum age
working group. J Clin Oncol 35:3781–3787
Kernan WN, Viscoli CM, Makuch RW, Brass LM, Horwitz RI (1999) Stratified randomization for
clinical trials. J Clin Epidemiol 52(1):19–26
Kim ES, Bruinooge SS, Roberts S, Ison G, Lin NU, Gore L, Uldrick TS, Lictman SM, Roach N,
Beavre JA, Sridhara R, Hesketh PJ, Denicoff AM, Garrett-Mayer E, Rubin E, Multani P,
Prowell TM, Schenkel C, Kozak M, Allen J, Sigal E, Schilsky RL (2017) Broadening eligibility
criteria to make clinical trials more representative: American society of clinical oncology and
friends of cancer research joint research statement. J Clin Oncol 35:3737–3744
Lancaster GA, Dodd S, Williamson PR (2004) Design and analysis of pilot studies: recommenda-
tions for good practice. J Eval Clin Pract 10:307–312
Lichtman SM, Harvey RD, Smit MAD, Rahman A, Thompson MA, Roach N, Schenkel C,
Bruinooge SS, Cortazar P, Walker D, Fehrenbacher L (2017) Modernizing clinical trial eligi-
bility criteria: recommendations of the American Society of Clinical Oncology–friends of
Cancer research organ dysfunction, prior or concurrent malignancy, and comorbidities working
group. J Clin Oncol 35:3753–3759
Pocock SJ, Richard S (1975) Sequential treatment assignment with balancing for prognostic factors
in the controlled clinical trial. Biometrics Int Biometric Soc 31(1):103–115
168 B. E. Chen et al.

Resnick DB (2009) Do Informed Consent Documents Matter?. Contemp Clin Trials 30(2):114–115
Rosner B (1990) Fundamentals of biostatistics, 3rd edn. PWS-Kent, Boston
Whitehead AL, Sully BG, Campbell MJ (2014) Pilot and feasibility studies: is there a difference
from each other and from a randomised controlled trial? Contemp Clin Trials 38(1):130–133
Xie W, Regan MM, Buyse M, Halabi S, Kantoff PW, Sartor O, Soule H, Clarke NW, Collette L,
Dignam JJ, Fizazi K, Paruleker WP, Sandler HM, Sydes MR, Tombal B, Williams SG, Sweeney
CJ (2017) J Clin Oncol 35(27):3097–3104
Zelen M (1974) J Chronic Dis 27:365–375
Procurement and Distribution of Study
Medicines 11
Eric Hardter, Julia Collins, Dikla Shmueli-Blumberg, and
Gillian Armstrong

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Procurement of Investigational Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Investigational Product Procurement Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Use of a Generic Drug . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Considerations for IP Procurement/Manipulation in Blinded Trials . . . . . . . . . . . . . . . . . . . . . . . 173
Impact of IP-Related Factors: Controlled Substances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Identification of Qualified Vendors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Packaging Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Manufacturing and Packaging Considerations for Blinded Trials . . . . . . . . . . . . . . . . . . . . . . . . . . 176
International Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Distribution of Investigational Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Documents to Support Release of IP to Qualified Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Use of Controlled Substances in Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
IP Inventory Management for Complex Study Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
IP Accountability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Shipping and Receipt of IP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Quality Assurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

E. Hardter · J. Collins · D. Shmueli-Blumberg


The Emmes Company, LLC, Rockville, MD, USA
e-mail: [email protected]; [email protected]; [email protected]
G. Armstrong (*)
GSK, Slaoui Center for Vaccines Research, Rockville, MD, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 169


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_34
170 E. Hardter et al.

Abstract
When compared to clinical trials involving a new (unapproved for human use)
drug or biologic, utilizing an approved, commercially available medication in a
clinical trial can introduce a new set of variables surrounding procurement and
distribution, all of which are fundamental to successful trial implementation.
Numerous procurement factors must be considered, including the identification
of a suitable vendor, manufacturing of a matching placebo, and expiration dating,
all of which can become more intricate when the study increases in complexity by
involving factors like active comparators, drug tapering regimens, and research
sites in more than one country. Distribution is a similarly complex operation,
which involves adherence to regulatory requirements and consideration of
aspects such as blinded study designs or utilization of additional safeguards
with the use of controlled substances. This chapter will review the basic factors
to be taken into consideration during the planning and operational stages of a
clinical trial involving a marketed medication and provide examples of how to
manage these factors, all of which are aimed at ensuring compliance with both
applicable local and international laws and with guidance documents aimed at
protecting the rights, safety, and well-being of trial participants.

Keywords
Investigational product (IP) · Placebo · Current Good Manufacturing Practices
(cGMPs) · Blinded/blinding · Manipulation · Procurement · Controlled
substance · Vendor · Accountability · Distribution

Introduction

The International Council for Harmonisation of Technical Requirements for the


Registration of Pharmaceuticals for Human Use (ICH) Good Clinical Practice
(GCP) guidelines (▶ Chap. 35, “Good Clinical Practice”) defines investigational
product (IP) as a pharmaceutical form of an active ingredient being tested or used as
a reference in a clinical trial or the corresponding placebo (ICH E6, 2016). As the
pharmaceutical form (tablet, liquid, sterile injectable, etc.,) and type (active, active
comparator, or placebo) of IP or study medication required in a clinical trial
are driven by the study goals and protocol design, the procurement and distribution
of IP should be considered early during protocol development and be a key part of
early study planning activities. A lack of planning and anticipation of the impact of
study design on IP procurement and distribution may lead to insurmountable
roadblocks during study implementation, including excessive costs, extended time-
lines, compliance problems, and logistical issues at both the Sponsor and site level.
Although clinical trials are primarily conducted to examine the safety and efficacy
of a new active ingredient, studies are also often conducted using an IP with an
ingredient that is approved by a competent authority (CA), such as the US Food and
11 Procurement and Distribution of Study Medicines 171

Drug Administration (FDA), for use in a specific indication (combination of a


disease and patient population). These latter types of clinical trials are often
performed to assess the safety and efficacy of an approved drug or biologic outside
of its initial indication or to gain further information about an approved use, such as
when given via a new route of administration, at a higher dose, or in combination
with another therapy. As it can be purchased “off the shelf,” using an approved IP
in a clinical trial may seem simpler than using an unapproved IP at first glance;
however, these trials come with their own set of challenges and potential pitfalls
surrounding procurement and distribution, which must be carefully considered
during study planning.
Although an IP may be available commercially via prescription from a pharmacy,
a Sponsor supporting a clinical trial involving IP may still need to file an
Investigational New Drug (IND) application (or regional equivalent) to the FDA
or local CA, depending on the indication to be studied and the goal of the clinical
trial. For example, in the USA, while some criteria to be exempt from filing an IND
as per 21 Code of Federal Regulations (CFR) 312.2 are straightforward (e.g., not
intending to use the data collected to support a new indication or change in
advertising for the IP), it can be challenging to present an argument that a clinical
trial does not “significantly increase the risk (or decrease the acceptability of the risk)
associated with the use of the drug product” (21 CFR 312.2(b)(iii)). Some typical
examples of IND trials using marketed medication are shown in Fig. 1. If ever in
doubt about whether an IND or equivalent is required, Sponsors should seek
guidance from the CA with oversight over the clinical trial (e.g., the FDA can
provide this guidance by way of a pre-IND meeting and a written request for IND
exemption or via formal and/or informal communications).
Regardless of the requirement to file an application with the local CA prior to trial
initiation, adherence to ICH GCP guidelines and all applicable in-country regulatory
requirements (e.g., federal and state laws in the USA) is paramount to protect the
rights and well-being of human subjects taking part in the clinical trial and to ensure
the integrity of the data collected. This includes adherence to applicable Current
Good Manufacturing Practices (cGMPs, 21 CFR 210 and 211 in the USA, and ICH
Q7, 2000) during the manufacture of blinded medication (such as manipulation of
commercially sourced medication for blinding purposes, manufacture of a matching
placebo, etc.), repackaging or relabeling IP, and appropriate tracking of inventories
at and shipments from the distributor/central pharmacy and clinical site. This chapter

x An increase in the daily dose or dosing duration of a medication.


x Utilization of a new combination of therapies that could increase the risks of use
than either single therapy alone.
x A new participant population outside of the approved indication, e.g. children,
pregnant women.
x An indication not currently approved.

Fig. 1 Examples of clinical trials using marketed medication usually requiring an IND or regional
equivalent
172 E. Hardter et al.

will outline the main components and points to consider for IP procurement (includ-
ing sourcing, manipulation of dosage forms, and compounding) and distribution
(tracking, restocking, and destruction) throughout the life of a clinical trial.

Procurement of Investigational Product

Investigational Product Procurement Planning

The extent to which a study drug must be manipulated for the clinical trial will
dictate selection of not only an initial source of the commercially available drug
but also the requirement for all other IP-related vendors or suppliers. Thus, it is
critically important for the Sponsor to decide upon all IP-related protocol aspects
during the planning phase, prior to selecting a supplier, and to make minimal
changes to the protocol that can impact IP during study conduct. For small open-
label clinical studies where IP is administered once, IP procurement may be as
simple as the on-site physician ordering the medication from an appropriate
commercial vendor or pharmacy and dispensing to participants, tracking lot
numbers as per institutional practices. However, IP procurement requirements
can quickly become more complicated for later phase trials (phase 2 and 3),
which can last longer, require blinded medication, and/or have many participat-
ing sites (national and international). This complexity can be compounded further
by IP-driven storage requirements (controlled, refrigerated, or frozen
medication).
Once the initial protocol design is finalized, the identification of a suitable
commercially approved drug or biologic is the first step in procurement planning.
This should be a dosage form (tablet, capsule, liquid, etc.), strength, and formula-
tion (oral, injectable, topical, etc.) suitable for use in the study, given the proposed
schedule and route of administration, with factors such as color, taste, shape, etc.,
as well as the availability of immediate-release or extended-release formulations,
taken into consideration (if appropriate). Pricing of each of the available options
for the study should then be performed to allow an initial check against the study
budget. If the proposed clinical trial is being conducted by an academic institution
or public health agency, it can be worthwhile to approach the pharmaceutical
company manufacturing the drug to ask about any programs through which the
IP needed to conduct the trial may be obtained for free or at a lower cost. In such
situations, the company donating the drug may dictate the packaging/labeling for
their IP and the process that must be utilized to supply medication to the study
sites. A detailed agreement should be in place regarding the provision of IP; the
requirement, if any, for clinical sites to return unused medication to the manufac-
turer; the ability of the trial Sponsor to cross-reference the manufacturer’s inves-
tigational or marketing application as required; and any specific safety reporting
related to product quality that the manufacturer requires for their post-marketing
obligations.
11 Procurement and Distribution of Study Medicines 173

Use of a Generic Drug

In many countries around the world, innovative drugs are protected from generic
intrusion by a patent or a period of marketing exclusivity. In the USA, the former is a
legal protection obtained through and afforded by the US Patent and Trademark
Office, while the latter is provided to a manufacturer by FDA. Both have an ability
to prohibit competitors from seeking approval of a drug or biologic therapeutically
equivalent to the innovator drug, which often increases drug availability and can
push prices down. International regulatory authorities, such as Health Canada and
the European Medicines Agency (EMA), have similar data exclusivity protections.
Having only a single source of IP can not only make IP prohibitively expensive but
may also delay or halt the study if there is a market shortage of the drug.
If neither a patent nor marketing exclusivity applies, purchasing options may
increase to potentially include generic versions of an IP. Prior to marketing
authorization, generic drugs must be considered therapeutically equivalent to the
innovator drug. The FDA considers drugs pharmaceutical equivalents if they contain
the same active ingredients, are of the same dosage form and route of administration,
are formulated to contain the same amount of active ingredient, and meet the
same compendial or other applicable standards (i.e., strength, quality, purity, and
identity). Generic drugs will differ in characteristics such as shape, scoring config-
uration, release mechanisms (for immediate- or extended-release formulations),
packaging, excipients (including colors, flavors, preservatives), expiration dating,
and, within certain limits, labeling, all important factors to take into consideration
when selecting a generic version of an approved drug to use as an IP. In the USA,
therapeutically equivalent generic drugs will receive an “A” rating in the FDA
Approved Drug Products with Therapeutic Equivalence Evaluations book (also
known as the Orange Book).

Considerations for IP Procurement/Manipulation in Blinded Trials

Blinded studies will require additional consideration as the drug or comparator will
be manipulated prior to being used in the trial, e.g., covered with another color to
obscure identifying markers (i.e., inking or debossing) to allow the manufacture of
a matching placebo (▶ Chaps. 43, “Masking of Trial Investigators” and ▶ 44,
“Masking Study Participants”). A blinded study design is used to reduce bias and
involves the study participants being unaware of which treatment assignment or
study group they are randomized to (single-blind) or all parties (the Sponsor,
investigator, and participant) being unaware of a participant’s treatment assignment
(double-blind). For placebo-controlled studies, a commercially sourced IP must be
disguised, and a matching placebo must be manufactured, if not already available
from the commercial IP manufacturer as part of their study support. In comparison,
open-label medication trials do not disguise the IP, as both the participants and
investigators are aware of the assigned treatment. IP can be obtained and managed in
174 E. Hardter et al.

a more straightforward fashion, both during drug procurement and throughout the
implementation of open-label clinical trials.
Generic medications, each of which vary in shape, color, and/or debossing/
imprinting (for tablets), can provide additional challenges or potential benefits
for blinding in a clinical trial, as certain shapes/sizes may be easier to insert
into a capsule, to replicate in a placebo, or to disguise for blinded use; e.g., a tablet
that has a letter or logo printed in ink on the surface can be easier to disguise by
overspraying than a tablet with a similar marking which is debossed, since the latter
contains a gap that must be filled.
Once a marketed product has been selected, securing a reliable IP source for the
entire duration of the study is the most important next step. Key parameters to
consider are the lead time for procuring the IP, quantity available, the time needed
to manipulate (i.e., spray coat, repackage, etc.,), and the expiration date of the IP
available. Lead time and available quantity could be subject to change pending a
potential shortage of drug. The time needed to get the IP ready to ship to sites is
dependent on the degree of manipulation, whereas expiration date of the IP is
directly tied to its stability profile, with drug wholesale companies usually providing
their “oldest” stock for shipment over stock which can remain on their shelves
longer. While the use of generic medication may reduce initial costs, availability
over an extended period may still become an issue for longer trials, necessitating IP
restocking. Further, generic drugs can be removed from the market without warning,
affecting the entire supply chain for a clinical trial. Thus, it is important to consider
the longevity of generic manufacturing (i.e., the likelihood of continuation of IP
manufacture for the duration of the trial) prior to selecting a manufacturer.
In a blinded trial, the purchased IP (and active comparator, if one is available)
must be manipulated (and the matching placebo manufactured) prior to study start.
For example, an IP in tablet form may be obscured via discoloration (e.g., over-
spraying) to cover identifying markers/debossing and to allow the manufacture of a
matching placebo. An injectable medication, however, may only require relabeling
if the color can be matched with a placebo. The extent to which purchased IP is
manipulated for the clinical trial will dictate the selection of an appropriate supplier
or vendor for this manipulation, set the timeline from procurement to shipping to the
clinical site, and also set expectations for IP-related data to be collected during the
study. For example, manipulation of a study drug may require release and ongoing
stability testing to ensure its continued identity, purity, and potency.

Impact of IP-Related Factors: Controlled Substances

If the IP is a controlled substance, there are additional steps and/or regulations


surrounding the preparation, testing, shipment, storage, dispensation, and return
for destruction of the drug that the Sponsor must consider during protocol design.
This includes the requirement for registration of parties handling the IP (site, central
pharmacy, manufacturing facility, etc.) with applicable local authorities, such as the
Drug Enforcement Administration (DEA) in the USA.
11 Procurement and Distribution of Study Medicines 175

Identification of Qualified Vendors

By purchasing a commercially available drug or biologic, the identity, potency,


purity, sterility, and stability of purchased IP can be assured, as continued cGMP
compliance is a condition of marketing authorization. Any IND, or similar filing to a
CA or Institutional Review Board (IRB), should therefore reference the marketing
application for the commercial product to be purchased for the trial. The manipula-
tion of the purchased product (active IP and active comparator, where applicable)
to make it suitable for use in the clinical trial must also be described in these filings,
including manufacture of a matching placebo, where applicable. It is the Sponsor’s
responsibility (Sponsor Requirements) to ensure that this manipulation is conducted
according to the applicable regulations and any potential impact on the product
attributes (identity, potency, purity, sterility, and stability) is identified and managed
appropriately. Planning activities involve the identification of tasks required to
support IP quality followed by the identification of reliable and qualified vendors
who can conduct their assigned tasks according to the applicable regulations, i.e.,
cGMPs, according to the timelines dictated by the study and within the budget
assigned.
Depending on vendor expertise and logistics (e.g., IP storage and transport),
study-related IP activities involving multiple steps (e.g., overcoating tablets,
manufacturing matching placebo, packaging/labeling of both release and stability
testing) can be carried out at a single facility or multiple facilities. A facility with the
broadest ability to do all required work is likely to be a licensed cGMP manufacturer
registered with a CA, such as FDA, with experience in the required activities, with
all applicable tasks occurring in a cGMP-compliant manner. However, full-service
facilities can be busy and therefore difficult to schedule, pushing up costs and
extending the timelines.
For IP that is produced on a very small scale, a pharmacist-led facility, such as a
503A compounding pharmacy in the USA, may be an option. Classically, both in
routine clinical care and for clinical trials, production of IP in one of these facilities is
done on a per-patient, per-prescription basis and is performed by a pharmacist to
tailor a medication to a specific use, e.g., to produce a topical formulation of an
active ingredient usually given orally. The constraint of requiring individual pre-
scriptions for each participant effectively precludes the utilization of these facilities
in larger clinical trials. For slightly larger studies, a licensed outsourcing facility
(termed a 503B outsourcing facility in the USA) would be a better option. Since
these facilities must adhere to cGMPs, assurance is provided to study Sponsors
regarding the identity, purity, and potency of the study IP. Common activities at such
facilities may include procedures such as re-encapsulation of drug capsules and
overcoating of drug tablets to obscure their appearance, as well as production of
matched placebo.
The more a commercial drug product is manipulated, the greater the likelihood
that these activities will have an impact on the quality attributes of the IP used in the
trial. These trials therefore require management of this risk, including release testing
and subsequent monitoring during the study.
176 E. Hardter et al.

Packaging Considerations

The IP should be packaged to prevent contamination and unacceptable deterioration


during transport and storage (ICH E6 5.13.3). The packaging configuration should
also take into consideration how the IP will be provided to the study participants.
Blister packages or IP kits are often useful for IP which needs to be tapered up and
down, ensuring the correct dose is taken; however, larger packs may be wasteful if a
participant drops out or a kit is lost.
For both open-label and blinded studies, IP repackaging and relabeling activities
are very likely required. In the USA, FDA describes repackaging as the act of taking
a finished drug product from the container in which it was distributed by the original
manufacturer and placing it into a different container without further manipulation
of the drug (FDA Guidance for Industry: Repackaging of Certain Human Drug
Products by Pharmacies and Outsourcing Facilities 2017). This activity can be
performed by a vendor/supplier or by a licensed study pharmacist. Alteration of
exterior packaging, such as a secondary box for medication contained in a blister
pack, would not be considered a repackaging activity. This type of repackaging is a
process associated with a very low risk of impact on the IP, as long as storage
instructions are followed during the repackaging activities. However, repackaging
activities during which tablets are transferred into a different bottle can impact the
stability profile; similarly, repackaging a sterile liquid into smaller single-use vials
can impact sterility. Therefore, the identification of the requirement for specialized
vendors (e.g., for sterile repackaging or handling medication which is refrigerated/
frozen) and for stability testing, including testing performed, time points, storage
conductions, and acceptance criteria, should be defined prior to IP manipulation.

Manufacturing and Packaging Considerations for Blinded Trials

If, for the purposes of blinding, a commercial drug product is inserted into a capsule,
sprayed to change color, etc., the impact of these changes should be taken into
consideration and testing performed to ensure that the quality attributes of the IP are
maintained and that patient safety and study integrity are also maintained. For example,
depending on the dissolution rate of the tablet, placing it into an opaque gelatin capsule
for blinding purposes may change the rate of drug release and therefore impact the
onset of drug action, which can be important if a drug has a narrow therapeutic
window, so performing re-encapsulation (emptying current capsules and transferring
content to a new capsule) may be a preferable option. An IP that is sensitive to light
should be repackaged under appropriate conditions, utilizing amber/opaque bottles or
suitable blister packages. While placebo has no expectation of potency, it typically still
needs to undergo testing for characteristics such as sterility, appearance, and odor
during stability studies. Neither drug nor placebo should be released for use in the
study until it meets all applicable release testing requirements, with testing usually
performed according to local Pharmacopeia monographs (standards for identity, qual-
ity, purity, strength, packaging, and labeling).
11 Procurement and Distribution of Study Medicines 177

Certain commercially available drug products may contain active ingredients


with defining characteristics (e.g., odor, color, taste). For placebo-controlled studies,
Sponsors should ensure that a manufacturing vendor can mimic these characteristics
in the matched placebo as much as possible. As a resource, FDA maintains databases
of allowable inactive ingredients, excipients, and substances generally recognized
as safe (GRAS) for the development of a placebo, which manufacturers can utilize to
ensure the placebo mirrors the study drug in all facets, excepting potency.
Each time a commercial drug product is manipulated, e.g., emptied from a bottle
or sprayed to change its color, there is a possibility that some supply may be diverted
from the clinical trial due to loss during these procedures (including equipment and
user errors, tablet breakage, etc.). Additional supply may also be needed for process
development and in-process testing. Lastly, when required, additional supply of the
final IP may need to be diverted and placed into stability testing for the duration of
the clinical study. To account for these activities, the total amount of IP needed
should be increased by at least 10–20% during the Sponsor’s estimations.
Overall, given the breadth of possible IP manipulation activities that may take
place for a given clinical trial, the selection of the right vendors is critical, as they are
going to be key partners throughout the clinical trial. Indeed, if a facility is deter-
mined to be non-compliant after initial selection, there may be loss of study drug and
a delay in timelines while things are corrected. Expectations on the quality of the IP
and the timelines associated with key steps during IP manipulation must be clear
to both sides at project start. In the worst case, study start may be delayed, while
alternative vendors are identified and qualified. In 2015, for example, an FDA
inspection of the Pharmaceutical Development Section of the NIH Clinical Center
triggered the inactivation and premature closure of multiple NIH intramural studies,
with others delayed pending the identification of alternate vendors. This example
serves to highlight the overarching significance of the selection of the right vendor.
Managing the risk of such an adverse outcome requires a proactive approach,
such as checking public inspection databases, such as the FDA web site, for records
for previous examples of non-compliance (e.g., Warning Letters and Forms FDA
483) for a specific manufacturing facility and also ensuring the vendor is able to
conduct the tasks as assigned, usually via the vendor qualification process,
performed as per the Sponsor’s standard operating procedures (SOPs). Depending
on the scope of activities, an on-site audit can be part of this process, performed by
representatives with expertise in regulatory affairs, quality assurance, and cGMP
compliance. Timing of the audit should ideally be early during the planning stage if
the vendor is not already qualified, so that it allows time for appropriate follow-up
and corrections to procedures/practices and reauditing if necessary.

International Clinical Trials

As discussed in the “Investigational Product Procurement Planning” section,


multisite trials with international locations have the potential to add elements of
complexity (▶ Chap. 19, “International Trials”). The primary consideration for such
178 E. Hardter et al.

studies surrounds the regulatory status of the commercial product chosen (and the
active ingredient) in each country (e.g., approved for marketing, approved but no
longer available, etc.), as this influences the ability to import IP and the requirement
for a clinical trial application locally. For example, a multinational study utilizing IP
which is FDA-approved may only require IRB oversight in the USA, without an
IND application, provided the study meets all criteria in 21 CFR 312.2. However, if
the same drug does not have marketing approval in Canada, it will require
full reporting to Health Canada under a Clinical Trial Application (CTA), an
assessment by the study Research Ethics Board, and an environmental assessment
by Environment Canada. Such regulatory approvals, or lack thereof, influence drug
sourcing options. In the scenario described above, IP would need to be exported
from the US manufacturer and imported into Canada. This requires prior approval of
the CTA and appropriate labeling of the exported drug, along with sign-off from an
importing agent in Canada (who must be a Canadian resident); otherwise, the IP will
be seized at the border by the Canadian Border Services Agency that works in
conjunction with Health Canada.
Similarly, IP manufacturing requirements may differ between countries. While
ICH member states typically overlap in this regard, small differences may result in
additional levels of compliance. For example, an IP manufactured in the USA and
intended for import to a European Union member state (e.g., Germany and France),
for a clinical trial, will require release by a qualified person (QP). The QP is
responsible for verifying that the IP meets a certain degree of cGMP compliance
for import into the country and thus will likely require access to IP batch records to
determine cGMP adherence. In some instances, the QP may wish to assess the batch
manufacture in person, depending on the risk of the activities undertaken during the
manufacturing process. It should be noted that, while the above scenario remains
plausible, mutual recognition agreements often exist between CAs (typically
between ICH members). These agreements effectively state that the competent
regulatory authority from an importing country will defer to a cGMP inspection of
the competent regulatory authority from the exporting country (or from another ICH
country, if such an inspection has been performed), without necessitating additional
inspection. Therefore, choice of a commercially available drug manufactured by a
company that has already obtained marketing authorizations for it in the countries to
be used in the clinical trial may lead to a quicker study start-up, as the local CAs will
be familiar with the IP and only the manipulation for the clinical trial will need to be
described.

Distribution of Investigational Product

The tracking of IP inventory available at the distributor/central pharmacy and


distribution to each clinical site is not only important for compliance with ICH E6
GCPs but also to ensure that IP is available for each participant. Tracking can be
11 Procurement and Distribution of Study Medicines 179

achieved several ways, such as by utilizing a system provided by the central IP


distributor or vendor. Regardless of the distribution method, basic measures of drug
management and accountability must be followed. Particular IP characteristics and
study design parameters, such as the use of controlled substances or whether the
study is blinded, can also influence the process.

Documents to Support Release of IP to Qualified Sites

As per ICH E6 GCP, many documents must be generated and be on file prior to
study start, which is often considered the initial shipment of IP to a clinical site.
These documents include those relating to the release of IP by the manufacturer,
for example, a Certificate of Analysis, which ensures that the IP fulfills the quality
attributes set for it. A subset of documents is also collected from the site
and includes documents related to the investigator’s ability to conduct the trial
(documentation of relevant qualifications and training), the favorable review of the
study protocol and other documents by the IRB or International Ethics Committee
(IEC), (▶ Chap. 36, “Institutional Review Boards and Ethics Committees”) and an
agreement to follow the study protocol, including the requirements for the reporting
of adverse events. These documents should be reviewed for their accuracy and
suitability to support study conduct prior to authorizing IP shipment and will often
include ensuring that the site has the appropriate documents and training for han-
dling IP, including IP disposition logs. If the IP is a controlled medication, the
applicable local registrations for the site to receive and prescribe controlled sub-
stances, such as DEA registration in the USA, are particularly important and will
have to be provided to the central distributing facility to ensure they comply with the
facility’s SOPs.
In a clinical trial with a small number of sites, it may be feasible to collect and
manage these regulatory documents using a paper system; however, in a larger-scale,
multicenter clinical trial, the collection and management of regulatory documents
throughout the life of the study is more challenging and complex. When multiple
sites are involved, the use of an electronic trial master file (eTMF) and a linked
clinical trial management system (CTMS) can help facilitate this task and assist
the Sponsor in maintaining compliance with applicable regulations (Zhao et al.
2010). For example, some systems can automatically trigger alerts and notifications
for upcoming expiration dates of documents filed in the system, as well as generate
reports that can be used to quickly identify any missing documents. Only once
all required documents have been collected and the required national and local
approvals are in place can the Sponsor or designee supply an investigator or
institution with IP. Setting clear expectations early in study start-up for the number
and quality of documents to be collected from the site prior to IP shipment is key to
on-time study start. Also, routine frequent monitoring of site documents, including
IP logs, at the site will ensure that the documents are maintained and are inspection
ready at the end of the study.
180 E. Hardter et al.

Use of Controlled Substances in Clinical Trials

The World Health Organization (WHO) provides guidance on the scheduling


of substances that can be of potential harm due to their psychoactive and/or depen-
dence-producing properties, as well as providing technical expertise to the United
Nations (UN) on drugs of abuse under the United Nations Single Convention on
Narcotic Drugs (1961) and the United Nations Convention on Psychotropic Sub-
stances (1971). These two treaties, along with the United Nations Convention
against the Illicit Traffic in Narcotic Drugs and Psychotropic Substances (1988),
provide the legal basis for the international prevention of drug abuse (WHO 2018).
The UN system categorizes narcotics and psychotropic substances into four
classifications, based on their harmfulness, which includes the risk of abuse and
other health dangers. Many countries and regions have their own classification
systems for controlled substances. In one article that investigated classification
systems across 23 countries (including countries in North America, Western Europe,
the Middle East, and Asia), it was found that the range of controlled substance
schedules varied from 2 to 15 different schedules of drug (Dragic et al. 2015);
therefore, local regulations should be carefully reviewed to assess the local
requirements for clinical trials utilizing IP which could be considered a controlled
medication.
In the USA, the DEA classifies drugs, substances, and certain chemicals used
to make drugs into five categories or schedules depending upon the drug’s
acceptable medical use and abuse or dependency potential (21 CFR 1308.11–
1308.15). Schedule I drugs have a high potential for abuse and the potential to
create severe psychological and/or physical dependence (e.g., heroin), while
Schedule V drugs represent the least potential for abuse (e.g., cough preparations
containing not more than 200 mg of codeine per 100 ml or per 100 g, such as
Robitussin AC (US DEA 2018) (Fig. 2)). In the USA, all clinical trials involving
controlled substances are regulated under the federal Controlled Substances Act
(CSA). In addition, many states have local regulations regarding the use of
controlled substances in clinical trials, and often these local regulations can be
more restrictive than federal guidelines. For example, California Law requires
that any clinical investigation involving either Schedule I or Schedule II medi-
cation as the main study drug be reviewed and approved by the Research
Advisory Panel of California in the Attorney General’s office (State of California
Department of Justice 2018).
In the USA, there are eight key control measures that directly impact all clinical
trials with controlled substances (Fig. 3). These include the scheduling of the drug (I
through V), registration and licensing of investigators, importation and exportation
controls, setting quotas for Schedule 1 and 2 substances, and on-site security
measures designed to restrict access to controlled substances. Additional control
measures involve strict and rigorous record keeping requirements for all controlled
substances, reporting requirements for theft or loss of these substances, and DEA
inspections (Woodworth 2011). Investigators outside the USA should review local
laws and look for a similar level of control.
11 Procurement and Distribution of Study Medicines 181

DEA
Description Example Substance(s)
Schedule
No currently accepted medical use in
the US, a lack of accepted safety for heroin, lysergic acid diethylamide
I
use under medical supervision, and (LSD), marijuana (cannabis)
a high potential for abuse.
hydromorphone, oxycodone
High potential for abuse which may
(Schedule II); amphetamine
II/IIN lead to severe psychological or
(Adderall®) (Schedule IIN
physical dependence.
(stimulants))
Have a potential for abuse less than
substances in Schedules I or II and
Tylenol with Codeine®;
III/IIIN abuse may lead to moderate or low
buprenorphine; ketamine
physical dependence or high
psychological dependence.
Have a low potential for abuse
alprazolam (Xanax®); diazepam
IV relative to substances in Schedule
(Valium®)
III.
Have a low potential for abuse
Cough preparations containing not
relative to substances listed in
more than 200 milligrams of
V Schedule IV and consist primarily of
codeine per 100 milliliters or per
preparations containing limited
100 grams (Robitussin AC®)
quantities of certain narcotics.
Source: US DEA Diversion Control Division 2018.

Fig. 2 List of US DEA categories or “schedules.” Drugs are categorized into schedules based on
their acceptable medical use and abuse or dependency potential (United States Department of
Justice Drug Enforcement Administration, Diversion Control Division 2018)

IP Inventory Management for Complex Study Designs

Inventory management is driven primarily by study design, the specific drugs used
(special handing requirements, e.g., controlled substances, frozen/refrigerated), the
number of clinical sites, and the countries involved. For example, as mentioned,
IP for an open-label study conducted at a single site can be prescribed locally
and dispensed as needed to the participants following institutional procedures.
Conversely, more intricate study designs, particularly those involving blinded IP,
increase the complexity of the drug supply process. To accommodate these com-
plexities, electronic systems have evolved to include more streamlined, auditable,
and user-friendly drug assignment and distribution processes. Systems such as a
CTMS, IRT (interactive response technologies), and/or electronic data capture
(EDC) systems (▶ Chap. 13, “Design and Development of the Study Data System”)
are used alone or in conjunction with each other, interfacing and automatically
updating to assign IP to a participant, track site inventory, and request resupplies.
The key benefits to these systems are the ability to monitor the IP in real time, the
immediate allocation of IP (bottle, kit, or single dose) available at the site to a
participant according to an overall randomization scheme, tracking of expiration of
IP, and the possibility for automatic replenishment of IP once a predetermined
182 E. Hardter et al.

Control Measure Key Information


1. Scheduling of the See Table 2. The parties involved in determination of drug classification or
drug or substance scheduling are the US Department of Health and Human Services or HHS
(including FDA and NIDA) and DEA (Title 21, CSA, 811(b)).
2. Federal registration Registration: A federal DEA practitioner registration is required for an
and state licensing of investigator to handle controlled substances in any manner. A DEA
clinical investigators to “Practitioner” registration is valid for three years and a separate registration is
prescribe, dispense, required for each principal place of business or professional practice at one
administer and conduct general physical location where controlled substances are manufactured,
research with controlled distributed, imported, exported, or dispensed by a person (21 CRF
substances §1301.12(a)).
DEA Registrations for receipt of medication at a clinical research site are
required when using scheduled drugs as investigational products, subject to
DEA regulations. There are several types of DEA Registrations (designations)
that can be used for receipt of medication (e.g., Pharmacy, Hospital/Clinic, etc.).
An individual DEA Registration (e.g., Practitioner, Researcher) may be used for
shipping and receipt of study medication as long as the address of record is the
site address or the local pharmacy where the drug is to be shipped and received.
Licensing: At the state level, licensing is generally accomplished via a license
that covers clinical practice and research; however, requirements may vary by
state.
3. Importation and In general, the nation exporting controlled substances must obtain a written
exportation controls permit or other form of permission in advance for each transnational shipment
from the country to which the substance is being shipped (Article 31, Single
Convention on Narcotic Drugs, 1961 and Article 12, Convention on Psychotropic
Substances, 1971).
4. Setting quotas for The Controlled Substances Act requires the DEA to determine the maximum
Schedule I and amount of Schedule I and Schedule II controlled substances that may be
Schedule II manufactured in the U.S. every calendar year (Title 22, United States Code,
substances Section 826).
5. Security measures Investigators are required to store controlled substances in a “securely locked,
that must be put in place substantially constructed cabinet” (21 CFR 1301.75) and to segregate (e.g.,
at the clinical site to separate box, separate shelf) by clinical study (if > 1) and by DEA registration
restrict access to number. “Double lock” security, such as a locked cabinet inside a locked room,
controlled substances is recommended as a best practice. Furthermore, Schedule I substances can’t
be stored with Schedules II-V and all controlled substances must be separated
from non-controlled medications.
6. Comprehensive, Once a controlled substance is received on site, the PI is required to maintain a
stringent detailed inventory (i.e., physical count) of all controlled substances on hand at
recordkeeping that site. Inventory must be taken at least once every two years; however, it is
requirements for all recommended that investigators complete this much more frequently (such as
controlled substances every time medication is dispensed or returned, or biweekly/monthly) (21 CFR
1304.11). The records must include the name and address of the person to
whom the medication was dispensed, the date of dispensing, the number of
units or volume dispensed, and the written or typewritten name or initials of the
individual who dispensed or administered the substance (21 CFR 1304.22 (c)).
7. Reporting Although there are several reporting requirements that those involved in the
requirements drug supply chain must meet, Investigators are only required to report “any theft
or significant loss of a controlled substance” to the DEA (Title 21, Code of
Federal Registrations, Section 1301.76 (b)).
8. DEA inspections The DEA typically only performs routine investigations of clinical sites when a
Schedule I controlled substance is being used in a trial.

Fig. 3 Key control measures impacting clinical trials with controlled substances (adapted from
Woodworth 2011)

threshold is reached. In blinded studies, the EDC can bypass the requirement for any
direct staff involvement in resupply. One component of these systems is careful
planning in setup of not only the system but also the IP itself, as the Sponsor must
ensure that any system-specific information, such as a bottle or kit identifying
number and corresponding barcode for electronic systems, is included to allow the
IP to be tracked. Overall, these systems can help minimize last-minute supply
requests, oversupply at a single site, and waste at clinical sites.
11 Procurement and Distribution of Study Medicines 183

Current Expiration
Site Name Supply Name Threshold
Inventory Date
Research
9-Aug-18
Site A
XR-NTX Medication Kit 10 5 8/31/2018
Suboxone 4mg strips 750 400 12/31/2019
Suboxone 8mg strips 300 350 11/30/2019
Research
9-Aug-18
Site B
XR-NTX Medication Kit 8 5 8/31/2018
Suboxone 4mg strips 615 400 12/31/2019
Suboxone 8mg strips 425 350 11/30/2019
Research
9-Aug-18
Site C
XR-NTX Medication Kit 4 5 5/31/2019
Suboxone 4mg strips 375 400 12/31/2019
Suboxone 8mg strips 428 350 11/30/2019

Fig. 4 Example Inventory Form for IP/study medication. IP levels at or below threshold or past
their expiration date appear in red

IP Accountability

One method for utilizing electronic IP management systems is to ask research staff
to report IP inventory periodically (e.g., weekly) directly in the EDC system. These
data are subsequently pulled into specifically programmed reports, which can
be reviewed to identify reorder needs based on predetermined thresholds and
usage at each site. Color-coding of these reports is useful for quick visual identifi-
cation of supplies that are nearing expiration or are below the desired threshold.
Of note, to maximize efficiency, inventory forms may also include study supplies
other than the IP (e.g., blood draw equipment) (not reflected in figure) (Fig. 4).
Shipments should be carefully timed, and the appropriate amount of IP (taking site
storage capacity and projected enrollment rate/target into account) should be distrib-
uted with each shipment to minimize the amount of IP and other supplies left unused
at the site while ensuring there are sufficient supplies on site for active study
participants. The latter can often be predicted based on participant enrollment rates
at the site and the IP distribution schedule delineated in the study protocol. The ideal
ratio and time will reduce shipping costs and waste.
Drug accountability is more than simply counting pills; it goes hand in hand
with inventory management and refers to the record keeping associated with the
receipt, storage, and dispensation of an investigational product. When done
correctly, it should provide a complete and accurate accounting of drug handling
from initial receipt on site to final disposition (e.g., utilization in the study, return,
or destruction). While the study Sponsor is responsible for procurement of
study medication (Sponsor Requirements) (including the processes delineated at
the start of this chapter), once at the sites, it is the responsibility of the principal
184 E. Hardter et al.

investigator (PI) to maintain adequate records of the product’s handling and dispen-
sation (ICH GCP 4.6.1) (▶ Chap. 6, “Investigator Responsibilities”). In accordance
with ICH GCP, the PI can choose to delegate responsibility of IP accountability to an
“appropriate pharmacist or another appropriate individual” of whom they have
oversight (ICH GCP 4.6.2). It should be ensured that this delegation and the
qualifications of the individual are documented appropriately.
By nature, working with human research subjects introduces participant-level
error with regard to study drug accountability, particularly when IP is “sent home”
with the participant. For example, participants might forget doses, misplace or lose
some or all of the study IP, share IP with others, or sell it illegally. To account for
these situations, clinical trial protocols often require participants to return any
unused IP at designated intervals throughout the study (e.g., weekly, monthly)
before providing them with more. Once IP is returned, qualified study staff
complete an inventory (e.g., number of remaining capsules/tablets) to evaluate
medication compliance and perform IP accountability (the latter is particularly
important when the IP is a controlled substance). To address the expected level of
participant error in IP management during a clinical trial, there are certain strate-
gies that study Sponsors can employ to attempt to increase medication adherence
and enhance subsequent drug accountability. These methods may range from basic
paper-and-pencil record keeping, such as providing participants with a small
calendar to mark the dates and times that they took their medication, to leveraging
technology-based options. One such option involves real-time medication adher-
ence monitoring by using a “smart” pill bottle, which may perform tasks such as
indicating or recording whether the patient took their medication on schedule (e.g.,
through a glowing light or via a counting tool built into the bottle cap), measuring
the time that elapses between doses using a stopwatch function, or even sending
automatic medication reminders to patients via text message on their smart phone
(Choudhry et al. 2017)). Another technology-based option is providing patients
with a QR (quick response) code to access easy-to-understand pharmacist counsel-
ing videos, which guide patients through how to take the medication, any potential
side effects, etc. (Yeung et al. 2003). Although all materials provided to a patient
during a clinical trial can require IRB review, an IRB may request additional
oversight of compliance and require that a participant take daily videos of them-
selves taking the study medication and securely share these with the study team for
compliance measurement purposes. Potential study participants will be informed
of such a mechanism during the informed consent process (▶ Chap. 21, “Consent
Forms and Procedures”).
Enhancing medication adherence is not only beneficial for study data reliability,
but it also plays a role in overall IP accountability. As such, it is the responsibility of
the study Sponsor to become familiar with the available options for participant IP
adherence and select a method (or methods) that is appropriate for the population
being studied and makes sense within the larger context of the clinical trial. The PI
must also ensure proper security and storage of the investigational drug while on site
(ICH GCP 4.6). Even when research pharmacists or vendors are involved, the PI
retains this responsibility (▶ Chap. 6, “Investigator Responsibilities”). Finally, drug
11 Procurement and Distribution of Study Medicines 185

dispensation at the site must be documented in a clear and comprehensive way.


Often Sponsors will provide tools such as drug inventory and tracking logs to assist
the site staff, including pharmacists, in study-specific requirements for tracking drug
receipt and dispensation (e.g., see Fig. 5). However, if no tools are offered by the
Sponsor, the site should follow local institutional practices.

Shipping and Receipt of IP

In some clinical trials, IP is shipped to research sites directly from a manufacturer,


while other studies utilize the services of a central pharmacy, which receives study
drug from a manufacturer and distributes it to the study sites. Any such distributor
must be in full compliance with applicable regulations (e.g., International Air
Transport Association (IATA) or, in the USA, regulations set by the FDA, DEA,
and the US Environmental Protection Agency (EPA)) and possess the required
licenses and permits required by federal, state, and local authorities for the safe
operation of a Drug Distribution Center. The distributor would need to have the
capability to store IP per cGMPs and labeled storage conditions and also maintain
packaging and shipping supplies (e.g., wet and dry ice, frozen cold packs, liquid
nitrogen, qualified Styrofoam shipping containers) needed for the uninterrupted
supply of study medication.
Proper storage during drug distribution (i.e., shipment) is critical to preserve the
quality of the investigational product. In their totality, good storage and distribution
practices should facilitate IP movement through a supply chain involving multiple
parties (i.e., supplier, manufacturer, pharmacy, and sites) that are controlled, mea-
sured, and analyzed for continuous improvement and maintain the integrity of the IP
(USP <1079>). Both during distribution and when stored on site, the IP(s) should
be stored as specified by the Sponsor in the label and in accordance with applicable
regulatory requirement(s) (ICH GCP 4.6.4), as well as the study protocol, investi-
gational brochure, or marketed medication information sheet (e.g., package insert, or
summary of product characteristics). To meet these specifications, Sponsors should
utilize both qualified, validated shipping materials and environment-tracking
devices that “alarm” when out of the specified range, such as temperature monitors
(e.g., TempTales) and humidity sensors. The Sponsor should determine the storage
requirements including the acceptable storage temperatures, storage conditions (e.g.,
protection from light), storage times, reconstitution fluids and procedures, and
devices for product infusion, if any. The Sponsor should inform all involved parties
(e.g., monitors, investigators, pharmacists, storage managers) of these determina-
tions via the study protocol, instructional manual, or direct study-specific training
(ICH GCP 5.13.2).
In accordance with institutional practices and GCP requirements, the PI (or
designee, such as a qualified pharmacist) at the clinical site should maintain records
of the product’s delivery to the site (e.g., the packing slip denoting what was
included in the shipment and any required shipment confirmation documentation,
such as a confirmation form that must be emailed or faxed back to the supplier).
186

Site Medication Inventory Log (4mg sublingual film)


Site Name: _________________________________ Page ____ of ____

RECEIVED INTO INVENTORY REMOVED FROM INVENTORY DISPOSITION


Inventory
Disposition Comments
Date Quantity Date Amount Assigned Confirmed Balance
(dispensed,
Received Received Dispensed Dispensed Dispensed to (ppt ID - by (4mg films)
(#4mg returned,
(mm/dd/yyyy) films) by (initials) (mm/dd/yyyy) by (initials) (#films) Lot # 4 digits) expired) (initials/date)

Fig. 5 Example study drug inventory and tracking log for a clinical trial
E. Hardter et al.
11 Procurement and Distribution of Study Medicines 187

These records should include dates, quantities, batch/serial numbers, expiration


dates, and the unique code numbers assigned to the investigational product(s) and
trial subjects, if applicable (ICH GCP 4.6.3). When study drug is received at the
clinical site, it is critical to inspect the shipment thoroughly and as quickly after
receipt as possible. The packing slip should be compared to both the order form and
the contents of the shipment to confirm that there are no discrepancies, and the items
themselves should be checked to confirm that they are intact. If a temperature and/or
moisture sensor is included in the shipment, ensure that it has not “alarmed” or been
activated; if so, the temperature and/or moisture levels may have surpassed the
acceptable threshold during shipment, potentially compromising IP integrity. Some
studies (often at the request of the manufacturer/distributor) may require that
the packing slip be signed, dated, and returned to the shipper to acknowledge receipt
of the shipment, while other studies may utilize an online portal or supply chain
management software to digitize this process. If it appears that the shipment is
incomplete or was damaged in transit, the Sponsor or designee should be contacted
immediately. Often, it is helpful to prepare comprehensive inventory logs for the
sites in advance, to ensure all key information is captured (see Fig. 5). Furthermore,
if the site does not already have a local SOP for the receipt of drug, it is
recommended that they create one prior to study start. Requiring documentation
within an SOP and study staff training on this SOP ensures that the process has been
thought out and that site staff is ready to properly receive and document the receipt of
drug.

Quality Assurance

The Sponsor is responsible for ensuring that adequate quality assurance procedures
are followed throughout the drug procurement and distribution process (“Sponsor
Requirements”). Specific to IP receipt and accountability, ICH GCP guidelines
indicate that “the Sponsor should ensure that written procedures include instruc-
tions that the investigator/institution should follow for the handling and storage of
investigational product(s) for the trial and documentation thereof” (ICH GCP
5.14.3). Detailed procedures (e.g., in the study protocol or pharmacy manual)
should address the proper receipt, handling, storage, dispensing, and final disposi-
tion of the investigational product. Once IP is on site, the PI is required to follow all
applicable laws and regulations regarding IP administration and distribution to
participants, such as adhering to the dosing guidelines in the investigator’s bro-
chure or package insert and keeping detailed records of the IP (e.g., lot number,
amount) that is distributed to each study participant. Those tasked with observing
the research sites in person to ensure that the study procedures are being properly
followed, often termed clinical trial monitors, should also periodically ensure that
procedures related to IP, such as adequate IP storage, temperature monitoring
during shipment, and appropriate documentation for receiving and dispensing
drug throughout the trial, are in place and are being followed.
188 E. Hardter et al.

Summary and Conclusion

IP procurement and distribution can be a complex process, and the level of


complexity depends on variables such as study design, scope, the specific IP being
utilized (e.g., controlled substances), and the number of sites and countries involved.
Whatever the level of complexity, the process requires advanced planning, attention
to detail, and compliance with local and regional regulations. The process of IP
procurement and distribution is carried out by an assortment of organizations and
individuals (all of whom need careful oversight and management by the study
Sponsor or appropriate designee) who may be involved in the manufacturing,
testing, packaging, labeling, documentation and accountability, shipping, and stor-
age of study drug. Clear communication and efficient coordination of the groups and
procedures involved are crucial, as are well-defined and unambiguous instructions to
the clinical sites. Deviation from appropriate drug accountability measures and the
various federal, state, and local regulations may potentially alter participant safety as
well as safety and efficacy study outcome measures. Thoughtful planning in the pre-
implementation stage of a clinical trial can mitigate these possible risks and help
ensure consistent and dependable standards are used throughout the lifecycle of the
clinical trial.

Key Facts

• Factors to consider during IP procurement include identification of a suitable


vendor, manipulation of IP and manufacturing of matching placebo, and
expiration dating, all of which increase in complexity in studies with active
comparators, drug tapering regimens, and/or multiple research sites.
• To avoid insurmountable roadblocks during study implementation, it is important
to consider procurement and distribution (in the context of the overall study
design) early in the protocol development process as part of the study planning
activities.
• The local competent authority can provide guidance as to whether an IND
application (or regional equivalent) is required for a trial with a currently
marketed IP.
• To protect the safety of human research participants, clinical trials using IP must
adhere to cGMPs when manufacturing blinded medication, repackaging or
relabeling IP, shipping IP to sites, and storing IP on site or at the central distributor
or pharmacy.
• Basic measures of drug accountability and management need to be followed
when IP is distributed to clinical sites, and various tools (such as clinical trial
management systems) are available to streamline these processes.
• IP characteristics and study design parameters (such as studies involving con-
trolled substances or blinded trials) influence the drug distribution process and
should be carefully considered when planning and implementing a clinical trial.
11 Procurement and Distribution of Study Medicines 189

References
Choudhry N, Krumme A, Ercole P et al (2017) Effect of reminder devices on medication adherence:
the REMIND randomized clinical trial. JAMA Intern Med 177(5):624–631
Dragic L, Lee E, Wertheimer A, et al. (2015) Classifications of controlled substances: insights from
23 countries. Inov Pharm 6(2):Article 201
State of California Department of Justice (2018) Research advisory panel. https://oag.ca.gov/
research. Accessed 06 Sep 2018
United Nations (1961) Single convention on narcotic drugs. https://www.unodc.org/pdf/conven
tion_1961_en.pdf. Accessed 04 Oct 2018
United Nations (1971) Convention on psychotropic substances. https://www.unodc.org/pdf/conven
tion_1971_en.pdf. Accessed 04 Oct 2018
United Nations (1988) Convention against illicit traffic in narcotic drugs and psychotropic sub-
stances. https://www.unodc.org/pdf/convention_1988_en.pdf. Accessed 04 Oct 2018
United States Department of Justice Drug Enforcement Administration, Diversion Control Division
(2018) Control substance schedules. https://www.deadiversion.usdoj.gov/schedules/. Accessed
04 Oct 2018
United States Drug Enforcement Administration (2018) Drug scheduling. https://www.dea.gov/
drug-scheduling. Accessed 04 Oct 2018
US Department of Health and Human Services Food and Drug Administration Center for Drug
Evaluation and Research (2017) Guidance for industry: re-packaging of certain human drug
products by pharmacies and outsourcing facilities. https://www.fda.gov/downloads/Drugs/Guid
ances/UCM434174.pdf. Accessed 19 Oct 2018
Woodworth T (2011) How will DEA affect your clinical study? J Clin Res Best Pract 7(12):1–9
World Health Organization (WHO) (2018) Substances under international control. http://www.who.
int/medicines/areas/quality_safety/sub_Int_control/en/. Accessed 06 Sep 2018
Yeung D, Alvarez K, Quinones M et al (2003) Low-health literacy flashcards & mobile video
reinforcement to improve medication adherence in patients on oral diabetes, heart failure, and
hypertension medications. J Am Pharm Assoc 57(1):30–27
Zhao W, Durkalski V, Pauls K et al (2010) An electronic regulatory document management system
for a clinical trial network. Contemp Clin Trials 31:27–33
Selection of Study Centers and
Investigators 12
Dikla Shmueli-Blumberg, Maria Figueroa, and Carolyn Burke

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
Site and Investigator Selection Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Site Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Facility Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Administrative Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Recruitment Potential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Regulatory and Ethics Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Investigator Responsibilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Investigator Qualification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
Investigative Team Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Site and Investigator Selection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Surveys and Questionnaires . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Site Qualification Visit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

Abstract
Site and investigator selection has traditionally been the result of a comprehensive
process by which a study sponsor and/or designated representative, often a
contract research organization (CRO), evaluates prospective investigative teams
and associated clinical sites for clinical trial participation. A list of criteria is often

D. Shmueli-Blumberg (*) · M. Figueroa · C. Burke


The Emmes Company, LLC, Rockville, MD, USA
e-mail: [email protected]; mfi[email protected]; [email protected]

© Springer Nature Switzerland AG 2022 191


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_35
192 D. Shmueli-Blumberg et al.

compiled and used by the sponsor to grade site and investigator suitability for
study participation.
In implementing a study, sponsors and site teams become partners in achieving
study goals. For longer-term or complex studies, this partnership can become
extensive. Sponsors and site teams must mutually invest in respective stakeholder
perspectives and approaches to answer research questions. Site teams interested
in the research question, with sufficient and qualified staff, and with access to the
desired participant population, are key to conducting sound research. Sponsors
that appreciate site perspectives, consider site operations and logistics in protocol
design, support efforts to mitigate site challenges, communicate updates on
broader perspectives of study activities, and offer fair compensation for site
resources are key factors in implementing a successful study. Conversely, dis-
connected relationships between investigative teams and study sponsors can
disrupt the timetable for research, contributing to compromised morale, cost
overruns, and increased variability in administrating the protocol resulting in
reduced ability to detect treatment differences, and may result in an unsuccessful
trial.
A site and investigator selection process should be designed to ensure that both
sponsors and site teams thoroughly evaluate whether the protocol design,
resources, sponsor/team relationships, general timelines, and site facilities are
compatible in achieving the goals of the study.

Keywords
Site · Investigator · Sponsors · Investigator selection plan · Site selection ·
Recruitment · Investigator qualification · Investigative team · Qualification visit

Introduction

Site and investigator selection is not a one-size-fits-all activity. It’s not unusual for
a site to be labeled a “good site” or a “bad site” in research, yet the criteria for such
conclusions are ill-defined. What qualities do those designations represent, and how
are those qualities best assessed? More importantly perhaps is the philosophy that
there is no universally “good” or “bad” site, but rather the partnership or “fit”
between the site team and the sponsor in executing a specific protocol at a specific
site can be better, or worse, given the site resources and protocol requirements. A site
team that successfully contributed to achieve study goals in one study may not
necessarily have the same success in a subsequent, similar study. A sponsor that may
have been a positive partner to a study team on a previously successful study may not
be a positive partner under a subsequent study.
Site teams and sponsors with established relationships from prior partnerships can
use their experiences to evaluate potential future partnerships. These relationships
often are not devoid of subjective considerations, but objective measures should be
incorporated into evaluations to the extent possible.
12 Selection of Study Centers and Investigators 193

This chapter is entitled ▶ “Selection of Study Centers and Investigators”, with


inherent connotations of the traditional hierarchical approach that sponsors enlist
centers (referred to as sites throughout this chapter) and investigators to contribute to
research. Readers are encouraged to consider the sponsor and study teams as partners
in identifying relationships that are conducive to achieving the goals of research.

Site and Investigator Selection Plan

As a study protocol evolves, sponsors consider the site facilities and teams that may
best complement the goals of the study. The protocol context can have a significant
impact on site characteristics and serve to narrow the field of potential sites quickly.
A documented site and investigator selection plan can be useful in defining the site
and investigator characteristics that are expected to complement the study
requirements.
Investing time developing a site and investigator selection plan encourages the
sponsor to review the protocol with perspective for anticipated needs and challenges
associated with subject recruitment, site and subject compensation, quantity and
location of sites (e.g., single country or international, rural or urban), site type (e.g.,
academic, commercial, private practice), study visit schedules, operations and
logistics, staffing experience and credentials, equipment, and applicable regulations.
Given the scope of potential factors to evaluate, a comprehensive plan assists
sponsors and site teams to objectively evaluate sponsor and site compatibility
more efficiently and possibly mitigate potential for subjective factors that can
introduce bias in selection. For example, an objective and transparent plan can
limit potential for hurt feelings based on pre-existing relationships. Plans should
include documented timepoints intended to evaluate the effect of the plan, once
applied, to identify any potential areas in the site and investigator selection approach
requiring modification.
Once a site and investigator selection plan has been developed, sponsor and site
teams should have common understanding and insight into potential study require-
ments before considering partnering in a research project. At minimum, a detailed
protocol synopsis, if not an initial draft of the protocol, should be available for site
teams to evaluate feasibility and potentially offer perspectives and experience that
could support further, more robust, protocol development. With access to detailed
information about a prospective study, site teams may be able to enhance the
integrity of a site application by offering objective and specific examples
of resources and past performance. In turn, this may create opportunities for more
candid review and discussion among both site and sponsor teams for evaluating
suitability for a study. Considerations that could be mutually applied to both the
sponsor and site team could include quantitative categories associated with level of
engagement, response time, adherence to timelines, and resolutions to action items/
troubleshooting initiatives. At the site level, considerations may include protocol
and regulatory compliance, subject recruitment and retention, number of queries
generated in and the time to query resolution, and data completion. Even the best
194 D. Shmueli-Blumberg et al.

possible site teams will “fail” if the protocol requirements are not suitable for their
site and the budget is inadequate to support their efforts.

Site Selection

The scope of potential qualities and characteristics to consider in evaluating site


compatibility for any research project is vast. Desired characteristics are specific to
the proposed project and defined in the selection plan. More general concepts are
described below (Fig. 1).

Facility Resources

Key considerations in assessing the suitability of sites for a research study include
evaluation of the research space, which may include exam room features, secure and
appropriate areas to store study drug or devices, specialized equipment needs,
availability of and access to a pharmacy, and laboratory and imaging capabilities.
Important factors to evaluate include the availability of adequate infrastructure,
staff availability (e.g., hours/days of week), staff depth (e.g., coverage for key staff
on leave, attrition management), and staff credentials and expertise such as clinicians
representing a disease specialty. Alternatively, if there is a possibility of supporting
capacity-building activities such as staff training at a site, then there could be more
flexibility regarding this criterion. The facility itself would be ideally located in close
proximity, or easily accessible via public transportation, to the subject population of
interest. Other important facility-related considerations include accounting for insti-
tutional standards of care, attitudes and participation of the various departments
in the facility, the ability to support data management operations, Internet connec-
tivity, immediate and long-term record storage capabilities, and other study-specific
operational concerns.

Site and Process Flow


With study subjects as key partners in research, sites and sponsors should invest time
and resources to streamline the subject experience during study visits. Consideration

Fig. 1 Site selection


considerations Facility Resources
Site Selection

Administrative
Considerations
Recruitment
Potential
Regulatory & Ethics
Requirements
12 Selection of Study Centers and Investigators 195

for who, what, when, where, and how study requirements will be completed will
help site teams evaluate their processes and either identify areas for modifications to
accommodate the subject experience or may determine they will not be able
to satisfy the study requirements.
In addition to the subject experience, the same considerations should be applied to
improve overall efficiency at the site. For example, if a protocol requires laboratory
sample analysis within 2 h of a subject’s arrival in an Emergency Room, the site team
will need to identify means to collect that sample shortly after consent, get the
sample to the lab quickly, and place a stat order for analysis. The sample collection
and analysis are already challenging, but if the laboratory happens to be on the other
side of an academic campus, or site policies require specific site personnel to
transport the sample that may not be immediately available, then there is increased
risk of compromising protocol requirements in ensuring sample analysis within 2 h.
Sites that have a central laboratory facility and equipment resources consistent
with protocol requirements (e.g., specific Tesla MRI machine) and can follow study-
wide standards of operations (SOPs) and/or guidelines may be more timely in
preparing for site initiation than sites that do not have such resources.

Administrative Considerations

Study costs may differ between research sites for a variety of reasons, including the
presence or absence of a national healthcare system in each country, regional
standards of care, or routine patient care costs that a form of Medicare or private
insurance carriers in that area will cover. Individual study budgets may not be known
during the site selection process, but investigating key budget-related questions (e.g.,
institutional overhead fees) might be informative even in the most preliminary phase
of the process.
Policies and legal requirements of both the sponsor and the site should be
explored. Time requirements for negotiating budgets and contract terms should
be considered. Sites with extensive legal contract and budget reviews may be less
favorable depending on study timelines. Additionally, there may be requirements
(e.g., protection from liability) either from the sponsor or at the site that cannot be
accommodated, instantly ruling out study participation. Alternatively, requirements
that can be accommodated but require negotiation may prolong time needed between
site selection and site initiation, ultimately impacting the overall project budget and
timelines. Such factors will also impact timelines for implementing protocol and
contract amendments after the study is initiated.

Recruitment Potential

Research questions cannot be answered without study subjects who are engaged in
the trial and eager to support activities to answer the research question. The more
collaborative the sponsor and site team are in encouraging and supporting subject
recruitment and retention, the better.
196 D. Shmueli-Blumberg et al.

Many trials in the past have failed to reach their enrollment goal within the
anticipated timeframe or ever. A primary factor to consider in site selection is
whether the site team has access to a target population with the condition of interest
and who are willing to participate in a trial. Recruitment potential is a necessary, but
insufficient, stand-alone criterion for selection. Sites may have a history of reaching
their recruitment goals, but a careful selection process must also assess whether
subjects were good research subjects, once enrolled, by reviewing compliance and
retention rates.
The investigator should be able to demonstrate that they have adequate recruit-
ment potential for obtaining the required number of subjects for that study. This
could be based on retrospective data, such as showing the number of patients who
have come in for a similar treatment at that facility over the past 12 months.
Ideally, site and sponsor teams will have experience in compiling and executing
a recruitment and retention plan centered around the subject experience. Even
without such experience, it’s never too late to try a new approach. Considerations
for such a plan include streamlined study activity workflow, potential compensation,
subject access to transportation, availability of childcare, snack/meal options during
visits, flexible scheduling hours, personnel dispositions, general hospitality, and
even facility aesthetics that will better support recruitment and retention initiatives.

Regulatory and Ethics Requirements

The globalization of clinical research makes attention to regulatory requirements


particularly important. Considering relevant regulatory authorities, affiliation with
central or distributed or local Institutional Review Boards (IRBs) or Independent
Ethics Committees (IECs), and issues related to drug procurement and distribution
may be important when assessing the fit of a site to a specific study. Regulatory
requirements between countries should be compared to identify potential regulations
that may only be required in select countries but need to be applied in all countries
to ensure compliance. Depending on the study, more granular questions regarding
assurances, certifications, and accreditations (such as local clinical laboratory licens-
ing) may be relevant as well.
Timelines to ensure compliance with regional regulatory requirements and
corresponding review can be significant in the overall project timeline. In some
countries, country-level regulatory review can take at least 90 days.

Investigator Selection
Investigators hold a key role in clinical trials, and the success or failure of a trial
may hinge in part on finding a suitable individual to fill this important position.
It is important for sponsors and prospective investigators to have a mutual under-
standing of the roles and responsibilities of a site investigator and the investigative
team in ensuring study subject privacy and safety and protocol and regulatory
compliance during a research project. Without appreciation for protocol and regula-
tory requirements prior to implementing a study, investigative teams can experience
12 Selection of Study Centers and Investigators 197

unfortunate consequences. Most consequential, the rights and welfare of study


subjects, the true champions for research, can be compromised. Such consequences
can ultimately contribute to imposed study termination, resulting in an inability to
answer the research question, and lost money, time, and resources. For various
reasons, some sponsors may predetermine a PI or several PIs to participate in
a trial. There is an inherent risk in finalizing the selection of these individuals to
the critical role of investigator without proper attention to their specific qualifications
based on objective measures. The section below offers a review of investigator
responsibilities and training requirements.

Investigator Responsibilities

The International Council for Harmonization (ICH) defines an investigator as the


individual responsible for the conduct of the clinical trial at a study site (ICH, GCP
1.34). If there is a larger team of people conducting the study at any site, then the
investigator is considered the responsible leader of that team and may be called the
principal investigator (PI). In multicenter trials, each clinical site has its own PI who
provides oversight and leads the research site staff at that site in implementing
the study. A successful PI has the time to engage in the study and demonstrates
a commitment to the research team, to study subjects, and to the integrity of the trial
itself. The PI’s actions set the tone for the research staff by implying the importance
of the trial and setting work expectation standards. The responsibilities of the
investigator are varied and related to the different phases throughout the study life
cycle. Investigator responsibilities generally include the following:

• Knowledge of the study protocol, ensuring other research staff are informed about
the protocol, and conducting their roles in accordance with the processes and
procedures outlined in the current version of that document.
• Maintaining proper oversight of the study drug, device, or investigational product
including documenting product receipt, handling, administration, storage, and
destruction or return. An investigator has a responsibility to inform potential
study subjects when drugs or devices are being used for investigational purposes.
• Reporting safety events throughout the implementation of a clinical trial (21 CFR
312.64). Investigators should document adverse events (AEs) which are unto-
ward medical occurrences associated with the use of a drug in humans that occur
during the study and, sometime, even for period after the study closure. They
must carefully follow the protocol and Federal guidelines for the appropriate
procedures (cite OHRP, FDA) for reporting AEs and serious adverse events
(SAEs) as needed.

To illustrate the scope of investigator responsibilities, below is an example of


responsibilities an investigator would be asked to accept for a study under an
investigational new drug (IND) application with the Food and Drug Administration
(FDA) in the United States (Fig. 2).
198 D. Shmueli-Blumberg et al.

9. COMMITMENTS
I agree to conduct the study(ies) in accordance with the relevant, current protocol(s) and will only make changes in a protocol after
notifying the sponsor, except when necessary to protect the safety, rights, or welfare of subjects.

I agree to personally conduct or supervise the described investigation(s).

I agree to inform any patients, or any persons used as controls, that the drugs are being used for investigational purposes and I will
ensure that the requirements relating to obtaining informed consent in 21 CFR Part 50 and institutional review board (IRB) review
and approval in 21 CFR Part 56 are met.

I agree to report to the sponsor adverse experiences that occur in the course of the investigation(s) in accordance with 21 CFR
312.64. I have read and understand the information in the investigator’s brochure, including the potential risks and side effects of the
drug.

I agree to ensure that all associates, colleagues, and employees assisting in the conduct of the study(ies) are informed about their
obligations in meeting the above commitments.

I agree to maintain adequate and accurate records in accordance with 21 CFR 312.62 and to make those records available for
inspection in accordance with 21 CFR 312.68.

I will ensure that an IRB that complies with the requirements of 21 CFR Part 56 will be responsible for the initial and continuing
review and approval of the clinical investigation. I also agree to promptly report to the IRB all changes in the research activity and all
unanticipated problems involving risks to human subjects or others. Additionally, I will not make any changes in the research without
IRB approval, except where necessary to eliminate apparent immediate hazards to human subjects.

I agree to comply with all other requirements regarding the obligations of clinical investigators and all other pertinent requirements in
21 CFR Part 312.

Fig. 2 Excerpt from FDA Form 1572, Section 9: Investigator Commitments

This form represents a commitment by the PI to personally conduct or supervise


the trial, which involves appropriate delegation of activities to other investigators
and qualified research staff, allocating time to adequately interact and supervise
those staff members periodically, and oversight of any third parties involved at their
site, if any. Other documentation, such as a protocol signature page, or a signed
protocol document may also serve as a commitment that the PI and sub-investigators
will follow the procedures and requirements outlined in the protocol.
The consequences of not adhering to these obligations can be serious. Investiga-
tors could be barred from participating in future research, lose professional licensure,
and face legal action. Warning Letters from the FDA (which are posted on the FDA
website) commonly involve an investigator’s failure to appropriately follow
the protocol, informed consent violations, and not maintaining adequate records
(Anderson et al. 2011). Anderson et al. also state that determining whether a PI has
personally conducted or supervised the trial has been emphasized during FDA
clinical investigator inspections as well. Enforcement strategies include regulatory
warning letters, disqualifications, restrictions and debarments, criminal prosecutions,
prison, and/or fines if warranted based on the circumstances of the violations.

Investigator Qualification

The US Code of Federal Regulations specifies that clinical trial investigators should
be qualified by training and experience as “appropriate experts to investigate the
drug” (21 CFR 312.53). The ICH guidelines include a similar assertion, stating that
investigators should be qualified by education, training, and experience to oversee
the study and provide evidence of meeting all qualifications and relevant regulatory
requirements (ICH, GCP 4.1). When the study intervention involves use of an
investigational product (IP) or devices, then part of being qualified involves being
12 Selection of Study Centers and Investigators 199

familiar with that product. The investigator should be thoroughly familiar with
the use of the IP as described in the protocol as well as having reviewed the
Investigator’s Brochure (for an unapproved product) or Prescribing Information
(for approved products) which describe the pharmacological, chemical, clinical,
and other properties of the IP. Appropriate general and protocol-specific training
can help ensure that an investigator is adequately qualified to conduct the study.
The FDA does not specifically require that a lead investigator have a medical degree
(e.g., MD, DO), but often he or she does. If the PI is not a physician, one should
be listed as a sub-investigator to perform trial-related medical decisions. This is
consistent with GCP standards that state that medical care given to trial subjects
(ICH GCP 2.7) and trial-related medical decisions (ICH GCP 4.3.1) should be the
responsibility of a qualified physician.

Investigative Team Considerations

Investigator Equipoise
In research there is an expectation of equipoise or uncertainty about the effectiveness of
the various intervention groups which is necessary to ethically run the trial. The nature
of the relationship between a patient and physician changes once a physician enrolls a
patient in a clinical trial, thereby creating a potential conflict of interest (Morin et al.
2002). For example, even if the physician believes that one of the treatment arms or
groups is more likely to be successful, all investigators have a responsibility to follow
the study protocol and most importantly the randomization plan. Physicians who have
equipoise do not have an inherent conflict of interest when suggesting study enrollment
and randomization to their patients. Those without equipoise may inject conscious or
subconscious bias in treatment or protocol administration.

Investigator Motivation
Other investigator qualifications are more subjective in nature and difficult to
quantify or document, such as motivation, leadership style, and investigator engage-
ment. Motivation may arise from the desire to work on cutting-edge research or
develop or test products that could ultimately improve the health of patients around
the world. Other motivating factors include financial benefits as well as prestige and
recognition in the professional and scientific community. Some have asserted that
enthusiasm and scientific interest in the research question are the most important
qualifications for potential principal investigators (Lader et al. 2004). A passionate
PI leading the study team is likely to evoke enthusiasm and determination which will
be valuable for successful implementation of the trial.

Site Team Dynamics


Sponsors and prospective investigative teams need to evaluate the dynamic charac-
teristics among the study investigative team. In addition to motivation and passion,
an investigator will typically need the assistance of several research staff members to
successfully run a clinical trial.
200 D. Shmueli-Blumberg et al.

Most studies will have a research coordinator who supports the daily operations
of the study such as scheduling subjects for visits, interacting with subjects at the site
and possibly conducting some of the assessments, ensuring data is accurately
collected and reported, and maintaining the site regulatory files. Other site staff
may include physicians and other medical clinicians, pharmacists, phlebotomists,
counselors, and other support personnel. Investigators can increase the likelihood for
a strong and productive team by fostering an environment of cooperation based on
the site staff’s shared mission of implementing a successful trial. Establishing clear
roles and responsibilities for each staff member is also important, and investigators
are required to maintain a list of all staff and their delegated trial-related duties (ICH
GCP 4.1.5) in addition to ensuring that the activities are delegated appropriately to
trained and qualified site staff members. An engaged investigator who spends
sufficient time at the research site will be able to monitor and evaluate the group
dynamics throughout the trial and ensure that morale remains high during difficult
times (e.g., low subject recruitment rates) and that staff have sufficient time to
complete their work. Other strategies for maintaining a strong site team include
providing immediate feedback to staff about their performance, having backups for
key staff roles, and providing opportunities for staff to review and provide input on
the protocol and manual of operations prior to study start-up to ensure their input is
incorporated in final versions.

Research Experience and Past Performance


Other metrics for evaluating site staff include previous experience. Site teams with
little to no research experience may automatically be excluded from sponsor con-
sideration for participation in a clinical trial. Concerns may arise regarding their
abilities to recruit and retain subjects and understand fundamental ethical and
regulatory requirements for participating in clinical research. A sponsor may debate
whether research-naïve sites are worth the risk in financial and time investments.
Then again, the extent of staff previous experience may not always be a good
indicator of future performance. Moreover, extensive research experience may
promote more rigidity (“I’ve been doing it this way for 20 years”) and over-
confidence, whereas a less experienced team may be more open to different
approaches and perspectives.
Reputations and past performance of an investigative team can offer important
information, but past performance is not necessarily predictive of future perfor-
mance. There are many variables (e.g., medical condition, staffing resources and
dynamics, access to the desired study subjects, trial design, sponsor/site dynamics,
etc.) that can contribute to past and future site performance. It is important to take the
time to understand what factors contributed to study performance and evaluate
the advantages of site teams who achieved study goals versus the obstacles another
site team may have encountered contributing to an inability to fully satisfy study
goals. Even with this information, the dynamic nature of a site team can have
significant influence on future studies.
Any change, in any area of the facility or team, will either negatively or positively
impact site team performance; neither will ever be the same as before. If obstacles
12 Selection of Study Centers and Investigators 201

can be mitigated or overcome, there can be many advantages of selecting a site team
with research history and experience. Such advantages include:

• Training from prior study participation may be transferrable, ultimately requiring


less overall training time.
• Experienced researchers may have already established a flow and process for
basic study requirements such as facilitating the consent process, collecting lab
samples, or compiling a study research record. These may contribute to quicker,
more efficient study implementation.
• The site team may already have thorough training, understanding, and experience
in applying regulatory requirements within study processes that may result in
lower risk of compromises in regulatory compliance.
• Site teams with prior experience in subject recruitment may have already deter-
mined effective outreach initiatives to increase potential to recruit members of the
desired study population.

There may also be advantages in selecting site teams with limited or no previous
research experience; these sites should not be overlooked based on this criterion
alone. New research sites may:

• Be eager to learn and be compliant in following the study procedures because


they do not have pre-conceived ideas about how things must be done
• Have an untapped pool of subjects available in their practice or within the referral
regional area
• Have greater diversity within their populations

Site and Investigator Selection Methods

Once the ideal criteria and characteristics for a site and study team for a project have
been determined, sponsors will need to find efficient and effective means to share
information about the project and collect relevant information from site teams.
This is often accomplished via a combination of utilities and interpersonal
interaction.
There are several ways for sponsors to solicit site information about prospective
study site teams. Sponsors may elect to cast a broad net and open the selection
process to any study team interested in participation. They may also elect to
implement a more strategic approach using some of the following resources:

• Sponsor databases
• Participation in similar studies listed on ClinicalTrials.gov (see Fig. 3)
• Research network memberships/registries
• Prior research partnerships
• Professional organization databases
• Literature searches
202 D. Shmueli-Blumberg et al.

Fig. 3 ClinicalTrials.gov image

• Requests for proposals (RFPs)


• Colleague referrals
• Investigator selection services
• Public clinical trial databases/libraries
• Professional society membership databases
• Research professionals’ societies

While a process that includes an “all are welcome” site recruitment approach may
be an easy way to achieve a specific number of desired sites, and expedite study start-
up, this may increase the risk of compromising longer-term goals of the study and
can result in ineffective resource allocation. The site and investigator selection plan
can help sponsors more clearly define the total number of sites, the type of site, site
team, and corresponding qualifications and training expectations of ideal interest to
support study activities.
With specified ideal qualifications, sponsors and site teams can conduct prelim-
inary review of requirements to quickly determine “go/no go” conclusions for
pursuing a study partnership. Once a site team elects to pursue study participation,
they can highlight specific areas of their staffing, facilities, subject recruitment
population, and any prior training and experience to illustrate why they are a good
fit for the study. Sponsors can also compare the desired qualifications and charac-
teristics for the study with submitted site application information to evaluate the site
teams that are most compatible with study goals.

Surveys and Questionnaires

Common tools used to collect information about prospective study sites include
surveys, questionnaires, and interviews. Depending on the design of the chosen
12 Selection of Study Centers and Investigators 203

method(s), the information collected can be integral in determining site team and
study compatibility, or it can expend staffing time and resources without adding
much value to the selection process. The tools intended to be used to collect
information from sites should be included as part of the sponsor site and investigator
selection plan.
The overall purpose of any method chosen needs to be clearly defined. This will
help ensure that information relevant to study participation and to proposed
outcome criteria will be offered by prospective study teams. For example, if a
sponsor site selection plan indicates interest primarily in site teams with research
experience in multiple sclerosis, but the method used to collect information from
sites doesn’t specifically address this level of detail, then the sponsor team may
receive information from site teams that reflect research experience, but without
specific reference to disease area expertise. Thus, the sponsor won’t have a
complete picture from which to determine whether a site team is compatible for
the study. Additional time and effort for further clarification may be needed, or in
the interest of time, the sponsor may move on to other proposals and overlook a
potentially ideal site team because there wasn’t sufficient detail in the information
provided simply because of the design of the utility selected to solicit the infor-
mation. The design of the collection tool will likely be a blend of both objective
and subjective information. The questions posed should reflect the four primary
areas discussed in section “Site Selection” of this chapter: facility resources,
administrative considerations, recruitment potential, and regulatory and ethics
requirements. The survey questions can be grouped by area for ease of completion
and can include specific study-relevant items beyond the four primary areas
described above.
A more objective design that corresponds with a scoring system for each type of
response could limit potential bias in the site selection process. At the least, there
should be a predetermined agreement on the particularly important items so that the
sites can be appropriately scored or ranked based on those criteria (Figs. 4 and 5).
Examples of objective items include:

• Is there a 80 F freezer on site for blood sample storage?


• How many feet is the freezer from where the samples will be collected?
• Are at least two members of the prospective study team available 24 h a day, 7
days a week?

On the other hand, subjective responses usually offer more insight into the
dynamics of the site team that can offer information about the team’s demeanor
and motivation to participate in the study, empathy and compassion for the study
population, and any creative means the site team has used to improve efficiencies
and recruit and retain study subjects.
Examples of subjective items include:

• What challenges/risks do you anticipate should you participate in this study?


• Describe your thoughts on recommended subject recruitment and retention prac-
tices for this study at your site.
204 D. Shmueli-Blumberg et al.

Fig. 4 Excerpt from Site Selection Survey: Site Characteristics

• Have any of the potential site staff worked together before? Share information
about methods of communication at the site to ensure study requirements and
updates are distributed.
• Describe a problem you encountered with a previous study and what approach
was taken to address it.

Prior to distributing the information collection tool to the site team, sponsor teams
should consider the process for how information from sites will be received and be
reviewed for potential partnership. Consideration should be given to having sites
provide masked information to the sponsor to eliminate as much bias in the selection
process as possible. If this is clarified as part of the design process, the sponsor can
be more forthright in advising prospective site teams of what to expect once a site
proposal has been submitted. Some questions to consider include:

• Will the information collected be kept confidential solely with this sponsor team?
• How will the information be stored (e.g., paper/electronic files, a site database)?
• Will the information collected be considered for this study alone or for this study
and other future potential studies with this sponsor?
• What process will be used to select site teams for study participation?
• Are additional activities expected after review of preliminary information (e.g.,
interviews, follow-up on site visits)?
• What process will be used to advise site teams of sponsor selections?
• What are the timelines for distribution and expected returned information from
site teams?
• Who will be available to respond to inquiries from site teams as they attempt to
complete the requested information?

Sponsors who approach prospective site teams as potential partners in their


research, valuing the time and effort site teams will use to compile the information
12 Selection of Study Centers and Investigators 205

Fig. 5 Excerpt from Site Selection Survey: Site Resources

and describing the process and expectations for evaluating prospective study partner-
ships, may be more likely to collect more timely, thoughtful, and comprehensive
responses from site teams.

Site Qualification Visit

Despite all the effort that is required to compile a site information collection utility,
and all the technology available to help people connect via phone, email, chat,
social media, and video, in-person interactions offer the greatest opportunity to
evaluate whether the sponsor and site teams can establish an effective partnership
206 D. Shmueli-Blumberg et al.

to conduct a study. On-site qualification visits are an invaluable way to gauge


sponsor and site team dynamics that cannot be achieved via surveys, question-
naires, or interviews. Each party can better evaluate whether they are compatible
and identify anticipated challenges in partnering together. An on-site visit also
offers the opportunity for the site team to share perspectives in operations and
logistics to which the sponsor team may be naïve. In-person site visits may be
especially important in international studies where countries may differ in norma-
tive standards (e.g., what constitutes adequate research space for conducting a
study). All partners can review and discuss anticipated areas of excellence and
potential risks and work together toward mitigating risks. Items that were
described in the “Site Selection” section above can be observed and evaluated
“first hand” including:

• Facility organization/flow – sites can demonstrate what the subject experience


may be like.
• Site team and subject access to the facility (i.e., public transportation, parking,
accessibility for assistance devices, navigation from the entrance to the research
site, location of labs, the pharmacy, etc. and other resources to the primary study
location).
• Nonverbal cues/information.
• Storage locations/utilities.
• Equipment availability, calibration, and documentation.
• Subject recruitment and retention planning – site teams that devise a formal plan
for recruitment in advance of site initiation that also includes timepoints for
evaluation may be better equipped to evaluate the effect of planned initiatives
and recalibrate initiatives as needed.

Summary

As has been described throughout this chapter, the site selection process is
complex and dynamic. Sponsors and site teams who elect to conduct a trial
together are entering, at minimum, into a short-term partnership with each other
and must be interdependent to achieve the goals of the study. Sponsors must
consider a variety of factors and embark upon both remote and in-person means
to learn more about prospective study sites. Investigators who participate in a
study have significant responsibility for ensuring appropriate staffing and
corresponding training and qualifications, regulatory compliance, protocol com-
pliance, and, most importantly, protecting the safety and rights of the subjects who
participate in a study. Investigative teams must weigh their responsibilities with
the study requirements provided by the sponsor. Ultimately, both sponsors and site
teams must evaluate whether they can be compatible and can achieve the goals of
the research project.
12 Selection of Study Centers and Investigators 207

Key Facts

1. The sponsor and study teams should be viewed as partners with a common goal of
identifying a good fit between a study and an investigator and site staff.
2. Some of the important considerations when selecting a site for a clinical trial
include facility resources, administrative considerations, recruitment potential,
and regulatory and ethics requirements.
3. Selection of an investigator may include objective quantifiable considerations
such as previous studies and publications and specific area of expertise, as well as
more subjective qualifications such as motivation and leadership style.
4. There are a variety of ways for sponsors to solicit site information about pro-
spective study site teams, such as through site surveys and questionnaires and site
qualification visits.

References
Anderson C, Young P, Berenbaum A (2011) Food and drug administration guidance: supervisory
responsibilities of investigators. J Diabetes Sci Technol 5(2):433–438. https://doi.org/10.1177/
193229681100500234
Lader MD, Cannon CP, Ohman EM et al (2004) The clinician as investigator: participating in
clinical trials in the practice setting. Circulation 109(21):2672–2679. https://doi.org/10.1161/01.
CIR.0000128702.16441.75
Morin K, Rakatansky H, Riddick F et al (2002) Managing conflicts of interest in the conduct of
clinical trials. JAMA 287(1):78–84. https://doi.org/10.1001/jama.287.1.78
Design and Development of the Study Data
System 13
Steve Canham

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
Descriptive Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
System Components: The Naming of Parts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
The Rise of the eRDC Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
Deployment Options and Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Constructing the Study Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Working from the Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Working Up the Functional Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
“User Acceptance Testing” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Final Approval of the Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
Using Data Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Validating the Specification and Final Release . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
Development and Testing Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
The Study Data System in the Longer Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
Change Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
Exporting the Data for Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

Abstract
The main components of a typical modern study data system are described,
together with a discussion of associated workflows and the options for deployment,
such as PaaS (platform as a Service) and SaaS (software as a service), and their
implications for data management. A series of recommendations are made about
how to create a study specific system, by developing a specification from the study

S. Canham (*)
European Clinical Research Infrastructure Network (ECRIN), Paris, France
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 209


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_36
210 S. Canham

protocol within a multidisciplinary framework, obtaining formal approval of that


specification, building prototypes, and then carrying out a detailed and systematic
validation of the system before releasing it for use. The use of data standards is
described and strongly encouraged, and the need for distinct development, test, and
production environments is discussed. Longer term aspects of system management
are then considered, including change management of the study data system and
preparing the data for analysis, and managing data in the long term.

Keywords
Clinical Data Management System · CDMS · Study definition · Electronic remote
data capture · eRDC · Functional specification · Validation · Data standards ·
Change management · Data extraction

Introduction

The study data system is designed to collect, code, clean, and store a study’s data,
deliver it for analysis in an appropriate format, and support its long-term manage-
ment. In fact, a “study data system” is rarely a single system – it is normally a
collection of different hardware and software components, some relatively generic
and others study specific, together with the procedures that determine how those
components are used and the staff that operate the systems, all working together to
support the data management required within a study.
This chapter has three parts. The first provides a descriptive overview and looks at
the typical components of a study data system, the associated data processing
workflows, and the main options for deployment. The second is a more prescriptive
account of how a data system should be designed, constructed, and validated for an
individual study, while the third discusses two longer term aspects of system use:
managing change and delivering data for analysis.

Descriptive Overview

System Components: The Naming of Parts

Any study data system has to be study specific, collecting only the data items
required by a particular study, in the order specified by the assessment schedule.
But any such system also has to meet a more generic set of requirements, to
guarantee regulatory compliance and effective, safe, data management. The func-
tionality required includes:

• The provision of granular access control, to ensure users only see the data they are
entitled to see (in most cases, only the data of the participants from their own site).
13 Design and Development of the Study Data System 211

• Automatic addition of audit trail data for each data entry or edit, with preservation
of previous values.
• The ability to put logic checks on questions, so that impossible, unusual, or
inconsistent values can be flagged to the user during data entry.
• Conditional “skipping” of data items that are not applicable for a particular
participant, (as identified by data entered previously).
• Support for data cleaning – usually by built in dialogues that allow central and site
staff, and monitors, to easily exchange queries and responses within the system.
• The ability to accurately extract part or all of the data, in formats that can be
consumed by common statistical programs.

These requirements mean that a study data system almost always has as its core a
specialist Clinical Data Management System (CDMS), a software package that
provides the generic functions listed above, but whose user interface can be adapted
to reflect the requirements of specific studies. Such systems are usually purchased on
a commercial basis and may be installed locally or hosted externally.
A single CDMS installation can and usually does support multiple studies. It makes
use of a database for storing the data and provides a set of front-end screens for
inputting and querying it. The database is normally one of the common relational
systems (e.g., SQL Server, Oracle, MySQL), but it will be automatically configured by
the CDMS to store both the study design details and the clinical, user, and audit data.
The user interface screens are normally web pages, which means that as far as end
users are concerned, the CDMS system is “zero-footprint”: it does not require any
local installation at the clinical sites or the use of a dedicated laptop. An end user at a
clinical site accesses the system, remotely and securely, by simply going to a pre-
specified web page. From there they can send the data immediately back to the central
CDMS, where it is transferred to the database. For security and performance reasons,
the database is normally on a different server, with tightly controlled access, whereas
the web server is necessarily “outward facing” and open to the web (see Fig. 1).
The study-specific part of the system is essentially a definition and is often
therefore referred to simply as the “study definition.” It is stored within and
referenced by the CDMS, and stipulates all the study-specific components, for
example, the sites, users, data items, code lists, logic checks, and skip conditions.
It also defines the order and placement of items on the data capture screens (usually
referred to as “eCRFs,” for electronic case report forms), and how the eCRFs are
themselves arranged within the “study events” (or “visits”), i.e., the distinct time
points at which data is collected. Even though a single CDMS installation usually
contains multiple study definitions, it controls access so that users only ever see the
study (or studies) they are working on, and the data from their own site.
Almost all systems also allow a study definition to be exported and imported as a
file. This allows the definition to be easily moved between different instances of the
same CDMS – for example, when transferring a study definition from a development
to a production environment. If the file is structured using an XML schema called the
Operational Data Model, or ODM (CDISC 2020a), an international standard devel-
oped by CDISC (the Clinical Data Interchange Standards Consortium), then it is
212 S. Canham

Fig. 1 The main components of a modern study data system. The CDMS stores one or more study
definitions and is usually installed on a web server, which presents each study’s screens to
authorized users via a secure internet link. The CDMS is also connected to a database, usually on
a separate server, for data storage

sometimes possible to transfer the study definition between different data collection
systems, (e.g., between collaborators). Sometimes rather than always because,
unfortunately, not all CDMS fully support ODM export/import and there are some
elements of a study definition (such as automatic consistency checks on data) where
ODM still only provides partial support.
In terms of features, almost all CDMS are technically compliant with clinical
trial regulations, especially GCP and CFR21(11), e.g., they allow granular access
control, they provide automatic audit trail data, the internal timestamps are
guaranteed to be consistent, etc. Without this technical compliance, they would
stand little chance in the marketplace. What makes a system fully compliant,
however, is the way in which it is used: the set of standard operating procedures
and detailed work instructions that govern how the system is set up and functions
in practice, together with the assumed competence of the staff operating those
systems. For this reason both ‘policies and procedures’, and ‘central IT and DM
staff’ are included as elements within Figure 1.
The study-specific system that users interact with, the product of the study
definition and the underlying CDMS, is sometimes known as a Clinical Data
Management Application (CDMA). Although in practice “study definition,” or
even “study database,” is probably more common, in this chapter, the more accurate
“CDMA” is used to refer to the software systems supporting a specific study, and
“study definition” is restricted to the detailed specification that defines the CDMA’s
features.
Researchers are normally much more engaged with the details of the CDMA
rather than the underlying systems, but they should at least be satisfied that the
13 Design and Development of the Study Data System 213

systems used for their trial are based upon an appropriate CDMS and that there
is a mature set of procedures in place that govern its consistent and regulatory
compliant use. This is one of the many reasons why the operational manage-
ment of a trial is best delegated to a specialist trials unit, which might be a
department within a university, hospital, research institute or company, or an
independent commercial research organization, or CRO (for simplicity, in this
chapter, all of these are referred to as a “trials unit”). Not only will such a unit
already be running or managing one or more CDMSs, they will also be able to
provide the expertise to develop the study specific part of the system safely and
quickly.
CDMSs differ in their ease of use and setup, for instance, in creating study
definitions, extracting data or generating reports, and the additional features that
they may contain. The latter can include:

• Integrated treatment allocation systems, so sites obtain a randomization decision


immediately they recruit a participant into a trial
• Built-in support for data standards, e.g., the ability to import/export CDISC ODM
files
• Integrated coding modules (e.g., for MedDRA coding of adverse events)
• Supporting versions for tablets and mobile phones, especially for obtaining data
directly from participants (ePRO or “electronic participant-reported outcomes”)
• Built-in support for monitors and source data verification
• Integration with laboratory systems

The great majority of CDMSs in use are commercial systems, available from a
wide variety of vendors. In early 2020, a search on a software comparison site listed
58 different systems that offered both electronic data capture and CFR21(11)
compliance (Capterra 2020), and that list was far from comprehensive. Vendors
range from large multinationals to small start-ups, and license costs vary over at least
an order of magnitude, from a few thousand to several hundred thousand dollars per
study, though as discussed below costs can also depend on the deployment models
used. There are also a few open source CDMSs: OpenClinica (2020) and RedCap
(2020) are the two best-known; both are available in free-to-install versions (as well
as commercial versions that provide additional support), and both have enthusiastic
user communities.
There are also some local CDMSs, built “in-house,” particularly in academic
units, although they are becoming less common. CDMSs are increasingly complex
and costly to build and validate, and effective ongoing support requires an invest-
ment in IT staff that is beyond the budget of most noncommercial units. Local
systems can also be over dependent on local programmers and become more difficult
to maintain if key staff leave. Although there is no question that home-grown
CDMSs can function well, there is an increased risk in using such systems. Sponsors
and researchers who find themselves relying on such systems need to be confident
that they are fully validated, and that they are likely to remain supported for the
length of the trial.
214 S. Canham

The Rise of the eRDC Workflow

The dominant web-based workflow for collecting clinical trial data, as depicted in
Fig. 1, is known as electronic remote data capture, or eRDC (EDC and RDC are
also used, and in most contexts mean the same thing). Since the early 2000s, eRDC
has slowly supplanted the traditional paper-based workflow, where paper CRFs were
sent through to the central trials unit or CRO, by post or courier, to be transcribed
manually into a central CDMS. Early papers extolling the benefits of eRDC were
often written by the CDMS vendors (e.g., Mitchel et al. 2001; Green 2003), who had
obvious vested interests. Despite this, the cost and time benefits of eRDC have
driven gradual adoption, especially for multi-site trials and in geographical areas
where reliable internet infrastructure is available. The advantages include:

• Removing the transcription step, and thus the time lag between the arrival of a
paper CRF and loading its data into the system, and eliminating transcription
errors. It therefore removes the need for expensive checks on data transcription,
such as double data entry.
• Speeding up data queries – the “dialogue” between site and central data manage-
ment can proceed securely on-line, rather than by sending queries and responses
manually. This can be especially important when chasing down queries in
preparation for an analysis.
• Allowing safety signals to be picked up more quickly. In addition, some systems
can generate emails if adverse events of a particular severity or type are recorded.
• Making it possible to reject “impossible” data (e.g., dates that can never be later
than the date of data entry) and thus force an immediate revision on data entry. In
a paper-based system, the need to reflect a paper CRF’s contents, however bizarre,
means that this type of data error must be allowed and then queried, or subject to
“self-evident correction” rules.
• Making it easier and clearer to tailor systems to the particular requirements of a
site, or a particular study participant (e.g., based on gender, treatment, or severity
of illness), by using skipping logic rather than sometimes complex instructions on
a paper CRF.
• Avoiding CRF printing costs and time.
• Allowing the data collection system to be more easily modified, for instance, in
the context of an adaptive trial.

By 2009, a Canadian study found 41% eRDC use for phase II–IV trials (El Emam
et al. 2009), and anecdotal evidence suggests eRDC use has continued to rise
considerably since then, with many units now only using eRDC for data collection.
42 of the 49 (86%) of the UK noncommercial trials units that applied for registration
status in 2017, i.e., most of the university-based trials units in the country, explicitly
mentioned using eRDC based systems, even if they did not always indicate they
were using eRDC for every trial (personal communication, UKCRC, Leeds).
Furthermore, empirical studies have now confirmed some of the benefits claimed
for eRDC (Dillon et al. 2014; Blumenberg and Barros 2016; Fleischmann et al.
13 Design and Development of the Study Data System 215

2017). Not all those benefits are relevant to single site studies, but even here the same
systems can be used, albeit normally within an intranet environment.
The main disadvantage of eRDC is that it demands that a large group of staff,
across the various clinical sites, are trained to use both the CDMS system and
specific CDMAs, and there is a greater level of general user management. A user-
initiated, automatic, “forgotten password?” facility in an eRDC system is a nontrivial
feature of any CDMS, avoiding an otherwise inordinate amount of time spent
managing requests to simply reenter the system.
Where paper-based trials are still run, they use essentially the same system for
their data management, except that the CDMS’s end users will be in-house data entry
staff rather than clinical site staff. Paper-based trials are still used, for instance, in
areas where internet access is patchy or unreliable, but eRDC is now the default
workflow for collecting clinical site data. Participant questionnaires (e.g., on quality
of life measures) have traditionally been collected on paper and then input centrally,
though in recent years there has been much interest in replacing these with ePRO
(electronic patient reported outcomes) systems, e.g., using smart phones, that can
connect directly to a CDMS. A review is provided by Yeomans (2014), though some
potential problems with ePRO, from a regulatory compliance perspective, are
highlighted by Walker (2016).

Deployment Options and Implications

Traditionally, a CDMS would be installed and run directly by the trials unit or CRO,
with hardware in server rooms within the trials unit’s own premises or at least under
their direct control. That scenario allows the unit to have complete control over their
systems and infrastructure, making it much easier to ensure that everything is run
according to specified procedures and that all staff understand the specialized
requirements of clinical trials systems.
This can be a relatively expensive arrangement, however, and may not sit well
with the centralizing tendencies of some larger organizations. It is also sometimes
difficult to retain the specialist IT staff required. It has therefore become increasingly
common to find the CDMS hosted in the central IT department of a hospital,
university, or company. The trials unit staff still directly access the CDMS across
the local network and can develop study definitions as well as oversee data man-
agement. They often also access and manage the linked databases, but data security,
server updates, and other aspects of IT housekeeping are carried out by “central IT.”
The servers are provided as PaaS, or “platform as a service,” i.e., they are set up to
carry out designated functions, as database or web servers, and the customer, the
department managing the trials, manages those functions (see Fig. 2).
This arrangement may be more efficient, but it does require that all parties are
very clear about who does what and that clear communication channels are in place.
From the point of view of trial management, the central IT department is an
additional subcontractor supporting the trial. It shifts the day-to-day responsibility
of many IT tasks (backups, server updates, firewall configuration, maintaining anti-
216 S. Canham

Fig. 2 A common PaaS


eRDC architecture. The data
is captured directly at the sites
and transferred directly to a
central CDMS. The trials unit
(or CRO) manages that
CDMS, including providing
the study-specific definitions,
though the system is
embedded in an IT
infrastructure managed by a
central IT department, who
provide the servers as
“platforms as a service”

malware systems, user access control, etc.) out of the trials unit, but it does not
change the fundamental responsibility of the unit, acting on behalf of the sponsor, to
assure itself that those tasks are being carried out properly.
As stressed by the quality standards on data and IT management established by
ECRIN, the European Clinical Research Infrastructure Network (ECRIN 2020), this
oversight is not a “one-off” exercise – the requirement is for continuous monitoring
and transparent reporting of changes (Canham et al. 2018). For example, trials unit
staff do not need to know the details of how data is backed up but should receive
regular reports on the success or otherwise of backup procedures. They do not need
to know the details of how servers are kept up to date, or logical security maintained
through firewalls, but they do need to be satisfied that these processes are happening,
are controlled and documented, and that any issues (e.g., security breaches) are
reported and dealt with appropriately.
This problem, of “quality management in the supply chain,” becomes even more
acute when considering the increasingly popular option of external CDMS hosting.
In this scenario, the CDMS is managed by a completely different organization –
most often the CDMS vendor. The trials unit staff now access the system remotely to
carry out their study design and data management functions, with the external system
presenting the CDMS to the unit as “software as a service” or SaaS. This scenario is
popular with many system vendors, because it allows them to expand their business
model beyond simple licensing to include hosting services, and in many cases offer
additional consultancy services to help design and build study systems. It also means
that they only have to support a single version of their product at any one time, which
can reduce costs. In fact, some CDMS vendors now insist on this configuration, and
only make their system available as SaaS.
But in many cases, the delegation chain is extended still further, as shown in Fig. 3,
because the software vendor may not physically host the system on its own
13 Design and Development of the Study Data System 217

Fig. 3 CDMS hosting by an


external SaaS supplier. Three
different organizations may be
involved in supporting a study
data system, with links
mediated by the internet.
Systems are physically
located within a third-party
data center, though the CDMS
is managed by the software
vendor. The trials unit, like the
clinical sites, accesses the
system via the internet

infrastructure. Instead it may make use of external IT infrastructure in a third-party data


centre, or within one of the large commercial “cloud” infrastructures.
Using externally hosted systems has some advantages for trialists and trials units.
For instance:

• It provides a very good way for trials units and CROs to experiment with different
CDMSs, without the costs and demands of installing and validating them locally.
• The burden of validating the CDMS is transferred to the organization controlling
its installation, usually the software vendor. The trials unit/CRO still needs to
satisfy itself that such validation is adequate, but that is cheaper and quicker than
doing it themselves.
• It empowers sponsors, who have greater ability to insist that a particular CDMS
system is used, regardless of who is carrying out the data management.
• It removes any suspicion that the trials unit or CRO, and through them the
sponsor, can secretly manipulate the data – data management is always through
the CDMS’ user interface, where all actions can be audited, and never by direct
manipulation of the data in the database.

On the other hand, an externally hosted system can lead to difficulties in


accessing and obtaining bulk data quickly (e.g., for analysis), and it can introduce
218 S. Canham

substantial difficulties in maintaining quality control in what is now an extended


supply chain. The trials unit is now dependent on the quality assurance processes and
transparency of the CDMS provider, to make sure that not only the CDMS itself but
also the IT provision is fit for purpose. A key point, if the CDMS provider uses a
“cloud” infrastructure, is the need for the trials unit to be fully aware of exactly
where the data, including the backup sets derived from it, is located. This will
determine the legal jurisdiction that applies and dictate whether additional safe-
guards (e.g., Privacy Shield compliance in the USA for data on European citizens)
need to be sought.
Contracts, information flows, and oversight processes need to be in place that
ensure all users not only know that the CDMS is validated and secure, and continues
to be through successive system changes, but are also aware of the underlying IT
infrastructure, and are happy that the CDMS provider is itself carrying out proper
oversight of that infrastructure. Too often, unfortunately, this is not happening. In
2018, the Inspectors Working Group of the European Medicines Agency listed a
wide range of issues they had discovered in respect of subcontracted services,
including:

• Missing or out-of-date contractual agreements


• Poor definition of the distribution of tasks
• Lack of understanding of the location of data
• A lack of understanding of GCP obligations by subcontractors
• Unwillingness to accept audits
• Poor understanding of reporting requirements
• Confusion over outputs and actions to be taken at the end of the trial

Sponsors, researchers, and trial teams, therefore, need to ensure that when
functionality is subcontracted, these issues have been dealt with, so that they are
confident that appropriate oversight is taking place all the way down the supply
chain. This need not involve detailed scrutiny, but it does mean selecting and
building up relationships and trust with a specialist trials unit or CRO, and being
confident that they have not just good technical systems available but also a
comprehensive quality management system in place.

Constructing the Study Definition

Overview

Whoever is managing and monitoring the CDMS and its underlying IT infrastruc-
ture, there is no doubt that the study-specific part of the study data system, the “study
definition” or CDMA, is the responsibility of the study management team. The team
aspect is important – even though the sponsor retains overall responsibility, success-
ful development of a CDMA requires expertise and input from a wide variety of
13 Design and Development of the Study Data System 219

people: investigators, statisticians, study managers, data and IT staff, quality man-
agers and site-based end users.
The process of creating a CDMA is summarized in Fig. 4. It has two distinct and
clearly defined phases – development and validation – both of which involve
iterative loops. The development phase takes the protocol and the data management
plan as the main input documents and creates a full functional specification for the
CDMA. It does so by organizing input from the various users of the system and
consumers of the data, and iteratively developing the specification until all involved
are happy that it will meet their requirements. Very often prototype systems are built
against the developing specification to make the review process easier but, in any
case, a system built to match the approved specification must be available at the end
of the development phase. The validation phase takes that system and checks,
systematically and in detail, that it does indeed match the agreed study definition.
Once that check is complete, the CDMA can be released for use.
Both phases should be terminated by a formal and clearly documented approval
process. At the end of the first, development, phase, all those involved in creating the
specification should sign to indicate that they are happy with it – the result should
therefore be a multidisciplinary and dated sign-off sheet for the specification. At the
end of the second, validation, phase, someone (often a data manager or operational
manager) needs to sign to indicate that the validation has been successfully com-
pleted and that the system can be released.

Working from the Protocol

CDMA development has to start with the study protocol, because that document
specifies the key outcome and safety measures to be captured, implying the individ-
ual data points that need to be collected, and the assessment schedule that determines
when the data should be collected.
One way to convert a protocol into a study definition would be to simply ask the
investigator and/or statistician to specify the data points needed to carry out the
required analyses, either by setting out a formal set of analysis data requirements, or
more simply by annotating the protocol document. Perhaps because the time avail-
able to both investigators and statisticians is often limited, neither approach seems
very common, though anecdotal evidence suggests that when it is used, it can be
very effective. What often happens, in practice, is that experienced data management
staff take the protocol and, sometimes using previous trials that have covered similar
topics, construct either a spreadsheet with the data items listed, or mock paper CRFs,
annotated with additional details such as the data item type or range limits for values,
or – very often – both, with the spreadsheet providing more details than can be easily
shown on an annotated paper form. These are then presented for review to the
multidisciplinary study management team.
The use of mock paper CRFs is undoubtedly effective, not least because most
people find it easier to review a paper CRF than a series of screens or a spreadsheet,
especially in the context of a meeting. It can, however, increase the danger of
220 S. Canham

Fig. 4 The workflow for CDMA development. There are two iterative loops: the first, usually
longer, results in the approval of a functional specification and the construction of a prototype
system, and the second results in the approval of that system for production use once it has been
validated. (Adapted from Canham et al. 2018)
13 Design and Development of the Study Data System 221

collecting data that is not strictly required to answer the questions posed by a study,
but which is included only because it was part of a previous, similar study, or
because there is a vague feeling that it might be “possibly useful one day.”
Collecting too much data in this way runs counter to data minimization, an
important principle of good practice emphasized in the General Data Protection
Regulations (GDPR) of the EU: “Personal data must be adequate, relevant and
limited to what is necessary in relation to the purposes for which those data are
processed” (GDPR Rec.39; Art. 5(1)(c)) (Eur-Lex 2020). At least within the EU,
collecting unnecessary data may therefore be illegal as well as unethical.
There are circumstances where data can be legitimately collected for purposes
other than answering the immediate research question – for instance, to obtain a
disease-specific “core dataset,” to be integrated with similar datasets from other
sources in the future. But if that it is the case, it should be explicitly mentioned within
study information sheets, so that a participant’s consent is fully informed and
encompasses the collection of such data.
One effective way of reducing the risk of collecting unused or unusable data is to
ensure the study statistician reviews the CDMA’s data points towards the end of the
development process. Far better for spurious data points to be removed before the
study begins, rather than being collected, checked, and queried, only for the statis-
tician – after they receive the extracted dataset – to protest that they would never
make use of that data.
A second document that feeds into CDMA design is the Data Management Plan,
or DMP. All trials should have such a plan, either as a section within the Trial Master
File (TMF) or as a separate document referenced from the TMF. Although a trials
unit or CRO would be expected to have a set of generic SOPs covering different
aspects of data management, there will almost always be study specific aspects of
data management that need planning and recording, and these should be described
within the DMP, which therefore forms part of the input to the design process.
A key aspect of study design is the balance between different methods of ensuring
data quality. Modern CDMS can include sophisticated mechanisms for checking
data, allowing complex and conditional comparisons between multiple data items on
different eCRFs and study events, for instance, to check for consistency between
visits, plausible changes in key values, and adherence to schedules. The fact that
these complex checks can be designed, however, does not necessarily mean that they
should always be implemented. The more complex a check, the more difficult it is to
implement and the harder it is to validate. There is also a possibility that data entry
may become over-interrupted, and take too long, if the system flags too many
possible queries during the input process.
The alternative to checking data as it is input is to check it afterwards, by
exporting the data and analyzing it using statistical scripts. For complex checks,
this has several advantages:

• It usually allows simpler, more transparent design of the checks than the often
convoluted syntax required within CDMS systems.
• It is easier for the checks to be reviewed, e.g., by another statistician, and
validated.
222 S. Canham

• It gives the statisticians, as the consumers of the data, greater knowledge of and
confidence in the checks that have been applied.
• Most importantly, it allows checks to be made across the study subject popula-
tion, for instance, when identifying outliers (CDMS based checks can only
usually be applied within a single individual’s data).

The final point, coupled with the need for statistical monitoring to compare site
performance (to help manage a risk based monitoring scheme), means that some
level of “central statistical monitoring” of data quality is almost always required (an
exploration of the use of central statistical monitoring is provided by Kirkwood et al.
2013). The question is how far it should be extended to include the checks that might
otherwise be designed into the study definition. Clearly the availability of statistical
resource to help design (if not necessarily run) the checks will influence the approach
taken. There is also the issue that queries discovered using statistical methods need
to be fed back into the CDMS system so that they can be transmitted to the sites, and
few CDMS allow this to be done automatically. Whatever the decision of the study
management team, it should be documented as part of the Data Management Plan
and then taken into account during the development of the functional specification.

Working Up the Functional Specification

CDMA development needs to be a multidisciplinary, iterative process, as shown in


Fig. 4. It will involve periodic reviews of the developing study definition, with data
management staff adding and editing items after each review and then sending out
updated documents. The staff involved in the process should always include, as a
minimum, the coordinating investigator and the study statistician, as well as the
trial’s project manager, but can often also involve sponsor’s representatives, quality
assurance staff, data managers, IT staff, and a subsample of the end users for the
system, the staff based at the clinical sites. Input from these different reviewers will
be focused on different things and sought at different times, and different trials units
will have different ways of coordinating the process and involving other groups
around the core study team. That is not important as long as the goal of the
development process – a full and approved functional specification for the CDMA
– is met.
Many CDMS systems can generate detailed metadata of the systems they contain,
including listing not just the study schedule and data items but also the details of
derivations, skipping, and range and consistency checks that have been programmed
into the study definition. Such data may be available as a set of standard reports, or it
may be drawn directly from the database where the study definition is stored. This
offers probably the most efficient way of developing a functional specification,
which is to prototype it. After the initial specification has been created, by annotating
the protocol or setting up mock paper CRFs, the IT and/or data management staff can
create a first prototype of the CDMA within the CDMS. The prototype can then
provide a detailed record of its own metadata, which is guaranteed to be an accurate
13 Design and Development of the Study Data System 223

representation of the developing system. This allows everyone involved in the


review process to see the CDMA taking shape and inspect design elements like
layout, colors, and prompts, as well as examining the more detailed specification
generated by the system.
To be clear, this is not an “agile” development strategy, other than in the relatively
minor sense that the visual layout of elements can be more easily negotiated. The
user requirements for the system are fixed and are represented by the protocol and
the context in which the CDMA will be delivered. Gradually building the CDMA by
using a succession of prototypes simply offers an easier way for people to monitor
development and check the specification is being interpreted correctly. It is more
accurate and takes less work to document, than trying to use a paper specification on
its own, but it should always be used in conjunction with review of the detailed
metadata documents. In this approach, the final formal specification can be generated
from the final version of the prototype.
If prototyping is not used, and the specification only exists on paper, then towards
the end of the development phase, the system will need to be built from that
specification, initially to allow end users to examine it (as part of the development
phase, see the section below) and then, after any final changes, in order that it can be
validated. Thus, even if working only from a document-based specification, the
development phase should still end with an initial CDMA build.

“User Acceptance Testing”

The phrase “User Acceptance Testing” has been put into quotes to emphasize that it
is such a misleading and potentially confusing phrase that it really should be
avoided. There are three significant problems with the term:

• Different people refer to different groups as “users.” IT staff often, but not
consistently, refer to data management staff as the users, but the data management
staff usually mean the end users of the system at the clinical sites.
• Whoever makes the final acceptance decision for a system, it is almost certainly
not the users – it is more likely to be a sponsor’s representative, the study project
manager, or the unit’s operational manager.
• Users, especially end users at the clinical site, rarely test anything. They may
inspect and return useful feedback – about eCRF design, misleading captions,
illogical ordering of data items, etc., but they can rarely be persuaded to system-
atically test a system, and one would not normally expect them to do so.

Having said all of that, “input from site-based users” can often be a useful thing to
factor into the end of the development phase. The system obviously needs to be up
and running and available to external staff, and system development should be in its
final stages – there is little point in asking end users to comment on anything other
than an almost completed system. Normally only a small subset of site-based staff
should be asked to comment, drawn from those that can be expected to provide
224 S. Canham

feedback in a timely fashion. Such feedback is best kept relatively informal – emails
listing queries and the issues found are usually sufficient.
The decision to use end-user feedback should be risk based. A simple CDMA that
is deployed only to sites that already have experience of very similar trials will have
little or no need for additional feedback from end users. But a CDMA that includes
novel features or patterns of data collection could benefit from site-user feedback. If
new sites are being used for a trial, especially if they are from a different country or
language group, then user feedback can be very informative in clarifying how the
eCRFs will be interpreted and in identifying potential problems.
The key point is that this is late input into the design and development phase
– it is not “testing.” It is obtaining feedback from the final group of stakeholders.
The difficulty that many trials units have is that they seek end-user feedback at
the same time as beginning the testing and validation phase. They sign off the
design as approved and start to test the system. Then feedback from end users
arrives and results in changes (most commonly over design issues but more
fundamental changes to data items may also be required) and then they have to
start the testing again. The work, and its documentation, expands and risks
becoming muddled. One of the basic principles of any validation exercise is
that it must be against a fixed target – hence the need to garner all comments,
from all stakeholders, and complete the entire design and development process
before validation begins.

Final Approval of the Specification

As a quality control mechanism (and as evidence for external auditors), it is


important to have a final sign-off from major stakeholders stating that they are
happy with the final functional specification. This group should include as a mini-
mum the chief or coordinating investigator, the study statistician, and the study’s
project manager. For commercially sponsored studies, a representative of the spon-
sor is also often included, and others (the QA officer, IT staff, data management staff)
may be added according to local procedures.
Cross-disciplinary approval does not necessarily mean that all parties will check
the specification for the same things. It is probably unreasonable to expect the chief
investigator to look through every data item in detail, but they should at least be
satisfied that the main outcome and safety measures are properly covered. As
mentioned earlier, statisticians may be asked to check there is no unnecessary
data being collected, as well as confirming that the collected data will be fit for their
analysis. A trial manager will probably check the eCRFs in detail and confirm that
feedback has been received from end users, while the unit’s quality manager may
also check for adherence to unit policies on CDMA design, use of coding systems,
etc. Some of this assessment can be done by inspecting the system itself, but some
of it will require checking of more detailed documents. The outcome of the
approval process should be a suitably headed sheet bearing the required dated
signatures.
13 Design and Development of the Study Data System 225

Using Data Standards

One of the ways of making a study data system easier and quicker to develop, and of
making the resulting system and the data exported from it easier to understand, is to
establish conventions for the naming and coding of data items, and to stipulate
particular “controlled vocabularies” in categorized responses. That can provide a
consistency to the data items that can be useful for end users and a consistency to the
data that can be useful for statisticians.
Consistency can be extended into a “house style” for the eCRFs, with a standard
approach to orientation, colors, fonts, graphics, positioning, etc. (so far as the CDMS
allows variation in these) and to the “headers” of the screens, that usually contain
administrative rather than clinical data (e.g., study/visit/form name). This simply
makes it easier for users to navigate through the system and more easily interpret
each screen, and to transfer experience gained in one study to the next.
Establishing conventions for data items can provide greater consistency within a
single trials unit, but at a time when there is increased pressure for clinical
researchers to make data (suitably de-identified) available to others, the real value
comes from making use of global, internationally recognized standards and con-
ventions, which allow data to be compared and/or aggregated much more easily
across studies.
Fortunately, a suite of such global standards already exists. These are the various
standards developed by CDISC, the Clinical Data Interchange Standards Consor-
tium. The key CDISC standards in this context are CDASH (CDISC 2020b), from
the Clinical Data Acquisition Standards Harmonization project, and the TA or
Therapeutic Area standards (CDISC 2020c). Both are currently used much more
within the pharmaceutical industry than the noncommercial sector. The FDA, in the
USA, and the PMDA, in Japan (though not yet the EMA in the EU), have stipulated
that data submitted in pursuance of a marketing authorization must use CDISC’s
Study Data Tabulation Model (SDTM), a standard designed to provide a consistent
structure to submission datasets. Creating SDTM structured data is far easier if the
original data has been collected using CDASH, which is designed to support and
map across to the submission standard.
Trials units in the noncommercial sector do not generally need to create and
document SDTM files, and consequently have been less interested in using CDASH,
although many academic units have experimented with using parts of the system.
The system is relatively simple conceptually, but it is comprehensive and growing,
and it does require an initial investment of time to appreciate the full breadth of data
items that are available and how they can be used. The nature and use of data
standards are treated in more detail in the chapter on the long-term management of
data and secondary use. The key takeaway for now is that an evaluation of CDASH
and its potential use within study designs is highly recommended.
Along with the CDISC standards and terminology, other “controlled vocabular-
ies” can also help to standardize trial data. For the coding of adverse events and
serious adverse events, the MedDRA system (MedDRA 2020) is a de facto standard.
Drugs can also be classified in various ways, though the ATC (Anatomical
226 S. Canham

Therapeutic Chemical) scheme is the best known (WHO 2020). Other systems
include the WHO’s ICD for disease classification, and MESH and SNOMED CT
for more general medical vocabularies, though in general, the larger the vocabulary
system, the more difficult it is to both integrate it with a CDMS and ensure that staff
can use it accurately.
MedDRA is the most widely used of all these systems and is, for example,
mandated for serious adverse event reporting in the EU. Its effective use requires
training, however, and a variety of study specific decisions need to be considered and
documented (in the DMP). For instance, what version of MedDRA should be used
(the system is updated twice a year) and how should upgrades, if applied, be
managed? How should composite adverse event reports (“vomiting and diarrhea,”
“head cold and coughing”) be coded? Probably most critically, which of the higher
order categories, used for summarizing and reporting the adverse events, should be
used when categorizing lower level terms? MedDRA is not a simple hierarchy, and a
lower level term can often be classified in different ways. A “hot flush” (or “flash”)
can be related to the menopause, hyperthyroidism, opioid withdrawal, anxiety, and
TB, among other things – so how should it be classified? The answer will normally
depend on the trial and the participant population, but where there is possible
ambiguity, a documented decision needs to be taken, so coding staff have the
required guidance. Such ambiguity is also the reason why MedDRA auto-coding
systems should be treated with caution, unless they can be configured or overridden
when necessary.

Validating the Specification and Final Release

Once the functional specification has been approved, the prototype that has been
built upon it needs to be validated. During the build, or during successive pro-
totypes if that has been the approach taken, the basic functionality will have been
tested by the staff creating the system, but this is normally an informal process and
unlikely to have been documented. Validation, in contrast, requires a systematic,
detailed, and documented approach to testing all aspects of the system. It is
intended to verify that:

• The build has been implemented correctly – i.e., that it matches the specification,
and is therefore fit for purpose.
• The detailed logic built into the system, e.g., the consistency checks between data
items, or the production of derived values, works as expected.

Validation of a study definition is therefore essentially a test and debugging


exercise. By default, validation should mean that all elements and all logic in the
system are tested. That includes ensuring data types, captions, tab order, and code
lists are all correct for each data item, and also systematically checking the skipping
logic, derivation logic, and each of the data validation (range and consistency)
checks.
13 Design and Development of the Study Data System 227

Validation can therefore be a rather tedious, mechanical exercise, for example,


when “walking through” each range limit check (inputting values just under the
limit, at the limit and just above) to ensure that the system fires a warning message
when appropriate and accepts a valid value without complaint. It may therefore be
carried out by relatively junior staff, which is fine if the specification and any
additional instructions are clear and there is sufficient supervision. Validation should
not be carried out by anyone that constructed the system, because any misinterpre-
tations of the original requirements will simply be repeated. In some cases, the data
managers for the study are asked to carry out the validation exercise. This has the
advantage that if they were not very familiar with the details of the system before
they certainly will be after the exercise is completed, but it may not be the best use of
skilled staff time.
The default validation strategy should be described in an SOP, but study-
specific decisions may alter that strategy in any particular case. For instance, it
may be decided to skip some checks if they have already been covered earlier. A
check that a date entered is not in the future is a condition commonly applied to
date data. If the date questions have been copied from a common precursor (most
CDMS allow data items to be copied and pasted during the design process) that
already included this test, it does not necessarily need to be checked for all the
derived date items. Similarly, an eCRF representing a questionnaire, imported
from another study where it had been used and tested previously, may not need
such a detailed checking as a completely novel eCRF. Conversely, some CDMAs
will require additional testing for functionality (like coding, or message trigger-
ing) that is specific to a study definition. The individual managing the validation
process, e.g., the trial manager, should make risk-based decisions about the
detailed strategy to be followed and document the justification for them (for
example in the DMP).
Another approach to CDMA validation involves completing dummy paper
CRFs, inputting them into the CDMA, and then exporting them again in a form
that is readily comparable with the original data. This has the advantage of testing
overall system usability as well as many of the functional components of the
system, and also means that the extraction/reporting functions are tested as well.
Unfortunately, unless a very large set of test data is prepared, not all components
of the system will be tested in a systematic way. If used, this method should
therefore probably be seen as an addition to the detailed testing of each component
described above.
The bugs found in the exercise and their resolution should be documented,
usually by the record of the relevant retests. At the end of the exercise, it should
be possible to show the system does indeed meet its functional specification, and the
validation should then be signed off. The signatures or initials of those who were
actually doing the validation should be embedded in the test documents themselves.
The sign off needs to be done by whoever is responsible for releasing the system into
production, usually the manager responsible for the validation process, such as the
trial’s project manager, or the unit’s quality or operations manager, who are in a
position to judge that the validation has been successfully completed.
228 S. Canham

Development and Testing Environments

Systems used for CDMA development and those used for CDMA production use
should be isolated from each other. Development and production environments have
very different user groups and (assuming there is no real data in the development/test
environment) different security requirements. The system should be developed on
machines specifically reserved for development and testing, and there should be no
possibility, however unlikely, of any problems in a developing CDMA spilling over
to affect any production system. Similarly, there should be no possibility of users,
including IT staff, inadvertently confusing development and production systems.
This allows the production servers to be kept in as simple and as “clean” a state as
possible, unencumbered by additional versions of the same study system, making
their management easier and providing additional reassurance that their validation
status is being maintained (see Fig. 5).
With the virtual machines that are now commonly used, “isolated from” means
logically isolated rather than necessarily on different physical hardware. That means
distinct URLs for the web-based components, distinct connection strings for data-
base servers, and different users and access control regimes on the different types of
server. It is also a good idea if development systems can be clearly marked as such on
screen (e.g., by the use of different colors and labels).
Note that if end-user input is required, so also is remote access to the development
and test system. In some IT infrastructures, this may be problematic and necessitate a
separate, third, testing environment, specifically for external access to a copy of the
developing system. This environment could therefore be used for the final stage of
system development, but once this was complete and the specification approved, it
would also be the obvious place in which to carry out validation.
This “Final Development” environment can also be useful for training purposes –
giving external users access to the fully validated system so that they can familiarize
themselves with it. Site-based users may want to input real data into the system
during this training phase, one reason why it should be under the same tight access
control within the IT infrastructure as the production system. The other benefit of a
secure testing/training environment is that the system can also be used for “backfill”
of data during revalidation exercises, as described in more detail in the section
“Change Management.”
Figure 5 illustrates such a combination of systems, but it is stressed that it is
only an example of many possible ways of arranging development, test, and
production environments. The optimum will depend on the server, security, and
access options available. If the initial development environment can handle exter-
nal users, then the A and B environments can be merged into one, as long as the
possibility of that environment having sensitive personal data in it is fully
considered.
It is possible to support training on a production server, by setting up a dummy
site within that system and initially only giving site-based users access to that site.
This can be simpler to manage than using a separate system, and it ensures that the
13 Design and Development of the Study Data System 229

Fig. 5 An example of one arrangement for development and production environments. The DB/
web server combination in A is used for most of the development process but never contains any
real data. The test environment in B can be linked to external users and is used to complete
development. It can also support validation and later testing when backfill with real data may be
useful. C is the “clean” production environment. Both B and C have tightly controlled access

training system will exactly match the production system’s definition. It has the
disadvantage, however, that all the data from the training “site” need to be excluded
from the analysis dataset (during or after the extraction process).
230 S. Canham

The Study Data System in the Longer Term

Change Management

Once the CDMA’s final specification has been approved, any further changes to that
specification will need to be considered within a formal change management pro-
cess, to ensure that all stakeholders are aware of proposed changes and can comment
on them, and that the changes are validated.
Any request for a change in the system should therefore be properly described and
authorized, so a paper or screen-based proforma needs to be available, to be
completed with the necessary specification and justification for the change. Changes
may be relatively trivial (a question caption clarified) or substantial (additional
eCRFs following a protocol amendment). Whoever initiates the change process,
staff should be delegated to assess the possible impacts of the change and identify
any risks that might be associated with it.
Risk-based assessment is the key to change management. The easiest way of
handling and documenting that process is to use a checklist that considers the
common types of potential impact. These can include:

• Impact on data currently in the database. Any change that dropped a data item or
a category from the system and “orphaned” existing data would not normally be
allowed and should be rejected. In fact, many CDMS would automatically block
such a change, though many do allow a field to be hidden or skipped within the
user interface.
Other changes may have less obvious consequences. For instance, if new options
are added to a drop-down to give a more accurate set of choices to the user, does
the existing data need to be reclassified? If so, how and by whom?
• Impact on validation checks and status. If a new consistency test is added, how
can the existing data be tested against it? If a new data item is added, does it need
new consistency checks to be run against other data? Detailed mechanisms are
likely to be system dependent but need to be considered and the resulting actions
planned.
• Impact on data extraction. In many cases, extraction will use built-in automatic
mechanisms, but if any processing/scripts are used within the extraction process,
will they be affected by the change? If additional fields are added, will that data
appear in the extracted datasets?
• Impact on metadata. A metadata file, or at least a “data dictionary,” should always
be available, for instance, to support analysis. Any change will render the current
metadata out-of-date and require the production of a new version.
• Impact on analysis. If the statistician has already rehearsed aspects of analysis and
has the relevant scripts prepared, how will any additional data items be included?
Will any hidden, unnecessary fields still be processed?
For example, changing an item’s data type could make existing data invalid and is
not normally allowed, or even possible in many CDMSs. If it transpires that an
integer field needs to hold fractional values, and thus must be changed into a real
13 Design and Development of the Study Data System 231

number field, it may therefore be necessary to add a new real field and hide or skip
the original integer one. The database ends up with two fields holding data for the
same variable, meaning that the statistician needs to combine them during the
analysis.
• Impact on site-based end users. The staff inputting the data need to be informed
of any change and its implications. When and how?
• Impacts on system documentation and training. For substantial changes, simply
informing end users is unlikely to be enough. Study-specific documentation and
training may need also need changing.

Considering possible risks in this systematic way provides a solid basis for
identifying and documenting the possible sequelae of a proposed change, deciding
if the change should be allowed and, if it is allowed, identifying the follow-on
actions that will be required. Those actions are likely to include testing of the revised
system, with the test results documented and retained. It also means that the change
management process needs to involve statisticians as well as trial managers, and the
IT and/or data management staff who usually implement the change. The key staff
involved should explicitly “sign-off” the change.
Substantial system changes often result from protocol amendments and cannot
be released into the production version of the system until those amendments have
been fully approved. It can sometimes happen, though it is relatively rare, that a
requested change implies a change in the protocol, even when it has not been
presented or recognized as such. This is another reason for the change manage-
ment process to include review by experienced staff (usually the trial manager and
statistician), or even the whole trial management team, to ensure that any need for
protocol amendment is recognized and acted upon before the change is
implemented. In other words, change management should never be seen as a
purely technical process.
Implementation of any change should always occur in all the environments being
used – i.e., in the development environment, in any intermediate test and training
environment, and finally in the production system. The flow of changes should be
unidirectional, with in each case a revised study definition exported to the destination
system. It can be tempting, for a trivial change, to shortcut this process and just (for
example) change a caption in the production system. But this then risks being over-
written back to the previous version, when, following a more substantial change
elsewhere in the system, a new study definition is imported from the development
environment.
The testing required will occur in the development and any test system. For some
changes, it may be considered more realistic, and therefore safer, to test against the
whole volume of existing data, rather than just the small amount of dummy data that
usually exists within development environments. Backfilling the test/development
server, or at least one of them if there are multiple development environments, with
the current set of real data can therefore be a useful way of checking the impact of
changes on the current system. This does depend, however, on the test server having
a similar level of access control as the production system, otherwise there is a risk
232 S. Canham

that sensitive personal data is exposed more widely than it should be, and that
nonspecialist staff are unfairly exposed to sensitive data.
A coherent and consistent versioning system can help to support any change
management process. All versions of the study definition should be clearly labelled
and differentiated, for instance, by adapting the three part “semantic versioning”
system used for software (Semver 2020). In this scheme,

• The specification as finally approved should be version 1.0.0 (while versions in


development are 0.x.y).
• Changes that involve a protocol amendment should increment the first number.
• Changes that are not protocol amendments, but which include changes to the data
in any way (including changes to the options available to categorized items),
should increment the second number.
• Changes that do not include changes to the data – e.g., changes in presentation or
to the logic checks used – should change only the third number.

Changing a study definition, even within a well-managed system, is an expensive


process that can carry risks. All stakeholders need to be aware of that and so
minimize the changes they request. The best way to do that is by the rigorous and
collective development of an initial specification that accurately meets the needs of
the study.

Exporting the Data for Analysis

At the end of a trial, the data needs to be extracted for analysis, usually in a generic
format (csv files, CDISC ODM) or one tailored to a particular statistical package (e.
g., SAS, Stata, SPSS or R). Because most statistics packages can read csv or similar
text files, the ability to generate such files accurately is the key requirement.
Data extractions can take place before this of course, e.g., for interim safety
analysis by a data monitoring committee, for central statistical monitoring, and to
support risk-based monitoring decisions. In the noncommercial sector, trials may
also be extended into long-term follow-up, so that data is periodically extracted and
analyzed long after the primary analysis has been done and the associated papers
published.
The extraction process, especially when supplying data for the main analysis,
needs to be controlled and documented. An SOP should be in place outlining roles,
responsibilities, and the records required, often supported by a checklist that can be
used to document the readiness of the database for extraction. The checklist should
confirm that:

• All data is complete, or explicitly marked as not available.


• Outstanding queries are resolved.
• All data coding is completed (if done within the CDMS).
• All planned monitoring and source data verification has been completed.
13 Design and Development of the Study Data System 233

• All data has been signed off as correct by the principal investigators at sites.
• Serious adverse event data (transmitted via expedited reporting) has been recon-
ciled with the same data transmitted through standard data collection using eCRFs.

Any exceptions to any of the above should be documented. Most CDMS include
a “lock” facility which prevents data being added or edited, and this can be applied at
different levels of granularity, e.g., from an individual eCRF, to a whole participant,
to a clinical site, to the whole study. Once the issues listed above have been checked,
one would expect the entire database to be locked (with any later amendments to the
data rigorously controlled by an unlocking/relocking procedure which clearly
explained why the data amendments were necessary).
The extraction process results in a series of files, with traditionally the data items in
each file matching a source eCRF, or a repeating question group within a CRF.
Although the data appears to be directly derived from the eCRFs, the extraction usually
requires a major transformation of the data, because in most cases, the data is stored
quite differently within the CDMS database. Internally most systems use what is called
an entity-attribute-value (EAV) model, with one data row for each data item, and often
with all the data, from all subjects, visits, and eCRFs, stored in the same table.
The EAV structure is necessary to efficiently capture the audit data that is a
regulatory requirement, to more easily support various data management functions
like querying, and to provide the flexibility that enables a single system to store the
data from different studies, each with a wide variety of eCRF designs. It is almost
never evident to the end users, who instead see the data points neatly arranged within
each eCRF, the system consulting the relevant study definition to construct the
screen and place the data items within it as required.
When the data is extracted, the audit and status data for each item is usually left
behind, and the data is completely restructured as a table per eCRF or repeating
group as described above. This underlines the need for the validation of data
extraction, because not only is the output data central to the research, the process
by which it is created is complex. Extraction mechanisms will usually be tested
within the initial validation of the CDMS, but this often involves just a very small
data load from a simple test CDMA. Extractions from real CDMAs should undergo a
risk-based assessment of the need for additional, study specific, validation. The
validation does not usually need to be extensive or burdensome, but it is worth
checking (and documenting) that, for instance:

• The number of extracted study participants is correct.


• The data for the first and last participants appears correct (because extraction
issues tend to affect the edges of data sets).
• Dates have retained their correct format.
• The correct version of (examples of) corrected data is extracted.
• Fields with any unusual (e.g., non-Latin) characters have been extracted properly.

As more extractions are performed and checked, the level of confidence in the
system will grow, and the need for validation can become less, especially if a trial is
234 S. Canham

similar in design to a previously extracted study. But if a CDMS update is applied,


the risk may increase again and so should the inspection of the extracted data.
Once the data has been extracted, it is often combined with data from other
sources, for example:

• From collaborators: Although sometimes such data may be imported into the
CDMS, more often it will be imported by aggregating extracted records. Care
must be taken that the extractions are fully compatible.
• From treatment allocation records: Up to this point, this data may have been kept
separately to preserve blinding.
• From laboratories: It is usually simpler to add data from external laboratories at
this stage rather than trying to import it into the CDMS, but this is a study-specific
decision, and may depend on lab preference and the need to carry out range and
consistency checks on the data.
• From coding tools, because in some trials units and CROs, coding is done on the
extracted data rather than within the CDMS.

Exactly how this data is aggregated with that from the CDMS should be planned
and documented within the data management plan. It is important that a description
of the newly combined data is included within the metadata documents for the study,
so in these cases, if metadata is normally generated by the CDMS, it will need to be
supplemented by additional documents.
The final analysis dataset, comprising the data from the CDMS and any additional
material integrated with it, needs to be safely retained. This is partly for audit or
inspection purposes, and partly to allow the reconstruction of any analysis using the
same extracted data, if that is ever required. In practice, it can be done by adding the
analysis dataset, in a folder, clearly labelled and date stamped according to an agreed
convention, into a read-only area of the local file system. A group (usually the IT staff,
who are deemed to be uninterested in the data content) has to have write privileges on
this area for the data to be loaded, but all other users, including the statisticians who
need to analyze the data, must take copies of the files if they wish to work on them.
Though obviously not part of the CDMS, the procedures and infrastructure required
to implement the safe storage of the output data, so it is protected from accidental
modification, as well as any suspicion of intentional edit, are an important part of the
total study data system. They form the final link in the chain that begins with the study
protocol, stretches through system design, definition, and testing, moves on to months
or years of data collection, with maximization of data quality, and finally ends with the
primary function of the system – the delivery of data for analysis.

Summary and Conclusion

A study data system is centered around a specialist software tool – the Clinical Data
Management System or CDMS – that provides the core functionality required to
guarantee the regulatory compliance of data collection, plus the flexibility needed to
13 Design and Development of the Study Data System 235

support a wide range of different study designs and data requirements. CDMSs or,
increasingly, externally hosted CDMS services, are usually purchased from special-
ist vendors. The CDMS is the core component but by no means the only one:
supporting sub-systems, e.g., for coding, file storage, backup, and metadata produc-
tion, may also be involved. The “system” also includes the competences of the staff
that operate it and, crucially, the set of policies and procedures that govern workflow.
It is these policies, more than the technical infrastructure, which determine the
quality of any study data system.
Procedures are especially important for supporting the workflows around devel-
oping and then validating the systems constructed for individual studies, ensuring
that these activities are done in a consistent, clear, reliable, and well-documented
fashion. They are also key to the systematic consideration and application of (for
example) data standards, systems for managing data quality, procedures for change
management, import and aggregation of externally derived data, preparation for data
extraction, and the extraction process itself.
The data flow of modern study data systems is now dominated by a web-based
approach (eRDC) that removes the need to install anything at the clinical site or
provide additional hardware, as was the case in the past. Over the last 20 years,
eRDC has almost entirely supplanted traditional paper based data transfer. There is
growing interest in extending this approach directly to the study participants, to
capture directly from them using smart phones or portable monitoring devices. The
major current trend in study data systems, however, is the growing use of externally
hosted systems, so that the coordinating center or trials unit, as well as the clinical
sites, access the system through the internet. This approach can bring greater
flexibility and reduced costs, but it carries potential risks, for example, around
communication, responsiveness, and quality control. Developing the technical and
procedural mechanisms to better manage these risks is one of the biggest challenges
facing vendors and users of study data systems today.

Key Facts

1. The core software component of any study data system is a specialist tool known
as a Clinical Data Management System, or CDMS, usually purchased on a
commercial basis.
2. A web-based data management workflow known as eRDC, for electronic remote
data capture, is used in the great majority of clinical studies.
3. It should be noted that the data system also consists, in addition to the CDMS
software, of the people managing it and the policies and procedures that govern
workflows and data flows.
4. Increasingly, study data systems are provided remotely, as “software as a
service” or SaaS.
5. SaaS offers advantages (e.g., reduced system validation load) but can carry
risks. A range of communication problems have been identified in SaaS
environments.
236 S. Canham

6. A trials unit retains the overall responsibilities for safe, secure, and regulatory
compliant data management, as delegated from the sponsor, even when some of
the functions involved are subcontracted to other agencies. Its quality manage-
ment strategy therefore needs to include mechanisms for monitoring the work of
these subcontractors.
7. Developing a successful study data system for any specific study requires a clear
separation between the development of a detailed specification for the required
system, requiring input and agreement from all important stakeholders, and a
second validation step, requiring detailed, systematic testing of the completed
system.
8. The use of data standards can reduce system development time and increase the
potential scientific value of the data produced.
9. The production version of the study data systems should be maintained sepa-
rately from the development and/or training versions of the same systems, and
be accessed using different parameters.
10. Proposed changes in the study data system need to be managed using a clear and
consistent risk-based change management system.

Cross-References

▶ Data Capture, Data Management, and Quality Control; Single Versus Multicenter
Trials
▶ Documentation: Essential Documents and Standard Operating Procedures
▶ Long-Term Management of Data and Secondary Use
▶ Patient-Reported Outcomes
▶ Responsibilities and Management of the Clinical Coordinating Center

References
Blumenberg C, Barros A (2016) Electronic data collection in epidemiological research, the use of
REDCap in the Pelotas birth cohorts. Appl Clin Inform 7(3):672–681. https://doi.org/10.4338/
ACI-2016-02-RA-0028. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5052541/. Accessed
31 May 2020
Canham S, Bernalte Gasco A, Crocombe W et al (2018) Requirements for certification of ECRIN
data centres, with explanation and elaboration of standards, version 4.0. https://zenodo.org/
record/1240941#.Wzi3mPZFw-U. Accessed 31 May 2020
Capterra (2020) Clinical trial management software. https://www.capterra.com/clinical-trial-man
agement-software. Accessed 31 May 2020
CDISC (2020a) The operational data model (ODM) – XML. https://www.cdisc.org/standards/data-
exchange/odm. Accessed 31 May 2020
CDISC (2020b) Clinical data acquisition standards harmonization (CDASH). https://www.cdisc.
org/standards/foundational/cdash. Accessed 31 May 2020
CDISC (2020c) Therapeutic area standards. https://www.cdisc.org/standards/therapeutic-areas.
Accessed 31 May 2020
13 Design and Development of the Study Data System 237

Dillon D, Pirie F, Rice S, Pomilla C, Sandhu M, Motala A, Young E, African Partnership for
Chronic Disease Research (APCDR) (2014) Open-source electronic data capture system offered
increased accuracy and cost-effectiveness compared with paper methods in Africa. J Clin
Epidemiol 67(12):1358–1363. https://doi.org/10.1016/j.jclinepi.2014.06.012. https://www.
ncbi.nlm.nih.gov/pmc/articles/PMC4271740/. Accessed 31 May 2020
ECRIN (2020) The European Clinical Research Infrastructure Network. http://ecrin.org/. Accessed
31 May 2020
El Emam K, Jonker E, Sampson M, Krleža-Jerić K, Neisa A (2009) The use of electronic data
capture tools in clinical trials: web-survey of 259 Canadian trials. J Med Internet Res 11(1):e8.
https://doi.org/10.2196/jmir.1120. http://www.jmir.org/2009/1/e8/. Accessed 31 May 2020
Eur-Lex (2020) The general data protection regulation. http://eur-lex.europa.eu/legal-content/en/
TXT/?uri¼CELEX%3A32016R0679. Accessed 31 May 2020
Fleischmann R, Decker A, Kraft A, Mai K, Schmidt S (2017) Mobile electronic versus paper case
report forms in clinical trials: a randomized controlled trial. BMC Med Res Methodol 17:153.
https://doi.org/10.1186/s12874-017-0429-y. Published online 2017 Dec 1. https://www.ncbi.
nlm.nih.gov/pmc/articles/PMC5709849/. Accessed 31 May 2020
Green J (2003) Realising the value proposition of EDC. Innovations in clinical trials. September
2003. 12–15. http://www.iptonline.com/articles/public/ICTTWO12NoPrint.pdf. Accessed 31
May 2020
Kirkwood A, Cox T, Hackshaw A (2013) Application of methods for central statistical monitoring
in clinical trials. Clin Trials 10:703–806. https://doi.org/10.1177/1740774513494504. https://
journals.sagepub.com/doi/10.1177/1740774513494504. Accessed 31 May 2020
MedDRA (2020) Medical dictionary for regulatory activities. https://www.meddra.org/. Accessed
31 May 2020
Mitchel J, You J, Lau A, Kim YJ (2001) Paper vs web, a tale of three trials. Applied clinical trials,
August 2001. https://www.targethealth.com/resources/paper-vs-web-a-tale-of-three-trials. Accessed
31 May 2020
OpenClinica (2020). https://www.openclinica.com/. Accessed 31 May 2020
RedCap (2020). https://www.project-redcap.org/. Accessed 31 May 2020
Semver (2020) Semantic versioning 2.0.0. https://semver.org/. Accessed 31 May 2020
Walker P (2016) ePRO – An inspector’s perspective. MHRA Inspectorate blog, 7 July 2016. https://
mhrainspectorate.blog.gov.uk/2016/07/07/epro-an-inspectors-perspective/. Accessed 31 May 2020
WHO (2020) The anatomical therapeutic chemical classification system, structure and principles.
https://www.whocc.no/atc/structure_and_principles/. Accessed 31 May 2020
Yeomans A (2014) The future of ePRO platforms. Applied clinical trials, 28 Jan 2014. http://www.
appliedclinicaltrialsonline.com/future-epro-platforms?pageID¼1. Accessed 31 May 2020
Implementing the Trial Protocol
14
Jamie B. Oughton and Amanda Lilley-Kelly

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
Protocol Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
Site Selection, Feasibility, and Set Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Site Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Timing of Site Activation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
Registration/Randomization System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
Risk and Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
Trial Risk Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Trial Monitoring Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Trial Oversight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
Project Team . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Trial Management Group (TMG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Independent Data Monitoring Committee (IDMC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Independent Trial Steering Committee (TSC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Trial Promotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
Trial Website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
Press . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Investigator Meeting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254

J. B. Oughton (*) · A. Lilley-Kelly


Clinical Trials Research Unit, Leeds Institute of Clinical Trials Research, University of Leeds,
Leeds, UK
e-mail: [email protected]; [email protected]

© Springer Nature Switzerland AG 2022 239


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_37
240 J. B. Oughton and A. Lilley-Kelly

Abstract
This chapter outlines the steps required to bring a protocol to life as a clinical trial.
Developing the protocol is a multidisciplinary effort which must be approached in
an ordered and logical way with clear leadership. Poor site feasibility is a
common reason for trial failure when performed badly and is crucial to capture
a generalizable population. The key site feasibility assessment issues are outlined.
The chapter goes on to give advice on data collection and how this should be
planned alongside developing the trial protocol. Trial processes must be ade-
quately described and trial staff trained well to maximize efficiency and minimize
error. Strategies to identify and mitigate risks to participant safety and trial
integrity are discussed along with techniques that can be implemented to monitor
the identified risks. Typical trial oversight groups and processes are provided to
reach a structure that is effective and proportionate to the level of trial risk.
Finally, suggestions for how to manage trial promotion to maximize engagement
with investigators, potential participants and other stakeholders are discussed.

Keywords
Protocol · Feasibility · Monitoring · Oversight · Publicity · Training program ·
Competency

Introduction

This chapter attempts to bridge the gap between the finalized protocol and patient
accrual. It outlines necessary considerations for identifying and selecting participat-
ing centers to optimize trial delivery. Similarly, it covers the vital task of risk
assessment and monitoring and the establishment of effective oversight bodies.
Finally, the topic of trial promotion is discussed, with suggestions based on the
needs and resources of a trial.

Protocol Development

The protocol is the most important document in a clinical trial and sufficient time and
expertise must be allocated to its development. The nature of the trial will dictate
which specialties will contribute but typically this will include: clinical, regulatory,
laboratory, statistical, operations, funder, and safety. A member of the coordinating
team, for example, the project manager, should take responsibility for making sure
each party has reviewed the protocol at the appropriate development stage. Omis-
sions or mistakes can be costly in time taken making amendments and therefore it is
useful to obtain a final review from a member outside of the immediate trial team.
The most appropriate protocol structure must be chosen at the start of develop-
ment. For example, is it reasonable to contain all the required information in one
14 Implementing the Trial Protocol 241

protocol or should there be separate sub-protocols underneath a master protocol? In a


platform trial (Park et al. 2019), for example, it is common to describe platform-wide
processes in a master protocol and processes that only exist in certain groups of
participants in a separate sub-protocol.

Site Selection, Feasibility, and Set Up

At an early stage of the development of a clinical trial, a decision must be made on


the scale of the trial. Single-site trials have the advantage of being easier and cheaper
to set up and a Chief Investigator (the person responsible for leading the clinical trial)
can effectively oversee the entire trial conduct. However, single-site trials may return
results that are less generalizable, are less free from systematic bias, and limit the
number of available participants. Multisite trials reduce the impact of bias, improve
generalizability, and allow access to increased number of participants. The loss of
control in comparison with a single-site trial can be overcome with trial oversight
strategies. Multisite trials require significantly more resources: both financial and
staffing. International participation may also be desirable due to the rarity of the
target condition, scarcity of equipment or to accelerate recruitment.
Every clinical trial requires a legal entity to be responsible for trial conduct. This
legal entity is known as the sponsor. The sponsor is able to delegate responsibilities
to other institutions to carry out the trial; for example, a contract research organiza-
tion or academic coordinating center to develop the protocol or a participating
hospital to collect the data.

Site Characteristics

Investigators can overstate their capabilities to secure opening a desirable trial,


particularly with regards to recruitment projections. If there are more sites available
than required to conduct the trial, site selection criteria should be developed. If the
coordinating center has worked with the site before, suitability can be assessed from
past performance (e.g., set up time, recruitment, data quality). Alternatively, self-
assessment by the site using a questionnaire has the advantage of discovering an
investigators commitment to the trial and collecting updated information. Often it
can be helpful to request objective data to support a recruitment target. For example,
by asking the site to review the last month’s clinic list and count the number of
patients with the target disease. The site should also provide the coordinating center
with evidence of staff who are qualified and who have any additional appropriate
training. This is usually achieved by reviewing copies of CV/resumes or training
certificates. It may be necessary to request the completion of a questionnaire to
address trial-specific issues; for example, where relevant, questions could include:

Is the investigator familiar with the proposed intervention?


Are there any competing trials open at that site?
242 J. B. Oughton and A. Lilley-Kelly

Is there a facility to accommodate an overnight stay?


Is there a facility to prepare drugs under aseptic conditions?
Is it possible to review a CT scan and provide the report within 1 week?
Are there sufficient staff to facilitate the data collection?

The site must have sufficient staff (e.g., trial coordinators, research nurses, data
managers, trial pharmacists) to deliver the trial. For some trials, it may be appropriate
to recruit additional staff and time must be allowed for this. There must be sufficient
enthusiasm from management and the site investigators for the trial to succeed.
Sponsors should be sensitive to the motivations of a site to participate in research, be
they prestige, financial, or patient-driven. Where there are likely to be barriers to set
up or recruitment (for example, excess treatment costs in the UK), these should be
highlighted by the coordinating center at an early stage in set up and addressed with
appropriate guidance and mitigation.
For international trials, a range of issues may become relevant. Is the treat-
ment environment (e.g., political, financial, or environment) likely to remain
stable during the lifetime of the trial? Are there factors that will limit the
delivery of a trial? For example, there are financial incentives towards hospi-
tal-based interventions in the United States or lack of refrigeration facilities in
some settings for a vaccine trial. International trials also bring significant
complexities due to multiple regulatory requirements; for example, in the United
States, each recruiting institution may require separate ethics approval, although
there has been a recent change for trials sponsored by the National Institutes of
Health (NIH) where single IRBs are encouraged/required. More detailed infor-
mation on setting up international trials is available elsewhere (Minisman et al.
2012; Croft 2020).

Timing of Site Activation

Once the sites have been selected, there needs to be a systematic approach to site set
up based on the resource available. It may be necessary to take a phased approach
rather than seeking to open all sites at the same time. This is often necessary because
of finite resources to complete all necessary tasks, e.g., on-site training, intervention
availability, or temporary low capacity at site.
It is important for sponsors and trial management teams to keep up to date and
engage with such schemes for prompt set up. For example, in England, the Health
Research Authority (HRA) made changes to the way research is initiated in the
National Health Service (NHS). The approach now is that the HRA give permission
on behalf of the NHS, potentially making it faster for individual hospitals to
participate.
Trial-wide approvals from the ethics committee and/or regulatory authority must
have been received before a site is permitted to start recruiting participants. Sponsors
must have a robust process to ensure both site level and trial level approvals have all
been received.
14 Implementing the Trial Protocol 243

Registration/Randomization System

For a randomized controlled trial, it is crucial to present a transparent, robust, and


reproducible technique to allocate participants to treatment groups. Most commonly
this is via an automated telephone- or web-based service that can be accessed
directly by participating sites. Access is restricted to sites that have been approved
for the trial and staff that have undergone the required training. Alternatively, a
paper-based randomization system could serve the same purpose.

Data Collection

All trials require a robust system for collecting the data in a format that is transparent
and suitable for the analysis. Clinical trial data is almost always collated in an
electronic database nowadays, and many trials also use electronic data capture
systems to collect data from participating sites. It is necessary to have a firm idea
of the data items required before developing the database, and this usually begins
once the protocol has been finalized. The data collection process should be mapped
out from the source data to the final analysis. The source of the data will dictate the
format of data collection tools. For example, laboratory results can occasionally be
provided electronically as a table to the sponsor, whereas a physical examination
may require a paper/electronic form to report the result. The complexity of the
database will be dictated by the sources of data and risk assessment. The database
and data collection instruments should be finalized and fully tested before opening to
recruitment.

Training

A key part of ensuring successful delivery of a trial is a clear training program to


support the protocol. When developing a training program, there are many elements
to consider, with initial focus on the target of the training – who will implement the
protocol, and within which teams do they work?
It is essential to consider the structure of the team required to deliver the protocol,
what expertise they have, and which elements of the study they will support. To do
this, it is best to consider each element of trial delivery, usually summarized in a trial
flow diagram (Fig. 1). The trial flow will determine which team members will
support specific activities; for example, will a medic be required to identify potential
participants and support the consent process, or will a research nurse carry out the
screening and recruitment process? It is also important to consider the administration
involved in all trial activities, for example, will a member of the trial team produce
recruitment packs, request patient notes, screen participant admissions, and complete
general administration activities or will a research nurse complete these tasks?
It is also important to determine research-specific training requirements. Are the
team members delivering the project well versed in research, or are there basic
244 J. B. Oughton and A. Lilley-Kelly

•Screening of potential participants


•Consideration of eligibility
Identification

•Patient introduction (patient information sheet)


•Informed consent
•Eligibility assesment
Recruitment

•(Only if applicable)
• Patient registered / randomised to treatment
Randomisation

•Commence treatment regime(s)


Treatment

•Endpoint assessment
Follow-up

Fig. 1 Flow diagram showing the stages of a clinical trial (Collett et al. 2017)

principles that need to be covered to ensure the trial is conducted appropriately? For
example, will the research team include clinical disciplines that are not often
involved in clinical research, such as allied health professionals (e.g., paramedics,
radiotherapists, or dieticians)? If so, should the training include an overview of Good
Clinical Practice (GCP) to underpin their participation in trial activities?
Dependent upon the trial, there may also be a need to consider additional training
requirements relevant to the trial population. For example, if participants will be
recruited from an older population that may have a high incidence of cognitive
impairment, it is important to provide an overview of any local legislation governing
participants with limited cognitive capacity and the process of informed consent
within a vulnerable population. Establishing expertise required is essential to the
development of the content of training; however, it is important to be mindful of
existing training programs available – balancing trial requirements with practical-
ities. It is often best practice to direct researchers to additional sources of information
that can cover complex topics in greater detail than required for specific trials.
Once the team delivering the trial and the expertise and knowledge required has
been defined, the content and structure of the training program can be developed. A
key consideration for the content of the training is the time available to support
training – how long will it take for each member of the team to be properly trained in
their elements of the trial? If the trial involves frontline clinical staff, there may be a
need to adapt training to accommodate other commitments. If there is a large team of
clinicians that would like to train together, what is an acceptable / practicable
duration? These considerations need to be balanced with the key components of
14 Implementing the Trial Protocol 245

Trial Overview
Background Participant Recruitment
Trial design
Inclusion / exclusion Registration / Randomisation
Additional elements
criteria
Overview of process / Intervention
(i.e. Process
evaluation / economic Participant
Analysis) Recruitment systems
processes Development of Follow-Up
- Screening intervention
(Background)
- Consent (incl. Types) Timeline
Intervention schedule
- Eligibility (i.e. number of Type
assessments contacts) Process
Intervention
Data collection

Fig. 2 Overview of training schedule

training, ensuring that topics are covered in order of priority, being mindful of the
logical flow of information and the burden of training on the target audience.
To establish a logical flow of information, it is often best to link back to
considerations of key trial elements (Fig. 2) and the relevant information around
these topics. An example of these and associated topics to include are outlined in
Fig. 2 – these broad topics should be tailored to include trial specifics and time
allocated dependent on content. It is also important to consider the expertise required
within the training team to deliver these sessions, potentially using changes in trainer
as natural breaks to avoid audience burden. The content of training often evolves as
the program develops, and it is beneficial to gain input from the wider trial team to
review content as it develops.
Methods of delivery for the training should be considered during development of
the training content, taking the audience and amenities available. Often a slideshow
is developed that can be adaptable for presentations where facilities exist or as
preprinted handouts if not. However, other options include online presentations
that could be delivered remotely (i.e., webinars/video blogs) and which also have
the added benefit of being reusable and readily available.
As part of the training program, materials need to be developed to support
training. For example, training slides, reference data collection instruments specific
to the trial, trial promotional materials, and site-specific documentation (i.e., Inves-
tigator Site File – ISF). Training development also often highlights procedures that
are more complex and require additional supporting information to ensure standard-
ized completion throughout the trial, in the form of a guidance manual or standard
operating procedure (SOP) with these materials provided as part of the training pack.
Once the training package is developed, it is important to consider how attendance
at training will be documented, and how robust this process needs to be. A list of
246 J. B. Oughton and A. Lilley-Kelly

attendees should be generated if attending in person or a system developed to monitor


access to online training materials for self-training. In some situations, local investi-
gator oversight of staff of listed on an Approved Personnel List (APL) may be
sufficient. It is essential to have an audit trail of staff training and a clearly defined
and reproducible training package to support the trial delivery. These training records
should be retained for the duration of the trial, even if personnel change over time.
It is also essential to consider the on-going training and competency of site staff
delivering the trial – how are new staff sufficiently trained in a timely manner? Often
on-going training is trial specific and dependent upon things like trial design (i.e.,
number of sites/geographical location/duration), complexity of the trial (i.e., trial
design and intervention delivery), and the impact of poor performance (i.e., maxi-
mizing competency of site staff to deliver an intervention). It may therefore be
necessary to consider how to assess staff competency and identify additional training
requirements for key elements of the trial. An example would be monitoring
completion of key data collection instruments to highlight errors and disseminate
feedback to the site and wider trial team. Additional methods to support sharing of
best practice include regular teleconferences with trial teams, discussion boards,
newsletters, and social media. However, the team needs to be mindful of site burden
and consider the impact and benefits of these methods for site staff.

Risk and Monitoring

By their very nature, clinical trials contain an element of uncertainty. It is important


that investigators identify any significant risks before commencing a trial protocol
and develop effective strategies to mitigate such risks. The lowest risk trials contain
interventions that are already licensed and used as part of standard care and the
highest risk trials assess unlicensed interventions that are often earlier in the devel-
opment pathway.
In an attempt to stratify risks in noncommercial trials, Brosteanu et al. (2009)
developed the risk categories shown in Table 1.
The imminent European Union (EU) Clinical Trials Regulation No 356/2014
includes scope for central monitoring for low-intervention trials (EU Commission
2014). In the UK, the Medicines and Healthcare products Regulatory Agency

Table 1 Risk stratification in noncommercial trials. (Adapted from the Brosteanu article)
Trial categories based on associated risk Examples of types of clinical trials
Type A: No higher than the risk of Trials involving licensed products or off-label use if
standard medical care this use is established practice
Type B Somewhat higher than the risk Trials involving licensed products if they are used
of standard medical care for a different indication, for a substantial dosage
modification or in combinations where interactions
are suspected
Type C Markedly higher that the risk Trials involving unlicensed products
of standard medical care
14 Implementing the Trial Protocol 247

already permits different approaches based on the level of risk inherent in an


interventional drug trial (MHRA 2011). Risk adaptions can be made to the require-
ments for the original application and review process, drug labelling, drug account-
ability, and safety surveillance. The lowest risk trials can sometimes benefit from
expedited regulatory approval.

Trial Risk Assessment

For every trial, there must be an attempt to identify the potential hazards and an
assessment of the likelihood of those hazards occurring and resulting in harm. Risks
fall into two main categories: those that affect patient safety and those that affect the
integrity of the trial. Appropriate control measures should be documented for each risk.
It may be appropriate to include key elements of the risk assessment as part of the trial
protocol so that all stakeholders are fully informed. Key risks to participants should be
explained in the patient information or consent form. The risks described in the patient
information should be presented in the context of the disease and standard treatment.
Generally only risks that are common (between 1/1 and 1/100) or thought to be
particularly serious should be detailed in the patient information. Patient groups are
valuable to ensure patient information is appropriate and directed towards the needs of
patients.

Trial Monitoring Plan

The risk assessment should then be used to develop the trial monitoring plan. The
level of monitoring will depend upon the level of risk and the resources available.
Monitoring falls into two categories: on-site source data verification and central
monitoring. Pivotal trials that will be used as evidence to support a marketing
application and phase I trials will usually contain substantial on-site monitoring,
whereas an interventional trial evaluating two interventions already used in standard
care or where endpoints can be collected centrally from routine data may require
none. On-site monitoring should be complemented by centralized monitoring, where
the sponsor or delegate is provided with source data by the site (e.g., the laboratory
or imaging reports) in order to validate key endpoints.
On-site monitoring can be separated again into two categories: planned visits and
triggered visits.

Planned Monitoring Visits


Planned monitoring visits are usually scheduled to occur at key points of risk or vital
data collection time points. For example, the monitoring plan may require a visit
1 week after the first drug dose in order to evaluate the appropriate reporting of
adverse events or that the pre-dose assessments have been carried out correctly.
Visits can be weighted in favor of the interventions that have greater risk or sites with
the least experience.
248 J. B. Oughton and A. Lilley-Kelly

Table 2 Example of triggered monitoring visit parameters


Low (1) Medium (2) High (3)
Serious adverse Within 1 SAE outside of timelines >1 SAE outside of timelines
event reporting timelines
Overall case >90% 70–90% <70%
report form
compliance
Recruitment 80–100% 50–79% of target >50% of target
of target
Problem data <1% 1–20% >20%
items
Protocol No 1 violation >1 violation
violations violations
Key personnel No Change to research nurse or Change to investigator or high
changes other key staff in last staff turnover in the last
6 months 6 months

Triggered Monitoring Visits


Before commencing a trial, it may be helpful to establish a list of parameters and a
severity score, which collectively can highlight sites for a triggered monitoring visit.
An example is presented in Table 2.
Each site is periodically scored and then ranked. The seriousness of any problems
is weighted within the score (1 point for low and 3 points for high). Sites with the
highest scores will pass the threshold for a triggered monitoring visit. Other options
are available to address issues, such as suspending recruitment, but this method is a
useful tool to identify sites that need extra support and monitoring. There is a risk of
high percentages due to low denominators, rather than risky data. To prevent
unnecessary action, consider setting a minimum threshold under which action will
not be taken.
A comprehensive procedure for developing the monitoring plan has been
published by others (Brosteanu et al. 2009), but recent evaluations suggest such
procedures should be supplemented by centralized checking (von Niederhausern
et al. 2017).

Trial Oversight

It is vital to establish appropriate oversight processes prior to opening to recruitment.


The standard oversight groups for supervising a clinical trial are outlined here. The
conventional oversight structure described below may be adapted in line with the
level of trial risk, as discussed earlier. A phase I trial would convene more frequent
oversight meetings than a phase IV observational trial. Responsibilities of the
oversight groups should be decided and outlined in the protocol and group terms
14 Implementing the Trial Protocol 249

of reference. For a large research group, it may be efficient to review similar trials at
the same meeting (i.e., review multiple trials using the same committee).

Project Team

The staff responsible for carrying out the day-to-day management of the trial should
meet to review key performance indicators from accumulating data. The group
would usually include the data manager, on-site monitor, trial manager, statisti-
cian/methodologist, and team leader.

Trial Management Group (TMG)

The TMG consists of the project team with the addition of the clinical investigators,
patient representatives, and sub-study collaborators with the objective of reviewing a
higher level summary than the project team meetings. As the TMG is made up of the
staff running or leading the trial, it is not independent.

Independent Data Monitoring Committee (IDMC)

The IDMC periodically reviews accumulating summaries of data with the purpose of
intervening in the interests of trial participants. Unlike the project team or the TMG,
the IDMC may review data presented by study arm. The group consists of disease-
specific experts and statisticians experienced with the trial design, none of whom
have any involvement in delivering the actual trial.
To avoid any uncertainty, it is recommended that the trial team/sponsor prepare
explicit guidelines outlining how the IDMC should operate (Sydes et al. 2004b). The
use of an IDMC should be briefly described in the trial results. Notably only 18% of
662 RCTs in a review done in 2000 did so (Sydes et al. 2004a). However, this is likely
to have improved substantially over the last 17 years with the advent of initiatives,
such as CONSORT (Hopewell et al. 2008) which aim to standardize reporting.

Independent Trial Steering Committee (TSC)

The purpose of the TSC is to take advice from the IDMC and make key decisions for
the trial. The TSC usually has the power to terminate the trial or require other actions
to protect participants.
The Medical Research Council in the UK has published guidelines for appro-
priate oversight structures (MRC 2017), and a recent survey has suggested wide-
spread compliance with academic trials units in the UK (Conroy et al. 2015).
Members of the independent groups should generally not be involved with the trial
250 J. B. Oughton and A. Lilley-Kelly

Regulator Ethics Committee

Funder Sponsor

Independent Trial Steering


Committee (TSC)
Clinical members
Statistician
Patient representative

Independent Data Monitoring


Committee (IDMC)
Clinical members
Statisticians

Trial Management Group (TMG)


Chief Investigator
Clinical Co-investigators
Trial/data manager
Statistician
Patient representative
Sub study leaders (health economics, quality of life etc.)

Project team
Trial/data manager
Trial monitor
Statistician

Fig. 3 Example trial oversight structure

in any way, be from outside the investigator’s institution and ideally have an
excellent knowledge of the relevant disease area. For large multisite trials, it may
be necessary to consider international colleagues or those recently retired from
practice (Fig. 3).
14 Implementing the Trial Protocol 251

Trial Promotion

At the start of a project, and throughout delivery, it is important to consider the end
impact of the results. Trial publicity and dissemination of information is essential to
support publication in high-impact journals and ensure future research for patient
benefit. The trial team, including any oversight committees, should develop a
promotional strategy, which could include a schedule of press releases to support
key milestones (i.e., launch/participant recruitment/analysis) disseminated by orga-
nizations associated with the trial (i.e., co-applicants/charitable organizations/dis-
ease-specific groups).
Large multisite trials often develop a brand identity. This starts with having a
trial name that is an accessible shorthand for everyone to refer to the research trial
quickly and easily. Trials with acronyms were more likely to be cited than those
without (Stanbrook et al. 2006). Convention dictates that the name is an acronym
using the letters from the trial’s full title, ideally with some link to the subject area,
though this is not essential. Others have written about what can be humorously
known as acronymogenesis (Fallowfield and Jenkins 2002; Cheng 2006) but
essentially avoid anything that could discourage potential patients (e.g.,
RAZOR) or that could be perceived as coercive (e.g., HOPE, LIFE, SAVED,
CURE, IMPROVED).
Following the trial acronym is often the trial logo. Those with access to a graphic
designer can have more elaborate designs but the trial acronym in a special font may
be sufficient emphasis. A couple of examples of trial logos are displayed below in
Figs. 4 and 5.

Trial Website

A trial website is a helpful way for people to access trial information. This can be
targeted towards investigators, with password protection as required, and/or towards

Fig. 4 Example of trial logo

Fig. 5 Example of trial logo (Oughton et al. 2017) (N.B. that this was an antibody trial, hence the
shape of the spacecraft)
252 J. B. Oughton and A. Lilley-Kelly

patients to encourage interested patients to volunteer for the trial and/or an alternative
means of providing information. It is very common for patients who have been invited
into the trial to do their own internet research, so it is important that any online publicly
available information compliment patient materials, such as the patient information
sheet. Websites have more flexibility in methods for presenting the information than a
paper information sheet. The website can aid the dissemination of the results to
participants by linking to published results or lay summaries of the findings. An
exemplar of good practice in this area can be found for the INTERVAL blood donation
frequency study (University of Cambridge 2017; Moore et al. 2014). Websites can be a
vital channel of communication for sites, patients, and the media. Contents of official
websites may need to have to be approved by an Ethics Committee/Institutional
Review Board depending on local requirements.
With appropriate access controls, there are potential gains to be made from having
trial documents accessible to investigators via the website. The website can provide a
link to remote data capture systems and online registration/randomization services.
Training videos can be hosted from the website for investigators as can participant
questionnaires.

Social Media

The use of social media has increased over the last two decades. Patients frequently
use the internet and social media as a primary source of information. Patient support
groups often have a significant online presence, with forums to facilitate discussions
about a wide variety of topics. Some patients even blog about their experience as trial
participants.
Some have expressed concern that there is potential to compromise the integrity
of clinical trials. However, a review of more than one million online posts found that
discussions of active clinical trials were rare and no discussions were identified that
risked unblinding of clinical trials (Merinopoulou et al. 2015). The authors go on to
recommend basic training for trial participants on the risks of social media discus-
sions and also that sponsors should consider periodic monitoring of social media
content.
There is the potential that participants may disclose adverse events on social
media that would otherwise be unreported. A systematic review (Golder et al. 2015)
review found that adverse events are identifiable within social media and that mild
and symptom-related adverse events are overrepresented online when compared
with traditional data sources, or perhaps alternatively that the lower end of side
effects are underrepresented in trial reporting. Undoubtedly, pharmaceutical compa-
nies are working to develop tools to aggregate social media data, but at present, this
approach is still in its infancy. There is currently only a regulatory responsibility to
report events that are reported to a sponsor/Marketing Authorization Holder rather
than to actively seek out events online. An active approach would also be con-
founded by the difficulty in matching a report to a specific research participant and
there would be ethical issues of perceived intrusive monitoring.
14 Implementing the Trial Protocol 253

Press

To accomplish wider reach, a press release for either regional or national news
outlets could be considered. To maximize the chances of the story being published,
it may be helpful to include a patient interest angle. For example, a trial patient that
has done well in the phase I trial and is now excited that the trial has expanded to
phase II. Photographs or willingness to be photographed/interviewed are key to
success. Permission for the press release must be obtained from all those involved
and an institution’s press office will often be able to provide support. If the press
release is directed towards recruiting patients, the relevant trial ethics committee
should give approval beforehand.

Investigator Meeting

Investigator meetings are useful to provide information about the trial, to foster a
collective commitment to the trial aims, and as a way of recognizing investigator’s
commitment. They are often timed to occur before recruitment commences but can
also be valuable during the lifetime of the trial to provide updates or to help publicize
the results. The organization and resources required to host an investigator meeting
are significant, and it is therefore important to consider carefully what goals and
achievements are important for the meeting and to select an appropriate venue. Costs
can be minimized by holding investigator meetings alongside scientific conferences
where investigators are already likely to attend.

Summary and Conclusion

Effective implementation of a clinical trial protocol requires the execution of the


tasks described in this section. Careful attention to protocol implementation will
maximize the likelihood of delivering an efficient and successful trial.

Key Facts

• A structured approach to identifying and selecting participating sites will translate


to optimal participant accrual.
• The requirements of the trial must be communicated to participating investigators
and staff in a format that meets their needs.
• Risks in a trial must be identified and mitigated using a proportionate monitoring
program.
• A clear and effective oversight structure is necessary to ensure the interests of
participants and funders are protected.
• Promotional strategies can optimize participant recruitment and retention.
254 J. B. Oughton and A. Lilley-Kelly

Cross-References

▶ Centers Participating in Multicenter Trials


▶ Documentation: Essential Documents and Standard Operating Procedures
▶ Pragmatic Randomized Trials Using Claims or Electronic Health Record Data
▶ Principles of Protocol Development
▶ Qualifications of the Research Staff
▶ Selection of Study Centers and Investigators
▶ Trial Organization and Governance

References
Brosteanu O et al (2009) Risk analysis and risk adapted on-site monitoring in noncommercial
clinical trials. Clin Trials 6(6):585–596
Cheng TO (2006) Some trial acronyms using famous artists’ names such as MICHELANGELO,
MATISSE, PICASSO, and REMBRANDT are not true acronyms at all. Am J Cardiol 98
(2):276–277
Collett L et al (2017) Assessment of ibrutinib plus rituximab in front-line CLL (FLAIR trial): study
protocol for a phase III randomised controlled trial. Trials 18(1):387
Conroy EJ et al (2015) Trial Steering Committees in randomised controlled trials: a survey of
registered clinical trials units to establish current practice and experiences. Clin Trials 12
(6):664–676
Croft J. International surgical trials toolkit. cited 2020. Available from: https://
internationaltrialstoolkit.co.uk/
European Commission (2014) Risk proportionate approaches in clinical trials-recommendations of
the expert group on clinical trials for the implementation of regulation (EU) no 536/2014 on
clinical trials on medicinal products for human use. 2014 08/08/17. Available from: http://ec.
europa.eu/health/files/clinicaltrials/2016_06_pc_guidelines/gl_4_consult.pdf
Fallowfield L, Jenkins V (2002) Acronymic trials: the good, the bad, and the coercive. Lancet 360
(9346):1622
Golder S, Norman G, Loke YK (2015) Systematic review on the prevalence, frequency and
comparative value of adverse events data in social media. Br J Clin Pharmacol 80(4):878–888
Hopewell S et al (2008) CONSORT for reporting randomised trials in journal and conference
abstracts. Lancet 371(9609):281–283
Merinopoulou E et al (2015) Lets talk! Is chatter on social media amongst participants compromis-
ing clinical trials? Value Health 18(7):A724–A724
MHRA (2011) Risk-adapted approaches to the management of clinical trials of Investigational
Medicinal Products Ad-hoc Working Group and the Risk-Stratification Sub-Group. 2011. cited
2019. Available from: https://assets.publishing.service.gov.uk/government/uploads/system/
uploads/attachment_data/file/343677/Risk-adapted_approaches_to_the_management_of_clini
cal_trials_of_investigational_medicinal_products.pdf
Minisman G et al (2012) Implementing clinical trials on an international platform: challenges and
perspectives. J Neurol Sci 313(1–2):1–6
Moore C et al (2014) The INTERVAL trial to determine whether intervals between blood donations
can be safely and acceptably decreased to optimise blood supply: study protocol for a
randomised controlled trial. Trials 15:363
MRC Guidelines for Management of Global Health Trials Involving Clinical or Public Health
Interventions. Medical Research Council (2017) https://mrc.ukri.org/documents/pdf/guidelines-
for-management-of-global-health-trials/
14 Implementing the Trial Protocol 255

Oughton JB et al (2017) GA101 (obinutuzumab) monocLonal Antibody as Consolidation Therapy


in CLL (GALACTIC) trial: study protocol for a phase II/III randomised controlled trial. Trials
18:1–12
Park JJH et al (2019) Systematic review of basket trials, umbrella trials, and platform trials: a
landscapanalysis of master protocols. Trials 20(1):572
Stanbrook MB, Austin PC, Redelmeier DA (2006) Acronym-named randomized trials in medicine
– the ART in medicine study. N Engl J Med 355(1):101–102
Sydes MR et al (2004a) Systematic qualitative review of the literature on data monitoring commit-
tees for randomized controlled trials. Clin Trials 1(1):60–79
Sydes MR et al (2004b) Reported use of data monitoring committees in the main published reports
of randomized controlled trials: a cross-sectional study. Clin Trials 1(1):48–59
University of Cambridge (2017) Interval study website 2017. Available from: http://www.
intervalstudy.org.uk/
von Niederhausern B et al (2017) Generating evidence on a risk-based monitoring approach in the
academic setting – lessons learned. BMC Med Res Methodol 17(1):26
Participant Recruitment, Screening, and
Enrollment 15
Pascale Wermuth

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Recruitment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Consenting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Screening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Enrollment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
Planning Recruitment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
The “Recruitment Funnel” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
The Recruitment Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
Recruitment Planning Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
Identifying Trial Candidates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Recruitment Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Screening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Planning the Screening End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
Screening Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
Enrollment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
Enrollment Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
Enrollment Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
Monitoring Enrollment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
Retention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
Retention Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Examples of Retention Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Recruitment Issues and Their Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Risk Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
Issue Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277

P. Wermuth (*)
Basel, Switzerland
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 257


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_38
258 P. Wermuth

Abstract
Participant recruitment and retention are key success factors in a clinical
trial. Failure to enroll the required number of participants in a timely manner
can have significant impact on trial budget and timelines, and data gaps due to
under-recruitment or early dropout of participants may lead to misinterpretation
or unreliability of trial results.
Prior to the start of a trial, a thorough recruitment plan should be set up,
considering the screen failure rate, the potential need to replace early dropouts,
and the geographical distribution of participants, timelines, and budgetary con-
straints. Understanding how to calculate the recruitment rate based on the number
of enrolled participants per site per month will help in assessing the probability of
successful recruitment. In addition, recruitment planning tools and services such
as comparison with benchmarking data from analogous historical or ongoing
trials, simulation tools, and specialist service agencies may support the setup of
a robust recruitment plan. Risk factors with the potential of leading to under-
recruitment, over-recruitment, or recruitment of unsuitable participants should be
identified upfront to allow risk mitigation as far as possible, e.g., through protocol
amendments or increasing the number of participating centers. Evaluation of
the most appropriate channels to identify and contact trial candidates will ensure
optimal turnout in relation to the financial and resource investments. Strategies
for participant screening, enrollment, and retention are reviewed in this chapter.

Keywords
Recruitment · Recruitment plan · Screening · Enrollment · Recruitment rate ·
Retention · Benchmarking · Simulation · Recruitment issues

Introduction

Once the design of a clinical trial has been defined and the final trial protocol has
received all required approvals, after participating centers have been set up and
trained appropriately, and when all trial supplies (including study drug) are available,
the trial is ready to start being populated with participants.
Studies conducted on metadata or on data obtained from local or national
registries do not require the active involvement of any individuals, and recruitment
as such will not be needed for these studies. However, most studies, interventional
and non-interventional, require the identification and enrollment of individual trial
participants, carefully selected according to trial-specific eligibility criteria, to allow
collection of relevant data that will address the objective of the trial. Failure to get
the right participants enrolled into a trial, or to retain participants in the trial, may
have significant impact on the quality of the trial results, as reliability and power of
the trial outcome may decrease (Little et al. 2012). Likewise, failure to enroll
within an appropriate time frame can have significant impact on the trial budget if
15 Participant Recruitment, Screening, and Enrollment 259

additional costs and resources are required to bring recruitment back on track and, in
the case of new therapies being investigated, can cause costly delays in time to
market for these therapies. Consequently, planning for successful recruitment and
participant retention will need to start at the very beginning of the conceptualization
of a clinical trial.
This chapter looks into the details of recruitment including planning tools such
as benchmarking and simulation, highlighting the various challenges that are so
often experienced in recruitment and how to avoid them, and how to identify
candidates for a trial. It also describes methods that can be applied to support the
accrual of participants and their retention in the trial until its completion. For further
reading refer to Anderson (2001), Bachenheimer and Brescia (2017) and Friedman
et al. (2015).

Definitions

Recruitment

Describes the overall process from the point of identifying a candidate for a clinical
trial (either a volunteer or a person diagnosed with the disease or condition under
investigation), through the steps of obtaining their informed and written consent
and verifying their eligibility (“screening”), up to the candidate’s inclusion into the
clinical trial (including randomization and/or treatment assignment where applicable).

Consenting

Describes the process of informing a candidate of the specifics of the clinical trial
(including the objectives of the trial, potential benefits and risks, the assessments and
procedures involved and their impact on the participant, and the participant’s rights
and responsibilities) and of obtaining the candidate’s (or their legal representative’s)
consent to collect personal data and biological samples and to analyze and publish
the derived data. This process is mandatory according to ICH GCP E6(R1).

Screening

Describes the process of ensuring an identified candidate meets all trial inclusion and
exclusion criteria, including obtaining written consent and confirmation of the
candidate’s willingness and ability to adhere to the trial requirements. Screening
assessments requiring any type of intervention not considered routine procedure or
standard of care can only be performed after a participant’s consent has been
obtained. Candidates not meeting all of the eligibility criteria are considered screen
failures and cannot be enrolled into the trial.
260 P. Wermuth

• The terms pre-screening or pre-identification can be used to describe the process


of identifying potential candidates from sources such as registries or medical
records of investigational sites, without necessarily directly contacting the
candidates. This may be applicable in trials with small numbers of highly
selected participants, in which screening slots are assigned to participating sites.
• The term rescreening is used when trial candidates who fail one or
several eligibility criteria at a given time point are reconsidered for participation
at a later time point (e.g., after successfully addressing the eligibility criteria
previously not met), where and as allowed by the trial protocol.

Enrollment

Enrollment is the process of actually including an individual into the trial after
their identification and verification that all trial-specific eligibility criteria are met.
This can include randomization and/or treatment assignment where applicable and
usually marks the point from which on data can be collected (both pro- and
retrospectively). The number of enrolled participants typically does not include
screen-failed individuals, but will include participants that drop out of the trial for
any reason at any time after enrollment (Fig. 1).

Planning Recruitment

During the conduct of clinical trials, recruitment has shown to be a common


challenge, with estimates of up to 80% of trials facing recruitment issues. Therefore,
setting up a recruitment plan including thorough investigation of the trial landscape
and detailed scenario and risk mitigation planning is key for successful recruit-
ment. See e.g. Thoma et al. (2010).
A recruitment plan should describe the assumptions used for calculations and
simulations, the planned budget, and, importantly, the risk mitigation and issue
management plans for the trial. These should include pre-defined trigger points for
escalation steps to be kicked off, and a detailed action and communication plan
for escalation, should recruitment issues occur during the trial. For multinational
and multicenter trials, setting up recruitment plans on country and/or site level
will help identify and address potential recruitment issues with high granularity,
allowing targeted and individualized mitigation or escalation. Also, recruitment
plans should be adapted on an ongoing basis as changes occur to the trial (e.g.,
protocol amendments), the environment (e.g., start of new competitive trials), or the
logistics (e.g., closedown of participating centers).
The following are the main points to be considered during recruitment planning:

• Numbers: The trial protocol will define the number of participants required
according to the statistical sample size calculations. In addition, the expected
screen failure rate will need to be established as the number of candidates to be
15

• Randomization (if • Clinical trial specific


applicable) treatments and
• Treatment assignment procedures
• Informed consent • Documentation of data
• Assessments of (if applicable) (pro- and retrospectively)
inclusion/exclusion
criteria
Yes
Enrollment
Eligibility End of
Screening confirmed? trial
Identification
of Trial No
Candidate Screen Failure
Pre-
screening/Pre-
Participant Recruitment, Screening, and Enrollment

identification

Re-Screening (where allowed)

Recruitment Conduct of Clinical Trial

Fig. 1 Overview of the recruitment process for enrolled participants documentation of reportable adverse events may be required from the beginning of
screening (retrospectively), up to the end of the clinical trial
261
262 P. Wermuth

screened in total will need to include the candidates not subsequently enrolled.
Also, the protocol may require replacement of enrolled participants who drop out
of the trial prior to reaching a specific milestone (e.g., the end of a certain
observation/treatment period or exposure duration). In this case, the expected
dropout rate should also be assessed, and the number of enrolled participants
increased accordingly (Fig. 2).
• Timelines: Restrictions on duration of the recruitment period can be defined by
budget and/or resource limitations, by ethical and/or regulatory requirements (e.
g., post approval commitments to health authorities), or by statistical require-
ments (e.g., occurrence of endpoints to be observed within a specific time frame).
Planning of timelines should also take into consideration the time needed to
identify and screen candidates (i.e., how often are candidates seen by trial
investigators; how long do trial-specific screening assessments take).
• Geographical distribution: Is the trial a single- or a multicenter trial, a local,
national, or international trial? Are there any specific geographic considerations
from an epidemiological stand point? Are there any logistical restrictions
to the distribution of the trial such as language constraints, limitations in
clinical research associate (CRA) monitoring resources, challenges in supply of
the investigational medicinal product, or differences in standard of care (e.g.,
availability of comparator treatment) that would influence the geographical
spread of the trial? Are there any regulatory requirements for acceptance of
data for licensing or marketing considerations (e.g., minimum number of
participants from a certain country required to allow filing)?

Example: some countries tend to be strong and fast recruiters but may have long
start-up timelines, potentially precluding their involvement in trials where the
overall recruitment period is expected to be short. On the other hand, other countries
may be included due to short start-up timelines, despite a perhaps limited enrollment
potential.

• Budgeting: Recruitment contributes significantly to the costs of a clinical trial;


hence, the trial budget will often influence the recruitment strategy of a trial. The
decision on how many sites to be included in which countries may not only be
based on the sites’ potential to contribute but also on how much it will cost to
run the trial in specific countries. Also, while generally the more sites are
opened for a trial the shorter the enrollment duration will be, opening up sites is
costly in resources as well as in pass-through costs (e.g., fees for contracts, IRB/
EC submissions, etc.). Therefore, a balance will need to be found between the
costs of adding more sites and the savings of having a shorter enrollment duration.

Required no. of
enrolled participants + Expected no. of screen
failed candidates + Expected no. of drop
outs = No. of candidates to
be screened

Fig. 2 Calculation of the number of patients to be screened


15 Participant Recruitment, Screening, and Enrollment 263

• Protocol feasibility outcome: Well-conducted protocol feasibility will


help establish the enrollment potential of investigational sites and will provide
information on how capable sites are to conduct the trial (e.g., logistical
infrastructure at a site, their experience in conducting clinical trials, resources
available to support the trial, commitment/interest of the investigator, competitive
trials run at the site). The information obtained through feasibility will indicate
if the planned sites will be sufficient to conduct the trial or if additional or
different sites will need to be approached.
• Recruitment challenges: Experience from comparable trials and feedback
received from investigators during protocol feasibility can help in identifying
potential recruitment challenges. Addressing these proactively as much as
possible will minimize the risk of missed recruitment goals. Early identification
might allow implementation of actions that are more difficult later on during
the trial such as amending the protocol (e.g., to relax eligibility criteria or to
reduce the frequency of trial assessments). Other preparatory steps might include
proactive tailored training of site staff and the preparation of supportive material
for sites (e.g., medical equipment, trial-specific “pocket summaries”).
• Advisory boards: Where available the benefit of consulting with scientific advi-
sory boards or with individual medical experts should be considered, as input from
such groups or individuals may help ensure trial assessments are in accordance with
current standard of care and/or feasible for the participants and investigators.
• Patient involvement: Similarly, seeking input from patient representative groups
may support the development of a “patient-friendly” trial. Ensuring a trial is
manageable and relevant for the participants will go a long way in ensuring
successful enrollment and retention. This can include aspects of the clinical trial
design (e.g., frequency of blood sampling), terminology used (e.g., potentially
inappropriate use of the word patient versus participant or person), reimburse-
ment of expenses, and understandability of consent (i.e., ensure all participants
are informed in language and terms they are able to understand).

The “Recruitment Funnel”

Typically, it has to be expected (and planned for) that not all identified candidates
will end up being enrolled into a clinical trial, and similarly, not all enrolled
participants will complete the trial. The phenomenon of the number of candidates
decreasing between pre-screening and enrollment, and again after randomization, is
often referred to as the “recruitment funnel.”
The first sweep of candidates will fall off the radar during the pre-screening phase.
Often site staff overestimate the number of potential trial participants they
can contribute to a trial, not fully taking into consideration competitive trials
being conducted at their site on a similar patient population, or the site’s resource
constraints and the related limitations in their ability to oversee the often time-intensive
management of participants in a clinical trial. A further proportion of candidates
will drop out during the actual screening and consenting phase, as not
264 P. Wermuth

Sites’ anticipated
screening potential
100
-30%

Pre-screened
candidates 70
-50%

Consented
candidates 35
-57%
Enrolled
participants 20
-15%
Participants
completing the trial 17

Fig. 3 The recruitment funnel (indicated numbers and percentages are examples)

Recruitment
Rate = Total no. of
participants
/ No. of
contributing sites
/ Time Unit

Fig. 4 Calculation of the recruitment rate

all candidates will meet all the eligibility criteria and/or are willing to enter a
trial. A reason for this is that during initial protocol review, investigators might
underestimate the stringency of the trial’s eligibility criteria. For potential enrollment
barriers, see Brintnall-Karabelas et al. (2011) and Lara et al. (2001). Thirdly, even after
randomization, the number of participants is likely to decrease over time, through early
dropouts due to various reasons such as adverse reactions experienced during the trial,
withdrawal of consent, or participants being lost to follow-up. The dropout rate
generally increases with the longer duration of a trial (Fig. 3).

The Recruitment Rate

The most often used metric in recruitment is the recruitment rate. The recruitment
rate of a clinical trial, both for single-center and for multicenter trials, is defined by
the following factors (Fig. 4):

• Total number of enrolled participants


• Number of contributing sites (will be 1 in single-center trials)
• Required enrollment duration (mostly indicated in months)
15 Participant Recruitment, Screening, and Enrollment 265

Example: In a phase II trial, 80 participants were enrolled by 12 contributing


sites, over the course of 8 months. Hence, the recruitment rate of this trial was 80
participants divided by 12 sites divided by 8 months ¼ 0.83 participants per site per
month.
Breaking down recruitment into these factors allows a standardized quantification
of enrollment, irrespective of the trial specifics such as design or indication.
However, recruitment rates vary vastly between different disease areas, from a
two- (or more) digit rate for trials in common indications to rates lower than 0.1 in
rare diseases, and therefore recruitment rates should be interpreted carefully when
used to compare recruitment in different trials.

Fine-Tuning the Recruitment Rate


The recruitment rate provides an indication of the average recruitment speed
across all sites and over the full recruitment period. However, in reality, recruitment
is rarely linear over the course of the trial, and breaking it down to an individual site
level and/or to specific phases during the enrollment period will provide a more
realistic picture of what recruitment may actually look like.
One factor to consider is the ramp-up period at the beginning of a trial, taking
into account the time needed to have all sites activated and ready to start enrollment.
During this ramp-up period, fewer sites will be contributing to enrollment as not all
might be able to start recruitment at the same time.
Example: Estimation of recruitment during ramp-up period: x weeks with 25% of
sites activated and contributing, y weeks with 50% of sites enrolling, and z weeks
(the remaining recruitment period) with 90% of sites contributing to recruitment.
Calculating with 90% instead of 100% of planned sites allows for the possibility of
some sites not enrolling any participants (e.g., approvals not received in time, no
candidates available).
Another factor impacting the recruitment rate may be decreased site activity
during holiday seasons, and being aware of the various holiday customs within the
participating countries will allow planning for potential dips during recruitment.

Recruitment Planning Tools

There are different methodologies and tools available that can be used to generate a
differentiated approximation of the estimated recruitment rate for a planned trial.
Ideally two or more of these are used in combination as this will allow the most
complete coverage of possible eventualities and the development of the best possible
recruitment strategy.

Benchmarking
Researching the trial landscape by identifying clinical trials (either historical or
currently running) that are analogous to the trial at hand will provide benchmarking
data on what can be expected with regard to recruitment metrics. Benchmarking data
of clinical trials can be obtained either from government-mandated reportable trial
266 P. Wermuth

registries (e.g., ClinicalTrials.gov, EudraCT, EU PAS) or from data collected


by specialized service providers from pharmaceutical companies through
anonymized methods. Some of these are publicly accessible; others require purchase
of licenses.
By extracting the key recruitment metrics for each identified analogous trial, i.e.,
number of enrolled patients and of participating sites and recruitment start
and end date, the recruitment rate of each trial can be calculated. This can then
be used as a starting point for the planning of the own trial but will need to be
adjusted for any variables not matching well. For example, if the treatment under
investigation is considered promising by the community, the recruitment rate might
need to be increased. Likewise, a decrease of the recruitment rate might be applicable
if there is a high density of concurrently running competing trials.
The research of such data will also provide information on the number of trials
run in the field and how many and which countries participated in these trials,
indicating both recruitment potential and competitive pressure in a given field and
region. However, data will be limited to indications, treatment classes, etc. as
previously or currently investigated, and it might not always be possible to find
relevant historical matches.

Simulation
Another useful tool is the simulation of recruitment by feeding different sets of
variables into a model and thus imitating different potential scenarios. Modifying
factors such as number of sites, recruitment rates, ramp-up times for individual sites,
etc. will help visualizing their impact on the recruitment duration and will allow
refinement of one’s assumptions (DePuy 2017) (Fig. 5).

Patient Recruitment Service Agencies


There are a number of service providers in the industry specializing in the planning,
support, and conduct of recruitment into clinical trials. Services provided range from
provision of benchmarking data for recruitment planning, to simulation tools, and to
the preparation of study material and tools supporting recruitment and retention.

No. of No. of screened/


activated sites enrolled patients
20 Scenario B: Target no. of screened
Target no. of activated sites = 18 participants = 85
80
Scenario B Scenario A
15
Scenario A:
Target no. of activated sites = 12 Target no. of enrolled
participants = 68 60

10
40

5
20

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar

Fig. 5 Visualization of simulation outputs for two different scenarios: scenario A ¼ 12 activated
sites; scenario B ¼ 18 activated sites. Target sample size ¼ 68; assumed screen failure rate ¼ 20%
15 Participant Recruitment, Screening, and Enrollment 267

Identifying Trial Candidates

Tailoring recruitment-related communication to the protocol-defined target


population, identifying the right sources, and being able to access the right
candidates will ensure optimal enrollment outcome in relation to cost and resource
investments (see, e.g., Graham et al. 2017; National Institute on Aging 2018). The
questions that should be asked when defining the recruitment communication
strategy are where can candidates matching the protocol requirements be found
(sources) and who will the trial team be communicating to.
Potential sources of candidates can include:

(a) Health-care professionals: e.g., hospitals, health centers, and general


practitioners
(b) Real-world data sources: e.g., patient registries and genomic profiling databanks
(c) Participant community: e.g., patient advocacy groups and individuals of the
target population directly (community reach)

With the advance and increasing dispersion of genomic profiling and personalized
health care, the concept of finding a trial for a patient rather than finding patients for a
trial will become more prevalent, with computational platforms allowing real-time
matching of patients to trials using genomic and clinical criteria (Fig. 6).

Recruitment Material

Once the source (or sources) and contact points have been established, communication
material to support recruitment can be developed accordingly. Content as well as
review and approval processes will differ depending on who the targeted recipients
are.
Note: any patient-facing material needs to be approved by IRBs/ECs.
Examples of written material for the promotion of a trial include flyers, cards,
posters, or ads with a trial overview and trial contact details. Target audiences can be
participating centers, referring physicians, the broad community, and/or preselected trial

Fig. 6 Illustration of how


personalized health care may
affect trial recruitment Finding patients Trial A
for a trial:

Trial A
Finding trials for
patients: Trial B

Trial C
Investigated population
268 P. Wermuth

candidates. Material supporting the informed consent process can be booklets or cards,
or short videos, with explanations and schemas of the scientific or medical background
of the trial, the underlying disease, or the treatments/medical procedures in question.
Other means of outreach include printed media (such as newspapers and
magazines), television, and radio broadcasts, as well as multimedia and online
platforms such as community web sites, forums, social media, apps, etc. (see, e.g.,
Katz et al. 2019) (Table 1).
The choice of formats used in a trial often depends on the budget and available
resources (e.g., consider the need for translations in multinational studies) but
should also reflect the specifics of the target audience to allow maximization of
the uptake of the information. To increase the success of the communication
campaign, it is often useful to use more than one means of communication (see, e.
g., Kye et al. 2009).
It should be noted that the choice of communication pathway could have a pre-
selective impact on the trial candidates identified (e.g., limited access to online media
for certain social or age groups).

Table 1 Overview of different communication formats


Format Examples Advantages For consideration
Printed Flyers, posters, cards, booklets Neither cost Risk of getting lost or not
material nor resource receiving much attention
intensive
Can be easily
distributed
News Ads in newspapers, magazines, Very broad May be more useful for trials
media in public areas (incl. public outreach with broad eligibility criteria
transport), on TV, or radio possible
Often not
costly
Online Ads on dedicated web sites, Can reach Legal restrictions to be adhered
media forums, or chat rooms, in apps very specific to when posting information
or search engines target online
audience IRB/EC approval required for
patient-facing content prior to
posting – may lead to delays
globally, or the need for
geofencing (i.e., blocking
content for users with certain IP
addresses)
Uptake limited to web-literate
audience
Other Mass letters, “cold calls,” e.g., Broad Risk of relatively low turnout in
to all GPs in a certain area outreach relation to invested effort
possible
15 Participant Recruitment, Screening, and Enrollment 269

Examples:

• For a single-center phase I study with healthy volunteers, a useful approach could
be to hand out study flyers and hang up posters in close-by universities or sport
centers.
• For a global phase III study in hemophilia (a well-connected and well-informed
community), ads on dedicated web sites such as association portals or patient
forums could be useful.

Often, recruitment material will be the primary tool to promote the clinical trial,
and therefore the relevance of the trial should be included in the communication.
Where possible a visual identity (e.g., trial logo) can be used that will help the trial
in standing out from others. However, it is important that the material, texts, and
visuals strictly promote only the trial and not the therapy or treatment under
investigation. There are a set of legal regulations that need to be followed when
authoring communications around clinical trials. These regulations may vary locally
but in general include points such as:

• Must use language that can be understood by lay people


• Must not promise a cure or other benefit (including free treatment) beyond what is
outlined in the protocol
• Must not be unduly coercive
• Should not include logos, branding, or phraseology used (or planned to be used)
for marketing of the therapy

Screening

The objective of the screening process is to ensure only eligible candidates enter the
trial. Therefore, the assessments required to identify eligibility of candidates will
need to be defined, and the necessary processes and systems will need to be set up
accordingly. Questions to be addressed include the following:

• Are the required screening assessments standard of care and can be expected to be
performed as part of routine clinical practice, or are they trial specific and will
require prior consenting by the candidates? Will eligibility tests be performed
locally at the sites or will they need to be performed centrally (e.g., due to limited
availability of technology or the need for standardization for comparability)?
• What is the time frame in which screening assessments need to be performed?
Are there any time constraints in having to get the candidate’s eligibility con-
firmed (e.g., is there a maximum time window between diagnosis and start of
therapy during which screening will need to be completed)?

Screening activities need to be documented for all candidates (e.g., by use of a


screening log to be maintained by the participating sites), both for candidates ending
270 P. Wermuth

up being enrolled and for screen failures (including the reason for screen failure).
Also, especially for multi-site trials, a method on how to track screening activities
should be implemented, allowing (ideally real-time) monitoring of progress against
projections.

Planning the Screening End

The aim is to hit the protocol required number of enrolled patients as closely as
possible, for budgetary reasons, but also in order not to unnecessarily expose
participants to experimental therapy. Toward the end of the enrollment period for a
trial, screening should be monitored carefully, as, for ethical reasons, all informed,
screened, and eligible candidates should be allowed to enter the trial. Therefore,
screening activities should stop once a sufficient number of candidates are in the
screening pool to fill the remaining enrollment slots. Especially in fast- recruiting
trials, this requires extrapolation of the screen failure rate observed previously.
Example: If the screen failure rate during the last period of enrollment in a trial
was 10%, and 9 participants are still required to complete the trial, screening should
stop when 10 candidates are in screening.

Screening Tools

• While for small trials screening progress can be tracked manually, larger
trials with several participating sites are often managed with the support of an
Interactive Voice or Web Response System (IxRS). This will allow sites to enter
into the system when a candidate has been identified (“screening call”) and then
again when the outcome of the screening assessments is established, either
leading to enrollment or to screen failure. Screen failure reasons, usually in
categories, can also be tracked this way. The use of an IxRS allows real-time
tracking of screening progress but can be costly and time-intensive to set up.
• A tool allowing close control of screening activities is the screening request form:
this is to be completed by a site once a candidate has been identified and to be
sent to the central trial team for approval. Screening assessments for a candidate
can only start once the screening request is approved by the central team.
Screening request forms are typically used in trials requiring only small numbers
of participants (e.g., phase I cohort studies with less than ten participants
per cohort) or when a certain lag time is required between enrollment of
individual participants (e.g., where tolerability of a treatment needs to be
established prior to enrollment of subsequent participants).
• The allocation of screening slots can also be useful in trials with small numbers
of participants. Here, the central trial team will allocate individual screening
slots to a small number of sites at a time, allowing only these sites to screen
one candidate at a time. Once a candidate is either enrolled or screen failed, a next
screening slot will be allocated to another site.
15 Participant Recruitment, Screening, and Enrollment 271

Enrollment

Enrollment Strategies

Enrollment is often straightforward, with participants being entered into a trial as


they get identified and screened. However, there might be considerations to a trial
that warrant the definition of a specific enrollment or recruitment strategy.

• Competitive versus allocated enrollment (applicable to multicenter studies only):


competitive enrollment means that all sites can enroll participants as and
when they are identified, until the overall sample size has been reached. However,
it may be necessary to restrict the number of enrolled participants by individual
sites or countries, even if these could contribute more and the overall sample size
has not yet been reached. This might be the case if a certain distribution of
enrolled participants is required (e.g., where a minimum number of participants
per site or country are needed).
• Enrollment in batches versus ongoing enrollment: enrolling in batches may be
required in trials where resource-intensive therapy is required that is best
applied to several participants at a time or if there has to be a time lag
after enrollment of a certain number of participants to allow observation prior
to subsequent enrollments (e.g., enrollment into cohorts).

Enrollment Procedures

• Where applicable randomization (blinded or open, including stratification and


treatment assignment as needed) will occur during the enrollment process.
In smaller studies, these steps can be managed manually; however, in larger,
multicenter trials, it is common to use an interactive voice or web response system
(IxRS). Sites will access the system to report an enrollment and to request a
randomization or participant ID which the system will link with the assigned
treatment code and where applicable with a medication or biological sampling
material kit number.
• Enrollment approval: in early phase studies in which participants’ safety might be
of particular concern, a process may have to be set up to allow central trial team
review of a candidate’s screening data prior to enrollment. This will allow the
central trial team to confirm that the candidate meets the trial’s entry criteria.

Monitoring Enrollment

• Close monitoring of enrollment progress on study and site level, and comparison
with projections as defined at trial start, will allow early detection of any devia-
tions from the recruitment plans. The setup of appropriate reports (e.g., through
the IxRS if used) should be included in the trial preparation activities as they
272 P. Wermuth

should be available early on. Such reports will also be useful for regular reporting
of enrollment progress to trial stakeholders, e.g., participating sites, sponsor
management, and regulatory authorities.
• Specific attention should be given when getting close to the end of recruitment.
As described under the screening procedures, over-enrollment should be avoided,
both for budgetary reasons and to avoid unnecessary exposure of participants to
the treatment under investigation.

Retention

In trials with single observation points or very short follow-up participant,


retention will not be an issue. However, for trials with longer observation or
follow-up, plans on how to limit the number of missed individual assessments
during the trial, and to keep participants in the trial for as long as required by the
protocol, will need to be put in place at study start-up. Participants have the right
to withdraw from a trial at any time; however, participants with a good under-
standing of the objectives of a trial are less likely to drop out. Therefore,
participant education of their responsibilities in achieving these objectives,
including adherence to the schedule of assessments until the end of the trial,
should be part of the consent process.

• Partial withdrawal: Should a participant wish to discontinue the trial-specific


treatment under investigation, they should be encouraged to remain in the study
for follow-up and to complete all remaining visits and assessments (without
continuing with the treatment). This will allow collection of outcome data.
However, if a participant withdraws their consent for the entire study, no further
data can be collected.

Reasons for participants to drop out include:

• Participant has not met the eligibility criteria (protocol deviation/violation).


• Consent withdrawal from either study treatment only or from the entire study.
• Death and adverse events.
• Pregnancy.
• Perceived or real lack of efficacy.
• Physician decision.
• Unwillingness or inability to comply with the protocol-mandated assessments.
• Logistical reasons such as translocation and changes in personal circumstances.
• Participant is lost to follow-up.

If a participant is lost to follow-up, every effort should be undertaken to locate


the participant. It might be useful to set up a clear definition of actions to be taken by
a site before a participant is considered lost to follow-up (e.g., at least three attempts
to contact the participant through different communication means at different time
points throughout a minimum period of 3 weeks).
15 Participant Recruitment, Screening, and Enrollment 273

Retention Strategy

A retention strategy should aim at positively influencing the trial experience for
participants and at establishing a rapport between the participant and the trial and/
or the trial team. An integral part of the retention strategy will therefore include
communication with the trial participants beyond their enrollment into the trial, ideally
through different pathways and at various time points throughout the trial duration.
Being the main trial contact point for participants, keeping local site staff
engaged, well informed of the global trial progress, and fully trained on trial
processes is key.

Examples of Retention Support

Supporting participant directly:


• Visit reminders (e.g., phone calls, email notifications, or alerts through mobile apps)
• Educational material on the underlying disease or the treatment under
investigation
• Communication aiming to establish a sense of community, including layman
summaries of the trial results and thank you letters at the end of the trial
• Reimbursement of expenses incurred for travel to the trial site (note: local
regulations need to be considered for the management of travel reimbursement,
as some IRBs/IECs may require out-of-pocket expenses to be reimbursed to
participants, whereas others will prohibit it)
• Logistical support for participants (e.g., home nursing, possibility to report data
electronically rather than having to go to the site)
• Trial treatment supporting gadgets (e.g., pillows or blankets)

Note: any material distributed to participants will need prior approval by IRBs/
IECs. Also, local regulations will apply, including that the material cannot promote
the treatment under investigation, only the trial itself, and that the value of the
material cannot be seen as persuasive to participation in the trial. The material
should be strictly trial related and not exceed a certain financial value.
Supporting site staff:
• Participants’ visit reminders (e.g., phone calls or email notifications)
• Template forms to capture participant’s contact details (including those of
close relatives where provided by the participants)
• Pocket summaries, schedules, and charts providing an overview or quick refer-
ence to the protocol procedures

Recruitment Issues and Their Impact

Recruitment has been one of the most common challenges in clinical trials in the past, and
the changing environment is not making it any easier. Factors contributing to the
increased complexity are an increasingly demanding regulatory environment and
274 P. Wermuth

competitive pressure caused by the increasing number of clinical trials being run. Also,
the tendency to tailor studies to more and more specific target populations (i.e., the trend
toward individualized health care), as well as the patients’ increased literacy and desire to
be involved in their treatment decisions, requires differentiated planning of recruitment.
Recruitment issues can be grouped into the following three categories:

• Slow/under-recruitment: if this happens, the estimated enrollment potential of the


contributing sites was too high, and participants are either enrolled at a lower rate
than expected (leading to longer recruitment and therefore longer trial duration) or
cannot be enrolled at all. Common reasons for lower than expected recruitment
rates are too restrictive eligibility criteria and unanticipated or underestimated
concurrent trials run at the participating sites, affecting both the number of trial
candidates as well as site resources and engagement of site personnel. Other
possible reasons include logistical factors such as issues with access to comparator
treatment, unplanned unavailability of site staff (e.g., due to illness or transfer), and
changes in the regulatory environment (e.g., approval of new competitive treatment
decreasing the interest in the trial). Patient-related reasons might include limited
patient access to the trial sites and too high burden of the protocol.
• Over-recruitment: while this tends to be a less frequently experienced issue, it can
also have significant negative impact on the success of a clinical trial. If participants
are enrolled into a trial faster than anticipated (higher recruitment rate), this can
impact trial resources both at the site level and at the central trial team level,
potentially limiting the ability to ensure adequate medical oversight of the partic-
ipants (caused by delayed data reporting and review). Also, the quality of the
collected data may suffer through insufficient trial oversight, putting the trial
outcome at risk. Recruiting more participants into a trial than anticipated (increased
sample size) can lead to budgetary constraints and can be ethically problematic as
more participants than statistically needed are exposed to experimental treatment.
• Low quality of recruitment: in this case, recruitment may seem to be on track;
however, the participants enrolled into a trial may not necessarily be the
right ones. This can happen if, often due to lack of oversight, participants
not fully meeting the eligibility criteria are enrolled into a trial (resulting in the
so-called protocol violations or deviations). Once identified, continuation of
these participants in the trial will need to be assessed carefully, and they
may need to be excluded from the trial for safety and/or efficacy reasons
(risk due to exposure to investigational treatment may not be justified given
the potential lack of response). The trial team will then need to decide if
exclusion of these participants will impact the power of the trial outcome and
if they need to be replaced by additional participants.

Risk Mitigation

Some potential risks to the planned recruitment can already be identified prior
to the start of a trial, during comparison with analogous trials (e.g., spread of
15 Participant Recruitment, Screening, and Enrollment 275

competitive trials), during protocol feasibility (e.g., sites’ capability to run the
trial), and during interaction with patient advocacy groups, scientific experts,
and advisory boards (e.g., trial alignment with current standard of care), see e.
g. National Institute of Mental Health (2005). While not all risks can be
avoided fully, some mitigating actions can be implemented, either through
modifications of the protocol or by adding (or exchanging) countries and sites
to the trial.
Examples of risks that may require mitigation:

• Exclusion of patients with active, controlled hepatitis will significantly reduce the
participant potential in some countries.
• High frequency of radiology assessments or too high blood sample drawing
burden might not be approved by some IRBs/ECs

The common need for a speedy study setup also bears risks that can lead to an
enrollment backlog from the very beginning of the trial, including starting trial
recruitment preparations with a nonfinal protocol version, underestimating the
time needed for the setup of sites (e.g., contract negotiations taking longer than
anticipated), and not having contingency plans in place early on. Also, ensuring all
involved stakeholders such as different departments of a hospital and referring
institutions are informed of the upcoming trial and are able to communicate with
each other will help in avoiding any delays.
Engaging participating investigators early on in trial planning can positively
impact recruitment. Key opinion leaders, by their reputation in the community,
may influence their colleagues to promote the study protocol and enrollment into
the trial. Also, additional motivation might be gained if there are opportunities for
authorship on study-related publications for participating investigators (in alignment
with applicable publication authorship guidelines).

Issue Management

Once recruitment is at risk of getting off track or is off track already, different corrective
actions can be taken, with often a combination of several actions being the most
successful approach. Ideally, these actions (as well as the trigger points for their
implementation) are defined in the recruitment plan, but there should also be flexibility
to adapt the corrective actions to the current situation (Bachenheimer 2016).
Possible measures to address delays in recruitment include:

• Increase/change of communication (e.g., newsletters to sites, investigator


meetings/teleconferences, trial promotional, and recruitment supporting material)
• Engage key opinion leaders or trial steering committee members to help promote
the trial in their community
• Opening up of new sites (ideally already “pre-initiated” as far as possible as
part of the recruitment plan) or expansion into other countries
276 P. Wermuth

• Review of the protocol to identify areas that could hinder recruitment and assess
if these can be modified (e.g., less stringent eligibility criteria, simplified, and/or
less frequent study assessments)
• Increase incentive for participating investigators where and as possible (payment,
authorship on planned publications)
• Facilitate study participation for participants (e.g., payment where allowed,
reimbursement of expenses, material to help understand the informed consent
form, and the study specifics; see Fiminska 2014; Kadam et al. 2016)

Possible measures to address too fast recruitment include:

• Implement the assignment of screening/enrollment slots to individual sites (i.e.,


restrict the number of participants that can be enrolled per site or per country).
• Implement a temporary recruitment hold for all sites (although sometimes it can
be difficult to get sites restarted at the same rate of recruitment as before).

Possible measures to address enrollment of wrong participants include:

• Retrain site staff of trial specifics.


• Investigate possibilities to increase resources at sites (e.g., hiring of temporary
contractors dedicated to the study).
• Implement recruitment stop at individual sites.

Summary and Conclusion

Recruitment has been the main challenge in the management of clinical trials in
the past and is expected to become even more difficult with an increasing regulative
environment and more trials being run in highly specified populations. Therefore,
thorough upfront planning of recruitment is key to ensure enrollment of the
right participants in time and within budget. The factors to be considered include
evaluation of the quality and number of sites needed, feasibility of the protocol,
and mitigation of any potential risks as much as possible through communication
with the key stakeholders. The recruitment plan should also include a strategy to
access candidates, to monitor enrollment progress, and to retain participants in
the trial. The use of supporting tools such as benchmarking and simulation should
be factored in when setting up the trial budget.

Key Facts

Recruitment of trial participants consists of identification, assessment, and


enrollment of trial candidates. A basic metric to quantify enrollment is the
recruitment rate, defined as the number of participants enrolled per site per time
unit (usually months). Factors influencing recruitment are related to the availability
15 Participant Recruitment, Screening, and Enrollment 277

of appropriate sites, the protocol design, and the communication with key stake-
holders of the trial. Retention of trial participants for as long as mandated by the
protocol is important to minimize gaps in data collection.

Cross-References

▶ Advocacy and Patient Involvement in Clinical Trials


▶ Consent Forms and Procedures
▶ International Trials
▶ Selection of Study Centers and Investigators

References
Anderson DL (2001) A guide to patient recruitment: today’s best practices and proven strategies.
CenterWatch, Boston
Bachenheimer JF (2016) Adaptive patient recruitment for 21st century clinical research. Available
via Applied Clinical Trials. http://www.appliedclinicaltrialsonline.com/adaptive-patient-recruit
ment-21st-century-clinical-research. Accessed 08 Jan 2019
Bachenheimer JF, Brescia BA (2017) Reinventing patient recruitment: revolutionary ideas for
clinical trial success. Taylor and Francis BBK Worldwide, Needham MA, USA
Brintnall-Karabelas J, Sung S, Cadman ME, Squires C, Whorton K, Pao M (2011) Improving
recruitment in clinical trials: why eligible participants decline. J Empir Res Hum Res Ethics 6
(1):69–74. https://doi.org/10.1525/jer.2011.6.1.69
DePuy V (2017) Enrollment simulation in clinical trials. SESUG Paper LS-213-2017. Available via
https://www.lexjansen.com/sesug/2017/LS-213.pdf. Accessed 08 Jan 2019
Fiminska Z (2014) 5 tips on how to facilitate clinical trial recruitment. EyeForPharma. Available at
https://social.eyeforpharma.com/clinical/5-tips-how-facilitate-clinical-trial-recruitment.
Accessed 08 Jan 2019
Friedman LM, Furberg CD, DeMets D, Reboussin DM, Granger CB (2015) Fundamentals of
clinical trials. Springer International Publishing, Cham
Graham LA, Ngwa J, Ntekim O, Ogunlana O, Wolday S, Johnson S, Johnson M, Castor C,
Fungwe TV, Obisesan TO (2017) Best strategies to recruit and enroll elderly blacks into clinical
and biomedical research. Clin Interv Aging 2018(13):43–50
Kadam RA, Borde SU, Madas SA, Salvi SS, Limaye SS (2016) Challenges in recruitment and
retention of clinical trial subjects. Perspect Clin Res 7(3):137–143
Katz B, Eiken A, Misev V, Zibert JR (2019) Optimize clinical trial recruitment with digital
platforms. Dermatology Times. Available via https://www.dermatologytimes.com/business/opti
mize-clinical-trial-recruitment-digital-platforms. Accessed 08 Jan 2019
Kye SH, Tashkin DP, Roth MD, Adams B, Nie W-X, Mao JT (2009) Recruitment strategies for a
lung cancer chemoprevention trial involving ex-smokers. Contemp Clin Trials 30:464–472
Lara PN Jr, Higdon R, Lim N, Kwan K, Tanaka M, Lau DHM, Wun T, Welborn J, Meyers FJ,
Christensen S, O’Donnell R, Richman C, Scudder SA, Tuscana J, Gandara DR, Lam KS (2001)
Prospective evaluation of cancer clinical trial accrual patterns: identifying potential barriers to
enrollment. J Clin Oncol 19(6):1728–1733
Little RJ, D’Agostino R, Choen ML, Dickersin K, Emerson SS, Farrar JT, Frangakis C, Hogan JW,
Molenberghs G, Murphy SA, Neaton JD, Rotnitzky A, Scharfstein D, Shih WJ, Siegel JP, Stern
H (2012) The prevention and treatment of missing data in clinical trials. N Engl J Med 367
(14):1355–1360
278 P. Wermuth

National Institute of Mental Health (2005) Points to consider about recruitment and retention
while preparing a clinical research study. Available via https://www.nimh.nih.gov/funding/
grant-writing-and-application-process/points-to-consider-about-recruitment-and-retention-whil
e-preparing-a-clinical-research-study.shtml. Accessed 08 Jan 2019
National Institute on Aging (2018) Together we make the difference – National Strategy for
recruitment and participation in Alzheimer’s and related dementias clinical research. Available
via U.S. Department of Health & Human Services. https://www.nia.nih.gov/sites/default/files/
2018-10/alzheimers-disease-recruitment-strategy-final.pdf. Accessed 08 Jan 2019
Thoma A, Farrokhyar F, McKnight L, Bhandari M (2010) How to optimize patient recruitment.
Can J Surg 53(3):205–210
Administration of Study Treatments and
Participant Follow-Up 16
Jennifer J. Gassman

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
Administration of Study Treatments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Introduction to Administration of Study Treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
Verification of Site Readiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Inclusion and Exclusion Criteria Focused on Treatment Administration . . . . . . . . . . . . . . . . . . 283
Eligibility Checking and Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Getting the Treatment to the Participant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
Promoting Treatment Adherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
Monitoring Treatment Adherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
Monitoring Early Treatment Discontinuation and Tracking Reasons for Discontinuation . . . 289
The Role of the Study Team in Enhancing Treatment Adherence . . . . . . . . . . . . . . . . . . . . . . . . . 290
The End of Treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
Participant Follow-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
Planning the Follow-Up Schedule During Trial Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
Making Trial Data Collection as Easy as Possible for the Participant . . . . . . . . . . . . . . . . . . . . . 294
Training the Participating Site Staff on Follow-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
Retention Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
Factors Related to Predicting Retention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
The Role of the Study Team in Promoting Retention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
Interrelationship Between Treatment Discontinuation and Dropouts . . . . . . . . . . . . . . . . . . . . . . . . . . 298
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300

J. J. Gassman (*)
Department of Quantitative Health Sciences, Cleveland Clinic, Cleveland, OH, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 279


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_39
280 J. J. Gassman

Abstract
After clinical trial participants have consented, provided baseline data, and been
randomized, each participant begins study treatment and follow-up. This chapter
covers administering a participant’s randomly assigned treatment regimen and
collecting the participant’s trial data through the end of their time in the study,
along with tracking and reporting data on timeliness and quality of treatment
administration and of follow-up visit attendance and trial data collection. Treat-
ment administration can include providing study medications or, in a lifestyle
intervention trial, teaching the participant to follow a diet, exercise, or smoking
cessation intervention. Trial data collection includes, for example, questionnaires
completed via smartphone, laboratory sample collection details and results of lab
analyses, imaging data, treatment adherence data, measurements taken at clinic
visits, and adverse event data. Monitoring participant follow-up and capturing
reasons why patients discontinue treatment or end follow-up early can aid in
interpretation of a trial’s results. Treatment administration, treatment adherence,
and participant follow-up metrics should be captured in real time, with the Data
Coordinating Center (DCC) providing continuous performance feedback to study
leadership and to the participating sites themselves. These aspects of trial conduct
are described in the context of a multicenter trial in which two or more clinical
sites enroll participants and study data are managed in a centrally administered
database. Timeliness and accuracy of study treatment administration is key to the
success of a trial. Participants providing required data according to the protocol-
defined schedule allow a trial to attain its goals.

Keywords
Run-in · Pill count · Treatment crossover · Unblinding · Treatment
discontinuation · Visit schedule · Visit window · Close out · Drop out (no longer
attending clinic visits) · Withdrawal of consent (no longer willing to provide data)

Introduction

The success of a multicenter-randomized clinical trial depends on valid study design


and study conduct. Valid study design starts with the requirement that all specified
baseline data be captured prior to randomization /registration and that treatment
commence as soon as possible after randomization. Once trial randomization begins,
recruitment, adherence, and retention are the keys to study success. This chapter
addresses administration of study treatments and participant follow-up in the context
of a multicenter trial where two or more clinical sites enroll participants and study
data are managed in a database administered by a Data Coordinating Center (DCC)
at an academic site or a commercial site such as a CRO.
Study treatments in a trial may include, for example, active and placebo medica-
tion in the form of pills, liquids, injections, or IV drugs; methods of surgery; and
16 Administration of Study Treatments and Participant Follow-Up 281

systems by which patients are taught or encouraged to follow diet, exercise, or


smoking cessation regimens. Observation (no intervention) can also be a study
“treatment” in a trial with multiple arms. Interventions chosen for a trial should be
considered acceptable by physicians treating patients with the condition under study;
this can be evaluated by survey prior to study initiation, as was done in the NIDDK
FSGS Clinical Trial (Ferris et al. 2013) or discussed during physician training.
Appropriate study conduct requires monitoring initial and subsequent treatment
administration, and treatment adherence data can be important for interpretation of
trial results.
Participant follow-up includes postrandomization/enrollment data collection
including patient-reported outcome questionnaires completed via smartphone, lab-
oratory sample collection details and results of lab analyses, imaging data, treat-
ment adherence data, and measurements taken at clinic visits. Monitoring
participant follow-up and capturing reasons why patients discontinue treatment
or end follow-up early are useful for assessing bias and for interpretation of a
trial’s results. A well-designed and well-conducted trial will have high rates of
participant follow-up. An appropriate study protocol is key here; the protocol’s
defined visit schedule must include contact with patients sufficiently frequently so
that patients are unlikely to move or change phone numbers between visits, and
sufficiently infrequently so that patients do not become burned out by having to
constantly fulfill study requirements for questionnaires, phone calls, lab sample
collections, and visits. The intent-to-treat data analyses that make trial conclusions
valid benefit from high rates of participant follow-up and data collection com-
pleteness (Byar et al. 1976).
Treatment administration, treatment adherence, and participant follow-up metrics
should be captured in real time, with the Data Coordinating Center (DCC) providing
continuous performance monitoring of these data to study leadership and to the
participating sites themselves. Methods for this feedback include frequently updated
routine reports posted on a study website and/or pushed to study leadership and
participating sites via e-mail as weekly reports.

Administration of Study Treatments

Introduction to Administration of Study Treatment

Immediate and per-protocol administration of study treatment is essential. Any steps


that can be taken that will facilitate a participant starting study treatment as soon as
possible after randomization and stay on treatment per protocol will enhance study
validity. If the treatment requires a drug, starter supplies of the medication should be
available at each clinical site. If this is not possible, the study should have a system in
place such that the participant can receive their treatment in an expedited way. The
study treatment administration goal is that all participants start treatment soon after
randomization, remain on treatment throughout their assigned study follow-up
period, and comply as described in the study protocol.
282 J. J. Gassman

Training

During staff training, study coordinators and other team members should be taught
how treatment is administered and the importance of treatment adherence (as well as
study visit schedule implementation and the importance of retention, covered in
section “Participant Follow-Up” of this chapter). Study coordinators and investiga-
tors experienced with the treatment being studied should be invited to present
segments of training related to treatment administration including blinding and
treatment challenges.
Regarding blinding, the training should include (1) Who will know what treat-
ment each participant receives, particularly with regard to whether the study be
unblinded/open label; single blinded, where the participant does not know their
treatment but members of the study team do; or double-blinded, where neither the
participant nor any member of the study team knows the participant’s treatment
assignment, (2) managing unblinding (see ▶ Chaps. 43, “Masking of Trial
Investigators,” and ▶ 44, “Masking Study Participants”), and (3) a discussion of the
role of study blinding in preventing subconscious bias in outcome assessment.
Regarding challenges which are associated with the study treatment, study
coordinators who are less familiar with the trial treatment will benefit from
hearing from those with on-the-ground experience using this treatment or similar
treatments. For lifestyle intervention trials, the systems in place implementing the
experimental and control treatment arms should be covered in detail, including
the study coordinators’ role in encouraging adherence to the intervention. For
medication trials, coordinators should be trained to teach participants require-
ments for their assigned treatment, e.g., whether a study drug should be taken on
an empty stomach or with a meal as well as how and when dosage is to be
increased or reduced. At training, study coordinators should also be informed of
the study plans for monitoring and reporting treatment adherence. The Data
Coordinating Center (DCC) team should present templates of treatment admin-
istration and treatment adherence-related tables from planned feedback reports/
electronic weekly reports.
Once the trial is under way, coordinators at one site may be able to offer valuable
tips to those at other sites based on their experience with participants who have had
difficulties with treatment adherence. Study coordinators and other participating site
team members can learn additional strategies for improving treatment adherence
informally, as part of a structured routine study coordinators’ web meetings or calls,
and at annual training/retraining meetings.
Study coordinators should be fully engaged in helping to ensure participants
receive their treatments on schedule and maintain the trial’s treatment blinding
as they work to enhance treatment adherence. Coordinators should also be
encouraged to engage and garner support from the participant’s family when
possible.
Training should also include a description of closeout in which a participant’s
treatment ends, remaining medications – if any – are returned, and study visits
cease.
16 Administration of Study Treatments and Participant Follow-Up 283

Verification of Site Readiness

Before a participating site’s subjects are enrolled and randomized, site readiness
should be assessed. A site “Ready to Enroll” table should be included in trial
monitoring (e.g., in the weekly report) showing whether each site has met require-
ments to begin consenting participants. These requirements will vary from study to
study but should include IRB/ Ethics Committee approval, completion of staff
training requirements, capture of needed site delivery addresses and staff contact
information, and (for medication trials that include a starter supple of medication)
receipt of the initial supply of medication, or information on how study drug is
ordered, and clear plans for appropriate secure storage of medication and participant
study documents as required by local regulations. Participants should not be
consented until a site is ready to begin randomization/recruitment. Site initiation
visits are sometimes conducted to ensure site-readiness.

Inclusion and Exclusion Criteria Focused on Treatment


Administration

The trial protocol should ensure that participants who consent and are random-
ized to study treatment are likely to be able to comply to study treatment.
Treatment-related inclusion and exclusion criteria can help ensure participants
who are not likely to be able to comply with study treatment requirements are
not randomized. A first step in this process is to ensure that the potential
participant can safely follow the treatment; participants who are allergic to or
have had side effects when on the study treatment will generally be excluded, as
will participants who require treatment with a concurrent medication that is
contraindicated for a participant on the active treatment arm. A second step
would be to ensure that the participant is likely to follow the treatment regimen;
participants who have a history of nonadherence to the type of treatment being
administered will generally be excluded. A third step would be to ensure that the
participant will be available to take the treatment; for many studies, participants
who spend several months each year away are problematic. Study investigators
should consider whether, for example, a participant who spends winter in Florida
or a participant who is away at school during the academic year would be
appropriate for randomization and should adjust exclusion criteria accordingly.
(Note that this third criterion is also important for complete participant follow-
up, i.e., retention, as described in section “Participant Follow-Up.”)

Eligibility Checking and Randomization

Informed consent is, of course, part of a participant’s time line in a trial. Study
procedures may not be performed and data may not be submitted to the DCC
until the participant has consented. The timing of informed consent is covered
284 J. J. Gassman

in this book’s ▶ Chap. 21, “Consent Forms and Procedures.” A trial partici-
pant’s data collection time line generally consists of screening, a baseline visit
or visits, eligibility confirmation, randomization, and subsequent follow-up
visits. Ideally, the participating site’s team will have at least two contacts
with a potential enrollee prior to randomization (Meinert 2012). This will
allow time for the participant to be certain they understand the requirements
of the trial and are fully on board and have had all their trial-related questions
answered, and time for the study team to fully consider the participant’s
suitability for the trial.
Trials in which the protocol includes baseline visits allow an opportunity to
test the participant’s ability to follow trial requirements during a run-in, or test,
period, and many trials include a run-in trial of patient requirements, as
described in this book’s ▶ Chap. 42, “Controlling Bias in Randomized Clinical
Trials.” For example, in the Frequent Hemodialysis Network Daily Trial, the
experimental treatment arm featured 1 year of six shortened dialysis sessions per
week versus the usual care control arm (three standard sessions per week).
During baseline, participants were required to visit the dialysis unit daily for 6
consecutive days. Three participants dropped out or were excluded (FHN Trial
Group et al. 2010) because of unanticipated difficulties in getting to their dialysis
unit 6 days a week. In the Frequent Hemodialysis Network Nocturnal Trial, each
participant’s home water supply needed to be evaluated for appropriateness of
use of in-home hemodialysis (Rocco et al. 2011). In FONT II (Trachtman et al.
2011), each participants’ Angiotensin Converting Enzyme (ACE) inhibitor/
Angiotensin Receptor Blocker (ARB) was followed through a series of baseline
visits to ensure the regimen was effective and stable. In medication trials,
participants sometimes take placebo medication during Baseline. A run-in period
is particularly useful if the treatment may be unacceptable to some patients
because of pill/capsule size or the number of pills that must be taken each day.
Such a Baseline run-in period may include specified adherence criteria such as
“Pill count must show 80% adherence to treatment during run-in,” as was
required in the COMBINE Trial (Ix et al. 2019) or may include a check-in
with the participant and a record of whether the participant reported any diffi-
culties taking the study medication. Once the Baseline period is complete, the
participant will be randomized to treatment if study eligibility confirmation, e.g.,
a “Ready to Randomize” interactive program, has verified that all required
baseline data have been collected and the site has verified that, logistically, the
participant is available to start their randomized treatment, e.g., the participant is
not currently traveling or hospitalized. At this point, the participant is random-
ized and irrevocably part of the study, and the site is notified of the patient’s
treatment assignment; in a blinded medication trial, the site would receive the
bottle or bin number of the medication to be provided to the participant. In
studies where randomization is carried out online, the treatment assignment
should be e-mailed to the study coordinator in addition to being displayed on
the randomization screen to ensure that the study coordinator can easily confirm
the assignment.
16 Administration of Study Treatments and Participant Follow-Up 285

Getting the Treatment to the Participant

In studies with drug treatments, each participating site will provide an address for
shipment of drug; at many sites, this will be the address of a hospital’s research
pharmacy. Drugs will come from their manufacturer or a study’s central pharmacy
(as described in this book’s ▶ Chap. 7, “Centers Participating in Multicenter Trials”)
and may be provided to the sites in bulk (for local blinded distribution) or in coded,
numbered bins or numbered bottles. Details on how drugs come to the participating
sites are in this book’s ▶ Chap. 11, “Procurement and Distribution of Study
Medicines.” On-site options for getting the treatment to the participant include,
having the participant pick their medications up from the site’s pharmacy, or having
the study coordinator pick the medication up for the patient so the coordinator can
hand the medication to the participant. When possible, handoff by the study coor-
dinator is easier for the participant and ensures that the participant leaves the site
with study drug.
Participants should begin treatment as soon as possible after randomization. In
studies with a baseline period, final baseline eligibility data and baseline values for
study outcomes including, lab results or imaging, must be captured prior to random-
ization, so the participant timeline will include a final baseline visit shortly prior to
randomization. If all required results are expected to be available before the partic-
ipant leaves the clinic and arrangements can be made such that study drug is
available on-site at the clinic, it may be possible for a participant to be randomized
at the end of this last baseline visit and go home with medication that day. If
inclusion criteria include data resulting from images or lab tests done at the last
visit of baseline, eligibility cannot be determined until after the final baseline visit. In
such a case, procedures should include randomizing the participant as soon as
possible, but at a time when the participant can begin taking drug (i.e., when the
participant is in town and not in the hospital) and getting the treatment to the
participant as soon as possible after randomization. A participant’s appointment
schedule will be based on the date of randomization, not on the day he or she started
treatment, i.e., the target date for a 1-year follow-up visit should be 1 year from
randomization. The study protocol may include a visit held shortly post-
randomization in which the participant receives their study medication (often
referred to as the “Follow Up 0” or the “Week 0” visit as in the AASK Trial
(Gassman et al. 2003) or FSGS Trial (Gipson et al. 2011)). A face-to-face visit
will be required if the treatment must be delivered under medical supervision, i.e., an
IV infusion. Alternatively, the protocol may allow for the treatment being delivered
to the participant or the participant picking the drug up from the site’s research
pharmacy. It may be helpful for a participant to interact with a study team member
when a treatment is provided, particularly under protocols in which there is some
complexity to treatment administration, e.g., in the case of double-dummy system
where the participant must take two different types of medication or in the case
where it is critical that the medication be taken under specific circumstances, such as
on an empty stomach or with a meal. Whichever method is used, the date the
participant begins taking medication should be captured in the study database.
286 J. J. Gassman

Table 1 Time from randomization to initiation of treatment

Participating Randomized Initiated Median time from


Site Treatment randomization to treatment
initiation (days)

1 California 20 20 2.2

2 Colorado 48 46 1.3

3 Connecticut 17 17 1.0

4 Delaware 34 33 1.0

5 Florida 47 45 1.6

6 Georgia 16 15 1.8

7 Illinois 7 7 2.4

TOTAL 189 183 1.5

Time from randomization to initiation of treatment (overall and by participating site)


is a useful metric to track. An example is shown in Table 1.

Promoting Treatment Adherence

It may be shocking for those new to the conduct of clinical trials to learn that
sometimes participants who are randomized to a study treatment do not adhere to
their treatment assignment. A trial’s Coordinating Center and Steering Committee
should implement multiple systems to enhance treatment adherence. As a first step
for any medication trial, efforts should be made to ensure that the participant does
not simply forget to take their pills. This can be customized to the treatment
requirement. For example, if a pill is to be taken in the morning, the study coordi-
nator might review the participant’s morning routine and determine where the study
medication should be kept, e.g., next to the coffee pot. If a pill is to be taken multiple
times a day, it may be useful to provide the participant with a pillbox; 2  7 and
4  7 weekly pillboxes are readily available. Smartphone applications for reminders
are available (Dayer et al. 2013; Ahmed et al. 2018; Santo et al. 2016) and have been
used successfully in randomized clinical trial settings (Morawski et al. 2018). Some
medications have clear requirements for successful administration. For example,
phosphate binders must be taken with a meal containing phosphorus in order to
reduce the possibility of GI side effects, and all patients randomized to the COM-
BINE Study (Isakova et al. 2015) were reminded at each visit to make sure to take
their blinded study medication with a meal. Ensuring requirements such as this are
met can also prevent treatment unblinding; e.g., a participant who takes placebo
16 Administration of Study Treatments and Participant Follow-Up 287

phosphate binders on an empty stomach will not experience GI side effects where as
a patient who takes active phosphate binders on an empty stomach likely will.
Study protocols and manuals of operations will include steps related to reducing or
temporarily stopping medication when a participant reports mild side effects poten-
tially related to treatment, seeing if the side effect goes away, and then up-titrating
back, possibly to a lower dose. Reducing or temporarily stopping medication may also
be helpful for a participant who suspects a symptom he or she is experiencing is caused
by the study medication, even if the study team sees no pathway by which the drug in
use could cause that symptom.
In a long-term study, investigators might consider studying coordinators
suggesting a brief “pill holiday,” to allow participants who are at risk of ending
participation to take a week or a month off their study drug. In long-term studies,
when participants have stopped medication (treatment discontinuation) for reasons
unrelated to the study drug and continue attending study visits, it is useful to ask the
participant at subsequent visits if he or she might now consider going back on the
medication, perhaps at a lower dose than previously.
Participants may refuse the treatment to which they have been randomized or
become a treatment crossover, i.e., a participant who has switched to another study
treatment arm. Such a participant is sometimes called a drop-in, defined by
Piantadosi (2017) as a participant who takes another treatment that is part of the
trial instead of the treatment he or she was randomized to, and can be followed for
study outcome. Drop-ins cause treatment effect dilution whereby the estimate of the
difference between the effect of experimental treatment and the control treatment is
reduced.
Finally, for many types of participants (e.g., adolescents, the elderly) and many
types of treatments (e.g., antihypertensives, antirejection drugs, retroviral drugs, and
dietary interventions), there is a full body of research on barriers to adherence. This
is beyond the scope of this chapter, but the DCC and the study leadership should be
aware of the literature on adherence related to the participant group and the treatment
under study.

Monitoring Treatment Adherence

Drugs do not work in participants who do not take them. – C. Everett Koop, M.D.,
US Surgeon General, 1985 (Osterberg and Blashchke 2005).
A variety of methods are available for treatment adherence monitoring, and the
method selected depends on the type of study and the type of information required
(Zulig and Phil Mendys 2017). Medication electronic monitors, also called MEMS
or “smart bottles,” are expensive but can provide precise information on when a pill
bottle or a particular section of a pillbox is opened, allowing investigators to assess
adherence to days and adherence to times of day pills are taken (Schwed et al. 1999).
Methods such as pill counts and weighing medication bottles require participants to
remember to return their “empty” bottles and are logistically cumbersome for the site
staff, and these methods are easily influenced by participants who know their
288 J. J. Gassman

adherence will be monitored in this way (Meinert 2012), a phenomenon sometimes


referred to as “piles of pills in the parking lot,” i.e., participants who have not taken
their study meds and want to please their study team will discard pills before coming
in to a visit so their adherence by pill count is high. Medication diaries may also be
utilized (Farmer 1999). An example report of adherence by pill count by site is
shown in Table 2. Only those who brought in their pills for counting and had a pill
count done are generally included in pill count denominators, which could cause pill
counts to be biased upward, i.e., those who do not bring in pills for counting may
have taken fewer pills on average than those who did bring pills in for counting.
Rates of pill count completion and implications of missing pill counts should be
included in the discussion section of papers that include information on adherence.
Note that taking more than 100% of prescribed medication is another form of
nonadherence, and consideration should be given to how rates over 100% should
be handled in calculations of adherence; a participant who sometimes takes 80% of
required medications and sometimes takes 120% of required medications should not
be considered to be 100% compliant over time. Consider capping adherence esti-
mates at 100% when evaluating adherence, as was done in Table 2. The DCC should
also provide Participating Sites with feedback on numbers and proportions of
participants with adherence over a threshold, e.g., over 100%, so the sites can
work with their participants on this. Because adherence and treatment crossover
vary over time in a long-term study, adherence should be considered as a continuum
(Meinert 2012), so a participant has a percent compliance in a trial, not a dichoto-
mous classification as a complier or noncomplier.
When treatment and visits are both discontinued (e.g., because a participant has
become lost to follow-up or withdrawn consent for follow-up), this is reported as a
study discontinuation, and as with those who did not provide pills for counting, the
participant is removed from the denominator of reports of adherence to treatment.
Medication interview questions are less cumbersome than pill counts for partic-
ipants and staff. Questions must be chosen with care; good interviews, or “medica-
tion interrogation,” can yield results similar to those obtained from pill counts.
Stewart (1987) found that a single question could identify 69.8% of compliers and

Table 2 Adherence at 3 months by Pill Count

Randomized 3 % of participants with 80%+


months ago plus Number with Number Adherence (out of
Participating 14 day lag time for 3-month visit with 3-month 3-month Pill Count participants with
Site data entry documented pill count Mean +/- SD, min, max pill counts)

1 California 19 16 15 89.9 +/- 14.7, 56.7, 100.0 86.67

2 Colorado 52 49 40 87.5 +/- 13.5, 48.6, 100.0 77.50

3 Connecticut 18 16 15 95.1 +/- 9.4, 66.7, 100.0 93.33

4 Delaware 41 38 36 90.9 +/- 10.0, 62.9, 100.0 83.33

5 Florida 49 45 39 88.3 +/- 18.0, 21.9, 100.0 84.62

6 Georgia 18 16 13 86.8 +/- 14.7, 50.0, 100.0 76.92

7 Illinois 7 7 5 82.1 +/- 24.2, 45.7, 100.0 60.00

Total 204 187 163 89.2 +/- 14.3, 21.9, 100.0 82.21
16 Administration of Study Treatments and Participant Follow-Up 289

80% of noncompliers if pill count is used as a gold standard. Stewart’s question was
phrased as “How many doses might you have missed in the 10 days?” and was asked
as follow-up to an affirmative answer to a nonjudgmental question regarding
whether the participant might have missed some doses. It is important that medica-
tion interrogation be carried out in a nonjudgmental way. Kravitz, Hays and
Sherbourne (1993) reported that when a cohort of 1751 patients with diabetes
mellitus, hypertension, and heart disease was surveyed on their adherence to med-
ications, more than 87% reported they had taken their medications as instructed by
their doctors “all of the time. A review by Garber et al. (2004) found a wide range of
level of agreement between participant-reported adherence by interview, diary, or
questionnaire and more objective measures of adherence such as pill count, canister
weight, plasma drug concentration, or electronic monitors.
In a pragmatic trial, a participating site may be able to track whether a participant
has picked up their treatment or refilled their prescription.
Direct signs of treatment adherence include laboratory measures of levels of the
drug itself or of one of its metabolites (Osterberg and Blashchke 2005). Biomarkers
of treatment response may also be useful as an indirect sign of treatment adherence.
When monitoring treatment adherence, it is traditional to report adherence on
those for whom the adherence was measured. That is, if 100 participants were
assigned to a treatment and pill counts are available for 50 of these participants,
treatment adherence is reported for the 50 who provided pill count data. The “zero”
pill count adherence for those who did not return their pills to the participating site,
or those whose pills were not counted for some other reason, is discussed in a
nonquantitative way rather than being “averaged in” as pill counts of zero.

Monitoring Early Treatment Discontinuation and Tracking Reasons


for Discontinuation

Early treatment discontinuation and treatment crossover can dilute the estimate of a
true treatment effect. In early discontinuation, a participant stops taking their
assigned treatment. With early discontinuation of active treatment, the treatment
effect crosses over to a different treatment effect which may be less than or equal to
the treatment effect observed in the study’s placebo group. Treatment crossover is
worrisome when a participant randomized to the placebo group seeks out and begins
taking the treatment being used in the active treatment group. Discontinuations and
treatment crossovers can lead to effective interventions being found ineffective and
should be carefully tracked to allow for sensitivity analyses and to inform investi-
gators planning future trials of similar treatments.
Sometimes, early treatment discontinuation is clearly related to an adverse event
(AE) or a serious adverse event. Examples include AEs related to lab values or
symptoms. Lab Value AEs requiring treatment discontinuation may be specified in
the study protocol. For example, in the AASK Study, participants (who may have
been randomized to Lisinopril) discontinued their ACE inhibitor arm if their serum
potassium was over 5.5 (Weinberg et al. 2009). In situations such as this, a physician
290 J. J. Gassman

may choose to stop study treatment if a participant is near the protocol-required cut-
point as well. Either way, the primary reason for such treatment discontinuations
should be tracked in the study database and should be tracked separately as treatment
stopped due to lab value as defined by protocol, or treatment stopped due to lab value
observed, physician judgment. Similarly, treatments stopped due to the appearance
of specific symptoms or potential medication side effects could be categorized as
having been stopped due to symptoms with separate categories for protocol-defined
discontinuation, physician judgment, or participant preference.
Other reasons for treatment discontinuation include a participant becoming burnt
out by the medication requirements of a study in the absence of abnormal lab values
or side effects. Such discontinuations may be flagged as discontinuation due to “pill
burden.” The study database should explicitly track the Participating Site’s evalua-
tion of the primary and secondary reason a participant stopped taking study drug.
These data should be captured in real time.
When participants stop taking medications because they have stopped coming to
visits, this should also be tracked. Table 3 shows an example of tabulation of the
Primary reason a study participant was not on study medications at the final
study visit including both cases thought to be related to study medication (stopped
drug due to lab adverse event, stopped drug due to patient-reported side effects,
stopped drug due to patient-reported pill burden) versus cases where a patient
stopped attending visits early but did not withdraw consent (discontinued active
study participation, allowing for passive follow-up only) or declared withdrawal of
consent (would no longer provide study data) or can no longer be located or
contacted.
Regarding the topic of adherence in statistical analyses of clinical trials, the reader
is referred to the ▶ Chap. 93, “Adherence Adjusted Estimates in Randomized
Clinical Trials” in the Analysis section of this book authored by Sreelatha Meleth.

The Role of the Study Team in Enhancing Treatment Adherence

Every member of the clinical trial team has a role in enhancing treatment adherence.
Treatment adherence issues should be discussed on study coordinators conference
calls; it is particularly useful for coordinators who have had success in adherence-
related issues to share their experiences with those who have had less success.
Principal investigators (PIs) should become personally involved in providing posi-
tive feedback for high adherence, as well as encouraging and strategizing with
participants who have had difficulties with adherence. PIs should routinely discuss
each participant’s adherence with the team. An adherence committee made up of
study coordinators, physicians, and data-coordinating center staff members may be
able to come up with suggestions as a brainstorming group. Treatment adherence
should be a topic on the agenda of every steering committee meeting. The data
coordinating center should ensure that the trial’s manual of operations includes
strategies that will assist with adherence for the treatments being studied, and the
DCC is responsible for providing feedback on every aspect of adherence and
16 Administration of Study Treatments and Participant Follow-Up 291

Table 3 Primary reason a study participant surviving to final visit (end of study) was not on
randomized study medication at final visit

Participating Participant Reason not on randomized study medication


Site ID

1 California 10003 Patient reported side effect (skin rash)

10015 Lost to follow up

2 Colorado 20023 Patient reported pill burden

20039 Discontinued active study participation

20041 Lost to follow up

3 Connecticut 30006 Lab-related adverse event

30014 Attending visits but quit study meds (noncompliant)

4 Delaware 40015 Side effect (GI symptoms)

40024 Lost to follow up

40037 Patient report: pill burden

5 Florida 50005 Withdrawal of consent

50027 Patient reported side effect (skin rash)

50040 Withdrawal of consent

6 Georgia 60009 Lost to follow up

60013 Attending visits but quit study meds (noncompliant)

7 Illinois 70004 Lost to follow up

Number for each reason


Attending visits but quit study meds (noncompliant): 2
Discontinued active study participation : 1
Lab-related adverse event: 1
Lost to follow up: 5
Patient reported side effects: 3
Patient reported pill burden: 2
Withdrawal of consent: 2
Total not on randomized treatment at final visit: 16

treatment discontinuation in an easy-to-read manner. Finally, in their role of moni-


toring study conduct, a trial’s Data Safety and Monitoring Board (DSMB) should
track study treatment adherence throughout the study and should note issues related
to treatment adherence in its reports back to the study leadership and DSMB.
292 J. J. Gassman

More work is needed in the area of predictors of or antecedents to adherence. In


an early review, Sherbourne and Hays (1992) noted interpersonal quality (having a
good social support system) of care and satisfaction with financial aspects of cases
stood out as potential predictors of adherence. Different antecedents appear predic-
tive in different studies, and Dunbar-Jacob and Rohay (2016) noted that different
methods of measuring adherence yield different predictors of adherence. They
looked for predictors in two trials and observed indications of gender and race
being more associated with electronically monitored adherence and a participant’s
self-efficacy being more associated with self-reported adherence.

The End of Treatment

For each trial participant, the last on-treatment-study visit marks the participant’s
closeout, and remaining medications should be collected at the participating site. If
the participant forgets to bring their medications in to their last visit, effort should be
made to collect these medications.
Many trials include an off-treatment observation period after treatment ends. If a lab
test is to be taken at the end of a specified observation period (a final off-treatment
visit) to check for the persistence of a biomarker after treatment ends, as in the BASE
Trial (Raphael et al. 2020), the target date for the final off-treatment visit may depend
on the date the last dose of study medication was taken. For example, a month 13 final
off-treatment visit may have a target date of 4 weeks after the month 12 visit, rather
than 13 months after randomization, to allow for reduction of biomarkers or signs
4 weeks after treatment. This should be considered during protocol development and
incorporated into the participant appointment schedule described in section “Partici-
pant Follow-Up” below.

Participant Follow-Up

Introduction

Every trial has a goal of complete follow-up for all participants, and complete collection
of the primary outcome for the study intent-to-treat analysis is the goal. In order to
achieve this goal, the participant follow-up visit schedule must be well-defined and
reasonable from both the patient and the participating site team’s point of view, and the
team will need to focus on visit attendance and prevention of incomplete follow-up
throughout the course of the trial. For a full discussion of intention to treat analysis, see
this reference book’s Analysis Section’s chapter ▶ Chap. 82, “Intention to Treat and
Alternative Approaches” by J. Goldberg. Incomplete follow-up carries with it a risk of
bias in the primary outcome, particularly when the number lost is substantially different
between the two treatment groups and the question of whether the experimental
intervention influenced attrition (Fewtrell et al. 2008). Even when retention rates are
the same in two treatment groups, if retention is not high, study power is reduced and
16 Administration of Study Treatments and Participant Follow-Up 293

generalizability can be harmed (Brueton 2014). The Special Topics section of this
reference book includes the chapter on Issues in ▶ Chap. 113, “Issues in Generalizing
Results from Clinical Trials” by Steve Piantadosi. Statistical methods are available to
handle missing data and are discussed in this book’s Analysis Section’s ▶ Chap. 86,
“Missing Data” by A Allen, F Li, and G Tong, but clearly prevention of missing data is
the goal and the impact of some missing data in the middle of a patient’s follow-up is
less of an issue than the loss of a patient’s final data. Results from trials with retention
rates of 95% or greater will generally be considered to be valid, particularly when
retention is similar in all treatment groups. Retention rates of 80% or lower call into
question the validity of results (Sackett et al. 1997). During trial design, staff training,
and throughout participant follow-up, retention is key.

Planning the Follow-Up Schedule During Trial Design

The follow-up schedule requires careful consideration during study design. Visits
held prior to randomization include screening visits and baseline visits. Visits held
after randomization (when a randomized treatment is allocated to the patient from
the study’s randomization schedule) are designated as follow-up visits. If the treat-
ment is provided to the patient on the day of randomization, the visit may be referred
to as the “randomization visit.” The number of visits and other contacts included in a
trial’s visit schedule must be frequent enough to allow for treatment administration
and, where necessary, dose adjustment, as well as patient training and collection of
needed adherence, process, safety, and outcome data. Trials sometimes have more
visits early in follow-up as participants learn to follow their assigned intervention
and/or ramp up dosages.
The visits should therefore be often enough but not too often, and in cases where
no hands-on data collection is required, phone or electronic contact can be
substituted for visits. If participant contact or participant visits occur only once a
year, more participants will become difficult to locate or contact because they have
moved or changed phone numbers since their last contact. Infrequent visits also
make it difficult to capture complete and accurate information on adverse events
(AEs) and serious adverse events (SAEs). On the other hand, if participant contact or
participant visits occur frequently (e.g., weekly) throughout a trial, this may be too
much for a participant to bear. This is particularly true if a number of participants
have long travel times due to distance or traffic; it is useful to collect information on
participant travel time to the clinic so that if during study conduct, visit adherence
becomes an issue, the site can check on the relationship between travel time and visit
attendance and, if necessary, limit enrollment of participants who live in areas that
are problematic for follow-up visits. The schedule of follow-up visits will depend on
the complexity of the study intervention and the disease being studied.
All participants should have the same visit schedule; this will prevent bias in AE
and SAE detection [“The more one looks, the more one sees” (Meinert 2012)] and
will reduce follow-up bias associated with more time and attention spent on inter-
vention group participants. The duration of the visit schedule should be well-defined;
294 J. J. Gassman

follow-up either continues until a common calendar date for all patients or continues
to a common time point for each patient, e.g., 24 months postrandomization. Follow-
up should continue according to protocol regardless of patient adherence to treat-
ment or patient attainment of a given outcome.

Making Trial Data Collection as Easy as Possible for the Participant

As noted, once a study begins, recruitment, retention, and adherence are key. The
plans for visits and collection must be safe for the patient and should be kept as easy
as possible for the participant. When multicenter trials are being designed, steering
committees balance their hopes for pragmatism with special research interests, and a
study protocol may be full of visit requirements including, questionnaires, lab tests,
physical function tests, imaging, and clinical measurements.
The desire to capture a large amount of data is sometimes addressed by having
shorter routine visits and collecting more data remotely and collecting more data at
annual visits. Collecting more data remotely is helpful. Questionnaires can be
completed from home during the week before a visit, online via smartphone, tablet,
or computer, or can be sent and returned by mail. Collecting more data at annual
visits can cause annual visits to be overwhelmingly long. The steering committee
should think outside the box. In a trial with brief quarterly visits, it may be possible
to collect some of the “annual extra data” at months 0, 12, 24, and 36 and other data
at, say, months 0, 9, 21, and 33. If participants decide that their visits are too long,
they will be less likely to attend all of their visits.
Reminders can help with visit attendance. Coordinators can customize reminders
prior to visits to the participant; some participants will prefer reminder via text
message rather than phone call, for example. Reminders are also helpful when a
patient must bring along a sample (24-hour urine jug, for example) or their pill bottle
(s) for counting or weighing.
Consideration should be given to ensure that requirements are convenient and
that participants will be comfortable. The site should consider not only covering
parking expenses but also making sure the participant can park in a convenient area.
Childcare expenses could be covered. Holding evening and weekend visits will be
helpful for working participants. If, for example, a visit includes a test that requires a
12-hour fast, visits should be scheduled in the morning. If a trial requires going to a
distant part of the medical campus, the study coordinator should arrange for a shuttle
ride. Making visits as convenient as possible for participants will pay off in
retention.

Training the Participating Site Staff on Follow-Up

Initial site-staff training, and retraining at staff annual meetings, should include a
review of the trial’s visit schedule and the trial’s retention plan as well as training in
methods known to facilitate retention. The expectation should be that each
16 Administration of Study Treatments and Participant Follow-Up 295

participant will attend all visits, with a recognition that of course some participants
will miss some visits. When a participant is randomized, the DCC should provide the
site with the participant’s appointment schedule showing both the target date for each
visit (e.g., the target date for the 12-month visit is 12 months postrandomization) and
the study-specified visit window (e.g., plus or minus 1 week of the target date). It is
also helpful to have a master schedule with the start of visit windows and target
appointment dates for each participating site.
Methods to facilitate retention that should be covered at training could include as
examples:

• Procedures should be established such that prior to randomization, participating


site teams discuss each eligible participant’s suitability for a trial. During
recruitment, sites may feel considerable pressure to enroll more participants.
Care must be taken that the site has thought about whether each participant they
consider is likely to comply with treatment and to attend required visits/to
provide required data.
• Participating site team members who recruit participants should try to collect data
at study and start on alternative ways to contact a participant should he or she
become unreachable (e.g., changes their phone number).
• The study coordinator should set up systems for visit reminders that are appro-
priate for the participant. Automated text messaging is a good way to reach
smartphone users. Automated or coordinator-initiated phone calls may be better
for those who do not routinely text.
• The participating site team members should engage with the participant in ways
that make them feel valued. In studies where participants are not paid, the
participants may appreciate birthday cards. In long-term studies, sites might
consider annual incentives such as grocery store vouchers or small gifts such as
fleece blankets or tote bags. Thank you notes signed by the study team make a
participant feel appreciated at little cost to the study and should be considered in
cases where tangible incentives are not funded or are not permitted. Sometimes,
smaller rewards are provided for questionnaires or short visits and larger rewards
are provided for longer visits or for a study’s final visit.

In a long-term study in which patients may become burned out over time, training
in retention should include prioritization for outcome measures. The study will
always accept whatever data a not-fully-adherent participant is willing to provide,
and obtaining the primary outcome measure for each patient will always be top
priority. However, if there are multiple secondary measures, the study leadership
should provide guidance to the sites on the relative importance of each planned
measure or the value of a surrogate for the primary outcome if the primary outcome
is not available. Such prioritization will help the site team negotiate with patients
who reach a point in a long-term study where they will no longer comply with all of
the study requirements.
As an aside, trial leadership should also ensure that study coordinators and other
participating site personnel feel valued and appreciated.
296 J. J. Gassman

Retention Monitoring

The DCC should design their retention monitoring feedback at the start of study
and provide an example retention table at training and review how to read it. It may
be helpful for the feedback to include missingness for a single visit and a summary
of those who have missed their last two (or more) visits; the first is noteworthy but
may be easily explained by the circumstances of the participant, e.g., they were on
vacation or hospitalized for much of the visit window, but the participating site
team expects them back for their next visit. The second note, of one who has
missed their past two or more visits, flags a participant at risk of becoming lost to
follow-up.
It is helpful to report on missed visits both looking at each visit (month 6, month 12)
and looking overall (across all visits). An example of a table monitoring missed visits
by visit and by site is shown in Table 4. The first column would list the participating
sites. The second column would show how many randomized participants would have
been expected to have that visit, i.e., the number of randomized participants who have
been in follow-up through the end of that visit window as of a data entry lag time such
as 2 weeks previous to the report being run. The third column shows the number and
percent of expected visits held. The fourth column shows the number of visits known to
have been missed, based on site reporting; it is useful to have an item at the beginning
of a study’s visit form on status of visit – held or missed to document the cases where
the visit window is past and the form is not pending; the site confirms that the visit was
not held. A fifth column can be used to document cases where the site has submitted a
form for that visit but the visit was held so far outside of the window that it is unlikely to
be used in analyses, e.g., the visit intended for Month 3 was held in the beginning of the
Month 6 target window. The sixth column would show counts of participants with an
unknown visit status, flagging cases where the visit form has not yet been submitted.
This table is appropriate for the study-wide weekly report; site personnel will also need
the details for columns 4–6, so they are reminded of which participants missed the visit

Table 4 3-month visit held and missed

Randomized Number with 3- Number with 3- Number with 3- Number with 3-


3 months ago plus month visit form month visit form month visit form month visit
14 day lag time documenting documenting visit showing visit held form not yet
Participating for data entry visit held missed outside window submitted
Site

1 California 19 16 (84.2%) 1 1 1

2 Colorado 52 49 (94.2%) 2 0 1

3 Connecticut 18 16 (88.9%) 1 0 1

4 Delaware 41 38 (92.7%) 1 0 2

5 Florida 49 45 (91.8%) 1 2 1

6 Georgia 18 16 (88.9%) 1 1 0

7 Illinois 7 7 (100%) 0 0 0

Total 204 187 (91.7%) 7 4 6


16 Administration of Study Treatments and Participant Follow-Up 297

Table 5 Participants missing two most recent protocol visits

Missing Missing Missing


Participating Randomized F1 and F2 Randomized F2 and F3 Randomized F3 and F6
Site >73 Days ago Visits >104 Days ago Visits >196 Days ago Visits

1 California 19 0 19 0 19 0

2 Colorado 47 0 44 0 37 0

3 Connecticut 16 0 16 0 15 1

4 Delaware 33 0 32 1 22 1

5 Florida 46 0 42 0 41 1

6 Georgia 15 0 13 0 13 0

7 Illinois 6 0 6 0 5 0

Total 182 0 172 1 152 3

(column 4), know the IDs of those whose visit was done so outside the window that it
cannot be used as data for that visit (column 5), and have a pending visit form (column
5). If a study’s recommended visit windows are strict and the report shows a high
proportion of visits as having been missed, it is useful for the weekly report to include
two versions of these tables, one showing visits missing under the study’s strict visit
window limits and one showing visits missing under a broader window indicating that
the data are close enough to the target date to be used for some statistical analyses.
Early detection of patients at risk for becoming drop outs (randomized patients
who have stopped attending study visits) is critical. It is helpful to report on those
who have missed their last two visits. Table 5 shows an example tallying this by site
and identification of the last two visits missed. The IDs of those who have missed the
last two visits should be provided to each site. Participants will “fall off” Table 5 and
the listing when they resume visit attendance. Participants who have died (or been
censored) are reported separately rather than in these tables, which are focused on
dropouts.
It is helpful to have participating site personnel investigate and provide an expla-
nation of why a participant has missed the last two visits. The process used to get this
information from the site ensures that the site realizes that the participant has missed
two visits and requires the site team to investigate. This can help detect cases where a
participant has moved, had an extended hospitalization or rehab visit, or died and will
focus the site on retention of participants at the individual participant level.

Factors Related to Predicting Retention

Sites should publish efforts taken to enhance retention. They are more likely to be
applicable to other trials in other populations or other disease areas than methods
used to enhance recruitment and adherence (Fewtrell et al. 2008).
298 J. J. Gassman

When reporting on retention, if the number of patients available through the end
of a trial for full analysis of study safety and other outcomes (treatment adherence,
quality of life) differs from the number of patients available for the trial’s primary
outcome, both should be reported as was done in the BID Trial (Miskulin et al.
2018); in this study, fewer patients had data for the primary outcome of change in LV
Mass than were available for other study measurements because of difficulties with
scheduling and measurement of baseline and month 12 MRI. In studies with
mortality outcomes, trialists may be able to capture the primary outcome in more
patients than one can evaluate for other outcomes.

The Role of the Study Team in Promoting Retention

Each member of the Study team has a role in promoting retention. At the partici-
pating site, study coordinators should engage the patient. As noted, the study
coordinators and site investigators could set up a system such that reimbursement
for expenses such as parking and payment, gift cards, and other incentives are
provided to the patient. The data coordinating center should provide retention
feedback and facilitate discussion of retention on study coordinator calls and at
Steering Committee meetings, and the Study DSMB should highlight retention
issues and emphasize the importance of retention in its recommendations back to
the steering committee.

Interrelationship Between Treatment Discontinuation and


Dropouts

The challenges in getting the participant to comply with their study treatment
(section “Administration of Study Treatments”) and getting the participant to attend
study visits and provide data (section “Participant Follow-Up”) are related. These are
directly related on a patient level in that a patient who misses visits may also be
noncompliant to treatment. Difficulties with these two may go hand in hand at the
study or site level as well. A trial or a participating site in a trial that is having
significant trouble with adherence may also have difficulties in retention and vice
versa. Of course, it is hoped that those who discontinue treatment remain available
for follow-up and obtaining the primary result variable, but studies with a higher
number of patients who do not comply to treatment may also have more patients who
stop attending visits and providing follow-up data. Site personnel should be
reminded often that if a patient says they will no longer follow study treatment per
protocol, they should be encouraged to follow any level of treatment. If the patient
will no longer accept any study treatment, they should be counseled on the impor-
tance of continuing to provide follow-up data. As noted, it may be helpful for the
study leadership to provide a prioritized list so patients who are reluctant to provide
full data care can be asked to provide as much as possible, in the order of importance.
Every study should follow up on drop outs in any way possible, even if all that can
16 Administration of Study Treatments and Participant Follow-Up 299

be done is to check for vital status at the end of the study. Unless a patient has
withdrawn consent and refuses to allow the study to capture any information, it is
likely that at least some data will be available on most patients who drop out, and, as
noted, patients who stop attending visits may be quite willing to allow for passive
follow-up whereby their local medical charts are used to provide information on
blood pressure, lab measures, and hospitalizations, for example. The DCC should
take care to report on two types of protocol nonadherence as separate issues, so the
study leadership and DSMB can consider why participants are not following treat-
ment or why they have stopped attending visits so it is clear which patients who have
discontinued treatment are available for continued follow-up for the primary out-
come variable (and have the potential to return to adherence) and which patients are
no longer willing to attend visits.

Summary and Conclusion

All of the steps in study design and organization leading up to the initiation of a
study are critical, yet once trial randomization begins, achieving the study’s planned
recruitment, adherence, and retention and accurately capturing data on these factors
are key to meaningful study results. Training sessions, meetings, and conference
calls of the Steering Committee, subcommittees, and study coordinators should
include agenda items focused on these areas of trial conduct. Treatment administra-
tion, treatment adherence, and participant follow-up data should be shared using
metrics and should be captured in real time, with the Data Coordinating Center
providing continuous performance monitoring of these data to study leadership and
to the participating sites themselves to optimize performance for valid study conduct
and complete and accurate data capture.

Key Facts

• Study coordinators and site investigators should be trained that it is important to


get the primary outcome variable for intent to treat analyses in all patients whether
they adhere to treatment or not.
• The schedule for study visits and the system for collection of data should not be
too hard on patients, and sites should engage patients and provide incentives and
rewards as permitted.
• If a participant drops out, he or she may still be agreeable to passive follow-up
through their chart; the trial leadership should consider which data would be
useful to capture for a patient who will no longer attend visits.
• The DCC should provide performance data on adherence, treatment discontinu-
ation, missed visits, and dropouts. These reports should be continuously updated
and should show data site by site.
• Categorization of reasons why patients discontinued treatment or dropped out
should be captured in real time and included in reports.
300 J. J. Gassman

Cross-References

▶ Adherence Adjusted Estimates in Randomized Clinical Trials


▶ Intention to Treat and Alternative Approaches
▶ Missing Data
▶ Procurement and Distribution of Study Medicines

References
Ahmed I, Ahmad NS, Ali S, George A, Saleem-Danish H, Uppal E, Soo J, Mobasheri MH, King D,
Cox B, Darzi A (2018) Medication adherence apps: review and content analysis. JMIR Mhealth
Uhealth 6(3):e62. https://doi.org/10.2196/mhealth.6432
Booker C, Harding S, Benzeval M (2011) A systematic review of the effect of retention methods in
population-based cohort studies. BMC Public Health 11:249. https://doi.org/10.1186/1471-
2458-11-249
Brueton VC, Tierney JF, Stenning S, Meredith S, Harding S, Nazareth I, Rait G (2014) Strategies to
improve retention in randomised trials: a Cochrane systematic review and meta-analysis. BMJ
Open 4(2):e003821. https://doi.org/10.1136/bmjopen-2013-003821
Byar DP, Simon RM, Friedewald WT, Schlesselman JJ, DeMets DL, Ellenberg JH, Gail MH, Ware
JH (1976) Randomized clinical trials – perspectives on some recent ideas. N Engl J Med
295:74–80. https://doi.org/10.1056/NEJM197607082950204
Dayer L, Heldenbrand S, Anderson P, Gubbins PO, Martin BC (2013) Smartphone medication
adherence apps: potential benefits to patients and providers. J Am Pharm Assoc (JAPhA) 53
(2):172–181. https://doi.org/10.1111/j.1547-5069.2002.00047.x
Dunbar-Jacob J, Rohay JM (2016) Predictors of medication adherence: fact or artifact. J Behav Med
39(6):957–968. https://doi.org/10.1007/s10865-016-9752-8
Farmer KC (1999) Methods for measuring and monitoring medication regimen adherence in
clinical trials and clinical practice. Clin Ther 21(6):1074–1090. https://doi.org/10.1016/
S0149-2918(99)80026-5
Ferris M, Norwood V, Radeva M, Gassman JJ, Al-Uzri A, Askenazi D, Matoo T, Pinsk M, Sharma
A, Smoyer W, Stults J, Vyas S, Weiss R, Gipson D, Kaskel F, Friedman A, Moxey-Mims M,
Trachtman H (2013) Patient recruitment into a multicenter randomized clinical trial for kidney
disease: report of the focal segmental glomerulosclerosis clinical trial (FSGS CT). Clin Transl
Sci 6(1):13–20. https://doi.org/10.1111/cts.12003
Fewtrell MS, Kennedy K, Singhal A, Martin RM, Ness A, Hadders-Algra M, Koletzko B, Lucas A
(2008) How much loss to follow-up is acceptable in long-term randomised trials and prospective
studies? Arch Dis Child 93(6):458–461. https://doi.org/10.1136/adc.2007.127316
FHN Trial Group, Chertow GM, Levin NW, Beck GJ, Depner TA, Eggers PW, Gassman JJ,
Gorodetskaya I, Greene T, James S, Larive B, Lindsay RM, Mehta RL, Miller B, Ornt DB,
Rajagopalan S, Rastogi A, Rocco MV, Schiller B, Sergeyeva O, Schulman G, Ting GO, Unruh
ML, Star RA, Kliger AS (2010) In-center hemodialysis six times per week versus three times per
week. N Engl J Med 363:2287–2300. https://doi.org/10.1056/NEJMoa1001593
Garber M, Nau D, Erickson S, Aikens J, Lawrence J (2004) The concordance of self-report with
other measures of medication adherence: a summary of the literature. Med Care 42(7):649–652.
https://doi.org/10.1097/01.mlr.0000129496.05898.02
Gassman J, Agodoa L, Bakris G, Beck G, Douglas J, Greene T, Jamerson K, Kutner M, Lewis J,
Randall OS, Wang S, Wright JT, the AASK Study Group (2003) Design and statistical aspects of
the African American Study of Kidney Disease and Hypertension (AASK). J Am Soc Nephrol
14:S154–S165. https://doi.org/10.1097/01.ASN.0000070080.21680.CB
16 Administration of Study Treatments and Participant Follow-Up 301

Gipson DS, Trachtman H, Kaskel FJ, Greene TH, Radeva MK, Gassman JJ, Moxey-Mims MM,
Hogg RJ, Watkins SL, Fine RN, Hogan SL, Middleton JP, Vehaskari VM, Flynn PA, Powell
LM, Vento SM, McMahan JL, Siegel N, D’Agati VD, Friedman AL (2011) Clinical trial of focal
segmental glomerulosclerosis (FSGS) in children and young adults. Kidney Int 80(8):868–878.
https://doi.org/10.1038/ki.2011.195
Isakova T, Ix JH, Sprague SM, Raphael KL, Fried L, Gassman JJ, Raj D, Cheung AK, Kusek JW,
Flessner MF, Wolf M, Block GA (2015) Rationale and approaches to phosphate and fibroblast
growth factor 23 reduction in CKD. J Am Soc Nephrol 26(10):2328–2339. https://doi.org/10.
1681/ASN.2015020117
Ix JH, Isakova T, Larive B, Raphael KL, Raj D, Cheung AK, Sprague SM, Fried L, Gassman JJ,
Middleton J, Flessner MF, Wolf M, Block GA, Wolf M (2019) Effects of nicotinamide and
lanthanum carbonate on serum phosphate and fibroblast growth factor-23 in chronic kidney
disease: The COMBINE trial. J Am Soc Nephrol 30(6):1096–1108. https://doi.org/10.1681/
ASN.2018101058
Kravitz RL, Hays RD, Sherbourne CD (1993) Recall of recommendations and adherence to advice
among patients with chronic Medical conditions. Arch Intern Med 153(16):1869–1878. https://
doi.org/10.1001/archinte.1993.00410160029002
Meinert CL (2012) Clinical trials: design, conduct and analysis, 2nd edn. Oxford University Press,
New York
Miskulin DC, Gassman J, Schrader R, Gul A, Jhamb M, Ploth DW, Negrea L, Kwong RY, Levey AS,
Singh AK, Harford A, Paine S, Kendrick C, Rahman M, Zager P (2018) BP in dialysis: results of a
pilot study. J Am Soc Nephrol 29(1):307–316. https://doi.org/10.1681/ASN.2017020135
Morawski K, Ghazinouri R, Krumme A et al (2018) Association of a smartphone application with
medication adherence and blood pressure control: the MedISAFE-BP randomized clinical trial.
JAMA Intern Med 178(6):802–809. https://doi.org/10.1001/jamainternmed.2018.0447
Osterberg L, Blashchke T (2005) Adherence to medication August 4, 2005. N Engl J Med 353:487–
497. https://doi.org/10.1056/NEJMra050100
Piantadosi S (2017) Clinical trials a methodologic perspective. Wiley series in probability and
statistics, 3rd edn. Wiley, New York
Raphael KL, Isakova T, Ix JH, Raj DS, Wolf M, Fried LF, Gassman JJ, Kendrick C, Larive B,
Flessner MF, Mendley SR, Hostetter TH, Block GA, Li P, Middleton JP, Sprague SM, Wesson
DE, Cheung AK (2020). A Randomized Trial Comparing the Safety, Adherence, and Pharma-
codynamics Profiles of Two Doses of Sodium Bicarbonate in CKD: the BASE Pilot Trial. J Am
Soc Nephrol 31(1):161–174. https://doi.org/10.1681/ASN.2019030287
Rocco MV, Lockridge RS, Beck GJ, Eggers PW, Gassman JJ, Greene T, Larive B, Chan CT,
Chertow GM, Copland M, Hoy C, Lindsay RM, Levin NW, Ornt DB, Pierratos A, Pipkin M,
Rajagopalan S, Stokes JB, Unruh ML, Star RA, Kliger AS, the FHN Trial Group (2011) The
effects of nocturnal home hemodialysis: the frequent hemodialysis network nocturnal trial.
Kidney Int 80:1080–1091. https://doi.org/10.1038/ki.2011.213
Sackett DL, Richardson WS, Rosenberg W (1997) Evidence-based medicine: how to practice and
teach EBM. Churchill Livingstone, New York
Santo K, Richtering SS, Chalmers J, Thiagalingam A, Chow CK, Redfern J (2016) Mobile phone
apps to improve medication adherence: a systematic stepwise process to identify high-quality
apps. JMIR Mhealth Uhealth 4(4):e132. https://doi.org/10.2196/mhealth.6742
Schwed A, Fallab C-L, Burnier M, Waeber B, Kappenberger L, Burnand B, Darioli R (1999)
Electronic monitoring of adherence to lipid- lowering therapy in clinical practice. J Clin
Pharmacol 39(4):402–409. https://doi.org/10.1177/00912709922007976
Stewart M (1987) The validity of an interview to assess a patient’s drug taking. Am J Prev Med
3:95–100
Trachtman H, Vento S, Gipson D, Wickman L, Gassman J, Joy M, Savin V, Somers M, Pinsk M,
Greene T (2011) Novel therapies for resistant focal segmental glomerulosclerosis (FONT) phase
II clinical trial: study design. BMC Nephrol 12:8. https://doi.org/10.1186/1471-2369-12-8
302 J. J. Gassman

Weinberg JM, Appel LJ, Bakris G, Gassman JJ, Greene T, Kendrick CA, Wang X, Lash J, Lewis
JA, Pogue V, Thornley-Brown D, Phillips RA, African American Study of Hypertension and
Kidney Disease Collaborative Research Group (2009) Risk of hyperkalemia in nondiabetic
patients with chronic kidney disease receiving antihypertensive therapy. Arch Intern Med 169
(17):1587–1594. https://doi.org/10.1001/archinternmed.2009.284
Zulig LL, Phil Mendys HB (2017) Bosworth, Medication adherence: A practical measurement
selection guide using case studies. Patient Educ Couns 100(7):1410–1414. https://doi.org/10.
1016/j.pec.2017.02.001. ISSN 0738-3991
Data Capture, Data Management, and
Quality Control; Single Versus Multicenter 17
Trials

Kristin Knust, Lauren Yesko, Ashley Case, and Kate Bickett

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
Data Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
Data Management Life Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
Data Capture Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
Case Report Form (CRF) Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
Data Management and Quality Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
Risk-Based Monitoring in Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
Data Quality Control Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
Data Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
Data Management Plan/Data Validation Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
CRF Completion Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
System User/Quick Reference Guides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
Training Site Staff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
Site and Sponsor Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
Data Management in Single Versus Multicenter Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
Future Data Management Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320

Abstract
Data capture, data management, and quality control processes are instrumental to the
conduct of clinical trials. Obtaining quality data requires numerous considerations
throughout the life cycle of the trial. Case report form design and data capture
methodology are crucial components that ensure data are collected in a streamlined
and accurate manner. Robust data quality and validation strategies must be employed

K. Knust (*) · L. Yesko · A. Case · K. Bickett


Emmes, Rockville, MD, USA
e-mail: [email protected]; [email protected]; [email protected]; [email protected]

© Springer Nature Switzerland AG 2022 303


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_40
304 K. Knust et al.

early on in data collection to identify potential systemic errors. Data management


guidance documents provide an opportunity to set clear expectations for stakeholders
and establish communication pathways. These tools need to be supplemented with
adequate training and ongoing support of trial staff. Trials may be conducted in a
single or multicenter setting, which has implications for data management. Risk-
based monitoring is one approach that can help data managers target quality issues in
a multicenter setting. Evolving technologies such as electronic medical record and
electronic data capture system integration, artificial intelligence, and big data analyt-
ics are changing the landscape of data capture and management.

Keywords
Data management · Data collection · Data quality · Multicenter trial · Case report
form (CRF) · Risk-based monitoring (RBM) · Data review · Electronic medical
record (EMR) · Electronic data capture (EDC)

Introduction

One of the key components to a successful clinical trial is a strong foundation of quality
data. Data management and quality control measures ensure the accuracy and reliabil-
ity of the database used in analyses, which are imperative to the outcome of a clinical
trial. The goal of a data management program is to produce a clean dataset containing
no data entry errors or unexplained missing data points and assure that all the necessary
data to analyze the trial endpoints are collected consistently. The consequences of a
poorly designed or improperly implemented data management and quality control
program are manifested in additional burdens of time, resources, trial costs, and,
perhaps most importantly, in a failure to produce an accurate database for analysis.
Some important considerations for designing a data management program
include the method of data capture, design of case report forms and edit checks,
implementation of data management reporting tools to assess data quality, setting
clear expectations based on trial objectives and sponsor/investigator goals, develop-
ment of training and reference tools for trial stakeholders, and strategies for
conducting data review. In addition, the context of the trial must be considered,
such as whether it is conducted in a single center, multicenter, or network setting and
the platform, format, and methods for sharing trial data.

Data Capture

Data Management Life Cycle

Data management activities span the life cycle of a clinical trial. Ideally, data
management teams are integrated into the protocol development phase, during
which case report form (CRF) development and database design may begin. Review
17 Data Capture, Data Management, and Quality Control; Single Versus. . . 305

and input from data managers (DMs) during this crucial stage may help identify and
reduce extraneous data collection and anticipate potential data management
challenges.
During implementation, data management documents are developed in conjunc-
tion with other trial management processes, and site training materials should be
developed. Data managers may be involved in the creation of system or CRF user
guides, in addition to establishing data management and data validation plans to
inform the collection and management of data throughout the trial.
At the time of activation and accrual, data management includes implementation
of data collection tools and early monitoring of data to identify potential trends or
issues. During trial maintenance, data management and quality control activities are
ongoing, and data managers typically utilize tools to detect anomalies, resolve
queries, retrieve missing data, and ensure the integrity of the trial database.
In preparation for trial analysis, data quality and cleaning activities may become
more targeted or focused on trial endpoint data to ensure the analysis can be
completed. During trial closure or in preparation for data lock, all remaining data
quality items are resolved or documented.

Data Capture Methods

The method of data capture and data storage should be considered during the design
of the trial and prior to the development of CRFs to ensure that information is
efficiently collected. Potential data capture methods include traditional paper-based
data collection, electronic data capture (EDC), electronic health record (EHR)
integration, and external data transfer/upload or any combination of these methods.
While clinical trials historically used paper CRFs, there has been an increasing trend
toward digital integration due to the enhanced quality control and real-time commu-
nication features available. EDC offers centralized data storage which will speed
analysis and distribution of results at the end of a trial. Additionally, the availability
of tablets and other portable devices has made this a cost-effective and practical
option.
An EDC system has become the gold standard for use in clinical trials, where site
staff or participants enter the data directly into the system, staff collect data on paper
CRFs and then enter data in the system, or data is transferred through an upload into
the system. Many EDC systems contain additional tools for managing data quality,
including features for real-time front-end data validation, shipment and specimen
tracking, transmission of non-form or imaging data (e.g., blood/tissue samples, X-
rays), report management and live data tracking, query resolution, adjudication of
trial outcomes, scheduling for participant visits, and other trial management tools
(Reboussin and Espeland 2005). Having a direct data entry system reduces the
potential for transcription errors when the data are recorded in EDC and provides
the ability to perform real-time data checks such as those for values outside the
expected ranges (e.g., a weight of 950 pounds). As worldwide availability of internet
capabilities and prevalence of EDC systems have expanded over the years,
306 K. Knust et al.

traditional paper CRF data entry has diminished. In circumstances where EDC is the
preferred method of data capture but internet access is unreliable or limited, offline
data entry may be used to collect data and then transmit once an internet connection
is established.
The technological enhancements available through EDC systems have a signif-
icant advantage for multi-center trials in that it provides a mechanism for Data
Managers (DMs) to view the data across all sites in real-time and perform quality
control checks such as identifying missing values, performing contemporaneous
data audits, or issuing queries. Specifically, it allows the DM to review current data
across all sites to identify trends or potential process issues earlier in the data
collection stream.
A variety of commercial EDC applications are available for use. Some are
available as “off the shelf” software and are free of charge, although features may
be limited. Other proprietary systems are available and may be customizable to the
needs of a client. Many factors go into choosing the appropriate application,
including the complexity of the trial and the number of participants and sites. It is
often more appropriate for a multi-center trial to choose a customizable EDC, as it
offers additional flexibility and reliability in data collection, storage, review,
retrieval, validation, analysis, and reporting as needed. A small, less complex single
site or limited resource trial may use a free commercial off the shelf (COTS)
solution, as it still offers some of the important features such as front-end validation
but may not have the capability for more complex reporting or customizations.
The design of the database will depend on the specific method of data capture and
encompasses a wide range of activities. However, it is important to evaluate whether
the database structure is both comprehensive enough to ensure all trial objectives are
met while minimizing the amount of extraneous data that may be captured. The
volume and complexity of data collected for a given trial should be weighed against
the relative utility of the information. If the data point being collected is not essential
to the outcome of the trial, consider the cost of the data collection burden to site staff,
in addition to the cost of cleaning the data, before including it.

Case Report Form (CRF) Development

A CRF, also sometimes known as a data collection form, is designed to collect the
participant data in a clinical trial. The International Conference on Harmonization
Guideline for Good Clinical Practice defines the CRF as “a printed, optical or
electronic document designed to record all of the protocol-required information to
be reported to the sponsor on each trial participant.” (ICH 2018). When implemented
in an EDC system, CRFs may be referred to as data entry screens or electronic CRFs
(eCRFs).
The thoughtful design of CRFs is fundamental to the success of the trial. Many
challenges in data management result from poor CRF design or implementation.
Designing a format for CRFs or data entry screens is important, and the basic
considerations are the same with paper or eCRFs. CRF development is ideally
17 Data Capture, Data Management, and Quality Control; Single Versus. . . 307

performed concurrently with protocol development to ensure that trial endpoints are
captured and will yield analyzable data. Ideally, the statistical design section of the
protocol (or statistical analysis plan) will be consulted to map all data points to the
analysis to confirm the data required will be available.
At a minimum, CRFs should be designed to collect data for analysis of primary
and secondary outcomes and safety endpoints and verify or document inclusion and
exclusion criteria. When developing a schedule of assessments (i.e., visit schedule),
the feasibility of data collection time-points should be evaluated. The schedule
should include all critical time-points, while ensuring that the frequency of visits
and anticipated participant burden is considered. Furthermore, the impact of data
collection on site staff should be assessed. When possible, soliciting input on form
content from individuals responsible for entering data may identify problematic
questions and clarify expectations.
While the content is crucial for analysis, structure and setup of the CRF is vital to
collecting quality data. There are no universal best practices for form development,
although the Clinical Data Interchange Standards Consortium (CDISC) has made
significant progress toward creating tools and guidelines (Richesson and Nadkarni
2011). CDISC has also implemented the Clinical Data Acquisition Standards Har-
monization (CDASH) project, which utilizes common data elements and standard
code lists for different therapeutic areas to collect data in a standardized approach
(Gaddale 2015).
The primary objective of CRF design is to gather complete and accurate data.
This is achieved by avoiding duplication of data elements and facilitating transcrip-
tion of data from source documents onto the CRF. Ideally, it should be well
structured, easy to complete without much assistance, and should collect data of
the highest quality (Nahm et al. 2011).
Some basic principles in CRF development and design include the following:

• Identify the intended audience and data entry method (e.g., trial staff direct data
entry or electronic participant reported outcomes/ePRO) and style of the CRF
(interview, procedural, or retrospective). This will determine the question format
and reading level of the question.
• Standardize CRF design to address the needs of all users such as investigator, site
coordinator, trial monitor, data entry personnel, medical coder, and statistician
(Nahm et al. 2011). Review by all affected parties before finalization confirms
usability and ensures complete data elements.
• Organize data in a format that facilitates and simplifies data analysis (Nahm et al.
2011).
• Keep questions, prompts, and instructions clear and concise to assure that data
collection is consistent across all participants at various sites.
• Group-related fields and questions together.
• Use consistent language across different CRFs in the same protocol and across
protocols. Avoid asking the same question in different ways (e.g., a field for “date
of birth” as well as a field for “participant age”) as data provided for these fields
may be inconsistent and creates additional work for the clinical sites.
308 K. Knust et al.

• Avoid capturing the same information more than once (duplication).


• Include clear and concise instructions regarding skip patterns and the character-
istics of the expected data.
• Make the CRF easy to follow. Avoid information clutter.
• Version control all CRFs.
• Use consistent formatting across all CRFs.
• Specify the unit of measurement, including decimal places.
• Use standard date format throughout the CRFs.
• Provide CRFs with coded field responses (e.g., “Yes,” “No” checkboxes) rather
than open text fields when possible to aid in analysis. Codes should be consistent
throughout a CRF set (e.g., “Yes” should always be coded as “1”).
• Avoid the use of negatively phrased questions. For example, instead of “Did the
participant fail to sign the informed consent document?” use “Did the participant
sign the informed consent document?”
• Avoid phrasing statements in such a way that leads the participant to a specific
response.
• Avoid collecting extraneous or excessive data that are not described in the
protocol, as this may distract sites from considering only the data related to trial
outcomes.
• Consider the flow of the clinical setting in which data will be collected (e.g., vital
sign fields may not be appropriate on a urinalysis CRF).
• Avoid collection of derived data on the CRF to minimize calculation errors. For
example, age can be calculated using date of birth. Body mass index can be
calculated using height and weight of the participant; only the latter two should be
captured (Nahm et al. 2011).
• Avoid creating fields designed to collect information in which a participant’s
identity can be determined in trial data (e.g., name, initials, phone number,
address). In such instances where this information is integral to the trial objec-
tives, appropriate security and privacy measures per regulatory guidelines must
be implemented.

Data Management and Quality Control

Risk-Based Monitoring in Data Management

Once data collection begins, the management and quality control of data become the
primary focus. However, the volume and pace of data collection may require data
managers to target data quality efforts, particularly when a trial is conducted in a
multicenter or network setting.
In the Food and Drug Administration Guidance for Industry “Oversight of
Clinical Investigations- A Risk Based Approach to Monitoring” (August 2013),
the agency acknowledges that some data have more impact on trial results; therefore
a risk-based approach can be used. These critical data points include informed
17 Data Capture, Data Management, and Quality Control; Single Versus. . . 309

consent, eligibility for the trial, safety assessments, treatment adherence, and main-
tenance of the blind/masking (FDA 2013).
Risk-based monitoring (RBM) creates a framework for managing risks through
identification, classification, and appropriate mitigation to support improved partic-
ipant safety and data quality. Adopting a targeted RBM approach to data manage-
ment may be appropriate in some settings and can provide significant advantages,
including more efficient use of resources, without compromising the integrity of the
clinical trial. In this approach, a range of metrics known as key risk indicators (KRIs)
may be used in real-time to identify areas of critical importance and are tracked to
flag data that may need additional attention (may be participant, site, or trial level).
KRIs may include protocol deviations, adverse events, missing values, missing
CRFs, or other areas of concern. An example of a report for monitoring KRIs and
identification of performance issues is shown in the Fig. 1 below. In this table, the
values are programmatically compared to pre-determined standards and given a
color code of green (indicating good performance), yellow (problem areas identi-
fied), or red (remedial action required). This allows for continuous monitoring in real
time and increases responsiveness by the clinical team in identifying patterns or
trends that may impact risk assessment of a site or trial, as well as quickly correct and
prevent further issues.
A sample report provides a framework for the general timeline and major
milestones throughout the protocol life cycle, from the time of protocol approval
to the publication of the primary outcome manuscript. This high-level overview
compares data from each trial to the defined expectations of the sponsor to highlight
trial performance (i.e., column displaying initial proposed dates versus column for
actual milestone dates). A column for current projections shifts in relation to current
data such as number of participants enrolled. Significant deviations from this
timeline highlight performance issues and identify the need for additional monitor-
ing (Fig. 2).
While review of all KRIs is important in a risk-based approach, RBM has
significant implications for data management, as data quality metrics may be used
to identify higher-risk sites or data management trends that warrant more frequent
onsite monitoring.
These might include percent of missing CRFs, missing data fields, outstanding
data queries, availability of primary outcome, and number of protocol deviations. If
any of these metrics lie outside the normal range, more frequent review of data
should be performed to determine if there are additional issues.
In an example of how RBM may be used to identify issues, a DM noticed that a
site had exceeded the KRI metric for missing CRFs. Upon further investigation of
the site’s existing CRFs, several other issues were identified. CRFs had been
completed at incorrect visits, and the audit history showed that site staff had
completed ePRO assessments, instead of being entered directly in the ePRO system
by the participant. The DM revealed that the site staff had been completing the CRFs
on behalf of the participants, which was a protocol violation. By using KRIs to flag a
high-risk site, the DM was able to identify and mitigate greater process issues, which
could have had a significant impact during analysis.
Flags and Triggers: Overall and by Site
310

Monday, Septembr 17, 2018 9:26 PM ET


Recruitment Availability of Primary Outcome
Data

Site Recruitment: Recruitment: Missing Audits Regulatory Primary Primary Treatment Follow-up
Overall Prior 3 Months1 Forms lssues Outcome: Outcome: Prior Exposure Visit
Overall 90 days Attendance

Site #1 90% 78% 0.0% 0.13% None 66% 73% 66% 71%

Site #2 100% 180% 0.0% 0.49% None 61% 58% 70% 72%

Site #3 70% 63% 0.1% 0.60% None 65% 78% 73% 61%

Site #4 84% 42% 0.0% 0.29% None 41% 50% 64% 59%

Site #5 100% 69% 0.2% 0.31% None 79% 71% 71% 70%

Site #6 90% 115% 0.0% 0.50% None 73% 89% 77% 82%

Site #7 60% 43% 1.5% 0.27% None 85% 91% 95% 81%

Site #8 150% 100% 0.0% 0.93% None 71% 78% 60% 45%

Overall 88% 79% 0.5% 0.45% None 69% 72% 71% 68%

1
Update on the 1st of the month to show percent of expected to actual randomizations over the previous 3 calender months.
2
Primary outcome availability for only the prior 90 days calculated as the percentage collected to expected UDS in the past 90 days.
See next page for color definitions

Fig. 1 Example of report for monitoring key risk indicators


K. Knust et al.
17 Data Capture, Data Management, and Quality Control; Single Versus. . . 311

Study Number/Title
Basic Protocol Information and Timeline [Updated Monthly]
Tuesday, August 28, 2018 9:31 PM ET
Initial Current Actual
Proposal Projection

N (Sample size) 420 450 229


Number of Sites 7 8
Number of Nodes 5 5
Concept Approval Data 9/15/2015
Protocol Approval Date 3/16/2016
Date First Participant Enrolled 11/15/2016
Publication Plan Submitted to Pub. Com. 11/15/2017 2/13/2018
Date Last Participant Enrolled 9/11/2018 11/19/2018 #
Date Trial Completed (Last Follow-up at Last Site) 11/20/2018 1/28/2019 #
Date of Database Lock 1/21/2019 3/31/2019 #
Date of Final Study Report to Sponsor 5/23/2019 7/31/2019 #
Date of Submission of Primary Outcome Paper 7/24/2019 10/1/2019 #
Date of Acceptance of Primary Outcome Paper 9/24/2019 12/2/2019 #
Date to Data Share 7/23/2020 9/30/2020 #
# These cells will remain blank until actual occurrence
1 “Initial Proposal” column reflects the plan at the time of first randomization
2 “Current Projection” date are based on the actual average randomization rate over the past 5 months
3 “Actual” N (Sample Size) represents the total number of participants randomized as of last month, and
“Actual” number of sites includes both active sites and those closed for new enrollment.

Fig. 2 Example of report for monitoring trial performance

Data Quality Control Tools

In addition to using RBM, there are several different types of data management tools
that should be implemented to ensure the validity and integrity of data collected,
including a variety of reports and edit checks.
Reports are often developed to track trial metrics for data quality and assess
progress and can provide both real-time and summary information. These reports
can provide high-level data quality information to both the data management team
and site staff collecting the data. While the reports can be made available for review
at any time (e.g., via a website or EDC system), they should also be discussed or sent
to staff collecting the data points on a set schedule or at pre-determined time-points
so that site staff can address discrepancies identified and the data management team
can provide feedback and/or training for collection of data.
312 K. Knust et al.

Reports used to track missing CRFs, missing data points, or numeric data entered
outside of an expected range (e.g., an unexpected date or a lab value that is not
compatible with life) should be implemented as standard tools to assist with data
management review (Baigent et al. 2008). These should be integrated within an EDC
system whenever possible to facilitate real-time review.
Edit checks (also called validation checks or queries) are another tool to look for
data discrepancies. These are employed as a systematic evaluation of the data
entered to flag potential issues and alert the user. Ideally, these checks should be
issued to site staff in real time or on a frequent basis, to facilitate timely resolution.
Data managers are instrumental in writing edit check messages, which should
include a clear description of the issue and the fields (variables) involved in the
check, the reason it was flagged as potentially inconsistent, and indicate the steps
required for resolution. An edit check program can look at a single data point within
a single assessment, multiple data points within a single assessment, or multiple data
points across multiple assessments. These checks should be run frequently (or on a
set schedule) to identify inconsistencies in the data or data that is in violation of the
protocol (Krishnankutt et al. 2012; Baigent et al. 2008).
Edit checks are typically conceptualized during CRF design, as it is important to
identify potential areas for discrepant data and minimize duplicate or potentially
conflicting data collection. Implementing edit checks early on in active data collec-
tion phase will allow detection of trends that may warrant changes to data entry or
retraining of site staff. A high priority early in the trial is to develop edit checks to
query baseline assessments and enrollment data. These discrepancies may uncover
problems with CRFs, misunderstandings at clinical sites, or problems with the trial
protocol that are critical to address. During the conduct of any trial, new checks are
often identified and programmed due to protocol changes, CRF changes, findings
during a clinical monitoring visit, or anomalies noted in trial-related reports.

Data Review

Throughout the conduct of a trial, it is expected that there is ongoing review of data.
This may be conducted by different stakeholders, with the goal of monitoring and
evaluating data collected in a contemporaneous fashion to identify potential
concerns.
Initiating communication with site staff following the enrollment of the first
participant is a simple review strategy to increase the likelihood of accurate and timely
data entry. It is during the first participant enrollment that sites first enact the written
protocol and trial procedures at their institution. Despite discussions regarding imple-
mentation and training prior to activation, the first participant enrolled is often when
sites first experience any challenges with the integration of trial procedures into their
standard practice. This is a critical time for the data manager to be engaged with the
rest of the site team to be sure that any issues are resolved and that all necessary data
are captured. Follow-up contact with site staff in this timeframe provides an opportu-
nity to communicate parts of the enrollment process that went well and those that
17 Data Capture, Data Management, and Quality Control; Single Versus. . . 313

would be helpful to adjust. Immediately following the first participant enrollment, the
site staff is more likely to recall issues and any missing or difficult information. This
feedback is extremely valuable to the data management team, particularly in multi-
center trials. The challenges encountered at one institution may be shared across
multiple sites. Discussing with sites allows for the trend to be identified and potentially
adjusted in real time so that other sites may avoid the same problems. Additionally,
touching base with the site at this early time-point provides the opportunity to
communicate reminders for upcoming assessments and trial requirements.
Another strategy for data review includes performing a data audit after a mile-
stone has been met, such as a percentage of accrual completed or a certain number of
participants reaching an endpoint. In this type of review, a subset of participants is
identified, and data cleaning procedures are performed to ensure all data submitted
through the desired time-point are complete and accurate. The subset of data may be
run through statistical programs or checks to verify the validity of the data collected
thus far. The goal of this type of review is to identify any systematic errors that may
be present. If any errors are identified, there is an opportunity to review the potential
impact and determine whether any changes to the CRFs or system are required.
Endpoint or data review committees may also be convened to provide an inde-
pendent assessment of trial endpoints or critical clinical or safety data. For some
trials where endpoints are particularly complex or subject to potential bias, an
independent review committee may provide additional assurance that trial results
are accurate and reliable. In the FDA’s Guidance for Industry: Clinical Trial End-
points for the Approval of Cancer Drugs and Biologics, it is noted that an indepen-
dent endpoint review committee (IRC) can minimize bias in the interpretation of
certain endpoints. If an endpoint committee is determined to be necessary for a trial,
a charter or guidance document should be in place prior to the start of the trial to
outline the data points that will be adjudicated by the committee, how the data will be
distributed for review, and when the data review will occur. In addition, the charter
should specify how “differences in interpretation and incorporation of clinical data in
the final interpretation of data and audit procedures” will be resolved (FDA 2018).
Depending on the duration of the trial, endpoint adjudication may occur on an
ongoing basis (e.g., as participants reach an endpoint that will be adjudicated) or may
be conducted at the end of the trial (e.g., once a predetermined number of partici-
pants reach an endpoint). The scope of the review is typically limited to the primary
or secondary endpoints of a trial but may include other clinically relevant data
points. A risk-based approach can also be taken for the endpoint review, with a
subset of cases reviewed and the committee adjourned if a certain concordance with
the reported data is met. For example, if independent review of an endpoint dem-
onstrates that committee review of data agrees with site-reported assessment in 95%
of cases, it may not be necessary to review data for every participant in the trial.
When this approach is used, the proposed concordance rate should be included in the
charter. To provide data for independent review, a data listing or similar format is
typically used to incorporate relevant information. Source documents (e.g., imaging,
clinical records) may be included as part of the review but must be appropriately de-
identified to protect participant information.
314 K. Knust et al.

When used as a component of a data management program, independent endpoint


review can ensure efficient and unbiased evaluation of key trial data.

Data Management Plan/Data Validation Plan

A comprehensive Data Management Plan (DMP) provides a blueprint for ensuring


quality data throughout the life cycle of a trial. The DMP specifies the tools and
processes that will be utilized in the management of clinical, laboratory, and
pharmacovigilance databases prior to trial initiation through the final database lock
or clinical trial report. The plan outlines database management and implementation,
defines processes for training and certification of data entry staff, describes clinical
data review, monitoring, and validation guidelines, and sets expectations for review
and transfer of data at the completion of a trial. The DMP serves to combine the
various methods that will be employed for data management and details how data
will be entered and which stakeholder is responsible for entering data. Any relevant
standard operating procedures (SOPs) for data collection, upload, and transfer
process are specified. The consolidation of the strategy into one overarching docu-
ment ensures that all stakeholders are aware of the plan and new staff can be trained.
A data validation plan (DVP) may be used to supplement the DMP or integrated
as a component of the DMP. The DVP further outlines the processes for the quality
control measures that will be utilized in the trial (e.g., front-end validation of
electronic data capture systems, edit checks or manual queries, quality assurance
reports, medical or clinical review, reconciliation of external data sources). This plan
may also contain information about the criticality of the data collection points to
support a risk-based monitoring approach, as well as outline the frequency of data
monitoring.

CRF Completion Guidelines

Several resource documents should be created at the start of the clinical trial that will
provide guidance to trial staff on expectations of the data capture system, collection
of data within each of the forms/assessments and general guidelines not captured in
the trial protocol. These documents should be available to trial staff prior to the start
of the trial and should be maintained and updated throughout the trial using version
control to address frequently asked questions and guidance or decisions on how to
enter specific data as needed (Nahm et al. 2011; McFadden 2007).
Ideally, the CRF completion guidelines should contain a table reflecting the
expected list of assessments per the schedule specified in the protocol (Nahm et al.
2011). There should be a section that clarifies when each assessment is expected, the
source data for the assessment, and how data entry will be completed. These sections
should also describe the intricacies of the CRF that are not immediately obvious by
reviewing the questions. This includes explaining any fill/skip patterns or logic that
may be used. For example, if a date is expected for an event, but several dates could
17 Data Capture, Data Management, and Quality Control; Single Versus. . . 315

be applicable, the CRF completion guidelines should identify clearly how the correct
date should be determined and reported (McFadden 2007).
Having a central resource for all trial staff will help ensure that data are collected
in a consistent manner across participants and sites.

System User/Quick Reference Guides

When electronic systems are utilized in the conduct of a trial, a system user guide
should be provided to assist staff with guidance on how to access and navigate
through the system. The guide should include a high-level overview of any data
collection system or other tools that will be used and provide more in-depth detail
about specific aspects of the data collection system (e.g., enrolling a participant using
an EDC system) or how to administer a specific assessment tool.
Depending on the complexity of the system being used, or for multicenter trials
which may have many participating institutions, it may be necessary to provide
additional resources or “quick reference guides” to facilitate data submission. This is
typically a short document that provides specific guidance on one or a few specific
tools, assessments, or systems, for example, a quick reference guide on how to
upload files to an electronic data capture system. The guide should provide specific
details but supplements more in-depth documents like a user’s guide or a CRF
completion guide. The goal is to ensure that any system user can quickly understand
key system features and expedite the training process.

Training Site Staff

Another area of data management support includes training of staff and system users.
Data management training may include instruction on CRF completion, data entry or
system navigation, query resolution, and trial-specific guidance. Training is an
ongoing activity; initial training is typically conducted at site initiation visits,
investigator meetings, or through group training or recorded module/webcast train-
ing modalities. There are many aspects to data collection, including regulatory
documents, safety, the method of data collection, biological and other validated
assessments, and possibly trial drug/intervention. At the beginning of the trial,
stakeholders involved in the creation of the protocol, assessments, system, and
overall trial guidelines should set up a detailed training for all site staff that will be
collecting data and administering the assessments.
Providing a training module for each area can provide a structured training to
staff before the trial start (Williams 2006). It is recommended that comprehension
evaluations are completed for each module and question and answer sessions are
provided to allow trial staff time to review training and ask questions. Providing
trial staff with certification of completion on each module they are trained on
for their records helps ensure that trial staff are prepared for data collection
(Williams 2006).
316 K. Knust et al.

As the trial progresses, it will be necessary to train additional site personnel and
perhaps provide retraining as data quality issues are identified or there are changes to
the trial due to protocol amendments or other updates. All initial training modules
and evaluations should be recorded and readily available for new staff or as a
refresher to existing staff throughout the protocol. Any supplemental training pro-
vided should also be recorded and readily available.
Training documentation includes the management of system access and mainte-
nance of user credentials, when electronic systems are used. It is important to ensure
that all users have appropriate access for their role and departing staff can no longer
access systems.
Traditionally, clinical research associates (CRAs) are involved in training of site
staff and perform on-site reviews to compare source documents to entered data,
monitor data quality, and verify training and regulatory documentation. They serve
as a front line resource to sites to help address any concerns early on and prevent them
from occurring through the life of the protocol. Errors in data collection or sample
storage are often identified and corrected during monitoring visits and should be
communicated to data management staff to determine whether any changes to CRFs
or guidance documents are warranted. Repeated issues may also lead to updates to
training or user materials. Data managers and CRAs provide ongoing training support
throughout the trial and work closely together to manage the overall integrity of data
collection at sites.

Site and Sponsor Communication

A successful data management program requires setting clear expectations with


sponsors or investigators on the format for data collection, acceptable data quality
metrics, pathways for communication and escalation of data quality issues, and
alignment on trial objectives and goals. Having regular meetings with data manage-
ment and sponsor staff throughout the trial will help keep these goals in mind as the
trial progresses, and questions are raised by trial staff. This also provides a mecha-
nism to determine if protocol-specified assessments are aligned with current practice
at sites. These types of issues should be discussed to decide whether additional
training of site staff is needed, modifications to assessments are needed, or if
clarification of data collection should be provided.
Data management staff are frequently a point of contact for trial personnel
seeking assistance. Formal communication with sites is crucial to developing pro-
ductive relationships with personnel and providing support for data quality or
management issues that may arise. Impromptu calls, regularly scheduled meetings,
or email communication are all potential avenues to provide ongoing data manage-
ment support. Expectations regarding frequency and types of communication should
be established during the training phase of the trial and should be frequently stated
throughout the life of the protocol.
In addition to regular calls with both the sponsor and trial staff, clear guidance on
the point of contact for the following areas should be established prior to the first
enrollment of a participant: data quality or management issues, safety events that
17 Data Capture, Data Management, and Quality Control; Single Versus. . . 317

may arise, concerns or issues related to trial medication, or intervention or technical


issues with the EDC system. Additionally, the sponsor and data management team
should evaluate the locations of trial centers and ensure that an escalation plan is in
place. If trial centers are in different time zones from data management staff or
sponsor, consider having a help line or a designated staff member available during
“off” hours to provide timely support to trial site personnel.
A potentially overlooked aspect of communication is the trial portfolio that is
ongoing at a site. Many sites participate in multiple trials at the same time, which
may stretch limited resources and lead to challenges with collecting quality and
timely data. It is important to recognize the environment in which the site staff are
operating and attempt to work within these constraints. When communicating with
sites on data quality items, clearly state priority items and set concrete deadlines. If
sites are unresponsive, enlist the support of CRAs or other team members to
understand the site challenges and identify potential solutions.

Data Management in Single Versus Multicenter Trials

Data management is structured differently depending on whether the trial is


conducted in a single center or a multicenter setting, but the overall goal of collecting
quality data remains the same.
A single center trial generally relies on an individual institution or site to enroll
participants and collect data for a trial. There are some advantages to a trial run at a
single center, such as having personnel located in proximity and having more homo-
geneity in data collection techniques and participant population. However, multicenter
trials (utilizing more than one site and/or having a central coordinating center or other
shared resource to administer the trial) have the benefit of utilizing more than one site
to enroll participants, yielding larger and potentially more diverse participant
populations which expedites accrual and may enhance the generalizability of trial
results (Meinert and Tonascia 1986). The multicenter trial is particularly important
when the therapeutic area is for a rare indication or small population, but there are
unique challenges as well.
From the data management perspective, a multicenter trial requires coordi-
nated efforts to ensure quality data are received from all participating sites and
that data management expectations are clearly communicated to all stakeholders.
Multicenter trials rely on shared resources, such as the protocol, guidance
documents, and standard operating procedures to ensure data are collected as
uniformly as possible.
Managing multiple sites requires an understanding of local site standards for data
collection. Sites may be academic institutions or participating hospitals which have
varying standards of care, may be subject to institutional restrictions and procedures,
and are utilizing laboratories or other facilities that have unique reporting techniques.
Whenever possible, this information should be gathered during site selection and
taken into consideration when implementing a CRF or process. For example, a CRF
collecting lab information on an international trial needs the flexibility to capture lab
values in varying units of measurement from different countries.
318 K. Knust et al.

Managing data at a single center still requires a focus on completeness and


accuracy, but it is also important to ensure there are no systemic errors in data
entry, as these may be more difficult to detect without a comparison. When manag-
ing data across sites, there is an increased ability to detect outlier data or differences
between sites, which can highlight issues for data management staff. Utilizing a data
review process may assist with this potential risk.

Future Data Management Considerations

The advancement of technologies such as electronic medical record (EMR) integra-


tion, artificial intelligence, and cognitive computing, has the potential to revolution-
ize how data are collected, shared, and understood and will impact the landscape in
which data management practices are deployed.
There are significant efforts to drive the integration of EMR and EDC, reducing or
eliminating the need for time-consuming and error-prone transcription between local
databases at the trial sites and central clinical databases. EMRs have become a
pervasive part of the everyday experience of medicine, and many organizations are
seeking to integrate EMR into their clinical trial collection processes. Data can be
imported into the trial database directly from the electronic medical record (EMR),
erasing traditional transcription and interpretation errors. However, barriers continue
to exist to the widespread adoption of the automatic transfer of EMR to EDC,
including the lack of standardization for data format, privacy concerns, and difficulty
extracting data from the EMR (Goodman et al. 2012).
Artificial intelligence, machine learning, and big data analytics provide opportu-
nities to analyze aggregated data to facilitate participant enrollment and site selec-
tion, identify potential fraud or data anomalies, and improve decision-making in
clinical trials (Johnson et al. 2018). The ability to collect, store, and analyze massive
amounts of data will intensify the need for a structured approach to data management
and data quality to ensure outcomes can be distinguished from the noise (Chen et al.
2016). These approaches are not without concern; the challenges to security and
privacy that face any electronic system are exacerbated in a rapidly evolving and
shifting field. Precautions for each step of the data lifecycle should be taken when
implementing an automated or integrated system (Khaloufi et al. 2018).
The impact of this digital transformation, combined with availability of enhanced
computing power and storage is continually developing, but the principles of data
management and database development are centered in a planned and comprehen-
sive evaluation of available tools and careful selection of appropriate technologies
for a clinical trial.

Summary and Conclusion

The role of data management in clinical trials is essential. Setting up a trial to obtain
quality data begins early with the protocol, SAP, and data capture method selection
and design. Errors or poor judgment at this stage have a significant impact on the
17 Data Capture, Data Management, and Quality Control; Single Versus. . . 319

process and may result in systematic errors that can compromise analysis and trial
results. Thoughtful case report form design is one of the most important aspects of
data management. Following commonly accepted principles of CRF creation and
utilizing standard CRFs whenever possible ensures streamlined data capture and
minimizes negative downstream effects.
Once the trial begins, the implementation of appropriate data quality tools such as
risk-based monitoring, reports, data validation checks, and data review are crucial
components to the data management program. These tools aid in identifying prob-
lematic data, provide information to support sites, ensure the consistency of data
collected, and provide an unbiased assessment of trial endpoints. Data quality reports
and checks should be updated frequently and implemented early. Additionally,
periodic data review is recommended as a method to ensure the integrity and validity
of data collected.
Setting clear expectations for all stakeholders is an important part of data
management. Data management and validation plans, as well as CRF and system
guidance documents, help provide valuable information regarding the flow of
data for the trial and how data should be entered correctly. These documents
should be supplemented by comprehensive training and a clear communication
plan to ensure understanding and agreement on data collection and quality
control measures.
Open and ongoing communication with sponsors and sites is necessary to
establish rapport and encourage collaboration for the duration of the trial. Scheduled
and ad hoc calls and meetings are helpful to build trust and facilitate discussion
regarding data quality issues. CRF and system training must be prioritized at the start
of a trial and is expected to continue throughout as new staff join or as there are
changes to CRFs or system features.
Clinical trials may be conducted in a variety of settings, depending on the nature
of the protocol and the trial objectives. There has been an increase in the number of
trials that are conducted through a multicenter approach to capitalize on centralized
resources and a larger participant population. Although the goal of a data manage-
ment program may remain the same, there are different considerations for single
center versus multicenter or network-conducted trials.
With advancements in technologies such as EMR-EDC integration, artificial
intelligence, and big data analytics, the landscape of data management is changing
rapidly. Utilizing principles of data quality assurance to manage new data sources
and applying understanding of data to large volumes of information will be imper-
ative to future data management programs.

Key Facts

• Quality data is critical


• Robust and reproducible quality control systems need to be developed
• Design step is very important to minimize changes after activation
• Aim for complete, timely and consistent data
320 K. Knust et al.

Cross-References

▶ Design and Development of the Study Data System


▶ Implementing the Trial Protocol
▶ International Trials
▶ Multicenter and Network Trials

References
Baigent C, Harrel F, Buyse M, Emberson J, Altman D (2008) Ensuring trial validity by data quality
assurance and diversification of monitoring methods. Clin Trials 5:49–55
Chen Y, Argentinis JD, Weber G (2016) IBM Watson: how cognitive computing can be applied to
big data challenges in life sciences research. Clin Ther 38(4):688
Food and Drug Administration (2013) FDA guidance oversight of clinical investigations – a risk-based
approach to monitoring. https://www.fda.gov/downloads/Drugs/Guidances/UCM269919.pdf
Food and Drug Administration (2018) FDA guidance clinical trial endpoints for the approval of
cancer drugs and biologics: guidance for industry. https://www.fda.gov/downloads/Drugs/Guid
ances/ucm071590.pdf
Gaddale JR (2015) Clinical data acquisition standards harmonization importance and benefits in
clinical data management. Perspect Clin Res 6(4):179–183
Goodman K, Krueger J, Crowley J (2012) The automatic clinical trial: leveraging the electronic
medical record in multi-site cancer clinical trials. Curr Oncol Rep 14(6):502–508
International Conference on Harmonisation (2018) Guideline for good clinical practice E6(R2)
good clinical practice: integrated addendum to ICH E6(R1) guidance for industry. https://www.
fda.gov/downloads/Drugs/Guidances/UCM464506.pdf
Johnson K, Soto JT, Glicksberg BS, Shameer K, Miotto R, Ali M, Ashley E, Dudley JT (2018)
Artificial intelligence in cardiology. J Am Coll Cardiol 71:2668–2679
Khaloufi H, Abouelmehdi K, Beni-Hssane A, Saadi M (2018) Security model for big healthcare
data lifecycle. Procedia Comput Sci 141:294–301
Krishnankutt B, Bellary S, Kumar N, Moodahadu L (2012) Data management in clinical trial: an
overview. Indian J Pharmacol 44(2):168–172
McFadden E (2007) Management of data in clinical trials, 2nd edn. Hoboken, NJ: Wiley-
Interscience.
Meinert CL, Tonascia S (1986) Clinical trials: design, conduct, and analysis. New York: Oxford
University Press.
Nahm M, Shepherd J, Buzenberg A, Rostami R, Corcoran A, McCall J et al (2011) Design and
implementation of an institutional case report form library. Clin Trials 8:94–102
Reboussin D, Espeland MA (2005) The science of web-based clinical trial management. Clin Trials
2:1–2
Richesson RL, Nadkarni P (2011) Data standards for clinical research data collection forms: current
status and challenges. J Am Med Inform Assoc 18:341–346
Williams G (2006) The other side of clinical trial monitoring; assuring data quality and procedural
adherence. Clin Trials 3:530–537
End of Trial and Close Out of Data Collection
18
Gillian Booth

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
Planning for Trial Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
Stage 1: End of Recruitment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
Stage 2: End of Trial Intervention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
Stage 3: End of Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
Stage 4: Trial Reporting and Publishing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
Stage 5: Archiving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
Trial Closure Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
Developing a Trial Closure Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
Stage 1 End of Recruitment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
Stage 2 End of Trial Intervention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
Stage 3 End of Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
Stage 4 Trial Reporting and Publishing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
Stage 5 Archiving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
Early Trial Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
Planning for Early Trial Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
Communicating with Trial Participants Following Early Trial Closure . . . . . . . . . . . . . . . . . . . . 342
Individual Site Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345

Abstract
Trial closure refers to the activities that take place in preparation for the cessation of
trial recruitment through to archiving of the trial. Trial closure can be notionally
divided into five stages: End of Recruitment; End of Trial Intervention; End of

G. Booth (*)
Leeds Institute of Clinical Trials Research, University of Leeds, Leeds, UK
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 321


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_41
322 G. Booth

Trial; Trial Reporting and Publishing; and Archiving. The length and scheduling of
each stage of trial closure is determined by the trial design and operating model. As
trial closure approaches there is an increased emphasis on monitoring and controls
to ensure the correct number of participants is recruited and data collection and
cleaning in preparation for final database lock and analysis is complete. The End of
Trial is a key ethical and regulatory milestone defined in the approved trial protocol
and has associated time-dependent notification and reporting requirements to the
independent ethics committee and, for regulated trials, the regulator(s). Key steps of
trial closure, reporting, publishing, and data sharing are important mechanisms to
support transparency in clinical trials. A Trial Closure Plan can be used to support
the activities that ensure appropriate control over the final stages of the trial.

Keywords
Analysis · Archiving · Close out · Database lock · Data sharing · End of Trial ·
Public registry · Publishing · Reporting · Transparency · Trial Closure Plan

Introduction

Trial closure refers to the activities that take place in preparation for the cessation of
trial recruitment through to archiving of the trial. During this period a range of
different tasks take place at each of the physical locations (institutions) where the
trial is being conducted in order to:

– Complete the protocol specified intervention and assessments


– Finalize the study data and documentation ready for analysis and reporting
– Close down investigator sites and notify trial participants, funders, regulators, and
ethics committees of the trial results

The specific trial closure activities at each institution will depend upon the trial
design and the role of each institution; however, the overarching aims remain the same.
Clinical trials each have different designs, participant pathways, and risk profiles
therefore the trial closure activities undertaken can be adapted to ensure a risk
proportionate approach appropriate to the trial design and operating model (MRC
et al. 2012). Where risk proportionate approaches are utilized, these should be
documented with the associated decisions in the trial risk assessment. It is also
possible for many typical trial closure activities to be undertaken either remotely
rather than “on site,” thereby permitting the most appropriate method to be utilized
and ensuring the most efficient and effective use of resources.

Planning for Trial Closure

Conducting methodologically sound, safe, ethical, and regulatory compliant trials


requires careful planning and control, thus it follows that the trial closure activities
and the timing of these will be carefully preplanned. The monitoring of timelines in
18 End of Trial and Close Out of Data Collection 323

the lead up to trial closure and the development of a Trial Closure Plan is critical to
ensure appropriate control over the final stages of the trial. A Trial Closure Plan will
include key milestones and deadlines relating to the responsibilities of each institu-
tion, with details of data items for on-site or central monitoring which can inform the
trial closure and analysis timelines.
However, it is not uncommon for trial timelines to change; for example, as a result
of slower than anticipated recruitment resulting in a delay to the recruitment closure
date. Some trial designs also include predefined stopping rules which allow the trial
to be stopped early; although the outcome of these prespecified reviews cannot be
predicted, planning for the various outcomes can and should still take place. It is less
common, but not unknown, for trials to be closed for unplanned reasons where the
opportunity for preplanning can be greatly reduced.
Trial closure activities (for all institutions involved in the trial) take place between
the cessation of trial recruitment to archiving and can be broadly divided into five
stages of activity (Fig. 1). While it is helpful to think about the stages of trial closure
in a linear, predictable way for planning purposes, in practice the design of the trial
may influence the length of each stage and whether the stages overlap. The period of
time between each stage will vary depending upon the length of time the trial is open
to recruitment and also the length of the individual participant intervention, data
collection, and follow-up periods. The design of the trial will also impact the overlap
between the different stages of trial closure; for example, in the case of an adaptive
platform multi-arm-multi-stage (MAMS) trial. A MAMS trial is a platform clinical
trial with a single master protocol where multiple interventions are evaluated at the
same time. Adaptive features enable one or more interventions to be “dropped” (e.g.,
due to futility) or added during the course of the trial. Different “arms” of the trial

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

End of Recruitment End of Trial End of Trial Trial Reporting & Archiving
Intervention Publishing
Completed: Completed: Completed: Completed: Completed:
 Enrolment of trial  All trial  Data collection  Trial analyses  Essential
participants participants have and clinical  End of Trial documents and
completed the assessments Report to ethics data prepared for
Continuing after this protocol specified  End of Trial committee, archive
stage: intervention(s) Notification to regulator, funder  Archive period
 Protocol specified regulator and  Results published agreed with
intervention and Continuing after this ethics committee in peer reviewed Sponsor
ordering of trial stage:  Substantial scientific journal  Investigator sites
supplies  Clinical amendments no  Results made notified of end of
 Clinical assessments longer permitted available to archive period
assessments  Safety monitoring participants
 Safety monitoring  Data collection Continuing after this  Results reported Continuing after this
 Data collection  Data cleaning stage: on public registry stage:
 Data cleaning  Investigator site  Data cleaning  Publications and
 Investigator site monitoring  Statistical Continuing after this presentations
monitoring  Statistical programming stage: arising from the
programming  Statistical  Publications and trial
analyses presentations  Hypothesis
arising from the generation
trial  Data sharing for
further research

Fig. 1 Stages of Trial Closure


324 G. Booth

will remain open to recruitment and intervention, while others are closed and there
may be protocol defined analyses performed and reported prior to the End of Trial
thereby resulting in overlap of the different stages of trial closure.

Stage 1: End of Recruitment

The end of recruitment is the point at which all trial sites are no longer permitted to
enroll participants into the trial. The trial protocol will specify the sample size, the
number of participants to be enrolled to achieve the sample size, and will describe
the recruitment pathway and related processes to achieve this. In most trials there
will still be participants receiving the intervention and undergoing clinical assess-
ments, safety monitoring, and data collection after the trial has closed to recruitment.

Stage 2: End of Trial Intervention

The End of Trial intervention is the point at which all participants enrolled into the
trial have completed the trial intervention as specified by the approved trial protocol.
Depending upon the trial design, it is likely that trial participants will be undergoing
clinical assessments, safety monitoring, and data collection after this time.

Stage 3: End of Trial

The End of Trial is a key ethical and regulatory milestone with associated time-
dependent notification requirements to the independent ethics committee and, for
regulated trials, the regulator(s). There may also be specific contractual requirements
for notification to other bodies, such as funders, at this time.
The End of Trial will be defined in the approved trial protocol; typically this will
be the date of the last “visit” of the last participant or at the time the last data item is
collected for the trial, that is, the point at which all clinical assessments, safety
monitoring, and data collection stops, although there may be different regulatory
requirements in different regions or countries. Preparatory activities for the End of
Trial will therefore focus on monitoring key data associated with the countdown
toward the End of Trial in addition to completing the data collection and cleaning
required for final database lock and analysis.

Stage 4: Trial Reporting and Publishing

Trial reporting and publishing of the trial results follow completion of the protocol
specified trial analyses. These are two discrete activities:

– Trial reporting is a requirement of regulators and involves reporting summary


results to the independent ethics committee, regulator(s), and/or trial registry –
18 End of Trial and Close Out of Data Collection 325

typically within 12 months of the End of Trial. Where a trial is intended to support a
regulatory submission (e.g., in support of a manufacturer’s license for a drug or
medical device) the final report will take the form of a Clinical Study Report with
supporting documentation and detailed datasets as required by the regional/country
regulator.
– Publishing refers to publishing trial results, irrespective of the trial outcome, in a
peer-reviewed scientific journal and tends to be an activity primarily, although not
exclusively, associated with academic-led research.

Trial reporting and publishing of the trial results are two of the four key mechanisms
to support transparency in clinical trials (Box 1). The overall aim of transparency in
clinical trials is to ensure that the participants of trials, doctors, the scientific community,
and the public have access to information about which trials have been conducted, how
they have been conducted, and the outcomes of those trials. This builds trust with
patients and the public, informs clinical practice by allowing access to all of the
available evidence about a particular treatment, and minimizes research waste by
ensuring the same trials are not repeated. Transparency is fundamental in meeting the
expectations of research participants, regulators, and the wider scientific community.
Transparency in clinical trials is typically understood to mean registering,
reporting, publishing, and making data from the trial available for further analyses
or for the purpose of undertaking an independent analysis of the trial results (Box 1).
Transparency measures are a regulatory requirement for some trials and a pre-
requisite for publishing in many high-profile scientific journals, for example, to
publish in some high impact journals (ICMJE 2019) the trial must be registered in
a Primary Public Registry prior to the start of recruitment.

Box 1 Transparency in Clinical Trials

What does Transparency mean in practice?

– Registering a trial in a public registry and/or regulatory database

The World Health Organisation considers trial registration to be not only a


scientific but also an ethical and moral responsibility and has set out the
International Standards for Clinical Trial Registries. Trial registration is
defined by the World Health Organisation as the publication of an internation-
ally agreed set of information about the design, conduct, and administration of
clinical trials (WHO 2012). These are published on a publicly accessible
website managed by a registry conforming to WHO standards. Registries
which meet the WHO Registry Criteria and have at least a national remit are
called WHO Primary Registries; an example of a Primary Registry is the EU
Clinical Trials Register.

(continued)
326 G. Booth

– Reporting a trial in a public registry or regulatory database

Regulatory requirements specifying which registry or database to use when


registering or reporting a trial differ by country, therefore the relevant laws and
guidance for that country/region should be referred to.

– Publishing the trials results in scholarly journal and presenting at


professional society scientific conferences

Regardless of the outcome of the trial it is important to make those results


available to others. For regulators, doctors, patients, and participants of trials
this enables them to make decisions based on the most up-to-date information
about a treatment. For other researchers this avoids the same research ques-
tions being investigated again unnecessarily.

– Making data from the trial available for further research purposes
(Data Sharing)

The quality control and curation of clinical trial datasets typically means
these are valuable resources which can be used for further research, such as
meta-analyses. Any further use of clinical trial datasets must be in line with
participant expectations and always legally compliant; this typically means
taking steps to anonymize a dataset prior to releasing to a third party.
For trial integrity reasons, data is not usually made available for further
research purposes until the protocol specified analyses have been completed
and reported/published.

Stage 5: Archiving

Archiving is the storage and retention of the trial essential documents (ICH 2016) and
data produced in the trial. Retention periods may vary and are dictated by regulators,
sponsors, funders, or institute policies. For regulated trials, the purpose of archiving is
to ensure the records which demonstrate how the trial was conducted and the compli-
ance of all individuals and institutions involved in the trial with good clinical practice
(ICH 2016) and all relevant laws are available for audit or inspection purposes.

Trial Closure Activities

Developing a Trial Closure Plan

There can be many different types of institutions, groups, and individuals involved in
the day-to-day conduct of a trial, for example, pharmaceutical companies, clinical
trials units (CTUs), contract research organizations (CROs), laboratories, investiga-
tors, investigator sites, suppliers/vendors, and independent oversight committees.
18 End of Trial and Close Out of Data Collection 327

Each institution, group, or individual will have agreed role(s) and an agreed set of
responsibilities which are usually defined in contracts, the trial protocol, and other
working documents. As trial closure approaches the lead institution responsible for
trial conduct will instigate the development of a detailed Trial Closure Plan (Box 2)
to ensure appropriate planning and control over the final stages of the trial.
When developing a Trial Closure Plan it is important to consider:

• Who does the Plan need to include?


Consider who will use it. Depending upon the complexity of the trial, there
may need to be detailed sub-plans for certain institutions, groups, or individuals
which feed into a master plan controlled by the lead organization responsible for
trial conduct.
• What resources are needed to deliver the plan?
The workforce or other resources needed to deliver the trial closure activities
will need to be identified. When considering the workforce planning implications
at the end of the trial there may be contracts that will need to be terminated or the
movement of people onto other projects.
• Does the Plan include timelines and key milestones?
In a well-managed trial, the key milestones will have already been identified in
the master trial project plan, as dictated by the protocol, contracts, and regulators.
However, there may be milestones which can only be defined and agreed as the trial
closure period approaches, such as scheduling the analysis to be completed by a
specific date in order to present at a key professional society scientific conference.
• What communication is needed, to whom, and when?
A communication plan should consider how communication strategies will
need to change to ensure timely communication, which institutions, groups, and
individuals need to be communicated with at each stage of trial closure and the
best routes for communication.
• Are there specific legal, protocol, or contractual obligations which need to be
met? (Box 2)
Typically trials include many contracting partners. It is good practice to review
the contracts regularly to ensure ongoing compliance. Contracts will usually
include specific reporting and communication requirements relating to the end
of trial; any contractual obligations should be built into the Trial Closure Plan so
that they are not missed.

Box 2 Typical elements of a Trial Closure Plan to ensure legal, protocol,


and contractual obligations

A Typical Trial Closure Plan will include detailed procedures to ensure


legal, protocol, and contractual obligations are met at each Stage.

At Stage 1: Recruitment Closure:

(continued)
328 G. Booth

• The appropriate number of trial participants are recruited and removal of


access to the trial systems for recruitment and randomization at the appro-
priate time.

At Stage 2: End of Trial Intervention:

• Trial supplies are appropriately controlled, accounted for, destroyed and


removal of access to the trial systems for managing trial supplies at the
appropriate time.
• Appropriate communication to trial participants about next steps and
options once their trial treatment ends.

At Stage 3: End of Trial

• End of Trial notification requirements of the independent ethics committee


and regulator are met.
• Data collection and cleaning is timed appropriately to meet database lock,
analysis and reporting deadlines, and removal of access to the trial database
for data entry at the appropriate time.
• Where required by the trial protocol, key data are signed off by the Site
Investigator and independent adjudication of endpoint data are completed
to meet database lock, analysis, and reporting deadlines.
• Final investigator site monitoring activities are completed according to the
trial monitoring plan.
• Trial samples are destroyed or have the necessary approvals to be stored
beyond the end of the trial.
• All final substantial amendments to the protocol or supporting documents
are made.
• Final payments to investigator sites, vendors, suppliers, and contractors are
made.

At Stage 4: Trial Reporting and Publishing

• Ethical and regulatory reporting requirements and timelines are met.


• A publication plan is developed to ensure timely publication and presenta-
tion of the trial results.

At Stage 5: Archiving

• Completion of the Trial Master File and Investigator Site File essential
documents and preparation of these and any data files (paper and/or elec-
tronic) for archive.
18 End of Trial and Close Out of Data Collection 329

Communication

Good communication between the different institutions involved in the trial is key to
a successful trial closure; as such it is typical to see the frequency and format of
communications between the different institutions, groups, and individuals increase
and change in the run up to trial closure (Box 3).

Box 3 Illustrative Communication at Trial Closure

Newsletters and Websites

In the months before the end of the recruitment period, newsletters,


websites, and other publicity information for the trial which are aimed at
recruiting site staff and/or potential participants are updated with the con-
firmed or indicative recruitment closure date.

Project Planning Meetings

As trial closure approaches, to facilitate more detailed planning activities and


timely decision making, the institutions responsible for the overall management
of the trial agree to increase the frequency of project planning meetings.

Notification of Closure to Recruitment

The organization primarily responsible for trial conduct writes to all rele-
vant organizations/groups/individuals involved in the trial including the
funder, Sponsor, Independent Oversight Committees, Investigator Sites, and
suppliers to inform them that the trial has closed to recruitment. The letter
includes information such as:

– The date the trial closed to recruitment and the reason for the end of
recruitment
– A summary of overall recruitment for the trial and, where writing to an
individual investigator site, the recruitment summary for that individual site
– Key dates, such as the planned end of intervention period and the End of
Trial date
– A reminder of ongoing activities/obligations such as the management of
trial supplies and data collection

Institutions Involved in the Day-to-Day Conduct, Funding, and


Independent Oversight of the Trial
Institutions, groups, and individuals involved in the day-to-day conduct, funding, and
independent oversight of the trial might include investigator sites, suppliers,
330 G. Booth

Typical communication to institutions / groups and individuals involved in the day to day
conduct, funding and independent oversight of the trial

• Date the trial closed to recruitment and the reason for the end of recruitment, in particular if the trial has closed early
Stage 1 • Summary of overall recruitment for the trial and where writing to an individual investigator site, the recruitment
End of summary for that investigator site
Recruitment • Key future dates, such as the planned end of the intervention period and the End of Trial date
• Reminder of ongoing activities / obligations, such as the management of trial supplies and data collection
• Where the trial has closed early due to safety concerns, detailed instructions about how the treatment of participants
should be stopped or changed, how and when action should be taken and communicated to participants

• Key dates, such as date the trial intervention delivery period ended, the End of Trial date, planned analyses and final
investigator site monitoring visits
Stage 2 • Reminder of ongoing activities / obligations, such as ongoing follow up data collection, the reconciliation, return or
destruction of trial supplies / equipment and maintenance of the Investigator Site File
End of Trial
• Chase for any outstanding essential documents required for the Trial Master File and to satisfy protocol and
Intervention contractual requirements such as trial logs
• Making provisions to destroy or seek appropriate authorisation to store trial samples beyond the End of Trial
• Provision of information to trial participants about next steps and options once their trial treatment ends

• Notification of the official End of Trial date


• Notification of the cessation of data collection and all other trial procedures
Stage 3 • Reminder of ongoing activities / obligations, such as preparations for archive of trial documents
End of Trial • For remote data entry trials (not paper), removal of access to the database and provision of a copy of the final
investigator site dataset to the investigator site
• Seeking permission to acknowledge collaborators in trial publications
• Making financial payments

Stages 4 & 5 • Notification of the trial results to the regulator / independent ethics committee / contract partners
Trial • Publication of the trial results
Reporting, • Updating the relevant Public Registry
Publishing & • Provision of information to trial participants about the trial results
Archiving • Providing permission to archive trial documents and notification of the end of archive date

Fig. 2 Typical communication to institutions involved in the day-to-day conduct, funding, and
independent oversight of the trial

independent oversight committees, and funders. Good communication will ensure that
those institutions, groups, and individuals are aware of the planned and final timing of
each stage in the days, weeks, and months leading up to the event. It is good practice,
and indeed an essential part of the audit trail for regulated trials, to write to the
institutions, groups, and individuals involved in the day-to-day running, funding,
and independent oversight of the trial at each stage of trial closure to keep them
appraised of key dates, decisions, and as a reminder of any ongoing activities or
obligations (Fig. 2).

Organizations Involved in the Authorization of the Trial


The organizations involved in the authorization of the trial, such as regulators and
independent research ethics committees, will have their own trial authorization
systems, processes, and timelines which will need to be followed. It is not usually
necessary to notify the regulator(s) or independent ethics committee(s) of the End of
Recruitment or End of Intervention unless recruitment has been terminated early and
this was not prespecified in the trial protocol, for example, where a trial has been
closed early for participant safety reasons. However, it is prudent to check the
18 End of Trial and Close Out of Data Collection 331

requirements of regulators and independent research ethics committees as these


differ between regions and countries and can change over time.

Stage 1 End of Recruitment

Planning and Controlling the End of Recruitment


As the point at which the total number of participants to be recruited approaches, the
frequency of monitoring recruitment will increase so as to ensure compliance with
the approved trial protocol. Different systems for monitoring recruitment may be
used, for example, spreadsheets or electronic trackers. Whichever system is used,
close communication with the institutions and individuals involved in recruitment is
critical to ensure an appropriate level of control over the number of participants
recruited.
Specific communication with the investigator(s)/recruiting team(s) in order to
prevent uncontrolled over-recruitment will be required and different approaches may
be taken such as restricting the number of participants being approached to consent
to the trial as the end of recruitment nears. It may also be necessary for the
investigator(s)/recruiting team(s) to adjust the communication with potential partic-
ipants to explain that the trial is nearing the point of closure and how this might
impact on whether they can ultimately participate in the trial.

Interim Recruitment Stopping Rules


The overall total number of participants to be recruited, as detailed in the approved
trial protocol, may not be achieved in cases where it becomes apparent ongoing
recruitment into the trial is not feasible and cannot be met within an acceptable
period of time. Where a risk of poor recruitment has been anticipated, the approved
trial protocol may include a recruitment stopping rule, that is, a planned interim
review of the trial recruitment rate and total to determine the feasibility of continuing
to recruit into the trial. Trial closure planning activities would be expected to take
place in preparation for review of any protocol specified stopping rules, particularly
where early trial closure is considered likely (see “Early Trial Closure” Section).

Closing Enrolment Systems


Once the protocol defined recruitment period has ended the trial enrolment systems
are closed. Trials commonly use phone or web-based systems for recruitment, that is,
where an Investigator phones or logs into a website to enroll the participant into the
trial; these are called interactive voice or web response systems (IVRS/IWRS).
IVRS/IWRS are also commonly used for other associated trial management activ-
ities such as randomization, completing participant diaries, ordering trial supplies
and in the case of blinded trials, un-blinding activities. Where an IVRS/IWRS is in
use, permission to enroll participants into the system will be physically revoked
while leaving permissions active for other associated ongoing trial activities where
such functionality is provided by the IVRS/IWRS, for example, in relation to
management of trial supplies.
332 G. Booth

Stage 2 End of Trial Intervention

Planning the End of the Trial Intervention Period


The activities associated with the end of trial intervention, once all participants have
completed the protocol specified intervention, will depend upon the nature of the
trial intervention(s). Common across all trials will be the careful monitoring of
timelines and the availability of the intervention as the final trial participants
complete the protocol specified intervention. In a well-managed trial, consideration
of the impact of delays earlier in the trial such as a slower recruitment rate on the
availability of trial supplies will have been identified during the trial with the planned
mitigating action being taken in preparation for the stage, for example:

– For interventions which rely on the employment of certain healthcare profes-


sionals, for example, therapists; undertaking extensions to the contracts of
employment
– For drug trials; where the delay has impacted on the expiry dates undertaking
authorized extensions to the expiry or sourcing additional supplies

In trials involving supplies, such as drugs and devices, planning during trial setup
will ensure there are sufficient trial supplies for each participant to receive the
intervention as specified in the protocol. Indeed, any risks to the trial supplies
could be classed as a reportable event to the independent ethics committee or
regulator particularly where participant safety or the trial integrity are compromised
as a result of poor trial supplies management (in the European Union and United
Kingdom such events which occur in regulated drug trials are called Serious
Breaches and required expedited reporting to the regulator (MHRA 2018)).
Although careful monitoring and management of each individual participant and
trial supplies is important during trial recruitment, this will become more important
toward the end of the intervention period to ensure efficient use of the trial supplies
and to avoid over-ordering and waste, and this can be particularly important where
there are limited trial supplies.

Controlling the Ordering of Trial Supplies


IVRS/IWRS systems can be used to manage the ordering, receipt, accountability,
and ultimately the return or destruction of trial supplies but simpler risk proportion-
ate processes and systems are also used, particularly in the case of noncommercial or
single site trials using a low-risk intervention where the trial supplies are available as
part of routine clinical care.
For regulated trials with limited supplies towards the end of the trial intervention
period, it may be necessary to restrict the amount of supplies each Investigator Site
can order or have each supply authorized by the Sponsor or delegate prior to
distribution because the expectation of some country regulators is that the transfer
of trial supplies between trial research sites is not routinely permitted, other than in
exceptional circumstances. Once all participants have completed the protocol
18 End of Trial and Close Out of Data Collection 333

defined intervention schedule, access to the systems for ordering new trial supplies
will be revoked.

Trial Supplies Accountability and Reconciliation


The protocol or contract should detail how surplus trial supplies or specialist trial
equipment left at the investigator site should be managed. It is not usually permitted
for trial supplies to enter routine supply chains, nor to be used for non-trial partic-
ipants. Typically contracts will specify one of the following scenarios:

– Ring fence and retain the remaining trial supplies until remote or on-site moni-
toring activities have been completed to confirm correct use and accounting of the
trial supplies (Box 4). After any monitoring activities have been successfully
completed and any arising issues resolved, the Sponsor (or delegate) will give
permission to the investigator site(s) to either destroy or return unused trial
supplies to the Sponsor or supplier for destruction.
– In the case of low-risk trials (i.e., where the intervention was of no higher risk
than standard of care) there may be no accountability logs to monitor in which
case the investigator site(s) will be instructed to destroy surplus supplies or return
them to the Sponsor or supplier for destruction.
– The return of any specialist equipment to the Sponsor or supplier.

Box 4 Typical monitoring activities to confirm correct use and


accounting of trial supplies

Typical examples of monitoring activities to confirm the correct use and


accounting of trial supplies include:

– Checking storage locations and temperature logs to verify that trial supplies
were stored and handled according to the manufacturers’ recommendations
and any deviations were notified to the Sponsor.
– Checking the records which detail the traceability and accountability of the
trial supplies, ensuring that trial supplies were not used for participants who
were not enrolled onto the trial.
– Checking logs and other records which verify that any equipment used was
appropriately calibrated and maintained.

Such activities may take place at the Investigator site or written confirmation
or evidence in the form of logs or other paperwork from the Investigator Site
Pharmacist may be requested for remote review by the Sponsor or delegate.

Provision of Information to Participants at End of Intervention


It should be made clear to participants at the point of consent whether access to the
intervention they receive during the trial will be made available to them after the trial
334 G. Booth

has closed, or how their treatment options may change as a result of participation in
the trial. Depending upon the nature of the intervention and the length of the
intervention period, it can be a long time between the point of consent and the end
of treatment for an individual participant therefore it is good practice to prepare
information at the end of intervention for individual participants to serve as a
reminder of what will happen to them in terms of changes to their treatment, ongoing
clinical monitoring and how they can find out about or opt out of receiving
information about the trial results. This is also a good point in time to thank
participants for their contribution to the research.

Stage 3 End of Trial

Planning for the End of Trial


Preparatory activities for the End of Trial are dictated by the End of Trial definition
as specified in the approved trial protocol, the regulatory required timeframe and
activities post End of Trial for analysis and reporting and the fact that further
substantial amendments to the trial protocol are not permitted after the End of
Trial. The activities will therefore largely be focused on:

– Meeting Ethical and Regulatory Requirements: Monitoring the trial data or


other trial information which indicates that the End of Trial has been reached and
initiates the regulatory required timeframe for End of Trial Reporting.
– Retention or destruction of research samples: Ensuring appropriate authoriza-
tion is in place for the retention of research samples (e.g., blood and tissue)
collected within the trial which are intended to be held after the End of Trial.
– Preparation for database lock. Ensuring data collection, cleaning, monitoring,
and any independent arbitration of endpoint data is timed appropriately to meet
database lock, analysis, and reporting deadlines and removal of access to the trial
database for data entry at the appropriate time.
– Completing final contractual close out activities such as making final payments to
investigator sites, vendors, suppliers, and contractors.

Ethical and Regulatory Requirements


Although the definitions for End of Trial and the associated timelines and mecha-
nisms for notification differ by country and region, there are consistencies in the
general concept of defining the point in time at which a clinical trial ends “End of
Trial” and the principle that the End of Trial date effectively “starts the clock” for
reporting of the final summary clinical trials results. Typically in the United King-
dom and European Union, the definition of End of Trial will be the date of the last
“visit” of the last participant or at the time the last data item is collected for the trial,
that is, the point at which all clinical assessments, safety monitoring, and data
collection stops (Official Journal of the European Union 2010). While in the United
States two dates are used: Primary Completion Date (pertaining to completion of the
intervention and clinical assessments for the purpose of data collection relating to the
18 End of Trial and Close Out of Data Collection 335

primary outcome) and Study Completion Date (pertaining to completion of the


intervention and clinical assessments for the purpose of data collection relating to
all protocol specified outcomes) with the Study Completion Date being equivalent to
the UK/EU definition.
Depending upon the design of the trial, the End of Trial may occur many years
after the end of recruitment and intervention stages have been completed. As a key
regulatory milestone, the End of Trial must be notified to the relevant oversight body
(ies) (usually the independent research ethics committee and for regulated trials the
country specific Regulator) within a defined period of time after it occurs, therefore it
is necessary to monitor the data and other trial information which indicate that the
End of Trial has been reached in the run up to the End of Trial definition being met.
This may require a change to the frequency of data collection, cleaning, or moni-
toring activities, for example:

– In trials where the timing of the analysis is linked to the event rate; as the target
approaches the frequency of data collection or monitoring at the investigator sites
may need to increase, with a greater focus on individual research site compliance
of the relevant data collection forms.

The mechanisms for reporting also differ by oversight body, region, and country;
this may be via a dedicated reporting system such as the European Union Portal/
EudraCT System, a registry or database such as ClinicalTrials.gov or simply a
standard form completed and emailed to the oversight body.
In practice the End of Trial notification does not mean that ethical and regulatory
oversight ends immediately at this point; there is an obligation for the Sponsor to
provide a written report within a fixed period of time to the independent ethics
committee, and for regulated trials the regulator. In addition active regulatory
inspection or contractual audit periods may extend many years after the End of Trial.
The End of Trial, regardless of definition, only occurs once; thus it follows that
for multicenter trials the End of Trial notification is made once the End of Trial has
occurred in all participating research sites and for multinational trials the notification
is made once the End of Trial has occurred in all participating countries.
Once the official End of Trial notification has been made substantial amendments
are no longer permitted; therefore any amendments to the protocol or other autho-
rized documents must be completed prior to the End of Trial being reached.
The End of Trial will be communicated in writing by the Sponsor or delegate to
all of the institutions and individuals which have been involved in the conduct of the
trial. There may also be specific contractual requirements for notification to other
bodies such as funders at this time.

Research Samples
Many trials include the collection of research samples, such as tissue, blood, or urine
samples which will be used for protocol defined analyses. Trial participants may also
be asked to consent to any samples which are collected being held in a tissue bank
and used for further research projects. The plan for the research samples after the End
336 G. Booth

of Trial will have been originally approved by the independent research ethics
committee and this approval must be adhered to. Where the plans have changed,
an amendment to the original ethical approval or alternative authorization by the
appropriate Authority/Regulatory Body will be needed in order to continue holding
the samples after the End of Trial. Depending upon the specific authorizations in
place, samples may need to be physically moved within or between institutions, for
example, to an authorized tissue bank.

Close out of Data Collection and Preparation for Database Lock


Trial data is expected to be of high quality (in terms of accuracy and completeness)
with data entered into the trial database contemporaneously to the associated proto-
col activity and rigorous processes in place to collect, clean, and monitor the quality
of the data therefore data cleaning is not a one off activity and takes place through the
trial. The activities undertaken to collect and clean trial data are typically
documented in a Data Management Plan which is a comprehensive document
detailing all aspects of the data handling parts of the trial and which provides
reference for staff working on the trial, organizational memory and is a controlled
document which supports the reconstruction of the trial for audit/inspection
purposes.
Database lock is the action taken to “freeze” or take a final copy of the trial dataset
in order that it can be used in the final trial analysis. Database lock ensures that a
copy of the final dataset and statistical code used in the analysis of the trial are
retained, so that analyses can be repeated and independently verified at a later date, if
necessary.
In a well-managed trial, defining and detailing the plans for the data collection
and cleaning period in the run up to Database Lock will have taken place at the time
of developing the Data Management Plan and Trial Monitoring Plan. However, it is
good practice to revisit the Data Management Plan in preparation for trial closure
and database lock to ensure that the priorities are right and that resources are being
used in the most efficient way (Box 5), for example:

– In trials where there are multiple different analyses taking place, or the analyses
only involve subgroups of participants, developing specific data management
plans directed to each trial analysis may be necessary. This could involve
identifying the specific data items and case report forms required for each analysis
and directing the investigator sites to prioritize certain case report forms for
completion or responding to certain data queries.

Ultimately Database Lock will be the culmination of many years work in


collecting and cleaning the clinical trial data to ensure the quality of the trial data
for the trial analyses. If this is not well-managed, the timeliness of availability and
quality of the data in the run up to database lock can adversely impact the timing of
the trial analyses, reporting, and publication.
18 End of Trial and Close Out of Data Collection 337

Box 5 Preparatory Steps for Database Lock

Typical checks in the run up to database lock include ensuring that:

• Data items have been received and where they have not, there is a
documented reason why.
• Discrepant data have been queried with the investigator site and the queries
have been resolved.
• All on-site and remote monitoring activities have been completed as per the
Trial Monitoring Plan and any outstanding issues have been resolved.
• That the essential documents held at the investigator site are complete in
case of future audit or inspection.
• The site investigator(s) have confirmed the accuracy and completeness of
key data from their site.
• Any data coding, for example, of free-text fields and adverse events has
been completed.
• The linkage, cleaning and reconciliation of datasets generated by other
collaborators or parties, for example, laboratories, routine data providers
have been completed.
• Where required by the trial protocol, independent verification/adjudication
of outcome measures, for example, interpretation of clinical results has
been completed.

Operationalizing Database Lock


How final “database lock” is practically implemented can differ between organi-
zations and will usually be specified in a Standard Operating Procedure. Practical
implementation can also differ depending upon the design of the trial; for example,
there may be interim or other analyses prior to the End of Trial which require the
database to be temporarily locked, a snapshot of the database to be taken at that
time, and then the database unlocked in order that data collection and cleaning can
continue. This is a feature of some adaptive trial designs (e.g., multi-arm, multi-
stage trials, platform trials) and one which is likely to require careful consideration
when determining the End of Trial definition in the protocol for those types of trial
design.
Database lock is usually carried out through the temporary or permanent
revocation of access rights to the trial database system by those individuals
who are responsible for data entry and cleaning activities. At the time of full
database lock, no further amendments are permitted to the database – the dataset
will be the final dataset used by the Statistician for the trial analysis. The
complexity of this exercise will depend upon the number of different individuals
and institutions involved and whether the trial uses paper or electronic data
capture technologies.
338 G. Booth

In common with all previous stages of trial closure; clear responsibilities and
communication is critical in achieving a successful database lock and the institution
responsible for data management for the trial will take responsibility for this. For
trials with large and complex datasets it may be necessary to implement a step-wise
approach to final database lock such as:

– Halting further data collection at investigator sites and focusing only on data
query responses
– Where remote data entry systems are in use, locking individual data collection
forms, participants or investigator sites to prevent further data entry or cleaning
activities at the investigator site level whilst continuing to permit time-limited
cleaning activities by the organization/individual responsible for data cleaning,
for example, a Data Manager to complete their activities.

For data integrity reasons, it is important to ensure that when locking the database
in a remote data entry system that each individual research site retains read only
access to the database for their data (MHRA 2018).

Stage 4 Trial Reporting and Publishing

Trial Reporting
The End of Trial starts the clock for reporting the summary results of the trial; the
typical expectation being that these are reported onto the relevant public registry /
regulator portal within 12 months of the End of Trial date in the United Kingdom and
European Union and Primary Completion Date (onto ClinicalTrials.gov 2017) in the
United States. There may be exemptions to the requirement or timeframe for
reporting certain trials, for example, to protect commercial interests. In some coun-
tries there are also financial penalties for delays to reporting or not submitting the
report.
The content of the End of Trial Report (sometimes called the Clinical Trial
Summary Report) will take the form as dictated by the regional/country regulator
or independent ethics committee and will typically include information such as:

– Title of the trial and key objective(s)


– Key dates and milestones
– Any substantial amendments
– Results including safety data

Reporting summary results via an End of Trial report onto a public registry is an
important mechanism to support clinical trials transparency (Fig. 1). Within the United
Kingdom and European Union emphasis is also placed on the importance of providing
trial results in an appropriate format to the participants of the trial and the wider general
public via a lay-summary (European Commission 2017; HRA 2014).
18 End of Trial and Close Out of Data Collection 339

Publication and Dissemination


It is typical to see publication policies established at the beginning of the trial and
documented in the trial protocol and contracts. Toward the end of the trial the
Sponsor, Chief Investigator or other institution with responsibility for performing
the trial analyses will prepare a publication and dissemination plan. The publication
and dissemination plan will set out a schedule of key outputs (typically publications
in peer-reviewed scientific journals and presentations but may include other public-
ity outputs such as via social media or other media outlets). For each output a
suitable individual will identified to lead it. In some trials there may be only one
or two planned publications, whereas in others there could be significantly more;
arising over future years as a result of sub-protocol and further exploratory analyses.
One of the most important aspects of dissemination is ensuring that participants
on the trial are informed of the trial results and that these are also made available to
other patient groups and communities, engaging with patient groups and communi-
ties can also be helpful in further disseminating the results of the trial.

Data “Sharing” for Further Research


The data generated from clinical trials are typically high quality and valuable
datasets, with the potential to be used for further research and hypothesis generation.
For legal or philanthropic reasons, many organizations now provide access to data
generated in their clinical trials; making data available for other high-quality
research projects. In most cases a controlled access approach is used; where an
oversight committee reviews applications made to access the clinical trials data and
makes a decision, based on the organization’s data release policy (Box 6) as to
whether the data will be released to the applicant. In some cases the committee
members are independent of the organization who owns the clinical trial data; in this
case the approach would be called an “independent controlled access approach”
(MRC 2015). Some regulators, such as the European Medicines Agency, also
publish anonymized clinical data submitted by pharmaceutical companies to support
their regulatory applications for human medicines (EMA 2016).
Given the highly sensitive nature of clinical trials data it is essential that any data
sharing takes place:

– In accordance with the expectations of the trial participants, for example, it is


good practice to inform trial participants that further data sharing will take place
in future via the participant information leaflet.
– In accordance with all relevant data protection laws.

In most cases data will only be released in an anonymized form and where
released to another organization, further protected by a legally binding data release
agreement.
Making clinical trials data available for further research and hypothesis genera-
tion is an important mechanism to support clinical trials transparency (Box 1). It is a
regulatory requirement in some countries for certain types of trials and an approach
340 G. Booth

also supported by the International Committee of Medical Journal Editors which


require a data sharing statement to be included at the point of registering the trial on a
public registry (ICMJE 2019).

Box 6 Key elements of a data release policy

Recommended elements of a clinical trial data sharing policy:

• The scope of data sharing


• The request and decision process including guiding principles or criteria to
be used in making the decision
• The process for data release including preparing the data and associated
data pack and signing an appropriate contract

Data release criteria should include:

• The data release is lawful and in line with participant expectations.


• The data release is in line with all contractual and licensing agreements
including periods of exclusivity.
• The timing of data release will not adversely interfere with the integrity of
the trial objectives as set out in the approved protocol.
• The proposed research project has clear objectives and will use appropriate
research methods.
• The proposed research project will be carried out by a reputable organiza-
tion that can demonstrate appropriate IT security standards and will comply
with all relevant legal and contractual arrangements.
• The resources are available to satisfy the request.

Stage 5 Archiving

The documents collected throughout the life of a clinical trial which individually and
collectively permit the evaluation of the clinical trial and the quality of the data
produced are defined as essential documents (ICH 2016). These essential documents
serve to demonstrate the compliance of the Chief Investigator, Investigators, Spon-
sor, and other organizations, groups, and individuals involved in the conduct of the
trial with the standards of Good Clinical Practice (GCP) and with all applicable
regulatory requirements. They are therefore required to be archived for a period
defined by law (for regulated trials) or by the Sponsor (for all other trials) once the
trial has ended.

Planning for Archive


Clinical trials can be very complex and involve many different institutions; this
means the essential documents generated during a trial could be held in different
18 End of Trial and Close Out of Data Collection 341

institutions, which may be in different countries and the documents may be held in
either paper or electronic form. In a well-managed trial how and where essential
documents are stored will have been planned and documented up front and orga-
nized in line with standard operating procedures dictating paper or electronic file
structures or by using an online document management system with a standard file
structure. A standard approach, particularly when used across all contributing
institutions in the clinical trial, mitigates the risk of documents being lost,
unavailable for audit or inspection during or after the trial or documents being
archived or destroyed too early. This approach also makes planning for archive
significantly easier because documents are able to be easily located, collated, and
organized prior to being put into the archive.
The Sponsor, or other organization, responsible for trial conduct will manage the
overall planning for archive, which will typically involve:

– Nominating a suitable archivist – an individual qualified by training and


experience to manage the archiving of trial documents, agree the start and end
of archive period, maintain a record of what is archived and where, control access
to the archive or documents held therein and authorize destruction at the end of
the archive period.
– Identifying a suitable archive – where a third-party specialist archive company
is used, it may be deemed necessary to undertake vendor selection to verify the
suitability of the vendor’s facilities, systems, and processes and that these are in
line with relevant contractual or regulatory requirements.
– Communicating with other Institutions – notifying all relevant institutions
involved in the conduct of the trial of the archive period and providing authori-
zation to archive documents held at the various different institutions. Not all
documents and data need to be archived directly by the Sponsoring organization;
indeed, it is expected by regulators in order to ensure the integrity of the trial
dataset, that source data from an investigator site are retained in the control of that
investigator site.

Challenges Associated with Archiving


Archive periods differ by country, region, and trial type but are typically lengthy
periods requiring retention of essential documents over several years. This leads to a
number of challenges when planning for archive in relation to:

– The high costs associated with storing significant quantities of paper documents
over long periods of time
– Storage life, compatibility and accessibility of data, software, and information
technology in the future
– Factors outside of the Sponsor’s control which might lead to an increase in the
archive period such as the risk of future litigation arising from the trial, where the
results are contentious/controversial or may inform national policy and so will be
likely to undergo greater independent scrutiny over an extended period of time
342 G. Booth

Early Trial Closure

Many trial protocols and designs will include provisions for monitoring the safety,
efficacy, or feasibility of the trial at pre-scheduled time points to inform whether it is
safe, ethical, and feasible to continue the trial beyond that point. Examples include:

– Dose escalation review in phase I trials


– Interim safety or futility monitoring by an Independent Data Monitoring
Committee
– Recruitment feasibility
– New adverse safety information about the treatment under investigation which
significantly alters the risk: benefit balance of the trial and the safety of current or
future participants of the trial

The Sponsor may decide to close a trial early for various reasons, including poor
recruitment, withdrawal of the intervention, or safety/ethical concerns. The decision
to close a trial will usually include some form of independent oversight such as that
provided by an Independent Data Monitoring Committee or even in some cases the
regulator, depending upon the reason for closing the trial early.

Planning for Early Trial Closure

Although it is not possible to always predict exactly if and when a trial will close
early, it is usually possible to undertake some planning activities in the run up to
protocol defined stop: go points such as dose escalation or interim analyses in case
the decision is to temporarily or permanently halt the trial at that time. Early planning
allows for the different scenarios to be worked through for each likely scenario and
can be particularly helpful where a plan of action will require rapid operationa-
lization to protect the safety of participants on the trial (Box 7). The short timeframes
involved when managing early trial closure arising as a result of interim safety or
efficacy monitoring or new adverse safety information which alters the risk: benefit
balance of the trial and safety of current or future trial participants mean careful
preplanning activities cannot always take place; however, the typical planning
activities detailed in Box 7 would still be applicable.

Communicating with Trial Participants Following Early Trial Closure

The most important consideration following early trial closure is to assess the impact
on previous, current, and future participants of the trial and understand how their
further participation in the trial will be affected by the decision and how quickly.
Where a trial closes early as a result of a safety concern, the communication to site
investigators and participants would be expedited and carefully consider the most
appropriate mechanism for communicating any new information to participants in a
18 End of Trial and Close Out of Data Collection 343

clear and sensitive way. Depending upon the risk involved, immediate action may
need to be taken to withdraw the trial intervention; changes to the intervention and
communication may need to be made to site investigators and participants immedi-
ately, that is, before seeking ethical or regulatory approval for the changes to be
made. In the European Union and United Kingdom such actions are reportable post
the event as an Urgent Safety Measure, where ethical or regulatory approval for the
actions is obtained within a defined period of time after the action has been taken.
In all cases where a trial is closed early, participants should be provided with clear
information about what is happening, why, further treatment options (where appro-
priate), and ongoing follow-up data collection.

Box 7 Checks when preparing for early trial closure

• Where multiple institutions, groups, or individuals need to be involved in


the decision, establish a communications plan between the key decision
makers to facilitate timely decision making.
• Establish a communication plan to disseminate the decisions made and
resulting actions. This might include:
– The independent ethics committee and/or regulator (be aware that where
a trial is closing early for participant safety reasons this may constitute
an Urgent Safety Measure or require shorter timelines for End of Trial
notification)
– The funder
– Other institutions, groups, or individuals involved in the trial (informa-
tion about how the decision impacts their activities and funding contract)
– Previous, current, and future participants
– Trial websites and public registry information
• Consider whether a substantial amendment to the trial protocol or
supporting documents will be required and when.
• Consider whether urgent action needs to be taken to protect the safety of
participants on the trial and whether this is reportable to the independent
ethics committee and / or regulator (e.g., as an Urgent Safety Measure or
End of Trial notification).
• For blinded trials, determine whether it may be necessary to un-blind some,
or all of the trial participants.

Individual Site Closure

Trials usually include more than one investigator site; these trials are called multi-
center trials. Prior to the End of Trial an individual investigator site may choose to
close or be closed following a decision by the Sponsor, independent ethics commit-
tee, or regulator; the reasons for this are varied; this is called individual site closure.
344 G. Booth

Site closure is achieved at an individual investigator site when (at that trial site):

• Participant enrolment has stopped


• The data collection and cleaning activities are complete
• Any trial supplies have been accounted for and returned or destroyed, as per the
protocol and contract
• All trial payments have been made
• For remote data entry trials, the site staff have been provided with a copy of the
data for that investigator site
• Any final monitoring visits have taken place and issues arising have been
closed out
• The Investigator Site File is complete

Summary and Conclusion

In a well-managed trial preparation for trial closure is key to ensure appropriate


control over the final stages of recruitment and data cleaning for database lock and
analysis. Preparation for trial closure can be supported through a comprehensive
Trial Closure Plan which considers communication, project management including
key milestones, and the roles and responsibilities of each organization, group, or
individual involved in the conduct of the trial.
There are a number of important trial closure milestones which require careful
monitoring and control, such as the end of recruitment to prevent too many partic-
ipants being recruited. From a regulatory perspective possibly the most important
milestone is that of the End of Trial, which is a key ethical and regulatory milestone
defined in the approved trial protocol and has associated time-dependent notification
and reporting requirements to the independent ethics committee and, for regulated
trials, the regulator(s).
Notably, several of the key steps of trial closure; reporting, publishing, and data
sharing are important mechanisms to support transparency in clinical trials.

Key Facts

– Trial closure refers to the activities which take place in preparation for the
cessation of trial recruitment through to archiving of the trial.
– Trial closure can be notionally divided into five stages: End of Recruitment; End
of Trial Intervention; End of Trial; Trial Reporting and Publishing, and
Archiving.
– The End of Trial is a key ethical and regulatory milestone defined in the approved
trial protocol and has associated time-dependent notification and reporting
requirements to the independent ethics committee and, for regulated trials, the
regulator(s).
18 End of Trial and Close Out of Data Collection 345

– Registering, reporting in a public registry, publication of trial results, and making


clinical trial data available for further research purposes are important mecha-
nisms to support transparency in clinical trials.

Cross-References

▶ Administration of Study Treatments and Participant Follow-Up


▶ Archiving Records and Materials
▶ Data and Safety Monitoring and Reporting
▶ Data Capture, Data Management, and Quality Control; Single Versus Multicenter
Trials
▶ Documentation: Essential Documents and Standard Operating Procedures
▶ Institutional Review Boards and Ethics Committees
▶ Interim Analysis in Clinical Trials
▶ Investigator Responsibilities
▶ Long-Term Management of Data and Secondary Use
▶ Participant Recruitment, Screening, and Enrollment
▶ Regulatory Requirements in Clinical Trials
▶ Responsibilities and Management of the Clinical Coordinating Center

References
Clinical Trials.gov (2017) FDA 42 CFR Part 11 Final Rule for Clinical Trials Registration and
Results Information Submission. Available via https://prsinfo.clinicaltrials.gov/ Accessed 17
October 2020
European Commission (2017) Summaries of Clinical Trial Results for Laypersons. Available via
https://ec.europa.eu/health/sites/health/files/files/eudralex/vol-10/2017_01_26_summaries_of_
ct_results_for_laypersons.pdf Accessed 17 October 2020
European Medicines Agency (2016) Clinical data publication. Available via https://www.ema.
europa.eu/en/human-regulatory/marketing-authorisation/clinical-data-publication Accessed 17
October 2020
International Committee of Medical Journal Editors (2019) Clinical Trials Registration and Data
Sharing Policies. Available via http://www.icmje.org/recommendations/browse/publishing-and-
editorial-issues/clinical-trial-registration.html Accessed 17 October 2020
International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human
Use (ICH) (2016) Guideline for Good Clinical Practice. Available via https://www.ich.org/page/
efficacy-guidelines Accessed 17 October 2020
Medical Research Council/Department of Health/Medicines and Healthcare products Regulatory
Agency (2012) Risk-adapted approaches to the management of clinical trials of investigational
medicinal products. Available via https://assets.publishing.service.gov.uk/government/uploads/
system/uploads/attachment_data/file/343677/Risk-adapted_approaches_to_the_management_
of_clinical_trials_of_investigational_medicinal_products.pdf Accessed 17 October 2020
Medicines and Healthcare products Regulatory Agency (2018) ‘GXP’ Data Integrity Guidance and
Definitions. Available via https://mhrainspectorate.blog.gov.uk/2018/03/09/mhras-gxp-data-
integrity-guide-published/ Accessed 17 October 2020
MRC Methodology Hubs for Trials Methodology Research (2015) Good Practice Principles for
Sharing Individual Participant Data from Publicly Funded Clinical Trials. Available via https://
346 G. Booth

www.methodologyhubs.mrc.ac.uk/files/7114/3682/3831/Datasharingguidance2015.pdf
Accessed 17 October 2020
Official Journal of the European Union (2010) Detailed guidance for the request for authorisation of
a clinical trial on a medicinal product for human use to the competent authorities, notification of
substantial amendments and declaration of the end of the trial. Available via https://ec.europa.
eu/health/documents/eudralex/vol-10_en. Accessed 17 October 2020
UK Health Research Authority (2014) Information for participants at the end of a study: Guidance
for Researchers. Available via https://www.hra.nhs.uk/media/documents/information-partici
pants-end-study-guidance-researchers.pdf Accessed 17 October 2020.
World Health Organisation (2012) International Standards for Clinical Trial Registries WHO Public
Registries. Available via https://apps.who.int/iris/bitstream/handle/10665/76705/
9789241504294_eng.pdf?sequence¼1 Accessed 17 October 2020
International Trials
19
Lynette Blacher and Linda Marillo

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
Challenges of Conducting Trials Internationally . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
Trial Coordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
Procedural Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
Regulatory Approval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
Investigational Medicinal Product Supply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
Bio-materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
Monitoring/Auditing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
Enrollment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
Trial Designs and Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
Data Collection Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
Mitigation of Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
European Union General Data Protection Regulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365

Abstract
The number of international clinical trials being activated has increased greatly
over recent years. There are valid reasons for this global expansion, including a
need for greater numbers of subjects enrolled in as short a time as possible,
application to diverse populations, and the potential for cost reduction. However,
conducting trials internationally involves its own set of challenges related to
every aspect of trial conduct, from site activation to data management. Challenges
L. Blacher (*) · L. Marillo
Frontier Science Amherst, Amherst, NY, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 347


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_44
348 L. Blacher and L. Marillo

are primarily related to cultural, procedural, and regulatory differences between


countries. The new General Data Protection Regulation (GDPR) (eugdpr.org/) in
Europe is also affecting the conduct of these trials. This chapter details these
challenges, as well as offering possible mitigation strategies.

Keywords
International clinical trials · General Data Protection Regulation (GDPR) ·
Regulatory approval · ClinicalTrials.gov

Introduction

Conducting clinical trials internationally involves many of the same components as


conducting a trial in the United States, or in any single country. The trial requires
regulatory approval, the protocol and case report forms (CRFs) need to be devel-
oped, investigational drug needs to be acquired, sites need to be selected, data need
to be captured accurately, bio-specimens need to be obtained, and monitoring and
auditing need to be performed. The critical aspects of conducting trials internation-
ally are the cultural, procedural, and regulatory differences between countries. This
chapter addresses the rationale for conducting trials globally; the challenges posed
by these cultural, procedural, and regulatory differences, as well as the challenges
resulting from implementing the new European General Data Protection Regulation
(GDPR) in the international clinical trials arena; and possible mitigation strategies.

Background

Expanding the conduct of clinical trials to the international setting has increased
greatly over recent years. ClinicalTrials.gov, a database of privately and publicly
funded clinical trials conducted globally, lists 298,104 registered clinical trials
conducted in 209 countries, with almost 16,000 trials including both US and non-
US participants, and over 143,000 non-US trials (clinicaltrials.gov) (Fig. 1).
There are several benefits to conducting trials globally. Conducting trials in
multiple countries allows access to a greater number of potential trial participants.
This is particularly important now that research has become more targeted – in other
words, tailored to a population with specific characteristics (such as a certain gene
combination). Identifying participants with these characteristics may be challenging,
and expanding the potential pool to multiple countries helps to alleviate this issue.
Having access to a larger participant pool should speed up recruitment, which in
turn should lead to quicker realization of trial results, and ultimately benefit to the
greater population. Having participants from multiple countries allows greater diver-
sity in terms of ethnicity and disease characteristics and susceptibilities, allowing the
results to apply to a broader population. In some cases, participants that may not
19 International Trials 349

Fig. 1 Percentage of Percentage of Registered Studies


registered studies by location by Location
(As of February 22, 2019)
Total of 298,104 studies

Non-US only (48%) US Only (35%)


Both US and Non-US (5%) Not provided (12%)

have had access to a certain treatment in their country can benefit by trial participa-
tion (Minisman et al. 2013).
Additionally, in theory, the cost of conducting the trial should decrease with
quicker recruitment. Bringing a new drug to market can cost between $161 million to
$2 billion (Sertkaya et al. 2014). A reduction in these expenses would make
resources available for additional or new research.

Challenges of Conducting Trials Internationally

Challenges can present themselves in all areas and stages of the trial. This chapter
will focus on the areas of Trial Coordination and Data Management.

Trial Coordination

Trial coordination refers to the oversight and coordination of the logistics of trial
activities. Challenges can arise in several areas.

Procedural Differences

Earlier in this chapter it was discussed that conducting trials internationally could
potentially reduce costs. However, costs can vary between different countries,
making it much more expensive to include certain countries rather than others.
This can be attributed to many factors, including research staff salaries, equipment
expenses, fees for submission to Ethics Committees, and many more.
Infrastructure can also vary between countries. Areas that can pose challenges
should be evaluated when considering sites from a country for participation (Garg
2016). For example, do they have access to the proper equipment; is the equipment
in good condition and of the needed standard? Do they have appropriate storage
350 L. Blacher and L. Marillo

procedures and facilities for the study drug, including a secured area, a reliable
freezer; and a tracking process for receipt, distribution, and destruction or return of
drug?
Quality standards must also be evaluated, as there can be variance between
countries. Do the normal SOPs and procedures in place meet the standard expected
for the trial? Does staff receive adequate training and guidance during trial conduct?
Are staff fully qualified and skilled to perform the necessary procedures and docu-
ment the research?
And most importantly, does the standard of care in the country lend itself to the
trial requirements? (Bogin 2016) If certain procedures dictated by the protocol are
not standard, will the principal investigator and research staff be able to perform
them? Will enough participants be willing to undergo nonstandard procedures?

Regulatory Approval

The area that presents one of the more time-consuming challenges is “activating” a
site to be able to participate in a trial. Activation involves many steps, including
obtaining approval of the protocol and its related documents from regulatory bodies
for each site. Various regulatory bodies may be involved including Ethics Commit-
tees (ECs)/Institutional Review Boards (IRBs), which are independent bodies that
review and approve/disapprove research proposals for human participants. For
certain countries, the trial protocol may also need to be submitted to/receive approval
from Competent Authorities (CA)/Health Authorities (HA) (authorities that review
submitted clinical data and those that conduct inspections), Data Protection Agen-
cies (authorities responsible for upholding the right of data privacy), and/or individ-
ual Hospital Management. Additional review may be needed depending on the trial
treatment or procedures (e.g., if the trial involves radioactive substances or trans-
plant) (campus.ecrin.org).
There may be multiple ECs/IRBs involved as well. Some countries (or states/
regions) have a Central or Lead EC/IRB that performs complete review of the
protocol, and the local ECs can adopt the decision of the Central/Lead EC. In
other cases, local EC approval may be required in addition to the approval of the
Lead EC, though this is usually a simplified review involving site-specific aspects.
Submission to ECs and CAs may occur in parallel or sequentially, depending on
the country. The timeline for review and the fees associated with submission also
vary. Even the submission platforms are different, from paper to CD to entry in an
official database. Table 1 demonstrates these variances between a subset of European
countries (campus.ecrin.org):
To illustrate the process in more detail, we will take Switzerland as an example.
The CA for Switzerland is Swissmedic. (An additional CA, Bundesamt für Gesund-
heit (BAG)/Federal Office of Public Health (FOPH), is involved for trials with
radioactive substances or transplant products.) The EC, Swissethics, is an associa-
tion of the 9 cantonal (i.e., regional) ECs within Switzerland. Some cantonal ECs are
19 International Trials 351

Table 1 Country-specific clinical trials submission


General
National/ Local/ CA EC
HA/ central/ regional submission timeline Review
Country CA lead EC EC DPA platform (days) order
Austria X X X CD/USB 60
with hard-
copy cover
letter
Belgium X X Local X Electronic 28 Parallel
file with or
hard-copy sequential
cover letter
Denmark X Regional X Online 60 Parallel
portal, or CD or
with hard- sequential
copy cover
letter, or
email
Hungary X X 2 copies on 60 Parallel
CD
Italy X X Local Online portal 60 Parallel
Portugal X X X CD with 30 Parallel
hard-copy or
cover letter sequential
Serbia X X Hard-copy of EC then
all, or CA
electronic
file with
hard-copy
cover letter
Switzerland X X X Hard-copy in 45 Parallel
binder and
CD
United X X Online Portal 30 Parallel
Kingdoma or
sequential
a
The National Health Service Research and Development Forum is also involved in the approval
process

responsible for several cantons. The submission application is submitted electroni-


cally through a portal to the Lead EC (the EC responsible for the site of the
coordinating investigator) to check for completeness. At the same time the applica-
tion is submitted to the ECs concerned for the participating sites, which evaluate
local aspects. The Lead EC performs a complete review and informs the applicant,
local ECs, and CA of its approval/disapproval. Local ECs can agree/disagree with
the Lead EC’s decision and may also add minor site-specific additions (campus.
ecrin.org).
352 L. Blacher and L. Marillo

Table 2 Research groups and sites


Research group Location Sites
Breast cancer trials Australia Australia, New Zealand
EORTC: European Organization for Research Belgium Europe, Africa
and Treatment of Cancer
IBCSG: International Breast Cancer Study Switzerland Europe, Australia, South
Group America, India, New Zealand,
Africa
GOIRC: Gruppo Oncologico Italiano di Italy Europe (Italy only)
Ricerca Clinica/Italian Oncology Group of
Clinical Research
JBCRG: Japan Breast Cancer Research Group Japan Japan
SOLTI: Grupo Español de Estudio, Spain Europe
Tratamiento y Otras Estrategias Experimentales
en Tumores Sólidos

One often thinks of international trials being conducted by pharma or industry,


but it is important to note that Research Groups are involved in these trials as well.
There are several Research Groups that act as a sponsor of clinical trials and/or
participate in these trials. Research Groups are responsible for protocol develop-
ment, trial management, and coordinating and overseeing the participation of several
sites, which may be in the country of the Research Groups, or may be located within
multiple countries. Table 2 demonstrates these variances between a subset of these
groups in the breast cancer arena.
Each Research Group has its own procedures in place for trial conduct. Not only
can there be group-specific variances in logistics, but country-specific variances
within the groups as well. Responsibilities and the scope of work must be clearly
defined between the sponsor (if the Research Group is not sponsor), Research
Groups, and participating sites. Some items to consider: Will communication be
through the group or directly to the sites? Will the site perform their own enroll-
ments, or will the group act on their behalf? Will the group monitor their sites, or will
this be the responsibility of the sponsor? If the sponsor is working with multiple
Research Groups, the scope of work could vary greatly between them.

Investigational Medicinal Product Supply

The process to provide an Investigational Medicinal Product (IMP) to the site, and
ultimately the participant, involves a series of steps, as well as several parties. This
process is referred to as the clinical supply chain and follows the IMP from
manufacturer, distribution center, local depots, sites, and participants. It is estimated
that supply chain logistics account for 25% of pharmaceutical research and devel-
opment costs, in part due to the globalization of trials (Fisher Clinical Services).
Adding an international component can increase the complexity of the process, and
issues can develop at or between any of these locations (Arnum 2011).
19 International Trials 353

Appropriate logistics are essential to ensure the timely delivery of the IMP and
any comparator product (current standard of care therapy). All parties involved must
have the knowledge, experience, and imprint to meet the needs of a global setting.
Any delay or shortage can cause delay of the start of a trial, or potentially even halt
an ongoing trial. This, in turn, can affect the well-being and safety of participants.
It is important to start planning early, but not too early. Logistics should
be discussed while the protocol is under development (Fisher Clinical Services).
However, implementation should begin once it is relatively certain there will not be
changes to the protocol and contracts with Research Groups or sites. For example, if
a country decides not to participate, or a new country is added, and you have already
planned the labels, distribution routes, and depots, these areas will need to be re-
planned. Labels that included IMP dose would also need to be changed if the IMP
dosage was changed in the protocol.
Logistical hurdles are many. Differing regulations between countries impact all
areas of the process and considerations must be given to a variety of factors:
Availability of IMP – Sometimes a drug may have approval for the indication in
certain countries and not in others. This could limit the number of countries that
participate in the trial, as the patients already have access to the drug. This may result
in slower recruitment, or not enough potential patients to conduct the trial. In other
cases, the IMP may receive indication approval in one or more countries during the
conduct of the trial, and these sites may cease recruitment.
There can also be the special case when a trial has concluded and the participant is
still doing well, but drug is no longer supplied by the trial. In countries where it has
been approved for the indication, the participant will be able to receive drug through
standard mechanisms. In countries where the IMP does not have indication approval,
an avenue of compassionate use may have to be pursued for these participants.
Compassionate use allows individuals who are seriously ill and have no standard
treatment options available to be treated with an IMP.
Forecasting – Underestimating need for drug (e.g., in case of faster recruitment
or higher retention than expected) leads to participants without supply, whereas
overestimating (in case of slower recruitment or lower retention) leads to unused
material that is ultimately wasted, and again, is a cost issue. The differing infrastruc-
ture and working patterns within the countries can impact forecasting as well (Fisher
Clinical Services). If proper procedures are not in place, the sites may not relay the
necessary supply information to the sponsor and/or supplier to accurately calculate
the IMP need.
Package Labeling – When planning labeling, participating countries must be
selected early enough to ensure they are included on the IMP label in time for
printing. The proper languages for each country must be determined, and translations
validated. Additionally, authorities in each country may require specific terms to be
used (Miller 2010).
If there are multiple countries, booklet labels to hold the volume of information
may be useful, though they require additional time for production and printing.
“Back-up” countries should also be anticipated in case one or more countries decide
not to participate. Should this happen and no back-up countries have been planned,
354 L. Blacher and L. Marillo

the labels/booklets would have to be revised and reprinted. Alternatively, a separate


label could be created only for the new country(ies); however, this requires an
additional supply pool. Either avenue can be quite costly.
There are several country-specific regulations regarding the information that
needs to be included in the labeling. For example, in some countries, the local
representative must be listed on the label/booklet. Sometimes even the comparator
may need re-labeling (Weyermann 2006).
Packaging and Shipping – Proper packaging has become even more important
due to the growth of biologics in research. It is necessary to transport and store
biologics at proper cold temperatures (this is called cold-chain logistics). Specialized
packaging is required, including insulated boxes and temperature-controlled con-
tainers (Fisher Clinical Services). The packaging materials must be assessed in
relation to the weather conditions (e.g., temperature and humidity) of each country,
including seasonal changes. Transport delays can occur at the border, with couriers,
and at the site. Country infrastructure can affect the amount of time it takes to
transport the medication to the site. In cases of longer timelines, it will be necessary
to select a shipping vendor which is validated to maintain the required temperature
for longer periods. Shipments also need to be planned around site working patterns,
holidays, and religious observances, so as not to deliver product when responsible
staff are not available to accept the package (Bioclinica).
Distribution – Multiple countries may translate to multiple depots. Whereas in
the European Union one depot can function for all countries, in many other areas of
the world, a separate depot is required for each country. This in turn leads to
additional depot costs and a separate supply for each country.
Regulations and Documentation – Regulations are continuously being updated
by governments and industry. Goods and service taxes to import drugs are levied
across borders and differ between countries. Suppliers must be aware of these taxes,
as well as country-specific documentation requirements. Original documents (rather
than copies) may be required for some countries, and specialized documents or
specialized versions for others, such as import permits and licenses (e.g., for Russia,
an umbrella license from the Ministry of Health is necessary to ship supply, and a
Certificate of Analysis is needed to grant the license). Another variance between
countries relates to the Importer of Record, which is the legal body responsible for
ensuring imported goods comply with local laws and regulations. In some countries
only the sponsor can serve in this capacity; in others a Clinical Research Organiza-
tion, distributor, or site may fill this role (Fisher Clinical Services).
Additionally, if the process for Ethics Committee/Competent Authority approval
of the trial is extensive and time-consuming in some countries (e.g., it can take 12–
16 months for approval in China) (George Clinical 2016), the expiration date of the
IMP (and comparator, if applicable) must be considered. It must be ensured that the
drug will be viable for the treatment of the participants.
Destruction and Return – At the end of trial, and upon expiration of drug, the
IMP must be destroyed or returned to the sponsor. Which of these options is chosen
and how it is accomplished can vary greatly by country due to country-specific
regulations. Destruction can be done at the trial site or off-site. Some sites are not
19 International Trials 355

allowed to destroy product locally. If drug must be returned to the sponsor or


dedicated vendor (e.g., pharma or supplier), an importation license into the country
of destruction is usually required (Global Health Trials). Additionally, a certified
carrier will need to be hired to transport the drug. Some countries, such as Serbia,
may mandate drug is destroyed at a government-regulated location. Involving
additional specialized companies and government authorities makes the return
process more complex and costly (Mongan 2016).

Bio-materials

Collection of bio-materials (e.g., pathological tissue, blood serum, and plasma) is a


component of many international trials. These materials may be required for:

• Central review: A central laboratory designated by the sponsor reviews materials


to ensure the materials were categorized correctly by the local laboratory; this is
often a factor in confirming eligibility.
• Translational research: Often referred to as “bench to bedside,” this term refers to
using results of research performed in laboratories to develop new methods of
diagnostics and new therapies, and translating findings from clinical trials to
everyday practice.
• Future research: In addition to the primary use of bio-material for the trial,
material may be requested for future yet unknown research projects.

Challenges related to use of these bio-materials include:

• Obtaining materials: Each trial will have specific requirements for the materials.
However, each country has specific regulations regarding the type of materials
that can be provided to another entity, and specification may differ even between
hospitals/medical institutions within the same country. For example, a block of
tumor may be required for review, analysis, and/or bio-banking for a trial, but
some countries may be allowed to provide only slides of the tumor material. In
this case the material would not be sufficient to meet the requirements of the trial,
and patients from these countries would be excluded from participation.
• Shipping materials: Materials are required to be sent to a central laboratory or
biobank. As was seen with the IMP supply chain process, issues may arise due to
the shipping regulations for each country, including special requirements; for
example, the Ministry of Health must provide permission in Australia, Russia,
and Brazil (export.gov; Fisher Clinical Services). Some countries, such as China,
do not allow export of bio-materials. There can also be confusion regarding the
classification of the material, for example, misconception from the courier that
pathology material is hazardous.
• Retaining materials: Length of pathology material storage differs by trial. Mate-
rials may be needed only for central review (in which case the materials would be
returned after a specified period), or for future use (in which case materials would
356 L. Blacher and L. Marillo

be stored indefinitely). Some countries and institutions do not allow indefinite


storage and require materials to be returned within a specified timeframe.

Monitoring/Auditing

The goals of site monitoring and auditing are to ensure:

• Compliance with Good Clinical Practice (GCP) and regulatory requirements (ich.
org)
• Compliance with the trial protocol and procedures
• Accurate and timely data collection
• Appropriate facilities, staff qualifications, and investigator oversight
• Communication between stakeholders
• Protection of patient safety and well-being

Although monitoring and auditing have the same goals, there is a key difference
between them, in that monitoring is a quality control function and auditing is a
quality assurance function. Monitoring refers to the performance of ongoing over-
sight and operational checks to verify processes are working as intended and in
accordance with the protocol, standard operating procedures (SOPs), GCP, and the
applicable regulatory requirements. Auditing refers to the systematic and indepen-
dent examination of all trial-related activities and documents, to determine if they
were conducted according to the protocol, SOPs, GCP, and the applicable regulatory
requirements. An audit is designed to improve the effectiveness of processes
(Ruppert 2007).
The international setting presents several similar challenges to both monitoring
and auditing. Both processes rely heavily on communication, and language barriers
can be an issue. If the trial is not conducted using a primary language, there may be
the need for a monitor/auditor to be fluent in multiple languages in order to cover
multiple countries; these qualified staff may be more costly and/or more difficult to
find. If multilingual staff are not available, alternatives include hiring several mon-
itors/auditors, each fluent in a different language, or hiring translators.
Scheduling visits may be problematic due to work patterns, holidays, and
religious observances. Additionally, the need to be sensitive to cultural differences
is even more important in the type of face-to-face interaction that takes place
during an on-site monitoring or audit visit. Monitors and auditors also need to be
aware of country-specific regulations regarding how GCP is interpreted and
implemented.
Site monitoring can account for up to 40% of the cost of a clinical trial (Sprosen
2017). The cost of on-site visits particularly can be exacerbated in the international
arena due to the extensive travel. There has been a shift to risk-based monitoring
approach due to increased number, complexity, and globalization of clinical trials
(Beauregard et al. 2018). This approach involves assessing risk, impact, and miti-
gation of the monitoring strategy. The goal is to focus on critical areas that relate to
19 International Trials 357

patient well-being, safety, and privacy. A risk-based monitoring plan often employs
the use of more centralized monitoring (i.e., remote evaluation) where appropriate, to
reduce the frequency and cost of on-site visits.
Although there are several benefits to central monitoring, one potential drawback
can be that site personnel may believe they are not receiving the same level of
support as provided during an on-site visit. It is much easier to develop a rapport with
face-to-face interaction rather than through phone calls and email.
The concept of the remote “visit” has extended to auditing as well (Cobert 2017).
Several considerations need to be given to up-front preparation, including:

• Communications: Remote audits are generally conducted via teleconference;


therefore, local information technology support needs to be confirmed for
the length of the audit, and communication should be tested in advance.
Videoconferencing could also be employed to enhance the communication.
• Documentation: A means to access documentation (e.g., essential, regulatory,
and Investigator Site File documents, as well as SOPs and any other relevant
document) needs to be determined. Possibilities include uploading documents to
a Trial Master File (TMF) structure within a web-based content management
system. Or, more ideally, into an eTMF if available. This would be done in
advance of the audit. Scans of source data or other necessary paper documents
would be included as well.
• Facilities: In order to review facilities, creative measures, such as the use of
digital photos or a video tour of the work environment, could be employed.
• Systems and processes: Screen-sharing technology could be used to demonstrate
computer systems and processes.

Though the cost of travel is reduced with remote monitoring and auditing, there
are still inherent costs in arranging and conducting remote visits.

Data Management

Challenges and costs can arise during data management activities as well.

Enrollment

Time and Date


Given the time differences between the various countries, the sponsor or data
collection agency may not be located in the same time zone as the participating
site or sites. To accommodate for these differences, a series of questions can be asked
in the enrollment checklist regarding current time at the site, which are then put
through validation checks to ensure the enrollment is happening within 24 h of the
completion of the checklist.
358 L. Blacher and L. Marillo

Race and Ethnicity


The collection of race and ethnicity has been mandated by United States (US)
standards as put forth by the Food and Drug Administration (FDA) for clinical trials
conducted domestically and abroad (fda.gov). These minimum standards are for US
Federal reporting purposes. The categories are social-political constructs only and
are not scientific or anthropological in nature. They are to provide a common
framework for uniformity and consistency in the collection and use of data on race
and ethnicity by US Federal agencies.
Questions about race and ethnicity should be self-reported to the extent possible
and never presumptively completed by site staff. The participant may identify with
more than one racial group or choose not to report race or ethnicity.
Ethnicity choices are Hispanic/Latino, or Not Hispanic/Latino.
Race options are based on the following primary racial groups:

• American Indian or Alaska Native: having origins in any of the original


peoples of North, Central, and South America, and who maintains tribal affilia-
tion or community attachment.
• Asian: having origins in any of the original peoples of the Far East, Southeast
Asia, or the Indian subcontinent, including Cambodia, China, India, Japan,
Korea, Malaysia, Pakistan, the Philippine Islands, Thailand, and Vietnam.
• Black or African American: having origins in any of the black racial groups of
Africa.
• Native Hawaiian or Other Pacific Islander: having origins in any of the original
peoples of Hawaii, Guam, Samoa, or other Pacific Islands.
• White: having origins in any of the original peoples of Europe, the Middle East,
or North Africa.

If additional granularity or more detailed characterizations of race or ethnicity are


collected to enhance understanding of the trial participants, the FDA recommends
these characterizations be traceable to the five minimum designations for race and
the two designations for ethnicity as listed above (US Food and Drug Administration
2016).
A hurdle with trials conducted in Sub-Saharan Africa, Brazil, and India is that
these guidelines cause confusion for participating sites and their diverse populations.
In trials conducted by the International Maternal Pediatric Adolescent AIDS Clinical
Trials Network (IMPAACT) and AIDS Clinical Trials Group, to counter the conflict,
each trial site provided a list of acceptable racial categories for their anticipated
participants (impaactnetwork.org).
Since data are collected at the time of enrollment, these choices are provided only
to those specific sites. On the backend, the selections are collapsed into the five
primary racial categories as well as the designations of multiracial or choosing not to
report. The specific selections are still kept in the study database and available to the
study team if requested.
19 International Trials 359

Trial Designs and Populations

In the following list of potential protocol design populations, it is imperative to


develop a tracking table to identify the links between the individuals. In many of
these cases, the same treatment or observational arm will need to be assigned to the
pairs or groupings.

• Discordant couples
• Index Case/Households
• Index Case/Caregiver
• Perinatal (Mother/Child)

Unique to the IMPAACT Network are trials involving perinatal populations in sub-
Saharan Africa, Brazil, India, and Thailand. To accomplish the rigors of enrollment
and follow-up, the mother and her fetus generally enroll between 28 and 35 weeks
gestation. Both the mother and fetus are assigned a unique participant ID that keeps
personally identifiable information to a minimum. The fetus is automatically assigned
the same race and ethnicity as the mother. The fetus is considered on study at the time
of enrollment, but the clock does not start for the baby until birth, whereas data
collection on the mother begins from time of enrollment. If the birth outcome is not
viable only the date of miscarriage or stillbirth is collected for the baby; all other
information is collected as adverse events for the mother (impaactnetwork.org).

Data Collection Strategies

The development of well-designed case report forms (CRFs) or electronic CRFs


(eCRFs) is integral to the collection of quality study data. These include eligibility
and screening logs, participant questionnaires, clinic staff completed study forms,
laboratory results, and adverse events reports. Data should be collected in the
language of the sponsor, DMC or CRO, most of the time in English or French.
NIH sponsored or supported studies will be collected in English. The expectation is
that study site staff will understand and write in English.
The technical capabilities of participating sites will determine the best method to
collect the study data. If Internet access is limited or nonexistent, paper CRFs should
be provided to the sites and mailed back to the data center for centralized entry. The
central data center reviews the records and communicates via mail with the site staff
regarding errors and other questionable items.
When Internet access is at least minimally reliable, the sites are capable of keying
their own data into a dedicated electronic data collection (EDC) package. With
appropriate training, site staff are willing and able to manage their own data, thus
allowing them real-time or near real-time access to their records. They are able to take
ownership of the quality of the data and respond to data queries in a timely manner.
360 L. Blacher and L. Marillo

Study CRFs are designed to be completed by site staff and incorporate only
elements necessary to meet trial design questions. These should be developed in the
primary language of the protocol and the protocol team members, usually English. Care
should be taken to minimize repetitive questions on separate CRFs thus avoiding
potential inconsistencies between responses. The placement of questions should also
be considered to provide a logical flow of responses and grouping of like data elements.
Participant questionnaires should be presented in the language of the enrolled
participant. There are qualified translation services available but mostly for Euro-
pean, Chinese, and Japanese languages. For other ethnic or tribal languages, there
would be reliance on local staff to provide the translation and back translation,
utilizing separate staff to perform these activities. This can be tedious, repetitive, and
time-consuming for both the sites and for the DMC staff required to verify the
translations, but aids in maintaining a consistent method of presenting study con-
cepts and questions, and eliciting responses from participants. One problem to avoid
is asking open-ended questions that would require text replies which have to be
translated back into English before being entered into the study database. This is
particularly important to maintain privacy when collecting sensitive information the
participant would not expect to be shared with site staff.
Questionnaires may be collected on paper and submitted to the data center via
mail or facsimile for data keying, or through an online Internet package that prompts
the user to provide responses to the questions – these responses are saved and
downloaded to the database. Since the completed local language form will be
submitted, the formatting of the questions and responses should align with the
English versions to ensure the data are entered into the study database appropriately.
Laboratory test data CRFs should be designed to capture the units of measure-
ments used in local laboratory, allowing for each site to report results as collected
and measured. Upper and lower limits of normal as well as results should have data
fields large enough to capture abnormally high or low results.
Ultimately, careful thought should go into the design of the data collection
instruments to mitigate confusion about the goals of the study and minimize repet-
itive questions. The database should be developed in conjunction with the designing
of the CRFs. A robust EDC system will have built-in validity and QA/QC checks,
whereas data submitted via paper will be centrally checked further downstream in
the data submission process.

Mitigation of Issues

At the beginning of this chapter, we discussed that the potential for reduced cost was
one of the factors leading to the globalization of clinical trials. However, the
numerous challenges of the global arena come with costs of their own. One needs
to employ various mitigation approaches to balance these costs.
It is important to begin with a proactive risk-based strategy for conduct and
monitoring of the trial. This allows any potential hurdles to be identified in advance,
and plans put in place to reduce or eliminate their impact. A risk-based strategy will
19 International Trials 361

also lessen the chance of emergency or crisis situations arising; and if they do arise,
there will already be a plan for addressing them.
Methods to mitigate risk include, wherever possible, creating simplified pro-
tocols, employing user-friendly data collection tools, and developing streamlined
procedures for trial activities. It is extremely important to train – and re-train – the
research team in these areas. This will ensure everyone has the same interpretation of
the protocol and procedures. Requiring a primary language be used for the conduct
of the trial, and mandating that all sites have at least one person on the trial team that
speaks this language, can also reduce the potential for misunderstandings.
The proper selection of partners and collaborators is also important for risk
mitigation. A site feasibility evaluation should be conducted before accepting a
site for the trial, and vendors should be thoroughly researched and vetted. Partnering
with experienced and dedicated collaborators allows for more efficient trial conduct.
A quality management system should be put in place to monitor each area of trial
conduct. The use of technology can greatly aid in this oversight. For example,
tracking systems can be utilized for IMP and bio-materials, and metrics reports
can be created for areas in site performance, such as length of time to activation, data
submission timeliness, query resolution, critical data items, protocol deviations, etc.
The most important facet of risk mitigation, which should be started at the very
beginning of the trial, involves communication. It is critical to develop a rapport and
understanding with all stakeholders. Establishing a clear communication pathway is
key to the conduct – and ultimately the success – of the trial.

European Union General Data Protection Regulation

A chapter on international clinical trials would not be complete without some discus-
sion of the General Data Protection Regulation (GDPR). The GDPR came into effect
for the European Union (EU) May 2018, replacing the previous EU Directive 95/46/
EC regarding data protection (eur-lex.europa.eu). The primary purpose of the regula-
tion is to harmonize data protection and privacy laws across EU countries. The
regulation covers the protection of natural persons with regard to the processing of
personal data and on the free movement of such data and applies to all organizations
who process personal data of EU subjects, even if the organization is not in the EU.
The main principles of the regulation emphasize transparency of data processing,
legitimate use of data, minimization of data collected (e.g., minimum required for
legitimate use), accuracy of data, security of data, subject consent to use of data, and
limitation for data retention (e.g., retain data only for the length of time required for
purpose of use) (eugdpr.org).
The GDPR strives to protect subjects by outlining their rights in regards to the
processing of their data. Rights of data subjects include:

• Right to Access – right to a copy of their data


• Right to Rectification – right to correct their data
• Right to Information – right to know how their data are being used
362 L. Blacher and L. Marillo

• Right to be Forgotten – right to have the data erased/destroyed


• Right to Restriction – right to restrict data processing
• Right to Portability – right to take data from controller and/or transfer to another
entity
• Right to Object – right to object to data processing (eugdpr.org/)

The GDPR also stresses the accountability of the data controller (person or entity
which determines the purpose and manner of processing personal data, for example,
a sponsor) and the data processor (person or entity that processes data on behalf of
the controller, for example, an organization responsible for quality control or
statistical analysis of the data), including strengthening enforcement and penalties
(Yeomans and Abousahl 2017). Each EU member state must appoint an independent
supervisory authority to enforce GDPR compliance; these authorities cooperate with
each other and report to the European Data Protection Board. The data subject has
the right to lodge complaints against these authorities and/or data controllers and to
receive compensation. Fines may be levied against member states and/or controllers.
Noncompliance to GDPR can lead to fines up to 10,000,000 EUR or a percentage of
an organization’s annual turnover (gdpreu.org).
There are many challenges in interpreting and implementing the regulation. The
sponsor and any other data controllers or processors must determine how to uphold
this regulation in the context of clinical trials. A first step is defining personal data,
which is considered to be data that relate to an identified or identifiable individual
(Advarra Regulatory Team 2018). The regulation applies to the processing of said
data if it is either processed in an automated manner, or processed in a nonautomated
manner such that it becomes part of a filing system (which is considered to be a
system organized by specific criteria).
If there are no identifiers that can link or relate that data to an individual, the data
can then be considered anonymized. Anonymized data are not considered personal
data. On the other hand, pseudonimyzed data are personal data that can no longer be
attributed to a specific individual without the use of additional information, but are
still considered personal data (eugdpr.org/). In order to determine whether pseudo-
nimyzed data are personal, it must be determined if there is information or means
available to identify the participant, and whether these means/information are readily
available. In terms of a clinical trial, participant data is usually coded (e.g., partic-
ipant identification number, randomization code, site identification number) and
would therefore be considered pseudonimyzed (Advarra Regulatory Team 2018).
The concept of personal data applies not only to participants in clinical trials, but
also employees of the sponsor, site staff, and collaborators as well (Gogates 2018).
The collection of names and contact information from these individuals is necessary
for the conduct of the trial. The GDPR does state that the legitimate interests of the
controller may provide a legal basis for processing data, especially if there is a
relevant relationship between the controller and the subject, provided the rights of
the data subject are still upheld.
According to Article 89 of the regulation, there is also some allowance for
derogation regarding rights of data subjects (e.g., rectification, erasure, right to be
forgotten, restriction of processing, portability, objection) when data are processed
19 International Trials 363

for scientific or historical research or statistical purposes, and Recital 156 refers to
clinical trials as such research. Note derogation is allowed only in the case where
complying with these provisions would make it impossible, or would significantly
hinder, the fulfillment of the purpose of the research. Also, the research must still
comply with GCP and appropriate safeguards must be put in place.
The rights of clinical trial participants can be upheld by ensuring the informed
consent clearly states what data are being collected, why it is being collected, and by
whom it will be processed or used (including whether it will be transferred to a third
country) (Gogates 2018). Internally, sponsors, processors, and controllers should
ensure appropriate security measures (including technology, processes, and training)
are in place to maintain the privacy of the data. Data protection impact assessments
should be conducted for each data process (Gogates 2018), to determine its purpose,
management, and risks to rights of data subjects, as well as whether additional
safeguards need to be established.
A Data Protection Officer (DPO), who will serve as the point person to ensure
GDPR compliance, may also need to be appointed (HIPPA Journal 2018). A data
privacy notice should be created and readily available to data subjects and should
include contact information for the data controller (and DPO if applicable), catego-
ries of data that are collected, information regarding data transfer and retention, and
data subject rights as outlined in the GDPR. The means by which requests or
complaints can be made should be indicated (gdpr.eu).
Despite best efforts to ensure data protection, a data breach is still possible.
Should this occur, the controller must inform the authorities within 72 h, unless
the breach is unlikely to result in a risk to the rights and freedoms of natural persons
(data.europa.eu/eli/reg/2016/679/oj). The controller must keep a record of all
breaches and the resulting investigations, regardless of whether they were reported.
If the breach is likely to result in such risk, the controller must communicate the
breach to the subject, including the likely consequences of the breach and steps taken
to mitigate the effects (data.europa.eu/eli/reg/2016/679/oj). Of course in the case of a
clinical trial the controller usually does not have direct contact with the participants
or access to their information, so the communication would be handled by the site
investigator, based on information provided by the controller.
Upholding the principles of GDPR within the clinical trial arena involves and
impacts many stakeholders, including the sponsor and other controllers, data pro-
cessing organizations, investigators and site research staff, and data subjects. The
measures undertaken to understand the regulation and implement it often involve
more complex processes and more personnel, which is an added cost to the conduct
of the trial.

Summary and Conclusion

This chapter has provided information on the challenges involved in conducting


clinical trials internationally. When planning an international trial, every area of
conduct must be carefully assessed to determine how country-specific procedures
and regulations can impact each area. Extra care must be taken to vet and establish
364 L. Blacher and L. Marillo

communication pathways with all partners, from site staff to vendors. Risk assess-
ment and mitigation strategies must be put in place. Though implementing these
measures may be costly, the hoped-for benefit of conducting clinical trials globally is
the quicker realization of trial results, and ultimately benefit to the greater
population.

Key Facts

The benefits to conducting trials globally include access to a greater number of trial
participants; greater diversity in terms of ethnicity, disease characteristics, and
susceptibilities; faster recruitment; and quicker realization of trial results.
The challenges involved in conducting international clinical trials are primarily
related to cultural, procedural, and regulatory differences between countries.
Regulatory bodies vary across countries and may include Ethics Committees/
Institutional Review Boards, Competent Authorities, Health Authorities, Data Pro-
tection Agencies, and/or individual Hospital Management.
Logistical variances between countries in the clinical (drug) supply chain include
approval for indication of the investigational medicinal product, impact of infrastruc-
ture on forecasting, languages and terminology on package labels, weather conditions
for cold-chain logistics (packaging), requirements for multiple depots, goods and
services taxes and permits, and local or off-site destruction of unused product.
Country-specific restrictions apply to the use of biomaterials in clinical trials,
such as type of material that can be provided (if provided at all), export regulations,
and length material that can be retained.
The challenges in auditing and monitoring international sites include language
barriers; scheduling visits due to work patterns, holidays, and religious observances;
and the cost of the visit due to international travel.
Challenges in data management of global clinical trials include time differences,
interpretation of race and ethnicity, language issues, and technical capabilities of the
sites (which affect whether paper-based or electronic data capture is possible).
Several considerations must be taken into consideration in terms of Case Report
Form design for international trials, including minimal data collection, participant
questionnaires in the language of the participant, and laboratory units in local
measurements.
The challenges of conducting international clinical trials can be mitigated by
developing a proactive risk-based strategy. Risks can be mitigated by simplified
protocols, user-friendly data collection tools, streamlined procedures, training, and
establishing a clear communication pathway.
The European General Data Protection Regulation (GDPR) strives to protect
subjects by outlining their rights in regard to the processing of their data. Rights of
data subjects include:

Right to Access – right to a copy of their data


Right to Rectification – right to correct their data
19 International Trials 365

Right to Information – right to know how their data are being used
Right to be Forgotten – right to have the data erased/destroyed
Right to Restriction – right to restrict data processing
Right to Portability – right to take data from controller and/or transfer to another
entity
Right to Object – right to object to data processing

Cross-References

▶ ClinicalTrials.gov
▶ Data Capture, Data Management, and Quality Control; Single Versus Multicenter
Trials
▶ Documentation: Essential Documents and Standard Operating Procedures
▶ Implementing the Trial Protocol
▶ Institutional Review Boards and Ethics Committees
▶ Multicenter and Network Trials
▶ Procurement and Distribution of Study Medicines
▶ Qualifications of the Research Staff
▶ Responsibilities and Management of the Clinical Coordinating Center
▶ Selection of Study Centers and Investigators
▶ Training the Investigatorship

References
Advarra Regulatory Team (2018) The GDPR and its impact on the clinical research community
(including non-EU researchers. In: Advarra. Available via https://www.advarra.com/the-gdpr-
and-its-impact-on-the-clinical-research-community-including-non-eu-researchers/
Arnum P (2011) Managing the global clinical-trial material supply chain. In: Pharmtech. Available
via http://www.pharmtech.com/managing-global-clinical-trial-material-supply-chain
Beauregard A et al (2018) The basics of clinical trial monitoring. In: Applied clinical trials. Available
via http://www.appliedclinicaltrialsonline.com/basics-clinical-trial-centralized-monitoring
Bogin V (2016) Feasibility in the age of international clinical trials. In: Applied clinical trials.
Available via http://www.appliedclinicaltrialsonline.com/feasibility-age-international-clinical-
trials
Cobert B (2017) Remote PV Audits & Inspections. In: C3i Solutions. Available via https://www.
c3isolutions.com/blog/remote-pv-audits-inspections/
European Clinical Research Infrastructure Network. Available via http://campus.ecrin.org/
Export.gov Brazil – Import Requirements and Documentation. In: Brazil Country commercial guide.
Available via https://www.export.gov/article?id¼Brazil-Import-Requirements-and-Documentation
Fisher Clinical Services Managing Complex Global Drug Distribution and Expiry. Available via
http://info.fisherclinicalservices.com/clinical-supply-optimization-global-distribution-case-
study-box
Fisher Clinical Services New Challenges for Global Clinical Trials: Managing Supply Logistics in
an Expanding Clinical Trial Universe. Available via http://info.fisherclinicalservices.com/white-
paper-global-clinical-trial-challenges
366 L. Blacher and L. Marillo

Fisher Clinical Services The Challenges of Cold Chain Management. Available via http://www.
fisherclinicalservices.com/content/dam/FisherClinicalServices/Learning%20Centre%20Images/
Latest%20Article%20Images/Latestarticlespdf/CTP012_Fisher%20Clinical_TRIM%20DPS.
PDF
Fisher Clinical Services What Clinical Teams Should Know About Changing Trial Logistics and
How they Will Affect Development. Available via http://info.fisherclinicalservices.com/log
Garg S (2016) An auditor’s view of compliance challenges in resource-limited clinical trial sites. In:
Applied clinical trials. Available via http://www.appliedclinicaltrialsonline.com/auditor-s-view-
compliance-challenges-resource-limited-clinical-trial-sites?pageID¼4
Gogates G (2018) How does GDPR affect clinical trials? In: Applied clinical trials. Available via
http://www.appliedclinicaltrialsonline.com/how-does-gdpr-affect-clinical-trials
HiPAA Journal (2018) GDPR: what is the role of the Data Protection Officer. Available via https://
www.hipaajournal.com/gdpr-role-of-the-data-protection-officer/
International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human
Use (ICH). Available via https://www.ich.org/home.html
Miller J (2010) Complex clinical trials are posting new challenges across the clinical supply chain.
In: BioPharm. Available via http://www.biopharminternational.com/complex-clinical-trials-are-
posing-new-challenges-across-clinical-supply-chain
Minisman et al (2013) PMC US National Library of Medicine National Institutes of Health
implementing clinical trials on an international platform: challenges and perspectives. J Neurol
Sci. Available via https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3254780/
Mongan A (2016) Three factors impacting on the destruction of IMP material. In: Clinical trials
arena. Available via https://www.clinicaltrialsarena.com/uncategorized/clinical-trials-arena/
three-factors-impacting-on-the-destruction-of-imp-material-4839806-2/
NIH U.S. Library of Medicine ClinicalTrials.gov. Available via https://clinicaltrials.gov
Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the
protection of natural persons with regard to the processing of personal data and on the free
movement of such data, and repealing Directive 95/46/EC (General Data Protection Regula-
tion). Available via http://data.europa.eu/eli/reg/2016/679/oj
Ruppert M (2007) Defining the meaning of ‘auditing” and ‘monitoring’ & clarifying the appropriate
use of the terms. Available via https://ahia.org/assets/Uploads/pdfUpload/WhitePapers/
DefiningAuditingAndMonitoring.pdf
Sertkaya A et al (2014) Examination of clinical trial costs and barriers for drug development.
Available via https://aspe.hhs.gov/report/examination-clinical-trial-costs-and-barriers-drug-
development
Sprosen T (2017) News: does cutting trial costs by reducing monitoring visits also reduce quality?
In: MoreTrials. Available via https://moretrials.net/news-cutting-trial-costs-reducing-monitor
ing-visits-also-reduce-quality/
The International Maternal Pediatric Adolescent AIDS Clinical Trials (IMPAACT). Available via
https://impaactnetwork.org/
U.S. Department of Health and Human Services Food and Drug Administration (2016)
Collection of race and ethnicity data in clinical trials. Available via https://www.fda.gov/
regulatory-information/search-fda-guidance-documents/collection-race-and-ethnicity-data-clini
cal-trials
Web learning resources for the EU General Data Protection Regulation; Fines and penalties.
Available via https://www.gdpreu.org/compliance/fines-and-penalties/
Weyermann A (2006) Labelling requirements for IMPs in multinational clinical trials: bureaucratic
cost driver or added value? Available via https://dgra.de/media/pdf/studium/masterthesis/mas
ter_weyermann_a.pdf
Yeomans A, Abousahl I (2017) Preparing for the EU GDPR in clinical and biomedical research.
Available via https://www.viedoc.com/site/assets/files/1323/preparing_for_the_eu_gdpr_in_
clinical_and_biomedical_research.pdf
19 International Trials 367

Further Reading
About NIAID Division of AIDS (DAIDS). Available via https://www.niaid.nih.gov/about/daids
ASPE U.S. Department of Health and Human Services (2014) Examination of clinical trial costs
and barriers for drug development. Available via https://aspe.hhs.gov/report/examination-clini
cal-trial-costs-and-barriers-drug-development
Ayalew K (2015) FDA perspective on international clinical trials. Available via https://www.fda.
gov/downloads/Drugs/NewsEvents/UCM441250.pdf
Bioclinica (2017) Collaboration between clinical operations and the logistics and supply chain
teams is key to trial success. Available via https://www.bioclinica.com/blog/collaboration-
between-clinical-operations-and-logistics-and-supply-chain-teams-key-trial
Clinical Trials Guidance Documents. Available via https://www.fda.gov/RegulatoryInformation/
Guidances/ucm122046.htm
ClinRegs. is an online database of country-specific clinical research regulatory information
designed to assist in planning and implementing international clinical research. Available via
https://clinregs.niaid.nih.gov/index.php
Collection of Race and Ethnicity Data in Clinical Trials. Available via https://www.fda.gov/down
loads/RegulatoryInformation/Guidances/UCM126396.pdf
DAIDS Regulatory Support Center (RSC). provides support for all NIAID/DAIDS-supported and/
or sponsored network and non-network clinical trials, both domestic and international. Available
via https://rsc.niaid.nih.gov/
Department of Health and Human Services Office of Inspector General (2001) The globalization of
clinical trials a growing challenge in protecting human subjects. Available via https://oig.hhs.
gov/oei/reports/oei-01-00-00190.pdf
Division of AIDS Clinical Research Policies and Standard Procedures Documents. Available via
https://www.niaid.nih.gov/research/daids-clinical-research-policies-standard-procedures
European Commission, Enterprise and Industry (2009) EU guidelines to good manufacturing
practice medicinal products for human and veterinary use. Available via http://www.gmp-
compliance.org/guidemgr/files/2009_06_ANNEX13.PDF
Foust M (2014) Strengthening the links in the clinical supply chain: aim for transparency through-
out the process. In: Applied clinical trials. Available via http://www.appliedclinicaltrialsonline.
com/strengthening-links-clinical-supply-chain-aim-transparency-throughout-process
George Clinical (2016) Regulatory timelines in the Asia-Pacific. Available via https://www.
georgeclinical.com/resources/research/regulatory-timelines-asia-pacific
Global Health Trials (2012) Destruction of investigational medical product following trial termi-
nation. Available via https://globalhealthtrials.tghn.org/community/groups/group/regulations-
and-guidelines/topics/172/
Henley P (2016) Monitoring clinical trials: a practical guide. In: Tropical medicine and international
health. Available via https://onlinelibrary.wiley.com/doi/full/10.1111/tmi.12781
Leyland-Jones B et al (2008) Recommendations for collection and handling of specimens from
group breast cancer clinical trials. J Clin Oncol. Available via https://www.ncbi.nlm.nih.gov/
pmc/articles/PMC2651095/
Mattuschka J (2016) Clinical supply chain: a four-dimensional mission. In: BioProcess interna-
tional. Available via https://bioprocessintl.com/manufacturing/supply-chain/clinical-supply-
chain-a-four-dimensional-mission/
Muts V (2018) International patient recruitment: the grass is not always greener abroad. In: Applied
clinical trials. Available via http://www.appliedclinicaltrialsonline.com/international-patient-
recruitment-grass-not-always-greener-abroad
National Cancer Institute Division of Cancer Treatment and Diagnosis (2018) Biorepositories and
Biospecimen Research branch best practices. Available via https://biospecimens.cancer.gov/
bestpractices/2016-NCIBestPractices.pdf
368 L. Blacher and L. Marillo

Pharmaceutical Engineering (2016) Clinical labeling of medicinal products: EU clinical trial regulation.
Available via http://www.pharmtech.com/managing-global-clinical-trial-material-supply-chain
Research Conducted in NIAID Labs. Available via https://www.niaid.nih.gov/research/research-
conducted-niaid
The Clinical Data Interchange Standards Consortium (CDISC) is an open, multidisciplinary,
neutral, 501(c)(3) non-profit standards developing organization. Available via https://www.
cdisc.org/
U.S. Department of Health and Human Services Food and Drug Administration (2013) Guidance
for industry oversight of clinical investigations – a risk-based approach to monitoring. Available
via https://www.fda.gov/downloads/Drugs/Guidances/UCM269919.pdf
World Courier (2015) Managing the myths. Available via https://www.worldcourier.com/insights/
managing-the-myths
World Health Organization (WHO). Available via https://www.who.int/topics/epidemiology/en/
Documentation: Essential Documents and
Standard Operating Procedures 20
Eleanor McFadden, Julie Jackson, and Jane Forrest

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
ICH Guidelines on Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
Sponsor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
Participating Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
Coordinating Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
Edit Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
Trial Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
Independent Statistical Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
Trial Monitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
Site Binder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
Source Data Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
Monitoring Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
Standard Operating Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
Sponsor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
Participating Site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
Coordinating Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
Independent Statistical Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
Monitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
Document Management Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
Document Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
Maintenance and Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383

E. McFadden (*)
Frontier Science (Scotland) Ltd., Kincraig, Scotland, UK
e-mail: [email protected]
J. Jackson · J. Forrest
Frontier Science (Scotland) Ltd, Grampian View, Kincraig, UK
e-mail: [email protected];
[email protected]

© Springer Nature Switzerland AG 2022 369


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_45
370 E. McFadden et al.

Quality Control of the TMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385


Document Archiving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
Destruction of Essential Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387

Abstract
Documentation is a critical component of clinical trials. There are requirements
not only to be able to verify that the data being analyzed is accurate but that it
was collected and processed in a consistent way. Anyone involved in a trial has
to recognize the documentation requirements and ensure that they are met. The
International Conference on Harmonization (ICH) Guidelines on Good Clinical
Practice E6 provides details of standards to be met along with relevant defini-
tions. This chapter provides guidance on identifying essential documents for a
trial and also on how to develop and maintain systems for standard operating
procedures.

Keywords
Documentation · Standard operating procedures · Trial master file · Essential
documents

Introduction

Documentation is now a fact of life for everyone involved in the conduct of clinical
trials. The Sponsor, the Funder, the Trials Unit coordinating the trial, the investigator
at the site, and often the trial subjects themselves have a responsibility to ensure that
complete and accurate documentation is kept relating to their role in the trial. The
conduct of clinical trials is now a highly regulated industry, and there are many
people who are employed to maintain and oversee the quality of the trial documen-
tation. The basic rule of regulators and other auditors is that if it is not written down,
then it didn’t happen, and they expect to be able to reconstruct the exact conduct of
the trial from the documentation, including source documentation.
As well as creating and revising the documentation, there is a requirement to keep
all documentation and archive it securely for lengthy time periods, which can vary
depending on the type of trial and where it is being conducted. This chapter will
outline requirements for essential documentation for sites, for coordinating centers,
and for the Sponsor. It also includes some guidance for monitors and independent
statistical centers (ISC) if relevant to the study.
The chapter also addresses the need for standard operating procedures (SOPs)
ensuring that routine procedures are always carried out in the same way. We discuss
the types of procedural documentation that are needed, and give suggestions for
20 Documentation: Essential Documents and Standard Operating Procedures 371

systems for the preparation and maintenance of trial documentation. This chapter
will close by providing an overview on the use of document management systems.

Terminology

As there are several different models possible for running a clinical trial, we describe
the model which we will use in this text for explanations. The hypothetical trial is a
multicenter trial with several hospitals/clinics entering patients. The trial has over-
sight by a study Sponsor. Data is submitted to a coordinating center (on either paper
case report forms (CRFs) or electronically via a remote data capture system), and the
coordinating center is responsible for randomization/registration of patients, quality
control of the data, queries to sites, statistical design and statistical analysis, and the
management of those trial-related services. There is a separate independent statistical
group responsible for preparing reports for the independent data monitoring com-
mittee. Site monitoring, including source data verification (SDV), is done by a
separate organization contracted by the Sponsor. The Sponsor is responsible for
provision and distribution of study medications/devices, oversight of the trial and all
trial documentation. The collection of essential documentation required for this trial
will be referred to as the trial master file (TMF). Where relevant, we will describe
how other trial models would address some of the documentation requirements.

Background

In the early days of “modern” clinical trials, there was very little documentation
maintained. Case report forms (CRFs) were relatively short and were all paper-
based. Investigators and their staff at the sites completed them manually with source
data being the patient’s medical record. Sometimes they were signed – by the
investigator, sometimes by a member of the investigator’s staff or a rubber stamp
signature, sometimes not signed at all, but it really didn’t matter – the data was
entered into the computer and used in analysis. This is just one example of how
things have changed over the last 30–40 years, and there are many others.
Why have things changed so substantially?
One reason was the detection of cases of fraudulent data on cancer trials being
submitted to a central trials office in the late 1970s. A result of an investigation into
submission of fraudulent data by one of the US National Cancer Institute-funded
cancer trials groups, the Eastern Cooperative Oncology Group (ECOG), was the
establishment of an ECOG audit process where all sites were visited on site at regular
intervals, and CRFs compared against source data on a random selection of cases.
The US cancer cooperative group program implemented this approach across the
board with involvement from the National Cancer Institute (NCI), and, while it has
been refined and strengthened over the years, this program is still very much in place
for NCI-sponsored clinical trials, and source data verification is now a routine
practice in clinical trials (Ben-Yehuda and Oliver-Lumerman 2017; Weiss 1998).
372 E. McFadden et al.

Another completely different dynamic was the difficulty of doing clinical trials
across borders because of the different regulations and practices around the world.
Sponsors of international trials in the 1970s and 1980s were finding it challenging to
deal with these variations, yet the urge to complete large trials more quickly meant an
increase in companies and researchers wanting to use this cross-border model and to
be able to use the same standards when submitting new drug applications in separate
countries.
The biggest influence in addressing this has come from the International Confer-
ence on Harmonization, starting as a meeting in 1989 in Brussels of representatives
from the pharmaceutical companies and regulatory authorities from Europe, Japan,
and the USA. This followed early work to harmonize procedures in the European
Union and strengthening of regulatory requirements by the US Food and Drug
Administration (FDA). This conference generated a set of guidelines for clinical
trials which have been widely adopted throughout the world as standards for the
conduct of clinical trials (Good Clinical Practice Guidelines, E6 (R2) 2016). The
ICH organization is still in place and constantly working to update and improve their
guidelines.
In parallel, we saw the growth of an industry of contract research organizations
(CROs) providing support services to the pharmaceutical industry in the conduct of
clinical trials, including on-site monitoring and source data verification. The rapid
and constant development of relevant technology has also had a huge impact as it is
now feasible to collect and manage data and documentation electronically rather
than on paper.

ICH Guidelines on Documentation

The version of ICH E6 Guidelines on Good Clinical Practice (GCP) published in


November 2016 defines Documentation as “All records in any form (including, but
not limited to, written, electronic, magnetic, and optical records, and scans, x-rays,
and electrocardiograms) that describe or record the methods, conduct, and/or
results of a trial, the factors affecting a trial, and the actions taken.” The definition
given for Essential Documents is “Documents which individually and collectively
permit evaluation of the conduct of a study and the quality of the data produced.”
These definitions are important as they are very comprehensive, cover all aspects
of the trial and, as mentioned, have been accepted in many (and certainly all major)
countries in the world involved in the conduct of clinical trials. There is a lot of
information in ICH E6 revision 2, and we recommend that anyone involved in
clinical trials become familiar with its contents. For this particular chapter, the key
section of ICH E6 is Sect. 8, Essential Documents for the Conduct of a Clinical Trial.
ICH E6, Sect. 8 groups its overview of documents by timing, before the trial
commences, during the clinical conduct of the trial and after completion or termi-
nation of the trial. We will not reproduce the list of essential documents but will refer
to some of the documentation types in our chapter.
20 Documentation: Essential Documents and Standard Operating Procedures 373

One of the GCP principles, in ICH E6, Sect. 2, is that systems should be
implemented with procedures that assure the quality of every aspect of the trial.
This principle is expanded in ICH E6, Sect. 5.1 where it is stated that it is the
responsibility of the Sponsor to implement and maintain quality assurance and
quality control systems with written standard operating procedures (SOPs). We
will explore SOPs later in this chapter.
Another principle to follow has become known as ALCOA. Documents should
be:

• Attributable: can the data be traceable to the person responsible for recording a
patient visit/event, along with the relevant date and time?
• Legible: can the data/information be easily read?
• Contemporaneous: was the data recorded at the time (or close to the time) that it
happened, and not recorded a long time after?
• Original: is the source or first-captured data available for review?
• Accurate: are the details recorded complete and correct?

Attributes added to this list are:

• Enduring: is the information recorded in a way that is durable and long-lasting?


(Ink can fade and electronic media become obsolete!)
• Available and accessible: is the data easily available for review or retrievable
within a reasonable time frame?
• Complete and credible: is the data based on real and reliable facts and complete
to that point in time?
• Consistent

We will refer to the complete list of attributes as ALCOA+.


The above are general guidelines to follow for all study documentation to meet
compliance reconstruction requirements. Document quality can be determined by
the degree to which documentation meets these ALCOA+ attributes. Other good
documentation practices include document naming and versioning for ease of
document identification and retrieval. This chapter considers the kind of documen-
tation that is needed for various constituencies in a clinical trial. Documentation in
general can be split up into two primary types, those which are instructional (e.g.,
plans, procedures, specifications, agreements) and records/reports.

Sponsor

The Sponsor of a clinical trial has ultimate responsibility for ensuring that all
required documentation is available at the end of the trial. In addition, during the
trial conduct period, the Sponsor usually has primary responsibility for the trial
protocol, the Informed Consent Form and any patient information sheets, the
Investigator’s Brochure (if relevant), and some oversight plans for the conduct of
374 E. McFadden et al.

the trial, such as a communications plan which documents communication pathways


between all parties involved in the trial.
For the overall TMF contents, there are templates available which specify TMF
requirements by category, and many Sponsors (especially pharmaceutical companies)
will use a version of such a template to provide a standardized structure and content for
the TMF. One such commonly used reference model is published online by the Drug
Information Association (DIA Trial Master File Reference Model v3.1.0, 2018). Not
all documents will be required for all trials, but this provides an excellent starting point
for defining study-specific requirements. This model includes the minimum ICH E6
GCP essential document set and additional documentation commonly created to
support trial activities. While the Sponsor has overall responsibility for the entire
TMF, they will normally delegate responsibility for specific documents to other parties
involved in the trial. In our hypothetical model, the sites, coordinating center, ISC, and
the monitoring organization will all be responsible for aspects of the TMF.
During the trial or at the end of the trial (or phase of a trial), the Sponsor is
responsible for collection of all the documents from all the involved parties and for
integrating them into a manageable TMF. The responsibility can be delegated, but
any delegation of this responsibility should be clearly defined and documented.
For all parties involved in the trial, these requirements are relevant:

1. Training records have to be maintained and updated over time to show that all
staff involved in the trial have the appropriate training and qualifications to fulfil
their trial-related responsibilities. This is required to meet one of the ICH GCP
key principles. Training records for former staff involved in the trial should be
maintained.
2. Retention of trial-related records is critical, and guidance should be sought from
the Sponsor about the retention period. In many instances, this can be for a
minimum of 25 years or until 2 years after the “last” regulatory submission
involving trial data. As it is very difficult to assess whether a submission will
be the “last” one, this effectively means that the records should be retained until
the Sponsor has said they can be destroyed. Remember that records have to
remain legible and accessible, to ensure that ink is not fading on handwritten
documents and that electronic media can still be read. If copies are made of
original paper records, the copies must be certified as exact copies of the original.
3. There must always be a complete audit trail of any changes made to clinical trial
CRF data. With electronic remote data capture systems, this type of audit trail is
usually built in, and a record will be kept of the original value, the new value, the
name of the person making the change, and the date and time stamp of the change.
Any eCRF system which does not meet these criteria should probably not be used
for any trial where regulators may eventually review and adjudicate the data and the
trial conduct. With paper records, the same level of information should be recorded
on the paper record to ensure transparency, with a single line through an original
value so that the value is not obliterated; the new value clearly written beside the
original, along with initials/name of the person making the change; and the date and
time of the change being made. The change must be supported by source data.
20 Documentation: Essential Documents and Standard Operating Procedures 375

4. Electronic trial data handling systems used in support of clinical trial activities
must be validated including those handling essential documents. A validation
document set confirming the systems fitness for purpose should be created; ICH
E6 GCP Sect. 5.5 provides detail on the documentation and SOPs required.

The following sections describe the responsibilities of each component of the trial
management team, but it is the Sponsor who defines these study-specific responsi-
bilities and also the Sponsor who should have systems to ensure that the required
documentation is created and available for review and inspection.

Participating Sites

Relevant trial-related documents should be kept at each participating site in an


Investigator Site File. These site files are part of the TMF but are usually held
separately from the main Sponsor TMF. The staff at the participating sites have
primary responsibility for ensuring that the required trial data is collected, recorded,
and transcribed accurately on to the CRF. They are also responsible for ensuring that
the original Informed Consent Form signed and dated by the patient entered on the
trial is available. For many drug trials, sites are periodically visited by trial monitors
who check the availability, completeness, and accuracy of the trial documentation, as
well as performing source data verification (SDV) on trial data entered in CRFs and
on pharmacy records (where relevant). Good organization at the sites is key to
success in these monitoring visits. They can also be seen as good practice for any
Sponsor audit or regulatory inspection of the site! The following points are intended
to help sites with their organization and address issues commonly found at site
monitoring visits or audits:

1. Most “formal” medical records do not contain sufficient information to allow the
complete reconstruction of a patient’s journey through a clinical trial. In such a
situation, the site should maintain a “research record” containing documentation
for the trial which is not recorded in the medical record. Information recorded
should follow the ALCOA+ rules defined above.
2. The principal study investigator at the site is responsible for all the trial activities
at that site. However, responsibilities can be delegated to other staff at the site. It
is essential that all such delegations be written down in a log, showing the name
of the person to whom a responsibility is delegated, the delegated responsibility
and the relevant dates. Lack of a delegation log is a common finding at site
monitoring visits/audits. Figure 1 shows an example of a site delegation log.
3. As well as CRF data, the site will be responsible for ensuring that all required
documentation is available for the specific trial. This could include some or all of
the following:
(a) Approved informed consent forms and original signed patient consent forms
(b) Ethics/Institutional Review Board (IRB) approvals
(c) Contracts, agreements, indemnity, insurance, financial aspects of the trial
376 E. McFadden et al.

Site Signature and Delegation Log

Study Title: Study Number/ Short Name: Site Number:


PI Name: PI Signature: Start Date: End Date:

Name Role in Signature Tasks Start Date End Date PI PI Date


(Print) Study Delegated* Signature

*Fill in code for delegated task from list below


1.Obtain Informed Consent 8.Processing study samples 15.Reporting SAEs to sponsor
2.Obtain Medical History 9.Completion of CRFs 16.Report deviations/ violations

3.Perform Physical Exam 10.Signature of CRF 17.Prescription of study treatments


4.Assess eligibility 11.Data QC check 18.Dispensing study treatment

5.Confirm eligibility 12.Respond to Data Queries 19. Drug accountability


6.Medical oversight of trial patients 13.Assessment of AE 20.Maintenance of Site File
7.Collection study specific samples 14. Causality assessment for SAE 21. Other (specify)

Fig. 1 Site delegation log

(d) Protocol and amendments


(e) Investigator’s brochure
(f) Sponsor correspondence
(g) Relevant emails about the study
(h) Any surveys/questionnaires completed by the site or by patients
(i) Minutes of any trial-related meetings
(j) Training materials used for staff training, training records, and curriculum
vitae (CVs)
(k) Delegation of tasks
(l) Site selection and closure documentation/correspondence
(m) Any monitoring visit reports or audit certificates
(n) Any serious adverse event (SAE) reports and follow-up
(o) Details of any protocol deviations
(p) Details of local review of adverse events (AEs/SAEs)
(q) Pharmacy records for drug trials of receipt, storage, dispensing, and destruc-
tion of trial supplies
(r) Laboratory accreditation and sample management procedures
(s) Records of receipt and subsequent disposition of any other trial materials
provided to the site
20 Documentation: Essential Documents and Standard Operating Procedures 377

Very often, the Sponsor will provide a site binder for the participating sites to use
for documentation, with instructions on the documentation to be maintained and
stored in the binder. If none is provided, it would be beneficial for the site to create
their own prior to the start of the trial. There are also now electronic site binders
being used so that all documents are organized and stored electronically rather than
on paper.
Procedural documentation (or SOPs) is relevant to all parties involved in a trial
and is covered in a separate section in this chapter.

Coordinating Center

Our hypothetical trial has a coordinating center which is responsible for data
collection, quality control, randomization/registration, trial management, and statis-
tical design and analysis. As well as development of SOPs (see section in this
chapter), what other documents will be needed for the TMF? The following is a
summary of typical documentation needs by function.

Data Collection

To collect data, a case report form (CRF) must be designed, tested, and implemented.
If the CRF is electronic, validation documentation of its design and implementation
should be maintained. Key correspondence relating to the design and content of the
CRF should be kept in an organized way such that it is easily retrievable at any time.
If a CRF is updated after data collection begins, the CRF should be version
controlled with dates of implementation and a record kept of all changes made as
this could be important when using data for reporting and analysis. The statisticians
need to know which version was being used when data was collected. It would also
be important to document whether changes made were to be applied for all patients
(including those already entered) or applied prospectively for new data only. There
will also be a need for procedural documentation to provide guidance on CRF
completion. Much of this can be covered in a data management plan for a study.

Edit Checks

Whether data is collected via an electronic data capture (EDC) system or on paper,
the coordinating center will develop edit checks on the data as part of its quality
control process. The edit checks can be either electronic (programmed to run
automatically) or be manual checks implemented on review of the data or be a
combination of both. The suite of edit checks will almost certainly evolve over time,
and again, it is essential to maintain documentation on the edit checks which are
implemented (including the testing and validation process), when they were
implemented and on which subjects they apply (all or only prospectively entered).
378 E. McFadden et al.

Trial Management

Much of the documentation around trial management will revolve around project
and quality process documents and can include topics such as risk, communication,
training, and TMF management and is covered later in the chapter. As with the sites,
the coordinating center will maintain contractual documents including documenta-
tion of responsibilities delegated by the Sponsor.

Statistics

The documentation required from the statistical team includes the following:

• Details of sample size calculation.


• All versions of the statistical analysis plans (SAPs).
• Statistical programs along with specifications, testing, and validation plans and
results. Statistical programs and associated validation documents should be
version controlled.
• Outputs from all protocol-defined analyses or analyses/simulations which led to a
change in protocol design.
• Details of any randomization system used, which can be covered in a randomi-
zation plan, along with any relevant system validation documents.

Independent Statistical Center

Most Phase III trials have an independent data monitoring committee (IDMC), also
referred to as a data and safety monitoring committee, to review study data and
progress periodically. This committee is independent from the Sponsor. The role and
responsibilities of the IDMC are well described in other publications (Ellenburg et al.
2003). The IDMC meets periodically and reviews reports prepared by one or more
statisticians who are not involved in the conduct of the study. The reports for the
IDMC are confidential and not seen by the Sponsor and are usually prepared by
statistician(s) who are independent of the Sponsor and independent of the protocol
statisticians. We refer to these independent statisticians as the independent statistical
center (ISC) in this chapter.
The IDMC has a very important role in the trial and can make recommendations
to modify or to stop a trial based on the information with which they are presented.
They make these decisions based on the data prepared for them by the ISC. The ISC
therefore has to maintain documentation of all the reports prepared for the IDMC
along with corresponding statistical programs and datasets. Quality control records
and validation reports on the programs should also be maintained for each version.
Minutes of the closed and open sessions of the IDMC meetings would also need to
20 Documentation: Essential Documents and Standard Operating Procedures 379

be saved and all documentation available for inclusion in the TMF. The role and
responsibilities of the ISC and the IDMC will often be documented in an IDMC
Charter. The documents maintained by the ISC are usually kept confidential from the
Sponsor until the end of the trial.

Trial Monitors

In our hypothetical trial, monitors visit sites to complete source data verification (or
carry out remote monitoring visits, depending on the model in place for the trial) and
check essential documents. Monitors may also visit sites during the site selection
process for a new trial to see whether a site is suitable for trial participation, for a site
initiation visit to ensure that all necessary documentation and processes are in place,
and for a site close-out visit to close down the site at the end of trial participation.
The role of the monitor and monitoring activities in a trial is usually defined in a
study-specific monitoring plan.

Site Binder

As mentioned, it is common practice in drug trials for the trial Sponsor to create a site
binder for each participating site to assist them in maintaining and organizing the
necessary trial documentation which is required for the TMF (see above under
Participating Sites). On a visit to the site, the monitors will routinely check the site
binders to ensure that they are complete. The site binder would usually contain
copies of all versions of the protocol used at the site, all relevant Ethics and
Regulatory approvals, plus any other required local approvals, site staff curriculum
vitae, training records, delegation logs, and study procedure documentation. It may
also contain copies of signed patient consent forms, depending on the local pro-
cedures. Monitors will use this documentation prior to a site activation to verify that
a site can be activated to accrue patients and then, during the trial, to ensure that the
site is compliant with all study requirements.

Source Data Verification

One of the primary responsibilities of the monitors is to verify that the data entered
into the CRF is accurate and complete. This is done by comparing the CRF entries to
the data at its source. This could be the patient’s medical or clinic record, a separate
file with original documents relating to the patient’s participation in the trial but not
part of the official medical record, and ancillary records such as pharmacy inventory
and dispensation records. A log of each monitor visit (on-site or remote) should be
maintained along with details of what was checked and any findings. Monitors
would also follow up to ensure that any deficiencies were appropriately corrected.
380 E. McFadden et al.

Monitoring Reports

These monitoring reports are written after each monitoring visit and submitted to the
Sponsor or the coordinating center. The reports will also be maintained as part of the
TMF.

Standard Operating Procedures

From the previous section about documentation requirements, it is clear that all
parties maintain trial-specific plans and associated records/reports. In this section of
the chapter, we consider instructional/process documents in the form of SOPs.
As mentioned previously there is a requirement to document not only what was
done but how it was done and to show that a task was done in a consistent way
throughout the trial, no matter who was doing it.
All of the entities involved in trials will follow SOPs. Procedures form the founda-
tion of a good quality system and describe how to perform a repetitive trial activity.
Procedures are often based on a standard policy which is more of a high-level statement
of intention to satisfy requirements. Details of the standard process to comply with the
policy that defines what needs to be done and why are operating procedures.
SOPs are essential documents and, like those covered in previous sections, should
include records to demonstrate that compliance with the procedure has been mea-
sured and the procedure has been followed. The protocol document itself can be
considered a procedure document for the selection, entry, and treatment of a patient
entered on the trial. It may also contain instructions on trial related activities such as
ordering trial medication, submitting materials for central review, randomization of
patients, and other essential trial activities.
The philosophy behind SOPs is that they should be general enough to apply to all
clinical trials being done by an entity and not be trial-specific. Depending on the
entity organization, SOPs may be at a global or local level. If there are certain
activities that are specific only to one trial, then a separate document should be
created to describe that procedure. A common approach for this type of document is
to call it a trial-specific work instruction rather than an SOP.
Clinical trial staff should be trained in applicable SOPs. Sometimes an entity will
follow Sponsor or other collaborator procedures, and there is a need to ensure at the
start of a trial that all parties are aware of the SOPs that are being followed. If
external/Sponsor SOPs are used, it is key to ensure good communication between
parties about procedure distribution and training. The entity that owns the SOPs
should provide training on the procedures. As procedures set out the standard to
work to, it is advisable that they are subject to periodic or scheduled review to ensure
they reflect the current practice. The quality system should include a process on how
to handle deviations from procedures.
How do you decide what procedure documentation is needed? The UK Clinical
Research Collaboration has developed a process for assessing competency of clinical
20 Documentation: Essential Documents and Standard Operating Procedures 381

trials units (CTUs) as part of a CTU registration process. They have developed a list
of areas of expertise which they consider essential for a CTU (McFadden et al. 2015)
and recommend that SOPs be developed to cover these areas. The names of the SOPs
can vary, but this list describes the basic topics which should be covered in SOPs in a
coordinating center.
We have updated this list to reflect recent changes in legislation and present it as a
table showing recommendations for documentation for each of our entities involved
in a trial. Table 1 shows recommended topics for procedure documentation by entity.

Table 1 Recommended topics for standard operating procedures by entity


Independent
Coordinating statistical
Topic area Sponsor Site center center Monitors
SOP on SOPs √ √ √ √ √
Protocol development √ √
Risk assessment and √ √ √ √ √
monitoring
Trial master file/site file √ √ √ √ √
(investigator and pharmacy)
Regulatory approvals √ √
Trial initiation and site set up √ √ √ √ √
Data management √ √ √
Trial supplies √ √ √ √
Safety reporting/ √ √ √ √
pharmacovigilance (if IMPs)
Quality management systems √ √ √ √ √
Informed consent/patient √ √
information
Training √ √ √ √ √
Registration/randomization (if √ √ √
randomized trials)
Statistics √ √ √
IT systems/databases √ √ √
Trial closure √ √ √ √ √
End of trial reporting √ √ √
Archiving √ √ √ √ √
Deviations, misconduct, and √ √ √ √ √
serious breaches of GCP and/
or the protocol
Sponsorship, contracts/ √ √ √ √ √
agreements, and indemnity
Data protection and √ √ √ √ √
confidentiality
Document control √ √ √ √ √
382 E. McFadden et al.

Sponsor

The Sponsor has ultimate responsibility for ensuring that all necessary procedures
and documentation are in place and maintained throughout the course of the trial.
The Sponsor can audit sites, coordinating centers, ISC, and monitors to check for
that assurance. The Sponsor should ensure that each participating party or entity
understands their responsibilities for preparation and maintenance of trial documents
and development and implementation of study procedures. It is also important to
establish which SOPs are to be used for the conduct of the study – is it the Sponsor
SOPs regardless of who is doing the specific task, or is it the SOPs of the entity to
which the task has been delegated?

Participating Site

The site should have their own hospital/clinic SOPs for trial-related activities or
follow those provided by the Sponsor (or a combination of both).

Coordinating Center

Depending on the responsibilities delegated to the coordinating center, there will be


a variety of procedure documentation required, much of it relevant to any trial
coordinated by the same unit.

Independent Statistical Center

The ISC will require procedure documentation for preparation of reports, interac-
tions with the IDMC members, secure transfer of reports, minute taking, and
archiving. There is usually also a study-specific charter for the IDMC activities
developed as part of the overall study governance documents.

Monitors

The monitoring organization will require procedures on planning, conducting,


reporting, and following up on monitoring visits. The monitors usually follow a
trial-specific monitoring plan which documents how much monitoring will be done
and the extent of the monitoring. This document is usually developed by the Sponsor
but could be delegated.

Document Management Systems

As the reader can see, the requirements for documents are substantial, and it is
beneficial to develop a system for creation, maintenance, and storage of these
documents.
20 Documentation: Essential Documents and Standard Operating Procedures 383

The Sponsor and all other parties involved in the conduct of the trial (e.g., site,
coordinating center, ISC, and monitors) should aim to maintain an inspection ready
TMF at all times. In other words, a regulatory inspector should be able to reconstruct
the conduct of the trial using only the documents and metadata present in the TMF.
While the concept of an inspection-ready TMF sounds simple, it is not easily
achieved, and the quality of the TMF is a growing area of risk for the clinical
research industry. There are many aspects to maintenance of the TMF, and some
aspects are more challenging than others, such as managing electronic correspon-
dence including emails in the TMF.
TMF quality is determined by both the quality of the individual records held
therein and the quality of the systems and processes in place to maintain the TMF.
The following section gives some suggestions for such a system.

Document Creation

Having a standard template for the development of a plan or an SOP simplifies the
process when a new one needs to be developed. When a standard format is used for
all documents, it also makes it easier for users to find something within the document
as they are familiar with the layout of the sections. Figure 2 shows a sample layout
for an SOP template. The document is normally prepared by someone with knowl-
edge of the particular process (a subject matter expert). Once drafted, the document
should be routed for appropriate review and, once all changes have been incorpo-
rated, final approval. This entire process should be documented. Signature is usually
with wet ink, but there is a growing trend to use of digital signatures to approve
essential documents. The implementation of electronic signatures should comply
with international electronic signature requirements. Once approved, the document
should be circulated to all individuals who are required to follow the process.
Training in the contents should be documented in each individual training record.

Maintenance and Storage

There is a requirement for SOPs and other procedural documents to be routinely


reviewed and updated when relevant. Part of the document management process is
therefore to build in a timeline for re-review at a regular interval. This can be
anything from 1 year to 3 years depending on the policy of the organization
developing the SOP. Updated versions should be version-controlled with a clear
record of what changes were made at what time. All approved versions of an SOP
should be held in the TMF.
In general, the TMF held by the investigator (at sites) must be stored and archived
separately from other documents that are part of the central TMF (e.g., from Sponsor,
coordinating center, monitors). Confidential documents such as those held by the
ISC and IDMC are usually transferred to the Sponsor once the primary analysis is
completed. While site files are part of the TMF, they may continue to be maintained
at the site rather than submitted to a central TMF. This is due partly to subject
384

[Document Title /Document ID] Version No.


[Document Number]
VERSION HISTORY

Version Summary of Changes Made


(previous versions)
[Document Title]

Version:

TABLE OF CONTENTS
Date of Approval:
1. PURPOSE
Effective Date:
2. SCOPE

Next Review Date: 3. INTENDED USERS

4. PRE-REQUISITES
Security Level:
5. ROLES AND RESPONSIBILITIES

6. PROCEDURES
Name Role Signature Date
7. REFERENCES
Approver
LIST OF TABLES

Name Role LIST OF FIGURES

Reviewer
Reviewer
Reviewer
Reviewer Template Version No. Page 2 of X

Fig. 2 Sample SOP template


E. McFadden et al.
20 Documentation: Essential Documents and Standard Operating Procedures 385

confidentiality issues and partly to the requirement that the investigator must main-
tain control of the source documentation held at site.
TMFs will usually be a combination of both paper and electronic files. It should
be noted that the main requirements for storage and archival are the same for both.
The TMF itself should be well structured with records filed in an organized and
timely manner. This enables ease of identification and retrieval of both the documents
and the associated metadata – a key requirement for regulatory inspections. Metadata
attributes give context and meaning to the records. An audit trail is a form of metadata,
providing information on actions relating to the documents and records. Access to the
TMF must be appropriately controlled to ensure no authorized disclosure of informa-
tion, and the TMF itself protected from damage, unauthorized changes, and records
loss. TMF documents are subject to review at audit/inspection, and any party involved
in the trial could expect requests for direct access to the TMF during an audit/inspection.
Individual documents within the TMF must be version controlled, with an audit
trail for changes made. Finally, the contents of the TMF must be clearly indexed with
“signposts” to the relevant TMF repositories and systems.
Using a specialized eTMF system can greatly assist with the challenge of meeting
the above requirements and streamline the processes of managing the active eTMF
and of archiving the eTMF at study close-out. An eTMF system can also solve some
of the issues faced by the Sponsor when the TMF documentation is being generated
by a number of collaborating partners.

Quality Control of the TMF

The TMF should be monitored throughout the course of the trial for quality in terms
of completeness, timeliness, document quality, and ease of retrieval.

Document Archiving

The Sponsor is responsible for archiving the TMF at the end of the study. An archive
is a physical facility or an electronic system designated for the secure long-term
retention and maintenance of archived materials. In the UK, such a system must be
under the control of one or more named archivists, with access limited to those
individuals.
Responsibilities for archiving the TMF should be agreed between the Sponsor,
the sites and all other parties involved in the trial such as the coordinating center,
ISC, and monitors. The Sponsor should ensure that all parties involved are capable
of archiving their TMFs in a manner that meets GCP requirements. If files are
transferred from one entity to another, then the transfer process should be tested
and validated to ensure that all files are transferred completely and correctly.
The required retention period for archiving the TMF will depend on the applica-
ble regulatory requirements. The principles for archiving paper and electronic TMFs
are the same:
386 E. McFadden et al.

(a) All archived materials must be stored in a way that ensures their integrity and
continued access throughout the required period of retention.
(b) Procedures should be established for making archived materials available for
inspection, e.g., by regulatory authorities.
(c) Any alteration to archived records shall be traceable.
(d) Any transfer of materials within the archive from one location, media, or file
format to another must be documented and validated if appropriate.
(e) The Sponsor shall maintain contact with any external organization that archives
its materials throughout the period of retention.

The challenge of meeting requirement (a) above, for an electronic archive


retained over a period of many years, can be significant due to the fast-changing
nature of computer software and hardware. Archived digital files can become
irretrievable unless consistent effort is made to maintain digital continuity through-
out the retention period.

Destruction of Essential Documents

The requirements for long-term storage of essential trial-related documents vary in


different regulatory jurisdictions. For example, in the EU the retention period is
about to become 25 years. In the USA, it is often defined as “until two years after the
last filing of data with regulatory authorities.” As this is obviously vague and hard to
define, it is recommended that documents be stored until the Sponsor gives approval
for their destruction. Given the huge volumes of information that must be stored, it is
easy to see why electronic archives are becoming the preferred way of storing
documents.

Summary

This chapter has provided information on the documentation requirements for


clinical trials. Clinical trials are increasingly governed by stringent rules and regu-
lations which require development and maintenance of standardized ways of doing
things and detailed documentation of those actions. While some of the regulations
and requirements apply specifically to trials of medicinal products, it is
recommended that clinical trialists use the same procedures for all types of clinical
research. First of all, it is easier to have one set of rules for all trials, and secondly, it
ensures that all clinical research is done to the same standards. However, it is
acknowledged that compliance with all the requirements adds a considerable admin-
istrative overhead to clinical trials conduct. This overhead is not always recognized
by those funding the trials – yet it is a requirement that all the standards be met. This
is another reason to try to streamline procedures so that SOPs cover all clinical trial
activity within an organization and that study-specific documentation is kept to a
minimum.
20 Documentation: Essential Documents and Standard Operating Procedures 387

The requirement to be able to “reproduce” the conduct of a trial from the TMF leads
to an increasing volume of documentation required for each trial, particularly if it is to
be used in a regulatory submission. While there are agreed international standards for
trial conduct defined by the International Conference on Harmonization, there are
national variations on these around the world. It is important for the Sponsor of the
trial to ensure that all parties involved in the trial are fully aware of their responsibilities
in terms of documentation and that all national requirements are met.
Use of standard templates and document management systems and following
ALCOA+ principles will make it easier to develop, maintain, and implement
essential documents and SOPs.

Key Facts

• Accurate and complete documentation is essential


• Conduct and results should be reproducable
• All parties have responsibility for compliance with procedures, for documenting
key decisions and contributing to a complete and accurate TMF
• TMF must be retrievable throughout the active phase of a trial and the archiving/
retention period

Cross-References

▶ Data Capture, Data Management, and Quality Control; Single Versus Multicenter
Trials
▶ Design and Development of the Study Data System
▶ Good Clinical Practice
▶ International Trials
▶ Responsibilities and Management of the Clinical Coordinating Center
▶ Trial Organization and Governance

References
Ben-Yehuda N, Oliver-Lumerman A (2017) Fraud and misconduct in clinical research: detection,
investigation and organizational response. University of Michigan Press. ISBN – 0472130552,
9780472130559
DIA TMF (2018) Reference Model. Retrieved from: https://tmfrefmodel.com/resources/
Ellenburg S, Fleming T, De Mets D (2003) Data monitoring committees in clinical trials. Wiley,
New York
Good Clinical Practice Guidelines, E6 (R2) (2016). Retrieved from: http://www.ich.org/products/
guidelines/efficacy/efficacy-single/article/integrated-addendum-good-clinical-practice.htmlICH
GCP Guidelines/
McFadden E et al (2015) The impact of registration of clinical trials units: the UK experience. Clin
Trials 12(2):166–173
Weiss RB (1998) Systems of Protocol Review, quality assurance and data audit. Cancer Chemother
Pharmacol 42(Suppl 1):S88
Consent Forms and Procedures
21
Ann-Margret Ervin and Joan B. Cobb Pettit

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
Who May Obtain Informed Consent? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
Who May Provide Informed Consent or Assent? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
Lack of Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
Other Vulnerabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
Nonnative Speaker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
Parental Permission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
Assent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
Community Approval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
What Must the Documentation of Informed Consent Include? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
Consent Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
Consent Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
Understandable Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
Assessing Comprehension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
Re-consent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
Termination of Consent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
Regulatory Requirements for Informed Consent in Canada and the United Kingdom . . . . . . . . 404
Canada . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
United Kingdom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406

A.-M. Ervin (*)


Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
The Johns Hopkins Center for Clinical Trials and Evidence Synthesis, Johns Hopkins University,
Baltimore, MD, USA
e-mail: [email protected]
J. B. Cobb Pettit
Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 389


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_46
390 A.-M. Ervin and J. B. Cobb Pettit

Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407


Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409

Abstract
Obtaining the informed consent of a participant is a prerequisite for enrollment in
a clinical trial. In the United States, federal regulations provide the framework for
establishing informed consent with additional protections for persons considered
vulnerable due to incarceration, illiteracy, or other condition. Investigators are
tasked with providing sufficient information about the research to satisfy the
ethical and regulatory requirements while communicating it in a manner that
maximizes the participant’s ability to make an informed decision regarding study
enrollment. There are clinical trial design features that are essential to include in
the consent form with care to describe topics such as randomization, allocation
ratio, and masking in a manner understood by the lay public. The informed
consent discussion should continue throughout the course of the trial as infor-
mally reaffirming the participant’s willingness to continue participation and
reconsenting them when there are significant changes to the study protocol are
important considerations for providing truly informed consent.

Keywords
Informed consent · Assent · Consent forms · Institutional review board

Introduction

The guidance in this chapter primarily pertains to clinical trials conducted in the
United States, but the general principles may apply more broadly to trials conducted
elsewhere.
Before enrolling a potential participant in a clinical trial, investigators must obtain
the individual’s informed consent. While people may think they know what “informed
consent” is, there is no set formula as to how to best achieve it. Informed consent is a
conversation between the participant and investigator that begins at recruitment and
ends with study exit. Presenting information about the trial to the target population,
taking into consideration the common characteristics of that population, e.g., age, sex,
common disease, or condition, can be challenging. The ethical, legal, and procedural
components of informed consent are intertwined; and while the ethical objective is
static, regulatory/legal authorities will modify requirements over time, causing
changes to procedural mechanisms. Investigators must have operational systems in
place to ensure compliance with regulatory changes that may affect their studies.
The Office of Human Research Protections (OHRP), the US government entity
that oversees federally funded human subjects research for the Department of Health
and Human Services (DHHS), describes the investigator’s obligation in the Code of
Federal Regulations, Title 45, Part 46 (45 CFR 46.116):
21 Consent Forms and Procedures 391

Except as provided elsewhere in this policy, before involving a human subject in research
covered by this policy, an investigator shall obtain the legally effective informed consent
of the subject or the subject’s legally authorized representative. An investigator shall seek
informed consent only under circumstances that provide the prospective subject or the
legally authorized representative sufficient opportunity to discuss and consider whether or
not to participate and that minimize the possibility of coercion or undue influence.
The information that is given to the subject or the legally authorized representative
shall be in language understandable to the subject or the legally authorized representa-
tive. The prospective subject or the legally authorized representative must be provided
with the information that a reasonable person would want to have in order to make an
informed decision about whether to participate, and an opportunity to discuss that
information Department of Health and Human Services, Office for Human Research
Protections (n.d.).

This chapter summarizes the components that contribute to a successful informed


consent process.

Who May Obtain Informed Consent?

Under guidance from the OHRP, obtaining informed consent is a study procedure
that, alone, is enough to establish that a research institution is engaged in human
subjects research and must have oversight by an Institutional Review Board (IRB)
(also known as Ethics Committee or Research Ethics Committee (REC) in many
parts of the world) (Department of Health and Human Services, Office for Human
Research Protections 2008). The IRB has the obligation of ensuring that the inves-
tigators performing research activities are qualified and trained to so. Thus, the
consent designees for a trial must be identified, and their qualifications and training
submitted to the IRB for review and approval. The investigator must identify consent
designees who are appropriately credentialed to obtain consent for clinical
procedures. Qualified study team members may also work with the investigators
and consent designees to support the informed consent process.
If the person who obtains informed consent has a clinical care relationship with
the potential participant, the issue of “therapeutic misconception” arises (Sisk and
Kodish 2018). Thus, when the clinician introduces a research study, the patient may
interpret that introduction as a recommendation from the clinician, and may improp-
erly attribute the possibility of direct personal benefit to study participation. The
clinician must clearly explain that the trial is separate from clinical care and the
patient’s decision about participation will not affect the care that the clinician
provides.

Who May Provide Informed Consent or Assent?

Participants, or their legal agents, must provide their voluntary informed consent in
order to enroll in a research study. Adults, as defined by the locale’s law on the age of
majority, may provide informed consent for themselves unless they lack capacity to
do so.
392 A.-M. Ervin and J. B. Cobb Pettit

Lack of Capacity

If an adult lacks capacity to provide informed consent, a legally authorized represen-


tative (LAR), as defined by local law, may provide consent on their behalf. Lacking the
capacity to provide informed consent is not the same as being cognitively impaired.
The Alzheimer’s Association (2004) defines the capacity to consent as “the ability to
comprehend a research protocol, the meaning of personal participation in this protocol,
including risks and benefits, as well as the ability to make and communicate a choice
about participation.” If a trial is likely to include adults who lack capacity to provide
consent, the protocol must explain how study personnel will assess capacity. Studies
that enroll adults who may lose capacity during the study may ask participants to
identify an individual (a Research Agent) who may provide continuing consent on
their behalf. A participant with limited capacity should provide assent, if able, and the
study team should respect their dissent from participation.

Other Vulnerabilities

Some adults may not lack capacity to consent, but may be otherwise vulnerable due
to incarceration, immigrant status, pregnancy, illiteracy, or another similar situation.
Each participant must be considered as a unique individual, and investigators must
consider and address any personal characteristics that may obstruct the possibility of
obtaining legally effective informed consent.

Nonnative Speaker

If the target population is anticipated to be non-English speaking, the investigators


should provide translations of the English consent document by qualified translators. If
an unexpected non-English speaker who is otherwise eligible for the study is interested
in participating and there is no time to fully prepare a translated consent form, both the
Food and Drug Administration (FDA) and OHRP permit the use of generic short-form
consents, translated into the potential participant’s language, in conjunction with an
oral translation of the official English version of the form by someone fluent in the
participant’s language. The process involves several components, including the par-
ticipation of a nonstudy affiliated witness who is fluent in both English and the
participant’s language, and the procedures must receive IRB approval prior to imple-
mentation (Department of Health and Human Services, U.S. Food and Drug Admin-
istration 2014).

Parental Permission

Parents (biological or adoptive parents) and legal guardians must provide parental
permission for minors (under the age of majority) to participate unless the study falls
under a specific exception to this rule. The exceptions are:
21 Consent Forms and Procedures 393

• The study addresses a topic that is protected by a statute that allows minors to
provide consent for themselves.
• The study meets the standards for waiver of informed consent and the IRB agrees
to waive parental permission.
• The study includes a population of children for whom requiring parental/guardian
permission is not appropriate given the study topic or the special characteristics of
their relationship, and the regulations governing the study permit the IRB to
approve a substituted mechanism to protect the minor participants. The DHHS
provides this exception at 21 CFR 46.408(c), with the example being a population
of neglected or abused children.

Depending upon the risk associated with the trial and the prospect of direct
personal benefit, permission may be required from one or both parents. In the
Randomized Trial of Peanut Consumption in Infants at Risk for Peanut Allergy
(LEAP Study), infants as young as 4 months old at high risk for a peanut allergy
were randomly assigned to avoid or consume peanuts in their diet until 5 years of
age. Because the IRB determined that the study offered the possibility of direct
personal benefit to the minor participants consent was obtained from one parent or
guardian (Du Toit et al. 2015).

Assent

Minors must have the opportunity to provide assent to study participation. Assent
means “a child’s affirmative agreement to participate in research. Mere failure to
object should not, absent affirmative agreement, be construed as assent” (45 CFR
46.402(b)). The IRB must determine whether children have the capacity to assent to
trial participation. The IRB must consider the “ages, maturity, and psychological
state of the children involved” (21 CFR 50.55(b), 45 CFR 46.408(a)). This deter-
mination may be for all children participating, for subgroups, or for individual
minors. Assent may be waived for certain minimal risk studies, and for an FDA-
regulated study which “holds out a prospect of direct benefit that is important to the
health or well-being of the children and is available only in the context of the clinical
investigation” (21 CFR 50.55(c)(2)).

Community Approval

For clinical trials that occur in community settings and in some international
locations, investigators may obtain the consent or approval from local leaders.
Community acceptance of the proposed trial may be important to successful recruit-
ment and enrollment of study participants. It is also important to establish a com-
munication mechanism with the community to facilitate the dissemination of results
once the trial is completed. The Surveillance and Azithromycin Treatment for
Newcomers and Travelers Evaluation (ASANTE) Trial recruited 52 communities
394 A.-M. Ervin and J. B. Cobb Pettit

in the Kongwa district, Tanzania, to receive annual mass drug administration (MDA)
or annual MDA plus a surveillance and treatment program for newcomers and
travelers to determine if the surveillance program would reduce infection with
Chlamydia trachomatis (Ervin et al. 2016). In the ASANTE Trial, community
leaders provided verbal consent for the participation of the community. Guardians
provided consent to enroll children, and children aged 7 years and older provided
assent to participate.

What Must the Documentation of Informed Consent Include?

The consent form for a clinical trial must include specific regulatory elements
that may change over time, as well as provisions required by the institution where
the research will take place. The US regulations governing federally funded
human subjects research from 1991 until the 2018 revisions required eight
basic elements and provided six “additional elements” to be added to consent
documents when appropriate (Department of Health and Human Services, Office
for Human Research Protections n.d.). The revised regulations, effective January
21, 2019, include new basic and additional elements and new format require-
ments. Some of these changes increase the focus on the collection and use of
identifiable data and biospecimens.
While the consent form is the primary documentary evidence that a study
interaction with a participant occurred, other notes that the study team records
contemporaneously about that interaction may help record the context of the
discussion. The consent form may also reference other IRB-approved tools, such
as brochures, videos, and patient information sheets that the study team may use
to help explain the study. Table 1 presents the current US regulatory requirements
for informed consent. HIPAA Authorization is also included as many clinical
trials in the United States must also comply with the HIPAA mandate.
Consent forms for clinical trials will also include a description of the study arms,
including the use of placebos and sham procedures. The method of assigning
participants to different study arms should be discussed along with a lay description
of the allocation ratio. Additional important elements include:

• Persons masked/blinded during the course of the trial


• How and when study products (drugs, devices, etc.) are administered with
detailed descriptions of all testing, procedures, and medications
• Details about the follow-up period, including the time points and methods of
contact with participants
• Indications for discontinuing the study product during the course of the trial
• Description of the examinations, surveys or other assessments conducted at each
study visit and any interim telephone (or other) contacts
• The availability of study products to all participants (or those receiving the
placebo or other control treatment) once the trial has concluded
21 Consent Forms and Procedures 395

Table 1 US federal regulatory requirements for informed consent


Requirement Description
Common rule consent elements Informed consent required elements 45
CFR 46.116 (1991)
1. A statement that the study involves research,
an explanation of the purposes of the research
and the expected duration of the subject’s
participation, a description of the procedures to
be followed, and identification of any
procedures which are experimental;
2. A description of any reasonably foreseeable
risks or discomforts to the subject;
3. A description of any benefits to the subject
or to others which may reasonably be expected
from the research;
4. A disclosure of appropriate alternative
procedures or courses of treatment, if any, that
might be advantageous to the subject;
5. A statement describing the extent, if any, to
which confidentiality of records identifying the
subject will be maintained; if the study is FDA
regulated, noting the possibility that the Food
and Drug Administration may inspect the
records (21 CFR 50.25(a)(5));
6. For research involving more than minimal
risk, an explanation as to whether any
compensation and an explanation as to whether
any medical treatments are available if injury
occurs and, if so, what they consist of, or where
further information may be obtained;
7. An explanation of whom to contact for
answers to pertinent questions about the
research and research subjects’ rights, and
whom to contact in the event of a research-
related injury to the subject; and
8. A statement that participation is voluntary,
refusal to participate will involve no penalty or
loss of benefits to which the subject is
otherwise entitled, and the subject may
discontinue participation at any time without
penalty or loss of benefits to which the subject
is otherwise entitled.
Additional elements
1. A statement that the particular treatment or
procedure may involve risks to the subject (or
to the embryo or fetus, if the subject is or may
become pregnant) which are currently
unforeseeable;
2. Anticipated circumstances under which the
subject’s participation may be terminated by
the investigator without regard to the subject’s
consent;
3. Any additional costs to the subject that may
(continued)
396 A.-M. Ervin and J. B. Cobb Pettit

Table 1 (continued)
Requirement Description
result from participation in the research;
4. The consequences of a subject’s decision to
withdraw from the research and procedures for
orderly termination of participation by the
subject;
5. A statement that significant new findings
developed during the course of the research
which may relate to the subject’s willingness to
continue participation will be provided to the
subject; and
6. The approximate number of subjects
involved in the study.
January 2018 revisions to 45 CFR 46.116; Expanded general requirements; 46.116(a)
effective January 21, 2019 (1) Before involving a human subject in
research covered by this policy, an investigator
shall obtain the legally effective informed
consent of the subject or the subject’s legally
authorized representative.
(2) An investigator shall seek informed consent
only under circumstances that provide the
prospective subject or the legally authorized
representative sufficient opportunity to discuss
and consider whether or not to participate and
that minimize the possibility of coercion or
undue influence.
(3) The information that is given to the subject
or the legally authorized representative shall be
in language understandable to the subject or
the legally authorized representative.
(4) The prospective subject or the legally
authorized representative must be provided
with the information that a reasonable person
would want to have in order to make an
informed decision about whether to participate
and an opportunity to discuss that information.
(5) Except for broad consent obtained in
accordance with paragraph (d) of this section:
(i) Informed consent must begin with a
concise and focused presentation of the key
information that is most likely to assist a
prospective subject or legally authorized
representative in understanding the reasons
why one might or might not want to participate
in the research. This part of the informed
consent must be organized and presented in a
way that facilitates comprehension.
(ii) Informed consent as a whole must
present information in sufficient detail relating
to the research, and must be organized and
presented in a way that does not merely
(continued)
21 Consent Forms and Procedures 397

Table 1 (continued)
Requirement Description
provide lists of isolated facts, but rather
facilitates the prospective subject’s or legally
authorized representative’s understanding of
the reasons why one might or might not want
to participate.
(6) No informed consent may include any
exculpatory language through which the
subject or the legally authorized representative
is made to waive or appear to waive any of the
subject’s legal rights, or releases or appears to
release the investigator, the sponsor, the
institution, or its agents from liability for
negligence
New basic elements: 46.116(b)
(b)(9)
(i) A statement that identifiers might be
removed from the identifiable private
information or identifiable biospecimens and
that, after such removal, the information or
biospecimens could be used for future research
studies or distributed to another investigator
for future research studies without additional
informed consent from the subject or the
legally authorized representative, if this might
be a possibility; or
(ii) A statement that the subject’s
information or biospecimens collected as part
of the research, even if identifiers are removed,
will not be used or distributed for future
research studies.
New Additional Elements: 46.116(c)
These require that a subject be informed of the
following, when appropriate:
(7) That the subject’s biospecimens (even if
identifiers are removed) may be used for
commercial profit and whether the subject will
or will not share in this commercial profit;
(8) Whether clinically relevant research results,
including individual research results, will be
disclosed to subjects, and if so, under what
conditions.
(9) For research involving biospecimens,
whether the research will (if known) or might
include whole genome sequencing (i.e.,
sequencing of a human germline or somatic
specimen with the intent to generate the
genome or exome sequence of that specimen.
For the purposes of this Chapter, the concept of
“. . .broad consent for the storage,
maintenance, and secondary research use of
(continued)
398 A.-M. Ervin and J. B. Cobb Pettit

Table 1 (continued)
Requirement Description
identifiable private information or identifiable
biospecimens” introduced by the DHHS in the
2018 revisions to the Common Rule, will not
be addressed.
HIPAA authorization requirements if the HIPAA authorization core elements (see
study involves using or disclosing protected Privacy Rule, 45 C.F.R. §164.508(c)(1))
health information (PHI) from a U.S. Description of protected health information
covered entity (PHI) to be used or disclosed (identifying the
information in a specific and meaningful
manner).
The name(s) or other specific identification of
person(s) or class of persons authorized to
make the requested use or disclosure.
The name(s) or other specific identification of
the person(s) or class of persons who may use
the PHI or to whom the covered entity may
make the requested disclosure.
Description of each purpose of the requested
use or disclosure. Researchers should note that
this element must be research study specific,
not for future unspecified research.
Authorization expiration date or event that
relates to the individual or to the purpose of the
use or disclosure (the terms “end of the
research study” or “none” may be used for
research, including for the creation and
maintenance of a research database or
repository).
Signature of the individual and date. If the
Authorization is signed by an individual’s
personal representative, a description of the
representative’s authority to act for the
individual.
Authorization required statements (see
Privacy Rule, 45 C.F.R. § 164.508(c)(2))
The individual’s right to revoke his/her
Authorization in writing and either (1) the
exceptions to the right to revoke and a
description of how the individual may revoke
Authorization or (2) reference to the
corresponding section(s) of the covered
entity’s Notice of Privacy Practices.
Notice of the covered entity’s ability or
inability to condition treatment, payment,
enrollment, or eligibility for benefits on the
Authorization, including research-related
treatment, and, if applicable, consequences of
refusing to sign the Authorization.
The potential for the PHI to be re-disclosed by
the recipient and no longer protected by the
(continued)
21 Consent Forms and Procedures 399

Table 1 (continued)
Requirement Description
Privacy Rule. This statement does not require
an analysis of risk for re-disclosure but may be
a general statement that the rivacy Rule may no
longer protect health information
NIH certificate of confidentiality (suggested “This research is covered by a Certificate of
language) Confidentiality from the National Institutes of
Health. The researchers with this Certificate
may not disclose or use information,
documents, or biospecimens that may identify
you in any federal, state, or local civil,
criminal, administrative, legislative, or other
action, suit, or proceeding, or be used as
evidence, for example, if there is a court
subpoena, unless you have consented for this
use. Information, documents, or biospecimens
protected by this Certificate cannot be
disclosed to anyone else who is not connected
with the research except, if there is a federal,
state, or local law that requires disclosure (such
as to report child abuse or communicable
diseases but not for federal, state, or local civil,
criminal, administrative, legislative, or other
proceedings, see below); if you have consented
to the disclosure, including for your medical
treatment; or if it is used for other scientific
research, as allowed by federal regulations
protecting research subjects.”
NIH guidance on consent for future In order to meet the expectations for future
research use and broad sharing of human research use and broad sharing under the GDS
genomic and phenotypic data subject to the Policy, the consent should capture and convey
NIH genomic data sharing policy in language understandable to prospective
2015 participants information along the following
lines:
Genomic and phenotypic data, and any other
data relevant for the study (such as exposure or
disease status), will be generated and may be
used for future research on any topic and
shared broadly in a manner consistent with the
consent and all applicable federal and state
laws and regulations.
Prior to submitting the data to an NIH-
designated data repository, data will be
stripped of identifiers such as name, address,
account, and other identification numbers and
will be de-identified by standards consistent
with the Common Rule. Safeguards to protect
the data according to Federal standards for
information protection will be implemented.
Access to de-identified participant data will be
controlled, unless participants explicitly
(continued)
400 A.-M. Ervin and J. B. Cobb Pettit

Table 1 (continued)
Requirement Description
consent to allow unrestricted access to and use
of their data for any purpose.
Because it may be possible to re-identify de-
identified genomic data, even if access to data
is controlled and data security standards are
met, confidentiality cannot be guaranteed, and
re-identified data could potentially be used to
discriminate against or stigmatize participants,
their families, or groups. In addition, there may
be unknown risks.
No direct benefits to participants are expected
from any secondary research that may be
conducted.
Participants may withdraw consent for
research use of genomic or phenotypic data at
any time without penalty or loss of benefits to
which the participant is otherwise entitled. In
this event, data will be withdrawn from any
repository, if possible, but data already
distributed for research use will not be
retrieved.
The name and contact information of an
individual who is affiliated with the institution
and familiar with the research and will be
available to address participant questions.
GINA (if appropriate) “A Federal law, called the Genetic Information
Nondiscrimination Act (GINA), generally
makes it illegal for health insurance
companies, group health plans, and most
employers to discriminate against you based
on your genetic information. This law
generally will protect you in the following
ways:
Health insurance companies and group health
plans may not request your genetic information
that we get from this research.
Health insurance companies and group health
plans may not use your genetic information
when making decisions regarding your
eligibility or premiums.
Employers with 15 or more employees may not
use your genetic information that we get from
this research when making a decision to hire,
promote, or fire you or when setting the terms
of your employment.”
Conflict of interest A statement that one or more investigators
have a financial or other conflict of interest
with the study and how it has been managed.
(continued)
21 Consent Forms and Procedures 401

Table 1 (continued)
Requirement Description
FDA “applicable clinical trials” Under 21 CFR 50.25(c), the following
statement must be reproduced word-for-word
in informed consent documents for applicable
clinical trials:
“A description of this clinical trial will be
available on http://www.ClinicalTrials.gov, as
required by U.S. Law. This Web site will not
include information that can identify you. At
most, the Web site will include a summary of
the results. You can search this Web site at any
time.”

Consent Materials

The consent process begins as soon as an informational exchange between the


investigator and potential participant about the trial yields personal information
from the participant; and it continues throughout study participation. Recruitment
materials are used as part of the consent process and must be IRB approved.
No materials should overstate the potential benefits of the study using persuasive
words like, “exciting” or “important” to describe the research. The traditional model
of obtaining informed consent involves the use of an IRB-approved consent docu-
ment that includes all the required elements of consent, potentially the elements of a
HIPAA Authorization, plus any institutional provisions required by the local site.
While the informed consent document is the standard mechanism for introducing a
study and explaining its purpose, procedures, risks, etc., other supplemental tools
may also be used. Videos, flipcharts, props, and other media can be very helpful to
improve participant access to the study information. IRBs must approve these
supplemental materials prior to use, and they must be consistent with the study’s
informed consent document.

Consent Discussion

The consent process must precede any study-related activities, including screening
for eligibility, whether the discussion takes place in-person, by phone, or other
remote method. Typically, a study team member approved to obtain informed
consent reviews the consent form with the potential participant and answers any
questions. The study team member must be cognizant of anything that might
interfere with a participant’s ability to make an informed decision (e.g., illiterate,
language barriers, hearing, visual, or cognitive impairment). The study team member
must allow time for the prospective participant to consider whether to participate and
402 A.-M. Ervin and J. B. Cobb Pettit

to ask questions about the study. In certain circumstances, the initial consent
discussion may extend over time to allow the prospective participant to consult
with her physician and/or family members. Study team members should not ask for
consent when the potential participant feels exposed or vulnerable, for example,
when lying on a gurney approaching the operating theater, or when his/her deliber-
ative faculties may be compromised by severe pain, anxiety, or the influence of
medication, etc. When the participant is satisfied with the discussion and agrees to
participate, the participant, and when applicable, the person obtaining consent, sign
and date the consent document. If the study involves clinical procedures for which
only credentialed clinicians may obtain consent, the process may be bifurcated such
that a trained study team member discusses the consent form with the participant,
and then the clinician reviews the consent form with the participant and answers
questions. Then, the participant, research staff member, and clinical research staff
member all sign and date the consent document. Some trials may include more than
one consent form as was utilized in the Randomized Trial of Achieving Healthy
Lifestyles in Psychiatric Rehabilitation (ACHIEVE). The aim of ACHIEVE was to
assess the efficacy of a behavioral weight loss intervention among persons diagnosed
with a serious mental illness who participate in a psychiatric rehabilitation pro-
gram (Casagrande et al. 2010). Persons at participating rehabilitation centers were
orally consented prior to screening for ACHIEVE in order to measure their weight
and height. Persons expressing an interest in ACHIEVE were asked to sign a written
consent form for procedures related to eligibility screening and a second consent
form before randomization.

Understandable Language

The language of the consent document, and discussion, must be understandable


to the potential participant. Understandable encompasses many things: the lan-
guage used in the discussion (English, Spanish, etc.); how well the consent
language conveys accurately the information a participant needs to know; and
the sophistication of the language, including how well the consent language
explains scientific terms to a non-scientist. While IRBs may focus on reading
levels and ask investigators to reduce the reading level of consent documents to a
certain level (e.g., eighth grade or lower), this effort fails if the result is an over-
simplified form that is deficient in conveying the information that a participant
needs to know. Translating complex scientific ideas into simpler language acces-
sible to the target population is a challenging task, but is essential to the objective
of obtaining legally effective informed consent.

Context

The consent discussion cannot take place under circumstances that introduce a threat
that might make prospective participants feel that they must participate (coercion), or
21 Consent Forms and Procedures 403

that impose undue influence over the decision such that the participants decide to
join or remain on a study that they otherwise would not elect to participate in or
discontinue participation. These conditions could undermine the voluntary nature of
the decision to participate. Investigators must consider the participant’s situation and
respect their privacy.

Assessing Comprehension

Some studies involve populations for which an assessment of comprehension is


appropriate to ensure that the participant really understands what is being agreed
to. It also may be appropriate to provide a process for ascertaining comprehen-
sion for complex studies. These assessments may take the form of a test follow-
ing the consent discussion, or could involve the researcher pausing at the end of
each section of the consent form to ask the participant questions about the
content. It’s difficult to know what information each participant absorbs, and it
is also difficult to know what information each participant retains. The incorpo-
ration of tools such as audio-visual aids during the consent discussion as well as
pre-study workshops to improve the communication skills of persons obtaining
consent may aid comprehension (Kao et al. 2017). Investigators have considered
reducing the complexity of consent forms to improve comprehension.
The Strategic Timing of AntiRetroviral Treatment (START) trial compared par-
ticipant comprehension after receiving a standard versus a concise consent form
using a cluster randomized non-inferiority design. The overall comprehension
and comprehension of randomization scores did not differ for participants at
START trial sites that received the concise consent form when compared to those
who received the standard consent form and the investigators concluded that
shorter consent forms do not appear to impair the participant’s ability to under-
stand the study design and other basic features (Grady et al. 2017). It was
however difficult to assess how ancillary information, such as discussions with
research team members, might confound the assessment of comprehension
among participants.

Re-consent

Re-consent may be required in certain circumstances. For studies that involve


multiple interactions over time, additional tools like information sheets or other
communications may be advisable to improve the quality of the informed consent.
New information that could affect a participant’s willingness to continue participa-
tion should be communicated to participants, with possible re-consent. Trials that
undergo substantive amendments, including stopping or changing the number of
arms of the trial, could affect a person’s willingness to continue participation and
investigators must provide participants an opportunity to re-consent. In the Evalu-
ating the Effectiveness of Prednisone, Azathioprine, and N-acetylcysteine in Patients
404 A.-M. Ervin and J. B. Cobb Pettit

with Idiopathic Pulmonary Fibrosis (PANTHER-IPF) randomized controlled trial,


participants with IPF and lung function impairment were assigned to receive a
combination therapy of prednisone, azathioprine, and N-acetylcysteine (triple ther-
apy), N-acetylcysteine (NAC) alone, or placebo (National Institutes of Health,
National Heart, Lung, and Blood Institute 2011; The Idiopathic Pulmonary Fibrosis
Clinical Research Network 2012). The primary outcome of the trial was change in
forced vital capacity (lung function) from baseline to 60 weeks. The Data Safety and
Monitoring Board for PANTHER-IPF identified safety concerns among participants
enrolled in the triple therapy arm and recommended stopping the administration of
the triple therapy and halting further enrollment in the trial while continuing the
NAC and placebo arms. Participants in the all three arms were notified of the
decision to halt the triple therapy arm and participants in the NAC and placebo
arms were re-consented if they chose to continue participation in the study. Studies
that enroll minors using parental permission and assent must consent those minors
who reach adulthood during study participation. Although a single interaction and a
single form may be preferred, legally effective informed consent requires more.

Termination of Consent

A study participant retains the right to leave a study at any time; the consent process
must explicitly communicate that right to participants, and if appropriate, the
consequences of that decision. It should be clear to the participant whether follow-
up is necessary for their own safety and well-being or if there are any other pro-
cedures that should occur as a result of that decision.

Regulatory Requirements for Informed Consent in Canada and


the United Kingdom

Canada

The Interagency Advisory Panel on Research Ethics developed the Tri-Council


Policy Statement: Ethical Conduct for Research Involving Humans (2nd Edition,
2018; TCPS 2). The TCPS 2 is the mandate for the ethical conduct of human
participant research in Canada. The TCPS 2 includes the guiding principles and
required elements for informed consent (Table 2).
The TCPS 2 includes many of the same informed consent tenets as the US Federal
and other policies throughout the world addressing research involving humans.
Consent should precede data collection, is voluntary, and can be withdrawn at any
time. Information regarding the study procedures and any risks and benefits of
participation should be described in detail to facilitate informed decision making.
Ensuring a participant’s informed consent is a continuous process as events that
might alter the risk/benefit ratio during the course of the study should be disclosed
and these may affect the participant’s willingness to continue participation. Consent
21 Consent Forms and Procedures 405

Table 2 Canadian regulatory requirements for informed consent


Requirement Description
Consent 1. A statement that the individual is being asked to participate in a research
elements study;
2. An explanation of the purpose of the research, study procedures, duration of
participation, and participant responsibilities;
3. Disclosure of the researcher(s) and study sponsor(s);
4. A description of any reasonably foreseeable risks and benefits to the
participants and others;
5. A statement confirming that participation is voluntary and the participant
may withdraw from the research study without penalty or loss of benefits to
which the participant is entitled. Information that may affect the participant’s
decision to continue with the research study will be provided in a timely
manner;
6. A statement that the participant may withdraw access to data or biologic
materials with information on the limitations related to the request to withdraw
data or materials;
7. A statement informing the participant of any conflicts of interest
(researchers, institutions, sponsors). Participants should also be informed if the
research findings will be used for commercial purposes;
8. A description of the plan for disseminating study results including whether
participants will be identified;
9. Disclosure of the person(s) to contact for research-related queries and
unaffiliated person(s) who can discuss ethical concerns with participants;
10. Disclosure of the types of data that will be collected and the purpose of this
data collection;
11. Notification of the person(s) who will have access to the data collected and
how these data will be used including information on confidentiality and
requirements to disclose data collected to other entities;
12. Description of any payments, incentives, reimbursements and
compensations for injury that will be provided to participants;
13. A statement noting that participants do not forfeit their right to legal
recourse if the participant experiences a research-related injury;
14. Clinical trial investigators should also include information on the rules for
stopping the trial and the conditions under which participants are removed from
the trial.

may be written and a signed consent is required for research regulated under the
Heath Canada Food and Drugs Act. The TCPS 2 also acknowledges that oral
consent, field notes, exchange of gifts, and other methods may be warranted for
documenting consent as cultural norms and research settings vary.
The TCPS 2 addresses the accommodations provided when persons lack the
capacity to consent. In this instance the investigator must ensure that the research
has a direct benefit to the participant or persons who are similar to the participant. If
the investigator is unable to show a direct benefit, then the research must be
minimal risk and low burden to the participant. A third party, who is not the
investigator or a member of the research staff, will be asked to provide consent
on behalf of the participant. If during the course of the trial, the participant regains
the capacity to consent, informed consent will be obtained. Assent may be obtained
if the participant has some capacity to comprehend the aims of the research. The
406 A.-M. Ervin and J. B. Cobb Pettit

TCPS 2 further advises investigators and persons who may be asked to provide
consent on behalf of a participant to review research directives for guidance on the
participant’s preference regarding participation in research activities. A research
directive does not, however, modify the Tri-Council’s requirements for informed
consent.

United Kingdom

Guidance on informed consent for clinical trials in the United Kingdom (UK) is
provided in the Medicines for Human Use Clinical Trials Regulations (MHCTR)
and Guidelines for Good Clinical Practice (European Medicines Agency Interna-
tional Conference (n.d.); The Medicines for Human Use (Clinical Trials) Amend-
ment (No. 2) Regulations 2006). The underlying ethical principles and general
content requirements do not differ from those of the US and Canada. A partic-
ipant information sheet (PIS) is prepared to support the consent process. The PIS
provides a summary of the trial, including the background and objectives, the
expectations for volunteers participating in the trial, risks and benefits, what data
will be used and who has access to these data, information on withdrawing from
the study, and how the results of the trial will be disseminated while maintaining
participant confidentiality. The style and length of a PIS are often tailored to
inform the persons providing consent or advice on study participation, including
children, legal representatives, and relatives.
There are special protections for vulnerable populations, including adults that
lack the capacity to consent, children, pregnant women, and patients participating in
emergency research. The requirements for the consent for vulnerable populations
may depend on the location of the research in the UK (England and Wales, Scotland,
or Northern Ireland) and the study type. For clinical trials of investigational drugs or
devices a legal representative may provide consent for adults who are unable to
consent for themselves in England, Wales, Scotland, and Northern Ireland. For all
UK nations the representative may be a person who has a relationship with the adult
but is not involved in trial conduct (personal representative) or a professional
representative such as a treating physician who is not involved in the study.
Scotland regulations further specify that a personal legal representative could be a
welfare guardian or attorney and if one is not appointed for the adult, then the
closest relative. For greater than minimal risk research in England, Wales, and
Northern Ireland that does not include investigational products, a person who
cares for or has an interest in the adult’s well-being (a personal consultee) or a
nominated consultee (a person independent of the study) can provide their opinion
on whether the adult would be willing to participate in the study. This opinion is
recorded on a Consultee Declaration Form. In Scotland, a legal representative is
asked to provide consent for research that does not include investigational products.
Specific requirements for children and emergency research in the UK are outlined in
Tables 3 and 4.
21 Consent Forms and Procedures 407

Table 3 Requirements for the consent of children in the UK


England, Wales, Northern Ireland, and
Scotland
Clinical trials of Consent on behalf 1. Parent or person with parental
investigational products of a child under responsibility
16 years old 2. Personal legal representative (if parent
cannot be contacted)
3. Professional legal representative (if
personal legal representative is
unavailable)
Assent of child should be sought when
appropriate
Consent on behalf Children over 16 may provide consent on
of a child over their own
16 years old If the child lacks the capacity to consent,
then the regulations for adults that lack
capacity to consent for clinical trials of
investigational products will apply
Greater than minimal risk There are no specific legal provisions for
research without a child’s consent to participate in research
investigational products that does not include investigational
products

Summary and Conclusion

While there are regulatory and institutional requirements for obtaining consent for
clinical trial participation, investigators must also take steps to ensure that the
process maximizes the potential participant’s ability to make an informed decision.
Additional protections are necessary for vulnerable populations. Discussions should
occur in the appropriate context and supplemental materials may be important to
illustrate specific procedures and expected contacts during the course of the trial.
Assessing the potential participant’s comprehension of specific elements of the trial
should be considered particularly when the methods are complex and participation is
expected over an extended period. Informed consent discussions should be contin-
uous and written or other communications should be distributed to update partici-
pants during the course of the trial. Re-consent should be considered when
modifications may affect the participant’s willingness to participate in the trial.

Key Facts

• Voluntary informed consent is an essential prerequisite for clinical trial


participation.
• There are additional legal protections for vulnerable persons to facilitate informed
consent.
408 A.-M. Ervin and J. B. Cobb Pettit

Table 4 Requirements for emergency research consent in the UK


England, Wales, and
Northern Ireland Scotland
Clinical trials of Adults May be included without May be included without
investigational lacking consent if 1) there is an consent if 1) there is an
products capacity to urgency to administer urgency to administer
give treatment, 2) there is an treatment, 2) there is an
consent urgency to administer the urgency to administer the
investigational drug in the investigational drug in the
trial setting, 3) obtaining trial setting, 3) obtaining
consent from a legal consent from a legal
representative is not representative is not
practical, 4) the trial has practical, 4) the trial has
been approved by the been approved by the
National Health Service’s National Health Service’s
research ethics committee, research ethics committee,
and 5) the consent of the and 5) the consent of the
legal representative is legal representative is
secured as soon as possible secured as soon as possible
Children Under 16 years old: May be Under 16 years old: May be
lacking included without consent if included without consent if
capacity to 1) there is an urgency to 1) there is an urgency to
give administer treatment, 2) administer treatment, 2)
consent there is an urgency to there is an urgency to
administer the administer the
investigational drug in the investigational drug in the
trial setting, 3) obtaining trial setting, 3) obtaining
consent from a legal consent from a legal
representative is not representative is not
practical, 4) the trial has practical, 4) the trial has
been approved by the been approved by the
National Health Service’s National Health Service’s
research ethics committee, research ethics committee,
and 5) the consent of the and 5) the consent of the
parent, guardian, or legal parent, guardian, or legal
representative is secured as representative is secured as
soon as possible soon as possible
16 years and older: Refer to 16 years and older: Refer to
guidance for adults guidance for adults
Greater than Adults May be included without Consent should be secured
minimal risk lacking consent if 1) there is an from a welfare attorney,
research without capacity to urgency to administer welfare guardian, or the
investigational give treatment, 2) obtaining nearest relative before the
products consent advice from a consultee is adult is included in the
not practical, 3) the research research.
has been approved by the
National Health Service’s
research ethics committee,
and 4) advice from a
consultee is secured as soon
as possible.
(continued)
21 Consent Forms and Procedures 409

Table 4 (continued)
England, Wales, and
Northern Ireland Scotland
Children May be included without May be included without
lacking consent if 1) the research consent if 1) the research
capacity to has potential benefits to the has potential benefits to the
give child, 2) the research has child, 2) the research has
consent been approved by the been approved by the
National Health Service’s National Health Service’s
research ethics committee, research ethics committee,
3) the research cannot be 3) the research cannot be
addressed in a nonemergent addressed in a nonemergent
setting, 4) a parent (or setting, 4) a parent (or
guardian) is notified as soon guardian) is notified as soon
as possible, 5) Consent and as possible, 5) Consent and
assent when appropriate are assent when appropriate are
obtained as soon as obtained as soon as
possible, and 6) the child possible, and 6) the child
and/or the parent or and/or the parent or
guardian are informed that guardian are informed that
the child can withdraw at the child can withdraw at
any time any time

• Consent forms must include specific regulatory elements, institutional language


(where applicable), and lay descriptions of study design features that are unique to
clinical trials.
• Re-consent should be considered when there are significant changes that may
affect the participant’s willingness to continue study participation.

Cross-References

▶ Institutional Review Boards and Ethics Committees

References
Alzheimer’s Association (2004) Research consent for cognitively impaired adults: recommenda-
tions for institutional review boards and investigators. Alzheimer Dis Assoc Disord 18
(3):171–175. https://doi.org/10.1097/01.wad.0000137520.23370.56
Casagrande SS, Jerome GJ, Dalcin AT, Dickerson FB, Anderson CA, Appel LJ, Charleston J,
Crum RM, Young DR, Guallar E, Frick KD, Goldberg RW, Oefinger M, Finkelstein J,
Gennusa JV, Fred-Omojole O, Campbell LM, Wang N-Y, Daumit GL (2010) Randomized
trial of achieving health lifestyles in psychiatric rehabilitation: the ACHIEVE trial.
BMC Psychiatry 10:108. https://doi.org/10.1186/1471-244X-10-108
Department of Health and Human Services, Office for Human Research Protections (2008)
Engagement of institutions in human subjects research. Available at https://www.hhs.gov/
ohrp/regulations-and-policy/guidance/guidance-on-engagement-of-institutions/index.html.
Accessed 23 June 2020
410 A.-M. Ervin and J. B. Cobb Pettit

Department of Health and Human Services, Office for Human Research Protections (n.d.) Protec-
tion of Human Subjects 45 CFR §46.116 (a) and (b). Available at https://www.hhs.gov/ohrp/
regulations-and-policy/regulations/45-cfr-46/index.html. Accessed 23 June 2020
Department of Health and Human Services, US Food and Drug Administration (2014) Informed
consent information sheet: guidance for IRBs, clinical investigators, and sponsors. Available at
https://www.fda.gov/RegulatoryInformation/Guidances/ucm404975.htm#genrequirments
Accessed 23 June 2020
Du Toit G, Roberts G, Sayre PH, Bahnson HT, Radulovic S, Santos AF, Brough HA, Phippard D,
Basting M, Feeney M, Turcanu V, Sever ML, Lorenzo MG, Plaut M, Lack G for the LEAP
Study Team (2015) Randomized trial of peanut consumption in infants at risk for peanut allergy.
N Engl J Med 372:803–813
Ervin AM, Mkocha H, Munoz B, Dreger K, Dize L, Gaydos C, Quinn TC, West SK (2016)
Surveillance and azithromycin treatment for newcomers and travelers evaluation (ASANTE)
trial: design and baseline characteristics. Ophthalmic Epidemiol 23(6):347–353. https://doi.org/
10.1080/09286586.2016.1238947
European Medicines Agency International Conference on Harmonisation Guideline for Good
Clinical Practice ICH GCP E6(R2) Step 5. Available at https://www.ema.europa.eu/en/docu
ments/scientific-guideline/ich-e-6-r2-guideline-good-clinical-practice-step-5_en.pdf. Accessed
23 June 2020
Grady C, Toulomi G, Walker AS, Smolskis M, Sharma S, Babiker AG, Pantazis N, Tavel J,
Florence E, Sanchez A, Hudson F, Papadopoulos A, Emanuel E, Clewett M, Munroe D,
Denning E, The INSIGHT START Informed Consent Substudy Group (2017) A randomized
trial comparing concise and standard consent forms in the START trial. PLoS One 12(4):
e0172607
Kao CY, Aranda S, Krishnasamy M, Hamilton B (2017) Interventions to improve patient
understanding of cancer clinical trial participation: a systematic review. Eur J Cancer Care 26:
e124124. https://doi.org/10.1111/ecc.12424
National Institutes of Health, National Heart, Lung, and Blood Institute (2011) Questions and
answers: PANTHER-IPF study. Available at https://www.nhlbi.nih.gov/node-general/questions-
and-answers-panther-ipf-study. Accessed 23 June 2020
Sisk BA, Kodish E (2018) Therapeutic misperceptions in early-phase cancer trials: from categorical
to continuous. IRB Ethics Hum Res 40(4):13–20
The Idiopathic Pulmonary Fibrosis Clinical Research Network (2012) Prednisone, azathioprine,
and N-acetylcysteine for pulmonary fibrosis. N Engl J Med 366:1968–1977
The Medicines for Human Use (Clinical Trials) Amendment (No. 2) Regulations (2006).
Available at http://www.legislation.gov.uk/uksi/2006/2984/pdfs/uksi_20062984_en.pdf.
Accessed 23 June 2020
Tri-Council Policy Statement: Ethical Conduct for Research Involving Humans – TCPS2 (2018).
Available at http://www.ethics.gc.ca/eng/policy-politique_tcps2-eptc2_2018.html. Accessed 23
June 2020
Contracts and Budgets
22
Eric Riley and Eleanor McFadden

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
Funding Landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
Types of Clinical Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
Key Funding Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
Key Differences in Funding Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
Distribution of Funds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
Request for Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
Preparation of Proposal and Budget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
Budget Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
Budget Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
Preparing the Response to a Request for Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
Selection of Relevant Partners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
Negotiation of Contract Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
Contract Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
Clinical Trial Agreement Guidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
Scope of Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
Budget Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
Signature of Contract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
Activation of Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
Other Legal Documents/Contracts/Contract Amendments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
Contract Amendments and Budget Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425

E. Riley (*) · E. McFadden


Frontier Science (Scotland) Ltd., Kincraig, Scotland, UK
e-mail: [email protected]; [email protected]

© Springer Nature Switzerland AG 2022 411


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_47
412 E. Riley and E. McFadden

Abstract
The clinical research landscape in the twenty-first century continues to evolve.
Over the last three decades, the clinical trials landscape has changed dramatically
with increased regulations, worldwide standards (ICH E6 Guidelines), and intense
scrutiny. Most recently, there has been substantial impact from changes to legisla-
tion on data protection (US Health Insurance Portability and Accountability Act,
EU General Data Protection Regulations) rather than legislation directed specifi-
cally at clinical trials. There is an increase in large multicenter clinical trials, many
of them international in scope. Trials are increasingly becoming more automated
and complex in their design, management, and implementation.
The complexity of the clinical trials environment and the increase in regula-
tions requires all those involved, whatever their level of contribution, to adapt to
the changes and to be very clear on the costs associated with carrying out research
(Mashatole, Conducting clinical trials in the 21st century- adapting to new ways
and new methods. Retrieved from https://www.clinicaltrialsarena.com/news/
conducting-clinical-trials-in-the-21st-century-adapting-to-new-wasy-and-new-
methods-4835722-2/, 2016). Equally important, the terms of agreement and
division of responsibility between relevant parties involved in a trial, including
the Sponsor and/or funder, must be clearly specified in advance in the form of a
legally binding agreement.
This chapter provides guidelines for preparing Clinical Trial Agreements/
Contracts and for developing budgets/funding requests for the work involved,
both essential activities during the start-up phase of a trial.

Keywords
Contract · Clinical Trial Agreement · Data Transfer Agreement · Budget ·
Sponsor

Introduction

A “Signed Agreement between Involved Parties” and “Financial Aspects of the


Trial” are listed as essential documents in ICH E6 Good Clinical Practice Guidelines
(ICH 2016). This chapter will consider key considerations in preparation of these
documents. For clarity, the “Signed Agreement” is referred to as a contract in this
document, and the “Financial Aspects” as a budget. The budget is usually an
appendix to the contract. In this chapter, the term Sponsor can mean the legal
Sponsor of the trial or it can be the funder. Sometimes they are both the same entity,
but not always, and, for the purpose of this text, are effectively interchangeable.
In broad terms, the steps usually followed to develop partnerships and a structure
for a new clinical trial are typically as follows:

1. Request for proposal from Sponsor/funder/research organization/investigator,


including specifications for the conduct of the trial
2. Preparation of submissions, including costs estimates
22 Contracts and Budgets 413

3. Selection of relevant partners


4. Negotiation of contract terms, including finalization of budget
5. Signature of contract
6. Activation of trial activities covered by the contract

In the chapter, the focus is primarily on steps 2 and 4 in the above list, but the
other steps will be touched on briefly. Firstly, it is important to provide an overview
of the funding landscape of clinical research and a background to the key issues,
which ultimately inform the budget development process and more specifically the
steps above.

Funding Landscape

There have been many changes in the approaches, regulations, and funding models
of clinical research in the past 30 years ultimately affecting the source and level of
investment. In 1991, 80% of US clinical trials were funded by government or
philanthropic organizations, including by one of largest sponsors of biomedical
research in the world, the US federal government. However, this number has been
steadily in decline as the pharmaceutical industry’s contributions continue to grow.
By 2005, industry funded an estimated 70% of US clinical trials. Given the fact that
commercial research is now assuming a larger share of the market, academic
research that was once mainly funded through public grants is having to rely more
on industry funding. As a result, these academic research groups are needing to
adjust how they operate in order to align with the goals of industry, which is to bring
drugs to the market in a timely and efficient manner (Pfeiffer and Russo 2016).
Much of the reason for the increased cost of doing research is the need for
compliance with legal and regulatory requirements. For example, the initial EU
Directive on Clinical Trials (Directive 2001/20/EC) stipulated that the same standard
of conduct had to be maintained for all drug trials, regardless of whether there was an
investigational agent involved or not. These regulations have been relaxed slightly,
but there are still extensive standards to follow.
In several countries, private nonprofit organizations and state/local governments
have been increasing their stake in research and now compete with the traditional
key stakeholders. As all parties battle the increasing costs and complexities of
clinical trials, which has been accelerated by global recessions and fluctuations in
funding, research organizations in every sector have to strategically position them-
selves and find ways to work with each other for sustainability and advancement of
their missions (Hind et al. 2017).

Types of Clinical Research

The contract and budget process is connected to the funding model for a particular
trial, which from a high-level perspective is dependent on the various types of
research. By having an understanding of the type of research being conducted, one
414 E. Riley and E. McFadden

can better appreciate the key elements that define these processes. Regardless of the
type however, the information generated by any clinical research ultimately leads to
the commercialization of a new drug or device, advancement of scientific knowl-
edge, and/or changes in health policy and legislation (Camps et al. 2017).
Broadly speaking, there are two types of clinical research, noncommercial and
commercial. In noncommercial research, the aim of the research project is to
generate knowledge that benefits the good of the wider public. It usually involves
government or nonprofit organizations such as academic institutions or foundations,
but it can also involve financial support from commercial entities such as a pharma-
ceutical or a biotech company, often in the form of an educational grant.
Commercial research is mainly sponsored by private industry. While there is a
genuine interest in advancing the development of scientific knowledge, there are
inevitably financial interests in this type of research specifically related to bringing a
new product to market or expanding the use of an existing product. This type of
research typically involves one organization which funds, designs, and carries out a
clinical trial either entirely under one roof or with delegated tasks outsourced to other
organizations which provide research services, such a Clinical Research Organisa-
tion (CRO). CROs can provide a range of services from project management,
database design and build, clinical trial data management, statistical analysis, and
administration of Independent Data Monitoring Committees (IDMC)/Data Safety
Monitoring Boards (DSMB). Due to this diversity and scope, they are becoming a
major force in drug development and clinical trial recruitment (Carroll 2005). Other
organizations such as AROs (Academic Research Organisations) and CCOs (Con-
tract Commercial Organisations) can also be involved in providing a range of
specialist services to support clinical research such as recruitment support, clinical
knowledge and expertise, marketing and regulatory filing support, and drug launch.

Key Funding Sources

Within each type of research, there are various funding sources and mechanisms. As
part of their drug development programs, private industry is naturally a major funder
of clinical research. Another significant source of investment is the US federal
government, namely, through the National Institutes of Health (NIH). NIH is
considered the largest funder among the world’s top public and philanthropic
organizations in the world investing $26.1 billion in health research. The European
Commission and the UK Medical Research Council follow in second and third
places for research investment with recent contributions investments of $3.7 billion
and $1.3 billion, respectively (Viergever and Hendriks 2016).
Charities like Cancer Research UK and the Wellcome Trust make significant
investment to research and development in the UK and worldwide (Cooksey
2006). Private donors or foundations such as the Bill and Melinda Gates Founda-
tion and The Michael J. Fox Foundation for Parkinson’s Research also fund
clinical research. Additionally, there are global philanthropic groups such as The
European Organisation for Research and Treatment of Cancer (EORTC) or the
22 Contracts and Budgets 415

Breast International Group (BIG) who support and fund research in specific
disease areas. An organization such as BIG has expansive global reach and
influence. As a key stakeholder in breast cancer research, BIG represents a network
of collaborative groups connecting over 59 academic research groups on over 30
clinical trials or research programs at a given time and affecting over 95,000
patients since its inception (BIG 2018).

Key Differences in Funding Models

Overall, the motivation and incentives for each funding model vary. Often the
motivation for noncommercial research does not align with the current needs of
the pharmaceutical industry. In these cases where industry is sponsoring the
research, there is a need to ensure independence and impartiality to address the
scientific hypothesis. This has implications for budgeting given that there could be
extra costs involved in ensuring the scientific integrity of the trial, such as the need
for Data Safety Monitoring Boards (DSMBs) or in other ways like ensuring the
proper firewalls are in place between the relevant partners.
There are other potential differences between commercial and noncommercial
models, which relate to timelines and trial procedures. In academic settings, time-
lines are typically more relaxed than for industry trials. The rigor of trial procedures
can also vary. In commercial research, or any trial involving investigational drugs,
the infrastructure and standards are usually more resource intensive and to a very
high standard. This rigor is necessary to satisfy regulatory bodies as results of these
trials are used to ensure products can make it to the consumer market safely and
effectively.

Distribution of Funds

The distribution of funds will vary depending on the individual trial, the main
stakeholder(s), and how the contracts are written. As stated, there may be several
parties involved in ensuring a trial is carried out successfully (e.g., research sites,
CRO, data management services, drug distribution services, sample processing
labs), and the main funder (which may be the Sponsor) is responsible for paying
the partners either directly or indirectly.
In academic trials the Sponsor/funder, which could be a government body,
nonprofit organization, or a commercial partner, typically pays a participating
university or affiliated medical center directly. In another model, the Sponsor pays
the Coordinating Center, which subcontracts to each participating site. Usually at
major academic centers, research groups are able to access the necessary support
resources through the institution’s research infrastructure sometimes called a CTU or
clinical trial unit. This may include central personnel, laboratory resources, addi-
tional medical services like imaging or equipment, bio sample storage, and Institu-
tion Review Boards (IRBs)/Ethics Committees (ECs).
416 E. Riley and E. McFadden

In contrast, a large global trial being sponsored by a major pharmaceutical


company may have a multitude of partners and individual contracts with each
party or possibly even a setup with a main CRO and only a few select partners.
Another model is a large collaborative group, which manages and handles all
contracts and payments across services. In these cases, the collaborative group
would likely have access to a network of global organizations who would pool
their resources to identify the best sites to recruit the required study population.
Overall, the distribution of funds for a clinical trial is dependent upon the type of
clinical research being implemented, the source of the funds, and the funding model
being applied. Prior to these steps though, the clinical trial process starts with a
request for proposal.

Request for Proposal

A clinical trial originates with a scientific concept for testing one or more treatments
for a particular condition. This concept can then follow one of many different
pathways. It could, for example, be an investigator-initiated concept, a concept
from a pharmaceutical company, or a concept from a funding agency. At a high-
level description, the idea is built in to a draft protocol and a trial Sponsor and funder
is agreed. The Sponsor and the funder can be the same entity or two separate entities.
The next step is to decide how the trial will be conducted and by whom. Quite often,
a formal request for proposal (RFP) is developed and circulated to any interested
party to allow them to submit a proposal outlining their plan for the specific role that
they would play in the trial. This RFP is usually particularly relevant for the Clinical
Trials Coordinating Center, and both this model and the Coordinating Center role are
the primary focus of this chapter.
Sometimes it is predetermined that a specific Coordinating Center will be respon-
sible for the conduct of the trial, perhaps because of a direct association with an
investigator, expertise in the specific condition being tested, or existing contractual
relationships. For other trials, there is a competitive process where any interested
parties (or sometimes-selected parties) are invited to participate.
Regardless of the model, this step is when certain details of the trial conduct are
first defined so that those responding (or those preselected) can develop a proposal
for their role in the conduct of the trial. The proposal will include logistical details
about proposed procedures and scope of work but will also include a preliminary
budget. Examples of things, which may be defined, and impact on required resources
(and therefore the budget) are:

1. Required accrual
2. Number of participating sites
3. Number of countries
4. Volume of data
5. Method of data collection (paper/electronic)
6. Any specific software requirements
22 Contracts and Budgets 417

7. Requirements for central reviews


8. Expected dropout/ineligibility rate
9. Extent of on-site monitoring required
10. Whether the trial involves investigational product
11. Whether there is intent to file with regulatory authorities
12. Any other criteria which could impact the scope of work and budget

The detail included in the request for proposal can vary in detail. In some
instances, all of the above will be well defined making it more straightforward to
develop a budget and proposal. In other instances, the specifications can be vague
and poorly defined, and some interaction will be required with the party requesting
proposals so that a reasonable estimate can be made.

Preparation of Proposal and Budget

This step is the first opportunity for a potential applicant to draft a budget to submit
as part of the proposal. This step is critical, as there needs to be a balance between
being competitive with a proposal and ensuring that the budget is realistic and would
cover actual costs of doing the trial.
The escalating and complex costs of clinical research are forcing the main funders
including the pharmaceutical and biotechnology companies, CROs, and government
to tightly manage and control the financial particulars of their trials. This is having a
large impact on organizations and is requiring them to better understand the spon-
sor’s position and the wider funding landscape in order to create a proposal that will
stand out. Rising costs are also forcing them to adapt their budget models so they can
remain competitive and to be able to deliver high quality and on target work.
All clinical research studies require a budget, regardless of the funding sources,
size of research project, or parties involved in the research activities (Fine and
Albertson 2006). The budget can be seen as a planning document, which covers the
financial life of the study and supports the functions necessary for its success
(Floore 2019). Along with other key trial documentation such as the protocol,
schedule of events, or informed consent, the budget is important and should be
designed to be the best attempt at evaluating and planning for the resources and
costs needed to implement the study in order to achieve the scientific goals for the
study.

Budget Considerations

There are several things to consider when developing a research budget. The primary
financial consideration is ensuring that the research can be effectively carried out
with the available funds. Planning inefficiency and insufficient budget forecasting
are notable areas where many organizations are failing, which often jeopardizes the
financial sustainability of a clinical trial (Grygiel 2016).
418 E. Riley and E. McFadden

Another challenging area for the bottom line of a clinical trial is the therapeutic
area being studied and the protocol design. There is evidence to suggest that the
complexity of the trial protocol is associated with higher study costs, lower levels of
data quality, and longer study durations (Friedman et al. 2010). The protocol defines
much of how the study will be implemented and what components are required.
Additional specifications as defined above will also factor into budget calculations,
such as how many research sites are needed, estimated duration of the trial, relevant
regulations, and scope of responsibilities. All of these areas have a significant impact
on a research budget and should be carefully examined to ensure the costs are
properly considered.
The costs of increasing and complex regulations and compliance are becoming a
more common part of research budgets (Matula 2012). This refers to both the
personnel responsible for fulfilling these obligations and the costs associated with
fulfilling IRB/ethics requirements, local, national, or even international regulations
including General Data Protection Regulation (GDPR), a European Union regula-
tion that has a global reach and impact on data protection. There are also the costs
attached in ensuring research personnel are adequately trained in the appropriate
areas of clinical research such as Good Clinical Practice (GCP). While these costs
may not be directly listed as budget line items, the costs may be reflected in travel or
training costs or by specific compliance roles such as a Quality Assurance Officer.
Rising costs in running clinical trials are also stemming from the fact that per
patient costs are increasing at astronomical rates. In the USA, the average cost per
patient in a clinical trial increased 88% between 2008 and 2011 (Hargreaves 2016).
Some reasons include poor patient recruitment and retention, which is resulting in
massive cost overruns, missed deadlines, and in some cases premature closure of the
study.

Budget Format

Any party involved in a trial will have to develop a budget relevant to their roles and
responsibilities and will have different formats to use. The parties involved could
include a Coordinating Center, participating sites, Contract Research Organisations,
central laboratories, and drug distribution centers.
The format of a budget submission for a specific proposal is usually predefined by
the Sponsor/funder for the trial and can be in many different forms. It is important to
prepare and submit the budget proposal in the required format. Some budget requests
are based on providing estimates of effort over time for relevant personnel, some are
hourly rates and estimated total number of hours per position, or some could be a
fixed rate per task. It is important to ensure that any relevant overhead (add-on/
indirect) costs are incorporated into the budget proposal.
The process of developing the research budget can be streamlined by having
standard tools available for the initial costing process. Research institutions could
have a “budget toolbox” in place to use when developing a draft budget, regardless of
the required format for a submission (Appelman-Eszczuk 2016). The diversity of these
tools depends upon the funding portfolios and overall experience with previous
22 Contracts and Budgets 419

applications or bids. All toolboxes should contain the ability to understand and analyze
several key areas including the funding model being applied, the funding source, the
therapeutic area being researched, the essential components of an effective trial
budget, and, of course, an understanding of the tasks which will be the responsibility
of the applicant. An organization with this type of “budget toolbox” in place when
responding to a request for proposal will be able to develop and submit a budget more
quickly and more effectively than those who do not have these tools at hand.

Preparing the Response to a Request for Proposal

The previous sections show the complexity of budget preparation and the impor-
tance of knowing the relevant factors, which contribute to the drafting of a budget
proposal. The budget should be prepared in the required format and according to
specifications provided by those making the proposal request. The proposal also
needs to demonstrate an understanding of the regulatory requirements for the
specific project under consideration. If any assumptions are made in the preparation
of the draft budget, they should be well documented so that if those assumptions are
incorrect, the budget can be adjusted accordingly. Adequate justification for all
budget line items should also be provided.
In addition to the budget, there will be text to be added to the proposal, and it is
important to follow all instructions in preparing the proposal. There may be a
questionnaire to complete or free text to write to summarize the plan to meet the
requirements for the trial and to justify the budget request. The written component of
the proposal should be clear and concise and cover all relevant information. It should
be clear from the text which responsibilities are being included in the proposal and
the budget and text should match. Finally, it is important to submit the proposal by
any stipulated deadline and to include all information that was requested. Late or
incomplete proposals may be rejected.

Selection of Relevant Partners

Once the party who requests submissions has received proposals from all interested
parties, there is a process of selection. There may be a requirement for the applicant
to give a presentation to the requester or to answer some additional questions. A final
selection will be made, and the successful applicant will then move on to negotiating
a legal contract with the Sponsor/funder.

Negotiation of Contract Terms

Once a proposal has been accepted, the two (or more) parties involved have to
negotiate a legal agreement outlining the terms under which the work will be done
and incorporating the final accepted budget, which may differ from the budget in the
proposal as more details of the project are fully defined. It is important to ensure that
420 E. Riley and E. McFadden

appropriate legal cover is in place prior to starting to work on a project and to


understand that the contract negotiations can be a lengthy process, especially for a
complex trial.

Contract Content

There are standard sections, which would routinely be incorporated into a contract
for the conduct of a clinical trial, often referred to as a Clinical Trial Agreement or
CTA. These sections include:

1. Names and addresses of parties involved in the contract


2. Definitions of any key terms
3. Period/duration covered by the contract
4. Description of responsibilities of each of the parties (may be detailed in an
Appendix)
5. Financial provisions including details of how and when payments will be made
and the agreed budget (usually in an Appendix)
6. Contract termination/early termination conditions/rules
7. Governance structure for the trial and responsibilities of those involved (e.g., a
Trial Steering Committee may have the ultimate decision-making power)
8. Ownership of study data and materials
9. Intellectual property rights
10. Liability and indemnity
11. Data protection
12. Rules for future amendments to contract terms

Other sections, which may be relevant depending on the trial and the roles and
responsibilities of the contracting parties, could include:

1. Access to data while trial ongoing (or restrictions to access)


2. Collection of biological samples
3. Drug/device distribution system
4. Rules for publication and presentation of data once results are available
5. Interactions with regulatory agencies
6. Conflict of interest
7. Entities excluded by debarment
8. Assignment of responsibilities
9. Ability to subcontract
10. Governing law and procedures for any disputes
11. Commitment to accrual (for site contracts)
12. Permission for monitoring/audits/inspections

As the contract is a legally binding document and would hold any party to
account, it is essential that this document has detailed legal input by representatives
22 Contracts and Budgets 421

of each party prior to agreement and signature. Quite often, the legal counsel for
involved parties will negotiate terms among themselves once the assignment of
responsibilities and general structure have been agreed. While legal advice can be
expensive, it is much less costly than the alternative, which is finding out that you are
not covered if something goes wrong.

Clinical Trial Agreement Guidance

There are templates online, which provide a starting point for a Clinical Trial
Agreement document. In the UK, the UK Clinical Research Collaboration
(UKCRC) has developed model agreements in several areas. Their website has
links to several model templates, including ones relevant to clinical investigation,
CROs, primary care and commercial trials, and one for site agreements (UKCRC
website – https://www.ukcrc.org/regulation-governance/model-agreements/). These
nationally approved model agreements have been developed and published to help to
speed up the trial development process and simplify negotiations. National Health
Service Trusts in England and the devolved nations (Scotland, Wales, and Northern
Ireland) are expected to use them for relevant contracts.
Other guidance can be found in the NIHR Clinical Trials Toolkit, “an interactive
color-coded route map to help navigate through the legal and good practice arrange-
ments surrounding setting up and managing a Clinical Trial of an Investigational
Medicinal Product (CTIMP) (www.ct-toolkit.ac.uk).”
A clinical trial podcast (Kunal 2017) details nine essential components of a CTA
and provides insight into pitfalls in their formulation.

Scope of Work

As mentioned above, one of the key components of the contract should be a detailed
summary of roles and responsibilities for each party. It is essential that this is well
documented and understood so that there are no misunderstandings or omissions
once the trial gets under way. The list of tasks can be extensive and may be best
included as a detailed Appendix to the contract.
Table 1 shows a sample list of high-level topics for a scope of work to be
considered in a contract between a Sponsor and a Coordinating Center. Each of
these high-level topics can be broken down into activities that are more detailed. For
example, under the Statistics header, it can be documented which party is responsible
for preparing the statistical analysis plan development; under Interactions with
Authorities and IRBs/ECs, it can be documented which party interacts with regula-
tory authorities and which is responsible for ensuring materials are prepared for,
submitted to, and approved by Institution Review Boards and Ethics Committees;
under Clinical Data Management, it can be defined which party is to hold the clinical
database, which does quality control and interacts with sites. These are just exam-
ples, but it is recommended that each high-level category be broken down into these
422 E. Riley and E. McFadden

Table 1 High-level scope of work topics


Trial
# Trial activities Sponsor CTU sites
1 Trial protocol and Informed Consent Form (ICF)
development
2 Selection of trial sites
3 Interaction with authorities and IRB/ECs
4 Other trial documents preparation, printing, and distribution
5 Management of trial drug and supplies
6 Management of central patient randomization
7 Management of investigator/team meetings
8 Project management and administration including
committees, contracts, and budget
9 Monitoring (coordination and execution)
10 Drug safety/SAE reporting
11 Clinical data management (including systems setup)
12 Clinical data review
13 Statistics (development, programming, and analyses)
14 Clinical study report writing
15 Quality control and assurance
16 Protocol-defined sample management
Instructions: Enter L in column for party taking the lead; X for party involved but not in lead

detailed subcategories so that the division of responsibility for each task is


predefined.
This example is for the division of responsibilities between a Sponsor and a
Coordinating Center, but similar lists can be created for other kinds of contracts, for
example, between a Sponsor and a drug distribution center or between a Coordinat-
ing Center and a randomization provider. Similarly, there needs to be a legal
agreement between all related parties in the conduct of a trial.

Budget Evaluation

At this stage in the start-up process and before the contract is signed, there should
be a thorough evaluation of the initial budget proposal which was submitted as part
of the response to the proposal request. It is highly likely that additional detail about
the trial and its conduct has become evident during the intervening period between
the initial submission and the contract signature. There may be additional responsi-
bilities that have been added to the scope of work since the RFP was issued, and any
additional tasks or increase in responsibility can impact the initial budget.
All assumptions made in preparing the initial budget should be reexamined to see
if they are still relevant, and revised budget calculations made and negotiated with
the funder. It is also advisable to add language to the contract saying that there will be
22 Contracts and Budgets 423

new negotiations if the scope of the contract changes and that no such changes can be
made without agreement of both parties.

Signature of Contract

Once the terms of the contract, budget, and scope of work have been agreed, legal
representatives of each party should sign the document. Someone senior within an
organization would normally do this. The primary researcher would not normally be
authorized to sign such documents on behalf of an organization. Signatures can be
wet-ink, with a document being circulated to all parties to add their signature(s).
Sometimes multiple copies are signed so that each party receives a fully signed/
executed copy with original wet-ink signatures, and sometimes each party retains
their own wet-ink signature on site, and a scanned copy is sent to other parties. More
recently, electronic signatures have become more common with document signature
software that is compliant with relevant regulations. Often the method is dependent
on the laws within the relevant countries involved.

Activation of Trial

Once the contract is signed, the study can be activated and work commence. It is not
advisable to start work on the trial until such a contract is in place, as an organization
would have no legal basis for doing work before the document is signed.

Other Legal Documents/Contracts/Contract Amendments

There are other legal agreements, which may be needed for a specific trial. Some
examples of these are as follows.

Confidentiality Agreements/Nondisclosure Agreements


These agreements are usually issued at the beginning of the proposal process to
protect any proprietary information about a study or company before being released
to a tendering site or organization.

Data Transfer/Data Use/Data Specification Agreement


This agreement would cover transfer of data between parties. For example, if there
are two parties each contracted directly by the Sponsor but with no contract between
the other for statistical analysis, a Data Transfer Agreement would be needed.
This agreement should clearly specify which data is being transferred, the mecha-
nism for transfer (e.g., secure portal), the timing of transfers, how the data can be
used once transferred, and specifications of file formats for the transfer. The two
parties involved should sign this agreement.
424 E. Riley and E. McFadden

Vendor Agreements
If specific software/services are contracted by a party involved in the trial and used
for fulfilling their responsibilities in the trial, there should be agreements signed with
each vendor. Examples of these would be software support, software provider, and
database/electronic data capture (EDC) host.

Contract Amendments and Budget Review

During the course of the trial, if work scope changes are made to the operation of the
trial, it is essential that these changes be reflected in an updated contract and budget
amendment. Examples of changes are:
1. Modified accrual goals
2. Change in study design
3. Additional recruitment sites
4. Changes to scope of monitoring requirements

These are some examples, but any of these changes would impact the work scope
and the budget, and an updated contract should be negotiated.

Summary and Conclusion

The preparation of contracts and budgets to cover activities in a clinical trial is


essential and critical. While it is important to submit a competitive budget if
responding to an RFP, it is also important to ensure that the budget request covers
all relevant costs particularly given the escalating costs and complexities to running a
clinical trial today. Review of all available documents about the study will help to
ensure that items are not missed. Budgets can be formulated in many different ways
and each funder will have their own rules for submission. Guidelines should be
followed, and questions asked if these are unclear, and there should be a good
understanding of the roles and responsibilities of each party involved.
The negotiation of contracts and budgets is usually a lengthy process and this
should be factored into study planning. A key part of this planning is ensuring the
organization’s contract and budget development process is regularly reviewed in
order to streamline its efficiency and effectiveness. Additionally it is essential for an
organization to make sure the budgets, which are being developed, are accurately
reflecting the work being carried out. These reviews can be a result of contract
amendments, which can take several forms including changes in accrual goals or
study design or as part of an organization’s regular improvement process.

Key Facts

• Rules provided by the funder should be followed in preparing a budget.


• Ensure that the budget meets the requirements of the party doing the work.
22 Contracts and Budgets 425

• Allow scope for budget amendments in the contract.


• Ensure that all contracts have had appropriate legal review before signing.
• Ensure contracts are signed before work starts on the project.

Cross-References

▶ Documentation: Essential Documents and Standard Operating Procedures


▶ Funding Models and Proposals
▶ Multicenter and Network Trials
▶ Responsibilities and Management of the Clinical Coordinating Center

References
Appelman-Eszczuk S (2016) Clinical research site budgeting for clinical trials. J Clin Res Excell
87:15–21
BIG (2018) Annual Report 2018 Spreading hope- advancing breast cancer research. Belgium. [Last
accessed: 24 November 2020] Available at: https://www.bigagainstbreastcancer.org/news/
annual-report-2018
Camps I, Rodriguez A, Agusti A (2017) Non-commercial vs. commercial clinical trials: a retro-
spective study of the applications submitted to a research ethics committee. Br J Clin Pharmacol
84:1384–1388
Carroll J (2005) CRO crowing about their growth. Biotechnol Healthc 2(6):46–50. https://www.
ncbi.nlm.nih.gov/pmc/articles/PMC3571008
Cooksey D (2006) A review of UK health research funding. HM Treasury, Norwich
Fine and Albertson P.C (2006) Budget development and staffing. In: Penson DF, Wei JT (eds)
Clinical research methods for surgeons. Humana Press, Totowa
Floore T (2019) Balancing the clinical trial budget. J Clin Res Excell 101:16–21
Friedman L, Furberg C, DeMets D (2010) Data collection and quality control in the funda-
mentals for clinical trials, chapter 11. Springer Science and Business Media, Dordrecht, pp
199–214
Grygiel A (2016) The struggles with clinical study budgeting. Contract Pharma. http://www.
contractpharma.com/issues/2011-10/view_features/the-struggle-with-clinical-study-
budgeting/
Hargreaves B (2016) Clinical trials and their patients: the rising costs and how to stem the loss.
Pharmafile (Online). Available at: http://www.pharmafile.com/news/511225/clinical-trials-and-
their-patients-rising-costs-and-how-stem-loss
Hind D et al (2017) Comparative costs and activity from a sample of UK clinical trials units. Trials
18:1–11. 203
International Council of Harmonization E6 (R2) (2016) Good Clinical Practice [Last accessed on
2020 November 24]. Available from https://www.ema.europa.eu/en/ich-e6-r2-good-clinical-
practice
Kunal S (2017) Clinical Trials Arena 9 Essential Components of a Clinical Trials Agreement.
https://www.clinicaltrialsarena.com/news/9-essential-components-of-a-clinical-trial-agreement-
5885280-2/
Matula M (2012) Evaluating a protocol budget. In: Gallin J, Ognibene F (eds) Principles and
practices of clinical research, 3rd edn. Elsevier/Academic, Amsterdam/Boston, pp 491–500
Pfeiffer J, Russo H (2016) Academic institutions and industry funding: is there hope? J Clin Res
Excell 90:23–27
Viergever R, Hendriks T (2016) The 10 largest public and philanthropic funders of health research
in the world: what they fund and how the distribute their funds. Health Res Policy Syst 14:1
Long-Term Management of Data and
Secondary Use 23
Steve Canham

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
Regulatory Obligations for Data Retention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
Regulatory Obligations and Long-Term Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
Data for Secondary Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
The Push for Secondary Data Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
Barriers and Issues with Secondary Use of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
Appropriate Preparation of Data for Re-use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
Maximizing Scientific Value with Data Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
Managing Data and Data Repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
Trials Unit Systems for Managing Secondary Re-use of Individual Participant Data . . . . . 448
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453

Abstract
The reasons for retaining data after a study is finished are reviewed. The nature
and implications of the legal obligations to keep data are explored, with a brief
discussion around each of the main questions that need to be considered. The
increased pressure to make individual-level data available to others is then
examined. Some of the barriers to such secondary use, or “data sharing,” are
described as well as some of the ways data re-use can be anticipated and thus
facilitated. Practical issues such as data de-identification and data use agreements
are discussed. The importance of promoting data inter-operability using standards
and common vocabularies is stressed, followed by a brief discussion about data
repositories and the selection of a suitable long-term home for data. Processes and
systems to support the secondary re-use of data, from the point of view of a trials

S. Canham (*)
European Clinical Research Infrastructure Network (ECRIN), Paris, France
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 427


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_286
428 S. Canham

unit, are suggested. A recurrent theme is the need to consider and plan the long-
term management of data from the very beginning of the study, because plans to
store and, especially, to share data may have profound implications for data
design and study costs.

Keywords
Data retention · Good clinical practice · Metadata · Secondary use · Data sharing ·
HIPAA · Data standards · Data repositories · Data use agreements

Introduction

Trials eventually reach a point when all data entry is complete, all the analyses have
been performed, and all the associated papers and result summaries are written. Direct
access to the trial data for its primary research purpose is either no longer required or
limited to occasional read-only access. The data cannot, however, be destroyed – there
is a regulatory and legal obligation to retain it, at least for a defined minimum period.
In addition, there is increasing recognition that the data has potential scientific value to
others and that – suitably de-identified and usually with controlled access – it could
and should be made available for possible re-use. For both of these reasons, therefore,
the data will require management in the long term.

Regulatory Obligations for Data Retention

The regulatory requirement for data retention stems from the possible need to re-
examine data, in the context of assessing, or re-assessing, either the safety of a
product or the general conduct and regulatory compliance of the study. There may be
a suspicion that the data has been interpreted wrongly, or that a particular safety-
related signal was missed, or even deliberately suppressed or mis-classified in the
original trial summaries. There may be a need – fortunately rare – to investigate
alleged fraud by individual investigators, or there may be actions for compensation
from individual participants. For all these reasons, the sponsor is responsible for
ensuring that the data is retained, enabling it, if necessary, to be examined within
legal or regulatory processes or institutional or professional disciplinary procedures.
Data retention also allows the completion of analyses originally abandoned at an
early stage and thus never published, as well as the re-analysis of results where
misreporting is suspected. Promoting such analyses was the aim of the RIAT (restor-
ing invisible and abandoned trials) initiative (Doshi et al. 2013; RIAT Support
Center 2020). GSK’s “study 329,” which looked at the effects of paroxetine (Paxil
or Seroxat) and imipramine (Tofranil) in the treatment of depression in adolescence,
is an example of a high-profile trial that was re-published as part of RIAT.
This study had been originally published in 2001 (Keller et al. 2001) and had
claimed that the drugs were “generally well tolerated and effective” in the target
23 Long-Term Management of Data and Secondary Use 429

population. During litigation, brought by New York State against GSK in 2004 after
it appeared that paroxetine in fact increased suicidal behavior among adolescents, it
emerged that the underlying data had never really supported the 2001 assertion. The
study had never been republished with a corrected analysis, however, and the
original paper had never been retracted (in 2020 it is still not retracted) so it was
re-examined within RIAT. The re-analysis (Le Noury et al. 2015) confirmed that the
drugs were not “statistically or clinically significantly different from placebo for any
prespecified primary or secondary efficacy outcome” but that there were “clinically
significant increases in harms, including suicidal ideation and behavior and other
serious adverse events in the paroxetine group and cardiovascular problems in the
imipramine group.”
Le Noury and colleagues had used not only the clinical study report (CSR) for
their analysis but also individual patient data (as SAS datasets) and about 77,000
pages of de-identified individual CRFs. Importantly, the authors noted that “Our
analysis indicates that although CSRs are useful, and in this case all that was needed
to reanalyze efficacy, analysis of adverse events requires access to individual patient
level data in case report forms.”
The principle of data retention is set out in the Good Clinical Practice Regulations
or GCP (ICH 2016). Although strictly speaking these only apply to investigations
involving medicinal products, the principles of GCP are usually seen as applicable
to, and are followed by, all types of trials. GCP uses the concept of “Essential
Documents” which are defined (section 8.1) as “those documents which individually
and collectively permit evaluation of the conduct of a trial and the quality of the data
produced.” The “Essential Documents,” often collectively referred to as the “Trial
Master File” or TMF, include:

8.3.14 SIGNED, DATED AND COMPLETED CASE REPORT FORMS (CRF) To docu-
ment that the investigator or authorized member of the investigator’s staff confirms the
observations recorded Investigator/ Institution.
8.3.15 DOCUMENTATION OF CRF CORRECTIONS To document all changes/addi-
tions or corrections made to CRF after initial data were recorded.

Essential documents therefore include the data, specifically in the form in which it
was collected and including amendments to that data, reflecting the fact that the
purpose of data retention is essentially to provide an audit trail for possible later
inspection. The retention period is given by section 5.5.11 of the GCP guidance:

The sponsor specific essential documents should be retained until at least 2 years after the
last approval of a marketing application in an ICH region and until there are no pending or
contemplated marketing applications in an ICH region or at least 2 years have elapsed since
the formal discontinuation of clinical development of the investigational product. These
documents should be retained for a longer period however if required by the applicable
regulatory requirement(s) or if needed by the sponsor.

The final sentence is important. The relatively short retention period demanded by
GCP may be extended by other regulations, applied at international, national, state,
430 S. Canham

or institutional level, and those regulations are subject to change. In the USA, the
retention period specified in CFR 21 (1) broadly follows the GCP requirement (US
Code of Federal Regulations 2020), but in Europe the period has been considerably
longer. The Clinical Trials Directive amendment of 2003 required:

. . . at least 15 years after completion or discontinuation of the trial,


— or for at least two years after the granting of the last marketing authorization in the
European Community . . . (European Commission 2003),

but the Clinical Trials Regulation of 2014 extended this period to 25 years:

Unless other Union law requires archiving for a longer period, the sponsor and the inves-
tigator shall archive the content of the clinical trial master file for at least 25 years after the
end of the clinical trial. However, the medical files of subjects shall be archived in
accordance with national law. (European Commission 2014, Article 58)

To complicate things further, the required retention period may also depend on
the type of trial and the population under study. For pediatric studies (because any
statute of limitation on claims may not come into effect until the child is 18, and
then last for a further period, e.g., 4 years), retention may be necessary until the
youngest participant reaches a certain age, e.g., 23 or 25. Similar considerations
may need to apply to studies that allow pregnant women, or the partners of
pregnant women, to participate. In such cases the retention period will depend
on the relevant law within the applicable legal jurisdiction. Some treatment types –
especially if they are relatively new and untested – may also demand longer
retention periods.
A further aspect of the GCP guidance on data retention is that it requires data to be
kept both centrally by the sponsor, and/or the trial’s operational managers (i.e., a
trials unit or CRO) on the sponsor’s behalf, and at each clinical site. Useful detailed
guidance on what records should be kept where, for the TMF as a whole, is provided
by the European GCP Inspectors Working Group (EMA 2018). Although originally
written for a European context, the points made in this document should be relevant
to most other environments.
For data, GCP makes it clear that the sponsor should retain the original CRFs
while each clinical site should keep a copy of the data that they generated. In the days
of three-part carbon paper CRFs, this was automatic; the sites simply retained a copy
of whatever data they sent to the sponsor’s central facility. Nowadays, with almost
universal use of electronic remote data capture, this means a copy of the site’s data,
as collected by a clinical data management system (CDMS) and usually incorporat-
ing the investigator’s signature in electronic form, is returned back to the site in a
human readable form at the end of the trial.
A potential complication was introduced into this process with the advent of the
latest version of GCP (E6(R2)). Here an addendum to section 8.1 makes the point
explicitly that:
23 Long-Term Management of Data and Secondary Use 431

The sponsor should ensure that the investigator has control of and continuous access to the
CRF data reported to the sponsor. The sponsor should not have exclusive control of those
data. . . . . . .
The investigator/institution should have control of all essential documents and records
generated by the investigator/institution before, during, and after the trial.

This seems reasonable – if a sponsor has exclusive control of the data, it could, in
theory, make unaudited changes to the data before it was returned to the site. Unless
the investigator had the time and inclination to check the returned data (e.g., against
the source documents), he or she would likely be unaware of any changes made. But
what this addendum means in practice is unclear. Does the use of a CRO, as a third
party managing the data, ensure that the sponsor does not have exclusive control?
Probably, though some have suggested a CRO, paid by the sponsor, may not be
independent enough. Academic- or hospital-based trials units rarely use CROs,
though increasingly they use hosted CDMS solutions. But if they directly control
and can access the CDMS, and thus the data, and are also the sponsor, how can they
show that the site, rather than themselves, has “control of all essential documents and
records generated by the investigator/institution”? If they cannot, are they then in
breach of this addendum to GCP? Further discussion would seem to be necessary to
clarify exactly how this addendum should be interpreted.

Regulatory Obligations and Long-Term Management

Organizing the long-term retention of the data equates to answering a set of


questions:

• What material needs to be retained?


• Who should keep the data and where?
• What format(s) should be used?
• What metadata is also required?
• How long should the data be retained for?
• What data should the sites retain?
• How should final data destruction (if it ever occurs) be managed?

The increasingly common additional question, of how data could also be made
available for possible re-use by others, is discussed in a later section. The final
responsibility for resolving the questions listed above rests with the sponsor, but they
will normally be discussed with investigators and the trial’s operational managers, i.
e., a CRO or trials unit. That discussion will usually encompass all the essential
documents, i.e., the whole of the TMF, although here only the data is considered.
It is clearly better to consider these questions as part of the initial study planning,
so that everyone is clear about what will happen to the data and essential documents
at the end of the study, and what their role and responsibilities will be, from the very
432 S. Canham

beginning. It also allows necessary resources to be identified and costed (and


included in bids for funding). The issues will probably be revisited when the end
of the study arrives, but in truth the focus and energy of those involved are usually
elsewhere by that point. This is especially the case if the trial was managed by a
collaboration of some kind, which may have dissolved by study end. There is
therefore a danger that data (or some parts or versions of the data) are simply left
where they are, with little active management, unless arrangements for the long term
have already been settled. The issues that need to be discussed are considered in
more detail below.
What material needs to be retained? The essential documents include all the data
as collected, i.e., as in the (e)CRFs. But should it be in exactly the same format as
when it was collected, i.e., as a database file? Or in the format in which it was
extracted from the clinical data management system (CDMS) for analysis, which
will usually be as a collection of flat files, e.g., CSV or SAS transport files? Or in
some form of read-only archive format (e.g., as pdf files)? The latter two may not
include all the data amendments, but would normally be easier to access.
How should additional data that was never in the CDMS (e.g., treatment allocation
lists or image or lab data) be retained? What about the analysis datasets themselves –
which again might not be exactly the same as the data as extracted (they might be
MedDRA coded, for instance, or include reconciled SAE data). And data for analysis
may also exist in different versions – interim datasets taken at different times, subsets
representing sub-studies, and datasets for different populations (e.g., for intent-to-treat
versus safety analysis). Which of these should be retained?
Who should keep the data and where? In some cases a sponsor and the CRO or
trials unit that is managing the data have a close, long-term relationship (or may be
the same organization). In this scenario, it is usually easier to retain the data within
the infrastructure in which it was collected. Given that storage capacity is so cheap,
and the datasets from a clinical trial are not, in terms of modern storage devices, very
large, it is quite possible to keep all the different versions of the data, but they will
need careful organization into a clearly labelled set of folders and files, including a
“read me” file explaining the contents.
If the sponsor has a more temporary, contractual relationship with the CRO or
trials unit, they will normally want the data as collected returned at the end of the
study, often to carry out the analysis themselves as well to meet their obligation to
retain the data. But what should happen to the copy of the data that remains on the
servers which collected it? This question becomes more acute when the servers are
not directly managed by the CRO/trials unit but are part of a remote SaaS (software
as a service) clinical data management system, which may in turn use a separate
“cloud” infrastructure. If the decision is taken that the data, once extracted, should be
removed from the data collection infrastructure (for a combination of security,
commercial, and financial reasons), then some thought will be required as to how
that can be managed. An absolute guarantee of data destruction, when an infrastruc-
ture is outside an organization’s direct control, is very difficult (Ramokapane et al.
2016), but some form of assurance that the data has been removed should be sought.
This should also cover the scenario when a virtual machine or a collection of data is
23 Long-Term Management of Data and Secondary Use 433

restored onto an infrastructure from a backup – in such a case data marked as deleted
will need to be re-deleted.
What format(s) should be used? The proprietary structures used by many clinical
data management systems, both for database storage and for a “data export,” do not
lend themselves to long-term storage. Even if systems remain in existence, they will
evolve, and a file created by one version may soon become unusable by later
versions. Twenty-five years is a very long time in technology. On the other hand,
such files often provide the most complete picture of the data as collected, with
previous values and audit trails included. Data is much easier to re-access if it is
stored in simpler non-proprietary formats, e.g., CSV (comma separated values), or
using a global XML schema, although such schemas are also likely to evolve over
time. But it may take additional work to ensure that the full set of data required,
including previous values, is included when using such formats.
The best answer is probably to use both format types. The lifetime of proprietary
files can be extended by using a virtual machine (VM) or, increasingly these days, a
server “Container” to preserve not just the data but also the context, e.g., the CDMS,
database, and operating system, in which it was housed. When a CDMS is updated
or replaced, old studies could be transferred to the new system, but it may be simpler
to split off the old system and the studies completed on it as a separate VM or
Container and “freeze” that in long-term storage. This has resource implications,
however, hence the need to consider this option at an early stage.
Complementing the data retained in this “native” format are the datasets in a non-
proprietary flat file format, i.e., the data as extracted and the data as analyzed, if
different. These should already exist because they have been required for the
analysis process. Organizing and retaining them should therefore be a relatively
straightforward exercise.
What metadata is also required? Data in any format quickly becomes useless
unless its meaning is clear, which means that metadata, for all types of retained data,
is essential. This should include, for each data item, its code, name, type, description,
and possible values. Most CDMSs can generate metadata for each study they
support, so including the metadata for data in its original format should be straight-
forward. Flat files created from the data may require additional, specific data
dictionaries, however, although again these should have already been created to
support the analysis process. Care needs to be taken that this descriptive metadata is
present or generated for all the data files, and included in the final data package.
As well as the descriptive metadata, the read me or “contents” file should include –
along with a general listing of the files included and the provenance and purpose of
each – any technical details about the files, e.g., the versions of systems used to
generate them. Text-based files like CSVs can all look the same to humans, but
machines can get confused by the different coding schemes used to generate them
(e.g., UTF-8 versus UTF-16) or the presence of technical marks in the file (e.g., byte
order marks). These may be unimportant at the time the data is generated because all
systems are set up to work with a specific configuration, but a few years later they
could cause problems if restoration is attempted. These details should therefore be
documented.
434 S. Canham

How long should the data be retained for? As indicated in the previous section,
there may not always be a simple answer to this question. Most sponsors, trials units,
and CROs become familiar with a particular regulatory regime and its data retention
guidelines, but it is worth checking to see if any exceptions apply or if the regulations
seem likely to change soon. For relatively new trialists (e.g., a new biotech startup or
an inexperienced investigator), it may be worth obtaining professional advice from a
CRO or trials unit, to ensure the regulations are well understood.
Some sponsors will stipulate a longer period for data retention than the strict
minimum (e.g., 30 years), often making it a blanket rule for all types of studies,
interventional and observational, to keep everything relatively simple. Exceptions
may still occur, but they should be less common. Some sponsors may even decide to
simply “keep everything” indefinitely. This is easy to say for a relatively new
sponsor with little if any data in long-term management, but decisions about how
and where the data needs to be stored still need to be resolved. It also raises an ethical
issue – if data is relatively identifiable, the longer it sits in an IT infrastructure, the
longer it risks being lost or hacked. Once the period required by regulation is over
therefore (admittedly a long time in Europe), there is a good argument that says such
data should be destroyed (or anonymized) and not simply left indefinitely in its
original state.
What data should the sites retain? The sites should end up with a copy of the data
they provided as input to the study. They are unlikely to have the systems to read the
data in the way it is stored centrally, so a database file is not appropriate. The data
will therefore need conversion to some more readable format – e.g., pdf, csv files,
and spreadsheets. This may need to be negotiated at the beginning of the study, and it
certainly needs to be planned as – especially for a large study with many sites – it
could be a time-consuming and costly exercise. Different CDMS have different
capabilities in this respect – some have an ‘archive’ function which will generate the
required data in a suitable format, while for others it will be more of a manual
exercise. The use of optical storage is a common way of transferring the data, i.e.,
sending the site a CD-ROM. If not copied to local systems, however, the disk may
not be accessible after several years – CD-ROMs have a finite lifetime – so
arrangements should be in place to ensure that the copying takes place, even though
this is the site’s responsibility.
How should final data destruction (if it ever occurs) be managed? If and when
some or all of the data is to be destroyed, then this should be explicitly authorized
and then documented. This can be quite difficult if the infrastructure where the data
sits is not under the direct control of the sponsor or the sponsor’s agents, but some
form of certification or assurance should be sought to show that the sponsor has done
their best to ensure full destruction.
A related issue is that elements of the underlying infrastructure will be periodi-
cally replaced. Machines and storage devices have a finite lifetime – indeed the
physics of solid-state storage devices (SSDs) means that they can only be written to
so many (million) times. There will therefore inevitably come a time when these
devices need replacing, with their data being transferred to new systems. It is
important that the infrastructure’s users are aware of this and are satisfied that device
23 Long-Term Management of Data and Secondary Use 435

removal also renders the data on that device completely inaccessible, usually
through physical destruction of the device. If the IT infrastructure is “in-house,”
this is relatively straightforward – procedures can be established to ensure it hap-
pens. When the infrastructure is external, it becomes more difficult, requiring
explicit recognition of this issue with suitable assurances sought and provided.
To ensure that all the questions listed above are considered, even if only in the
form of a checklist that needs to be worked through, it is important that both the
sponsor and the operational managers of a trial, the CRO or trials unit, have a
standard operational procedure (SOP) in place covering long-term data management.
It should be integrated with the other SOPs covering trial setup and design, and the
decisions taken as a result of working through it should be documented in the trial’s
data management plan (DMP). Much later, when the study ends, the same DMP can
be used to document the actions taken as a result of the plan.

Data for Secondary Use

Over and beyond simply keeping the data because it is a regulatory requirement,
which at base is a rather passive and defensive exercise, there is a growing recog-
nition that the data from a clinical trial has potential scientific value to other
researchers and through them to society as a whole. Over recent decades therefore,
there has been a steadily growing acceptance that a study’s individual participant
data (IPD) should be actively prepared for possible secondary use (so called because
it is outside the primary use of the original research study) and then be openly
advertised as available, albeit usually under controlled access.

The Push for Secondary Data Use

Making data available in this way has been driven by the convergence of a number of
different arguments and trends. Among the arguments advanced in favor of data re-
use are:

• It allows the conclusions from trials to be re-examined and verified or corrected,


although naturally enough it is the corrections that tend to generate the most
coverage. The example of GSK’s study 329 has already been quoted. Another
controversial case (where researchers were compelled to give up their data under
a Freedom of Information Act) was the re-analysis of the PACE trial on treatments
for myalgic encephalomyelitis (White et al. 2011; Geraghty 2016; Torjesen
2018).
• IPD re-analysis can trigger a debate over methodology and analytic methods that
may stimulate further work, as well as clarifying the value of the original research.
The re-analysis of an influential de-worming trial in Kenya, where the researchers
made their data available voluntarily, provides one example (Miguel and Kremer
2004; Davey et al. 2015; Özler 2015), as does the re-analysis of the FEAST trial,
436 S. Canham

looking at fluid resuscitation in African children with shock and severe infection
(Maitland et al. 2011; Levin et al. 2019; Maitland et al. 2019).
• In times of global pandemics like Ebola or COVID-19, the availability of IPD can
be critical in allowing investigators to properly evaluate the often hastily prepared
reports, as well as allowing the possible pooling of data from different sources. In
this context, calls for data sharing have been issued by the WHO (2015), the
Wellcome Trust (2020), and the Research Data Alliance (2020).
• Data availability makes it possible to compare or combine the data from different
studies. An example is the cross-study “data platforms” that have been
established in some specialist disease areas, for example, the Ebola Data Platform
(IDDO 2020). It also allows data aggregation for participant-level meta-analysis,
where, despite the potential advantages of such analyses, data has often been
difficult to obtain (Riley et al. 2010).
• Secondary use can reduce unnecessary duplication of work and make it easier to
build upon a trial with additional ancillary studies. An early example was the
Diabetes Control and Complications Trial (DCCT) of 1993 that made their data
available to other investigators. By 2015 over 220 ancillary studies had been
carried out using or building upon DCCT data, i.e., with the same cohort of
participants (Henry and Fitzpatrick 2015; EDIC 2020).
• Secondary use can lead to novel analyses and/or tool generation. In an experiment
in 2016, the New England Journal of Medicine hosted the SPRINT data analysis
challenge. People were invited to analyze the IPD from the NIH-sponsored
SPRINT trial (SPRINT Research Group 2015), “to identify a novel or scientific
or clinical finding that advances medical science.” A total of 143 different
applications were received, each representing a new application of the data
(NEJM 2016).
• Economically, because data sharing can increase the quality and efficiency of
clinical trials through the mechanisms described above, it can help to reduce the
wastage in research (Chan et al. 2014). Not surprisingly, funders are often strong
supporters of data sharing and mandate it in the studies they support. The
Wellcome Trust, the UK’s Medical Research Council, Cancer Research UK,
and the Bill and Melinda Gates Foundation all require that data be made available
for re-use. In a joint declaration, they concluded “It is simply unacceptable that
the data from published clinical trials are not made available to researchers and
used to their fullest potential to improve health” (Kiley et al. 2017).
• Ethically, IPD sharing has been framed as a way of better respecting the gener-
osity of clinical trial participants, as it increases the utility of the data they provide
and thus the value of their contribution. It has also been argued that, if access to
health and healthcare is a basic human right, access to data that can improve
health is similarly a fundamental right (Lemmens 2013). Those involved in
research therefore have an obligation to respect and promote that right by making
their data available (Lemmens and Telfer 2012).
• Socially, the substantial public investment in science, including clinical research,
demands a similarly public response: “publicly funded research data are a public
good, produced in the public interest, which should be made openly available
23 Long-Term Management of Data and Secondary Use 437

with as few restrictions as possible in a timely and responsible manner” (UKRI


2020). Because clinical research has a key role in promoting and maintaining
health, and in determining regulatory and safety decisions, it has been further
argued that clinical trial data should be shared and treated as a public good
whoever generates it, i.e., whether it is created by publicly funded or commercial
research (Reichman 2009).
• Culturally, making IPD from clinical research available is part of a wider shift in
science as a whole, toward making data FAIR (findable, accessible, interoperable,
and reusable), itself part of a more general move toward “open” science (Wilkin-
son et al. 2016). Clinical research is increasingly aligning itself with the more
open data sharing already practiced in many disciplines, including basic biolog-
ical sciences as well as physics, astronomy, geology, etc. The fact that clinical
research IPD is sensitive personal data certainly makes data sharing more com-
plex, but not impossible.

For all of the reasons listed above, the idea of data sharing in clinical research has
become much more acceptable, to the extent that Vickers was able to claim a
“tectonic shift in attitudes” over 10 years (Vickers 2016). In addition, many trial
registries now include sections for trialists to describe their plans for data sharing,
and there is strong encouragement from many journals for authors to indicate how
they will make the underlying data for a paper available to others. BMJ journals, for
example, for most of its major titles, stipulate the following (BMJ 2020):

• “We strongly encourage that data generated by your research that supports your article be
made available as soon as possible, wherever legally and ethically possible.
• We require data from clinical trials to be made available upon reasonable request.
• We require that a data sharing plan must be included with trial registration for clinical
trials that begin enrolling participants on or after 1 January 2019. . . .”

The last requirement is in line with the data sharing recommendations of the
influential International Committee of Medical Journal Editors, which also stipulate
that clinical trials must include a data sharing plan in the trial’s registration, from 2019
onward. The ICMJE currently stop short, however, of requiring the availability of data
for secondary use. Not making data or documents available is still listed as an example
of a valid data sharing plan (ICMJE 2020). This may be a recognition that, despite the
various “top-down” pressures to make data available, e.g., from publishers and
funders, and the growing cultural acceptance of data sharing, from a “bottom-up”
perspective there are several potential risks that can make it less appealing.

Barriers and Issues with Secondary Use of Data

Some investigators fear that others could “mine” their data for insights and results
that would otherwise be available only to them. The critical value of published
papers for career progression can make this concern, that others might pre-empt
“their” papers using “their” data, a critical factor in deciding when data should be
438 S. Canham

made more generally available. It also influences the debate about whether the whole
dataset produced by a study should be made available, or just the data used to
support the conclusions of published papers, which may be subsets of the whole.
There have also been claims that researchers in low- and middle-income countries
could be particularly disadvantaged if their data is made available to those with more
developed capacities for analysis (Tangcharoensathien et al. 2010), a case of FAIR
being potentially unfair. The call has therefore been made for data sharing in such
contexts to be considered as a partnership, and more of a mutual learning exercise,
than the simple appropriation of one group’s data by another.
There is also a reticence of some authors to allow their data and analyses to be
examined and possibly misunderstood, misused, or simply criticized, with a conse-
quent need to enter into a public debate that could endanger their reputation, as well
as demanding time and effort. This was one of the major reasons quoted by
researchers as a barrier to data sharing in a survey conducted by the Wellcome
Trust (Van den Eyndon et al. 2016) across a range of researchers in the biological,
medical, and social sciences. The same survey, however, also found that in reality –
despite some of the high-profile cases reported above – very few researchers reported
these types of negative experiences from data sharing. In fact side effects when they
did occur were almost always positive, including more collaboration opportunities
and an increase in citations.
The lack of conventional academic reward for “simply” sharing data has also
been recognized as a barrier to data sharing. At the technical level, there needs to be
consensus on the best ways in which re-used data should be cited to ensure that the
original data generators are properly recognized (including tackling the issue of
different versions of data being available), but more importantly those citations then
need to be included in the evaluation of a researcher’s work, for example, when
considering grant applications or career progression. One scheme (Bierer et al. 2017)
proposes the use of the term “data author” in the literature, to clearly distinguish the
contribution of the data generators to the research, in not only initially collecting but
also managing, cleaning, curating, and preparing the data for re-use.
It seems likely that over time experience will show that there is less to fear from
data sharing than some researchers believe. Greater clarity should emerge from
publishers and others about their expectations for making data available, and the
time periods when that should occur; and systems will evolve to give greater
recognition to the researchers who provided the data. There will remain, however,
some very practical issues that contribute to the complexities and costs of supporting
secondary data use, which are often a cause of concern to researchers. An overview
of some of the main issues is provided below.

Appropriate Preparation of Data for Re-use

To preserve participant privacy, as well as the reputation of the investigators and


their institutions or companies, data must almost always be modified before it can be
released for secondary use, and it must conform to the relevant data protection and
23 Long-Term Management of Data and Secondary Use 439

privacy regulations. The difficulty is that those regulations will vary across both
space and time. For example, currently (mid-2020) there appears to be a contrast
between a relatively pragmatic approach to secondary use of IPD in the USA, based
on de-identification of the data, and the more complex situation in the EU, where the
General Data Protection Regulation (GDPR) has brought several contentious issues
to the fore, including the exact characterization of pseudonymous data and the
potential role of consent (Peloquin et al. 2020). In addition, while the GDPR was
supposed to harmonize regulations relating to personal data use across the EU, it
returned some of the powers to regulate personal research data back to the member
states, so that a simple, single European regime has not been realized. The result is
that the regulations around secondary use of IPD in the EU continue to evolve.
No attempt is made here to try and survey the different and developing privacy
and data protection requirements that apply around the world. It will always be
necessary for sponsors and investigators to familiarize themselves with those
requirements and comply with them, which for multi-national trials may involve
multiple jurisdictions. There are, however, some common components to data
preparation that will need to be considered.

The Need for De-identification


The expectation is that data will need to be de-identified before it can be shared. The
de-identification should be sufficient to render the dataset anonymous in practical
terms – i.e., the amount of effort required to identify individuals, for instance, the
collection and collation of additional information from external sources like social
media, should outweigh any potential benefit to anyone trying to identify
individuals.
De-identification techniques described in the literature. For instance, Appendix B
of the Institute of Medicine’s (2015) paper on data sharing, “Concepts and Methods
for De-identifying Clinical Trial Data,” provides an overview of both the assessment
of risks and strategies to mitigate them, focused on but not restricted to the US
context. Also in the USA, the Health Insurance Portability and Accountability Act
(HIPAA) Privacy Rule gives detailed explicit guidance on de-identification tech-
niques. One option is to use a documented “expert determination” that a dataset does
not contain personally identifiable information, but the other (the “safe harbor
method”) is to remove all of a checklist of direct identifiers. The main identifiers
are listed below, but the HHS website should be consulted for details of the full
compliance required (HHS 2020).

• Names
• Geographic locations smaller than a state, including street address, city, county,
post code
• All elements of dates (except year) for dates that are directly related to an
individual, including birth date, admission date, discharge date, and death date
• All ages over 89 and all elements of dates (including year) indicative of such age,
aggregated into a single category of age >¼ 90
• Fax numbers, telephone numbers
440 S. Canham

• Device identifiers and serial numbers


• Email addresses, Web Universal Resource Locators (URLs), and IP addresses
• Social security numbers, medical record numbers, and health plan beneficiary
numbers
• Vehicle identifiers and serial numbers, including license plate numbers
• Full-face photographs and comparable images; biometric identifiers
• Account numbers, certificate/license numbers
• Any other unique identifying number, characteristic, or code

In the USA at least, following the fairly straightforward and public rules should
normally allow secondary use. The loss of dates (apart from the year element) is
something that could seriously impact the scientific usefulness of a study dataset,
given the central importance of the timing of events. One way around this is to
“rebase” dates to numbers of days after a fixed point (e.g., randomization), so that
they become integers with no relationship to the calendar.
Other de-identification techniques take the obfuscation of data further. One
approach ensures that no collection of data values is unique to a single individual,
instead being shared by at least k individuals (“k-anonymization”). This can include
techniques such as:

• Aggregating categories to reduce unique combinations (for instance, birth years


become age ranges)
• Data perturbation, where the distribution of data is preserved but the actual values
are changed
• Removal of detailed text fields, such as reports of serious adverse events

The problem with de-identification is that if pursued too enthusiastically, the


scientific usefulness of the data may decline. Data perturbation, for example, may
make sense in the context of an epidemiological study with many thousands of
subjects, but unless done with great care, it may distort the statistical analysis in a
clinical trial with just a few hundred participants.
Ideally therefore, the documentation of the de-identification process (which
should always be available with the de-identified data) should indicate if the primary
analyses can be re-run and give the same results as with the original data. A useful
example of a de-identification process being applied in practice is provided by
Keerie et al. (2018), who used guidance provided by the UK’s Medical Research
Council (MRC 2015). The de-identification process included explicit confirmation
that the original analysis could be replicated.
De-identification is almost always a necessary part of data preparation for re-use
and applies to data classified as either pseudonymized or anonymized – the need to
prevent re-identification from the data itself is the same in each case. Nor does it
matter if the term lacks any formal significance within the jurisdiction’s legal
framework (for instance, the term is not defined in the GDPR), the process will
still be required to protect study participants.
23 Long-Term Management of Data and Secondary Use 441

Anonymized Versus Pseudonymized Data


One of the problems of any discussion around data sharing and re-use is that the
terms “anonymized data” and “pseudonymized data” are subject to a range of
interpretations, both in common use and more formally within legal frameworks.
It is therefore usually necessary to preface any discussion of secondary IPD re-use by
setting out the definitions used in any particular case (e.g., as in Ohmann et al.
(2017)).
From the point of view of preparing data, it is necessary to be clear about the legal
implications, if any, of labelling data either pseudonymized or anonymized, and then
to be clear how the various datasets in question are categorized along this dimension.
In some cases it will also be necessary to clarify how data can be moved from a
pseudonymized to an anonymized state, to take advantage of what are usually lesser
restrictions around anonymized data.
Both anonymized and pseudonymized data prepared for re-use should be de-
identified, and the datasets themselves are likely to be identical. The difference is
only that with pseudonymized data there is additional information, kept separately,
that can be used to link the data back to the individual participants.
All clinical trial data starts off as pseudonymized – it can always be linked back to
the participants at the source clinical sites. During data collection it remains pseudo-
nymized, and the linkage must remain during the legal retention period, which as
described above may be decades. But the fact that a pseudonymized version of the
data exists does not necessarily mean that data released for secondary use is also
pseudonymized (although, inevitably, it depends how those terms are defined). Some
have argued that if a dataset is de-identified and released without the pseudo-
nymizing linkage data, with the recipients having no (legal) means of obtaining
that linking data, it is “effectively anonymized.” Whether this distinction is applica-
ble within any particular legal jurisdiction would need to be clarified.
It has also been proposed that for extra security the participant identifiers attached
to an “effectively anonymized” data should also be regenerated as a separate,
independent set with no links to those used in the primary study. The difficulties
with that is that (a) it does not stop the data theoretically being matched back to the
pseudonymized data, using the full sets of data values linked to each participant, and
(b) if a new signal is detected in secondary analysis that may be relevant to the
treatment of some of the participants, it makes it much more difficult to make that
information known to the participant and/or their clinician.

The Role of Consent


Data classified as pseudonymized generally requires consent from the data donors,
the study participants, for any particular usage. This may, however, depend on the
legal basis claimed for the data processing, with the options available being depen-
dent, as always, on the local legal framework. If the legal basis is something other
than consent (e.g., is “public interest”), then the presence or not of an associated
consent is likely to be irrelevant. This is particularly important in a public health
emergency such as the COVID-19 pandemic. Many states retain the right to process
442 S. Canham

individual health and research data in the interests of public health, with suitable
safeguards but irrespective of the presence or not of explicit consent. Secondary re-
use of data may therefore be allowed under such regulations.
Even when consent is the basis of processing pseudonymized data, as it often is
with the primary study, the difficulty with consent for secondary use is that it cannot
be fully informed. By definition, at the time of the primary study, the nature of any
secondary usage is unknown. Attempts have been made to promote the use of a
“general consent” for secondary use for research purposes, linked to assurances
about data de-identification and data location (as has been proposed for bio-bank
materials), but it is not clear if such a consent would be acceptable in all jurisdictions.
Consent therefore remains a tricky issue whose relevance urgently needs clarifi-
cation in circumstances where data remains classified as pseudonymized. Having
said that, whether consent is deemed relevant from a legal standpoint or not, there
remains an ethical imperative to inform the study participant of any plans to make
the data they provide available for data sharing. Such information should be pro-
vided as part of the information sheets given to participants when they enter a study,
and include relevant details such as the de-identification measures that will be
applied, the location of data storage, and restrictions that will be placed on any sec-
ondary users (in turn meaning that establishing these details is a necessary part of
initial study planning).
The question then arises as to whether a participant should be able to object to
their data being re-used beyond the primary study, and be able to withdraw their data
from such re-use, either at the beginning of the study or at any time afterwards, even
after data collection has ceased. Again, different legal jurisdictions may have
something to say about whether this is possible or desirable, and what mechanisms
might be necessary to put in place to support it.

The Role of Data Use Agreements


Although it is possible that anonymized and heavily de-identified IPD may simply
be released into the public domain, the difficulties with guaranteeing full
anonymization while retaining scientific utility mean that in many cases investiga-
tors and sponsors will prefer to control access to the data, wrapping that access
within a formal “data use agreement” (or “data sharing agreement”). This allows
them, for example, to insist that applicants for access to the data provide a full
explanation of their reasons, e.g., in a research protocol, and possibly also provide
evidence of ethical approval of that protocol. It also allows them to insist, within a
contractual agreement, that the data applicants will not misuse the data, e.g., by
trying to identify participants or by passing it on to third parties. The agreement can
also clarify the nature of the data access – which might be by on-screen access only,
perhaps even in a specified location, rather than by a simple file download.
Data use agreements can therefore provide a powerful and flexible mechanism for
reducing risks associated with data re-use. Whether or not they can modify the legal
position in respect of allowing data re-use in any particular case will depend, as usual,
on the local legal framework – on whether, for example, it demands that proportionate
risk management measures are put in place. Even if not strictly required, however, a
23 Long-Term Management of Data and Secondary Use 443

data use agreement is evidence of good intent and can help to protect sponsors from
reputational damage. It is noteworthy that two major data repositories managing
secondary re-use of data from the pharmaceutical industry – ClinicalStudyDa-
taRequest.com and the Yale University Open Data Access project – both insist on
data use agreements being in place before data is released (CSDR 2020; Yoda 2020).

Maximizing Scientific Value with Data Standards

Traditionally clinical studies have been designed in relative isolation, and study
datasets have therefore also tended to be idiosyncratic, each with a distinct set of
differently defined and coded data points, often categorized in different ways.
Unfortunately this can be a huge problem when trying to compare and/or aggregate
data from different studies, making those processes error prone, time-consuming,
and costly, as well as constraining what comparisons are possible.
The use of standards and conventions for data definitions, however, allows data to
be compared and/or aggregated much more easily across studies (and also, as
described in the chapter on the study data system, allows studies to be designed
more quickly and efficiently). As secondary re-use of data increases, it is vital that
investigators and study designers maximize the inter-operability, and thus the poten-
tial scientific value, of the data they generate, by increasing the use of data standards.
Data standardization operates at various levels:

• The data points selected


• The detailed definition of those data points
• How the data is categorized – the “controlled vocabularies” used in categorized
questions
• How the data is structured and coded in the database
• How the data is described (i.e., associated metadata)

The data points selected will depend upon the outcomes and safety signals to be
measured, as specified in the protocol. The nature of clinical trials means that
sometimes a study will have novel end points and safety signals, but a high
proportion of the data points collected will be very similar across studies. This
applies not just to the common variables found in most studies (e.g., demographics,
medical history, vital signs, adverse events, concomitant medications) but often also
to more disease-specific measures. One aid to selecting outcome measures is the
COMET (Core Outcome Measures in Effectiveness Trials) initiative, designed to
identify and support the generation of core outcome sets (COS) in trials, with a core
outcome set being defined as “an agreed standardised set of outcomes that should be
measured and reported, as a minimum, in all clinical trials in specific areas of health
or health care.” COMET maintains a database of papers that describe the generation
and content of over 400 core outcome sets (Comet 2020).
Professional, national, and international bodies may also develop outcome
measures and measuring schemes for particular disease areas, such as cancer staging
444 S. Canham

(e.g., TMN) and tumor measurement (e.g., RECIST) or, as an example of a response
to a specific disease threat, the set of data points developed by ISARIC for COVID-
related trials (ISARIC 2020). A further source of standardized data items is
published and validated questionnaires, e.g., those dealing with aspects of the quality
of life of participants or for assessing cognition or mood. Care must be taken that the
instrument is valid for the population under study (and translations in a multi-
national/multi-lingual context must also be validated), but in general using a pre-
existing questionnaire or rating scale will be more useful, less expensive, and more
generalizable than trying to develop a study-specific instrument.
Ensuring detailed definition of data points is critical for meaningful comparison,
within as well as between studies, and underlines the need for good descriptive
metadata that makes such definitions explicit. Although many data points are
relatively unambiguous (date of surgery, weight, blood biochemistry, etc.), some
are not. A notorious example is “blood pressure,” which for consistency should be
further characterized by position (lying, sitting, standing) but rarely is, or perhaps for
timing (e.g., before or after a procedure). Time points for events, despite their
importance for analysis, can also be ill-defined. Is a recurrence of a tumor dated
from the date the patient first reported the associated symptoms, the date of the scan
that confirmed disease progression (possibly one of several scans and tests), or the
date of the multi-disciplinary meeting that formally confirmed the recurrence diag-
nosis? There may be several weeks between these dates, so some consistent rules
need to be developed, applied, and described.
The use of different categorization schemes can also be a headache when com-
paring data, though it is sometimes possible to find mapping schemes between the
major controlled vocabularies. MedDRA is widely used for adverse event reporting
(and is mandated for such use within the EU) and is a de facto international standard
for that purpose. But medical history, for example, might be gathered using
MedDRA, ICD, SNOMED CT, or MESH terminology, among others. Unfortunately
there are few formal requirements or guidelines regarding the use of one controlled
vocabulary scheme over another – the choice may come down to such factors as
previous training and/or exposure to different systems, the existence or cost of
licenses, the practicalities of use, and the required levels of granularity and compat-
ibility with other systems. Whatever controlled vocabularies are selected, however,
they will almost certainly be more informative and easier to analyze than free text.
The most comprehensive and established framework for using data standards in
clinical research, one that is both global in scope and internationally recognized, is
that provided by CDISC, the Clinical Data Interchange Standards Consortium.
Beginning in 1997, CDISC has provided a broad range of standards and related
tools, covering all phases of the clinical study life cycle, including schema for
structuring pre-clinical data, for study protocols, for data collection, for data trans-
port, and for data submission and analysis. It also provides lists of questionnaires and
controlled vocabularies (CDISC 2020).
The key CDISC resources, in terms of study design and data re-use, are the
Clinical Data Acquisition Standards Harmonization (CDASH) standards and the
Therapeutic Area (TA) user guides. Taken together, these provide a means of
23 Long-Term Management of Data and Secondary Use 445

structuring, coding, and defining the data in a consistent fashion, especially those
relating to the data domains commonly found across studies – demographics,
adverse events, subject characteristics, vital signs, treatment exposure, etc.
CDASH and the TA guides are currently used much more within the pharmaceu-
tical industry than the non-commercial sector. The FDA, in the USA, and the
PMDA, in Japan (though not yet the EMA in the EU) have stipulated that data
submitted in pursuance of a marketing authorization must use CDISC’s Study Data
Tabulation Model (SDTM), a standard designed to provide a consistent structure to
submission datasets. Creating SDTM structured data is far easier if the original data
has been collected using CDASH, which is designed to support and map across to
the submission standard.
The CDASH system is relatively simple conceptually, but it is comprehensive,
and it does require an initial investment of time to appreciate the full breadth of data
items that are available and how they can be used. It provides standardized terms for
common study and demographic variables (e.g., SUBJID, SITEID, BRTHDAT,
AGE) and then uses a prefix-suffix system to define further variables in various
domains – 24 such domains are listed in CDASH 2.0. The prefix is a two-letter code
for the domain (AE ¼ adverse events, MH ¼ medical history, CM ¼ concomitant
mediation, EX ¼ Exposure (to the drug under investigation, PR ¼ Procedures, etc.).
The suffix indicates the type of data value, so, for instance:

• The start date of an AE event ¼ AESTDAT


• The name of the adverse event ¼ AETERM
• Whether the AE is still ongoing ¼ AEONGO
• The outcome of the AE ¼ AEOUT
• The start date for a concomitant medication ¼ CMSTDAT
• Whether the concomitant medication is ongoing ¼ CMONGO
• The name of the concomitant medication ¼ CMTRT
• The individual dose of the concomitant medication ¼ CMDSTXT
• The start of the investigative treatment ¼ EXSTDAT
• The individual dose of the treatment ¼ EXDSTXT
• The units of the treatment dose ¼ EXDSU
• The route of the treatment dose ¼ EXROUTE
etc., etc.

The Therapeutic Area (TA) user guides supplement both the CDASH and the
SDTM standards, providing a steadily growing list of therapeutic or disease area
specific terminology and detailed explanations of how SDTM and CDASH defini-
tions can be applied. The list of therapeutic area standards already developed, or
being developed, is available on the CDISC website. In August 2020 there were 44
areas listed (from acute kidney injury, Alzheimer’s, and asthma through to tubercu-
losis, vaccines, and virology).
An evaluation of CDASH and any relevant TA standards is highly recommended
because they represent a relatively complete system for standardizing data collec-
tion. Readers are referred to the CDISC website (CDISC 2020), which provides
446 S. Canham

comprehensive implementation guides for each standard. While trials units in the
non-commercial sector have not been forced into preparing CDASH and SDTM
datasets, many have already experimented with using parts of the system. Ultimately,
wide use of the CDISC standards could enable an SDTM-based archiving and data
sharing model that could be used across all sectors of clinical research, allowing a
huge pool of more inter-operable data to become available.
One caveat is that using CDASH can have consequences for the structure of the
data that is exported from a CDMS, i.e., the nature of the analysis datasets, and it is
therefore important that statisticians are happy about the data being presented to
them in this way. Traditionally, CDMS systems generate a series of tables, each
corresponding to an eCRF, with the data arranged as a row/subject visit within that
table. Because the CDASH approach tends to make greater use of small repeating
groups of questions, it creates many “ribbon”-shaped tables instead, each focused on
a particular domain, with relatively few data fields in each but often with many rows.
These tables are organized as one row/event, where the “event” is represented by a
cluster of related data points. Some statisticians may be wary of accepting data in this
form, preferring to transform it to a more traditional structure, or face modifying
their normal approach to analysis and the library of tools they have established. In
other words, making use of the CDISC standards requires the full understanding and
cooperation of the statisticians tasked with analyzing the exported data.
The final aspect of standardization to consider is the generation of descriptive
metadata for the study data – the characterization of the data points: their codes,
names, descriptions, types, ranges, possible values, etc. This has traditionally been
done using a variety of methods, from simple “data dictionaries” in spreadsheets
through to XML schemas, for example, using the CDISC Operational Data Model
(ODM). To make this metadata more useful, and in particular more easily searched
and processed by machines, it would be useful to have such metadata in a standard
format, the most appropriate – because it exists specifically for this purpose – being
CDISC’s “Define.xml” standard.
Unfortunately, current use of Define.xml seems very limited outside of the
pharmaceutical industry. There is a need for CDMS developers to incorporate
Define.xml exports in their systems, for tools to help statisticians and others read
or search Define.xml files more easily, for tools to allow machines to search and/or
describe Define.xml content, and for funders, trials units, sponsors, and investigators
to push for greater consistency in generating metadata using this single standard
rather than the variety of approaches that currently exist. The first part of re-using
data is to understand what is in it, and without a consistent approach to metadata, that
is going to be more time-consuming and costly than it should be.

Managing Data and Data Repositories

Currently, much of the clinical research data objects made available for sharing are
simply retained by the research team that produced them, somewhere on the disk
storage allocated to their department. The alternative, and other things being equal the
23 Long-Term Management of Data and Secondary Use 447

preferred option in the longer term, would be to make a conscious decision to move the
whole data package (i.e., datasets and related documents) to a dedicated data repos-
itory. This might be the institution’s or the company’s own data repository, specifically
set up for storing the research outputs of its staff, or it might be a third-party repository
– perhaps a general one storing all types of scientific data, or one specializing in
clinical research data, or even one specializing in a particular disease area.
Note that putting a copy of the data in a repository does not mean granting public
access to it; it simply means preparing the data for possible sharing and then
advertising that it is available. Those wishing to use it would still have to meet
any conditions that the researchers stipulated, e.g., provide a rationale for their usage
and/or adhere to a data use agreement. The data in the repository could be pseudon-
ymous (i.e., could be linked if required to the pseudonymous data held securely by
the researchers) or anonymous (could not be practically linked to that data)
depending on legal requirements.
The advantages of using a separate, dedicated data repository (including one set
up by the researchers’ own institution) include:

• Long-term data management. The original research team (or collaboration) will
change its composition, or may even cease to exist, and it may then become
difficult or impossible for data to be managed and requests for it to be properly
considered.
• Transfer of data to a repository helps to ensure that preparation of the data for
sharing (e.g., de-identification, provision of metadata) occurs and that the data
and related documents are properly described.
• Advertising the data and metadata in a repository’s catalog can help to make that
data and related documents more easily discoverable.
• It can, depending on the arrangements made with the repository, relieve the
original research team/sponsor of the need to review requests and even of the
need to make the decisions about agreeing to such requests.
• Anticipating transfer to a repository aids in explicitly identifying data preparation
and sharing costs at an early stage of the trial.

The problem is that most existing data repositories are not, yet, well organized to
manage the sensitive personal data generated by clinical research and have only
limited facilities for controlled access. The default for data repositories in most
scientific domains is open public access, with the only control a possible embargo
period on data release, so controlled access to sensitive data presents a challenge.
A recent study (Banzi et al. 2019) looked at data repositories that were potentially
available to non-commercial researchers for clinical research data. Twenty-five such
repositories were identified and assessed against eight key criteria (filtered down
from an initial list of 34), seen as particularly relevant to clinical researchers and their
data storage needs. The criteria were that the repository should have:

• Guidelines for data upload and storage


• Support for data de-identification
448 S. Canham

• Data quality controls


• Contracts for upload and storage
• Exposure of metadata
• Application of identifiers
• Flexibility of access
• Plans for long-term preservation

None of the repositories fully demonstrated all of the eight items included in the
indicator set, although three were judged as demonstrating or partially demonstrating
all of them. Other repositories appeared less suitable in a variety of ways, although
this may have been because in many cases the relevant information was not available
publicly on the repository’s website – many repositories do not do a good job of
advertising their services.
This situation may improve but at the moment it is clear that the full potential for
data re-use for clinical research data is hampered by the lack of suitable places to
store that data in the long term. This problem also underscores the need for robust
and public assessments of data repositories, so that potential users can make an
informed decision. Various general schemes have been proposed for this (e.g., see
CoreTrustSeal 2020), but they have not yet been expanded to include specialist
certification schemes for groups with particular requirements. Such a development
will be necessary, however, in order that clinical researchers can make informed
decisions about the storage of their data.

Trials Unit Systems for Managing Secondary Re-use of Individual


Participant Data

Managing the secondary use of clinical study IPD is complex, with a range of
technical, resource, and legal issues to consider. In many cases decisions will be
required as part of study planning – for example, deciding how to integrate data
standards in the study’s database design and what to include about potential data re-
use in the information sheets prepared for participants. Even when decisions and
activities could be postponed until the end of the study (e.g., deciding where data
should be stored in the long term, de-identifying the data), they should be anticipated
at the beginning of the study in order to estimate the resources required for those
activities and include the associated costs in bids for funds (not least because the
impetus for data sharing often comes from funders).
Managing the potential re-use of data is also a relatively new activity – one more
aspect of running a trial to add to all the other responsibilities and requirements faced
by investigators and operational managers. So how should a trials unit (using that
term in the most general sense, i.e., the trial management department in a pharma-
ceutical company, CRO, university, or hospital) integrate managing data re-use with
the other services it offers to investigators and sponsors? It seems clear that two
broad types of activity are necessary:
23 Long-Term Management of Data and Secondary Use 449

• A general preparation, of systems and staff, to understand and be prepared for the
various aspects of data re-use
• Study-specific activity, to manage the details of data re-use in the context of a
particular study. The latter will be split into two time points:
– That required during study planning, and
– That required at study end

As a brief practical guide to supporting data re-use, but also to summarize many
of the points made earlier in this chapter, suggestions for the main elements of each
of these activities are listed below:

General Preparation
• Clarify the legal regulations and requirements for data sharing in the relevant
legal jurisdiction(s). “Relevant” usually means those in which any study partic-
ipants live, and not just the jurisdiction of the trials unit itself. Among the things
to be clarified are the legal basis, or bases, under which data re-use is justified, the
role of consent (if any), the definitions and relevance of anonymized and pseudo-
nymized data, the need for data de-identification, and the need to demonstrate risk
assessment and/or risk management.
• Clarify any existing policies and procedures relevant to data sharing of the parent
organization (if there is one, e.g., a hospital, university, or company), and
incorporate them as necessary into the trials unit’s own procedures.
• For external sponsors or funders involved with a lot of studies, clarify their
policies and procedures relevant to data sharing (and data retention), with a
view to incorporating them, as necessary, into the trials unit’s own procedures.
• Ensure sufficient staff are familiar with regulations and policies relating to data re-
use, as described above, for study-specific work in this area to be carried out
effectively. Consider giving one or two roles operational oversight of the prepa-
ration for data re-use.
• Ensure sufficient staff are familiar with data standards and their application in
local systems for systems such as CDASH to be applied – perhaps at relatively
low levels initially but increasing over time. (Application of data standards may
require separate SOPs).
• Explore the options available for long-term data storage and data management,
within the department, within the larger parent organization, and within external
third-party repositories.
• Develop an SOP for preparing data for re-use, to be applied in the context of any
particular trial (unless all the decisions relating to data re-use in that trial are taken
entirely by an external sponsor). Integrate it into more general SOPs on study
preparation so that the data re-use is considered, planned, and costed at the outset
of the trial’s setup. The elements of the SOP are described in more detail in the
study-specific section below.
• Develop an SOP on responding to requests for data, to be applied in the context of
any particular trial (unless all such requests in that trial are managed entirely by an
external sponsor).
450 S. Canham

Study-Specific Activity (within study planning):


• Any IPD sharing policies/procedures of the study sponsor (or other data control-
ler) and/or the study funder should be checked and the implications of these for
IPD sharing identified (if not already known from the more general review
described above).
• The IPD sharing policies and expectations of any collaborating groups should
also be checked and incorporated into the proposed plan for IPD secondary
use.
• The sponsor as the data controller should normally take overall responsibility for
making decisions about possible data re-use, though may delegate this in practice
to a Trial Management Group (TMG). The TMG may in turn delegate the
function to a smaller group.
• The datasets to be made available for sharing should be identified, together with
an estimate of the time points when they will become available (e.g., in months
after the primary paper has been published).
• The documents to be made available for sharing should be identified (e.g.,
protocols, results summary) together with an estimate of the time points when
they will become available. Times are likely to differ between different
documents.
• Depending on the legal requirements, the need for a specific consent to enable
data re-use needs to be considered. If such a consent is deemed necessary, it and
associated information will need to be written and incorporated into the consent
forms and other participant documents.
• Likely de-identification steps need to be identified to aid in estimating the work
that will be involved.
• The pros and cons of transferring data and documents to a dedicated data
repository, in the same or an external organization, should be considered. At
this stage this might only be a “decision in principle,” but it may have a cost
implication and therefore needs to be considered.
• The extent and type of the use of data standards in the study should be decided,
planned, and documented.
• The type of access to be offered needs to be decided, at least in principle. For
example, access might be public, or by prior request to the investigator, with a
reasoned scientific justification, or it might be initially to an expert review panel
who could screen requests and recommend appropriate action.
• The likely costs of preparing data for sharing and then storing it in the long term
should be estimated. These may include the costs required for (a) de-identifi-
cation, (b) checking the impact of de-identification on scientific utility, (c)
preparing metadata, (d) identifying a suitable repository and negotiating with
it, and (e) long-term storage costs, if any. Costs should be included in bids for
funds.
• Text needs to be prepared so that the data sharing plan can be summarized (a)
within the protocol, (b) within the trial registration entry (or entries) and (c) within
patient information sheets.
23 Long-Term Management of Data and Secondary Use 451

Study-Specific Activity (at study end)


• The plans for data re-use should be reviewed, in the light of available resources,
possible changes in legislation (or the interpretation of that legislation), changed
scientific expectations about data sharing, etc.
• The data preparation required to support IPD sharing should be carried out
according to the agreed strategy. This will include (a) application of de-identifi-
cation techniques and (b) checking of the impact of de-identification on the
analyses carried out on the data.
• If and as required by local regulation, risk assessment and risk management
documents may have to be prepared. The latter may need to include descriptions
of the data use agreements to be employed.
• Additional metadata for the dataset(s) and documents should be prepared. This
includes (a) descriptive metadata for the de-identified dataset(s), e.g., using
CDISC’s Define.xml, and (b) provenance/discovery metadata, for use in “adver-
tising” the data and documents, e.g., in the context of a data repository. Both these
steps may need to be repeated multiple times, as different datasets and documents
become available.
• If a dedicated external repository is to be used, datasets and documents should be
transferred to that repository. The transfer should be subject to formal agreements
that stipulate the responsibilities of each party.

Conclusion

Study data management does not end with the end of the study. Data must be
retained for a set period to allow re-analysis if necessary, and, increasingly, data –
or a de-identified subset of it – is expected to be made accessible to others, if they can
justify the reasons for that access.
The simple retention of data is not difficult and has long been a requirement, but it
does require clear planning and resourcing, and it needs to be comprehensive – all
forms of the data need to be considered and archived or destroyed as necessary.
Because interest in the data may have waned at study end, it is important that all
decisions relating to data retention are taken at the beginning of the study, by the
sponsor but usually in collaboration with the study’s operational managers, and
that all activities are properly resourced.
Making data available for secondary re-use is a more complex, active process that
is relatively new for most trialists and trials units, but it is increasingly becoming an
expectation. The details of the processing required will inevitably depend on the
legal framework that applies, but as a minimum data will need to be de-identified. To
make the shared data more useful, it will also be important to ensure that the data is
as inter-operable as possible, by incorporating data standards into the study design
from the outset (applying such standards retrospectively can be done in theory, but in
practice is a very difficult and costly process). Finally, to make the data and
452 S. Canham

associated documents available in the long term, it will often be necessary to transfer
them to a dedicated data repository.
Again, the only way this end of study activity can be delivered efficiently is by
planning and resourcing it from the beginning of the study planning process. That
means setting up systems (including adequately trained staff as well as relevant
SOPs) that allow preparation for data re-use to be integrated into the rest of study
management.

Key Facts

1. Clinical trial data must be retained, for potential re-analysis and investigation,
for periods determined by the relevant legal jurisdiction(s).
2. Data retention is also required for compliance with Good Clinical Practice
(GCP).
3. Data should be retained both centrally and at each clinical site.
4. The sponsor has the final decisions with regard to the details (files, format,
location, etc.) of retained data.
5. Arrangements for retaining data in the long term should be established by the
sponsor, in collaboration with the trial’s operational managers (e.g., CRO or
trials unit) as part of study planning.
6. In recent years there has been increasing pressure on sponsors and investigators
to make de-identified individual participant data and study documents available
to others, for secondary research purposes.
7. Funders and publishers in particular have been keen to encourage secondary re-
use, as a mechanism for raising both the cost-effectiveness and the quality of
research.
8. The regulations governing secondary re-use will vary from one legal jurisdiction
to another and over time.
9. Almost all data will need to undergo de-identification before it is suitable for
secondary re-use. A variety of techniques have been published, but the stronger
the de-identification applied, the greater the risk to the scientific utility of the data.
10. Data use agreements offer an additional level of risk management around
secondary use and are an important reason why access to the data should
often be controlled rather than freely available.
11. To maximize the value of secondary use, it is important to make data as inter-
operable as possible, by the use of data standards – e.g., with common outcome
sets, consistent data definitions and categorizations, standardized data structures
and codes, and a standardized metadata scheme.
12. In the longer term, data is best transferred to a dedicated data repository.
Unfortunately, at the moment, few existing repositories are well adapted to
managing sensitive personal data available under controlled access.
13. Sponsors and study operational managers need to develop systems and pro-
cesses to support the preparation of data for re-use. This includes adequate
training of staff as well as SOPs and other quality documents.
23 Long-Term Management of Data and Secondary Use 453

14. Although much of the activity related to data re-use occurs at the end of the
study, much of the planning for it needs to take place at the beginning, as part of
the general study planning process.

Cross-References

▶ Archiving Records and Materials


▶ De-identifying Clinical Trial Data
▶ Design and Development of the Study Data System
▶ Documentation: Essential Documents and Standard Operating Procedures
▶ Responsibilities and Management of the Clinical Coordinating Center

References
Banzi R, Canham S, Kuchinke W et al (2019) Evaluation of repositories for sharing individual-
participant data from clinical studies. Trials 20:169. https://doi.org/10.1186/s13063-019-3253-3.
Available at https://trialsjournal.biomedcentral.com/articles/10.1186/s13063-019-3253-3.
Accessed 13 Aug 2020
Bierer B, Crosas M, Pierce H (2017) Data authorship as an incentive to data sharing. N Engl J Med
376:1684–1687. https://doi.org/10.1056/NEJMsb1616595. Available at https://www.nejm.org/
doi/10.1056/NEJMsb1616595. Accessed 14 June 2020
BMJ (2020) BMJ author Hub: data sharing. Available at https://authors.bmj.com/policies/data-
sharing/. Accessed 8 June 2020
CDISC (2020) CDISC standards in the clinical research process. Available at https://www.cdisc.
org/standards. Accessed 13 Aug 2020
Chan A, Song F, Vickers A et al (2014) Increasing value and reducing waste: addressing inacces-
sible research. Lancet 383:257–266. https://doi.org/10.1016/S0140-6736(13)62296-5
COMET (2020) Core outcome measures in effectiveness trials. Available at http://www.comet-
initiative.org/. Accessed 13 Aug 2020
CoreTrustSeal (2020) CoreTrustSeal certification. Available at https://www.coretrustseal.org/.
Accessed 13 Aug 2020
CSDR (2020) ClinicalStudyDataRequest.com: data sharing agreement. Available at https://
clinicalstudydatarequest.com/Help/Help-Data-Sharing-Agreement.aspx. Accessed 13 Aug
2020
Davey C, Aiken A, Hayes R, Hargreaves J (2015) Re-analysis of health and educational impacts of
a school-based deworming programme in western Kenya: a statistical replication of a cluster
quasi-randomized stepped-wedge trial. Int J Epidemiol 44:1581–1592. https://doi.org/10.1093/
ije/dyv128
Doshi P, Dickersin K, Healy D et al (2013) Restoring invisible and abandoned trials: a call for
people to publish the findings. BMJ 346. https://doi.org/10.1136/bmj.f2865. Available at https://
www.bmj.com/content/346/bmj.f2865. Accessed 5 June 2020
EDIC (2020) The epidemiology of diabetes interventions and complications. Available at https://
edic.bsc.gwu.edu/. Accessed 7 June 2020
EMA (2018) Guideline on the content, management and archiving of the clinical trial master file
(paper and/or electronic). EMA/INS/GCP/856758/2018 Good Clinical Practice Inspectors
Working Group (GCP IWG). Available at https://www.ema.europa.eu/en/documents/scien
tific-guideline/guideline-content-management-archiving-clinical-trial-master-file-paper/elec
tronic_en.pdf. Accessed 5 June 2020
454 S. Canham

European Commission (2003) Directive 2003/63/EC, amending Directive 2001/83/EC relating to


medicinal products for human use. Available at https://ec.europa.eu/health/sites/health/files/
files/eudralex/vol-1/dir_2003_63/dir_2003_63_en.pdf. Accessed 5 June 2020
European Commission (2014) Regulation 536/2014 of the European Parliament and of the Council
of 16 April 2014 on clinical trials on medicinal products for human use. Available at https://ec.
europa.eu/health/sites/health/files/files/eudralex/vol-1/reg_2014_536/reg_2014_536_en.pdf.
Accessed 5 June 2020
Geraghty K (2016) ‘PACE-Gate’: when clinical trial evidence meets open data access (Editorial).
J Health Psychol. https://doi.org/10.1177/1359105316675213. Available at https://journals.
sagepub.com/doi/10.1177/1359105316675213. Accessed 8 June 2020
Henry D, Fitzpatrick T (2015) Liberating the data from clinical trials (editorial). BMJ 351:h4601.
https://doi.org/10.1136/bmj.h4601
HHS.Gov (2020) Guidance regarding methods for de-identification of protected health information
in accordance with the Health Insurance Portability and Accountability Act (HIPAA) privacy
rule. Available at https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identi
fication/index.html#standard
ICH (2016) International council for harmonisation of technical requirements for pharmaceuticals
for human use (ICH) guidelines for good clinical practice E6(R2). Available at https://database.
ich.org/sites/default/files/E6_R2_Addendum.pdf. Accessed 5 June 2020
ICMJE (2020) International Committee of Medical Journal Editors: data sharing. Available at http://
icmje.org/recommendations/browse/publishing-and-editorial-issues/clinical-trial-registration.
html#two. Accessed 13 Aug 2020
IDDO (2020) Infectious diseases data observatory: Ebola. Available at https://www.iddo.org/ebola/
data-sharing/accessing-data. Accessed 7 June 2020
Institute of Medicine (2015) Sharing clinical trial data, maximizing benefits, minimizing risk.
Appendix B. Concepts and methods for de-identifying clinical trial data. National Academies
Press, Washington, DC. Available at https://www.nap.edu/read/18998/chapter/10. Accessed 12
Aug 2020
ISARIC (2020) Clinical data collection – the COVID-19 case report forms (CRFs). Available at
https://isaric.tghn.org/COVID-19-CRF/. Accessed 13 Aug 2020
Keerie C, Tuck C, Milne G et al (2018) Data sharing in clinical trials – practical guidance on
anonymising trial datasets. Trials 19:25. https://doi.org/10.1186/s13063-017-2382-9. Available
at https://trialsjournal.biomedcentral.com/articles/10.1186/s13063-017-2382-9. Accessed 12
Aug 2020
Keller et al (2001) Efficacy of paroxetine in the treatment of adolescent major depression: a
randomized, controlled trial. Am Acad Child Adolesc Psychiatry 40(7):762–772. https://doi.
org/10.1097/00004583-200107000-00010. Available at http://www.dcscience.net/keller-2001-
paroxetine.pdf. Accessed 5 June 2020
Kiley R, Peatfield T, Hansen J, Reddington F (2017) Data sharing from clinical trials – a research
funder’s perspective. N Engl J Med 377:1990–1992. https://doi.org/10.1056/NEJMsb1708278.
Available at https://www.nejm.org/doi/full/10.1056/NEJMsb1708278. Accessed 8 June 2020
Le Noury J, Nardo J, Healy D et al (2015) Restoring study 329: efficacy and harms of paroxetine and
imipramine in treatment of major depression in adolescence. BMJ 351. https://doi.org/10.1136/
bmj.h4320. Available at https://www.bmj.com/content/351/bmj.h4320. Accessed 5 June 2020
Lemmens T (2013) Pharmaceutical knowledge governance: a human rights perspective. J Law Med
Ethics 41(1):163–184
Lemmens T, Telfer C (2012) Access to information and the right to health: the human rights case for
clinical trials transparency. Am J Law Med 31(1):63–112
Levin M, Cunnington AJ, Wilson C et al (2019) Effects of saline or albumin fluid bolus in
resuscitation: evidence from re-analysis of the FEAST trial. Lancet Respir Med 7:581–593.
https://doi.org/10.1016/S2213-2600(19)30114-6. Available at https://www.thelancet.com/
journals/lanres/article/PIIS2213-2600(19)30114-6/fulltext. Accessed 8 June 2020
23 Long-Term Management of Data and Secondary Use 455

Maitland K, Kiguli S, Opoka R et al (2011) Mortality after fluid bolus in African children with
severe infection. N Engl J Med 364(26):2483–2495. https://doi.org/10.1056/NEJMoa1101549.
Epub 2011 May 26. Available at https://www.nejm.org/doi/10.1056/NEJMoa1101549?url_
ver¼Z39.88-2003&rfr_id¼ori:rid:crossref.org&rfr_dat¼cr_pub%20%200www.ncbi.nlm.nih.
gov. Accessed 8 June 2020
Maitland K, Gibb D, Babiker A et al (2019) Secondary re-analysis of the FEAST trial (correspon-
dence). Lancet Respir Med 7(10):E29. https://doi.org/10.1016/S2213-2600(19)30272-3
Miguel E, Kremer M (2004) Worms: identifying impacts on education and health in the presence of
treatment externalities. Econometrica 72(1):159–217. https://doi.org/10.1111/j.1468-0262.2004.
00481.x.
MRC (2015) Good practice principles for sharing individual participant data from publicly funded
clinical trials. MRC Hub for Trials methodology research, UKCRC, Cancer Research UK,
Wellcome Trust. Available at https://www.methodologyhubs.mrc.ac.uk/files/7114/3682/3831/
Datasharingguidance2015.pdf. Accessed 12 Aug 2020
NEJM (2016) The SPRINT data analysis challenge. Available at https://challenge.nejm.org/pages/
about. Accessed 8 June 2020
Ohmann C, Banzi R, Canham S et al (2017) Sharing and reuse of individual participant data from
clinical trials: principles and recommendations. BMJ Open 7:e018647. https://doi.org/10.1136/
bmjopen-2017-018647. Available at https://bmjopen.bmj.com/content/bmjopen/7/12/e018647.
full.pdf. Accessed 12 Aug 2020
Özler B (2015) Worm wars: a review of the reanalysis of Miguel and Kremer’s deworming study.
World Bank Blogs. Available at https://blogs.worldbank.org/impactevaluations/worm-wars-
review-reanalysis-miguel-and-kremer-s-deworming-study. Accessed 8 June 2020
Peloquin D, DiMalo M, Bierer B, Barnes M (2020) Disruptive and avoidable: GDPR challenges to
secondary research uses of data. Eur J Hum Genet 28:697–705. https://doi.org/10.1038/s41431-020-
0596-x. Available at https://www.nature.com/articles/s41431-020-0596-x. Accessed 12 Aug 2020
Ramokapane K, Rashid A, Such J (2016) Assured deletion in the cloud: requirements, challenges
and future directions. Conference paper at ACM, October 2016. https://doi.org/10.1145/
2996429.2996434. Available at http://eprints.lancs.ac.uk/81611/1/Assured_deletion_Final_
version.pdf. Accessed 7 June 2020
Reichman J (2009) Rethinking the role of clinical trial data in international intellectual property law:
the case for a public goods approach. Marquette Intellect Prop Law Rev 13(1):1–68
Research Data Alliance (2020) RDA COVID-19 recommendations and guidelines (5th Release).
Available at https://www.rd-alliance.org/system/files/RDA%20COVID-19%3B%20recommen
dations%20and%20guidelines%2C%205th%20release%20%28final%20draft%29%2028%
20May%202020.pdf. Accessed 7 June 2020
RIAT Support Center (2020) Restoring invisible and abandoned trials. Available at https://
restoringtrials.org/. Accessed 5 June 2020
Riley R, Lambert P, AboZaid G (2010) Meta-analysis of individual participant data: rationale,
conduct, and reporting. 340:c221. https://doi.org/10.1136/bmj.c221. Available at https://www.
bmj.com/content/340/bmj.c221. Accessed 7 June 2020
SPRINT Research Group (2015) A randomized trial of intensive versus standard blood-pressure
control. N Engl J Med 373:2103–2116. https://doi.org/10.1056/NEJMoa1511939. Available at
https://www.nejm.org/doi/full/10.1056/NEJMoa1511939. Accessed 8 June 2020
Tangcharoensathien V, Boonperm, Jongudomsuk P (2010) Sharing health data: developing country
perspectives. Bull World Health Organ 88(6):468–469. https://doi.org/10.2471/BLT.10.079129.
Available at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2878166/. Accessed 14 June 2020
Torjesen I (2018) Pressure grows on Lancet to review “flawed” PACE trial (News article). BMJ 362:
k3621. https://doi.org/10.1136/bmj.k3621
UKRI (2020) Common principles on data policy. Available at https://www.ukri.org/funding/
information-for-award-holders/data-policy/common-principles-on-data-policy/. Accessed 7
June 2020
456 S. Canham

US Code of Federal Regulations (2020) Title 21, Chapter I, 312 D – responsibilities of sponsors and
investigators, section 312.62 investigator recordkeeping and record retention. Available at
https://www.govregs.com/regulations/expand/title21_chapterI_part312_subpartD_section312.
61#title21_chapterI_part312_subpartD_section312.62. Accessed 5 June 2020
Van den Eyndon, V, Knight G, Vlad A et al (2016) Towards open research, practices, experiences,
barriers and opportunities. Wellcome Trust. https://doi.org/10.6084/m9.figshare.4055448
Vickers A (2016) Sharing raw data from clinical trials: what progress since we first asked, “whose
data set is it anyway?”. Trials 17:227. Available at https://www.ncbi.nlm.nih.gov/pmc/articles/
PMC4855346/. Accessed 8 June 2020
Wellcome Trust (2020) Sharing research data and findings relevant to the novel coronavirus
(COVID-19) outbreak. Available at https://wellcome.ac.uk/coronavirus-covid-19/open-data.
Accessed 7 June 2020
White P, Goldsmith K et al (2011) Comparison of adaptive pacing therapy, cognitive behaviour
therapy, graded exercise therapy, and specialist medical care for chronic fatigue syndrome
(PACE): a randomised trial. Lancet 377(9768):5–11, 823–836. https://doi.org/10.1016/S0140-
6736(11)60096-2. Available at https://www.sciencedirect.com/science/article/pii/
S0140673611600962. Accessed 8 June 2020
WHO (2015) Developing global norms for sharing data and results during public health emergen-
cies. Available at https://www.who.int/medicines/ebola-treatment/data-sharing_phe/en/.
Accessed 7 June 2020
Wilkinson MD, Dumontier M et al (2016) The FAIR guiding principles for scientific data manage-
ment and stewardship. Sci Data 3:160018. https://doi.org/10.1038/sdata.2016.18
Yoda (2020) Yale open data access project: data use agreement training. Available at https://
yalesurvey.ca1.qualtrics.com/jfe/form/SV_0P7Kl30x4aAZDRX?Q_JFE¼qdg. Accessed 13
Aug 2020
Part III
Regulation and Oversight
Regulatory Requirements in Clinical Trials
24
Michelle Pernice and Alan Colley

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
Fundamentals of Global Regulatory Affairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
Regulatory Strategy: “Black and White” vs. “Gray Zone” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
The “Black and White” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
The “Gray Zone” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
Hypothetical Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
Regulatory Affairs Considerations for Clinical Trials in the USA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
Submitting an IND to FDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
Maintaining the IND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
Regulatory Affairs Considerations for Clinical Trials in the European Union . . . . . . . . . . . . . . . . 470
Evolution of EU Clinical Trials Legislation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
EudraLex “The Rules Governing Medicinal Products in the European Union” . . . . . . . . . . . 471
Clinical Trials Facilitation and Coordination Group (CTFG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
EU Regulatory Agency Advice on Clinical Development and Clinical Trials . . . . . . . . . . . . . 472
Submitting a CTA in the EU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
Voluntary Harmonization Procedure (VHP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
Maintaining the Clinical Trial Authorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
Regulatory Affairs Considerations for Clinical Trials in Other Countries . . . . . . . . . . . . . . . . . . . . . 476
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478

M. Pernice (*)
Dynavax Technologies Corporation, Emeryville, CA, USA
e-mail: [email protected]
A. Colley
Amgen, Ltd, Cambridge, UK
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 459


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_51
460 M. Pernice and A. Colley

Abstract
Understanding the regulatory requirements for initiating and conducting clinical
trials is a crucial starting point and success factor in any plan to advance drug
development in humans. Regulatory requirements go beyond what is considered
compliance with good clinical practice (GCP) and other standards. Regulations
provide the guardrails and opportunities for safe, efficient, and purposeful drug
development. While regulations provide the groundwork of what is to be con-
sidered “right” and “wrong” in drug development, there is a level of uncertainty
that is intentionally left for sponsor interpretation in order to provide flexibility.
Further, regulators usually represent the views of the country or region within
their specific purview, a structure which lends itself to dissonance between
different regulations/guidelines, furthering the need for sponsor interpretation.
Such interpretation is conveyed in the finished clinical trial application (CTA) or
investigational new drug (IND) application and then subject to the regulators’
review and approval. As important as it is to understand the written requirements,
it’s equally important to understand how and when to engage with the regulators
to expedite drug development. If done well, the combination of understanding the
regulations, implementing sponsor interpretation, and utilizing opportunities for
engagement with regulatory agencies can lead to ultimately deliver useful treat-
ments to patients.
In this chapter, global, regional, and national clinical trial regulatory consid-
erations will be described to enable the reader to understand the principles and
practice of conceptualizing, submitting, initiating, and completing clinical trials
in the regulated environment of drug development.

Keywords
Regulatory · FDA · EMA · Marketing authorization · BLA/NDA · MAA · IND ·
CTA · Approval · Sponsor

Introduction

When a patient goes to see a doctor and walks out with a prescription, the ability for
that treatment to be prescribed is the result of regulatory approval or “marketing
authorization.” Marketing authorization is only granted once sufficient clinical trial
data are generated to prove that the benefits of the treatment outweigh the risks. As
explained in the US Code of Federal Regulations (CFR), the purpose of conducting
clinical trials is to distinguish the effect of a drug from other influences (e.g., rule out
“placebo effect”). The Food and Drug Administration (FDA) considers adequate and
well-controlled studies to be the primary basis for determining whether there is
“substantial evidence” to support the claims of effectiveness for new drugs. Sub-
stantial evidence is defined in Section 505(d) of the Food, Drug, and Cosmetic
(FD&C) Act2 as, “evidence consisting of adequate and well controlled
24 Regulatory Requirements in Clinical Trials 461

investigations, including clinical investigations, by [qualified experts who could


fairly and responsibly conclude that the drug will have the effect it purports or is
represented to have in the labeling].” The clinical trial data which serve to comprise
substantial evidence as the basis of such an approval are submitted to the regulator in
a marketing authorization application (MAA) or, in the USA, a new drug application
(NDA)/biologics licensing application (BLA). Standardly, the MAA or BLA/NDA
will contain the data from Phase 1, Phase 2, and Phase 3 clinical trials. However,
there are exceptions to this standard where development may be expedited (e.g.,
Phase 2 data being considered sufficient for approval under FDA’s Accelerated
Approval program) based on unmet need and other factors.
As the name denotes, Phase 1 trials are the starting point to understand how the
drug candidate will perform in humans from a safety perspective, including identi-
fication of the appropriate dose(s) to be studied further. Phase 2 trials are meant to
develop information on the potential efficacy and to generate more safety data in the
larger population. Phase 3 trials are intended to provide sufficient data to prove that
the benefits outweigh the risks of the experimental treatment, using a clinically
meaningful endpoint (e.g., survival or mortality). Typically, the Phase 3 trial(s) are
referred to as the “registrational” or “pivotal” trials. However, in situations where
there is a high unmet medical need and new treatment options are urgently needed,
an earlier-phase trial (e.g., Phase 2 trial) can serve as the “registrational” or “pivotal”
trial. This would usually occur under a specific regulatory designation (e.g., “Accel-
erated Approval” in the USA and “Conditional Approval” in the EU), which is
sought by the manufacturer of the product (referred to as the “sponsor”) and granted
by the regulator (e.g., the FDA in the USA, European Medicines Agency [EMA] in
the EU).
There are regulatory requirements that apply across all phases of clinical trials. As
an example, all clinical trials must comprehensively disclose the risks of the trial to
the people volunteering to be enrolled and receive the volunteers’ consent to be
treated and have the results used as data. There are also regulations and regulatory
guidance that apply to specific trial phases. For example, in a Phase 3 study where
the investigational treatment is compared against an approved therapy, the design
will be required to control for possible bias between treatment arms (e.g., the
protocol procedures must employ blinding) in order to generate a reliable, clinically
meaningful conclusion (e.g., proven superiority or non-inferiority).
In addition to Phase 1–3 data intended for inclusion in the submission of an
MAA, NDA, or BLA, there are also clinical trials which may be conducted after the
product’s approval, referred to as Phase 4 clinical trials or post-marketing studies.
Phase 4 trials may be required by the regulator to answer remaining questions about
the efficacy or safety of the now-approved product or could be initiated voluntarily
by the sponsor company. Either way, Phase 4 trials usually seek to generate more
information on the treatment over a longer duration of time, in a larger population or
in specific patient populations not sufficiently represented in the original Phase 1–3
trials. Phase 4 trials aren’t the only trials that can be conducted after a product
approval, however. As the approved product continues to be explored in other
disease areas and/or patient populations, additional Phase 1–3 trials with that product
462 M. Pernice and A. Colley

may be conducted (e.g., a product approved for the treatment of adult patients with
melanoma may then be included in a new Phase 1 trial to assess the product’s safety
in pediatrics with a different malignancy).
Clinical trials are often conducted in more than one country. This is partly due to
the intent for the product to ultimately be approved in multiple countries, and
therefore data that is representative of each country’s population and local medical
practice is likely to be required by that country’s regulator. This is also due to the
need to expeditiously accrue patients to a trial, necessitating the ability to recruit
study volunteers from a larger population than what would be feasible in a single
country. While conducting clinical trials globally should lead to data that are more
representative of the real-world patient population, this also opens the sponsor up to
inconsistencies between requirements and advice received from different national
and regional regulators. As an example, the approval to conduct a clinical trial in the
USA is the subject of FDA review of the Investigational New Drug (IND) applica-
tion, which includes a multitude of detailed documents (e.g., information on the
manufacturing of the product, the preclinical data on the product). Thereafter, when a
subsequent trial is proposed (e.g., after the completion of the Phase 1 trial, a Phase 2
trial will be proposed), that individual trial’s clinical protocol will be submitted “to
the IND.” The clinical trial protocol is just one document usually less than 200
pages, whereas there are typically 30 or more documents amounting to thousands of
pages in the original IND. To conduct the same Phase 2 trial in the EU, a new clinical
trial application (CTA) must be submitted to the regulators of those countries within
the EU where the clinical trial will be conducted, even though a CTA was submitted
for the original Phase 1 trial. There are some documents that may be prepared to
support both an IND and CTA filing, whereas there are a number of other documents
that only serve to support one or the other. Unlike in the USA, where after the FDA
review of the original IND subsequent studies require less documentation, in the EU,
CTAs are submitted each time a new study is proposed. While certain initial CTA
documents can be referenced or resubmitted for subsequent CTAs, the submission
package still tends to be much larger than what is required for such subsequent
studies in the USA.
As exemplified above, in order to be successful in developing a new treatment for
patients, it is important for the sponsor to have capabilities to understand not only
basic global regulatory requirements but also the details of individual country
regulations.

Fundamentals of Global Regulatory Affairs

Unlike the IND and CTA differences described, some regulations are consistent
globally, or have been harmonized between regions, and comprise the bedrock of
initial clinical development inception and planning. Developed by regulators around
the world, good manufacturing practice (GMP) and good clinical practice (GCP)
form the fundamental basis of what is required throughout global clinical develop-
ment. Adhering to the standards set forth in these practices is a requirement for
24 Regulatory Requirements in Clinical Trials 463

clinical trials to be conducted safely and ethically. Various bodies globally have
published what comprise the principles of GMP and GCP, including the World
Health Organization (WHO) and the International Council of Harmonization
(ICH). ICH was founded in 1990 with the mission to achieve greater harmonization
worldwide to ensure that safe, effective, and high-quality medicines are developed
and registered in the most resource-efficient manner. Since then, ICH has developed
guidelines across the various pillars of a drug development program, which include
the development of the experimental product’s quality attributes (referred to as the
“Q” category; e.g., stability and shelf life), the generation of preclinical toxicology
data (“S” category) and clinical efficacy and safety (“E” category) data, and multi-
disciplinary topics which apply across categories (“M” category; e.g., standardized
medical terminology). Such foundational globally relevant guidance was developed
based on core principals often reflected in individual country regulations and in turn
also influence future evolution of those individual country regulations.

Regulatory Strategy: “Black and White” vs. “Gray Zone”

Within the practice of regulatory affairs, there are clear-cut regulations that need to
be understood and adhered to, provided through the rules and regulations written in
“black and white,” meaning there is intentionally little room for interpretation
considering the criticality of the concept (e.g., regulations that govern patient safety
and adverse event reporting). As regulators and sponsors consider the adherence of
these rules as more than just “regulatory affairs,” this practice is more often termed
“compliance.”

The “Black and White”

In the USA, written regulations are laid out in the CFR. The CFR documents all
actions that are required under the applicable federal law which, in this case, is the
Federal Food, Drug, and Cosmetic Act (FD&C Act, codified into Title 21 Chap. 9
of the US Code) (The CFR is organized into a hierarchical series which will be
exemplified hereon consistent with the subject focus of this book chapter. The
hierarchy begins with Titles, and Title 21 of the CFR contains ‘Food and Drugs’
regulations. Next, are Chapters classified by regulatory entity, and Chap. 1 covers
the ‘FDA Department of Health and Human Services’. Chapters are broken down
into Subparts, where Subpart D describes ‘Drugs for Human Use’. Within Subpart
D there are a number of Parts, including Part 312 which describes ‘Investigational
New Drug Applications’. Collectively, this example selection within the CFR
would be referred to as “21 CFR Part 312,” “Chap. 1” and “Subpart D” are
assumed by “Part 312” as Chap. 1, Subpart D is the only component of 21 CFR
that contains a “Part 312.”). Among other topics, the CFR contains the central
principals of safely and ethically conducting a clinical trial. For a regulatory affairs
professional with a purview that includes the USA, the CFR is often the first pillar
464 M. Pernice and A. Colley

of decision-making and considered a necessity to achieve “compliance.” This


section considers the CFR, and comparable written regulations globally, as “The
Black and White” seeing as the rules and regulations set forth in the CFR are most
often standard requirements in order to achieve regulatory approval to conduct a
clinical trial and the subsequent approval of an MAA/BLA/NDA based on data from the
conducted clinical trials. These regulations tend to avoid overprescriptive details and
instead issue broadly impactful rules. An example can be found in 21 CFR Part 312.20
(a) which states, “A sponsor shall submit an IND to FDA if the sponsor intends to
conduct a clinical investigation with an investigational new drug that is subject to §312.2
(a)” (where §312.2(a) refers to 21 CFR Part 312.2(a) which addresses the products
within scope of the overall 21 CFR Part 312 regulations). The concept that a sponsor
must submit an IND to FDA prior to conducting a clinical investigation with an
investigational new drug within the scope referred to is not a subject that should
typically be considered negotiable. This requirement is necessary to enable the appro-
priate oversight of the investigational clinical trial by the regulator, a critical component
of safe and ethical drug development.

The “Gray Zone”

The next pillar of decision-making for a regulatory affairs professional is found in


written regulatory guidance or guidelines. Written guidance elaborates on the written
regulation text, offering more detailed description of the intent of the regulation
based on the regulator’s current thinking. Most often, guidance documents will begin
with reference to the statutory and regulatory requirements that comprise the basis of
the topic being presented for guidance (e.g., text from the CFR). Meaning, written
guidance introduces a combination of the “black and white” and the “gray zone.”
The “gray zone” can be interpreted as regulator-issued considerations that could be
the subject of further discussion and alignment, even negotiation, between the
sponsor, regulator, and other stakeholders (e.g., healthcare professionals, patients,
and patient advocates). Most written guidance will contain introductory text to this
effect, as an example FDA guidance text includes:

In general, FDA’s guidance documents do not establish legally enforceable responsibilities.


Instead, guidances describe the Agency’s current thinking on a topic and should be viewed
only as recommendations, unless specific regulatory or statutory requirements are cited. The
use of the word ‘should’ in Agency guidances means that something is suggested or
recommended, but not required.

This flexibility is often evident in the guidance text itself being inconclusive or
situation-dependent, leaving the “door open” for the sponsor to consider what is the
most appropriate proposal for the particular investigational drug, specific patient
population, and disease landscape. Diseases that are life-threatening or otherwise
remain a high unmet medical need are of particular relevance for further consider-
ation, discussion, and even negotiation with the regulator. With the betterment of
public health as a shared common goal between all stakeholders (regulators,
24 Regulatory Requirements in Clinical Trials 465

sponsors, healthcare professionals, and patients themselves), circumstances which


challenge the chances of a patient’s success are replete with opportunity to be open,
collaborative, and communicative between stakeholders, namely, between the spon-
sor and regulator.
Another helpful component to guide decisions to meet the regulators’ expecta-
tions is regulatory precedents (e.g., examples of past agreements between a sponsor
and a regulator). Every approved product is issued a public document describing the
product’s core information. In the USA, this document is called the US Prescribing
Information (USPI), and in the EU, it is called the Summary of Product Character-
istics (SmPC). Documents of this type will reflect a multitude of information,
including the pivotal clinical trials conducted in order to achieve the marketing
authorization. Further, regulators often publish documents publicly which stipulate
the rationale behind the decision to approve the product. In the USA, this document
is called the Summary Basis of Approval (SBA), and in the EU, this document is
called the European Public Assessment Report (EPAR). Within these documents, a
description of the registrational pivotal trials will be described, often including the
primary endpoint, selected secondary endpoints, sample size, demographics, specific
safety assessments, and results. Such precedents are helpful to understand circum-
stances when products were successfully developed and approved in addition to
what is provided in official labeling.
Once the plans for a particular clinical trial’s design are formed, a sponsor can
choose to seek regulatory agency advice. This represents the final major pillar of
decision-making for a regulatory affairs professional and where the opportunity to
really explore the “gray zone” comes into action. Some regulator advice procedures
are described in the CFR, denoting how common and how strongly encouraged they
tend to be. Of note, 21 CFR 312.47, entitled “Meetings,” presents the below
alongside additional information:

Meetings between a sponsor and the agency [FDA] are frequently useful in resolving
questions and issues raised during the course of a clinical investigation. FDA encourages
such meetings to the extent that they aid in the evaluation of the drug and in the solution of
scientific problems concerning the drug, to the extent that FDA's resources permit. The
general principle underlying the conduct of such meetings is that there should be free, full,
and open communication about any scientific or medical question that may arise during the
clinical investigation.

Such engagement constitutes the third pillar of regulatory wherewithal to guide drug
development in an ethical, safe, and productive manner. The ability to combine
information and learning from each of the pillars outlined into a plan that suits the
aim of all stakeholders is what constitutes a regulatory strategy. Regulatory strategies
are developed by regulatory affairs professionals for each critical decision within a
drug candidate’s development plan. Such critical decisions span across the life of the
drug’s development and touch on topics both with immediate need and with long-
term impact, such as the optimal timing for an initial IND submission, whether or not
to develop in pediatric populations, and choosing a marketing authorization pathway
to aim toward throughout the drug’s development.
466 M. Pernice and A. Colley

Hypothetical Case Study

A hypothetical case study can be found within the requirements surrounding clinical
trial endpoint selection in potentially registrational clinical trials for patients with a
disease of high unmet need, using certain cancers as an example of such disease.
When designing a clinical trial, the regulatory affairs professional is often tasked
with selecting and confirming the appropriate endpoint of a given clinical trial. In
this case study, consider that the team has requested the regulatory affairs profes-
sional to advise on whether there are any other endpoints that can be acceptable by
regulators for a marketing authorization, as a mortality endpoint may take too long to
reach in the clinical trial setting, considering the present-day unmet need of this
population.
Knowing that the clinical trial being designed is intended to study a patient
population of high unmet need, and aims to pursue a pathway that is as expedited
as possible toward marketing authorization, the regulatory affairs professional could
start with comprehending the “black and white” around the clinical trial results and
data requirements to support such an application in the USA. Accordingly, in
searching first through the CFR, the regulatory affairs professional will find 21
CFR part 314, subpart H entitled, “Accelerated Approval of New Drugs for Serious
or Life-Threatening Illnesses,” including 21 CFR part 314.510, “Approval based on
a surrogate endpoint or on an effect on a clinical endpoint other than survival or
irreversible morbidity,” which states:

FDA may grant marketing approval for a new drug product on the basis of adequate and
well-controlled clinical trials establishing that the drug product has an effect on a surrogate
endpoint that is reasonably likely, based on epidemiologic, therapeutic, pathophysiologic, or
other evidence, to predict clinical benefit or on the basis of an effect on a clinical endpoint
other than survival or irreversible morbidity. Approval under this section will be subject to
the requirement that the applicant study the drug further, to verify and describe its clinical
benefit, where there is uncertainty as to the relation of the surrogate endpoint to clinical
benefit, or of the observed clinical benefit to ultimate outcome. Postmarketing studies would
usually be studies already underway. When required to be conducted, such studies must also
be adequate and well-controlled. The applicant shall carry out any such studies with due
diligence.

Based on this, the regulatory affairs professional can advise the team that a clinical
trial should be proposed to FDA containing an endpoint that acts as a “surrogate”
that is “reasonably likely” to predict what a traditional, clear clinical benefit endpoint
would assess (e.g., mortality). Acknowledging that there are currently no approved
therapies in the malignancy being studied, and therefore nothing to design a com-
parative, head-to-head trial against, the regulatory affairs professional may consider
whether a Phase 2 study would be sufficient for initial marketing authorization.
Along with guiding the team in planning such a clinical trial, the team will also need
to be advised to plan for a post-marketing study including a certain clinical benefit
endpoint. As this is written in “black and white,” it is pertinent information for the
drug development team. Next, the regulatory affairs professional knows to seek
24 Regulatory Requirements in Clinical Trials 467

information beyond the CFR and identifies written FDA guidance titled, “Clinical
Trial Endpoints for the Approval of Cancer Drugs and Biologics” which advises,
among other content:

Surrogate endpoints for accelerated approval must be reasonably likely to predict clinical
benefit (FD&C Act § 506(c)(1)(A); 21 CFR part 314, subpart H; and 21 CFR part 601,
subpart E). While durable objective response rate (ORR) has been used as a traditional
approval endpoint in some circumstances, ORR has also been the most commonly used
surrogate endpoint in support of accelerated approval. Tumor response is widely accepted by
oncologists in guiding cancer treatments. Because ORR is directly attributable to drug effect,
single-arm trials conducted in patients with refractory tumors where no available therapy
exists provide an accurate assessment of ORR. Whether tumor measures such as ORR or
PFS are used as an accelerated approval or traditional approval endpoint will depend on the
disease context and the magnitude of the effect, among other factors.

With this, the regulatory affairs professional is empowered to advise the team of
regulation- and guidance-supported suggestions of possible endpoints for the team to
consider within the context of this particular malignancy and patient population.
Further, the regulatory affairs professional perceives an element of regulatory pre-
cedents that plays heavily into the FDA written guidance. In searching through the
USPIs and SBAs of approved treatments for other forms of cancer that previously
represented a high unmet need, a number of surrogate endpoints can be identified
and noted as successful in serving as pivotal evidence to support the marketing
authorization of that particular product.
Culminating the rules, regulations, guidance, and precedents, the team will
generate a proposed clinical trial design, including a selected surrogate endpoint,
to seek advice from major regulators. Such advice will generate a collaboration
between the regulator and sponsor to meet the common goal: the betterment of
public health.
With the refined clinical trial design, the sponsor will seek approval from the
regulators to conduct the study. This will entail multiple country- and region-specific
processes. In order to maximize the efficiency of the sponsor’s preparation, the
regulator’s review, and the applicability across countries, regulators worldwide
(FDA, EMA, and Japan’s Ministry of Health, Labor and Welfare (MHLW)) devel-
oped a set of specifications for applications to be submitted regulators, entitled the
Common Technical Document (CTD) which is broken into five parts or “modules.”
Module 1 is a region-specific part that contains documents required by the regulator
in that specific country or region. Module 2 through Module 5 are constant interna-
tionally, shown in Fig. 1.
The electronic version of the CTD (eCTD) was developed by ICH, enabling
electronic submissions in lieu of the previous paper-based submissions. This struc-
ture is employed for marketing authorization applications (MAA, BLA, NDA) in all
participating countries (referred to as “ICH countries”), including but not limited to
the USA, EU, Japan, Canada, Switzerland, and Australia. It is also employed for
IND submissions in the USA to FDA but is not uniformly employed for CTA
submissions to regulators within the EU.
468 M. Pernice and A. Colley

Not part
of the CTD
Regional
administrative
information
Module 1

Non-clinical
overview Clinical
Module 2
overview

The CTD
Quality overall Non-clinical Clinical
summary summary summary

Quality Non-clinical Clinical study


study reports reports
Module 3 Module 4 Module 5

Fig. 1 The CTD triangle

Regulatory Affairs Considerations for Clinical Trials in the USA

Submitting an IND to FDA

The submission of an IND to FDA requires that the details of the investigational
product’s development be explained across many documents. These documents are
organized within the eCTD structure when an IND is being submitted to FDA.
Module 1 contains a cover letter, administrative forms, table of IND contents, the
investigators brochure, and an introductory statement including a brief summary of
what the clinical development plan is foreseen to include. Module 2 contains the
summaries of all subsequent modules (Modules 3–5).
Module 3 will describe the drug substance and the drug product in terms of
ingredients, manufacturing process, name and address of the manufacturer, limits
imposed to maintain the manufactured products’ integrity, and testing results to
show that the products remain stable over time. The drug substance is the
manufactured active ingredient before it is prepared in the useable form, the drug
product. The drug product is the finished manufactured product prepared in a form
(“dosage form”) that is able to be used for immediate administration (e.g., a finished
tablet or capsule) or for preparation of administration (e.g., a vial containing the drug
in a solution to be diluted in a bag of inactive ingredients, like “normal saline,” for
intravenous infusion).
24 Regulatory Requirements in Clinical Trials 469

Module 4 includes the reports containing the results from preclinical studies,
meaning testing done in animals and “in vitro.” Such testing provides data on safety
and toxicology as well as what to expect of the drug in humans in terms of how it will
be absorbed, distributed, metabolized, and excreted (ADME).
Module 5 includes the clinical proposals in terms of how the investigational drug
will be handled in treating the people who have volunteered to participate in the
clinical trial. The clinical trial protocol and the informed consent document are of the
most important documents in the IND submission, as it contains details of how
assessments will be made (e.g., how often the doctor will check the patients’
bloodwork, or when an x-ray will be done) and how the investigational drug should
be administered (e.g., route of administration, dose, frequency).
Once submitted to FDA, the IND will be reviewed by manufacturing, preclinical,
and clinical experts employed by FDA in the “review division.” The typical review
timeline for a new IND is 30 days, during which time the FDA may ask questions to
the sponsor, termed “requests for information.” Upon successful review of the IND,
the sponsor will receive a letter from FDA entitled “Study May Proceed” which
details the FDA acceptance of the proposal set forth by the sponsor in the IND.
Alternatively, if the FDA is not comfortable with the sponsor’s proposal in the IND,
the FDA can issue a “clinical hold” letter detailing what additional information is
needed prior to the sponsor being able to initiate the study in the USA. Such a
“clinical hold” can also be issued by the FDA during the ongoing conduct of the
study, if new information arises that leads the FDA to consider that study participants
at undue risk under the current trial protocol.
After the initial IND review has successfully completed, future studies intended
for the same or similar patient population can be introduced to the FDA under this
now-approved IND, simply with the new protocol and additional new documents
supporting the new trial (e.g., toxicology studies for a new patient population). The
FDA will review the new documents and will issue questions to the sponsor if
necessary. Unlike the original IND, there is no defined formal review timeline.
Technically, sponsors can initiate a study in less than 30 days after submitting the
new protocol to the IND. However, it is fairly common practice for sponsors to wait
an “informal” 30-day period before initiating the study. This practice can be helpful
to reduce the risk that the FDA will ask questions or place the study on clinical hold
after the study has been initiated; however, the FDA is able to issue “requests for
information” at any time, including after the initial 30-day period.

Maintaining the IND

Amendments
As the IND receives initial approval and then stands as the core source of information
for an investigational product throughout the product’s development, it is common
practice for the contents of the IND to change over time. As an example, the IND
may’ve been initially submitted with the drug product in a vial formulation, and over
time, the sponsor developed a pre-filled syringe to facilitate ease of preparation and
470 M. Pernice and A. Colley

use. Once the sponsor is ready to introduce the drug product utilizing the pre-filled
syringe formulation into clinical trials, an amendment to the IND will need to be filed.
Such changes can occur in areas that impact the other modules within the IND, also (e.
g., additional animal toxicology data become available requiring an amendment to
Module 4 of the IND; the clinical trial is changed to also enroll patients with an earlier
stage of the disease, requiring an amendment to the protocol in Module 5).

Safety Reporting
One of the most critical aspects of maintaining the IND as the core source of
information for the investigational product is safety reporting. Over the course of a
clinical trial, the sponsor and the regulator will learn more about the safety of the
investigational product. Most of this learning will come from “adverse event
reporting.” “Adverse event” is defined as any untoward medical occurrence associ-
ated with the use of a drug in humans, whether or not considered drug related.
Throughout the conduct of the trial, adverse events will be reported based on the
study volunteers’ experiences while enrolled in the study. These reported events
must be promptly reviewed by the sponsor. This is one of the most important
regulations stipulated in “black and white” within the CFR and is also a unique
exception to regulations presented in the CFR which are typically restricted to
conduct in the USA. Safety reporting regulations in the CFR apply to safety reports
received from “foreign or domestic sources.” Accordingly, sponsors must review
safety reports received from every source and assess the reports for their potential
reportability to the FDA and impact on continued treatment within the ongoing
clinical trials. The CFR is also more prescriptive than usual with regard to the
required timeline of safety reporting, outlining which type of reports are mandatory
to be submitted no later than 7 calendar days and 15 calendar days.

Annual Reporting
Further to the submission types that occur if and when a qualifying event occurs, as
described above, maintaining the IND also comes with a requirement to report certain
information annually, within 60 days of the anniversary date of when the FDA review
of the original IND came to a successful close. This routine report includes a culmi-
nation of what occurred over the reporting period, spanning all topics encompassed in
an IND (e.g., manufacturing changes, the status of ongoing preclinical studies, clinical
and safety updates, status of the investigational product worldwide).

Regulatory Affairs Considerations for Clinical Trials in the


European Union

Evolution of EU Clinical Trials Legislation

A major change to the legislation governing clinical trials in the European Union
(EU), the Clinical Trial Regulation (CTReg) EU No. 536/20143, currently awaits
implementation and could revolutionize the way clinical trials are run in the EU in
24 Regulatory Requirements in Clinical Trials 471

the next few years. The goal of the CTReg is to create a more favorable environment
for conducting clinical trials in the EU by addressing many of the criticisms leveled
at the current procedures implemented by the Clinical Trials Directive (CTDir),
Directive 2001/20/EC4, in 2004. It has been widely acknowledged that implemen-
tation of the CTDir led to a complexity and lack of harmonization that had direct
effects on the cost and feasibility of conducting clinical trials in the EU.
To understand the current procedures for clinical trial authorizations (CTAs) in
the EU, and the evolving regulation of clinical trials, it is useful to consider some
aspects of the European legislative process in general and the history of clinical trials
legislation. Prior to 2004, clinical trials were regulated by the national legislation of
each individual member state (MS) of the EU, and significant differences existed
between the requirements and procedures in each country. In an attempt to harmo-
nize clinical trial conduct, and the CTA processes, the CTDir was implemented into
national legislation of each MS from 2004 onward. The failure of the CTDir to fully
harmonize the CTA processes stems largely from the fact that EU directives are legal
acts which require each MS to achieve a result without dictating the means of
achieving that result in national legislation. This has led to each of the 28 national
regulatory agencies (national competent authorities (NCAs)) having differing sub-
mission package requirements and/or procedures. In contrast to directives, regula-
tions are legal acts that apply automatically and uniformly to all EU countries as
soon as they enter into force, without needing to be transposed into national law. EU
regulations are binding in their entirety, on all EU countries, and therefore imple-
mentation of the CTReg can be expected to overcome the current lack of harmoni-
zation in European CTA procedures.

EudraLex “The Rules Governing Medicinal Products in the European


Union”

The body of EU legislation in the pharmaceutical section is compiled into the ten
volumes of EudraLex. The basic legislation for medicinal products for human use is
contained in Volume 1 and includes the various directives and regulations pertinent
to both marketing authorization applications (MAAs) and CTAs. The basic legisla-
tion is supported by various guidelines, and Volume 10, “Guidelines for clinical
trials,” includes guidelines for CTAs approved under the current CTDir and guide-
lines intended to support the CTReg once it’s implemented. Volume 10 includes a
chapter on the CTA application, safety reporting, quality of investigational medicinal
products (IMPs), inspections, various additional guidelines, and finally links to
relevant basic legislation.

Clinical Trials Facilitation and Coordination Group (CTFG)

The Heads of Medicines Agencies (HMA) is a collaborative network of the heads


of the NCAs from each EU member state. The CTFG was originally established
472 M. Pernice and A. Colley

to coordinate the implementation of the CTDir and has subsequently taken an


active role in harmonizing CTA assessment decisions and processes across the
NCAs.
In 2009, the CTFG established a process for the assessment of multinational
clinical trials (MN-CT) in the EU, the Voluntary Harmonization Procedure (VHP).
The VHP was established as a means of achieving a coordinated assessment of MN-
CTs within the existing legal framework for clinical trials established by the CTDir.
The VHP is discussed in more detail later but can be thought as an intermediate step
between the CTDir and the CTReg with the latter being influenced by the experi-
ences of the VHP.
The CTFG publishes guidance relevant to CTAs and the conduct of clinical trials
in the EU. With sponsors proposing increasingly complex clinical trial designs, such
as basket trials investigating an IMP or multiple IMPs in a variety of populations, or
umbrella trials investigating several IMPs in a single population, the CTFG has
published a recommendation paper (CTFG, 2019) on initiation and conduct of such
trials.

EU Regulatory Agency Advice on Clinical Development and Clinical


Trials

The European Medicines Agency (EMA) issues scientific guidelines on most


aspects of drug development, and these will influence the design of clinical trials.
However, a sponsor may want to seek regulatory advice on the design of a clinical
trial prior submitting the CTA, for example, to validate the design will adequately
support the regulatory assessment of the benefit and risk during the review of the
MAA. In the EU, advice may be sought from the EMA or from individual NCAs.
A sponsor may also wish to seek advice from a specific NCA where it is planned to
run a clinical trial to facilitate approval of the forthcoming CTA application. In
contrast to the situation in the USA where the FDA is responsible for reviewing the
IND and the BLA/NDA, in the EU, the review of CTAs is the responsibility of
individual NCAs, whereas the EMA is responsible for the review of the MAA
submitted through the centralized procedure. If advice is sought, then sponsors
should consider whether advice is required at a pan-EU level from the EMA and/or
at a country level from one or more NCA. For example, a sponsor might plan to
seek advice from an NCA where it is planned to conduct a first-in-human study
with the aim of facilitating the review by that authority. Later in the product
development, it might be more appropriate to seek advice on the design of a
pivotal Phase 3 trial from the EMA especially if there is something novel about
the trial design, a lack of guidance, or some planned deviation from EMA guid-
ance. The minutes from scientific advice meetings, whether EMA or national, are
required to be included in the CTA application, and any deviation from the advice
received may need to be justified either proactively in the application or if
questions arise during the review.
24 Regulatory Requirements in Clinical Trials 473

Submitting a CTA in the EU

A sponsor planning to conduct a clinical trial in the EU will select one or more
European countries for participation by conducting a feasibility assessment that
evaluates a wide range of factors including identification of suitable investigators
and sites and availability of patients meeting the planned inclusion and exclusion
criteria for the trial. Since the access to medicines and standard of care can vary
between countries, this can sometimes influence the feasibility assessment.
Once the countries have been selected, a first step is to evaluate the specific
document requirements and procedures of the NCA and Ethics Committees (EC) in
each MS to plan the CTA submission. As discussed earlier, the exact requirements
will vary by MS because the CTDir has not been implemented in a harmonized
fashion. In addition to various administrative documents and country-specific
required documents, the core package submitted to all NCAs includes the protocol,
investigator’s brochure (IB), and the Investigational Medicinal Product Dossier
(IMPD). The IMPD includes information on the quality of any IMP in the trial as
well as relevant nonclinical and clinical data that is available. An overall benefit-risk
assessment for the trial should also be included unless already included in the
protocol. The possibility also exists to cross-refer to the nonclinical and clinical
data summarized in the IB. CTA submissions are not made in eCTD, but sections of
the IMPD typically follow a CTD headings with the quality information presented
like Module 3 of the CTD.
The CTDir states that assessment of a valid request for clinical trial authorization
by the NCA be carried out as rapidly as possible and may not exceed 60 calendar
days. The procedure will involve a validation phase to check all the necessary
documentation has been provided and is clear. This is followed by the review
phase, and usually the NCA will issue a list of questions (“Grounds for Non-
Acceptance”) requiring adequate responses prior to approval. The exact timelines
and procedures vary by MS and are also influenced by the potential for “clock stops”
when a sponsor is given additional time to respond to questions.
Typically, the regulatory and ethics procedures run in parallel, and a sponsor may
not start the trial in a MS until a favorable opinion is received from both the NCA
and the EC.

Voluntary Harmonization Procedure (VHP)

For a MN-CT, the sponsor has a choice of regulatory pathway, either submitting
separate CTAs in each MS via the relevant national procedures, as previously
described, or requesting assessment via a Voluntary Harmonization Procedure
(VHP). The decision to use VHP versus multiple national procedures should be
made on a case-by-case basis. The benefits of a single, harmonized procedure may
be attractive especially for trials involving many countries where there may be
significant operational benefits and the possibility of achieving a harmonized
474 M. Pernice and A. Colley

outcome according to a single well-defined timetable. The CTFG particularly rec-


ommends use of the VHP for the review of CTAs for MN-CTs with a complex
design. However, the VHP is not without its challenges, including very short
timelines (10 days) for the response to questions, and overall VHP can be slower
than the national procedure in certain countries.
The VHP consists of three phases (not to be confused with clinical trial phases).
In Phase 1, the sponsor requests assessment via VHP, and validation of the applica-
tion takes place. It is important to remember the “voluntary” nature of the procedure,
and individual NCAs can decline to participate, and in that scenario, the sponsor
must default to submission via the standard national procedures in that country. If a
sponsor requests VHP, then all EU NCAs planned to be involved should be included
in the request, and the sponsor should not mix between the national route and VHP,
unless advised to do so by the regulators. Sponsors are required to nominate a
Reference-NCA (REF-NCA) in the VHP request. The REF-NCA is responsible
for leading the scientific assessment in collaboration with the participating-NCA
(P-NCA).
Phase 2 of VHP is the assessment step led by the REF-NCA, and it’s usual to
receive a consolidated list of questions (or “Grounds for Non-Acceptance”) on day
32 after the procedure starts. Sponsors have 10 calendar days to respond to ques-
tions. Following receipt of the sponsors’ response, the REF-NCA continues the
assessment with input from the P-NCA. Depending on the acceptability of the
response, the Phase 2 will conclude between around 56 and 78 days following the
start of the procedure.
Phase 3 of VHP is the “national step” that formally concludes the CTA review
according to the requirements of the CTDir. This national step does not include
further scientific evaluation of the benefit-risk or quality aspects of investigational
medicinal product that were assessed during Phase 2 of VHP. Instead the focus is on
national aspects of the CTA, for example, clinical trial labels in national language,
ICF, or EC approval letter.

Maintaining the Clinical Trial Authorization

Amendments
The CTDir allows a clinical trial to be amended after it has started, and amendments
can be classified as non-substantial or substantial. An amendment to a trial is
considered substantial if the changes are likely to have a significant impact on the
safety or physical or mental integrity of trial participants or on the scientific value of
the trial. Guidance on what is typically considered substantial or not is found in the
European Commission communication 2010/C 82/01 (CT-1), and it is the sponsor’s
responsibility to assess any planned amendment on a case-by-case basis. A substan-
tial amendment must be submitted for review and can only be implemented once the
necessary NCA and/or EC approvals have been received. Non-substantial amend-
ments should be documented internally within the sponsor’s records and submitted
with the next substantial amendment.
24 Regulatory Requirements in Clinical Trials 475

Mechanisms also exist for sponsors to implement urgent safety measures, to


protect patients in trials, without the need for approval of a substantial amendment
and for a follow-up submission to be made.

Other Maintenance Activities


Sponsors are required to notify NCAs with respect to end of trial and to provide
within 1 year (or 6 months for pediatric trials) the results of a trial in a clinical trial
report. Usually, this requirement is met providing either clinical study report (CSR)
synopsis or the full CSR depending on the MS.

Safety Reporting and Annual Reporting


Safety of patients participating in clinical trials is paramount, and sponsors are
required to notify NCAs of life-threatening suspected unexpected serious adverse
reactions (SUSARs) as soon as possible and in any case no later than 7 days after
becoming aware of the case. Other nonlife- threatening SUSARs can be reported as
soon as possible but no later than 15 days after becoming aware of the case.
The requirement for an Annual Safety Report is met by submission of the
Development Safety Update Report (DSUR) following adoption in the EU of the
ICH guideline E2F on development safety update report. The DSUR is an annual
review and evaluation of safety information and describes new issues that may
impact the overall development program or specific clinical trials. The DSUR
describes known and potential risks and evaluates the impact of new safety infor-
mation on the clinical development.

Implementation of the Clinical Trials Regulation (CTReg)


When the CTReg becomes applicable, it will replace the CTDir and aspects of the
national legislation in each MS that implemented the CTDir. A key benefit of the
regulation will be a harmonized electronic submission and assessment process for
clinical trials conducted in multiple MS. Submissions to the NCAs and ECs will take
place via a new Clinical Trials Information System that includes a submission
“portal.” The CTA application will be subject to separate parallel scientific (Part I)
and ethical reviews (Part I). Part I is led by a reporting member state (RMS) in
coordination with the other concerned member states (CMS). Part II is the national
ethical assessment conducted independently in each MS, and national laws will still
apply for many documents submitted in Part II. It will be up to the individual MS to
determine exactly how to involve the NCA and EC in Part I and Part II of the
assessments to reach a single decision by country. Timelines will be harmonized, and
assuming no validation questions but questions during the review, then the overall
assessment time will be 106 days. Some parallels with the VHP are evident although
the CTReg takes harmonization and collaboration between MS much further.
The CTReg’s full implementation relies entirely on the full functionality of the
portal, but due to technical difficulties, testing of the portal was still ongoing as of
June 2019, and it seems unlikely that implementation will occur before 2020. The
portal will deliver secure workspaces for both sponsors and authorities and will
facilitate all interactions between them. The CTReg will become applicable 6 months
476 M. Pernice and A. Colley

after the portal is confirmed as achieving full functionality through independent


audit. Following implementation, a 3-year transition period will start; during the first
year, CTAs can be submitted under the old CTDir or the new CTReg systems. The
VHP will no longer be an option for new CTAs immediately once the CTReg is
implemented. During years 2 and 3, trials authorized under the CTDir can remain
under that system, while new CTAs must be submitted under the CTReg systems.
Finally, after 3 years, all trials must switch to the new CTReg system.
Other important aspects of the CTReg will be to simplify safety reporting and to
support the continued drive for transparency of clinical trial data in the EU, and
increasingly there will be proactive publication of clinical trial information via the
clinical trial information system. The implementation of the CTReg will present
sponsors, NCAs, and ECs with many challenges and will not entirely remove the
complexities associated with conducting a MN-CT in the EU. However, it can be
hoped that if the goals of the CTReg are achieved, a more favorable environment for
conducting trials in the EU will result and ultimately facilitate the development of
new medicines for patients.

Regulatory Affairs Considerations for Clinical Trials in Other


Countries

In addition to the detail discussed pertaining to the USA and EU, there are many
regulatory considerations regarding the conduct of clinical trials in other countries.
Some countries (e.g., China, South Korea, India, Russia) require that, in order to
achieve marketing authorization, patients from that country must be included in
clinical trials submitted within the marketing authorization application. This require-
ment can be at least partly due to concerns about ethnic differences in how a drug
may be metabolized or a result of notable differences in overall patient care between
the studied countries compared to the country with local data requirements. While
each country has their own review process and procedures to assess the safety and
appropriateness of a proposed new clinical trial, most follow the same basic struc-
ture. This basic structure typically starts with the sponsor submitting a clinical trial
application containing multifunctional information spanning manufacturing, preclin-
ical, and clinical. Next, the local regulator reviews the submitted application and
may issue questions to be answered by the sponsor within a defined period of time.
Finally, if successful, the regulator will approve the proposed clinical trial to be
conducted in that country.
There are additions and exceptions to this basic structure, which a global
sponsor needs to develop the capabilities to understand and anticipate. An exam-
ple can be found in Japan, where the Pharmaceuticals and Medical Devices
Agency (PMDA) is the regulator with purview over clinical trial applications
and marketing authorization applications. In Japan, the sponsor typically plans
to meet with PMDA prior to submitting the clinical trial application for a consul-
tation with PMDA to advise on the overall acceptability of the basic proposal for
24 Regulatory Requirements in Clinical Trials 477

the new clinical trial (e.g., checks whether a proposed clinical trial complies with
the requirements for regulatory submission).
During the clinical trial planning, the regulations for all countries selected by the
sponsor to be included in the recruitment for study volunteers must be taken into
account to enable timely approval and initiation of the trial in the given country. In
particular, if a country with local data requirements for marketing authorization is
not included in the clinical development of the product, it is likely that additional,
dedicated studies may need to be conducted if the sponsor aims to have marketing
authorization in that country. Unfortunately, it is not uncommon that the conduct of
these additional, dedicated studies can lead to years-long delays in access to the new
treatment in that country. Therefore, up-front planning leveraging regulatory acumen
and guidance is critical to the success of enabling global approval and access to new
medicines.

Summary

Regardless of country or region, the need for sophisticated regulatory strategy is a


critical component of the development of any drug candidate. While some rules and
regulations seem clear and simply the subject of wrought memorization or ability to
research and reference, most drug development decisions span beyond what is
written in “black and white.” Beyond coming to an agreement with a particular
regulatory authority on a complex topic, sponsors are often also tasked with inte-
grating differing advice and procedures from regulators globally. This frequently
proves to be very challenging and can even slow the development of a promising
new drug candidate. However, this divergence can also be a motivator behind what
can result in some of the best examples of innovation. Regardless of how divergent
the views of individual regulators and sponsors may be, the betterment of patients’
health is always the shared goal. Accordingly, the regulatory affairs professional’s
ability to maintain focus on the patient as the end goal is what will ultimately drive
innovative solutions to complex drug development challenges.

Cross-References

▶ ClinicalTrials.gov
▶ Cluster Randomized Trials
▶ Consent forms and Procedures
▶ Data and Safety Monitoring and Reporting
▶ Data Capture, Data Management, and Quality Control; Single Versus Multicenter
Trials
▶ End of Trial and Close Out of Data Collection
▶ Evolution of Clinical Trials Science
▶ Good Clinical Practice
▶ Implementing the Trial Protocol
478 M. Pernice and A. Colley

▶ International Trials
▶ Investigator Responsibilities
▶ Multicenter and Network Trials
▶ Participant Recruitment, Screening, and Enrollment
▶ Post-approval Regulatory Requirements
▶ Reporting Biases

References
Clinical Trial Regulation (CTReg) EU No. 536/2014. Available via European Commission. https://
ec.europa.eu/health/sites/health/files/files/eudralex/vol-1/reg_2014_536/reg_2014_536_en.pdf.
Accessed 02 Sept 2019
Clinical Trials Directive (CTDir), Directive 2001/20/EC. Available via European Commission.
https://ec.europa.eu/health/sites/health/files/files/eudralex/vol-1/dir_2001_20/dir_2001_20_en.
pdf. Accessed 02 Sept 2019
Code of Federal Regulations. Available via Electronic Code of Federal Regulations (e-CFR).
https://www.ecfr.gov/cgi-bin/ECFR?page=browse. Accessed 02 Sept 2019
European Commission communication 2010/C 82/01 (CT-1). Available via European Commission.
https://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:C:2010:082:0001:0019:EN:PDF.
Accessed 02 Sept 2019
Sect. 505(d) of the Food, Drug and Cosmetic (FD&C) Act. Available via FDA webpage on FD&C
Act Chap. V: Drugs and Devices. https://www.fda.gov/regulatory-information/federal-food-
drug-and-cosmetic-act-fdc-act/fdc-act-chapter-v-drugs-and-devices#Part_A. Accessed 02 Sept
2019
ClinicalTrials.gov
25
Gillian Gresham

Contents
ClinicalTrials.gov: History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
ClinicalTrials.gov: Content and Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
Characteristics of Trials Registered in ClinicalTrials.gov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
ClinicalTrials.gov Website Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
ClinicalTrials.gov: Registration and Results Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
Registering a Trial in ClinicalTrials.gov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
Reporting Results in ClinicalTrials.gov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
Quality Control Review of ClinicalTrials.gov Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
Downloading and Analyzing Content from ClinicalTrials.gov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
Downloading Content for Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
Limitations of Analyzing Data from ClinicalTrials.gov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495

Abstract
ClinicalTrials.gov is a federally supported, web-based clinical trials registry
maintained by the United States (US) National Library of Medicine (NLM) at
the National Institutes of Health (NIH). It is available to health care professionals,
researchers, patients, and the public. Since its launch in 2000, over 325,000 clinical
research studies have been registered in ClinicalTrials.gov. Unlike other clinical
trial registries and databases, clinical trials registration for certain types of clinical
trials is mandated by law under Section 801 of the US Food and Drug Adminis-
tration Amendments Act (FDAAA 801). There are several components that make
up the ClinicalTrials.gov registration process, including trial registration itself,
results reporting, and the download and analysis of the ClinicalTrials.gov content.

G. Gresham (*)
Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, CA,
USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 479


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_266
480 G. Gresham

While the previous chapter focuses on clinical trials registration in general, this
chapter pertains to clinical trials registered in ClinicalTrials.gov. This chapter
provides an overview of the history of ClinicalTrials.gov, a description of the trials
currently registered in ClinicalTrials.gov, and a review of the Federal Requirements
for Registration in the United States. A summary of the registration process, trial
reporting, and data analysis procedures follows. The chapter concludes with an
overview of the limitations associated with the analysis and reporting of
ClinicalTrials.gov registration data.

Keywords
Clinical trials registration · ClinicalTrials.gov · Clinical trial · Interventional
study · Clinical trials database · Results reporting

ClinicalTrials.gov: History

Trial registration and its regulation in the United States, as we know it today, has
evolved and expanded over the last 30 years. Key events and policies related to the
ClinicalTrials.gov are illustrated in the historical timeline (Fig. 1). ClinicalTrials.gov
definitions are consistent with those provided by the NIH and are listed in an online
glossary as part of the ClinicalTrials.gov website: https://clinicaltrials.gov/ct2/about-
studies/glossary. Some key definitions from the glossary are transcribed in Table 1.
International calls for trial registration first emerged in the late 1980s in response
to increasing awareness of publication and reporting biases (Dickersin 1990; Simes
1986). In 1986, Simes demonstrated the value of an international registry for clinical
trials using two case examples in ovarian cancer and multiple myeloma (Simes
1986). Simultaneous calls for registration were published at the turn of the twenty-
first century, providing additional examples of reporting biases and arguments for the
need for a comprehensive, prospective trial registry (Dickersin 1990; Dickersin and
Rennie 2003; Piantadosi 2017).
In 1997, the first Federal law to require trial registration was passed under Section
113 of the Food and Drugs Administration Modernization Act (FDAMA) related to
data bank containing information on privately or federally funded trials being
conducted under investigational new drug applications for serious or life-threating
disease and conditions:

“A registry of clinical trials (whether federally or privately funded) of experimental treatments


for serious or life-threatening diseases and conditions under regulations promulgated pursuant
to section 505(i) of the Federal Food, Drug, and Cosmetic Act, which provides a description of
the purpose of each experimental drug, either with the consent of the protocol sponsor, or
when a trial to test effectiveness begins. Information provided shall consist of eligibility
criteria for participation in the clinical trials, a description of the location of trial sites, and a
point of contact for those wanting to enroll in the trial, and shall be in a form that can be readily
understood by members of the public. Such information shall be forwarded to the data bank by
the sponsor of the trial not later than 21 days after the approval of the protocol.”
25 ClinicalTrials.gov 481

1997: First US law 2005: ICMJE3 2008: Declaration of 2016: Final rule for
passes requiring trial requires trial Helsinki Revision FDAAA 801 &
registration: FDAMA1 registration promotes trial registration NIH Policy Issued

2000: NIH NLM2 releases 2007: Congress passes 2014: Notice of 2017: Revised
online clinical trials PDAAA4 to expand Proposed Rulemaking Common Rule
registry: ClinicalTrials.gov ClinicalTrials.gov for FDAAA 801 released (45 CFR 46) issued
submission requirements for public comment
1
Food and Drug Administration Modernization Act
2
National Institutes of Health National Library of Medicine
3
International Committee of Journal Editors
4
Food and Drug Administration Amendments Act

Fig. 1 ClinicalTrials.gov timeline

Table 1 Selected terms and definitions used on ClinicalTrials.gov


Term ClinicalTrials.gov definition
Clinical study A research study involving human volunteers (also called
participants) that is intended to add to medical knowledge.
There are two types of clinical studies: interventional studies
(also called clinical trials) and observational studies
Interventional study A type of clinical study in which participants are assigned to
(clinical trial) groups that receive one or more intervention/treatment (or no
intervention) so that researchers can evaluate the effects of the
interventions on biomedical or health-related outcomes. The
assignments are determined by the study’s protocol.
Participants may receive diagnostic, therapeutic, or other types
of interventions
Observational study A type of clinical study in which participants are identified as
belonging to study groups and are assessed for biomedical or
health outcomes. Participants may receive diagnostic,
therapeutic, or other types of interventions, but the investigator
does not assign participants to a specific interventions/
treatment
Expanded access A way for patients with serious diseases or conditions who
cannot participate in a clinical trial to gain access to a medical
product that has not been approved by the US Food and Drug
Administration (FDA). Also called compassionate use. There
are different expanded access types
ClinicalTrials.gov identifier The unique identification code given to each clinical study upon
(NCT number) registration at ClinicalTrials.gov. The format is “NCT”
followed by an 8-digit number (e.g., NCT00000419)
Funder type Describes the organization that provides funding or support for
a clinical study. This support may include activities related to
funding, design, implementation, data analysis, or reporting.
Organizations listed as sponsors and collaborators for a study
are considered the funders of the study
Sponsor The organization or person who initiates the study and who has
authority and control over the study
(continued)
482 G. Gresham

Table 1 (continued)
Term ClinicalTrials.gov definition
Collaborator An organization other than the sponsor that provides support
for a clinical study. This support may include activities related
to funding, design, implementation, data analysis, or reporting
Phase The stage of a clinical trial studying a drug or biological
product, based on definitions developed by the US Food and
Drug Administration (FDA). The phase is based on the study’s
objective, the number of participants, and other characteristics.
There are five phases: early phase 1 (formerly listed as phase 0),
phase 1, phase 2, phase 3, and phase 4. Not applicable is used
to describe trials without FDA-defined phases, including trials
of devices or behavioral interventions
Phase 1 A phase of research to describe clinical trials that focus on the
safety of a drug. They are usually conducted with healthy
volunteers, and the goal is to determine the drug’s most frequent
and serious adverse events and, often, how the drug is broken
down and excreted by the body. These trials usually involve a
small number of participants
Phase 2 A phase of research to describe clinical trials that gather
preliminary data on whether a drug works in people who have a
certain condition/disease (i.e., the drug’s effectiveness). For
example, participants receiving the drug may be compared to
similar participants receiving a different treatment, usually an
inactive substance (called a placebo) or a different drug. Safety
continues to be evaluated, and short-term adverse events are
studied
Phase 3 A phase of research to describe clinical trials that gather more
information about a drug’s safety and effectiveness by studying
different populations and different dosages and by using the
drug in combination with other drugs. These studies typically
involve more participants
Phase 4 A phase of research to describe clinical trials occurring after
FDA has approved a drug for marketing. They include
postmarket requirement and commitment studies that are
required of or agreed to by the study sponsor. These trials
gather additional information about a drug’s safety, efficacy, or
optimal use
Phase not applicable Describes trials without FDA-defined phases, including trials of
devices or behavioral interventions
All definitions transcribed from the ClinicalTrials.gov glossary available at: https://clinicaltrials.
gov/ct2/about-studies/glossary

The 1997 FDAMA law resulted in the subsequent release of ClinicalTrials.gov in


2000 by the National Institutes of Health NLM as the primary registry for federally
and privately funded trials conducted in the United States. At the time of its launch,
ClincialTrials.gov included information on over 4000 medical studies in over 47,000
locations across the United States (Zarin et al. 2017a). The launch of ClinicalTrials.
25 ClinicalTrials.gov 483

gov was followed by FDA guidance for Industry, issued in 2002 and withdrawn by
the FDA in September 2017 (US guidance for industry 2002).
In 2004, the International Committee of Medical Journal Editors (ICMJE)
implemented a policy that required registration of all clinical trials as a condition
of consideration for publication (De Angelis et al. 2004). The policy applies to any
trial that started enrollment after July 1, 2005, where registration must occur before
patient enrollment and requires registration of trials by September 13, 2005, for those
that began enrollment before July 1, 2005. The ICMJE registration policy represents
an important landmark for trial registration, where a significant increase in trial
registration occurred after its implementation (Zarin et al. 2017a).
The World Health Organization (WHO) established a trial registration policy
shortly after, in 2006, releasing a minimum trial registration dataset of 20 items
(Appendix 10.12.) Additional information regarding the history of the development
of the WHO International Clinical Trials Registry Platform (ICTRP) is described in
the previous Chapter on “Trial Registration” Chap. 3.2. Additional international
efforts by the World Medical Association (WMA) to encourage trial registration
were made in 2008 at the 59th WMA General Assembly in Seoul, Republic of
Korea. At this time, the Declaration of Helsinki was amended to include trial
registration requirements initially outlined in Sections 19 and 30 (World Medical
Association 2013). These principles were again modified and re-ordered in 2013 at
the 64th WMA General Assembly, now corresponding to Sections 35 and 36 (World
Medical Association 2013). Section 35 indicates that every research study involving
human subjects should be registered in a public database, while Section 36 raises the
ethical obligation to publish and disseminate the results of research regardless of
whether the findings are statistically significant or “negative or inconclusive”
(Appendix 10.3). While not legally binding, the Declaration of Helsinki has
increased recognition and awareness of the importance and ethical obligations of
trial registration, especially among physicians conducting research in human
subjects.
The Food and Drug Administration Amendments Act (FDAAA) of 2007 became
one of the most important and influential policies for trial registration in the United
States. The FDAAA Public Law 110-85 was passed by Congress on September 27,
2007, and expanded registration and reporting requirements for ClinicalTrials.gov.
Such requirements, as detailed in Section 801 of FDAA, included expanding the
clinical trial registration information for applicable clinical trials and adding a results
database (FDAAA 801). The law also mandated submission of results for applicable
clinical trials of drug, biologics, and devices that were approved, cleared, or licensed
by the FDA. The law includes the requirement that the responsible party of an
applicable clinical trial must submit results within 1 year of data collection including
summary results of the demographic and baseline characteristics, primary and
secondary outcomes, points of contact, and agreements. Submission of adverse
events, including frequent and serious adverse events, was not required by law
until 2009 (FDA 801). Finally, the FDAAA law of 2007 introduced civil penalties
of “not more than $10,000 for each day of the violation after such period until the
violation is corrected” (FDAAA 801).
484 G. Gresham

The FDAAA Section 801 was modified and released in September 2016,
expanding on definition of the clinical trial and providing additional requirements
regarding trial registration and reporting (Zarin et al. 2016). The NIH simultaneously
issued a policy, requiring that all NIH-funded trials should be registered regardless of
whether they are covered under FDAAA 801 requirements.

ClinicalTrials.gov: Content and Features

Characteristics of Trials Registered in ClinicalTrials.gov

A summary of the registration process itself, trial reporting, and data analysis will
then be provided.
ClinicalTrials.gov includes clinical trials being conducted in 207 countries, with
over a third being conducted in the United States only, half outside of the United
States, and the rest in both the United States and non-US countries. Study locations
were not specified in 12% of the registered trials. ClinicalTrials.gov is a living
database that is constantly being updated with new studies as well as undergoing
modifications and revisions to the study records as well as to the site itself. There-
fore, counts will vary with time, and the following summary of characteristics of
trials registered reflects trial counts completed as of December 31, 2019. Overall,
there were 325,860 studies registered in ClinicalTrials.gov of which 256,924 (79%)
were interventional, 67,486 (19%) were observational, and 601 were expanded
access. Types of interventions include drugs or biologics (59%), behavioral inter-
ventions (31%), surgical procedures (10.5%), and devices (12.5%). Among the
registered trials, 175,691 were completed to date (December 31, 2019). Trials can
also be characterized by lead sponsor, where industry was lead sponsor for 106,775
trials as of December 31, 2019, the US Federal Government including NIH was lead
sponsor for 37,706 trials, and all other funding sources were lead sponsor for
184,040 trials. While industry tends to fund larger, randomized drug intervention
trials, the NIH focuses on smaller, early development studies (Gresham et al. 2018;
Ehrhardt et al. 2015). An increasing number of behavioral trials funded by NIH have
also been observed in the last 10 years, which may include exercise and nutritional
studies.

ClinicalTrials.gov Website Content

ClinicalTrials.gov is an online resource maintained by the NLM with a target


audience of health-care professionals, researchers, patients, and the general public.
Information and resources for different users are integrated throughout the website
links. The home page includes a search bar for users to find trials by recruitment
status, condition or disease, other terms (e.g., NCT identification number, inves-
tigator, drug name), and country. An advanced search allows users to further filter
their search by trial type, intervention, outcome measure, eligibility criteria,
25 ClinicalTrials.gov 485

location, phase, funder type, and recruitment status. The menu at the top of the
page includes five tabs: “Find Studies,” “About Studies,” “Submit Studies,”
“Resources,” and “About Site.” Within the “Find studies” menu, users can access
a map of the studies and information on how to search for studies as well as how to
use, find, and read a study record. The “About Studies” tab provides information
about the studies, a list of additional websites about studies, and the glossary of
common terms. Resources for administrators including registration guidelines
help with registering the studies; support and training materials as well as FAQs
can be found under the “Submit Studies” tab. A “Resources” tab includes a list of
selected publications, clinical trials alerts, RSS feeds, the metadata for
ClinicalTrials.gov, and information on downloading ClinicalTrials.gov content
for analysis. Finally, the top menu includes additional information about the site
where readers can learn more about the history of ClinicalTrials.gov; the history,
policy, and laws surrounding trials registration; and the terms and conditions of the
site. An additional link to the Protocol Registration and Results System (PRS) site
is available for administrators and study sponsors/investigators to access and
register their trials. To access the PRS site, users must have a PRS account linked
to their organization name, username, and password. More information on this will
be provided in a later section.
Clinical trials are organized by the ClinicalTrials.gov study identification number
(NCT number) which is unique to each trial registered. Every study record includes
the NCT number, which is listed at the top of the record along with the study title,
key dates (e.g., First posted, Results first posted, Last update), and names of the
sponsors, collaborators, and responsible party. Every record also includes a dis-
claimer that states the following:

The safety and scientific validity of this study is the responsibility of the study sponsor and
investigators. Listing a study does not mean it has been evaluated by the U.S. Federal
Government. Read our disclaimer for details.

Each trial record includes the ICMJE/WHO minimum 20-item Trial Data Set
(Appendix 10.12). Trial information is organized by tabs including “Study Details,”
“Tabular View,” and “Study Results.” Additional links to the disclaimer and
resources for patients on how to read and interpret a study record are also available.
The “Study Details” divides trial information by section: study description, study
design, arms and interventions, outcome measures, eligibility criteria, contacts and
locations, and more information (e.g., publications). Related citations are automat-
ically identified from the NLM and added to the publication tab using the study
identification number (NCT number) and updated directly to the publications field.
The tabular view provides the same information as listed in the “Study Details”
page with some additional features and links. Links to study documents (e.g.,
protocol, consent forms) can be accessed and downloaded, if available. A “Change
History” link also exists, where a complete list of historical versions for the specific
study is available and posted to the ClinicalTrials.gov archive site. When applicable,
study results are posted under the results tab and organized by baseline table,
486 G. Gresham

primary outcome measures, secondary outcome measures, and adverse events by


treatment group (Section 5.3).

ClinicalTrials.gov: Registration and Results Reporting

Registering a Trial in ClinicalTrials.gov

Trial registration is required by law if it fits the definition of an “applicable clinical


trial,” as defined in the Final Rule (42 CFR Part 11) and under the FDAAA Act of
2007, Section 801. A checklist for determining whether a trial is considered an
“applicable trial” is available online at https://prsinfo.clinicaltrials.gov/Voluntary
SubmissionFlowchartChecklist.pdf.
To register a clinical trial, a ClinicalTrials.gov PRS account must first be
requested if one does not already exist for the organization from which the trial is
being registered. One PRS account per organization is established, for which
investigators and administrators from that organization can subsequently be added.
Once a PRS account has been created, the responsible party for the trial, defined as
the “sponsor of the trial unless and until a principal investigator has been designated
the responsible party in accordance with 42 CFR 11.4(c) (2),” can register their trial.
Only one record per trial should be created. Trial registration must occur within
21 days after enrollment of the first trial participant and posted to the public within
30 days after initial submission (Zarin et al. 2016).
Initial registration of the trial in ClinicalTrials.gov involves provision of the study
title, description, study type (interventional, observational, or expanded access), and
status (e.g., recruiting, completed, withdrawn, etc.). Specification of the study start
date, primary completion, and study completion dates must also be provided. These
dates may be actual or anticipated, depending on the study status. Details on the
sponsors and collaborators follow in addition information regarding the study
oversight, such as details on the US FDA-regulated drug and IND number (if
applicable), the name and information of the human subjects review board, and the
corresponding human subjects review board’s approval number.
A brief and detailed summary of the study that includes the purpose and general
information of the study is included along with a selection of MeSH terms for
conditions and keywords. Details on the study design (type, phase, number of
arms, masking, allocation, and enrollment) and a description of the study arms and
interventions follow. Eligibility criteria for the trial must be provided, with separate
fields for age, sex, and acceptance of healthy volunteers included.
One of the most important data elements to be registered is the study outcomes
and secondary outcomes. ClinicalTrials.gov requires detailed specification of the
outcome title, description, and timeframe or the specific timepoint at which the study
participant is assessed for that measure (Zarin et al. 2016). Some general data entry
tips for each data entry element, as obtained from the PRS information pages, have
been summarized in Table 2.
25 ClinicalTrials.gov 487

Table 2 Data entry tips for common data elements entered in ClinicalTrials.gov
Data element Definition Data entry tips
Study status Overall recruitment status: the Study status can alternate between the
recruitment status for the clinical following:
study as a whole, based upon the Not yet recruiting: participants are
status of the individual sites not yet being recruited
Study start date: the estimated date Recruiting: participants are
on which the clinical study will be currently being recruited, whether or
open for recruitment of participants, not any participants have yet been
or the actual date on which the first enrolled
participant was enrolled Enrolling by invitation:
Primary completion date: the date participants are being (or will be)
that the final participant was selected from a predetermined
examined or received an intervention population
for the purposes of final collection of Active, not recruiting: study is
data for the primary outcome, continuing, meaning participants are
whether the clinical study concluded receiving an intervention or being
according to the pre-specified examined, but new participants are
protocol or was terminated not currently being recruited or
Study completion date: the date the enrolled
final participant was examined or Completed: the study has
received an intervention for purposes concluded normally; participants are
of final collection of data for the no longer receiving an intervention or
primary and secondary outcome being examined
measures and adverse events (e.g., Suspended: study halted
last participant’s last visit), whether prematurely but potentially will
the clinical study concluded resume
according to the pre-specified Terminated: study halted
protocol or was terminated prematurely and will not resume;
participants are no longer being
examined or receiving intervention
Withdrawn: study halted
prematurely, prior to enrollment of
first participant
If the trial registered is multisite and
one of the individual sites is
recruiting, then the overall
recruitment status for the study must
also be “recruiting”
Once the first patient is enrolled, the
study start date should be updated to
include the actual date
Once the study has reached the study
completion date, the study completion
date should be updated to reflect the
actual study completion date
Study Brief summary: a short description of The brief summary should be brief
description the clinical study, including a brief and written for a lay audience (limit
statement of the clinical study’s 5000 characters)
hypothesis, written in language The detailed description can include
intended for the lay public. more technical information
Detailed description: extended The detailed description should not
(continued)
488 G. Gresham

Table 2 (continued)
Data element Definition Data entry tips
description of the protocol, including include the entire protocol nor should
more technical information compared it duplicate information that is already
to the brief description recorded in other data elements (limit
32,000 characters)
Study design Study design: a description of the Primary purpose can be selected from
manner in which the clinical trial will drop-down menu and includes
be conducted, including the following treatment, prevention, diagnostic,
information: primary purpose supportive care, screening, health
services research, basic science, and
device feasibility
The study phase should be selected
based on the NIH definitions (Table 1)
The interventional study model may
include single group, parallel,
crossover, factorial, or sequential
All the roles that are masked should
be indicated including participant,
care provider, investigator, outcomes
assessor, or open-label (no masking)
Study allocation can be randomized or
non-randomized. Note that quasi-
randomized is not a true form of
randomization
Anticipated enrollment should be
specified based on the primary
outcome power calculation. Once the
study is complete, the actual
enrollment should be updated
Arms and Arm: a pre-specified group or The arm title should be concise, but
interventions subgroup of participants in a clinical allow for easy distinction from one
trial assigned to receive specific arm to another
interventions (or no intervention) The arm definition is selected from a
Intervention: a process or action that drop-down menu and includes
is the focus of a clinical study experimental; active comparator;
placebo comparator; sham
comparator; no intervention; other
If the intervention is a drug, the
generic name should be used as well
as the dosage form, dose, frequency,
and duration
Intervention type is selected from a
drop-down menu and can include
drug, device, biological/vaccine,
procedure/surgery, radiation,
behavioral, genetic, dietary
supplement, combination product,
diagnostic test, or other
If conducting an observational study,
intervention name can be used to
identify the intervention or exposure
of interest
(continued)
25 ClinicalTrials.gov 489

Table 2 (continued)
Data element Definition Data entry tips
Eligibility The eligibility module specifies the Enter age limits, if applicable.
criteria criteria for determining which people Otherwise, enter “N/A (no limit)”
are (or are not) eligible to participate from a drop-down menu
in the study Sex refers to the classification of male
or female based on biological
distinctions, with drop-down of “all,”
“male only,” and “female only”
Gender refers to the person’s self-
representation of gender identity. If
applicable, a user can indicate that
eligibility is based on gender in
addition to descriptive information
about gender criteria
When entering eligibility criteria,
include headings for the inclusion and
exclusion criteria followed by a
bulleted list under each heading
Outcome Primary outcome: the outcome When specifying an outcome, include
measures measure(s) of greatest importance the specific domain, method of
specified in the protocol, usually the aggregation, specific metric, and
one(s) used in the power calculation. timepoint
Most clinical studies have one Do not use acronyms
primary outcome measure, but a Each outcome measure should be
clinical study may have more than one presented separately, regardless of
whether they share the same metric
If using a scale or questionnaire,
specify the number of items, how they
are scored, the minimum and
maximum ranges, and how the scores
are interpreted
Definitions and information obtained from: https://register.clinicaltrials.gov/prs/html/definitions.
html

It is the responsibility of the record owner to maintain and update the clinical trial
information within the required timepoints in accordance with Section 801 of
FDAAA and 42 CFR 11.64. Records for active studies are required to be updated
at least once a year with some data elements requiring more frequent updates. Once
the record has been reviewed for accuracy and modified as necessary, the verification
date will be updated, and the responsible party/PRS administrator can approve and
release the record.

Reporting Results in ClinicalTrials.gov

The ClinicalTrials.gov registry provides access to study results, regardless of


whether they have been published. While all registered trials may submit their
study results to ClinicalTrials.gov, results for applicable clinical trials, as previously
490 G. Gresham

defined, are required by the FDAAA to be submitted within 1 year after the trial’s
primary completion date (date that the final subject was examined or received the
intervention for the purposes of final data collection for the primary outcome). Trials
that are not considered “applicable clinical trials” (Non-ACT) are not required to
submit results, such as Phase 1 trials, feasibility, or observational studies. However,
under the NIH Policy, any trial that meets the NIH definition for clinical trial and is
funded in whole or in part by the NIH must provide summary results.
As of December 31, 2019, a search of the ClinicalTrials.gov registry identified
41,074 studies (interventional or observational) with posted results. This has
increased dramatically from the 2,178 records with results identified in September
2010 and 23,000 in 2016, probably as a result of the expanded FDAAA reporting
requirements (Zarin et al. 2011, 2016). It is anticipated that this number will continue
to grow as more applicable trials are completed after the January 18, 2017 imple-
mentation date. Understanding and training in the results submission process will
become more important to ensure timely and accurate data entry.

Results Submission Process


Results are entered and submitted using the PRS results page in a similar fashion to
the registration process. Results can be entered directly into the interactive tables
provided on the results tab or uploaded using XML files. Unlike journal publica-
tions, they are not accompanied with detailed narrative text to explain the results
(Zarin et al. 2017a). Results are displayed by study arm or comparison group, where
applicable, as well as combined totals. In addition to tabulations, statistical compar-
isons and summary results can be provided for the corresponding outcome data (e.g.,
within and between group differences). During the preparation of results for sub-
mission, it is important that the responsible party works with the study statistician
and other investigators to ensure complete and accurate information being entered
into the results system.
Results are organized into the following sections: participant flow, baseline
characteristics, outcome measures and statistical analyses, and adverse events. The
participant flow chart includes a summary of the progress of trial participants, by
comparison group, if applicable. The purpose of the flow chart is similar to that of a
CONSORT flow diagram, where it displays the number of participants at each stage
or study time interval (e.g., screening, randomization, treatment, study completion,
follow-up, etc.) (Appendix 10.5). The second results section to enter is the baseline
characteristics, which includes a table of demographic and baseline measures by
each comparison group and combined totals. All results will include the total number
of participants and the number of participants analyzed for each result at the
specified timepoint.
Baseline measures include age (continuous, categorical, or customized), sex or
gender, race, and ethnicity, as defined by the NIH Office of Management and Budget
(OMB), region of enrollment (e.g., United States, Canada), and study-specific
measures. The study-specific measures are customized to the study population and
protocol and may include baseline anthropometric measures, clinical and diagnostic
characteristics, or other factors specific to the disease or condition under study.
25 ClinicalTrials.gov 491

Baseline measures may be summarized as counts, means, median, least squares


means, geometric means, numbers, or other. Measures of dispersion should also be
included, if applicable, where a standard deviation, standard error, interquartile
range, or full range may be provided. The system provides space to add the unit of
measure (e.g., lbs., mmHg, participants) and explanation if a particular entry is not
applicable.
Results for both primary and secondary outcomes, as defined and pre-specified in
study the protocol and registration record, are entered by group along with the result
from the statistical analysis for that particular outcome. Each outcome should
include the type (mean, median, count, etc.), measure of dispersion/precision (e.g.,
standard deviation, confidence interval, interquartile range), and number of partici-
pants analyzed. Additionally, a description of the statistical test (e.g., superiority,
non-inferiority), hypothesis (e.g., p-value), method to calculate p-value (e.g., log
rank, ANOVA, regression), estimation parameter (e.g., hazard ratio, mean differ-
ence, odds ratio, etc.), and its corresponding dispersion/precision parameter should
be included.
The last component of the results reporting system is the adverse events section,
which includes a tabular summary of all anticipated or unanticipated serious adverse
events, as well as other adverse events that exceed a specific frequency threshold.
This information is categorized into three tables including all-cause mortality,
serious adverse events, and other adverse events. A checklist and adverse event
reporting template are available on the ClinicalTrials.gov Administrative Informa-
tion Page to assist with the preparation and submission of adverse event results:
https://clinicaltrials.gov/ct2/manage-recs/how-report#AdministrativeInformation
Adverse event reporting elements include the description of the adverse event
reporting system along with source vocabulary (e.g., MedDRA 10.0, CTCAE 5.0),
the collection method (systematic or non-systematic), the title, and description of
the adverse event. In the all-cause mortality table, the number of participants that
died from any cause and the number of participants that were assessed for death
should be included by group. For the serious adverse event table, the number of
participants who experienced each AE by adverse event term and organ system
must be provided including the number of participants affected, the number at risk
(denominator), and the number of events. The time frame, arm description,
adverse event collection approach, and all-cause mortality table are required for
trials that had primary completion dates after January 18, 2017, as per the updated
Final Rule.
A third table for other adverse events can be included for all adverse events (not
including serious adverse events) and may be reported based on a frequency
threshold of occurrence that the adverse event must exceed. This number must be
less than or equal to the allowed maximum of 5% within at least one of the
comparison groups.
Once results have been entered, they can be released by the sponsor. They will
undergo verification by a PRS administrator, similar to the registration process, and
all queries and errors will be identified and addressed, prior to releasing the record to
the public.
492 G. Gresham

Quality Control Review of ClinicalTrials.gov Records

All submitted trial records undergo a quality review prior to being released to the
public. Quality control review of ClinicalTrials.gov records includes both automated
validation rules incorporated within each item for entry and review by PRS staff. Once
the responsible party/PRS administrator has released and submitted the initial record,
PRS staff reviews the record for completeness and any additional errors, deficiencies,
or inconsistencies (ClinicalTrials.gov 2019). Implementation of standard quality con-
trol review criteria, standardized review comments, and similar trainer programs across
PRS staff ensures consistency of the reviews. Reviews are also audited by other review
staff members to ensure proper review of the records. Specific quality control review
criteria and accompanying documents are publicly available and can be found under
“Support Materials” at the PRS User’s Guide and Review material link: https://
clinicaltrials.gov/ct2/manage-recs/resources#ReviewCriteria. Review criteria are orga-
nized by data entry element, which are categorized within 13 different modules for
describing the study protocol. PRS review of the trial registration record is estimated to
take between 3 and 5 days for registration and within 30 days for results. Reviewers
provide comments throughout the record that address general issues, formatting, and
specific notes on the completeness and appropriateness of each data element or result.
Comments may be identified as “major” which are required to be corrected or
addressed within 15 calendar days or “advisory” which are meant to improve the
clarity of the record and can be addressed within 25 days from when the PRS Staff sent
notification (ClinicalTrials.gov 2019). While they are able to identify errors in the entry
of information, the reviewers are not responsible for ensuring the scientific validity and
merit of the trial and cannot confirm that the information is compliant with policy or
legal requirements (Tse et al. 2018).
The most common problem identified upon quality review of registration infor-
mation is incomplete or insufficient information for the primary and secondary
outcomes. Common issues encountered when reviewing results include invalid or
inconsistent units of measure, insufficient information about scales, internal incon-
sistencies between different sections in the record, the inclusion of written results or
conclusions, and unclear baseline or outcome measure (Tse et al. 2018).
Once PRS comments have been received and addressed, the record owner will
resubmit the information for further review and comment. At this point, reviewers
may respond with additional comments and suggestions or release the record to the
public along with the assigned NCT identification number.

Downloading and Analyzing Content from ClinicalTrials.gov

Downloading Content for Analysis

There are two primary methods for downloading clinical trials content from
ClinicalTrials.gov. The first is directly from the ClinicalTrials.gov database,
where some search results are available for download. For instance, a search of
25 ClinicalTrials.gov 493

studies within a particular disease site may be conducted, and the total records or a
selection of records from the search can be exported and downloaded to different
formats (.csv, XML, plain text, tab-separated values, and PDF). The record for an
individual trial can also be downloaded directly from the study record page. The
downloaded content includes 20 fields in long format with trials listed by study ID.
Exported data types include NCT ID, title, study status, whether results are available,
conditions, interventions, outcome measures, phases, sponsors, gender/age, enroll-
ment (sample size), funders, study type (interventional or observational), design, and
key dates (start date, completion date, date last updated, etc.). While downloading
content directly from ClinicalTrials.gov can be a simple and efficient method to
access up to date study information, it is limited to 10,000 records at a time and does
not include all registration fields and study results.
The second method for downloading ClinicalTrials.gov content for analysis is
through the Clinical Trials Transformation Initiative (CTTI) Aggregate Analysis of
ClinicalTrials.Gov (AACT): https://www.ctti-clinicaltrials.org/aact-database. The
AACT database contains restructured and aggregated information on Clinical Trials
registered in ClinicalTrials.gov that is refreshed daily and available in different
formats (e.g., Oracle dmp, Pipe delimited text output, and SAS CPORT transport).
It also includes static versions of the databases that are updated monthly and
available for download. Access to the cloud-based platform can be accessed upon
free registration and download of the required programs: https://aact.ctti-
clinicaltrials.org/download. The AACT database is a relational database linked by
NCT ID and organized by trial registration fields and categories. A comprehensive
data dictionary and schema are available on the website at the following link: https://
aact.ctti-clinicaltrials.org/schema. Data in the AACT database have been cleaned,
sorted, and created with additional calculated fields generated to facilitate and
improve the analysis of trials. The AACT CTTI database has also integrated the
MeSH thesaurus, thus improving search and indexing capabilities. Regardless of the
method used to obtain and analyze clinical trials data, it is important to take the
limitations of the registries into consideration when interpreting and reporting the
results.

Limitations of Analyzing Data from ClinicalTrials.gov

There are several limitations associated with the use and analysis of data from
ClinicalTrials.gov. First of all, the analysis is based on the assumption that all trials
are registered (Zarin et al. 2017b; Gresham et al. 2018). Although registration has
significantly improved over time, especially during the last decade, one cannot
assume that the studies registered in ClinicalTrials.gov are unbiased representation
of the clinical research enterprise (Tse et al. 2018). A recent paper published by Tse
et al. (2018) identifies and describes ten common problems encountered when using
ClinicalTrials.gov for research (Tse et al. 2018). Some of the key issues raised
include the fact that ClinicalTrials.gov includes more than just interventional studies,
with approximately 20% of the registered studies being observational and 450 with
494 G. Gresham

expanded access records (Tse et al. 2018). Thus an understanding of the definitions
and specific registration elements and requirements for each study type is essential.
Trial records may also be incomplete or incorrect, thus leading to potentially
inaccurate reports and interpretations of the trial data. For example, missing
registration fields, especially for optional data elements, or misclassification of
data elements can occur, make it difficult to estimate and compare trends in clinical
trials. Incomplete records may also be a result of the changing database elements
over time, where the ClinicalTrials.gov structure has evolved since its establish-
ment in 2000 (Zarin et al. 2017b). Mandatory data elements have also been added
and modified over time, such as the primary outcome measure data elements and
sub-elements (Tse et al. 2018). Data entered in the trial record can also be modified
at any time, making it difficult to determine which information is most appropriate
for analysis. While the change history of modifications can be accessed, it is
difficult to obtain and download previous versions of the trial record for analysis.
Furthermore, while review of the quality review of the trial record and results is
performed, it does not include verification of the scientific merit and validity of the
information (Zarin et al. 2007).
Finally, an increasing problem includes duplicate registrations of clinical trials,
which can occur within ClinicalTrials.gov or across different trial registries (e.g.,
ICTRP). Duplicates within the ClinicalTrials.gov database are often a result of
follow-on or expansion studies being registered as separate records (Tse et al.
2018). There is currently no automated way to identify duplicates, although searches
of the trial titles and acronyms, summaries, and eligibility can be used to identify
similar records. As a result of a growing number of international trial registries, there
are also duplicate registrations across multiple registries, where almost 45% of
duplicates go unobserved or undetected (van Valkenhoef et al. 2016). There are
currently no methods for identifying identical trials across registries, thus distorting
and overestimating the number of trials registered. Prevention of duplicates across
registries would require coordination and potential linkage using one universal
registration number within a single platform such as the World Health Organization
(Zarin et al. 2007).

Conclusion

ClinicalTrials.gov is an important resource for researchers, policy makers, providers,


and the general public that provides access to information that, prior to registration,
difficult to obtain. As a result of key regulatory events and policies, registration is
now widely accepted as standard practice. ClinicalTrials.gov continues to evolve and
incorporate new methods and policies to further improve its overall quality and
function such as recent efforts to import study documents (e.g., protocols, consent
forms) into the registration record (Zarin et al. 2017a) or the incorporation of
individual participant data (IPD) to in ClinicalTrials.gov to increase the accountabil-
ity and transparency of clinical trial data. Researchers are also beginning to use
ClinicalTrials.gov as a more efficient method for conducting systematic reviews
25 ClinicalTrials.gov 495

using automatic extraction of the quantitative data (Pradhan et al. 2019). While the
use and analysis of ClinicalTrials.gov registration data can provide valuable infor-
mation about a particular intervention, it is complex and requires an in-depth
understanding and knowledge of the registration and reporting requirements. Thus,
it is the responsibility of the lead sponsors as well as study investigators, staff, and
responsible parties to provide complete and accurate registration information in
order to contribute to scientific advancement and improve the clinical trials research
enterprise.

References
De Angelis C, Drazen JM, Frizelle FA, Haug C, Hoey J, Horton R, Kotzin S, Laine C, Marusic A,
Overbeke AJPM, Schroeder TV, Sox HC, Van Der Weyden MB, E. International Committee of
Medical Journal (2004) Clinical trial registration: a statement from the International Committee
of Medical Journal Editors. CMAJ: Can Med Assoc J 171(6):606–607
Dickersin K (1990) The existence of publication bias and risk factors for its occurrence. JAMA
263(10):1385–9. PubMed PMID: 2406472
Dickersin K, Rennie D (2003) Registering clinical trials. JAMA 290(4):516–23. PubMed PMID:
12876095
Ehrhardt S, Appel LJ, Meinert CL (2015) Trends in National Institutes of Health funding for clinical
trials registered in ClinicalTrials.gov. JAMA 314(23):2566–2567
Gresham GK, Ehrhardt S, Meinert JL, Appel LJ, Meinert CL (2018) Characteristics and trends of
clinical trials funded by the National Institutes of Health between 2005 and 2015. Clin Trials
15(1):65–74
Piantadosi S (2017) Clinical trials: a methodologic perspective. John Wiley & Sons
Pradhan R, Hoaglin DC, Cornell M, Liu W, Wang V, Yu H (2019) Automatic extraction of
quantitative data from ClinicalTrials.gov to conduct meta-analyses. J Clin Epidemiol
105:92–100
Simes RJ (1986) Publication bias: the case for an international registry of clinical trials. J Clin Oncol
4(10):1529–41. PubMed PMID: 3760920
Tse T, Fain KM, Zarin DA (2018) How to avoid common problems when using ClinicalTrials.gov
in research: 10 issues to consider. BMJ (Clinical Research Ed) 361:k1452
van Valkenhoef G, Loane RF, Zarin DA (2016) Previously unidentified duplicate registrations of
clinical trials: an exploratory analysis of registry data worldwide. Syst Rev 5(1):116
World Medical Association (2013) World medical association declaration of Helsinki: ethical
principles for medical research involving human SubjectsWorld medical association declaration
of HelsinkiSpecial communication. JAMA 310(20):2191–2194
Zarin DA, Ide NC, Tse T, Harlan WR, West JC, Lindberg DA (2007) Issues in the registration of
clinical trials. JAMA 297(19):2112–2120
Zarin DA, Tse T, Williams RJ, Califf RM, Ide NC (2011) The ClinicalTrials.gov results
database–update and key issues. N Engl J Med 364(9):852–860
Zarin DA, Tse T, Williams RJ, Carr S (2016) Trial reporting in ClinicalTrials.gov—the final rule. N
Engl J Med 375(20):1998–2004
Zarin DA, Williams RJ, Tse T, Ide NC (2017a) The role and importance of clinical trial registries
and results databases. In: Gallin JI OF, Johnson LL (eds) Principles and practice of clinical
research. Academic, London, pp 111–125
Zarin DA, Tse T, Williams RJ, Rajakannan T (2017b) Update on trial registration 11 years after the
ICMJE policy was established. N Engl J Med 376(4):383–391
Funding Models and Proposals
26
Matthew Westmore and Katie Meadmore

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
Types of Research Funding Agencies and Their Societal, Political, and Organizational
Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
Political Context for Public Research Funding Agencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
Sources of Funding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
Philosophies and Theories of Change of Funding Agencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
Whose Priority Is It Anyway? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
Funder Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
The Importance of Remit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
The Impact of Funder Policies on Research Culture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
Funding Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
Open Versus Commissioned Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
Common Funding Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
Proposal Assessment Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
Typical Application Route and Decision-Making Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
Who Reviews the Applications? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
Assessment Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
Success Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
Tips for Success . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518

Abstract
Clinical trials require funding – often a lot. Funders of clinical trials are not just
sources of funding however. They are actors in their wider research systems, have
their own philosophies, values, and objectives, and operate within different
M. Westmore (*) · K. Meadmore
University of Southampton, Southampton, UK
e-mail: [email protected]; [email protected]

© Springer Nature Switzerland AG 2022 497


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_55
498 M. Westmore and K. Meadmore

political, social, and economic environments. While there are commonalities,


differences in their context and culture shape their approaches to funding deci-
sions, what they are looking for from the research community, and therefore how
to successfully engage with them. By understanding the commonalities and
differences between funding agencies, the types of funding models they may
use, what they are trying to achieve, and what the decision-making process looks
like may help increase the success of proposals.
This chapter summarizes the similarities and differences of clinical trial
funding agencies around the world and the implications for funding models and
proposals. It is primarily aimed at trialists seeking to understand and ultimately
succeed in applying and funding; it will also be of interest to research funding
agencies (RFAs) and regulators.

Keywords
Research funding agency · Funders · Funding model · Decision-making ·
Funding decision · Sources of funding · Proposals and applications

Introduction

This chapter summarizes the similarities and differences of clinical trial funding
agencies around the world and the implications for funding models and proposals. It
is primarily aimed at trialists seeking to understand and ultimately succeed in
applying and funding; it will also be of interest to research funding agencies
(RFAs) and regulators.
Funders of clinical trials are not just sources of funding. They are actors in their
wider research systems, have their own philosophies, values, and objectives, and
operate within different political, social, and economic environments. While there
are commonalities, differences in their context and culture shape their approaches to
funding decisions, what they are looking for from the research community, and
therefore how to successfully engage with them.
Figure 1 shows a hierarchy of factors from the wider environment, through to the
internal organizational setting that ultimately affects how funding schemes are
designed and what is expected of applicants.

Types of Research Funding Agencies and Their Societal, Political,


and Organizational Context

Clinical trial funding agencies are not all alike. Broadly, they share the ultimate aim
of improving human health through research but the way they operate and the way
they measure success differ. These differences depend on the societal, political,
economic, research system and organizational context in which they operate. With
26 Funding Models and Proposals 499

Fig. 1 Hierarchy of factors that influence research funding agency policies and procedures and
ultimately what applicants have to address

many commonalities, these differences lead to different aims and theories of change
of how to achieve those aims. This section outlines some of those differences.
The information presented in this chapter is solely meant to provide an overview
of the different types of contexts in which funding organizations operate and are
molded by. To do this, we have necessarily caricatured different types of organiza-
tions and generalized their objectives, values, and approaches. We have done this in
best faith to inform readers in a simple way rather than suggest any actual funding
agency neatly fits the character.

Political Context for Public Research Funding Agencies

The allocation of funding for research is not just a technical process but is a
political one as well; this is especially true for publicly funded research. At what
level and how influential politicians should be is a controversial topic and beyond
the scope of this work. What is important to understand is how the political context
flows through the hierarchy of factors, Fig. 1, through to the expectations placed
on researchers.
500 M. Westmore and K. Meadmore

Politics (and therefore policy makers) can influence the research that is funded
across a spectrum of ways. From direct involvement into individual decision-mak-
ing, for example, in the prioritization of individual calls for research as the primary
customer of the eventual evidence, to setting the wider policy context in which
research funding agencies interpret their role; for example, in how some countries
national policy for economic growth has been internalized in funding agencies into a
desire to demonstrate potential impact at the application stage.
When done well, this connects researchers with policy makers and ensures
research reflects the needs and desires of those that fund it through taxation; when
done badly, this represents an unacceptable imposition on academic freedom and
allows political bias to cast a shadow across research.
Politicians who take little interest in research is perhaps no less worrying than
those that take too much; a survey of Canadian members of parliament and/or senior
aids found that 32% knew nothing about the role of the Canadian Institutes of Health
Research (CIHR) despite it being the primary federal funder of research (Clark et al.
2007).
US President Barack Obama summed up the tension:

Obama pledged in a speech to protect “our rigorous peer-review system” to ensure that
research “does not fall victim to political manoeuvres or agendas” that could damage “the
integrity of the scientific process.” However, he added that it was important that “we only
fund proposals that promise the biggest bang for taxpayer dollars”. (Obama 2013)

Sources of Funding

One of the most influential characteristics of funding agencies on funding models is


the source of their funding.
Public RFAs funded through public finance (e.g., taxation, public borrowing) are
accountable to society more broadly and therefore tend to operate in ways that
promote societal values such as inclusivity, democracy, fairness, and transparency.
Different public RFAs will operate in different areas of research depending on the
prevailing theory of change. An important characteristic is the sponsoring govern-
ment department or legislation. Research councils in the UK, for example, are
sponsored by the government department responsible for business, energy, and
industrial strategy, whereas the National Institute for Health Research (NIHR) is
sponsored by the department responsible for the UK’s health, public health, and
social care services. What they each look for from the research community only
makes sense within that context.
Until relatively recently, the last 10–20 years, public funding has valued long-
term incremental advances to societal value and shorter term scientific break-
throughs. Public RFAs tend to be broad in their remits but may set strategic areas
of focus. Where they differ more markedly is in the domains of science in which they
operate. For example, the UK Clinical Research Collaboration assesses public and
philanthropic funding (UK Health Research Analysis – http://www.ukcrc.org).
26 Funding Models and Proposals 501

It categorizes funding into disease categories and in research activity categories:


underpinning etiology, prevention, diagnosis and detection, treatment development,
treatment evaluation, disease management, and health services. The analysis shows
different RFAs focusing on different categories.
Philanthropic RFAs, funded by high net worth individuals or fundraising led by
the founders, may have objectives significantly influenced by their founders. This
may be because of direct lived experience, such as trusts established in someone’s
name, say a child of the founder who died from a particular disease, or because of
wider interests in helping humanity benefit from their “excess” wealth, or as is
sometimes termed “giving back to society.” These organizations are also
influenced by the founders in the way they operate. For example, founders
whose background is in venture capital may employ venture capital thinking and
ways of working, so-called philanthropic venture capital. They tend to focus more
on portfolios of research, value overall return on investment, and are often
characterized (or perhaps caricatured) as seeking rapid and transformative change
to health outcomes. Funding is often not limited to specific countries or specific
disciplines. They may be more utilitarian and tangible in their metrics of success
compared to other types of funders.
Medical research charities are philanthropic in their outlook but are heavily
influenced by their constituents and donor communities. Medical charities broadly
operate in four main domains: fund raising, lobbying, service provision, and research
funding. Different charities will have different profiles across these domains, with
some being predominantly active in one domain and having less or have no role in
others. This can lead to differences in the way they operate within the research
funding domain. For example, charities whose primary interest is lobbying and
service provision may see research funding as a powerful tool to drive fundraising.
They will therefore prioritize newsworthy and high-profile research, working with
esteemed institutions. Given that the overall organization culture and expertise
within the charity is focused on other areas, these charities’ research agendas are
driven predominantly by the research communities with whom they work and may
be more science driven, focusing on more headline grabbing exploratory and
discovery phases of research. Charities whose primary interest is in medical research
itself may be more driven by the immediate needs and perspectives of their current
patient communities. They are perhaps more likely to support a more balanced
research portfolio across prevention, cure, and symptom management and are
more likely to more actively steer the research agenda themselves.
Commercial RFAs clinical trials are dominated by the pharmaceutical sector
followed by medical devices. These studies could be completely funded by the
company or collaborative research with funding coming from other partners –
other companies, charities, or public RFAs. Commercial funding (or investigator-
sponsored trials or studies as they are also referred to) focuses on developing
approaches to health problems for the benefit of people but that obviously have a
commercial emphasis and application. As this research is often for commercial gain,
the research outputs are not always open and accessible to all. In addition, to attract
private funding researchers might require a good understanding of market context
502 M. Westmore and K. Meadmore

and awareness of potential conflicts of interest around intellectual property, espe-


cially ownership of the research and freedom to disseminate knowledge.
Many organizations will fund so-called own account research. The original
funding into the organization comes from a variety of sources but there is very little
link between the original source of those funds and what research it goes to fund. The
research will be carried out by the employees of the organization itself. The
important characteristics in terms of what research would be prioritized, the pro-
cesses, and oversight are defined by the organization itself. Organizations will tend
to have a wide variety of different mechanisms and schemes aimed at supporting the
aims of the organization (rather than source funder). Commercial organizations fund
large amounts of research in this way and this will form the spine of their in-house
research and development pipelines; public institutions tend to fund small projects
(e.g., developmental, pilot, and feasibility studies in the context of this book) aimed
at developing opportunities and capacity for future external proposals to other RFAs.
A new democratized model of research funding is now also emerging that will
change the nature of the relationship between the source of funding and the recipient
researcher. Crowd funding approaches are being used directly by researchers to raise
funding for research projects from the public at large through (typically) online
platforms. Traditionally, research is funded by collecting a large number of small
contributions, such as through taxation or donations, but there is an intermediary
institution between the person making the contribution and the researcher who receives
it. Crowd funding removes that intermediary. On the assumption that the intermediary
is not just a source of funds, it influences priorities and the conduct of research, a quality
assurance role, and acts on behalf of other stakeholders, these new approaches will
ultimately have a significant impact on what research happens and how it is delivered.
A final category to explore is that of the transnational RFA; funders that have a role
to play in funding across national boundaries. The nature of these RFAs expectations
differs depending on whether their funding comes from a single national entity, such as
a philanthropic trust with an interest in funding global health research such as the UK’s
Wellcome Trust or The Bill and Melinda Gates Foundation, or multinational collab-
oration with an interest in the collective advantage of the member states, such as the
European Union Horizon 2020 program. These different factors can again lead to
aims, expectations, policies, and procedures that are markedly different compared to
other RFAs. For example, the European Union’s Horizon 2020 values, indeed man-
dates, multinational collaborative research involving member states with high and low
national incomes. Its aims are not just around scientific or health benefits but to reduce
inequality across member states and to promote overall cohesiveness of the European
Union – these are political, not scientific or clinical aims.

Philosophies and Theories of Change of Funding Agencies

What funders are trying to achieve and how they believe they will achieve them also
significantly influences their policies and procedures. These could loosely be called
philosophies and theories of change. Some funders will be quite explicit about this
26 Funding Models and Proposals 503

and others will be influenced by a wider set of norms, culture, and tacit knowledge.
There are philosophies and theories that apply very generally to research and those
that are very focused on clinical trials.
Starting generally, the oldest and most influential concept is the Haldane princi-
ple. This is the idea that politicians may set the overarching strategic allocation of
funding (e.g., what to spend on research into the liberal arts, what to spend on
engineering, what to spend on health-related research); these are political questions.
Beyond that, decisions about what to spend research funds on should be made by
researchers rather than politicians; these are technical questions. This principle has
underpinned the entire peer review process since the Haldane report was published
in 1918 (HMSO 1918) and has been influential in subsequent funding policy not just
in the UK but around the world. Haldane in its purest sense gives primacy to
academic freedom in deciding on the direction of research. This has been highly
successful and has led to many modern economies being based on the advances in
knowledge, culture, and technology that has resulted from it. Haldane has its
limitations however.
The research community may well be the best experts to decide on highly
scientific technical questions but which research to support and how it should be
delivered are often subjective, value-laden issues. This is particularly the case when
the intended purpose of research is more utilitarian than enlightenment or non-
specific advances in knowledge; when the intended user of the research is not
another researcher (but, for example, a policy maker, clinician, or patient). This
requires a wider range of opinions, experiences, and expertise.
The first major challenge to the Haldane principle also originated in the UK. In
1971, the Rothschild Report (HMSO 1971) raised the issue, and proposed solutions
to it, that the research community and commercial funders while undoubtedly
successful in some fields were failing to deliver in others. Most notably in applied
research areas where parts of society needed more immediate answers to more
specific questions; in the context of this work, questions like is treatment A better
than treatment B? Rothschild developed the concept that applied R&D must have a
customer and that customer should be influential in deciding which research should
be carried out. Rothschild was and in some ways remains highly controversial. It has
nonetheless changed the nature of research and research funding.
Rothschild also led to the concept of market failure research funding. Whereby
public funders should not be supporting research that would happen anyway –
research that would be funded by commercial funders or philanthropic funders.
Doing so is not only unnecessary, and therefore a poor use of limited public funding
that could be spent on other areas but can also lead to crowding out. Those that
would have funded in that area now do not either because they do not need to or
because it now does not make commercial sense; for example, public funding results
in public knowledge that cannot be protected for commercial gain. This has the result
that the additional public funding actually reduces the over-investment in an area
rather than increases it.
Influenced by the work of Mariana Mazzucato in The Entrepreneurial State
(Mazzucato 2018), an alternative view has also developed. In certain circumstances,
504 M. Westmore and K. Meadmore

public funding can indeed have the opposite effect whereby the injection of funding
in an area causes other private and philanthropic funders to also fund in that area: this
concept is called crowding-in (as opposed to crowding-out). The public funder must
of course choose the area and nature of its investment carefully – this in turn will
again change the ways in which it makes its funding decisions.
More specifically relating to clinical trials. A common narrative in the develop-
ment of new treatments is the translational pathway. The US NIH defines transla-
tional research as:

Translational research includes two areas of translation. One is the process of applying
discoveries generated during research in the laboratory, and in preclinical studies, to the
development of trials and studies in humans. The second area of translation concerns
research aimed at enhancing the adoption of best practices in the community. Cost-effec-
tiveness of prevention and treatment strategies is also an important part of translational
science. (Rubio et al. 2010)

How strongly the funder subscribes to this model, and sees their role in facilitat-
ing ideas move along the pathway, will have significant impact on not only the
methodology expected but also who is setting research priorities and what outcomes
would be of greatest interest.

Whose Priority Is It Anyway?

A critical aspect of a research funder’s philosophy is to whom do they feel account-


able and who therefore should define its priorities. A research funder who sees their
role in supporting the research community develop and deliver to their full potential
will pay greatest attention to that community in priority setting (c.f. Haldane
principle); a funder who sees their role in delivering direct patient benefit will look
to clinical and patient communities to set priorities (c.f. Rothschild). This is impor-
tant because when different stakeholders are consulted on research priorities, they
provide different answers – patient priorities don’t completely align with clinician
priorities, nor completely align with researchers; a further list might be generated
when asking policy makers. When you coproduce priorities with people from each
of these communities you end up with another different list. Understanding that will
inform applicants on how to set their own priorities, who to work with to do that, and
ultimately how to succeed in obtaining funding.
Of growing importance to funders of health-related research in general, and
clinical trials in particular, is the role of the patient, carer, consumer, lay represen-
tative, or member of the public above and beyond as participants recruited into trials.
Different terminology is used across the world and across funders – for the sake of
clarity, we will use patient and public. Funders are increasingly expecting patients
and the public to play a number of varied and more active roles. Again different
terminology is used such as involvement, advocacy, engagement; we will use
involvement. The UK’s INVOLVE (http://www.invo.org.uk) defines consumer
involvement in research as research being carried out “with” or “by” members of
26 Funding Models and Proposals 505

the public rather than “to,” “about,” or “for” them. This includes, for example,
working with research funders to priorities research, offering advice as members of a
project steering group, commenting on and developing research materials, and
undertaking interviews with research participants. More on this topic can be found
in section 3 ▶ Chap. 30, “Advocacy and Patient Involvement in Clinical Trials.”

Impact

Research impact is the effect research has beyond academia. There is no single
definition nor approach to measuring it, but it is often described as research that
has wider benefits and influences on society, culture, and the economy. It remains a
contentious and complex subject and also depends on the individual funder’s
context and where they sit in the translational pathway; one funder’s impact is
another funder’s input. What is universal, however, is every funder wants it. It
speaks to the funder’s fundamental purpose and it forms an important part of how
the funder is held to account by those providing the funding. A public funder has to
justify its overall impact to government, a philanthropic to its donors, and a
commercial to its shareholders. A discussion of the nature and role of research
impact is beyond the scope of this work but it is important to underline its
importance to the relationship between research funder, funded researcher, and
wider stakeholders.

Funder Policies

Funders encode all of the above into policies and procedures that guide their own
actions and the expectations placed on the research community. These will cover all
areas relating to the research over which the funder either has responsibility (such as
legislative requirements or financial rules ensuring good use of funds) or wish to
have influence (such as research integrity or transparency).
Different funders with different contexts will of course have different sets of
rules, policies, and procedures that must be understood and complied with.

The Importance of Remit

All funders will have limitations on the nature of research they will support. This
flows from the fundamental purpose of the organization through the intended
purpose of the scheme being applied to. Remits will operate at different levels –
the whole funder, a funding program, or a specific call. Different funders will specify
their remits differently; some might be methodologically driven (e.g., by clinical trial
phase), others by clinical area or health need. It cannot be overstated how important
it is to understand the remit of a call or program being applied to. Preparing
applications can be an enormous piece of work, yet up to 20% of applications (for
506 M. Westmore and K. Meadmore

example, see the NIHR Health Technology Assessment success rates https://www.
nihr.ac.uk/documents/hta-programme-success-rates/23178) can be deemed out of
remit and will not be considered for funding.

The Impact of Funder Policies on Research Culture

While the majority of the focus of RFAs is on the relevance, quality, and impact of
the research they support, funders are increasingly paying attention to how their
policies and procedures have a wider influence on research delivery and culture.
RFAs sit in a highly influential position and are increasingly using that position to
improve research. For example, the move from a Haldane denominated view of the
world to Rothschild; the rise of the impact agenda, insistence on open access
publication, and wider clinical trial transparency.
Of particular importance is the global movement toward quality improvement in
research is the Research Waste and Rewarding Diligence Alliance (REWARD). The
REWARD Alliance was launched at the REWARD/EQUATOR Conference, 28–30
September 2015, stimulated by the seminal work of Iain Chalmers and Paul Glasziou
on avoidable research waste in 2009 (Chalmers and Glasziou 2009) and a series of
articles in the Lancet in 2014 (Lancet Series Research: increasing value, reducing
waste 2014), detailing expert consensus recommendations for all sectors of the
research ecosystem. The Alliance’s purpose is to facilitate efforts to maximize the
potential for research contributions by addressing five cross-cutting ideals: (1) The
“right” research priorities are set, with input from the users of research, including
patients and clinicians; (2) Studies are appropriately designed by building on what is
already known and are robustly conducted and analyzed through using up to date
methods to minimize bias; (3) Research regulation and management requirements
are proportionate to risks; (4) All information on research methods and study
findings are accessible; (5) Study reports are complete and usable. Both the
REWARD Alliance and the 2014 Lancet series noted that progress would require
action independently and collaboratively by different stakeholders, namely
researchers, funders, regulators, and publishers, with the inclusion of patients and
the public embedded in the activities of each of these stakeholder groups. An
international group of RFAs called Ensuring Value in Research (http://www.
ensuringvalueinresearch.org) has formed and developed a conceptual model and
ten guiding principles to address these issues. These are now beginning to inform
RFA policies globally (Fig. 2).
A second initiative particularly relevant to this work is the World Health Orga-
nizations’ Joint statement on public disclosure of results from clinical trials (World
Health Organization 2017). This sets out a number of expectations regarding clinical
trial transparency namely:

• Clinical trials must be registered in design specific registries


• Clinical trial protocols must be made publicly available at the start of studies
• Registry information must be kept up to date
26 Funding Models and Proposals 507

Fig. 2 Ensuring value in research conceptual model and guiding principles

• Summary findings must be made publicly available within 12 months of the


completion of the primary study
• Full results must be made publicly available within 24 months of the completion
of the primary study
• Past registration and publication performance should be taken into account when
applying for new funding
• Individual patient level data should be shared

At the time of writing, 21 RFAs had signed up and are now implementing policies
to deliver on this. Even where RFAs are not formal signatories, however, the
importance of transparency policies is growing and are likely to be part of the
requirements for funded researchers.

Funding Models

Given the different contexts, environments, aims, and objectives of funders, different
models of funding have been developed. Each will follow a different process of
decision-making and place different expectations on the research community during
the application and delivery phases of research. Fundamentally, however, all are
attempting to achieve the same aim. The delivery of relevant, high quality, usable,
508 M. Westmore and K. Meadmore

and accessible answers to specific research questions. Where they differ is how and
who crafts the research question.

Open Versus Commissioned Calls

The two most common funding models used are open call (or researcher-led or
response mode) and commissioned call (or targeted or contract research). In open
call, researcher-led or response mode funding models, the RFA sets a high level
remit and the research community develops the research question and methodology.
In contrast in commissioned call funding models the RFA, working with stake-
holders, fully specifies the research question and the research community compete to
be the best team to deliver it. Typically, in these cases applicants will be provided
with a specific brief or vignette and the program of research in the application must
address this. Researcher-led calls or responsive mode funding is where the
researchers drive the research questions and topics and can propose research ques-
tions on any topic (so long as they are within the organization’s remit).
Sitting between open and commissioned calls are thematic calls. Some funding
organizations also issue themed calls for research in areas that have been identified as
health challenges, scientific, clinical, or community priorities. These are specified
more tightly than open calls but more broadly than commissioned calls.
Across all of these models, the importance of remit should be restated. Applicants
must ensure they are not only within the remit of the funder or program but also the
call in question. Deviations from the call specification may be tolerated but this
would be a high-risk strategy and would have to be robustly defended by the
applicant.

Common Funding Models

Table 1 summarizes some common funding models.

Proposal Assessment Processes

Regardless of which RFA is applied to and where the funds are sourced (public,
philanthropic, commercial, etc.), all RFAs have to make decisions regarding which
research applications they should invest in. Good decision-making processes are
seen as integral and essential to the research process (Nurse 2015). This is no easy
challenge as the number of applications received by funding organizations is often
large and the amount requested by the applicants typically outweighs the amount of
resource available (Guthrie et al. 2018). As such, funding organizations have
rigorous processes in place to facilitate decision-making in order to whittle down
the number of competitive applications and ensure that funds are awarded to the best
applications. However “best” does not have just one definition, and instead depends
26 Funding Models and Proposals 509

Table 1 Common funding models used by different types of funder


Typically
Funding model Description Purpose used by
Response mode Funder specifies a broad Allows the research Public
(a.k.a. grant remit to the program. community to flexibility Philanthropic
rounds, Researchers develop and propose any project (within
researcher-led, submit applications or remit). Allows the funder to
open call) proposals decide on a project-by-
project basis
Commissioned Funder specifies a specific Allows the funder to fully Public
call (a.k.a. piece of research (e.g., may specify the research project. Philanthropic
targeted) advertise a PICO) that This would normally be in Commercial
applicants apply to deliver response to an identified
need. For public
philanthropic, evidence
users’ expressed need (e.g.,
patient community, policy
maker, clinical
community). For
commercial, an internal
R&D pipeline is needed
Challenge areas Funder sets out a major A way of focusing a wide Public
(a.k.a. moon challenge facing society (possibly transdisciplinary) Philanthropic
shots) needing a coordinated and community to work over a
concerted response across a longer period of time
range of projects and
disciplines
Strategic or Funder sets out a strategic Useful where there are a Public
thematic calls or thematic area needing number of uncertainties Philanthropic
number of projects to be needing addressing, where
funded the funder wants to guide
the research community but
not fully specify
Call-off contracts Funder enters into a Funder is able to initiate Public
contract for future research research very rapidly from Commercial
where the individual a team that is established
projects are not known at and expert
the point of award
Contract or grant Public funder enters into a Grant funding typically Public
funding contract for research or comes with limited Commercial
issues a grant oversight from the funder.
Grants are efficient and
useful instruments when
academic freedom is
paramount. Where the
funder wants greater
control over the delivery of
the research project a
contract of research would
be used
(continued)
510 M. Westmore and K. Meadmore

Table 1 (continued)
Typically
Funding model Description Purpose used by
Project funding Funding of a single project. Where a single or small Public
The project may have number of interconnected Philanthropic
multiple subprojects that subprojects are required Commercial
collectively address a
narrow need
Block funding Research institution is Provides long-term Public
awarded substantial sustaining and capacity- Philanthropic
funding with limited building funding. Allows Commercial
direction from the funder for the highest levels of
on what to use it for academic freedom and
creativity
Infrastructure Funding to provide Provides long-term Public
funding infrastructure to support sustaining and capacity Philanthropic
research projects funded by building to support a wider Commercial
other means research community
Sandpits and Prefunding workshops To bring research Public
other variants communities together to Philanthropic
collaborate in ways that
would not otherwise
happen, e.g., where highly
creative or radical
approaches or
transdisciplinary research is
required

on the organizational context and priorities. RFAs also need to balance academic
freedom and creativity with accountability and value for money. This balance will
again vary depending on the nature of the funder and nature of the research.

Typical Application Route and Decision-Making Processes

There are many types of approaches and processes involved in allocating research
funding (see Table 2).
The overarching processes for allocating funding are largely similar across public
and philanthropic funders (Nurse 2015). Commercial, own account, self-funded, and
crowd-funded research follow a vast heterogeneity of processes that cannot be
usefully summarized here. This section therefore focuses on public and philan-
thropic funders.
In the current landscape, the use of triage, face-to-face committee meetings, and
external peer review comprise a typical approach by funders to decide which
applications to fund (see Fig. 3). This standard route has been developed over
many years to embed the Haldane Principle and principles of openness and fairness.
Typically, once an application has been submitted, it goes through an internal triage
26 Funding Models and Proposals 511

Table 2 Stages for research fund allocation


Stage Brief description Pros (not exhaustive) Cons (not exhaustive)
Remit and Usually a gate-keeper Reduces the number of Sometimes seen as a
competitiveness for initial submissions. applications that go to hidden review process
checking It is a type of internal the next round or are that is not transparent
triage in which sent out to review by and could filter out
applications are excluding those that potentially good
shortlisted into those are weaker or not in applications or let
that are believed to be remit through bad
within the program applications
remit and are Increased burden to
competitive (e.g., funder – another
reasonable costs, process and staff need
methodology) appropriate training
One-stage One-stage applications Reduce time to make a Increased burden to
application require submission of a decision process funder as potentially
full application upfront more full applications
to review
Limits opportunity for
feedback to applicants
Two-stage Require an expression Reduces the number of Increase time for
applications of interest or a reduced applications that go to decision-making
application form at the next round or are process
stage one. If the sent out to review by Could filter out
applicant is successful, excluding those that potentially good
then they are invited to are weaker applications or let
submit a full through bad
application at stage applications
two Ability to give
feedback to applicants
between stages
External peer Applications are sent Funders are gaining Biases exist (e.g., age,
review to experts in the field expert opinion on the gender, stage of
for comments and application career); unreliable as
recommendations. scores and comments
External peer can vary; high burden
reviewers do not sit on for funders to find
the research programs reviewers, reliant on
funding committee quality and timely
reviews
Face-to-face Applications are Provides opportunity Biases exist, certain
committee reviewed by a range of for thorough members of a
meetings experts in different discussion and committee may be
fields in a face-to-face clarification better at arguing a case
meeting (and so more likely to
get applications they
like funded), high
time, and cost burden
Virtual Applications are Provides opportunity Biases exist, certain
committee reviewed by a range of for thorough members of a
meetings experts in different discussion and committee may be
fields in a virtual clarification; more better at arguing a case
(continued)
512 M. Westmore and K. Meadmore

Table 2 (continued)
Stage Brief description Pros (not exhaustive) Cons (not exhaustive)
environment (e.g., inclusive than face-to- (and so more likely to
telephone conference face as reduces travel, get applications they
or online such as time, and geographic like funded), reliant on
Skype) constraints technology
Inclusion of Applications are Funders are gaining Biases exist (toward
stakeholder reviewed by lay lay, patient, and/or own health condition/
perspectives people, patients/carers population perspective population); can be
of people with a difficult to find PPI for
specific health narrow criteria
condition, or people
from a specific
population
Sandpits and A sandpit model aims Provides a forum for Relies on appropriate
other variants to bring together brainstorming to foster selection of
researchers, funders, creativity and generate participants; may not
and reviewers to research proposals be inclusive as
interactively discuss quickly; shorter involves 3–5 day
and revise proposals at timeframe for proposal workshops
a workshop review and revision
Random There are many Eliminates biases, Uncertainty around
allocation for different ways that this transparent outcome, may not
applications could be done. Sorting capture very good
above a certain applications into three applications, still needs
threshold groups through peer reviewers and/or a
review according to a committee for initial
certain threshold (e.g., ranking process
not fundable, probably Even if statistically fair
fundable, definitely the use of random
fundable). Not chance to decide
fundable are declined funding is not
and definitely fundable welcome by the
are accepted. research community
Decisions in the
middle tier are made
through random
allocation up to the
amount of resource
available

system. Applications which are considered competitive are then sent to peer
reviewers. Different RFAs approach peer review differently; some will rely on
sending applications singularly or in small numbers to individual external and
independent experts; others will send all applications in a call to a face-to-face
committee of experts; other funders do both.
If both external and committee peer review are used, external reviewers com-
ments and recommendations on the proposal are considered at a funding committee
meeting (also referred to as panels or boards). Applications considered fundable may
then be ranked and a final list of funded applications is drawn. Before applicants are
26 Funding Models and Proposals 513

Fig. 3 Illustration of the common elements involved in an application route and potential areas for
differences

informed, the outcome may first need to have formal sign off from the RFAs
governance structures and/or external sponsoring agency (such as government
department for a public funder).
Different funders will operate variations on the process steps included in this
model. For example, the Canadian Institutes for Health Research (CIHR) and the
Australian National Health and Medical Research Council do not send proposals out
to external peer review. Other funders will not hold face-to-face meetings; instead
making their decision via electronic panels, scoring, and discussion.
Differences in the way different process steps are carried out may also include the
assessment criteria to which applications are judged (see Table 3), whether the
application is in response to a commissioned call or researcher-led, whether appli-
cants can make revisions and rebuttals following feedback from reviewers, the
number of internal triage stages, and the scoring system used for rating the applica-
tion (Guthrie et al. 2018).
514 M. Westmore and K. Meadmore

Table 3 Common assessment criteria used by funding organizations


Common
assessment criteria Examples of how the criteria may be defined
Remit and Does it fit the funder objectives and research strategy? Does it fit the
relevance research program objectives, e.g., global health initiatives expect to see
plans in the application to work collaboratively with partners in low- and
middle-income countries?
Does it fit the specific call?
A need to generate Is the question of importance and has it been tested before?
evidence Is it based on a systematic review?
Is there equipoise?
Scientific rigor Is an appropriate methodology used?
Will the outcome measures answer the research question?
For example, is the sample size big enough and are there adequate
recruitment strategies in place?
Innovation/ Has the study previously been conducted?
originality of Is it investigating a topic using novel methods?
proposal Is it studying a novel topic?
Patient and public Is there meaningful and sufficiently resourced patient and public
involvement involvement throughout the research project?
Team Are the team multidisciplinary?
Do the team engage appropriate stakeholders?
Does the team have an excellent track record or have adequate mentoring
and support been put in place?
Is the team balanced and credible?
What is the track record?
Stakeholder Different funders expect to see this at different levels to stakeholders
perspectives reading materials, to stakeholder steering groups involvement throughout
the project, to coproduction
Value for money Does the application generate value for money given the importance of the
question?
Has the study been adequately and sufficiently costed?
Potential societal Who will benefit from this research?
impact Will the research make a change in policy or practice?
Is that credible?
Is that value for money?
Potential research What is the pathway for academic impact?
impact Will a paper be published?
Who else will benefit from the results of this study?
Will the research have potential for implementation in practice?
Intellectual Do the outcomes from the study have the potential for IP or
property and commercialization?
commercialization How will that be managed?
review How will the IP be exploited?
Is that appropriate given the nature of the funder?
For example, a public funder will have a different view of
commercialization to a commercial funder
Conflicts of interest What are the actual or perceived conflicts of interest?
How are they being managed?
Do any residual conflicts threaten scientific integrity (bias) or confidence
in the eventual results?
26 Funding Models and Proposals 515

Although these processes are widely used, there is also much criticism surround-
ing them. For example, it is suggested that peer review (both external and internal
funding committees) is heavily biased and not reliable (Guthrie et al. 2017). Opin-
ions on what is fundable can be very subjective and vary widely. Largely these
issues, real or perceived, come from the long-term trend of research funding becom-
ing a professionalized, largely technical, and bureaucratic process. These trends are
in turn due to the desire for funders to operate fair, transparent and efficient
processes.
Over the last decade, funders have begun to explore variations to this typical
approach as well as alternative processes for funding, such as sandpits. However,
these approaches (alternative and more traditional) still have limited evidence on
how efficient and effective they are. Future work needs to explore in which circum-
stances different approaches work best, how, and for whom.

Who Reviews the Applications?

The application is therefore reviewed by a large number of people. Under the


Haldane principles, it is suggested that scientists should assess other scientific
work as they are best placed to make a judgment on the excellence of the
proposal. Funders have since gone beyond these principles and it is recognized
by many that expertise encompasses a wide range of reviewer expertise. Accord-
ingly, many funding organizations seek external peer reviews and committees/
panels that comprise a mix of experts. For example, academics, clinicians, health
economists, methodologists, patients, and public. Some funding organizations
will ask each reviewer to read the whole application and others will ask them to
focus on their areas of expertise. For example, a methodologist will look at the
methodology of the proposed research studies and a patient representative may
look at the aspects of the proposal relating to PPI. Each will then be asked to
score the application based on specific assessment criteria. Applicants need to
understand the nature of the decision makers in order to understand how to
ensure their application is compelling and addresses the expectations of a wide
range of experts; for example, an application that is a tour de force methodolog-
ically will make a compelling case to methodologists but may fall flat when read
by the clinical or patient experts.

Assessment Criteria

To assess applications, each organization will have a set of specific assessment


criteria that are applied to each application (see Table 2). These assessment criteria
may change from organization to organization and from research program to
research program (Guthrie et al. 2018). They will incorporate the funding organiza-
tions’ strategic research plans, as well as the aims and objectives of specific research
programs. Therefore, there is not a one size fits all and applicants are advised to
516 M. Westmore and K. Meadmore

check the remit of the organization and the research program/call they are submitting
to before writing on writing the application begins.
General or higher level assessment criteria usually consist of a few core values
that are important to a funder. For example, the UK’s NIHR states three general
assessment criteria: (1) need for the evidence; (2) value for money; (3) scientific
rigor. In a review of the UK research councils, Nurse (2015) suggests that there
are three key factors that should be considered when funding decisions for
scientific research: (1) who the researcher(s) are; (2) content of the research
program; and (3) the context within which the research being undertaken. In
practice, more criteria that focus on more specific questions under these headings
are used during review.
Common assessment criteria include scientific rigor, remit, and relevance (to fit
the funder and research programs objectives and research strategy), potential
research impact, innovation/originality of proposal, value for money, a need to
generate evidence, and potential societal impact (see Table 3). Funding organizations
may use a combination of these criteria and may weigh the criteria according to their
values. For example, a funder of late phase pragmatic trials will weight meaningful
and sufficiently resourced patient and public involvement throughout the research
project more highly than a funder of early phase efficacy study.

Success Rates

Not all submitted applications will get funded. Given the effort involved in devel-
oping and submitting an application, applicants have careful decisions to make
regarding where to submit their research proposals to enhance their chances of
success. In addition to checking the funding organizations remit and objectives,
applicants may also want to consider the success rates associated with a funding
organization or a particular research program. More interesting than the numeric
value of the success rate may well be the reasons behind it being high or low and
what the applicant can learn from that.
Success rates provide information about the percentage of applications that
receive funding from the total number of applications that are reviewed. Note that
there are also a number of applications (generally about 10–20%) which will not
make it past the first internal triage stage (i.e., remit and relevance). This is some-
times a stage which is forgotten about and success rates often do not include these
application numbers in their calculations; i.e., the true success rate could be lower
than stated.
At each decision stage of the review process, some applications are rejected.
These figures are fairly consistent across funders internationally and are reported by
the funding organizations (usually found on their website). In general the overall
success rate of funding organizations is between 15% and 25% (for example, see
https://www.timeshighereducation.com/news/uk-research-grant-success-rates-rise-first-
time-five-years for UK examples and https://report.nih.gov/success_rates/ for
NIH data). For those funding organizations that have a two-stage review process,
26 Funding Models and Proposals 517

for example, the UK’s NIHR and the Welcome Trust, about 50% of applications are
rejected at each stage.

Tips for Success

Taking all these considerations together, all the layers of Fig. 1 hierarchy of factors,
there are a number of practical tips for success that seem simple yet are not always
followed. Doing these things can have a big impact on success rates:

• Choose your funder and program carefully. Squeezing an ill-fitting idea into the
wrong funder or scheme is unlikely to work.
• Make sure it is in remit and has a chance of being competitive. Look at the
funder’s past portfolio to get an idea of the type and quality of projects previously
funded.
• Write your application specifically for the funder and scheme of choice.
• Consider the broader expectations of the funder, scheme, or expert reviewers.
Don’t just focus on one element such as methodology. Different funders will have
different criteria and will weight them differently.
• You need to convince the peer review experts, external or committee, that the
question is important. This will be highly dependent on the nature of the funder
and the makeup of the expert and peer reviewers. Consider who it is important to
and why. Challenge yourself – is it important or just interesting – the question
may be important but the proposed study might not.
• Cut and paste high level prevalence or incidence figures are not convincing
on their own. Funders will want to know what difference the proposed trial
will actually make to those that use, deliver, or plan health services and
treatments.
• Remember you will need to convince those outside of your specialty. That will
include trialists and clinicians working in other areas, methodologists and statis-
ticians, patients, and the public.
• Make sure there is a real research gap that the proposed trial will add to
what is already known and that what you are proposing is plausible given
the existing evidence base. The best way of doing this is to base the new
proposal on a systematic review of the existing evidence. If there isn’t one,
do one.
• You need to convince the RFA that you have the right approach to delivering the
trial. This will include methodology but will also be much broader. How feasible
is it? Who are you partnering with? Do you have the right multidisciplinary team?
(clinicians, statisticians, patients, etc.).
• Make sure your sample size is credible and meaningful. Will it be achievable?
Will it change the meta-analysis?
• Consider wider issues around how you will do the research. Consider issues of
transparency, integrity, and openness.
518 M. Westmore and K. Meadmore

• Consider value for money. Is the answer to the question worth the investment?
How much is the trial costing per participant? The more expensive studies will be
expected to make a bigger difference to society.

Summary and Conclusions

Funders of clinical trials are not just sources of funding, they are actors in their wider
research systems, have their own philosophies, values, and objectives, and operate
within different politi cal, social, and economic environments. These will all affect
their policies and practice and ultimately what applicants need to work successfully
with them. Different funding agencies will use a range of funding models depending
on what they are trying to achieve. The decision-making process will vary by funder
and by scheme. It is likely to be based on multiple criteria; all must be considered
(see Table 3). Applications will be reviewed by a range of experts and usually
beyond the field of expertise of the applicant.

Key Facts

• Not all funders of clinical trials are alike; they have their own sources of funding,
stakeholders, philosophies, values, and objectives, and operate within different
political, social, and economic environments. These will all affect their policies and
practice, and ultimately what applicants need to do to work successfully with them.
• Different funding agencies will use a range of funding models depending on what
they are trying to achieve; from open calls for proposals limited only by broad
remit statements through to commissioned calls where the funder specifies the full
research question.
• With nuanced variations, many funders make decisions following a standard
procedure involving internal review, external expert and peer review, and funding
committee review.
• Understanding the political, philosophical, and contextual issues of funding
agencies is important, but there are also some simple practical tips for success
for applicants that are useful across all funders.

Cross-References

▶ Advocacy and Patient Involvement in Clinical Trials

References
Chalmers I, Glasziou P (2009) Avoidable waste in the production and reporting of research
evidence. Lancet 374(9683):86–89. https://doi.org/10.1016/S0140-6736(09)60329-9
26 Funding Models and Proposals 519

Clark DR, McGrath PJ, MacDonald N (2007) Members’ of parliament knowledge of and attitudes
toward health research and funding. CMAJ 177(9):1045–1051. https://doi.org/10.1503/
cmaj.070320
Guthrie S, Ghiga I, Wooding S (2017) What do we know about grant peer review in the health
sciences? F1000Res 6:1335. https://doi.org/10.12688/f1000research.11917.2
Guthrie S, Ghiga I, Wooding S (2018) What do we know about grant peer review in the health
sciences? An updated review of the literature and six case studies. RAND Corporation, Santa
Monica. https://www.rand.org/pubs/research_reports/RR1822.html
HMSO (1971) A framework for Government research and development. HMSO, London
HMSO (1918) Report of the Machinery of Government Committee under the chairmanship of
Viscount Haldane of Cloan. HMSO, London. https://www.civilservant.org.uk/library/1918_
Haldane_Report.pdf. Accessed 10 June 2020
Mazzucato M (2018) The entrepreneurial state, 1st edn. Penguin, London
Nurse P (2015) Nurse review of research councils. GOV.UK. Available at: https://www.gov.uk/
government/collections/nurse-review-of-research-councils. Accessed 27 June 2019
Obama B (2013) Public papers of the Presidents of the United States: Barack Obama, Book I, p 345.
https://www.govinfo.gov/app/details/PPP-2013-book1/PPP-2013-book1-doc-pg342. Accessed
10 June 2020
Rubio D, Schoenbaum E, Lee L, Schteingart D, Marantz P, Anderson K, Platt L, Baez A, Esposito
K (2010) Defining translational research: implications for training. Acad Med 85(3):470–475.
https://doi.org/10.1097/ACM.0b013e3181ccd618
Series. Research: increasing value and reduce waste when research priorities are set. (2014) The
Lancet 383(9912):156–185 e3–e4. https://doi.org/10.1016/S0140-6736(13)62229-1
World Health Organization (2017) Joint statement on public disclosure of results from clinical trials.
Available at: https://www.who.int/ictrp/results/jointstatement/en/. Accessed 28 June 2019
Financial Compliance in Clinical Trials
27
Barbara K. Martin

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
CMS Policy Regarding Reimbursement of Costs in Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
Categorization of Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
The Clinical Trial Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
Coverage with Evidence Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
Billing Compliance in Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
Coverage Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
Qualifying for Medicare Coverage Under CTP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
Device Classification and Medicare Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
Qualifying for Medicare Coverage with Evidence Development . . . . . . . . . . . . . . . . . . . . . . . . . . 530
Identifying Research Charges Billed to CMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
Medicare Advantage Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
Issues in Non-compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
Subject Remuneration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
Waiving of Co-pays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
Reimbursement for Subject Injury . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
Billing Non-compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
Summary: Best Practices for Billing Compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538

B. K. Martin (*)
Administrative Director, Research Institute, Penn Medicine Lancaster General Health,
Lancaster, PA, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 521


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_267
522 B. K. Martin

Abstract
Financial compliance considerations are an important aspect of the design and
funding of clinical trials. Such research often involves a mixture of sponsor
funding and insurance billing for the clinical services provided in the trial. In
the United States, what can be billed to insurance and what must be paid by a
sponsor are in general determined by the Centers for Medicare & Medicaid
Services (CMS). Other third-party payer policies largely mimic those of CMS.
Medicare reimbursement for clinical trials is determined by the interagency
agreement between the Food and Drug Administration and CMS regarding
investigational devices, the clinical trials policy, and CMS guidance on coverage
with evidence development. Non-compliance in research billing carries risk of
monetary penalty. To ensure compliance, providers and institutions must conduct
coverage analyses to determine if a trial qualifies for CMS coverage and, if it
does, which clinical items and services can be billed to CMS. Claims with items
and services being billed to CMS must be identified with research codes and
modifiers. While these policies and procedures have brought some clarity to
research billing, there are still murky waters that providers and institutions need
to navigate.
The risk from non-compliance is not theoretical. Several cases of large fines to
major research institutions have been well publicized. The imperative for having
a comprehensive program for billing compliance continues to mount, and the cost
of this necessary infrastructure must be part of the calculation of institutional
overhead for clinical research.

Keywords
Billing compliance · Coverage analysis · Clinical Trial Policy · Coverage with
evidence development · Qualifying clinical trials · Waiving of co-pays · Subject
remuneration · Subject injury

Introduction

The design and conduct of an appropriate, informative, and successful clinical trial
of course is multifaceted. The trial must be based on a scientific question worth
answering. It must be ethically sound. The design must give the trial a reasonable
chance to actually answer the question that it is intended to answer. It must be
conducted with rigor and integrity. Also importantly, it must be adequately and
appropriately funded.
The funding of clinical trials differs from and is more complex than the funding of
other research studies. That is, studies that involve clinical services of any type, from
diagnostic testing or monitoring to surgery to administration of drugs, biologics, or
devices, could be funded entirely by a sponsor, but they are more likely to involve a
mixture of sponsor funding and billing to insurance for some or all of the clinical
27 Financial Compliance in Clinical Trials 523

services provided as part of the trial. In the United States, what can be billed to
insurance and what must be paid by a sponsor in general are determined by policy of
the Centers for Medicare & Medicaid Services (CMS). Many third-party payers have
policies and practices that largely mimic those of CMS. However, with CMS policy
come consequences for non-compliance that involve monetary and even criminal
penalties. Therefore, the issue of financial – particularly billing – compliance is now
an important concern in the conduct of clinical trials. This chapter explores the
history of current US policy and the resulting financial compliance considerations
that are now necessary. Acronyms frequently used in this chapter are explained in
Table 1.

CMS Policy Regarding Reimbursement of Costs in Clinical Trials

The standard for CMS reimbursement has always been anchored to the phrase
“reasonable and necessary” from the Social Security Act, which established the
Medicare program. More specifically, Medicare is intended to reimburse for clinical
services and products that are “reasonable and necessary for the diagnosis and
treatment of an illness or injury, or to improve the functioning of a malformed
body member” (42 US Code § 1395y). Experimental treatments generally have
not met this standard for reimbursement, as it has been interpreted to mean that the
service or product must be demonstrated to be safe and effective. However, confu-
sion has long existed around the “routine” testing and treatment that individuals
might receive as part of a clinical trial that they would also receive if they were not
enrolled in a clinical trial.

Categorization of Devices

The first classification of CMS reimbursement policy came in 1995, with an


interagency agreement (FDA 1995) between what was then the Health Care Finance
Administration (HCFA) – the precursor to what became CMS in 2001 – and the US
Food and Drug Administration (FDA). In the years preceding the agreement, the
Office of the Inspector General (OIG) was investigating whether hospitals and
providers improperly billed Medicare for the facility and professional fees associated
with the implantation of cardiac devices being tested under investigational device
exemptions (IDEs) and was indeed finding this to be the case (Aaron and Gelband
2000). Device manufacturers claimed that they could not feasibly pay for all the
costs associated with the clinical research necessary to advance the development of
medical devices. Furthermore, FDA recognized that some “experimental” devices
represented refinements of existing technologies, and treatment of patients with
existing technologies would have resulted in similar charges outside of the clinical
trials. Concern that hospitals and providers might discontinue participation in device
trials out of fear of the OIG investigation and potential fines and penalties led to
524 B. K. Martin

Table 1 Frequently used acronyms


Acronym Full form Description
CED Coverage with evidence The mechanism by which the Centers for Medicare &
development Medicaid Services agrees to cover an item or service not
otherwise covered, due to a shortage of adequate
evidence that the item or service is reasonable and
necessary, with the condition that data are collected on
utilization and impact of the item or service as part of a
protocol that the agency has reviewed and approved; the
data generated are intended to be used to inform future
coverage decisions; the agreement to provide coverage
is published in a national coverage determination
CMS Centers for Medicare & The federal agency within the US Department of Health
Medicaid Services and Human Services that oversees and administers the
Medicare, Medicaid, and Children’s Health Insurance
programs; among other things, it determines what items
and services these programs cover for their beneficiaries
CSP Coverage with study A type of coverage with evidence development; the
participation mechanism by which the Centers for Medicare &
Medicaid Services agrees to cover an item or service not
otherwise covered if beneficiaries are receiving the item
or service in the context of clinical studies that the
agency has reviewed and approved as specifying the
process for gathering data and as providing protections
and safety measures for beneficiaries
CTP Clinical trials policy The commonly used name for the national coverage
determination that states that the Centers for Medicare &
Medicaid Services provides coverage for the costs of
routine items or services delivered in the context of
qualifying clinical trials
FDA Food and Drug The agency within the US Department of Health and
Administration Human Services that, among other things, regulates the
sale, labeling, and shipment of drugs, biologics, and
medical devices
HCFA Health Care Financing The former name for the federal agency within the US
Administration government department then known as Health,
Education, and Welfare; it oversaw and administered the
Medicare and Medicaid programs
IDE Investigational device The means by which a device manufacturer obtains
exemption permission from the Food and Drug Administration to
conduct human clinical studies to collect safety and
effectiveness data for a new or modified medical device
before the device is approved for marketing
IND Investigational new The means by which a pharmaceutical company obtains
drug [application] permission from the Food and Drug Administration to
conduct human clinical trials (1) with an experimental
new drug before it can be approved for marketing or (2)
with an existing (approved) drug being testing for a new
indication before a labeling change can be approved
(continued)
27 Financial Compliance in Clinical Trials 525

Table 1 (continued)
Acronym Full form Description
NCD National coverage A determination by the Centers for Medicare &
determination Medicaid Services as to whether Medicare will pay for
an item or service; in the absence of a national coverage
determination, an item or service is covered at the
discretion of the local Medical Area Contractor
OIG Office of the Inspector Specifically, the Office of the Inspector General of the
General US Department of Health and Human Services (HHS)
dedicated to protecting the integrity of HHS programs,
combating fraud, waste and abuse, and improving
program efficiency; the majority of resources go toward
oversight of Medicare and Medicaid

consideration by the FDA and HCFA of a means to determine whether some devices
might legitimately be covered by Medicare.
In the interagency agreement between FDA and HCFA regarding reimbursement
of investigational devices, FDA agreed to categorize the clinical investigation of
medical devices to aid HCFA in its reimbursement decisions. Specifically, FDA
would label as Category A those investigations of Class III devices (requiring pre-
market approval) that are innovative and for which the safety and effectiveness of the
device has not been established (i.e., they are experimental). Category B investiga-
tions, on the other hand, would be those that involve devices where the incremental
risk of the device is the primary risk in question (i.e., the underlying questions of
safety and effectiveness have already been resolved). Therefore, devices in Category
B investigations were able to meet the criteria of “reasonable and necessary,” and the
devices and the associated hospital and professional charges could qualify for
reimbursement by HCFA.

The Clinical Trial Policy

Reimbursement for care in non-device trials remained unclear until 2000. Early in
that year, the Institute of Medicine of the National Academy of Sciences released the
report, “Extending Medicare Reimbursement in Clinical Trials” (Aaron and Gelband
2000). The report summarized the state of reimbursement at that time, suggesting
that although Medicare did not have a policy to reimburse for care in clinical trials,
and many private insurers had policies that excluded coverage, a significant propor-
tion of costs of patient care in clinical trials were indeed paid for by insurers. This
was because providers bill for the services, and without any obvious identification of
a beneficiary’s participation in a clinical trial, the insurers were none the wiser. That
current state, though not untenable, left patients and providers with uncertainty
regarding whether costs would be covered. The IOM report recommended an
explicit policy that “Medicare should reimburse routine care for patients in clinical
526 B. K. Martin

trials in the same way it reimburses routine care for patients not in clinical trials.”
(Aaron and Gelband 2000).
As a result of the IOM report, President Clinton on June 7, 2000, issued an
executive order directing Medicare to create explicit policy and to immediately begin
to reimburse for the costs of routine services provided to participants in clinical trials
(The White House 2000). In response to the executive order, HCFA issued the
national coverage determination (NCD) for Routine Costs in Clinical Trials on
September 19, 2000 (CMS 2000). This NCD is widely referred to as the Clinical
Trial Policy and exists in much the same form today. To state first what the policy
excludes, it does not allow for coverage of investigational items or services them-
selves, unless an item or service is “otherwise covered outside the trial.” Addition-
ally, “items and services provided solely to satisfy data collection and analysis needs
and that are not used in the direct clinical management of the patient” are
not covered. What the policy does require is the coverage of routine costs from
“qualifying” clinical trials. Qualifying clinical trials must have therapeutic intent for
patients with a diagnosed disease, or intent to diagnose a clinical disease. The policy
also outlines seven desirable characteristics of a qualifying clinical trial. HCFA, or
now CMS, has never instituted a method for investigators to certify that their
research meets the seven desirable characteristics. Instead, clinical trials are deemed
as qualifying if they are federally funded, are conducted under an investiational new
drug (IND) application, or meet criteria to be exempt from having an IND. The
policy also provides that if Medicare is billed for a study that is not qualifying, the
providers are liable for the costs and could be investigated for fraud.

Coverage with Evidence Development

The IOM in its report, in addition to advocating for coverage of costs of routine care
in clinical trials, urged HCFA “to use its existing authority to support selected trials
and to assist in the development of new trials” (Aaron and Gelband 2000). That is,
the agency should identify research that is of particular significance to the care of
its beneficiaries and provide reimbursement for more than just routine care.
The Committee pointed to the example of the National Emphysema Treatment
Trial, in which HCFA agreed to pay for lung volume reduction surgery (LVRS)
only when the procedure was performed as part of the trial, which was a randomized
comparison between LVRS and standard medical management of emphysema.
Response to this recommendation of the IOM report took longer. On July 12,
2006, CMS published guidance on “National Coverage Determinations with Data
Collection as a Condition of Coverage” (CMS 2006). The so-called coverage with
evidence development (CED) allows for coverage of a service under
specific conditions. One such condition, more specifically labeled “coverage with
study participation” (CSP), provides for reimbursement of the item or service “only
when provided within a setting in which there is a pre-specified process for gathering
additional data, and in which that process provides additional protections and safety
measures for beneficiaries, such as those present in certain clinical trials.”
27 Financial Compliance in Clinical Trials 527

The intention is to allow coverage of items and services for which CMS does not find
there to be enough evidence to support their use as “reasonable and necessary” but
for which additional data could help clarify the value of the service. The decision to
allow coverage under the CED mechanism is made as part of the CMS coverage
determination process, and as such, it is open to public comment (CMS 2014b).
When CMS revised the Clinical Trial Policy in July of 2007, it added reference to
coverage of items and services “when provided in a clinical trial that meets the
requirements defined in [a specific] national coverage determination” (CMS 2007).
This version of the CTP has remained unrevised for the subsequent decade and more.

Billing Compliance in Clinical Trials

CMS policies that began to be put in place a couple decades ago have provided much
needed clarity to reimbursement for services provided in the context of clinical trials.
However, with this clarity has come the requirement for compliance, and the
possibility of penalty for non-compliance. This is a big concern to providers and
institutions that bill CMS and/or receive federal grant funds. This section will
discuss what is entailed in billing compliance in clinical trials.

Coverage Analysis

In the 2000 IOM report, the Committee discussed the status quo in reimbursement
and concluded that much of routine care in clinical trials was indeed already billed
to, and paid for, by insurance, mostly without the payers knowing when their
beneficiaries were in trials. However, because of the lack of clarity in policy, the
possibility remained that either providers or patients could be left holding the bill
after a denial of coverage or when sponsor support did not adequately cover services.
Providers and institutions differed in their billing practices. According to the Com-
mittee, only General Clinical Research Centers (GCRCs), typically funded by the
National Institutes of Health and located within major academic hospitals, seemed to
have a rigorous and consistent approach to billing for tests and services provided in
the context of their clinical trials (Aaron and Gelband 2000). These centers reviewed
all the charges for participants in their studies, often early phase research with
intensive treatment and monitoring, and determined which charges were for
unproven therapies or for tests and services that the participants would not have
had outside the clinical trial. These costs were not billed to CMS or other payers,
while routine charges were. At the time of the IOM report, this practice was
uncommon and mostly confined to GCRCs, which were able to devote substantial
resources to their clinical trial billing.
Today, this practice is considered imperative to a comprehensive system for
billing compliance. Such a system starts first with coverage analysis, the term that
has come to refer to the process of (1) determining if a trial qualifies for coverage
under the CTP, under CED, or as a Category A or B IDE study approved by CMS
528 B. K. Martin

and (2) analyzing all activities required by the study and whether they will be
supported by study funds or billed to insurance. Then, as was modeled by
GCRCs, all clinical trial charges must be reviewed and triaged for payment
according to the coverage analysis.
Determining coverage for a study is the first challenge in billing compliance, as
discussed below.

Qualifying for Medicare Coverage Under CTP

The interagency agreement and the CTP have done much to clear up uncertainty
regarding clinical trial billing. However, a couple issues remain unclear, and insti-
tutions and providers are still left to make their own determinations of the appropri-
ateness of billing or to consult with their local Medicare area contractors (MACs).
One area that has generated much discussion, particularly in the oncology field, is
that of early phase clinical trials. As mentioned above, to qualify for coverage under
the CTP, a trial must have therapeutic intent for patients with a diagnosed disease or
intent to diagnose a clinical disease. Many phase I trials are designed with safety and
toxicity measures as the primary outcomes of interest. Can such trials be considered
therapeutic in nature, as required under the CTP? Many oncology researchers argue
that these trials do have therapeutic intent; if the drugs under study were not thought
to have the potential to be therapeutic, they would not be under study. Nonetheless, if
the measure of therapeutic intent is the primary outcome of the study for which it is
designed and powered, then phase I studies arguably might not meet this criterion.
A second murky area is that of research on a device, such as diagnostic tool, that
did not require an IDE, or research on a procedure or technique, such as the use of
surgical intervention instead of medical management. Because these interventional
studies do not involve an IDE, IND, or IND exemption, the rules for covered devices
and qualifying trials do not apply. If the trial has federal sponsorship, it qualifies
under the default of the CTP, but if it does not, the trial can be caught in a dead zone
of no clarity on coverage. Federal sponsorship or CED coverage for a procedure and
technique gaining foothold in practice may be more likely for later phase studies.
However, non-sponsored early phase development may be hampered by virtue of
the lack of ability to bill for an innovative procedure or technique in the context of
a clinical trial.

Device Classification and Medicare Coverage

In the two decades following the interagency agreement that established the
categorization of devices by the FDA, this process in and of itself was not enough
to establish Medicare coverage. That is, mere classification by the FDA did not
guarantee that Medicare would cover routine costs or that it would cover the costs for
a Category B device. Local MACs had to be consulted and pre-authorization sought,
often for each patient enrolled. CMS, with FDA concurrence, subsequently made
27 Financial Compliance in Clinical Trials 529

changes to its regulations regarding coverage of devices and routine costs as part
of IDE studies, with the intent to streamline its coverage determinations. Effective on
January 1, 2015, Medicare coverage determinations for IDE studies were centralized
(CMS 2014a). That is, sponsors are now required to submit IDE protocols to CMS
for a central review process. Studies that are approved for coverage of the investi-
gational device (Category B only) and routine costs are published on the CMS
website.
CMS and FDA further collaborated to revise the definitions of Category A and B
devices, to support CMS’s centralized decision-making on coverage (FDA 2017).
Each category has three similar sub-categories of devices, and whether or not there
are data to support the questions of safety and effectiveness determines the classi-
fication of the device as Category A or B. That is, a new device with no marketing
approvals will be considered Category A if “data on the proposed device or similar
devices do not resolve initial questions of safety and effectiveness” and Category B
if there is available information on the proposed device or similar devices that
supports the proposed device’s safety and effectiveness. If an approved device is
being studied for a new indication, it will be classified as Category A if the
information from the proposed or similar devices related to the previous indication
does not resolve questions of safety and effectiveness for the new indication and
Category B if it does. Finally, a proposed device that has “different technological
characteristics compared to a legally marketed device,” such that the information
from the marketed device can’t resolve the questions of safety and effectiveness, will
be considered Category A, while a device that has similar technological character-
istics would be classified as Category B if the information on the approved device
provides applicable data on safety and effectiveness. The amendment to rules on
device categorization further allows for changes in the categorization – most likely
from A to B but in some instances from B to A – as research on the device
progresses. FDA will categorize a device at the start of an IDE study but can consider
changes to the categorization with amendments or supplements to the study or at the
request of the sponsor.
As a consequence of device categorization by FDA and centralized review by
CMS, billing in the device research space may seem more focused and defined.
However, questions about coverage may arise from non-significant risk (NSR)
device studies (FDA 2006) or studies that can be conducted without an IDE (21
CFR § 812.2(c)), as these studies are not reviewed by FDA or approved by CMS.
Rather, one or more IRBs serve as the surrogate for the FDA in reviewing and
approving the study. Such studies may involve, for example, investigations related to
the clinical use of a new imaging approach, or optimal clinical use of an approved
monitoring device. Providers and institutions may need to negotiate with sponsors
and local MACs over payment for approved items and services that are being used
outside of common practice. For monitoring devices, it is not just the cost of the
devices themselves that may be in question but also coverage for medical review of
the data being reported by those devices. Providers and institutions may need to
determine their risk tolerance for practices for which the “reasonable and necessary”
standard could perhaps be challenged.
530 B. K. Martin

Qualifying for Medicare Coverage with Evidence Development

When CMS issues a national coverage determination for a specific item or service,
the NCD may limit coverage to certain indications, populations, or providers and
centers with demonstrated expertise in delivery of the item or service. It also may
allow for reimbursement only when the item or service is provided in the context of
a research study, under the coverage with evidence development mechanism. For a
trial to qualify for reimbursement of the item or service under the CED mechanism,
CMS must review and approve the protocol. The items and services that may be
covered under CED are listed on the CMS webpage. The CMS webpage typically
references the NCD for the item or service it is covering under CED. The NCD then
lists the research questions that CMS is interested in having answered and the
research studies that are currently approved by CMS.
It is worth noting that the CED mechanism for coverage pertains only to items
and services that are covered under Medicare parts A and B. The CED mechanism of
coverage could not be applied to self-administered drugs that fall under Medicare
part D.

Identifying Research Charges Billed to CMS

Charges that are billed to CMS as routine care provided in the context of clinical
trials are to be labeled as such in the claims. This requirement by CMS also has its
roots in the events surrounding the advent of the CTP. President Clinton’s exec-
utive order directed HCFA to establish a tracking system for the charges billed to
and reimbursed by Medicare that were generated in clinical research (The White
House 2000).
First, an International Classification of Diseases (ICD) code is required.
The ICD10 code Z00.6 identifies a charge as part of an “encounter for examination
for normal comparison and control in [a] clinical research program.”
Second, the National Clinical Trials (NCT) number assigned by the clinicaltrials.
gov registry is required. Conveniently, this registry had already been established by
the Food and Drug Administration Modernization Act of 1997 (FDAMA), devel-
oped in conjunction with the National Institutes of Health, and made publicly
accessible in 2000 (National Library of Medicine). Initially, federally and privately
funded clinical trials conducted under investigational new drug applications were
required to be registered and to provide this information to the public, healthcare
professionals, and researchers. In 2005, the International Committee of Medical
Journal Editors (ICMJE) began to require that authors seeking to publish clinical trial
results provide evidence of registration of the clinical trial. The ICMJE’s interest was
to promote clinical trial registration as a means of addressing the well-documented
bias in the publication and reporting of clinical trial results. The requirement for
registration of clinical trials has since been expanded by the FDA Amendments Act
of 2007 (FDAAA). As a result, this registry provides a comprehensive mechanism
for identifying clinical trials when billing for items and services provided in these
trials.
27 Financial Compliance in Clinical Trials 531

CMS additionally requires modifiers on charges for outpatient or professional


clinical services that are billed to Medicare for a clinical trial (CMS 2008). The modifier
Q0 designates an experimental item or service. Experimental items and services are not
reimbursable under the CTP, so the use of this modifier is essentially limited to charges
for Category B devices and for items and services allowed under CED. The Q1 modifier
is used to designate all other charges for routine items and services.
It is not clear if and how Medicare has used this tracking data. However, it is also
apparent that the labeling of claims could allow CMS “to undertake significant data-
mining to compare different institutions’ billing practices for the same research
study” and provides a “powerful mechanism for the government’s billing compli-
ance enforcement” (Meade & Roach LLP 2008).

Medicare Advantage Plans

Medicare Advantage plans pose another challenge to billing in clinical trials. When
Medicare Advantage plans came into being, the interagency agreement between FDA
and CMS regarding coverage of investigational devices was already in place, so the
cost of this benefit was calculated into the capitated payments made by CMS to the
Medicare Advantage plans. Therefore, Medicare Advantage plans are required to
cover costs from device trials that have been approved by CMS for Medicare billing.
Drug studies, on the other hand, get complicated. After the CTP went into effect in
2000, “CMS determined that the cost of covering these new benefits was not included”
in the capitated payments to Advantage plans (CMS 2014c). Therefore, it was decided
that CMS should pay for the covered clinical trial services outside of the capitated
payment rate. This means that, for a Medicare Advantage beneficiary, routine costs in
trials subject to the CTP need to be billed to Medicare Fee-for-Service rather than to the
beneficiary’s Advantage plan. Providers need to have a mechanism to redirect claims
appropriately. Then, the Medicare Advantage plan is required to cover the difference
between its beneficiary’s out-of-pocket costs and those incurred under Medicare Fee-
for-Service (CMS 2013). The capitated payment rates have yet to be adjusted, so this
band-aid solution has remained in place for a couple decades.

Issues in Non-compliance

This section is not intended to be a comprehensive review of laws and regulations,


their interpretation, or their enforcement. However, it is intended to highlight issues
that require careful consideration when determining the financial arrangements for
covering the costs of clinical care provided in the context of clinical trials.

Subject Remuneration

Subject remuneration in clinical trials is a generally accepted practice, though some


institutional review boards (IRBs) do not allow it and those that do typically want to
532 B. K. Martin

review the reason for the remuneration, the amount, and the schedule of payment.
This chapter will not cover the larger discussion on the appropriateness of subject
remuneration but will only address the billing compliance issues that are raised in
some circumstances.
Subject remuneration in clinical trials, like any form of gift or payment, is
viewed by the OIG as an inducement to Medicare beneficiaries that could influence
their selection of a particular provider of healthcare services. The OIG, in its
Special Advisory Bulletin dated August 2002, stated that a person who offers
remuneration to Medicare beneficiaries “could be liable for civil money penalties
(CMPs) for up to $10,000 for each wrongful act” (HHS OIG 2002). However, the
Advisory Bulletin allows that providers may offer gifts and remuneration that fit
within five statutory exceptions; subject remuneration in clinical trials is poten-
tially applicable to only one of these, which are practices allowed in the “safe
harbor” provisions of the federal anti-kickback statute. Payments related to clinical
trials (i.e., payments of industry sponsors to providers, payments of providers to
research subjects) are typically judged against the requirements of the “personal
services and management contracts” safe harbor of the anti-kickback statute (42
CFR § 1001.952(d)). This safe harbor category requires that the payments occur
under a written and signed agreement that details the services to be provided and
the schedule and term of the agreement (which cannot be for less than a year). It
also requires that the compensation for the clinical trial activities is set in advance,
is consistent with fair market value, and doesn’t exceed what is reasonably
necessary for the performance of the activities. Finally, the activities performed
under the agreement cannot involve business promotion, and the compensation for
the activities cannot be determined by the volume of referrals for services paid by
federal healthcare programs. If a clinical trial agreement is so constituted and
executed, providers should not incur liability for penalty for subject remuneration
set out in such an agreement. That said, it is prudent, when determining subject
remuneration, to have standard practices in place for the amount of compensation
for specific things such as extra time, travel, parking, or other expenses incurred by
subjects by virtue of being in the study. Stated in the converse, it is prudent to
avoid paying subjects when their research participation is not requiring much in
the way of time and expenses over and above what they would be investing for
standard care. For example, when a study is merely abstracting data on the results
of routine services, subjects are not incurring additional expenses by virtue of
being in the study, and providing them with remuneration could arguably consti-
tute inducement to receive those routine services as a research subject, and from
the research provider.

Waiving of Co-pays

As providers are likely well aware, CMS considers the waiving of co-pays by
providers to be in violation of the beneficiary inducements statute and the anti-
kickback statute. That is, such waivers are seen as inducements to use services paid
27 Financial Compliance in Clinical Trials 533

for by Medicare or inducement to receive services from a specific provider. Waiving


of co-pays in clinical trials is largely seen in the same light, with the additional
concern that it may sway beneficiaries to forego available proven therapies in favor
of an experimental one (HHS OIG 2008). However, leaving subjects with out-of-
pocket expenses that they would not have incurred, but for their participation in a
clinical trial, are also not satisfactory. The IOM Committee, in its 2000 report, put
forth the recommendation that subjects should not incur expenses for participation,
over and above what might be expected for standard care, but recognized that, while
this may be a guiding principle, it is not enforceable (Aaron and Gelband 2000).
Indeed, not only is it unenforceable but in some instances is very difficult to achieve
within the confines of current laws and their interpretations.
The OIG has issued a series of opinions in specific cases in which it has declined
to impose sanctions for the payment of co-pays. The OIG is clear that these opinions
are relevant to the specific cases only, but they are informative on the thinking of the
Department of Health and Human Services. For example, in one case of a trial of
oxygen therapy supported by the National Heart, Lung, and Blood Institute, inves-
tigators made the assertion that cost-sharing would result in the intervention group
having more expenses than the control group and could decrease compliance. The
OIG decided that the trial did not present risk of fraud and abuse, at least in part
because it was co-sponsored by CMS and was not a commercial, product-oriented
study (HHS OIG 2008). The OIG explicitly stated that “commercial or private
studies pose significantly different risks under the fraud and abuse authorities.” In
another opinion, the OIG declined to impose sanctions when the payment of co-pays
by Medicare beneficiaries would have unblinded the research subjects in a CED
study as to whether they were in the active treatment or sham procedure group (HHS
OIG 2015). In a third case, a trial sponsored by the National Cancer Institute, the
OIG concurred that payment of co-pays by subjects could create an economic barrier
to participation in a cancer prevention trial in people with HIV and could skew the
research population toward a higher and less representative, socioeconomic status
(HHS OIG 2016). These instances are all good examples in which the requirement
for subjects to pay co-pays for care received as part of a clinical trial could affect the
feasibility and/or scientific validity of the study, but it is clear that unless the OIG
issues an opinion in a particular case, waiving of co-pays in clinical trials creates risk
for the investigators.

Reimbursement for Subject Injury

Subject injury is a somewhat sticky issue in clinical research. Let us first consider
clinical trials covered under the CTP. The CTP allows for billing of services related
to monitoring for, preventing, and treating adverse effects related to the investiga-
tional therapy. So, for example, in an oncology chemotherapy trial, treatment of
nausea and laboratory tests to monitor blood counts are billable. This seems quite
reasonable when medical monitoring and management of adverse effects are to
be expected with most therapies. It also is quite reasonable when it may be difficult
534 B. K. Martin

to determine whether a clinical event is related to an investigational therapy or to the


patient’s larger clinical circumstances. However, there may be instances in which
patients clearly suffer significant subject injury from an investigational therapy. The
CTP provision for coverage of adverse effects was not intended to release industry
sponsors from liability for their products or to transfer catastrophic costs to insurers.
Additionally, it again is important to remember that when insurers are billed, in this
case for adverse effects of therapy, subjects frequently may share in those costs
through co-pays or deductibles. Therefore, it is important to carefully think through
subject injury costs and who should be responsible for them. Industry sponsors
historically have pledged to cover costs not covered by insurers, but this has proven
problematic in a couple of regards.
Again, co-pays are at issue. Even in the case of treatment of adverse effects
resulting from the study intervention, the waiving of co-pays could generate risk of
penalty. A second issue is that of Medicare Secondary Payer (MSP) rule, which
defines the circumstances under which Medicare is not obligated to pay until a
primary insurer has paid, or because a primary insurer can reasonably be expected to
pay for a covered item or service (42 US Code § 1395y). It is long been debated
whether the MSP is invoked in so-called conditional payment clauses in clinical trial
agreements; that is, does it violate MSP rules if a sponsor, in a clinical trial contract,
agrees to pay for subject injury costs in the event that such costs are not covered by
insurers? CMS in 2010 announced a policy for sponsor payments in clinical trials
that stopped short of answering this question. The policy states that when sponsors
pay for subject injury, they are acting as a liability insurer. It also states that such
payments must be reported to CMS, which allows CMS to ensure that it has not paid
for the same items and services. The naming of sponsors as liability insurers in
clinical trials has generated discomfort by providers with contractual provisions to
bill Medicare when there is a possibility of payment by the sponsor, i.e., another
insurer. In the absence of definitive policy by Medicare, sponsors are seeming to
recognize that to help providers mitigate compliance risk, they may need to accept
financial obligations for subject injury. Some sponsors have attempted to limit the
financial obligations by carving out subject injury payment for beneficiaries of
federal programs only, as there are no potential legal implications for billing private
insurers when there is a possibility of sponsor payment. However, such an approach
puts additional compliance burden on providers to modify their clinical trial billing
practices based on patients’ source of insurance, but perhaps more ethically unsat-
isfactory is the creation of a situation in which the sponsor pays in whole for some
patients’ medical costs, while other patients experience cost-sharing with their
insurers. In recognition of the above, it is the current recommendation of Model
Agreements & Guidelines International (MAGI), an organization of industry players
with the mission to standardize best practices for clinical operations, business, and
regulatory compliance, to avoid such carve-outs in the contract and to provide for
sponsor payment for subject injury. As pointed out by Meade & Roach, LLP, a
leading law firm offering advisement in the research billing compliance space, “This
arguably defers to the clinical trial agreement to define when – and presumably under
what myriad of circumstances – the sponsor will pay for complications and injury
27 Financial Compliance in Clinical Trials 535

related to the clinical trial. In essence, it becomes a matter of contract law as to


what. . .the sponsor will consider to be complications.” (Meade & Roach LLP 2010).
In a return to where we started in this section, it is important to note that the above
issues relate to clinical trials falling under the CTP. For non-qualifying studies,
device studies, and CED studies, there is no clear guidance on the billing of services
related to monitoring for, preventing, and treating adverse effects related to the
investigational therapy. Sponsors and providers are left to negotiate this space.

Billing Non-compliance

By far the issue that has gotten institutions in the most trouble has been the
inappropriate or double billing of Medicare and Medicaid for services provided in
the context of clinical research. There are a few well-known cases of penalties
imposed on major medical centers for research billing found to be in violation of
the False Claims Act (31 US Code § 3729). An early case is that of the University of
Alabama at Birmingham. The US Department of Justice announced in April of 2005
that it had reached an agreement with the university to pay $3.39 million to settle
allegations related to its research billing practices (DOJ 2005). The investigation of
the institution resulted from two lawsuits brought by whistleblowers – a former
physician and a former compliance officer at the medical school. It was alleged that
the university “unlawfully billed Medicare for clinical research trials that were also
billed to the sponsor of research grants.” In December of that same year, Rush
University Medical Center announced that it had voluntarily disclosed billing errors
to the federal government (RUMC 2005). Under the False Claims Act, the
government can impose fines up to three times the amount of the false claims.
Rush, because of its self-disclosure and cooperation with the investigation, was
only fined 50% of the amount of the false claims. With these penalties and restitution
of the claims, the fine paid by Rush reportedly totaled about $1 million, substantially
less than that paid by the University of Alabama at Birmingham.
Five years later, the Tenet HealthSystem/USC Norris Cancer Center agreed to pay
$1.9 million (HHS OIG 2010). The health system was already operating under a 5-
year corporate integrity agreement with the OIG as part of the resolution of a wide
range of investigated fraudulent activities. Under the disclosure requirements of the
agreement, the health system revealed that it submitted improper claims for “(1)
items or services that were paid for by clinical research sponsors or grants under
which the clinical research was conducted; (2) items or services intended to be free
of charge in the research informed consent; (3) items or services that were for
research purposes only and not for the clinical management of the patient; and/or
(4) items or services that were otherwise not covered under the Centers for Medicare
& Medicaid Services (CMS) Clinical Trial Policy.” The fine for its research billing
practices paled in comparison to the more than $900 million paid by the health
system in 2006 to settle its other billing liabilities.
In 2013, Emory University admitted to overbilling Medicare and Medicaid in
clinical trials conducted at its Winship Cancer Institute (DOJ 2013). The case was
536 B. K. Martin

brought to light by a former research finance manager at the university. Emory


University agreed to pay $1.5 million to settle its claims related to billing for services
for which clinical trial sponsors either had paid or had agreed to pay.
To put these cases into perspective, one can compare them to the overall
recoveries by the federal government under the False Claims Act. In 2018, the
federal government collected $2.5 billion in such settlements from the larger
healthcare sector, the ninth consecutive year that this amount exceeded $2 billion
(DOJ 2018). In this light, non-compliance in clinical trial billing is a small fraction
of the cases pursued by the Department of Justice. However, for premier research
institutions, it is not only their clinical revenue but also their research grant
revenue that could be put in jeopardy by research billing non-compliance. As
these institutions have increasingly joined forces with community healthcare
systems and are expanding their research bases, the imperative for billing compli-
ance continues to mount.

Summary: Best Practices for Billing Compliance

In summary, this chapter has discussed the evolution of the rules regarding financial
compliance in clinical trials. The Institute of Medicine report set the stage for the
clinical trials policy, and the GCRCs referenced in the report were leaders in
establishing best billing compliance practices.
Today, a comprehensive program for financial compliance must include the
below elements.

• Determination of Qualification: Every research study that involves interaction of


a healthcare provider with a patient must be examined to determine if the study
meets the criteria of a qualifying clinical trial.
• Coverage Analysis: Every item, service, and activity that is required by the
clinical trial protocol must be analyzed to determine if it is billable or not. This
determination should be documented and referenced for billing review.
• Review and Comparison of Contract and Consent: The contract should be
examined carefully for what the sponsor is obligated to cover, especially
regarding subject injury. Any language that could invoke the Medicare as
Secondary Payer rule is best avoided. The consent must be reviewed for any
promises of coverage to subjects and should be consistent with the terms of the
contract.
• Billing Review: Charges that are incurred by participants in clinical research
studies should be reviewed in a systematic fashion to ensure that they are being
triaged correctly. There are a number of philosophies and mechanisms to achieve
this that run the gamut from a manual, charge-by-charge comparison against the
coverage analysis of 100% of participants’ items and services while enrolled in a
study to sophisticated, automated methods to find research charges and label them
with the appropriate codes and modifiers. Whatever the mechanism for billing
27 Financial Compliance in Clinical Trials 537

review, documentation that a charge was reviewed is necessary for quality control
and auditing of the process.
• Auditing: The loop is closed by audit of at least a subset of study subjects to
ensure that all expected research charges were identified and billed as appropriate
to the sponsor or the subject’s insurer. It can lead to further auditing if errors are
detected that could be more widespread or systemic.

Such a comprehensive program requires infrastructure and resources to support it.


The costs of the software to track research finances, the IT resources to customize,
implement, and integrate research-specific tools, and the personnel to conduct
coverage analyses, billing reviews, and audits are not insignificant. These costs
typically do not generate a return on investment; they are the costs of doing business
and of being compliant while doing so. If not covered as a direct cost in research
budgets, these costs need to be part of the calculation of institutional overhead, as
they are a necessary component of research infrastructure.

Key Facts

• In the United States, what can be billed to insurance and what must be paid by a
sponsor are largely determined by CMS.
• Medicare reimbursement for device trials is governed by an interagency agree-
ment between FDA and CMS, which has been in place since 1995.
• Medicare reimbursement for drug trials is governed by the Clinical Trial Policy,
which went into effect in 2000.
• CMS uses the mechanism of Coverage with Evidence Development to allow
coverage of items and services for which CMS does not find there to be enough
evidence to support their use as “reasonable and necessary” but for which
additional data could help clarify the value of the service.
• Research institutions need to have processes for determining at the start of a trial
if it qualifies for CMS coverage and for conducing coverage analyses to deter-
mine which clinical items and services are to be paid by the research study and
which can be billed to CMS.
• Charges that are billed to CMS as routine care provided in the context of clinical
trials are to be labeled as such through the use of modifiers and codes on the
claim.
• Subject remuneration, payment of co-pays, and reimbursement for subject injury
are three complicated issues for which special care must be taken to avoid
noncompliance.
• Noncompliance in research billing carries risk of monetary penalty, as set forth in
the False Claims Act.
• There have been four cases of large fines to research institutions for billing
noncompliance.
• A comprehensive program to ensure financial compliance in clinical trials
requires investment in research infrastructure by institutions.
538 B. K. Martin

References
Aaron HJ, Gelband H (eds) (2000) Committee on routine patient care costs in clinical trials for
medicare beneficiaries, Institute of medicine. Extending medicare reimbursement in clinical
trials. National Academy Press, Washington, DC
Centers for Medicare and Medicaid Services. National coverage determination (NCD) for routine
costs in clinical trials (310.1). Publication 100-3, Version 1. Effective date September 19, 2000
Centers for Medicare and Medicaid Services. Guidance for the public industry, and CMS staff:
national coverage determinations with data collection as a condition of coverage: coverage with
evidence development. Issued July 12, 2006
Centers for Medicare and Medicaid Services. National coverage determination (NCD) for routine
costs in clinical trials (310.1). Publication 100-3, Version 2. Effective date July 9, 2007
Centers for Medicare and Medicaid Services. CMS manual system: medicare claims processing.
New HCPCS Modifiers when Billing for Patient Care in Clinical Research Studies Publication
100-04. Effective date January 1, 2008
Centers for Medicare and Medicaid Services. CMS manual system: medicare managed care.
Chapter 4, Benefits and beneficiary protections. Publication 100-16. Effective date August 23,
2013
Centers for Medicare and Medicaid Services. CMS manual system: medicare benefit policy.
Publication 100-02. November 6, 2014a
Centers for Medicare and Medicaid Services. Guidance for the public industry, and CMS staff:
coverage with evidence development. Issued November 20, 2014b
Centers for Medicare and Medicaid Services. Medicare managed care manual. Chapter 8, Payments
to medicare advantage organizations. Revision 118. September 19, 2014c
Department of Health and Human Services, Food and Drug Administration, Center for Devices and
Radiological Health. Information sheet guidance for IRBs, clinical investigators, and sponsors:
significant risk and nonsignificant risk medical device studies. January 2006
Department of Health and Human Services, Food and Drug Administration, Center for Devices and
Radiological Health. FDA categorization of investigational device exemption (IDE) devices to
assist the centers for medicare and medicaid services (CMS) with coverage decisions: guidance
for sponsors, clinical investigators, industry, institutional review boards, and Food and Drug
Administration staff. December 5, 2017
Department of Health and Human Services, Food and Drug Administration, Office of Device
Evaluation. Implementation of the FDA/HCFA interagency agreement regarding reimbursement
categorization of investigational devices. IDE guidance memorandum #95-2. September
15, 1995
Department of Health and Human Services, Office of the Inspector General. Special advisory
bulletin. Offering gifts and other inducements to beneficiaries. August 2002
Department of Health and Human Services, Office of the Inspector General. OIG Advisory Opinion
08-11. September 17, 2008
Department of Health and Human Services, Office of the Inspector General. Semiannual report to
Congress, Part III: legal and investigative activities related to medicare and medicaid. Fall 2010
Department of Health and Human Services, Office of the Inspector General. OIG Advisory Opinion
15-07. May 28, 2015
Department of Health and Human Services, Office of the Inspector General. OIG Advisory Opinion
16-13. December 13, 2016
Department of Justice. Press release: University of Alabama-Birmingham will pay U.S. $3.39
Million to resolve false billing allegations. April 14, 2005
Department of Justice, Office of Public Affairs. Press release: justice department recovers over $2.8
billion from false claims act cases in fiscal year 2018. December 21, 2018
Department of Justice, U.S. Attorney’s Office, Northern District of Georgia. Press release: Emory
University to pay $1.5 million to settle false claims act investigation. August 28, 2013
27 Financial Compliance in Clinical Trials 539

International Committee of Medical Journal Editors. Recommendations: publishing & editorial


issues: clinical trials. http://www.icmje.org/recommendations/browse/publishing-and-editorial-
issues/clinical-trial-registration.html. Accessed May 29, 2019
Meade & Roach LLP. Compliance advisory: new CMS research modifier rules. http://meaderoach.
com/advisory_newsletters.html. April 2008. Accessed 29 May 2019
Meade & Roach LLP. Compliance advisory: CMS issues clinical trials MSP instruction.
http://meaderoach.com/advisory_newsletters.html. July 2010. Accessed 29 May 2019
Model Agreements & Guidelines International. Clinical trial agreement template (Annotated).
https://www.magiworld.org/Standards?M¼1&PK¼47. Accessed May 29, 2019
National Library of Medicine. About site: history, policies, and laws. https://clinicaltrials.gov/ct2/
about-site/history. Last reviewed September 2018. Accessed 29 May 2019
Rush University Medical Center. News release. Rush settlement with government may help clarify
billing requirements for medicare patients in research studies: sets model for provider compli-
ance with national coverage decision on clinical trials. December 8, 2005
The White House, Office of the Press Secretary. President Clinton takes new action to encourage
participation in clinical trials: medicare will reimburse for all routine patient care costs for those
in clinical trials. June 7, 2000

Statutes and Regulations


29 CFR § 812.2(c)
30 CFR § 1001.952(d)
31 US Code § 3729
42 US Code § 1395y
Financial Conflicts of Interest in
Clinical Trials 28
Julie D. Gottlieb

Contents
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
Risks of Financial Conflicts of Interest: Reality and Appearance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
Regulations and Other Important Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
Developing and Implementing COI Policies for Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
Disclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
Review for FCOI in Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
An Initial Consideration: Thresholds for Participation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
Study Design and Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
Study Conduct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
Publication/Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
Documenting and Communicating FCOI Decisions: The Management Plan . . . . . . . . . . . . . . . . . 554
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556

Abstract
Because clinical trials are the gold standard for evaluating the safety and
efficacy of drugs and medical devices, they should be conducted as safely and
objectively as possible. Objectivity can be affected by study design and conduct,
but additional risks to objectivity – and possible risks to safety – may arise from
financial conflicts of interest (FCOIs). The explosion in recent decades of
financial relationships between medical researchers and the pharmaceutical and
medical device industry means it is very likely that at least some members of most
clinical trial study teams will have a financial tie with the sponsor or manufacturer
of the study drug or device. Payments of various types are ubiquitous. The data on

J. D. Gottlieb (*)
Johns Hopkins University School of Medicine, Baltimore, MD, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 541


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_275
542 J. D. Gottlieb

payments to physicians published by the Center for Medicare and Medicaid


Services (CMS) since 2014 offers evidence that approximately half of all
physicians have financial ties with industry (Tringale et al. 2017). Social science
studies have demonstrated that financial interests can give rise to conscious or
unconscious bias in favor of the product being tested or its manufacturer. In
addition, when institutions that are clinical trial sites have financial interests
related to a clinical trial, the institutional conflicts of interest (institutional
COIs) can create risks to the objectivity or safety of the study. So close attention
to conflict of interest issues is essential for protecting the integrity of clinical
trials. This chapter will examine the nature of the risks at issue, key regulations
and standards for addressing FCOIs in research, key elements of FCOI policies,
and approaches to evaluating and addressing FCOIs.

Keywords
COI committee · Data safety monitoring board · Disclosures · Financial conflicts
of interest · Institutional conflicts of interest · Management plan · Technology
licensing

Definition

A common definition of a conflict of interest is “a set of circumstances that creates a


risk that professional judgment or actions regarding a primary interest will be unduly
influenced by a secondary interest” (Lo and Bero 2017).

Introduction

The primary goals of those who conduct clinical trials must be to carry out safe and
objective research that has the potential to advance the science of human health. If
the safety of research participants (and future patients) and the objectivity of research
are of paramount importance, the potential risks of conflicts of interest must be
addressed.
Conflicts of interest (COIs) are ubiquitous in personal and professional life. While
there have been calls to consider the risks of intellectual COIs (Ioannidis and
Trepanowski 2018), such as strongly held personal beliefs, FCOIs in medical
research have received the most attention in large part because of the widespread
practice of industry payments to researchers and the incentives associated with
inventing, patenting, and licensing new technologies and starting new companies
in the biomedical field. And financial interests, unlike potentially competing intel-
lectual interests, can be measured.
Funders and consumers of academic research and the academic research com-
munity itself have intensified their focus on FCOIs because of the potential they have
to affect research, education, and clinical care. In the wake of various exposés and
28 Financial Conflicts of Interest in Clinical Trials 543

investigational reporting in the media (Stolberg 2019), scrutiny of the impact of the
financial interests of physicians and biomedical researchers increased throughout the
early 2000s. National associations have issued recommendations and guidelines for
addressing the risks that FCOIs pose in education, clinical care, and to some extent
basic and animal research. In academia, most of the attention, including regulation
and association standards, has centered on FCOIs in clinical research because the
welfare of human research participants is at stake and because research results
directly impact medical care and treatment.
This chapter will address FCOIs in clinical trials, including the financial interests
of investigators and the institutions where research is often conducted. Not all
financial interests create the potential for conflict of interest or the appearance of
conflict of interest. If a physician who conducts clinical trials in interventional
cardiology owns stock in an energy company, for example, the financial interest is
unlikely to create the potential for conscious or unconscious bias related to the value
of a particular intervention. However, if the researcher owns stock in a company that
sells cardiac stents, that is likely to create an FCOI with her research. So one must
define the types of financial interests that have the potential to affect the objectivity
and safety of clinical research. Under Public Health Service (PHS) regulation on
FCOI, when a grantee institution is conducting research, an investigator must
disclose to the institution any financial interest “that reasonably appears to be related
to the Investigator’s institutional responsibilities.” Institutions must review the
disclosed interests to identify any “[F]inancial conflict of interest (FCOI),” which
is defined as “a significant financial interest that could directly and significantly
affect the design, conduct, or reporting of PHS-funded research” (eCFR – Code of
Federal Regulations 2019). FCOIs must be addressed with specific management
steps. The key roles and responsibilities of the parties involved in clinical research –
investigators, institutions, the committees that evaluate potential FCOIs, sponsors,
and journals – are set forth in Table 1.
Those conducting or administering clinical trials at hospitals, medical schools, or
research organizations that themselves may have financial interests in biomedical
research also must deal with institutional COI (Cigarroa et al. 2018). Although there
are no US regulations governing institutional COIs in research, many research
organizations include institutional COI in their FCOI policies. When the institution
where the research is being conducted has a financial interest in the outcome of
research, or their senior leaders have ties to companies with financial interests in a
study, there may be an actual or apparent institutional COI. For example, an
institution that is conducting research on a novel bone harvesting device that it
licensed to a start-up company has an institutional COI with the trial testing the
device. Of course, it is an institution’s agents (deans, department chairs, etc.) who act
on behalf of the institution. The risks arise from a concern that in a conscious or
unconscious effort to maximize the value of the product or manufacturer in which
the institution has a stake, institutional agents may make decisions that conflict with
the safety and objectivity of the research project. Another source of concern arises
from the personal financial interests of institutional officials when those interests are
related to the research they oversee. For instance, even if a research dean or hospital
544 J. D. Gottlieb

Table 1 Roles and responsibilities of investigators, institutions, IRB or COI committees, sponsors,
and journals in the disclosure, review and management of financial conflicts of interest
IRB and/or
COI
Investigator Institution Committee Sponsor Journals
Disclose Establish, Review For FDA Set policies
personal publicize, disclosures regulated trials, regarding
financial interests implement, and associated collect permissible
to institution and/ enforce FCOI with specific disclosures from COIs; require
or IRB as policy clinical trials investigators, disclosure to
required manage FCOIs, journal and in
report FCOIs to publications
FDA
Comply with Disclose Develop and If a covered
FCOI institutional communicate entity, report
management COIs for review management payments to
plan as required plan physicians to
Centers for
Medicare and
Medicaid
Services under
Physician
Payments
Sunshine Act
Follow Monitor Identify any
disclosure compliance with failures to
requirements of management comply with
institution (e.g., plans management
to patients, study plan
team members,
sponsor) and of
journals
Report FCOIs to
regulatory bodies
as required
Make FCOI
information
related to PHS
sponsored
studies publicly
available per
PHS regulations

president is not directly involved in a particular study but has stock in the manufac-
turer of the study drug or device, she may – consciously or unconsciously – make
decisions affecting the safety and objectivity of the study. Even decisions that are not
intended to impact a study may be viewed as biased if the decision maker has a
related financial interest and that interest is not disclosed or steps are not taken to
protect the study.
28 Financial Conflicts of Interest in Clinical Trials 545

Risks of Financial Conflicts of Interest: Reality and Appearance

There is a growing body of literature demonstrating strong associations between


financial interests and fealty to professional standards, including in medical practice
and research (Dana and Loewenstein 2003). Social science research has shown that
even modest gifts create an expectation of reciprocity, and studies have demonstrated
that there are strong associations between investigators’ ties with industry and
positive outcomes of related research (Ahn et al. 2017; Lundh and Bero 2017).
Research also indicates that many professionals, including physicians, tend to think
they are not susceptible to bias and believe they are less vulnerable to the influence of
payments and gifts than their colleagues (Cain 2008).
It is important to acknowledge, however, that even the appearance of a
researcher’s or an institution’s financial conflict of interest with a study may call
into question the integrity of a research project. This is as important as clearly
established causation or association. Disclosure of – or, more significantly, the
failure to disclose – ties between a researcher or research institution and industry
as they relate to clinical research can lead to skepticism on the part of a scientific
audience, negative news reports about the research, and doubt on the part of society
that medical research is being carried out honestly and is worthy of public support.
When patients suffer a bad outcome while participating in a clinical trial, the
presence of significant financial interests may strengthen an argument that financial
interests played a role in the harm to research participants. There are well-known
cases in which plaintiffs’ attorneys have linked investigators’ (and their institutions’)
financial interests with the harm suffered by research subjects in clinical trials
(Wilson 2009).

Regulations and Other Important Standards

Regulations on conflict of interest in research are varied and inconsistent with one
another. There are different standards and recommendations issued by accrediting
bodies, journals, professional societies, and national associations (Gottlieb 2015).
Institutional officials should be familiar with the array of relevant regulations and
standards when developing conflict of interest policies. A brief overview of these
standards follows, and additional detail appears elsewhere in this chapter.
Clinical research is subject to the Public Health Service (PHS) regulations on
objectivity in research if PHS support is involved. The Food and Drug
Administration (FDA) regulation applies (CFR – Code of Federal Regulations
Title 21 2019) if the trial data are to be used in marketing applications for FDA
approval of a drug, device, or biologic product. Separate standards are maintained by
the Association for the Accreditation of Human Research Protection Programs
(AAHRPP), which accredits Institutional Review Boards (IRBs), national
associations such as the Association of American Medical Colleges (AAMC) and
the Association of American Universities (AAU), journals, including those that
adhere to the standards set by the International Committee of Medical Journal
546 J. D. Gottlieb

Editors (ICMJE), and professional societies such as the American Society of Clinical
Oncology (ASCO).
Public Health Service. The 1995 PHS regulation on FCOI (titled “Promoting
Objectivity in Research”) was substantially revised in 2011, and the revised
version went into effect 2012. The regulation covers research supported by PHS
agencies (including, among others, the National Institutes of Health, Centers for
Disease Control and Prevention, FDA, and the Centers for Medicare and Medicaid
Services). It outlines the types of financial interests that investigators must report
to a recipient institution that applies for or receives federal research support; how
the disclosed interests must be reviewed for potential FCOI with federally funded
research projects; and the range of possible approaches to managing FCOIs. While
the regulation does not distinguish among different types of research and its focus
is on protecting research objectivity rather than research participant safety, it
acknowledges that FCOIs in research involving human participants carry the
greatest potential risk. Some institutions responded to the 2012 revisions by
applying the federal standards to all research regardless of funding source in
order to have a single, consistent standard for COI review. Others opted to apply
the regulation only to research with federal support, potentially limiting their
administrative burden but creating dual standards for federally funded research
and research with other sources of support.The 2012 revision lowered the financial
“floor” for annual income that must be reported from $10,000 to $5,000 and
expanded reporting requirements for reporting of equity ownership. Other report-
able interests include royalties from intellectual property, honoraria, consulting
fees, and equity in publicly traded and privately held companies. Exceptions to
reporting requirements include income from US institutions of higher education
and service on certain US federal, state, and local government advisory panels.
However, income from non-excluded nonprofit organizations such as foundations
and foreign institutions of higher education must be disclosed. There also is a
requirement that institutions solicit disclosures of payment or reimbursement for
travel. Specified details about FCOIs that institutions have identified and managed
must be reported to the awarding agency and must be publicly disclosed on a
regularly updated website or upon request.
The PHS FCOI regulation does not require that research support from industry to
the recipient institution be disclosed and reviewed for potential FCOI. FDA COI
disclosure requirements do not include industry support for the “covered” study
(although they do include funds the sponsor may provide the institution that are not
directly supporting the covered study). However, there are reasons for institutional
policies to consider the role of industry research support, whether financial or in-
kind, in the course of their FCOI reviews. Some federally supported research pro-
jects also involve support from industry. Many institutions apply their FCOI polices
to research that is not federally funded but may be supported by industry (or
foundations that are closely tied to a biomedical company with an interest in the
research). Journals and professional societies typically require disclosure of research
support from industry. Institutions whose FCOI policies do not include research
supported by industry take the position that while grants or sponsored research funds
28 Financial Conflicts of Interest in Clinical Trials 547

awarded to institutions may support a portion of an investigator’s salary, the funding


is administered by the institution, and it supports a variety of costs of conducting
the research. Those that do include industry support tend to view the sources of
support for an investigator’s salary and the research costs as potential sources of bias.
Finally, research grants made directly to investigators who are not part of academic
medical centers – for example, those in private medical practices – are more direct
personal payments to investigators and may be significant sources of conscious or
unconscious bias. In sum, support for the costs of research is treated heterogeneously
in the clinical research community.
Food and Drug Administration. The FDA’s regulation on conflict of interest for
clinical investigators applies to studies being used to support marketing applications
for drugs, medical devices, and biologic products. Its purpose is to identify situations
in which investigators involved in generating the data have FCOIs exceeding the
thresholds set by FDA, address the risk of bias, and thereby enhance the integrity of
the approval process for marketing of drugs and medical devices. The sponsor,
whether a company or a sponsor-investigator, must collect investigators’ financial
interest information, apply conflict of interest management measures to mitigate the
risks to data associated with FCOIs, and report the information to the FDA. FDA
regulation does not prohibit or restrict investigators with FCOIs from participating in
research, including as authors of resulting publications. However, in evaluating the
potential for bias associated with the data collected by the conflicted investigator, the
FDA takes into account the COI management measures applied to the study as well
as study design elements such as blinding, objective endpoints, and measurement of
endpoints by someone other than a conflicted investigator.
Some institutions have adopted a policy that if an investigator on an FDA-
regulated study has financial interests exceeding certain thresholds, that individual
may not serve as the sponsor-investigator for the study (and that significant financial
interests also disqualify non-investigators from acting as sponsors of FDA-regulated
studies).
World Medical Association. The Declaration of Helsinki, which sets forth
ethical principles for medical research involving human participants, requires that
prospective research participants be informed of any relevant conflicts of interest on
the part of investigators (World Medical Association Declaration of Helsinki 2013).
In 2004, the Office for Human Research Protections (OHRP) issued a guidance
document that governs human subject protection in research conducted under
the HHS or FDA regulations. Although it does not have the force of regulation,
the document suggests approaches for investigators, institutions, and IRBs to con-
sider in addressing potential FCOIs. The guidance includes IRB operations and cites
regulation prohibiting IRB members with conflicting interests in a project from
participating in the initial or continuing (HHS.gov 2016).
The Association for the Accreditation of Human Research Protection Programs
(AAHRPP), which has accredited a substantial majority of human research protec-
tion programs (e.g., Institutional Review Boards) in US research-intensive univer-
sities and medical schools, includes investigator and institutional conflict of interest
as elements in its evaluation process. So organizations seeking this important
548 J. D. Gottlieb

credential for their human research protection programs need to address COI and
institutional COI policies as they prepare for the accreditation process.
The Association of American Medical Colleges (AAMC) issued guidance docu-
ments for dealing with COIs in human subject research in the early 2000s. The most
notable recommendation is that individual FCOIs in human subject research that
exceed certain thresholds should be subject to a “rebuttable presumption,” i.e.,
investigators with those interests should not be permitted to participate in the
relevant human research project. The AAMC’s recommendations challenge institu-
tions to set robust FCOI standards for their human subject research programs.
The International Committee of Medical Journal Editors (ICMJE) has issued a
series of recommendations, including recommendations for addressing conflicts of
interest (ICMJE | Recommendations | Author Responsibilities – Conflicts of Interest
2019) involving authors of journal articles as well as reviewers and editors. The
ICMJE developed a detailed COI disclosure form that journals can require authors
and those involved in the review of manuscripts to complete so there is transparency
about financial interests. The organization recommends that journals publish the
disclosure forms (or key FCOI information) with articles as well as a statement about
the authors’ access to study data. A large number of journals, including many
leading biomedical journals, claim to have adopted the ICMJE recommendations.

Developing and Implementing COI Policies for Clinical Trials

Organizations that conduct clinical trials should adopt, publish, and implement a
credible FCOI policy. The policy should:

• (i) Outline the financial interests that must be disclosed to the organization as well
as when and how they should be disclosed.
• (ii) Describe the process and standards for review of those interests, whether by
the IRB or an ancillary committee charged with addressing FCOIs (COI
Committee).
• (iii) Require the institution to issue a written management plan designed to
manage, reduce, or eliminate risks associated with the FCOI.
• (iv) State that the institution will monitor compliance with the FCOI management
plan and address failures to comply.

Disclosure

COI policies should specify who must make disclosures of potential FCOIs, what
interests and other information must be disclosed, and the time frame for disclosure.
Policies should detail whether all or a subset of the following must be disclosed:
personal income in the form of fees, honoraria, or other payments; patents, patents
pending, and trademarks; royalty income or entitlement to royalty under a license
agreement; equity interests in publicly traded companies; equity interests in non-
28 Financial Conflicts of Interest in Clinical Trials 549

publicly traded companies; fiduciary roles, such as service on boards of directors;


and the interests of immediate family members. Even organizations that do not
receive PHS support should consider adopting an FCOI policy – including disclo-
sure requirements – that complies with the regulation. The regulation applies to sub-
recipients, such as sites for clinical trials involving federal funds, and the agreement
between the recipient organization and the sub-recipient may require the latter to
have a PHS-compliant policy.
Some institutions require disclosure of financial interests in companies that
compete with the company that is developing the study product. Defining such
interests and evaluating them can be complex and nuanced.
Under PHS regulations, institutions must solicit disclosures of income and
payments that, in the aggregate, exceed $5,000 in the 12 months preceding the
disclosure; income from intellectual property (e.g., royalties); ownership of equity
worth more than $5,000 in publicly traded companies; any equity ownership in
privately held companies; and travel payments or reimbursements of over $5,000 in
a 12-month period. The disclosure requirements also apply to these interests if they
are held by immediate family members.
The policy should specify when and how disclosures should be submitted and
updated. (Ideally, there should be an online or other easy-to-use process for submit-
ting and updating financial interest disclosures. There are turn-key systems on the
market, and many can be customized.) In the setting of clinical research, disclosures
should be made well before the study is approved by the IRB so there can be a
complete review of the potential risks of the financial interests. FCOI management
may necessitate a change in the protocol or in the roles of investigators. Making
changes to study design or personnel too late in the process may be costly and/or
inconvenient. The more specific and up-to-date disclosures are, the more robust and
effective the review can be. Since investigators’ financial interests may change
during the course of a clinical trial, FCOI policies need to require that disclosures
be updated in a timely way so that new circumstances can be addressed promptly.
Whether FCOIs are reviewed by an IRB or another body, such as a designated
COI committee, close integration between the information systems for FCOI
disclosure and human subject research is advisable. That, in addition to well-
coordinated administrative processes, will maximize the likelihood that FCOIs in
clinical trials are identified and addressed in a timely way.
Institutional COIs. Organizations that address institutional COIs need to
develop a system for informing the IRB or COI committee of the institution’s
financial interests as they relate to a particular study. The interests might include
income from licensing intellectual property; equity in start-up companies based on
inventions made at the institution; or the equity, royalty, consulting income, and
board service of senior institutional officials. Aggregating and efficiently communi-
cating the information can be challenging. Many institutional financial interests arise
from technology licensing activity, which is often administratively disconnected
from research administration and IRBs. Moreover, if an institution has an interest
in a drug or device but the inventor is not an investigator on the study, it may be
difficult to make the link between study and the institution’s financial interest.
550 J. D. Gottlieb

So institutions that take institutional COIs into consideration may need manual or
custom methods of matching institutional financial information with trial data.

Review for FCOI in Clinical Trials

Substantive review is at the heart of the FCOI process. A robust review should
identify the risks that the investigators’ and institution’s financial interests may
generate in the context of a specific clinical trial and within the framework of
applicable policy and regulations. There should be a well-defined review process,
and the reviewers, whether the IRB members or the members of a COI committee,
should have relevant expertise (e.g., experienced clinical trialists, biostatisticians)
and independence. If the reviewing body is not the IRB, close coordination with the
IRB is essential.
Reviewers should be free of bias. They should disclose any competing personal
interests and should recuse themselves from a particular case if they have an interest
that may bias or appear to bias their review.

An Initial Consideration: Thresholds for Participation

One threshold question is whether a conflicted investigator should be permitted to


have any role in a clinical trial. While there are national guidelines (e.g., AAMC) that
recommend limits on the financial interests a clinical investigator may have, neither
PHS nor FDA regulations require that investigators whose financial interests exceed
certain levels be disqualified from participation. Many leading academic medical
centers do, however, set thresholds for permitting a conflicted investigator to partic-
ipate in a study. These institutions take the position that a researcher who has a
“significant” financial interest (however the institution defines it) may not participate
in a clinical trial unless the individual completely divests herself of the conflicting
financial interests. There are variations among even the most restrictive policies. For
instance, some draw the line at allowing conflicted researchers to have any role in a
trial; others prohibit conflicted investigators from serving as principal investigators
(PIs) but still allow them to be part of a study team.
Social science research has shown that even modest gifts or payments are sources
of potential bias (Dana and Loewenstein 2003), but regulators, institutions, and
others take into account the nature and size of financial interests. Many policies set
thresholds for cash income at a level that may allow for some compensated consult-
ing while, in their judgment, not creating undue bias or an appearance of unaccept-
able conflict of interest. Equity ownership in a publicly traded company is often
treated like cash. If the value of the stock does not exceed a specified limit, the
investigator may hold the stock and participate in the study. Equity ownership in a
privately held company – especially if the company has licensed the investigator’s
invention – is widely viewed as disqualifying the researcher from any but the most
limited participation in a trial in which the technology is being tested. This is because
28 Financial Conflicts of Interest in Clinical Trials 551

any trial of the product is likely to directly and significantly impact the value of the
equity. Likewise, royalty income and entitlement to future royalty income through
inventorship of a study drug or device can create an incentive to demonstrate the
safety or efficacy of the product since that can affect regulatory approval, sales, and
ultimately personal income. Some institutions allow limited participation for inven-
tors of investigational drugs or devices in an attempt to balance the organization’s
drive for innovation and translational research with the risks of FCOIs. Finally,
service as a board member or officer of a company with an interest in the investi-
gational drug or device is often treated as a bar to participation in trials of the
company’s products because fiduciary roles require the individual to act in the best
interests of the company, a goal that may directly conflict with an investigator’s
obligation to carry out safe and objective research.
Some institutions set thresholds for the institution’s own involvement in a trial.
For example, if an institution, through a technology license to a start-up company,
holds equity in the manufacturer of an investigational drug and is entitled to royalty
on eventual sales of the drug, it may be prudent for trials of safety and efficacy to be
conducted at another institution – one that does not have conflicts of interest. Some
institutions require that the protocol and/or the institutional COI be reviewed by
another institution’s IRB or COI Committee, provided that institution does not itself
have a conflict of interest with the study. Another approach is to permit the conflicted
institution to participate, but not lead the study. That may involve allowing only a
small percentage of patients to be enrolled at the institution and ensuring that the
institution does not serve as coordinating center or have another other leadership role
in the study.
Institutions should clearly outline which FCOIs disqualify an investigator from
participating in a trial. To the extent a conflicted individual may be permitted to
participate, the institution must undertake a careful, detailed analysis of the study and
especially the features that are vulnerable to FCOI risks. Elements of the review are
described below. The review should result in a plan to ensure that risks to safety and
objectivity are minimized or mitigated and that there is transparency to all key parties
about any conflicts of interest.

Study Design and Planning

Clinical trials should be designed to answer scientific questions and not to support a
predetermined outcome. Certain study designs can help mitigate the risk that a
conflicted investigator will inject bias into the study. One option is to ensure that
the investigators are blinded to treatment and control arms. Investigators with FCOIs
generate greater risk for unblinded studies, especially Phase I or Phase II studies. For
example, in an oncology study comparing standard of care to standard of care
combined with an interventional therapy, an investigator with a financial interest in
the study drug may be tempted – consciously or unconsciously – to adjust dosages of
the standard therapy, as permitted in the protocol, to boost the apparent efficacy of
the intervention.
552 J. D. Gottlieb

Study designs with objective endpoints that can be recorded, reviewed, and tested
by those without FCOIs are likely to be safer from bias than those with subjective
endpoints.
Transparency is a powerful tool for protection against bias on the part of a
conflicted investigator. Disclosing to all study team members that an investi-
gator has an FCOI builds in a measure of oversight. The study team should
also be informed of the measures put in place to address the FCOI, especially
since they may be charged with implementing parts of the management plan.
For example, a research coordinator who is a consent designee should know if
the PI has a conflict of interest so he can clearly inform prospective subjects of
the COI (Friedman et al. 2007). Study team members should be advised about
how to raise any concerns related to the FCOI and should be protected if they
do so.
Expanding a study to more than one center and vesting greater authority in
another center, e.g., as coordinating center, especially if the PI and/or her institution
have financial interests in the study, can mitigate the risks that any bias in the conduct
of the study at the conflicted center will unduly influence the outcome.
Data Safety Monitoring Boards. Establishing independent data safety
monitoring boards (DSMBs), especially where the DSMB is informed of the FCOI
and formally charged with addressing any FCOI-related risks, can offer powerful
protection for trials with conflicts of interest. This is especially true if there is an
institutional COI, as long as the DSMB members are independent of the institution.
To ensure a DSMB is truly independent, its members should have no conflicting
financial interests. Ideally, the DSMB members should not be appointed by the
industry sponsor, and if an industry sponsor wants to have a nonvoting representa-
tive on a DSMB, that individual may provide information but should not participate
in or be present during discussion or voting.

Study Conduct

Recruitment and Consent. FCOIs can inject risk into the recruitment and
consenting process. Investigators with FCOIs may be tempted to stretch or
expand enrollment criteria to favor a particular outcome. While such behavior
may represent noncompliance with a protocol, the risk can be lowered by
putting protective measures in place such as prohibiting those with FCOIs
from recruiting subjects to trials. Likewise, informed consent should be
obtained by individuals other than a conflicted investigator and ideally by
individuals who (a) know about the FCOIs and (b) are not supervised by the
investigator with an FCOI. Informed consent documents should clearly state if
there is an FCOI or an institutional COI and provide contact information for
prospective subjects who may have questions.
Intervention. The greatest risk to clinical trial participants may be the inves-
tigational intervention. Because administering a drug is fairly straightforward,
there is usually no reason a conflicted investigator needs to participate. If the
28 Financial Conflicts of Interest in Clinical Trials 553

intervention involves an experimental device for which the procedure is novel


or very specialized, the conflicted investigator may be able to make a case that
his unique expertise is essential for the safety of the procedure. That argument
should be tested with independent senior experts in the field. If it is determined
that the conflicted investigator has unique expertise, and in particular that his
participation is important for subject safety, he may be permitted to implement
the procedure as long as other measures are put in place. For example, the
investigator may be allowed to carry out a limited number of interventions
provided he trains another physician to succeed and replace him in carrying out
the procedure. Assigning a non-conflicted investigator responsibility for
reporting adverse events and serious adverse events adds a layer of protection
if it is judged important for subject safety to allow the conflicted investigator to
carry out the intervention.
Data collection is another area of potential vulnerability. Ensuring that endpoints
are objective, that non-conflicted investigators collect the data, and that all study
team members have access to the study data can help address potential COI risks.
Data Analysis. A conflicted investigator’s bias, whether conscious or uncon-
scious, is likely to favor the interventional drug or device. So it is especially
important to protect data analysis from any conflict or appearance of conflict.
Potential approaches include engaging independent biostatisticians to advise on
analytical tools and conduct the analyses; avoiding unacceptable cherry-picking;
and avoiding analyses designed to overweight or promote very modest effects of an
interventional drug or device.

Publication/Reporting

If a conflicted investigator is allowed to participate in a trial in way that qualifies


her for authorship, she must be included as an author on resulting publications.
However, steps are needed to protect against bias in the publication and ensure
transparency. First, she should not have a role (such as first or senior author) that
vests substantial authority over the publication. Her contributions should be
reviewed by the first and senior authors for potential bias. If for some reason a
conflicted investigator is permitted to serve as first or senior author, the institution
should consider having an independent expert with access to study data review the
manuscript for potential bias.
Second, whatever role the conflicted investigator is permitted to have, her
relevant financial interests should be disclosed (a) to the journal in accordance
with its requirements and (b) in the manuscript. Journals have varied disclosure
requirements, and the information they publish also varies. But it is the authors’
responsibility to understand and follow those requirements. Failure to adhere to
journal policies – and the scientific community’s expectation of transparency – can
lead to corrections, article retractions, being barred from publishing in the journal for
a period of time, bad publicity, and even disciplinary action by one’s employer
(Bauchner et al. 2018; Gottlieb and Bressler 2017).
554 J. D. Gottlieb

Documenting and Communicating FCOI Decisions: The


Management Plan

The reviewing body should outline a plan for dealing with the FCOIs or institutional
COIs associated with the trial. A written management plan should outline clearly the
activities in which the conflicted individual (or institution) may participate and under
what conditions; which activities the conflicted investigator may not participate in
and who will handle them instead; and the disclosures that should be made in various
settings. The management plan should be provided to the investigator and others
who have a need to know, and there should be infrastructure in place to monitor and
help ensure compliance with it. Ideally, there should be a process for the conflicted
individual or, in the case of an institutional COI, the responsible individual, to
document their agreement to comply with the management plan.
Policies should be flexible enough that in certain circumstances, a reviewing body
may determine that an FCOI generates such significant risks for a trial that the
individual (or the institution) may have no role in any part of the study. If this
determination is made, it should be communicated through a management plan.
Management plans should address, at a minimum, the following items:

• Limitations on the role of the conflicted investigator in areas such as:


– Recruiting subjects and obtaining consent
– Carrying out investigational intervention
– Collecting data
– Determining adverse events and serious adverse events
– Analyzing data
– Authoring publications and making presentations about the study
• If the reviewing organization employs or otherwise has appropriate authority over
the investigator, there may be limits imposed on the investigator’s relationship
with the sponsor or manufacturer of the study drug or device, such as:
– Limits on annual income and other payments from or interests in the company
– Limits on the investigator’s involvement with the company
– Prohibition on negotiating terms of the sponsored agreement on behalf of the
institution or the company.
• If the conflict involves the institution, the management plan should be commu-
nicated to all those with responsibility for and oversight of the study and should
outline any conditions being placed on conducting the study at the institution,
such as:
– If a multicenter trial, limit on number or percentage of subjects that may be
enrolled at the institution and restrictions on the institution’s leadership roles
(e.g., as coordinating center)
– Whether an outside IRB must review the protocol
– Whether an independent DSMB should be established and details of its
composition and charge
• Any changes that need to be made to the study design, such as:
– Blinding investigators to treatment and control arms
28 Financial Conflicts of Interest in Clinical Trials 555

– Ensuring inclusion of objective endpoints


– Adding non-conflicted investigators with special expertise
• Requirements for disclosure of the FCOIs, including details that must be included
in the disclosure, to
– Study team members
– Prospective research subjects
– Medical journals and conferences
– Media and journalists reporting on study outcomes
– Regulatory bodies (if the conflicted investigator is responsible, e.g., as spon-
sor-investigator on an FDA-regulated trial).

Once a management plan has been issued, there should be oversight of


investigators’ compliance with the conditions in the plan. Monitoring may be
conducted by the IRB, a COI office, or institutional auditors. The degree of
monitoring applied to any particular study may depend on the level of risk associated
with the study, its potential impact on drug or device approval and medical practice,
and the nature and extent of the financial interests. Failures to adhere to the
management plan should be addressed promptly and thoroughly under the institu-
tion’s policies on research integrity and research compliance. Failures to comply
with PHS regulations may result in additional action by the awarding agency.

Summary and Conclusion

FCOIs are ubiquitous and have the potential to create bias and affect the safety of
clinical research. Financial relationships between the biomedical industry and those
who conduct essential research are likely to be a feature of clinical research well into
the future. Regulations and national standards to minimize or mitigate the risks of
these relationships will evolve, but investigators and research organizations need to
make disclosure, robust review, and careful management of FCOIs a fundamental
part of their culture. Adopting policies and procedures that are easy to understand
and follow and enforcing policies and procedures consistently will foster a culture of
compliance. Institutional leaders should communicate that addressing conflicts of
interest is a top priority and is part of a commitment to integrity in research, and they
should provide sufficient resources to support robust administration of the FCOI
policy. Researchers and institutions that demonstrate a commitment to transparency
and to mitigating undue influence of FCOIs in clinical trials will be viewed by the
scientific community, the public, and patients as credible and objective.

Key Facts

• Financial conflicts of interest are ubiquitous in clinical research and should be


disclosed, reviewed, and managed to ensure the objectivity and safety of clinical
trials.
556 J. D. Gottlieb

• Management of financial conflicts of interest should be carried out by


knowledgeable individuals with the expertise and authority to ensure compliance
with and enforcement of management plan conditions.
• Disclosure and transparency regarding financial conflicts of interest are essential.
• Management of financial conflicts of interest may necessitate changes in study
design and/or in the roles of investigators with conflicts of interest.
• Some institutions that conduct clinical trials set limits on the conflicting financial
interests that investigators may have while leading or participating in clinical
trials.
• Financial conflicts of interest in research are regulated by the Public Health
Service and the Food and Drug Administration.
• National associations, professional societies, and accrediting bodies maintain
standards for addressing financial conflicts of interest.
• Most medical journals require disclosure of authors’ financial conflicts of interest.
• Congress and the media have intensified their scrutiny of financial conflicts of
interest in recent decades.

Cross-References

▶ Consent Forms and Procedures


▶ Data and Safety Monitoring and Reporting
▶ Data Capture, Data Management, and Quality Control; Single Versus Multicenter
Trials
▶ Fraud in Clinical Trials
▶ Implementing the Trial Protocol
▶ Institutional Review Boards and Ethics Committees
▶ Investigator Responsibilities
▶ Paper Writing
▶ Principles of Clinical Trials: Bias and Precision Control
▶ Reporting Biases
▶ Trial Organization and Governance

References
Ahn R, Woodbridge A, Abraham A et al (2017) Financial ties of principal investigators and
randomized controlled trial outcomes: cross sectional study. BMJ 356:i6770. https://doi.org/
10.1136/bmj.i6770
Bauchner H, Fontanarosa P, Flanagin A (2018) Conflicts of interests, authors, and journals. JAMA
320:2315. https://doi.org/10.1001/jama.2018.17593
Cain D (2008) Everyone’s a little bit biased (even physicians). JAMA 299:2893. https://doi.org/10.
1001/jama.299.24.2893
CFR – Code of Federal Regulations Title 21 (2019) In: Accessdata.fda.gov. Accessed 28 Jan 2019
Cigarroa F, Masters B, Sharphorn D (2018) Institutional conflicts of interest and public trust. JAMA
320:2305. https://doi.org/10.1001/jama.2018.18482
28 Financial Conflicts of Interest in Clinical Trials 557

Dana J, Loewenstein G (2003) A social science perspective on gifts to physicians from industry.
JAMA 290:252. https://doi.org/10.1001/jama.290.2.252
eCFR — Code of Federal Regulations (2019) In: Ecfr.gov. https://www.ecfr.gov/cgi-bin/text-idx?
c¼ecfr&SID¼992817854207767214895b1fa023755d&rgn¼div5&view¼text&node¼42:1.0.
1.4.23&idno¼42#sp42.1.50.f. Accessed 28 Jan 2019
Friedman J, Sugarman J, Dhillon J et al (2007) Perspectives of clinical research coordinators on
disclosing financial conflicts of interest to potential research participants. Clin Trials 4:272–278.
https://doi.org/10.1177/1740774507079239
Gottlieb JD (2015) Financial conflicts of interest in research. In: Suckow M, Yates B (eds) Research
regulatory compliance. Elsevier Inc., London, pp 253–276
Gottlieb JD, Bressler NM (2017) How should journals handle the conflict of interest of their editors?
JAMA 317:1757. https://doi.org/10.1001/jama.2017.2207
Hhs.gov (2016) Financial conflict of interest: HHS guidance (2004). In: HHS.gov. https://www.hhs.
gov/ohrp/regulations-and-policy/guidance/financial-conflict-of-interest/index.html#. Accessed
4 Feb 2019
ICMJE | Recommendations | Author Responsibilities—Conflicts of Interest (2019) In: Icmje.org.
http://www.icmje.org/recommendations/browse/roles-and-responsibilities/author-responsibili
ties%2D%2Dconflicts-of-interest.html. Accessed 28 Jan 2019
Ioannidis J, Trepanowski J (2018) Disclosures in nutrition research. JAMA 319:547. https://doi.org/
10.1001/jama.2017.18571. Available at: https://jamanetwork.com/journals/jama/article-
abstract/2666008
Lundh A, Bero L (2017) The ties that bind. BMJ 356:j176. https://doi.org/10.1136/bmj.j176
Lo B, Field MJ (eds) (2009) Principles for identifying and assessing conflicts of interests. In:
Conflict of interest in medical research, education, and practice, 1st edn. National Academies
Press, Washington, DC, pp 44–61. Available at: https://www.nap.edu/read/12598/chapter/4.
Accessed 25 Jan 2019
Stolberg S (2019) Youth’s death shakes new field of gene experiments on humans. In: Archive.
nytimes.com. https://archive.nytimes.com/www.nytimes.com/library/national/science/
012700sci-gene-therapy.html. Accessed 28 Jan 2019
Tringale K, Marshall D, Mackey T et al (2017) Types and distribution of payments from industry to
physicians in 2015. JAMA 317:1774–1784. https://doi.org/10.1001/jama.2017.3091
Wilson RF (2009) Estate of Gelsinger v. Trustees of University of Pennsylvania: Money, Prestige,
and Conflicts of Interest In Human Subjects Research. In: Johnson SH, Krause JH, Saver RS,
Wilson RF (eds) Health Law and Bioethics: Cases In Context
World Medical Association (2013) World Medical Association declaration of Helsinki. JAMA
310:2191–2194. https://doi.org/10.1001/jama.2013.281053
Trial Organization and Governance
29
O. Dale Williams and Katrina Epnere

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
Key Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
Funding Source and Its Relationship to Trial Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
Individual Organizational Units, Roles, and Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
Committees, Committee Roles, and Committee Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
Common Threats and Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566
Conclusion and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568

Abstract
An issue impacting the success of many human efforts is the organizational and
management strategy required for their successful completion. This is an impor-
tant issue for any clinical trial as well. It is always a challenge to match the needs
required for a successful trial with the resources available in a management
strategy compatible with the experience and personalities of the collection
of investigators and staff involved. Clearly the simplest situation is the single-
site trial with a single investigator and few or no staff. In this situation, the
investigator has only himself or herself to organize and manage. While this is
no guarantee of success, it creates much less of a management burden than does
a multicenter, long-term trial, especially since such endeavors typically include
numerous investigators, central laboratories, reading centers, coordinating center,
O. D. Williams (*)
Department of Biostatistics, University of North Carolina, Chapel Hill, NC, USA
Department of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
e-mail: [email protected]
K. Epnere
WCG Statistics Collaborative, Washington, DC, USA

© Springer Nature Switzerland AG 2022 559


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_56
560 O. D. Williams and K. Epnere

and a large number of committees, each with its own purpose, requirements, and
personality. The organization and management (OM) issues for this situation are
critically important for the overall success of the trial. This chapter highlights
issues for such long-term, multicenter studies as these situations encompass all
the key, major issues.

Keywords
Organization and management · Organizational structure · Multicenter studies ·
Steering committee · Executive committee · Coordinating center

Introduction

The number of newly registered trials doubled from 9,321 in 2006 to 18,400 in 2014.
The number of industry-funded trials increased by 43%. Concurrently, the number of
NIH-funded trials decreased by 24% (Ehrhardt et al. 2015). In a recent communica-
tion, Meinert indicated ClinicalTrials.gov included almost 95,000 trials started
between 2014 and 2018 (personal communication Meinert 2019). This is a surpris-
ing number in many ways and raises the interesting question as to how many are well
organized and managed and how many will not meet their stated goals as a
consequence of inadequate OM.
It is often said that the inability to recruit adequate numbers of trial participants is
the most common cause of the failure of a clinical trial. The root cause of failure in
this case, however, is most likely due to an OM strategy that was not up to the task.
This situation was recognized early on in the history of multicenter trials in
the USA and was addressed in the Greenberg Report (1967) prepared in 1967 and
formally published in Controlled Clinical Trials in 1988. This report includes an
organization chart that has stood the test of time although the situation has evolved
in directions and magnitudes that were perhaps unimaginable in 1967. The key
components of this chart are listed in the discussion of committees below.
It is important to point out that it is not uncommon for the OM general issue to
receive inadequate attention from the earliest phases of trial planning as these issues,
critical as they are, often are much less interesting than the scientific and health-care
issues under consideration. The consequence of this lack of appropriate attention can
be catastrophic failure. Farrell et al. have repeatedly pointed out that even though
eminent trialists have written persuasively and repeatedly of the need for large,
randomized, controlled trials, in the scientific literature, little attention has been
given to the day-to-day and strategic management of such trials. She emphasizes
that the knowledge and expertise gained on running earlier trials are not widely
disseminated and new trials often have to begin from scratch. Because randomized
trial involves a huge investment of time, money, and people, Farrell suggests it
should be managed like any other business (Farrell 1998; Farrell et al. 2010).
Multicenter clinical trials often operate under two separate but related organiza-
tion charts, one representing the funding structure and its accountability and
29 Trial Organization and Governance 561

financial reporting expectations and one representing the committee structure for the
overall trial. The funding structure requirements necessarily address issues related to
inadequate performance of a trial’s individual funded entities. The committee struc-
ture performance expectations, while also critical, tend to be less concretely formu-
lated. The remainder of this chapter focuses on the latter.
The overall committee structure typically reflects a balance among the appro-
priate representation of stakeholders, expertise requirements, and operational
efficiency. The first of these may require large committees if there are large
numbers of clinical field sites and central units, which may be further augmented
should the expertise required not be available from these units. Such large numbers
of persons on committees may make it difficult or impossible to proceed with the
required efficiency. One strategy used is to create a steering committee, consisting
of representatives of all the stakeholders which has overall responsibility for the
trial. The role of the steering committee is to provide oversight of the trial on
behalf of the sponsor and funder and ensure that the trial is conducted in accor-
dance with the principles of GCP and relevant regulations. The steering committee
should focus on the progress of the trial, adherence to the protocol, and participant
safety (McDonald et al. 2014). A subcommittee, sometimes called an executive
committee, which is much smaller may be more directly responsible for day-to-
day issues.
Since the OM strategy for a multicenter, long-term clinical trial typically has
a committee structure at its core, it might be worthwhile to reflect on the old adage
that a camel is horse designed by a committee. The fact that a key committee exists
does not, unfortunately, necessarily mean it will function commendably. Success
requires the productive cooperation of all key stakeholders operating in a system that
recognizes and takes into account their individual needs as well as those of the
overall trial. The WRIST study group wrote a Guide on Organizing a Multicenter
Clinical Trial and stated that planning of multicenter clinical trials (MCCTs) is a long
and arduous task that requires substantial preparation time. They emphasized an
essential asset to planning a MCCT is the fluidity with which all collaborators work
together toward a common vision. This would mean a development of a consensus-
assisted study protocol and the recruitment of centers and co-investigators who are
dedicated, collaborative, and selfless in this team effort to achieve goals that cannot
be reached by a single-center effort (Chung et al. 2010).

Key Factors

A list of factors that may be helpful to consider when developing an OM plan for
a trial includes the following:

1. Funding source and its relationship to trial operations


2. Individual organizational units, roles, and structure
3. Committees, committee roles, and structure
4. Common threats and failures
562 O. D. Williams and K. Epnere

A goal for the overall OM scheme can perhaps best be characterized by the simple
statement “Who reports to whom about what and when.” Which bodies need a
report? What types of reports are required? How often are reports required and in
what format? What data are required to be included in the report, for example,
recruitment data, safety data, and blinded or unblinded data? Who will produce the
reports (McDonald et al. 2014)? A scheme that identifies and clarifies roles, respon-
sibilities, and accountability for the entities involved is vitally important.

Funding Source and Its Relationship to Trial Operations

Funding sources for clinical trials include for-profit entities such as pharmaceutical
firms; numerous US government agencies and those of other countries and the
European Union; various not-for-profit entities including foundations, societies,
and others; and international organizations such as the World Health Organization.
Each such source has its own expectations as to how trials it funds will be organized
and managed in the general sense and what its specific functional role will be for
a given trial. It is critically important that these expectations are clearly understood
by the investigative team at the very outset of a trial. The expectations and context
for pharmaceutical industry-sponsored trials are importantly different from those
sponsored by NIH, for example, and the OM scheme to be utilized needs to be fully
cognizant of these differences.
It should also be noted that some of the operational units may be funded through
subcontracts with other trial organizational units which are funded directly from the
funding agency. For example, some laboratories and reading centers may operate
under subcontracts to a coordinating center. In this case, the organizational unit
offering the subcontract has to have the resources and expertise to select and manage
the relationship with the entity under subcontract. Olmstead summarized that several
articles and surveys have addressed concerns of pharmaceutical company research
staff with the performance of their outside contract researchers. He classified these
issues into four categories – credibility, responsiveness, quality of product, and cost.
He concluded that strong emphasis on quality control and improved, automated data
management are key elements of improvement and added that improved organiza-
tion and management efforts on the part of contract researchers themselves will go
far to reduce the most obvious difficulties (Olmstead 2004).

Individual Organizational Units, Roles, and Structure

A trial may involve 30 or more organizational units. Examples include:

1. Clinical centers. Clinical centers are the core operational unit for a trial.
The number of such units is usually based on those required to recruit the required
number of trial participants. Clinical centers recruit and interact with trial partic-
ipants as required for the duration of the trial. They also are responsible for all
29 Trial Organization and Governance 563

local research-related approvals and for collecting and transmitting data, typically
to a coordinating center. They also deal with biological samples, sent either to
a local laboratory or to a central laboratory, and for the collection and transmis-
sion of any images, again, to local readers or to a central reading center. They
participate in the trial committee structure as appropriate.
2. Coordinating center. The trial coordinating center is the heart of the trial, whether
it is a single-site or multicenter trial. Sometimes the broader function served by
this unit is divided into a clinical coordinating center and a data coordinating
center. In general, this combined entity is responsible for data-related and study
coordinating issues. The data coordinating component typically is responsible for
key elements of trial design and for data collection systems, data management,
and data analyses. This includes as well the design and testing of data collection
forms and data collection quality assessment. The data coordinating center
component also would prepare reports for trial overview committees such as
Data and Safety Monitoring Boards. The clinical coordinating center component
often is responsible for managing and reviewing the adverse and serious adverse
events. Responsibility for providing staff support for at least some committees is
usual. The data coordinating component team usually consists of chief investi-
gator(s), trial manager, programmer/IT support, database manager and/or data
clerks, and trial statistician (McDonald et al. 2014).
3. Central laboratory. Some trials require, in addition to the use of local laboratories,
more than one central laboratory. In general, central laboratories are responsible
for creating and maintaining shipping procedures for the transmission of samples
from the clinical centers to the lab. They are responsible for high-quality labora-
tory analyses for the parameters under their purview and for transmission of the
resulting data to the coordinating center. If abnormal results are considered
adverse or serious adverse events, they would be required to transmit appropriate
notifications. Importantly, they should participate in the appropriate standardiza-
tion programs and any quality control activities specific to the trial. They also may
serve as an archive for biological materials collected by the trial. Personnel from
the central lab also may participate in trial committees.
4. Central reading center. Some trials require the use of central reading centers for
images critical to the assessment of patient safety or trial outcomes. These centers
typically are responsible for the systems that transmit images from the clinical
centers to the reading center and for transmitting the results of assessments
they complete to the coordinating center. The center is expected to perform
high-quality assessments for the readings they undertake and to participate in
quality control activities as appropriate. They, like the central labs, may also serve
as an archive for images collected by the trial. Personnel from the center may
participate in trial committees.

These individual organizational units also have to be successfully organized and


managed for the overall trial’s OM to be successful. The individual units report, in
many senses, to their home institution and also to the organization structure for the
trial of which they are a part. This means that the unit leaders need to have the
564 O. D. Williams and K. Epnere

experience and capability to successfully work with both sets of masters. Keeping in
mind that a trial may include more than 30 organization units, it would be ideal if
these 30+ units were each led by someone with appropriate OM experience and
capability. This doesn’t always happen, and a poorly organized and managed unit
can jeopardize the overall trial.

Committees, Committee Roles, and Committee Structure

Since the core management strategy for many clinical trials is based on a committee
structure, the creation of the committees, selection of their members, and their
operational effectiveness and efficiency are of paramount importance. Decisions
need to be made up front as to which committees will be needed at least at the
outset. Often the first committee created is the steering committee, which includes
representatives of all the key stakeholders. Sometimes the chair is designated by
the funding agency and sometimes elected from the members. However this is done,
this person is key to the overall success of the trial and therefore needs to have the
requisite knowledge, experience, and personality for the task. There also needs to be
a succession plan that provides backup as needed.
The likelihood of this success may be enhanced by the following considerations:

1. Committee charge: A clear statement as to the role the committee is expected to


play is critical.
2. Committee chair: Someone with appropriate knowledge, experience, and person-
ality, possibly along with a designated co-chair, is a fundamental requirement.
This person is responsible for organizing and conducting committee meetings
and, in general, making sure the committee is satisfactorily addressing its com-
mitments. This likely will include making sure there are appropriate minutes that
include action items and assigned tasks. The status of progress on completing
these tasks should be addressed at subsequent meetings.
3. Committee members: The critical issue here is the inclusion of appropriate
stakeholders and expertise. It may be necessary to go outside the immediate
members of the overall team for this expertise. Committee members are respon-
sible for attending and participating fully in the meetings and their deliberations.
They also are responsible for completing any assigned tasks in a timely manner.
4. Committee staff: This issue is typically overlooked and is not always needed, but
when it is, there needs to be a mechanism for its provision. Staff typically are
responsible for arranging the required logistics for meeting and organizing
agendas and materials and preparing minutes and action item lists.
5. Scheduling meetings: A clearly delineated meeting schedule, with adjustments as
required and available suitably in advance, can be most helpful. One important
consequence of such a schedule is the ability of the members to put meetings on
their calendars well ahead of time and thus be more likely to be available for
meetings.
29 Trial Organization and Governance 565

6. Meeting conduct: Factors that may facilitate the success of the committee meet-
ings include appropriate agendas prepared well ahead of the meeting, accompa-
nied by documents and materials as appropriate; efficiently conducted meetings
to include appropriate control of time devoted to individual items and speakers;
and clear minutes and follow-up on issues addressed in previous meetings.
7. Committee accountability: Most trial committees are in fact subcommittees
to a steering committee or similarly designated committee so that they report to
this higher committee. The steering committee should hold the subcommittees
accountable for meeting their charge in a high-quality, timely fashion. This
typically requires both written reports and presentations at the steering committee
meetings.

The trial’s committee structure has the responsibility to inform the funding
structure component of issues that need to be addressed for specific individual
operational entities. This may require special reports and/or special meetings.
The designated committees play key roles, and their successful operation is
critical to the success of the overall trial. Examples of committees, which typically
operate as subcommittees of and thus report to the steering committee include:

1. Steering committee. Includes representatives of stakeholders and funding entity


and may include outside experts. Responsible for the overall management of the
trial.
2. Executive committee. Typically, a subcommittee of the steering committee
that includes the steering committee chair, the director(s) of the coordinating
center, and representatives of the clinical centers. A relatively small committee
responsible for the more day-to-day activities.
3. Recruitment and retention committee. Responsible for developing and
implementing participant recruitment procedures and also participant retention
efforts.
4. Laboratory committee. Responsible for creating list of laboratory tests to be
done and monitoring the quality of laboratory performance.
5. Imaging committee. Responsible for creating list of imaging parameters to be
included and for monitoring the quality of the reading center performance.
6. Quality control committee. A subcommittee with overarching responsibilities
for data quality control. May mandate blind duplicate assessments for some key
variables and set standards for acceptable performance.
7. Endpoint committee. For those trials for which a judgment based on several data
sources may be required in the assessment of primary outcomes or endpoints,
a panel of experts may be required to make this assessment. This panel is
required to make this judgment for all such events in the trial.
8. Ancillary studies committee. Some trials include or obtain funds for ancillary
studies that add procedures or data to be collected in addition to those for the
main trial. This committee would overview that process with careful reference to
avoiding conflicts with the main trial.
566 O. D. Williams and K. Epnere

9. Data form committee. Responsible for the development and testing of data
collection forms and sometimes overviews the data collection training and
certification procedures.
10. Publication and presentation committee. Responsible for overviewing the pub-
lication and presentation process for the trial. This includes efforts to help ensure
that trial publications are completed in a timely manner and also deals with
authorship conflict issues.
11. Data and safety monitoring board. An independent board of experts in the topic
of the trial and biostatistics responsible for trial integrity and participant safety.
Typically reports to the funding entity but also may report jointly to the steering
committee. Usually operates according to a charter established at the outset of
the trial. Reviews adverse and serious events and trial analysis reports.
12. Advisory committee. Some trials may involve an overarching advisory com-
mittee which is appointed by and reports to the funding entity. This committee
may assist with setting overall directions and with broad overview assessment of
trial progress and success.

Common Threats and Failures

As is the case for any endeavor such as a clinical trial, failure can occur. Some key
issues are:

1. Participant recruitment. One of the most common causes of a clinical trial failing
to be able to operate to completion is failure to recruit adequate numbers
of participants to undertake the randomization process. The trial OM process
sometimes is too slow to react to this crisis, and when it reacts, it does so with too
little too late. It is imperative that the OM process monitor recruitment status from
the very outset and react strongly to indications of recruitment problems.
The assumption should be that the enrollment will be slower than projections and
almost every trial should implement proactive measures to foster enrollment.
Frequent monitoring of actual vs projected enrollment by site to identify trends
gives the opportunity to consider protocol amendment, additional recruitment
funds, site closure, etc. (Allen 2015).
The STEPS study analyzed 114 multicenter trials and showed that 45% failed to
reach 80% of the prespecified sample size. Less than one third of the trials recruited
their original target number of participants within the time originally specified, and
around one third had to be extended in time and resources. Trials that actually
recruited successfully shared a common factor – they had employed a dedicated trial
manager. The STEPS collaborators suggested that anyone undertaking trials should
think about the different needs at different phases in the life of a trial and put greater
emphasis on “conduct” (Campbell et al. 2007; Farrell et al. 2010).
2. Clinical center failure. Especially if the trial includes a rather large number of
clinical centers, one or more may not perform adequately. Such a situation may
29 Trial Organization and Governance 567

jeopardize the trial. Since this may be a leadership problem at the clinical site, the
trial OM system may need to step in quickly and assist or replace as needed.
3. Coordinating center performance. Coordinating centers need to develop data
collection, data management, and data analysis systems and operate them in a
timely and high-quality fashion. If this does not happen, the consequence can be
severe.
Gathering clean data is among the most important steps to a successful clinical
trial. Even if the sites are found and patients recruited, but the data is inaccurate, it
will not be of use to the sponsor. Consider doing early and routine review of
data – whether remotely or during a monitoring visit. Priority should be given to
primary endpoint data. Identifying data quality issues early on, correcting those
issues, retraining site personnel, and establishing preventative measures allow for
data issues to be addressed and resolved quickly before evolving into significant
problems (Allen 2015).
4. Committee failure. If a key committee lags behind and is causing delays in trial
development or operation, some corrective action may be needed. Sometimes a
new chair should be appointed.

Conclusion and Summary

As described above, numerous entities typically are involved in the organization and
management of multicenter long-term trials so that there are numerous opportunities
for failure. Clearly, a clear and detailed organization and management strategy needs
to be established well before the onset of the trial. The strategy needs to provide an
unambiguous answer for the essential question “who reports to whom about what
and when.” Strong and experienced leadership closely connected with day-to-day
operations in a system that provides continuous monitoring and flexibility to adjust
to unexpected situations is key to success.

Cross-References

▶ Archiving Records and Materials


▶ ClinicalTrials.gov
▶ Data and Safety Monitoring and Reporting
▶ Data Capture, Data Management, and Quality Control; Single Versus Multicenter
Trials
▶ Evolution of Clinical Trials Science
▶ Funding Models and Proposals
▶ Multicenter and Network Trials
▶ Participant Recruitment, Screening, and Enrollment
▶ Publications from Clinical Trials
▶ Responsibilities and Management of the Clinical Coordinating Center
568 O. D. Williams and K. Epnere

References
Campbell MK, Snowdon C, Francis D, Elbourne D, McDonald AM, Knight R, Entwistle V, Garcia
J, Roberts I, Grant A, The STEPS Group (2007) Recruitment to randomised trials: strategies for
trial enrolment and participation study. The STEPS study. Health Technol Assess (Winch Eng)
11(48). iii, ix–105
Chung KC, Song JW, WRIST Study Group (2010) A guide to organizing a multicenter clinical trial.
Plast Reconstr Surg 126(2):515–523
Ehrhardt S, Appel LJ, Meinert CL (2015) Trends in National Institutes of Health funding for clinical
trials registered in ClinicalTrials.gov. JAMA 314(23):2566–2567
Farrell B (1998) Efficient management of randomised controlled trials: nature or nurture. BMJ 317
(7167):1236–1239
Farrell B, Kenyon S, Shakur H (2010) Managing clinical trials. Trials 11(1):78
Greenberg Report (1967) Organization, review, and administration of cooperative studies
(Greenberg report): a report from the heart special project committee to the National Advisory
Heart Council. Control Clin Trials 1988(9):137–148

Online Documents
Allen S (2015) Best practices for clinical trial operations. https://www.pharmoutsourcing.com/
Featured-Articles/180536-Best-Practices-for-Clinical-Trial-Operations/
McDonald A, Lane A, Farrell B, Dunn J, Buckland S, Meredith S, Napp V (2014) Trial managers’
network guide to efficient trial management. https://cdn.ymaws.com/www.tmn.ac.uk/resource/
collection/77CDC3B6-133F-42E6-9610-F33FF5197D2F/tmn-guidelines-web_[amended_
July_2014].pdf
Meinert C (2019) The trend in trials. https://jhuccs1.us/clm/PDFs/NameThatTune.pdf
Olmstead FL (2004) Improved organization and management of clinical trials. http://www.
appliedclinicaltrialsonline.com/improved-organization-and-managementclinical-trials
Advocacy and Patient Involvement
in Clinical Trials 30
Ellen Sigal, Mark Stewart, and Diana Merino

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570
Patient Engagement in Research and Drug Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
Primary Areas of Engagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
Challenges Associated with Incorporating Patients into Research and Drug Development . . . 573
Barriers to Patient Engagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574
The Contribution of Patient Advocacy to Research and Drug Development . . . . . . . . . . . . . . . . . . 575
Trial Designs and Endpoint Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
Capturing and Measuring Patient Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
Contributors to Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578
Future Areas of Innovation and the Evolving Clinical Trial Landscape . . . . . . . . . . . . . . . . . . . . . . . 579
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580

Abstract
Patient engagement in research and clinical trials has evolved over time. Patients
are no longer simply passive research subjects but are increasingly being inte-
grated into research teams and protocol review teams to help design, implement,
and disseminate clinical trial findings. While potential barriers exist for mean-
ingful patient engagement, mechanisms and methods to effectively engage
patients and advocacy groups are evolving, and resources and best practices are
continually being developed to assist researchers and patients. Additionally,
legislation and regulatory guidance are being instituted to promote patient
engagement and ensure it is a routine process for clinical trial development.
Developing patient-centered clinical trial designs has led to development of
innovative clinical trial infrastructures and statistical methods. Patient advocates

E. Sigal (*) · M. Stewart · D. Merino


Friends of Cancer Research, Washington, DC, USA
e-mail: [email protected]; [email protected]; [email protected]

© Springer Nature Switzerland AG 2022 569


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_57
570 E. Sigal et al.

and organizations are also increasingly developing their own data sources and
clinical trials, which represent unique opportunities for researchers to partner
with patient groups to rapidly advance drug development.

Keywords
Patient advocacy · Drug development · Patient engagement · Patient-Centered
clinical trials

Introduction

The role of patients and advocates in clinical research and their involvement in
the regulation and oversight of clinical trials have substantially grown over time.
In just a few decades, patients have gone from being considered passive human
subjects whose clinical measures would contribute to answering research questions
to active participants and engaged stakeholders. This growing movement toward a
more patient-centered approach aims to provide the best healthcare for each patient,
which takes into consideration the patient’s own goals, values, and preferences
(Manganiello and Anderson 2011). This movement is rooted in early advocacy
efforts led by the HIV/AIDS community dating back to 1988 and resulted in
fundamental changes to the medical research paradigm.
The path from initial development of a new drug to entry of the new therapy into
the patient community relies on clinical trials, which represent the final step
in evaluating the safety and efficacy of new therapeutic approaches. Along this
developmental path, patients can provide critical input from collecting natural
history information; involvement in endpoint selection; protocol design; consent
and eligibility; clinical trial recruitment and retention strategies; design of post-
market safety studies; and dissemination of trial findings (Fig. 1).
A detailed analysis of several clinical trials indicates that 48% of all sites in a
given trial fail to meet their enrollment targets and more than 11% never enroll a

Post-
Drug Clinical Trials
Preclinical FDA Review marketing
Discovery (Phase I/II/III)
surveillance

• Input on relevance of • Characterize eligibility • Support for patient • Provide public • Participate in post-
research to patient criteria for clinical selection & testimony at FDA marketing
community trials recruitment Advisory Committee surveillance
• Identify unmet needs • Input on relevant • Educate/motivate • Attend/participate in initiatives
• Provide data on clinical endpoints patient community FDA hearings • Support in returning
therapeutic burden • Input on PROs • Serve on Data & • Serve on FDA study results to
• Input on informed Safety Monitoring advisory committees participants
consent form/process board • Present results to
• Work with regulatory • Input for trial patient community
agencies on benefit- adaptations & • Provide feedback on
risk & draft guidance modifications patient community
• Support sponsors at • Participate in benefit- perception of results
pre-IND FDA meeting risk discussions • Work with payers in
• Ensure balanced trial • Participate in patient reimbursement
portfolios preference studies

Fig. 1 Opportunities for patient involvement in the drug development process


30 Advocacy and Patient Involvement in Clinical Trials 571

single patient (Kaitin 2013). It is estimated that less than 5% of adult cancer patients
enroll in a clinical trial despite many indicating their desire to participate in clinical
trials (Comis et al. 2003; Unger et al. 2016). Thus, significant barriers such as
clinical trial access, demographic and socioeconomic challenges, inappropriate or
excessive procedures, broad exclusion criteria, lack of patient-centric trial designs,
and patient and physician attitudes remain that hinder trial participation. While not
every barrier may be readily overcome, engaging patients early and often throughout
the entire research and drug development process can help ensure appropriately
designed trials that are viewed favorably by patients, answer questions important to
the patient community, and ultimately encourage participation.
A growing body of evidence describing the benefits of patient involvement in
research and clinical trials is slowly changing scientific, medical, and regulatory
practices. In their systematic review, Domecq and colleagues found that
patient engagement positively influenced research by increasing study enrollment
rates and helping researchers in securing funding, designing study protocols, and
choosing relevant outcomes (Domecq et al. 2014). Greater patient engagement in
research and clinical trials would help drug developers sponsor trials that are
more informed about the needs of the patients, which would translate to more
feasible and streamlined trial design generating better outcomes (Hanley et al.
2001; Tinetti and Basch 2013). Increased engagement could also reduce patient
accrual time due to improved enrollment, reduce patient attrition, and make findings
more applicable and relevant to the target population (Bombak and Hanson 2017),
which would significantly decrease trial costs. Implementation of mechanisms for
patient engagement can vary.

Patient Engagement in Research and Drug Development

Acknowledging that patients are central to research and drug development, several
national and international organizations have invested in clearly defining the role of
patient involvement in research practices and the need for the development of
innovative infrastructures that will help facilitate the incorporation of the patient
voice in all stages of the research process, including design, execution, and transla-
tion of research (Domecq et al. 2014). The Patient-Centered Outcomes Research
Institute (PCORI) was established in 2010 to improve the quality and relevance of
evidence available to help stakeholders make better-informed health decisions and
requires that all its funded research projects include patient input throughout the
entire research study (www.pcori.org). Patient engagement has been defined by
PCORI as “involvement of patients and other stakeholders throughout the planning,
conduct, and dissemination of the proposed project” and is becoming institutional-
ized and incorporated into several funding schemes (PCORI 2018). Patient-driven
research activities have ranged from pre-discovery funding for development and
acquisition of animal models and cell lines all the way to post-market study design
and value discussions.
572 E. Sigal et al.

The US Food and Drug Administration (FDA) recognizes that patients are experts
on living with their conditions, and as such, their voice is uniquely positioned to inform
stakeholders and provide the right therapeutic context for drug development as well as
perspective on the outcome measures that are most relevant to patients and evaluation
by regulatory agencies (Anderson and McCleary 2016). Patients may voice their
concern or support for the development of certain drugs and provide a firsthand
perspective on the proper balance of risk to benefit for a particular disease or patient
population. For instance, the patient voice was crucial when reintroducing Tysabri, a
monoclonal antibody used to treat multiple sclerosis, which had been previously
removed from the market following reports of lethal side effects. After the thorough
review of safety information, the FDA convened an advisory committee where patients
and caregivers were invited to testify. Weighing all evidence, including the advocates’
testimonies, the FDA found enough support to remarket the drug under a special
prescription program (Schwartz and Woloshin 2015). Additionally, the FDA has
formalized several initiatives to encourage the inclusion of the patient voice in medical
product development. Under the fifth authorization of the Prescription Drug User Fee
Act (PDUFA V) signed into law in 2012, the FDA began the Patient-Focused Drug
Development (PFDD) program with the intent to more systematically incorporate the
patient perspective into drug development (FDA 2018). From 2012 to 2017, the FDA
organized 24 disease-specific PFDD meetings that have helped capture patients’
experiences, perspectives, and priorities and enabled the incorporation of this mean-
ingful information into the drug development process and its evaluation. Duchenne
muscular dystrophy advocacy organizations helped to exemplify how patient and
advocates can successfully inform regulators, provide meaningful input into benefit
and risk assessments, and identify treatment priorities. To build on this success and
enable more patient advocacy organizations to shape and influence drug development,
the twenty-first Century Cures Act and PDUFA VI have tasked FDA with developing
additional guidance to describe approaches to gather patient experience data, quanti-
fying benefit and risks, and using patient-reported outcomes in treatment development.
Moreover, the newly formed FDA Oncology Center of Excellence (OCE) has made
PFDD a priority and is exploring innovative regulatory strategies that incorporate
patient input. Additionally, the National Cancer Institute (NCI) also encourages patient
advocates to be involved in the clinical trial process. The SWOG Cancer Research
Network, one of five NCI cooperative cancer research groups, has an advocate assigned
to every research committee and who is involved in every stage of the process.

Primary Areas of Engagement

A systematic review that searched for reporting of patient engagement on controlled


trials and nonrandomized comparative trials conducted from May 2011 to June 2016
reviewed 2777 citations, of which only 23 clinical trials (17 randomized controlled
trials and 6 nonrandomized comparative studies) reported patient engagement
practices (Fergusson et al. 2018). The methods of engagement most commonly
reported involved the development of the research question, selection of outcome,
30 Advocacy and Patient Involvement in Clinical Trials 573

dissemination and implementation of results, and other activities, such as the


refinement of the study intervention and protocol review (Fergusson et al. 2018).
Thus, there is evidence showing that researchers have engaged patients, especially in
trials that reported following the community-based participatory research (CBPR)
methods as part of the study design; however, there is still more work needed to
get patients meaningfully involved in clinical research. Innovative methodologies,
such as CBPR, which aim to have more meaningful relationships with the target
population and more effective dissemination and implementation of results are key
in improving patient involvement in research (Chhatre et al. 2018).
Another systematic review assessed patient engagement in research including
randomized control trials, qualitative studies, single cohort studies, cross-sectional
studies, case reports, and systematic reviews (Domecq et al. 2014). This study found
that engagement was feasible and most commonly done in the beginning of the research
process (agenda setting and protocol development) and less commonly during the
execution and translation of research. The study also found no comparative effective-
ness research on patient engagement methods. The authors concluded that the lack of
this evidence is what may have led to inconsistent and vague reporting of patient
engagement research, preventing the incorporation of effective reporting methods.
Using the 2014 Health Information National Trends Survey, one study investi-
gated three aspects of patient engagement: interest, awareness, and participation as
research partners in the medical research process to identify different levels of
engagement and barriers that prevent engagement (Hearld et al. 2017). The study
consisted of a cross-sectional analysis that suggested modest levels of interest in
engaging in the research process among respondents. The study also found low levels
of awareness of ways in which patients could become involved in research and very
low levels of actual participation. Several factors, such as patient health status,
attitudes about their health and healthcare, and sociodemographic characteristics,
were also examined to provide insights into the types of patients most likely to be
engaged in the research process. The study suggested that higher socioeconomic
status and positive patient attitudes were associated with increased interest in becom-
ing involved in research but there was no association between respondents with
different demographic, socioeconomic, and environmental characteristics to actual
participation. The authors concluded that raising awareness of engagement opportu-
nities would improve people’s interest in being engaged in research. Moreover, they
suggested further research to identify why patients who may be aware of research
opportunities are still reluctant to become active participants of the research process.

Challenges Associated with Incorporating Patients into Research


and Drug Development

Attitudes toward a more patient-centered or patient-focused approach to care and


research are continuing to shift, in part, because of the increasing awareness that
active patient participation in research can lead to improvements in the credibility
of the study findings and their direct applicability to patients. In addition to the
574 E. Sigal et al.

benefits observed for study sponsors and participants, greater patient involvement is
also driven by a compelling ethical rationale that lies behind the participation of
patients in the democratization of the research process (Domecq et al. 2014).
Data shows a compelling relationship between the incidence of clinical trial
enrollment and improvement in cancer population survival, and a recent survey
indicates the value patient engagement can have on improving patient retention and
accelerating trial accrual (Smith et al. 2015; Unger et al. 2016). However, several
challenges and concerns remain about the way patient engagement is being
conducted (Bombak and Hanson 2017).

Barriers to Patient Engagement

The most commonly described patient engagement barriers were related to logistics
and a concern of tokenistic engagement (Domecq et al. 2014). Tokenism refers to
involving patients superficially. This can often occur when a small number of
participants, who may be involved in the research process minimally, are considered
to represent a far larger and diverse patient group. This insincere act of patient
inclusion hinders patients from seeking greater involvement in the research process,
and it lessens the credibility of the patient voice. Indeed, various research studies
have identified that people frequently find that participating in clinical trials is
meaningless or disempowering (Mullins et al. 2014), yet people often want to be
informed, empowered, and engaged in their medical management (Davis et al.
2005). Some programs may require patients to undergo intense forms of training
and involve abundant time, interest, and potentially resources (Bombak and Hanson
2017). These requirements may create preference for observable or quantitative
skills over instinct and intuition and may bias the perspectives shared as part of
the study. The lack of incentives or payment for a patient’s time may also be a barrier
for some patients to become engaged in research. Moreover, various erroneous
perceptions have been identified as barriers for engagement. Some studies have
identified the detrimental perception that patients will not be objective in their
decisions and will become a hurdle in the design and development process or that
patients and advocates are naïve about the research process and funding problems
(Hanley et al. 2001; Bombak and Hanson 2017). These barriers should be assessed
in more detail, and greater efforts should be placed on overcoming any perceived
drawback that would prevent patients from engaging and getting involved in
scientific research.
Historically, few mechanisms existed for systematic engagement of patients in the
drug development continuum, and in the very seldom cases in which structures for
patient participation exist, they may be disorganized or confusing (Hohman et al.
2015). Efforts to overcome these should be undertaken, and learning modules and
information are available to provide best practices. In recognition of these potential
barriers, many patient advocacy organizations have research training programs
designed specifically for patients to help inform and prepare them to support research
studies. They can also provide mechanisms to connect patients with opportunities to
30 Advocacy and Patient Involvement in Clinical Trials 575

participate on advisory boards and research teams to support the development of


clinical trials. Most notably, the National Breast Cancer Coalition developed Project
LEAD Institute, which provides a series of courses that establish a foundation
of scientific knowledge to empower patients to participate actively and collaborate
with physicians, industry, and regulatory agencies. In addition, Fight Colorectal
Cancer has a Research Advocacy Training and Support (RATS) program, and
Susan G. Komen and the American Association for Cancer Research also
have programs to train advocates to support research studies. The Clinical Trials
Transformation Initiative (CTTI), a public-private partnership, helps develop and
drive adoption of practices within physician and patient communities to support
patient engagement that will increase the quality and efficiency of clinical trials.
The inclusion of patients as reviewers and on research teams has led to
more appropriately designed trials and the development of innovative clinical trial
designs and statistical methods. Additionally, studies have demonstrated that patient
involvement in the design and development of clinical trials is necessary to improve
the efficiency and relevance of drug development and evaluation.

The Contribution of Patient Advocacy to Research and Drug


Development

The incorporation of the patient voice has directly impacted the way trials are
designed and conducted (Mullins et al. 2014). The way in which clinical trials are
designed can transform the evidence generation process to be more patient centered,
providing people with an incentive to participate or continue participating in
clinical trials. Providing better information to participants and incorporating
alternative trial designs will minimize concerns that clinical trials aren’t patient
centered and will dispel any doubts or concerns that prevent patients from becoming
meaningful participants in the planning and design of clinical trials. Addressing the
concerns and desires of patients has led to innovative strategies and designs to make
trials more patient centric.

Trial Designs and Endpoint Selection

Many new therapies in oncology are molecularly targeted against specific oncogenic
driver mutations that may be present in only a fraction of the patient population.
Although the advent of targeted therapies holds great promise for patients, it also
means that many patients may need to be screened before enough patients harboring
the necessary mutation are found. Additionally, patients may not have the mutation
of interest and will potentially have to seek out a variety of trials before finding a
match. Master protocols are one mechanism to assist with the development and
investigation of targeted therapies (Woodcock and LaVange 2017). Perhaps one of
the greatest efficiencies of the collaborative clinical trial system is its increased
benefit to patients seeking access to genomic screening technologies and
576 E. Sigal et al.

experimental therapies. Rather than being forced to undergo multiple screening


attempts and to move from trial to trial before ever being matched with a trial and
treatment arm, patients who are screened for inclusion in a master protocol study
need only be tested once to have a high likelihood of eventually participating in the
study. The variety of patient subgroups that are evaluated over the course of a master
protocol, as well as the use of non-match substudies, greatly increases patients’
chances of receiving a study treatment. Moreover, patients who participate in
master protocols are given access to a broad-based screening technology such
as next-generation sequencing (NGS), which efficiently screens patients for a
multitude of genomic markers and matches them to treatment arms based upon
this information. Some select master protocols include the BATTLE program,
LUNG-MAP for patients with lung cancer, and NCI-MATCH for patients with
solid tumors, lymphomas, and myeloma.
Other patient-centric trial designs include pragmatic trials, adaptive trials, and trials
that incorporate Bayesian statistics and allow patient crossover to the experimental
treatment (Mullins et al. 2014). Pragmatic clinical trials can produce results that more
accurately reflect the outcomes a typical person could expect to experience. Adaptive
clinical trial designs allow for modifications to occur partway through the study based
on information collected through the trial’s progress. The incorporation of Bayesian
statistics allows trialists to use prior information learned during the course of the trial
and is often employed within adaptive trials. The subsequent Bayesian statistical
analysis would describe the probability of a treatment’s effect. While these trials
provide many advantages for patients, they do have limitations. They can create
logistical complications attributable to data management and study design as well as
pose risks in the interpretability of the trial results. Trials that allow patients to crossover
to the treatment arm, if shown to be superior to the control arm, can attenuate the
treatment effect size. Additionally, the specific therapy under study may dictate which
trial design is most optimal, particularly if interim results are unattainable to inform an
adaptive methodology. The needs of patients and the need to generate solid evidence of
efficacy will always need to be balanced.
It is important to engage patients early to understand the endpoints that matter
most to them in all settings and stages of a disease. Mortality, for example, is an
important outcome measure but is often not the only important outcome to patients.
Especially in circumstances when chances of survival can be relatively low, other
outcomes such as unnecessary diagnostic procedures or progression-free survival
(PFS) are also important to patients. Clinical trials, therefore, must be designed with
the patient’s needs and preferences in mind within a given disease context. While
certain endpoints may be more meaningful to researchers, these endpoints may
ultimately not be meaningful to the patient group affected by the clinical trial.
With the exception of validated surrogate endpoints, a primary endpoint should
generally be a measure of something that is important to the patient (Vroom 2012).
These endpoints should measure not only how a patient survives but also how a
patient feels and functions.
The ascertainment of certain meaningful clinical endpoints, however, may
be burdensome and time-consuming for researchers, hindering potentially
30 Advocacy and Patient Involvement in Clinical Trials 577

lifesaving access for patients to the innovation under investigation. Recognizing this
problem, Friends of Cancer Research and the Brookings Institute convened a panel
of experts at a 2011 conference to discuss potential methods for streamlining the
FDA approval process for drugs that show large treatment effects early in develop-
ment while still ensuring drug safety and efficacy. The discussion at this conference
informed the creation of the “Advancing Breakthrough Therapies for Patients Act”
which established the FDA’s Breakthrough Therapy Designation (BTD). This des-
ignation defines a breakthrough therapy as a drug intended to treat a serious or life-
threatening disease or condition and for which preliminary evidence indicates that
the drug may demonstrate substantial improvement over existing therapies (FDA
Fact Sheet: Breakthrough Therapies). Once BTD is requested by the drug sponsor,
the FDA and sponsor work together to determine the most efficient path forward, and
if the designation is granted, the FDA will work closely with the sponsor to help
expedite the development and review of the drug. Because innovative designation
and approval pathways such as BTD take into consideration novel approval end-
points for clinical trials demonstrating higher rates of benefit in carefully selected
patients, it is especially critical that patients are involved in identifying and defining
the endpoints most important to them.
Given the broad benefits associated with patient involvement in scientific
research and clinical trials, it is crucial to focus on greater dissemination and
awareness. Strategies for the uptake and implementation of mechanisms for patient
involvement should involve patients and patient advocates, health professionals, and
drug developers. The creation of more educational resources to support researchers
and patients when coordinating the incorporation of the patient voice in clinical
trials would also improve the uptake of these mechanisms.

Capturing and Measuring Patient Experience

The patient voice is more commonly being incorporated in regulatory decision-


making and has enabled the creation of more modern regulatory pathways.
A patient’s and their caregiver’s experience with the disease and treatment-related
symptoms, which may alter their function and health-related quality of life, is
important. Capturing this rich experience from both patients and their caregivers
helps provide key outcome information to consider in the evaluation of new agents.
A recent policy review article written by international regulatory professionals
from the USA, Europe, and Canada highlights the need for capturing the patient
experience from different sources and focuses on the use of rigorous PRO measures
to facilitate the regulatory decision-making process (Kluetz et al. 2018). Among the
many advantages that PRO measures provide, these data are critical for supporting
the benefit-risk assessment of experimental agents and useful when incorporated
into prescribing and product information as descriptive data to inform safety
and tolerability (Kim et al. 2018) or as a claim of treatment benefit. This information
is particularly important for concerns with quality of life issues that patients and
caregivers may have.
578 E. Sigal et al.

All international regulatory agencies acknowledge that robust and accurate data
collected from the patient experience can be useful, as it complements existing
measurements of safety and efficacy, but warn that poorly defined PRO methodology
using heterogeneous analytical methods greatly hinders the incorporation of
PRO data in regulatory decision-making (Kluetz et al. 2018; Kuehn 2018;
Bottomley et al. 2018). It recommends that sustained international collaboration
among regulatory agencies is required to improve patient experience collection and
standardize the assessment, analysis, and interpretation of patient data from clinical
trials.
The FDA has recognized that a central aspect of PFDD is the use of patient-
reported outcomes (PROs) as a way to incorporate the patient voice in drug
development and regulatory decisions. PROs are directly reported by the
patient and provide a status of the patient’s health, quality of life, or functional status
(FDA-NIH Biomarker Working Group 2016). PRO measures can provide a better
understanding of treatment outcomes and tolerability from a patient perspective and
complement current measures of safety and efficacy (Kim et al. 2018). In 2009,
the FDA released guidance for industry on the use of PROs in medical product
development to support labeling claims and has worked with other advocacy
organizations, such as the Critical Path Institute, and industry to form working
groups that seek to engage patients and caregivers in the development of robust
symptom-measuring tools, such as the PRO Consortium. Although challenges exist
when seeking to collect patient and caregiver experience data, such as the need
for more personalized and dynamic measuring tools that keep up with the diversity
of novel drug classes with wide variety of toxicities, greater efforts to ensure
consistency, reliability, and applicability of these data are warranted to support
robust use in the drug development space.

Contributors to Data Generation

Patients and advocacy organizations are also actively establishing their own data
sources to support clinical drug development and, in some instances, establishing
their own clinical trials. These include patient registries, online data-sharing
communities, wearable devices, and social media tools for capturing longitudinal
data points. Organizations such as the Genetic Alliance, the National Organization
for Rare Disorders, and Parent Project Muscular Dystrophy have launched regis-
tries to study the natural history of disease, burden of disease, expectations for
treatment benefits, and perspectives on tolerable harms and risks. These tools can
help inform academia and industry and incentive further study into a particular
disease state. Through public-private partnerships, advocacy organizations are
also initiating clinical trials within their patient communities. For example, the
Leukemia and Lymphoma Society is leading the Beat AML Master Trial, which is
a collaborative trial to test targeted therapies in patients with acute myeloid
leukemia (AML) (Helwick 2018). Principle investigators should look for oppor-
tunities to utilize and integrate these data collection efforts into their research
30 Advocacy and Patient Involvement in Clinical Trials 579

questions and studies in order to develop innovative partnerships that improve


research logistics, outreach and communication, funding, and the prioritization of
clinical trials.

Future Areas of Innovation and the Evolving Clinical Trial


Landscape

There has been great progress in the area of patient engagement in clinical trials and the
advancements being made by patient advocacy groups, and additional areas
of opportunity continue to be identified. The development of more refined frameworks,
models, best practices, and guidelines will help ensure early investigators have founda-
tional knowledge to meaningfully engage patients and advocacy organizations in their
research questions and drug development programs. Biopharma is investing heavily to
accelerate development timelines. TransCelerate BioPharma Inc., a nonprofit organiza-
tion that creates collaborations across biopharmaceutical research and development
community, has recently launched a new initiative around patient awareness and access
(TransCelerate 2018). Toolkits are available to assist research teams in engaging patient
advocacy organizations and participants to optimize clinical trial designs. Additionally,
some healthcare systems are partnering with cognitive computing platforms to help
physicians match, enroll, and support patients (Bakkar et al. 2018).
The incorporation of external data sources to streamline, augment, and support
clinical trial development is growing rapidly, due in large part to the advent of
technological solutions that include patient collaboration programs, crowdsourcing,
and the collection of big data and analytics. The US FDA is currently developing
guidance and a framework to describe how real-world evidence can support drug
development and regulatory decision-making. These external data sources represent
an opportunity to augment clinical trial data and can potentially result in more
streamlined drug development with fewer patients. These novel mechanisms of
data collection, as well as their use and implementation, will continue to require
the involvement of active advocates and consumers, who, through their experience,
will contribute greatly to the oversight and eventual success of future clinical trials.

Cross-References

▶ Bayesian Adaptive Designs for Phase I Trials


▶ Cross-over Trials
▶ Implementing the Trial Protocol
▶ Orphan Drugs and Rare Diseases
▶ Participant Recruitment, Screening, and Enrollment
▶ Patient-Reported Outcomes
▶ Pragmatic Randomized Trials Using Claims or Electronic Health Record Data
▶ Trials in Minority Populations
580 E. Sigal et al.

References
Anderson M, McCleary KK (2016) On the path to a science of patient input. Sci Transl Med 8:1–6.
https://doi.org/10.1126/scitranslmed.aaf6730
Bakkar N, Kovalik T, Lorenzini I et al (2018) Artificial intelligence in neurodegenerative disease
research: use of IBM Watson to identify additional RNA-binding proteins altered in
amyotrophic lateral sclerosis. Acta Neuropathol 135:227–247. https://doi.org/10.1007/s00401-
017-1785-8
Bombak AE, Hanson HM (2017) A critical discussion of patient engagement in research. J Patient
Cent Res Rev 4:39–41. https://doi.org/10.17294/2330-0698.1273
Bottomley A, Pe M, Sloan J et al (2018) Moving forward toward standardizing analysis of quality of
life data in randomized cancer clinical trials. Clin Trials 15:624–630. https://doi.org/10.1177/
1740774518795637
Chhatre S, Jefferson A, Cook R et al (2018) Patient-centered recruitment and retention for
a randomized controlled study. Trials 19:205. https://doi.org/10.1186/s13063-018-2578-7
Comis RL, Miller JD, Aldigé CR et al (2003) Public attitudes toward participation in cancer clinical
trials. J Clin Oncol 21:830–835. https://doi.org/10.1200/JCO.2003.02.105
Davis K, Schoenbaum SC, Audet AM (2005) A 2020 vision of patient-centered primary care.
J Gen Intern Med 20:953–957. https://doi.org/10.1111/j.1525-1497.2005.0178.x
Domecq JP, Prutsky G, Elraiyah T et al (2014) Patient engagement in research: a systematic review.
BMC Health Serv Res 14:1–9. https://doi.org/10.1016/j.transproceed.2016.08.016
FDA (2018) FDA voices: perspectives from FDA experts. https://www.fda.gov/newsevents/news
room/fdavoices/default.htm. Accessed 12 Nov 2018
FDA Fact Sheet: Breakthrough Therapies. https://www.fda.gov/regulatoryinformation/lawsenfor
cedbyfda/significantamendmentstothefdcact/fdasia/ucm329491.htm. Accessed 12 Nov 2018
FDA-NIH Biomarker Working Group (2016) BEST (Biomarkers, EndpointS, and other Tools)
Resource [Internet]. Food and Drug Administration (US), Silver Spring; Co-published by
National Institutes of Health (US), Bethesda
Fergusson D, Monfaredi Z, Pussegoda K et al (2018) The prevalence of patient engagement in
published trials: a systematic review. Res Involv Engagem 4:17. https://doi.org/10.1186/
s40900-018-0099-x
Hanley B, Truesdale A, King A et al (2001) Involving consumers in designing, conducting, and
interpreting randomised controlled trials: questionnaire survey. BMJ 322:519–523
Hearld KR, Hearld LR, Hall AG (2017) Engaging patients as partners in research: factors associated
with awareness, interest, and engagement as research partners. SAGE Open Med
5:205031211668670. https://doi.org/10.1534/genetics.107.072090
Helwick C (2018) Beat AML trial seeking to change treatment paradigm. [Internet] The ASCO Post
Hohman R, Shea M, Kozak M et al (2015) Regulatory decision-making meets the real world.
Sci Transl Med 7:313fs46. https://doi.org/10.1126/scitranslmed.aad5233
Kaitin K (2013) 89% of trials meet enrollment, but timelines slip, half of sites under-enroll.
Tufts Cent Study Drug Dev Impact Rep 15:1–4
Kim J, Singh H, Ayalew K et al (2018) Use of pro measures to inform tolerability in oncology trials:
implications for clinical review, IND safety reporting, and clinical site inspections. Clin Cancer
Res 24:1780–1784. https://doi.org/10.1158/1078-0432.CCR-17-2555
Kluetz PG, O’Connor DJ, Soltys K (2018) Incorporating the patient experience into regulatory
decision making in the USA, Europe, and Canada. Lancet Oncol 19:e267–e274. https://doi.org/
10.1016/S1470-2045(18)30097-4
Kuehn CM (2018) Patient experience data in US Food and Drug Administration (FDA) regulatory
decision making: a policy process perspective. Ther Innov Regul Sci 52:661–668. https://doi.
org/10.1177/2168479017753390
Manganiello M, Anderson M (2011) Back to basics: HIV/AIDS advocacy as a model for catalyzing
change. AIDS 1–29. https://www.fastercures.org/assets/Uploads/PDF/Back2BasicsFinal.pdf
30 Advocacy and Patient Involvement in Clinical Trials 581

Mullins CD, Vandigo J, Zheng Z, Wicks P (2014) Patient-centeredness in the design of clinical
trials. Value Health 17:471–475. https://doi.org/10.1016/j.jval.2014.02.012
PCORI (2018) The value of engagement. https://www.pcori.org/about-us/our-programs/engage
ment/value-engagement. Accessed 12 Nov 2018
Schwartz L, Woloshin S (2015) FDA and the media: lessons from Tysabri about communicating
uncertainty. NAM Perspect 5. https://doi.org/10.31478/201509a
Smith SK, Selig W, Harker M et al (2015) Patient engagement practices in clinical research among
patient groups, industry, and academia in the United States: a survey. PLoS One 10:e0140232
Tinetti ME, Basch E (2013) Patients’ responsibility to participate in decision making and research.
JAMA 309:2331–2332. https://doi.org/10.1001/jama.2013.5592
TransCelerate (2018) Patient experience. http://www.transceleratebiopharmainc.com/initiatives/
patient-experience/. Accessed 12 Nov 2018
Unger JM, Cook E, Tai E, Bleyer A (2016) The role of clinical trial participation in cancer research:
barriers, evidence, and strategies. Am Soc Clin Oncol Educ Book 35:185–198. https://doi.org/
10.14694/EDBK_156686
Vroom E (2012) Is more involvement needed in the clinical trial design & endpoints?
Orphanet J Rare Dis 7:A38. https://doi.org/10.1186/1750-1172-7-S2-A38
Woodcock J, LaVange LM (2017) Master protocols to study multiple therapies, multiple diseases,
or both. N Engl J Med 377:62–70. https://doi.org/10.1056/NEJMra1510062
Training the Investigatorship
31
Claire Weber

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
Trial Sponsor Team . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
The Sponsor Quality Manual and Quality System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
External Training and Study-Specific Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
GCP CSPs [(Including Contract Research Organizations (CROs)] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
Investigator Site Team . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
PI Delegation of Authority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
Site Monitoring Visits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588
Other Training Meetings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
Other External Teams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
Training Documentation and Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591

Abstract
The Investigatorship for clinical trials is a team with specialized experience who
are qualified by training and experience to successfully execute clinical trials. The
Investigatorship includes the trial sponsor, Good Clinical Practice (GCP) Con-
tract Service Providers (CSPs), and site Investigators, who may also include

C. Weber (*)
Excellence Consulting, LLC, Moraga, CA, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 583


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_240
584 C. Weber

outside experts such as Key Opinion Leaders (KOLs) and Data Monitoring
Boards (DMBs). GCP is the fundamental required training for all team members
conducting clinical trials. Training occurs throughout the lifecycle of the trial, and
each member of the team must have records of adequate training and qualifica-
tions to conduct the study for their identified role. This chapter explains the
Investigatorship members, the types of training conducted, and how training is
documented.

Keywords
Documentation · Monitoring · Delegation · File · Quality system

Introduction

Training is an essential and required component of conducting successful clinical


trials. The Investigatorship must be qualified and trained on Good Clinical Practice
(GCP), the trial protocol under study, the use of the investigational product(s) (IP),
standard operational procedures (SOPs), trial protocol design and operations, the
local and regional regulations and guidelines for clinical research, and Clinical Trial
Applications (CTAs). Training is a part of risk control, in that training activities
provide systematic safeguards to ensure adherence to standard operating procedures,
and training in processes and procedures.
The Investigatorship for executing the clinical trial consists of trial sponsor teams,
GCP Contract Service Providers (CSPs), and Investigator site teams. The
Investigatorship may also include Investigators such as Key Opinion Leaders
(KOLs) and independent Data Monitoring Boards (DMBs) who provide specialized
expertise.
Training occurs within each sector of the Investigatorship, and the foundation for
all clinical trials training is GCP. The trial sponsor is responsible for ensuring that
each Investigatorship team member is appropriately qualified and trained relevant to
their function, and that training is documented. This chapter describes the
Investigatorship team, types of training, and training records maintained during the
trial lifecycle.

Trial Sponsor Team

The trial sponsor team is made up of individuals who based on their training and
experience will submit the clinical trial application (CTA) for the investigational
product (IP) under study and plan and implement the trial ensuring compliance with
International Council for Harmonisation (ICH) GCP and regulatory requirements.
The trial sponsor team includes qualified individuals in functional areas including
31 Training the Investigatorship 585

clinical science, clinical operations, technical operations (also known as supply


chain operations), information technology, biostatistics, data management, regula-
tory affairs, pharmacovigilance, and quality assurance. Since the trial sponsor team
is responsible for the submission documents for the CTA, they are also accountable
for the overall oversight of the clinical trial. It is therefore requisite that the trial
sponsor team has adequate training and knowledge of global regulations and guide-
lines so the training can be implemented throughout the trial.

The Sponsor Quality Manual and Quality System

The trial sponsor develops and maintains a quality manual or equivalent document
that describes the quality system in their organization. The manual explains the
organizational structure and quality assurance of the sponsor team functions. The
quality manual also refers to required trial sponsor team training requirements and
types of procedural training for controlled documents. In addition, it is customary for
the quality system to describe how issues are escalated and how continuous improve-
ment areas are identified and addressed for managing the quality and training for
implementing clinical trials.
The qualifications of the trial sponsor team are documented for each team
member in curriculum vitae’s (CVs) and licenses relevant to their job description.
Each trial sponsor team member is required to have adequate training documentation
to perform their duties that are identified in job descriptions.
The hierarchy for training of controlled documents in the trial sponsor quality
system is described in Fig. 1, with the quality manual as the highest-level document,
and policies, procedures, work instructions, and records that are lower levels in that
order.
Training requirements are recorded in a training curriculum for each sponsor team
member. An example training curriculum is described in Table 1.
Controlled document training can be performed in person as on-the-job training,
group training, or read and understand training. The trainee signs training documen-
tation confirming the date of the training and the documentation is maintained in a
sponsor trial master file.

Fig. 1 Organization of
quality system documents
586 C. Weber

Table 1 Example of sponsor team training curriculum


Type/
document Clinical Data Quality Regulatory
number Title Biostatistics research management assurance affairs
SOP- GCP X X X X X
00001 training
procedure
GCP GCP X X X X X
training annual
training
session

External Training and Study-Specific Training

Sponsor teams may arrange trial-specific trainings and may attend courses such as
seminars, webinars, or conferences to further their education and skills specific to
their job duties. For these trainings, the trainee will print a certificate of attendance or
the agenda/sign-in log for and maintain them in the sponsor trial master file.

GCP CSPs [(Including Contract Research Organizations (CROs)]

GCP CSPs are another important part of the Investigatorship. GCP CSPs are pro-
viders who perform trial development and execution services such as data manage-
ment, statistical analysis, Randomization and Trial Supply Management (RTSM),
and laboratory analysis. Clinical Research Organizations (CROs) are one type GCP
CSP specific for study monitoring.
Clinical Research Organizations (CROs) are defined as:

A person or an organization (commercial, academic, or other) contracted by the sponsor to


perform one or more of a sponsor’s trial-related duties and functions. (ICH E6 (R2) Glossary
Section 1.20)

Each GCP CSP team member is also qualified by training and experience to
perform their job duties.
Each GCP CSP will maintain a quality system, training curricula, and documen-
tation in a similar way to the trial sponsor team. The trial sponsor team maintains
adequate oversight of the GCP CSPs to ensure that the GCP CSP staff are qualified
and have a training system and documented training records. For US Investigational
New Drug (IND) trials, the transfer of regulatory obligations from the trial sponsor
team to the GCP CSP for important functions identified in the Code of Federal
Regulations (CFR) are documented by the trial sponsor on the FDA 1571 New Drug
Application form (Section 15). and forwarded to FDA.
31 Training the Investigatorship 587

The trial sponsor team provides specialized training (e.g., detailed IP and trial-
specific training) to the GCPs CSPs throughout the trial, and this training is
documented and maintained in the trial master file. It is important to note that the
trial sponsor team and GCP CSP team partner on many aspects to implement the
clinical trial, and training development and management of training records is an
essential part of this collaboration.

Investigator Site Team

The Investigator site team is made up of lead Investigators [(also known as principal
investigators (PIs)], subinvestigators, and other site study personnel who are respon-
sible for executing the trial according to GCP, health authority, institutional review
board(IRB)/Ethical Committee (EC), and local regulations and guidelines. The PI is
defined as:

A person responsible for the conduct of the clinical trial at a trial site. If a trial is conducted
by a team of individuals at a trial site, the investigator is the responsible leader of the team
and may be called the principal investigator (ICH E6 (R2) Glossary Section 1.34).

A subinvestigator is defined as:

Any individual member of the clinical trial team designated and supervised by the investi-
gator at a trial site to perform critical trial-related procedures and/or to make important trial-
related decisions (e.g., associates, residents, research fellows) (ICH E6 (R2) Glossary
Section 1.56).

The PI and subinvestigators who are responsible for the conduct of the study
under a US IND are documented on the FDA 1572 form or equivalent. The PI
supervises the investigation at the Investigator site, and other members of the
Investigator team may include subinvestigators, study coordinators, pharmacists,
and laboratory personnel.
Each Investigator site team member must be qualified by education and experi-
ence to perform their functions at the study site and the PI has overall responsibility
for delegating tasks to other qualified team members. The site/institution will also
have a quality system and procedures requiring training.

PI Delegation of Authority

The delegation by the PI is documented in a Delegation of Authority Log that is


updated as applicable during the trial. An example of a delegation of authority log is
as follows:
588 C. Weber

Delegation of Authority Log


[STUDY NAME]
Site Number:

The purpose of this form is to: a) serve as the Delegation of Authority Log and b) ensure that the individuals performing study-related tasks/procedures are appropriately
trained and authorized by the investigator to perform the tasks/procedures. This form should be completed prior to the initiation of any study-related tasks/procedures. The
original form should be maintained at your site in the study regulatory/study binder. This form should be updated during the course of the study as needed.

Review Vital Signs and Labs

Investigational Product (IP)


(severity/relationship to IP)
Case Report Form (CRF)
Obtain Informed Consent

AE Inquiry and Reporting


Concomitant Medication

for Clinical Significance

AE/SAE interpretation

Regulatory Document
Physical Examination

Laboratory Specimen
Assess Inclusion and

Medication History /

Collection/Shipping
–Exclusion Criteria

Collect Vital Signs


Source Document

Administration of

IP Accountability
Medical History

Administrative
Maintenance
Completion

Completion

Please Print
NAME: OTHER (specify):

STUDY ROLE: SIGNATURE: INITIALS: DATES OF STUDY INVOLVEMENT:

NAME: OTHER (specify):

STUDY ROLE: SIGNATURE: INITIALS: DATES OF STUDY INVOLVEMENT:

I certify that the above individuals are appropriately trained, have read the Protocol and pertinent sections of 21CFR 50 and 56 and ICH GCPs, and are authorized to perform the above study-related tasks/procedures. Although I have
delegated significant trial-related duties, as the principal investigator, I still maintain full responsibility for this trial.
Invesgator Signature: Date:
Source National Institute of Health (NIH) Delegation of Authority Log Version 2.0 24 April 2014

The trial sponsor team and/or the GCP CSP team collect training qualification
documentation from the Investigator site team members (e.g., CVs, licenses, docu-
mentation of GCP training, etc.). They also provide the Investigator site team with
specialized trial-specific trainings at site monitoring visits, Investigator group meet-
ings, and other trial meetings. Each of these trainings are documented and forwarded
to the trial master file and maintained in the Investigator site files.

Site Monitoring Visits

A site monitor identified by the trial sponsor team and/or the CSP has the respon-
sibility of monitoring the Investigator site. Monitoring is defined as:

The act of overseeing the progress of a clinical trial, and of ensuring that it is conducted,
recorded, and reported in accordance with the protocol, standard operating procedures
(SOPs), GCP, and the applicable regulatory requirement(s). (ICH E6 (R2) Glossary Section
1.38).

The site monitor performs monitoring as part of a monitoring plan according to


the site sponsor and/or GCP CSP standard operating procedures. The site monitor
conducts a pre-qualification visit prior to selecting the site, which includes an
assessment that the Investigator site team are qualified by training and experience.
After selection, the site monitor performs a site initiation visit ensuring that the
Investigator site team is properly trained to conduct the study, including review of
31 Training the Investigatorship 589

the Delegation of Authority Log, team member qualifications, startup recruitment,


enrollment, IP administration and accountability, study protocol procedures, file
maintenance, electronic systems such as electronic data capture (EDC), and
RTSM, and any other operational requirements for the trial.
The monitoring visits are documented in monitoring visit reports and a follow-up
letter is sent to the PI confirming the activities performed at each monitoring visit.
These reports and letters are filed in the sponsor trial master file. The visit report is
defined as:

A written report from the monitor to the sponsor after each site visit and/or other trial related
communication according to the sponsors SOPs. (ICH E6 R2 Glossary Section 1.39)

To document initial training prior to the commencement of the trial, the site
initiation monitoring visit report is required to be filed at the at the Investigator site:

To document that trial procedures were reviewed with the investigator and the investigator’s
trial staff. (ICH E6 R2, Section 8.2.20)

During the study, the site monitor conducts monitoring visits at regular intervals,
and at the end of the study, the site monitor conducts a close-out visit. Each interim
visit may include training as applicable, and further reviews of the Delegation of
Authority Log to ensure the Investigator site team continues to be trained and
qualified. The close-out visit includes specific training about final trial documenta-
tion and closure.

Other Training Meetings

Investigator meetings and communications between the trial sponsor team, GCP
CSP team, and Investigator site team throughout the study are held to ensure
adequate study training for new and existing team members as applicable.

Other External Teams

Other external team members such as KOLs and DMBs are qualified by education
and experience to provide their expertise for the study. Training and communications
with these external teams are documented according to the site sponsor required
training procedures and filed in the sponsor trial master file.

Training Documentation and Files

Training documentation can be summarized as follows:


590 C. Weber

• Quality system and controlled document trainings (e.g., procedures, policies, etc.)
• GCP training certifications
• Professional training certifications
• Continuing education unit (CEU) accreditations
• CVs, job descriptions, and relevant licenses documenting qualifications
• Electronic system training including granting and revoking system access to
systems (e.g., EDC, RTSM, pharmacovigilance system, etc.)
• Trial-specific training including agendas and attendance at Investigator meetings
and other communications
• Site pre-qualification visits, initiation visits, interim visits, and close-out visits
• Monitoring visit reports and follow-up letters to Investigators

Training files for the entire Investigatorship are maintained as part of the sponsor
trial master file and Investigator site files and must be made available to regulatory
authorities upon request.

Summary and Conclusion

The Investigatorship is made up of the trial sponsor team, GCP CSPs, and Investi-
gator site teams which may include Investigator external experts. Each team member
must be adequately qualified and trained to perform their duties to conduct the
clinical trial, and Investigators must adequately delegate authority to appropriate
team members. GCP training is the foundation of all training for clinical trials.
Training documentation is an important aspect of the trial and is maintained
throughout the trial by the sponsor in the sponsor trial master file, and by the
Investigators in the Investigator site file.

Key Facts

The facts covered in this chapter include: definitions of the Investigatorship team,
their roles duties and requirements, and the overall guiding principles of ICH GCP
for ensuring training and maintenance of the Investigatorship site files.

Cross-References

▶ Investigator Responsibilities
▶ Selection of Study Centers and Investigators
▶ Trial Organization and Governance
31 Training the Investigatorship 591

References
Code of Federal Regulations, Title 21, Part 312
Department of Health and Human Services, Food and Drug Administration, Investigational New
Drug Application (IND) (Title 21 Code of Federal Regulations (CFR) Part 312) FDA 1572
(21 CFR 312.53(c))
Department of Health and Human Services, Food and Drug Administration, Investigational
New Drug Application (IND) (Title 21 Code of Federal Regulations (CFR) Part 312) Form
1571 03/19
ICH E6 (R2) International Council for Harmonisation Guideline for Good Clinical Practice, Section
8.2.20
ICH E6 (R2) International Council for Harmonisation Guideline for Good Clinical Practice,
Glossary Section 1.20
ICH E6 (R2) International Council for Harmonisation Guideline for Good Clinical Practice,
Glossary Section 1.56
ICH E6 (R2) International Council for Harmonisation Guideline for Good Clinical Practice,
Glossary Section 1.38
ICH E6 (R2) International Council for Harmonisation Guideline for Good Clinical Practice,
Glossary Section 1.34
ICH E6 (R2) International Council for Harmonisation Guideline for Good Clinical Practice,
Glossary Section 1.39
National Institute of Health (NIH) Delegation of Authority Log Version 2.0 24 April 2014
Responsibilities and Management of the
Clinical Coordinating Center 32
Trinidad Ajazi

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
Responsibilities of the Clinical Coordinating Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
Clinical Trial Development and Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
Site Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
Site Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598
Regulatory Compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
Trial Sponsorship and Investigational New Drug Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
Inspection Readiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
Quality Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
Standard Operating Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
Quality Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
Quality Assurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
Oversight of CROs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
Adverse Event and Safety Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
Management of Clinical Coordinating Centers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
Research Administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
Industry Collaborations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
Institutional Approval of Clinical Coordinating Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607
Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607
Resource Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608
Operational Efficiency and Project Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608
Risk Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
Clinical Coordinating Center and Trial Governance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
Clinical and Data Coordinating Center Integrated Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610
Clinical Trial Network Group Coordinating Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614

T. Ajazi (*)
Alliance for Clinical Trials in Oncology, University of Chicago, Chicago, IL, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 593


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_274
594 T. Ajazi

Abstract
Clinical coordinating centers of investigator-initiated multi-site clinical trials
have a myriad of responsibilities throughout the life cycle of clinical trials from
trial concept development to completion. At the core of all clinical research is the
dual mandate to protect human subjects and ensure trial data integrity. National
regulations and international guidelines are designed to enable regulatory com-
pliance and achievement of these mandates. Integrated within clinical coordinat-
ing center activities are quality management mechanisms designed to monitor,
control, and assure patient safety and data integrity.
This chapter summarizes the responsibilities of the clinical coordinating center
with emphasis on efficient trial development and site selection, presents regula-
tory compliance requirements, focuses on practices for quality management, and
describes clinical coordinating center management and network groups.

Keywords
Multi-site · Investigator-initiated · Clinical coordinating center · Clinical trial
operations · Site management · Quality management · Regulatory compliance ·
Sponsor oversight · Research administration

Introduction

Clinical trials are often initiated by investigators in collaboration with their col-
leagues across multiple institutions. These multi-site investigator-initiated trials
require coordinating centers to lead the implementation of the trial. The primary
types of multi-site coordinating centers are the clinical coordinating center (CCC)
and the data coordinating center (DCC). The clinical coordinating center is respon-
sible for clinical trial operations, and the data coordinating center is responsible for
statistics and data management functions. Some of the functions performed by the
CCC and DCC may be combined under one coordinating center. The clinical and
data coordinating centers may be integrated under one organization or reside in
separate organizations. Collectively the CCC and DCC are responsible for imple-
mentation of multi-site clinical trials.
The clinical coordinating center of a multicenter research network provides the
infrastructure for operationalizing clinical trials across participating centers. The
primary responsibilities of the clinical coordinating centers include clinical trial
operations and site management, quality management, regulatory compliance, com-
munications, administration, and study results publication. The CCC is broadly
responsible for oversight of all trial-related activities, inclusive of applicable clinical
trial sponsor responsibilities.
In general, investigators seek to advance scientific discovery and positively
impact healthcare practices. Clinical research in a global environment has grown
increasingly complex. Regulatory oversight is paramount in all aspects of clinical
32 Responsibilities and Management of the Clinical Coordinating Center 595

trials. At the core of all operational activities is the dual mandate to protect the rights
and well-being of human subjects and to ensure research data integrity. Management
of the CCC requires strong leadership and clinical research expertise. This chapter
focuses on the responsibilities and management of the clinical coordinating center
and written from the perspective of nonprofit, academic-based clinical trial centers.

Responsibilities of the Clinical Coordinating Center

The clinical coordinating center is responsible for administrative support of gover-


nance and leadership functions, clinical trial development, project management, site
management, regulatory affairs and compliance, quality control and assurance,
research administration, trial registration and reporting, publication, and other oper-
ational services. How functional units are organized within coordinating centers
overlap. The statistical and data management functions of the DCC are critical to
trial implementation and should be synchronized with clinical trial operations. These
activities are represented in Table 1 (Clinical coordinating center areas of responsi-
bility). Some of these activities are further described below; research administration
is described in the Management of Clinical Coordinating Centers section of the
chapter.

Clinical Trial Development and Operations

Clinical trial development begins with concept development by scientific investiga-


tors. The study design with its primary and secondary endpoints are subject to
scientific review both at the coordinating centers level and at the level of the funding
agency or program (e.g., National Institutes of Health (NIH)) sponsoring the
research. In addition to scientific review, the concept should also undergo an
operational review with preliminary assessment of feasibility for required resources,
trial budget and funding, specialized procedures, competing trials, patient
populations, accrual barriers, Medicare coverage analysis, and other potential logis-
tical challenges.
Clinical trial development is a team effort. Central to development are the project
managers or protocol coordinators, the multi-site principal investigator (study PI),
and the study statistician. These roles are key to efficient study development. Tools
that enable operational efficiency include an upfront project plan and timeline pro-
jections and accepted model protocol and informed consent document templates.
Protocols may include several co-chairs, including co-chairs for translational
research aspects of the study. Other key members of the protocol development
team include the medical officer, data manager, statistical programmer, study elec-
tronic data capture (EDC) builder, biorepository and reference laboratory personnel,
regulatory affairs manager, and liaisons from disciplines such as nursing and phar-
macy, as applicable. Additional reviewers responsible for assessing logistical feasi-
bility at clinical sites and patient engagement include clinical research professionals,
596 T. Ajazi

Table 1 Clinical coordinating center areas of responsibility


Clinical trial development
and operations Regulatory affairs and compliance Quality management
Project management Investigational new drug (IND) Policies and standard
Study design/concept Investigational device exemption operating procedures
development (IDE) Quality control/
Protocol development Institutional review board/ethics monitoring
Trial activation committee Centralized monitoring
Protocol revision/ Clinical trial registration Key performance
amendment Conflict of interest/financial evaluation
Trial closure/termination disclosure On-site clinical
Trial publication Trial master file (TMF) monitoring
Committee support maintenance/essential regulatory Quality assurance
Medical and clinical documents Audit
oversight Data and safety monitoring board Inspection readiness
Research collaborations support Corrective and preventive
Investigational product action
management Good clinical practice
Trial-related reporting (GCP) training
Site selection and Pharmacovigilance/drug safety Communications and
management management education
Site selection Scientific misconduct/serious GCP Public relations
Site start-up breach Newsletters
Site management Compliance: Code of Federal Education and training
Patient recruitment/accrual Regulations (Food and Drug Meetings management
enhancement Administration/Office of Human Website content
Site and personnel roster Research Protections) management
and credential management Compliance: International Social media strategy
Site retention regulatory authorities Patient advocate
Data privacy and security relationships
Other partnerships
Collaboration with Information technology/ Research administration
statistics and data systems (shared with DCC) and finance
management in areas noted
below
Statistical design Patient enrollment and Budgeting and finance
Statistical analysis plan randomization Grants administration
Patient enrollment and Electronic data capture Legal, contracts, and trial
randomization Clinical trial management system agreements
Data collection (CTMS) Fundraising
Data management Trial master file (TMF) Human resources
Data quality monitoring Reporting system Governance
Data and statistical analysis Website Steering/executive
Study reporting Learning management system committee
Publication (LMS) Publication committee/
Data privacy and security Adverse event/safety management charter
Data and safety monitoring
board reporting
32 Responsibilities and Management of the Clinical Coordinating Center 597

community investigators, and patient advocates. The study PI is the primary author
of the protocol.
The protocol document and related model informed consent document undergo
several stages of authoring, review, and revision, based on the standard operating
procedure (SOP) of the coordinating center. Case report forms are developed in
parallel. Prior to release and activation of the study to participating sites for imple-
mentation, the relevant funding agency (e.g., NIH), regulatory authorities (e.g., Food
and Drug Administration (FDA)), central institutional review board (CIRB), and
other regulatory review bodies must review and approve the protocol. Prior to
implementation at the site level, the institutional review board (IRB) or ethics
committee of record must review and approve the protocol, according to relevant
regulations. See Code of Federal Regulation (CFR) Title 21, Part 50 (Protection of
Human Subjects) and Part 56 (Institutional Review Boards).
During the life cycle of a clinical trial, the protocol, clinical operations, and
statistical teams are responsible for amending the protocol, when monitoring of the
clinical trial signals the need to adjust trial parameters to improve safeguards for
patient safety or mitigate logistical barriers in the conduct of the trial.

Site Selection

Selection of high-performing clinical trial sites is key to the successful initiation and
completion of a clinical trial. Site personnel directly control trial participant recruit-
ment, informed consent, visit schedules, trial procedures, data collection, and data
submission. The CCC chooses sites with demonstrated ability to conduct clinical
trials, in order to best ensure the integrity of research. High-performing sites have the
structure, resources, and standard processes to recruit participants, enroll patients on
study, adhere to protocol requirements, comply with applicable regulations, and
submit date in a timely manner. Investigators at high-performing sites promote
clinical trial participation, assess adverse events in real time, and provide necessary
oversight of sub-investigators and research personnel.
Substandard sites are prone to GCP non-compliance, low accrual rates, protocol
deviations, and delinquent and low-quality data submission. Such sites adversely
consume CCC resources to monitor and manage the site for low returns. It is
imperative to select qualified sites for successful clinical trial initiation, execution,
and completion.
Depending on the complexity of trial requirements and the number of sites
needed for successful trial completion, site selection could occur in multiple steps:

1. Identification of potential sites.


2. Site feasibility questionnaire and evaluation.
3. Site qualification visit, if applicable.
4. Final site selection.
598 T. Ajazi

During the site selection and startup process, coordinating center staff evaluate
the following considerations:

1. Investigator credentials and research interests (curriculum vitae).


2. Investigator financial conflict of interest disclosure.
3. Patient population, registries.
4. Recruitment or outreach practices.
5. Previous history related to contracting and clinical trial startup timelines.
6. Institutional review board and site scientific committee review timelines.
7. Facilities and equipment, including specialized study-specific requirements.
8. Laboratory and biospecimen processing capabilities, as applicable.
9. Experience with clinical trial technologies, e.g., electronic data capture, elec-
tronic medical records, and interactive response technology.
10. Staff resources, experience and training (Good Clinical Practice (GCP) at
minimum).
11. Site policies and standard operating procedures.
12. Quality reports, e.g., available quality assurance reports.
13. Regulatory inspection reports, e.g., FDA debarment and FDA inspection
databases.
14. Satellite or affiliate management and oversight plan, if applicable.

Investigators in the FDA debarment database are not allowed to participate in


clinical trials. The coordinating center staff need to assess the applicability of any
FDA 483 that has been issued to the site. The FDA Form 483 is issued to firm
management at the conclusion of an inspection when an investigator has observed
any conditions that in their judgment may constitute violations of the Food Drug and
Cosmetic (FD&C) Act and related acts. Coordinating center staff carefully consider
the effectiveness of any applicable corrective and preventive action plan (CAPA)
implemented by the site in response to an FDA 483.

Site Management

Following site selection, the CCC works with participating sites to prepare them for
activation of the trial. Site startup activities may include conducting a site initiation
visit (SIV) to confirm that sites have the appropriate facilities and equipment,
provide study-specific training to investigators and site staff, review and collect
regulatory documentation. Release of investigational product could be contingent on
successful completion of an SIV, in addition to execution of the clinical trial
agreement, IRB approval, access and training in the electronic data capture system,
and other clinical trial management systems, as appropriate.
The CCC is responsible for tracking participating sites and active research
personnel. A clinical trial management system (CTMS) is useful for maintaining
site contacts and other relevant information. The CCC is responsible for maintaining
a trial master file (TMF) of essential documents and ensuring that participating sites
32 Responsibilities and Management of the Clinical Coordinating Center 599

maintain essential documents in the investigator site file (ISF) or regulatory binder
(electronic or hard copy). These documents must be maintained for continued
enrollment of research subjects and clinical trial participation. ICH GCP sect. 8 pro-
vides guidance on required essential documents. Essential documents include the
statement of investigator on FDA Form 1572, IRB approvals (initial, amendment,
and continuing review) for research, and the delegation of authority log (DOA). The
CCC and DCC should have checks in place to ensure that sites with expired or
missing essential documents do not register research subjects to the trial. Upon
submission of the DOA to the CCC and during monitoring visits, the CCC confirms
that tasks delegated by the clinical investigator are appropriate and staff have
received the proper training, prior to performance of research-related tasks.
Upon activation of a clinical site, coordinating center personnel maintain com-
munication with sites to promote accrual of subjects to the clinical trial. In conjunc-
tion with the principal investigator, CCC staff develop a communication plan to
educate sites regarding the clinical trial, development accrual enhancement materials
such as trial websites, social media posts, brochures, newsletters, etc. It is a good
idea to host meetings with participating site investigators and clinical research
professionals to address questions and concerns, report on trial developments,
share best practices and frequently asked questions.
It is important that the trial principal investigator is available to address site inquiries
related to eligibility, patient and treatment management, disease status evaluation,
adverse event reporting, and other inquiries that affect safety of research participants
or the conduct of the clinical trial. Documentation of these inquiries and any related
actions and decisions should be tracked by CCC personnel. During these interactions,
the PI and CCC staff should monitor for potential protocol deviations. They should
advise sites of any deviations that should be reported to the IRB and in the electronic
data capture system. CCC personnel should be in close contact with the data managers
of the trial to share concerns for site management and data management. Research
nurses and clinical trial managers at the CCC assist the PI in addressing inquiries from
participating sites. The regulatory manager is also available to respond to inquiries
related to informed consents and IRB-related questions.

Regulatory Compliance

Compliance with regulatory requirements is required for all clinical trial activities.
The CCC ensures that the clinical trial is conducted according to the regulations
described below. In the United States, clinical trials are developed and implemented
according to the Code of Federal Regulations (CFR). These regulations are codified
to provide the rules for implementing laws enacted by the Congress.
All clinical trials funded in whole or in part by HHS agencies are required to
follow Title 45 CFR Part 46 – Protection of Human Subjects. 45 CFR 46 includes the
following subparts: A, Basic HHS Policy for Protection of Human Subjects; B,
Additional Protections for Pregnant Women, Human Fetuses and Neonates Involved
in Research; C, Additional Protections Pertaining to Biomedical and Behavioral
600 T. Ajazi

Research Involving Prisoners as Subjects; D, Additional Protections for Children


Involved as Subjects in Research; and E, Registration of Institutional Review
Boards. Subpart A, as adopted by multiple HHS agencies, including the National
Institutes of Health (NIH), is also known as Federal Policy for the Protection of
Human Subjects or the Common Rule.
The Food and Drug Administration (FDA) is responsible for regulating clinical
trials of drugs, biological products, and medical devices. Clinical trials utilizing
investigational products are under the jurisdiction of the FDA. Title 21 of the CFR
contains the rules of the FDA. There are several key parts of Title 21 (Food and Drug
Administration) that pertain to the clinical trials, including Part 50 – Protection of
Human Subjects (Subpart A, General Provisions; Subpart B, Informed Consent of
Human Subjects; Subpart D, Additional Safeguards for Children; Part 56, Institu-
tional Review Boards; Part 312, Investigational New Drug Application; and Part
812, Investigational Device Exemptions). In addition, management of clinical trials
involving investigational products that are submitted to the FDA for marketing
approval needs to take into account 21 CFR Part 314 (Applications for FDA
Approval to Market a New Drug). Other important parts of the CFR include Part
54 (Financial Disclosure by Clinical Investigators) and Part 11 (Electronic Records;
Electronic Signatures). All trials involving investigational products or devices are
conducted in accordance with CFR Title 21.
Clinical trials are conducted on a global basis. However, sponsors and investiga-
tors conducting clinical trials are required to abide by country-specific regulations. In
order to address variations in country-specific requirements for developing pharma-
ceuticals in multiple countries, the International Council for Harmonisation of
Technical Requirements for Pharmaceuticals for Human Use (ICH) developed
guidelines to harmonize the conduct of clinical trials globally for participating
countries. ICH guidelines cover multiple common topics including Quality (Q),
Safety (S), Efficacy (E), and Multiple Disciplinary (M) Guidelines. E6 has become
the international standard for Good Clinical Practice (GCP) and has been
implemented in the United States. Compliance with ICH GCP ensures the protection
of human subjects and clinical trial data integrity.
Per FDA regulations, an institutional review board is an appropriately constituted
group that has been designated to review and monitor biomedical research involving
human subjects. It has the authority to approve, require modification in, or disap-
prove research. The IRB assures the protection of the rights and welfare of humans
participating as subjects in research. IRBs review research protocols and related
materials, including the informed consent document.
Laws and regulations have been implemented to govern the conduct of clinical
trials for the purposes of ensuring the protection of human subjects and clinical trial
integrity. The policies and procedures implemented by clinical trial coordinating
centers must comply with applicable regulations and related guidance documents.
The federal laws and regulations, as well as related guidance documents and GCP
principles, provide the foundation for all coordinating activities described in the
following sections of this chapter.
32 Responsibilities and Management of the Clinical Coordinating Center 601

Trial Sponsorship and Investigational New Drug Application

Clinical trials that include an intervention with an investigational drug or biological


product require submission of an Investigational New Drug Application (IND) to the
Food and Drug Administration (FDA). FDA approval of an IND provides the
sponsor with the authorization to administer an investigational product to human
subjects. These trials must be conducted according to 21 CFR Part 312 (Investiga-
tional New Drug Application). The IND is filed by the sponsor of the trial.
According to 21CFR Part 312.3, a sponsor is a person who takes responsibility for
and initiates a clinical investigation. The sponsor may be an individual or pharma-
ceutical company, governmental agency, academic institution, private organization,
or other organizations.
If the IND is filed by the principal or lead investigator of a multi-site clinical
trial when the site is also one of the clinical sites for the trial, the PI is considered
the sponsor-investigator and is therefore responsible for initiating and
conducting a clinical trial. The IND is cross-referenced with the IND filed by
the pharmaceutical company. The coordinating center is responsible for working
with the principal investigator to prepare the IND for submission. The CC also
works with the pharmaceutical partner to secure the necessary documents
required from the pharmaceutical company, including the Investigator Brochure
(IB) and cross-reference or letter of authorization. The pharmaceutical company
may file the IND and retain official sponsorship of the trial while transferring
specified sponsor responsibilities to a coordinating center. In general, the coor-
dinating center is responsible for ensuring that sponsor responsibilities according
to 21 CFR Part 312, Subpart D (Responsibilities of Sponsors and Investigators)
are met.
Sponsors are generally responsible for selecting qualified investigators, providing
them with the information they need to conduct an investigation properly, ensuring
proper monitoring of the investigation(s), ensuring that the investigation(s) is
conducted in accordance with the general investigational plan and protocols
contained in the IND, maintaining an effective IND with respect to the investiga-
tions, and ensuring that FDA and all participating investigators are promptly
informed of significant new adverse effects or risks with respect to the drug (21
CFR 312.50).
General investigator responsibilities are described in sect. 312.60 and includes
ensuring that an investigation is conducted according to the signed investigator
statement (FDA Form 1572), the investigational plan (aka protocol), and applicable
regulations; protecting the rights, safety, and welfare of subjects under the investi-
gator’s care; and controlling drugs under investigation. The investigator is respon-
sible for informed consent of each human subject.
ICH GCP E6(R2) emphasizes the need for sponsor oversight and the sponsor’s
responsibilities for quality management including risk management, monitoring,
quality control, and quality assurance. Quality management is discussed in later
sections of this chapter.
602 T. Ajazi

Inspection Readiness

Coordinating centers (CCC and DCC) must be ready for inspection by a regulatory
authority (RA), such as the FDA. Inspection readiness must be built into clinical
trial operations, quality monitoring, and quality assurance activities. Good docu-
mentation practice is a basic requirement for inspection readiness, in all areas of
research. Regulatory authorities will look for contemporaneous documentation
showing sponsor oversight. Evidence that the coordinating center monitored the
trial and maintained regulatory documentation in a timely manner is critical during
an inspection. The FDA Bioresearch Monitoring Program (BIMO) provides a
copy of the Compliance Program Guidance Manual (CPGM) utilized in FDA
inspections. The CCC should utilize the BIMO as a guide for ensuring inspection
readiness.

Quality Management

The CCC and DCC are responsible for quality management at multiple levels.
Quality management should be conducted as a partnership between clinical opera-
tions, statistics and data management, as well as research collaborators. The princi-
ples and practice of quality management is evolving. The quality focus varies by the
type of research organization and their aims, structure, and resources. In all cases, the
coordinating centers seek compliance with regulations, integrity of safety and
efficacy data, and the protection and well-being of subjects. The following is only
a slice of the discussions surrounding quality management with a focus of the
aspects of quality management conducted by the CCC and DCC.

Standard Operating Procedures

At a basic level, the coordinating center is responsible for developing and


maintaining standard operating procedures. Responsible staff need to periodically
review SOPs to ensure that they are in line with regulatory changes, as well as
changes in systems or structure. The SOPs should cover all aspects of clinical trial
development, management, oversight, conduct, and completion. In some cases,
institutional policies and SOPs may determine how the CCC functions. The SOPs
should provide consistency across the institution, but they need to account for
variations in trial designs and circumstances. A good practice is to have levels of
policies, SOPs, and working instructions as controlled documents that have increas-
ing levels of detail. The working instructions could be more detailed and provide
step-by-step instructions. These could be changed and tweaked as needed. Study-
specific plans could augment SOPs but must remain consistent with the SOPs. This
creates a balance and flexibility while retaining structure. One point to remember is
that SOPs are only as good as training on SOPs. Staff training must be prescribed and
maintained in a timely fashion.
32 Responsibilities and Management of the Clinical Coordinating Center 603

Quality Control

The CCC is responsible for quality control (QC) and trial monitoring activities, as
well as quality assurance (QA) activities. The latter is usually conducted in the form
of systematic audits to ensure compliance. In November 2016, the International
Council for Harmonisation (ICH) Integrated Addendum to ICH E6(R1) Guideline
for Good Clinical Practice, E6(R2), was released. The FDA released its related
Guidance for Industry for E6(R2) in March 2018. The addendum to ICH GCP E6
sect. 5 promotes implementation of a system to manage quality throughout all stages
of the trial. According to ICH GCP E6(R2), a risk-based approach should define the
quality management system. This includes provisions for critical processes and data
identification of processes and data critical to ensure human subject protection and
reliability of trial results. Risk management includes risk identification, evaluation,
control, communication, review, and risk reporting.
Since the addendum was released, research organizations have developed strat-
egies and processes for risk-based centralized monitoring. There are varying appli-
cations of this approach. At the core of these activities is the development of
analytical reports and metrics that can be reviewed centrally and/or remotely.
Evaluation of key performance indicators (KPIs), also known as key risk indicators
(KRIs), such as data quality, data timeliness, query rates, protocol deviation rates,
serious adverse event rates, and subject enrollment levels, assist the coordinating
center in identifying sites that might be at risk. The determination leads to decisions
regarding increased monitoring, auditing, and/or other corrective or preventive
actions. Central review of monitoring KPIs is a joint activity between the CCC
and DCC.
With the use of electronic systems, source data review and verification can be
conducted remotely. The electronic tools for central monitoring are integrated with
the electronic data capture system (EDC) maintained by the DCC. A central monitor
assigned either by the DCC or CCC can review certified copies of source documents
uploaded into the EDC system or a source document portal. A central monitor can
remotely access a site’s electronic medical record (EMR) system. Electronic source
data documentation is reviewed against data recorded into the EDC. The success of
central monitoring is dependent on careful planning. The critical data points for
eligibility, trial intervention or treatment, adverse event/safety monitoring, and trial
endpoint evaluation should be identified to focus the central monitoring activity.
Central review of data, coupled with KPI metric reviews, enables the CCC and DCC
to identify sites that require intervention, follow-up, increased on-site monitoring,
auditing, or other forms of remediation.
Medical oversight by a designated medical officer or medical monitor is key to
quality management activities. The medical officer provides clinical expertise and
judgment. The medical officer is a point of escalation for other members of the study
team. The medical officer is also responsible for considering potential impact on
patient safety and recommending changes to the protocol based on trends analysis
reports. Such recommendations are discussed with the trial statistician and other
members of the study team.
604 T. Ajazi

The monitoring plan for a clinical trial should account for both centralized or
remote monitoring and on-site monitoring. Routine on-site monitoring may be
planned at a decreased frequency in combination with centralized monitoring. A
level of source data verification conducted through on-site monitoring should be
documented in the clinical monitoring plan. At-risk sites, identified through central
monitoring, are given priority for on-site monitoring.

Quality Assurance

Quality assurance is an independent and recommended activity, usually conducted in


the form of audits. Audits include site, TPO/CRO, system, and internal audits. The
scope of the audit is identified in the audit plan. Auditor assessments include a
review of compliance with regulations, GCP guidelines, SOPs, and trial plans. The
auditor should be reviewing clinical trial management by the CCC. The audit should
not be confused with monitoring, an ongoing quality control activity. In fact, audits
assess the effectiveness of monitoring activities, clinical trial operations, and data
center activities. An investigator site audit could reveal not only non-compliance by
the investigator but also failures in other aspects of trial conduct at the sponsor-
investigator CCC or CRO level. For example, investigator non-compliance could be
coupled with monitoring observations if the monitors did not detect the non-com-
pliance. Audit observations need to be conveyed to the affected parties in a timely
manner followed by implementation of a corrective and preventive action (CAPA)
plan with a root cause analysis, timelines, and measures for success. The coordinat-
ing center is responsible for quality assurance and evaluating the effectiveness of any
CAPA implemented, according to SOPs.

Oversight of CROs

If the multicenter trial is large enough (e.g., 1000 research subjects and 50 sites), it
might be necessary to outsource monitoring functions to a Clinical Research Orga-
nization (CRO). The ICH EG(R2) addendum also emphasized the need for sponsors
to provide oversight to third parties contracted to perform responsibilities on their
behalf. A few tips regarding oversight include review of CRO qualification and staff
training, as well as frequent communication with CROs to share information on
changes to the trial, provide instruction, and address inquiries. If the CRO is
responsible for monitoring, review the monitoring reports and set and review metrics
for trip report completion, site follow-up, and compliance with the monitoring plan.
If the CRO is responsible for the trial master file, generate reports on timely
completion and check accuracy of the TMF. Implement a pathway for the CRO to
escalate issues to the coordinating center staff. Establish guidelines for how critical
and major non-compliance need to be addressed and the SOP for CAPAs to follow.
Be clear on what SOPs are applicable. Develop a quality management plan with the
CRO. Document all continuous oversight activities.
32 Responsibilities and Management of the Clinical Coordinating Center 605

Adverse Event and Safety Monitoring

The multi-site principal investigator and responsible personnel at the CCC and DCC
are required to continuously review expedited or serious adverse events, as they are
reported by participating sites. Following the FDA Guidance for Industry and
Investigators for Safety Reporting Requirements for INDs and BA/BE Studies,
investigators must assess if a reported adverse event meets the requirement for
expedited reporting to the FDA of a Serious and Unexpected Suspected Adverse
Reaction (SUSAR), in accordance with 21 CFR 312.32. The principal investigator is
also responsible for reporting the SUSAR to the pharmaceutical partner. The CCC
notifies participating sites of SUSARs (IND safety reports) and provides guidance on
any related changes to risks associated with an investigational product and the
informed consent document.

Management of Clinical Coordinating Centers

Management of the clinical coordinating center involves oversight by both the CCC
leadership and research administration at the institutional level. Research adminis-
tration provides the infrastructure and determines the policies that govern the
responsibilities and management of a CCC.

Research Administration

Research administration at an institution may be comprised of several offices that


administer various areas of sponsored research management and compliance includ-
ing grants and contracts, financial administration, research integrity and compliance,
core facilities, research computing, and legal counsel. An institutional review board
(IRB) is part of research administration and must approve clinical research activities,
prior to clinical trial initiation. A multi-site coordinating center trial requires
approval from research administration of the multi-site principal investigator’s
institution. The multi-site PI may also be referred to as a sponsor-investigator,
especially for clinical trials that involve investigational products requiring submis-
sion of sponsor-investigator IND. The PI serves as both the multi-site PI and PI at
their site.
There are several avenues for funding multi-site investigator-initiated clinical
trials. The coordinating center administration is responsible for raising funds and
fiscal management. Sources of funding include nonprofit foundations, federal gov-
ernment agency grants, cooperative agreements or contracts, as well as public-
private partnerships with pharmaceutical and biotechnology companies. The
National Institutes of Health (NIH) and other federal agencies have a multitude of
funding opportunities. Research proposals are usually submitted in response to a
request for proposal (RFP) or funding opportunity announcement (FOA) from a
funding agency. The multicenter research proposal is prepared by the principal
606 T. Ajazi

investigator and co-investigators with assistance from coordinating center staff.


These proposals contain the research plan, a description of the clinical coordinating
center capabilities, and the budget. Research proposals to a funding agency are
submitted through research administration and must have the approval of an autho-
rized institutional official.
In addition to pre-award assistance, research administration provides post-award
services including financial administration and financial reporting. The institution
may also issue sub-award agreements or subcontracts to external investigators, or
vendor service agreements to third-party organizations (e.g., central laboratories). In
case the data coordinating center is not at the same institution, funding and collab-
oration agreements with the data coordinating center would be issued. Research
administration will assist with negotiation of contracts and budgets with funding
sponsors, participating sites, and subcontractors, enforcing legal and regulatory
requirements for research administration, as necessary. The multi-site PI and their
team are responsible for monitoring deliverables and ensuring that applicable terms
of award are met. At specified time points, a progress report for the research plan and
financial utilization is submitted to the funding agency by research administration.
Another important function for research administration is overseeing potential
financial conflicts of interest by investigator. When potential conflicts of interest
arise, institutional policies for conflict management or elimination are implemented.

Industry Collaborations

In addition to grants management of federally funded clinical trials, research admin-


istration will work with a multi-site PI for an investigator-initiated trial and CCC in
managing collaborations with industry. A common example of such a collaboration
is a pharmaceutical company providing access to an investigational product for use
in a trial. The collaboration could include additional funding support from the
company for other aspects of the trial. The deliverables, including data sharing
requirements, are addressed in the investigator-initiated clinical trial agreement
(CTA). The CTA may include the following:

1. Clinical trial agreement components: conduct of the study; human subject enroll-
ment requirement; participating site requirements; quality assurance and regula-
tory inspection readiness; vendor and subcontractor compliance; study data
ownership and data sharing; record keeping; confidentiality; publication; safety
reporting; inventions; compliance with law; term and termination; indemnifica-
tion and insurance; payment and payment schedule.
2. Budget (exhibit).
3. Statement of work (exhibit).

Finance and contracts staff, in conjunction with project management personnel,


negotiates the budget and agreement and ensures that all trial activities are funded
appropriately. The budget staff works closely with study team members to determine
32 Responsibilities and Management of the Clinical Coordinating Center 607

research tests and procedures that are not covered by insurance. This could be part of
a Medicare coverage analysis review. The study team also determines if funding
requests are needed for correlative science. The CCC and research administration
teams are responsible for tracking the terms and milestones of all agreements. This is
a shared administrative and finance function. The scientific and resource benefits that
translate into successful trial completion are well-worth the time and effort to
negotiate the collaborations.

Institutional Approval of Clinical Coordinating Center

Prior to initiation of the clinical trial and CCC activities, the multi-site PI’s institution
may require approval of the CCC, per institutional policies. Research administration
and the IRB will review the CCC to ensure compliance with regulatory requirements
for the protection of human subjects. Institutional requirements include review of the
clinical trial protocol and informed consent, the data and safety monitoring plan, as
appropriate for a multi-site trial. The institution may request feasibility questionnaire
for external sites. Institutional evaluation includes review of coordinating center
responsibilities, qualifications and training of research staff, site selection, site
management, data management procedures, statistical analysis plan, investigational
product distribution and accountability, pharmacovigilance and safety reporting,
protocol deviation monitoring, central and on-site monitoring, accrual plan, project
management, and multi-site communication plans, as applicable. The relevant infor-
mation may be contained in the protocol and other study plan documents. If
applicable for interventional trials, the institution may require the implementation
of an independent data and safety monitoring board.
The requirements for approval of a CCC may be obtained from research admin-
istration. For reference, a good example of CCC institutional requirements can be
found on the website of Dana Farber/Harvard Cancer Center (DF/HCC).

Management

The coordinating center, multi-site PI, and participating site PIs are collectively
responsible for meeting the criteria for grant awards and contracts. This requires
collaborative management practices by the leadership and management team. The
management team of the coordinating center is responsible for efficient and effective
management of all centralized clinical trial activities. Continuous monitoring, doc-
umentation, and progress evaluations are necessary.
The multi-site principal investigator is responsible for fulfilling the obligations of
the sponsor-investigator, including the initiation and management of the clinical trial
at all clinical trial sites. The CCC may already be established within a program at the
institution with a medical director who regularly oversees the activities of the CCC,
or it may be newly established under the direction of the multi-site PI. The structure
of the CCC management and staff may differ slightly between institutions. However,
608 T. Ajazi

directors or senior managers who oversee clinical trial operations, administration,


project management, and regulatory affairs are key to managing a clinical coordi-
nating center. Research nurses, regulatory managers, clinical trial or project man-
agers, and research coordinators are important staff members for a fully functional
CCC.

Resource Management

Successful completion of a multi-site clinical trial is heavily dependent on the


personnel staffing the clinical and data coordinating centers, as well as third-party
service providers. CCC managers need to review the staffing plan periodically to
ensure that sufficient staff are available to manage the trial. Staff planning is closely
tied to the budget planning. Balancing anticipated resource needs and financial
resources is a challenging and shared function between operations, administration,
and finance managers. The CC cannot afford the mistake of understaffing and
consequently risking trial timelines, resulting in regulatory non-compliance, over-
sight gaps, or other unmet trial obligations.
Initially hiring qualified personnel followed by an ongoing training and education
program is necessary for the any health organization. Staff engagement, open
feedback, and communication loops are keys to optimal CCC management and
quality improvements.

Operational Efficiency and Project Management

In conjunction with the DCC, the clinical coordinating center is responsible for
timely trial development to keep pace with scientific advancements and meet the
requirements of the funding agency. Taking too long to launch a clinical trial may
affect the relevancy of the scientific question and the impact of the results on clinical
practice. In partnerships with industry, inefficient and slow timelines affect the
ability of the industry partner to submit marketing applications for investigational
products. In a competitive environment, delays place the CCC and its partner at a
disadvantage.
Operational efficiency is achieved and monitored in part through careful project
planning and management. The coordinating center needs to develop target timelines
and milestones for protocol development. The CCC utilizes tools and computer
applications for careful and constant monitoring deadlines with the goal of meeting
target timelines. Project managers are accountable to CCC leadership for ensuring
trial activation, progress, and completion. Project managers coordinate the activities
of internal and external personnel and coordinate execution of many processes.
During the study, project managers work closely with data analysis staff to generate
reports for trends analysis reporting to oversee the conduct of the study. Contingency
and escalation plans are part of the CCC SOPs and may be incorporated into the
project plan or other study-specific plans.
32 Responsibilities and Management of the Clinical Coordinating Center 609

Risk Management

Ultimately, the multi-site PI is responsible for the conduct of a clinical trial at all
participating sites. Beyond the study design and compliance with clinical trial
regulations, successful completion of a clinical trial starts with careful planning by
CCC management. This includes ensuring that adequate human and financial
resources are available throughout the life cycle of the trial. The management team
needs to budget for personnel at the CCC and participating sites and personnel and
resources for data coordinating center activities. There are a multitude of costs that
must be included in the budget calculations, including clinical trial management
systems, electronic data capture system, data and record storage, trial supply distri-
bution, training, travel, site recruitment, patient recruitment, monitoring, project
management, special equipment, laboratory services, site payments, and other
costs related to subcontracts and vendor management. Throughout the life of the
clinical trial, the management team is responsible for tracking expenses and ensuring
that the trial remains within budget.
At the beginning of the project, the CCC management team is primarily responsible
for risk assessment and risk management while the clinical trial is ongoing. Recent
developments in clinical research, exemplified by changes to ICH GCP guidelines and
FDA guidance, place a major emphasis on risk management. For example, a 2019
funding opportunity (PAR-19-329) posted by the National Heart, Lung and Blood
Institute (NHLBI) titled Clinical Coordinating Center for Multi-Site Investigator-
Initiated Clinical Trials emphasizes the focus on both risk management and opera-
tional efficiency, as well as the role of project management to proactively mitigate
risks. The NHLBI requires a trial management plan that describes the strategy of the
CCC to “ensure that management activities of the clinical trial are met including
directly supporting the needs of scientific leadership to identify barriers, make timely
responses, and optimize the allocation of limited resources” with a risk assessment and
management plan that identifies a range of contingencies and solutions.
Identification of potential risks at the beginning of the trial is key to successful
risk management. Monitoring of key risk indicators (KRIs), as part of quality
management, ensures corrective and preventative action in a continuous improve-
ment manner. Key risk considerations and indicators often depend on the type and
complexity of the trial. CCCs often consider the patient population, competing
clinical trial opportunities, accrual rate projections, expected screen failure rate,
investigational new drug or device status, data timeliness and quality, adverse
event rates, and protocol deviations rates. An example of a risk assessment tool is
the Risk Assessment and Categorization Tool (RACT) developed by TransCelerate
to assist with risk-based monitoring implementation.

Clinical Coordinating Center and Trial Governance

The multi-site clinical trial and the CCC may be governed by a steering committee or
executive committee, chaired by the multi-site PI. Members of the executive
610 T. Ajazi

committee may include multi-PIs of participating sites and statistical center represen-
tation. At the beginning of the trial, the executive committee is concerned about trial
design, funding, and site selection. Upon trial initiation, this committee is charged with
monitoring the progress of the trial and making decisions regarding issues escalated by
CCC management. These issues often fall along the lines of key risk indicators
described above. The multi-site PI and members of the executive committee serve
as champions of the trial with other investigators and other research collaborators.
Upon analysis of trial results, they are responsible for overseeing implementation of
DSMB recommendation and publication of study results. The governance responsi-
bilities of the executive committee may be documented in a charter.

Clinical and Data Coordinating Center Integrated Functions

The coordinating center may include the CCC and the DCC within the same
institution, or the DCC may be part of a separate institution. The primary responsi-
bilities of the DCC include statistical design, data management, data analysis, and
publication of study results. Collectively the functions of the CCC and DCC
encompass the breadth of centralized clinical trial management activities.
The functions of the DCC must be integrated with clinical operations throughout
the life cycle of a clinical trial: development and activation, accrual phase, follow-
up/data maturation, data analysis and reporting, and close-out. The DCC and CCC
leaders are part of the multi-site management team and included in all levels of
management discussion, planning, and review. DCC personnel including statisti-
cians and data managers are members of the study team and key to both protocol
development and trial management. The SOPs and workflow between the CCC and
DCC are developed to dovetail and support all units. Frequent and open communi-
cations happen on a day-to-day basis for ongoing study management.
Alignment of CCC and DCC resources is important in multi-site coordinating
center obligations for timely trial activation and completion, as well as ongoing trial
management. The DCC is crucial not only for data analysis and results reporting but
also in ongoing quality management analyses. Figure 1 depicts examples of integra-
tion between the clinical and data coordinating centers, over the life cycle of a
clinical trial.

Clinical Trial Network Group Coordinating Center

This chapter focused on responsibilities of the clinical coordinating center for a


multi-site clinical trial. Coordinating centers could be established to manage one or
several trials. The number and type of participating clinical sites are configured
based on trial needs. These participating sites could evolve into an established
clinical trial network. Clinical trial networks that conduct large-scale clinical trials
with a portfolio of clinical trials may be referred to as network groups or cooperative
research groups. The operation centers of these network groups have all of the
components of an investigator-initiated coordinating center with some advantages
32 Responsibilities and Management of the Clinical Coordinating Center 611

•Statisticians and clinical investigators develop study design


•CCC develops protocol and related documents for trial operations, incorporating statistics and data management
logistics
•DCC develops statistical analysis plan
Trial •DCC develops case reports form and builds study in electronic data capture
Development •DCC programs patient registration and randomization system

•CC manages and monitors sites during trial accrual phase


•DCC collects and manages trial data
•CCC and DCC jointly monitor study and sites for quality and key performance indicators
•DCC provides study reports to Data and Safety Monitoring Board and CC administrative support and disseminates
DSMB statements
Trial •DCC and CC jointly manage trial related inquiries from sites and any emergent safety related issues or other study
status changes
Management •DCC and CC jointly manage protocol amendments and other site notifications
•Joint development of training and education materials

•DCC monitors studies to determine completion status and termination to follow-up


•CCC manages study termination notices and instructions
•DCC statistician conducts statistical analysis for DSMB report and subsequent publication
Trial •Statistician and clinical investigators write study publications and CCC manages the publications process
Completion

Fig. 1 Clinical coordinating center (CCC) and Data coordinating center (DCC) integrated function
examples

and disadvantages. An example of established network groups is the National


Clinical Trials Network (NCTN) groups, funded by the National Cancer Institute
(NCI). The groups include the Alliance for Clinical Trials in Oncology, ECOG-
ACRIN, NRG Oncology, SWOG, Children’s Oncology Group, and Canadian Can-
cer Trials Group. The coordinating centers of these groups are a paired set of an
operations center (i.e., clinical coordinating center) and statistics and data manage-
ment center (i.e., data coordinating center).
The advantages to these types of networks include a membership model for
participating sites. These participating sites are required to adhere to membership
performance requirements, in order to participate in network trials. Site selection and
qualification is streamlined for sites interested in clinical trial participation. The
established nature of the group’s policies and procedures and centralized infrastructure
provided by the NCI itself offer an opportunity to launch peer-reviewed clinical trials
in a uniform fashion. These networks offer consistency in how trials are managed,
monitored, and audited. They can leverage the utilization of a standard electronic data
capture system and data management procedures across multiple clinical trials. These
large networks enjoy the scientific participation and expertise of investigators across
the country, as members of both scientific and administrative committees. Clinical trial
development procedures have been standardized with the aim of optimizing opera-
tional efficiency. They are able to monitor institutional performance across multiple
trials and sites, offering education and training opportunities for improvement of
clinical trial conduct. Nevertheless, these groups face challenges related to resource
and funding constraints inherent in managing a large portfolio of clinical trials. As with
all multi-site coordinating centers, these network coordinating centers face similar
challenges for completing trial accrual, managing quality and ensuring regulatory
612 T. Ajazi

compliance for the protection of human subjects, trial integrity, and data quality.
Research collaborations with other research groups, including international partners,
as well as industry partnerships are important to the success of collaborative clinical
research. Within the academic research and the investigator-initiated clinical research
community, the choice for multicenter clinical trials includes implementation through
a network group or an institution-based coordinating center.

Summary

The clinical coordinating center is responsible for managing all stages of a clinical
trial life cycle: trial development, activation, accrual, follow-up, results reporting,
and closure. Multi-site clinical trial management functions include risk management,
project management, protocol development, site selection and management,

Clinical Trial
Development
and
Operations
Information Regulatory
Technology/ Affairs and
Systems Compliance

Quality
Management:
Clinical
Control Coordinating Governance
Monitoring Center
Assurance

Communications Site Selection


and Education and
Management

Research and
Financial
Administration

Fig. 2 Clinical coordinating center responsibilities


32 Responsibilities and Management of the Clinical Coordinating Center 613

reporting to regulatory authorities, data and safety monitoring, centralized and


remote monitoring, quality control and assurance, pharmacovigilance and safety
reporting, regulatory affairs, communications, grants and contracts management,
and financial and research administration. The purpose of the coordinating center
is to enable and facilitate clinical research operations. A summary of the clinical
coordinating center responsibilities is shown in Fig. 2.
Quality management processes, integrating both ongoing quality control and
assurance mechanisms, protect against the risk of non-compliance with applicable
regulations. Documented sponsor oversight, accountability, and inspection readiness
are not optional. Clinical coordinating centers must strive for inspection readiness at
all times. Clinical coordinating center management is responsible for ensuring
regulatory compliance. They need to obtain sufficient resources for efficient clinical
trial operations and implement risk and quality management policies and procedures.

Key Facts

1. Coordinating centers are responsible for clinical operations, site management,


regulatory compliance, communications, administration, information systems,
and quality management.
2. At the core of all clinical trial operations is the protection of human subjects and
data integrity.
3. Selection of high-performing sites is important for successful trial initiation and
completion.
4. Site management requires constant communication with participating sites.
5. Project management, planning, and execution ensure that trial timelines and
milestones are met.
6. Quality management includes both quality control and quality assurance.
7. Risk-based monitoring and risk management processes aid in meeting GCP
guidelines.
8. Inspection readiness must be ongoing and constant with documented sponsor
oversight.
9. Clinical coordinating centers and data coordinating centers collectively manage
clinical trials.
10. Coordinating centers of network groups coordinate a portfolio of clinical trials
for a large site network.

Cross-References

▶ Good Clinical Practice


▶ Multicenter and Network Trials
▶ Selection of Study Centers and Investigators
▶ Trial Organization and Governance
614 T. Ajazi

References
FDA Site Investigational New Drug Application Resources. Available online. Accessed 07 Sep
2020. https://www.fda.gov/drugs/types-applications/investigational-new-drug-ind-application
E6(R2) Good Clinical Practice: Integrated Addendum to ICH E6(R1). Available online. Accessed
07 Sep 2020. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/
e6r2-good-clinical-practice-integrated-addendum-ich-e6r1
Food and Drug Administration (2017) Compliance Program 7348.810 Bioresearch Monitoring,
Sponsors, Contract Research Organizations and Monitors. Available online. Accessed 07 SEP
2020. https://www.fda.gov/inspections-compliance-enforcement-and-criminal-investigations/
fda-bioresearch-monitoring-information/compliance-program-7348810-bioresearch-
monitoring
Dana Farber/Harvard Cancer Center, Investigator-Sponsored Multi-center Clinical Trials. Available
online. Accessed 07 Sep 2020. https://www.dfhcc.harvard.edu/research/clinical-research-
support/office-of-data-quality/services-support/dfhcc-multi-center-trials/
NHLBI Funding Opportunity, PAR-19-329, Clinical Coordinating Center for Multi-Site Investiga-
tor – Initiated Clinical Trials. Available online. Accessed 07 Sep 2020. https://grants.nih.gov/
grants/guide/pa-files/par-19-329.html
TransCelerate Risk-Assessment and Categorization Tool. Available online. Accessed 07 Sep 2020.
https://www.transceleratebiopharmainc.com/initiatives/risk-based-monitoring/
Efficient Management of a Publicly Funded
Cancer Clinical Trials Portfolio 33
Catherine Tangen and Michael LeBlanc

Contents
Introduction: SWOG Statistics and Data Management Within the NCTN . . . . . . . . . . . . . . . . . . . . 616
Disease Committee Structure, Interactions, and Study Development . . . . . . . . . . . . . . . . . . . . . . . . . . 617
Disease Committee Structure Within the Statistical Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
Communications Within and Between the Statistical Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
Protocol Development Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619
Using Expert Cross-Disease Teams (Cores), Strategic Meetings, and Standardized
Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620
Recruitment and Retention Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621
Patient-Reported Outcomes (PRO) Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621
FDA Application Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621
Clinical Trial Methods Core and Translational Medicine Methods Core . . . . . . . . . . . . . . . . . . 622
Rave ® Study Build Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622
Training Opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
Standardized Publication Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
Standardized Data Collection, Coding, and a Comprehensive Portfolio Database . . . . . . . . . . . . 623
In-House Tools for Design, Study Monitoring, and Analysis of Clinical Trials . . . . . . . . . . . . . . . 625
Comprehensive Statistical Reporting Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625
Specimen Tracking Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626
Public Use Statistical Design Calculators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
Automatic Monthly Study Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
Site Performance Metrics Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 628
Portfolio-Wide Data Safety Monitoring Committee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
General Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
Standard Report Formatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
General Interim Analysis Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632
Statistical Center External Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633

C. Tangen (*) · M. LeBlanc


SWOG Statistical Center, Fred Hutchinson Cancer Research Center, Seattle, WA, USA
e-mail: [email protected]; [email protected]

© Springer Nature Switzerland AG 2022 615


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_61
616 C. Tangen and M. LeBlanc

Data Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633


Biospecimen Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635

Abstract
The implementation, management, and reporting of any single well-designed
cancer clinical trial is an extremely complex and expensive undertaking. However,
within our group, there is the opportunity to simultaneously design, implement,
monitor, and analyze up to approximately 100 publicly funded clinical trials across
the development spectrum. Operationally, we aggressively seek to optimize and
standardize processes and software that are common across studies, increase effi-
ciency, and focus on the quality and reliability of study results. Implementing novel
software applications increases the quality and efficiency of data evaluation, mon-
itoring, and statistical analysis across multiple disease and study types. Conventions
and cross-study tools are the key to quality monitoring and analysis of complex
portfolio of studies. A strategy of fully utilizing the commonalities across the trials
leads to better quality results of any given study in the portfolio as well as more
efficient utilization of public funds to conduct the studies.
In this chapter the structure and processes are described in the context
of SWOG Cancer Research Network, one of the four National Clinical Trial
Network (NCTN) adult cancer clinical trial groups funded by the US National
Institutes of Health (NIH) under a cooperative agreement with the US National
Cancer Institute (NCI).

Keywords
Clinical trial · Cancer · Statistics · Portfolio · Data management · Protocol
development · Statistical design · Translational medicine · Software · Data safety
monitoring committee · Database

Introduction: SWOG Statistics and Data Management Within the


NCTN

The mission of the SWOG Cancer Research Network, as a partner in the NCTN (see
references for website link), is to significantly improve lives through cancer clinical
trials and translational research. The SWOG Network Operations Center, which
includes the office of the Group Chair, the contracts and legal team, and communi-
cations, is located in Portland, Oregon, and the SWOG Operations Center which
oversees protocol development and audit functions is located in San Antonio,
Texas. We will focus on trial portfolio strategies at the SWOG Statistics and Data
Management Center (Statistical Center) located in Seattle, WA, where all trial and
33 Efficient Management of a Publicly Funded Cancer Clinical Trials Portfolio 617

ancillary study data reside. The Statistical Center is led by the Director, referred to as
the Group Statistician. There are currently 12 statistical faculty who receive some
fraction of funding from SWOG. The faculty typically have other non-SWOG
research interests and receive additional funding from other grant activities outside
of SWOG. An additional 15 SRAs (MS degree statisticians) and 2 additional
Statistical Unit Assistants (BS degree statisticians), who are fully funded by
SWOG activities, round out the statistical team. The goal of the Statistical Center
is to provide leadership in the statistical design and data management of oncology
clinical trials for SWOG and to safely and efficiently monitor and report on clinical
investigations over a portfolio of clinical trials. Critically, the Group must analyze
the clinical outcomes in a consistent and reproducible way. The portfolio of managed
trials includes both trials to evaluate new cancer treatments (both single and multi-
arm Phase II and randomized Phase II, Phase II/III, and Phase III trials) and other
studies involving cancer prevention, supportive care and symptom management,
palliative care, as well as trials of comparative effectiveness of treatments. These
nontreatment studies include both randomized and cohort studies and are conducted
in collaboration with the NCI Community Oncology Research Program (NCORP)
program (see references for website link).
The SWOG Statistical Center designs, implements, and manages their trial
portfolio through (A) the SWOG disease committee structure, interactions, and
protocol development; (B) use of expert teams, strategic meetings, and standardized
policies; (C) use of standardized data collection, coding, and a comprehensive
portfolio database; (D) development of in-house tools for design, monitoring,
and analysis of clinical trials; (E) utilization of a portfolio-wide Data and Safety
Monitoring Committee; and (F) standardization of our interactions with outside
groups including biospecimen and data sharing. Expanded descriptions follow.

Disease Committee Structure, Interactions, and Study


Development

Disease Committee Structure Within the Statistical Center

The Statistical Center structure, and the primary work of SWOG, is accomplished
through anatomic disease committees. Each committee is assigned at least one Ph.D.
statistician (faculty), one or more master’s level statistician(s) referred to as
Statistical Research Associates (SRA), and one or more data coordinator(s) who
work as part of a larger team with the clinical and translational medicine members of
the disease committee. Within the Statistical Center, these committees function
under the direction of the faculty statistician(s), with priorities set in consultation
with the respective clinical disease committee chair and the Group Statistician.
During study development, statisticians work with the study chair to develop the
trial design and help lead the protocol through the SWOG and NCI approval
processes and protocol implementation. Assessments of feasibility, experimental
design, sample size, randomization schemes, data analysis plans, and key elements
618 C. Tangen and M. LeBlanc

of data collection are further refined by the statisticians and the study team during the
development process. Statisticians and the protocol coordinators work with the study
chair to launch the proposed trial. Several statisticians have responsibilities and
methodological skills that address general needs across diseases, which facilitates
standardization within the Statistical Center (see the “Using Expert Cross-Disease
Teams (Cores), Strategic Meetings, and Standardized Policies” section).

Communications Within and Between the Statistical Center

Communications are critical for integrating the work of the statisticians with data
management staff. Important in-person meetings at the Statistical Center include chief
meetings with senior faculty, senior SRA, and data management and applications
development management. Chief meetings are used to set priorities within the Statis-
tical Center and to discuss how to respond to new initiatives, regulatory changes, and
other challenges. Other important meetings that primarily include Statistical Center
faculty and staff are twice-monthly meeting of all Statistical Center statisticians to
discuss policy issues, programming and software needs, standards and guidelines, and
statistical issues. One of these twice-monthly meetings includes a statistical analysis or
methodology presentation and discussion. There is a monthly meeting of the SRAs,
where they discuss study implementation issues, evaluate workloads and priorities, and
share ideas. This also serves as a forum for continued training. Monthly disease
committee meetings are led by a faculty statistician with attendance by the respective
disease committee statisticians and data coordinators. This meeting serves to set
analysis and data evaluation priorities, to summarize accrual and adverse event issues,
and to discuss any concerns for ongoing trials. In addition, electronic case report form
(CRF) development ideas, design concerns, study structure setup in the database, and
evaluation requirements are discussed for studies in development. There is also a
weekly statistical capsule review meeting and protocol review meeting. These meet-
ings are concept- or protocol-specific (see Protocol Development below for more
details). Finally, the biannual Statistical Center all-staff meeting is a format to showcase
scientific accomplishments over the past 6 months, to introduce new staff, and to
discuss performance goals for the upcoming 6 months.
Strong links exist between the Operations Center and the Statistical Center.
Importantly, the Group Statistician and Group Chair have a scheduled weekly one-
on-one teleconference, and they are in almost daily email contact to promote
the efficient scientific and administrative functioning of SWOG. Senior statistical
faculty also have close contact with the Director of Operations, who works with the
Group Chair to oversee and direct the operational and administrative activities of
SWOG in support of its scientific missions.
Statistical and other data-related issues are fundamental to any new study’s
approval. The Group Statistician and Statistical Center Deputy Director are members
of the SWOG Executive Committee (“triage”), along with the Group Chair, execu-
tive officers, Director of Operations, and Director of Protocols. The statistical design
review by the Group Statistician and Deputy Director at this meeting is an integral
component of each study evaluation. The committee meets weekly and reviews
33 Efficient Management of a Publicly Funded Cancer Clinical Trials Portfolio 619

capsules for scientific soundness, feasibility, appropriateness to SWOG’s mission,


priorities, and resources. Approved studies are assigned a SWOG study number and
sent for approval at the NCI prior to protocol development. Concepts often need to
go through several reviews prior to approval by triage. This senior level of review
provides strong scientific feedback and consistency and typically results in a high
rate of approval at the NCI.

Protocol Development Process

As described in the Disease Committee section above (“Disease Committee


Structure, Interactions, and Study Development”), for new studies there are initial
conferences with the committee statistician to discuss appropriate study design
options, eligibility criteria, endpoints, accrual estimates (and hence feasibility),
preliminary statistical plans, and proposed sample sizes. In addition, preliminary
discussions of eligibility, translational medicine questions, and details of treatment
are also considered at this time.
As the first review step in our protocol development process, the short study
proposal (capsule) is discussed at an internal statistical review meeting at the
Statistical Center to assess the proposed study design and analysis plans. At this
meeting, translational medicine, patient-reported outcome concepts, and trial design
are discussed. This review is performed by the Group Statistician and Deputy
Director, a designated rotating faculty statistician, and the SRAs, who are usually
divided into two teams to share the workload. Once a capsule has been reviewed,
it goes on to Triage review by the SWOG Executive Committee (Fig. 1).
During protocol development there is a high level of interaction among the
study’s scientific study leadership, statisticians, data management staff, and applica-
tion development (Rave) staff. After trial activation, interactions rest more strongly
on statistics and data management staff. External involvement with the protocol
coordinator is very high during protocol development and diminishes but still carries
on through management as amendments and other protocol issues arise.
Reengineering Protocol Implementation and Development (RaPID) Review:
Once there is a working version of the protocol, an in-person meeting (typically for
Phase III trials) or an extended conference call (for Phase II trials) is scheduled to
carefully review all sections of the protocol; this is known as the RaPID review
meeting. This meeting includes the study chair(s), the protocol coordinator, Director
of Protocols, Statistical Center statisticians, and data coordinators from the respec-
tive disease committee. In a subsequent conference call with the study statisticians
and study chairs, data elements to be captured on the study case report forms are
finalized. This is the first step in the study build process. Having all study personnel
focused together for 4–5 h on the protocol development is an extremely efficient way
to identify and address key logistical and scientific issues for the trial, and it leads to
faster finalization of the document.
Protocol Review Committee: After the RaPID meeting and an additional review
for compliance to standards are conducted by the SWOG Operations Center, the
protocol is submitted to the Protocol Review Committee (PRC) at the Statistical
620 C. Tangen and M. LeBlanc

Network Group
Operations Center
Protocol Coordination

Network Group
Chair’s Office Sites
Protocol
Contracts Study Chairs
Budgets

Statistics and Data Management Center


Statistics
Data Management
Applications Development
Information Technology

Fig. 1 Protocol standardized workflow

Center. The PRC meets weekly or as needed; committee members receive a copy of
the protocol, draft data collection forms, and a copy of the NCI Protocol Submission
Worksheet. This committee is chaired by one of the faculty statisticians and consists
of at least one other faculty statistician, the Deputy Director, and 5–6 master’s degree
level statisticians, with the goal of having multiple independent reviewers for each
study. The study team, the Director of Protocols, and the protocol coordinator from
the SWOG Operations Center attend by teleconference. Study chairs are encouraged
to participate and do so as their schedules permits.
The review provides critiques and recommendations to eliminate internal incon-
sistencies and provide clarification in the protocol document, especially for eligibil-
ity, data and specimen collection, and other implementation issues. Moreover, this
review ensures consistency across studies and disease committees for our approach
to the design and conduct of trials.

Using Expert Cross-Disease Teams (Cores), Strategic Meetings,


and Standardized Policies

There are many commonalities among trials conducted across diseases within
our group. Trials are becoming more complex due to the addition of extensive
biospecimen collection and high dimensional lab assessments, the emerging
33 Efficient Management of a Publicly Funded Cancer Clinical Trials Portfolio 621

importance of quality of life or other patient-reported outcomes (PROs), the intent


for sponsoring companies pursue regulatory registration of their drug, and the wide
variety of clinical trial designs that may be applicable for a given study. To avoid the
need to develop this type of expertise within each disease committee and to bring the
best processes, statistical methods and data collection strategies, expert teams, or
“cores” have been developed that can be accessed by all committees when design-
ing, conducting, and analyzing clinical trials. These cores include recruitment and
retention, patient-reported outcomes, FDA application intent, clinical trial design,
and translational medicine. These cores include members from both statistics and
data management.

Recruitment and Retention Core

This group provides enhanced statistical input for accruing minority and medically
underserved populations. We have a committed Statistical Center team of staff and
faculty with expertise in these populations to support the activity of SWOG.
The SWOG Statistical Center directly serves trial recruitment goals with a full-
time staff expert (Recruitment and Retention Coordinator) for accrual-related issues
on trials and a faculty statistician with research interests in design analysis of studies
involving accrual and representativeness. The Recruitment and Retention Coordi-
nator works closely with study leadership, study teams, and the SWOG Recruitment
and Retention Committee to provide expertise and support in the development
of study-specific recruitment and retention strategies and materials. This coordinator
also participates in NCI working groups related to this mission.

Patient-Reported Outcomes (PRO) Core

This core team oversees the scientific review and conduct of PRO sub-studies
for treatment trials. The PRO core administers the review process for new PRO
proposals, including assessing scientific merit, feasibility, and resource allocation
within the Symptom Control and Quality of Life Committee. For approved sub-
studies, the PRO core provides the statistical design, monitoring, and analysis
resources for the conduct of the PRO study, guided by a set of key design and
analysis principles developed for PRO studies. The staff, funded primarily through
NCORP, includes a faculty statistician, master’s level statisticians, and PRO expert
data coordinators.

FDA Application Core

The goal of this core team is to ensure efficiency in process and procedures for FDA
registration trials across disease committees. This is accomplished by reviewing of
case report forms with extended data requirements and validated data systems,
including biomarker-driven treatment assignment. The team provides training with
622 C. Tangen and M. LeBlanc

emphasis on regulations and additional documentation appropriate for study man-


agement at the Statistical Center and study sites. Core team members function as
resources to ensure patient safety and data integrity checks across these trials. The
team includes members from both statistics and data management, including senior
statisticians, a CDISC expert, a SAS programmer, a statistician experienced with
prior FDA submissions, and a senior data management consultant. Funding for
activities in this core beyond NCI requirements and expectations comes from non-
NIH sources.

Clinical Trial Methods Core and Translational Medicine Methods


Core

These cores have been introduced with the goal of introducing new designs and
analysis strategies, enhancing consistency across disease committees, and providing
a sounding board for ideas. Members of these groups include SWOG statistical
faculty as well as non-SWOG faculty who may not be directly supported by our
grants but who have methodological interests in clinical trials or translational
medicine. A goal of the cores is to assess solutions for trial design and TM analyses
that may be appropriate across committees. Each core identifies and facilitates short
topics of discussion and relevant journal papers for the broader group to review
during statistics meetings, thereby aiding the dissemination of new approaches and
methods. These cores are also responsible for maintaining guideline documentation
for best practices within SWOG. These cores ensure that state-of-the-art solutions
are used and help identify new statistical methodologies that are needed for the best
conduct of clinical trial activities.
Data science skills are supplemented by leveraging unique expertise located
outside of the Statistical Center but within the parent institution. These individuals
are included as necessary for special projects and funding flows from appropriate
grant sources for a finite amount of time to address special topics such as biomarker
treatment designs, tumor heterogeneity, genomics, mobile data, computational lin-
guistics, and natural language processing.

Rave® Study Build Core

Medidata Rave® is a configurable electronic data capture (EDC) system that


includes data capture, management, and monitoring. Patient data received through
Rave ® are accessible by Clinical Research Associates (CRAs), data coordinators,
study chairs, statisticians, monitors, and auditors, with appropriate permission con-
trols for access in place. The Rave ® application facilitates a collaborative and
uniform environment for data capture and review at multiple levels. For trials that
are double-blinded, Rave ® and all the systems and tools outlined in this section blind
33 Efficient Management of a Publicly Funded Cancer Clinical Trials Portfolio 623

all users to treatment assignment to avoid any potential bias in the review of the
patient’s data. The Statistical Center has developed expertise in Rave ® in the areas of
custom functions, calendaring, case report form presentation, and edit checks, all of
which help to personalize the user experience to the specific patient and study while
at the same time providing consistent quality processes across studies. Our philos-
ophy is to maximize data accuracy and to create case report forms that are easy to
understand and personalized to the patient and that include comprehensive and
informative edit checks. Our approach is consistently rigorous, whether the trial
has FDA registration potential or not.

Training Opportunities

To ensure quality, consistency, and reliability in SWOG’s numerous clinical trials,


effective training is critical for all personnel involved in the design, conduct, and
analysis of the studies. That includes data coordinators, statisticians, clinical research
associates, and medical study leadership. The Statistical Center is actively involved
in developing these training programs which cover topics such as protocol develop-
ment, data monitoring, ethics, statistical design, and trial management for clinical
investigators. This training takes place in person at our biannual group meetings and
via web-based technology which tracks completion of courses.

Standardized Publication Policy

Authorship for SWOG also follows standards, including statistical representation.


For primary results of the primary endpoint either in abstract or manuscript, SWOG
policy dictates that the first/lead author is the study chair. The lead, contributing
biostatistician is listed as second author followed by the study co-chairs involved in
study management and evaluation as listed in the protocol. The policy also provides
guidance on other types of publications and presentations

Standardized Data Collection, Coding, and a Comprehensive


Portfolio Database

The first steps of quality and statistical standardization are derived from (1) devel-
opment of protocols that are clearly stated and inclusive of all criteria and procedures
and (2) data collection necessary to address the key objectives of the trial.
The SWOG data capture system has evolved over time starting with CRF-
scanned data into the database via an optical character recognition system and then
to an in-house EDC system which allowed users to enter and amend data from a
web-based portal as well as upload source documentation. This was used until
624 C. Tangen and M. LeBlanc

acquisition and implementation of Rave ® mandated by the NCI in 2014 for all
network trial groups. Medidata Rave® is a configurable EDC system that includes
data capture, management, and monitoring.
Rave ® has excellent features for single studies. However, to best utilize cross-
portfolio strategies, dynamic integration of data across trials is chosen, and CRF data
from Rave ® is uploaded to the SWOG database. The variables are then mapped into
standardized coding across studies. As described in the “Comprehensive Statistical
Reporting Tool” section below, this allows for cross-study reporting, improved
umbrella trial support, more extensive patient follow-up, and further exploratory
analysis opportunities. A key feature of the SWOG Statistical Center study design
processes is the mapping of data elements to a standard set of domains and codes,
which facilitates efficient analysis. Where possible, many of the coding conventions
are standardized across types and stages of cancer.
Unlike some other NCTN groups that leave their data stored at a remote, central
location, the Statistical Center, in cooperation with Medidata, uses Rave® Web
Services to create a process for downloading data from Rave® into the SWOG
database on a nightly basis. An important advantage of our approach is that it allows
for unified monitoring and reporting across SWOG coordinated studies. Having the
data from Rave® CRFs stored in the SWOG database is critical to the Statistical
Center data management and statistical analysis processes. It allows continued use of
our suite of custom-built applications, such as patient evaluation tools, and the
Statisticians’ Report Worksheet (SRW, described in a later section) and other reports
that are informed by data in the SWOG database. Having Rave® data in the SWOG
database also allows us to combine both Rave® and clinical trials data collected on
earlier pre-Rave ® EDC for further analysis. This facilitates our ability to conduct
SWOG database analyses that combine multiple SWOG trials over a long period of
time.
Having an organization that manages a portfolio of trials also allows for a unified
approach with respect to network security, information exchange security, access
controls, and disaster recovery and contingency plans. The Statistical Center
approaches data security and confidentiality with a focus on the confidentiality,
integrity, and availability of data. Processes and procedures involving network
services and data management applications are influenced by federal requirements
for computer, network, and data security. Network security is based on best practices
for electronic computing and networking and regulatory compliance. Security is
addressed through a defense-in-depth approach. Multiple layers of defense are
utilized to address potential security vulnerabilities. Policies and procedures for
disaster recovery are modified as needed and reviewed annually. Outside profes-
sional consultation, review, and auditing provide additional feedback and result in
updates to policies, procedures, and training as appropriate. All web application
traffic is secured and protected from tampering or eavesdropping by use of industry
standard cryptographic protocols. Operating system and database controls restrict
inappropriate access privileges to data, files, and other objects that require protection
from modification.
33 Efficient Management of a Publicly Funded Cancer Clinical Trials Portfolio 625

In-House Tools for Design, Study Monitoring, and Analysis of


Clinical Trials

The Statistical Center is successful in creating custom software applications specif-


ically designed for the needs of a group running a wide array of clinical studies.
To maintain the necessary control of those applications, enhancements are applied as
needed. These tools leverage the standardized data collection and comprehensive
database structure previously described. Key tools to manage the SWOG portfolio
are described below:

Comprehensive Statistical Reporting Tool

SWOG has conducted hundreds of clinical trials, making standardization of reports


critical. However, working in a wide variety of cancer settings, being able to
customize reports is necessary. To help achieve this balance, the Statistical Center
uses a custom-built statistical analysis reporting software application. This trial
platform-based reproducible research tool (Statisticians’ Report Worksheet, SRW)
epitomizes our statistical philosophy: While there is flexibility across diseases and
studies, common coding conventions and cross-study tools are the key to quality
monitoring and analysis of a portfolio of studies as numerous and complex as those
in the NCTN. SRW is the primary tool used by statisticians across disease commit-
tees to construct standardized reports, tables, and graphs for all studies. SRW
incorporates a web-based interface and the data from our SWOG database to create
ready-to-print reports.
SRW extracts data from the database to create SAS datasets, i.e., a “snapshot” of
the patient data, and then automatically archives the resultant dataset. SRW outputs
summaries in common formats such as Word, HTML, and PDF. SRW is used for
reporting SWOG studies in the semiannual Report of Studies (ROS) and Data and
Safety Monitoring Committee (DSMC) reports. This application is facilitated by
having the Rave ® data from every study directly incorporated into the SWOG
database together with data from our pre-Rave ® electronic data capture system as
previously described. The SRW application uses standardized modules to compose
study reports featuring standardized charts, tables, graphs, and descriptive informa-
tion. Statisticians can customize the tables to adapt to their study requirements.
Examples of customizable features are label definitions, table format information,
and selected text such as objectives, patient population, accrual goals, and study
summaries. The use of SRW provides a consistent approach to testing and verifying
analysis code for commonly required analyses rather than having all analyses
based on individual SAS programs. This standard for reproducible research (e.g.,
Gentleman and Temple Lang 2007; Iqbal et al. 2016) mechanism is feasible because
the Statistical Center implements standardized coding choices of primary data
elements including eligibility, treatment status, adverse events, and outcome, across
disease committees. Our primary time-to-event outcomes are uniformly defined and
626 C. Tangen and M. LeBlanc

Fig. 2 A tiled representation of a report from the Statisticians Report Worksheet (SRW) program

named across studies. The reporting mechanism normalizes how Phase II and Phase
III study data are presented but with study-specific flexibility with respect to tables.
A collage representation of a study report is presented in Fig. 2.

Specimen Tracking Application

SWOG has developed a sophisticated system for tracking shipment of biological


specimens from the point of collection to arrival at laboratories for analysis and/or to
the specimen biorepository for banking for future use. All banked specimens
have been consolidated in a single bank, the Nationwide Children’s Hospital
Biopathology Center in Columbus, Ohio. During the last 5 years, it tracked specimen
submission for over 100 SWOG studies, representing approximately 59,000 sub-
missions. The Specimen Tracking System is fully metadata-driven and generic
enough for any study but allows for customization as needed. A new study can be
set up in less than an hour.
33 Efficient Management of a Publicly Funded Cancer Clinical Trials Portfolio 627

For all SWOG studies, CRAs use the application to log specimens and indicate
when those specimens are shipped to the appropriate biorepository or laboratory.
Specimen-specific questions can be configured to gather information about the
specimens for use by the laboratory processing the sample (e.g., when slides were
cut, how long a sample was frozen). Laboratory staff use the application to indicate
when those shipments are received and in what condition and to indicate if the
specimens were aliquoted and/or shipped to another destination. Laboratory staff can
also enter assay test results which are communicated in real time to CRAs at the
institutions or the Statistical Center for eligibility, stratification, and/or treatment
decisions.
Every patient registered to a SWOG study is assigned a pseudo patient ID that can
be used when transmitting data to a laboratory. Only Statistical Center staff can link
these pseudo patient IDs to the clinical data. Thus, a laboratory performing an assay
with specimens received from the bank will not have treatment assignment or
clinical outcome access when performing the assay. Prior to merging clinical data
with lab data, the Statistical Center requires that the lab send us their data so that it
can be stored in our database for future use.

Public Use Statistical Design Calculators

Every year, a large number of prospective clinical trial and translational medicine
statistical design specifications must be evaluated in order to identify the optimal
design. To facilitate this statistical development and encouraging standards across
the portfolio, SWOG has designed a suite of trial power and sample size calculators.
Clearly, efficient methods are needed to facilitate evaluation of the potential design
and underlying model scenarios. While many sample size and power calculators are
available for simple trial designs, continued development of improved tools for
design and analyses involving multiple subgroups and complex trial monitoring
are needed. Re-implementation and expansion of existing tools to be mobile acces-
sible are ongoing to move them to a cloud-based setting using OpenCPU, a system
for embedded scientific computing and reproducible research (Ooms 2014). Inte-
gration of both JavaScript-based and R-based calculations can be accomplished
while retaining a common look and feel in the mobile environment. As an example,
Fig. 3 shows a statistical calculator that provides an interaction test for a predictive
marker and randomized treatment assignment in terms of survival outcome data.
While each type of calculator requires different input parameters, there is a relatively
standard presentation of input and output parameters across the set of power and
sample-size calculators.

Automatic Monthly Study Reports

The comprehensive SWOG database also facilitates the generation of ongoing study
monitoring data summaries which are reviewed by statisticians, data coordinators,
628 C. Tangen and M. LeBlanc

Fig. 3 An example of one of the web-based calculators used in the design of clinical and
translation medicine studies for SWOG. Each calculator uses standard coloring for input and output
parameters and includes a linked help file

and in some cases study chairs. For instance, summary reports of adverse events,
SAE, and treatment data are generated monthly and emailed to study team members
for careful monitoring of trial data.

Site Performance Metrics Reports

SWOG Institutional Performance Report (IPR) measures the timeliness of data and
specimen submission across all SWOG studies. Additional metrics are in develop-
ment including assessment of responsiveness to queries, serious adverse event
(SAE) reporting timeliness, patient eligibility rates, and specimen quality indicators.
Site principal investigators and their staff as well as SWOG leadership receive these
reports on a regular basis, monitor progress of concerning sites, and provide inter-
vention and support as appropriate and disciplinary action as a last resort.
33 Efficient Management of a Publicly Funded Cancer Clinical Trials Portfolio 629

Portfolio-Wide Data Safety Monitoring Committee

General Structure

A single Data Safety Monitoring Committee (DSMC) monitors all SWOG-coordi-


nated Phase III and randomized Phase II clinical trials. Occasionally, single-arm
trials are also monitored by the DSMC. For very large prevention trials, a separate
DSMC is formed where some members are chosen due to their expertise outside of
cancer. Typically, there are approximately 30 trials that are being monitored at any
given time, representing diverse cancers, stages, and treatment modalities. There are
numerous benefits of having one DSMC for all trials. There is consistency in
communication and decision-making and a greater efficiency to reviewing all the
trials at the same time. There is also an understanding of the National Cancer Trial
Network (NCTN) structure, as well as trust and respect among members with diverse
expertise who become quite familiar with each other. However, to lessen the review
burden by the members, it is important to standardize reports so that the look and feel
is similar across trials to aid in the scientific review.
The membership of the DSMC follows the National Cancer Institute (NCI)
DSMC policy. The SWOG DSMC policy can be found on the public side of the
SWOG website (SWOG.org) under policies and procedures. Members are appointed
for 3-year terms (renewable once). The committee includes physicians and statisti-
cians from within and outside SWOG who are selected based on their experience,
reputation for objectivity, absence of conflicts of interest, and knowledge of good
clinical trial methodology. The committee includes a patient advocate and a voting
statistician from outside SWOG. Three nonvoting members of the DSMC come
from the NCI – two physicians and an NCI statistician. The SWOG Group Statis-
tician is also a non-voting member of the DSMC. The majority of voting DSMC
members are not affiliated with SWOG, and voting quorums for a DSMC meeting
require that the majority of voting members do not belong to SWOG.
The primary responsibility of the DSMC is to review interim analyses of
outcome data (prepared by the study statistician) and to recommend whether the
study needs to be changed or terminated based on these analyses. The committee
also determines whether and to whom confidential outcome results should be
released for planning purposes prior to the public reporting of study results. The
DSMC reviews reports of related external studies as needed to evaluate whether a
SWOG study needs to be changed or terminated or if communication is needed
with participating patients. The DSMC reviews interim toxicity data, although that
is primarily the responsibility of the study committee on a more ongoing basis. The
DSMC reviews major modifications to the study proposed by the study committee
prior to their implementation (e.g., termination, dropping an arm based on toxicity
results or other trials reported, increasing or decreasing target sample size or
duration). The DSMC meets twice yearly at a minimum. Each year, one meeting
is held face-to-face in conjunction with the SWOG meeting, and the other biannual
meeting is conducted as a conference call at the SWOG meeting. There are
attendance requirements.
630 C. Tangen and M. LeBlanc

Standard Report Formatting

Standardized reporting of DSMC reports and communications are critical to effi-


ciently oversee so many different trials. As previously described, SWOG follows
common statistical principles in terms of interim monitoring for both futility and
efficacy, so the specifications and reporting in the DSMC reports have a similar
presentation across studies. Every 6 months the status of each trial being monitored
is developed into a report by the study statistician, and each study report is reviewed
by the Group Statistician and Deputy Director for accuracy and consistency.
Each report contains two parts, a cover letter and the study report generated by
SRW. The cover letter provides an overview of the study including current accrual
status, any notable issues related to safety or feasibility, concerns about design
assumptions, or external information. A table showing the number of planned
interim analyses and their expected schedule is included. Although each trial has
its own unique features, the table structure looks similar to the following (Table 1).
Also included in the cover memo are results of any conducted interim analysis and
interpretation with respect to prespecified statistical boundaries. In the memo, the
DSMC is also reminded when prior interim analyses were conducted and when the
next analysis is projected to occur and any prior permissions the DSMC has allowed
for the trial. The overall format of these memos is kept similar across studies, so the
DSMC members know where to look to find specific pieces of information.
The second part of the DSMC report is prepared using our Statisticians’ Report
Worksheet (SRW), a tool that was described previously. Each report has a similar
look and feel. The face sheet includes information about study title, study phase,
activation date and date of accrual closure if applicable, study leaders, and a schema
of the study design. Following that header, there is text about primary and secondary
objectives, the eligible study population, accrual goals, and a summary of the
ineligibility, toxicity assessment and most frequent or severe adverse events, major
treatment deviations, and other notable aspects of the trial. Standardized tables about
accrual, eligibility, patient characteristics, treatment status, and detailed adverse
events are also included. Response rates and time to event data are included as
appropriate for formal interim analyses.
A compiled notebook of all the reports is carefully reviewed by the Group
Statistician and Deputy Director of the Statistical Center before it is finalized. The
DSMC receives the notebook at least 3 weeks prior to the biannual DSMC meeting
along with a suggested draft agenda for the meeting. The DSMC members are
invited to add to the agenda as they see fit. The study chair may also prepare
a report for the DSMC addressing specific toxicity concerns or other concerns
about the conduct of the study. However, the study chair does not have access to
the DSMC study report prepared by the statistician. The statistician’s report may
contain recommendations on whether to close the study, whether to report the
results, whether the design assumptions should be adjusted, whether to continue
accrual or follow-up, and/or whether a DSMC discussion is needed. Unless a DSMC
discussion is requested by a DSMC member, the study report is accepted without
discussion.
33

Table 1 Example interim analysis table provided to DSMC for an ongoing monitored trial. Phase III two-armed trial testing superiority of an experimental
agent
# of expected events Interim testing
Interim Expected time since Standard Experimental arm % of expected death Superiority one-sided α Futility one-sided α
analysis start of trial arm (assuming Ha: HR ¼ 0.75) information (Ho: HR ¼ 1.0) (Ho: HR ¼ 0.75)
1 3.2 years 107 88 38% N/A 0.01
2 4 years 190 158 67% 0.005 0.01
3 5 years 246 207 86% 0.005 0.01
Final 5.75 years 283 240 100% 0.022 N/A
Efficient Management of a Publicly Funded Cancer Clinical Trials Portfolio
631
632 C. Tangen and M. LeBlanc

The DSMC meeting includes three parts. The first part is an open session in which
members of the study team and respective disease committee leadership may be invited
by the DSMC to answer questions or present their requests. Following the open session,
there is a closed session limited to DSMC members and possibly the study statistician in
which outcome results will be presented either by a member of the DSMC, the
designated SWOG Statistician, or the study statistician. A fully closed executive session
follows in which the DSMC discusses outcome results and then votes. At the fully
closed executive session, those present are limited to DSMC members.
The DSMC provides written recommendations to the SWOG Group Chair. If he
agrees, he will forward the DSMC recommendations to the National Cancer Institute
for their evaluation. Details of this process of communication and required actions
are covered in our DSMC policy.
Individuals invited to serve on the DSMC (voting and nonvoting) disclose to the
Group Chair any potential, real, or perceived conflicts of interest. The Statistical
Center representative to the DSMC is also a member of SWOG’s Conflict Manage-
ment Committee, serving as a liaison between the two committees.

General Interim Analysis Strategies

While there is some study flexibility, the SWOG Statistical Center sets standards
with respect to interim analysis strategies. Stopping rules are based on group
sequential designs to preserve overall error rates but allow for early stopping if
extreme results are observed. In addition to the specification of Type I and Type II
errors, a typical design for a Phase III study would call for the specification of a small
number of interim analyses, between two and five, with a small probability of
concluding that treatment is efficacious under the null hypothesis. The timing
of interim analyses is based on overall information or event calculation for time-
to-event studies. SWOG statisticians also typically define a one-sided test of “futil-
ity” using a similar early stopping rule based on testing the alternative hypothesis,
rather than performing a test based on conditional power. For many studies, critical
p-values are chosen for interim stopping at a small number of conservative early
assessments that test the alternative hypothesis (e.g., Green et al. 2016; Fleming et al.
1984). Some plans also include an assessment at 50% information time and stop if an
estimated hazard ratio is not favoring the experimental treatment. This guideline
is easy to describe and can be less conservative than testing the alternative hypoth-
esis. With respect to a time scale of interim analyses, a study by Freidlin et al. (2016)
is supported using combined-arm event information in most instances. However,
there is flexibility depending on the specific trial features. Regardless of the interim
analysis strategy, the design properties such as power, Type 1 error, and stopping
probabilities must be assessed and presented in the statistical section of the protocol.
Most Phase III trials involve two treatment arms. Trials with more than two arms
or biomarker-based subgroups are fully addressed in the interim analysis plans.
To facilitate interpretation by the SWOG DSMC, the report includes a table of
estimates/p-values and actions defined by the statistical analysis plan in the protocol.
33 Efficient Management of a Publicly Funded Cancer Clinical Trials Portfolio 633

Statistical Center External Interactions

SWOG’s Statistical Center staff need to effectively interact with external entities to
carry out their clinical trial mission. Interested investigators approach SWOG
to access biospecimens, trial data, or both. Standard, transparent processes need to
be in place to handle and evaluate these queries.

Data Sharing

There are two paths for obtaining trial data from the group’s studies: through the
NCTN/NCORP Data Archive and by direct application to SWOG to request
data. The Statistical Center follows established procedures for archiving SWOG
data with the official NCTN/NCORP Data Archive. Data from qualifying
studies must be archived at the NCI within 6 months of publication. Qualifying
datasets include data from recently reported Phase III trials. The scope of
required data sharing has also increased to include data used in many second-
ary analyses of NCTN and NCORP trial data. Standard operating procedures
(SOPs) are developed at the Statistical Center to document detailed steps of the
processes. A statistician creates the files required to be archived with the NCI
that are sufficient to reproduce all results reported in the primary manuscript.
The Statistical Center administrator assists with creating the data dictionary and
reviews the datasets to ensure compliance with the guidelines. For trial data
that are not stored in the NCTN/NCORP Data Archive, a second path is used
to obtain data. Investigators submit a brief proposal to SWOG that includes
some background for their proposed data analysis, objectives, statistical anal-
ysis methods to be used, and data elements requested. After evaluating for
feasibility and ensuring no overlap with ongoing work, the SWOG Executive
Committee approves the proposal, and a data usage agreement is executed
between the investigator and SWOG. Requested data are then shared with the
investigator. A pseudo-patient ID is used to link records. Data are shared in the
preferred format of the investigator which typically involves Excel spreadsheets
or SAS datasets.

Biospecimen Sharing

Requests for SWOG specimens are common and may arise from SWOG, other
NCTN groups, or nonaffiliated investigators. For SWOG-led intergroup studies,
the specimens are usually housed in the SWOG biorepository at Nationwide
Children’s Hospital. Appropriate permissions are required, usually from the
NCI Correlative Sciences Steering Committee. Once a material usage agree-
ment (MUA) and data usage agreement (DUA) (if applicable) are executed and
communicated to the biobank and Statistical Center, statisticians use our linked
biospecimen inventory at the Statistical Center to produce pull lists that have
634 C. Tangen and M. LeBlanc

the proper required consent for the translational study, which is then commu-
nicated to the biorepository. With the introduction of the NCI Navigator
system, there will be increased opportunities and effort for Statistical Center
statisticians to both support assessing the feasibility of proposed translational
medicine studies and, where appropriate, collaborate on the design and analysis
of the resulting studies.
Most trial specimens can be requested via a new resource recently launched
by the NCI: NCTN Navigator. Cancer researchers interested in conducting
studies using biological specimens and clinical data collected from cancer
treatment trials in the NCTN can use this resource. It includes information
about specimens, such as tumor and blood samples, donated by patients in
NCI-sponsored clinical trials. The clinical trials included in Navigator are
published Phase III studies that evaluated cancer treatments. Investigators can
use the NCTN Navigator website to search the inventory for specimens with
specific characteristics. Investigators who develop proposals and get approval
can use the specimens, along with the trial participants’ clinical information, in
their research. SWOG has a full specimen inventory database from our unified
biobank at Nationwide Children’s Hospital. Specimen data are linked to SWOG
clinical trial data elements to enable efficient overall specimen data manage-
ment. This simplifies specimen utilization for patients meeting various clinical
criteria as well as the creation of reports or datasets that combine data from the
clinical and biorepository databases. It also enhances the efficiency of interac-
tions with the NCI Navigator system. Our enhanced database includes coding
for projects for which specimens are requested and indicates the disbursement
of specimens, project completion, and the return of unused specimens to the
SWOG biorepository.

Summary and Conclusion

Our approach to designing, monitoring, and analyzing of a diverse portfolio of


trials is to use standardized processes and software tools. In addition to the
software tools and reports described, a set of NCI/NCTN tools are used which
work across the NCTN groups and then employ standardized software appli-
cations to increase the quality and efficiency of study implementation, data
evaluation, monitoring, and reporting across SWOG’s portfolio of clinical tri-
als. The structure of the SWOG comprehensive database of clinical and
biorepository inventory enhances our efficiencies. In addition, it facilitates
data sharing and the use of archived biospecimens by outside researchers.
Recognizing the specifics of software and procedures for efficient trial portfolio
management in the NCTN setting may not directly be applicable to single
institutions or organizations external to the NCI; the goal of standardization of
the trial design, protocol development, standardized trial data elements, and
expert teams to support the group are good practices for most groups oversee-
ing multiple simultaneously running clinical trials.
33 Efficient Management of a Publicly Funded Cancer Clinical Trials Portfolio 635

Key Facts

To provide statistical design and data management and efficiently monitor and report
over a portfolio of clinical trials, communications need to be facilitated between
statisticians and data management staff including regularly scheduled meetings
between senior faculty, senior statistical research associates, data management, and
applications development management.
A standardized, multidisciplinary process for development and review of com-
monly structured protocols results in clear, scientifically sound documents that result
in quality and efficiency in the conduct of our trials and provide enrolling sites with a
format in which they are familiar.
Portfolio-wide processes are developed to avoid the need to develop expertise
(e.g., recruitment and retention, patient-reported outcomes, FDA application intent,
trial design, study build, translational medicine) within each disease committee. We
form expert teams (or cores) that can be accessed by all committees when designing,
conducting, and analyzing clinical trials.
A key feature of our study design processes is the mapping of data elements to a
standard set of domains and codes, which facilitates efficient analysis and enables
our ability to conduct analyses that combine multiple trials over a long period of
time.
Creation of custom software applications helps to address the needs of a group
running a wide array of clinical trials. Because of standardized data collection and a
comprehensive database structure, applications such as a standardized yet flexible
statistical report writing tool, a specimen tracking system, and site performance
metrics reports can be developed.
Having an organization that manages a portfolio of trials allows for a unified
approach with respect to network security, information exchange security, confiden-
tiality, access controls, and disaster recovery and contingency plans. Additionally,
cross-portfolio training can be applied to address common features and issues,
recognizing that study-specific training also is necessary. A portfolio-wide Data
Safety Monitoring Committee can also be utilized.
Standard, transparent processes need to be in place to handle and evaluate queries
from external entities wishing to access biospecimens, trial data, or both.

References
Fleming TR, Harrington DP, O’Brien PC (1984) Designs for group sequential tests. Control Clin
Trials 5(4):348–361
Freidlin B, Othus M, Korn EL (2016) Information time scales for interim analyses of randomized
clinical trials. Clin Trials 13(4):391–399. https://doi.org/10.1177/1740774516644752. PMID
27136947
Gentleman R, Temple Lang D (2007) Statistical analyses and reproducible research. J Comput
Graph Stat 16:1–23. https://doi.org/10.1198/106186007X178663
Green S, Benedetti J, Smith A, Crowley J (2016) Clinical trials in oncology, 3rd edn. CRC Press,
Boca Raton
636 C. Tangen and M. LeBlanc

Iqbal SA, Wallach JD, Khoury MJ, Schully SD, Ioannidis J (2016) Reproducible research practices
and transparency across the biomedical literature. PLoS Biol 14(1):e1002333. https://doi.org/
10.1371/journal.pbio.1002333. PMID: 26726926. PMCID: PMC4699702
NCI Community Oncology Research Program (NCORP). www.cancer.gov/research/areas/clinical-
trials/ncorp
NCI’s National Clinical Trials Network (NCTN). www.cancer.gov/research/areas/clinical-trials/
nctn
Ooms J (2014) The OpenCPU system: towards a universal interface for scientific computing
through separation of concerns. https://arxiv.org/abs/1406.4806
Archiving Records and Materials
34
Winifred Werther and Curtis L. Meinert

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638
Trial Master File (TMF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639
Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 640
Key Study Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641
Study Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641
Consent Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642
Data Collection Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642
Investigator’s Brochure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642
Key Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
IRB Transmissions and Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
Reports of Adverse Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
Directives from Sponsors and Regulatory Agencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
Inquiries from Persons or Journalists Concerning the Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
The Trial Data System and Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644
Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644
Other Study Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645
Access to the Archive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646
TMF Retention Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647

W. Werther (*)
Center for Observational Research, Amgen Inc, South San Francisco, CA, USA
e-mail: [email protected]
C. L. Meinert
Department of Epidemiology, School of Public Health, Johns Hopkins University, Baltimore,
MD, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 637


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_62
638 W. Werther and C. L. Meinert

Abstract
An archive, in the context of a trial, is a collection of documents and records
relevant to the design and conduct of the trial maintained as a historical reposi-
tory. Archiving is a process that starts before the first person is enrolled and
continues to the end of the trial when all analyses are complete and the investi-
gator group disbands.
So, when a trial is finished, money has run out, and investigators have
dispersed, what do you have archived and where? The answer to the first question
is “everything you may need later,” and the answer to the second is “someplace
readily accessible far into the foreseeable future.” Both answers are correct but
not helpful because the first question requires a crystal ball of what might be
needed and the second requires a place like the Smithsonian and there are no
Smithsonians for archiving records of clinical trials.
This chapter is about the process of archiving and about what to archive.

Keywords
Trial master file · Archiving · Electronic

Introduction

An archive is a place where records or historical materials are stored and preserved.
The place may be a physical location, like the National Archives where records can
be accessed and viewed, or an electronic address serving the same purpose, the latter
usually the case for clinical trials. The International Council for Harmonisation
(ICH) good clinical practice (GCP) guidelines are foundational for guidelines on
archiving. These guidelines are as set forth by the European Medicines Agency
(EMA 2018).
But even if there were no legal requirements for documentation and archiving,
investigators would document on their own. They need documentation should they
need to retrace steps or check on what they did. They need documentation if
questions arise from outside the trial regarding what they did or how they did it.
Archiving can be a safeguard for questions that might occur during and after
conduct of the trial. A few examples of questions follow. First, investigators in
VIGOR (Vioxx Gastrointestinal Outcomes Research) published their results in
November 2000 in the NEJM (Bombardier et al. 2000). The NEJM expressions of
concern regarding counts in VIGOR came 5 years later (Curfman et al. 2005, 2006).
Second, troubles in the National Surgical Adjuvant Breast and Bowel Project
(NSABP) came from falsified data in breast cancer trials (Crewdson 1994). Third,
the University Group Diabetes Program (UGDP) was a randomized multicenter
secondary prevention trial designed to test whether commonly used treatments for
type 2 diabetes were useful in delaying the cardiovascular and neurological sequelae
of the disease (UGDP 1970a). The trial started in 1960 and finished in 1978. About
34 Archiving Records and Materials 639

mid-way through investigators stopped the use of one of the treatments, tolbutamide
(an oral drug widely regarded as safe and effective in the diabetic community),
because there were concerns regarding safety (UGDP 1970b). The decision brought
an avalanche of criticisms from diabetologists and ultimately led to a review of the
trial by a special committee commissioned by the International Biometric Society.
The committee met several times from 1972 thru 1974 and published its report in
JAMA in 1975 (Gilbert et al. 1975). The first meeting of the committee was at the
coordinating center for the trial in the fall of 1972. The first thing the committee
wanted to see was a description of the randomization procedure used in the trial,
written in 1960, before the trial started. The problem was that, somehow, after all
those years in a dark filing cabinet, various sentences, “crystal clear” when written,
had morphed into puzzling statements.
Lesson: Foundational documents, like the system for randomization, should be
read and reviewed by multiple members of the investigational team before archiving.
As exemplified in these historical examples of clinical trial archive use, investigators
and sponsors can and should expect many different reasons for needing and using the
archive for clinical trial activities.

Trial Master File (TMF)

The TMF, broadly, is a collection of documents and files created over the course of a
trial that enables sponsors, monitors, agencies, authorities, or persons to check and
reconstruct what was done. The TMF is discussed in the European Medicines
Agency Guideline on the content, management, and archiving of the clinical trial
master file (paper and/or electronic) (EMA 2018). The TMF is often managed and
maintained electronically and is referred to as the electronic TMF or eTMF.
The executive summary of the Guideline on the content, management, and
archiving of the clinical trial master file provides an overview of the intention of
the TMF and is quoted here:

Trial master file (TMF) plays a key role in the successful management of a trial by the
investigator/institutions and sponsors. The essential documents and data records stored in the
TMF enable the operational staff as well as monitors, auditors and inspectors to evaluate
compliance with the protocol, the trial’s safe conduct and the quality of the data obtained.
This guideline is intended to assist the sponsors and investigators/institutions in complying
with the requirements of the current legislation (Directive 2001/20/EC and Directive 2005/
28/EC), as well as ICH E6 Good Clinical Practice (GCP) Guideline (‘ICH GCP guideline’),
regarding the structure, content, management and archiving of the clinical trial master file
(TMF). The guidance also applies to the legal representatives and contract research organi-
sation (CROs), which according to the ICH GCP guideline includes any third party such as
vendors and service providers to the extent of their assumed sponsor trial-related duties and
functions. The ICH GCP guideline provides information in relation to essential documents to
be collected during the conduct of a clinical trial. The risk-based approach to quality
management also has an impact on the content of the TMF. To ensure continued guidance
once the Clinical Trials Regulation (EU) No. 536/2014 (‘Regulation’) comes into
640 W. Werther and C. L. Meinert

application, this guidance already prospectively considers the specific requirements of the
Regulation with respect to the TMF.

The table of contents of the Guideline on the content, management, and archiving
of the clinical trial master file is a good reference point when planning a TMF and is
provided below:

1. Executive summary
2. Introduction
3. Trial master file structure and contents
3.1. Sponsor and investigator trial master file
3.2. Contract research organisations
3.3. Third parties-contracted by investigator/institution
3.4. Trial master file structure
3.5. Trial master file contents
3.5.1. Essential documents
3.5.2. Superseded documents
3.5.3. Correspondence
3.5.4. Contemporariness of trial master file
4. Security and control of trial master file
4.1. Access to trial master file
4.1.1. Storage areas for trial master file
4.1.2. Sponsor/CRO electronic trial master file
4.1.3. Investigator electronic trial master file
4.2. Quality of trial master file
5. Scanning or transfers to other media
5.1. Certified copies
5.2. Other copies
5.3. Scanning or transfer to other media
5.4. Validation of the digitisation and transfer process
5.5. Destruction of original documents after digitisation and transfer
6. Archiving and retention of trial master file
6.1. Archiving of sponsor trial master file
6.2. Archiving of investigator/institution trial master file
6.3. Retention times of trial master file
6.4. Archiving, retention and change of ownership/responsibility
7. References

Prerequisites

The study team needs to document in real time for the TMF to be sufficient to allow
people or authorities to reconstruct how the trial was conducted after it is finished. To
accomplish this there must be understanding prior to starting the trial of what gets
documented and by whom. Also, required are understandings as to where documents
34 Archiving Records and Materials 641

will be stored and whether in hard or electronic form. To accomplish timely


archiving, the roles and responsibilities of the trial team will be defined by the
coordinating center.
The quality and extent of documentation can be expected to vary by who funds
the trial, its size, and location. Small single-center, investigator-initiated, trials may
not be as well documented as multicenter international trials. Sponsors subject to
oversight by regulatory agencies can be expected to be quite compulsive about
maintaining the TMF. Indeed typically, in those cases, it will be the sponsor who
maintains the majority of documenting either themselves or through a contract
research organization (CRO).
Clearly, if there are two or more administrative parties in a trial who are respon-
sible for archiving, then there must be agreement on division of labor as to who
maintains the archive that produces archived copies of documents. Duplicate docu-
ments, produced by different parties, are not virtues when it comes to documentation
because, invariably, they will differ.
Experience teaches that documenting is a not a favorite activity of trialists. More
likely than not, investigators assume it will be done by “somebody else,” and their
eyes glaze when protocols and procedures for documentation are discussed at
investigator meetings.

Key Study Documents

The study protocol, consent forms, and data collection forms are at the heart of the
trial. All three should be open to the public, except for details that have the potential
of biasing results, for example, details regarding masking and randomization
schemes. The Investigator’s Brochure is another key study document that is neces-
sary when the clinical trial involves an investigational drug.

Study Protocol

The protocol is roughly akin to a blueprint for a building but far less detailed than
blueprints. The protocol, unlike blueprints, allows room for clinical judgment. To
facilitate inclusion of the proper details for the trial conduct, SPIRIT (Standard
Protocol Items: Recommendations for Interventional Trials) is a published document
with a 33-item list of information to be included in protocols (Chan et al. 2013).
Protocols should be open to the public. One way to accomplish this is by posting
on registration sites, such as ClinicalTrials.gov. ClinicalTrials.gov has a field to
include protocols, but it is only sparingly used: 5,721 postings out of 145,844
completed trials, 3.9% of completed trials, as of 3 April 2020 (US NLM 2020).
Publishing protocols as standalone manuscripts in a clinical trial journal is one
way to make them public, but perhaps the best way is as supplemental material in
results publications, but even if done for every results publication the practice would
cover only a fraction of trials since the majority of trials are never published. A
642 W. Werther and C. L. Meinert

stumbling block in publicizing protocols is the desires of proprietary sponsors to


keep competitors from knowing what they are doing. Posting protocols may give
away business secrets.
The reality is that protocols are subject to change over the course of a trial through
the documentation of protocol amendments. The study archive postings need to be
updated if they are to retain their informative value. Some of the changes may be
minor, but some can be substantive changes to enrollment criteria, changes in
dosages schedules, or addition or deletion of treatments in the trial.
The study protocol and all subsequent amendments are archived in the TMF.

Consent Forms

Consents are necessary prerequisites to enrolling persons into clinical trials. All
transactions concerning consents must be archived should questions arise later
regarding content in each version of the informed consent and when they were
used at the trial sites and signed by the trial participant.
Consents may be oral or written depending on settings and circumstance. The
language and content of the informed consent document are controlled by local
IRBs, even if the trial is done with a central IRB (NIH 2016; FDA 2006).
Clinics in multicenter trials may be provided with prototype consent statements
prepared by the coordinating center or some other leadership center in the trial, but
individual clinics and local IRBs are free to change language or add statements to the
prototype, provided the primary information transmitted remains unaltered. Individ-
ual clinics are responsible for archiving their own consent statements, as approved
by local IRBs. The trial leadership is responsible for archiving master consent
statement and changes thereto in the TMF.

Data Collection Forms

The data collection schedule is outlined in the protocol. Copies of data collection
forms (electronic or paper) and changes thereto during the trial should be
documented in sufficient detail to permit reconstruction of the data collection effort,
if necessary, after the trial is finished. Electronic data collection systems need to have
audit trail functions to track changes to data collection during the conduct of the trial.

Investigator’s Brochure

An Investigator’s Brochure exists only if the trial involves an investigational product


under the control of the FDA or other regulatory agencies. The brochure serves to
summarize information about the product, recommended doses, and likely side
effects. The holder of the investigational new drug (IND) application is responsible
for updates to the brochure and distribution to IRBs and study centers involved in
34 Archiving Records and Materials 643

studying the product. Changes to the Investigator’s Brochure over time should be
archived in the TMF.

Key Communications

Communications and correspondences regarding clinical trials are archived in the


TMF, and requirements are described in the EMA Guideline (EMA 2018). There are
many types of correspondences to consider when organizing them for the archive.

IRB Transmissions and Communications

Individual study centers are responsible for communications to and from their
respective IRBs and for archiving same. The study coordinating center, office of
the chair, sponsor, or some other party in multicenter trials is responsible for
archiving communications to and from the study parent IRB, including communi-
cations concerning the study protocol and changes to it, prototype consent forms,
and data collection forms.

Reports of Adverse Events

Adverse events must be reported to IRBs and all clinics in the trial. Typically, the
clinic in which the event occurred reports the event to the study coordinating center
or like leadership center in multicenter trials, and it in turn reports the event to all
study centers and sponsors. Sponsors have an obligation to maintain an adverse
event database for the investigational or marketed product and an obligation to report
adverse events to regulatory authorities with specific timelines. Regulatory author-
ities may place clinical trials on hold based on adverse event reporting. Communi-
cations on adverse event reporting should be included in the study archive.

Directives from Sponsors and Regulatory Agencies

Directives from sponsors and regulatory agencies concerning the trial must be
communicated to study IRBs and study centers and implemented as indicated.
Archiving is the responsibility of study coordinating center or like leadership center
in the trial.

Inquiries from Persons or Journalists Concerning the Trial

Queries from persons or the press concerning the trial should be logged with details
as to resolution. Questions from patients in trials usually are addressed at the clinic
level. Multicenter trials will have structures for dealing with questions from the press
644 W. Werther and C. L. Meinert

or others not involved in the trial. The usual course is to refer those queries to the
study chair or some other responsible person in the organization structure of the trial.
Correspondence should be logged and archived.

The Trial Data System and Database

The trial data system, including the database, is the soul of the trial. In this electronic
age, it will likely be comprised of dozens of programs to construct and manage the
data system and as many to monitor and analyze data during and after completion of
the trial, many of which will be updated or changed over the course of the trial. The
developer and operator of the data system are responsible for archiving data system
programs. Data analysts are responsible for archiving analysis programs.
The prize possession of a trial is its data. Obviously, the finished, identified, dataset
must be archived but must be edited and cleaned of outstanding edits and checks before
archiving. Once the dataset is frozen for archiving, changes or updates are nuisances,
especially if the changes impact on counts or results in published papers.
The archive must be secure, password protected, and in a location likely to allow
access for a minimum of 20 years after deposit.
But nothing is forever, and, hence, eventually files maybe unreadable because
technologies change. When the UGDP ended, the dataset was deposited at the
National Technical Information Service (NTIS) on magnetic tape. Even if it still
exists there, it would be hard to find anyone capable of reading magnetic tapes.
Dataset can have value long after trials are finished. The Coronary Drug Project
(CDP) ran from 1966 to 1985 (CDP 1973). Just recently a person requested the
dataset to do a follow-up study of enrollees. The dataset disappeared when the
institution housing the coordinating center for the trial ceased to exist in 2010.
Investigators must decide if they produce a deidentified dataset for use by people
outside the research group. Increasingly, the expectation is that there will be a
deidentified dataset available, but deidentifying data is no mean task. It takes time,
costs money, and requires skilled people to do the deidentifying. Clinical research
teams can hire experts on deidentification to ensure proper procedures are followed.
If done, the set may be available on request or may be deposited at a commercial
enterprise specialized in such services.
Another use of deidentified data may be participation in meta-analyses or pooling
of placebo-treated patients across trials to better understand the underlying patient
population. Cooperative groups and groups led by medical associations are leading
some efforts to pool deidentified patient level data.

Registration

Registration on websites for trials, such as ClinicalTrials.gov, is a form of archiving,


though not mentioned in the EMA guidelines for archiving. However, the EMA does
maintain EudraCT (European Union Drug Regulating Authorities Clinical Trials
Database), which is the European database for all interventional clinical trials on
34 Archiving Records and Materials 645

medicinal products authorized in the European Union and outside the EU if they are
part of the Pediatric Investigation Plan from 1 May 2004 onwards. It has been
established in accordance with Directive 2001/20/EC. Protocol and results informa-
tion on interventional clinical trials are made publicly available through the Euro-
pean Union Clinical Trials Register since September 2011 (EMA 2020).
Trials are to be registered prior to start of enrollment and registrations are to be
updated to completion of the trial. The ClinicalTrials.gov website has a field for
posting protocols and logging updates to it over the course of the trial and for listing
citations to publications from the investigator group.
Results, without written comments, are to be posted to the website within 1 year
after completion of the trial. The bad news is that only a small fraction of the
registrations contains posted results. For example, for trials completed in 2018,
only 13% had posted results, as of 9 April 2020 (US NLM 2020).

Other Study Documents

Trials, like people, need curriculum vitae (CV) to list key facts, activities, and accom-
plishments. An example is the CV for the National Emphysema Treatment Trial (NETT
1999) posted at trialsmeinertsway.com (Meinert 2020). Its content is as below:

1. Background and rationale 1


2. Design summary 3
3. Summary of pulmonary rehabilitation program 8
4. Consent, data collection, and telephone contact schedule through Dec 9
2002
5. Substudies 10
6. Landmark events 11
7. Participating centers, groups, and committees 12
8. Publications 13
9. Presentations 23
10. Meetings 26
11. Site visits 37
12. Meetings/conference calls and site visits by year of study 38
13. Support statement 39
14. Contract numbers, funding period, and ClinicalTrials.gov number 40
15. Repositories 41
16. Items on file at the National Technical Information Service 42
17. NETT website 43
18. Accessing the NETT Limited Access Dataset 44

Other documents that should be archived:

• Manuals of operations and study handbooks, data collection forms, and revision
histories
646 W. Werther and C. L. Meinert

• Policy and procedures memoranda (memos having force of protocol distributed to


study centers)
• Funding history
• Photographs and digital media images

Access to the Archive

Trials, especially multicenter trials, have a website built specifically for use by
investigators in the trial. Typical study websites will include the Investigator’s
Brochure, current version of the study protocol, study handbooks and manuals,
copies of data collection forms, and other information of importance to investigators
in the trial. Access must be password protected. If public access to the protocol,
consent forms, and data collection forms is provided, it will be provided on a public
website.
Access to the TMF or eTMF is controlled by the coordinating center, and staff
working on the trial will be given access according to roles; for example, an editor
can upload documents, while staff who need to access files only will be given read-
only access.

TMF Retention Time

For trials conducted under Directive 2001/20 of the European Union, the sponsor
and investigator must ensure that the documents in the TMF are retained for at least
5 years after the conclusion of the trial or in accordance with national regulations; for
example, Germany requires a 10-year period of retention (de Mey 2018). Some
countries require 20 years or longer.

Summary and Conclusions

Archiving records and materials is a critical activity during the conduct of clinical
trials. The repository for the archive is referred to as the TMF. The most used
guideline on creating and maintaining the TMF is published by the EMA. Successful
archiving includes specified roles and responsibilities of the staff charged with
archiving for the trial. Special attention should be made to key documents of the
trial including the protocol and consent forms and the various versions used during
the trial. Correspondence and communications with trial sponsors, press, and others
are included in the archive. The database system and database demand special
considerations when locking for archiving. One method for archiving publicly is
to include the trial protocol and results in a clinical trial registry, such as
ClinicalTrials.gov. Access to the archive is controlled by the leadership of the trial.
Retention for archives varies by country and region and needs to be taken into
consideration when planning the archive.
34 Archiving Records and Materials 647

Key Facts

• The trial master file (TMF) serves as the archive for a clinical trial and can be
paper and/or electronic.
• Guidelines on the TMF have been published by the EMA.
• Many considerations go into creating and maintaining the TMF including key
documents, communications, data systems, and other documents. All versions of
documents used during the trial are included in the archive.
• Registration of the trial offers an opportunity to provide documents from the
archive to the public.
• Access to the TMF is controlled by the clinical trial leadership.
• Retention time is a consideration when choosing where to house the TMF.

Cross-References

▶ Good Clinical Practice


▶ Regulatory Requirements in Clinical Trials
▶ Responsibilities and Management of the Clinical Coordinating Center
▶ Trial Organization and Governance

References
Bombardier C, Laine L, Reicin A, Shapiro D, Burgos-Vargas R, Davis B, Day R, Ferraz MB,
Hawkey CJ, Hochberg MC, Kvien TK, Schnitzer TJ for the VIGOR Study Group (2000)
Comparison of upper gastrointestinal toxicity of rofecoxib and naproxen in patients with
rheumatoid arthritis. N Engl J Med 343:1520–1528
Chan A-W, Tetzlaff JM, Altman DG, Laupacis A, Gøtzsche PC, Krleža-Jeric K, Hróbjartsson A,
Mann H, Dickersin K, Berlin J, Doré C, Parulekar W, Summerskill W, Groves T, Schulz K, Sox
H, Rockhold FW, Rennie D, Moher D (2013) SPIRIT 2013 statement: defining standard
protocol items for clinical trials. Ann Intern Med 158:200–207
Coronary Drug Project Research Group (1973) The coronary drug project: design, methods, and
baseline results. Circulation 47(Suppl I):I-1–I-50
Crewdson J Fraud in breast cancer study, Chicago tribune, 13 March 1994
Curfman GD, Morrissey S, Drazen JM (2005) Expression of concern: Bombardier et al., compar-
ison of upper gastrointestinal toxicity of Rofecoxib and naproxen in patients with rheumatoid
arthritis. N Engl J Med 343:1520–1528. N Engl J Med 2005;353:2813–2814
Curfman GD, Morrissey S, Drazen JM (2006) Expression of concern reaffirmed. N Engl J Med 354:
1193
De Mey C (2018) Archiving – how long? https://wwwacps-networkcom/2018/11/08/ct-lost-in-
delegation-2/. Accessed 23 Dec 2020
European Medicines Agency (EMA) Good Clinical Practice Inspectors Working Group (2018)
Guideline on the content, management and archiving of the clinical trial master file (paper and/
or electronic) https://www.ema.europa.eu/en/documents/scientific-guideline/guideline-content-
management-archiving-clinical-trial-master-file-paper/electronic_en.pdf. Accessed 23 Dec
2020
European Medicines Agency (EMA) (2020) EudraCT public home page. https://eudracte
maeuropaeu/. Accessed 23 Dec 2020
648 W. Werther and C. L. Meinert

Gilbert JP, Meier P, Rümke CL, Saracci R, Zelen M, White C (1975) Report of the Committee for
the Assessment of biometric aspects of controlled trials of hypoglycemic agents. JAMA 231:
583–608
Meinert CL (2020) Trials Meinerts Way. https://jhuccs1us/clm/defaultasp. Accessed 9 Apr 2020
National Emphysema Treatment Trial Research Group (1999) Rationale and design of the National
Emphysema Treatment Trial (NETT): a prospective randomized trial of lung volume reduction
surgery. Chest 116:1,750–1,761
United States Food and Drug Administration (FDA) (2006) Using a Centralized IRB Review
Process in Multicenter Clinical Trials Guidance for Industry. https://www.fda.gov/regulatory-
information/search-fda-guidance-documents/using-centralized-irb-review-process-multicenter-
clinical-trials. Accessed 23 Dec 2020
United States National Institutes of Health (NIH) (2016) Final NIH policy on the use of a single
institutional review Board for Multi-Site Research. https://grantsnihgov/grants/guide/notice-
files/not-od-16-094html. Accessed 23 Dec 2020
United States National Library of Medicine (US NLM) (2020) ClinicalTrials.gov https://www.
clinicaltrials.gov/. Accessed 3 Apr 2020
University Group Diabetes Program Research Group (1970a) A study of the effects of hypoglyce-
mic agents on vascular complications in patients with adult-onset diabetes: I. Design, methods,
and baseline characteristics. Diabetes 19(suppl 2):747–783
University Group Diabetes Program Research Group (1970b) A study of the effects of hypoglyce-
mic agents on vascular complications in patients with adult-onset diabetes: II. Mortality results.
Diabetes 19(Suppl 2):785–830
Good Clinical Practice
35
Claire Weber

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 650
GCP Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 650
ICH and GCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 650
GCP Historical Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651
Key Aspects of ICH GCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653
GCP Documents Also Known as Essential Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656

Abstract
Good clinical practice (GCP) is an international quality standard that is provided
by the International Council on Harmonization (ICH), an international body that
defines standards, which governments can transpose into regulations for all
phases of clinical trials involving human subjects. GCP applies to the trial
sponsor team, the institutional review boards (IRB)/ethics committees (EC) and
the investigator site teams. This chapter describes the GCP concepts, a GCP
historical timeline, and how GCP in all phases of clinical trials and drug devel-
opment through regulatory approval is the standard for clinical research.

Keywords
ICH · GCP · Ethical · Consent · Privacy · Regulations · Guidelines · Sponsor ·
IRB/EC · Investigator

C. Weber (*)
Excellence Consulting, LLC, Moraga, CA, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 649


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_64
650 C. Weber

Introduction

Clinical research is conducted according to a set of standards which has been


formalized in many international guidelines and regulations. GCP must be instituted
in all clinical research and is the standard for designing, conducting, recording, and
reporting trials that involve the participation of human subjects. The primary goals of
GCP are to protect all research participants and assure that only worthy treatments
are approved for use for future patients. If the clinical study follows GCP, the data
generated from the trial will be mutually accepted by many of the regulatory
agencies around the world in support of an approval to market the drug. GCP
encompasses local and regional laws, directives, regulations, guidance documents,
and standard operating procedures (SOPs) for use by the trial sponsor, the IRB/EC,
and the investigator site team.

GCP Definition

Good clinical practice is defined as:

An international ethical and scientific quality standard for designing, conducting, recording,
and reporting trials that involve the participation of human subjects. Compliance with this
standard provides public assurance that the rights, safety, and well-being of trial subjects are
protected, consistent with the principles that have origin in the Declaration of Helsinki, and
that the clinical trial data are credible (ICH E6 [R2] Introduction Page 1).

And it is further defined as:

A standard for the design, conduct, performance, monitoring, auditing, recording, analyses,
and reporting of clinical trials that provide assurance that the data and reported results are
credible and accurate, and that the rights, integrity, and confidentiality of trial subjects are
protected (ICH E6 [R2] Glossary Section 1.24).

GCP, combined with good manufacturing practice (GMP) standards, good labo-
ratory practice (GLP) standards, good pharmacovigilance practice (GPVP), good
distribution practice (GDP), and good documentation practice (GDoP) are referred to
as GxP. GxP applies to all aspects of drug development. This chapter only pertains to
GCP, but it should be noted that GCP shares some common elements with definitions
of other areas of GxP, since they each are standards to ensure drug products are safe,
pure and not adulterated, and effective. Although GCP was developed for clinical
investigational drug studies, the principles are also used in medical device studies,
and other social/behavioral studies.

ICH and GCP

The International Council on Harmonization of Technical Requirements for regis-


tration of Pharmaceuticals for Human Use (ICH) is a joint effort by the regulatory
35 Good Clinical Practice 651

authorities and pharmaceutical industry representatives of Europe, Japan, and the


USA. Beginning in 1990, the ICH has published many guidelines. Informed by a
lengthy list of predicate documents (e.g., the Nuremberg Code, Declaration of
Helsinki, European Community and Nordic Guidelines, and various local regional
laws and regulations), GCP is codified in principles called ICH GCP (ICH E6
(R2) Pages 8–10) that is widely recognized as authoritative in defining the obliga-
tions of sponsors, investigators, and institutional review boards and ethics commit-
tees (IRBs/ECs). ICH E6 is the primary GCP guideline, however there are other
important ICH guidelines for clinical research that also address GCP including (but
not limited to):

• ICH E2A-E2F: Pharmacovigilance


• ICH E7 Studies in support of special populations: Geriatrics
• ICH E8: General considerations for clinical studies
• ICH E11: Clinical investigation of medicinal products in the pediatric population
• ICH E19: Safety data collection

GCP Historical Timeline

The following key events were instrumental for the development of GCP:
The Nuremberg Code of 1947:

• On August 20, 1947, the judges delivered their verdict in the “Doctors Trial”
against Karl Brandt and 22 others. These trials focused on doctors involved in the
human experiments in concentration camps. The suspects were involved in over
3,500,000 sterilizations of German citizens.
• Instituted informed consent and absence of coercion, and voluntary participation.

The Declaration of Geneva of 1948:

• One of the first and most important actions of the World Medical Association
(WMA), regarding a physicians’ dedication to the humanitarian goals of medicine
• Physician’s oath, to be sworn at the time a person enters the Medical profession,
was added to the Declaration of Geneva and adopted by the General Assembly of
the World Medical Association
• Pledge in view of the medical crimes that been committed in Nazi Germany and
includes “I will maintain the utmost respect for human life; even under threat, I
will not use my medical knowledge contrary to the laws of humanity”

The Declaration of Helsinki of 1964:

• Well-being of subjects takes precedence.


• Respect for persons.
• Protection of subject’s health and rights.
652 C. Weber

• Special protection for vulnerable populations.


• Safeguarding research subjects.

Informed consent.

• Adhering to an approved research plan/protocol which is reviewed by an inde-


pendent committee (IRB/EC).
• Autonomy – subjects must be able to quit at any time.
• Scientifically valid – study design and study conduct.
• Minimize the risk – harm, injury, and suffering.
• Biomedical research on human subjects must conform to scientific principles.
• Be based on valid laboratory and animal experimentation.
• Required studies to be conducted only by scientifically qualified persons and
supervised by a clinical competent medical authority.

1962 Kefauver Amendment to Food Drug and Cosmetic Act:

• FDA officials began to lobby members of Congress and draft legislation in the
late 1950s to address gaps in oversight in drug manufacturing and marketing.
• Drug manufacturers are required to prove to the FDA the effectiveness and safety
of the product before marketing.
• These efforts coincided with how an FDA official refused to give a positive
opinion on a drug called thalidomide, used to treat morning sickness and nausea
and found to have caused hundreds of birth defects, in Western Europe.
• Raised the issue of the importance of keeping good records and documentations.
• FDA had veto power over new drugs entering the market.
• Drugs now had to demonstrate evidence of effectiveness as well as safety,
dramatically increasing the amount of time, resources, and scientific expertise
required to develop a new drug. The Modern Clinical Trial System was
implemented, the1962 Amendment required interpretation of effectiveness to
include “substantial evidence” in “adequate and well-controlled investigations.”

The Medicines Act of 1968 from the Department of Health and Social Services
(DHSS):

• Merged a number of previous medical regulations to provide broad legal stan-


dards on the manufacture and supply of medicines which related to general
practice.
• Introduced three categories of medicine: prescription-only drugs, which are
available only from a pharmacist if prescribed by a doctor; pharmacy medicines,
available only from a pharmacist but without a prescription; and general sales
medicine which may be bought from any shop without a prescription. It made
possession of prescription drugs without a prescription an offence.
35 Good Clinical Practice 653

Table 1 The Belmont Report Summary of Ethical Principles


Ethical principles for research Applications of ethical principles for research
Respect for persons Informed consent
Individuals should be treated as Volunteer research participants, to the degree
autonomous agents that they are capable, must be given the
Persons with diminished autonomy are opportunity to choose what shall or shall not
entitled to protection happen to them
The consent process must include three
elements:
Information
Comprehension
Voluntary participation
Beneficence Assessment of risks and benefits
Human participants should not be The nature and scope of risks and benefits must
harmed be assessed in a systematic way
Research should maximize possible
benefits and minimize possible risks
Justice Selection of participants
The benefits and risks of research must There must be fair procedures and outcomes in
be distributed fairly the selection of research participants

The Belmont Report, 1979, ethical principles, and applications summarized in


Table 1:
The timeline of these events, culminating with the Belmont Report, led to
ultimate framework for the ICH, and GCP.

Key Aspects of ICH GCP

Consent
Consent is a critical aspect of GCP and is defined as:

A process by which a subject voluntarily confirms his or her willingness to participate in a


particular trial, after having been informed of all aspects of the trial that are relevant to the
subject’s decision to participate. Informed consent is documented by means of a written,
signed, and dated informed consent (ICH E6 (R2) Glossary Section 1.28).

In addition to consent, trials with children and with impaired individuals may
need to include an assent signed by the subject, in addition to the consent signed by
the legal representative.
GCP requires that all elements of consent/assent are appropriately obtained and
documented.

IRB/EC
The IRB/EC is defined as:
654 C. Weber

An independent body constituted of medical, scientific, and nonscientific members whose


responsibility is to ensure the protection of the rights, safety, and well-being of human
subjects involved in a trial by, among other things, reviewing, approving, and providing
continuing review of trial protocol and amendments and of the methods and material to be
used in obtaining and documenting informed consent of the trial subjects (ICH E6
[R2] Glossary Section 1.31).

GCP requires that all studies are reviewed and overseen by IRB/ECs.

Privacy
The US Health Insurance Portability Accountability Act (HIPAA) of 1996 Privacy
Rule and the European Union (EU) General Data Protection Regulation (GDPR),
and other similar international regulations are important aspects of GCP as safe-
guards to protect the privacy of personal health information and rights to examine
and obtain a copy of health records.
GCP requires that the subject’s privacy is protected.

International, Local, and Regional Laws, Directives, and Regulations


Regulatory authority documents (e.g., US FDA code of Federal Regulations [CFR],
EU Clinical Trial Directive, etc.) include the ICH GCP principles for the conduct of
clinical trials. Other international and local regional laws, directives, and regulations
also include ICH GCP principles. Sponsors, IRB/EC, and the investigator site team
document the operational details for incorporating GCP in standard operating pro-
cedures. Thus, regulations/guidelines and standard operating procedures reinforce
GCP as described in Fig. 1.

GCP Documents Also Known as Essential Documents

Essential documents are defined as:

Fig. 1 ICH/GCP overlap


with SOPs, Laws, Regulations
and Guidelines
ICH GCP

International
Standard laws, Regulations
Operating (CFR, CTD)
Procedures Guidelines
35 Good Clinical Practice 655

Documents which individually and collectively permit evaluation of the conduct of a study
and the quality of the data produced (ICH E6 [R2] Glossary Section 1.23).

Examples of GCP documents for clinical trials include:

• Protocol
• Consent form
• Regulatory authority approvals (Country, IRB/IEC)
• Investigator’s brochure
• Plans – e.g., monitoring, medical oversight, risk management, statistical analysis
plan (SAP), blinding/masking, pharmacovigilance, etc.
• Investigator site source documents
• Investigator statement – FDA form 1572 and financial disclosure
• Electronic or paper case report form (CRF)
• Clinical study report (CSR)
• Standard operating procedures and training
• Trial master file (sponsor and investigator site)

Essential documents demonstrate that GCP is followed, and the trial master file is
the master archive of the essential documents.

Summary and Conclusion

GCP is an international quality standard that is provided by the ICH for all phases of
clinical trials, and drug development through regulatory approval. The goals of GCP
are to protect all research participants and assure that only worthy treatments are
approved for use for future patients. ICH E6 (R2) includes GCP principles (referred
to as ICH GCP) that are widely recognized as authoritative in defining obligations of
sponsors, investigator site teams, and IRB/ECs. The IRB/EC and implementation of
consent and privacy are critical aspects of GCP. ICH GCP principles are included in
standard operating procedures, as well as international, local, and regional laws,
directives, and regulations. Essential documents demonstrate that GCP is followed.
The trial sponsor team, the investigator site team, and IRB/EC are all responsible for
protecting human subjects who volunteer to participate and must be trained on and
follow GCP. If the clinical study follows GCP, the data generated from the trial will
be mutually accepted by many of the regulatory agencies around the world in
support of an approval to market the drug.

Key Facts

The facts covered in this chapter include: Goals and definitions of GCP and ICH,
historical timeline of key events leading to GCP, key aspects of GCP, clinical
research teams responsible for following GCP, and GCP essential documents.
656 C. Weber

Cross-References

▶ Archiving Records and Materials


▶ Clinical Trials, Ethics, and Human Protections Policies
▶ Consent Forms and Procedures
▶ Documentation: Essential Documents and Standard Operating Procedures
▶ Institutional Review Boards and Ethics Committees
▶ Investigator Responsibilities
▶ Regulatory Requirements in Clinical Trials
▶ Training the Investigatorship

References
Act of October 10, 1962 (Drug Amendments Act of 1962), Public Law 87-781, 76 STAT
780, which amended the Federal Food, Drug, and Cosmetic Act to assure the safety, effective-
ness, and reliability of drugs, authorize standardization of drug names, and clarify and
strengthen existing inspection authority
Clinical Trial Directive: Directive 2001/20/EC of the European Parliament and of the Council of
4 April 2001 on the approximation of the laws, regulations and administrative provisions of the
Member States relating to the implementation of good clinical practice in the conduct of clinical
trials on medicinal products for human use
Declaration of Helsinki of 1964
EMA GCP Directive 2005/28/EC
FDA Code of Federal Regulations (CFR), Title 21, Part 312
Health Insurance Portability and Accountability Act of 1996 (HIPAA)
ICH E6 (R2) International Council for Harmonisation Guideline for Good Clinical Practice
Guideline for Good Clinical Practice (Introduction, p 1, Glossary Section 1.24, pp 8–10,
Glossary Section 1.28, Glossary Section 1.31, Glossary Section 1.23)
Medicines Act 1968 c.67
Regulation (EU) 2016/679 (General Data Protection Regulation), and Directive 95/46/EC
The Belmont report: ethical principles and guidelines for the protection of human subjects of
research (1978). The Commission, Bethesda
The Nazi doctors and the Nuremberg Code: human rights in human experimentation (1995). Oxford
University Press, New York
World Medical Association (2001) World Medical Association declaration of Helsinki. Ethical
principles for medical research involving human subjects. Bull World Health Organ 79(4):
373–374. World Health Organization
Institutional Review Boards and Ethics
Committees 36
Keren R. Dunn

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658
History of Research Ethics and Emergence of IRBs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658
The National Commission and the Belmont Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659
Ethics Violations and Calls for Reform at the Turn of the Century . . . . . . . . . . . . . . . . . . . . . . . . 661
Revision of the Common Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 662
IRB Functions and Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663
IRB Review of Research and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664
IRB Review Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666
IRB Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667
Criteria for IRB Approval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
Informed Consent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
Documentation of Informed Consent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
Waivers of Informed Consent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673
Single IRB Review and IRB Reliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674
Ethics Committees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677

Abstract
Institutional review boards (IRBs) are committees established in accordance with
US federal regulations to review and monitor clinical trials and other research
with human subjects. IRBs evolved from a history of egregious ethical violations
in research with human subjects and the ethics codes and declarations that ensued,
and were first mandated by US law in 1974, with the passing of the National
Research Act. IRBs help to ensure the protection of the rights and welfare of

K. R. Dunn (*)
Office of Research Compliance and Quality Improvement, Cedars-Sinai Medical Center, Los
Angeles, CA, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 657


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_65
658 K. R. Dunn

human subjects by applying the ethical principles of the Belmont Report, respect
for persons, beneficence, and justice, in their review of research projects. They
have the authority to approve, require modifications to, or disapprove proposed
research. IRBs review plans to obtain and document informed consent from
research participants and can waive the requirements for informed consent in
certain circumstances. IRBs may exist within the institution where research is
being conducted or institutions can rely on an external IRB with a written
agreement. While the term IRB is unique to the USA, clinical trials internation-
ally adhere to the ethical principles of the Declaration of Helsinki, which requires
independent review by an ethics committee.

Keywords
Institutional review board · IRB · Ethics committee · Belmont Report · Common
Rule · Informed consent

Introduction

IRBs in the United States and ethics committees around the world conduct indepen-
dent review of research with human subjects and provide a core protection for the
rights and welfare of participants in clinical research. IRB or ethics committee
review is also critical in gaining and maintaining public trust of clinical research
due to a history of egregious ethical violations in the conduct of clinical research.
This chapter provides an overview of the history of ethical violations in clinical
research and emergence of IRBs and ethics committees, a summary of IRB functions
and operations, an outline of the requirements for informed consent, and an overview
of recent changes to the system of IRB review and oversight. A timeline of key
milestones in research ethics and the establishment of IRBs in shown in Fig. 3.

History of Research Ethics and Emergence of IRBs

The foundation of modern-day research ethics in the USA and around the world
begins with the Nuremberg Code, which emerged in 1947 from the Nuremberg
trials, in which Nazi physicians were tried for their conduct of atrocious medical and
scientific experiments on prisoners in concentration camps (White 2020). The
Nuremberg Code includes ten basic principles for the conduct of ethical research
with human subjects, covering voluntary and informed consent, risk/benefit assess-
ment that is favorable, subject right to withdraw, and research expertise and respon-
sibility (U.S. Government Printing Office 1949; Rice 2008). In 1964, the World
Medical Association created the Declaration of Helsinki, an ethical code of conduct
that built upon the principles outlined in the Nuremberg Code, but added the tenets
that the interests of subjects must be placed above the interests of society and that
every subject should be given the best known treatment (Rice 2008). Additionally,
36 Institutional Review Boards and Ethics Committees 659

the Declaration of Helsinki expanded upon the requirement for voluntary and
informed consent from the Nuremberg Code to address the ethical participation of
children and compromised adults in research (White 2020). Despite US involvement
in the development of both the Nuremberg Code and Declaration of Helsinki, there
are multiple documented cases of serious research ethical violations in the US
throughout the 1950s and 1960s (White 2020).
In 1966, a well-respected anesthesiologist from Massachusetts General Hospital,
Henry Beecher, published an article in the New England Journal of Medicine
outlining multiple examples of ethical violations he had garnered from a review of
publications in an “excellent journal” (Harkness et al. 2001). Beecher’s examples
included, among other violations, studies where known effective treatment was
withheld from subjects and studies where subjects were exposed to excessive and
unjustified risk of harm (Beecher 1966). In conclusion, Beecher advocated that “it is
absolutely essential to strive for (informed consent) for moral, sociologic and legal
reasons” (Beecher 1966). Additionally, he concluded, “there is the more reliable
safeguard provided by the presence of an intelligent, informed, conscientious,
compassionate, responsible investigator” (Beecher 1966). Notably, Beecher was
not an advocate for independent review and oversight, despite the influence his
work had on the emergence of the system of institutional review boards (IRBs)
(Harkness et al. 2001).
Perhaps the most infamous research ethics violation in the USA in the twentieth
century, “Tuskegee Study of Untreated Syphilis in the Negro Male,” was exposed in
an article published in the Washington Star by Jean Heller in 1972 (White 2020). The
study began in 1932, when there were no safe and effective treatments available for
syphilis and enrolled 600 African American men from the community around
Tuskegee, Alabama (White 2020). Although penicillin was proven to be an effective
treatment for syphilis by 1945 and was widely used, the men in the study were not
informed and not offered treatment so that the researchers could continue to learn
about the natural course of the disease (White 2020). The study continued for
40 years until it was publicly exposed in 1972 (White 2020).
Public outcry about the Tuskegee Study and other ethics violations, as well as
concern from the medical community following Beecher’s article, led congress to
pass the National Research Act in 1974, which established federal regulations for the
protection of human subjects (45 CFR 46) and paved the way for the modern system
of institutional review boards (IRBs) for the oversight of research with human
subjects (Rice 2008). The National Research Act mandated that any entity applying
for an NIH grant or contract must provide assurances that it has established an IRB to
protect the rights of human subjects in biomedical and behavioral research
(US Congress Senate 1974).

The National Commission and the Belmont Report

The National Research Act also established the National Commission for the Protection
of Human Subjects of Biomedical and Behavioral Research (the Commission). The
660 K. R. Dunn

Excerpt from Instuonal Review Boards (Naonal Commission 1978)

The Commission's deliberations begin with the premise that investigators should not have
sole responsibility for determining whether research involving human subjects fulfills ethical
standards. Others, who are independent of the research, must share this responsibility,
because investigators are always in positions of potential conflict by virtue of their concern
with the pursuit of knowledge as well as the welfare of the human subjects of their research.

The Commission believes that the rights of subjects should be protected by local review
committees operating pursuant to federal regulations and located in institutions where
research involving human subjects is conducted. Compared to the possible alternatives of a
regional or national review process, local committees have the advantage of greater
familiarity with the actual conditions surrounding the conduct of research. Such committees
can work closely with investigators to assure that the rights and welfare of human subjects
are protected and, at the same time, that the application of policies is fair to the
investigators. They can contribute to the education of the research community and the
public regarding the ethical conduct of research. The committees can become resource
centers for information concerning ethical standards and federal requirements and can
communicate with federal officials and with other local committees about matters of
common concern.

Fig. 1 Transcribed excerpt from the Commission’s report: Institutional Review Boards

duties of the Commission were: 1) to identify the basic ethical principles which should
guide the conduct of research with human subjects; 2) develop guidelines for the
conduct of research with human subjects in accordance with those ethical principles;
and 3) advise the secretary on administrative actions to apply the guidelines and on any
other matters related to the protection of human research subjects (US Congress
Senate 1974).
The Commission published multiple reports between 1975 and 1979. On
September 1, 1978, the Commission published a report entitled Institutional Review
Boards, which outlined recommendations for the IRB review mechanism and
evaluation of IRB performance, as well as steps to improve the ethical review
process (National Commission 1978). Figure 1 includes an excerpt from this report
on IRBs.
The Belmont Report, named for the location of the Commission’s four-day
intensive meetings at the Belmont Conference Center in 1976, was issued September
30, 1978 and published in the Federal Register April 18, 1979 (National Commis-
sion 1979). The Belmont Report described the boundaries between the practice of
medicine and research, outlined the basic ethical principles to guide research with
human subjects, and delineated the application of these ethical principles (National
Commission 1979). Figure 2 includes a summary of the ethical principles and their
applications outlined in the Belmont Report.
In 1981, revised regulations for the protection of human subjects (45 CFR 46)
incorporating most of the recommendations of the Belmont Report were signed by
the secretary of the Department of Health and Human Services (DHHS) and the
Food and Drug Administration (FDA) adopted similar regulations covering
36 Institutional Review Boards and Ethics Committees 661

The Belmont Report Ethical Principles and Applications


Ethical Principles Applications

Respect for Persons Informed Consent


Respect the autonomy of Consent process includes information, comprehension, and
individuals voluntariness.
Protect individuals with Incomplete disclosure of information must be justified by its
diminished autonomy necessity to accomplish the research goals, that there are no
(e.g., prisoners) significant risks omitted from the disclosure, and when
appropriate, there is a debriefing plan.
Beneficence Assessment of Risks and Benefits
Do not harm Appropriate study design
Maximize possible Risks must be justified, and benefit/risk ratio must be
benefits and minimize favorable.
possible harms Consider both the probability and magnitude of any harm.
Consider both individual Consider risks of psychological, physical, legal, social and
and societal benefits economic harm.
Justice Selection of Subjects
Fair distribution of Opportunities to participate in potentially beneficial
benefits and burdens of research must be distributed fairly.
research Burdens of research should not be borne unfairly by
disadvantaged populations.

Fig. 2 Summary of the Belmont Report’s ethical principles and their applications

requirements for IRBs (21 CFR 56) and informed consent (21 CFR 50) in FDA
regulated clinical investigations (White 2020). In an effort to harmonize regulations
across the federal government, the Federal Policy for the Protection of Human
Subjects (45 CFR 46) was adopted in 1991 by 15 federal departments and agencies
to become known as the “Common Rule” (White 2020). The FDA has not signed on
to the Common Rule, but has committed to amending its own regulations at 21 CFR
parts 50 and 56 to align with the Common Rule to the extent possible (White 2020).

Ethics Violations and Calls for Reform at the Turn of the Century

In 1999 and 2001, there were two highly publicized tragic deaths of research subjects
participating in studies at separate renowned institutions (White 2020). Jesse
Gelsinger was born with a mild form of ornithine transcarbamylase (OTC) defi-
ciency that was well managed with diet and medication and had just turned 18 when
he volunteered to participate a phase 1 gene therapy study for the treatment of OTC
deficiency (White 2020). Shortly after receiving the experimental gene therapy,
Gelsinger experienced an acute inflammatory response leading to multiorgan failure
and died just 4 days later (White 2020). This led to an investigation, which raised
questions about significant ethics and regulatory violations, including, among others,
whether Gelsinger was enrolled in violation of the eligibility criteria in the
IRB-approved protocol and a conflict of interest for the director of the gene studies
662 K. R. Dunn

program that was not disclosed (White 2020). Ellen Roche was a healthy 24-year-old
lab technician when she volunteered in 2001 to participate in a physiology study in
which subjects were administered inhaled hexamethonium (White 2020). Within
24 h Roche developed significant pulmonary abnormalities, which progressed to
multiorgan failure, and ultimately, she died within a month (White 2020). Concerns
raised from the investigation of this case included lack of identification of reported
complications associated with hexamethonium in the literature, failure to apply for
an investigational new drug application (IND) or to inquire with the FDA about the
need for an IND, lack of information in the consent form about the regulatory status
of hexamethonium, missing reports of complications in prior publications, and use
of a chemical grade agent, rather than pharmaceutical grade (White 2020).
In September 2000, in response to the Gelsinger tragedy (and before the tragedy
of Roche’s death), Donna Shalala, secretary of Health and Human Services,
published a plan and urgent call to action to strengthen protections for human
research subjects in the New England Journal of Medicine (Shalala 2000). Shalala
outlined several steps taken by the government, including expansion of the role of
the Office for Protection from Research Risks (OPRR) and renaming it the Office for
Human Research Protections (OHRP), along with the appointment of new leadership
(Shalala 2000). However, Shalala made the case that ultimate responsibility to
protect human subjects lies with the institutions performing research (Shalala
2000). With respect to IRBs, Shalala stated:

IRBs, the key element of the system to protect research subjects, are under increasing strain.
In June 1998, the Office of Inspector General of the Department of Health and Human
Services issued four investigative reports, which indicated that IRBs have excessive work-
loads and inadequate resources. At a number of institutions, IRB oversight was inadequate,
and on occasion, researchers were not providing the boards with sufficient information for
them to evaluate clinical trials fully.

During this time period, serious discussions about accreditation emerged as a


solution to ensuring the quality of IRBs and institutional systems for the protection
of human research subjects (Steinbrook 2002). The Association of American Med-
ical Colleges, along with six other partnering organizations, founded the Association
for the Accreditation of Human Research Protection Programs (AAHRPP) in 2001
(Steinbrook 2002). Calls for reform continued in the years that followed. Proposed
reforms included mandatory accreditation of IRBs and institutional human research
protections, credentialing of IRB personnel, and centralized IRBs for review of
multisite studies (Emanuel et al. 2004).

Revision of the Common Rule

A path to revise the Common Rule began in 2011, when the federal government
sought input from the public with the release of an advance notice of proposed
rulemaking (ANPRM). The ANPRM described shortcomings of the current regula-
tions, citing changes to the research enterprise since the Common Rule was first
enacted 20 years earlier, including “the proliferation of multi-site clinical trials and
36 Institutional Review Boards and Ethics Committees 663

2001
1974 Association for
the
National Accreditation
Research Act of Human
1947 HHS 1981 Research
The regulations FDA Protection
Nuremberg established (45 Regulations Programs
Code CFR 46) adopted established

1964 1978 1991 2017


Declaration of The Belmont The Common Revised
Helsinki Report Rule adopted Common Rule
by 15 federal published,
departments effective 2019

Fig. 3 Timeline of key milestones in research ethics and establishment of IRBs

observational studies, the expansion of health services research, research in the


social and behavioral sciences, and research involving databases, the Internet, and
biological specimen repositories, and the use of advanced technologies, such as
genomics” (DHHS 2011). The ANPRM sought public input with respect to the
protection of human subjects in research, while reducing burden, delay, and ambi-
guity for investigators (DHHS 2011).
The final revised Common Rule was published in the Federal Register in January
2017 and became effective in January 2019. Significant revisions in the final rule
incorporated standards for the language and organization of informed consent forms,
including the required use of a concise and focused presentation of the key infor-
mation most likely to facilitate understanding and decision-making about whether or
not to participate in the research (Menikoff et al. 2017). Additional revisions reduced
burden on IRBs and researchers for the management and review of low-risk studies,
including elimination of the requirement for annual progress reports and IRB
continuing review for many of these studies and additional allowances for
researchers to screen potential research subjects based on review of medical records
or other information available to them (Menikoff et al. 2017). Additionally, the
revised Common Rule included provisions for broad informed consent, whereby
participants can agree to the secondary unspecified future research use of their
private identifiable information that was originally collected for clinical care or
other specific research studies (Menikoff et al. 2017). Lastly, the revised Common
Rule included a requirement for single-IRB review of multisite studies in most cases,
aiming to avoid the burden for study sponsors, researchers, and IRBs associated with
multiple IRB reviews of the same protocol (Menikoff et al. 2017).

IRB Functions and Operations

The Common Rule requires that institutions conducting or otherwise engaged in


research with human subjects supported by a federal department or agency provide
an assurance that it will comply with the Common Rule (45 CFR 46.103). This
assurance of compliance, submitted to the federal Office for Human Research
664 K. R. Dunn

Protections, is called a Federalwide Assurance (FWA) and must be signed by an


institutional official authorized to act on behalf of the institution and to take
responsibility for the institution’s compliance with the Common Rule requirements
(45 CFR 46.103).
While the Common Rule only applies to research conducted or supported by a
federal department or agency, institutions submitting an FWA are required to provide
an assurance that all of its human subjects research activities will be guided by an
appropriate code, declaration, or statement of ethical principles such as the Decla-
ration of Helsinki by the World Medical Association or the Belmont Report. In
addition to a general assurance of compliance the FWA provides assurances regard-
ing the institution’s written procedures for its IRB operations, that the institution will
provide copies of its written procedures to OHRP upon request, that the institution
ensures adequate resources for each of its IRBs, and that when the institution relies
on an external IRB, the reliance arrangement is documented in a written agreement.
Institutions must renew their FWA every 5 years and are required to submit updates
within 90 days of certain significant changes (OHRP 2021).
Since 2009, both the Common Rule and FDA regulations require that IRBs must
be registered before the IRB can be designated under an institution’s FWA and
before the IRB can review research involving FDA-regulated products (45 CFR
46 Subpart E, 21 CFR 56.106). Registration of IRBs requires the name and contact
information for the institution running the IRB, the name and contact information for
the IRB and IRB chairperson, approximate number of active protocols involving
FDA-regulated products reviewed by the IRB, and a description of the types of
FDA-regulated products used in research reviewed by the IRB. IRB registration
must be renewed every 3 years and updated within 90 days or 30 days of certain
significant changes (45 CFR 46 Subpart E, 21 CFR 56.106).
According to federal regulations, an IRB must have at least five members,
including at least one scientist, at least one nonscientist, and at least one member
who is not affiliated with the institution themselves or through an immediate family
member. IRB membership must be qualified through experience, expertise, and
diversity to promote respect for its recommendations and determinations. The IRB
should include members knowledgeable about institutional commitments, regula-
tions and applicable laws, and standards of professional conduct to facilitate effec-
tive review of proposed research. IRBs routinely reviewing research with potentially
vulnerable subject populations, including children, prisoners, cognitively impaired
individuals, or socioeconomically disadvantaged individuals, should include mem-
bers familiar and experienced with these populations (45 CFR 46.107 and 21 CFR
56.107). IRBs must be allocated resources, including sufficient staff and adequate
meeting space (45 CFR 46.108).

IRB Review of Research and Definitions

All research involving human subjects (see definitions in Table 1) that is conducted
or supported by a federal department or agency is subject to the regulations in the
Common Rule (45 CFR 46), including the requirements for IRB review and
36 Institutional Review Boards and Ethics Committees 665

Table 1 Selected definitions transcribed from the Common Rule (45 CFR 46.102) and FDA
regulations (21 CFR 56.102)
Term Definition
Human subject Common Rule: A living individual about whom an investigator
(whether professional or student) conducting research
(i) Obtains information or biospecimens through intervention or
interaction with the individual, and uses, studies, or analyzes the
information or biospecimens
(ii) Obtains, uses, studies, analyzes, or generates identifiable
private information or identifiable biospecimens
FDA: An individual who is or becomes a participant in research,
either as a recipient of the test article or as a control. A subject may be
either a healthy human or a patient
Intervention Includes both physical procedures by which information or
biospecimens are gathered (e.g., venipuncture) and manipulations of
the subject or the subject’s environment that are performed for
research purposes
Interaction Includes communication or interpersonal contact between
investigator and subject
Private information Includes information about behavior that occurs in a context in which
an individual can reasonably expect that no observation or recording
is taking place, and information that has been provided for specific
purposes by an individual and that the individual can reasonably
expect will not be made public (e.g., a medical record)
Identifiable private Private information for which the identity of the subject is or may
information readily be ascertained by the investigator or associated with the
information
Identifiable A biospecimen for which the identity of the subject is or may readily
biospecimen be ascertained by the investigator or associated with the biospecimen
Research A systematic investigation, including research development, testing,
and evaluation, designed to develop or contribute to generalizable
knowledge
Clinical trial (Common A research study in which one or more human subjects are
Rule only) prospectively assigned to one or more interventions (which may
include placebo or other control) to evaluate the effects of the
interventions on biomedical or behavioral health-related outcomes
Clinical investigation Any experiment that involves a test article and one or more human
(FDA only) subjects and that either is subject to requirements for prior submission
to the Food and Drug Administration under section 505(i) or 520
(g) of the act, or is not subject to requirements for prior submission to
the Food and Drug Administration under these sections of the act, but
the results of which are intended to be submitted later to, or held for
inspection by, the Food and Drug Administration as part of an
application for a research or marketing permit
Test article (FDA only) Any drug for human use, biological product for human use, medical
device for human use, human food additive, color additive, electronic
product, or any other article subject to regulation under the act or
under sections 351 or 354-360F of the Public Health Service Act
Minimal risk The probability and magnitude of harm or discomfort anticipated in
the research are not greater in and of themselves than those ordinarily
encountered in daily life or during the performance of routine
physical or psychological examinations or tests
666 K. R. Dunn

approval and informed consent. Although these regulations only apply to federally
conducted or supported research, institutions with an FWA are required to apply
similar protections to all their research involving human subjects (OHRP 2021).
Additionally, the International Committee of Medical Journal Editors (ICMJE) notes
that authors should seek approval to conduct research from an independent review
body such as an IRB or ethics committee (ICMJE 2019), additional incentive for
researchers to seek and for institutions to require IRB approval of all research with
human subjects. Since FDA oversight is focused on drugs, biologics, and medical
devices, different terminology is used to define the research subject to IRB review.
The FDA regulations at 21 CFR Parts 50 and 56 (informed consent and IRB review)
apply to clinical investigations, as defined in Table 1.
IRBs have the authority to approve, require modifications to, or disapprove
research and are required to conduct continuing review of research at least annually,
except that most minimal risk research and ongoing research that remains open only
for long-term data collection and analysis does not require continuing review in
accordance with the revised Common Rule (45 CFR 46.109 and 21 CFR 56.109).
IRBs are required to notify investigators and the institution in writing of its decisions
to approve or disapprove research activities. The reason for disapproval must be
explained in writing and the investigator must be given an opportunity to respond in
writing (45 CFR 46.109 and 21 CFR 56.109).

IRB Review Levels

When research is required to be reviewed at a convened IRB meeting, a quorum


must be met, meaning a majority of members must be present, including at least one
nonscientist. A majority of members present must vote to approve research for the
research to be approved (45 CFR 46.108 and 21 CFR 56.108). IRB meetings can be
held via audio or video conference, but each member must have received review
materials prior to the meeting and must be able to participate in the discussion of
protocols actively and equally (DHHS OHRP & FDA 2017). Members who have a
conflict of interest related to research under review by the IRB may not participate in
the review of that research except to provide information requested by the IRB. The
IRB can invite consultants with expertise to assist with the review of research, but
the consultant cannot vote (45 CFR 46.107 and 21 CFR 56.107).
Although many research submissions require review by the IRB at a convened
meeting, most minimal risk research is eligible for an expedited review procedure by
an IRB chairperson or other experienced designated member of the IRB. The
following categories of research are eligible for expedited review if they are deter-
mined to pose minimal risk and meet other specified conditions:

• Studies of drugs or medical devices where an IND or IDE is not required


• Research involving blood draws (depending on the subject population, volume,
and frequency of collection)
• Noninvasive collection of biological specimens such as saliva or nail clippings
36 Institutional Review Boards and Ethics Committees 667

• Noninvasive procedures such as MRI or ultrasound


• Secondary research use of data or specimens collected for other purposes (if not
exempt)
• Collection of voice, video, digital, or image recordings, and surveys, interviews,
or focus groups
• Continuing review of research that originally required review by the convened
IRB, but remains active for long-term follow-up of subjects or data analysis only,
or where no subjects have been enrolled and no additional risks have been
identified

A complete list of the categories of research eligible for expedited IRB review is
posted in the Federal Register (DHHS NIH 1998). Minor changes to previously
approved research can also be reviewed by expedited IRB review. While the IRB
chairperson or designated member has the authority to approve or require modifica-
tions to research activities eligible for expedited review, only the convened IRB has
the authority to disapprove research (45 CFR 46.110 and 21 CFR 56.110).
Certain categories of research with human subjects are exempt from the require-
ments for IRB review and informed consent under the Common Rule (45 CFR
46.104). Exempt research includes the following categories of research under
specified circumstances for each category:

• Education research
• Surveys, interviews, educational assessments, and observation of public behavior
• Benign behavioral interventions
• Research with information or biospecimens collected for other purposes (e.g.,
clinical)
• Federal research and demonstration projects
• Taste and food quality evaluation
• Storage of information or biospecimens for secondary research with broad
consent
• Secondary research with information or biospecimens under broad consent

Certain exempt research categories require that the IRB conduct “limited IRB
review,” a form of expedited review focused on protections of privacy and confi-
dentiality (45 CFR 46.104).

IRB Records

Both the Common Rule and FDA regulations require that IRBs prepare and maintain
records in paper or electronic form, documenting their activities (45 CFR 46.115 and
21 CFR 56.115). IRB records must be maintained for at least 3 years after comple-
tion of the research and they must be made available to applicable federal oversight
agencies for inspection and copying upon request. Both the Common Rule and FDA
regulations note that IRB records must include the following:
668 K. R. Dunn

• Research proposals for review


• Scientific evaluations associated with the proposals, if applicable
• Correspondence between the IRB and researchers
• Approved consent forms
• Progress reports submitted by the researchers
• Records of continuing review activities
• Reports of any injuries to subjects
• Statements of significant new findings provided to subjects
• Detailed IRB meeting minutes, which must include the following:
• Meeting attendance
• Actions taken by the IRB
• The vote on actions taken, including numbers voting for, against, and
abstaining
• Basis for required changes to research or disapprovals
• Summary of discussion of controverted issues and their resolution
• IRB membership rosters with details to show name, expertise, and affiliation of
each member
• Written procedures for IRB operations, including:
• Conducting initial and continuing review of research
• Reporting its review findings and actions to the investigator and institution
• Determining which research projects require continuing review more fre-
quently and which projects need verification from independent sources that
no material changes have occurred since last IRB review
• Ensuring prompt reporting to the IRB of proposed changes in research and that
investigators will not implement changes until they have been reviewed and
approved by the IRB except when necessary for immediate subject safety
• Ensuring prompt reporting to the IRB, institutional officials, and applicable
regulatory agencies of any unanticipated problems involving risks to subjects
or others, any serious or continuing noncompliance, and any suspension or
termination of IRB approval
• Additionally, the Common Rule requires additional records, which are not gen-
erally applicable to FDA-regulated research. The following records were added in
the revised Common Rule, and are intended to discourage unnecessary time and
effort being spent on the review and oversight of low-risk studies:
• Rationale for conducting continuing review of minimal risk research that
would not normally require continuing review
• Rationale for an expedited reviewer determining that research with procedures
generally considered minimal risk is more than minimal risk and, therefore, not
eligible for expedited review

Criteria for IRB Approval

Whether research is reviewed by expedited review or at a convened IRB meeting,


regulations outline criteria for IRB approval of research. The IRB is required to
36 Institutional Review Boards and Ethics Committees 669

obtain and review information sufficient to determine whether the proposed research
meets criteria for IRB approval. Table 2 outlines the general criteria for IRB approval
of research. In addition to the general criteria for IRB approval, regulations outline
additional protections for pregnant women, fetuses, and neonates, prisoners, and
children in research, including additional requirements for IRB membership, criteria
for inclusion of these potentially vulnerable populations, and additional consider-
ations for informed consent and child assent (45 CFR 46 Subparts B, C, and D and
21 CFR 50 Subpart D).

Informed Consent

One of the key assertions outlined in the Nuremberg Code, the Declaration of
Helsinki, and the Belmont Report is that informed consent is critical to the ethical
conduct of research with human subjects. Therefore, it is not surprising that the
process and plans for documentation of informed consent are a significant focus of
the IRB review process. The informed consent of the subject or their legally
authorized representative is required for all research subject to IRB review unless
the IRB determines the research is eligible for a waiver or alteration of the require-
ments for informed consent. Researchers are required to provide information in
language that is understandable to the subject and subjects must be given sufficient
opportunity to discuss and consider their decision to participate in a setting and
manner that minimizes any possibility of coercion or undue influence. Additionally,
regulations specify that the informed consent cannot include any exculpatory lan-
guage where subjects appear to give up any legal rights (45 CFR 46.116 and 21 CFR
50.20). The revised Common Rule also specifies that subjects should be given
information that a “reasonable person” would want to make a decision about
participation in the research, that the consent must begin with a concise summary
of key information, and that the informed consent must be organized in a manner that
facilitates understanding of reasons why one may not want to participate (45 CFR
46.116).

Documentation of Informed Consent

Generally, informed consent must be documented using an IRB-approved consent


form that contains all the basic and applicable additional elements of informed
consent. Table 3 includes a listing of the basic and additional elements of informed
consent. The subject or their legally authorized representative must be given suffi-
cient time to read the consent form or have it read to them before they sign it. A copy
of the consent form is required to be given to the person signing. Both the Common
Rule and FDA regulations allow for the use of a short form written consent
document stating that the required elements of informed consent have been pre-
sented to the subject orally, along with an IRB-approved written summary of the oral
presentation (45 CFR 46.117 and 21 CFR 50.27). The short form option is often
670 K. R. Dunn

Table 2 Criteria for IRB approval transcribed from 45 CFR 46.111 and 21 CFR 56.111
Topic Regulatory criteria for IRB approval
Minimizing risks Risks to subjects are minimized (i) by using procedures that are
consistent with sound research design and that do not unnecessarily
expose subjects to risk, and (ii) whenever appropriate, by using
procedures already being performed on the subjects for diagnostic or
treatment purposes
Favorable benefit/risk Risks to subjects are reasonable in relation to anticipated benefits, if
ratio any, to subjects, and the importance of the knowledge that may
reasonably be expected to result. In evaluating risks and benefits, the
IRB should consider only those risks and benefits that may result
from the research (as distinguished from risks and benefits of
therapies subjects would receive even if not participating in the
research). The IRB should not consider possible long-range effects
of applying knowledge gained in the research (e.g., the possible
effects of the research on public policy) as among those research
risks that fall within the purview of its responsibility
Equitable selection of Selection of subjects is equitable. In making this assessment the IRB
subjects should take into account the purposes of the research and the setting
in which the research will be conducted. The IRB should be
particularly cognizant of the special problems of research that
involves a category of subjects who are vulnerable to coercion or
undue influence, such as children, prisoners, individuals with
impaired decision-making capacity, or economically or
educationally disadvantaged persons
Note: The language describing potentially vulnerable populations
was updated in the revised Common Rule. FDA regulations still
contain original language, which also includes specific reference to
pregnant women, handicapped, or mentally disabled persons
Informed consent Informed consent will be sought from each prospective subject or the
subject’s legally authorized representative
Documentation of Informed consent will be appropriately documented or appropriately
informed consent waived
Note: FDA regulations do not include provisions for IRBs to waive
consent; however, there has been FDA guidance issued on this topic,
which is described in the informed consent section of this chapter
Data and safety When appropriate, the research plan makes adequate provision for
monitoring monitoring the data collected to ensure the safety of subjects
Privacy and When appropriate, there are adequate provisions to protect the
confidentiality privacy of subjects and to maintain the confidentiality of data
Vulnerable subjects When some or all of the subjects are likely to be vulnerable to
coercion or undue influence, such as children, prisoners, individuals
with impaired decision-making capacity, or economically or
educationally disadvantaged persons, additional safeguards have
been included in the study to protect the rights and welfare of these
subjects
Note: Like the section on equitable selection of subjects, the
language in this section was updated in the revised Common Rule.
FDA regulations still contain original language to describe
potentially vulnerable populations
36 Institutional Review Boards and Ethics Committees 671

Table 3 Elements of informed consent transcribed from 45 CFR 46.116 and 21 CFR 50.25
Category Required elements of consent
Basic elements of informed consent (45 CFR 1. A statement that the study involves research,
46.116(b) and 21 CFR 50.25(a)) an explanation of the purposes of the research
and the expected duration of the subject’s
participation, a description of the procedures to
be followed, and identification of any
procedures that are experimental.
2. A description of any reasonably foreseeable
risks or discomforts to the subject.
3. A description of any benefits to the subject
or to others that may reasonably be expected
from the research.
4. A disclosure of appropriate alternative
procedures or courses of treatment, if any, that
might be advantageous to the subject.
5. A statement describing the extent, if any, to
which confidentiality of records identifying the
subject will be maintained.
6. For research involving more than minimal
risk, an explanation as to whether any
compensation and an explanation as to whether
any medical treatments are available if injury
occurs and, if so, what they consist of, or where
further information may be obtained.
7. An explanation of whom to contact for
answers to pertinent questions about the
research and research subjects’ rights, and
whom to contact in the event of a research-
related injury to the subject.
8. A statement that participation is voluntary,
refusal to participate will involve no penalty or
loss of benefits to which the subject is
otherwise entitled, and the subject may
discontinue participation at any time without
penalty or loss of benefits to which the subject
is otherwise entitled.
9. One of the following statements about any
research that involves the collection of
identifiable private information or identifiable
biospecimens:
a. A statement that identifiers might be
removed from the identifiable private
information or identifiable biospecimens and
that, after such removal, the information or
biospecimens could be used for future research
studies or distributed to another investigator
for future research studies without additional
informed consent from the subject or the
legally authorized representative, if this might
be a possibility, or
b. A statement that the subject’s information
or biospecimens collected as part of the
(continued)
672 K. R. Dunn

Table 3 (continued)
Category Required elements of consent
research, even if identifiers are removed, will
not be used or distributed for future research
studies.
Notes:
Item number 9 was added in the revised
Common Rule and is not included in FDA
regulations.
There is one additional basic element of
consent required under FDA regulations: the
possibility that the FDA may inspect the
records.
Additional elements of informed consent, to be 1. A statement that the particular treatment or
included when appropriate (45 CFR 46.116 procedure may involve risks to the subject
(c) and 21 CFR 50.25(b)) (or to the embryo or fetus, if the subject is or
may become pregnant) that are currently
unforeseeable.
2. Anticipated circumstances under which the
subject’s participation may be terminated by
the investigator without regard to the subject’s
or the legally authorized representative’s
consent.
3. Any additional costs to the subject that may
result from participation in the research.
4. The consequences of a subject’s decision to
withdraw from the research and procedures for
orderly termination of participation by the
subject.
5. A statement that significant new findings
developed during the course of the research
that may relate to the subject’s willingness to
continue participation will be provided to the
subject.
6. The approximate number of subjects
involved in the study.
7. A statement that the subject’s biospecimens
(even if identifiers are removed) may be used
for commercial profit and whether the subject
will or will not share in this commercial profit.
8. A statement regarding whether clinically
relevant research results, including individual
research results, will be disclosed to subjects,
and if so, under what conditions.
9. For research involving biospecimens,
whether the research will (if known) or might
include whole genome sequencing (i.e.,
sequencing of a human germline or somatic
specimen with the intent to generate the
genome or exome sequence of that specimen).
(continued)
36 Institutional Review Boards and Ethics Committees 673

Table 3 (continued)
Category Required elements of consent
Notes:
Item numbers 7, 8, and 9 were added in the
revised Common Rule and are not included in
FDA regulations.
There is one additional element of consent
required under FDA regulations for applicable
clinical trials (21 CFR 50.25(c)): a statement
and brief description about registration of the
clinical trial on ClinicalTrials.gov, a clinical
trial registry.
There are separate requirements for “broad
consent” for storage, maintenance, and
secondary research use of identifiable private
information or identifiable biospecimens
defined at 45 CFR 46.116(d). These
requirements have been omitted from this table
for brevity.

approved by IRBs as an option for documenting informed consent, along with the
use of an interpreter, for subjects who require an informed consent process in a
non-English language where the need for a written translation of the full informed
consent form had not been anticipated. The Common Rule also allows the IRB to
waive the requirement for obtaining a signed informed consent form for certain
minimal risk research or where a breach of confidentiality is the primary risk and the
signed consent form would be the only record identifying the subject (45 CFR
46.117).

Waivers of Informed Consent

For general research to be eligible for a waiver or alteration of informed consent


under the Common Rule (45 CFR 46.116(f)), the IRB must find that all the following
criteria are met:

• The research involves no more than minimal risk to the subjects


• The research could not practicably be carried out without the requested waiver or
alteration
• If the research involves using identifiable private information or identifiable
biospecimens, the research could not practicably be carried out without using
such information or biospecimens in an identifiable format
• The waiver or alteration will not adversely affect the rights and welfare of the
subjects
• Whenever appropriate, the subjects or legally authorized representatives will be
provided with additional pertinent information after participation
674 K. R. Dunn

While the Common Rule allows an IRB to waive or alter requirements for
informed consent, FDA regulations do not. However, FDA issued guidance in
2017 noting an intent to update its regulations to allow waivers and alterations of
informed consent for certain minimal risk clinical investigations, which will align
with waivers allowed under the Common Rule. In the meantime, the FDA notes it
will not object to IRBs allowing such waivers (DHHS, FDA 2017). FDA regulations
do allow for an exception from the requirement for informed consent to treat a
patient/subject with an investigational drug or device in a life-threatening emergency
situation. When this exception from the requirement for informed consent is used, a
report must be submitted to the IRB within 5 business days (21 CFR 50.23).
Both DHHS and the FDA also allow the IRB to approve a waiver of the
requirements for informed consent in research involving human subjects in emer-
gency medical situations (e.g., heart attack, stroke, and trauma) where it may not be
possible to obtain consent from the subject or their legally authorized representative.
Application of this exception requires significant consideration, time, effort, and
planning by the researchers and the IRB. Additional required steps and protections
necessary for the IRB to grant a waiver of consent for emergency research include
community consultation, public disclosure about the research, and procedures to
inform subjects or their representative about the research and their right to discon-
tinue participation at the earliest opportunity (21 CFR 50.24).

Single IRB Review and IRB Reliance

The NIH released a policy that became effective in January 2018, requiring the use
of a single IRB for the review of multisite human subjects research funded by the
NIH (2016). Subsequently, the revised Common Rule required the use of a single
IRB for multisite research that became effective in January 2020. In a single IRB
review model, institutions are required to document reliance on an IRB it does not
operate and the delineation of responsibilities for each entity, the relying institution
and reviewing IRB (45 CFR 46.103(e)). FDA regulations allow single IRB review of
multisite clinical investigations, but it is not required.
The purpose of single IRB review was summarized in the NIH notice as follows
(NIH 2016):

The goal of this policy is to enhance and streamline the IRB review process in the context of
multi-site research so that research can proceed as effectively and expeditiously as possible.
Eliminating duplicative IRB review is expected to reduce unnecessary administrative bur-
dens and systemic inefficiencies without diminishing human subjects protections. The shift
in workload away from conducting redundant reviews is also expected to allow IRBs to
concentrate more time and attention on the review of single site protocols, thereby enhancing
research oversight.

In preparation for the federal mandates and to support and facilitate single IRB
review, beginning in 2014, the NIH funded an initiative to develop a standard
national master IRB reliance agreement (Cobb et al.). This evolved into the
36 Institutional Review Boards and Ethics Committees 675

Streamlined, Multisite, Accelerated, Resources for Trials (SMART) IRB Platform,


the foundation of which is the SMART IRB Master Common Reciprocal IRB
Authorization Agreement (SMART IRB Agreement), an umbrella agreement
among the participating institutions (Cobb et al.). Eligible institutions can join the
SMART IRB Agreement, which eliminates the need for participating institutions to
negotiate a new IRB reliance agreement for each study reviewed under the single
IRB review model (Cobb et al.). Once joining, an institution can decide on a study-
by-study basis whether to use the Smart IRB Agreement (Cobb et al. 2019).

Ethics Committees

While the operation of IRBs in the USA is established by federal regulations,


research ethics committees (RECs) internationally evolved from the Declaration of
Helsinki, which has included independent committee review and oversight of
biomedical research with human subjects since 1975 (WMA 2013). An excerpt
from the Declaration of Helsinki is included in Fig. 4.
While regulations governing the requirements for ethics committee review and
operation of ethics committees vary around the world, significant efforts toward
harmonization were realized with the publication of the International Ethical
Guidelines for Biomedical Research Involving Human Subjects in 1993 by the
Council for International Organizations of Medical Sciences (CIOMS) (White
2020). These guidelines aimed to provide direction in the application of ethical
principles from the Declaration of Helsinki, particularly in developing countries,
and included a section on the constitution and responsibilities of ethical review
committees (White 2020).

Excerpt on Ethics Commiees from the Declaraon of Helsinki (WMA 2013)

The research protocol must be submitted for consideration, comment, guidance and
approval to the concerned research ethics committee before the study begins. This
committee must be transparent in its functioning, must be independent of the researcher,
the sponsor and any other undue influence and must be duly qualified. It must take into
consideration the laws and regulations of the country or countries in which the research is to
be performed as well as applicable international norms and standards but these must not be
allowed to reduce or eliminate any of the protections for research subjects set forth in this
Declaration.

The committee must have the right to monitor ongoing studies. The researcher must provide
monitoring information to the committee, especially information about any serious adverse
events. No amendment to the protocol may be made without consideration and approval by
the committee. After the end of the study, the researchers must submit a final report to the
committee containing a summary of the study’s findings and conclusions.

Fig. 4 Excerpt on ethics committees from the Declaration of Helsinki (WMA 2013)
676 K. R. Dunn

Summary and Conclusion

IRBs in the USA and ethics committees internationally have been a cornerstone in the
review and oversight of clinical trials and other research involving human subjects
since the 1970s, with the passing of the National Research Act in 1974 and amendment
of the Declaration of Helsinki in 1975. IRBs are guided by the ethical principles of the
Belmont Code and they carry out their responsibilities for the protection of the rights
and welfare of human research subjects in accordance with the Common Rule (45 CFR
46) and/or FDA regulations at 21 CFR parts 50 and 56, depending on the scope and
funding source of the research they are reviewing. In order to approve research, IRBs
are required to ensure that risks are minimized and there is a favorable risk/benefit
ratio, there is an adequate plan for data and safety monitoring and protection of privacy
and confidentiality, that selection of subjects is equitable, and there is a plan to obtain
and document informed consent from subjects. Additionally, IRBs give special con-
sideration to the protection of potentially vulnerable subject populations. The changing
research landscape and some highly publicized tragedies led to calls for reform around
the turn of the 21st century, which resulted in development of programs for the
accreditation of IRBs and institutional human research protection programs, a revised
Common Rule that became effective in 2019, and the trend toward single IRB review
of multisite studies. While responsibility for the protection of human subjects is shared
among multiple parties, IRBs and ethics committees play a critical role in the review
and oversight of clinical trials and research with human subjects.

Key Facts

• IRBs, first mandated by US law in 1974, are established in accordance with


federal regulations to review and monitor clinical trials and other research with
human subjects.
• IRBs help to ensure protection of the rights and welfare of human subjects by
applying the ethical principles of the Belmont Code, respect for persons, benef-
icence, and justice.
• IRBs consider the regulatory criteria for approval of research, including plans to
obtain and document informed consent from research participants.
• IRBs may exist within the institution where research is being conducted or
institutions can rely on an external IRB with a written agreement. Single IRB
review of multisite studies is mandated by NIH policy and the Common Rule.
• Outside the USA, independent review of research with human subjects is
conducted by ethics committees.

Cross-References

▶ Clinical Trials, Ethics, and Human Protections Policies


▶ Consent Forms and Procedures
36 Institutional Review Boards and Ethics Committees 677

References
Beecher HK (1966) Ethics and clinical research. N Engl J Med 274(24):1354–1360. https://doi.org/
10.1056/NEJM196606162742405
Cobb N, Witte E, Cervone M, Kirby A, MacFadden D, Nadler L, Bierer BE (2019) The SMART
IRB platform: a national resource for IRB review for multisite studies. J Clin Transl Sci 3(4):
129–139. https://doi.org/10.1017/cts.2019.394
Department of Health and Human Services (2011) Human subjects research protections: enhancing
protections for research subjects and reducing burden, delay, and ambiguity for investigators.
Fed Register 76(143):44512–44531. https://www.federalregister.gov/documents/2011/07/26/
2011-18792/human-subjects-research-protections-enhancing-protections-for-research-subjects-
and-reducing-burden. Accessed 26 Jun 2021
Department of Health and Human Services, FDA (2017) IRB Waiver or alteration of informed
consent for clinical investigations involving no more than minimal risk to human subjects
guidance for sponsors, investigators, and institutional review boards. https://www.fda.gov/
media/106587/download. Accessed 26 Jun 2021
Department of Health and Human Services, NIH (1998) Protection of human subjects: catego-
ries of research that may be reviewed by the Institutional Review Board (IRB) through an
expedited review procedure. Fed Register 63(216):60364–60367. https://www.hhs.gov/
ohrp/regulations-and-policy/guidance/categories-of-research-expedited-review-procedure-
1998/index.html. Accessed 4 Jun 2021
Department of Health and Human Services OHRP and FDA (2017) Minutes of Institutional
Review Board (IRB) meetings guidance for institutions and IRBs. https://www.hhs.gov/
ohrp/minutes-institutional-review-board-irb-meetings-guidance-institutions-and-irbs.
html-0. Accessed 26 Jun 2021
Emanuel EJ, Wood A, Fleischman A, Bowen A, Getz KA, Grady C, Levine C, Hammerschmidt
DE, Faden R, Eckenwiler L, Muse CT, Sugarman J (2004) Oversight of human participants
research: identifying problems to evaluate reform proposals. Ann Intern Med 141(4):282–291.
https://doi.org/10.7326/0003-4819-141-4-200408170-00008
Harkness J, Lederer SE, Wikler D (2001) Laying ethical foundations for clinical research. Bull
World Health Organ 79(4):365–366
International Committee of Medical Journal Editors (2019) Recommendations for the conduct,
reporting, editing, and Publication of scholarly work in Medical Journals. http://www.icmje.org/
icmje-recommendations.pdf. Accessed 1 Jul 2021
Menikoff J, Kaneshiro J, Pritchard I (2017) The common rule, updated. N Engl J Med 375:613–
615. https://doi.org/10.1056/NEJMp1700736
National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research
(1978) Reports and recommendations institutional review boards. https://www.hhs.gov/ohrp/
regulations-and-policy/belmont-report/access-other-reports-by-the-national-commission/index.
html. Accessed 30 Jun 2021
National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research
(1979) The Belmont report. https://www.hhs.gov/ohrp/sites/default/files/the-belmont-report-
508c_FINAL.pdf. Accessed 4 Jun 2021
National Institutes of Health (2016) Final NIH policy on the use of a single institutional review
board for multi-site research. NOT-OD-16-094. https://grants.nih.gov/grants/guide/notice-files/
NOT-OD-16-094.html. Accessed 1 Jul 2021
Office for Human Research Protections (OHRP) (2021) Assurance process frequently asked
questions. https://www.hhs.gov/ohrp/register-irbs-and-obtain-fwas/fwas/assurance-process-
faq/index.html. Accessed 1 Jul 2021
Rice TW (2008) The historical, ethical, and legal background of human-subjects research. Respir
Care 53(10):1325–1329
Shalala D (2000) Protecting research subjects – what must be done. N Engl J Med 343(11):808–
810. https://doi.org/10.1056/NEJM200009143431112
678 K. R. Dunn

Steinbrook R (2002) Improving protection for research subjects. N Engl J Med 346(18):1425–1430.
https://doi.org/10.1056/NEJM200205023461828
US Congress Senate (1974) (Reprint of) National Research Act. https://www.govinfo.gov/content/
pkg/STATUTE-88/pdf/STATUTE-88-Pg342.pdf. Accessed 4 Jun 2021
U.S. Government Printing Office (1949) The Nuremberg Code: Trials of war criminals before the
Nuremberg military tribunals under control council law No. 10, vol 2. pp. 181–182. https://
history.nih.gov/display/history/Nuremberg+Code. Accessed 4 Jun 2021
White MG (2020) Why human subjects research protection is important. Ochsner J 20(1):16–33.
https://doi.org/10.31486/toj.20.5012
World Medical Association (2013) WMA Declaration of Helsinki – ethical principles for medical
research involving human subjects as amended by the 64th WMA General Assembly, Fortaleza,
Brazil. https://www.wma.net/policies-post/wma-declaration-of-helsinki-ethical-principles-for-
medical-research-involving-human-subjects/. Accessed 4 Jun 2021
Data and Safety Monitoring and Reporting
37
Sheriza Baksh and Lijuan Zeng

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680
DSMB Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682
Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682
Charter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684
Meeting Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684
Meeting Settings: In-person vs Remote . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687
Quorum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687
Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 688
Confidentiality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 688
DSMB Meetings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690
Structure of Meetings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690
Recommendations and Follow-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696

Abstract
Data safety and monitoring boards (DSMB) are comprised of a group of clinical
experts, statisticians, and other representatives with pertinent experience, who
collectively monitor the data and conduct of ongoing clinical trials to ensure the
safety of trial participants and the integrity of the trial. Over the years, the
frequency of the use of a DSMB has increased; its mandate has been expanded
to evaluate interim efficacy results, make recommendations for early termination
of a trial, conduct sample size reassessments, and support the technical aspects of

S. Baksh (*)
Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
e-mail: [email protected]
L. Zeng
Statistics Collaborative, Inc., Washington, DC, USA

© Springer Nature Switzerland AG 2022 679


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_209
680 S. Baksh and L. Zeng

a trial through other recommendations. Given the complex issues a DSMB may
face, it is important for a DSMB to gain the support the members need from
relevant parties in order to function effectively and independently and to make
informed judgments. This chapter starts by introducing when a DSMB is
warranted, and provides guidance on the formation of a DSMB, highlighting
approaches to ensuring adherence to data confidentiality and principles of inde-
pendence. The chapter then provides an overview of different types of DSMB
meetings, templates for a DSMB charter, and considerations for open and closed
reports. Lastly, a listing of guidance documents on DSMB from regulatory
agencies and others is provided for reference.

Keywords
Data Monitoring Committee · Data Safety Monitoring Board · Interim data
sharing

Introduction

A data and safety monitoring board (DSMB), also known as a data monitoring
committee (DMC), or an independent data and safety monitoring committee
(IDMC), serves an integral part of many trials in ensuring study participant
safety, assessing data integrity, and monitoring study progress. In this chapter,
we will use DSMB as an umbrella term to refer to data and safety monitoring
boards, data monitoring committees, and independent data and safety monitoring
committee.
A study’s DSMB serves as an independent resource for study investigators
and sponsors to ensure the integrity of the data, the ethical conduct of the study,
and the safety of study participants. DSMBs are often formed as part of large,
phase 3, multicenter clinical trials, but they may be also used in smaller, phase
1 and 2 clinical trials, where study participants are considered to comprise a
vulnerable population or when interventions are high risk. Additionally, DSMBs
may be needed in emergency trials, when consent might be waived (Eckstein
2015). The independence of a DSMB enables study recommendations to be
made in the best interests of the study population and for the maximum benefit
for the intended target population. Note that not all trials need DSMB. For
example, having a DSMB may not be practical for trials with fast enrollment
or short duration, nor would it be necessary for trials for non-critical indications
or low-risk investigational drugs.
The DSMB can have a variety of duties based upon the needs of a particular study
or request from the study Sponsor. These are often outlined in a DSMB charter or a
data and safety monitoring plan (DSMP) that the DSMB, study investigators, and
study Sponsor agree upon at the beginning of the study. Depending on the timing of
the formation of the DSMB, the DSMB may have varying input on the development
of the study protocol. Among their duties are reviewing study protocols, statistical
37 Data and Safety Monitoring and Reporting 681

analysis plans, consent documents, and other participant facing documents, advising
the trial’s Steering Committee, evaluating data for stopping the trial, and reviewing
interim analyses (Clemens et al. 2005). Through the course of the study, the DSMB
periodically meets to review and discuss the emerging data and the study perfor-
mance so as to provide recommendations in line with the jurisdiction outlined in the
charter. Recommendations may stem from the discussions in these meetings. A
summary of the recommendations from the meeting may also be shared with the
institutional review boards and other regulatory bodies to keep them abreast of any
potential safety concerns for study participants.
While not explicitly required for all clinical trials, the jurisdiction for DSMBs
has been spelled out by various regulatory and governmental agencies across the
world. While there might be slight variations in what each agency requires, each
of these governing bodies outlined the following as integral to a functional and
effective DSMB: primacy of patient safety, ensuring data integrity, and continual
oversight of study performance metrics. Table 1 lists key guidance documents
that outline the purview of DSMBs from various regulatory agencies across the
world. Investigators undertaking clinical trials in specific countries should seek
to abide by requirements outlined in the guidance documents pertinent to the
countries in which their trials are being conducted. While this list is not exhaus-
tive, it provides a sample of what one can expect when organizing a DSMB
across countries.

Table 1 Guideline documents for DSMBs by country/multi-governmental organizations


Country/Multi-
Governmental
Organizations Governing body Document
United States National Institutes of NIH Policy for Data and Safety Monitoring
Health (National Institutes of Health 1998)
Food and Drug Guidance for clinical trial sponsors:
Administration establishment and operation of clinical trial
data monitoring committees (FDA 2006)
European Union European Medicines Guideline on Data Monitoring Committee
Agency (European Medicines Agency 2005)
Japan Pharmaceuticals and Guideline on Data Monitoring Committee
Medical Devices Agency (PFSB/ELD notification No.0404-1)
(Pharmaceutical and Food Safety Bureau
2013)
Australia National Health and Data Safety Monitoring Boards (DSMBs)
Medical Research Council (National Health and Medical Research
Council 2018)
Brazil Agência Nacional de Resolution of the Board of Directors –
Vigilância Sanitária, RDC No. 9 (ANVISA 2015)
Ministry of Health
Tanzania Tanzania Food and Drugs Guidelines for Application to Conduct
Authority Clinical Trials in Tanzania (Tanzania Food
and Drugs Authority 2017)
682 S. Baksh and L. Zeng

DSMB Organization

Formation

Once a study has been funded and initial planning is underway, Sponsors may elect
to appoint a DSMB to assist with study oversight. One goal in forming an effective
DSMB is to ensure the expertise necessary for monitoring the risks and benefits to
study participants with a limited number of individuals. In some instances, an
ethicist or patient advocate, or both, as part of the DSMB might be prudent.
Members with this perspective can be especially helpful when the study involves
participants for whom consent is waived or for whom their condition or the studied
intervention is of a sensitive or controversial nature. While the optimal number of
DSMB members is often up for debate, the expertise of the members should be
balanced for discussion of trial issues and consensus formation.
The independence of a DSMB is essential in order that members consider both the
safety of trial participants as well as the potential risk and benefits to the intended
target patient population for the intervention under study. This holistic approach to
trial integrity depends on the DSMB’s independence from competing interests,
research activities, and financial incentives. Without these assurances, both those
charged with study oversight, as well as the general public, cannot be assured that the
recommendations stemming from the DSMB are in the best interest of patient safety
and their corresponding benefit-risk profile. There are many ways to protect against
potential or perceived bias. Among these strategies is a disclosure of conflict of
interests (COI) at the time of DSMB formation and at the beginning of each data
review.
In some situations, DSMB may choose to designate voting and non-voting
members. In these situations, both voting and non-voting members participate in
discussions of study data; however, the voting members are tasked with deciding
upon study recommendations, including determination of continuation with recruit-
ment. Best practices typically recommend against this however, and instead advo-
cate for recommendations stemming from consensus views in the closed session
(Fleming et al. 2017). While compositions may vary from trial to trial, DSMBs may
have a clinical expert, statistician, clinical trialist, patient advocate or representative,
and/or a Sponsor representative (Fig. 1). Non-voting members of the DSMB tend to
be those from the investigative team, and the voting members are generally those
who remain independent from the study activities. Including sponsor representatives
in the closed sessions of a DSMB is more common in government-sponsored trials
than in industry-sponsored trials, where industry sponsors usually hire an Indepen-
dent Statistical Reporting Group (ISRG) for preparing and presenting closed and/or
open reports to DSMB (Fig. 2). Given that the principal investigator is steeped in the
clinical area, he/she may recommend individuals best suited to adjudicate patient
safety and interests for a disease area, but the Sponsor ultimately signs off on the
members for the DSMB. Members of the clinical study team are not typically present
in either the open or closed session of the DSMB meeting; however, the principal
investigator may attend the open portion of the meeting to provide a scientific and
operational update of the study, and answer questions from the DSMB. Study
37 Data and Safety Monitoring and Reporting 683

Chair

Clinical
Sponsor Specialist
(1-2 members)

Ethicist/Patient
Statistician
Advocate

Study
Statistician

Fig. 1 Example of DSMB composition in government-sponsored trials. Typical voting members


are shown in blue rectangles, and typical non-voting members are shown in red ovals

Chair

Independent Clinical specialist


Statistical (1-2 members)
Reporting Group

Ethcist/Patient
Statistician
Advocate

Fig. 2 Example of DSMB composition in industry-sponsored trials. Voting members are shown in
blue rectangles, and non-voting members are shown in red ovals
684 S. Baksh and L. Zeng

statisticians (in government-sponsored trials) and members of ISRG (in industry-


sponsored trials) may present or orient the DSMB to the study materials, explain the
analyses that were conducted as well as any assumptions around those analyses.
The study Sponsor will appoint a chair for the DSMB. This individual is
usually either an expert in the clinical area under study or a statistician. The chair
will be tasked with running the flow of the meeting, coordinating with other
DSMB members for consensus on recommendations, and providing a central
voice for any clinical concerns with study conduct. The chair will communicate
with the Sponsor to relay the DSMB’s recommendations on continuance of the
study. This recommendation is often also shared with the IRBs for studies.
Because of the influential nature of this role, it is imperative that the DSMB
chair is able to lead the group, encourage all members to express their views, and
forge a consensus.

Charter

The DSMB charter serves as a guideline for DSMB operations, outlining DSMB
responsibilities, providing principles for guiding DSMB decisions, and describing
procedures and workflow for the DSMB (Herson 2017; Fleming et al. 2017). The
trial sponsor usually prepares a draft charter which is later reviewed collectively by
the sponsor, DSMB, ISRG, and any other key parties involved. Table 2 provides an
outline of the organization of a typical DSMB charter.
Although the DSMB may vary in composition and practices depending on the
study, core elements of the DSMB charter remain similar across studies. Templates
for DSMB charters have been proposed in reference books (e.g., Ellenberg et al.
2019; Herson 2017). DAMOCLES, (Data Monitoring Committees: Lessons, Ethics,
Statistics) Study Group (DAMOCLES Study Group, 2005) also provides templates
for DSMB charters.

Meeting Types

The types of DSMB meetings that are held during the trial should be described in the
DSMB Charter. The objectives, frequency, and schedule of meetings are generally
decided upon during the formation of the DSMB in conjunction with the investiga-
tors and Sponsor. The main meeting types are as follows:

Initial/Organizational/Kick-Off Meeting
The initial DSMB meeting, also known as the organizational or kick-off meeting,
should ideally be held prior to the first patient first visit. During this meeting, DSMB
members can get acquainted with each other and the sponsor’s study team, exchange
thoughts on the study design, and share their own experiences and insights. The
sponsor or investigator usually presents the current version of the protocol and
DSMB charter, and the independent reporting statistician may present the draft
37 Data and Safety Monitoring and Reporting 685

Table 2 Example of 1. Introduction (A brief description of trial information; purpose and


Charter contents key parties involved)
2. Committee members/organization
3. Confidentiality, independence, conflict of interest disclosure
4. Committees related to safety or trial conduct
5. Responsibilities of the parties involved
a. DSMB (Chair, Statistician, other members)
b. Sponsor or its designee
c. Independent Statistical Reporting Group
d. Contract Research Organization
e. Executive Committee/Steering Committee
6. DSMB meetings
a. Types of meeting: Kick-off/Initial/Organizational meeting;
safety review; interim analysis; final closure meeting; ad-hoc meeting
b. Meeting frequencies and format (in-person or teleconference)
c. Quorum
d. Voting or reaching consensus
7. Meeting documentation
a. Open session minutes/notes
b. Closed session minutes/notes
c. Executive session minutes/notes
d. DSMB recommendation
8. Data review plan
a. Safety review contents
b. Efficacy review contents
9. Interim analysis plan
a. Statistical guidelines
10. Organization diagram and data flow
11. Communication flow related to
a. Safety concerns
b. Pre-specified interim analysis results
c. Regular safety review meeting recommendation
12. Duration, disbandment of DSMB
13. Appendix
a. Recommendation form format
b. DSMB contact information
c. Sponsor key personnel contact information
d. Independent statistical group contact information

report templates to solicit any feedback from the DSMB during the early stages of
interaction. This is a valuable opportunity for study investigators to gather input
from other leaders in the clinical field.
To have a productive meeting, the study materials such as protocols, important
forms, patient-facing materials, the draft DSMB charter, and other relevant materials
should be made available to the DSMB prior to the initial meeting. Shortly after the
686 S. Baksh and L. Zeng

meeting, the Sponsor or DSMB, or both will approve and sign the charter, according
to the Sponsor’s SOPs.

Meetings Following First Patient First Visit


The DSMB Charter should outline how frequently the DSMB will be meeting. Soon
after the initial meeting, a representative of the ISRG will schedule the first data
review meeting, which usually takes place after a specified number of patients are
enrolled, or at a pre-specified timeframe, whichever occurs first. Because of the
inherent uncertainty of rate of enrollment, it is operationally easier to meet at a
pre-specified time, in that the meeting can be scheduled ahead of time.
In cases where enrollment is slow, having the first data review meeting at a
pre-specified timeframe (e.g., at 6 months after first patient first visit) ensures that
the DSMB have an opportunity to review available data from patients already
enrolled, monitor performance metrics, and assess safety profiles on the study
drug. In scenarios where enrollment is much faster than expected, the first data
review meeting can be held on a date earlier than originally planned.

Safety Review Meetings


The frequency of safety review meetings specified in the charter depends on the
study’s disease areas, study design, expected accrual, and expected safety profile
from previous studies (when applicable). For example, oncology trials commonly
hold safety review meetings every three or four months, biannually, or annually. On
the other hand, in trials of orphan diseases trials where sample size is inherently
small, the DSMB may meet once data from a threshold number of new patients
become available to review data from newly enrolled patients, as well as cumulative
data from all enrolled.

Interim Analysis Review Meetings


To control the overall type-I error rate of the trial, the criteria, statistical analyses, and
actions taken toward each potential outcome for interim analyses should be
pre-specified in the protocols, DSMB charters, statistical analysis plan (SAP),
and/or additional interim analysis plan documents. Refer to ▶ Chap. 59, “Interim
Analysis in Clinical Trials” for more details on interim analysis.

Ad-hoc Meetings
Between pre-specified safety and efficacy review meetings, the DSMB and Sponsor
may request ad-hoc meetings to review ad-hoc analyses, address emerging safety
issues from monthly safety reports (or SAE narratives), or discuss important new
information external to trials. When the DSMB requests an ad-hoc meeting, the
details of the meetings should not be communicated to the Sponsor until the
conclusion of the trial unless the DSMB issues a recommendation to modify or
terminate the trial in response to findings from the meetings. The documentation for
the ad-hoc meetings should still follow the same process as the periodic data review
meetings. Note that the implications of additional analyses should be considered and
factored into the alpha-spending as specified in the SAP a priori.
37 Data and Safety Monitoring and Reporting 687

Final Results/End-of-Trial Meeting


As the trial results become available after the final database lock, Sponsors may plan a
meeting (also known as the End-of-Trial Meeting) to share with the DSMB the final
analysis results and interpretation. Some sponsors may choose to present a press
release of the study results; some sponsors may release a publication draft with the
DSMB directly. Holding a Final Results meeting is a valuable opportunity for the
sponsors to solicit any feedback from the DSMB regarding their experience during the
course of the trial.

Meeting Settings: In-person vs Remote

The actual format of various DSMB meeting will depend on the scheduling, DSMB
preferences, and complexity of issues to be discussed at the meeting. In-person
DSMB meetings, which often allow for more effective interactions and communi-
cation, are preferred at the initial meeting, interim efficacy/futility analysis meeting,
final meeting and/or other pre-specified review meeting.
For example, it is generally preferable to have an in-person meeting for the
study kick-off. This allows DSMB members to get familiarized with each other and
share their experiences. When an important decision is made regarding whether the
DSMB is recommending early termination of a study due to safety, efficacy, or
futility, it is valuable to have DSMB members in the same room, if possible, in
order to assess the benefit and risk profiles of study drugs carefully, thoughtfully
exchange their opinions and concerns, and finally come to a consensus had there
been conflicting feedback. Moreover, having the DSMB meet in-person on a
regular basis (e.g., annually) is recommended. However, meeting in-person may
not always be necessary or efficient. Once the DSMB becomes very familiar with
the trial or has observed no major safety issue after numerous meetings, it may be
sufficient to meet by teleconference or videoconference. Meeting in-person may
not be possible for ad-hoc discussions on emerging trial issues given the short
notice and not practically feasible if the DSMB needs to closely monitor the trial
population and meet frequently (i.e., every other week) to review new information.

Quorum

DSMB members should make every attempt to attend each meeting either
in-person or by teleconference. However, in cases where not all members can be
present, the DSMB Chair, or designee, should contact any absent individual before
or after the meeting, or both, to obtain their opinion in writing after their review of
all materials discussed during the meeting. The inclusion of opinions from absent
members is at the discretion of the DSMB, as outlined and pre-specified in the
charter.
Usually, at a minimum, the DSMB Chair and the DSMB Statistician should be
present to hold a meeting. However, many charters require that all voting members be
688 S. Baksh and L. Zeng

present, unless there are extenuating circumstances, for making a recommendation to


the Sponsor related to early termination or any other modification of study protocol.

Independence

To provide an objective assessment of the benefit and risk profile of study drugs and
make recommendations on the studies, members of the DSMB must remain inde-
pendent and avoid all COIs that could affect their decision making. COIs can arise in
many situations – some are easier to ascertain (for example, financial or research
interests), while others can be harder to avoid or cannot be fully eliminated. For
example, owning shares or investing in the sponsor’s company stocks are obvious
financial COIs, which preclude one’s eligibility from serving on the committee.
DSMB members usually receive some honorarium (financial compensation) from
the sponsors for their time serving on the boards; however, the amount of the
honorarium should not be so high that it might potentially bias the DSMB’s decision
making. For more details on financial COIs, refer to ▶ Chap. 28, “Financial Con-
flicts of Interest in Clinical Trials.”
Other than financial incentives, potential research-driven or intellectual COIs are
also common among clinical and statistical experts who are usually involved in or
serve as consultants for multiple research projects. In general, the investigators in a
trial may not serve on a DSMB for a competing trial. One may not even serve as the
DSMB for competing trials at the same time to avoid inadvertently sharing confi-
dential information across trials.
It is not, however, uncommon for a single DSMB to monitor multiple ongoing
trials in the same or related programs as this allows the DSMB to more efficiently
make informed recommendation based on information from the associated trials.
Requiring members to be completely free from any COIs is difficult to achieve given
the varying subject matter expertise required on each board. As such, full disclosure
of any potential COI is critical to avoid compromising the DSMB’s recommenda-
tions as the trial proceeds. If, through the course of the study, any of these tenets of
independence have changed, the members should disclose their status to the chair of
the DSMB and the study Sponsor who will decide whether the member is still
sufficiently independent to remain on the Board.

Confidentiality

Maintaining confidentiality of all interim information (data, analyses, meeting dis-


cussions, documentations, etc.) from an ongoing trial is one of the most crucial
principles to protect the integrity and credibility of the trial.
In general, limited trial data by aggregated treatment groups can be presented in
the open session to facilitate an informative discussion between sponsors and the
DSMB to address issues related to trial conduct or management (e.g., enrollment,
dropout, protocol deviations, or timeliness of data from different sources).
37 Data and Safety Monitoring and Reporting 689

Unblinded comparative safety and efficacy data should be accessible only to the
DSMB and ISRG. The FDA Guidance (2006) states: “Even for trials not conducted
in a double-blind fashion, where investigators and patients are aware of individual
treatment assignment and outcome at their sites, the summary evaluations of com-
parative unblinded treatment results across all participating centers would usually
not be available to anyone other than the DSMB.”
Although it may be tempting to use positive trial data from interim analyses to
inform subsequent planning of product development, caution needs to be taken for
interpretating and relying on the immature trial results from interim analyses as
studies (Woloshin et al. 2018; Wayant and Vassar 2018) have shown inconsistent
results in magnitude and even direction between interim assessments and final
analyses at the end of the trials. The spread of unreliable interim comparative
efficacy data may adversely affect patient adherence to study drugs, recruitment,
and long-term follow-up (Ellenberg et al. 2019). Inappropriate release of interim data
could even lead to early termination due to breach of confidentiality (see example
below regarding the LIGHT trial, Nissen et al. 2016). The FDA Guidance on
Establishment and Operation of Clinical Trials Data Monitoring Committees
(2006) states the following:

Knowledge of unblinded interim comparisons from a clinical trial is generally not necessary
for those conducting or sponsoring the trial; further, such knowledge can bias the outcome of
the study by inappropriately including its continuing conduct or the plan of analyses.
Unblinded interim data and the results of comparative interim analysis, therefore, should
generally not be accessible by anyone other than DSMB members or the statistician(s)
performing these analyses and presenting to the DSMB.

In some cases, the DSMB needs to notify the Sponsor and release safety data to
Sponsors who will inform regulatory agencies. When DSMBs observe an increased
risk in certain safety events, the committee may raise the concern to the Sponsor and
recommend informing investigators and patients. The DSMB may recommend
collecting additional information to support further safety reviews, or on some
occasions, modifying the trial procedures to protect patients in the trial. DSMBs,
in this case, may share relevant safety data with limited individuals from the Sponsor
to support subsequent procedures.
In anticipating the need to release unblinded data and meeting materials from the
DSMB, a data access plan or other relevant SOPs is important to limit the spread of
confidential information by specifying when the data could be shared, who will have
access, and how unblinded materials and results will be communicated or trans-
ferred. Often, the Sponsor may appoint a ‘firewalled’ group, internally or externally
(e.g., members from Executive Committees or Steering Committees), to receive
these data if the DSMB recommendation warrants this.
The following example shows the detrimental impact of inappropriate handling of
confidential interim trial data and highlights the importance of maintaining confi-
dentiality of data from the ongoing trials to preserve the integrity of the ongoing trial.
Example of early termination due to inappropriate public release of confidential
interim data by the sponsor – the LIGHT trial (Nissen et al. 2016)
690 S. Baksh and L. Zeng

The LIGHT trial compared the effect of naltrexone-bupropion to placebo on


major adverse cardiovascular events (MACE) in overweight and obese patients
with cardiovascular risk factors. The study used a two-stage noninferiority design,
with an interim-analysis at 25% information time to rule out an upper bound of the
95% confidence internal of the hazard ratio (HR) exceeding 2.0 for regulatory
approval, followed by a HR at the study completion to exclude 1.4 post-approval.
After approximately 25% of the expected events, the first interim analysis was
conducted; the data ruled out the null hypothesis as the HR did not exceed the 2.0
risk margin. The DSMB released the initial noninferiority analysis results to a core
team from the sponsor, according to the data access plan, for regulatory filing while
the blinded trial continued as planned. However, the Sponsor disseminated the
unblinded results far beyond the intended core team. Even worse, the Sponsor
publicly released the 25% interim analysis by applying for a patent. Subsequently,
while the trial was ongoing, the Sponsor reported to the SEC that the HR was 0.59
(95% CI, 0.39–0.90) (SEC 2015). Around the same time, the 50% interim analysis
time results became available, showing a HR of 0.88 (99.7% CI, 0.57–1.34), which
was less favorable than the HR released in the SEC document. The study was
terminated early because of the release of confidential trial data. Had the sponsors
properly handled the confidential information from the IA, the study would have
continued to completion so that information on the long-term safety profile of the
study drug would have been learned.

DSMB Meetings

Structure of Meetings

DSMB meetings, much like many other types of study meetings, are often a
dance between study investigators, experts in the field, and other vested parties,
such as the study Sponsor. As such, the meeting structure reflects this power
dynamic and enables important information to reach the DSMB members in the
closed session, while offering others from the investigative team an opportunity
to weigh in during the open session. Additionally, any interpretation of the
recommendations, summaries, or subsequent actions taken must be done in
light of these interpersonal dynamics. A typical data review meeting consists
of an open session, a closed session, and optional executive and closed sessions
(described below) afterwards. Each session has a predetermined roster, agreed
upon and outlined in the DSMB charter. Adhering to these agreements maintains
data integrity while allowing for recommendations and decision-making in light
of study data presented by treatment group. Both the open and closed sessions
may have an accompanying report with the data to be reviewed and discussed.
We have provided a sample table of contents in Table 3. While not exhaustive,
this list contains elements one might consider presenting in a meeting following
the first patient, first visit.
37 Data and Safety Monitoring and Reporting 691

Table 3 Components Open Session Report


of a DSMB Report
1. Study overview
2. Scientific updates
3. Recent protocol amendment(s)
4. Statistical analysis plan
5. Performance metrics
a. Screenings over time
b. Randomizations over time
c. Data currency
d. Protocol deviations
e. Withdrawals over time
6. Baseline characteristics, aggregated
7. Safety outcomes
a. Adverse events, aggregated
b. Serious adverse events, aggregated
Closed Session Report
All the following data are presented by treatment group
1. Baseline characteristics
2. Study disposition and treatment status
3. Protocol deviations
4. Efficacy outcomes
5. Safety outcomes
a. Adverse events
b. Serious adverse events
c. Unanticipated events
d. Safety laboratory assessments

Open Session
The open session, attended by representatives from sponsors or investigators, the
DSMB, and the ISRG, provides the DSMB opportunities to discuss with the Sponsor
issues related to data quality, trial conduct, and trial management in a blinded
manner. Topics include but are not limited to enrollment, dropouts, timeliness of
data from different sources, protocol deviations, and inclusion/exclusion questions.
The Sponsor can use this opportunity to seek advice from the DSMB on emerging
trial issues.
The open sessions usually start by checking with the DSMB to assess if any new
conflict of interest has arisen. The study team or Sponsor representative then may
take the lead in presenting their perspectives on the trial progress and new informa-
tion from relevant clinical programs or literature external to the trial that may have an
impact on the study.
The Sponsor representatives should also provide updates regarding action items
from the Sponsor from previous meetings if they were not resolved soon after the
meetings. In general, the open session for a periodic safety data review should be
concise to ensure that the DSMB has enough time to discuss contents by treatment
692 S. Baksh and L. Zeng

group in the closed session. The open session is usually accompanied by a


corresponding, confidential open session report which may contain a brief overview
of the study with any protocol changes that have been made since the last DSMB
meeting. There may also be a discussion of any proposed changes for the DSMB to
provide input; however, the input from the DSMB should not be based on unblinded
trial data. The Sponsor might include a copy of the DSMB charter and the statistical
analysis plan as a quick reference for all in attendance. In addition to these study
documents, the open report may contain performance metrics, baseline characteris-
tics of the study population, and aggregated safety data. It is generally not advised to
present efficacy data in the open session; however, if such data must be discussed in
this forum, efficacy data should be presented in aggregate.
Some sponsors and study teams may choose to blind themselves from certain data
domains even for open-label trials. For example, certain laboratory endpoints can be
used to make tentative inference about statistical evidence concerning the safety or
efficacy of the study drug. The DSMB needs to be cautious during the open session
discussion to avoid inadvertently disclosing any unblinding information to the
sponsors.

Closed Session
In the closed session, attended by DSMB and ISRG only, the DSMB reviews data on
such issues as enrollment, trial status, safety, and efficacy presented by treatment
group, and discusses overall benefit and risk profile of the study drugs. Variations
abound as to how to approach a closed session: some DSMB Chairs may lead the
discussions; others may assign members with different expertise to lead topics
related to different issues; others may designate the high-level review to an ISRG
statistician who is most familiar with data and reports and is able to highlight new
information from the previous reviews, answer questions related to data, and inter-
pret the presentations included in the closed session report. Regardless of the
meeting styles, all DSMB members should have thoroughly reviewed the reports
prior to the meetings.
To facilitate a productive data review and discussion during the closed session,
the closed report should contain comprehensive data that are presented in a com-
prehensible manner (Buhr et al. 2018). Depending on the study objective and the
focus of the review, the structure and contents of the reports may vary. Typically, the
closed report starts with an executive summary table of the study, highlighting high-
level study status, safety, and/or efficacy events by treatment group, followed by
more detailed summaries presented in tables and figures. When detailed information
on patients and events is needed, listings of event of interest are provided to
supplement the review. The presentations in the closed reports should be presented
by unblinded treatment groups to inform assessments on the relative benefit-to-risk
profiles. In general, the closed reports should cover pre-specified efficacy assess-
ments, adverse events with corresponding clinical details by treatment group, pro-
tocol deviations and subsequent actions, and any unanticipated events that may have
occurred. These data are also discussed in light of the information presented in the
scientific updates from the open session. The DSMB may discuss areas where they
37 Data and Safety Monitoring and Reporting 693

may like to request additional analyses or subgroup analyses to better understand the
patterns that are emerging.
At the end of the closed session, the DSMB should strive to come to a consensus
with respect to recommendations for the trial continuation, instead of using a
majority vote approach. In situations where consensus cannot be met, a superma-
jority may be recorded with discussions and rationale for the decision-making
documented in the closed session minutes. In addition, the DSMB can discuss and
request ad-hoc analyses for the ISRG or Sponsor to address in follow-up meetings or
correspondences.

Executive session
After the closed session, there may be an optional executive session limited to the
actual members of the DSMB, where the DSMB has the opportunity to escalate
action items to the Sponsor and communicate the meeting recommendations ver-
bally. Whether an executive session is warranted is at the discretion of the DSMB.
There is typically no data prepared specifically for discussion during the executive
session. The outcome of this executive session, however, is recorded into the
meeting minutes and might be shared with the IRB or other regulatory authorities.

Recommendations and Follow-up

A DSMB meeting can result in a variety of outcomes implicating the trajectory of the
study. Typically, the DSMB can recommend one of four things: 1) continuation of
the study without modification, 2) continuation of the study with recommended
modifications, 3) termination of the study, or 4) suspension of enrollment pending
resolution of issues or concerns. Each of these options carries considerable risks and
benefits to the final interpretation of trial results and overall conclusions for the
patient population. In addition to these overarching recommendations, the DSMB
may also recommend that the investigative team amend the current protocol, change
enrollment strategies, improve the speed and accuracy of data entry, open or close
clinical sites, audit clinical sites, as well as other changes to trial activities. These
suggested changes may emanate from changes in trial data or other external factors
discussed during the meeting. After the conclusion of the meeting, the DSMB chair,
in consultation with the other members of the DSMB, typically prepare a letter
recommending continuation or termination of the trial along with any other
suggested changes or additional analyses. This letter is then submitted to all the
IRBs involved in the trial. Below, we highlight a few real-world examples of DSMB
recommendations for consideration.
Example of external regulatory authorities intervening in trial conduct – the
ATMOSPHERE trial (Swedberg et al. 2016)
The Aliskiren Trial to Minimize Outcomes in Patients with Heart Failure
(ATMOSPHERE) trial provides an example of several external factors determining
the trajectory of a clinical trial. In this study, participants were randomized to
enalapril, aliskiren, or a combination of both drugs for the prevention of death
694 S. Baksh and L. Zeng

from cardiovascular causes or hospitalization for heart failure. Aliskiren had previ-
ously been approved for patients with hypertension in the United States and
European Union. Concurrent with ATMOSPHERE, aliskiren was also used in two
similar trials, ASTRONAUT and ALTITUDE for slightly different populations. The
DSMB for ATMOSPHERE also served as the DSMB for ASTRONAUT and they
were aware of the accumulating data from ALTITUDE. ASTRONAUT, which had
closed recruitment, showed a higher proportion of participants with renal dysfunc-
tion on aliskiren than on placebo (14.1% vs. 10.2%). ALTITUDE had accumulated
69% of projected events and reported increased adverse events associated with
aliskiren. After reviewing the data from ALTITUDE and ASTRONAUT, but not
ATMOSPHERE, the Clinical Trials Facilitation Group of the European Union
requested to the sponsor Novartis discontinuing aliskiren in all patients with diabetes
in ATMOSPHERE. Despite the assurance from the DSMB for ATMOSPHERE that
they had carefully considered the data from ALTITUDE and ASTRONAUT in their
recommendation to proceed, Novartis complied with the request of the Clinical
Trials Facilitation Group to pause the treatment among diabetic patients. Because
of the censoring of follow-up time during the treatment pause, the study had to
extend for an additional year to meet the targeted number of events.
Example of DSMB stopping trial based on primary outcome – the EOLIA trial
(Harrington and Drazen 2018)
In the ECMO to Rescue Lung Injury for Severe ARDS (EOLIA) trial, investiga-
tors studied the use of extracorporeal membrane oxygenation (ECMO) compared to
standard of care in the treatment of severe acute respiratory distress syndrome
(ARDS) (Combes et al. 2018). Because of the nature of the intervention, treating
clinicians were unmasked to the treatment groups. Consequently, following the
protocol those randomized to standard of care could, theoretically, be switched to
ECMO during the course of treatment for rescue use. By the end of the trial, 28% of
those randomized to the standard of care had switched to ECMO, with 57% of these
crossover participants dying. The investigators noted that the high proportion of
crossover inhibited their ability to draw conclusions about the use of ECMO for the
primary outcome of mortality at 60 days. They did, however, see a significant effect
of ECMO on the secondary outcome of treatment failure, defined as death in the
ECMO group versus death or crossover in the standard of care group. After roughly
three-quarters of the projected participants were enrolled, the DSMB stopped the
trial for futility at the fourth interim analysis. Critics of this decision contend that had
the trial continued, investigators might have had greater evidence for the secondary
outcomes, some of which were trending toward favoring ECMO as a treatment.
These critics encourage future DSMB members to treat the stopping guidelines as
true guidelines and consider the impact of these decisions on other outcomes, both
safety and efficacy, in the trial (Harrington and Drazen 2018).
Example of emerging evidence influencing DSMB – the MOXCON trial (Pocock
et al. 2004)
The MOXonidine CONgestive Heart Failure (MOXCON) trial provides a classic
example of a trial that was halted for safety concerns. The study was designed to
investigate the use of moxonidine for the prevention of all-cause mortality in patients
37 Data and Safety Monitoring and Reporting 695

with NYHA class II–IV heart failure (Cohn et al. 2003). Initially powered to detect a
20% reduction in all-cause mortality, the study required 724 deaths. Of note, a
concurrent dose-finding trial of moxonidine was not completed at the time of the
start of MOXCON. While concerns were raised, MOXCON was permitted to start,
despite the fact it was studying the highest dose used in the dose-finding study. An
interim analysis when roughly one-quarter of the expected enrollment had occurred
began showing a trend of increased mortality with the use of moxonidine. Despite
the early indicators of increased risk for mortality, the small numbers of deaths
combined with the lack of safety concerns in the then completed dose-finding trial
led the DSMB to recommend continuation with a planned teleconference before the
next 6-month safety analysis. At this analysis, a nominal p-value of less than 0.05
was observed, with 37 deaths in the moxonidine group and 20 deaths in the placebo
group. After an investigation of the potential causes of the deaths, time to death,
dosing, other serious adverse events, and baseline characteristics, the DSMB was put
in the discomforting position of recommending termination after less than 10% of
the expected deaths had occurred.
The DSMB then discussed this concern with the MOXCON Executive Commit-
tee and came to a consensus to recommend stopping randomization and closing out
participants currently enrolled and being treated in the trial. They ultimately left the
final decisions up to the MOXCON Senior Management. In their published debrief
of this experience, the DSMB also noted that they recognized the difficult position
the Executive Committee faced: continuing to proceed with MOXCON despite a
contrary recommendation from the DSMB could potentially raise serious concerns
about the interpretation of the results at the completion of the trial. The importance of
this power dynamic and delineation of roles is especially evident at the time of
decision-making.
Each of these examples provides unique snapshots of how DSMB recommenda-
tions can be unpredictable and determinative of a study’s direction. Regardless of
how impactful or mundane the recommendations, they are recommendations. Ulti-
mately, decision-making for the study rests with the study Sponsor, as do the
consequences of those decisions.

Summary and Conclusions

A DSMB provides an integral role in the conduct of many multicenter clinical trials.
It serves as a check on competing interests in the name of participant safety and a
balance to the inherent biases trial investigators and Sponsors may hold regarding
the outcome of the trial. While they are not an overarching governing body for a
clinical trial, their recommendations do carry weight, and when presented to the
outside scientific community, can influence the interpretation of trial results as
illustrated in the examples in this chapter. What can sometimes seem like a rudi-
mentary task at the outset of a trial, developing a DSMB charter and defining the
purview of this group can impede or enhance the utility of a DSMB in the conduct of
the trial. It is imperative that the members of the DSMB, trial investigators, and trial
696 S. Baksh and L. Zeng

Sponsors carefully consider the needs of the study, the independence of the DSMB,
and the potential concerns of the patient group impacted by the study results when
developing this document. Through this collaborative and intentional effort, the
DSMB can best serve in its capacity to monitor the trial for study integrity, safety,
and efficacy.

Key Facts

1. DSMBs best serve as an independent check on data integrity and patient safety
and as a balance to the decision-making power of study leadership.
2. While DSMBs provide recommendations for the trajectory of a trial, their
suggestions command respect from the broader scientific community, as the
recommendations balance competing interests of other study stakeholders.
3. Investing in the development of a comprehensive DSMB Charter, with consid-
eration for the needs of the study and the patient population, will not lead to
anticipating every possible decision to be made, but rather will provide the
parameters for effective decision-making when difficult situations arise.

Cross-References

▶ Financial Conflicts of Interest in Clinical Trials


▶ Interim Analysis in Clinical Trials
▶ Issues for Masked Data Monitoring

References
ANVISA (2015) Resolution of the board of directors – RDC no. 9. Ministry of Health. Retrieved
from http://antigo.anvisa.gov.br/documents/10181/3503972/RDC_09_2015_COMP.pdf/
e26e9a44-9cf4-4b30-95bc-feb39e1bacc6
Buhr KA, Downs M, Rhorer J, Bechhofer R, Wittes J (2018) Reports to independent data
monitoring committees: an appeal for clarity, completeness, and comprehensibility. Ther
Innov Regul Sci 52(4):459–468. https://doi.org/10.1177/2168479017739268. Epub 2017
Nov 13
Clemens F, Elbourne D, Darbyshire J, Pocock S (2005) Data monitoring in randomized controlled
trials: surveys of recent practice and policies. Clin Trials 2(1):22–33. https://doi.org/10.1191/
1740774505cn064oa
Cohn JN, Pfeffer MA, Rouleau J, Sharpe N, Swedberg K, Straub M, ... Wright TJ (2003) Adverse
mortality effect of central sympathetic inhibition with sustained-release moxonidine in patients
with heart failure (MOXCON). Eur J Heart Fail 5(5):659–667. https://doi.org/10.1016/s1388-
9842(03)00163-6
Combes A, Hajage D, Capellier G, Demoule A, Lavoué S, Guervilly C, ... Mercat A (2018)
Extracorporeal membrane oxygenation for severe acute respiratory distress syndrome. N Engl
J Med 378(21):1965–1975. https://doi.org/10.1056/NEJMoa1800385
DAMOCLES Study Group (2005) A proposed charter for clinical trial data monitoring committees:
helping them do their job well. Lancet 365:711–722
37 Data and Safety Monitoring and Reporting 697

Eckstein L (2015) Building a more connected DSMB: better integrating ethics review and safety
monitoring. Account Res 22(2):81–105. https://doi.org/10.1080/08989621.2014.919230
Ellenberg SS, Fleming TR, DeMets DL (2019) Data monitoring committees in clinical trials: a
practical perspective, 2nd edn. Wiley, Hoboken, NJ
European Medicines Agency (2005) Guideline on data monitoring committees. (EMEA/CHMP/
EWP/5872/03 Corr). European Medicines Agency, London. Retrieved from https://www.ema.
europa.eu/en/documents/scientific-guideline/guideline-data-monitoring-committees_en.pdf
Fleming TR, DeMets DL, Roe MT, Wittes J, Calis KA, Vora AN, Meisel A, Bain RP, Konstam MA,
Pencina MJ, Gordon DJ, Mahaffey KW, Hennekens CH, Neaton JD, Pearson GD, Andersson
TL, Pfeffer MA, Ellenberg SS (2017) Data monitoring committees: promoting best practices to
address emerging challenges. Clin Trials 14(2):115–123. https://doi.org/10.1177/
1740774516688915. Epub 2017 Feb 1. PMID: 28359194; PMCID: PMC5380168
Harrington D, Drazen JM (2018) Learning from a trial stopped by a data and safety monitoring
board. N Engl J Med 378(21):2031–2032. https://doi.org/10.1056/NEJMe1805123
Herson J (2017) Data and safety monitoring committees in clinical trials, 2nd edn. Taylor & Francis,
Boca Raton, FL
National Health and Medical Research Council (2018) Data safety monitoring boards (DSMBs).
(978-1-86496-004-4). National Health and Medical Research Council. Retrieved from www.
nhmrc.gov.au/guidelines-publications/EH59C
National Institutes of Health (1998) NIH policy for data and safety monitoring. National Institutes
of Health. Retrieved from https://grants.nih.gov/grants/guide/notice-files/not98-084.html
Nissen SE, Wolski KE, Prcela L et al (2016) Effect of naltrexone-bupropion on major adverse
cardiovascular events in overweight and obese patients with cardiovascular risk factors: a
randomized clinical trial. JAMA 315(10):990–1004. https://doi.org/10.1001/jama.2016.1558
Pharmaceutical and Food Safety Bureau (2013) Guideline on data monitoring committee (PFSB/
ELD notification No.0404-1). Ministry of Health, Labour and Welfare, Japan. Retrieved from
https://www.pmda.go.jp/files/000232300.pdf
Pocock S, Wilhelmsen L, Dickstein K, Francis G, Wittes J (2004) The data monitoring experience
in the MOXCON trial. Eur Heart J 25(22):1974–1978. https://doi.org/10.1016/j.ehj.2004.
09.015
Swedberg K, Borer JS, Pitt B, Pocock S, Rouleau J (2016) Challenges to data monitoring
committees when regulatory authorities intervene. N Engl J Med 374(16):1580–1584. https://
doi.org/10.1056/NEJMsb1601674
Tanzania Food and Drugs Authority (2017) Guidelines for application to conduct clinical trials in
Tanzania, 3rd edn. Retrieved from https://www.tmda.go.tz/uploads/publications/
en1554368837-TANZANIA%20CLINICAL%20TRIAL%20GUIDELINES-%202017.pdf
U.S. Food and Drug Administration (2006) Guidance for clinical trial sponsors: establishment and
operation of clinical trial data monitoring committees. March 2006. Available at: https://www.
fda.gov/media/75398/download
U.S. Securities and Exchange Commission (2015) From 8-K. Orexigen Therapeutics, Inc. File
number 001-33415. March 3, 2015. Available at: https://www.sec.gov/Archives/edgar/data/
1382911/000119312515074251/d882841d8k.htm
Wayant C, Vassar M (2018) A comparison of matched interim analysis publications and final
analysis publications in oncology clinical trials. Ann Oncol 29:2384–2390. https://doi.org/10.
1093/annonc/mdy447
Woloshin S, Schwartz LM, Bagley PJ, Blunt HB, White B (2018) Characteristics of interim
publications of randomized clinical trials and comparison with final publications. JAMA 319
(4):404–406. https://doi.org/10.1001/jama.2017.20653
Post-Approval Regulatory Requirements
38
Winifred Werther and Anita M. Loughlin

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 700
History of US and EU Regulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702
History of Post-Approval Studies in the USA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702
History and Legal Framework of Post-Approval Studies in Europe . . . . . . . . . . . . . . . . . . . . . . . 703
Post-Approval Terminology and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706
Post-Approval Study Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706
Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706
Observational Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713
Enforcement of Post-Approval Studies by Regulatory Agencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719
US PMC and PMR Enforcement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 720
EU PAM Enforcement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 720
Systematic Reviews of Post-Approval Studies in the USA and EU . . . . . . . . . . . . . . . . . . . . . . . . . . . 721
Reviews of Post-Approval Studies in the USA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 721
Reviews of Post-Approval Studies in the EU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 721
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 722
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723

Abstract
Health authorities throughout the world have regulations for requesting additional
research in the post-approval setting. This chapter focuses on the regulations in
the USA and European Union (EU). The history of post-approval studies can be
traced through changing regulations enforced by the US Food and Drug Admin-
istration (FDA) and the EU European Medicines Agency (EMA).
W. Werther (*)
Center for Observational Research, Amgen Inc, South San Francisco, CA, USA
e-mail: [email protected]
A. M. Loughlin
Corrona LLC, Waltham, MA, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 699


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_256
700 W. Werther and A. M. Loughlin

Post-approval studies are either clinical trials (interventional) or observational


(non-interventional) studies. Choosing a study design may be influenced by the
strengths and weaknesses of the design options and available data sources.
Imposed post-approval studies are reviewed for compliance by the regulatory
agencies. For clinical trials that are ongoing at the time of approval, often these
are classified as post-marketing commitment (PMC) in the USA or post-authori-
zation measure (PAM) in the EU. Findings of these trials can be submitted to the
health authorities for addition to the prescribing information. The FDA and EMA
both track progress on PMC/PMRs and PAMs, respectively.
Post-approval studies are necessary to continually gather data on the safety and
effectiveness of approved drugs. These studies are regulated by health authorities,
included in registries (e.g., ClinicalTrials.gov, ENCePP), and tracked to comple-
tion. This chapter reviews the history of the regulations, terminology, study
designs, and systematic reviews of the published post-approval studies.

Keywords
Post approval · Post marketing · Post authorization · Pharmacovigilance ·
Pharmacoepidemiologic

List of Abbreviations
CFR Code of Federal Regulations
EMA European Medicines Agency
EU European Union
FDA Food and Drug Administration
FDAAA Food and drug Administration Amendments Act
MAH Market authorization holder
PAES Post-authorization efficacy study
PAM Post-authorization measure
PAS Post-authorization study
PASS Post-authorization safety study
PMC Post-marketing commitment
PMR Post-marketing requirement
PREA Pediatric Research Equity Act
REMS Risk evaluation and mitigation strategy
USA United States

Introduction

The collection of information on safety and efficacy of medical treatments often does
not end with the approval of medical products. Health authorities throughout the
world have regulations for requesting additional research in the post-approval
38 Post-Approval Regulatory Requirements 701

setting. Post-approval research can include clinical trial methodologies, as well as


observational studies.
The data included in regulatory submissions for approval of medical inventions,
specifically drugs and devices, are limited by the scope of the pre-approval clinical
trials. There may be limitations in safety and efficacy based on patient characteris-
tics, duration of therapy, size of patient population, and the ability to identify rare
outcomes. A few examples of when additional trials and studies may be requested by
the regulatory agencies are as follows. First, if pre-approval or registrational studies
were limited to specific age groups, additional studies in unstudied age groups likely
to receive the treatment may be required. Second, if the outcome required for
approval in a clinical trial is defined as event-free survival or progression-free
survival, additional data collection to provide estimates for overall survival may be
required in the post-approval setting. Lastly, rare safety outcomes are difficult to
estimate in pre-approval trials, and for treatments for chronic diseases, large obser-
vational studies may be required to further quantify known safety events and to
identify new safety events.
Some high-profile drug withdrawals, where risks outweighed the benefits to
patients in the post-approval setting, have provided additional motivation in support
of required post-approval safety studies. The following are two high-profile drug
withdrawals. In 2004, the post-approval trials for rofecoxib (Vioxx) identified an
increased risk of heart attack and stroke, which lead to its removal from the US
market (Krumholtz 2007; Prakash and Valentine 2007). In 2005, the occurrence of
progressive multifocal leukoencephalopathy leads to the issuance of a Food and
Drug Administration (FDA) Drug Safety Communication and voluntary withdrawal
from the market for natalizumab (Tysabri) (FDA 2018; Kappos et al. 2011). In the
case of natalizumab, the drug was reintroduced to the market when its benefit-risk
profile for the difficult-to-treat relapsing-remitting form of multiple sclerosis was
maintained by minimizing the risk through patient selection, detailed safety moni-
toring recommendations for the early detection of PML, and management of PML
recommendations. In addition, post-approval studies were conducted.
In this chapter, we will focus on the post-approval regulatory requirements in the
USA and the European Union (EU). For terminology, in the USA, the term post
marketing is used for drugs and post approval is used for medical devices, while in
Europe, the term post authorization is used for both drugs and devices. In this
chapter, we will use the term post approval to refer to post marketing and post
authorization for drugs and devices. Regulatory agencies can require additional
studies, or they may enter an agreement with a sponsor or market authorization
holder (MAH) for additional studies that are deemed voluntary, and therefore not
required. The history of required and voluntary post-approval studies can be traced
through changes in regulations. This chapter will not address the conduct of clinical
trials for new indications of approved drugs, as those trials while conducted in the
post-approval setting are required to follow the same regulations for approval of a
new drug.
702 W. Werther and A. M. Loughlin

History of US and EU Regulations

History of Post-Approval Studies in the USA

First, in the USA, the 1997 FDA Modernization Act introduced the requirement for the
FDA regarding post-approval studies, as referred to in the US regulations as post-
marketing studies or post-marketing requirements (PMRs). In 1999, the FDA
published the rule regarding post-marketing commitment (PMC), which was defined
as studies, including clinical trials, conducted by an applicant after FDA has approved
a drug for marketing or licensing that were intended to further refine the safety,
efficacy, or optimal use of a product or to ensure consistency and reliability of product
quality. In 2006, as a complement to the final rule from 1999, the FDA issued a
guidance for industry on PMCs. In 2007, the FDA Amendments Act (FDAAA), which
clarified reasons for post-marketing studies, was signed into law by the US president.
FDAAA included the new provision that gave the FDA the authority to require risk
evaluation and mitigation strategy (REMS), in addition to PMR and PMC. In 2011, a
new guidance for industry was released on post-marketing studies and clinical trials
with implementation into Section 505(o)(3) of the Federal Food, Drug, and Cosmetic
Act which stated that the FDA can require post-approval clinical trials and studies
(FDA 2011). In this guidance, clinical trials were defined as any prospective investi-
gation in which the applicant or investigator determines the method of assigning the
drug product or other interventions to one or more human subjects, and studies were
defined as all other investigations.
In the USA, with the 2007 FDAAA, there was a change in the rationale for
requesting a post-approval study from a sponsor. Before 2007, the following three
reasons were used when requiring post-marketing studies:

• Post-marketing studies or clinical trials to demonstrate clinical benefit for drugs


approved under the Accelerated Approval requirements in 21 Code of Federal
Regulations (CFR) 314.510 and 21 CFR 601.41
• Deferred pediatric studies (21 CFR 314.55(b) and 601.27(b)), where studies are
required under the Pediatric Research Equity Act (PREA)
• Studies or clinical trials to demonstrate safety and efficacy in humans that must be
conducted at the time of use of products approved under the Animal Rule (21
CFR 314.610(b)(1) and 601.91(b)(1))

After FDAAA in 2007, the reasons for a post-marketing study were broadened to:

• Assess a known serious risk related to the use of the drug


• Assess signals of serious risk related to the use of the drug
• Identify an unexpected serious risk when available data indicate the potential for a
serious risk

There are four mechanisms that provide the FDA with the authority to require
PMR. They are through Accelerated Approval, the Animal Rule, the Pediatric
Research Equity Act, and FDAAA. These authorities are described in Table 1.
38 Post-Approval Regulatory Requirements 703

Table 1 Post-marketing requirement authorities of the US Food and Drug Administration.


(Adapted from Wallach et al. 2018)
Year
Authority implemented Purpose Requirement
Accelerated 1992 To expedite the approval FDA has the authority to
Approval pathway of novel drugs that treat require post-market
serious diseases and fill studies or clinical trials to
unmet medical needs, on confirm efficacy
the basis of surrogate or
intermediate endpoints
“reasonably likely” to
predict clinical benefit
Animal Rule 2002 To allow for the approval When feasible and
of novel drugs when ethical, FDA can require
human efficacy studies post-market studies in
and field trials are not humans
ethical and feasible
Pediatric Research 2003 To provide pediatric use FDA can include deferred
Equity Act (PREA) information in drug pediatric studies or
product labeling for drugs clinical trials as post-
and biological products marketing requirements
developed for indications
that occur in both adult
and pediatric populations.
FDA can approve novel
drugs for use in adults
without corresponding
studies for the same
indication in the relevant
pediatric population
Food and Drug 2007; To provide additional FDA can require post-
Administration effective information for novel market studies that assess
Amendments Act March 2008 treatments approved known serious risks,
(FDAAA), Section under Section 505 of signs of serious risks, or
505(o)(3) FDAAA or Section 351 of unexpected serious risks
the Public Health Services related to the use of a
Act novel drug

History and Legal Framework of Post-Approval Studies in Europe

The European Medicines Agency (EMA), the EU Member States, and the European
Commission are responsible for implementing and operating the legislation that
deals with post-approval studies, as referred to in EMA legislation as post-authori-
zation studies, including pharmacovigilance studies. Pharmacovigilance studies are
research studies with the objective of studying drug safety. The EMA plays a key
role in coordinating activities relating to post-approval studies by working with a
wide range of stakeholders including the European Commission, pharmaceutical
companies, national medicines regulatory authorities, patients, and healthcare pro-
fessionals to ensure effective implementation and operation of the pharmacov-
igilance legislation, which includes post-authorization safety studies (PASS). Post-
704 W. Werther and A. M. Loughlin

authorization efficacy studies (PAES) are another type of study conducted in the
post-approval setting. However, PAES are not part of the pharmacovigilance
legislation.

Post-Authorization Safety Study (PASS)


The 2010 European Pharmacovigilance Legislation created the legal framework for
post-authorization studies (PAS), including PASS. This was the biggest change in
EU regulations since 1995 and was implemented in 2012 (EMA 2012).
Based on Directive 2001/83/EC and Regulation (EC) No 726/2004, the EMA
may require PAS (Goedecke 2017). The pharmacovigilance legislation includes
directives and legislations that can be found at the EMA website (EMA 2012).
The following text on the rationale for the directive and regulation is directly from
the EMA website:

The development of the pharmacovigilance legislation was based on the observation


that adverse drug reactions (ADRs), ‘noxious and unintended’ responses to a
medicine, caused around 197,000 deaths per year in the EU.
Because of this, in 2005 the European Commission began a review of the European
system of safety monitoring including sponsoring an independent study, as well
as extensive public consultation through 2006 and 2007.
This process resulted in the adoption of a Directive and Regulation by the European
Parliament and Council of Ministers in December 2010, bringing about signifi-
cant changes in the safety monitoring of medicines across the EU.

Per the directive and regulation implemented in 2012 and the EMA website
(EMA 2020a), a PASS is a study that is carried out after a drug has been authorized.
The purpose of the PASS is to evaluate the safety and benefit-risk profile of a drug
and support regulatory decision-making. A PASS aims to (1) identify, characterize,
or quantify a safety hazard; (2) confirm the safety profile of a drug; or (3) measure the
effectiveness of risk management measures. Risk management measures are activ-
ities carried out by the sponsor to assess the risks associated with drugs. Risk
management measures are tracked in risk management plans (RMP). Sponsors are
required to submit an RMP to the EMA when applying for a marketing authorization.
A PAS design is either clinical trial or observational. A PAS is either imposed or
voluntary. The EMA’s Pharmacovigilance Risk Assessment Committee (PRAC) is
responsible for assessing the protocols of imposed PASS and for assessing their
results. A voluntary PASS is conducted by sponsors on their own initiative. Non-
imposed PAS that are requested by the EMA in RMPs are deemed voluntary PASS.
An RMP includes activities agreed upon by EMA and sponsors to continually study
the risks of a drug.
EMA has published guidance on the format and content of study protocols and
final study reports for non-interventional studies, together with the PRAC assess-
ment report templates. The guidance is based on Commission Implementing Regu-
lation No 520/2012 of 19 June 2012, which was implemented in January 2013. For
clinical trials, sponsors should follow the instructions in volume 10 of the rules
38 Post-Approval Regulatory Requirements 705

governing medicinal products in the European Union (EU). Further guidance for
PASS is available in the following document: Guideline on good pharmacovigilance
practices: Module VIII – Post-authorisation safety studies (EMA 2017).

Post-Authorization Efficacy Studies (PAES)


In the EU, similar to PASS, PAES may be voluntary or imposed. However, the
legislation behind the PAES is not the same as the pharmacovigilance regulations
(EMA 2016). The PAES can be imposed by a competent authority, either centrally or
nationally. One way that PAES can be imposed is within the scope of Delegated
Regulation (EU) No 357/20141; it states:

• At the time of granting the initial marketing authorization (MA) where concerns
relating to some aspects of the efficacy of the medicinal product are identified and
can be resolved only after the medicinal product has been marketed [Art 9(4)(cc)
of REG/Art 21a(f) of DIR]
• After granting of a MA where the understanding of the disease or the clinical
methodology or the use of the medicinal product under real-life conditions
indicates that previous efficacy evaluations might have to be revised significantly
[Art 10a(1)(b) of REG/Art 22a(1)(b) of DIR]

Also, PAES can be imposed outside of the scope of Delegated Regulation (EU)
No 357/2014. PAES may be imposed in the following specific situations:

• A conditional MA granted in accordance with Article 14(7) of Regulation (EC)


No 726/2004
• A MA granted in exceptional circumstances and subject to certain conditions in
accordance with Article 14(8) of Regulation (EC) No 726/2004 or Article 22 of
Directive 2001/83/EC
• A MA granted to an advanced therapy medicinal product in accordance with
Article 14 of Regulation (EC) No 1394/2007
• The pediatric use of a medicinal product in accordance with Article 34(2) of
Regulation (EC) No 1901/2006
• A referral procedure such as initiated in accordance with Articles 31 or 107i of
Directive 2001/83/EC or Article 20 of Regulation (EC) No 726/2004

The recommended study designs for PAES include randomized and non-random-
ized designs. Consideration for clinical trial and observational study methodologies
is described in the PAES guidance document (EMA 2016) as follows:

Clinical trial design options for the design of PAES could include explanatory and
pragmatic trials. Explanatory trials generally measure the benefit of a treatment
under ideal conditions to establish whether the treatment works. Pragmatic trials
examine interventions under circumstances that approach real-world practice,
with more heterogeneous patient populations, possibly less-standardized treat-
ment protocols and delivery in routine clinical settings as opposed to a research
706 W. Werther and A. M. Loughlin

environment. Minimal or no restrictions may be placed on modifying dose,


dosing regimens, co-therapies or comorbidities or treatment switching.
Non-randomized (for treatment) studies may be considered for investigating post-
authorization benefits where one or more of the following situations apply:
randomization is unethical or unfeasible, outcomes are infrequent, the generaliz-
ability of randomized trials is particularly limited, outcomes are highly predict-
able, or effect sizes are very large. Observational PAES may additionally be
useful to identify effect modifiers, namely factors that result in important differ-
ences in the level of efficacy of the drug between patients within the authorized
indication and which may not have been detectable in the pivotal trials conducted
prior to authorization.

Post-Approval Terminology and Definitions

Terminology used by the FDA and EMA is described in the table below and includes
term, definition, terminology usage, examples, timing, reporting, and registration.
Briefly, the FDA uses the term risk evaluation and mitigation strategy (REMS) to
track post-approval safety studies that can be either post-marketing requirement
(PMR) or post-marketing commitment (PMC). However, PMR and PMC can be
conducted outside of REMS. Timing and reporting are described in the table. The
EMA uses post-authorization measures (PAMs) to track post-authorization safety
studies (PASS) and post-authorization efficacy studies (PAES) (Tables 2 and 3).

Post-Approval Study Designs

Post-approval studies are either clinical trials (interventional) or observational (non-


interventional) studies. Choosing a study design may be influenced by the strengths
and weaknesses of the design options. The Table 4 below provides points to consider
including for each study design: strengths, weaknesses, and usefulness in the post-
approval setting.

Clinical Trials

Providing results from clinical trials in the post-approval setting can be necessary
under many conditions. To name a few, confirmatory efficacy findings are required,
patients with unique characteristics are studied, patients with new indications are
studied, and efficacy measures have changed significantly during the conduct of the
registrational trials.
Clinical trial designs may include pragmatic trials, as well as synthetic trials.
Large pragmatic trials are trials where simple designs are used to study large
numbers of patients with high external validity (Patsopoulous 2011). Synthetic trials
are clinical trials that use real-world data or pooled clinical trial data to recreate
38

Table 2 Terminology for post-approval studies for the USA


Term Definition Terminology usage Examples/components Timing Reporting Registration
Risk A required risk Describes the Elements of REMS: Before or at approval Must
evaluation management strategy elements required in Medication guide or after approval if include
and that can include one or the risk management Patient package insert FDA becomes aware PMR and
mitigation more elements to strategy Communication plan of new safety PMC
strategy ensure that the Elements to assure information updates
(REMS) benefits of a drug safe use (ETASU)
outweigh its risks Implementation
system
Post- Clinical trials and Describes all required Observational At time of approval Annual to Voluntary
marketing studies required for post-marketing pharmacoepidemiologic or after approval if FDA registration at
Post-Approval Regulatory Requirements

requirement any or all of three studies or clinical studies FDA becomes aware ClinicalTrials.gov
(PMR) purposes: trials, including those Meta-analyses of new safety for clinical trials and
To assess a known required by four Clinical trials with safety information studies
serious risk related to authorities: FDAAA, endpoint evaluated
the use of the drug Pediatric Research, Safety studies in animals
To assess signals of Accelerated In vitro laboratory safety
serious risk related to Approval, Animal studies
the use of the drug Rule Pharmacokinetic studies
To identify an or clinical trials
unexpected serious Studies or clinical trials
risk when available to evaluate drug
data indicates the interactions or
potential for serious bioavailability
risk
(continued)
707
708

Table 2 (continued)
Term Definition Terminology usage Examples/components Timing Reporting Registration
Post- Studies (including Describes studies and Drug and biologic At time of approval Annual to Voluntary
marketing clinical trials), clinical trials that quality studies FDA registration at
commitment conducted by an applicants have Pharmacoepidemiologic ClinicalTrials.gov
(PMC) applicant after FDA agreed to conduct, but studies on natural history for clinical trials and
has approved a drug that will generally not of disease or background studies
for marketing or be considered as rates for adverse events
licensing, that were meeting a statutory in a population not
intended to further purpose and so will treated with the drug
refine the safety, not be required Studies and clinical trials
efficacy, or optimal for non-serious risk or
use of a product or to safety signals
ensure consistency Clinical trials with
and reliability of primary endpoint related
product quality to further defining
efficacy
Reference: FDA REMS Overview_121110-cln.pdf from FDA website https://www.fda.gov/aboutfda/transparency/basics/ucm325201.htm
W. Werther and A. M. Loughlin
Table 3 Terminology for post-approval studies for the European Union
38

Examples/
Term Definition Terminology usage components Timing Reporting Registration
Post- Additional data post PAMs fall within
authorization authorization, as it is one of the following
measures necessary from a categories [EMA
(PAM) public health codes]:
perspective to Specific
complement the obligation [SOB]
available data with Annex II
additional data about condition [ANX]
the safety and, in Additional
certain cases, the pharmacovigilance
efficacy or quality of activity in the risk
authorized medicinal management plan
Post-Approval Regulatory Requirements

products (RMP) [MEA] (e.


g., interim results of
imposed/non-
imposed
interventional/non-
interventional
clinical or
nonclinical studies)
Legally binding
measure [LEG] (e.
g., cumulative
review following a
request originating
from a PSUR or a
signal evaluation
[SDA], Corrective
Action/Preventive
709

(continued)
710

Table 3 (continued)
Examples/
Term Definition Terminology usage components Timing Reporting Registration
Action (CAPA),
pediatric [P46]
submissions,
MAH’s justification
for not submitting a
requested variation)

Recommendation
[REC], e.g., quality
improvement
Post- Any study relating to Includes clinical PASS categories in As a condition of Imposed, non- Clinical trials must
authorization an authorized trials and non- RMP for clinical granting marketing interventional be registered at
safety study medicinal product interventional trials or non- authorization or PASS reporting to European Union
(PASS) conducted with the studies interventional: after granting PRAC within Clinical Trials
aim of identifying, May be imposed Category 1: market 12 months of end of Portal and Database
characterizing, or (required) or imposed PASS authorization if data collection (www.
quantifying a safety voluntary (not Category 2: there are concerns Abstract of results clinicaltrialsregister.
hazard, of required) specific obligation about risks of the posted to EU PAS eu)
confirming the safety Category 3: authorized Register (www. Non-interventional
profile of the required as part of medicinal product encepp.eu) studies must be
medicinal product, or RMP registered at
of measuring the Note: European Union
effectiveness of risk Categorization is Post-Authorization
management from GVP V.B.6.3 Study (EU PAS)
measures Register (www.
encepp.eu)
W. Werther and A. M. Loughlin
38

Post- Post-authorization Clinical trials and From 2014 to 2017, Clinical trials must
authorization efficacy studies non-interventional all PAES have been be registered at
efficacy study (PAES) of medicinal studies imposed for clinical trials European Union
(PAES) products are studies conditional Clinical Trials
conducted within the marketing Portal and Database
authorized authorization or
therapeutic marketing
indication to authorization under
complement exceptional
available efficacy circumstance and
data in the light of other conditions
well-reasoned
scientific
uncertainties on
aspects of the
Post-Approval Regulatory Requirements

evidence of benefits
that should be, or can
only be, addressed
post-authorization
Note: Not a legal
definition, it is a
working definition
from
EMA/PDCO/CAT/
CMDh/PRAC/
CHMP/261500/2015
Draft-scientic-
guidance-post-
authorisation-
efficacy-studies-first-
version_en.pdf
711
712 W. Werther and A. M. Loughlin

Table 4 Strengths and weaknesses of common post-approval study designs


Usefulness in post-
Design Strengths Weaknesses approval setting
Clinical trial: Specify safety and/or Operationally complex Answer specific
randomized efficacy outcomes Size of study has safety/efficacy
Specify exposure limitations, and follow- question identified in
For comparative trials, up may result in long pre-approval setting
use randomization and duration to answer
masking to reduce bias question
Limited generalizability
Clinical trial: Minimal criteria for Size of study can be Expected wide use of
pragmatic exposure to mimic real- large to compensate for drug quickly to foster
world use of approved diverse patient exposure enrollment and faster
product data study conduct
Simplified protocol
compared to pre-
approval trial protocols
Clinical trial: Data are already Not yet accepted by Conducted for
synthetic collected regulatory agencies exploratory or
Comparative safety and confirmative results
efficacy possible
Observational: Identify incidence of If outcome is rare, large Randomization is not
prospective safety event sample size and long ethical
cohort study Accurate measurement duration may be Studying exposed
of exposure necessary patients only; no need
Size and duration of for control group
study may drive up cost
Observational: Data are collected and Relies on previous When exposure period
retrospective available from documentation of is short and data will
cohort study administrative exposures and outcomes accumulate quickly
healthcare data or Delay from approval for timely analysis
patient registries until the data are
Identify incidence of available for
safety event retrospective review

clinical trial arms with the intent to provide comparative analyses within the data
source or as comparators to external clinical trials. Synthetic trials provide compar-
ative effectiveness results by analyzing existing data sources without collecting new
information (Berry et al. 2017; Zauderer et al. 2019).
There are operational concerns when conducting trials in the post-approval
setting including the need to maintain equipoise. In the accelerated/expedited
approval setting, ongoing trials are typically listed as required post-approval trials
so that the trials continue to completion. This is especially important if approvals are
granted on surrogate outcomes. The ongoing trials will often provide hard outcomes,
such as survival, as compared to disease progression.
Clinical trials in special populations, such as pediatric patients, may be conducted
in the post-approval setting as part of a pediatric investigation plan. For diseases that
38 Post-Approval Regulatory Requirements 713

occur rarely in pediatric populations, these trials are not typically completed at the
time of filing for regulatory approval and thus continue in the post-approval setting.

Observational Studies

Observational studies complement interventional studies and are peculiarly suited


for studies where randomization is not ethical, or for studying broader populations,
specifically populations not well represented in clinical trials, and for understanding
actual results (e.g., safety and effectiveness) in real-world practice. Pharmacoepide-
miologic safety studies are designed to assess the risk associated with drug exposure
and to test prespecified hypotheses (ISPE 2015; Berger et al. 2012). Guidance for the
design and conduct of post-approval safety studies in the USA and EU has been
developed. Table 5 provides some suggested sources for these guidance documents
and guidelines for pharmacoepidemiologic research (Berger et al. 2012, 2017;
Dreyer et al. 2010; ENCePP 2010, 2017; FDA 2013; ISPE 2015).

Table 5 Sources for guidance for observational studies for drug safety research
Guidance for safety studies
European Network of Centers for Pharmacoepidemiology and Pharmacovigilance (ENCePP),
Guidelines on Good Pharmacovigilance (GVP) – Module VIII – Post-authorization safety studies
(Revision 3). 2017. Online: https://www.ema.europa.eu/en/documents/scientific-guideline/
guideline-good-pharmacovigilance-practices-gvp-module-viii-post-authorisation-safety-studies-
rev-3_en.pdf. Accessed 12 Jun 2020.
FDA. Best Practices for Conducting and Reporting Pharmacoepidemiology Safety Studies
Using Electronic Healthcare Data Sets. May 2013. Online: https://www.fda.gov/regulatory-
information/search-fda-guidance-documents/best-practices-conducting-and-reporting-
pharmacoepidemiologic-safety-studies-using-electronic. Accessed 12 Jun 2020.
Guidelines for pharmacoepidemiologic studies
ENCePP, Guidelines on Methodological Standards in Pharmacoepidemiology (Revision 7),
2010. Online: http://www.encepp.eu/standards_and_guidances/documents/
ENCePPGuideonMethStandardsinPE_Rev7.pdf, Accessed 12 Jun 2020.
International Society for Pharmacoepidemiology (ISPE), Guidelines for Good
Pharmacoepidemiology Practice (Revision 3), 2015. Online:https://www.pharmacoepi.org/
resources/policies/guidelines-08027/#1. Accessed 12 Jun 2020.
Berger ML, Sox H, Willke RJ, et al. Good practices for real-world data studies of treatment and/
or comparative effectiveness: Recommendations from the joint ISPOR-ISPE Special Task Force
on real-world evidence in health care decision making. Pharmacoepidemiol Drug Saf. 2017;26
(9):1033–1039.
Berger ML, Dreyer N, Anderson F, Towse A, Sedrakyan A, Normand SL. Prospective
observational studies to assess comparative effectiveness: the ISPOR good research practices task
force report. Value Health. 2012;15(2):217–230.
Dreyer NA, Schneeweiss S, McNeil B, et al.; on behalf of the GRACE Initiative. GRACE
principles: recognizing high-quality observationalstudies of comparative effectiveness. Am J
Manag Care 2010;16:467–71.
714 W. Werther and A. M. Loughlin

The design of the observational study should be guided by the research question,
the availability of data to answer the question, and an understanding of the limita-
tions of the study to be conducted. This section will describe study designs com-
monly used for post-authorization studies but is not intended to replace textbooks on
epidemiologic study design.
A thorough study protocol provides the parameters of the observational study
including the following sections that define the design and conduct of the study
(FDA 2013; ENCEPP 2010; ICPE 2015; Dreyer et al. 2010).

• Research Question and Objectives – this section defines both the issue that
which leads to the study and specified hypotheses or outcomes that will be
measured. This should include both primary and secondary objectives.
• Study Design – this section provides the overall research design (e.g., cohort,
case-control) that will be used to answer the research question.
• Study Population – this section describes the source population and the com-
parison groups. In defining the source population, the protocol will outline the
inclusion and exclusion criteria for the study population. For example, the study
population may restrict to patients with a specific diagnosis likely to receive the
drug or may include a subset of this population (e.g., pregnant women and their
infants). Within the source population, the comparison groups in post-approval
studies are defined by either exposure to a specific drug(s) of interest or by the
presence of specific safety outcome of interest.
• Data Source – this section describes the data used to assess the research question.
Data source includes primary data collection, patient-based or exposure-based
registries, and secondary data sources (e.g., administrative claims data and elec-
tronic health records).
• Data Collection and Covariates – these sections describe both how drug
exposures of interest and safety outcomes are operationally defined and mea-
sured, as well as how other risk factors, comorbidity, co-medication, potential
confounders, and effect modifying variables are defined and measured.
• Analysis Plan – this section both describes how confounding or other biases
are assessed and/or controlled and the statistical methods used to describe and
compare the comparison groups, including the occurrence of outcome (e.g.,
incidence) and measure of association (e.g., relative risk, odds ratios, mean
differences) and its confidence interval. It may describe additional planned
such as intention-to-treat analysis, as treated analysis, subgroup analyses,
sensitivity analyses and meta-analytic techniques to combine findings across
data sources.
• Limitations of Research Methods – this section describes any potential limita-
tions of study design, data sources, and analytic methods, including issues of
confounding, bias, generalizability, and random error, as well as efforts made to
reduce limitations with the proposed research plan.

Changes to study protocol should be documented in final report and the impact of
changes on interpretation of study discussed (FDA 2013).
38 Post-Approval Regulatory Requirements 715

Observational Study Data Sources


The choice and effectiveness of treatment may be affected by practice setting
(academic vs. community hospital), healthcare environment (e.g., commercial
insurer, or single-payer system, or fee for service), the experience of healthcare
providers (specialist vs. general practice), as well as availability of medical history of
patient. Therefore, careful consideration must be made when choosing the appro-
priate data source.

Sources with Primary Data Collection


Post-approval observational studies can recruit patients by specific disease or by
specific drug exposures, follow these patients over time, and collect pertinent data,
prospectively, to assess the safety and effectiveness of a medical product of interest.
Primary data collection may include patient-reported data, physician-reported data,
and data abstracted from the patients’ medical records. In addition, data can be
collected to supplement patient data obtained in a secondary data source. The types
of data not well documented in secondary data sources are duration of disease,
behavioral factors (e.g., smoking, alcohol consumption, exercise), drug adherence,
and patient quality of life. These data can be collected directly from patient via
survey- or interview-based data collection. Medical record abstract may be used to
collect additional patient information (e.g., potential confounding factors), or reason
for discontinuation of a drug, and can be used to collect data to validate a diagnosis
identified in secondary data (ENCEPP 2010).
Registries, specifically patient disease registries and pregnancy registries, are
examples where patients are recruited, and data is collected prospectively during
follow-up (Blumenthal 2017; Gliklich et al. 2014). Patient disease registries collect
clinical data related to the disease onset, progression, and treatment course. At
registry enrollment patient details are collected, including demographics, lifestyle
and behavioral characteristics, patients’ medical history, past and current drug
utilization, and comorbidities. At regular intervals, prospectively, these data are
updated. Important to post-approval study data is the identification of an initiation
of new medications, or the onset of new symptoms or disease diagnosis. Patient
disease registries are common for difficult-to-treat chronic diseases (e.g., multiple
sclerosis and rheumatoid arthritis) and rare diseases (e.g., cystic fibrosis and Pompe
disease (lysosomal storage disorder)) (NIH 2019). Patient registries are a useful
source of patients for recruitment into clinical trials, as well as to identify patients for
observational studies.
Because clinical trials conducted to assess the safety and efficacy of a new
medical product often do not include pregnant women, a post-approval requirement
may include the establishment of a pregnancy exposure registry. A pregnancy
exposure registry enrolls women who have taken a drug of interest when they are
pregnant. The registry will prospectively collect information on maternal events and
delivery outcomes (e.g., spontaneous abortion) through the end of pregnancy and
will collect data on infant events (e.g., infant death or major congenital
malformations) usually in the first year of life (FDA 2020a).
716 W. Werther and A. M. Loughlin

One strength of primary data collection, such as in registries, is that patients are
well characterized over time, and needed information on patient exposure, outcomes,
and potentially confounding factors can be captured. A second strength for registries
is the ability to collect information on patient using survey tools and validated
instruments. Patient-reported health data may include quality of life, symptoms
(e.g., pain scores or fatigue), use of over-the-counter medications, patient prefer-
ences, behavioral data, family history, and biological specimens. Yet, there are
limitations; following many patients for a long time and the collection of prospective
data are both time-consuming and very expensive. As in clinical trials, protocol-
specified inclusion and exclusion criteria help to limit systematic selection bias in
registries, and the use of validated and standardized assessments can reduce mis-
classification of disease and outcomes; yet while patient-reported events are essential
to registry data, these data are subjective and certain forms of bias, such as recall
bias, may influence the data.

Sources with Secondary Data Collection


The use of large, longitudinal, healthcare databases, such as electronic health record
(EHR) databases, or health administrative claims databases has improved the efficien-
cies of observational studies. However, observational studies conducted using these
sources must be performed within the limitations of data. A primary limitation of both
data sources is that EHR and claims data are gathered for the purpose of patient care
and billing, respectively, and not for research. Therefore, the fitness of the database to
define exposure, outcomes, and potential confounding variable needs to be considered
carefully. Hall et al. published a guidance and a checklist for selection of appropriate
healthcare database for observational studies (Hall et al. 2012).

Electronic Health Records


An electronic health record (EHR) is a digital version of individual patient’s medical
chart, captured in real time, and is intended to provide a broad view of a patient’s
characteristics, medical history and comorbidity, drug utilization and treatment
history, and the documentation of new onset of diagnoses. When using different
EHR data sources, investigators must understand if the patient records include the
entire record of patient care, or just a portion of it. For example, patients receiving
treatment from multiple physicians, offices, or hospitals might have their care data
captured in several different medical record data sources. Therefore, investigators
using an EHR data source should describe the steps taken to ensure complete capture
of patient care over time to facilitate the likelihood that all exposures and safety
outcomes of interest will be captured. An example of an EHR source in Europe is the
Clinical Practice Research Datalink (CPRD) that includes all people attending
general practitioners in the UK. Other databases that link both electronic medical
records with administrative health data are found in Italy (regional), the Netherlands
(national), and other Nordic countries (e.g., Sweden (national healthcare database
and disease registries) and Denmark (national)). Canada has a national EHR system,
and data that are available from three provinces are available for pharmacoepide-
miologic research. In the USA EHR data is available from large healthcare networks
with shared electronic medical record system (e.g., Kaiser Permanente) and EHR
38 Post-Approval Regulatory Requirements 717

systems (e.g., Optum EHR Research Database, Flatiron) that integrate electronic
medical records across different systems into a common data platform, inclusive of
data from a generalizable national sample of private practices, hospitals, and inte-
grated health networks.

Health Administrative Claims Data


Health administrative claims data are a comprehensive source of longitudinal med-
ical care data, including patient enrollment data (demographics), outpatient and
inpatient medical claims (diagnoses, procedures, and laboratory claims), pharmacy
claims (medications dispensed), and associated costs. These data are captured by
health insurers, across all healthcare providers caring for patient, for the purpose of
billing. Investigators using administrative claims data sources should address con-
tinuity of coverage (enrollment and disenrollment), particularly for claims data
sources in the USA, because patients often enroll and disenroll in different health
plans in relation to changes in employment or other life circumstances. Such
documentation allows only periods of enrollment during which data are available
on the patients of interest to be included in the study, and periods of disenrollment
when data are not available on patients can be appropriately excluded. Definitions of
enrollment or continuous coverage should be developed and documented, particu-
larly in studies using more than one data source. While generally used as de-
identified data (patient private information is redacted), in instances and with the
right approvals, claims data can be linked to other sources, including EHR systems,
registry data (e.g., cancer registries), and vital records (e.g., National Death Index)
that improve the capture of long-term outcomes such as cancer and death.
In Europe and Canada, administrative data is captured as part of national and
regional healthcare data sources, as described above. Examples of US sources of
administrative claims data include Medicare, Medicaid, Veterans Administration
System, as well as data available from commercial insurers (e.g., UnitedHealthcare,
HealthCore, and MarketScan). Certain differences affect whether a non-US data
source can be used to address specific drug safety hypotheses in a way that is
relevant to the US population. Various factors in non-US healthcare systems, such
as medication tiering (e.g., first-line, second-line) and patient coverage selection,
influence the degree to which patients on a given therapy in other countries might
differ in disease severity from patients on the same therapy.

Other Data Sources


Other secondary sources of health data that are useful tools in safety assessments are
pharmacy claims data, national vital status records (e.g., the National Death Index, or
Birth Registry), cancer registry data, and telehealth data.

Observational Study Designs

Cohort Studies
The efficiencies gained by using large healthcare databases make cohort studies a
viable alternative to clinical trials, for large comparative effectiveness and safety
studies. Cohort studies identify a population at risk and an exposure to medical
718 W. Werther and A. M. Loughlin

products of interest and follow patients over time for the occurrence of events. In
cohort studies, the comparison cohorts are selected from the same population at risk
yet are unexposed at time of enrollment into the cohort and are similarly followed
over time for the occurrence of events. Cohort studies provide the opportunity to
determine the incidence rate of adverse events in addition to the relative risk of an
adverse event. They are useful for identification of multiple events in the same study.
In addition, cohort studies are useful for examining safety concerns in special
populations, such as the children, the elderly, pregnant women, or patients with
comorbid conditions that are often underrepresented in clinical trials (EMA 2017;
FDA 2013).
Prospective and retrospective cohort studies serve different purposes in the post-
approval setting. Prospective cohort studies are used for safety studies early after
approval and release of a drug into the market. Retrospective cohort studies are
conducted in secondary data sources when a drug has been on the market for some
time and there is a new safety concern. See Table 4 for the strengths and weaknesses
of prospective and retrospective cohort study designs.

Other Observation Designs


Other designs have been proposed to assess the associations between intermittent
exposure (e.g., vaccination) and short-term events. These designs include case-
control, self-controlled case-series, case-crossover, and case-time control studies.
In these designs, only cases are used, and the control information is obtained as
unexposed person-time experience of the cases themselves. One important strength
of these designs is that confounding variables that do not change over time with
individuals are automatically matched (EMA 2017).
Meta-analyses are common in observational research. These analyses involve
statistical techniques that integrate and summarize the results across several studies
with the same or similar research objectives can extend the understanding of the
research question. They are important in identifying how both differences in research
design and data source affect results, as well as to obtain an overall risk estimate
(Chou and Helfand 2005; ENCePP 2010).

Bias in Observational Studies


Threats to validity of observational studies to assess drug safety include selection bias,
misclassification, immortal time bias, channeling, and confounding by indication
(Berger et al. 2017; FDA 2013). Selection bias occurs when there is selective recruit-
ment of study populations such that populations are not representative of the
populations you are trying to compare. Misclassification occurs when there is incorrect
information about either the exposure, outcome, or covariates that describe the under-
lying populations. Immortal time refers to a period of cohort follow-up time during
which an outcome of interest could not have occurred. Immortal time bias arises when
the period between cohort entry and date of first exposure to a drug, during which the
event of interest has not occurred, is either misclassified or simply excluded and not
accounted for in the analysis (Berger et al. 2017; Suissa 2007, 2008). Channeling and
confounding by indication occur when the estimate of the effect (exposure ! outcome)
38 Post-Approval Regulatory Requirements 719

results from imbalance of determinants of disease (or their proxies) across compared
groups (FDA 2013). Channeling refers to the situation where drugs are prescribed to
patients differently based on the presence or absence of factors prognostic of patient
outcomes. Confounding by indication is a type of channeling bias that occurs when the
indication, which is associated with drug exposure, is an independent risk factor for the
outcome. Biases that threaten pharmacoepidemiologic safety studies conducted in
secondary data sources, and methods to handle these biases, need to be taken into
consideration when planning these studies.
Study designs with new users, with active comparators, or that are matched by
disease risk score are methods that reduce these biases, in that a comparison is made
between patients with the same indication initiating different treatments (ENCePP
2010).
Study design choices that make the study groups more similar are important tools
for controlling for confounding and biases. The goal of the study design is to
facilitate comparisons of people with similar chance of benefiting from the treatment
or experiencing harm. There are few epidemiology and statistical method used to
handle confounding in pharmacoepidemiologic studies (e.g., restriction, matching,
adjustment, and weighting). Methods, such as propensity score (PS) matching and
inverse probability treatment weighting (IPTW) using the PS, are two common ways
to reduce bias in comparative safety studies using real-world large secondary data
sources (Austin 2011; Austin and Stuart 2015; Rosenbaum and Rubin 2007/1983).
PS is defined as the conditional probability of being treated with drug of interest
given observed set of pretreatment characteristics. PS is estimated using logistic
regression, where treatment group is the dependent variable. Potential independent
variables in the logistic regression will include a priori specified characteristics,
potential confounding factors, and effect modifying factors. Exposed and unexposed
treatment groups are matched based on PS score, using a greedy matching algorithm.
Treatment groups matched by PS should be well balanced with respect to known
(and possibly unknown) confounders; therefore, the outcomes observed across
treatment groups can be directly compared.

Enforcement of Post-Approval Studies by Regulatory Agencies

When they are imposed, post-approval studies are reviewed for compliance by the
regulatory agencies. For clinical trials that are ongoing at the time of approval, often
these are classified as PMC in the USA or PAM in the EU. Findings of these trials
can be submitted to the health authorities for addition to the prescribing information,
including expansion of the indications and/or update of the efficacy data. On the
contrary, imposed observational studies for safety concerns can lead to changes in
prescribing information and add to list of potential adverse effects. The FDA and
EMA both track progress on PMC/PMRs and PAMs, respectively.
In the USA, PMC and PMR studies are registered at ClinicalTrials.gov, whether
they are clinical trials or observational studies. In Europe, the EMA publishes the
protocols, abstracts, and final study reports of PAS in the EU PAS register hosted on
720 W. Werther and A. M. Loughlin

the European Network of Centres in Pharmacoepidemiology and Pharmacovigilance


(ENCePP) website.

US PMC and PMR Enforcement

The FDA publishes an annual report that provides a summary of the progress of
PMC and PMR that were agreed upon at the time of medicinal product approval.
This annual report is required according to the FDA Modernization Act of 1997 and
is published to the Federal Register. The report includes data from the PMR/PMC
database maintained by the FDA (2020b). The PMR/PMC database is searchable
and available to the public at the FDA website https://www.accessdata.fda.gov/
scripts/cder/pmc/.
The most recent FDA annual report includes fiscal year 2018. PMC/PMRs are
categorized as pending, ongoing, delayed, terminated, submitted, fulfilled, and
released. In addition, PMRs/PMCs may be characterized as open or closed. Open
PMRs/PMCs comprise those that are pending, ongoing, delayed, submitted, or termi-
nated, whereas closed PMRs/PMCs are either fulfilled or released. Open PMRs are
described as on- or off-schedule. On-schedule PMRs/PMCs are those that are pending,
ongoing, or submitted. Off-schedule PMRs/PMCs are those that have missed one of
the milestone dates in the original schedule and are categorized as either delayed or
terminated.
The fiscal year 2018 annual report shows that 69% of PMR/PMC annual status
reports were received on time. For those that were open but not due, 79% for new
drug application and 86% for biologics license application PMRs were progressing
on schedule, and most open PMCs – 76% for new drug applications and 84% of
biologics license applications – were also on schedule (FDA 2019).

EU PAM Enforcement

The EMA assesses compliance to specific obligations specified in PAMs through the
analysis of their database including due dates. The assessment is conducted annually,
for both the annual renewal (for conditional marketing authorizations) and annual
reassessment (for marketing authorizations under exceptional circumstances) (EMA
2020b).
When issues of non-compliance with PAM are identified, the relevant EMA
committees can take one or more of the following actions:

• Letter to the MAH by the chair of the committee


• Oral explanation by MAH to the committee
• Initiation of a referral procedure with a view to vary/suspend/revoke the Market
Authorization in light of Article 116 of Directive 2001/83/EC
• Inspection to be performed upon request of the committee(s)

Also, if the medicinal product has conditional approval, the marketing authori-
zation can be varied, suspended, or revoked.
38 Post-Approval Regulatory Requirements 721

Such regulatory action regarding non-compliance of a PAM may be made public


by the Agency on the Agency website, e.g., in the European public assessment report
(EPAR) of the affected products.

Systematic Reviews of Post-Approval Studies in the USA and EU

Several analyses of post-approval study conducted in the USA and EU have been
published. A summary of these analyses is described below.

Reviews of Post-Approval Studies in the USA

In the USA, following accelerated or expedited approval, the FDA may require
additional confirmatory clinical trials. These required trials and their results have
been the subject of several systematic review studies (Beaver et al. 2018; Naci et al.
2017; Wallach et al. 2018).
Naci et al. studied characteristics of pre- and post-approval clinical trials reviewed
at the US FDA from 2009 to 2013 (Naci et al. 2017). They reported on trials for 22
drugs with 24 indications examined. Of these post-approval trials, 42% (10 of 24
indications studied) confirmed the efficacy of a previously analyzed surrogate
endpoint within 3 years of Accelerated Approval. Among the 58% of post-approval
trials that had not confirmed the indication for trial at time of review, half of the trials
were still ongoing and the other half either were terminated, failed to confirm results,
or were delayed by more than a year. There were two indications where the post-
approval trial failed to confirm clinical benefit, yet these findings did not result in
reversal of the approval, and no additional trials were imposed.
Wallach et al. studied post-approval studies required by the US FDA between
2009 and 2012 and allowed for at least 4 years of follow-up (Wallach et al. 2018).
Among the 134 prospective cohort studies, registries, and clinical trials, 102 (76%)
were registered on ClinicalTrials.gov. There were 65 completed studies, and 47
(72%) of these had reported results in ClinicalTrials.gov or in a publication. How-
ever, most (32 of 47, 68%) did not report results in the timeframe stated in the post-
marketing requirement.
Beaver et al. reviewed the accelerated US FDA approvals of oncology and
malignant hematology medicinal products from 1992 to 2017 (Beaver et al. 2018).
They identified 93 products with Accelerated Approvals for new indications. Of
these, 51 (55%) completed their post-approval studies and confirmed benefit within
a median of 3.4 years, while 5 (5%) post-approval studies concerned indications that
were withdrawn from the market. The remainder have ongoing confirmatory studies.

Reviews of Post-Approval Studies in the EU

In the EU, several studies have been conducted to measure compliance with the EU
regulation and registration of PAS.
722 W. Werther and A. M. Loughlin

Blake et al. conducted the first review of the PAS register maintained by ENCePP
(Blake et al. 2011). This analysis included PAS required by EMA between 2007 and
2009. As assessed in 2009, 60 PAS had been registered for 32 medicinal products
and 52 had progressed to data collection, 7 were deemed no longer necessary by the
Committee for Medicinal Products for Humans (CHMP), and the final study did not
have a final decision. Of the 47 studies being “carried out” at the time of publication,
14 were randomized controlled trials; the remainder were either non-controlled trials
or observational (non-interventional) studies.
Engel et al. specifically studied PASS protocols reviewed under the EU
pharmacovigilance legislation (Engel et al. 2017). During 2012 to 2015, PRAC
reviewed 189 PASS protocols, of which 58 (31%) were imposed and 131 were
voluntary but required in the RMP for the medicinal product. Of 57 studies with
protocols available in ENCePP, 67% were primary data collection and 33% used
secondary data; in addition, the authors report that 65% did not include a comparator
population. Only 2 of the 57 protocols explicitly stated hypothesis testing analyses
were planned; therefore, we expect very few PASS used a clinical trial design. The
authors did not report results on interventional vs. non-interventional study design.

Summary and Conclusions

Health authorities throughout the world have regulations for requesting additional
research in the post-approval setting. This chapter focuses on the regulations in the
USA and EU. The history of post-approval studies can be traced through changing
regulations enforced by the FDA and EMA.
Specific terminology for post-approval studies is used by the FDA and EMA.
Briefly, the FDA uses the term risk evaluation and mitigation strategy (REMS) to
track post-approval safety studies that can be either post-marketing requirements
(PMRs) or post-marketing commitments (PMCs). However, PMR and PMC can be
conducted outside of REMS. The EMA uses post-authorization measures (PAM) to
track post-authorization safety studies (PASS) and post-authorization efficacy stud-
ies (PAES).
Post-approval studies are either clinical trials (interventional) or observational
(non-interventional) studies. Choosing a study design may be influenced by the
strengths and weaknesses of the design options, as described in Table 4.
Imposed post-approval studies are reviewed for compliance by the regulatory
agencies. For clinical trials that are ongoing at the time of approval, often these are
classified as PMC in the USA or PAM in the EU. Findings of these trials can be
submitted to the health authorities for addition to the prescribing information. The
FDA and EMA both track progress on PMC/PMRs and PAMs, respectively.
In conclusion, post-approval studies are necessary to continually gather data on
the safety and effectiveness of approved drugs. These studies are regulated by health
authorities, included in registries (e.g., ClinicalTrials.gov, ENCePP), and tracked to
completion. Given the inclusion of post-approval studies in public databases, they
are the subject of systematic reviews.
38 Post-Approval Regulatory Requirements 723

Key Facts

• Post-approval studies can be required by regulatory agencies or voluntarily


conducted by sponsors and market authorization holders.
• Post-approval studies use clinical trial or observational methodologies.
• Post-approval studies that are post-authorization safety studies (PASS) are regis-
tered in the ENCePP PAS study registry.
• Required post-approval studies are tracked by regulatory authorities for comple-
tion and compliance to timelines.

Cross-References

▶ Introduction to Meta-Analysis
▶ Pragmatic Randomized Trials Using Claims or Electronic Health Record Data
▶ Regulatory Requirements in Clinical Trials

References
Austin PC (2011) An introduction to propensity score methods for reducing the effects of
confounding in observational studies. Multivariate Behav Res 46(3):399–424
Austin PC, Stuart EA (2015) Moving towards best practice when using inverse probability of
treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in
observational studies. Stat Med 34(28):3661–3679
Beaver JA, Howie LN, Pelosof L, Kim T, Liu J, Goldberg KB, Sridhara R, Blumenthal GM,
Farrell AT, Keegan P, Pazdur R, Kluetz PG (2018) A 25-year experience of US Food and Drug
Administration approval of malignant hematology and oncology drugs and biologics. JAMA
Oncol. https://doi.org/10.1001/jamaoncol.2017.5618. Published online March 1, 2018
Berger ML, Dreyer N, Anderson F, Towse A, Sedrakyan A, Normand SL (2012) Prospective
observational studies to assess comparative effectiveness: the ISPOR good research practices
task force report. Value Health 15(2):217–230
Berger ML, Sox H, Willke RJ, Brixner DL, Eichler HG, Goettsch W, Madigan D, Makady A,
Schneeweiss S, Tarricone R, Wang SV, Watkins J, Mullins CD (2017) Good practices for real-
world data studies of treatment and/or comparative effectiveness: recommendations from the
joint ISPOR-ISPE Special Task Force on real-world evidence in health care decision making.
Pharmacoepidemiol Drug Saf 26(9):1033–1039
Berry DA, Elanshoff M, Blotner S, Davi R, Beineke P, Chandler M, Lee DS, Chen LC, Sarkar S
(2017) Creating a synthetic control arm from previous clinical trials: Application to establishing
early end points as indicators of overall survival in acute myeloid leukemia (AML). ASCO
abstract. J Clin Oncol 35(15_Suppl):7021. https://doi.org/10.1200/JCO.2017.35.15_suppl.
7021. Published online May 30, 2017. https://ascopubs.org/doi/full/10.1200/CCI.19.00037
Blake KV, Prilla S, Accadebled S, Guimier M, Biscaro M, Persson I, Arlett P, Blackburn S, Fitt H
(2011) European Medicines Agency review of post-authorisation studies with implications for
the European Network of Centres for Pharmacoepidemiology and Pharmacovigilance.
Pharmacoepidemiol Drug Saf 20:1021–1029
Blumenthal S (2017) The use of clinical registries in the United States: a landscape survey. EGEMS
(Wash DC) 5(1):26. https://doi.org/10.5334/egems.248. Published 2017 Dec 7
Chou R, Helfand M (2005) Challenges in systematic reviews that assess treatment harms. Ann
Intern Med 142(12 Pt 2):1090–1099
724 W. Werther and A. M. Loughlin

Dreyer NA, Schneeweiss S, McNeil B, Berger ML, Walker AM, Ollendorf DA, Gliklich RE, on
behalf of the GRACE Initiative (2010) GRACE principles: recognizing high-quality observa-
tional studies of comparative effectiveness. Am J Manag Care 16:467–471
Engel P, Almas MF, DeBruin ML, Starzyk K, Blackburn S, Dreyer NA (2017) Lessons learned on
the design and the conduct of Post-Authorization Safety Studies: review of 3 years of PRAC
oversight. Br J Clin Pharmacol 83:884–893
European Medicines Agency (2012) Legal framework: pharmacovigilance. Available at https://
www.ema.europa.eu/en/human-regulatory/overview/pharmacovigilance/legal-framework-
pharmacovigilance. Accessed 12 June 2020
European Medicines Agency (2016) Scientific guidance on post-authorisation efficacy studies.
Available at https://www.ema.europa.eu/en/documents/scientific-guideline/scientific-guidance-
post-authorisation-efficacy-studies-first-version_en.pdf. Accessed 12 June 2020
European Medicines Agency (2017) Guideline on good pharmacovigilance practice (GVP). Avail-
able at https://www.ema.europa.eu/en/documents/scientific-guideline/guideline-good-
pharmacovigilance-practices-gvp-module-viii-post-authorisation-safety-studies-rev-3_en.pdf.
Accessed 12 June 2020
European Medicines Agency (2020a) Post-authorisation safety studies (PASS). Available at https://
www.ema.europa.eu/en/human-regulatory/post-authorisation/pharmacovigilance/post-authori
sation-safety-studies-pass-0. Accessed 12 June 2020
European Medicines Agency (2020b) Post-authorisation measures: questions and answers. https://
www.ema.europa.eu/en/human-regulatory/post-authorisation/post-authorisation-measures-ques
tions-answers. Accessed 12 June 2020
European Network of Centers for Pharmacoepidemiology and Pharmacovigilance (ENCePP)
(2010) Guidelines on methodological standards in Pharmacoepidemiology (Revision 7), 2010.
Online: http://www.encepp.eu/standards_and_guidances/documents/ENCePPGuideonMeth
StandardsinPE_Rev7.pdf. Accessed 12 June 2020
European Network of Centers for Pharmacoepidemiology and Pharmacovigilance (ENCePP)
(2017) Guidelines on Good Pharmacovigilance (GVP) – Module VIII – Post-authorization
safety studies (Revision 3). Online: https://www.ema.europa.eu/en/documents/scientific-
guideline/guideline-good-pharmacovigilance-practices-gvp-module-viii-post-authorisation-
safety-studies-rev-3_en.pdf. Accessed 12 June 2020
Food and Drug Administration (2011) Guidance for industry postmarketing studies and clinical
trials – implementation of section 505(o)(3) of the Federal Food, Drug, and Cosmetic Act.
Available at https://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryIn
formation/Guidances/UCM172001.pdf or https://www.fda.gov/regulatory-information/search-
fda-guidance-documents/postmarketing-studies-and-clinical-trials-implementation-section-
505o3-federal-food-drug-and. Accessed 12 June 2020
Food and Drug Administration (2013) Best practices for conducting and reporting pharmacoepi-
demiology safety studies using electronic healthcare data sets. May 2013. Online: https://www.
fda.gov/regulatory-information/search-fda-guidance-documents/best-practices-conducting-
and-reporting-pharmacoepidemiologic-safety-studies-using-electronic. Accessed 12 June 2020
Food and Drug Administration (2018) FDA drug safety communication: new risk factor for
Progressive Multifocal Leukoencephalopathy (PML) associated with Tysabri (natalizumab).
Available at https://www.fda.gov/drugs/drug-safety-and-availability/fda-drug-safety-communi
cation-new-risk-factor-progressive-multifocal-leukoencephalopathy-pml. Accessed 12 June
2020
Food and Drug Administration (2019) FDA in brief: FDA issues annual report on efforts to hold
industry accountable for fulfilling critical post-marketing studies of the benefits, safety of new
drugs. https://www.fda.gov/news-events/fda-brief/fda-brief-fda-issues-annual-report-efforts-
hold-industry-accountable-fulfilling-critical-post. Accessed 12 June 2020
Food and Drug Administration (2020a) List of pregnancy exposure registries updated 17 Jan 2020.
Online: https://www.fda.gov/science-research/womens-health-research/list-pregnancy-expo
sure-registries. Accessed 20 Feb 2020
38 Post-Approval Regulatory Requirements 725

Food and Drug Administration (2020b) Postmarketing requirements and commitments: reports.
https://www.fda.gov/drugs/postmarket-requirements-and-commitments/postmarketing-require
ments-and-commitments-reports. Accessed 12 June 2020
Gliklich R, Dreyer N, Leavy M (eds) (2014) Registries for evaluating patient outcomes: a user’s
guide, 3rd edn. Two volumes. (Prepared by the Outcome DEcIDE Center [Outcome Sciences,
Inc., a Quintiles company] under Contract No. 290 2005 00351 TO7.) AHRQ Publication No.
13(14)-EHC111. Agency for Healthcare Research and Quality, Rockville. http://www.
effectivehealthcare.ahrq.gov/registries-guide-3.cfm
Goedecke T (2017) EU PASS/PAES Requirements for Disclosure. Available at https://www.ema.
europa.eu/en/documents/presentation/presentation-eu-pass/paes-requirements-disclosure-
thomas-goedecke_en.pdf. Accessed 12 June 2020
Hall GC, Sauer B, Bourke A, Brown JS, Reynolds MW, LoCasale R (2012) Guidelines for good
database selection and use in pharmacoepidemiology research [published correction appears in
Pharmacoepidemiol Drug Saf. 2012;21(11):1249. Casale, Robert Lo [corrected to LoCasale,
Robert]]. Pharmacoepidemiol Drug Saf 21(1):1–10. https://doi.org/10.1002/pds.2229
International Society for Pharmacoepidemiology (2015) Guidelines for good pharmacoepi-
demiology practice (Revision 3), 2015. Online: https://www.pharmacoepi.org/resources/
policies/guidelines-08027/#1. Accessed 12 June 2020
Kappos L, Bates D, Edan G, Eraksoy M, Garcia-Merino A, Grigoriadis N, Hartung HP, Havrdová
E, Hillert J, Hohlfeld R, Kremenchutzky M, Lyon-Caen O, Miller A, Pozzilli C, Ravnborg M,
Saida T, Sindic C, Vass K, Clifford DB, Hauser S, Major EO, O’Connor PW, Weiner HL, Clanet
M, Gold R, Hirsch HH, Radü EW, Sørensen PS, King J (2011) Natalizumab treatment for
multiple sclerosis: updated recommendations for patient selection and monitoring. Lancet
Neurol 10(8):745–758
Krumholz HM, Ross JS, Presler AH, Egilman DS (2007) What have we learnt from Vioxx? BMJ
334(7585):120–123. https://doi.org/10.1136/bmj.39024.487720.68
Naci H, Smalley KR, Kesselheim AS (2017) Characteristics of preapproval and postapproval
studies for drugs granted accelerated approval by the US Food and Drug Administration.
JAMA 318(7):626–636. https://doi.org/10.1001/jama.2017.9415
National Institutes of Health (2019) List of registries, last reviewed 18 Nov 2019. Available at:
https://www.nih.gov/health-information/nih-clinical-research-trials-you/list-registries.
Accessed 12 June 2020
Patsopoulous NA (2011) A pragmatic view on pragmatic trials. Dialogues Clin Neurosci
13:217–224
Prakash S, Valentine V (2007) Timeline: the rise and fall of Vioxx November 10, 2007. https://
www.npr.org/2007/11/10/5470430/timeline-the-rise-and-fall-of-vioxx
Rosenbaum PR, Rubin DB (2007) The central role of the propensity score in observational studies
for causal effects. Biometrika 70(1 (Apr, 1983)):41–55
Suissa S (2007) Immortal time bias in observational studies of drug effects. Pharmacoepidemiol
Drug Saf 16(3):241–249
Suissa S (2008) Immortal time bias in pharmaco-epidemiology. Am J Epidemiol 167(4):492–499
Wallach JD, Egilman AC, Dhruva SS, McCarthy ME, Miller JE, Woloshin S, Schwartz LM, Ross
JS (2018) Postmarket studies required by the US Food and Drug Administration for new drugs
and biologics approved between 2009 and 2012: cross sectional analysis. BMJ 361:k2031
Zauderer MG, Grigorenko A, May P, Kastango N, Wagner I, Caroline A (2019) Creating a synthetic
clinical trial: comparative effectiveness analyses using electronic medical record. JCO Clin
Cancer Inform. https://doi.org/10.1200/CCI.19.00037. Published online June 21, 2019
Part IV
Bias Control and Precision
Controlling for Multiplicity, Eligibility,
and Exclusions 39
Amber Salter and J. Philip Miller

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 730
Multiplicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 730
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 730
Sources of Multiplicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 731
Adjustment for Single Sources of Multiplicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 732
Adjustments for Multiple Sources of Multiplicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734
Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735
Eligibility and Exclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736

Abstract
Multiple comparison procedures play an important role in controlling the accu-
racy of clinical trial results while trial eligibility and exclusions have the potential
to introduce bias and reduce external validity. This chapter introduces the issues
and sources of multiplicity and provides a description of the many different
procedures that can be used to address multiplicity primarily used in the confir-
matory clinical trial setting. Additionally, trial inclusion/exclusion criteria and
enrichment strategies are reviewed.

A. Salter (*) · J. P. Miller


Division of Biostatistics, Washington University School of Medicine in St. Louis, St. Louis, MO,
USA
e-mail: [email protected]; [email protected]

© Springer Nature Switzerland AG 2022 729


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_210
730 A. Salter and J. P. Miller

Keywords
Multiple comparison procedures · Inclusion/exclusion criteria · Enrichment
strategies

Introduction

Clinical trial design continues to evolve and become increasingly complex due, in
part, to efforts on making evaluation of new treatments more efficient. The use of
multiple outcomes, dose levels, and/or populations results in challenges for decision-
making, especially concern over making incorrect conclusions about the efficacy or
safety of a treatment. As the multiplicity increases, the probability of making a false
conclusion increases. Multiple strategies have been developed recently to maintain
strong control over the error rate in clinical trials. Regulatory bodies such as the
Federal Drug Administration (FDA) and European Medical Agency (EMA) have
both recognized this issue and provided guidance on aspects of multiplicity in
confirmatory clinical trials. In addition to issues of multiplicity, eligibility criteria
and exclusions have the potential to add bias and reduce the external validity of a
clinical trial.

Multiplicity

Introduction

Multiplicity problems in clinical trials results from conducting many comparisons


within a single trial. Scenarios include evaluating multiple outcomes, dose levels, or
patient populations. The chances of making incorrect conclusions regarding the
hypotheses being tested increase as the multiplicity increases. Phase III trials are
primarily used to demonstrate specific efficacy and safety claims of a drug, treat-
ment, or device. In this setting, there is a need to control the increased probability of
making an error. The consequence of this type of error in this confirmatory phase of
treatment development could lead to adoption of a treatment with no beneficial
effect.
Confirmatory clinical trials are designed primarily to establish evidence that a
drug, treatment, or device is effective and safe. Most clinical trials make decisions
about the success of the trial by constructing a hypothesis to test predetermined
objectives. The hypothesis testing framework is traditionally associated with two
types of error: type I (α) and type II (β) errors. Type I error, or rejecting a true null
hypothesis, is controlled at a prespecified level; however, in the multiplicity setting,
this error for each test has the potential to be inflate the trial level error rate when no
adjustments are made for multiple comparisons. For confirmatory clinical trials, the
type I error mandated by most regulatory agencies is a two-sided 5% level. Con-
ventionally, the power (1-β) of a trial is set at 0.8 or above. While multiplicity
39 Controlling for Multiplicity, Eligibility, and Exclusions 731

increases the type I error in hypothesis testing, the statistical methods to control
multiplicity may differentially affect the power of a trial. The effects on power
should be examined in the trial planning stages.
Recognition for the importance of controlling multiplicity is increasing. While
not every confirmatory trial is conducted to obtain regulatory approval, the EMA
published a guidance document on multiplicity in clinical trials (EMA (European
Medicines Agency) 2017) and the FDA released guidance on multiple outcomes in
clinical trials in 2017 (FDA (U.S. Food and Drug Administration) 2017). The
development of these guidance documents was partly a result of sponsors increased
efforts to improve the efficiency of clinical trials. This efficiency can be gained by
having a trial evaluate more than one outcome or more than one population.
However, the need to control multiplicity in this setting is critical to maintain
scientific rigor.
Safety measures are an important component in clinical trials, and specific safety
outcomes or concerns are based on experiences in earlier phase trials. Specific safety
outcomes are one type of safety measure and may be a specific outcome to be tested
in a Phase III trial in conjunction with an efficacy outcome. Other safety measures,
such as adverse events, may be considered more descriptive or exploratory in nature
(Dmitrienko and D’Agostino 2018). If adverse event data is compared between
groups, this could be considered a multiplicity issue which has the potential to
identify false positive safety concerns. Application of methods such as the double
false discovery rate has been proposed to more rigorously evaluate adverse events
(Mehrotra and Heyse 2004) and reduce the complexity of safety profiles, especially
in large drug or vaccine trials.
Statistical adjustment for multiplicity is not always necessary in the multiplicity
setting. The objectives of the analysis need to be considered and scenarios which do
not require adjustments for multiplicity exist. For instance, the use of co-primary
outcomes where success of the trial depends on all outcomes being less than the
significance level or when supplemental analyses are conducted (adjusting for
covariates or a per-protocol analyses) for a single outcome (FDA (U.S. Food and
Drug Administration) 2017; Proschan and Waclawiw 2000). While scenarios, such
as co-primary outcomes, may not affect the type I error rate, their possible effect on
power needs to be addressed in the design stage.

Sources of Multiplicity

Multiplicity problems arise in clinical trials from a variety of sources, such as


evaluating treatment effects for several outcomes, for multiple dose levels, testing a
prespecified set of potential moderator variables on subgroups or for multiple compo-
nent outcomes. Interim analyses are another source of multiplicity. Statistical methods
developed specifically for this issue are addressed in ▶ Chap. 59, “Interim Analysis in
Clinical Trials.” Clinical trials with just one multiplicity factor present are considered
to have a single source multiplicity while those with more than one factor have a
multiple source multiplicity. An example of multiple source multiplicity is a trial
732 A. Salter and J. P. Miller

evaluating several dose levels in a general population and a targeted subgroup of the
population.
Different methods are used for single and multiple sources of multiplicity. Choice
of adjustment procedure utilizes clinical and statistical information to decide which
method to implement. The multiplicity procedure chosen should be in line with the
clinical trial objectives and investigate the effect of the method on statistical power.
Simulations are often used to evaluate the effect of the procedures on power.

Adjustment for Single Sources of Multiplicity

For single sources of multiplicity, adjustment methods fall into two main categories:
single step and hypothesis ordered methods. Single step methods test all hypotheses
simultaneously, while ordered methods test hypotheses in a stepwise manner with
the order based on the data (size of p-values) or are prespecified based on strong
clinical information or prior studies. These methods can step-up or step-down where
the significance level changes as the procedure progresses through the set of null
hypotheses being tested due to the error rate being transferred from rejection of the
prior null hypothesis. The step-up procedure will order the hypotheses from largest
to smallest according to the p-values. The step-down procedures in data-driven
hypotheses ordering will arrange the hypotheses from smallest to largest based on
their associated p-values, and the testing ceases when a hypothesis fails to be
rejected. Within these categories of adjustment methods, distributional information
about the hypothesis tests is relevant to the choice of multiplicity method. Increased
knowledge regarding the joint distribution of the test statistics among the hypotheses
being tested leads to more powerful procedures being chosen (Dmitrienko and
D’Agostino 2013). Nonparametric procedures make no assumptions regarding the
joint distribution while semiparametric procedures assume the hypothesis tests
follow a distribution but have an unknown correlation structure (Dmitrienko et al.
2013). Additionally, there are parametric procedures, such as the Dunnett’s test
(Dunnett 1955), which assume an explicit distribution for the joint distribution of
hypothesis tests and are associated with classical regression and analysis of variance
and covariance models.
Single step procedures control the error rate using simple decision rules in order
to adjust the significance level. The Bonferroni correction is a classic example of a
single step nonparametric multiplicity adjustment where the overall error rate is
divided by the number of tests being tested to obtain an adjusted significance level
for all tests (α/m) (Dunn 1961). For example, if the overall error rate is 0.05 and three
tests are being conducted, the adjusted significance level for all three tests is 0.0167.
The Bonferroni method can also be applied by assigning prespecified weights to
different tests to account for clinical importance or other factors in the multiplicity
adjustment (FDA (U.S. Food and Drug Administration) 2017).
Other single step procedures include the Simes and Šidák semiparametric multi-
plicity adjustment methods (Šidák 1967; Simes 1986). Both are uniformly more
powerful than the Bonferroni. The Simes procedure is a global null hypothesis test,
39 Controlling for Multiplicity, Eligibility, and Exclusions 733

similar to the omnibus F test in an analysis of variance (ANOVA). While the


procedure is able to identify if at least one null hypothesis is false, it does not
identify which of the specific hypotheses is false. The Šidák procedure adjusts the
significance level ( pi) using the number of hypotheses tested (m) at the overall error
level (α) by pi 1 – (1 α)1/m. As in the example for the Bonferroni correction, if
there are three tests being conducted at an overall error rate of 0.05, the adjusted
significance level using the Šidák procedure is 1 – (1–0.05)1/3 = 0.0170.
While the single step procedures have the advantage of simplicity and relative
ease of implementation, stepwise procedures are more powerful procedures. The
Simes and Šidák procedures can be utilized as stepwise procedures with data-driven
hypothesis ordering and have increased power compared to the single step proce-
dure. The step-down Šidák procedure tests each hypothesis sequentially according to
ordered p-values (smallest to largest). The first i = 1 . . . m 1 hypotheses have an
adjusted significance level of pi 1 – (1 α)1/(m – i + 1) where subsequent hypotheses
are only tested if the ith hypothesis is rejected. The final m hypothesis is tested at a
significance level of 0.05. Thus, if there are three hypotheses being tested at an
overall error rate of 0.05, the first hypothesis would be tested at an adjusted
significance level of 0.0170, the second hypothesis would be tested at 0.0253, and
the third hypothesis tested at 0.05.
Other stepwise procedures include the nonparametric Holm procedure (Holm
1979) and the semiparametric Hochberg and Hommel procedures (Hochberg
1988; Hommel 1988). The Holm procedure is a step-up procedure utilizing the
Bonferroni procedure where the mth hypothesis is adjusted for the remaining
hypotheses to be tested. To illustrate for a set of three hypotheses at a significance
level of 0.05, the first hypothesis is tested at the 0.0167 (α/3), the second hypoth-
esis is tested at 0.025 (α/2), and the third hypothesis is tested at 0.05 (α/1). The
Hochberg and Hommel procedures are stepwise extensions of the Simes’ proce-
dure based on step-up algorithms. Starting with the largest p-value, the Hochberg
procedure compares the hypothesis p-value to the adjusted significance level
determined by p(m – i + 1)  α/i, and if the hypothesis is rejected, then all m
hypotheses are rejected. Otherwise, the next hypotheses are tested in a similar
manner until a hypothesis is rejected or the final m hypothesis is reached. At the
mth hypothesis, the adjusted significance level is pm  α/m. With three hypotheses
being tested at the 0.05 significance level, the first hypothesis with the largest p-
value is tested at the 0.05 significance level, the second and third hypotheses are
tested at the 0.025 and 0.0167 levels, respectively. The Hommel procedure is
similar to the Hochberg procedure; however, instead of depending only on the p-
value associated with the null hypothesis being tested as is the case with the
Hochberg, the Hommel procedure adds other conditions when a hypothesis fails
to be rejected to increase the number of hypotheses rejected. The Hommel
procedure incorporates the preceding hypothesis to reject the current null hypoth-
esis (Dmitrienko et al. 2013).
Multiple testing procedures used for prespecified hypothesis ordering, such as the
nonparametric fixed-sequence or fallback procedures, incorporate prior clinical and
logical information. The order in which the hypotheses will be tested is defined
734 A. Salter and J. P. Miller

before the trial begins. The fixed-sequence method places the more important
hypotheses to be tested first and is tested at the trial significance level. If the
hypothesis is rejected, then the next hypothesis is tested; however, if the hypothesis
fails to be rejected, the testing ceases and the remaining hypotheses fail to be
rejected. The fallback procedure is a more flexible approach to prespecified ordering
that allows for other hypotheses to be tested in the event the preceding hypothesis
fails to be rejected (FDA (U.S. Food and Drug Administration) 2017; Wiens 2003).
The fixed sequence of hypotheses are maintained, but the type I error is divided up
between the hypothesis being tested. The division of the significance level uses
weights (wi) which are nonnegative and sum to 1. The first hypothesis is tested at the
adjusted significance level determined by pi  αwi, and if the hypothesis fails to be
rejected, the second hypothesis is tested at pi  αwi. However, if the first hypothesis
is rejected, then second hypothesis is tested at the overall error rate as the unused
alpha is passed on to the next test in the sequence.

Adjustments for Multiple Sources of Multiplicity

The procedures used for multiple sources of multiplicity have an additional com-
plexity inherent in that the multiple sources of multiplicity need to be addressed. A
common manifestation of having multiple sources of multiplicity is having multiple
families of hypotheses in the form of hierarchy in clinical trial objectives (primary,
secondary, and tertiary objectives). The strategy usually employed for this setting is
called a gatekeeping procedure and tests the hypotheses in the first (primary objec-
tives) family with a single source adjustment method. The second family of hypoth-
eses is tested with a multiplicity adjustment only if the primary family has
demonstrated statistical success. The first family of hypothesis tests acts as a
gatekeeper to testing the second family of hypotheses. The gatekeepers can be
designed to be serial or parallel where serial gatekeepers require all hypotheses in
the first family to be rejected before proceeding to the second family of hypotheses,
while parallel gatekeepers only require at least one hypothesis to be rejected. These
procedures allow for the error rate to be transferred to subsequent families of testing.
These approaches can be further extended to allow for retesting to occur by trans-
ferring error back from the subsequent families to previous families (second family
of hypothesis back to the first family). The choice of procedure again depends on the
clinical objectives of the trial. Trials which have used these procedures include the
lurasidone trial in schizophrenia and CLEAN-TAVI in severe aortic stenosis
(Haussig et al. 2016; Meltzer et al. 2011).

Software

Software implementation of these procedures is found in SAS and R software. SAS


has a procedure PROC MULTTEST to compute the adjusted p-values for the more
common procedures. R packages have been developed to implement many of the
multiple testing procedures, such as multcomp and multxpert (Dmitrienko and
D’Agostino 2013).
39 Controlling for Multiplicity, Eligibility, and Exclusions 735

Summary

Multiplicity issues arise from various sources in clinical trials and are defined as the
evaluation of different aspects of treatment efficacy simultaneously (Dmitrienko
et al. 2013). The more commonly encountered multiplicity problems are found in
the use of multiple outcomes, composite outcomes and their components, multiple
doses, and multiple subgroups or populations. One or a combination of these may be
found in a clinical trial, and as the number of multiple comparisons increases, the
probability of making a false conclusion, or type 1 error, increases. This inflation of
the type 1 error has the consequence of incorrectly concluding that a treatment is
efficacious or safe. Multiple methods for controlling for multiplicity have recently
been developed (Dmitrienko et al. 2013). These methods range from simple to
complex where there is a need to handle multiple sources of multiplicity in a clinical
trial. The choice of adjustment to be used should be based on clinical and statistical
information and predefined in the statistical analysis plan for a clinical trial (Gamble
et al. 2017).

Eligibility and Exclusion

A clinical trial aims to have a sample which is representative of the population that
would have the treatment applied clinically if the trial is positive. The selection of
individuals for inclusion into a clinical trial is based on predefined eligibility criteria.
Criteria in clinical trials may be used to limit the heterogeneity of the trial population,
limit inference for obtaining the outcome measure (e.g., individuals with a
comorbidities at high risk of dying for an unrelated cause prior to the outcome
assessment), and related to safety concerns (e.g., pregnant women or individuals at
increased risk of an adverse outcome). Yet, the potential consequences of eligibility
criteria are to create selection bias in the study population and reduced external
validity for the trial. One of the primary ways to control selection bias is through
randomization and allocation concealment. By using randomization to assign indi-
viduals to a treatment or intervention, there will be balance in the known and
unknown factors, on average, between the groups. Yet, randomization alone does
not completely eliminate selection bias. There is still a chance for individuals to be
selectively enrolled in a trial should those in charge of recruitment know what the
next treatment allocation will be. Without concealment, those enrolling patients may
take eligible individuals which they perceive may do worse and randomize them
when they know a placebo assignment is likely. It is established that clinical trial
populations differ from the larger clinical population. The eligibility criteria often
create a highly selected population that can limit the external validity of the trial.
Recommendations to improve the reporting of the eligibility criteria are encouraged
in order to increase awareness among clinicians reading the findings to more
appropriately assess who the results apply to.
Run-in periods, placebo or treatment, or implementing enrichment strategies are
examples of exclusion criteria in clinical trials. These occur prior to randomization
and are utilized to select or exclude individuals from a trial. Using a run-in period can
736 A. Salter and J. P. Miller

help reduce the number of noncompliant individuals who are randomized or exclud-
ing those who experience adverse events (Rothwell 2005). Enrichment strategies
focus on recruiting individuals who are likely to respond well in a trial such as
nonresponders to a previous treatment. These exclusion criteria have the potential to
reduce the external validity of the study.

Summary and Conclusion

There is a need for inclusion/exclusion criteria in clinical trials, but their use may
result in potential bias or lack of generalizability of the study results. Enrichment
strategies are useful in limiting individuals who may discontinue participation in a
trial or not respond well to a treatment. While these strategies may result in more
subjects (and consequently more power), the external validity of the trial may be
compromised.

Key Facts

• Awareness is increasing for the need to identify and address multiplicity issues in
confirmatory clinical trials.
• There are many procedures which control the type I error inflation resulting from
multiplicity issues in clinical trials. The choice of procedure should be based on
clinical and statistical information and determined during the design phase of a
clinical trial.
• Eligibility criteria and exclusion strategies may be necessary to implement in a
clinical trial; however, careful review of the potential biases as a result should be
conducted.

Cross-References

▶ Confident Statistical Inference with Multiple Outcomes, Subgroups, and Other


Issues of Multiplicity
▶ Interim Analysis in Clinical Trials

References
Dmitrienko A, D’Agostino R (2013) Traditional multiplicity adjustment methods in clinical trials.
Stat Med 32(29):5172–5218. https://doi.org/10.1002/sim.5990
Dmitrienko A, D’Agostino RB (2018) Multiplicity considerations in clinical trials. N Engl J Med
378(22):2115–2122. https://doi.org/10.1056/NEJMra1709701
Dmitrienko A, D’Agostino RB Sr, Huque MF (2013) Key multiplicity issues in clinical drug
development. Stat Med 32(7):1079–1111. https://doi.org/10.1002/sim.5642
39 Controlling for Multiplicity, Eligibility, and Exclusions 737

Dunn OJ (1961) Multiple comparisons among means. J Am Stat Assoc 56(293):52–64. https://doi.
org/10.1080/01621459.1961.10482090
Dunnett CW (1955) A multiple comparison procedure for comparing several treatments with a
control. J Am Stat Assoc 50(272):1096–1121. https://doi.org/10.1080/01621459.
1955.10501294
EMA (European MedicinesAgency) (2017) Guideline on multiplicity issues in clinical trials.
Retrieved from www.ema.europa.eu/contact
FDA (U.S. Food and Drug Administration) (2017) Multiple endpoints in clinical trials: guidance
for industry. Retrieved from http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryIn
formation/Guidances/default.htm
Gamble C, Krishan A, Stocken D, Lewis S, Juszczak E, Doré C, . . . Loder E (2017) Guidelines for
the content of statistical analysis plans in clinical trials. JAMA 318(23): 2337. https://doi.org/
10.1001/jama.2017.18556
Haussig S, Mangner N, Dwyer MG, Lehmkuhl L, Lücke C, Woitek F, . . . Linke A (2016). Effect
of a cerebral protection device on brain lesions following transcatheter aortic valve implantation
in patients with severe aortic stenosis. JAMA 316(6): 592. https://doi.org/10.1001/jama.
2016.10302
Hochberg Y (1988) A sharper Bonferroni procedure for multiple tests of significance. Biometrika
75(4):800–802. https://doi.org/10.1093/biomet/75.4.800
Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6(2):65–70.
Retrieved from https://www.jstor.org/stable/pdf/4615733.pdf?refreqid=excelsior%
3Ab73e56d22a17fe5eebc22397ada28121
Hommel G (1988) A stagewise rejective multiple test procedure based on a modified Bonferroni
test. Biometrika 75(2):383–386. https://doi.org/10.1093/biomet/75.2.383
Mehrotra DV, Heyse JF (2004) Use of the false discovery rate for evaluating clinical safety data.
Stat Methods Med Res 13(3):227–238. https://doi.org/10.1191/0962280204sm363ra
Meltzer HY, Cucchiaro J, Silva R, Ogasa M, Phillips D, Xu J, . . . Loebel A (2011) Lurasidone in the
treatment of schizophrenia: a randomized, double-blind, placebo- and olanzapine-controlled
study. Am J Psychiatry 168(9): 957–967. https://doi.org/10.1176/appi.ajp.2011.10060907
Proschan MA, Waclawiw MA (2000) Practical guidelines for multiplicity adjustment in clinical
trials. Control Clin Trials 21(6):527–539. https://doi.org/10.1016/S0197-2456(00)00106-9
Rothwell PM (2005) External validity of randomised controlled trials: “to whom do the results of
this trial apply?”. Lancet 365(9453):82–93. https://doi.org/10.1016/S0140-6736(04)17670-8
Šidák Z (1967) Rectangular confidence regions for the means of multivariate normal distributions. J
Am Stat Assoc 62(318):626–633. https://doi.org/10.1080/01621459.1967.10482935
Simes RJ (1986) An improved Bonferroni procedure for multiple tests of significance. Biometrika
73(3):751–754. https://doi.org/10.1093/biomet/73.3.751
Wiens BL (2003) A fixed sequence Bonferroni procedure for testing multiple endpoints. Pharm Stat
2(3):211–215. https://doi.org/10.1002/pst.064
Principles of Clinical Trials: Bias and
Precision Control 40
Randomization, Stratification, and Minimization

Fan-fan Yu

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 740
Assignment Without Chance: A Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 741
Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 742
Methods of Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 743
Simple Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744
Restricted Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745
Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754
Synonyms: Covariate-Adaptive Randomization, Dynamic Randomization, Strict
Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754
Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757
Practicalities and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757
Unequal Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757
Checks on the Actual Randomization Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758
Assessing Balance of Prognostic Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 759
Accounting for the Randomization in Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 760
Conclusion and Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764

Abstract
The fundamental difference distinguishing observational studies from clinical
trials is randomization. This chapter provides a practical guide to concepts of
randomization that are widely used in clinical trials. It starts by describing bias
and potential confounding arising from allocating people to treatment groups in
a predictable way. It then presents the concept of randomization, starting from a
simple coin flip, and sequentially introduces methods with additional restrictions

F.-f. Yu (*)
Statistics Collaborative, Inc., Washington, DC, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 739


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_211
740 F.-f. Yu

to account for better balance of the groups with respect to known (measured) and
unknown (unmeasured) variables. These include descriptions and examples of
complete randomization and permuted block designs. The text briefly describes
biased coin designs that extend this family of designs. Stratification is introduced
as a way to provide treatment balance on specific covariates and covariate
combinations, and an adaptive counterpart of biased coin designs, minimization,
is described. The chapter concludes with some practical considerations when
creating and implementing randomization schedules.
By the chapter’s end, statistician or clinicians designing a trial may distinguish
generally what assignment methods may fit the needs of their trial and whether or
not stratifying by prognostic variables may be appropriate. The statistical prop-
erties of the methods are left to the individual references at the end.

Keywords
Selection bias · Assignment bias · Randomization · Allocation concealment ·
Random assignment · Permuted block · Biased coin · Stratification ·
Minimization · Covariate-adaptive randomization

Introduction

An apple a day keeps the doctor away.

How does one test this hypothesis? Researchers conducting an observational


cohort trial might gather a group of like individuals, follow them for a period of
time, and record whether they made non-wellness visits to their primary care
physician. The analysis would look at the relationship between this outcome and
whether or not those with and without the outcome ate apples or not. Investigators of
a clinical trial, however, would approach this differently by preemptively assigning
apples to one group of people, no apples to another, and then observe whether they
made non-wellness visits to their physician. This approach directly tests whether an
apple-eating lifestyle affects health. Alternatively, a clinical trial could hone in on the
vitamin C in apples and give participants a dose equivalent to apples to one group
and placebo to another.
The long-standing lure of a miracle health benefit from everyday foods still drives
medical research. Such was the case not for apples, but carrots: observational studies
in the early 1990s showed evidence that people who consumed more fruits and
vegetables rich in beta-carotene had lower rates of heart disease and cancer. It was
not clear, however, whether health benefits were the direct result of beta-carotene,
antioxidant vitamins and other nutrients in beta-carotene-rich foods, dietary habits in
general, or other behaviors. A series of long-term, large-scale, randomized, clinical
trials followed to provide direct tests of the benefits of beta-carotene on these health
outcomes.
40 Principles of Clinical Trials: Bias and Precision Control 741

In a “classic” parallel-group clinical trial, people are assigned one of two different
therapies, often to compare a new treatment intervention to placebo or standard of
care. In the case of beta-carotene, one trial, the Physicians’ Health Study, random-
ized two groups of men to a 12-year supplementation regimen of either beta-carotene
or beta-carotene placebo (Hennekens et al. 1996). The outcome for such a trial is
compared between the groups, with the goal of obtaining an estimate of treatment
effect that is free of bias and confounding (which will be addressed later in this
chapter).
In an observational trial, the adjustment of the exposure effect between groups of
people often occurs in the analysis through a stratified analysis or by including
potential confounders in a regression model. Although a clinical trial analysis can
apply the same adjustment methods, designers of a clinical trial can control bias and
confounding at the start of the trial. One way to do so is through randomization, the
process of assigning individuals to treatment groups using principles of chance for
assignment. Because the presence of bias can greatly affect the interpretation and
generalizability of a clinical trial, many aspects in trial design, including randomi-
zation, exist to ensure its minimization.
Randomization is the fundamental difference distinguishing observational studies
from clinical trials. In controlling for biases and confounding, randomization forms
the basis for valid inference. Careful thought should be given to its elements,
described in this chapter (block size, strata, assignment ratio, and others) and,
naturally, discussed between statisticians and a trial’s clinical leadership.

Assignment Without Chance: A Motivating Example

To begin, consider two examples, more generic in nature, of nonrandom assignment


for a multicenter trial enrolling participants. Investigators are aware of the assign-
ment process but are masked to treatment:

1. Assign the first 50 people who consent to the trial to treatment A and the next 50
to treatment B.
2. Assign alternating treatments: odd-numbered participants receive A and even-
numbered participants receive B.

These approaches are simple and systematic but with drawbacks. In the first
approach, there is a high likelihood that participants within groups are more alike
than those in opposing groups. This could occur if Dr. X had all the earlier
appointments and Dr. Y the later ones.
In the second approach, the pattern is predictable. From observing the previous
set of participants, the caring Dr. Compassion has figured out that A is the novel,
active treatment. A patient he has known for many years has been very sick, and Dr.
Compassion believes that this patient may benefit from the experimental therapy in
the trial. The patient would be tenth in line and therefore slated to receive B, the
742 F.-f. Yu

control therapy. Dr. Compassion – whether consciously or unconsciously – decides


to hold his patient back in the enrollment order, so that the patient receives the active
treatment A instead.

Bias

These situations show two examples of bias that could easily occur during assign-
ment of treatment groups. In both cases, the treatment schedule is predictable. If the
trial is unmasked, or if the novel treatment is obvious despite masking, then, like Dr.
Compassion, investigators may manipulate the timing of participant enrollment so
that certain participants receive certain treatments. Both cases are prone to selection
bias. This bias occurs when investigators have knowledge of the treatment assign-
ment and the selection of a participant for a trial is based on that knowledge. Such
selection could occur in an unmasked trial or if the randomization scheme’s assign-
ment pattern is predictable.
This issue of predictability raises the importance of allocation concealment.
The risk of investigator-influenced assignments and selection bias can be minimized
if the investigators do not know what the next assignment will be. Note that this is
different from blinding or masking, which seeks to conceal the treatment altogether.
Two easy ways to conceal assignments before they are handed out are (1) avoiding
easy assignment patterns and (2) avoiding publicly available lists, such as the
notorious example of one tacked up on the nurses’ station bulletin board. Allocation
concealment is possible in an unmasked trial, as long as investigators are unaware of
the assignment before a participant receives the intervention.
In the first case, the participants who enroll early may share certain baseline
characteristics that differ from those of participants who enroll later. These baseline
characteristics are often prognostic factors for the disease of trial. For trials enrolling
over long periods of time, demographic shifts do occur. The characteristics of
participants, which sometimes reflect changed and improved standards of care,
may differ temporally depending on when they enter the trial. Byar et al. (1976)
described the Veterans’ Administration Cooperative Urological Research Group trial
in participants with prostate cancer. Earlier recruited participants had shorter survival
than those who entered later. A similar contrast arises with prevalent versus incident
cases of disease. Those available at first might have had the disease for a long time;
incident cases that arise during the trial may be more rapidly (or more slowly)
progressive. When the assignment results in prognostic factors that are unequally
distributed across the treatment groups, then the effect of the treatment on the final
outcome may be confounded with the effect of the factor. This is an example of
assignment bias.
Mitigating bias results in more accurate estimates of treatment differences. To
see this mathematically, consider a hypothetical trial in diabetic children as
presented in Matthews (2000). Hemoglobin A1c is a measure of average blood
glucose levels over the past 2–3 months. HbA1c levels tend to be higher in
adolescents (9–10%) than in young children (6–7%). Consider a trial comparing
40 Principles of Clinical Trials: Bias and Precision Control 743

active treatment (A) to placebo (B) showing no treatment effect. Matthews nicely
shows how assignment bias with respect to age grouping affects the treatment
difference. Assuming there is no treatment difference, the expected value of
HbA1c is μ1 for children and μ2 for adolescents. The mean HbA1c in group A
may be expressed as the sum of all nA children’s observations plus the sum of all
(N  nA) adolescent observations:

P
nA P
N
Xi þ Xi
i¼1 i¼nA þ1
HA ¼
N

The mean for group B is similarly calculated. The expected HbA1c in each group
is

  n μ þ ðN  nA Þμ2   nB μ1 þ ðN  nB Þμ2
E HA ¼ A 1 , E HB ¼
N N

Mathematically, the expected treatment effect can then be expressed as the


difference between the expected response in children μ1 and adolescents μ2, multi-
plied by a factor dependent on the number of children in each group, nA and nB:

    ð n  nB Þ
E HA  E HB ¼ A ðμ1  μ2 Þ
N

Recall that there is no actual treatment effect; thus the expected difference above
equals 0.
Since children and adolescents have different HbA1c levels, μ1 < μ2. If
the number of children in groups A and B is balanced (nA = nB), then the
above equation is also to 0. The balancing of the prognostic factor, age, provides
an unbiased estimate of the treatment effect. If the number of children in groups
A and B are not balanced (nA 6¼ nB), the treatment effect is non-zero. The
treatment effect is biased, showing a treatment difference when there actually
is none.
Randomization may help to avoid this type of bias, making sure the two groups
are similar. In that way, the observable difference between the two groups is the
result of the treatment.

Methods of Randomization

Simple randomization, as described below, may produce treatment groups of differ-


ent sizes. Blocking and stratification, which are methods of randomization, address
the problems of imbalance in important covariates.
744 F.-f. Yu

Simple Randomization

Synonyms: Complete Randomization, Fair Coin Flip


The simplest example of randomization is the fair coin flip. Each participant who is
in the queue to participate is randomized independently to one treatment or another
with 50% probability. The electronic equivalent of the coin flip – more efficient,
current, and most importantly reproducible – would be a generated random proba-
bility assignment for each person, who receives treatment A if the probability
assignment is less than 0.5 and treatment B if it’s more. Reproducibility of experi-
ments is important for scientific validity. It allows corroboration of trial methods and
results and protects against fraud.
A pre-generated, formalized randomization list (synonyms: randomization sched-
ule, randomization scheme) would pre-generate each assignment so that one can
refer to the next entry on the list rather than flipping a coin for each assignment.
As participants are enrolled, their place in the queue corresponds with a specific
assignment to A or B in the list. One way to do this would be to:

(a) First, generate a sequence of numbers from 1 to n, where n is the total sample
size.
(b) Then, perform an independent coin flip (probability 0.5 for each treatment) for
each number in sequence. For example, in the R software, use the command
rbinom(n, 1, 0.5) to generate the probabilities, and assign treatment A or B
depending on whether the probability is less or more than 0.5.

Note that this process introduces no selection bias, because each successive
participant’s assignment is completely random and independent of each other.
A graphical depiction of simple randomization appears in Fig. 1, which uses a
game-board spinner to depict the coin probabilities.
Fun fact. A computer-generated coin flip is actually pseudorandom, as it’s the
result of algorithm-based number generator which makes the result reproducible
given an initial number or “seed.” A series of ideal coin flips is randomness in its
purest form, but it is not reproducible.
Because a coin flip is binomial, the large sample theory of binomial distribu-
tions applies. An assignment of 50% in each group becomes more likely as the
number of coin flips, or trials, increases. With smaller sample sizes, the assign-
ment of people to one group or another is likely to be unequal for some period of
time.
Clinical trials aiming for a 1:1 assignment seek to attain equal experience with
both treatments. As discussed above, using simple randomization may not guarantee
that assignment when the sample size is small. In reality, however, the assignment
from simple randomization will not be far from 1:1; the larger the sample size, the
greater the likelihood of balance. Lachin (1988b) states one need not worry about the
imbalance for trials of more than 200 people.
40 Principles of Clinical Trials: Bias and Precision Control 745

“COIN FLIP” USING A BOARD-GAME SPINNER

Generate a
sequence of #s
A B
1
2
3
.
.
.
k
. COIN FLIP USING R
. < 0.5
. A
n
rbinom(n, 0.5)
B
>= 0.5

Fig. 1 Simple randomization. A graphical depiction of simple randomization

Restricted Randomization

An alternative to avoiding imbalance in participant numbers in each group is to


consider putting restrictions on the randomization process. For trials under 200
participants, additional conditions on the randomization procedure may ensure
more equal assignment of participants between the two treatment groups. As
discussed below, these conditions in restricted randomization also extend to better
balance of prognostic variables between the two treatment groups.

Random Assignment
A step beyond simple randomization’s fair coin flip is random assignment.
An analogy is a typical American elementary school game of musical chairs,
modernized so that no one is eliminated and everyone wins: for an equal number
of boys and girls (1:1 assignment of treatments A and B) and the same number of
chairs, music plays while the children dance freely inside the circle of chairs
(randomization placement); after the music stops, the children scramble to find a
new chair, forming a new seating arrangement (random assignment of treatments).
More formally, for a 1:1 assignment, this procedure pre-specifies the exact sample
size in advance and then restricts the randomization to half the participants receiving
746 F.-f. Yu

Table 1 Random assignment


a) 5 As and 5 Bs b) Scramble And reorder
Sort variable
using SAS Sort variable Final
Sequence function in ascending Sequence randomization
number Group RANUNI (465) order number Group order
001 A 0.0075949673 0.0075949673 001 A 1
002 A 0.0912527778 0.0912527778 002 A 2
003 A 0.9183023315 0.1542221867 009 B 3
004 A 0.6203701001 0.236651111 008 B 4
005 A 0.5451987076 0.4853155145 007 B 5
006 B 0.7139676347 0.5451987076 005 A 6
007 B 0.4853155145 0.6203701001 004 A 7
008 B 0.236651111 0.7139676347 006 B 8
009 B 0.1542221867 0.9183023315 003 A 9
010 B 0.9534464944 0.9534464944 010 B 10

A and the other half receiving B. Then, true to the method’s name, it randomly
allocates each participant’s placement in the list sequence. A list constructed pro-
grammatically could do the following for 100 planned assignments:

(a) List 100 sequence numbers: 1–50 as A and 51–100 as B.


(b) Scramble the numbers by using a random number generator to assign a random
number to each of the 100 assignments, and then sort the list by the order of the
random numbers. The RANUNI function in SAS is useful for this.

The schema below (Table 1) illustrates the process on a smaller scale for 10
assignments, using a seed number of 465 with the SAS RANUNI function for the
reordering.
In practice, pre-generated randomization lists provide more assignments than the
planned sample size in order to account for higher-than-expected enrollment, poten-
tial errors, and other just-in-case scenarios.
Fun fact. Random assignment is the simplest form of a permuted block design –
it’s a single block of size n.

Permuted Block Designs


One concern with random assignment is the possibility of a long run of a single
treatment – for example, treatment B occurring many times in a row. Another
problem may be unbalanced group sizes at some time during the randomization.
One way to address these issues is to use blocking, a concept that puts restrictions on
the allocated numbers of participants within each group by permuting the assignment
sequence in smaller subgroupings (blocks). This method achieves treatment balance
within each block rather than over half the subjects in the trial.
Blocks, then, are assignment groups of predetermined sizes, and treatments are
allocated within a block, sitting like shelves in a bookcase. With the last block as a
40 Principles of Clinical Trials: Bias and Precision Control 747

base and the first block on top, the blocks are then “stacked” to produce a tower,
which comprises the randomization list.
In the simplest case, 1:1 assignments use even block sizes, while 2:1 randomiza-
tions use multiples of 3.
How does one build the tower of blocks? First, figure out the different sequences
for particular block sizes. For instance, a block size of 2 has only two sequence
options:

Sequence option number 1 2


Sequence AB BA

Then, decide on the number of assignments – typically a multiple of the block


size chosen. To produce a randomization list of 100 assignments, first
generate a random list comprised of 50 numbers. Each number is either 1 or 2
(sampling with replacement), representing the two block types above. An example
for the first ten participants would be a sequence of 1, 1, 2, 1, 2. This corresponds to a
randomization list of AB|AB|BA|AB|BA. The first person randomized is assigned
treatment A, the second treatment B, the third treatment A, etc.
A block size of 4 has six block sequence options:

Sequence number 1 2 3 4 5 6
Sequence AABB BBAA ABAB BABA ABBA BAAB

A list of 100 assignments using a block size of 4 would produce a list of 25


randomly selected block sequence numbers. Each of those 25 numbers corresponds
to one of the six block types above.
Example. A trial with two treatments (A and B) and a 1:1 assignment uses a
permuted block design for randomization. The block size is 4. Although the design
of the trial specifies enrolling 80 participants, the randomization list generates 100
assignments to be “on the safe side.” A partial list of the first 20 assignments appears
below in Table 2. The program generates 25 block numbers with replacement from
the block sequence list {1, 2, 3, 4, 5, 6}. The selected block sequence numbers
appear in the second column. The corresponding assignment sequences of As and Bs
appear in the third column. A unique randomization number appears in the fourth
column. This number can serve either as the participant ID or maps uniquely to a
separate participant ID.

Choosing Block Sizes


The choice of block size for a randomization list depends on the trial size and specific
features of the trial. Ideally, a well-chosen block size can lower the ability to predict
future treatments by lowering the predictability of patterns and therefore protecting
the masking of treatment groups. The decision of block size should also consider the
longest acceptable “run” of a single treatment. An example appears above in Table 2
between the consecutive block sequence numbers of 1 and 2. As seen here, the
longest run of a single treatment using blocks of size 4 is four in a row.
748 F.-f. Yu

Table 2 First 20 assignments for a permuted block design (block size of 4)


Block number Sequence number Assignment group Randomization number
1 4 B 1001
A 1002
B 1003
A 1004
2 3 A 1005
B 1006
A 1007
B 1008
3 1 A 1009
A 1010
B 1011
B 1012
4 2 B 1013
B 1014
A 1015
A 1016
5 5 A 1017
B 1018
B 1019
A 1020
...
25 Etc. Etc. Etc.

In many studies, randomization is stratified by trial site. If many sites are expected
to enroll few subjects, or if a trial is small, a smaller block size may be appropriate.
This helps to ensure better assignment of treatment within a block and prevents the
majority of participants from receiving a single treatment at one site. Larger studies
with many randomized at each site are able to accommodate larger block sizes.
Example. A trial of 100 people has 10 sites but expects enrollment to occur at the
2 main sites located in major metropolitan areas. Trial coordinators at the smaller
sites expect few enrollees. The randomization, which stratifies by trial site, uses a
block size of 8. The first block in the randomization list for one of the smaller sites
has assignment sequence AAAABBBB. Only four participants enroll at this site; all
four therefore receive treatment A. The analysis of the outcome cannot disentangle
the effect of this site from the effect of treatment. Because the effect of treatment is
potentially confounded with the effect of site, a smaller block size would be
appropriate here.
As a rule of thumb: block sizes of 4 to 6 are typical for studies with sites that are
expected to enroll only a few participants, 2 is a small block, and 8 is considered
large. Block sizes greater than 8 need careful consideration in relation to the size of
the trial. They are not recommended for small studies. Long runs, such as
AAAABBBB|BBBBAAAA, may occur with a block size of 8, and a block may
40 Principles of Clinical Trials: Bias and Precision Control 749

not fill completely as seen in the example above. In the case of a 12-participant trial
with this randomization, the trial would actually be a 2:1 randomization instead of
the intended 1:1. This defeats the purpose of blocking to achieve better balance
between treatment groups. While this example may be extreme, it is still an impor-
tant consideration within blocks and for a stratified randomization (more on this later
in section “Stratified Randomization”).
For trials with more than two treatment groups, the block size should be a
multiple of the number of treatment groups if the assignment is 1:1. A trial with
three treatment groups and block sizes of 2 and 4 make less sense than a trial with
block sizes of 3 and 6.

Keeping the Block Size Secret


In certain situations, investigators like Dr. Compassion, whom we met earlier in the
chapter’s introduction, may be able to predict, with a fair degree of accuracy, the next
treatment in the assignment sequence if they know the block size.
Example. In a placebo-controlled masked trial, an investigator has noticed the
telltale effects of the active group, a prostanoid therapy: symptoms of nausea and
diarrhea, jaw pain, and flushing. By observing participants, she has noticed that the
sequence thus far at her site is likely to have been placebo, active, active. Knowing
that the block size is 4, she can predict with certainty that fourth participant will be
assigned the active treatment. Similarly, if investigators know that the block size is 2,
then it is easy to predict all of the even-numbered assignments.
Similar situations could arise with larger block sizes; the probability of predicting
the treatment increases at the ends of blocks. Thus, one important aspect of permuted
block designs is to limit knowledge of the block size to a select few, preferably to the
statisticians at the data coordinating center who generate the randomization and who
may have access to unmasked data during the trial. Keeping the blocking informa-
tion from investigators decreases the potential for selection bias (Lachin et al. 1988).
In the absence of this measure, the potential for selection bias decreases as a function
of the block size unless random blocks are used in the randomization scheme (Matts
and Lachin 1988; see section “Mix It Up: Using Random Permuted Blocks with
Unequal Block Sizes”).
This issue of keeping the block size hush is especially important in unmasked
studies, which have a greater potential for selection bias if investigators are able to
guess the ordering of assignment assignments. Investigators may be more suscepti-
ble to influencing who gets which assignment if the randomization uses permuted
blocks and the block size is known. Below are several other methods to reduce the
predictability of treatments within a permuted block design.

Use, or Add On, Block Sizes of 2


Block sizes of 2 are considered small and not always ideal because of the predict-
ability of assignment. In some situations, however, particularly in stratified random-
izations (see section “Stratified Randomization”), block sizes of 2 help to minimize
the possibility that participants within a stratum are randomized to the same
treatment:
750 F.-f. Yu

• A trial has several centers but only a few enrollees per center are expected.
Initiation of sites often occurs in groups and sequentially over time. Thus,
enrollment at certain times in the trial – for example, in the first few months –
may occur only at a few sites. To minimize the chance that sites enroll participants
from the same treatment group, consider a block size of 2.
• A small trial has multiple sites with anywhere from four to eight participants
expected per site. Randomization will be stratified by site. Use a block size of 2
first to guarantee treatment balance for the first two randomized, and then mix it
with block sizes of 4.
• A block size of 6 may run the risk of having this assignment: AAABBB|
BBBAAA, a run of 6 Bs in a row at a single site. Rather than using a block
size of 6, mix block sizes within a randomization; for example, combine a block
size of 4 with a block size of 2. For continuous runs such as AABB|BA, the
maximum run of any one treatment in this case is 3.

An alternative for a larger trial is to mix more than two block sizes – for example,
sizes of 2, 4, and 6.

Mix It Up: Using Random Permuted Blocks with Unequal Block Sizes
A way to address the selection bias that may occur by predicting treatment
assignments at the ends of blocks is to mix up the block sizes and use random
block sizes rather than fixed ones. Rather than choosing among the six possible
blocks of size 4, one could choose among blocks of size 4 or 2. For a list of 100
numbers,

1. First generate a list of number corresponding to a block length of either 2 or 4.


2. For each block generate a number within that block type.
3. Select the sequence corresponding to the block type.

An example of the random block length sequence is 4, 2, 4, 4, 2. The sequence


numbers within each block type is 6, 2, 3, 1, 1. Recall the list of six sequence options
defined earlier for a block of size 4:

Sequence number 1 2 3 4 5 6
Sequence AABB BBAA ABAB BABA ABBA BAAB

and the two sequence options for a block size of 2:

Sequence option number 1 2


Sequence AB BA

The corresponding randomization assignments for the first 16 participants is

BAAB j BA j BABA j AABB j AB


40 Principles of Clinical Trials: Bias and Precision Control 751

Random block length sequence 4 2 4 4 2


Sequence number 6 2 3 1 1
Randomization assignment BAAB BA BABA AABB AB

Urn-Adaptive Randomization Designs

Synonyms: Adaptive Randomization, Dynamic Randomization


A trial sponsor may want to have treatment balance during a trial in real time
rather than just at the end, for example, when the trial has staggered entry of
participants and when the total number of participants is not entirely known.
Enter urn-adaptive randomization designs, extensions of restricted randomized
designs. The general principle is this: rather than using a fair coin toss as
described earlier, urn-adaptive designs use a biased coin. For now, say this coin
has a heavier tail side and is weighted 30:70 heads/tails. When a new participant
is enrolled, look to see which treatment group has fewer people, and then flip the
coin. If it lands tails (and remember, this is a 70% chance of this), then the person
goes to the group with fewer people. If it lands heads, then the person goes to the
group with more people.
This is an example of Efron’s biased coin design, where the first participant, or
first several participants, is randomized by simple randomization. Generalizing the
above (where p = 0.70), for the k-th participant, consider the difference in the
number of people between the groups, A–B.

If A–B < 0 (more Bs), randomize to A with probability p, where p > 0.5.
If A–B > 0 (more As), randomize to A with probability 1- p.
If A–B = 0, randomize to A with probability 0.5.

Note that p is constant even when there is imbalance. The design is summarized
visually in Fig. 2, again using a game-board spinner instead of a coin for illustration
purposes.
Two other designs are Wei’s urn design (1978) and its generalization, Smith’s
generalized biased coin design. Wei’s urn design is similar to Efron’s except that p
fluctuates depending upon the balance between the two groups. Both are urn models
with n balls labeled A and n balls labeled B. For Wei’s urn design, when the k-th
person is randomized, a “ball” is picked from the urn. If the ball is labeled A, then:

• The person is randomized to group A.


• The A ball is returned to the urn.
• m balls labeled B are added to the urn.
• Repeat for the (k + 1)st person randomized.

Fun fact. Complete randomization is the situation if no B ball is added to the urn.
752 F.-f. Yu

Current allocation B
difference has Randomize next to A
more Bs with p=0.70
(A-B < 0)
A

Current allocation B
difference has Randomize next to A
more As with p=0.30
(A-B > 0)
A

Current allocation
difference Randomize next to A
is equal with p=0.50
(A-B=0) A B

Fig. 2 Efron’s biased coin design, using a 30:70 “biased coin” as depicted by a game-board spinner

Because the probability of assignment is biased toward the group with fewer
assignments, urn-adaptive designs adapt the probability of choosing the next treatment
on the basis of the assignment ratio thus far. This helps maintain balance as the trial is
ongoing, but doesn’t guarantee complete balance at the end of trial. In terms of bias,
the procedure reduces the predictability of the assignments and thereby reduces the
bias associated with that. For smaller trials, urn designs provide balance along the way
and behave more like complete randomization as the sample size gets large.

Stratified Randomization
Let’s return to the earlier example of a trial of diabetic children, where HbA1c tends
to be higher in adolescents as compared to young children. An imbalance in the
40 Principles of Clinical Trials: Bias and Precision Control 753

number of children in group A versus group B could occur using the randomization
methods just described and could lead to a biased treatment effect. An alternative is
to achieve treatment balance within each age grouping, rather than achieving balance
between treatments over all participants. This method, stratified randomization,
achieves balance within pre-chosen strata (young children vs. adolescents) defined
by important prognostic factors (age grouping), whose levels may affect the outcome
(HbA1c). In the simplest case of a two-strata factor, the randomization list is
essentially two lists, one for each stratum.
One major goal of stratification is to minimize the chances of one treatment
occurring primarily within a single factor – for instance, the majority of adolescent
trial participants receiving treatment B – such that the analysis cannot disentangle the
effect of the factor from the effect of treatment. This helps avoid correlation between
predictors (for factors not associated with the outcome) and confounding (for factors
associated with the outcome).
A trial with two two-level factors, such as age and gender, has four strata: male
pediatric, male adult, female pediatric, and female adult. Statisticians will often
picture this as a 2 by 2 table and refer to each stratum as a “cell.” A trial with two
treatment groups will therefore have eight cells. This trial will have four separate
randomization lists, one for each stratum. Within each stratum, randomization may
occur using random assignment or permuted blocks.
Fun fact. For random assignment, stratification may be viewed as blocking, with
each stratum acting as one large block using simple randomization.
A real challenge in designing stratified trials is selecting the most important strata
on which to achieve balance. A true story is a discussion among researchers who
were planning a trial. Each clinician felt very strongly about a prognostic factor
whose levels would affect the outcome. The list grew to include gender, baseline
disease status, age category, a disease-specific clinical characteristic, and a bio-
marker. When the trial statistician pointed out that there were now at least 32 strata,
and therefore 64 cells, for a 100-person trial, the researchers had to step back to re-
prioritize as a group.
As seen in Table 3, the number of strata quickly multiplies as the number of
prognostic variables increases. A risk of including so many factors is that numerous
strata may result in certain “cells” having few or no people. Having empty cells, or
many cells with a single person, not only goes against achieving balance but also

Table 3 Number of prognostic factors and strata for a trial with two treatment groups
Two-level Number of Number of
factors Example strata cells
1 Gender 2 4
2 Gender, age (pediatric vs. adult) 4 8
3 Gender, age, baseline disease status (WHO class 8 16
I/II vs. III/IV)
4 Gender, age, baseline disease status, genetic 16 32
biomarker
N 2N 2N + 1
754 F.-f. Yu

presents problems when analyzing the data. For the analysis, many trial teams
choose to pool strata that have only one or two people randomized.
Another operational consideration for limiting the number of strata is the
possibility for mis-stratification. Investigators are humans; they may enter the
wrong stratum criterion when randomizing a participant. Deciding how to handle
mis-stratifications then becomes a challenge in the conduct and interpretation of
an analysis. For example, if a woman is stratified as a man, should she be
analyzed as a man, to reflect the actual randomization, or as a woman, because
that is what she is?
There is some debate as to whether, and when, studies should stratify. Lachin
et al. (1988) recommend stratification for trials with fewer than 100 participants. For
larger trials, the advantages are negligible for efficiency; they recommend stratifying
by center but not by other prognostic factors. Others argue that investigators may
want to stratify for other scientific reasons – such as characteristics of a disease – that
may affect trial outcome. An example is the breast cancer trial in Table 5, which
stratified by first- versus second-line therapy. With an outcome of progression-free
survival, it was important to monitor that the randomization obtained balance
between those who were farther along in their treatment (second-line therapy) than
those who were not.
A special consideration is stratification by clinical site, which was addressed
earlier in the discussion of block sizes in section “Permuted Block Designs.”
Because of the sequential nature of site initiation in a trial and the similarities in
patient care within a site, many studies will stratify randomization by site. This helps
to avoid confounding of the treatment effect by site and ensures balance within site.
Because blocking is usually employed within site, unmasked studies should
probably avoid stratifying by site. The prediction of treatment patterns at the ends
of blocks is much easier in this setting.

Minimization

Synonyms: Covariate-Adaptive Randomization, Dynamic


Randomization, Strict Minimization

When a trial has many prognostic factors needing balancing, minimization may
provide a good assignment alternative compared to more traditional randomization
methods. Recall the trial mentioned earlier with the five different two-level prog-
nostic variables and the resulting problematic 64 cells for 100 people. That trial may
have been a candidate for minimization if the clinicians decided that each of the five
variables was equally important for stratification.
Minimization refers to minimizing the treatment imbalance over several
covariates by the use of a dynamic, primarily nonrandom method. As an alternative
to stratified block randomization, minimization allows balancing on many prognos-
tic variables in real time. The method uses information on prognostic factors – the
stratification variables used with the randomization methods above – to determine
40 Principles of Clinical Trials: Bias and Precision Control 755

Table 4 Participants randomized to two groups, by strata


After 15 participants
Factor Stratum Group A Group B A–B
Gender Female 4 2 2
Male 5 4 1
Age <18 3 3 0
18+ 3 6 3

where the imbalance is. Then, generally, the method chooses the arm that best
minimizes the imbalance and assigns the next participant to that arm. Similar to
biased coin designs, the next assignment is partially determined by the treatment
group with fewer people. Here, the assignment is done by defining a weighted metric
that combines the treatment differences across all the strata for a covariate. This
weighted metric then determines the assignment probability to assignment of that
arm. The goal of minimization, like urn-adaptive randomization, is to ensure a small
absolute difference between the numbers randomized in each treatment group. The
difference between urn-adaptive methods and minimization is that minimization
uses stratum-specific differences to minimize the differences between treatment
groups.
First introduced by Taves (1974), a deterministic version of the method with the
four strata in Table 4 would randomize the first set of participants using complete
randomization (Table 3). Within each stratum, calculate the differences for A-B as
seen in the last column of the table; positive values indicate more participants in A
and vice versa. To determine the assignment for the 16th participant, add the
differences for the subject-specific strata.

If sum < 0 (more Bs), randomize to A.


If sum > 0 (more As), randomize to B.
If sum = 0 (equal assignment), randomize to A with probability = 0.5.

Example: In Table 3, 15 people have been randomized. The 16th subject is a


pediatric female. To determine this person’s assignment, add the differences A-B for
these two strata, 2 + 0 = 2. This indicates that because currently A has more people
for this combination of factors, this person receives treatment B. Update the table
using this person’s information, and then allocate the next person.
Pocock and Simon (1975) independently proposed a similar method but used a
probability p > 0.5 for assigning participants to a specific arm. Proschan et al. (2011)
refer to Taves’ deterministic method as strict minimization, and Pocock and Simon’s
method as minimization.

If sum < 0, randomize to A with probability p.


If sum > 0, randomize to B with probability p.
If sum = 0, randomize to A with probability = 0.5.
756 F.-f. Yu

A p of 0.8 gives a relatively high probability for receiving the treatment currently
in “deficit.” Pocock and Simon generally prefer a p of 0.75.
This method balances marginally over all covariates rather than within stratum as
for stratification using permuted block designs. While the method achieves better
balance in real time on the selected factors, unlike conventional randomization
methods, it does not guarantee balance on unspecified factors. It works best in
small trials (e.g., trials of <100 people). Pocock and Simon have generalized this
method to three or more groups, which is not covered here (Fig. 3).

TAVES POCOCK AND SIMON


”STRICT MINIMIZATION” ”MINIMIZATION”
If the sum of factor-specific
treatment differences
(A-B) has:

B
More Bs A
(sum < 0)
A

Assign to A Randomize to A with


p = 0.75

B
More As B
(sum > 0)
A

Assign to B Randomize to B with


p=0.75

Equal allocation
(sum = 0)
A B A B

Randomize to A with Randomize to A with


p = 0.50 p=0.50

Fig. 3 Minimization. A graphical depiction of minimization


40 Principles of Clinical Trials: Bias and Precision Control 757

One disadvantage to dynamic randomization is its computational complexity.


Unlike conventional randomization, a pre-generated list prior to trial start is not
feasible. Although a list of probabilities for p may be pre-generated, dynamic
methods require real-time monitoring of the strata and current imbalance for each
participant enrolled and “randomized.” In more complicated methods, such as urn
designs with dynamic p, several lists for p will need pre-generation. The added
complexity introduces the potential for computational error and, if implemented
incorrectly, may counter the intention for increased real-time balance.

Other Methods

Additional approaches to randomization models include urn models where the distri-
bution of treatment “balls” within urns is based on the responses observed so far. These
response-adaptive randomizations include randomized play the winner (Wei and
Durham 1978) and drop the loser (Ivanova 2003), among others. The basic premise
is that if one treatment is showing better response than the other, then the assignment
probabilities can favor the better treatment. This type of randomization is more suitable
with trials where responses are viewed quickly and addresses ethical concerns about
participants exposing themselves to treatment groups that may not be effective. This
chapter does not address these methods further.

Practicalities and Implementation

Unequal Allocation

Although this chapter focused on 1:1 assignments, some trials may choose to use
other assignment ratios. A common alternative is 2:1, which in some cases has less
power than its 1:1 counterpart. Figure 4 displays the total sample
" size needed#
for a continuous, normally distributed outcome represented by Φ Δ
σ
pffiffiffiffiffiffiffiffi
1
1 1
 1:96 ,
n1 þ n2

Δ
where ¼ 0:65 and power of 80%, 90%, and 95%.
σ
For an often small loss in power (or increase in sample size), however (Fig. 4), in
some trials investigators will prefer to have unequal allocation because of nonstatistical
reasons. Some trials may face high-cost issues for obtaining the control treatment from
sponsors. In rare diseases, it allows more people access to the novel treatment, and a
single trial will therefore have more experience with the novel treatment.
An example of unequal allocation comes from a trial of a novel gene therapy,
delivered by subretinal injection to the eye, which had a promise to restore vision to
blind participants who had a particular mutation. The sponsor and investigators did
not want to burden control participants with a sham injection procedure, especially
when many participants would be children. As a result, masking the treatment groups
was not possible. With the potential to regain vision from blindness or to stop the path
758 F.-f. Yu

Fig. 4 Total sample size as a function of the assignment ratio for 80%, 85%, and 90% power

toward blindness, everyone recruited was keen on receiving the novel treatment in a
rare disease population where finding patients was already difficult. The final design
used a 2:1 randomization with an extension period. Despite a small loss in power
compared to a 1:1 randomization in an already small trial, this design limited the
number of control participants, and the extension period allowed the opportunity for
controls to receive treatment after a year on the main trial (Russell et al. 2017).

Checks on the Actual Randomization Schedule

Prior to finalizing a complete or permuted block randomization scheme for a trial, a


statistician should run a series of checks on the schedule using the intended final
seed. If the seed number is 8675309, then prior to calling pseudorandom number
generator functions, one can use set.seed(8675309) in R and call streaminit
(8675309) in SAS. Other functions, such as ranuni in SAS, also have seed as the
main argument. The final randomization scheme should be checked for what ran-
domization strives to achieve: balance and desired assignment of treatment groups
overall, within blocks, and within strata.
Ideally, a randomization schedule should be reproducible when using the same
seed as the original schedule. The seed and the block size are best kept from the
40 Principles of Clinical Trials: Bias and Precision Control 759

sponsors, sites, and investigators as discussed in section “Permuted Block Designs.”


A few checks may include:

• Check for patterns to ensure that the distribution of possible block permutations is
not unusual. An example is to ensure that all A–B blocks do not all occur early in
the list and all B–A blocks at the end. Another is to avoid a long string of Bs
occurring such as a block size of 6; a run of AAABBB|BBBAAA|AAABBB
gives two long runs of As and Bs, respectively. One might want to consider a
different seed to achieve runs that alternate more between the treatment groups.
• Check to see if the distribution of the position of treatment assignment within blocks is
well-balanced. For example, for blocks of size 6, As occur more often in position 5/6.
• Check to see there are no patterns of transitions between treatment assignments.
For example, across all blocks, there are more A–B transitions than B–A.
• If there is an abbreviated treatment group identifier (A, B or 1, 2), then a variable
should have the decoded variable (“active,” “placebo”).

Assessing Balance of Prognostic Factors

Reports of randomized and unrandomized studies typically present a table of demo-


graphic and baseline characteristics as part of the overall summary of the trial analysis.
The table shows the distributions of participant characteristics to see how comparable or
“balanced” the groups are. “Balanced” means that important characteristics are distrib-
uted similarly in each treatment group. Table 5 displays a subset of the demographic and
baseline characteristics from a large breast cancer trial comparing the effects of epoetin
alfa to the best standard of care among participants who develop anemia during
chemotherapy (Leyland et al. 2016). Because the trial randomized a large number of
women, 2000, one would expect that randomization would result in similar distributions
of prognostic factors in the two groups. As seen in Table 5, this is the case.
Fun fact. An interesting aspect of baseline tables is that people often request a p-
value to provide formal comparisons of differences between treatments for each
variable summarized. A p-value is the probability that the observed differences are
the result of chance alone. Randomization is a mechanism assigning people by
chance to one group or another. Therefore, the p-value for each of the differences
in a baseline table of a randomized trial, by definition, is 1 (of course, unless the
process of randomization was flawed).
For randomized clinical trials, tables of baseline characteristics also help show
whether randomization is doing its job. Data and Safety Monitoring Boards
(DSMBs), who review interim data of ongoing trials, will use such tables to monitor
imbalances of important prognostic factors partway through ongoing trials and
whether the trend persists over time. They will also monitor to see if the trial’s
baseline population reflects the target population. If not, a DSMB may encourage the
trial sponsor to further its efforts to recruit particular types of subjects.
For the final analysis of a trial, such a table helps describe the trial population to
evaluate how generalizable the trial results are to the larger population.
760 F.-f. Yu

Table 5 A typical demographics/ baseline characteristics table


Best standard of care Epoetin alfa
N = 1,048 N = 1,050
Age, years
Median 52.0 52.0
Range (min, max) 23, 81 24, 79
Race, n (%)
White 724 (69.1) 692 (65.9)
Asian 304 (29.0) 335 (31.9)
Black 4 (0.4) 4 (0.4)
Weight, kg
Mean (SD) 67.0 (16.08) 67.6 (16.56)
BMI, kg/m2
Mean (SD) 26.3 (5.35) 26.4 (5.64)
Stage at initial diagnosis, n (%)
I 58/1036 (6) 52/1029 (5)
II 323 (31) 331 (32)
III 303 (29) 325 (32)
IV 336 (32) 306 (30)
Unknown 16 (2) 15 (2)
Line of chemotherapy, n (%)
First line 828/1048 (79) 837/1050 (80)
Second line 220 (21) 213 (20)
Baseline tumor-related characteristics, n (%)
HER2/neu-positive 407/1044 (39) 405/1048 (39)
Had prior surgery 740/1048 (71) 753/1050 (72)
Had prior chemotherapy 851/1048 (81) 849/1050 (81)
Taken from Leyland-Jones et al. (2016), American Society of Clinical Oncology

Accounting for the Randomization in Analyses

Models Underlying Randomization and Inference


The analysis of data includes making certain assumptions about the underlying
distributions. This is relevant because the underlying distributions form the
basis for statistical tests, which inform inference. Although the clinical trial
populations are treated in analyses as if they were true random samples from the
larger population, sometimes they are not. The three main theoretical models for the
underlying population of a trial sample (Lachin 1988a) are briefly described below.
The population model. This is the idea that any sample drawn randomly for
the trial, including the treatment groups resulting from randomization, is a represen-
tative sample from an infinitely larger population. All samples can all have the same
underlying distribution, and clinical responses among individual people in the
sample are independent.
A homogeneous population model assumes that the people in the sample satisfy
the same inclusion and exclusion criteria. In this model, the assignment of treatments
does not affect the type I error rate or power of a test. A heterogeneous population
40 Principles of Clinical Trials: Bias and Precision Control 761

model assumes that people differ in terms of their (baseline) characteristics and are
sampled from multiple populations. The underlying distribution of participant
responses is a function of the participant’s characteristics.
The invoked population model. Randomized groups may be similar with respect
to baseline variables, but each group may not necessarily be a perfect sampling
distribution from the larger population. The reality is that recruitment for a trial’s trial
population is far from a random sampling procedure of an infinitely large population.
In fact, much of it is nonrandom, targeting specific hospitals and communities and
selecting participants who satisfy certain eligibility criteria. The only random ele-
ment comes from the act of randomization itself (Rosenberger et al. 2018). The data
are still analyzed as if they were a random sample representative of the infinitely
larger population. While the randomized participants may be somewhat representa-
tive of the larger population, this belief still requires a leap of faith and is appropri-
ately called the invoked population model (Lachin 1988a). It invokes the assumption
that the analysis and inferences are from samples of the larger, homogeneous
population where the underlying distributions are the same.
The randomization model. Another approach is to say that the underlying distri-
butions of the treatment groups are not expected to be similar or are unknown.
In fact, there is no way ever to know the underlying distribution or to even make
assumptions about them under the invoked population model. Although this sounds
philosophical, this situation is tangible in the real-world setting of a small trial. Here,
the sample size may be too small to assume normality, which is the basis of many
tests. An alternative is to make no assumptions about the underlying distributions,
and therefore tests of treatment differences do not rely on those assumptions. Instead,
the test solely compares whether the outcome is related to the treatment.
In a randomization test, the basic idea is to assume that treatment label has
nothing to do with a person’s outcome. The null hypothesis is that the participant’s
responses are unaffected by the treatment. The observed difference, then, is only the
result of how the participants were allocated. The test is actually multiple rounds of
reshuffling and is sometimes referred to as “re-randomization.” To perform the test,

• Randomize the assignment of treatment label to the participant; participants keep


their outcomes but jumble their treatment labels.
• Repeat the analysis with the new labels 10,000 times or so, without replacement. This
process is really sampling from the distribution of randomization permutations.
• Next, calculate the test statistic and/or the p-value from the model.
• Then line up all the p-values in order, and see if the original result observed in the
first place is one of the extreme outcomes.
• The randomization test p-value is the proportion of new p-values that are as or
more extreme than the original observed in the actual dataset.

If the two groups were really not different, the reassignments are unlikely to
produce significant differences.
The benefit of the randomization test is that it requires no distributional assump-
tions; the disadvantage is that the computationally intensive process can be time-
consuming and complex programmatically.
762 F.-f. Yu

Fun fact. A randomization test is often referred to as a permutation test, but they
are not technically the same thing. A permutation test assumes that the data are
exchangeable and that all outcomes in the permutation have the same likelihood.
Rosenberger et al. show that this may be the case for random assignment, but not
under other randomization designs.

Randomization Method
Earlier parts of the chapter reference adjusting for prognostic factors when the
treatment groups are not balanced. In most cases, the analysis needs to account for
the randomization in order to account for the type I error properly.
In a stratified randomization, many statisticians advocate including the stratifica-
tion factors as covariates in the analysis. One reason for this is that in stratifying,
participants within a stratum are more alike; the stratification induces correlation
among those participants (Kahan and Morris 2012). This affects the variances of the
treatment difference. If the analysis ignores the stratification, and therefore the
correlation, then the standard error of the treatment difference is larger than the
truth. This in turn affects both power (lower than accounting for stratification) and p-
values (smaller than accounting for stratification).
If one is performing a permutation or randomization test for the analysis, then re-
randomization should use the same method for the original randomization.
For example, if assignment occurred using a permuted block design, then re-ran-
domization should use the same in order for the tests to have the proper type I error
rate.
For a trial where assignment is determined using minimization, the method of
analysis is less clearly defined. Taves advocated including the factors used for minimi-
zation as covariates in the analysis. Others have argued for a randomization test to
control alpha, although conducting one may be complicated. Re-randomization for trials
using minimization may have issues if there is unequal allocation (Proschan et al.) and
may also be unnecessary, producing similar results as a t-test or test of proportions
(Buyse 2000). Further review of this discussion appears in Scott et al. (2002).

Randomization Errors: FAQs


Q What if someone is misrandomized?
A In an intent-to-treat (ITT) analysis [covered in another chapter], use the random-
ized treatment group. In an as-treated analysis situation, use the actual treatment
group. (Note that some people do not accept as-treated analyses as valid.)
Q What if someone is mis-stratified?
A If randomization used a permuted block design, use the individual’s actual
(correct) stratification in the analysis. If someone is mis-stratified as a male
when she is really female, then use female in the analysis. (Note that some
people maintain that the person should remain mis-stratified).
Q For treatment errors: what if someone is randomized to A but receives B?
A In a true ITT analysis, the analysis uses the randomized assignment. Receiving
the wrong treatment could be receiving the wrong, one-time treatment, which
would be different from receiving one wrong dose out of many doses. The
philosophy, especially in a Phase 3 setting, is that the assigned treatment
40 Principles of Clinical Trials: Bias and Precision Control 763

regimen is the main comparison. A sensible extra analysis would analyze the
data using the actual treatment received.

Conclusion and Key Facts

The choice of randomization method depends on, of course, the size and needs of the
trial, with input from the trial sponsor and investigators. The table below (Table 6)
summarizes considerations for the different types of assignment methods discussed
in this chapter.

Table 6 Randomization properties and methods


Simple Random Permuted Urn-
randomization assignment block Stratification adaptive Minimization
Treatment balance ✓ ✓ ✓ ✓ ✓
at trial’s end for
N > 200
N  200 ✓ ✓
Possible treatment ✓ ✓
imbalance during
assignment
process
Treatment balance ✓ ✓
during assignment
process in real
time
Pre-generated ✓ ✓ ✓ ✓
assignment list
Treatment balance ✓ ✓ ✓ ✓
within unspecified
factors
Treatment balance ✓ ✓ ✓
within 1–3
specified factors
Treatment balance ✓
within >3
specified factors
Special ✓ ✓
recommendations
for N < 100
Dynamic ✓ ✓
Complex ✓ ✓
programming
Allocation ✓ ✓ ✓
concealment in
unmasked trial
With random/ ✓
mixed blocks only
764 F.-f. Yu

Cross-References

▶ Randomization and Permutation Tests

References
Buyse M (2000) Centralized treatment allocation in comparative clinical trials. Applied Clinical
Trials 9:32–37
Byar D, Simon R, Friendewald W, Schlesselman J, DeMets D, Ellenberg J, Gail M, Ware J
(1976) Randomized clinical trials – perspectives on some recent ideas. N Engl J Med
295:74–80
Hennekens C, Buring J, Manson J, Stampfer M, Rosner B, Cook NR, Belanger C, LaMotte F,
Gaziano J, Ridker P, Willett W, Peto R (1996) Lack of effect of long-term supplementation with
beta carotene on the incidence of malignant neoplasms and cardiovascular disease. N Engl J
Med 334:1145–1149
Ivanova A (2003) A play-the-winner type urn model with reduced variability. Metrika 58:1–13
Kahan B, Morris T (2012) Improper analysis of trials randomized using stratified blocks or
minimisation. Stat Med 31:328–340
Lachin J (1988a) Statistical properties of randomization in clinical trials. Control Clin Trials
9:289–311
Lachin J (1988b) Properties of simple randomization in clinical trials. Control Clin Trials
9:312–326
Lachin JM, Matts JP, Wei LJ (1988) Randomization in clinical trials: Conclusions and recommen-
dations. Control Clin Trials 9(4):365–374
Leyland-Jones B, Bondarenko I, Nemsadze G, Smirnov V, Litvin I, Kokhreidze I, Abshilava L,
Janjalia M, Li R, Lakshmaiah KC, Samkharadze B, Tarasova O, Mohapatra RK, Sparyk Y,
Polenkov S, Vladimirov V, Xiu L, Zhu E, Kimelblatt B, Deprince K, Safonov I, Bowers P,
Vercammen E (2016) A randomized, open-label, multicenter, phase III study of epoetin alfa
versus best standard of care in anemic patients with metastatic breast cancer receiving standard
chemotherapy. J Clin Oncol 34:1197–1207
Matthews J (2000) An introduction to randomized controlled clinical trials. Oxford University
Press, Inc., New York
Matts J, Lachin J (1988) Properties of permuted-block randomization in clinical trials. Control Clin
Trials 9:345–364
Pocock S, Simon R (1975) Sequential treatment assignment with balancing for prognostic factors in
the controlled clinical trial. Biometrics 31:103–115
Proschan M, Brittain E, Kammerman L (2011) Minimize the use of minimization with
unequal allocation. Biometrics 67(3):1135–1141. https://doi.org/10.1111/j.1541-0420.2010.
01545.x
Rosenberger W, Uschner D, Wang Y (2018) Randomization: the forgotten component of the
randomized clinical trial. Stat Med 38(1):1–12
Russell S, Bennett J, Wellman J, Chung D, Yu Z, Tillman A, Wittes J, Pappas J, Elci O, McCague S,
Cross D, Marshall K, Walshire J, Kehoe T, Reichert H, Davis M, Raffini L, Lindsey G, Hudson
F, Dingfield L, Zhu X, Haller J, Sohn E, Mahajin V, Pfeifer W, Weckmann M, Johnson C,
Gewaily D, Drack A, Stone E, Wachtel K, Simonelli F, Leroy B, Wright J, High K, Maguire A
(2017) Efficacy and safety of voretigene neparvovec (AAV2-hRPE65v2) in patients with
40 Principles of Clinical Trials: Bias and Precision Control 765

REP65-mediated inherited retinal dystrophy: a randomised, controlled, open-label, phase 3 trial.


Lancet 390:849–860
Scott N, McPherson G, Ramsay C (2002) The method of minimization for allocation to clinical
trials: a review. Control Clin Trials 23:662–674
Taves DR (1974) Minimization: a new method of assigning patients to treatment and control
groups. Clin Pharmacol Ther 15:443–453
Wei L, Durham S (1978) The randomized play-the-winner rule in medical trials. J Am Stat Assoc
73(364):840–843
Power and Sample Size
41
Elizabeth Garrett-Mayer

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768
Type I and Type II Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 769
Illustrations of Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 769
Trade-Offs in Power Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 772
Clinically Meaningful Effect Sizes and Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 772
Choosing Alpha and Beta (or Power) Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 773
Power Calculations for Common Trial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 773
Comparative Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 773
Single-Treatment Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777
Power Calculations for Non-inferiority Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 779
Approaches for Calculation of Power and Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 780
Available Software and Websites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 780
Simulation Studies for Power Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 781
Power Calculations for Fixed Sample Size Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 781
Alternatives to Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 782
Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 782
Sample Size Calculations in Bayesian Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 782
Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783
Evaluability of Patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783
Interim Analyses and Early Stopping Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 784
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785

Abstract
A critical component of clinical trial design is determining the appropriate sample
size. Because clinical trials are planned in advance and require substantial
resources per patient, the number of patients to be enrolled can be selected to

E. Garrett-Mayer (*)
American Society of Clinical Oncology, Alexandria, VA, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 767


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_213
768 E. Garrett-Mayer

ensure that enough patients are enrolled to adequately address the research
objectives and that unnecessary resources are not spent by enrolling too many
patients. The most common approach for determining the optimal sample size in
clinical trials is power calculation. Approaches for power calculations depend on
trial characteristics, including the type of outcome measure and the number of
treatment groups. Practical considerations such as trial budget, accrual rates, and
drop-out rates also affect the study team’s plan for determining the planned
sample size for a trial. These aspects of sample size determination are discussed
in addition.

Keywords
Power · Sample size · Type I error · Type II error · Clinically meaningful ·
Effect size

Introduction

A critical component of clinical trial design is determining the appropriate sample size.
Because clinical trials are planned in advance and require substantial resources per
patient, the number of patients to be enrolled can be selected to ensure that enough
patients are enrolled to adequately address the research objectives and that unneces-
sary resources are not spent by enrolling too many patients. The most common
approach for determining the optimal sample size in clinical trials is power calculation.
In clinical trials evaluating a new treatment regimen relative to a standard treatment,
power is the probability of concluding that the new treatment is superior to the
standard treatment if the new treatment really is superior to the standard treatment.
In designing a trial, the research team wants to ensure that the power of the trial is
sufficiently high. If the trial does not have sufficient power, the team is likely to
incorrectly conclude that a promising treatment has low efficacy.
The concept of power is based on hypothesis testing, a method used in most phase
II and phase III clinical trials. As an example, consider a randomized trial with two
treatment groups, an experimental treatment and a standard-of-care treatment, and
assume that the outcome of interest is a binary indicator of response (i.e., a patient
responds or does not respond to the assigned treatment). When a research team
embarks on a trial, they have a hypothesis about the level of response for the
treatment under study that would be considered “a success” relative to the control
group. If the researchers are treating a condition where the standard treatment leads
to a 10% response rate in patients, then perhaps a 25% response rate would be
considered sufficiently high in the experimental treatment to pursue further study. In
this example, to design the trial, the known information and assumptions regarding
response rates in the standard of care and new treatment are used to set up the
hypothesis test with two hypotheses: the null hypothesis (H0) and the alternative
hypothesis (H1). H0 represents the response rate if the new treatment is no better than
the standard of care; H1 represents the response rate if the new treatment is better
41 Power and Sample Size 769

Table 1 Possible results of a clinical trial designed using hypothesis testing


Truth
H0 H1
Hypothesis selected based on trial results H0 ✓ Type II error
H1 Type I error ✓

than the standard of care. When developing a power calculation, these are usually
written in the format

H0 : p1 ¼ 0:10; p2 ¼ 0:10

H1 : p1 ¼ 0:10; p2 ¼ 0:25,

where p1 is the assumed response rate (or response probability) in the standard
treatment and p2 is the assumed response rate in the experimental treatment.

Type I and Type II Errors

In our hypothesis testing example assuming one of the two hypotheses is true, at the
end of the trial, the research team will either choose the correct or the incorrect
hypothesis. If the null hypothesis is true, but the data collected lead the research team
to choose the alternative hypothesis, then the team has made a type I error. If the
alternative hypothesis is true, but the data lead the research team to choose the null
hypothesis, then the team has made a type II error. Table 1 shows the possible
outcomes that a research team can make.
When designing a clinical trial, the research team wants to minimize making
errors and sets the type I and II error rates to relatively low levels. Traditionally, type
I error rates are set to values between 2.5% and 10%; type II errors are usually in the
range of 10–20%. Note that the type I error rate is also called the alpha (α) level of
the hypothesis test and the type II error rate the beta (β) level of the test. Power is 1
minus beta (1β). Because it is desirable to keep the type II error relatively low
(20% in most trials), the power is usually at least 80% in well-designed studies.
In our example, there are four elements to include to calculate our optimal sample
size: (1) the response rate under the null hypothesis, (2) the response rate under the
alternative hypothesis, (3) the type I error rate, and (4) the type II error rate. As will
be seen in later sections, when you have other types of outcomes, you may need
additional information to perform the power calculation (e.g., the assumed variance
if the outcome is a continuous variable).

Illustrations of Power

Graphical displays to illustrate power are shown in Fig. 1. In panel A, there are two
bell curves (i.e., distributions) where the x-axis is the difference in proportions from
770 E. Garrett-Mayer

Fig. 1 Illustration of power and alpha levels for varying sample sizes with response rate as the
outcome in a randomized trial with two treatment groups. Panels a, b, c and d are ordered vertically
from top to bottom

our example. Each curve represents the response rate in the experimental group
minus the response rate in the control group under one of our hypotheses. The black
distribution represents the null hypothesis, where the difference in response rates is
0 (i.e., the response rate in both groups is 0.10 if the null hypothesis is true); the red
41 Power and Sample Size 771

distribution represents the alternative hypothesis where the difference in response


rates is 0.15 (i.e., a 0.25 response rate in the experimental group minus a 0.10
response rate in the control group). These curves demonstrate the distributions of
expected differences in response rates. For the black curve, it is very likely we will
see a difference in response rate in the range of 0.05 to 0.05, given the height of the
curve over that range. However, if the alternative is true, it is rather unlikely that we
will see differences in that range, noted by the low height of the red curve in the
region of 0.05 to 0.05.
Although there is not substantial overlap in the red and black curves in Fig. 1a,
there is some overlap, suggesting there are some resulting observed differences in
response rates that are similarly consistent with both hypotheses. For example, if the
trial is completed and the difference in response rates is 0.07, this difference is about
equally well-supported by both H0 and H1 as can be seen by the height of the curves
at 0.07 on the x-axis. It is in this region where type I and II errors are likely to be
made. In Fig. 1a, the black hashed sections represent the tails of the null distribution
curve and, more specifically, the tails of the curve that correspond to the alpha level.
In this example, the alpha level has been set to 0.05 (or 5%), meaning that each tail
has 2.5% of the area under the curve. If the difference in response rates lies in one of
these tails, then the null hypothesis is rejected, as it is considered relatively unlikely
if the null hypothesis is true. Thus, the vertical black lines in Fig. 1a define the
rejection regions: if the difference in proportions is outside the vertical black lines,
then the null hypothesis is rejected because the data collected are inconsistent with
H0, relative to H1.
Focusing on the alternative distribution now, the area under the red curve to the
right of the rejection threshold line represents the power, which is the probability
of rejecting the null hypothesis if the alternative hypothesis is true. This is illustrated
in Fig. 1b where the red shaded area shows the power. Thus, in our example in
Fig. 1a, b, the vertical lines define the rejection region; the black hashed areas show
the alpha level of the trial and the red shaded region the power of the trial.
A critical aspect of the trial characteristics shown in Fig. 1a, b is the sample size.
In Fig. 1a, b, the sample size in each group is 100. Fixing alpha at 0.05, this leads to a
power of 90% (and, thus, a type II error of 10%). Figure 1c, d shows the effects on
the shapes of the distributions that represent our trial when we change the sample
size and the effects on power. In Fig. 1c, the sample size per group is 40. This leads to
wider distributions and more overlap in the distributions. Assuming that alpha is
maintained at 0.05 (i.e., 2.5% of the area in each tail defines the rejection region), the
power drops to 54% (i.e., 54% of the area under the red curve is to the right of
vertical rejection region threshold). This suggests that enrolling 40 patients per group
is not enough: with only 40 patients per group, if the experimental treatment is better
than the control treatment, we only have a 54% chance of making that conclusion at
the end of the trial, even if the observed differences in response rates are close to the
hypothesized difference of 0.15. We call this an “underpowered” trial because the
power is too low.
Figure 1d shows a trial that is “overpowered.” With a sample size of 160 per
group, there is almost no overlap in the curves. Fixing the alpha again at 0.05, there
772 E. Garrett-Mayer

is almost no region of the red curve that is to the left of the rejection threshold, and
the power is 98%. While the research team will be pleased to know that they have a
high chance of finding a significant difference in treatments if the treatments are
different, many would argue that this trial is wasteful because it utilizes too many
resources and could be completed without enrolling so many patients.

Trade-Offs in Power Calculations

From the previous section, there were four quantities that were specified to calculate
the power of the trial: (1) the response rates under H0; (2) the response rates under
H1; (3) the alpha level; and (4) the sample size. (As noted above, with other
outcomes, the assumed variance may also be required.) In theory, one can specify
the power and solve for any of the other four quantities. However, in most trials the
null hypothesis and the alpha level are prespecified. Ideally, the research team would
solve for the sample size based on the other quantities, but due to resource con-
straints, many trials have an upper limit on a feasible sample size, and thus power or
the alternative hypothesis is determined based on the sample size limitations.

Clinically Meaningful Effect Sizes and Sample Size

It is important to ensure that the alternative hypothesis represents a clinically


meaningful difference or clinically meaningful effect size. That is, the difference in
response rates should represent a difference that would lead experts in the area to
conclude that the experimental treatment represents a meaningful improvement in
response and worthy of either further study or should be used regularly in clinical
practice (depending on the phase of the trial and other supporting evidence).
Additionally, the alternative hypothesis should not be unrealistic: it is not useful to
assert a very large effect size as the alternative hypothesis if it is not likely attainable.
The sample size will be small, but the trial is likely to fail to find a difference, and
even a moderately large observed difference would not lead to rejection of the null
hypothesis. If the alternative hypothesis in our example was set to difference in
response rates of 0.50 (i.e., the assumed response rate in the experimental group is
0.60 under the alternative hypothesis), the required sample size would only be 42
patients (21 per treatment), but the observed difference in response rates would have
to be relatively large to reject the null hypothesis. Looking back after the trial, if the
research team had seen a response rate of 33% in the experimental group and 10% in
the control group, the team might be disappointed to conclude that they cannot reject
the null even though the difference in response rates was 23%; the p-value for this
result would be 0.13 using a Fisher’s exact test.
Similarly, research teams should be discouraged from seeking small differences,
as they may not be clinically meaningful. This has been addressed in cancer clinical
trials by numerous authors, concerned that anticancer therapies may be approved for
use in cancer patients due to statistical significance, but may not confer any
41 Power and Sample Size 773

meaningful improvement in survival (Sobrero and Bruzzi 2009). Studies like this led
to efforts to define clinically meaningful differences in cancer clinical trials, with the
goal of ensuring that trials would be designed with appropriate levels of power and
sample size to ensure that detectable effect sizes would be clinically meaningful
(Ellis et al. 2014).

Choosing Alpha and Beta (or Power) Levels

There are conventions in clinical trials that have been used for many decades,
leading to almost no consideration given to appropriate selection of alpha and beta
levels. Most commonly, one will see alpha set to 5% and beta set to 20% (i.e., power
set to 80%). These are not correct levels, but are the most commonly chosen.
Strident arguments can be made that setting alpha low for a phase III trial is
appropriate: making a type I error when deciding whether or not to approve an
experimental agent is a very serious error. That is, a type I error would lead to
approving an agent when the agent is not better as compared to the control group.
From an approval standpoint, making a type II error is less grievous; not approving
an effective treatment is less worrisome than approving ineffective treatments. Thus,
for trials that are intended to provide direct evidence for approval of the agent,
setting alpha substantially lower than beta may be sensible. However, in earlier
phase trials, setting alpha and beta to similar levels may be a better strategy. In many
early efficacy trials in cancer research, alpha and beta are both set to 10%, suggesting
that each type of error has equally bad implications. In this setting, the research team
is more willing to take an ineffective agent to the next phase of research (higher
alpha), but less willing to discard an effective agent (lower beta).

Power Calculations for Common Trial Designs

Different areas of medical research tend to use different primary outcomes in their trials,
leading to differences in test statistics used in hypothesis tests and thus in how power
calculations are performed. Most outcomes fall into one of the three categories: contin-
uous, binary, or time-to-event outcomes. The example in the previous section was based
on a comparison of response rates and a binary outcome. In the following sections,
comparative and single-treatment studies are reviewed for each of these outcomes.

Comparative Studies

Binary Outcomes
A randomized trial with a binary outcome example was developed in section “Intro-
duction.” For binary outcomes, there are various options and assumptions that can be
used in power calculations. In Fig. 1, a normal approximation was used, which is
simple to calculate and works well when the response probabilities are not close to 0 or
774 E. Garrett-Mayer

Table 2 Differences in power using different power calculation approaches for a randomized trial
with a binary indicator of response as the outcome, assuming response probabilities of 0.10 and 0.25
in the control and experimental treatments under the alternative hypotheses, respectively
Power calculation type Sample size per treatment Power (%)
Normal test, approximation 1 40 54
Normal test, approximation 2 40 31
Chi-square test 40 42
Fisher’s exact test 40 33
Normal approximation 1 100 90
Normal approximation 2 100 80
Chi-square test 100 83
Fisher’s exact test 100 76
Normal approximation 1 160 98
Normal approximation 2 160 93
Chi-square test 160 96
Fisher’s exact test 160 94

1 in either group, and the sample size is relatively large. Other normal approximations
are also used which differ in their approach for estimating the denominator of the test
statistic (i.e., the standard error of the difference in proportions). Depending on the
sample size and the assumed response rates, the power estimates may be very similar
or dissimilar depending on the approximation used. When planning a trial, the
approach used to calculate the power or sample size should be consistent with the
approach used to analyze the data at the end of the trial (Table 2).

Continuous Outcomes
In the previous example with a binary outcome, in addition to knowing the power
and alpha, one only needed to know the expected response rates under the null and
alternative hypotheses. When the outcome of the trial is a continuous variable, and
the goal is to compare the means between two groups, the research team must set null
and alternative hypotheses for the means in the groups, and they must also make an
assumption about the variance of the outcome. For example, assume that a trial is
being planned to evaluate the efficacy of vitamin D supplementation in individuals
with vitamin D deficiency where individuals are randomized to a low dose of
vitamin D (400 IU) in one group and a high dose in another group (2000 IU). The
outcome is 25(OH)D, which is a measure of vitamin D in the blood. The research
team assumes (based on their previous research) that the standard deviation of 25
(OH)D is approximately 14 ng/mL in individuals who do not have deficiency. The
research team plans to compare 25(OH)D levels between the two groups after
6 months of supplementation using a two-sample t-test.
In the previous example, the width of the curves that determined power
(Fig. 1) was determined based on the both assumed response rates in the null
and alternative hypotheses and the sample sizes in each group. When using a
continuous outcome, the means in the null and alternative hypotheses and the
41 Power and Sample Size 775

Fig. 2 Illustration of effect of the standard deviation on power in a trial with a continuous outcome.
Panels a is on the top; panel b is on the bottom

sample size factor into the power calculation, but so does the assumed standard
deviation. Thus, to calculate power, the following are required: alpha, sample
size, difference in means under the null hypothesis (usually 0), the difference in
means under the alternative hypothesis, and the standard deviation in each group.
The researchers expect that the mean 25(OH)D levels will be 55 ng/mL in the
low-dose group and 65 ng/mL in the high-dose group after 6 months of supple-
mentation. Under the null hypothesis, the means would be the same; and under
the alternative hypothesis, the difference in means would be 10 ng/mL:

H0 : u2  u1 ¼ 0 ng=mL

H1 : u2  u1 ¼ 10 ng=mL

To achieve 90% power with a two-sided alpha level of 5%, and assuming that the
standard deviation is 14 ng/mL in each group, the research team would need to enroll
42 patients in each group. Figure 2a shows the distributions of the difference in
776 E. Garrett-Mayer

means under the null and the alternatives, assuming a sample size of 42 per group,
and a standard deviation (SD) of 14 ng/mL in each group. All else being the same, if
the assumed SD were larger, the power would decrease. Having a larger SD in each
group adds more variance and thus more imprecision in the estimates. Figure 2b
shows the effect of the larger SD on power if the sample size remains 42 per group.
Notice that the curves are wider, the overlap is greater, and the area under the
alternative distribution curve representing power (i.e., the red shaded portion) is
smaller. If the SD is 16 ng/mL in each group instead of 14, the power decreases from
90% to 82%. In the example above, it was assumed that the standard deviation in the
two groups is the same (14 ng/mL). Power calculations can also be performed
assuming a different standard deviations in each group.

Time-to-Event Outcomes
In many trials, the outcome of interest is a time interval. For example, in many phase
III randomized clinical trials in cancer research, survival time is the outcome, that is,
the time from enrollment on the trial until death. The challenging aspect of survival
time as an outcome is that not all patients have the event of interest (death) when the
data are analyzed. The patients who are still alive at the time of data analysis have
survival times that are “censored,” meaning that we know that they lived for a certain
amount of time but we do not know their actual survival time (which will occur in the
future). Statisticians have approaches for analyzing time-to-event outcomes, such as
survival time. Randomized trials with time-to-event outcomes which have inferences
based on the hazard ratio (i.e., the ratio of events in two groups being compared) only
require a few elements: the hazard ratio under the null hypothesis (usually assumed
to be 1, meaning equal event rates), the hazard ratio under the alternative, and the
type I and type II errors. With these quantities, one can solve for the number of
events required to achieve the desired power. While the simplicity is convenient,
when planning a trial, knowing the number of events needed is not sufficient: clinical
trial protocols require that the number of patients enrolled be stated. Using additional
information, including the expected accrual rate and the minimum amount of time
each patient will be followed for events combined with the required number of
events to achieve the desired power, the number of patients required for enrollment
can be calculated.
The hypothesis for evaluating a time-to-event outcome could be set up as follows,
where the null hypothesis assumes there is no difference in the event rates in the two
groups; the alternative in this example assumes that the event rate in group 2 (λ2) is
half as large as the event rate in group 1 (λ1). If this were a cancer treatment trial with
two treatments with survival time as the outcome, the alternative assumes that the
rate of death occurring in group 2 is half of that in group 1, meaning the treatment in
group 2 doubles the expected survival time:

H0 : λ2 =λ1 ¼ 1

H1 : λ2 =λ1 ¼ 0:5
41 Power and Sample Size 777

Going back to the characteristics that affect sample size, if the hazard ratio were
0.75 instead of 0.5, the sample size would be required to be larger due to a smaller
difference in rates in the two treatments. And, knowing that the power is driven by
the number of events, the sample size for a trial with 2 events per month would need
to be larger than a trial with 20 events per month to be completed in a similar amount
of time. Sample size for trials based on time-to-event outcomes also depends on the
accrual rate and the planned length of follow-up. For example, based on the expected
accrual rate of ten patients per month, a trial with only 1 year of follow-up (after the
last patient has enrolled) will require more patients than a trial with 2 years of follow-
up because the latter trial will observe more events prior to stopping the trial.

Single-Treatment Studies

In single-treatment studies, the null hypothesis is based on an “historical control”


estimate of a control setting (e.g., the effect under the standard of care). Using our
previous example, imagine that instead of performing a trial with two treatments
where patients are randomized to the experimental and the control group, informa-
tion from patients previously treated with the control treatment is used to develop an
appropriate null hypothesis regarding what an ineffective response rate would be,
but no patients are enrolled in a control condition. In practice, patients are enrolled to
the experimental treatment, and at the end of the trial, their response rate (p) is
compared to what would have been expected had they been treated using the control
treatment. This is written (using our previous terminology) as

H0 : p ¼ 0:10

H1 : p ¼ 0:25

where p is the response rate and H0 represents a response rate too low for further
consideration and H1 a response rate that is sufficiently high for further study of the
treatment. The power calculation includes the same elements of alpha, beta, null, and
alternative hypothesis, but the calculation will have smaller sample sizes than the
comparative trial. This is due to lower variance – with only one treatment group
included, and a fixed comparator, there is only variance in the experimental
condition.
The same trade-offs between type I and II errors, clinical effect size, and sample
size are present in single-treatment studies as in randomized trials.

Single-Treatment Trial, Binary Outcome

Single Stage
In the example in section “Single-Treatment Studies,” with a null response rate of
0.10 and an alternative response rate of 0.25, the sample size can be calculated using
778 E. Garrett-Mayer

either an approximation (e.g., a normal approximation or chi-square approximation)


or an exact test. Because single-treatment studies tend to be smaller than comparative
studies, using Fisher’s exact test is often preferred (recall that the approximations
work best when sample sizes are large). If we assume a type I error (two-sided) of
0.05 and power of 90%, this trial would require 59 patients. If the research team
determined that they did not have sufficient resources to enroll 59 patients and could
only enroll 50 patients, the trial would have 83% power.

Multistage
As described in section “Interim Analyses and Early Stopping Rules” (Practical
Considerations), many trials are designed to have sufficient sample size to maintain
power and alpha when interim analyses or early stopping rules are incorporated into
the design. One common single-treatment trial design for binary outcomes is the
Simon two-stage design which includes two stages, where n1 patients are enrolled in
the first stage and n2 patients are enrolled in the second stage, and uses a one-sided
test (Simon 1989). After n1 patients are enrolled, the number of responses is
compared to a predefined threshold (r1). If there are r1 or fewer responses, the trial
stops for futility. That is, if there are r1 or fewer responses at the end of stage 1, it is
unlikely that sufficient responses could be seen by the end of stage 2 to reject the null
hypothesis, and so no more patients are enrolled. If more than r1 responses are seen at
the end of stage 1, the trial continues, enrolling an additional n2 patients (for a total of
n1+n2 patients at the end of stage 2). At the end of stage 2, the total number of
responses (from stage 1 and 2 combined) are counted up, and the null hypothesis is
rejected if there are sufficient responses.
Technically, there are many designs that can fit the criteria (due to the flexibility
induced by allowing the early look). Simon suggested a criterion minimizes the
expected sample size of the trial if the null hypothesis is true. He referred to this
version of the design as the “optimal” two-stage design.
Because the trial has two “looks” at the data, the type I and II errors will differ
from that in a trial with only one look. In this type of trial, early stopping is only
allowed for futility, meaning there are two opportunities for a type II error and only
one opportunity for a type I error. To ensure that the errors are controlled, the sample
size is usually slightly larger than if the trial was performed in a single stage. For
example, a single-stage trial with H0: p = 0.20 versus H1: p = 0.40 requires a sample
size of 42 for a one-sided alpha of 0.05 to maintain a power of 0.90. Simon’s optimal
two-stage design requires 54 patients, with 19 patients in stage 1 and stopping early
for futility if fewer than 5 patients has responses. Other two-stage designs with alpha
of 0.05 and power of 90% could be selected that would allow a sample size closer to
42, but they would not meet the optimality criterion defined above.

Single-Treatment Trial, Continuous Outcome


Single-treatment studies with continuous outcomes include primarily the same
information as for two-treatment studies for continuous outcomes. If one
were to undertake a trial of high-dose vitamin D supplementation described as a
single-treatment trial, the trial would enroll patients into a high-dose vitamin D
group (2000 IU). The null hypothesis was an expected mean of 25(OH)D level of
41 Power and Sample Size 779

55 ng/mL, which is what the research team assumed would occur in a low-dose
(400 IU) setting; the researchers assumed that giving a high dose would lead to a
higher mean, a mean of 65 ng/mL. This would be set up as follows:

H0 : u ¼ 55 ng=mL

H1 : u ¼ 65 ng=mL

The additional requirements to complete the power calculation would be the


assumed standard deviation (14 ng/mL from above) and the alpha and power levels.
With alpha of 0.05, and power of 90%, the required sample size was only 17. Note
that the sample size in the comparative trial described in section was 84 (42 patients
per treatment). As expected the sample size for the single-treatment trial is smaller,
due to the lower variance with just one treatment.

Single-Treatment Trial with TTE Outcome


When designing a single-treatment trial with a time-to-event outcome, the standard
approaches are (1) to compare the median event time to a presumed median event
time in a control setting or (2) to compare the event rate to a presumed event rate in a
control setting. The first of these will require a larger sample size, because the
median event time tends to be imprecise. In the example below, this is expressed
as null hypothesis of a median (m) of 6 months and an alternative hypothesis median
of 9 months. Power calculations can be based on a test for differences in medians
(Brookmeyer and Crowley 1982).

H0 : m ¼ 6 months

H1 : m ¼ 9 months

Comparing event rates is more efficient but will require that assumption that the
event rate is constant over time. For example, the example below shows a null
hypothesis with an event rate (λ) of 0.30; the alternative hypothesis event rate is 0.20.
If the assumption that the event rate is constant over time is untrue, the inferences
from the trial could be invalid. Power calculations in this setting can be based on a
test of the rate parameter from the exponential distribution.

H0 : λ ¼ 0:30

H1 : λ ¼ 0:20

Power Calculations for Non-inferiority Studies

Most trials are designed as superiority studies; that is, the goal is to show that a
treatment is better than another treatment setting (which might be an active treat-
ment, a placebo, or no treatment at all). However, some trials have the primary
780 E. Garrett-Mayer

objective of showing that a new treatment is similarly effective to another treatment.


For example, the standard-of-care treatment might be reasonably efficacious but
have unpleasant side effects. A new treatment may be hypothesized to have similar
efficacy to the standard of care but a better side effect profile. In this example, the
research team would want to design a trial to show that for the efficacy outcome, the
new treatment is non-inferior to the standard of care. Non-inferiority trials require
the research team to define a margin of non-inferiority to set up the hypothesis test.
That, is how much worse could the new treatment be without being considered
significantly worse? As an example, if the standard of care had a response rate of
50%, by setting a non-inferiority margin of 5% would mean that a response rate of
45% for the new treatment would be considered non-inferior.
The hypothesis test is set up in the opposite way from superiority trials. The null
hypothesis assumes that the new treatment is worse than the standard of care; the
alternative hypothesis assumes that the new treatment is equal to or better than the
standard of care (based on the margin of non-inferiority). Power is still the
probability that the null hypothesis is rejected if the alternative is true; alpha is
still the probability that the null hypothesis is rejected if the null hypothesis is true.
For obvious reasons, non-inferiority margins should be rather small. It would be
hard to argue that a treatment with a response rate that is 10–20% worse than the
standard of care is non-inferior. As a result, non-inferiority trials seek to focus on
small differences in treatment and require large sample sizes to demonstrate non-
inferiority for levels of power and alpha that are comparable to those in superiority
trials.

Approaches for Calculation of Power and Sample Size

Available Software and Websites

Although there are formulas for power and sample size calculations that can be
implemented by hand (or with a calculator), clinical trial power calculations are
performed using software packages specifically designed for power and sample size
estimation (e.g., NQuery, PASS, GPower, EAST), standard all-purpose statistical
software (e.g., SAS, Stata, R, or SPSS), or using online websites (e.g., www.crab.
org, http://powerandsamplesize.com).
Sample size estimation software packages provide many options and user-
friendly interfaces for standard trial designs and some more complex settings, such
as ANOVA, clustered designs, and early stopping rules. Many academic institutions
and companies involved in trial design have a license for at least one package, and
some are free to download (e.g., GPower). Statistical packages like SAS and Stata
include standard trial design options for power calculations (e.g., one- and two-
sample comparisons of mean and proportions). While R has standard designs
available, it also has user-contributed libraries that allow users to perform power
calculations for more complex designs (e.g., cluster-randomized trials can be
designed using the “clusterPower” library.
41 Power and Sample Size 781

Simulation Studies for Power Calculations

Not all trials designs will fall into the categories discussed above. For example, for a
trial with a longitudinal design where the primary objective includes comparing
trajectories (or slopes) based on repeated measures per individual in two or more
groups, there may not be a standard software package that will suit the trial design
needs. In that case, sample size could be determined based on simulation studies.
That is, just as above where assumptions were made regarding effect size, variance,
sample size, and alpha, assumptions are made for all the relevant parameters, and
trials are simulated under the set of assumptions. For each trial simulated under the
parameters consistent with the alternative hypothesis, the trial results are analyzed,
and it is determined if the null hypothesis is rejected or accepted (at the predefined
alpha level). The proportion of simulated trials for which the null hypothesis is
rejected provides an estimate of the power. To get a precise estimate of power, the
number of simulated trials has to be reasonably large. If the power is too low for a
given sample size, the sample size can be increased, and the simulations can be
performed using the larger sample size; this can be repeated until the desired power
level is reached.
While simulations allow the research team to investigate trial properties for
complex designs, they have the drawback that they usually have quite a few
parameters for which the research team needs to make assumptions. However, it
is not always possible to have preliminary information to make assumptions.
This leads the research team to consider a range of values for some parameters.
Additionally, simulations require skill in programming and depth of knowledge of
statistics and probability distributions. Simulations can be time-consuming: they
may take considerable time to develop and undertake (depending on the number of
simulations, the number of parameters, and the range of parameters considered) and
to summarize and present in graphical or tabular format.

Power Calculations for Fixed Sample Size Studies

In clinical trials, the sample size is planned prior to embarking on the trial, and so the
planning for the sample size incorporates the clinical effect size, alpha, power, and
variance. Once the sample size is determined, there may be additional analyses the
research team plans to perform to address secondary objectives of the trial. It is
unlikely that the sample size would be increased to ensure sufficient power for
secondary objectives, but it can be helpful to calculate power for secondary analyses
or to report the effect size that is detectable for secondary analyses based on a fixed
power and the sample size for the trial.
In some cases, after the trial is completed, the data from the trial may be used for
“secondary data analyses,” which implies that the data were collected but not
intended to be used for these analyses. The researchers may write a proposal for
these additional data analyses and the proposal should justify that the objectives of
the secondary data analyses can be achieved with the sample size of the dataset. This
782 E. Garrett-Mayer

can be helpful perspective for the research team proposing the analyses as it will
determine the effect sizes that are detectable for the predetermined sample size for a
given power level. The research team may have an overpowered trial, in which case
they may want to use a relatively small alpha level to conclude “significance.” If the
trial is underpowered (i.e., the sample size is too small to detect clinically relevant
effect sizes), the research team may decide that the analysis should not be conducted,
or they may decide to perform the analysis, but will be cognizant of the low power
and take it into account when interpreting their results, and they may choose to avoid
significance testing altogether.

Alternatives to Power

Inferences for some trials are not based on traditional hypothesis tests, and therefore
power is not relevant.

Precision

Some trials have goals of estimation. For example, a single-treatment trial of a new
agent in a patient population defined by a biomarker could have the primary
objective of estimation of median progression-free survival. Instead of testing
that median survival is larger than a null value, the sample size can be motivated
by the precision with which the median survival is estimated. If the research team
wants to estimate the median survival with a 95% confidence interval with a width
no greater than 2 months, the research team can use information regarding
expected accrual rate, assumptions about possible observed values of median
survival, and determine how many patients to enroll to ensure that the width is
within 2 months.
Although this is a sensible approach, there are not standard guidelines that allow
one to determine what a sufficiently precise confidence interval might be. Addi-
tionally, at the end of the trial, the 95% confidence interval still needs to be
interpreted, and a decision has to be made regarding whether or not further study
of the agent should be pursued in the patient population. The lower limit of the
confidence interval in this case could be used to determine if the treatment should
be pursued in additional research studies and for calculating effect sizes for future
studies.

Sample Size Calculations in Bayesian Settings

The approaches in this chapter have focused primarily on frequentist interpreta-


tions of hypothesis testing and frequentist calculations. Bayesian statistics use
different concepts for analysis and interpretation of data. Thus, Bayesian trial
41 Power and Sample Size 783

designs differ in the way that they determine sample sizes and in the concepts that
they use that are similar to type I and II errors. Although some quantities can be
directly calculated, power and sample size calculations for Bayesian trial designs
are calculated using simulations because Bayesian trial designs tend to be adaptive.
See Lee and Chu for more details, including advantages to Bayesian trial designs
(Lee and Chu 2012).

Practical Considerations

Evaluability of Patients

Clinical trials must state a planned sample size for review and approval by institu-
tional review boards (IRBs) and for other scientific review committees. Technically,
studies should not enroll beyond the planned sample size, and doing so can result in
punitive actions for not following the trial protocol. But, in the practical implemen-
tation of trials, not all patients contribute information to address the primary objec-
tive and are deemed “inevaluable” as per a definition in the protocol. For example, a
patient may enroll in the trial but drop out of the trial prior to receiving the trial
treatment, and thus the patient did not receive treatment and has no information on
the outcome of interest. Some trials may have approaches for how this patient may
contribute to the analysis, but many trials would deem this patient as inevaluable. In
trials in which there are expected to be patients who are inevaluable, the research
team should take this into account when planning the sample size. For example, if
the sample size calculation requires 60 patients for sufficient power and the team
anticipates 10% of patients will be inevaluable, the projected enrollment should be at
least 67 (67  (1–0.10) = 60.3).

Interim Analyses and Early Stopping Rules

The examples provided in previous sections (except for the Simon’s two-stage
design) have assumed single-stage designs, where the trial enrolls patients and
does not evaluate the data until the trial ends. In practice, many trials have early
stopping rules or are designed to have interim analyses, which may affect the trial’s
enrollment.

Early Stopping Rules


Trials with early stopping rules allow trials to stop prior to total planned enrollment
if there is early evidence to stop early. Stopping rules can be included for early
evidence of futility and/or efficacy. As noted above, futility stopping allows an
early look where a type II error could be made. Early stopping for efficacy (i.e.,
early evidence suggests the alternative hypothesis is true) allows an early look
where a type I error could be made. Trials in which both futility and efficacy
784 E. Garrett-Mayer

stopping are included will need to account for these looks at the data and increase
the sample size accordingly to ensure that the desired power and alpha can be
achieved. If the power calculations do not account for the early looks, the true type
I and II error rates will be higher than assumed.

Interim Analyses
In addition to early stopping rules, other analyses can be planned to take place in the
midst of a trial. For example, the research team may be required to make assumptions
regarding the trial, such as the accrual rate or the event rate, without much preliminary
data, and the team designs the trial to adjust the sample size to account for any
incorrect assumptions. This does not allow the team free reign to make changes to
the trial based on interim results – this type of analysis must be planned carefully in
advance with clearly defined plans for how any changes would be made. In addition,
these mid-trial analyses are usually planned so that the research team is blinded from
knowing efficacy estimates. One type of interim analysis that could be planned is an
estimation of “conditional power” where the trial team determines the likelihood of
rejecting the null hypothesis based on the evidence in the data collected thus far. If the
conditional power is very high, the trial may continue without any revisions to the
sample size; if the conditional power is moderate (e.g., between 50% and 80%), the
sample size may be increased to bring the power up to a level of 80% or higher; if the
conditional power is low, the trial may be discontinued due to futility (i.e., with a low
conditional power, a very large sample size increase would be required, suggesting
that the detectable effect size would be considered too small to be clinically relevant).

Summary and Conclusions

Power and sample size calculations are an important part of clinical trial planning.
There are standard approaches for many traditional designs, including those for
continuous, binary, or time-to-event outcomes, and for single-treatment and for
randomized trials. For more complex designs, including those with adaptive designs,
interim analyses, or early stopping, more sophisticated calculations need to be
performed to ensure that type I and type II errors are controlled. While some
power calculations can be conducted using standard software or online tools, it is
wise to engage a biostatistician to ensure the calculations are performed properly and
any nuances of the design have been accounted for.

Key Facts
• Type I and II errors must be controlled when designing trials to ensure that
inferences lead research team to correct conclusions with a high probability.
• There are many types of trial designs, and power and sample size calculations can
be performed using simple approaches (which are available in software or online)
or using complex simulation studies.
• Practical considerations need to be considered in addition to output from formulas
for power or sample size.
41 Power and Sample Size 785

References
Brookmeyer R, Crowley JJ (1982) A confidence interval for the median survival time. Biometrics
38:29–41
Ellis LM, Bernstein DS, Voest EE, Berlin JD, Sargent D, Cortazar P, Garrett-Mayer E, Herbst RS,
Lilenbaum RC, Sima C, Venook AP, Gonen M, Schilsky RL, Meropol NJ, Schnipper LE (2014)
American Society of Clinical Oncology perspective: raising the bar for clinical trials by defining
clinically meaningful outcomes. J Clin Oncol 32(12):1277–1280
Lee JJ, Chu CT (2012) Bayesian clinical trials in action. Stat Med 31(25):2955–2972
Simon R (1989) Optimal two-stage designs. Control Clin Trials 10:1–10
Sobrero A, Bruzzi P (2009) Incremental advance or seismic shift? The need to raise the bar of
efficacy for drug approval. J Clin Oncol 27(35):5868–5873
Controlling Bias in Randomized Clinical
Trials 42
Bruce A. Barton

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788
Sources of Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 789
Selection Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 789
Cautionary Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 791
Performance Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793
Detection Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794
Attrition Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795
Reporting/Publication Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797
Other Sources of Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 799
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 800
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 800
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 800
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801

Abstract
Clinical trials are considered to be the gold standard of research designs at the top
of the evidence chain. This reputation is due to the ability to randomly allocate
subjects to treatments and to mask the treatment assignment at various levels,
including subject, observers taking measurements or administering question-
naires, and investigators who are overseeing the performance of the study. This
chapter section deals with the five major causes of bias in clinical trials: (1)
selection bias, or the biased assignment of subjects to treatment groups; (2)
performance bias, or the collection of data in a way that favors one treatment
group over another; (3) detection bias, or the biased detection of study outcomes
(including both safety and efficacy) to favor one treatment group over another;

B. A. Barton (*)
Department of Population and Quantitative Health Sciences, University of Massachusetts Medical
School, Worcester, MA, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 787


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_214
788 B. A. Barton

(4) attrition bias, or differential dropout from the study in one treatment group
compared to the other; and (5) reporting and publication bias, or the tendency of
investigators to include only the positive results in the main results paper (regard-
less of what is specified in the study protocol) and the tendency of journals to
publish only papers with positive results. While other biases can (and do) occur
and are also described here, they tend to have lower impact on the integrity of the
study. The definitions of these biases will be presented, along with how to
proactively prevent them through study design and procedures.

Keywords
Treatment randomization · Treatment masking · Selection bias · Performance
bias · Detection bias · Attrition bias · Reporting bias

Introduction

Clinical trials are generally considered the least biased of any research study design
and are widely considered to the gold standard of research designs (Doll 1998).
The two major factors credited for the lower risk of bias in clinical trials are the use
of random treatment assignment for subjects and masking of the assigned treat-
ment. No matter how it is performed, treatment randomization “levels the playing
field” so that the treatment groups are typically similar in terms of baseline patient
characteristics, medical history, etc. Assignment to treatment is not influenced by
factors of any kind – for example, severity of disease, previous history, gender, age,
or location do not affect the probability of randomization for any subject. The
treatment masking of the assigned treatment alleviates any bias of the observer to
record data in a way that is preferential to one treatment group over the other.
However, clinical trials are not impervious to bias, although the risk is cer-
tainly reduced compared to other designs (Lewis and Warlow 2004). Perhaps the
best compendium of biases in clinical trials is the Cochrane Handbook for
Systematic Reviews of Interventions (Higgins et al. 2008), which lists the poten-
tial biases in clinical trials: (1) selection bias, (2) performance bias, (3) detection
bias, (4) attrition bias, (5) reporting/publication bias, and (6) other sources of bias.
The Cochrane Collaboration has developed a tool for assessing the risk of bias,
which is recommended for the evaluation of studies to include in each systematic
review submitted to the Cochrane Reviews, but is also useful to help investigators
reduce, if not eliminate, bias in their studies. In addition, the CONSORT State-
ment and, in particular, the Elaboration and Extension information can help
investigators avoid inadvertent biases in their study design, protocol, and
reporting.
The following sections describe the types of problems that can result in each type
of bias, how to detect these problems, how serious they are, and how a researcher can
design a study that avoids them.
42 Controlling Bias in Randomized Clinical Trials 789

Sources of Bias

Selection Bias

Selection bias refers to how subjects are allocated to treatment groups. There
are many examples in the literature of nonrandomized studies yielding positive
results for a treatment only to have a subsequent randomized study overturn those
results (e.g., Wangensteen et al. 1962; Ruffin et al. 1969). The underlying reason for
the positive results in a nonrandomized setting is that the investigators give the
treatment to patients preselected to be responsive. Without a randomized
control group, it is not possible to estimate the treatment effect. Bias can also arise
if patient outcomes are compared to historical or other non-contemporaneous (or
nonrandomly chosen) controls. The solution to this source of bias is the randomiza-
tion of subjects between (or among) treatment/control groups.
In a randomized clinical trial, subjects are randomly allocated to treatment/control
groups according to a masked allocation sequence, either static or dynamic. It is
important to understand that the randomization must be masked so that the investi-
gators are not able to determine what the next treatment assignment might be. For
example, if the “randomization” is based on the last digit of the patient’s medical
record number, investigators will be able to determine what treatment the current
patient will receive – and advise the patient (either directly or indirectly) on whether
to enroll in the study based on that knowledge and their own bias related to treatment
efficacy. In RCTs, randomization means masked randomization with barriers built
into the sequence to prevent determining the next treatment allocation. The section
below discusses approaches to do this.
Typically, this is done through a randomization process as part of the allocation;
for example, subjects are assigned to treatment groups based on a sequence of
random numbers. The complete process is as follows.
Before the study is started, a sequence of random numbers is generated, either
from a table of random numbers or from a computer program/web site. That
sequence can take a number of forms, ranging from single digits (e.g., 1, 2, 3,. . .)
to multiple digits (e.g., 001, 002, 003,. . .) or (especially from computer programs)
any decimal sequence between 0.0 and 1.0. Once the full range of the random
numbers is known, subjects who receive numbers in the lower half of the sequence
of random numbers are assigned to one treatment group, while subjects who receive
numbers in the top half are assigned to the other treatment group. Once the treatment
assignments have been determined for each random number, the numbers are put
back into the original order to achieve a fully randomized list (Lachin 1988).
This simple approach, however, does not assure balance of the numbers of
subjects between the treatment groups. It also does not avoid a lengthy string of
the same treatment. The usual approach to assure the balance is through a permuted
block randomization (Matts and Lachin 1988). In this approach to random alloca-
tion, random sequences are generated in groups of numbers, known as “blocks.”
Each block of, say, four random numbers is sorted by the random numbers in the
790 B. A. Barton

block with the lowest two numbers assigned to one treatment group and the other
two numbers assigned to the other treatment group. The random numbers are then
put back into the original sequence, forcing a balance between the numbers of
subjects assigned to each treatment group while generating a truly random sequence.
The down side to this approach is that, if it becomes known what the block size is,
the final treatment assignment in the block is easy to determine. The way to eliminate
this potential bias is to randomly set the block size (e.g., for a particular study,
the block sizes could be 2, 4, or 6) so that it is not possible to determine the next
treatment assignment without knowledge of the block size that is currently being
filled. This information is, of course, hidden from the investigators as only the
random sequence of treatment assignments is available to the randomization system.
In addition to the allocation bias described above, there is a potential bias related
to an imbalance of factors that are highly predictive of outcomes. For example, in
studies of diabetes in which HbA1c is the outcome, body mass index (BMI) could be
an important predictor of HbA1c. To assure a balance of BMI in the two treatment
groups so that a biased outcome is not generated by a BMI imbalance, stratification
can be used. Patients’ BMI could be classified as “normal” (BMI 18–25) or “ele-
vated” (BMI >25.0). The patients with normal BMI would have a separate random-
ization to assure an equal distribution of normal BMI patients between the treatment
groups, as would the patients with elevated BMI. It should be noted that typically
there is a separate randomization within the clinical center in a multicenter clinical
trial so that the treatments are balanced within each clinic, eliminating any treatment
bias within center.
Typically, a formal stratification using this approach would be limited to three
variables because of complexity of maintaining the randomizations across multiple
strata. For situations where it is important to balance the randomization for more than
three important factors, the alternate strategy is a minimization or dynamic allocation
strategy. This strategy keeps track of the treatment allocations for each factor of
interest. As patients are randomized using a standard approach, an imbalance may
develop with, say, females so that more females are assigned to one treatment group
(Treatment A) compared to the other (Treatment B). The usual approach, to preserve
an element of randomization, is to reduce the probability of being assigned to
Treatment A so that Treatment B has a higher probability of assignment for the
next female. This reduction in the probability of assignment to Treatment A is a
function of the imbalance, so that the usual probability of assignment to Treatment A
of 0.50 is reduced to 0.40 or even 0.25 depending on the decisions during the study
design phase related to how much tolerance to imbalance is possible. The adjustment
of this probability also implies that an a priori randomization scheme is not possible
since the probability of assignment will vary for each new patient depending on the
balance across the important factors. For each new patient ready for randomization,
this recalculation of probability of assignment to Treatment A (and the associated
probability of assignment to Treatment B) must take into account any imbalance for
each important factor, so that an overall probability is calculated. Thus, the random-
ization system must be available at all times with access to the previous randomiza-
tions so that these calculations can be made in real time.
42 Controlling Bias in Randomized Clinical Trials 791

In a cluster-randomized study, the randomization is generated at the facility level


(e.g., nursing home, ERs, hospitals) so that everyone recruited from that facility
receives the same treatment. In this type of study, masking is less of an issue since
everyone is getting the same treatment. However, the main problem is that the
baseline characteristics of the randomized units may not be similar, as they are
when the randomization is at the individual level. For cluster-randomized studies, a
recently described approach, the minimal sufficient balance randomization method
(Zhao et al. 2011), and other approaches can balance the covariates of the random-
ization units so that even a small number of clusters (e.g., nursing homes within a
city) will be reasonably balanced on important covariates.
Unfortunately, there are few, if any, flexible randomization programs available
online or even in the more advanced statistical software packages (e.g., SAS, Stata,
SPSS, R). Programs can certainly be written in these languages to implement any of
the above techniques, but programming knowledge of each package is required. The
implementation of adaptive randomization strategies, in particular, will require
specialized programming. While there are a few R-based programs, such as
RRApp, to provide more capability, the more advanced techniques still require
programming capability.

Cautionary Note

Many database management systems, such as REDCap, have randomization capa-


bilities built into the system. While the primary purpose is to provide convenient
randomizations within the one system, a secondary purpose is to allow for real-time
verification of eligibility of a patient before he/she is randomized to a study. Under
intention-to-treat principles, once a patient is randomized, he/she is in the study
without exception. To avoid randomizing ineligible patients, verification of eligibil-
ity criteria is essential before the randomization is issued.
It is critical that all randomizations have tracking for patient eligibility and group
assignment that cannot be changed after the randomization decision is issued.
REDCap offers this capability as do other similar systems. This ability is to verify
the randomization that was issued for the subject, which is critical if any dispute over
the “intention-to-treat” paradigm develops. In addition, it can be used to routinely
check on the performance of the randomization system, especially in terms of
verification of eligibility.
Selection bias also includes a bias in the actual performance of the randomization
of subjects. This can include randomization under difficult situations, such as
transport to emergency departments, in trauma situations, or when the subject is
not conscious. In situations where it may not be possible to contact a central
randomization facility (e.g., data coordinating center or research pharmacy) to
randomize the subject, recent publications have proposed a “step-forward” random-
ization design (Zhao et al. 2010). Database systems, such as REDCap, have a mobile
app which can be programmed to have the next randomization ready, but in a masked
format. In the case of REDCap, even if the cell phone/tablet is not in contact with the
792 B. A. Barton

central REDCap database, the next randomization is available – and the individual
can still be checked for eligibility prior to the randomization. Once contact with
REDCap is restored, the information is transmitted to the main study database, and
the next randomization is downloaded.
If the randomization cannot be done electronically, then an opaque sealed enve-
lope system can be used, but it is much easier to get around the masked treatment
assignments and “game” the masking. Various methods are used to maintain the
mask of the next randomization in the nonelectronic randomization. For example,
the person obtaining the randomization must sign the opaque envelope (along with
the date and time) on the seal on the back of the envelope as well as signing the
actual card with the treatment assignment. A third party, not otherwise associated
with the study, could be in charge of the randomization envelopes and log the date,
time, and requestor for each envelop. Once the envelope is opened, that subject is in
the study.
Berger et al. (Berger and Christophi 2003) have made the point that allocation
concealment (i.e., no ability to predict the next random treatment allocation) is
critical to the successful randomization of patients to form equivalent treatment
groups. The impact of selection bias is hard to estimate, although a recent review
of oral health intervention studies (Saltaji et al. 2018) indicated that studies with
inadequate or unknown method of sequence generation/masking had larger effect
sizes (diff. in effect size = 0.13 (95% CI: 0.01–0.25)) than studies where the
generation/masking of treatment allocation was adequate.
Finally, some studies use a “run-in” period to assess a subject’s ability to adhere to
the study requirements and to the intervention as well as to detect any early adverse
effects (Laursen et al. 2019). There is some concern that a run-in period can create a
selection bias because the run-in explicitly tests the ability of subjects to adhere to the
intervention, whereas, in studies without a run-in, the ability of people to adhere to the
intervention becomes part of the study post-randomization. Because of this selection
bias, the results of these studies could be less generalizable than otherwise. However,
the counterargument is that, with an intervention that requires that the subject performs
a task in some way, medical personnel would not prescribe the intervention without
knowing if the subject could (and would) perform the task. In the HIPPRO study, for
example, the intervention was the wearing of hip pads to prevent hip fractures in
nursing home residents. If the resident was not able (or willing) to wear the pads daily
during the run-in period, he/she was not included in the main study. In reality, medical
personnel would not ask a nursing home resident to wear hip pads if the resident was
not capable of wearing them all day every day. So, in some cases, a run-in period
makes sense. In a drug study, however, a run-in period could eliminate people who
experience a specific side effect, while, in reality, these people could be prescribed the
drug. So the question of whether to include a run-in period should be considered
carefully to be sure that generalizability is not compromised. In a systematic review by
Laursen et al. (2019) of 470 clinical trials reported in Medline in 2014, 5% (25 studies)
had a run-in period of varying design and duration. Of the 25 studies, 23 had
incomplete reports of the run-in period in the study results paper. Industry-sponsored
studies were more likely to have run-in periods than studies funded by other sources.
42 Controlling Bias in Randomized Clinical Trials 793

Performance Bias

Performance bias refers to the collection of data from subjects in a way that does not
accurately reflect subject responses, i.e., the collection of data in a way that is
favorable to the data collector’s treatment of choice. If the data collection staff are
not masked, a number of biases can enter into the data, including exaggeration (or
diminishing) of outcomes, failing to record adverse events, failing to administer all
data collection forms in a neutral way, and misinterpreting laboratory results. For
example, a data collector/interviewer could record “headaches” as an expected and
nonserious adverse event for study participants in one treatment group but the same
symptoms as unexpected and serious for the other group. To avoid this bias, all clinic
staff responsible for any form of data collection must be masked to the treatment
assignment. This is inclusive of all data collection staff or anyone who can influence
the staff, such as the principal investigator of the study or of the clinical site. This
implies that only the person obtaining the randomization would be unmasked – such
as pharmacy staff who are packaging or providing the study treatments in a study of
medications. In these clinical trials, the clinic staff and even the laboratory staff
should be masked to study treatment.
Care must be taken if there are laboratory results that could unmask the study
physicians or nurses. For example, in studies of a new treatment in type 2 diabetics
with uncontrolled diabetes, HgbA1c or fasting glucose values could unmask the
physician reviewing the results. Similarly, in patients with sickle cell disease,
patients taking hydroxyurea will have higher fetal hemoglobin levels than those
not taking it. In these situations, the laboratory results may need to be first sent to the
data center to be masked (especially placebo results) prior to forwarding to the clinic
staff, but the true results always need to be recorded in the patient’s medical records.
In the WARCEF study (Pullicino et al. 2006), in which patients were randomized to
aspirin or warfarin (coumadin), the INRs for the warfarin patients were put into the
study/clinical records without modification, but the data center generated INRs for
the aspirin patients to avoid unmasking those patients. The process of masking
laboratory results depends on the condition being studied, but generally involves
imputing a random level in one (or both) groups that will not distinguish between the
treatments. This will need to be stated explicitly in the protocol and to the IRB along
with the explanation of how these results will be handled in the clinical center. It is
important to note that the actual results be recorded in a fashion that will not
jeopardize patient care but will mask the information for study personnel. Depending
on the electronic health record system used at each site, this may need specialized
programming to be done effectively.
In studies of behavioral interventions, the same is true, although it may be
difficult to maintain masking if there is a difference in the level of “attention” in
the intervention group compared to the control group. Many behavioral clinical trials
now include an equivalent number of sessions for both groups to avoid a bias created
by this additional attention in the intervention group.
Subjects should also be masked to their treatment group (if possible) so that all
reporting by the subjects is accurate and not related to the perceived effects of the
794 B. A. Barton

treatment. This is particularly true in studies where subjective or patient-reported


outcomes are used, including occurrence of nonserious adverse events as well as
quality of life scales. Because consent forms typically list expected adverse events of
the treatments (and these expected AEs are readily available on various drug
information web sites), subjects may be more inclined to report symptoms, such as
headaches or colds, as treatment-related AEs if they know that they are in the
intervention group. It has been reported in the research journals (The Lancet Oncol-
ogy Editorial 2014) and even the lay press (Marcus 2014) that subjects in masked
clinical trials have formed groups on social media, such as Facebook, to offer
support for others and to compare treatment effects. Participants in these social
groups typically compare symptoms and effects and try to determine (based on
other information on the internet) which treatment they are on. If subjects do
determine what treatment they are on, this can lead to other “downstream” biases,
such as detection bias or attrition bias, both discussed below.
All clinic personnel should have a complete understanding of the protocol and
protocol-specific procedures. This can be accomplished through webinars and online
training sessions. It can also be reinforced through testing of personnel on the
protocol – with emphasis on specific elements that are important to the different
types of clinic staff. There should also be refresher sessions through the study.
Finally, ongoing monitoring of study performance is critical to identify problems
and biases before they critically disrupt the integrity of the study. This involves
generating Quality Assurance/Quality Control reports for the study leadership and
for the Data and Safety Monitoring Board to review. Included in these reports should
be presentations of data quality (e.g., time to submit information after a visit, percent
of data that is “clean” on initial entry), measurement quality (e.g., number of
measurements within study-defined limits and with normal variability, variability
of measurements across staff taking them), laboratory quality (e.g., results from
masked duplicate assays, variability of coefficients of variation across time), and
quality assurance activities (e.g., retraining, site visit reports). Techniques for these
reports should include techniques such as Shewhart Plots (Dunn 2019) to display
unusual variability (either excessive or lack of variability) across visits and across
clinic staff.

Detection Bias

Detection bias refers to a systematic bias in determining outcomes in subjects by


treatment groups (Wirtz et al. 2017; Rundle et al. 2017; Dusingize et al. 2017) . Even
in a well-masked study, there are frequently unavoidable indicators of treatment
group in a subject that can contribute to a biased determination of an outcome,
particularly a subjective outcome, such as pain level, but it can also influence the
assessment of a clinical outcome, such as myocardial infarction in cardiology or
acute chest syndrome in sickle cell disease or even cause of death. For example, in a
study of multiple sclerosis treatments, the initial assessment of stage of disease and
progression by study (unmasked) neurologists showed an advantage of the
42 Controlling Bias in Randomized Clinical Trials 795

intervention over the standard of care treatment. However, when the same assess-
ment was conducted by masked neurologists, there was no advantage shown by the
analysis of the masked results (Noseworthy et al. 1994). In general, masked
observers produce smaller treatment effect sizes that are also more reproducible
(Hróbjartsson et al. 2012). A recent state-of-the-art review by Kahan et al. (2017)
indicated that the best approach to adjudication of events and outcomes in a clinical
trial is dependent of the nature of the study and of the event/outcomes that are subject
to adjudication.
The approach that can typically minimize or even eliminate detection bias is to
engage a group of clinicians, unrelated to the study, to determine the outcome based
on prespecified criteria listed in the final study protocol. It is important that these
criteria are finalized before any outcome data are reviewed. This Outcomes Adjudi-
cation Committee would receive only data, reports, and notes that are de-identified
and on which any reference to the study (and any potentially unmasking informa-
tion) is redacted. The minutes should be taken at Committee meetings and become
part of the study documents. Decisions by the committee should be clearly indicated
in the Committee meeting minutes and should be entered into the study database by
the Committee secretary and verified by the Committee chair or designee. The
adjudications could only be changed by the Committee chair through the database
audit trail and such actions noted in subsequent Committee meeting minutes.
The independent adjudication could also include a review of unexpected serious
adverse events (SAEs) to verify the relatedness of the SAE to the study therapy (i.e.,
drug, behavior modification, device, or biologics). The same basic approach as for
clinical outcomes should be taken with SAEs.
An additional detection bias is the inability to determine if an outcome, especially
a soft or subjective outcome, has occurred due to subject recall bias. For example, in
studies of sickle cell disease, if a patient is feeling better in general, he/she may
forget about the pain episode 2 weeks ago, while a patient who feels miserable may
not. Electronic daily pain diaries (especially cell phone apps) have been very useful
in capturing transient subject outcomes with corroborative information, such as
prescription use, in a number of disease areas, such as sickle cell disease, atrial
fibrillation, and diabetes glucose monitoring. A number of these apps are being
paired with sensors to help determine if, for example, a sickle cell pain crisis is about
to start, if an episode of atrial fibrillation has started or is imminent, or if a subjects’
continuous glucose monitor is indicating out-of-control blood sugar levels. With the
availability of wearable devices, including those that can run apps, it is practical to
design studies that collect daily information on these types of outcomes or adherence
information (such as length of Transcendental Meditation or yoga practice sessions)
to avoid recall bias.

Attrition Bias

Attrition bias can have two causes. First, some outcomes, although recorded in the
database, may be excluded from analysis for a variety of reasons. Some reasons may
796 B. A. Barton

be technical in nature (e.g., outcome not assessed within prespecified time window
or not assessed using protocol specified lab test). Others may be more logistic (e.g.,
subject did not receive protocol mandated intervention, patient found to be ineligible
after randomization). Second, subjects may have dropped out of the study or can no
longer be located for follow-up. These subjects may have dropped out of the study
for reasons related to treatment (Hewitt et al. 2010), so that it is critical to keep this
“missingness” to a minimum and preferably less than 5%. Differential attrition
between the two treatment groups may be an indication that side effects (or even
treatment effects) are not acceptable to subjects, and, rather than confront clinic staff
with that decision, the subjects are walking away quietly. It is important that the
Informed Consent Form (ICF) be written in such a way as to allow indirect (at least)
searching for subject information, including vital status (through the National Death
Index). If the subject has died, the causes of death (through the NDI) can be obtained
and would be important to complete the mortality information.
A systematic review (Akl et al. 2012) assessed the reporting, extent, and handling
of loss to follow-up and its potential impact of treatment effects in randomized
controlled trials published in the five top medical journals. The authors calculated the
percentage of trials in which the relative risk would no longer be significant when
participant’s loss to follow-up varied. In 160 trials, with an average loss to follow-up
of 6%, and assuming different event rates in the intervention groups relative to the
control groups, between 0% and 33% of trials were no longer significant.
The least biased approach to analysis of a clinical trial in general is the intention-
to-treat (ITT) approach. This approach, devised at the time of the Anturane Study
controversy (The Anturane Reinfarction Trial Research Group 1978; Temple and
Pledger 1980), has three principles: (1) all patients are analyzed in the treatment
group to which they were assigned; (2) outcomes for each subject must be recorded;
and (3) all randomized subjects should be included in the analysis. The problem is
that it is rare that a study has the outcome(s) for all subjects. So, some form of
imputation is usually required to satisfy all three of the ITT principles. The rule of
thumb is that, if the level of missing outcome data is 5% or less, it will not affect the
overall study results and imputation is not critical. If the level of missingness is 10%
or more, multiple imputation is a good clinical practice and should be performed. If
the level of missingness is 20% or more, imputation can overly influence the results
and, in a sense, drive the results. In these situations, other approaches for dealing
with the missingness would be necessary. These approaches will depend on the
study, but could include checking other sources for information (NDI, Social
Security, Medicare, all-payer claims databases, or contacting other family members).
Briefly, multiple imputation is a strategy that generates expected outcomes for
patients missing them (Sterne et al. 2009). This is usually done using a model-based
approach based on observed data. However, because even model-based imputation
can produce the same expected outcome for multiple patients, the end result of a
single imputation is likely to yield a smaller standard deviation (and, thus, standard
error of the regression coefficients), making it easier to reject the null hypothesis than
in reality. The solution is to produce multiple sets of imputed data, each with a
random variation of the imputed values designed to restore the full variability of the
42 Controlling Bias in Randomized Clinical Trials 797

outcome. The same strategy can be used to generate expected predictors when
important predictors are missing. The multiply imputed data sets are then analyzed
using an analysis stratified by imputation data set and the results combined to
produce a single analytic result.
A second approach to the analysis, sometimes called the “modified intention to
treat,” excludes subjects who have not received a protocol-specified minimum
“dose” of the intervention. The problem with the modified ITT approach is that
people who drop out early due to immediate adverse events are excluded from the
analysis – the exact problem that ITT was designed to prevent. This approach is used
frequently in oncology studies that enroll patients with advanced disease. The
rationale is that a number of these patients do not live long enough after randomi-
zation to receive the minimum dose – or potentially any dose. Thus, using ITT for
efficacy analysis could artificially reduce the success rate in those studies.
A third approach, the “per-protocol” approach, excludes subjects who do not
receive the protocol-specified complete dose for the treatment to which they were
assigned. It is typical that studies report the ITT as the primary analysis and the per-
protocol as the secondary analysis of the primary outcome. Thus, the readers see the
most unbiased result as well as the “full dose” result. An additional concept that is
sometimes included under the per-protocol approach is to analyze subjects according
to the treatment received, not as randomized. The rationale for this is that this is a
“cleaner” approach to estimating treatment effect, rather than keeping people in their
original groups, regardless of what they received. The ITT analysis tends to mini-
mize the treatment effect, whereas the “as-treated” approach tends to report the
observed treatment effect.
The study may also inadvertently cause differential attrition. In a study of growth
in children (not a clinical trial, but the lesson is valuable), the blood pressure of the
children was measured at one of the visits, and a note was given to the children with
elevated blood pressure to take home to their parents. A higher proportion of the
children who received the note did not return for future visits compared to those who
did not receive the notes. Actions may have unintended consequences and need to be
pilot tested for acceptance by subjects.
Attrition bias can lead to nonrandom missingness (as in the example above), so
that study results could be compromised, even if the nonrandomness is recognized. If
follow-up (and, thus, outcome) data are missing not at random, interpretation of the
study is not straightforward and could be curtailed to certain subgroups. In conjunc-
tion with the Data and Safety Monitoring Board and QA/QC reports discussed
above, analyses of missingness should be included so that the DSMB can identify
early any nonrandom missingness and potential causes.

Reporting/Publication Bias

Reporting bias relates to the reporting of significant treatment comparisons and the
underreporting of nonsignificant comparisons. As Chan and Altman (2005) says,
this could be the most substantial of the biases that can affect clinical trials. The
798 B. A. Barton

Catalog of Bias lists several types of selective reporting of outcomes: (1) reporting
only those outcomes that are statistically significant, (2) adding new outcomes after
reviewing the data that are statistically significant, (3) failing to report the safety data
(i.e., adverse events) from the trial, and (4) changing outcomes of interest to include
only those that are statistically significant (Catalog of Bias Collaboration 2019). The
CONSORT statement and associated checklist (Consolidated Standards of Reporting
Trials; http://www.consort-statement.org) is a comprehensive list of items that
should be included in the reporting of clinical trials (Schulz et al. 2010; Moher
et al. 2010). Most medical journals now subscribe to the CONSORT principles,
including the principles that all primary and secondary outcomes should be reported
(Item 17a) and all adverse events reported (Item 19). With the enforced use of
CONSORT by the ICMJE (International Committee of Medical Journal Editors),
which requires that all outcomes and adverse events be reported for a clinical trial,
the reporting bias should be minimized (Thomas and Heneghan 2017). This does
assume that studies will follow CONSORT and that the journals verify that.
It should also be noted that outcome data must be posted on ClinicalTrials.gov,
along with the study protocol and adverse event information. There are strict
timelines for posting the outcome results from the study, with financial penalties
for failure to comply. This requirement will also tend to diminish this bias in the
future. A number of systematic reviews of publications versus protocols filed on
ClinicalTrials.gov or other publicly accessible data sources have documented that
between 6% and 12% of reported studies have different primary outcomes than
specified in the protocol or a different analytic approach (Dwan et al. 2008, 2013,
2014; Zhang et al. 2017; Perlmutter et al. 2017). This is complicated by the
possibility that protocols were updated after data were viewed (or even analyzed),
and it is not possible to review previous protocol versions, indicating that this may be
a substantial underestimate of the problem. So even the manuscript statement
“Analyses were conducted according to the protocol” is not necessarily meaningful.
There is no clear way to determine if the analyses were conducted using prespecified
analytic techniques or if a number of analytic approaches were used until one that
produced a significant result was found and the statistical section of the protocol (or
the SAP) was changed to reflect the new approach. One indicator of this possibility is
if an “esoteric” statistical technique is used without a clear explanation why.
The second aspect of this type of bias, the publishing bias, is the tendency of
journals to publish studies with significant results. This is much more difficult for an
investigator or research group to counter. This bias can have a wide-reaching impact
since meta-analyses use predominately published results, although more are starting
to include results posted on ClinicalTrials.gov. Because meta-analyses are frequently
used in reports to policy makers regarding health care, this bias can lead to the
exaggeration of the efficacy of a new medication or procedure and, potentially, the
underestimation of safety issues. Because the sample of patients in a clinical trial
does represent a single sample in meta-analytic terms, investigators working on
meta-analyses need to be very careful of the publication bias in RCTs. To help
counter this bias is not easy, since the journal editors have control over what gets
published; investigators involved in negative studies should argue in the article
42 Controlling Bias in Randomized Clinical Trials 799

Discussion section (as well as in the cover letter to the editor) that publication of the
negative results is important to keep the literature balanced. Statisticians and epide-
miologists who develop meta-analyses for treatments of specific conditions should
be careful to search ClinicalTrials.gov for negative, non-published results to enhance
the balanced inclusion of studies in the meta-analysis.

Other Sources of Bias

Other sources of bias include statistical programming quality control concerns. The
data collected by a clinical trial will be recorded in a database, such as REDCap. The
data is typically longitudinal with repeated drug administration, clinical visits,
laboratory tests, adverse events, and outcome adjudication results. Assembling
these data into an analytic database can be challenging, requiring the merging of
multiple data sets into a set of longitudinal records for each patient. That aspect of
each study is a high-risk program that must be closely checked through a
documented quality control process. The programming to determine the outcomes
for each patient are also high-risk programs. All of these programs need to be
subjected to multiple layers of quality control to verify the accuracy of the prepara-
tion of the data for analysis. The actual analysis programming is less risky because
those programs are using the prepared analytic data sets. However, with today’s
statistical software, a mistake in one line of computer code can reverse the treatment
groups for efficacy and for safety outcomes.
Another source of bias occurs in cluster-randomized studies where it is known
that recruitment of subjects in the control facility is more difficult than in the
intervention facility. It will likely be necessary to allow a longer recruitment period
in control facilities to achieve the appropriate sample size. There is also concern that
the characteristics of the subjects recruited in control facilities could be different than
those in intervention facilities. These characteristics should be monitored during
recruitment to verify that the two cluster-randomized treatment groups are similar.
The DSMB reports should contain information on patient characteristics in these
studies. In addition, the characteristics of the facilities should be compared as well.
In the advent of an imbalance in patient characteristics between treatment groups, if
discovered early in the study, an adaptive randomization plan could be implemented
so that the patient characteristics would balance themselves prior to the end of
recruitment.
In studies that are not FDA monitored, a statistical analysis plan (SAP) either does
not exist or is vague and not followed very well. FDA typically requires a SAP to be
filed before the final analysis can be conducted. The reason is simple – if the analytic
techniques are not prespecified, the investigators are able to select the analytic
technique that supports their research without concern for what was stated in a
SAP. While a preliminary analysis approach was likely included in an NIH applica-
tion, that can easily be dismissed as preliminary and not even reported. If a reduced
version of a SAP is included in the protocol (and, thus, on ▶ ClinicalTrials.gov), it is
much harder to ignore it. But few protocols include much more than a cursory
800 B. A. Barton

explanation of the analysis plan unless there is FDA oversight, in which case the
major elements of the SAP are included in the protocol.

Summary

This section describes the major sources of bias in RCTs and possible solutions. In most
cases, depending on the nature of the study, other solutions can be found in addition to
those described here. There is no “push button” approach to safeguarding a trial from
bias – every RCT is different and will require different approaches to controlling and,
hopefully, eliminating bias. New sources of bias can arise through the electronic social
media. For example, a slight difference in the appearance of a placebo (or the inherent
differences in behavioral study treatments/conditions) can give study participants
enough information that, combined with a study-related social media group, can
unmask a study, resulting in misleading results based on patient-related outcomes.
Investigators, therefore, must be constantly watchful for bias to proactively prevent bias
from occurring and to retroactively correct existing problems.

Key Facts

1. Biases can still occur in randomized clinical trials, the study design that is
considered to be the gold standard.
2. Random treatment group assignment and data collection masked to subjects’
treatment group assignment will prevent most biases.
3. Random treatment assignments are always required for an RCT; treatment group
masking can be more challenging.
4. Other types of bias, such as reporting and publication bias, are unrelated to
randomization and masking. While reporting bias can be avoided by following
the CONSORT statement and checklist, publication bias is in the hands of the
journal editors.

Cross-References

▶ Adherence Adjusted Estimates in Randomized Clinical Trials


▶ Administration of Study Treatments and Participant Follow-Up
▶ ClinicalTrials.gov
▶ Financial Conflicts of Interest in Clinical Trials
▶ Design and Development of the Study Data System
▶ Fraud in Clinical Trials
42 Controlling Bias in Randomized Clinical Trials 801

▶ Good Clinical Practice


▶ Intention to Treat and Alternative Approaches
▶ Masking of Trial Investigators
▶ Masking Study Participants
▶ Missing Data
▶ Participant Recruitment, Screening, and Enrollment
▶ Patient-Reported Outcomes
▶ Principles of Clinical Trials: Bias and Precision Control
▶ Reporting Biases

References
Akl AE, Briel M, You JJ, Sun X, Johnston BC, Busse JW, Mulla S, Lamontagne F, Bassler D, Vera
C, Alshurafa M, Katsios CM, Zhou Q, Cukierman-Yaffe T, Gangji A, Mills EJ, Walter SD, Cook
DJ, Schünemann HJ, Altman DG, Guyatt GH (2012) Potential impact on estimated treatment
effects of information lost to follow-up in randomized controlled trials (LOST-IT): systematic
review. BMJ 344:e2809
Berger VW, Christophi CA (2003) Randomization technique, allocation concealment, masking, and
susceptibility of trials to selection bias. J Mod Appl Stat Methods 2(1):80–86
Catalog of Bias Collaboration (2019). Catalog of Bias, November 19. Retrieved from catalogofbias.
org
Chan A-W, Altman DG (2005) Identifying outcome reporting bias in randomised trials on PubMed:
review of publications and survey of authors. BMJ 330(7494):753
Doll R (1998) Controlled trials: the 1948 watershed. BMJ 317:1217
Dunn K (2019) Shewhart charts, July 17. Retrieved from https://learnche.org/pid/process-monitor
ing/shewhart-charts
Dusingize JC, Olsen CM, Pandeya NP, Subramaniam P, Thompson BS, Neale RE, Green AC,
Whiteman DC, Study QS (2017) Cigarette smoking and the risks of basal cell carcinoma and
squamous cell carcinoma. J Invest Dermatol 137(8):1700–1708
Dwan K, Altman DG, Arnaiz JA, Bloom J, Chan AW, Cronin E, Decullier E, Easterbrook PJ, Von
Elm E, Gamble C, Ghersi D, Ioannidis JP, Simes J, Williamson PR (2008) Systematic review of
the empirical evidence of study publication bias and outcome reporting bias. PLoS One 3(8):
e3081
Dwan K, Gamble C, Williamson PR, Kirkham JJ, Reporting Bias Group (2013) Systematic review
of the empirical evidence of study publication bias and outcome reporting bias – an updated
review. PLoS One 8(7):e66844
Dwan K, Altman DG, Clarke M, Gamble C, Higgins JP, Sterne JA, Williamson PR, Kirkham JJ
(2014) Evidence for the selective reporting of analyses and discrepancies in clinical trials: a
systematic review of cohort studies of clinical trials. PLoS Med 11(6):e1001666
Editorial (2014) #Trial: clinical research in the age of social media. Lancet Oncol 15(6):539
Hewitt CE, Kumaravel B, Dumville JC, Torgerson DJ, Trial Attrition Study Group (2010)
Assessing the impact of attrition in randomized controlled trials. J Clin Epidemiol 63(11):
1264–1270
Higgins JPT, Altman DG, Behalf of the Cochrane Statistical Methods Group and the Cochrane Bias
Methods Group (2008) In: JPT H, Green S (eds) Cochrane handbook for systematic reviews of
interventions. Wiley, Chichester
802 B. A. Barton

Hróbjartsson A, Thomsen AS, Emanuelsson F, Tendal B, Hilden J, Boutron I, Ravaud P, Brorson S


(2012) Observer bias in randomized clinical trials with binary outcomes: systematic review of
trials with both blinded and unblinded assessors. BMJ 344:e1119
Kahan BC, Feagan B, Jairath V (2017) A comparison of approaches for adjudicating outcomes in
clinical trials. Trials 18:266
Lachin J (1988) Properties of simple randomization in clinical trials. Control Clin Trials 9:312–326
Laursen DRT, Paludan-Muller AS, Hrobjartsson A (2019) Randomized clinical trials with run-in
periods: frequency, characteristics and reporting. Clin Epidemiol 11:169–184
Lewis SC, Warlow CP (2004) How to spot bias and other potential problems in randomised
controlled trials. J Neurol Neurosurg Psychiatry 75:181–187. https://doi.org/10.1136/
jnnp.2003.025833
Marcus AD (2014) Researchers FRET as social media lift veil on drug trials: online chatter could
unravel carefully built construct of ‘blind’ clinical trials. Wall Street Journal, July 29
Matts J, Lachin J (1988) Properties of permuted-block randomization in clinical trials. Control Clin
Trials 9:327–344
Moher D, Hopewell S, Schulz KF, Montori V, Gøtzsche PC, Devereaux PJ, Elbourne D, Egger M,
Altman DG (2010) Explanation and elaboration: updated guidelines for reporting parallel group
randomised trials. BMJ 340:c869
Noseworthy JH, Ebers GC, Vandervoort MK, Farquhar RE, Yetisir E, Roberts R (1994) The impact
of blinding on the results of a randomized, placebo-controlled multiple sclerosis clinical trial.
Neurology 44(1):16–20
Perlmutter AS, Tran VT, Dechartres A, Ravaud P (2017) Statistical controversies in clinical
research: comparison of primary outcomes in protocols, public clinical-trial registries and
publications: the example of oncology trials. Ann Oncol 28(4):688–695
Pullicino P, Thompson JLP, Barton B, Levin B, Graham S, Freudenberger RS (2006) Warfarin
versus aspirin in patients with reduced cardiac ejection fraction (WARCEF): rationale, objec-
tives, and design. J Card Fail 12(1):39–46
Ruffin JM, Grizzle JE, Hightower NC, McHardy G, Shull H, Kirsner JB (1969) A Cooperative
Double-Blind Evaluation of Gastric Freezing in the Treatment of Duodenal Ulcer. New England
Journal of Medicine 281(1):16–19
Rundle A, Wang Y, Sadasivan S, Chitale DA, Gupta NS, Tang D, Rybicki BA (2017) Larger men
have larger prostates: detection bias in epidemiologic studies of obesity and prostate cancer risk.
Prostate 77(9):949–954. https://doi.org/10.1002/pros.23350
Saltaji H, Armijo-Olivo S, Cummings GG, Amin M, da Costa BR, Flores-Mir C (2018) Impact of
selection bias on treatment effect size estimates in randomized trials of oral health interventions:
a meta-epidemiological Study. J Dent Res 97(1):5–13
Schulz KF, Altman DG, Moher D, for the CONSORT Group (2010) CONSORT 2010 statement:
updated guidelines for reporting parallel group randomised trials. BMJ c332:340
Sterne JAC, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR
(2009) Multiple imputation for missing data in epidemiological and clinical research: potential
and pitfalls. BMJ 338:b2393
Temple R, Pledger G (1980) The FDA’s critique of the Anturane Reinfarction trial. NEJM
303(25):1488–1492
The Anturane Reinfarction Trial Research Group (1978) Sulfinpyrazone in the prevention of cardiac
death after myocardial infarction. NEJM 298(6):289–295
Thomas ET, Heneghan C (2017) Catalogue of bias collaboration, outcome reporting bias.
In: Catalogue of biases. http://www.catalogueofbiases.org//outcomereportingbias
Wangensteen OH (1962) Achieving “Physiological Gastrectomy” by Gastric Freezing. JAMA
180(6):439
Wirtz HS, Calip GS, Buist DSM, Gralow JR, Barlow WE, Gray S, Boudreau DM (2017) Evidence
for detection bias by medication use in a cohort study of breast cancer survivors. Am J
Epidemiol 185(8):661–672
42 Controlling Bias in Randomized Clinical Trials 803

Zhang S, Liang F, Li W (2017) Comparison between publicly accessible publications, registries,


and protocols of phase III trials indicated persistence of selective outcome reporting. J Clin
Epidemiol 91:87–94
Zhao W, Ciolino J, Palesch Y (2010) Step-forward randomization in multicenter emergency
treatment clinical trials. Acad Emerg Med 17(6):659–665
Zhao W, Hill MD, Palesch Y (2011) Minimal sufficient balance—a new strategy to balance baseline
covariates and preserve randomness of treatment allocation. Statistical Methods in Medical
Research 24(6):989–1002
Masking of Trial Investigators
43
George Howard and Jenifer H. Voeks

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806
Why Mask Randomized Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806
How to Mask Investigators in Randomized Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 810
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 813
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 813

Abstract
The substantial investment of both time and money to mount a clinical trial would
not be made without an underlying belief that the new treatment is likely
beneficial. While a lack of definitive evidence can underpin the equipoise of
investigators that is necessary to mount a new trial, the success in previous early
phase trials (or even animal models) provides a natural foundation for an expected
benefit in subsequent phase trials. Both investigators and patients can share this
belief, and these expectations of treatment efficacy for new therapies introduce
the potential for bias in clinical trials. The benefits, completeness, and reporting of
masking in clinical trials are described, as they are approaches for implementing
and maintaining the mask.

Keywords
Masking · Blinding · Assessment of outcomes

G. Howard (*)
Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA
e-mail: [email protected]
J. H. Voeks
Department of Neurology, Medical University of South Carolina, Charleston, SC, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 805


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_215
806 G. Howard and J. H. Voeks

Introduction

The substantial investment of both time and money to mount a clinical trial would
not be made without an underlying belief that the new treatment is likely beneficial.
While a lack of definitive evidence can underpin the equipoise of investigators that is
necessary to mount a new trial, the success in previous early phase trials (or even
animal models) provides a natural foundation for an expected benefit in subsequent
phase trials. Such an underlying belief of efficacy is demonstrated when investiga-
tors were asked to guess whether patients were assigned to active versus placebo in a
trial treating depression and were more likely to guess assignment to active treatment
among patients who had better clinical outcomes (and also among patients with more
adverse events) (Chen et al. 2015). Likewise, patients either have a predisposition or
are transferred a confidence that active treatment is superior, with patients even in
early-phase cancer trials are optimistic that new therapies will be beneficial (Sulmasy
et al. 2010; Jansen et al. 2016). These expectations of treatment efficacy for new
therapies introduce the potential for bias in clinical trials (see Fig. 1).

Why Mask Randomized Trials

Masking (or blinding) of the treatment assignment stands as one of the pillars to
protect the study from potential biases introduced through these expectations. These
expectations could consciously or unconsciously influence the messaging by investi-
gators to subjects in the description of expected outcome and adverse events, the

Fig. 1 Cartoon demonstrating expectation of beneficial efficacy of experimental therapies.


Reprinted from the New Yorker with permission
43 Masking of Trial Investigators 807

Table 1 Potential benefits accruing depending on those individuals successfully masked. (From
Schulz and Grimes (2002))
Individual
masked Potential benefit
Participants Less likely to have biased psychological of physical responses to intervention
More likely to comply with trial regimens
Less likely to seek additional adjunct interventions
Less likely to leave trial without providing outcome data, including lost to
follow-up
Trial Less likely to transfer their inclinations or attitudes to participants
investigators Less likely to differentially administer co-interventions
Less likely to differentially adjust dose
Less likely to differentially withdraw participants
Less likely to differentially encourage or discourage participants to continue
trial
Assessors Less likely to have biases affect their outcome assessments, especially with
subjective outcomes of interest

diligence of the surveillance for outcomes by investigators, and the adjudication of


suspected study events. Through masking, the treatment assignment can be obscured
to three groups: study participants, study investigators, and those assessing study
outcomes. While the language is not precise, “double mask” commonly refers to
obscuring treatment assignment to all three groups (hence, sometimes alternatively
referred to as “triple masking”) (Schulz and Grimes 2002; Bang et al. 2004). While
more confusion surround the use of the term “single mask,” which can refer to
obscuring treatment assignment to any of these three groups, it most frequently refers
to obscuring treatment assignment to the participants (Schulz and Grimes 2002; Bang
et al. 2004). Schulz and Grimes (2002) offer a list of the benefits to the study in
masking each of these groups (see Table 1), with 5 of the 10 advantages accruing to the
masking of investigators. A study without masking is commonly referred to as “open
label.” This chapter focuses on methods for masking of investigators, including both
double masking and the single masking of investigators.
Meta-analyses of specific therapies, where some of the component studies were
masked and others were not, offer the opportunity to assess the magnitude of
potential bias attributable to a failure to mask. These meta-analyses can estimate
the difference in treatment effect between masked and unmasked studies, with any
larger treatment effect in open-label studies assumed to arise from a bias introduced
by the lack of masking. For example, a recent analysis of 64 meta-analyses including
540 trials of oral health treatments estimated the standardized difference in treatment
effect between masked and unmasked/inadequately masked trials. Among these
trials, 71% provided adequate masking of patients, and 59% provided adequate
masking of the outcome assessor. The standardized difference in the treatment effect
was 0.12 (95% CI, 0.00 to 0.23) larger in trials where the patient was not masked to
the treatment and 0.19 (95% CI, 0.06 to 0.32) for trials where the assessor was not
masked. In the same set of analyses, masking of caregivers and principal
808 G. Howard and J. H. Voeks

investigators was not associated with differences in estimated treatment effect


between adequately and inadequately masked studies (Saltaji et al. 2018). Similarly
among trials in osteoarthritis, there was a substantially larger treatment effect among
inadequately masked studies than in adequately masked studies when the magnitude
of the overall treatment effect was larger (difference in treatment effect = 0.79;
95% CI, 1.02 to 0.50), but (not surprisingly) little difference in treatment effect
between unmasked and masked sites for treatments with smaller overall treatment
effects (difference in treatment effect = 0.02; 95% CI, 0.10 to 0.06) (Nuesch
et al. 2009). Other meta-analyses using a similar approach also show unmasked
studies tend to have a larger treatment effect than those that are masked, and the bias
is larger when subjective outcomes are used (Savovic et al. 2012; Page et al. 2016).
Hence with this approach, there is strong empirical evidence that the lack of masking
can introduce bias to overestimate the magnitude of the treatment effect.
The current (2010) Consolidated Standard for Reporting Trials (CONSORT)
statement has two statements for reporting of masking: (1) “if done, who was
blinded after assignment to interventions (for example, participants, care provided,
those assessing outcomes)” and (2) “if relevant, description of the similarities of
interventions” (Schulz et al. 2010). Despite this requirement, numerous authors have
noted that the documentation of the methods underpinning masking is commonly
poorly reported (Armijo-Olivo et al. 2017; Boutron et al. 2006). In a systematic
review of 819 trials published in major medical journals in 2004, only 472 (58%)
reported the methods of masking, with the authors speculating that a lack describing
the methods of masking is in part a product of an under emphasis of the importance
in the CONSORT guidelines (Boutron et al. 2006). However, it also seems that
pressure from word-count policies of journals could also be a contributor.
Approaches to assess the completeness masking in trials are largely based on
asking patients to report their suspected assignment group or a response they don’t
know to which group they are assigned. Two approaches have been proposed to
quantify the quality of masking and to provide for statistical inference. The approach
of James and colleagues is related to the kappa statistic for disagreement and heavily
weights the proportion of subjects responding that they do not know to which group
they are assigned and provides a single index of the quality of masking across the
treatment groups (James et al. 1996). The dependence of this index on the proportion
that is unsure of their assignment implies that if this proportion is large (>30%), that
unmasking in one or both arms may not be detected (Bang et al. 2004). Additionally,
that the approach provides a single index pooling results across multiple treatment
arms may result in failing to detect unmasking in a single arm (e.g., in the case of a
strong treatment effect in the experimental arm, but little effect in the control arm).
These shortcomings were overcome by the subsequent proposal for masking indices
by Bang that provides an estimate of the quality of masking for specific treatment
groups. Under the assumption that those uncertain of their assignment are masked, the
Bang can be interpreted as the proportion of unmasked patients in each treatment
(Bang et al. 2004).
However, the success in masking is frequently not assessed. Only 40 of 2,467
(1.6%) psychiatric trials (Freed et al. 2014), and only 23 or 408 (5.6%) pain trials
(Colagiuri et al. 2019), reported assessing masking with information for meta-analysis
43 Masking of Trial Investigators 809

of the masking index. Likewise, the reporting of the success of efforts to mask was
reported in only 31 of 1,599 (2%) trials described as masked, included in the Cochrane
Central Register of Control Trials, and published in 2001 (Hrobjartsson et al. 2007).
The guideline for reporting the success of masking included in the 2001 CONSORT
guidelines was removed from the 2010 CONSORT guidelines because of a lack of
empirical evidence supporting the practice and concerns regarding the validity of
assessments (Schulz et al. 2010). Reports that do report the frequency of assessing
the success of masking are inconsistent, for example, with the previously mentioned
meta-analysis of psychiatric trials showing masking success in both arms (Freed et al.
2014), while the masking was not successful in the pain trials (Colagiuri et al. 2019).
Success in masking appears higher in trials with smaller effect sizes (Freed et al. 2014).
However, as many as half of masked trials conducted some assessment of the quality
of masking without reporting it in the literature (Bello et al. 2017), where 20% or less
of the trials with formal assessments of masking reported these results in publications
(Hrobjartsson et al. 2007; Bello et al. 2017). Hence, reporting bias and other factors
challenge efforts to describe whether masking can be successfully implemented and
factors associated with masking success.
Among the mechanisms through which the lack of masking could introduce bias is
the possibility that knowledge of the treatment assignment may make affect the
vigilance of investigators in their surveillance of potential study outcomes. For
example, an unmasked investigator may consciously or unconsciously more aggres-
sively probe potential symptoms for patients in the placebo arm, “ensuring” that the
events presumed to be more common are not missed. Conversely, the investigator may
have a higher threshold to declare an outcome event in the actively treated arm. But, as
outcomes become more objective, there is less judgment in probing for potential
events and less leeway for judgment by the investigator assessing outcomes. Hence,
with objective outcomes, there is less opportunity for investigators (or subjects) to
introduce bias, with little contribution for very objective outcomes such as mortality or
outcomes determined through direct measurement (e.g., weight loss, blood pressure,
lipid levels, etc.). This possibility that masking is less important with increasing
objectivity of the outcome is supported by empirical evidence. For example, in
analyzing the difference in the treatment effect between 532 trials with adequate
masking versus 272 with inadequate masking, there was no evidence of a difference
for all-cause mortality (ratio of odds ratio = 1.04; 95% CI, 0.95 to 1.14), while the
difference between masked and unmasked was significantly different (ratio of odds
ratio = 0.83; 95% CI, 0.70 to 0.98). When these differences were assessed at the
threshold where outcomes were classified as “objective” versus “subjective” out-
comes, the ratio of odds ratios were 1.01 (95% CI, 0.92 to 1.10) for objective studies
and 0.75 (95% CI, 0.61 to 0.82) for subjective outcomes (Wood et al. 2008). However,
while such evidence does support a lower importance of masking with outcomes that
are more objective, it is important to remember there are other pathways through
which unmasked investigators could introduce bias. For example, in the setting of
intensive care units (ICUs), knowledge of the treatment assignment could affect the
decision to provide or withhold life support therapy and thereby have an effect on
mortality as an outcome (Anthon et al. 2017). However, a systematic review with a
published a priori protocol (Anthon et al. 2017) suggests this theoretical possibility
810 G. Howard and J. H. Voeks

provided little support that it is a major concern. The authors considered published
systematic reviews and reanalyzed the data clustering the studies included into those
with and without masking. The results of this effort showed that for the primary
outcome of death (at the longest follow-up time), only 1 of 22 studies showed a larger
treatment effect for those unmasked than masked (odds ratio = 0.58; 95% CI,
0.35–0.98 versus odds ratio = 1.00; 95% CI, 0.87–1.16) (Anthon et al. 2018). With
22 assessments (and testing interaction at α = 0.10), this is nominally fewer treatments
showing heterogeneity of effect than expected. Similar findings were shown for other
outcomes including in-hospital and in-ICU mortality (Anthon et al. 2018). Still, even
in studies with very objective outcomes, the possibility that alternative pathways could
introduce bias should be carefully considered before abandoning masking.
There is a growing literature supporting the position that estimation of treatment
effects based on events as reported by site investigators is quite similar to results
when information is centrally retrieved and processed by adjudication committees
(Ndounga Diakou et al. 2016). However, it is important to recognize that decision to
use or not use adjudication committees differs fundamentally from the decision to
mask or not mask studies. That is, it is straightforward to maintain a mask within the
clinics for pharmacological treatments with active versus placebo treatments, and
hence the decision for the use of adjudication becomes a comparison of masked local
determination of outcomes versus the masked central adjudication of outcomes.
However, for other trials where maintaining a mask within the clinical center is
problematic (such as surgical trials), providing adequate masking for the determina-
tion of outcomes may require either additional staff in the clinical centers who are
masked to treatment allocation (and trust to believe that a “wall” there between the
masked and unmasked clinical center staff) or the use of central adjudication
committee that can be masked (discussed below).
It does seem intuitive that investigators (and subjects) could be influenced by the
knowledge of the treatment allocation, and this intuition is supported by empirical
data showing a bias for larger treatment effects in unmasked studies. While a large
number of studies are masked, there appears to be substantial room for improvement
in the reporting of the methods for implementing the mask, and the methods and
benefit for assessing the success of masking remain questionable.

How to Mask Investigators in Randomized Trials

While double mask active versus placebo treatments in pharmacologic trials are
frequently the first thought of when discussing masking in clinical trials, the mask
of treatment assignment is much more complex for many trials. Different treatments
give rise to a spectrum of challenges to the masking of investigators, with perhaps
surgical trials giving rise to the largest number of issues. Here, clearly those providing
the therapy cannot be masked, nor can patients generally be masked (without the use
of sham surgery, i.e., frequently considered unethical (Macklin 1999)), nor can many
assessors be masked from the scars associated with procedures. Implementation of
masking in lifestyle treatments (e.g., diet, exercise, etc.) is similarly difficult to
43 Masking of Trial Investigators 811

implement. However, as noted above, the lack of masking can give rise to biased
estimates of treatment effect, and as such the effort to provide the most complete
masking feasible is central to the good conduct of studies. As such, an array of tools
and approaches have been developed to reduce bias. Karanicolas and colleagues have
proposed three consideration to consider in the implementation of these approaches:
they should successfully conceal the group allocation, they should not impair the
ability to successfully assess outcomes, and they must be acceptable to the individuals
assessing the outcome (Karanicolas et al. 2008).
Boutron and colleagues’ outstanding systematic review of methods for masking
studies among 819 trials offered an effective strategy for classifying masking tools
and approaches, specifically whether they primarily (1) mask patients and healthcare
providers, (2) maintain the mask of patients and healthcare providers, or (3) support
the masking of assessor of outcomes (Boutron et al. 2006). We will follow this
structure in the review of these methods.
Among approaches to support the masking of patients and healthcare providers,
by far, the most common technique is the central preparation of oral/topical active
treatments with masked alternative treatments, an approach employed by 193 of 336
(57%) of studies reporting approaches to mask the patient and healthcare provider
(Boutron et al. 2006). This approach for masking is nearly ubiquitous in pharmaco-
logical clinical trials, and use of a central pharmacy effectively masks the treatment
assignment from the investigators. While this approach is common, the effort and
cost to identify and contract with a central pharmacy partner for the trial are
considerable, and the time line for implementation and production of active and
placebo treatment should not be underestimated. In addition, investigators in the
central pharmacy have unique experience and insights that are often remarkably
useful for the trial; it is critical to identify and involve these scientists as early as
possible in the trial planning process. Specifically, while the active drug may be
readily available, the creation of a placebo treatment with similar characteristics
sometimes requires encapsulation to conceal the active drug, or the addition of
flavors to mask the taste. Care and due diligence are still required, as while never
published in the reports from the trial, the investigators in the Vitamin Intervention in
Stroke Prevention (VISP) trial fortunately had the foresight to bioassay the first batch
of active and placebo medications provided by the central pharmacy, finding the
placebo to have levels of the treatment medications (folate, B6, and B12) nearly
indistinguishable from the active medication (obviously resolved prior to the onset
of the trial). Trials with an active alternative treatments (e.g., a trial of teriparatide
versus risedronate for new fractures in postmenopausal women (Kendler et al.
2018)) offer additional challenges, where it could be difficult to produce treatment
that appear similar even with encapsulation. Such a situation may call for a placebo
to be created for each of the two active treatments, or a “double-dummy” design.
Once masking of treatment assignment is established, efforts need to focus on
maintaining that mask during patient follow-up during which several factors work to
potentially unmask the treatment assignment. A particular challenge to maintaining
the masking of investigators are pharmacological trials that require dose adjust-
ments, where the mask can be maintained by having a centralized office that creates
812 G. Howard and J. H. Voeks

the adjustment orders with the inclusion of sham adjustments for those patients on
placebo. The investigators can also be partially or completely unmasked by the
availability of results of laboratory or other assessments at the clinical site. This
possibility can be reduced by the use of a central laboratory or reading facility with
only selected information required for safety being returned to the clinical site. The
masking of investigators is also challenged by the occurrence of specific adverse
events, and again the use of a central facility to process and report adverse events can
reduce this possibility and by systematic treatments to prevent adverse events that
are applied equally in both treatment groups. Finally, it is critical for the investigators
to avoid “messaging” to the patient about the therapeutic effect ,and the expected
side effects have to be carefully considered to maintain the mask.
However, there are treatments where maintaining the mask in the clinical center is
quite difficult or even impossible. Examples would include randomization to surgery
versus medical management for the management of asymptomatic carotid stenosis
(Howard et al. 2017), randomization to Mediterranean diet versus alternative diets
(Estruch et al. 2013), or randomization to different treatment algorithms (such as
different blood pressure levels (SPRINT Research Group et al. 2015)). In this case, a
first-line approach to provide masking is to have independent clinic staff who are not
involved in providing treatment be masked to the treatment allocation and assess the
trial outcomes; however, such an approach requires faith that the clinic staff will
maintain a “wall” between staff who may know each other well. Alternatively,
outcomes can be centrally processed by trial staff that are masked to treatment
allocation, an approach referred to as a prospective randomized open-blinded end-
point (PROBE) design (Hansson et al. 1992). Examples of the approach include the
video recording of a neurological examination with centralized scoring of the
modified Rankin score that serves as the primary study outcome (Reinink et al.
2018) or the retrieval of medical records for suspected stroke events that can be
redacted to mask treatment allocation and adjudicated by clinicians who are masked
to treatment allocation (Howard et al. 2017). Even with the use of PROBE designs,
investigators must be careful to not let the actions of the unmasked clinic staff
introduce bias. For example, the clinic staff could be more sensitive to the detection
of potential events in the medically managed group and be more likely to report these
events for the central adjudication. This can be partially overcome by the introduc-
tion of triggers, such as a 2-point increase in a clinical stroke scale, and requiring
records to be provided each time the trigger occurs. This potential bias can also be
reduced by setting a very low threshold for suspected events, so that many more
records are centrally reviewed with a relatively small proportion being adjudicated as
a study outcome. That PROBE approaches could reduce but not eliminate bias is
supported by a meta-analysis of oral anticoagulants to reduce stroke risk estimating
the treatment difference between trials using a double mask approach (4 trials)
versus a PROBE design (9 trials). This analysis observed a nonsignificantly
(p = 0.16) larger effect for stroke prevention in the PROBE studies (relative risk
= 0.76; 95% CI, 0.65–0.89) than for the double mask studies (relative risk = 0.88;
95% CI, 0.78–0.98) and a significantly larger effect (p = 0.05) for the prevention of
hemorrhagic stroke in the placebo trials (relative risk = 0.33; 95% CI, 0.21–0.50) than
43 Masking of Trial Investigators 813

the double mask studies (relative risk = 0.55; 95% CI, 0.41–0.73) (Lega et al. 2013).
While the use of PROBE methods likely reduces bias in outcome ascertainment, it is
not clear that these methods are as widely used as possible. For example, in a review of
171 orthopedic trials, masking of clinical assessors was considered feasible in 89% of
studies and masking of radiographic assessors in 83% of trials; however, less than 10%
of these trials used masked assessors (Karanicolas et al. 2008).
While simple active/placebo masking is possible for some treatments, many trials
will require creativity and determination to implement masking of investigators.
Additionally, once masking is in place, efforts need to be directed to maintain the
mask.

Conclusions

Masking stands as one of the pillars to reduce or eliminate bias in the conduct of
clinical trials. Without masking, intentional or unintentional prejudice can influence
the outcome of the trial, and as such ignorance is truly bliss.

References
Anthon CT, Granholm A, Perner A, Laake JH, Moller MH (2017) The effect of blinding on
estimates of mortality in randomised clinical trials of intensive care interventions: protocol for
a systematic review and meta-analysis. BMJ Open 7(7):e016187
Anthon CT, Granholm A, Perner A, Laake JH, Moller MH (2018) No firm evidence that lack of
blinding affects estimates of mortality in randomized clinical trials of intensive care interven-
tions: a systematic review and meta-analysis. J Clin Epidemiol 100:71–81
Armijo-Olivo S, Fuentes J, da Costa BR, Saltaji H, Ha C, Cummings GG (2017) Blinding in
physical therapy trials and its association with treatment effects: a meta-epidemiological study.
Am J Phys Med Rehabil 96(1):34–44
Bang H, Ni L, Davis CE (2004) Assessment of blinding in clinical trials. Control Clin Trials 25
(2):143–156
Bello S, Moustgaard H, Hrobjartsson A (2017) Unreported formal assessment of unblinding
occurred in 4 of 10 randomized clinical trials, unreported loss of blinding in 1 of 10 trials. J
Clin Epidemiol 81:42–50
Boutron I, Estellat C, Guittet L et al (2006) Methods of blinding in reports of randomized controlled
trials assessing pharmacologic treatments: a systematic review. PLoS Med 3(10):e425
Chen JA, Vijapura S, Papakostas GI et al (2015) Association between physician beliefs regarding
assigned treatment and clinical response: re-analysis of data from the Hypericum Depression
Trial Study Group. Asian J Psychiatr 13:23–29
Colagiuri B, Sharpe L, Scott A (2019) The blind leading the not-so-blind: a meta-analysis of
blinding in pharmacological trials for chronic pain. J Pain 20:489–500
Estruch R, Ros E, Salas-Salvado J et al (2013) Primary prevention of cardiovascular disease with a
Mediterranean diet. N Engl J Med 368(14):1279–1290
Freed B, Assall OP, Panagiotakis G et al (2014) Assessing blinding in trials of psychiatric disorders:
a meta-analysis based on blinding index. Psychiatry Res 219(2):241–247
Hansson L, Hedner T, Dahlof B (1992) Prospective randomized open blinded end-point (PROBE)
study. A novel design for intervention trials. Prospective Randomized Open Blinded End-Point.
Blood Press 1(2):113–119
814 G. Howard and J. H. Voeks

Howard VJ, Meschia JF, Lal BK et al (2017) Carotid revascularization and medical management for
asymptomatic carotid stenosis: protocol of the CREST-2 clinical trials. Int J Stroke
12(7):770–778
Hrobjartsson A, Forfang E, Haahr MT, Als-Nielsen B, Brorson S (2007) Blinded trials taken to the
test: an analysis of randomized clinical trials that report tests for the success of blinding. Int J
Epidemiol 36(3):654–663
James KE, Bloch DA, Lee KK, Kraemer HC, Fuller RK (1996) An index for assessing blindness in
a multi-centre clinical trial: disulfiram for alcohol cessation – a VA cooperative study. Stat Med
15(13):1421–1434
Jansen LA, Mahadevan D, Appelbaum PS et al (2016) Dispositional optimism and therapeutic
expectations in early-phase oncology trials. Cancer 122(8):1238–1246
Karanicolas PJ, Bhandari M, Taromi B et al (2008) Blinding of outcomes in trials of orthopaedic
trauma: an opportunity to enhance the validity of clinical trials. J Bone Joint Surg Am
90(5):1026–1033
Kendler DL, Marin F, Zerbini CAF et al (2018) Effects of teriparatide and risedronate on new
fractures in post-menopausal women with severe osteoporosis (VERO): a multicentre, double-
blind, double-dummy, randomised controlled trial. Lancet 391(10117):230–240
Lega JC, Mismetti P, Cucherat M et al (2013) Impact of double-blind vs. open study design on the
observed treatment effects of new oral anticoagulants in atrial fibrillation: a meta-analysis. J
Thromb Haemost 11(7):1240–1250
Macklin R (1999) The ethical problems with sham surgery in clinical research. N Engl J Med 341
(13):992–996
Ndounga Diakou LA, Trinquart L, Hrobjartsson A et al (2016) Comparison of central adjudication
of outcomes and onsite outcome assessment on treatment effect estimates. Cochrane Database
Syst Rev 3:MR000043
Nuesch E, Reichenbach S, Trelle S et al (2009) The importance of allocation concealment and
patient blinding in osteoarthritis trials: a meta-epidemiologic study. Arthritis Rheum
61(12):1633–1641
Page MJ, Higgins JP, Clayton G, Sterne JA, Hrobjartsson A, Savovic J (2016) Empirical evidence
of study design biases in randomized trials: systematic review of meta-epidemiological studies.
PLoS ONE 11(7):e0159267
Reinink H, de Jonge JC, Bath PM et al (2018) PRECIOUS: PREvention of Complications to
Improve OUtcome in elderly patients with acute Stroke. Rationale and design of a randomised,
open, phase III, clinical trial with blinded outcome assessment. Eur Stroke J 3(3):291–298
Saltaji H, Armijo-Olivo S, Cummings GG, Amin M, da Costa BR, Flores-Mir C (2018) Influence of
blinding on treatment effect size estimate in randomized controlled trials of oral health inter-
ventions. BMC Med Res Methodol 18(1):42
Savovic J, Jones H, Altman D et al (2012) Influence of reported study design characteristics on
intervention effect estimates from randomised controlled trials: combined analysis of meta-
epidemiological studies. Health Technol Assess 16(35):1–82
Schulz KF, Grimes DA (2002) Blinding in randomised trials: hiding who got what. Lancet 359
(9307):696–700
Schulz KF, Altman DG, Moher D (2010) CONSORT 2010 statement: updated guidelines for
reporting parallel group randomised trials. J Pharmacol Pharmacother 1(2):100–107
SPRINT Research Group, Wright JT Jr, Williamson JD et al (2015) A Randomized Trial of
Intensive versus Standard Blood-Pressure Control. N Engl J Med 373(22):2103–2116
Sulmasy DP, Astrow AB, He MK et al (2010) The culture of faith and hope: patients’ justifications
for their high estimations of expected therapeutic benefit when enrolling in early phase oncology
trials. Cancer 116(15):3702–3711
Wood L, Egger M, Gluud LL et al (2008) Empirical evidence of bias in treatment effect estimates in
controlled trials with different interventions and outcomes: meta-epidemiological study. BMJ
336(7644):601–605
Masking Study Participants
44
Lea Drye

Contents
Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816
Goals of Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817
Alerting Participants That Masking Will Be Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817
Operationalizing Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818
Placebos and Shams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818
Mechanics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818
Unmasking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 820
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 820
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 820
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 821
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 821

Abstract
Masking or blinding in clinical trials refers to the process of keeping the identity
of the assigned treatment hidden from specific groups of individuals such as
participants, study staff, or outcome assessors. The purpose of masking is to
minimize conscious and unconscious bias in the conduct and interpretation of a
trial. Masking participants in clinical trials is a key methodological procedure
since patient expectations can introduce bias directly through how a participant
reports patient-reported outcomes but also indirectly through his or her willing-
ness to participate in and adhere to study activities.
The complexity of operational aspects of masking participants is often
underestimated. Masking is facilitated by placebos, dummies, sham devices, or
sham procedures/surgeries. The success of masking depends on how closely the

L. Drye (*)
Office of Clinical Affairs, Blue Cross Blue Shield Association, Chicago, IL, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 815


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_216
816 L. Drye

placebo or sham matches the active treatment. Creation of a completely identical


placebo is generally possible only when active drug and matching placebo are
provided by the manufacturer. Masking of participants becomes more compli-
cated if there are more than two experimental treatment groups, an active control,
if treatments are taken at different intervals or via different routes, or if sham
devices or procedures are required.
Trials in which participants are masked should have procedures in place to
unmask. Most unmasking is routine unmasking in which investigators commu-
nicate treatment assignment with participants after treatment and follow-up are
complete. In addition to this routine unmasking, masked trials should have pro-
cedures to immediately unmask at any hour of the day in the event of an
emergency.

Keywords
Blind · Mask · Single mask · Double mask · Unmask · Placebo · Sham

Definitions

Mask or blind: Withholding treatment assignment identification from a group or


groups of individuals in a clinical trial.
Single mask: Withholding treatment assignment identification from a single
group, usually used to refer to withholding treatment assignment from participants.
Double mask: Withholding treatment assignment identification from two groups
of individuals, usually used to refer to withholding treatment assignment from both
participants and from study staff.
Unmask: Unintentional or intentional revealing of the treatment assignment to
groups of individuals who were previously masked.

Introduction

Masking in clinical trials is the process of keeping one or more parties (e.g.,
participants, study staff, outcome assessors, data analysts) unaware of the identity
of the treatment assignment during the conduct of the trial. The purpose of masking
is to prevent conscious or subconscious notions and expectations regarding the
treatment effects from affecting outcomes. To minimize behavior that can lead to
differential effects on outcomes, the preferred design strategy is to mask as many
individuals as is practically possible while maintaining safety. Blinding is a term
synonymous with masking that is frequently used.
The term single masking is usually used to refer to the masking of study partici-
pants, while double masking is usually used to refer to masking of study participants
and study staff. Additional levels of masking, such as masking of data analysts or
treatment effects monitoring committees (also known as data and safety monitoring
44 Masking Study Participants 817

boards), may also be used. It is important to note that the terms single, double, and
triple masked, which are used to describe the level of masking, are not universally
standardized. Readers must evaluate the description of masking in trial publications or
study documentation such as the protocol to understand which groups were masked.

Goals of Masking

Masking is a crucial methodologic feature of randomized controlled trials. While


randomization minimizes selection bias and confounding in the assignment of
treatment, it does not prevent subsequent differential reporting or assessment of
outcomes or behaviors that indirectly affect outcomes. Masking should not to be
confused with allocation concealment, which is preventing the disclosure of upcom-
ing treatment assignments until enrollment.
The masking of participants is particularly important for patient-reported out-
comes such as symptom scales, adverse events, and concomitant medications.
However, the participant’s knowledge of treatment assignment could affect out-
comes in less direct ways through his or her willingness to continue participation
in study activities, adherence to assigned treatment, avoidance of other treatments,
and risk behaviors. There are no analytical techniques that can “correct” for biased
assessment of outcomes.
The effect of masking on outcomes was explored in Hrobjartsson et al. (2014). The
authors reported results of a systematic review of 12 randomized clinical trials including
3869 patients in which the trial had one sub-study involving masked participants and
another, otherwise identical, sub-study involving unmasked participants. In trials with
patient-reported outcomes, the authors reported that effect sizes based on unmasked
participants were exaggerated by an average of 0.56 standard deviations compared to
masked participants. In addition to the effect on patient-reported outcomes, the average
risk of participant attrition in the control groups of RCTs including more than 2 weeks
follow-up duration was 7% (4% to 11%) in the unmasked treatment groups versus 4%
(2% to 6%) in masked treatment groups, and more participants in the unmasked control
groups used co-interventions compared to masked control group.

Alerting Participants That Masking Will Be Used

Study staff should explain to study participants that treatments will be masked and
the reasons for masking. Masking should not be attempted if accomplishing it
requires lying to participants or deception.
Both the Common Rule and FDA clinical trial regulations require (45 CFR Part
46.116 and 21 CFR 50.25) that descriptions of procedures related to research, such
as masking, should be included in the informed consent process. Participants should
be informed that they will be kept unaware of their treatment assignments in masked
studies as well as whether the study staff or physician will be aware of the treatment
assignment.
818 L. Drye

Operationalizing Masking

Placebos and Shams

Operationally, masking participants is facilitated by placebos and shams. Both serve


the same purpose regarding bias control.
Placebo or dummy treatments are inert or inactive substances that are taken or
applied as a substitute for the active treatment to prevent the participant from
knowing which treatment he or she received. In device trials, both the terms placebo
device and sham device are used.
Sham procedures or surgeries refer to something done to study participants to prevent
them from knowing which treatment they received. Sham procedures have additional
ethical concerns given that they are not inert like placebo treatments. They generally
involve potential risks to participants due to sedation and possibly surgical wounds.

Mechanics

The success of masking depends on how closely the placebo or sham matches the
active treatment. The brief statements regarding masking in most published reports
of clinical trials belie how complicated the task of masking truly is.
Masking participants in drug trials depends on whether pills, tablets, patches,
injections, liquid formulations, etc. can be made to match the active treatment with
respect to obvious characteristics such as size, shape, and color but also with respect
to smell and taste, particularly if the drug is known to have a characteristic feature. In
practice, this is generally possible only when active drug and matching placebo are
provided by the manufacturer since formulating an identical product to one that is
marketed is not legal. In the Alzheimer’s Disease Anti-inflammatory Prevention
Trial (ADAPT), Bayer and Pfizer supplied investigators with identical placebos
matching their marketed products (Martin et al. 2002).
When the manufacturer does not provide matching placebo, overencapsulation is
a technique that can be used to produce identical active and placebo capsules for a
trial of treatments that are in pill, tablet, or capsule formulation. This process is
expensive and may require lab testing to confirm bioavailability. Depending on how
large the overencapsulated study drug becomes, it can add additional eligibility
criteria for participation, i.e., participants must be able to swallow the capsule to
enroll.
Masking of drugs goes beyond creating identical product. The packaging also
must be identical. This will require repackaging of drug in some situations which
adds cost and time and also may have implications for product stability.
Masking of participants in drug trials becomes even more complicated if there are
more than two treatment groups or an active control, or if the treatments are taken at
different intervals or via different routes. In these cases, placebos will need to be
44 Masking Study Participants 819

created so that participants take active and placebo treatments at the appropriate
treatment intervals and routes for all treatments. If one treatment requires adminis-
tration twice as often as another, all participants must take study treatment at the
most frequent interval to maintain masking. Similarly, if one treatment is a tablet and
another is an injection, participants in both groups must receive both tablets and
injections. For example, the Oral Psoriatic Arthritis Trial (OPAL) Broaden phase
3 trial compared tofacitinib to adalimumab for psoriatic arthritis in patients with
inadequate response to previous disease-modifying antirheumatic drugs (Mease
et al. 2017). Patients were randomized in a 2:2:2:1:1 ratio to receive tofacitinib
5 mg twice daily, tofacitinib 10 mg twice daily, adalimumab 40 mg administered
subcutaneously once every 2 weeks, placebo with switch to the 5 mg tofacitinib at
3 months, or placebo with switch to the 10 mg tofacitinib at 3 months. In order to
mask the trial, all patients had to take two tablets twice daily and receive biweekly
injections. The content of the tablets and injections varied according to treatment
group.
Masking of participants in device or surgery trials is difficult but not always
impossible. Sham devices are manufactured to appear the same as active devices
but are manipulated so that they do not function as required to administer the
treatment. The Escitalopram versus Electrical Current Therapy for Treating
Depression Clinical Study (ELECT-TDCS) compared transcranial direct-current
stimulation with escitalopram in patients with major depressive disorder (Brunoni
et al. 2017). Patients received active or placebo escitalopram and active or sham
transcranial direct-current stimulation. Sham transcranial direct-current stimula-
tion was accomplished using fully automated devices that were programmed to
turn off the current automatically after 30 s. In a sham-controlled trial of 5 cm H2O
and 10 cm H2O of continuous positive airway pressure (CPAP) in patients with
asthma, “sham” CPAP was delivered via identical devices calibrated by the
manufacturer to deliver pressure at less than 1 cm H2O with masked display of
pressure level and intake flow rates and noise levels similar to the active devices
(Holbrook et al. 2016).
In a sham surgery, an imitation procedure is performed to mimic the active
surgery. This might include patients receiving anesthesia, having scopes inserted,
having incisions, etc. Therefore, sham surgeries do carry risks that are more difficult
to justify ethically and have been used less often than placebos. If patients are not
under general anesthesia, then the surgical team may also have to mimic sounds,
smells, and dialogue of surgery so that patients cannot distinguish whether or not
they underwent the actual surgery. Any imaging to check success of surgery or
medication to prevent infection must also be mimicked in patients assigned to sham.
While difficult, sham surgeries are not impossible to perform if risks to patients can
be minimized. In a trial investigating transplantation of retinal pigment epithelial
cells as a treatment for Parkinson’s disease, surgeons performed not only skin
incision but also burr holes in the skull. In the sham group, the burr holes did not
penetrate the dura matter (Gross et al. 2011).
820 L. Drye

Unmasking

To the extent possible, it is important to maintain participant masking until outcomes


assessment is complete. In reality, no masking scheme is perfect. Participants who
are determined to figure out their treatment assignment may be able to do so by
comparing their treatments to other participants to look for tell-tale subtle differences
between the treatments or, in the case of overencapsulated study drug, by opening
the capsules to examine the contents.
Most unmasking occurs after treatment and follow-up are complete as a matter of
process in closing out a trial. During the trial, there should be few instances where
unmasking is required. When participants are experiencing side effects, unmasking
is usually not needed as the study drug can simply be stopped. However, all masked
trials should have procedures in place to immediately unmask at any hour of the day
in emergency situations. This might be accomplished through a study website, 24-h
call-in service or through tear-off labels on containers or devices. Situations which
require immediate unmasking are:

• The treatment assignment is needed to care for the participant because decisions
on how to proceed depend on which treatment the participant has received
particularly in an emergency setting.
• Potential allergic reaction.
• Potential overdose of the participant or another person.

Occasionally, a participant will be adamant that he or she be told the treatment


assignment during conduct of the trial. In these cases, the study staff have no choice
but to unmask the participant.

Conclusion

Masking minimizes conscious and unconscious bias in the conduct and interpretation of
a trial and as such is a key methodological procedure. While the importance of
participant masking is well understood, the complexity of its implementation is often
underestimated. It is rarely possible to create or purchase a completely identical placebo
or sham. Masking of participants is difficult if there are more than two treatment groups
or an active control, if the treatments are taken at different intervals or via different routes,
or if sham procedures are required. Investigators should have procedures in place for
routine unmasking of participants after treatment and follow-up are complete as well as
procedures to immediately unmask at any hour of the day in the event of an emergency.

Key Facts

• Masking (also called blinding) is used to minimize the likelihood of differential


treatment or assessments of outcomes due to conscious or unconscious bias.
44 Masking Study Participants 821

Cross-References

▶ Administration of Study Treatments and Participant Follow-Up


▶ Issues for Masked Data Monitoring
▶ Masking of Trial Investigators
▶ Patient-Reported Outcomes

References
Brunoni AR, Moffa AH, Sampaio-Junior B, Borrione L, Moreno ML, Fernandes RA, Veronezi BP,
Nogueira BS, Aparicio LVM, Razza LB, Chamorro R, Tort LC, Fraguas R, Lotufo PA, Gattaz
WF, Fregni F, Bensenor IM (2017) Trial of electrical direct-current therapy versus Escitalopram
for depression. N Engl J Med 376(26):2523–2533
Gross RE, Watts RL, Hauser RA, Bakay RA, Reichmann H, von Kummer R, Ondo WG, Reissig E,
Eisner W, Steiner-Schulze H, Siedentop H, Fichte K, Hong W, Cornfeldt M, Beebe K,
Sandbrink R (2011) Intrastriatal transplantation of microcarrier-bound human retinal pigment
epithelial cells versus sham surgery in patients with advanced Parkinson's disease: a double-
blind, randomised, controlled trial. Lancet Neurol 10(6):509–519
Holbrook JT, Sugar EA, Brown RH, Drye LT, Irvin CG, Schwartz AR, Tepper RS, Wise RA, Yasin
RZ, Busk MF (2016) Effect of continuous positive airway pressure on airway reactivity in
asthma. A randomized, sham-controlled clinical trial. Ann Am Thorac Soc 13(11):1940–1950
Hrobjartsson A, Emanuelsson F, Skou Thomsen AS, Hilden J, Brorson S (2014) Bias due to lack of
patient blinding in clinical trials. A systematic review of trials randomizing patients to blind and
nonblind sub-studies. Int J Epidemiol 43(4):1272–1283
Martin BK, Meinert CL, Breitner JC (2002) Double placebo design in a prevention trial for
Alzheimer's disease. Control Clin Trials 23(1):93–99
Mease P, Hall S, FitzGerald O, van der Heijde D, Merola JF, Avila-Zapata F, Cieslak D, Graham D,
Wang C, Menon S, Hendrikx T, Kanik KS (2017) Tofacitinib or Adalimumab versus placebo for
psoriatic arthritis. N Engl J Med 377(16):1537–1550
Issues for Masked Data Monitoring
45
O. Dale Williams and Katrina Epnere

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824
Main Focus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825
Some Current Guidelines and Opinions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826
Implications and Suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 829
Key Suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 830
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 831
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 831

Abstract
The essential, primary purpose of a clinical trial is to provide a fair test for the
comparison of treatments, drugs, strategies, etc. A challenge to this fairness is the
appropriate utilization, or lack thereof, of masking or blinding. Masking generally
refers to restricting knowledge as to the treatment group assignment for the
individual or, in the case of a Data and Safety Monitoring Board (DSMB), to
the summary of information comparing treatment groups. Fundamentally,
masking is important to consider for those situations wherein knowledge of the
treatment assignment could alter behavior or otherwise impact inappropriately on
trial results. Masking may, however, while protecting against this bias, make it
more difficult for the DSMB properly to protect trial participants from undue risk
of adverse or serious adverse events. While there are several dimensions to this

O. D. Williams (*)
Department of Biostatistics, University of North Carolina, Chapel Hill, NC, USA
Department of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
e-mail: [email protected]
K. Epnere
WCG Statistics Collaborative, Washington, DC, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 823


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_217
824 O. D. Williams and K. Epnere

overall situation, this chapter addresses the important issue as to whether a trial’s
DSMB should be fully aware of which treatment group is which as it reviews data
summaries for an ongoing trial.

Keywords
Data and safety monitoring board · Data monitoring committee · Masking ·
Blinding · Open report · Closed report · Interim analysis · Risk/benefit

Introduction

DSMBs, sometimes called Data Monitoring Committees (DMCs) or Safety and


Data Monitoring Boards (SDMBs), have been utilized and referenced in the
context of clinical trials since the 1960s (Greenberg Report 1967; Gordon et al.
1998; Wittes 1993). From that time to the present, the field of clinical trials
methods and applications has grown enormously (▶ Chap. 37, “Data and Safety
Monitoring and Reporting”). This period experienced an enormous expansion of
the types of research questions that the field addresses. Some of these require long-
term, complex trials, some address treatment regimens with inherent adverse event
and serious adverse event issues, and some address those regimens without such
issues. Global policies impacting on the conduct of trials should take this diversity
into account.
The typical situation providing context for this discussion has to do with how
reports prepared for the Board’s review are organized and how Board meetings
tend to be conducted. For the reports, their presentation schedule is roughly defined
at the outset; this schedule is followed unless a critical issue arises that needs
attention prior to a scheduled report release or meeting. These reports tend to have
three components: an investigator’s report to the DSMB, the open report, and the
closed report. The investigator’s report tends to deal with overall trial progress,
responses to Board recommendations, synopsis of the protocol, review of the
protocol history and amendments, and emerging results from relevant clinical
studies. Typically, the open and closed reports contain a core set of tables and
figures that are based on an agreed-upon data analysis plan. The typical open report
provides recruitment and compliance information, demographic and baseline char-
acteristic summaries for the combined treatment groups, and other relevant infor-
mation. The closed report may include some of the same information, adverse
event data, and outcome summaries, all by treatment group. “The closed report
should allow the DMC to assess the risk/benefit (▶ Chap. 99, “Safety and Risk
Benefit Analyses”) of the study treatments as well as the integrity of the data,
including completeness and timeliness, used in the interim analyses” (▶ Chap. 59,
“Interim Analysis in Clinical Trials”) (Neaton et al. 2018).
The meetings tend to follow this same structure with separate open and closed
sections. Access to the open reports and participation in their discussion typically are
limited to investigators and others involved in the study under careful confidentiality
restrictions. Access to the closed report and participation in the closed session are
45 Issues for Masked Data Monitoring 825

restricted to Board members and those responsible for preparing the closed reports.
For publically funded trials, representatives of the funding agency may attend
(Anand et al. 2011; Bierer et.al 2016; DeMets et al. 2004; Wittes et al. 2007).

Main Focus

This chapter focuses on how these treatment groups are identified in the closed reports
and during the consequent Board discussions.
Context is a critical issue, elements of which include:

1. No masking – Also sometimes called “open label,” generally implies neither the
participants enrolled in the trial, the staff conducting the trial, nor the DSMB are
masked to treatment assignment. Even for this situation, trial site staff and potential
enrolled participants typically – and importantly – are masked with respect to the
treatment assignment for participants in line to be randomized.
2. Almost no masking – A special case of the “no masking” situation whereby the
process of assessing the trial’s primary outcome is done in a masked fashion. That
is, the individuals or panels assessing or measuring trial outcomes for individual
enrolled participants do so without knowledge as to the participant’s treatment
group. This is often considered a critically important need for any trial.
3. Single-masked trial – Generally refers to the situation where the participants
enrolled in the trial do not know which treatment they are receiving, but the
trial site staff and others are aware of treatment assignment (▶ Chap. 44,
“Masking Study Participants”).
4. Double masked – Generally refers to the situation where neither the enrolled
participants nor the trial site staff are aware of the treatment assignments.
5. Triple masked – Generally refers to the double-masked situation plus the masking
of the DSMB when reviewing ongoing results by treatment group.
6. Only DSMB unmasked – The critical question herein is whether the DSMB
should be masked or should be the only operational entity, except for the staff
preparing reports for the DSMB, which should be unmasked.

There is an important difference between the masking issues for a DSMB relative to
those for the other components mentioned above. For the others, the frame of reference
is possible bias for data items for individual enrolled participants. For the DSMB, the
judgment is whether differences between the treatment groups as represented by
summary data merit some action. This is, by its very nature, a broader, more important
assessment of benefit and risk, often with societal implications (▶ Chaps. 99, “Safety
and Risk Benefit Analyses”).
In this context, the operative question is whether the Board should be masked as
to the identity of the treatment groups and how should this operational decision be
made and implemented. Some options are listed below:

1. Masked mandated – The implication is that during the ongoing trial, the Board
would review differences between treatment groups with the treatment groups
826 O. D. Williams and K. Epnere

identified by codes (e.g., group A and B). They would learn the actual identity of
said groups only at the end of the trial. This strategy gives considerable weight to
the concern that the knowledge of treatment group could bias the interpretation of
interim results and lead to perhaps inappropriate action. It gives lesser weight to
the concern that the strategy would perhaps impair the Board’s ability to address
adverse event issues in a timely and appropriate manner.
2. Unmasked mandated – The implication is that the likelihood of the Board not
properly being able to assess harm or benefit without knowing the identity of the
treatment groups outweighs the concern about possible bias.
3. Something in between – If so, who decides and how will it be structured.

Although Boards typically have purview over additional issues, the two most
central issues tend to be:

1. Adverse events, including those considered attributable to a trial treatment reg-


imen and those not so considered
2. Trial outcomes, with a priority focus on the primary, design outcome but also on
those outcomes considered secondary or otherwise not primary

Operational implications can be quite different for these two issues: a masking
policy for one may well not be appropriate for the other. The adverse event situation
typically involves the Board’s review of individual events in addition to summaries
comparing treatment groups. Further, the review of adverse events may be under-
taken on the occurrence of each individual event along with a careful investigation of
differences between treatment groups subsequently at a Board’s formal meeting. The
fundamental issue is risk to the enrolled individuals. A careful assessment of this risk
may require that the Board be informed as to treatment assignment.
Outcome comparisons may be made at each of the Board’s formal meetings. These
comparisons are critical to the purpose of the trial, and any factors that could impact on
the fairness of the comparison need careful attention. In this context, some Board
members may prefer to be masked and others may not. A special case is the issue of
interim analyses. There is often pressure for the Board not to formally review interim
analyses dealing with primary outcomes. Further, if the Board also is not able to
conduct “informal” looks at such efficacy data, there is an interesting situation with
respect to masking. If there are no such interim analyses or looks, then the Board is
inherently totally masked as to the assessment of outcomes prior to the end of the trial
(▶ Chap. 59, “Interim Analysis in Clinical Trials”). In this case, arguments for or
against the Board being masked with respect to primary outcomes are moot.

Some Current Guidelines and Opinions

The critical aspects of assessing the pros and cons of masking Board are the
expectations and opinions of funding agencies, of governmental agencies, of inves-
tigators, and of experts in the field. Highlighted below are excerpts from relevant
documents and publications:
45 Issues for Masked Data Monitoring 827

1. From the European Medicines Agency (EMA), (European Medicines Agency


2005):

A Data Monitoring Committee is a group of independent experts external to a study


assessing the progress, safety data and, if needed critical efficacy endpoints of a clinical
study. In order to do so a DMC may review unblinded study information during the
conduct of the study and provide the sponsor with recommendations regarding study
modification, continuation or termination. Operating procedures describing how the
DMC works and how it communicates with other study participants (e.g. with the data
centre or the sponsor) should be in place at the start of the trial. [..] procedures should also
describe how the integrity of the study with respect to preventing dissemination of
unblinded study information is ensured.

Note that the EMA does not mandate unblinded reports; rather it indicates that the
DMC “may review unblinded study information.”

2. From the US Food and Drug Administration (FDA) (US DHHS FDA CBER
CDER CDRH 2016):

We recommend that a DMC have access to the actual treatment assignments for each study
group. Some have argued that DMCs should be provided only coded assignment information
that permits the DMC to compare data between study arms, but does not reveal which group
received which intervention, thereby protecting against inadvertent release of unblinded
interim data and ensuring a greater objectivity of interim review. This approach, however,
could lead to problems in balancing risks against potential benefits in some cases.

Note that this report references possible problems in some cases in “balancing risk
against potential benefits” (▶ Chap. 99, “Safety and Risk Benefit Analyses”) and
thus recommends that the Board be unblinded.

3. From the National Heart, Lung, and Blood Institute (NHLBI) (National Heart,
Lung, and Blood Institute National Institutes of Health 2014):

NHLBI monitoring boards:

• Are convened to protect the interests of research subjects and ensure that they are not
exposed to undue risk.
• Operate without undue influence from any interested party, including study investigators
or NHLBI staff.
• Are encouraged to review interim analysis of study data in an unmasked fashion.

These guidelines reference interim analysis and “encourage” unmasked DSMBs.

4. From Clinical Trials Transformation Initiative (CITI), (Clinical Trials Transfor-


mation Initiative 2016):

DMCs must periodically review the accumulating unmasked safety and efficacy data by
treatment group, and advise the trial sponsor on whether to continue, modify, or terminate a
trial based on benefit-risk assessment, as specified in the DMC Charter, protocol, and/or
828 O. D. Williams and K. Epnere

statistical analysis plan. During conduct of the trial, DMCs should periodically review by
treatment group and in an unmasked fashion: primary and secondary outcome measures,
deaths, other serious and non-serious adverse events, benefit-risk assessment, consistency of
efficacy and safety outcomes across key risk factor sub-groups.

These guidelines require periodic reviews during the conduct of the trial by treatment
group in an unmasked fashion.

5. From the US Department of Health and Human Services (DHHS) (Department of


Health and Human Services, Office of Inspector General 2013):

The ability of DSMBs to monitor trial progress and ensure the safety of patients may be
compromised without access to unmasked data.

Importantly, the weight of the argument of masked vs unmasked DSMBs in these


documents is on the side of the Board being unmasked. Note also that these
documents tend not to reference the opinions of those on the firing line, so to
speak, namely, the Board members themselves and study investigators.
The DSMB masking issue has also been addressed in the scientific literature.
Important examples include the following:

1. Clinical Trials Transformation Initiative (CITI) group conducted a survey and a


set of focus groups that consisted of DSMB members, statistical data analysis
center representatives, patients and/or patient advocate DSMB members, institu-
tional review board and US Food and Drug Administration representatives,
industry, government, and nonprofit sponsors:

Participants indicated that the primary responsibility of a DMC is to be an independent advisory


body representing the interests of trial participants. [..]DMCs should have access to unmasked
study data in order to periodically review the accumulating safety and efficacy findings and
advise the sponsor on whether to continue, modify, or terminate a trial based on an assessment
of risks and benefits. Unmasked interim analyses should be identified in the charter and agreed
upon beforehand. Charter [..] should fully address whether the DMC will have access to
unmasked data at the subject level and aggregate level. (Calis et al. 2017; Lewis et al. 2016)

2. Chen-Mok et al. described the experiences and challenges in data monitoring for
clinical trials within an international tropical disease research network:

The interim reports discussed during closed sessions were presented using treatment
codes (eg, A and B), with any needed unblinding done in an executive session of voting
members only. The executive secretary kept sealed envelopes containing treatment
decoding information [..] These envelopes were available to members for each study
being reviewed at a meeting. DSMB members began to consider the arguments for fully
unblinded reviews and began to move toward more easily unblinding reports. However,
members did not achieve a clear position regarding automatic unbinding of reports.
(Chen-Mok et al. 2006)
45 Issues for Masked Data Monitoring 829

3. Holubkov et al. summarized the role of DSMB in the comparative pediatric


Critical Illness Stress-Induced Immune Suppression (CRISIS) Prevention Trial:

It is difficult to conjecture whether the DSMB being unmasked at time of the first interim
analysis [..] would have led to different decisions regarding study continuation and timing of
subsequent data reviews. Blinded review requires simultaneous consideration of different
possible scenarios, and the CRISIS DSMB members were sufficiently comfortable with the
two possibilities to maintain masking until the second data review. (Holubkov et al. 2013)

4. Recent publications by Fleming et al. have suggested that:

. . . DMCs should have full access to unblinded accumulating data on safety and efficacy
throughout the clinical trial. Some believe a DMC should receive only safety data or that a
DMC that receives efficacy data only by blinded codes (e.g., Group A versus Group B) will
be more objective in assessing interim data. The consensus of the expert panel was that such
blinding was counterproductive, even potentially dangerous to the safety of the study
participants. By having access to unblinded data on all relevant treatment outcomes, the
DMC can develop timely insights about safety in the context of a benefit-to-risk assessment,
as well as about irregularities in trial conduct or in the generation of the DMC reports.
(Fleming et al. 2017)
(Fleming et al. 2018, DeMets and Ellenberg 2016).

5. In 1998, the New England Journal of Medicine published Meinert’s opinion regard-
ing masked DSMB reporting:

Masked monitoring is thought to increase the objectivity of monitors by making them less
prone to bias. What is overlooked is what masking does to degrade the competency of the
monitors. The assumption underlying masked monitoring is that recommendations for a
change in the study protocol can be made independently of the direction of a treatment
difference, but this assumption is false. Usually, more evidence is required to stop a trial
because of a benefit than because of harm. Trials are performed to assess safety and efficacy,
not to “prove” harm. Therefore, it is unreasonable to make the monitors behave as if they
were indifferent to the direction of a treatment difference. (Meinert 1998)

Implications and Suggestions

First and foremost, the distinction between the three options for masking listed above can
be described simply as differences as to when unmasking occurs. For Option 1 Masked
Mandated, the unmasking would occur only at the end of the trial. For Option
2 Unmasked Mandated, the unmasking would occur at the outset. For Option 3 Some-
thing in between, there may be masking at the outset but unmasking later in accordance to
decisions made jointly by the funding entity, the Board itself, and others as appropriate.
The guidelines and opinions expressed above appear to be more in the context of
clinical research for which adverse events and serious adverse events are important.
In this context and in view of the information above, a conclusion is that Option
1 Masked Mandated above is neither practical nor tenable for many of these trials.
830 O. D. Williams and K. Epnere

For those investigative issues for which there is limited concern about risk and some
concern about judgment bias, the strategy may be more acceptable. Nevertheless,
Option 3 is likely to be a better alternative than mandating masking at the outset and
maintaining it until the trial’s end.
For Option 2, some language in relevant guidelines and used in discussions seems
to assume that the only options are Option 1 and Option 2, that is, the assumption
appears to be that masking would be mandated in such a manner that unmasking
would occur only at the end of the trial. Nevertheless, there is apparent considerable
force behind the recommendation that Option 2 be the operative strategy.
There are, however, some concerns with Option 2 Unmasked Mandated that deserve
consideration. The most compelling is the opinion of members of the Board, the funding
entity, and any other entity with an explicit role in the Board’s deliberations. This
collection necessarily has a clear understanding of the needs of the trial and should be
well positioned about issues critical to its success. There certainly have been instances
where the masking strategy was discussed at the outset and the decision was for the
Board to be masked. In this circumstance there is a clear understanding that the Board
can chose to be unmasked at any point for which it seems appropriate to do so. When
considering unmasking, the discussions tend to focus, as they should, on assessing the
joint issue of adverse events and primary and other outcomes.
An important issue is how the masking strategy utilized is reflected in the reports of
analyses summarizing adverse events and outcomes. If Option 1 is utilized, the reports
would necessarily have the treatment groups coded in some way, say Treatment A and
Treatment B. For Option 2, this would be unnecessary, and the treatment groups could
be clearly indicated. For Option 3, it may be necessary to code as per Option 1 and
then unmask this coding scheme when the decision to unmask the Board is made.
However, it may well be prudent to use coded labels (Buhr et al. 2018) in the reports in
any case as this may help prevent the identification of the treatment groups should the
reports be accessed inappropriately. If the Board is operating unmasked, then it would
simply need access to the interpretation of the codes.

Key Suggestions

A simple strategy consistent with the apparent purpose of the guidelines and
opinions above is listed below. This strategy accommodates the wide variety of
trials and questions they address and the opinions of the Board members, funding
entity, study leaders, and others as appropriate:
Step 1. At the outset of the functioning of the Board, have a clear discussion of the
masking strategy that seems most appropriate for the trial in question. This discus-
sion would necessarily involve the funding entity and others as appropriate.
Step 2. If all agree that an unmasked approach is most appropriate and should be
used from the outset, then proceed accordingly. However, it still may be prudent to
code the labels for the treatment groups in reports, with the code readily available to
the Board, as a strategy to diminish the likelihood of inadvertent knowledge of trial
status by someone who otherwise would not have access to this information.
45 Issues for Masked Data Monitoring 831

Step 3. If all agree that beginning with a masked approach is preferable, then a
prudent strategy would be to reconsider this decision at each subsequent meeting so
that the mask can be readily lifted if appropriate.
It should be noted that this strategy is not novel (Buhr et al. 2018). It has been used
and is being used for clinical trials both recent and underway. It puts the critical
decision as to the most appropriate masking strategy in the hands of those responsible
for the Board’s operation for a specific study. Thus, it takes into account the specific
characteristics of both the study in question and the concerns of the appointed Board.

Cross-References

▶ Data and Safety Monitoring and Reporting


▶ Interim Analysis in Clinical Trials
▶ Masking Study Participants
▶ Safety and Risk Benefit Analyses

References
Anand SS, Wittes J, Yusuf S (2011) What information should a sponsor of a randomized trial
receive during its conduct? Clin Trials 8(6):716–719
Bierer BE, Li R, Seltzer J, Sleeper LA, Frank E, Knirsch C, Aldinger CE, Lavine RJ, Massaro J,
Shah A, Barnes M, Snapinn S, Wittes J (2016) Responsibilities of data monitoring committees:
consensus recommendations. Ther Innov Regul Sci 50(5):648–659
Buhr KA, Downs M, Rhorer J, Bechhofer R, Wittes J (2018) Reports to independent data
monitoring committees: an appeal for clarity, completeness, and comprehensibility. Ther
Innov Regul Sci 52(4):459–468
Calis KA, Archdeacon P, Bain RP, Forrest A, Perlmutter J, DeMets DL (2017) Understanding the
functions and operations of data monitoring committees: survey and focus group findings. Clin
Trials 14(1):59–66
Chen-Mok M, VanRaden MJ, Higgs ES, Dominik R (2006) Experiences and challenges in data
monitoring for clinical trials within an international tropical disease research network. Clin
Trials 3(5):469–477
DeMets DL, Ellenberg SS (2016) Data monitoring committees—expect the unexpected. N Engl J
Med 375(14):1365–1371
DeMets D, Califf R, Dixon D, Ellenberg S, Fleming T, Held P, Packer M (2004) Issues in regulatory
guidelines for data monitoring committees. Clin Trials 1(2):162–169
Fleming TR, DeMets DL, Roe MT, Wittes J, Calis KA, Vora AN, Gordon DJ (2017) Data
monitoring committees: promoting best practices to address emerging challenges. Clin Trials
14(2):115–123
Fleming TR, Ellenberg SS, DeMets DL (2018) Data monitoring committees: current issues. Clin
Trials 15(4):321–328
Gordon VM, Sugarman J, Kass N (1998) Toward a more comprehensive approach to protecting
human subjects. IRB: A Review of Human Subjects Research 20(1):1–5
Holubkov R, Casper TC, Dean JM, Anand KJS, Zimmerman J, Meert KL, Nicholson C (2013) The
role of the data and safety monitoring board in a clinical trial: the CRISIS study. Pediatr Crit
Care Med J Soc Crit Care Med World Fed Pediatr Intensive Crit Care Soc 14(4):374
Lewis RJ, Calis KA, DeMets DL (2016) Enhancing the scientific integrity and safety of clinical
trials: recommendations for data monitoring committees. JAMA 316(22):2359–2360
832 O. D. Williams and K. Epnere

Meinert CL (1998) Masked monitoring in clinical trials—blind stupidity? N Engl J Med


338:1381–1382
Neaton JD, Grund B, Wentworth D (2018) How to construct an optimal interim report: what the
data monitoring committee does and doesn’t need to know. Clin Trials 15(4):359–365
Wittes J (1993) Behind closed doors: the data monitoring board in randomized clinical trials. Stat
Med 12(5–6):419–424
Wittes J, Barrett-Connor E, Braunwald E, Chesney M, Cohen HJ, DeMets D, Walters L (2007)
Monitoring the randomized trials of the Women’s health initiative: the experience of the data and
safety monitoring board. Clin Trials 4(3):218–234
Greenberg Report (1967) Organization, Review, and administration of cooperative studies
(Greenberg report): a report from the heart special project committee to the National Advisory
Heart Council. Control Clin Trials. 1988 9:137–148

Online Documents

Department of Health and Human Services, Office of Inspector General. (2013) Data and safety
monitoring boards in NIH clinical trials: meeting guidance, but facing some issues. https://oig.
hhs.gov/oei/reports/oei-12-11-00070.pdf
U.S. Department of Health and Human Services Food and Drug Administration, Center for
Biologics Evaluation and Research (CBER)Center for Drug Evaluation, and Research
(CDER) Center for Devices and Radiological Health (CDRH) (2016) Guidance for clinical
trial sponsors establishment and operation of clinical trial data monitoring committee. https://
www.fda.gov/downloads/regulatoryinformation/guidances/ucm127073.pdf
European Medicines Agency Committee for medicinal products for human use (2005) Guideline on
data monitoring committee. https://www.ema.europa.eu/documents/scientific-guideline/guide
line-data-monitoring-committees_en.pdf
National Heart, Lung, and Blood Institute National Institutes of Health (2014) NHLBI policy for data
and safety monitoring of extramural clinical studies. https://www.nhlbi.nih.gov/grants-and-train
ing/policies-and-guidelines/nhlbi-policy-data-and-safety-monitoring-extramural-clinical-studies
Clinical Trials Transformation Initiative (CTTI) (2016) CTTI recommendations: data monitoring
committees. https://www.ctti-clinicaltrials.org/files/recommendations/dmc-recommendations.pdf
Variance Control Procedures
46
Heidi L. Weiss, Jianrong Wu, Katrina Epnere, and O. Dale Williams

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834
What Is Variance? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834
What Are the Main Sources of Variance in a Clinical Trial? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834
Why Does Variance in a Clinical Trial Matter? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835
When Is Variance Uncomfortably Large? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835
How to Control Variance Through Clinical Trial Design and Data Collection and
Analysis? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836
Control Variance Through Clinical Trial Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836
Control Variance Through Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 837
Control Variance Through Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 838
Variance As a Data Quality Assessment Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 838
Conclusion/Key Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 840
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 840
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 840

Abstract
This chapter covers the concepts of variance and sources of variation for clinical
trial data. Common metrics to quantify the extent of variability in relation to the
mean are introduced as are clinical trial design techniques and statistical analysis
methods to control and reduce this variation. The uses of variance as a data quality
assessment tool in large-scale, long-term multicenter clinical trials are highlighted.
H. L. Weiss · J. Wu
Biostatistics and Bioinformatics Shared Resource Facility, Markey Cancer Center, University of
Kentucky, Lexington, KY, USA
e-mail: [email protected]; [email protected]
K. Epnere (*)
WCG Statistics Collaborative, Washington, DC, USA
O. D. Williams
Department of Biostatistics, University of North Carolina, Chapel Hill, NC, USA
Department of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA

© Springer Nature Switzerland AG 2022 833


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_218
834 H. L. Weiss et al.

Keywords
Clinical trial · Variance · Systematic errors · Measurement errors · Random error ·
Coefficient of variation · Technical error · Matched design · Crossover design ·
Repeated measures design · Power · Sample size · Analysis of covariance ·
Multiple regression analysis · Data quality assessment

Introduction

Variance is one of the first topics presented during any basic statistics course,
typically occurring shortly after discussions of the mean and other measures of
central tendency. Subsequent presentations tend to portray variance in the context
of making judgments about the mean and as the denominator in equations that
include the mean or other point estimates in the numerator. This tends to indicate
that variance is a necessary evil in the computations required for other measures, but
otherwise of limited value. This is certainly not true for experimental research and
especially not true for large-scale, long-term multicenter clinical trials.

What Is Variance?

Variance is something that can be measured, so what is this something? One simple
and somewhat intuitive way to express this is to consider variance as being a
consequence of the distances between all the numbers in a dataset. Thus, the closer
together these numbers are, the lower the variance and the further apart they are the
higher the variance. Mathematically, the sample variance is the sum of squared
deviations of each value from the mean value, divided by the sample size minus one.
P
n
ðxi  xÞ2 pffiffiffiffi
Sample variance : s2 ¼ i¼1 and standard deviation : s ¼ s2
n1
where xi is the value of the ith element, x is the sample mean and n is the sample size.
Thus, the units for the variance are the square of the units for the numbers used for
the calculation. To get back to the units of the initial numbers, the square root of the
variance, the standard deviation, is used. The magnitude of the variance, per se, is
not very informative. It, however, can be highly informative relative to the mean or
other relevant point estimates or to compare the variability of one set of numbers to
that of another set. This comparison can serve as a critically important data quality
assessment and monitoring tool.

What Are the Main Sources of Variance in a Clinical Trial?

Clinical trials, by definition, are comparisons involving persons. In this context,


measurements provide data representing differences, that is, variation, among per-
sons and within persons. The process of making the measurements is expected to
46 Variance Control Procedures 835

provide data that represent the underlying true value at the point in time the
measurement was made as well as any deviation from this true value due to the
measurement process. Measurement error is typically unavoidable so the critical
issue is understanding its magnitude and taking steps to reduce it to more comfort-
able levels should doing so be warranted.
More broadly, in a parallel two group clinical trial, the variances can be divided
into two sources: the variance between the treatment groups and the variances within
the treatment groups. In general, the variance between the groups should be a
consequence of the treatment effects. The variances within the groups, however,
reflect the inherent differences among the individuals within the groups plus the
errors that occur in the process of making the data measurements. More specifically,
measurement errors occurring during data collection can often be due to data
collection instrument or process variability, data transfer or transcription errors,
simple calculation errors or carelessness. It is good practice for a clinical trial to
maximize the treatment effect and to reduce measurement error by using appropriate
methods for the study design, data collection, and data analyses.

Why Does Variance in a Clinical Trial Matter?

Data variation is unavoidable in clinical research and such variation due to system-
atic or random errors can cause unwanted effects and biases. Systematic error (bias)
is associated with study design and execution. When bias occurs, the results or
conclusions of a trial may be systematically distorted especially should the biases
affect one treatment group more or less than the other. These can be quantified and
avoided. On the other hand, variances due to random error occur by chance and add
noise to the system, so to speak, and thus reduce the likelihood of finding a
significant difference between treatment groups (FDA 2019). Publication by Barraza
et al. (2019) discusses these two concepts in more detail. Furthermore, variances
have great impact on the sample size estimation and precision of outcome measure-
ments of a trial. Underestimation of the variance could result in lower statistical
power to detect treatment differences than would otherwise be the case. It can also
reduce the ability to comfortably compare the results of one trial to those of other
studies.

When Is Variance Uncomfortably Large?

There are two aspects for assessing the magnitude of variance. One is the variance of
a set of numbers in relation to the average for that set. This is often assessed by the
use of the coefficient of variation (CV),

s
Coefficient of Variation : CV ¼  100%
x
where s is the standard deviation and x is the sample mean.
836 H. L. Weiss et al.

which simply is the standard deviation divided by the average. Generally, if the
standard deviation is more than 30% of the average, the data may be too highly
variable to be fully useful.
The other aspect is technical error (TE), which can be based on the differences
between two independent measures of a variable. For example, a clinical chemistry
laboratory may be sent two vials of material for analyses, which represent one
sample that has been split. The identity of the two samples would not be known
by the laboratory. This process could be repeated for several samples so that a dataset
based on the assays of these paired observations can be created. These data can be
used to calculate the Technical Error for this measurement process, where
vffiffiffiffiffiffiffiffiffiffiffiffi
uP n
u d2
t i
i¼1
Technical Error of Measurment : TE ¼
2n
where di is the difference between measurements made on a given object on two
occasions (or by two workers) and n is the sample size.
Detailed instructions for calculating TE are described by Perini et al. (2005). For
this situation, data quality is classified as very good if RTE <10%, good if
10%  RTE < 20%, acceptable if 20%  RTE < 30%, and not acceptable if
RTE  30%. The target for key outcome measures for a clinical trial should be
<10%, however, there are no universally acceptable cut-off levels.

How to Control Variance Through Clinical Trial Design and Data


Collection and Analysis?

Systematic error and measurement error can be reduced by 1) using appropriate


statistical designs, blocking, or stratification, 2) successfully implementing standard
data collection procedures for key measures and other data quality enhancements,
and 3) using appropriate methods of statistical analysis. Particularly important for
minimizing measurement errors is the careful use of high-quality methods for data
collection with clear procedures and sound quality assurance and quality control
methods for instruments and assays to be used. For example, measurement error can
be reduced by clear and detailed specifications on standard procedures for measuring
clinical and biological outcomes in the protocol. Further, the impact of measurement
error can also be reduced through data analysis by using statistical techniques and
methods discussed below.

Control Variance Through Clinical Trial Design

Some of the tools that can be used to control or minimize the impact of larger than
perhaps desirable variances are described below. These will be helpful in some, but
not all situations.
46 Variance Control Procedures 837

1. Randomization: The process of randomly allocating trial participants to the different


treatment groups has many desirable consequences (Suresh 2011). This process
results in participants tending to be spread evenly in the treatment groups in items of
age, gender, race, genotype, education status, smoking habit, etc. Hence, the
potential for systematic differences between groups is reduced, the within group
variances are more likely to be similar and the overall variance for trial measure-
ments is likely less in face of the reduced systematic differences. The potential
confounding between prognostic factors and outcome variable is also diminished.
2. Matched or paired design: In a matched or paired design, first create pairs of
subjects where the individuals within each pair are as alike each other as the
situation permits. Then within a pair randomly assign one subject to treatment and
the other to control. The primary outcome can then be the differences within pairs
across the full trial (Simon and Chinchilli 2007). This strategy has the potential to
reduce importantly the confounding between prognostic factors and outcomes.
3. Cross-over design: A more general paired design is a cross-over design. In the
simplest cross-over trial, each subject receives two different treatments A and
B. Half the subjects receive A first and then after a washout period are crossed
over to B. The remaining subjects who initially receive B first are then crossed over
to A. Thus, each person serves as their own control, thus eliminating or at least
seriously reducing the impact of among person variability on the outcome assess-
ment (Simon and Chinchilli 2007). The use of cross-over designs is not without its
risks, however, as there may be difficulty in implementing a fully effective wash out
process. For this reason, this design is somewhat infrequently used.
4. Repeated measurements design: Replication provides an efficient way to increase
the precision of studies. For example, if the population variance is σ2, then the
variance of sample mean based on the n observation is σ2/n . Thus, the precision
in measuring the mean can be arbitrarily increased with sufficient replications.
Using repeated measurements design, the variance due to within subject differ-
ences is diminished, perhaps substantially (Tango 2016).
5. Increasing the sample size: In general, increasing sample size reduces the overall
error variance and thus increases the study power (Biau et al. 2008).

Control Variance Through Data Collection

Measurement error cannot be eliminated completely, but it can be reduced tremen-


dously by clear and detailed procedures for measuring clinical and biological outcomes
in the protocol. Measurement error occurring during the data collection in laboratories
is often due to transcription errors, simple calculation errors, or carelessness. Therefore,
training of clinicians, nurses, and laboratory technicians is important.
Biologic and clinical data often are generated with the use of sophisticated
instrumentation, assays, computers, or questionnaires. The clinical personnel must
not only understand the rationale and newly developed technology but also be able
to perform consistently throughout the study according to the procedure specified in
the protocol (Chow and Liu 2014). Training on use of electronic case report forms
838 H. L. Weiss et al.

(eCRFs), data specification on types of variables within eCRFs, and use of clinical
trial management database systems are important.

Control Variance Through Data Analysis

Several statistical techniques and methods can be used in analysis stage of clinical
trial to control the variance.

1. ANCOVA: The analysis of covariance (ANCOVA) is a useful statistical analysis


method to improve the precision of a clinical trial. The error comes from
extraneous variables that vary randomly within the groups. If such extraneous
variables cannot be controlled by the experimenter but can be observed along
with outcome variable, then ANCOVA can adjust the outcome variable for the
effect of the concomitant variable. If such adjustment is not performed, the
concomitant variables could inflate the error variance and make the treatment
differences difficult to detect (Wang et al. 2019).
2. Analyzing change from baseline: When the outcomes were measured before the
patients were randomized, analyzing change from baseline could reduce the
variability among subjects. For example, let X be the outcome variable of a
treatment group and the corresponding baseline measurement is B and assume
var:ðXÞ ¼ var:ðBÞσ 2 , and

corrðX; BÞ ¼ r:

The analysis ignoring baseline is based on the outcome X with variance σ2,
whereas the analysis based on change from baseline has variance.

var:ðX  BÞ ¼ 2σ 2 ð1  rÞ:

Thus, the analysis of the difference X-B (change from baseline) will use a
smaller variance if r > 0.5. This is because of the typically marked positive
correlation between the baseline and outcome levels. If the correlation is less than
0.5, then using change from baseline introduces extra noise into the analysis and
is not recommended (Mathews 2006; EMA 2015).
3. Multiple regression analysis: Using regression analysis can separate the covariate
variance from the error variance, thus reducing the error variance for the treatment
assessment. Typical covariates to consider include different sites/centers, demo-
graphic and baseline clinical characteristics associated with the trial outcome that
were not controlled for in the trial design (EMA 2015).

Variance As a Data Quality Assessment Tool

Measures of variance can be an important data quality assessment tool. In addition to


the CV and TE discussed above, additional approaches also can be utilized. Some
examples are included below.
46 Variance Control Procedures 839

Technician Performance. Especially for large-scale, long-term, multicenter


clinical trials, numerous measurements are conducted by a variety of technical
staff, often with more than one such person at each clinical center. One example is
the measurement of blood pressure (BP) with a cuff and stethoscope – the use of
automated devices also has similar issues. For example, suppose there are 12 BP
technicians and the calculation of the variances for each of these over a prespecified
time interval indicates that the BP measurements for one of the technicians have a
much higher variance than the others. This situation merits further investigation.
The first step would be to prepare the frequency distribution of the measurements for
this technician and compare it to those for the other technicians. If the measurements
for this technician are simply spread out more across the distribution than the others,
it likely means this technician is not as careful as the others or perhaps does not hear
adequately well. If the distribution is more similar to those for other technicians
except for several outliers, then the source of the outliers needs to be ascertained and
corrections made as necessary.
Further, an examination of the results for the same technician over several time
periods may identify some periods with variances higher than the others. A look at
the frequency distribution of the data for this technician or for the time periods of
concern may provide some clues. One possibility is that this technician does not
follow the protocol carefully.
Sometimes variances can be too small. For many, but not all issues, larger
than desirable variances are the concern. However, there are circumstances
whereby the calculated variance is too small. One such example, again for BP
measurements, is the situation whereby the measures of systolic BP completed
by a specific technician had a distribution with reasonable mean and variance.
Further, the distribution for diastolic BP also had a distribution with reasonable
mean and variance; however, when the distribution of the differences between
systolic and diastolic BP were examined, the mean was very near 30 and the
variance was near zero. It appears that the technician measured the systolic
pressure and then rather than taking the time to complete the measurement
process correctly, simply subtracted 30 from the systolic measurement and
recorded the result as the diastolic pressure.
Other examples can occur if instruments malfunction and get locked at one
value or a restricted range of values. In all these cases, simple examinations of
the frequency distributions can often provide insights as to the cause of the
issue.
Clinical Center Laboratory Performance For those trials that use a local
laboratory at each of the participating clinical centers, it can be informative to simply
compare the variances for each assay utilized across the different clinical centers,
typically for a specific time interval. Further, comparing the variances within labo-
ratory over different time intervals can also be informative. Examining the frequency
distributions for those laboratories or time intervals can often provide clues as to the
reasons for the higher variances. Outliers, correct values or not, also can create this
situation. This can quickly be examined using frequency distributions. There may
well be data groupings other than time intervals for which this type of assessment
would be informative.
840 H. L. Weiss et al.

Conclusion/Key Recommendations

• Variance is used to measure the deviation from (or the spread around) the mean in
each dataset and it allows us to compare different data sets.
• Coefficient of Variance and Technical Error can be used to assess the amount of
variance in each dataset.
• There are several clinical trial design tools as well as data collection and
analysis methods that can be used to control variance. Underestimation of the
variance could result in lower statistical power to detect treatment differences
and it can also reduce the ability to comfortably compare the results between
studies.
• Variance can serve as a very useful data quality assessment tool in clinical trials.

Cross-References

▶ Controlling Bias in Randomized Clinical Trials


▶ Cross-over Trials
▶ Data Capture, Data Management, and Quality Control; Single Versus Multicenter
Trials
▶ Power and Sample Size

References
Barraza F, Arancibia M, Madrid E, Papuzinski C (2019) General concepts in biostatistics and
clinical epidemiology: random error and systematic error. Medwave 19(7):e7687
Biau DJ, Kernéis S, Porcher R (2008) Statistics in brief: the importance of sample size in the
planning and interpretation of medical research. Clin Orthop Relat Res 466(9):2282–2288
Chow SC, Liu JP (2014) Design and analysis of clinical trial, 3rd edn. Wiley, Hoboken, NJ. https://
www.wiley.com/en-ug/Design+and+Analysis+of+Clinical+Trials%3A+Concepts+and+Meth
odologies%2C+3rd+Edition-p-9780470887653
European Medicines Agency (EMA) (2015) Guideline on adjustment for baseline covariates in
clinical trials. https://www.ema.europa.eu/en/documents/scientific-guideline/guideline-adjust
ment-baseline-covariates-clinical-trials_en.pdf. Accessed 23 Mar 2021
Mathews J (2006) Introduction to randomized controlled clinical trials, 2nd edn. Chapter 6, p 78,
Chapman & Hall/CRC Texts in Statistical Science. https://www.routledge.com/Introduction-to-
Randomized-Controlled-Clinical-Trials/Matthews/p/book/9781584886242
Perini TA, de Oliveira GL, Ornellas J d S, de Oliveira FP (2005) Technical error of measurement in
anthropometry. Rev Bras Med Esporte 11(1):81–85. https://doi.org/10.1590/S1517-
86922005000100009
Simon LJ, Chinchilli VM (2007) A matched crossover design for clinical trials. Contemp Clin
Trials 28(5):638–646. https://doi.org/10.1016/j.cct.2007.02.003
46 Variance Control Procedures 841

Suresh K (2011) An overview of randomization techniques: an unbiased assessment of


outcome in clinical research. J Hum Reprod Sci 4(1):8–11. https://doi.org/10.4103/0974-
1208.82352
Tango T (2016) On the repeated measures designs and sample sizes for randomized controlled trials.
Biostatistics 17(2):334–349. https://doi.org/10.1093/biostatistics/kxv047
U.S. Department of Health and Human Services Food and Drug Administration Center for Drug
Evaluation and Research Center for Biologics Evaluation and Research (2019) Enrichment
strategies for clinical trials to support determination of effectiveness of human drugs and
biological products guidance for industry. https://www.fda.gov/media/121320/download.
Accessed 25 Mar 2021
Wang B, Ogburn E, Rosenblum M (2019) Analysis of Covariance (ANCOVA) in randomized trials:
more precision, less conditional bias, and valid confidence intervals, without model assump-
tions. Biometrics 75:1391–1400
Ascertainment and Classification of
Outcomes 47
Wayne Rosamond and David Couper

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844
Masking to Treatment Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845
Competing Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845
Types of Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845
Major Clinically Recognized Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845
Asymptomatic Subclinical Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846
Patient-Reported Outcomes (PROs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847
Time to Event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847
Models of Event Ascertainment and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 848
Outcome Event Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 849
Obtaining Diagnostic Data Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 850
Development of Data Capture Instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 850
Training in Data Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 851
Classification of Clinical Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 851
Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 852
Administrative Oversight of Outcome Ascertainment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853

W. Rosamond (*)
Department of Epidemiology, Gillings School of Global Public Health, University of North
Carolina, Chapel Hill, NC, USA
e-mail: [email protected]
D. Couper
Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina,
Chapel Hill, NC, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 843


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_233
844 W. Rosamond and D. Couper

Abstract
Successful completion and valid conclusions from a clinical trial rely on having
complete and accurate outcome data. Misclassification of outcome events can
introduce systematic error and bias as well as reduce the statistical power of the
trial. Outcomes of interest in clinical trials vary and can include major clinically
recognized events; asymptomatic subclinical measurements; and/or patient-cen-
tered reported outcomes. Final classification of study outcomes often involves use
of standardized computer algorithms, processing of materials and review with
outcome classification committees, and linkage with electronic data sources.
Processes for accomplishing the goals of outcome ascertainment and classifica-
tion can be designed as centralized systems, de-centralized networks of investi-
gators, or a hybrid of these two methods. Challenges to obtaining valid outcomes
include ensuring complete follow-up of study participants, use of standardized
event definitions, capture of relevant diagnostic information, establishing pro-
tocols for review of potential events, training of clinical review teams, linkage to
data sources across various platforms, quality control, and administrative oversite
of the process. Designers of clinical trials need to consider carefully their
approach for event identification, capture of diagnostic data, utilization of stan-
dardized diagnostic algorithms and/or clinical review committees, and mecha-
nisms for maintaining data quality.

Keywords
Event ascertainment · Outcome classification · Adjudication, bias

Introduction

In the process of conducting clinical trials and observational studies, there is often
a heavy focus on treatment and exposure assessment. Although important, this
attention can occur at the cost of less consideration of the complexities of complete
identification and valid classification of outcomes. Successful completion and valid
conclusions of a clinical trial rely on having complete and accurate outcome
data, particularly for the primary outcome. In clinical trials, if missingness or
misclassification of outcome events is unrelated to treatment group assignment,
this may merely reduce the statistical power of the trial and bias results toward the
null. However, if the missingness or misclassification varies across treatment or
exposure groups, this introduces systematic error in the results of the trial and may
bias findings in either direction. There are many challenges to obtaining valid
outcomes in clinical trials. These include ensuring complete follow-up of study
participants, use of standardized event definitions, capture of relevant diagnostic
information, establishing protocols for review of potential events, training of clinical
review teams, linkage to data sources across various platforms, quality control, and
administrative oversite of the process. This chapter focuses on methods to obtain
47 Ascertainment and Classification of Outcomes 845

information needed for the full assessment of trial outcomes and the process of using
that information to determine the outcomes for all participants.

Masking to Treatment Assignment

The best designed clinical trials take particular care to reduce the potential for
differential misclassification to occur across treatment groups. Even if participants
and the investigators and staff involved in treatment provision cannot be masked,
it is desirable that those involved in any aspect of the outcome ascertainment
and classification be unaware of the participants’ treatment group assignments.
Otherwise, classifications may be either consciously or subconsciously influenced
by knowledge of the treatment group.

Competing Risks

The definition of outcome or the statistical methods for analyzing them need to
account for the potential for the outcome to be missing because of competing risks.
For instance, in a trial in the elderly of a method to reduce the risk of decline in
cognitive function, some participants may be missing information about change
in cognitive function because they die before having a follow-up assessment of
cognitive function. There are accepted statistical approaches to address competing
risks, such as the Fine and Gray model for competing risks in time-to-event analyses
(Fine and Gray 1999). Details of these methods are addressed elsewhere in this
monograph.

Types of Outcomes

Outcomes of interest in clinical trials may be considered in three main categories.


These categories include (1) major clinically recognized events, (2) asymptomatic
subclinical measurements, and (3) patient-centered-reported outcomes. Although
most clinical trials have a single primary outcome in just one of these domains,
some seek to capture information on all three types of outcomes for investigation as
secondary outcomes.

Major Clinically Recognized Events

Events that generally come to the attention of medical care services are often primary
outcomes in clinical trials. Examples of major clinically recognized outcomes
include acute myocardial infarction, acute decompensated heart failure, stroke, all-
cause or cause-specific mortality, exacerbations of chronic obstructive pulmonary
disease and asthma, venous thromboembolism, gestational diabetes, diabetes
846 W. Rosamond and D. Couper

mellitus, major infections, trauma, and injury. Events such as acute myocardial
infarction generally have a well-defined time of occurrence. Onset dates of other
events such as diabetes or heart failure are less well identified. By definition,
clinically recognized outcomes involve contact with medical personnel, though
often not with staff involved in the clinical trial. For instance, there is generally no
expectation that a participant who has an acute myocardial infarction will be treated
in hospitals in connection with investigators in the clinical trial. The ease with which
such potential events can be identified depends on the type of health-care system in
the country in which the trial is conducted. In a country with a single-payer health-
care system, information about hospitalizations is collected centrally, and with the
appropriate permissions, it is relatively straightforward to obtain the medical records
needed for event classification and adjudication. In the USA, if a trial is done using
participants from a managed care consortium such as Kaiser Permanente, the
situation is similar to a country with a single-payer health system, except that
participants may move to a different health-care system during follow-up. When
participants are not all in a single managed care consortium, identification of
potential events and obtaining the medical records needed for classification are
much more complex.

Asymptomatic Subclinical Measurements

Asymptomatic subclinical assessments can be either primary or secondary


outcomes of clinical trials and encompass outcomes, conditions, or stages of
conditions that generally do not come to the attention of health-care systems.
They require independent assessment through either study specific clinical visits,
home visits, or contact of study participants through phone, email, or other
means. Examples of asymptomatic subclinical outcomes include results from
imaging (e.g., white matter lesions in the brain measured by magnetic resonance
imaging (MRI); vessel wall thickness of carotid arteries measured by b-mode
ultrasound; microvascular narrowing measured by retinal photography; benign
electrocardiographic abnormalities from electrocardiography (ECG); coronary
calcium measured by computed tomography (CT) scans); biomarker measure-
ments (e.g., serum lipoproteins, cardiac troponin levels); and standardized ques-
tionnaire assessments (e.g., cognitive function tests, assessments of diet and
physical activity, activities of daily living, range of motion). Outcomes that involve
measurement or administration of questionnaires or tests at an in-person study visit
may not require adjudication. For instance, in the blood pressure example, the
measured blood pressure is the outcome. Similarly, in the COMBINE trial, a
participant’s outcome was obtained from the structured drinking assessment
(Anton et al. 2006). In such instances, procedures need to be in place to ensure
good-quality data, such as appropriate training of study personnel, regular equip-
ment checks, and quality control checks, but there is no adjudication of the
outcomes themselves.
47 Ascertainment and Classification of Outcomes 847

Patient-Reported Outcomes (PROs)

Patient-centered or patient-reported outcomes are measures of patients’ direct


experiences of health conditions and health care (Weldring and Smith 2013).
Patient-reported outcomes are directly reported by the patient without interpretation
of the patient’s response by a clinician or anyone else and pertain to the patient’s
health, quality of life, or functional status associated with health care or treatment.
These outcomes may be measured in absolute terms, such as a patient’s rating of
the severity of pain, and new onset of nausea following administration of a new drug
and may include functional status, health service satisfaction, and quality of life.
Outcomes such as hospital readmissions may also be included as patient centered
and can be measured using administrative claims database or self-report. In the
COMBINE study of treatments for alcohol dependence, the co-primary outcomes of
percent days abstinent during the 16-week treatment period and time to relapse to
heavy drinking were determined from structured in-person interviews. Such inter-
views could potentially be conducted by telephone (though this was not done in
COMBINE). In the Aging and Cognitive Health Evaluation in Elders (ACHIEVE)
trial, the primary outcome is change in global cognitive function, which requires
assessment at an in-person interview (Deal et al. 2018). A key secondary outcome
includes a diagnosis of dementia. Many patient-reported outcome assessments such
as dementia do not require an in-person interview. They can be done using a brief
telephone interview with the participant or an informant or using hospital records
and death certificates. An example of a randomized pragmatic clinical trial that
used patient-reported outcomes as the primary study outcome is the Comprehensive
Post-Acute Stroke Services (COMPASS) trial (Duncan et al. 2017). Patient-centered
outcomes in COMPASS were collected from telephone surveys administered at 90-
day post-hospital discharge. A centralized survey research calling center adminis-
tered the phone interview. The primary outcome was patient-reported functional
status as measured by the 16-item Stroke Impact Scale (SIS-16). The SIS-16 is a self-
reported questionnaire that can be completed by the patient or a proxy and was
selected because it is an outcome that matters to patients, their caregivers, and stroke
experts. Interviewers were masked to treatment group and used standardized scripts
and interviewing guidelines. COMPASS utilized reminder letters, additional phone
contacts, mailed surveys, and proxy interviews to increase follow-up rates.

Time to Event

Time-to-event (“survival”) methods are typically used to analyze outcomes from


clinical trials. Participants who do not have the event of interest during the trial are
censored in the analysis, either at the time they are lost to follow-up or administra-
tively at a designated date for the end of follow-up, whichever is earlier. Censoring
should be at the last time at which the participant was known to be free of the event,
even if that was earlier than the administrative censoring date, because absence of
848 W. Rosamond and D. Couper

information about an event does not automatically imply the participant did not have
the event. Other outcomes such as heart failure can be thought of as clinical
syndromes with a diffuse event onset time. Occurrence time may be defined as
first onset of symptoms, which in the case of heart failure could be progressive over
an extended period of time. Some trials may choose to define onset of these types of
events as the time the condition requires hospitalization or the time it is diagnosed in
the outpatient setting. In the case of heart failure, the actual condition may be been
present for some time prior to this defined start date.

Models of Event Ascertainment and Classification

The goal of the event ascertainment and classification component of clinical trials is
to completely identify all events and establish valid event classification for each.
There are several models to identify and classify events. The best choice to employ
depends partially on the type of events targeted by the trail and on the size and
duration of the clinical trial. Operational structure and resources of the trial also
influence the methods used. Models for accomplishing the goals of event ascertain-
ment and classification can be grouped onto three types including centralized
systems, decentralized systems, or a hybrid of these two models.
A centralized model is one that establishes special clinics where study partici-
pants return for asymptomatic subclinical outcome assessment through a clinic
examination, biomarker measurement, and/or questionnaire evaluation. Clinical
outcomes may also be determined at central special clinics but would most likely
come to attention of other health-care providers within and outside the sphere of the
clinical trial investigators. Even though events may be identified in hospitals
and clinics outside of special centralized centers, medical records and diagnostic
elements are often sent to centralized reading (e.g., electrocardiograms) and/or
abstraction centers that employ specially trained medical record abstractors. Using
centralized reading centers or review of diagnostic elements helps reduce variation in
clinical practice fashions and increases standardization.
Decentralized outcome assessments are also common. Studies that employ a
decentralized system rely on identifying and obtaining medical records from all
facilities utilized by participants. These facilities could be anywhere around the
world. A considerable effort is required to identify and to obtain complete sets of
diagnostic information from the various medical systems. These decentralized
systems may also incorporate home visits with participants in comparison with
having participants return to one or more central specially created clinical sites.
An example of this method in a major observations study is the REGARDS study
(Howard et al. 2005). In this study of approximately 30,000 participants, mobile
units were sent to the homes of participants to capture information on subclinical and
patient-reported conditions. Clinical events were identified through participant self-
report. Records for reported hospitalizations were then sought and abstracted
centrally.
47 Ascertainment and Classification of Outcomes 849

Outcome Event Identification

Outcome ascertainment systems in clinical trials strive for complete capture of the
outcomes of interest. This often involves identification of a wide net of events from
which outcomes of interest are further evaluated and classified. The type of approach
depends on the type of outcomes (i.e., clinical, subclinical, patient-reported) that are
of most interest to the study. Studies often apply highly sensitive selection criteria of
potential events in order to ensure complete and comprehensive case ascertainment.
Clinical trials use a variety of methods that can include participant self-report of
potential events, searches of electronic medical records (EMRs) lists obtained
from selected health-care facilities and clinics, utilization of wearable devices by
participants (e.g., transdermal patch electrocardiographic (ECG) monitors permit
extended noninvasive ambulatory monitoring for atrial fibrillation and other cardiac
conditions (Heckbert et al. 2018)), and periodic participant examinations (e.g.,
sequential ECG evaluations to identify silent myocardial infarction).
An example of using participant self-report to obtain comprehensive ascertain-
ment of outcomes in a large observational cohort studies is the Atherosclerosis
Risk in Communities (ARIC) study (The ARIC investigators 1989). Briefly,
study participants are contacted by phone twice annually to obtain self-reported
hospitalizations for any reason. Medical records of all reported hospitalizations are
sought. In addition to identifying potential events from patient self-report, electronic
files of hospital discharges are obtained from hospitals in the regions from which the
cohort was drawn. These files are searched using participants’ information to
identify hospitalizations for study participants. Approximately 10% of total study
outcomes are identified from searching electronic files of discharges from selected
hospitals that were not otherwise identified from participant self-report. The result of
this case identification approach is a comprehensive list of hospitalized outcomes
from which event classification and validation can proceed. In the Hispanic
Community Health Study/Study of Latinos (HCHS/SOL), a similar approach is
used but includes participant self-report of all visits to an emergency department
not leading to hospitalization (Sorlie et al. 2010).
It is important that clinical trials established clear and detailed description of the
outcome of interest. This is key to the rigor and reproducibility of findings in
the context of other studies and patient populations. An example of this level of
outcome description is the RIVUR (Randomized Intervention for Children with
Vesicoureteral Reflux) trial, a double-mask placebo-controlled trial of antimicrobial
prophylaxis; the primary outcome to evaluate treatment efficacy was recurrence of F/
SUTI (febrile or symptomatic urinary tract infection (UTI) (RIVUR Trial Investiga-
tors et al. 2014). Suspected recurrent UTI events were reviewed and adjudicated to
determine if they met the RIVUR criteria for a primary outcome. The definition of
recurrent F/SUTI required the presence of fever or urinary tract symptoms, pyuria
based on urinalysis, and culture-proven infection with a single organism. A UTI was
defined as recurrent only if its onset occurred more than 2 weeks from the last day of
appropriate treatment for the preceding UTI or following a negative urine culture or
850 W. Rosamond and D. Couper

it was an infection with a new organism. The study had a UTI Classification
Committee (UCC). All reported medical care visits required data collection using
standardized study procedures, with data entered into the central data management
system (DMS). Those visits where a potential UTI was identified were reviewed and
classified by the UCC, using standardized criteria to adjudicate each event according
to the study definitions. When an algorithm in the data management system identi-
fied a potential outcome, relevant data were sent to two randomly selected members
of the UCC. Each of the two UCC members classified the event and entered their
responses into the DMS. If the classifications by the two reviewers disagreed, the
UCC met in person or by conference call to come to a final decision.

Obtaining Diagnostic Data Elements

Once possible outcome events are identified, clinical trials must have standardized
approaches to obtain the relevant diagnostic information needed for event validation
and classification. Traditional methods of manual abstraction by trained medical
records abstractors have been widely used and are successful at reliable collection of
diagnostic elements from medical records. More recent approaches employ natural
language processing programs to capture information from EMR text fields on
symptom presentation, disease course, and other relevant diagnosis elements (e.g.,
presence of cardiac chest pain, worsening of difficulty breathing). Electronic medical
records can also be an efficient method to capture structured data elements (e.g.,
laboratory values, test results, medications) needed to validate and classify study
outcomes. Although the capture of diagnostic elements from EMR relying solely on
computer-based methods has great potential for efficiency, challenges remain in the
area of interoperability across EMR platforms and establishing and maintaining
acceptable sensitivity and specificity compared to traditional medical record review.
Once highly reliable and valid data are obtained (either by electronic or manual
approach), these data can then be used in computerized standard event classification
algorithms. Hybrid approaches to diagnostic data capture are also used. Structured
data elements captured from EMR combined with manual abstraction guided by
natural language processing are an example of a hybrid data collection approach.
Computer systems can be used to search text fields and locate and underscore
location of key diagnostic information that can be confirmed by manual overread
by trained study personnel.

Development of Data Capture Instruments

Studies typically develop online computer systems for reviewers to use in event
classification. After decisions have been made about the information needed to be
captured for event classification, case report forms need to be developed and
programmed to be used for data entry by abstractors. These systems usually need
to include not only fields for capturing specified data elements but also to allow
47 Ascertainment and Classification of Outcomes 851

inclusion of narrative sections of the medical record and uploading of images, such
as MRIs and other components in electronic formats, such as ECGs. Once data entry
for a participant is complete, an algorithm and/or clinician uses the information to
decide the event type. When two clinicians or a clinician and the algorithm have
reviewed an event, the system compares the reviews. If there are discrepancies, they
are resolved either by mutual agreement or adjudication by an additional reviewer.
The system needs to be able to incorporate such resolutions or adjudications and
record the final decision as to the nature of the event (see section “Classification of
Clinical Events”).

Training in Data Capture

Standardized initial training and recurrent recertification of staff involved in outcome


data capture are essential to maintain high-quality outcome ascertainment and
validation. This is important for obtaining data about clinical events as well as
when outcomes are derived from participant self-report using interview
questionnaires.

Classification of Clinical Events

Once relevant diagnostic data elements captured for potential study outcomes
are available, classification of events can proceed. Methods for determining final
classification of study outcomes vary and include use of standardized computer
algorithms, processing for review with outcome classification committees, and
linkage with electronic data sources (e.g., clinical registries, administrative claims,
mortality registries, and death indexes).
Computer diagnostic algorithms for determining final study events exist for many
major clinical outcomes. For example, a widely used algorithm for classifying acute
myocardial infarction utilized data on cardiac pain symptoms, biomarker evidence,
and electrocardiographic evidence (Luepker et al. 2003). The results of this and
similar algorithms are a spectrum of certainty of classification such as definite,
probable, suspect, or no acute myocardial infarction. Input from additional reading
of electrocardiograms can be incorporated to identify subclasses of myocardial
infarction, namely, ST segment MI (STEMI) or non-ST segment elevation MI
(NSTEMI). While widely used in trials, a limitation of these types of algorithms is
that they are not specific enough to classify subcategories of events based on newer
universal definitions of myocardial infarction (i.e., acute myocardial infarction
subtype 1 through subtype 5 (Thygesen et al. 2018)). Clinical overread of diagnostic
information is required to produce valid subtyping of these events.
Other major outcomes such as stroke, heart failure, and respiratory disease are
less well suited for reliance on diagnostic algorithms and may require processing for
final diagnostic classification with outcome review committees. Outcome review
committees are commonly used by clinical trials to determine the final diagnostic
852 W. Rosamond and D. Couper

classification of study outcomes. Methods used to establish and operate outcome


review committees vary across studies. Some studies use a consensus model
whereby all eligible cases are reviewed by all or a subset of committee member
with one member being assigned as the primary reviewer. Under this model, the
primary reviewer summarizes the case for the committee, and the case is discussed
among all committee members in a conference or webinar. A consensus diagnosis is
the result of this method. Another common model is the independent reviewer
model with adjudication of disagreements between original reviews. Under this
method, eligible cases are independently reviewed and classified by reviewers
on the committee. Disagreement between reviews on the case classification is
adjudicated by a third reviewer (often the chair of the committee) to create the
final study classification. Each reviewer selects a classification by following
reviewer guidelines and case law. The reviewer guidelines and case law should be
periodically reviewed by members of the committee and modified as needed in
order to be consistent both within the committee and with contemporary clinical
guidelines. For clinical trials, it is important that possible reviewers should be
masked to treatment arm when completing their classification of cases.
Annual recertification of review committee members is recommended. For annual
recertification training, all reviewers independently classify a set of selected cases and
then review each response in a group setting, seeking clarification from the established
guidelines and making new case law as needed. Retention of reviewers is important to
help maintained high levels of quality control in the outcome classification process.
Methods to ensure retention of committee members include financial compensation on a
per case basis or percent effort and/or involvement of review committee members in the
development of publications and other scholarly products from the study. Another
important aspect of maintaining high-quality outcome assessment in studies using case
review committees is the use of online systems to record reviewers’ classification and
ongoing quality control reporting. In the setting where two reviewers classify each event,
with a third reviewer adjudicating disagreements, quality control reports would typically
provide information about how frequently reviewers disagree and, when there is adju-
dication, how frequently the adjudicator agrees with each of the two initial reviewers.
The use of online, web-based data systems is important in managing the work of these
committee and to keep outcome classification on preestablished timelines. An online
event reviewer data collection system with real-time data checks and helps menus that
allow reviewers remote access to view standardized summaries of diagnostic data
elements, imaging, physicians’ notes, procedure notes, and medication lists while they
are completing an event classification review form helpful to ensure high-quality data on
outcome classification as well as successful management of a review committee.

Ethics

Clinical trials usually require all participants to provide informed consent at the time
of entry into the study. There are some types of trials for which a waiver of consent
may be granted, such as a trial of a new method of CPR for treating out-of-hospital
47 Ascertainment and Classification of Outcomes 853

cardiac arrest (Aufderheide et al. 2011). If medical records are needed for outcome
identification and classification, participants also need to sign an agreement allowing
their medical records to be obtained from physicians and hospitals.

Administrative Oversight of Outcome Ascertainment

Administrative staff are usually required to manage the outcome ascertainment


process. Although many aspects may be automated, an administrator is often
responsible for assigning events to reviewers, based on information about their
workload and for following up when reviewers are overdue completing their
assigned cases. Experienced reviewers are typically busy clinicians who have to
fit reviewing into hectic schedules, so having an administrator keeping track of
progress is critical for timely classification of events.

Conclusion

Accurate ascertainment of study outcomes is a critical component of a clinical trial.


Outcomes need to be defined unambiguously and procedures put in place to max-
imize completeness and accuracy of outcome classification.

Key Facts

• Complete and accurate outcome ascertainment and classification are important


for making valid conclusions from clinical trials.
• Methods for identifying and classifying outcomes vary depending on the aims of
the trial and nature of the outcomes of interest.
• Design of clinical trials needs to consider models for event identification,
methods to capture relevant diagnostic data, utilization of standardized diagnostic
algorithms and/or clinical review committees, and mechanisms for maintaining
data quality.

References
Anton RF, O'Malley SS, Ciraulo DA, Cisler RA, Couper D, Donovan DM, Gastfriend DR,
Hosking JD, Johnson BA, LoCastro JS, Longabaugh R, Mason BJ, Mattson ME, Miller WR,
Pettinati HM, Randall CL, Swift R, Weiss RD, Williams LD, Zweben A (2006) COMBINE
Study Research Group. Combined pharmacotherapies and behavioral interventions for alcohol
dependence: the COMBINE study: a randomized controlled trial. JAMA 295(17):2003–2017
Aufderheide TP, Frascone RJ, Wayne MA, Mahoney BD, Swor RA, Domeier RM, Olinger ML,
Holcomb RG, Tupper DE, Yannopoulos D, Lurie KG (2011) Standard cardiopulmonary
resuscitation versus active compression-decompression cardiopulmonary resuscitation with
854 W. Rosamond and D. Couper

augmentation of negative intrathoracic pressure for out-of-hospital cardiac arrest: a randomized


trial. Lancet 377(9762):301–311
Deal JA, Goman AM, Albert MS, Arnold ML, Burgard S, Chisolm T, Couper D, Glynn NW,
Gmelin T, Hayden KM, Mosley T, Pankow JS, Reed N, Sanchez VA, Richey Sharrett A,
Thomas SD, Coresh J, Lin FR (2018) Hearing treatment for reducing cognitive decline: design
and methods of the Aging and Cognitive Health Evaluation in Elders randomized controlled
trial. Alzheimers Dement (N Y) 4:499–507. ClinicalTrials.gov entry NCT03243422
Duncan PW, Bushnell CD, Rosamond WD, Jones Berkeley SB, Gesell SB, D'Agostino RB Jr,
Ambrosius WT, Barton-Percival B, Bettger JP, Coleman SW, Cummings DM, Freburger JK,
Halladay J, Johnson AM, Kucharska-Newton AM, Lundy-Lamm G, Lutz BJ, Mettam LH,
Pastva AM, Sissine ME, Vetter B (2017) The Comprehensive Post Stroke Services (COMPASS)
study: design and methods of a cluster randomized pragmatic trial. BMC Neurol 17(1):133
Fine JP, Gray RJ (1999) A proportional hazards model for the subdistribution of a competing risk.
J Am Stat Assoc 94:496–509
Heckbert SR, Austin TR, Jensen PN, Floyd JS, Psaty BM, Soliman EZ, Kronmal RA (2018)
Yield and consistency of arrhythmia detection with patch electrocardiographic monitoring: the
multi-ethnic study of atherosclerosis. J Electrocardiol 51(6):997–1002
Howard V, Cushman M, Pulley L, Gomez C, Go R, Prineas R, Graham A, Moy C, Howard G
(2005) The reasons for geographic and racial differences in stroke study: objectives and design.
Neuroepidemiology 25:135–143
Luepker RV, Apple FS, Christenson RH, Crow RS, Fortmann SP, Goff D, Goldberg RJ, Hand MM,
Jaffe AS, Julian DG, Levy D, Manolio T, Mendis S, Mensah G, Pajak A, Prineas RJ, Reddy KS,
Roger VL, Rosamond WD, Shahar E, Sharrett AR, Sorlie P, Tunstall-Pedoe H, AHA Council
on Epidemiology and Prevention; AHA Statistics Committee; World Heart Federation
Council on Epidemiology and Prevention; European Society of Cardiology Working Group
on Epidemiology and Prevention; Centers for Disease Control and Prevention; National
Heart, Lung, and Blood Institute (2003) Case definitions for acute coronary heart disease
in epidemiology and clinical research studies: a statement from the AHA Council on Epidemi-
ology and Prevention; AHA Statistics Committee; World Heart Federation Council on Epide-
miology and Prevention; the European Society of Cardiology Working Group on Epidemiology
and Prevention; Centers for Disease Control and Prevention; and the National Heart, Lung, and
Blood Institute. Circulation 108(20):2543–2549
RIVUR Trial Investigators, Hoberman A, Greenfield SP, Mattoo TK, Keren R, Mathews R, Pohl HG,
Kropp BP, Skoog SJ, Nelson CP, Moxey-Mims M, Chesney RW, Carpenter MA (2014) Antimi-
crobial prophylaxis for children with vesicoureteral reflux. N Engl J Med 370(25):2367–2376
Sorlie PD, Avilés-Santa LM, Wassertheil-Smoller S, Kaplan RC, Daviglus ML, Giachello AL,
Schneiderman N, Raij L, Talavera G, Allison M, Lavange L, Chambless LE, Heiss G (2010)
Design and implementation of the Hispanic Community Health Study/Study of Latinos.
Ann Epidemiol 20:629–641
The ARIC investigators (1989) The atherosclerosis risk in communities (ARIC) study: design and
objectives. Am J Epidemiol 129:687–702
Thygesen K, Alpert JS, Jaffe AS, Chaitman BR, Bax JJ, Morrow DA, White HD, Executive Group
on behalf of the Joint European Society of Cardiology (ESC)/American College of Cardiology
(ACC)/American Heart Association (AHA)/World Heart Federation (WHF) Task Force for the
Universal Definition of Myocardial Infarction (2018) Fourth universal definition of myocardial
infarction (2018). J Am Coll Cardiol 72(18):2231–2264
Weldring T, Smith S (2013) Patient-reported outcomes (pROs) and patient-reported outcome
measures (pROMs). Health Serv Insights 6:61–68
Bias Control in Randomized Controlled
Clinical Trials 48
Diane Uschner and William F. Rosenberger

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856
Restricted Randomization in Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 857
Covariate Imbalances and Predictability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 861
Correct Guesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 861
Conditional Allocation Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 862
Type I Error Probability and Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 863
Multi-arm Trials (Generalizations) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864
Chronological Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 866
Impact on Type I Error Probability and Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 867
Planning for Bias at the Design Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 869
Robust Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 869
Randomization Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 870
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 871
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 872
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 872
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 873

Abstract
In clinical trials, randomization is used to allocate patients to treatment
groups, because this design technique tends to produce comparability across
treatment groups. However, even randomized clinical trials are still susceptible
to bias. Bias is a systematic distortion of the treatment effect estimate. This
chapter introduces two types of bias that may occur in clinical trials, selection

D. Uschner
Department of Statistics, George Mason University, Fairfax, VA, USA
e-mail: [email protected]
W. F. Rosenberger (*)
Biostatistics Center, The George Washington University, Rockville, MD, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 855


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_219
856 D. Uschner and W. F. Rosenberger

bias and chronological bias. Selection bias may arise from predictability of the
randomization sequence, and different models for predictability are presented.
Chronological bias occurs due to unobserved time trends that influence patients’
responses, and its effect on the rejection rate of parametric hypothesis tests for the
treatment effect will be revealed. It will be seen that different randomization
procedures differ in their susceptibility to bias. A method to reduce bias at the
design stage of the trial and robust testing strategies to adjust for bias at the
analysis stage are presented to help to mitigate the potential for bias in random-
ized controlled clinical trials.

Keywords
Selection bias · Chronological bias · Restricted randomization · Type I error ·
Power

Introduction

Clinical trials aim at comparing the efficacy and safety of therapeutic agents across
treatment groups. It is crucial that the groups are comparable with respect to the
demographic features and other prognostic variables. In practice, it is not possible
to create comparability deterministically across the groups, particularly, as some
underlying prognostic variables, such as pharmacological properties, may still be
unknown.
Randomization tends to balance groups with respect to known and unknown
covariates and is therefore commonly regarded as the key component of clinical
trials that provides comparability of treatment groups (Armitage 1982). In addition,
randomized treatment allocation allows the effective concealment of treatments from
patients and investigators. When the treatment assignment is deterministic, the
concealment of allocations is inherently difficult. Allocation concealment is often
referred to as double-blinding, while a loss of concealment is called unblinding.
Double-blinding is important, if possible, to achieve an unbiased assessment of the
outcomes of the trial.
Despite the favorable properties of randomization, a randomized clinical trial can
still suffer from a lack of comparability among the treatment groups. Biases may
arise from different sources. For example, long recruitment times, changes in study
personal, or learning curves during surgical procedures may cause time trends that
affect the outcomes of patients in the trial. It is intuitively clear that time trends will
lead to a bias of the treatment effect, when patients that arrive early in the allocation
process are allocated to one group and those that arrive later are allocated to the other
group. This bias has been termed chronological bias (Matts and McHugh 1978).
Randomizing patients in blocks has been recommended as a means to create more
similar groups in the course of the trial (ICH 1998). However, blocking introduces
predictability of the upcoming treatment assignments. In particular, the pharmaco-
logical effects or the side effects of an intervention may be easily distinguishable
48 Bias Control in Randomized Controlled Clinical Trials 857

from those of a standard or placebo intervention. These effects will cause unblinding
of the past treatment allocations and may in turn make future allocations more
predictable. Predictability can introduce selection bias (Rosenberger and Lachin
2015), a bias caused by a systematic covariate imbalance of the treatment groups.
Several randomization procedures have been developed in the literature to miti-
gate the effects of chronological bias and selection bias. Each randomization proce-
dure represents a trade-off between balance and randomness, while greater balance
leads to higher predictability, and greater randomness leads to more susceptibility to
time trends. The extreme of complete randomness is the toss of the fair coin, also
called complete randomization or unrestricted randomization. The other extreme
is small blocks of two, where after a random allocation to one group, the next patient
will be deterministically allocated to the other group. When the allocation
of a patient depends on the treatment assignments of the previous patients, a
randomization procedure is called restricted. Section “Restricted Randomization in
Clinical Trials” reviews restricted randomization procedures that are used to mitigate
bias in randomized trials.
Section “Covariate Imbalances and Predictability” shows how predictability and
covariate imbalances can be measured in clinical trials. Chronological bias is the
focus of section “Chronological Bias.” Section “Planning for Bias at the Design
Stage” presents an approach to minimize the susceptibility to bias at the design stage.
Section “Robust Hypothesis Tests” introduces hypothesis tests that are robust to
bias. The chapter closes with a Summary in section “Summary and Conclusions.”

Restricted Randomization in Clinical Trials

Consider a randomized clinical trial with an experimental agent E and a control agent
C. Patients enter the trial sequentially and are allocated randomly into one of the two
groups. When the groups are expected to be balanced in the end of the trial, the
random allocation can be achieved by the toss of a fair coin, and a patient will be
allocated to the experimental group when the coin shows head and to the control
group when the coin shows tails. Let an even n  ℤ>0 be the total sample size of the
trial. Then the allocation of patient i for i  {1, . . ., n} is denoted by

1 if patient i is allocated to group E
ti ¼
0 if patient i is allocated to group C:

The allocation ti of patient i is the realization of a Bernoulli random variable Ti ~


Bern (0.5). The sequence T = (T1, . . ., Tn) with realization t = (t1, . . ., tn) is called
randomization sequence. The set of all randomization sequences with total
sample size n is given by Ωn = {0, 1}n.
The sample
P size in group E after the allocation of patient i is denoted by
N E ðiÞ ¼ ij¼1 ti and in group C by NC (i) = i – NE (i). The imbalance after patient
i is defined as the difference in group sizes after the allocation of patient i and is
given by
858 D. Uschner and W. F. Rosenberger

X
i
Di ¼ N E ðiÞ ¼ N C ðiÞ ¼ 2  t j  i:
j¼1

The imbalance Di is a random variable that describes a random walk (i, Di) for i
 1, . . ., N. Figure 1 shows a realization of the random walk in heavy black and all
the possible realizations in light gray.
There is a one-to-one correspondence between the set of randomization
sequences and realizations of the random walk. Each random walk corresponds to
a randomization sequence, and each randomization sequences describes a unique
random walk. Using a fair coin toss, each randomization sequence has the same
probability

1 1
PðT ¼ tÞ ¼ ¼ ,
jΩn j 2n

where |Ωn| is the cardinality of the set Ωn. In other words, for each patient i, the
random walk has the same probability to go either up or down. As there are no
restrictions to the randomization process, this randomization procedure is usu-
ally called unrestricted randomization or complete randomization. In particular,
complete randomization only maintains the allocation ratio in expectation but
may result in high imbalances in the course of the trial, as well as in the end of the
trial. As Fig. 1 shows, complete randomization (CR) allows imbalances as high
as n.
Several randomization procedures have been proposed to achieve better balance
in the clinical trial. A randomization procedure is a (discrete) probability distribution
on the set of randomization sequences. Randomization procedures other than the
uniform distribution are called restricted randomization procedures.

Fig. 1 Random walk of the randomization sequence t = (1, 1, 1, 1, 0, 0, 0, 0), figure generated
using the randomizeR package (Uschner et al. 2018b)
48 Bias Control in Randomized Controlled Clinical Trials 859

The most commonly used randomization procedures are random allocation rule
(RAR) and permuted block randomization (PBR). Random allocation rule forces
randomization sequences to be balanced in the end of the trial by giving zero weight
to
 unbalanced
 sequences and equal probability to balanced sequences. As there are
n
sequences, the probability of a sequence t is
n=2
8 !1
>
< n
if Dn ðtÞ ¼ 0
Pð T ¼ t Þ ¼ n=2
>
:
0 otherwise:

Permuted block randomization forces balance not only at the end of a trial but at
M points in the trial. The interval between two consecutive balanced points in the
trial is called a block. Let every block contain m = n/M patients, where M and m are
positive integers, and let b = m/2 be the number of patients allocated to E and C in
each block. Using PBR, the probability of a sequence is
8 !n=2b
>
> 2b
<
if D jk ðtÞ ¼ 0 for j ¼ 1, . . . , M
PðT ¼ tÞ ¼ b
>
>
:
0 else:

A different way to achieve balance was suggested by Berger et al. (2003). They
promote the maximal procedure (MP), a randomization procedure that achieves
final balance and does not exceed a maximum tolerated imbalance b = maxi |Di|.
All remaining sequences Ωn,MP are realized with equal probability.
8
< 1
i f max i j Di ðtÞ jj b and Dn ¼ 0
PðT ¼ tÞ ¼ j Ωn,MP j
:
0 else:

Figure 2 illustrates the set of sequences. The cardinality of the set of sequences
of Ωn,MP depends on n and the imbalance boundary b. There is no closed form,
and the generation of the randomization sequences requires an ingenious algo-
rithm proposed by Salama et al. (2008) and implemented in Uschner et al.
(2018b).
Another approach that does not force balance in the end of the trial is Efron’s
biased coin design (EBCD). Here, the probability of the next treatment assignment
is based on the current imbalance of the random walk. Let 12 < p  1. Then the
probability to assign the next patient to group E is given by
8
> p if Di < 0
<
1
PðT iþ1 ¼ 1jT 1 , . . . , T i1 Þ ¼ if Di ¼ 0
>
:2
1p if Di > 0:
860 D. Uschner and W. F. Rosenberger

Fig. 2 Set of sequences of the maximal procedure for sample size n = 8 with imbalance tolerance
b = 2, figure generated using the randomizeR package (Uschner et al. 2018b)

The set of all sequences of EBCD is Ωn, but the probability distribution
is different. Sequences with high imbalances have a lower probability. In
other words, the probability mass is concentrated about the center of the random
walk.
Chen’s design and its special case the big stick design (BSD) were developed to
avoid the high imbalances still possible in EBCD. Here, an imbalance boundary b is
introduced for the random walk, and a deterministic allocation is made to the other
treatment group once the random walk attains the imbalance boundary on one side of
the random walk. Using Chen’s design, the probability to allocate the next patient in
group E is
8
>
> 1 if  b ¼ Di
>
>
>
> p if  b < Di < 0
>
<
1
PðT iþ1 ¼ 1jT 1 , . . . , T i1 Þ ¼ if Di ¼ 0
>
>2
>
>
>
> 1p if 0 < Di < b
>
:
0 if Di ¼ b:

The special (and more common) case of the big stick design results if p ¼ 12.
Note that despite the similar set of sequences of the big stick design and the
maximal procedure, their probability distributions are very different. Maximal
procedure gives equal probability to all sequences. The big stick design, however,
introduces deterministic allocations (i.e., allocations with probability one) every time
the imbalance boundary is hit. As a consequence, sequences that run along the
imbalance boundary have higher probability than those in the middle of the alloca-
tion tunnel.
48 Bias Control in Randomized Controlled Clinical Trials 861

Covariate Imbalances and Predictability

A principal goal of randomization in clinical trials is to produce treatment groups


that are comparable with respect to known and unknown prognostic variables.
Therefore, randomization is the basis of a meaningful and unbiased comparison of
the primary outcome between the treatment groups and allows conclusions about the
treatment effect. At the same time, randomization allows the implementation of
masking and allocation concealment. A systematic covariate imbalance of the
treatment groups is called selection bias. Berger (2005) distinguishes three types
of selection bias. First-order selection bias is defined as the bias that results in a non-
randomized trial, when the allocation sequence is not generated in advance, and
treatments can be assigned based on the patients’ and physicians’ preferences. The
bias that results in a randomized controlled trial that does not use allocation
concealment to mask allocations from patients and physicians is called second-
order selection bias. Bias that results from predictability of the randomization
sequence due to unsuccessful allocation concealment of past treatment assignments
in combination with a known target allocation ratio has been termed third-order
selection bias (Berger 2005). Third-order selection bias may occur when past
treatment assignments are unmasked due to side effects or when the nature of the
treatment makes masking of past allocation unfeasible, e.g., in a surgical procedure.
While first- and second-order selection bias can be alleviated by a more careful
study design, third-order selection bias is harder to mitigate. Intuitively, the more
restrictions a randomization procedure induces, the higher the potential for predict-
ability. This section reviews the formal measures of predictability that have been
introduced in the literature.

Correct Guesses

The first to propose a measure of selection bias were Blackwell and Hodges (1957).
Under the assumption that the investigator knows the target allocation ratio as well as
past treatment assignments, they investigate the influence of an investigator who
consciously seeks to make one treatment appear better than the other irrespective of
the presence of a treatment effect. Assuming that the investigator favors the experi-
mental treatment, he might include a patient with better expected response in the trial
when he expects the experimental treatment to be allocated next. Conversely, he would
include a patient with worse expected response, when he expects the next treatment
assignment to be to the control group. Blackwell and Hodges propose two models for
the guess of the investigator. The first model, coined the convergence strategy (CS),
assumes that the investigator guesses the treatment that has so far been allocated less.
Let gCS(i, t) denote the guess for allocation i using the convergence strategy, and let R ~
Ber(0.5) be a Bernoulli random variable. Using the convergence strategy, the inves-
tigator’s guess for the ith allocation is given by
862 D. Uschner and W. F. Rosenberger

8
<1
> N E ði  1, tÞ < N C ði  1, tÞ
gCS ði, tÞ ¼ 0 N E ði  1, tÞ > N C ði  1, tÞ
>
:
R N E ði  1, tÞ ¼ N C ði  1, tÞ,

where a value of 1 corresponds to the experimental treatment, and a value of 0


corresponds to the control treatment. The second model is termed divergence
strategy (DS). Here the investigator will guess the treatment which has so far been
allocated less frequently. The investigator’s guess for the ith allocation is thus given
by
8
<1
> N E ði  1, tÞ > N C ði  1, tÞ
gDS ði, tÞ ¼ 0 N E ði  1, tÞ < N C ði  1, tÞ
>
:
R N E ði  1, tÞ ¼ N C ði  1, tÞ:

A correct guess is the event that the investigator guesses the treatment that will in
fact be allocated next, i.e., g(i, t) = ti for g  {gCS, gDS}. The number of correct
guesses of a randomization sequence is then defined as

X
n
GðtÞ ¼ I ðgði, tÞ ¼ ti Þ:
i¼1

With this notation, the expected number of correct guesses E(G) is given by
X
EðGÞ ¼ PðT ¼ tÞ  GðtÞ,
t  Ωn

where P(T = t) is the sequence probability as induced by the randomization


procedure.
It is intuitively clear that the convergence strategy induces a correct guess each
step the random walk reduces its imbalance. The divergence strategy induces a
correct guess each time the imbalance is increased. In addition, every time the
random walk is balanced, a correct guess is made with probability p ¼ 12.

Conditional Allocation Probability

Rosenberger and Lachin (2015) recently proposed a metric that is equivalent to


the expected number of correct guesses but does not rely on the guessing model.
They propose to investigate the expected difference between the conditional and
the unconditional allocation probability. In a clinical trial with target allocation
ratio 12, each patient has the probability 12 to be allocated to either of the treatment
groups. In other words, for each position i  {1, . . ., n}, the random walk has the
same overall probability to go up or down.
48 Bias Control in Randomized Controlled Clinical Trials 863

Denote the probability to receive the experimental treatment by p, and denote


the conditional probability to receive the experimental treatment, given the past
treatment assignments by ϕi = P(Ti = 1 | T1, . . ., Ti–1). Then predictability of a
sequence t is given by

X
n
ρPRED ðtÞ ¼ ð ϕi ð t Þ  pÞ 2 :
i¼1

Clearly, if the allocation is completely random, ρPRED (t) = 0. When each


allocation is deterministic (which technically is not a randomized design anymore),
ρPRED (t) is maximized and takes the value ρPRED (t) = n ∙ (1 – p)2. The predict-
ability of a randomization procedure is again given by the weighted mean of the
sequence predictability, namely,
X
ρPRED ¼ PðT ¼ tÞ  ρPRED ðtÞ:
t  Ωn

It turns out that ρPRED is mathematically equivalent to the expected number of


correct guesses minus the target allocation p, as shown by Rosenberger and Lachin
(2015).

Type I Error Probability and Power

Proschan (1994) was the first to propose and investigate the influence of the
convergence strategy on the type I error rate of a hypothesis test of the treatment
effect. Let the primary outcome Y follow a normal distribution
Y  N ðμE  T þ μC  ð1  T Þ, σ 2 Þ. If the variance σ is known, the null hypothesis
H0: μE = μC can be tested using a Z-test, and, under the assumption of independent
and identically distributed responses, the test statistic D ¼ Yp
E Y
ffiffiffiffi C follows a standard

normal distribution.
Assume that a higher outcome Y can be regarded as better and that the
investigator favors the experimental group, although the null hypothesis is true, i.
e., μ = μE = μC. Then the influence of the convergence strategy on the responses can
be modeled as follows:
8
<μ þ η
> N E ði  1, tÞ > N C ði  1, tÞ
Eð Y i Þ ¼ μ  η N E ði  1, tÞ < N C ði  1, tÞ
>
:
μ N E ði  1, tÞ ¼ N C ði  1, tÞ,

where η > 0 denotes the selection effect, the extent of bias introduced by the
investigator. It is assumed that η > 0 to account for the fact that the treatment E is
preferred and higher outcomes are regarded as better.
864 D. Uschner and W. F. Rosenberger

Under this assumption, the responses Y1, . . ., Yn are not identically distributed
anymore but are still independent. Proschan gave an asymptotic formula for the type
I error probability when random allocation rule is used and investigates the rejection
rate in simulations for various values of n and η. It turns out that the rejection
probability exceeds the planned significance level even for small values of η.
Kennes et al. (2011) extended the approach of Proschan to permuted block
randomization. As expected, the type I error inflation increases with smaller block
sizes. Ivanova et al. (2005) adapted the approach for binary outcomes and introduced
a guessing threshold to reflect a possibly conservative investigator. The influence
of various guessing thresholds was also investigated by Tamm and Hilgers (2014),
who further generalized the approach to investigate the influence of predictability
on the t test. Rückbeil et al. (2017) investigated the impact of selection bias on time-
to-event outcomes.
Langer (2014) gave an exact formula for the rejection rate of the t test conditional
on the randomization sequence, when the convergence strategy is used. The
approach was published in Hilgers et al. (2017) and implemented in the randomizeR
R package in Uschner et al. (2018b). The rejection probability conditional on the
randomization sequence t is given by
   
α
r ðtÞ ¼ P jSj> tn2 1  jt
   2     
α α
¼ F tn2 , n  2, δ, λ þ F tn2 , n  2, δ, λ ,
2 2
where S is the test statistic of the t-test, tn – 2(γ) is the γ-quantile of the t-distribution
with n – 2 degrees of freedom, and F(∙, n – 2, δ, λ) is the distribution function of the
doubly noncentral t-distribution with n – 2 degrees of freedom and non-centrality
parameters δ, λ that both depend on the randomization sequence t. Figure 3 shows
the distribution of the type I error probability for the maximal procedure and the big
stick design, both with imbalance tolerance b = 2, and for the random allocation
rule. All are based on the total sample size n = 20 and normally distributed outcomes
with group means μE = μC = 2 and equal variance σ 2 = 1.
Notably, all randomization procedures contain sequences with rejection proba-
bilities as high as 100%. These are the alternating sequences. The big stick design
has most sequences concentrated around the 5% significance level. The random
allocation rule is similar to the big stick design but introduces more variability.
The maximal procedure, despite having a similar set of sequences as the big stick
design, has a higher probability for sequences that exceed the significance level
substantially.

Multi-arm Trials (Generalizations)

The approach to assess susceptibility based on the rejection probability (see section
“Type I Error Probability and Power”) was generalized to multi-arm trials by
48 Bias Control in Randomized Controlled Clinical Trials 865

Fig. 3 Distribution of the type I error probability under the convergence strategy with η = 4 for
three randomization procedures, figure generated using the randomizeR package (Uschner et al.
2018b)

Uschner et al. (2018a). They proposed models for selection bias in multi-arm trials
that generalize the convergent guessing strategy of Blackwell and Hodges (1957);
see section “Correct Guesses.” Let K  2 denote the number of treatment groups,
and assume that a randomization procedure with equal allocation ratio is used for the
allocation of patients to the K groups. In the two-arm case, it is assumed that the
investigator favors one treatment over the other. In the multi-arm case (K > 2), it is
assumed that the investigator favors a subset of the K treatment groups and
dislikes the rest. Similarly to the two-arm case, it is assumed that the investigator
would like to make his favored groups appear better than the disliked groups,
despite the null hypothesis H0: μ1 = . . . = μK being true. Under this assumption,
the investigator would thus try to include a patient with better expected
response when he guesses that one of his favored groups will be allocated next.
Let F {1, . . ., K} denote the subset of favored treatment groups, and let the
complement F C = {1, . . ., K} \ F denote the treatment groups that are not favored
by the investigator. A reasonable strategy for the investigator would be to guess that
one of his favored groups will be allocated next, when all of the groups in F have
fewer patients than the remaining groups. Under this assumption, the expected
response is given by

EðYÞ ¼ μ þ η  b,

and the components of the bias vector b are


866 D. Uschner and W. F. Rosenberger

8
<1
> if max j  F N j ði  1Þ < min k  F C N k ði  1Þ
bi 1 if min j  F N j ði  1Þ > max k  F C N k ði  1Þ
>
:
0 else:

As the responses are no longer identically distributed under the null


hypothesis, the F-test of the hypothesis H0: μ1 = . . . = μK, SF no longer follows a
central F-distribution. However, conditional on the realized randomization
sequence, SF follows a doubly noncentral F distribution with degrees of freedom
that depends on the bias vector.
Ryeznik and Sverdlov (2018) propose to investigate selection bias in multi-arm
groups based on the forcing index (FI) that measures the difference between
the conditional allocation probability and the target allocation probability of the
randomization procedure. Their approach extends the approach of Rosenberger and
Lachin (2015) (see section “Conditional Allocation Probability”). In a trial with K
treatment groups, write Ti = j if patient i is allocated to group j, i = 1, . . ., n, j = 1,
. . ., K. Then ϕi, j = P(Ti = j | T1, . . ., Ti  1) denotes the conditional probability that
patient i will be allocated to treatment j, similarly as in section “Conditional
Allocation Probability.” The Euclidean distance between the vector of conditional
allocation probabilities ϕi = (ϕi, 1, . . ., ϕi, K) of patient i and the vector of target
allocation probabilities p = ( p1, . . ., pK) is defined as the forcing index of patient i,
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u K
uX 
FI i ¼ t
2
ϕi, j  p j :
j¼1

The Pforcing index for the randomization sequence is then given by


FI ¼ 1n ni¼1 FI i . Clearly, the closer the conditional allocation to the target allocation,
the more random a randomization sequence can be considered. Therefore, a low
forcing index is desirable to reduce predictability.

Chronological Bias

Chronological bias arises when the treatment effect is distorted due to a time trend
that affects the patients’ responses, i.e., when later observations tend to be
systematically higher or lower than previous observations. According to Matts and
McHugh (1978), who coined the term chronological bias, clinical trials with a long
recruitment phase are particularly prone to suffer from the hidden effects of time.
The idea of investigating the effect of an unobserved covariate, such as time, on the
estimation of the treatment effect is due to Efron (1971), who termed the resulting
systematic distortion of the treatment effect in a linear model accidental bias.
The susceptibility of a randomization procedure to chronological bias may be
measured by the degree of balance it yields (Atkinson 2014). In a trial with two
treatment arms, a randomization sequence t is said to attain final balance, when the
48 Bias Control in Randomized Controlled Clinical Trials 867

achieved allocation ratio in the end of the trial is equivalent to the target allocation
ratio p,

N E ðn, tÞ ¼ n  p:

When the target allocation is p = 0.5, the maximum difference in group size
yields a measure for imbalance throughout the trial,

MI ¼ max Dk ðtÞ,
k¼1, ..., n

where the difference at time k is given by Dk(t) = NE(k, t)  NC(k, t). A generali-
zation of this approach to multiple treatment groups was presented by Ryeznik and
Sverdlov (2018).

Impact on Type I Error Probability and Power

A more direct way of measuring the susceptibility of a randomization procedure to


chronological bias is by estimating the effects of an unobserved trend on the
rejection probability of a parametric test. In a two-arm clinical trial, if the primary
outcome Y follows a normal distribution Y  N ðμE  T þ μC  ð1  T Þ, σ 2 Þ, we can
use a t-test to test the hypothesis H0: μE = μC, assuming that the variance σ 2 is
unknown.
In order to assess the effect of a time trend on the type I error rate of the t-test,
Tamm and Hilgers (2014) assume that the responses are affected by a trend τ(i),

EðY i Þ ¼ μE  T i þ μC  ð1  T i Þ þ τðiÞ:

They propose three different shapes of trend: linear, logarithmic, and stepwise.
Under linear time trend, the expected response of the patients increases evenly
proportional to a factor θ with every patient included in the trial, until reaching θ
after n patients. Linear time trend may occur as a result of gradually relaxing in- or
exclusion criteria throughout the trial. The shift of patient i where i = 1, . . ., N is
given by the formula

i
τðiÞ ¼  θ:
n
Under logistic time trend, the expected response of the patients increases
logistically with every patient included in the study, until reaching θ after n patients.
Logistic time trend may occur as a result of a learning curve, i.e., in a surgical trial.
Under logistic trend, the shift of patient i where i = 1, . . ., N is given by the formula
 
i
τðiÞ ¼ log  θ:
n
868 D. Uschner and W. F. Rosenberger

Under a step trend, the expected response of the patients increases by theta
after a given point n0 in the allocation process. Step trend may occur if a new device
is used after the point n0 or if the medical personal changes at this point. Under a step
trend, the shift of patient i where i = 1,. . ., N is given by the formula

τðiÞ ¼ 1fn0 ing  θ:

Rosenberger and Lachin (2015) present the results of a simulation study in which
they investigate the average type I error rate and power of the t-test under a linear
time trend for various designs. They find that the mean type I error rate of the designs
does not suffer from chronological bias, but power can be deflated substantially.
Moreover, more balanced designs lead to better control of power. Tamm and Hilgers
(2014) investigate the permuted block design with various block sizes concerning
it’s susceptibility to chronological bias. They find that strong time trends can lead to
a deflation of the type I error rate if large block sizes are used.
It is, however, not necessary to rely on simulation. As in the case of selection
bias (see section “Type I Error Probability and Power”), the impact of chronological
bias on the rejection probability of the t-test, conditional on the randomization
sequence, can be calculated using the doubly noncentral t distribution. Figure 4
shows the exact distribution of the type I error probability under a linear time trend
with θ = 1 for the random allocation rule, the maximal procedure, and the big stick
design latter two with maximum tolerated imbalance b = 2 for sample size n = 20.
For all three designs, the most randomization sequences yield a rejection
probability that is below the nominal significance level of 5%. The variance of the

Fig. 4 Distribution of the type I error probability under a linear time trend, figure generated using
the randomizeR package (Uschner et al. 2018b)
48 Bias Control in Randomized Controlled Clinical Trials 869

random allocation rule is higher as for the other designs. The big stick design seems
to best attain the significance level.

Planning for Bias at the Design Stage

Hilgers et al. (2017) proposed a framework called ERDO (Evaluation of


Randomization procedures for Design Optimization) for the choice of a randomiza-
tion procedure at the design stage of a clinical trial. Based on the idea that bias that
may occur during a trial can be anticipated from previous knowledge of similar trials,
their approach consists of assessing a large number of randomization procedures
with respect to the anticipated bias and choosing the randomization procedure that
has been shown to be susceptible for the design of the new trial. The design is
therefore optimal with respect to mitigating the anticipated bias. The assessment is
facilitated by the R package randomizeR (Uschner et al. 2018b), which allows the
assessment of a large number of restricted randomization procedures, particularly
those presented in section “Restricted Randomization in Clinical Trials.” In the first
step, the objective of choosing an optimal randomization procedure for the study is
stated, and prior information as given by previous studies is gathered. Then, the
assumptions underlying the study are presented. Information from previous studies
is used to estimate the shape of the time trend, its effect size θ, and the selection
effect η. Also, the metric of assessment (e.g., mean type I error rate) is stated.
Based on these assumptions, a comprehensive evaluation of various randomization
procedures will be conducted. Finally, the randomization procedure that best miti-
gates the biases assumed in the assessment is chosen as the optimal randomization
procedure for the study.

Robust Hypothesis Tests

An alternative approach of mitigating the influence of bias in a study is to adjust for


bias in the analysis stage of a trial. For the case of normally distributed responses in a
trial with two treatment arms and permuted block randomization where the
responses are influenced by selection bias, Kennes et al. (2015) proposed to include
the selection bias in a linear model in order to estimate the selection bias effect and to
get an unbiased estimate of the treatment effect. They further propose asymptotic
likelihood ratio test for the treatment effect, adjusted for bias, and for the selection
effect. The adjusted test controls the type I error substantially better when the
selection effect is medium to large, without losing much power.
Uschner et al. (2018a) employed a similar approach for multi-arm trials, inves-
tigating the effects of selection bias on the F-test and comparing several different
block sizes. Figure 5 is a reprint of their Fig. 6 and shows the results of a simulation
study based on the biasing policy for multi-arm groups that is outlined in section
“Multi-arm Trials (Generalizations).” The selection effect η is assumed to be a
fraction ρ of Cohen’s effect size f = f16,3 = 0.9829 for 3 treatment arms with 16
870 D. Uschner and W. F. Rosenberger

Fig. 5 Power of the F-test. Panel A, adjusted for selection bias; panel B, unadjusted for selection
bias. Both panels assume total sample size n = 48, K = 3 treatment groups, and selection effect
η = ρ ∙ f16,3 with ρ  {0, 0.5,1, 2}. (Originally published in (Uschner et al. 2018a), under Creative
Commons Attribution (CC BY 4.0) license)

subjects each. A total number of 10,000 trials was simulated under the assumption
μ1 = c ∙ f, μ2 = c ∙ f, μ3 = 0, where c is chosen such
pffiffiffi that the effect size of the
comparison results in 80% power of the F-test c ¼ 3= 2 , and the estimated power
is given by the proportion of trials that led to a rejection of the null hypothesis.
Panel A shows that the power of the adjusted test is close to the nominal power
when the block size is large but reduces to about 67% when a small block size is
used. The magnitude of the selection effect does not have an impact on the power of
the adjusted test. Panel B shows the power of the unadjusted test. As expected, the
power increases with increasing selection effect, as a result of an overestimation of
the treatment effect. Small block sizes lead to the heaviest inflation, reflecting the
higher susceptibility to selection bias.

Randomization Tests

An approach to mitigate the influence of chronological bias is given by randomiza-


tion tests. They do not rely on parametric assumptions and are thus more robust to
deviations and biases. Under the null hypothesis of a randomization test, the patient’s
responses are independent of the treatment the patient receives. Under this
hypothesis, the response vector is fixed, and the treatment assignments are random.
A test statistic is then chosen, e.g., the difference in means statistic, and computed for
the complete set of randomization sequences (or a Monte Carlo sample thereof),
yielding a distribution of the test statistic under the null hypothesis. The p-value is
computed as the probability of obtaining a more extreme test statistic than the one
that was observed in the trial.
Rosenberger and Lachin (2015) simulated the type I error rate and power for
several randomization procedures, using the model
48 Bias Control in Randomized Controlled Clinical Trials 871

Y i  N ðΔ  T i þ 4  τðiÞ  2, 1Þ,

where Δ  {0, 1}, and τ(i) is a linear time trend as in section “Chronological Bias.”
Their results show that the randomization test maintains the 5% significance level
for all randomization procedures. The power is decreased for all randomization
procedures except permuted blocks with small block sizes. It results that a smaller
degree of balance leads to greater power loss.
The advantages of randomization tests are that they need little effort to compute
and can handle heterogeneity in the data that may arise from bias and they do not
rely on random sampling from a distribution, such as parametric hypothesis tests.
They are therefore the natural choice for hypothesis tests when bias is anticipated
in the data.

Summary and Conclusions

Randomization is a design technique that helps to reduce bias in clinical trials.


A variety of randomization procedures has been developed in the literature to address
different types of bias. At the design stage of a clinical trial, it is crucial to select a
randomization procedure that adequately addresses the biases that may potentially
occur during the trial. Several restricted randomization procedures are available
in the literature to address that need. It has been claimed (Taves 2010) that
covariate adaptive allocation, such as minimization, is less susceptible for selection
bias. The controversy surrounding this claim and other issues regarding the
analysis of covariate adaptive randomization are discussed in Rosenberger and
Lachin (2015).
Investigators should take into account information from previous trials to
evaluate the potential for bias in a prospective trial, as suggested by Hilgers et al.
(2017). When no such information is available, several different scenarios should be
anticipated in a sensitivity analysis. For example, clinical trials for rare diseases
usually have long recruitment periods in order to identify a larger number of
participants that are eligible for participation in the trial. Due to the slow recruitment,
changes in study personal or co-medication can be anticipated. Therefore, several
different shapes and parameters of time trend should be investigated to assess the
potential for chronological bias. A randomization procedure that performs well in
all scenarios should consequently be chosen for the design of the trial.
Trials should be assessed for the potential of predictability when the nature of the
intervention makes it impossible to blind the investigator or the patient, such as in the
case of a surgical intervention. When the investigator knows or can guess
the treatment assignment of a patient, there is a potential of predictability of the
future allocations. Other examples for potential of predictability are trials where one
of the treatments has a side effect that the other treatment has not. Furthermore, when
the trial has a single center, or even multiple centers but the randomization is
stratified by center, the potential for predictability is high. In any of these cases,
the available randomization procedures should be assessed with respect to their
872 D. Uschner and W. F. Rosenberger

potential for selection bias, using a variety of selection bias parameters. Again, a
procedure that performs well in all investigated scenarios should be chosen. Lastly, if
little is known about the nature of the trial, a combination of chronological and
selection bias, as proposed by Hilgers et al. (2017), can be used as a basis for the
assessment that will determine the choice of the design. The combination approach
will ensure that a trial is protected if both biases occur during the trial.
By choosing a randomization procedure for a particular clinical trial that reflects
anticipated bias, the susceptibility to bias can be substantially reduced. While it is
recommended to include all available randomization procedures in assessment, ran-
domization procedures that promote balance, such as the permuted block randomiza-
tion or the big stick design, should particularly be taken into account when
chronological bias is anticipated. Procedures that support randomness, such as com-
plete randomization, Efron’s biased coin design, or the big stick design with larger
imbalance tolerance, are especially recommended when predictability is an issue.
At the analysis stage, when researchers suspect that bias may have affected the
results of their trial, they can use testing strategies to detect and adjust for potential
bias. The Berger-Exner test (Berger and Exner 1999) can be applied to detect the
presence of selection bias with a high accuracy, if all assumptions are met
(Mickenautsch et al. 2014). Altman and Royston (1988) recommend to use cumu-
lative sums of the outcomes to detect the presence of time trends. A more general
approach to control bias is to use methods that are robust to bias. When a parametric
test is used for the treatment effect, it is possible to control for a specific bias by
estimating its effect from the data and thus adjusting the treatment effect for the bias.
When the focus is not estimation, but testing of the null hypothesis of no treatment
effect, randomization tests are recommended to control the effect of bias, particularly
chronological bias, on the type I error probability. As randomization tests do not
rely on parametric assumptions, their results are robust to biases that arise from
heterogeneity in the patient stream, e.g., due to chronological bias.

Key Facts

• The results of a clinical trial can be affected by bias despite randomization.


• The susceptibility to bias varies with the randomization procedure that is
employed.
• The susceptibility to bias can be mitigated by chosing suitable randomization
procedure at the design stage of a trial.
• Sensitivity analyses are recommended to evaluate the impact of the clinical
scenario on the trial results.

Cross-References

▶ Cross-over Trials
▶ Evolution of Clinical Trials Science
▶ Factorial Trials
48 Bias Control in Randomized Controlled Clinical Trials 873

▶ Fraud in Clinical Trials


▶ Issues for Masked Data Monitoring
▶ Masking Study Participants
▶ Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials
▶ Principles of Clinical Trials: Bias and Precision Control
▶ Reporting Biases

References
Altman DG, Royston JP (1988) The hidden effect of time. Stat Med 7(6):629–637. https://doi.org/
10.1002/sim.4780070602
Armitage P (1982) The role of randomization in clinical trials. Stat Med 1:345–353
Atkinson AC (2014) Selecting a biased-coin design. Stat Sci 29(1):144–163. https://doi.org/
10.1214/13-STS449
Berger VW (2005) Quantifying the magnitude of baseline covariate imbalances resulting from
selection bias in randomized clinical trials. Biom J 47(2):119–127. https://doi.org/10.1002/
bimj.200410106
Berger VW, Exner DV (1999) Detecting selection bias in randomized clinical trials. Control Clin
Trials 20(4):319–327. https://doi.org/10.1016/S0197-2456(99)00014-8
Berger VW, Ivanova A, Deloria Knoll M (2003) Minimizing predictability while retaining balance
through the use of less restrictive randomization procedures. Stat Med 22(19):3017–3028.
https://doi.org/10.1002/sim.1538
Blackwell D, Hodges JL (1957) Design for the control of selection bias. Ann Math Statist
28(2):449–460. https://doi.org/10.1214/aoms/1177706973
Efron B (1971) Forcing a sequential experiment to be balanced. Biometrika 58(3):403–417
Hilgers RD, Uschner D, Rosenberger WF, Heussen N (2017) ERDO – a framework to select an
appropriate randomization procedure for clinical trials. BMC Med Res Methodol 17(1):159.
https://doi.org/10.1186/s12874-017-0428-z
ICH (1998) International conference on harmonisation of technical requirements for registration of
pharmaceuticals for human use. ICH harmonised tripartite guideline: statistical principles for
clinical trials E9
Ivanova A, Barrier RC Jr, Berger VW (2005) Adjusting for observable selection bias in block
randomized trials. Stat Med 24(10):1537–1546. https://doi.org/10.1002/sim.2058
Kennes LN, Cramer E, Hilgers RD, Heussen N (2011) The impact of selection bias on test decisions
in randomized clinical trials. Stat Med 30(21):2573–2581. https://doi.org/10.1002/sim.4279
Kennes LN, Rosenberger WF, Hilgers RD (2015) Inference for blocked randomization under
a selection bias model. Biometrics 71(4):979–984. https://doi.org/10.1111/biom.12334
Langer S (2014) The modified distribution of the t-test statistic under the influence of selection
bias based on random allocation rule. Master’s thesis, RWTH Aachen University, Germany
Matts JP, McHugh RB (1978) Analysis of accrual randomized clinical trials with balanced groups
in strata. J Chronic Dis 31(12):725–740. https://doi.org/10.1016/0021-9681(78)90057-7
Mickenautsch S, Fu B, Gudehithlu S, Berger VW (2014) Accuracy of the Berger-Exner test
for detecting third-order selection bias in randomised controlled trials: a simulation-based
investigation. BMC Med Res Methodol 14(1):114. https://doi.org/10.1186/1471-2288-14-114
Proschan M (1994) Influence of selection bias on type i error rate under random permuted block
designs. Stat Sin 4(1):219–231
Rosenberger W, Lachin J (2015) Randomization in clinical trials: theory and practice. Wiley series
in probability and statistics. Wiley, Hoboken
Rückbeil MV, Hilgers RD, Heussen N (2017) Assessing the impact of selection bias on test
decisions in trials with a time-to-event outcome. Stat Med 36(17):2656–2668
874 D. Uschner and W. F. Rosenberger

Ryeznik Y, Sverdlov O (2018) A comparative study of restricted randomization procedures for


multi-arm trials with equal or unequal treatment allocation ratios. Stat Med 37(21):3056–3077.
https://doi.org/10.1002/sim.7817
Salama I, Ivanova A, Qaqish B (2008) Efficient generation of constrained block
allocation sequences. Stat Med 27(9):1421–1428. https://doi.org/10.1002/sim.3014. https://
onlinelibrary.wiley.com/doi/abs/10.1002/sim.3014, https://onlinelibrary.wiley.com/doi/pdf/10.
1002/sim.3014
Tamm M, Hilgers RD (2014) Chronological bias in randomized clinical trials arising from different
types of unobserved time trends. Methods Inf Med 53(6):501–510
Taves DR (2010) The use of minimization in clinical trials. Contemp Clin Trials 31(2):180–184.
https://doi.org/10.1016/j.cct.2009.12.005
Uschner D, Hilgers RD, Heussen N (2018a) The impact of selection bias in randomized multi-arm
parallel group clinical trials. PLoS One 13(1):1–18. https://doi.org/10.1371/journal.
pone.0192065
Uschner D, Schindler D, Hilgers RD, Heussen N (2018b) randomizeR: an R package for the
assessment and implementation of randomization in clinical trials. J Stat Softw 85(8):1–22.
https://doi.org/10.18637/jss.v085.i08
Part V
Basics of Trial Design
Use of Historical Data in Design
49
Christopher Kim, Victoria Chia, and Michael Kelsh

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 878
Study Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 881
Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 881
Defining Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 883
Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884
Selection Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885
Information Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885
Analytic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886
Examples of the Use of Historical Comparators to Test Efficacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 887
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 889
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 889
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 889
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 889

Abstract
The goal of clinical research of new disease treatments is to evaluate the potential
benefits/risks of a new treatment, which generally requires comparisons to a
control group. The control group is selected to characterize what would have
happened to a patient if they had not received the new therapy. Although a
randomized controlled trial provides the most robust clinical evidence of treat-
ment effects, there may be situations where such a trial is not feasible or ethical,
and the use of external or historical controls can provide the needed clinical
evidence for granting conditional or accelerated regulatory approvals for novel
drugs. Data from previous clinical trials and real-world clinical studies can
provide evidence of the outcomes for patients with the disease of interest.
However, many methodologic considerations must be considered such as the
appropriateness of data sources, the specific data that needs to be collected,
C. Kim · V. Chia (*) · M. Kelsh
Amgen Inc., Thousand Oaks, CA, USA
e-mail: [email protected]; [email protected]; [email protected]

© Springer Nature Switzerland AG 2022 877


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_69
878 C. Kim et al.

application of appropriate inclusion/exclusion criteria, accounting for bias, and


statistical adjustment for confounding. This chapter reviews the settings in which
the use of historical controls or comparators may be appropriate, as well as study
designs, limitations to using real-world comparators, and analytic methods to
compare real-world data to clinical trial data. When these considerations are
appropriately handled, use of historical controls can have immense value by
providing further evidence of the efficacy, effectiveness, and safety in the devel-
opment of novel therapies and increase efficiency of the conduct of clinical trials
and regulatory approvals.

Keywords
Real-world comparators · Real-world evidence · Clinical trials · Controls ·
Historical comparator · Statistical methods · Propensity score

Introduction

The goal of clinical research of new disease treatments is to evaluate the potential
benefits/risks of a new treatment, which generally requires comparisons to a control
group. The control group is selected to characterize what would have happened to
patient if they had not received the new therapy. To minimize bias in this compar-
ison, clinical trial study designs typically involve randomization of eligible
patients to treatment and control groups and, where feasible, blinding investigators
to patient treatment status. This approach provides the most robust clinical evidence
of treatment effects. However, for a variety of reasons and under a number of
circumstances, this well-established study design may not be feasible or ethical,
and the use of existing “real-world” (RW) data (i.e., external or historical controls)
can provide the needed clinical evidence for granting conditional or accelerated
regulatory approvals for novel drugs.
Randomized controlled trials (RCTs) may not be feasible because the disease
under study is so rare, creating a significant challenge in recruiting a sufficient
number of patients or requiring an unreasonably long time period for patient
recruitment. Similarly, if a new treatment is targeting a relatively rare “molecular-
identified” subgroup of patients, this may require screening a large number of
patients to identify the target patient population. In addition, analytical challenges
such as patient crossover from the control group to the new treatment group can
threaten the study integrity and confound survival analyses. For these feasibility
reasons and other ethical considerations (see below), physicians and patients may
refuse to participate in RCTs under these circumstances.
The ethical considerations that can impose significant challenges include a lack of
equipoise resulting from the early findings of new treatment or the dismal outcomes
of the current standard of care (SOC) for many serious life-threatening illnesses.
Even if a disease is not life threatening, a study could involve potential invasive/
risky monitoring or follow-up procedures, administered to controls who have little to
49 Use of Historical Data in Design 879

gain in terms of improvement or benefit in their disease status under the current SOC.
All of these scenarios could render an RCT as unethical.
Under such circumstances single-arm clinical trials, accompanied with historical
controls, can provide the needed comparative data for evaluation of outcomes for
patients who did not receive the new treatment. Such studies can be faster and more
efficient than RCTs and still generate needed clinical evidence. In some cases,
comparative data may be obtained from readily available study group data from
completed clinical trial treatment and/or control cohorts, patient cohorts from pre-
vious clinical case series, or meta-analyses of previous RCT or observational patient
data. However, for more precise comparisons, accounting for important clinical
characteristics likely requires individual patient-level data. Use of these data is the
focus of the discussion in this chapter.
The broad concept of historical controls has been described previously (Pocock
1976) and has been given various labels such as “nonrandomized control group,”
“external control,” “synthetic control,” “natural history comparison,” or “historical
comparator.” Although there are subtle distinctions across these different labels,
generally they are used to refer to a comparison group which is not randomized, can
be concurrent or historical, and can be derived from a single or multiple sites or data
sources. For purposes of discussion in this chapter, we will use the term “historical
control” to describe this type of comparison group. These data could be historical or
contemporaneous with the new treatment group; with either type of data, similar
study design and analytical principles would apply (with the exception of the
assessment of trends over time for the historical data). The goal of using historical
controls is to provide evidence of the expected patient outcomes in the hypothetical
randomized control group in the absence of randomized control study. Additionally,
the data could improve the generalizability of the study findings.
Considering the potential ethical and feasibility challenges described above,
situations that favor the use of historical controls involve the following:

• A rare disease with a large unmet treatment need.


• Severe or unfavorable outcomes for patients receiving the current SOC.
• The clinical/biological mechanism of the new treatment is well-characterized.
• The natural history of disease is well-understood.
• The new therapy appears to provide significant improvement over current SOC.
• The target patient population is well-defined and can be identified in RW or
clinical data sources.
• Disease and outcome classifications are similar between the new treatment group
and “real-world” controls, and these measures are collected using objective and
repeatable measurements.
• For historical data, disease prognosis and treatment patterns remain relevant for
the period under study.

This list is not intended to be exhaustive of all potential scenarios where historical
controls may be advantageous or feasible, nor is it a requirement that all of these
attributes be present for the use of historical controls in lieu of a randomized control
880 C. Kim et al.

group. Data quality and accessibility are key components to the use of historical
controls for evidence generation and regulatory review. Each disease area and
specific indication may necessitate one type of data or another.
Controls derived from historical clinical trial data, when “reused” for other
studies, are considered observational data (Desai et al. 2013) and, because they
involve rigorous data collection and validation procedures, can provide robust
historical control data. These data are often limited in how many patients may be
eligible to be evaluated as they represent highly selected patient populations. The
characteristics of the included patients should be carefully considered for represen-
tativeness and comparability to the population being considered. Other data sources
include electronic health records, disease registries, insurance administrative claims
records, and clinical data abstracted for use as historical controls (USFDA 2018).
These data can often provide large numbers of patients but may not have as
meticulous collection of data as with other trials. But when the inclusion/exclusion
criteria are matched to that of the population being considered, the outcomes
can truly reflect what would occur if patient did not receive the investigational
intervention. Regardless of the type of data source, historical control data need to
include a sufficient number of patients, critical exposure, outcome, and covariate
information and systematic and sufficient follow-up to provide reliable comparative
information. The potential for bias can be reduced by assuring similar covariate,
endpoint, and exposure (e.g., previous treatment information, comorbidity data)
definitions, similar SOC across the historical patient cohort, and unbiased patient
selection processes. It is also important to fully understand data sources/data sys-
tems: how data are recorded, who are the patient populations captured in the data
source (e.g., an understanding of patient referral patterns), reasons for missing data,
geographic distribution of data, variation in standards of care, and other critical data
characteristics or system attributes (Dreyer 2018).
In addition to consideration and evaluation of the quality of historical control data
sources, appropriate study design and analytical methods that are aimed at reducing
bias and improving comparability/balance between patients receiving new treatment
and historical controls are critical aspects in providing an accurate comparison to the
new treatment patient group. Study design considerations and analytical strategies to
address bias due to confounding generally and, in particular, confounding by
indication include descriptive, stratified, weighted analyses, multivariate modeling,
propensity score (PS) methods for weighting and matching, and, when appropriate,
Bayesian approaches for statistical analysis. Sensitivity analysis should also be
proposed to assess impacts of study assumptions. A key to the scientific integrity
and successful regulatory evaluation of this process is the upfront specification of
these design and analysis strategies. Further details, examples from oncology, and
discussion on these topics are provided in this chapter.
In the first section of this chapter, appropriate study design is discussed. Selecting
the appropriate data source for the historical control group is critical, and evaluating
what data will satisfy the needs is the first key step. Next, defining the study variables
between the trial and historical controls is also critical. Without appropriately defined
exposures and endpoints, any comparisons will be limited. Finally, appropriate
49 Use of Historical Data in Design 881

considerations for bias that must be considered are discussed. Randomization typ-
ically balances differences between patient populations, but use of historical controls
will almost certainly be biased without proper design and analysis. After study
design, a number of different analytic options are presented, and examples of
comparing single-arm trials to historical controls are discussed.

Study Design

Data Sources

There are three primary methods to conduct a study’s data collection, each with its
own sets of strengths and limitations: (1) prospective primary data collection, (2)
clinical data abstraction from patient charts/clinical databases with a standardized
case report form, and (3) databases such as an electronic medical record (EMR)
linked to an administrative claims database. The last data option is to use former
clinical trial data. The data options that make the most sense will be determined by
the disease of interest and what the data will contextualize. The first option is to run a
primary prospective data collection. The second option is clinical data abstraction
from patient charts or clinical databases with a standardized case report form.
Another option is to use existing secondary databases such as comprehensive
EMR data, administrative claims data, EMR data linked to administrative claims,
disease registry data, or national health screening data. The specific details of each
approach are detailed below.
Primary prospective data collection will provide the most in-depth and complete
data but will be the most costly and time-consuming effort. This option may be
needed if the required data are not regularly collected in clinical practice and cannot
be derived from routinely collected measures (e.g., in graft-versus-host disease,
measurement of response assessment to therapy by NIH 2014 Consensus Response
Criteria Working Group is widely collected and reported in clinical trials but is not
integrated into routine clinical care (Lee et al. 2015)). In situations where temporality
of the data is particularly important, for example, if standard of care has changed
substantially over time, prospective data collection may be the only feasible option.
For instance, in 2005, the use of bortezomib was fully approved for the treatment of
relapsed or refractory multiple myeloma after compelling data from the phase 3
trial demonstrated bortezomib superiority to dexamethasone in overall survival
(Richardson et al. 2005). The introduction of bortezomib changed the SOC of
multiple myeloma, and as a result, a reasonable comparison of outcomes for
myeloma patients would need to include data after 2005 when bortezomib became
a backbone of myeloma therapy. Additionally, some endpoints may have a much
more complicated ascertainment than what is routinely conducted under typical
clinic care and cannot be derived from existing data. In these circumstances, the
only way to collect these assessments is to prospectively design a study that collects
such data. This approach will require the most time as sites will need to be enrolled
and each patient will need to be screened and provide informed consent.
882 C. Kim et al.

Retrospective extraction of clinical data using a standardized case report form can
provide depth of clinical data with less complexity and time than a prospectively
designed study. This option is a good choice if the data needed are commonly
collected but not necessarily in a structured field of an electronic medical record
(EMR) (e.g., to measure response to therapy for an acute lymphoblastic leukemia
patient (Gokbuget et al. 2016a)). The primary benefit of doing a retrospective data
collection directly from clinical sites is that most centers will have years of data
available. With a specific case report form, a focused effort can extract just the
necessary data. In some instances, centers may maintain a database or registry that
contains most, if not all, of the data elements needed, which will streamline the data
abstraction process. However, many centers do not keep a routine database of
clinical data for research purposes. The biggest barrier for these sites will be the
process of medical chart abstraction which requires extensive staff support. The
process of data abstraction and entry is often a slow and expensive process due to the
labor involved. Additionally, many sites and investigators may be less interested in
participating in retrospective studies which may not be as novel or impactful as
investigational therapies.
Use of databases can be the most time- and cost-efficient method. This option is
feasible when the primary endpoints are routinely collected in everyday medical
practice in structured EMR or insurance claims diagnosis (e.g., incidence of bleeding
events in patients with thrombocytopenia (Li et al. 2018)). The most common types
of data used would likely be an EMR linked to an administrative claims database.
These data are more likely to be highly generalizable as these data provide a large
sampling of centers and providers from a geographic region. Additionally, the
sample size provided by these datasets is likely to be far greater than a clinical
study. Despite these advantages however, the appropriateness of the databases needs
to be considered. For instance, existing electronic databases may lack the specificity
and depth of data needed for comparative purposes to clinical trial data. Additionally,
because the data are not provided on a protocol or for research purposes, there may
be missing elements that were never or rarely collected. Some covariate and end-
point assessments may be less frequently or sporadically measured, as these data-
bases reflect real-world medical practice. Lastly, many types of endpoints cannot be
assessed in these types of databases. Understanding the limitations of the data is key
to understand if this approach is feasible.
Previous clinical trials can provide robust control data. Clinical trials are
considered the gold standard of clinical evidence as they are highly controlled
and perform thorough data collection. Many variables may be collected for
completeness. Additionally, adherence to medications is often more closely
monitored and most potential variables are recorded. This allows investigators
to evaluate a wide range of variables during the design and analysis phase.
However, clinical trials tend to have highly selected populations, which limits
the generalizability and applicability to many populations. Often, most of these
data will not mimic the studied population, intervention, or inclusion/exclusion
criteria of the population to be compared to, making it difficult to use these data
as a source for controls.
49 Use of Historical Data in Design 883

Defining Variables

Defining exposure/treatment can vary depending on the type of data collection and
study being conducted. In prospective data collection, exposure definition, dates,
duration, dose, and any changes due to adverse events can be matched exactly to the
trial assessment schedule and definitions. In retrospective data abstraction efforts, it
may be straightforward to identify what specific regimen or protocol was anticipated
for the patient. However, there may be specific details missing such as the exact
dose(s) administered or adjustments for toxicity. Using large databases, treatment
regimen typically must be derived using an algorithm which can be prone to errors
and assumptions, particularly for multi-agent treatment regimens. For prescriptions
that are filled through a pharmacy and not administered in a clinic setting, you only
know with certainty that the prescription was picked up but not necessarily whether
it was actually taken by the patient. This highlights that in using historical data,
assumptions must be made and algorithms developed that need to be validated where
possible and/or evaluated in sensitivity analyses (Table 1).
Collection of relevant prognostic covariates is important for assessment of patient
population comparability. In a prospective data collection, all baseline covariates can
be ascertained with a complete baseline assessment. In a retrospective collection,
typically, disease-relevant clinical covariates will be routinely collected. However, a
complete assessment as typically done on a trial will not be conducted unless a
patient has specific medical conditions necessitating it. In a database, some labs or
imaging data may not be routinely available. Claims typically do not contain specific
lab values; EMR typically do not contain comorbid conditions captured outside of
that specific clinic.
Carefully determined endpoints are critical to study success. In prospective data
collection, endpoints can follow assessment schedules and definitions just like in the
clinical trial. In retrospective data sources where death is captured, overall survival is

Table 1 Data collection methods suitability for exposure, covariate, and endpoint
Data
collection Exposure Covariates Endpoints
Prospective Collect exact dates, Can do a full baseline Set the exact definitions
duration, dosing, assessment for all of endpoints and the
and changes relevant covariates assessment schedule
Retrospective Identify treatments, Demographics and Can be routinely
chart but may be missing disease-relevant clinical collected in medical
abstraction or exact doses and fine characteristics; some practice, but schedule of
clinical details comorbidities assessments is less
database frequent than trial
Claims, Can be inconsistent Claims can identify May lack some endpoints
registry, or in details, may demographics and that require lab or
EMR require algorithms comorbidities; EMR can imaging results
database to identify identify demographics
treatments and some clinical
characteristics
884 C. Kim et al.

easiest to evaluate because of limited heterogeneity in endpoint determination (death


is death). For composite endpoints such as relapse-free survival, event-free survival,
and progression-free survival, ascertainment can vary depending on frequency of
assessments, which may be somewhat less frequent and subject to some heteroge-
neity. However, in administrative or EMR database studies, response to therapy may
be difficult to ascertain systematically because of heterogeneity in timing or response
assessment is not systematically recorded in real-world medical practice data. Some
procedures to ascertain response may not be conducted if there is clear evidence of
no treatment response (e.g., if bone marrow aspirate required for response assess-
ment but patient exhibits overt symptoms). Other times, a proxy measure may be
sufficient for some endpoints due to similarity in timing of events (e.g., time to next
treatment in some scenarios can be similar to progression-free survival). However,
these proxies should be validated as a facsimile of the endpoint in question prior to
use in a comparative analysis. Lastly, caution should be exercised when assessing
safety endpoints from retrospective or databases as they may not be systematically
captured except for expected/known toxicities common to the disease/treatments
resulting in a visit to the hospital. This may lead to biased ascertainment of adverse
events and lead to inappropriate comparisons. Generally, these issues do not apply to
prospectively collected data if the assessment schedule is like the trial.

Bias

In 1976, Pocock described the use of historical controls in clinical trials (Pocock
1976), and the acceptability of a historical control group requires that it meet the
following conditions:

1. Exposure: such a group must have received a precisely defined standard treatment
which must be the same as the treatment for randomized controls.
2. Patient selection: the group must have been part of a recent clinical study which
contained the same requirements for patient eligibility.
3. Outcome: the methods of treatment evaluation must be the same.
4. Covariates: the distributions of important patient characteristics in the group
should be comparable with those in the new trial.
5. Site selection: the previous study must have been performed in the same organi-
zation with largely the same clinical investigators.
6. Confounding: there must be no other indications leading one to expect differing
results between the randomized and historical controls.

Only if all these conditions are met can one safely use the historical controls as
part of a randomized trial. Although meeting all of these conditions would result in
an ideal historical comparator, it is not always feasible. In previous sections, types of
data sources, how to define exposures, outcomes, and covariates are discussed.
Later, analytic methods to assess and control for confounding will be discussed,
and in this section, Pocock’s conditions, bias, and how to mitigate bias are
49 Use of Historical Data in Design 885

discussed. Appropriate control for confounding as described in the analytics section


may reduce bias (Greenland and Morgenstern 2001).

Selection Bias

Selection bias may be introduced if the patients selected for the historical compar-
ators are not comparable to the clinical trial-treated subjects (other than treatment
exposure) or do not contain the same patient eligibility requirements. Prospective
study designs would minimize this bias the most, as investigators are able to define
eligibility criteria similarly to that of the clinical trial, with the goal to include
patients that would have been able to participate in the clinical trial. For studies
using existing data, either from clinical sites or through large existing databases,
careful selection of patients restricting inclusion and exclusion criteria is required;
however, not all clinical trial eligibility criteria may be found in these types of data
sources. Additionally, for existing data sources, selection of patients when the
outcome is known can bias the results to produce a favorable evaluation for the
drug or device under study. This bias may be mitigated by including all patients who
meet the eligibility criteria. If a random sample of patients is selected, then the
outcome must be blinded. Random selection can help provide a mix of patients at
various lines of therapy. When there are multiple treatments received over time (e.g.,
different lines of therapy), bias may be introduced when selecting the treatment line
for which to assess outcomes. For instance, if subjects in the clinical trial had to have
previously failed at least two prior lines of therapy and the majority of subjects only
had two prior lines of therapy, bias would be introduced if the majority of patients in
the historical comparator had three prior lines of therapy.
In addition to bias resulting from patient selection, for studies using existing data
from clinical sites, investigators need to ensure the clinical sites, including type of site
(e.g., academic hospitals, large specialty centers), country of site, standard treatments
used, and types of patients undergoing treatment at those clinics are comparable to the
clinical trial sites and subjects. However, the selected sites do not necessarily have to
be performed with the same clinical trial sites and investigators.
Finally, the time period in which the historical comparator is drawn from will
need to be carefully assessed for comparators that are nonconcurrent. For instance, if
there have been significant changes in medicine or technology over time (e.g., earlier
diagnosis of disease, changes in treatment effectiveness, or better supportive care
measures), having nonconcurrent comparators can bias the results to favor the drug
or device. Thus, it is important to carefully assess changes in the treatment landscape
over time when selecting patients for historical comparators.

Information Bias

Appropriate measurement of the exposure, outcomes, and covariates was previously


discussed, and appropriate measurement and minimization of missing data can
886 C. Kim et al.

reduce information or measurement bias. When data are collected from multiple
existing data sources (e.g., multiple clinical sites), standardization of data collection
forms will minimize measurement error. An issue comparative effectiveness obser-
vational research is the inappropriate selection of follow-up time. This creates a bias
in favor of the treatment group where during a period of follow-up, an event cannot
occur due to a delay or wait period for the treatment to be administered. This is
known as immortal time bias. Immortal time bias can be appropriately handled by
assigning follow-up when events can occur and using a time-varying analysis, not a
fixed-time analysis. Finally, the length of follow-up time after the treatment exposure
can impact outcomes. For instance, when following patients for death, the longer the
follow-up time, the more likely events will accrue. The historical comparator must
have a comparable follow-up time as the clinical trial patients.

Analytic Methods

Prior to beginning any comparative analyses, it is important to characterize each


patient population by evaluating the important covariates in both populations. The
first step is to make sure that the inclusion/exclusion criteria for the historical
comparator match as closely as possible to the clinical trial subjects. Once the
primary criteria are matched between populations, other baseline clinical character-
istics can be described and evaluated for differences. These covariates should be as
balanced as possible so the outcomes comparisons can be meaningful and not
attributed to one population being sicker/healthier.
As data between studies are collected at different time points and on different
schedules, the heterogeneity in endpoints must be accounted for when defining the
analysis of time-dependent variables and time-to-event endpoints. These variables
will include treatment initiation, treatment response assessment, and overall survival.
Alignment on these variables is important so as not to create biased analyses that are
invalid and inappropriately favor one group over another as can occur with inap-
propriate follow-up creating immortal time bias.
Several analytic and design approaches are available to account for information
bias. Two analytic methods are the use of simulation-extrapolation or regression
calibration to account for measurement error of an exposure/treatment. Depending
on how much is known about the type of misspecification in the distribution of the
variable, simulation-extrapolation may have less bias if the true measurement dis-
tribution is unknown or misspecified compared to regression calibration. Sensitivity
analysis with subsets of data to evaluate the consistency of results can also help to
identify data/bias issues. During study planning stages, designing data collection and
review with multiple checks from data reviewers can also help identify and avoid
confirmation biases that are inherent to variables where human judgment is required.
When conducting comparisons between two separate study populations, several
different options are available. The simplest method is to conduct a weighted analysis
based on a baseline covariate (or a few). This is a straightforward method of adjust-
ment based on levels of a covariate. However, there may be uncontrolled confounding
49 Use of Historical Data in Design 887

when using such a simple adjustment method as it only adjusts for a few characteristics
with simple groupings. But when there are not many prognostic covariates to consider
or the populations are relatively well balanced on measured covariates without
adjustment, simply weighting may provide adequate and easy-to-interpret compari-
sons. Another option is to use a multiple regression model. Multiple regression will
adjust estimates of an endpoint measure with the assumptions normally associated
with fitting multiple covariates in the model. However, many of these assumptions
such as linearity or distributions of the covariates may not be met leading to a violation
of model assumptions and resulting in biased estimates.
Another method that is more flexible and reduces bias in comparisons is the use of
propensity score as adjustment (D’Agostino 1998). Propensity scores estimate the
probability of being assigned to a treatment group based on baseline covariates
entered in a model. First, the propensity score for each patient is derived using a
logistic model with many baseline covariates. It is possible to account for many
covariates, including interactions between variables, in deriving the propensity
scores. A wide range of variables can be accounted for compared to a traditional
multivariate logistic or Cox regression model. The distribution of propensity score
for both cohorts should be described and then the adjustment method can be chosen.
Two predominant methods of using propensity scores exist: matching or
weighting. When sample size is abundant, matching has some advantages. However,
when sample size is a concern, weighting provides a bit more flexibility at the cost of
less direct matching. Risk of outcomes associated with treatment exposure can be
presented as propensity score-adjusted odds ratios (OR) or hazard ratios (HR) with
95% CIs. Additionally, other covariates may be assessed and adjusted for in the risk
models. Propensity score models have been used in comparing phase 2 single-arm
trials to historical or real-world studies. Such an example compared blinatumomab to
chemotherapy for relapsed/refractory acute lymphoblastic leukemia which will be
discussed (Gokbuget et al. 2016b) and alectinib for non-small cell lung cancer
compared to real-world clinic outcomes (Davies et al. 2018). When sample size of
the assessed studies is small, statistical power can be boosted by including use of a
Bayesian prior. This method “borrows” data from a similar statistical analysis
(typically the same disease with similar drug) to augment the effect estimates. This
may lend additional credibility to study results if the a priori data is a relevant.
Additional analytics to consider are outlined well elsewhere (Lim et al. 2018).

Examples of the Use of Historical Comparators to Test Efficacy

Lim et al. describe several examples where drug approvals have used historical
comparator data in settings of rare diseases, including oncology and other life-
threatening conditions (Lim et al. 2018). One of these examples was for acute
lymphoblastic leukemia (ALL) and the use of historical comparators to put the
results from the blinatumomab single-arm phase 2 clinical trial into context
(Gokbuget et al. 2016b). In this example, historical data were pooled from large
study groups and individual clinical sites treating patients with Philadelphia
888 C. Kim et al.

chromosome-negative, B-precursor, relapsed, or refractory ALL with standard of


care chemotherapy in Europe and the United States.
Outcomes, such as complete remission and overall survival, in the historical
comparator patients were either weighted to the distribution of important clinical
prognostic predictors in the blinatumomab trial subjects or were estimated using
propensity scores and inverse probability of treatment weighting. Weighted analyses
of the historical comparators provided a complete remission (as defined by the study
groups) estimate of 24% [95% confidence interval (CI), 20–27%] and a median
overall survival of 3.3 months (95% CI, 2.8–3.6 months). In the propensity score
model, the predicted complete response was 27% (95% CI, 23–30%) in the historical
comparators, with a statistically significant twofold increase in the odds of achieving
complete remission in the blinatumomab subjects versus the historical comparators,
and a statistically significant hazard ratio for overall survival (0.53, 95% CI, 0.39–
0.73) favoring blinatumomab. Several sensitivity analyses were conducted with
alternative treatment effect analyses and outlier stabilizations. Yet, the results
remained consistent and robust. The authors raised potential issues in residual
confounding, temporality of historical data (data included patients as far back as
1990), and heterogeneity in data collection.
Interestingly, the weighted complete remission and overall survival data from the
historical comparator was similar to the data in controls from the subsequently
completed randomized phase 3 clinical trial of blinatumomab compared to standard
of care chemotherapy (Kantarjian et al. 2017). Complete remission in the subjects
receiving standard of care chemotherapy was 24.6% (95% CI, 17.6–32.8%), and the
median overall survival was 4.0 months (95% CI, 2.9–5.3 months). This example
provides a good illustration of the use of historical comparators to put single-arm
clinical trial data into context and was reassuringly confirmed to provide data similar
to the standard of care arm in the randomized phase 3 clinical trial.
In another example, a comparison of data from two phase 2 trials of alectinib for
the treatment of anaplastic lymphoma kinase-positive (ALK) non-small cell lung
cancer (NSCLC) were pooled and compared to ALK NSCLC patients treated with
ceritinib from the Flatiron electronic health records database and analyzed with
IPTW adjustment (Davies et al. 2018). The primary endpoint was overall survival.
At baseline, the ceritinib group was older, had less pretreatment, and had less CNS
metastases. These covariates were balanced after adjustment with IPTW. The
weighted survival of the alectinib group was 24 months (95% CI, 21-NR) and
in the ceritinib group was 16 months (95% CI, 16–19) with a hazard ratio of
0.65 (95% CI, 0.48–0.88) favoring the alectinib group. The results were evaluated
against a number of sensitivity analyses (i.e., exclusion of outliers, inclusion of
additional covariates in model specification, alternative treatment weighting
methods), and the results were found to be consistent. The authors pointed out
several limitations in the analyses such as residual confounding due to unmeasured
covariates, differential follow-up time between the two treatment groups, and
inherent differences in the data collection practices between groups. Despite these
limitations, these data demonstrated that newer generation of tyrosine kinase inhib-
itors can reduce the risk of death in ALK NSCLC patients.
49 Use of Historical Data in Design 889

Summary and Conclusion

The use of historical controls can serve to provide an alternate form of evidence for
understanding treatment effects in nonrandomized studies. In appropriate situations
where a randomized trial may be unethical and/or the unmet need is great, historical
control may be particularly useful for contextualizing study outcomes in the absence
of randomized trials. However, many methodologic considerations must be consid-
ered such as the appropriateness of data sources, the specific data that needs to
be collected, application of appropriate inclusion/exclusion criteria, accounting for
bias, and statistical adjustment for confounding. When these considerations
are appropriately handled, use of historical controls can have immense value by
providing further evidence of the efficacy, effectiveness, and safety in the develop-
ment of novel therapies and increase efficiency of the conduct of clinical trials and
regulatory approvals.

Key Facts

• External or historical controls can provide the needed clinical evidence for
granting conditional or accelerated regulatory approvals for novel drugs.
• Methodologic considerations for the use of external or historical controls must be
considered in order to appropriately compare real-world data to clinical trial data.
• When the methodologic considerations are appropriately handled, use of histor-
ical controls can have immense value by providing evidence to support the
efficacy and safety of novel therapies.

Cross-References

▶ Evolution of Clinical Trials Science

Funding Statement and Declarations of Conflicting Interest CK, VC, and MK are employees
and shareholders of Amgen Inc.

References
D’Agostino RB Jr (1998) Propensity score methods for bias reduction in the comparison of a
treatment to a non-randomized control group. Stat Med 17:2265–2281
Davies J et al (2018) Comparative effectiveness from a single-arm trial and real-world data:
alectinib versus ceritinib. J Comp Eff Res. https://doi.org/10.2217/cer-2018-0032
Desai JR, Bowen EA, Danielson MM, Allam RR, Cantor MN (2013) Creation and implementation
of a historical controls database from randomized clinical trials. J Am Med Inform Assoc 20:
e162–e168. https://doi.org/10.1136/amiajnl-2012-001257
Dreyer NA (2018) Advancing a framework for regulatory use of real-world evidence: when real is
reliable. Ther Innov Regul Sci 52:362–368. https://doi.org/10.1177/2168479018763591
890 C. Kim et al.

Gokbuget N et al (2016a) International reference analysis of outcomes in adults with B-precursor


Ph-negative relapsed/refractory acute lymphoblastic leukemia. Haematologica 101:1524–1533.
https://doi.org/10.3324/haematol.2016.144311
Gokbuget N et al (2016b) Blinatumomab vs historical standard therapy of adult relapsed/refractory
acute lymphoblastic leukemia. Blood Cancer J 6:e473. https://doi.org/10.1038/bcj.2016.84
Greenland S, Morgenstern H (2001) Confounding in health research. Annu Rev Public Health
22:189–212. https://doi.org/10.1146/annurev.publhealth.22.1.189
Kantarjian H et al (2017) Blinatumomab versus chemotherapy for advanced acute lymphoblastic
leukemia. N Engl J Med 376:836–847. https://doi.org/10.1056/NEJMoa1609783
Lee SJ et al (2015) Measuring therapeutic response in chronic graft-versus-host disease.
National Institutes of Health consensus development project on criteria for clinical trials in
chronic graft-versus-host disease: IV. The 2014 Response Criteria Working Group report. Biol
Blood Marrow Transplant 21:984–999. https://doi.org/10.1016/j.bbmt.2015.02.025
Li S, Molony JT, Cetin K, Wasser JS, Altomare I (2018) Rate of bleeding-related episodes in elderly
patients with primary immune thrombocytopenia: a retrospective cohort study. Curr Med Res
Opin 34:209–216. https://doi.org/10.1080/03007995.2017.1360852
Lim J et al (2018) Minimizing patient burden through the use of historical subject-level data in
innovative confirmatory clinical trials: review of methods and opportunities. Ther Innov Regul
Sci. https://doi.org/10.1177/2168479018778282
Pocock SJ (1976) The combination of randomized and historical controls in clinical trials. J Chronic
Dis 29:175–188
Richardson PG et al (2005) Bortezomib or high-dose dexamethasone for relapsed multiple
myeloma. N Engl J Med 352:2487–2498. https://doi.org/10.1056/NEJMoa043445
USFDA USFaDA (2018) Real-world Evidence. https://www.fda.gov/scienceresearch/
specialtopics/realworldevidence/default.htm. Accessed 15 Aug 2018
Outcomes in Clinical Trials
50
Justin M. Leach, Inmaculada Aban, and Gary R. Cutter

Contents
Introduction, Definitions, and General Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 892
What Is an Outcome? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 892
Outcomes in Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 892
Where Are We Going? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 893
Types of Outcome Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894
Clinical Distinctions Between Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894
Quantitative and Qualitative Descriptions of Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 897
Safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 899
Choosing Outcome Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 900
Nonstatistical and Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 901
Assessing Outcome Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 901
Statistical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904
Reporting Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 906
Multiple Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 907
Multiple (Possibly Related) Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 907
Longitudinal Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 910
Summary/Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 912
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 912
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 913

Abstract
Selecting outcomes for clinical trials requires a wide range of considerations relating
to clinical interpretation, ethics relating to therapy effectiveness and safety, and
statistical optimality of measures. Appropriate outcome choice plays a key role in
determining the usefulness and/or success of a study and can affect whether a
proposed study is view favorably by funding and regulatory agencies. Many
regulatory and funding agencies provide guidance on the types of outcomes that
are appropriate or acceptable in various contexts, and it is important to understand
J. M. Leach · I. Aban · G. R. Cutter (*)
Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA
e-mail: [email protected]; [email protected]; [email protected]

© Springer Nature Switzerland AG 2022 891


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_70
892 J. M. Leach et al.

regulatory guidelines, definitions, and expectations when choosing clinical out-


comes. This chapter provides an overview of the clinical, practical, and statistical
considerations in choosing outcomes, with a focus on the intersection between the
considerations themselves and the standards and definitions provided by regulatory
and funding agencies. Section “Introduction, Definitions, and General Consider-
ations” introduces basic definitions and broad considerations in clinical trials out-
comes. Section “Types of Outcome Measures” discusses clinical distinctions
between outcomes, mathematical descriptions of outcomes, and safety consider-
ations. Section “Choosing Outcome Measures” discusses both statistical and prac-
tical considerations in outcome choice, introduces approaches for evaluating the
quality of selected outcomes, and makes distinctions in reporting outcomes. Section
“Multiple Outcomes” examines the intricacies involved in using multiple outcomes,
specifically multiple outcomes consisting of different, but possibly related, mea-
sures, and longitudinal studies that measure the same outcome for multiple times.

Keywords
Primary outcomes · Secondary outcomes · Biomarker · Multiple measures

Introduction, Definitions, and General Considerations

What Is an Outcome?

All studies consist of taking measurements of varying levels of complexity. While


finished studies make included measurements appear obvious and necessary, careful
planning is necessary to ensure achievement of study goals. There are two key
classes of measurements in studies: outcomes (dependent variables) and predictors
(independent variables). In the most basic sense, studies ask whether measured
predictors can account for variation in outcomes. In clinical trials the most important
predictors are typically therapies, drugs, and outcomes that are measures with
clinical significance, e.g., suppose we study the effectiveness of a cigarette smoking
cessation method. The (primary) outcome would likely be cigarette cessation (or
not); the study then seeks to answer whether the cessation method was associated
with higher rates of cigarette smoking cessation. However, decisions regarding
outcomes may be more complex than one would naively expect. For smoking
cessation, the broad outcome of interest is whether a subject quit smoking, but
researchers and regulators may care about how one defines quitting and/or how
long the cessation lasts. Which measures most adequately capture relevant clinical
concerns and are within the practical limits of conducting a trial?

Outcomes in Clinical Trials

The FDA defines a clinical outcome as “an outcome that describes or reflects how an
individual feels, functions, or survives” (FDA-NIH Biomarker Working GroupB).
50 Outcomes in Clinical Trials 893

Choices regarding outcomes in clinical trials are often further constrained compared
to studies in general. Ethical considerations guide the choice of outcomes, and the
regulators, i.e., Food and Drug Administration (FDA), EMA, institutional review
boards, and/or other relevant funding agencies, ensure that these ethical consider-
ations are considered. Many clinical trials involve therapies that carry significant
risks, especially in the case of surgical interventions or drugs. In phase III trials, the
FDA requires that the effects of therapies under consideration be clinically mean-
ingful to come to market (Sullivan n.d.). The expectation is that a therapy’s benefits
will sufficiently outweigh the risks. The FDA gives three reasons for which patients
reasonably undertake treatment risks:

1. Increased survival rates


2. Detectable patient benefits
3. Decreased risk of disease development and/or complications

Outcomes inherently vary in importance. A primary outcome should be a


measure capable of answering the main research question and is expected to
directly measure at least one of the above reasons for taking risks. Treatment
differences in primary outcomes generally determine whether a therapy is
believed to be effective. Researchers often measure a single primary outcome
and several secondary outcomes. Secondary outcomes may be related to the
primary outcomes, but of lower importance or may not be inherently feasible
to use as a primary outcome due to duration of the study needed to assess them
or the sample size required to defend the study as adequately powered. Outcomes
measuring participant safety must ensure that the risk-benefit ratio is sufficiently
high. In cigarette cessation trials, perhaps smoking cessation maintained for
6 weeks as the primary outcome and cessation after 6 months and 1 year are
secondary outcomes. The longer-duration cessation is actually more important
but may make the size and/or duration of the trial not feasible due to expected
recidivism or losses to follow-up. Note that therapies not involving drugs
typically still require recording adverse events. For example, smoking cessation
therapy studies may be concerned about depression or withdrawal symptoms
(Motooka et al. 2018). When conducting an exercise study to improve fitness in
disabled multiple sclerosis patients, we need to be cognizant of falls and thus
measure and record their occurrences.

Where Are We Going?

The focus of this chapter is the mathematical, clinical, and practical considerations
necessary to determine appropriate outcomes. Section “Types of Outcome Mea-
sures” introduces and discusses biomarkers and direct and surrogate outcomes and
defines mathematical descriptions of variables. Section “Choosing Outcome Mea-
sures” considers the clinical, practical, and statistical considerations in outcome
choice. Finally, Section “Multiple Outcomes” examines the benefits and complica-
tions of using multiple outcomes.
894 J. M. Leach et al.

Types of Outcome Measures

Clinical Distinctions Between Outcomes

Direct Endpoints
The FDA defines direct endpoints as outcomes that directly describe patient well-
being; these are categorized as objective or subjective measures. Objective measures
explicitly describe and/or measure clinical outcomes and leave little room for
individual interpretation. Some common objective measures are as follows:

1. Patient survival/death.
2. Disease incidence; e.g., did the subject develop hypertension during the study
period given they were free of hypertension at the start of the study?
3. Disease progression; e.g., did the subject’s neurological function worsen during
the study period?
4. Clinical events; e.g., myocardial infarction, stroke, multiple sclerosis relapse.

Subjective measures often depend upon a subject’s perception. For health out-
comes, this is often in terms of disease symptoms or quality of life (QoL) scores.
Subjective endpoints are complicated by their openness to interpretation, either
between or within subject’s responses or rater’s assessments, and whether or
which measures adequately capture the quality of interest is often debatable.
Ensuring unbiased ascertainment and uniformity of measurement interpretation is
difficult when the outcome is, say QoL, global impressions of improvement, etc.
compared to objective endpoints such as death or incident stroke. Measure assess-
ment is covered in detail in section “Choosing Outcome Measures.”
Note that regulatory agencies prefer direct endpoints as primary outcomes,
particularly for new drug approval. There are several issues that arise from using
what we will denote as the elusive surrogate measures or biomarkers and how these
issues can make their use less than optimal.

Surrogate Endpoints
Surrogate endpoints are substitutes for direct or clinically meaningful endpoints and
are typically employed in circumstances where direct endpoints are too costly, are
too downstream in time or complexity, or are unethical to obtain. Few true surrogates
exist if one uses the definition provided by Prentice (1989). In the Prentice definition,
the surrogate is tantamount to the actual outcome of interest (E), but this is often
unachievable. While there is some concurrence on the existence of a so-called
surrogate, these are often laboratory measures or measurable physical attributes
from subjects, such as CD4 counts in HIV trials, although still lacking in meeting
the Prentice definition.
Surrogate endpoints may avoid costly or unethical situations, but the researcher
must provide strong evidence that the surrogate outcome is predictive of, correlated
with, and/or preferably in the therapeutic pathway between the drug or treatment and
expected clinically significant benefit. Importantly, while the Prentice criteria argue
50 Outcomes in Clinical Trials 895

for complete replacement of the endpoint by the surrogate, the generally accepted
goal of a surrogate endpoint is to be sufficiently predictive of the direct endpoint.
In the case of sufficiently severe illness, researchers may obtain “accelerated
approval” for surrogate endpoints, but further trials demonstrating the relation
between surrogate and direct endpoints are typically required despite initial
approval. Surrogate endpoints can be classified into the following stages of valida-
tion (Surrogate Endpoint Resources for Drug and Biologic Development n.d.):

1. Candidate surrogate endpoints are in the process of proving their worth as


predictors of clinical benefits to subjects.
2. Reasonably likely surrogate endpoints are “endpoints supported by strong mech-
anistic and/or epidemiologic rationale such that an effect on the surrogate end-
point is expected to be correlated with an endpoint intended to assess clinical
benefit in clinical trials, but without sufficient clinical data to show that it is a
validated surrogate endpoint” (FDA-NIH Biomarker Working Group). These are
more likely to receive accelerated approval than candidate surrogate endpoints.
3. Validated surrogate endpoints “are supported by a clear mechanistic rationale and
clinical data providing strong evidence that an effect on the surrogate endpoint
predicts a specific clinical benefit” (FDA-NIH Biomarker Working Group).
Validated surrogate endpoints are generally accepted by funding agencies as
primary outcomes in clinical trials and generally are not required to provide
further studies in support of the relationship between the surrogate and direct
endpoint.

For validation, regulatory agencies prefer more than one study establishing the
relationship between direct and surrogate endpoints. A major drawback to surrogate
outcomes is that relationships between surrogate and direct endpoints may not be
causal even when the correlation is strong; even if the relationship is (partially)
causal, surrogate outcomes may not fully predict the clinically relevant outcome,
especially for complicated medical conditions. Two problems thus arise:

1. A drug could have the desired beneficial effect on the surrogate outcome but also
have a negative effect on an (possibly unmeasured) aspect of the disease, render-
ing the drug less effective than anticipated/believed.
2. Drugs designed to treat a medical condition may have varying mechanisms of
action, and it does not follow that validated surrogate endpoints are equally valid
for drugs with differing mechanisms of action.

These drawbacks can bias the estimate of a benefit-risk ratio, especially in smaller or
shorter studies, where there may be insufficient sample size or follow-up time to
capture a representative number of adverse events. Pairing underestimation of
adverse events with too-optimistic beliefs regarding the therapeutic benefits can
result in overselling a mediocre or relatively ineffective therapy.
Surrogates are often used in phase II studies initially before they can be accepted
as legitimate for phase III clinical outcomes as surrogate endpoints are not often
896 J. M. Leach et al.

clinically meaningful in their own right. Phase II trials can use biomarkers or
indicators of processes that are not necessarily surrogates.

Biomarkers
Biomarkers are “a defined characteristic that is objectively measured as an indicator
of normal biological processes, pathologic processes, or responses to an exposure or
intervention, including therapeutic interventions” (FDA-NIH Biomarker Working
Group). Biomarkers are often useful as secondary outcomes regarding subject safety
or validation that a therapy induces the expected biological response or a primary
outcome in phase II proof of concept trials. Most validated surrogate endpoints are in
fact biomarkers.
Biomarkers are often chosen as outcomes for the same reasons that surrogates are
used: shortened trials, smaller sample sizes, etc. However, biomarkers are often more
specific to certain treatments than surrogates. For example, in multiple sclerosis,
MRI reveal small areas of inflammation when viewed after injection of a special
chemical, gadolinium. Gadolinium-enhanced lesions are used in phase III trials as
proof of concept primary outcomes, but they do not clearly predict disability out-
comes, which are the goal of disease-modifying therapy, and they are clinically
meaningful only through their repeated linkage to successful drug treatment.
Sormani et al. have shown that they are acting as surrogates at the study level
(Sormani et al. 2009, 2010). These counts of enhancing lesions seem to be bio-
markers for inflammation, and their absence following treatment has been taken as a
sign of efficacy. However, there are now drugs that virtually eliminate these enhanc-
ing lesions, yet progression of disability still occurs, so they are not a good choice for
an outcome comparing two effective drugs where both may eliminate enhancing
lesions but have differences in their effects on disability.
Biomarkers are also useful not only as outcome variables, but as predictors of
outcomes on which to enrich a trial making it easier to see changes in the biomarkers
or primary outcomes. Biomarker-responsive trials select individuals who have
shown that with certain biomarkers or certain levels of biomarkers, the participants
are at increased risk of events or are more responsive to treatment. This seems like a
rational approach, but there are several caveats to the uncritical use of this selection.
Simon and Maitournam point out that the efficiency of these designs is often not seen
unless the proportion of biomarker-positive responders is less than 50% and the
response in those who are biomarker negative is negligible (Simon and
Maitournam 2004). The reasons for this counterintuitive finding are that the cost
of screening can overwhelm the dampening of the response by the biomarker-
negative individuals making the biomarker selection an added logistic issue while
not enhancing the design over simple increases in sample size and stratification. In
other situations, the biomarker’s behavior needs to be carefully considered. In
Alzheimer’s trials, it has been argued that more efficient trials could be done if
patients were selected based on a protein, tau-beta, found in their spinal cords. This is
because patients with these proteins have more rapid declines in their disease as
measured by the usual cognitive test outcomes. However, Kennedy et al. (2015)
showed that when designing a study based on selection for tau positivity, the gains in
50 Outcomes in Clinical Trials 897

sample size reduction due to the greater cognitive declines, which make percent
changes easier to detect, were offset by the increased variation in cognitive decline
among the biomarker positive subset. This results from assuming that the variance is
the same or smaller in the biomarker-positive subset compared to the larger
population.

Quantitative and Qualitative Descriptions of Outcomes

Thus far we have definitions of outcomes that relate to what we want to measure
rather than how to quantify the measurement. The biological quantity of interest may
be clear, but decisions about how to measure that quantity can affect the viability of a
study or the reasonableness of the results. Outcomes are described as either quan-
titative or qualitative. In the following sections, we distinguish between these, give
several examples, and discuss several common subtypes of outcomes.

Quantitative Outcomes
Quantitative outcomes are measurements that correspond to meaningful numeric
scale and can be broken down into continuous and discrete measurements (or
variables). In mathematical terms, continuous variables can take any of the infinite
number of values between any two numbers in its support; they are uncountable. On
the other hand, discrete variables are countable. A few examples should help clarify
the difference.
Systolic blood pressure is a continuous outcome. In theory, blood pressure can
take any value between zero and infinity. For example, a subject participating in a
hypertension study may have a baseline systolic blood pressure measurement of
133.6224. We may round this number for simplicity, but it is readily interpretable as
it stands. In most cases discrete quantitative outcomes consist of positive whole
numbers and often represent counts. For example, how many cigarettes did a subject
smoke in the last week? The answer is constrained to nonnegative whole numbers: 0,
1, 2, 3, . . .. While perhaps it is possible to conceive of smoking half a cigarette, the
researcher needs to decide a priori whether to record and allow your data collection
system to accept fractions or develop clear rules to record discrete values in the same
way for all participants.

Categorical Outcomes
Categorical outcomes, or qualitative variables, have neither natural order nor inter-
pretation on a numeric scale and result from dividing study participants into cate-
gories. Many drugs are aimed at reducing the risk of negative health outcomes,
which are often binary in nature, and common trial aims are reducing the risk of
death, stroke, heart attack, or progression in multiple sclerosis. The use of these
binary outcomes is not simply convenience or custom, but rather they are much more
easily interpreted as clinically meaningful. To say you have reduced blood pressure
in a trial by 4.5 mmHg is by conditioning a positive result, but it is not in and of itself
immediately clinically meaningful, whereas, if the group that experienced
898 J. M. Leach et al.

4.5 mmHg greater change had lower mortality rates, it would be easier to say this is
clinically meaningful.
Categories need not be binary in nature. For example, consider a study where it is
likely that, in the absence of treatment, patient health is expected to decline over
time. A successful drug might slow the decline, stop the decline but not improve
patient health, or improve patient health, and so researchers could categorize the
subjects as such.

Nominal Versus Ordinal Outcomes


Ordinal outcomes are perhaps best understood as a hybrid of continuous and
categorical outcomes. Ordinal outcomes have a natural order, but do not correspond
to a consistent, objective numerical scale; moving up/down from one rank to another
need not correspond to the same magnitude change and may vary by individual; e.g.,
categories of worse, the same, or better patient health can be interpreted as an ordinal
outcome since there is a natural order. However, it is not immediately clear that the
“same” vs. “better” health indicates the same benefit for all patients or is necessarily
equal in magnitude to the difference between “worse” and the “same” health. In
contrast, nominal outcomes have neither natural ordering nor objective interpretation
on a numeric scale. In the context of clinical trials, nominal variables are more likely
to be predictors than outcomes; for example, we may have three treatment groups
with no natural order, placebo, drug A, and drug B, or they can have an order such as
placebo, low dose, and high dose. The former requires certain forms of analyses,
while the latter allows us to take advantage of the natural ordering that occurs among
the dose groups.

Common Measures
Often outcomes are raw patient measures, e.g., patient blood pressure or incident
stroke. However, summary measures are often relevant. Incidence counts the number
of new cases of a disease per unit of time and can be divided into cumulative
incidence such as that occurring over the course of the entire study or the incidence
per unit of time, 30-day mortality following surgery, etc. Incidence can pertain to
both chronic issues, e.g., diabetes, and discrete health events, e.g., stroke, myocar-
dial infarction, or adverse reaction to a drug.
Another common summary measure is the proportion: how many patients out of
the total sample experienced a medical event or possess some quality of interest? The
incidence proportion or cumulative incidence is the proportion of previously healthy
patients who developed a health condition or experienced an adverse heath event.
Incidence is also commonly described with an incidence rate; that is, per some
number of patients, called the radix, often 1000, how many will develop the
condition, experience the event, etc.; e.g., supposing 4% of patients developed
diabetes during the study, then an incidence rate would say that we expect 40 out
of every 1000 (nondiabetic) subjects to develop diabetes. Note that incidence differs
from prevalence, which is the proportion of all study participants who have the
condition. Prevalence can be this simple proportion at some point in time, known as
point prevalence or period prevalence, the proportion over some defined period. The
50 Outcomes in Clinical Trials 899

period prevalence is often used in studies from administrative databases and counts
the number of cases divided by the average population over the period of interest,
whereas the point prevalence is the number of cases divided by the population at one
specific point in time.
A measure related to incidence is the time to event; that is, how long into the study
did it take for the incident event to occur? This is often useful for assessing a
therapy’s effect on survival or health state. For example, a cancer therapy may be
considered successful in some cases if patient survival is lengthened; similarly, some
therapies may be considered efficacious if they extend the time to a stroke or other
adverse health events.

Safety

Measuring/Summarizing Safety
In addition to efficacy outcomes, safety outcomes are also important to consider. We
mentioned above that there is a necessary balance between the risks and potential
benefits of a therapy. Thus, we need information regarding potential risks, particu-
larly side effects and adverse events. It is possible that a therapy could be highly
effective for treating a disease and yet introduce additional negative health conse-
quences that make it a poor option. Safety endpoints can be direct or surrogate
endpoints. Some therapies may increase the risk of adverse health outcomes like
stroke or heart attack; these direct endpoints can be collected. We often classify
events into side effects, adverse effects, and serious adverse effects.
Side Effects: A side effect is an undesired effect that occurs when the medication
is administered regardless of the dose. Unlike adverse events, side effects are mostly
foreseen by the physician, and the patient is told to be aware of the effects that could
happen while on the therapy. Side effects differ from adverse events and later resolve
on their own with time.
Adverse Events: An adverse event is any new, undesirable medical occurrence or
change (worsening) of an existing condition in a subject that occurs during the study,
whether or not considered to be related to the treatment.
Serious Adverse Events: A serious adverse event is defined by regulatory
agencies as one that suggests a significant hazard or side effect, regardless of the
investigator’s or sponsor’s opinion on the relationship to investigational product.
This includes, but may not be limited to, any event that (at any dose) is fatal, is life
threatening (places the subject at immediate risk of death), requires hospitalization
or prolongation of existing hospitalization, is a persistent or significant disability/
incapacity, or is a congenital anomaly/birth defect. Important medical events that
may not be immediately life threatening or result in death or hospitalization but may
jeopardize the subject or require intervention to prevent one of the outcomes listed
above, or result in urgent investigation, may be considered serious. Examples
include allergic bronchospasm, convulsions, and blood dyscrasias.
Collecting and monitoring these are the responsibility of the researchers as well as
oversight committees such as Data and Safety Monitoring Committees. Collection of
900 J. M. Leach et al.

these can be complicated, such as when treatments are tested in intensive care units
where nearly all actions could be linked to one or the other type of event, to relatively
straightforward. Regulators have tried to standardize the recording of these events
into System Organ Classes using the Medical Dictionary for Regulatory Activities
(MedDRA) coding system. This standardized and validated system allows for
mapping of a virtually infinite vocabulary of events into medically meaningful
classes of events – infections, cardiovascular, etc. for comparison between groups
and among treatments. These aid in the assessment of benefits versus risks by
allowing comparisons of the rates of these medical events that occur within specific
organs or body functions.

Obstacles to Measuring Safety


It is often the case that direct safety-related endpoints have relatively low incidence
rates, especially within the time frame of many clinical trials since many medical
conditions manifest after extended exposure; i.e., often weeks, months, or years pass
before health conditions manifest. Thus, surrogate endpoints are often necessary, and
biomarkers are useful, especially in cases where information on drug toxicity is
needed. Using biomarkers to assess toxicity is integral to altering the patient’s dose
or ceasing treatment before more severe health problems develop (FDA-NIH Bio-
marker Working Group). Laboratory assessments measure ongoing critical func-
tions, and we often use flags or cut points to identify evolving risks, such three times
the upper limit of normal to flag liver function tests or white cell counts to indicate
infections. One of the major obstacles to assessing safety is that neither researchers
nor regulators can give a specific frequency above which a treatment is considered
unsafe. For some events, such as rare fatal events, the threshold may be just a few
instances, and for other situations where the participants are extremely sick, high
rates of adverse events may be tolerated, such as in cancer trials and treatments.

Choosing Outcome Measures

While some health-related outcomes have obvious metrics, others do not. For
instance, if a study is conducted on a drug designed to lower or control blood
pressure or cholesterol, then it is straightforward to see that the patient’s blood
pressure is almost certainly the best primary outcome. However, for many complex
medical conditions, arriving at a reasonable metric requires a considerably more
twisted, forking path. For example, in multiple sclerosis (MS), the aim is to reduce
MS-related disability in patients, but “disability” in such patients is a multi-
dimensional problem consisting of both cognitive and physical dimensions and
thus requiring a complex summary metric. Sometimes the choice involves choosing
a metric and other times it involves how to use the metric appropriately. For example,
smoking cessation therapy studies should record whether patients quit smoking, but
at a higher level, we may debate just how long a subject must have quit smoking to
be considered a verified nonsmoker, or we may require biological evidence of
50 Outcomes in Clinical Trials 901

cessation such as cotinine levels, whereas in MS studies, the debate is more often
over which metric most adequately captures patient disability.

Nonstatistical and Practical Considerations

The most important consideration in choosing an outcome measure is to ensure that


the outcome measure possesses the ability to capture information that can answer
relevant scientific questions of interest, and there can sometimes be a debate about
which metrics are most appropriate. In MS studies there has been considerable
concern that many commonly used measures of MS-related disability cannot suffi-
ciently capture temporal change nor adequately incorporate or detect patient-per-
ceived quality of life (Cohen et al. 2012).
More practical concerns involve measure interpretability and funding agency
approval. Established metrics are more likely to be accepted by funding agencies
such as the FDA and NIH, and a considerable amount of work is often necessary to
make the case for a new metric. Some of the regulatory preference for established
measures is no doubt based in a disposition toward “historical legacy,” but we note
that there can be good reasons for preferring the status quo in this case (Cohen et al.
2012). Specifically, comparing studies becomes more difficult when different
outcome measures are used, complicating interpretation of a body of literature.
Therefore, new measures must often bring along detailed and convincing cases for
their superiority over established measures. Physicians and other medical profes-
sionals must be able to readily interpret trial results in terms of practical implications
on their patients, and if an outcome is difficult to practically interpret, it may be
resisted even if it possesses other desirable qualities. For example, the Multiple
Sclerosis Functional Composite (MSFC) was proposed to answer criticisms of the
established and regulator-preferred Expanded Disability Status Scale (EDSS) (Cutter
et al. 1999). However, despite its good qualities and improvement on the EDSS in
many aspects, the MSFC is resisted by regulators primarily because its mathematical
nature, a composite z-score of three functional tests, is a barrier to physician
interpretation (Cohen et al. 2001, 2012). Interpretability is closely tied to ensuring
that measures are clinically meaningful in addition to possessing desirable metric
qualities.

Assessing Outcome Measures

Validity
Validity is the ability of the outcome metric to measure that which it claims to
measure. In cases where the outcome is categorical, it is common to assess validity
with sensitivity and specificity. Sensitivity is the ability of a metric to accurately
determine patients who have the medical condition or experienced the event, and
specificity is the ability of a metric to accurately determine which patients do not
have the medical condition or did not experience the event. Both sensitivity and
902 J. M. Leach et al.

specificity should be high for a good metric; for example, consider the extreme case
where a metric always interprets the patient as having a medical condition. In such a
case, we will identify 100% of the patients with the medical condition (great job!)
and 0% of the patients without the medical condition (poor form!). Note that while
these concepts are often understood in terms of medical conditions and events, they
need not be confined in such a way.
For continuous, and often ordinal, measures, assessing validity is somewhat more
complicated. One could impose cutoffs on the continuous measure to categorize the
variable, only then using sensitivity or specificity to assess validity. However, this is
a rather clumsy approach in many cases; we want continuous outcome measures to
capture the continuous value with as little measurement error as possible. This is
often more relevant for medical devices. For example, a wrist-worn measure of
blood glucose would need to be within +/ of a certain amount of the actual glucose
level in the blood to demonstrate validity. Often individuals use regression analyses
to demonstrate that a purported measure agrees with a gold standard, but it should be
kept in mind that a high correlation by itself does not demonstrate validity. A
regression measure should have slope of 1 and an intercept of 0 to indicate validity.
Sensitivity is also used to describe whether an outcome measure can detect
change at a reasonable resolution. Consider a metric for disability, on a scale from
1 to 3, where higher scores indicate increased disability. This will be a good metric if
generally when a patient’s disability increase results in a corresponding increase on
the scale, but if the measure is too coarse, then it could be the case, for example, that
many patients are having disability increases, but not sufficient to move from a 1 to a
2 on the scale. This measure being insensitive to the worsening at the participant
level would lead to high sensitivity (because greater disability would have occurred
before the scale recognized it), but poor specificity because being negative does not
indicate the participant hasn’t progressed. When the metrics pertain to patient well-
being or health dimensions of which a patient is conscious, it is expected that when
the patient notices a change, the metric will reflect those changes. This is particularly
important for determining the effectiveness of therapies in many cases. A measure
that is insensitive to change could either mask a therapy’s ineffectiveness by
incorrectly suggesting that patient conditions are not generally worsening or on the
flip side portray an effective therapy as ineffective since it will not detect positive
change. Further, because if a participant feels they are worsening, but the measure is
insensitive, then this can lead to dropping out of the trial.

Reliability
Reliability is a general assessment of the consistency of a measure’s results upon
repeated administrations. There are several relevant dimensions to reliability. A
measure can be accurate, but not reliable. This occurs because on average the
measure is accurate but highly variable. Various types or aspects of reliability are
often discussed. Perhaps most prominent is interrater reliability. Many trials require
raters to assess patients and assign scores or measures describing the patient’s
condition, and interrater reliability describes the consistency across raters when
presented with the same patient or circumstance. A reliable measure will result in
50 Outcomes in Clinical Trials 903

(properly trained) raters assigning the same or similar scores to the same patient.
When a metric is proposed, interrater reliability is a key consideration and is
typically measured using a variant of the intraclass correlation coefficient (ICC),
which should be high if the reliability is good (Bartko 1966, Shrout and Fleiss 1979).
Intersubject reliability is also a concern; that is, subjects with similar or the same
health conditions should have similar measures. This differs from interrater reliabil-
ity in that it is possible for raters to be highly consistent within a subject, but
inconsistent across subjects, or vice versa. Interrater reliability measures whether
properly trained raters assign sufficiently similar scores to the same patient; that is, is
the metric such that sufficiently knowledgeable individuals would agree about how
to score a specific subject? Intersubject reliability measures whether a metric assigns
sufficiently similar scores to sufficiently similar subjects.

Other Concerns
There are several other issues in evaluating outcome measures. Practice effects occur
when patients’ scores on some measure improve over time not due to practice rather
than therapy. Studies involving novel outcome measures should verify that either no
practice effects are present or that the practice effects taper off; for example, in
developing the Multiple Sclerosis Functional Composite (MSFC), practice effects
were observed, but these tapered off by the fourth administration (Cohen et al. 2001).
Practice effects are problematic in that they can lead to overestimates of a treatment
effect because the practice effect improvement is ignored when comparing a post-
intervention measure to a baseline measure. In randomized clinical trials, we can
assume both groups experience equivalent practice effects and the difference
between the two groups is still an unbiased estimate of the treatment effect, but
how much actual improvement was achieved is biased unless the practice effects can
be eliminated prior to baseline by multiple administrations or adjusted for in the
analyses. Practice effects are often complex and require adjustments to the measure
or its application; for example, the Paced Auditory Serial Addition Test (PASAT), a
measure of information processing speed (IPS) in MS, was shown to have practice
effects that increased with the speed of stimulus presentation and was more prom-
inent in relapse-remitting MS compared to chronic-progressive MS (Barker-Collo
2005). Therefore, using PASAT in MS research requires either slower stimulus
presentation or some correction accounting for the effects.
Another source of (unwanted) variability in outcome measures, particularly sub-
jective measures, is response shift. Response shift occurs when a patient’s criteria for a
subjective measure change over the course of the study. It is clearly a problem if the
meaning of the same recorded outcome is different at different times in a study, and
therefore response shift should be considered and addressed when subjective measures
and/or patient-reported outcomes are employed (Swartz et al. 2011). This is often the
case with long-term chronic conditions such as multiple sclerosis where participants
report on their quality of life in the early stages of the disease and when reassessed
years later when they have increased disability record the same quality of life scores.
Adaptation and other factors are at the root of these response shifts, but outcome
measures that are subject to this type of variability can be problematic to use.
904 J. M. Leach et al.

Statistical Considerations

In addition to whether a measure is informative to the clinical questions of interest,


there are statistical concerns relating to the ability to make comparisons between
treatment groups and answer scientific or clinical questions of interest with the data
on hand. This section defines the relevant statistical measures and then describes
their practical import in outcome choice.

Statistical Definitions
In hypothesis testing, variable selection, etc. there are two kinds of errors to
minimize. The first is the false positive, formally Type I error, which is the proba-
bility that we detect a therapy effect, given that one does not exist. False positives are
typically controlled by assignment of the statistical significance threshold, α, which
is generally interpreted as the largest false-positive rate that is acceptable; by
convention, α ¼ 0.05 is usually adopted.
The second error class is the false negative, or Type II error, which occurs when
we observe no significant therapy effect, when in reality one exists. Power is given
as one minus the false-negative rate and refers to the power to detect a difference,
given that one exists. For a given statistical method, the false-positive rate should be
as low as possible and the power as high as possible. However, there is a tension in
controlling these errors because controlling one in a stronger manner corresponds to
a reduction in ability to control the other.
There are several primary reasons that these errors arise in practice (and theory).
Sampling variability allows for occasionally drawing samples that are not represen-
tative of the population; this problem may be exacerbated if the study cohort is
systematically biased in recruitment or changes over time during the recruitment
period so that it is doubtful that the sample can be considered as drawn from the
population of interest. The second primary reason for errors is sample size. In the
absence of compromising systematic recruitment bias, a larger sample size can often
increase the chance that we detect a treatment difference if one exists. The funda-
mental reason for improvement is that sample estimates will better approximate
population parameters with less variation about the estimates, on average. Small
sample sizes can counterintuitively make it difficult to detect significant effects,
overstate the strength of real effects, and more easily find spurious effects. This is
because in small samples a relatively small number of outliers can have a large
biasing effect, and in general sampling variability is larger for smaller compared to
larger samples.
The choice of statistical significance thresholds can also contribute to errors in
inference. If the threshold is not sufficiently severe, then we increase the risk of
detecting a spurious effect. Correspondingly, a significance threshold that is too
severe may prevent detection of treatment differences in all but the most extreme
cases. Errors related to the severity of threshold can affect both large and small
samples.
Note that caution must be employed with respect to interpreting differences.
Statistically significant differences are not necessarily clinically significant or
50 Outcomes in Clinical Trials 905

meaningful. It is rare that two populations will be exactly equal in response to a


treatment, but small differences between groups, while statistically significant at a
large sample size, may not represent a sufficient benefit to the patients.

Common Statistical Issues in Practice


Statistical significance thresholds, power, and clinically meaningful treatment dif-
ferences are determined a priori. Using these values and some estimate of variance
gleaned from similar studies, we can calculate the necessary sample size. In cases
with continuous outcomes, the calculations are often relatively straightforward and
tend to have few, if any, additional restrictions beyond (usually) normality on the
outcome’s values. However, the situation is more complicated when the outcomes
are no longer continuous (or normal). A common problem is near or complete
separability when the outcome is binary, that is, when almost all the patients have
one or the other outcome. Model fitting problems will arise when separability applies
to the whole study sample but also when (nearly) all patients in one group have one
outcome and (nearly) all patients in the other group have the other outcome. This is
especially true in retrospective and observational studies which seek to make
comparisons among subgroups within the population. For example, in a study
presented at the European Association for Cardiothoracic Surgery in 2016, a retro-
spective study of a vein graft preservation by a buffered solution compared to saline
for use during harvesting and bypassing heart vessels attempted to adjust for the two
time periods of comparison, the saline period prior to the introduction of the new
product and after introduction. However, when the propensity scores were plotted by
type of storage solution used, there was almost complete separation of the two
populations before and after (Haime 2016). Here nearly all saline patients had a
more favorable risk profile as the willingness to perform bypasses was directed
toward younger healthier patients and after the buffered solution was adopted, so did
the willingness to accept higher-risk patients for bypass.
Another common way such a situation arises is when the study length is too short
for a sufficient number of events to arise. This is an issue whether the study is
collecting time-to-event outcomes or simply recording incidence at the study termi-
nus. In such cases it is highly relevant to know the required number of events
necessary to detect a clinically significant difference. This problem is generalized
to cases where there are more than two categories for the outcome, e.g., ordinal or
multinomial data. In such cases where the study length cannot be extended to a
sufficient length, nonbinary outcomes and/or surrogate outcomes may be necessary
alternatives.

Common Simplifications and Their Up- and Downsides


As noted above, due to the ease of interpretation on the clinical meaningfulness of
binary events, a common approach to analysis is to categorize continuous measures
so long as the cutoffs are clinically relevant and decided before conducting a study.
As discussed in the previous section, this simplification can cause analysis and
interpretation issues if there are not sufficient numbers of patients in each category.
There is also a tendency to reanalyze the data to better understand what has happened
906 J. M. Leach et al.

and that can lead to arbitrary cutoffs, and without the a priori specification of the
outcome, finding a cut point that “works” is certainly changing the chances of a
false-positive result.
On the other hand, it is sometimes useful to treat a discrete variable as continuous;
the most common instance of this is count data, where counts are generally very large.
In some cases, ordinal data may be treated as continuous with reasonable results. Even
though analyzing the ordinal data implicitly assumes that each step is the same in
meaning, this provides a simple summarization of response rate. Nevertheless, treating
these ordinal data as ranks can be shown to have reasonable properties in detecting
treatment effects. However, one should use caution when analyzing ordinal data by
applying statistical methods designed for continuous outcomes; in particular, models
for continuous outcomes perform badly when the number of ranks is small, and/or the
distribution of the ordinal variable is skewed or otherwise not approximately normally
distributed (Bauer and Sterba 2011; Hedeker 2015).

Reporting Outcomes

There are several main approaches for assessing outcomes: patient-reported out-
comes, clinician-reported outcomes, and observer-reported outcomes. We define and
discuss each below.

Patient-Reported Outcomes
Patient-reported outcomes (PROs) are outcomes dependent upon a patient’s subjec-
tive experience or knowledge; PROs do not exclude assessments of health that could
be observable to others and may include the patient’s perception of observable health
outcomes. Common examples include quality of life or pain ratings (FDA-NIH
Biomarker Working Group). These outcomes have gained a lot of acceptance
since the Patient-Centered Outcomes Research Institute (PCORI) came into exis-
tence. The FDA and other regulators routinely ask for such outcomes as they are
indicative of the meaningfulness of treatments. Rarely have patient-reported out-
comes been used as primary outcomes in phase III trials, except in those instances,
such as pain, where the primary outcomes are only available in this manner. Most
often they are used to provide adjunctive information on the patient perspective of
the treatments or study. Nevertheless, researchers should be cautioned not simply to
accept the need for PROs, but rather think carefully about what and when to measure
PROs. PROs are subjective assessments and can be influenced by a wide variety of
variables that may be unrelated to the actual treatments or interventions under study.
Asking a cancer patient about their quality of life during chemotherapy may not lead
to the conclusions of benefits of survival because of the timing of the ascertainment.
Similarly, a participant in a trial who is severely depressed may be underwhelmed
with the benefits of a treatment that doesn’t address this depression. In addition, the
frame of reference needs to be carefully considered. For example, when assessing
quality of life, should one use a tool that is a general measure, such as the Short-
Form-36 Health Survey (SF36), or one that is specific to the disease under study?
50 Outcomes in Clinical Trials 907

This depends on the question being asked and should be factored into the design for
any and all data to be collected.
PROs, like many outcomes, are subject to biases. If participants know that they
are on an active treatment arm versus a placebo, then their reporting of the specific
outcome being assessed may be biased. Similarly, participants who know or suspect
that they are on a placebo may report they are doing poorly simply because of this
knowledge rather than providing accurate assessments as per the goal of the instru-
ment. A general rule is that when one can make blinded assessments, the better. A
more detailed discussion of the intricacies involved in PROs is found in Swartz et al.,
and the FDA provides extensive recommendations and discussion (FDA 2009).

Clinician-Reported Outcomes
Clinician-reported outcomes (CRO) are assessments of patient health by medical or
otherwise healthcare-oriented professionals and are characterized by dependence on
professional judgment, algorithmic assessment, and/or interpretation. These are typi-
cally outcomes requiring medical expertise, but do not encompass outcomes or
symptoms that depend upon patient judgment or personal knowledge (FDA-NIH
Biomarker Working Group). Common examples are rating scales or clinical events,
e.g., Expanded Disability Status Scale, stroke, or biomarker data, e.g., blood pressure.

Observer-Reported Outcomes
Observer-reported outcomes (OROs) are assessments that require neither medial
expertise nor patient perception of health. Often OROs are collected from parents,
caregivers, or more generally individuals with knowledge of the patient’s daily life
and often, but not always, are useful for assessing patients who cannot, for reasons of
age or impairment, reliably assess their own health (FDA-NIH Biomarker Working
Group). For example, in epilepsy studies caregivers often keep seizure diaries to
establish the nature and number of seizures a patient experiences.

Multiple Outcomes

Most studies have multiple outcomes (e.g., primary, secondary, and safety out-
comes), but it is sometimes desirable or necessary to include multiple primary
outcomes. These generally consist of repeatedly measuring the same (or similar)
outcomes over time and/or including multiple measures, which can encompass
multiple primary outcomes or multiple secondary outcomes. This section describes
common situations where multiple outcomes are employed and discusses relevant
considerations arising thereof.

Multiple (Possibly Related) Measures

When the efficacy of a clinical therapy is dependent upon more than one dimension,
it may be inappropriate to prioritize one dimension or ignore lower-priority
908 J. M. Leach et al.

dimensions. For example, in a trial of thymectomy, a surgical procedure by which


one’s thymus is removed, to control myasthenia gravis, a neuromuscular disease, a
joint outcome was needed (Wolfe et al. 2016). The treatments were thymectomy plus
prednisone versus prednisone alone. The primary outcome was the clinical condition
of the participant over 3 years and the amount of prednisone utilized to control the
disease. The need for both outcomes was due to the fact that the clinical condition
could be made better by using more prednisone, so analyzing the clinical condition
as the outcome would not correctly answer the question of how well a participant
was doing, nor would use of the amount of prednisone used since using less
prednisone could be done at the expense of the clinical condition.
Primary outcomes are generally analyzed first. For a therapy to achieve an
efficacy “win,” it usually must meet some criteria pertaining to success in the
primary endpoints (Huque et al. 2013). This may consist of all, some proportion,
or at least one of the primary endpoints achieving significance by specified criteria
and is typically conditional on demonstration of acceptable safety outcomes.
Primary outcomes that must be all significant in order to demonstrate efficacy are
called coprimary (FDA 2017). Significance in secondary outcomes tends to be
supportive in nature and is generally not considered an efficacy “win” in the absence
of significance for the therapy on primary outcome terms. Additionally, tertiary and
exploratory outcomes are often reported, but conditional on primary endpoint
efficacy. Nevertheless, the regulators often refer to the “totality of the evidence”
when evaluating any application for licensing, and there have been treatments
approved when showing statistically significant effectiveness on secondary out-
comes, but not primary outcomes. This is more often done when there are no or
few treatments available for a condition.

Composite Outcomes
Composite outcomes are functions of several outcomes. These can be relatively
simple, e.g., the union of several outcomes. Such an approach is common for time-
to-event data, e.g., major adverse cardiovascular events (MACE) or to define an
event as when a patient experiences one of the following: death, stroke, or heart
attack. Composite outcomes can also be more complex in nature. For example, many
MS trials are focused on MS-related disability metrics, which tend to be composites
of multiple outcomes of interest and which may be related to either physical or
cognitive disability; two common options are the Expanded Disability Status Scale
(EDSS) and the Multiple Sclerosis Functional Composite (MSFC).
Composite events are often used in time-to-event studies to increase the numbers
of outcomes and, thus for the same relative risk reductions, increase the power as the
power of a time-to-event trial is directly related to the number of events. When such
events are not reasonably correlated however, care must be given not to dominate
signal by noise. For example, as noted previously MS impacts patients differently
and variably. The EDSS assesses seven functional systems and combines them into a
single ordinal number ranging from 0 to 10 in 0.5 increments. For a person who is
impacted in only one functional system, the overall EDSS may not be moved even
by changes in this one functional system, thereby reducing its sensitivity to change.
50 Outcomes in Clinical Trials 909

In composites, such as z-scores of multiple tests, it is recommended that no more


than four or five components be used because most composites are averages or sums
of the individual items. This means that any signal can be dominated by noise.
Consider five ordinal scales, four of which only vary due to measurement error or
variability and only one is in the affected domain. If all five scales were 2 at the
baseline and all but one are 2 s later on and the one that changes goes from a 2 to 4,
the overall average score would go from 2.0 to 2.4 because of the lack of change in
four of the five measures. Many measurements are more variable than that in this
example, and thus, understanding or even identifying changes becomes more diffi-
cult because of the signal-to-noise problem.

Multiple Comparisons/Testing Considerations


Multiple testing issues arise when individual outcomes are to be evaluated individ-
ually, either as components of a composite outcome or without a global test. When
efficacy depends on more than one outcome, controlling the false-positive rate
becomes more complicated, and there are two general metrics for false-positive
control, each of which has multiple approaches to control. The family-wise error rate
(FWER) is the probability of one or more false positives in a family of tests; control
of the FWER is divided into weak control, which controls the FWER under the
complete null hypothesis, i.e., when no outcomes have a significant treatment effect,
and strong control, which controls the FWER when any subset of the outcomes has
no significant treatment effect. Note that in confirmatory trials, strong control of the
FWER is often required (Huque et al. 2013; FDA 2017). Alternatively, the false
discovery rate (FDR) is the expected proportion of false rejections in a family. A
good heuristic for determining which false-positive rate is appropriate is whether a
single false positive would significantly affect the interpretation of the study. FWER
is usually appropriate if a false positive would invalidate the study, while in contrast,
FDR is often appropriate in the presence of a large number of tests, where a few false
positives would not alter the study interpretation.
The utility of controlling complex false-positive rates was once viewed with
suspicion, not least because many traditional methods resulted in severe decreases
in statistical power, for example (Pocock 1997). However, advances in methodology
for control and concern about study reliability have renewed focus on controlling
false-positive rates; e.g., Huque et al. (2013) provide an overview of approaches to
FWER control, and Benjamini and Cohen (2017) propose a weighted FDR control-
ling procedure in similar contexts. In addition to improvements in the quality of the
procedures for false-positive control, regulatory agencies are also more aware and
concerned with false-positive control (FDA 2017). The pharmaceutical industry is
under control by regulators, but the academic community has been less well-
regulated with regard to these multiple comparison issues. However, the requirement
to list trials with ClinicalTrials.gov has made it more concrete that these issues must
be decided in advance. Often, such testing approaches are not determined explicitly
at the initiation of the protocol, but codified at the time the statistical analysis plan is
created, prior to locking the database and in double blind studies, prior to unblinding.
These details are often in the statistical analysis plan and, thus, not available on
910 J. M. Leach et al.

ClinicalTrials.gov. While the specific corrections are beyond the scope of this
chapter, it is useful to contrast traditional approaches to false-positive control with
modern extensions.
In many cases traditional methods for controlling FWER or FDR have been
adapted to handle multiple testing in a more nuanced manner. Traditionally, one
defined a family of tests and then applied a particular method for controlling false
positives, but a study may reasonably consist of more than one family of tests; e.g.,
one may divide primary and secondary outcome analyses into two separate families.
Furthermore, families need not be treated as having equal importance, which is the
basis for hierarchical ordered families or a so-called step-down approach. This
approach applies to controlling FWER or FDR when “win” criterion is achieved in
the families of primary endpoints which is required before testing secondary (and
possibly tertiary) outcomes. The two most common frameworks are α-propagation
and gatekeeping. α-Propagation divides the significance level across a series of
ordered tests, and when a test is significant, its portion of the α is “propagated” or
passed to the next test. Gatekeeping approaches depend on a hierarchy of families. In
regular gatekeeping, a second family of tests is only tested if the first family passes
some “win” criteria, but some gatekeeping procedures allow for retesting, and many
methods incorporate both gatekeeping and α-propagation. A detailed discussion is
found in Huque et al. (2013).
Defining power and calculating the required sample size in complex multiplicity
situations may not be straightforward (Chen et al. 2011). However, using traditional
methods in the absence of updated methodology is likely to result in conservative
results, especially since many methods control FWER at or below the specified
significance value and so are often more conservative than the desired value so that
higher sample sizes are required to achieve the desired power. Note that there are no
multiplicity issues when considering coprimary endpoints, since each must success-
fully reject the null hypothesis, but power calculations are nonetheless complicated
and require considering the dependency between test statistics; unnecessarily large
sample sizes will be required if an existing dependency is ignored (Sozu et al. 2011);
for a detailed discussion on power and sample size with coprimary endpoints, see
Sozu et al. (2012).

Longitudinal Studies

What Are Longitudinal Studies?


Multiple outcomes also arise when following patients over time, recording measure-
ments for outcomes at several time points throughout the trial. The circumstances for
such a situation can be quite varied. In Alzheimer’s disease, we are interested in the
trajectory or slope of the decline in cognitive function over time or in chronic
obstructive pulmonary disease, the rate of lung function decline. In many cases
where adverse medical events (e.g., death, stroke, etc.) are recorded, it is of interest
to know when the event occurred, and follow-up may be conducted at pre-specified
times to determine whether or not an event occurred for each (qualifying) patient; in
50 Outcomes in Clinical Trials 911

such cases therapies may be distinguished by whether they prolong the time to event
instead of, or in addition to, whether they prevent the event entirely. Some events,
such as exacerbations (relapses) in MS, can occur repeatedly over time, and thus
assessing these for the intensity of their occurrence (often summarized as the
annualized relapse rate) is common. Other less extreme examples involve recording
patient attributes, e.g., quality of life assessments or biomarkers like blood pressure
or cholesterol; these and similar situations assess changes over time and address the
existence and/or form of differences between treatment groups.

Benefits
A major benefit of recording patients at multiple time points is that it allows for a
better understanding of the trajectory of change in a patient’s outcome; for example,
nonlinear trends may be discovered and modeled, thereby allowing researchers to
seek understanding of the clinical reasons for each part of the curve. For example,
longitudinal studies of blood pressure have established that human blood pressure
varies according to a predictable nonlinear form (Edwards and Simpson 2014). Such
a model may be used to better define and evaluate healthy ranges for patient blood
pressure. Second, patient-specific health changes and inference are available when a
patient is followed over time. A simple example is given by comparing a cross-
sectional study, a rarity in clinical trials, to a pre-post longitudinal study. Whereas in
a cross-sectional study, we only have access to observed group differences in raw
scores, longitudinal studies provide insight into whether and how a particular
patient’s metrics have changed over time; generalizing beyond pre-post to many
observations allows better modeling for individuals as well as groups. Additionally,
unless the repeated measures are perfectly correlated, the power to detect group
differences is generally increased when using longitudinal data and may require a
smaller sample size to detect a specified treatment difference.

Drawbacks and Complications


The increased use and acceptance of the (generalized) linear mixed model (GLMM)
have allowed increased flexibility in modeling dependence and handling missing
outcomes compared to traditional methods such as repeated measures ANOVA or
MANOVA approaches, which require strong assumptions and trouble including
patients with missing observations (Hedeker and Gibbons 2006). While GLMM
work by making some assumptions that are rarely testable, there are far more
flexible. However, while modeling has been greatly improved in recent decades,
longitudinal studies still have drawbacks and complications. An obvious drawback
is the corresponding increase in study cost; trials must balance the benefit of
additional information with the ability to pay for that information. Power is generally
increased in these repeated measures designs, but few studies seem designed based
on the balance of the gain per dollar for each measurement.
A perhaps more pressing concern is that while missing data is no longer an
impediment to modeling and inference, the reason that missingness occurs is rele-
vant to interpretation of trial results (Carpenter and Kenward 2007). Data missing
completely at random (MCAR) are associated with neither observed nor unobserved
912 J. M. Leach et al.

outcomes of interest. Data missing at random (MAR) may be associated with


observed, but not unobserved, outcomes. Data missing not at random (MNAR) are
associated with unobserved outcomes. The assumptions of MCAR are often not
well-substantiated. In particular, using complete case analyses where participants
with missing observations are excluded can bias results (Mallinckrodt et al. 2003;
Powney et al. 2014). Likewise, many common imputation methods, such as last
observation carried forward, are often not valid approaches; GLMMs are valid when
the assumptions of MAR are reasonable and do not require imputation methods to
include all subjects but verifying missingness is not MNAR is difficult. In the
interest of transparency, studies should report reasons for dropout and prepare
detailed and well-justified approaches for handling missing data (Powney et al.
2014). Performing sensitivity analyses is essential, not just to get agreement on
what the primary analysis showed, but to provide evidence that the results are not the
product of hidden biases.

Summary/Conclusion

It is imperative that researchers think deeply about which outcomes to employ in


studies and are aware of the various issues and complications that can arise from
those choices. This chapter has introduced outcomes in clinical trials, delineated
their different purposes and manifestations, and discussed regulatory and statistical
issues in selecting appropriate study outcomes so that researchers have a clear idea of
what to consider as they make choices on which outcomes to include in their studies.

Key Facts

• Primary outcomes should be measures capable of answering the main research/


clinical question and are often expected to measure rates, scale outcomes, detect-
able patient benefits, risk of disease development, and/or complications, while
secondary outcomes are often related to the primary outcomes but typically of
lesser importance and/or may be infeasible due to sample size limitations; both
are typically related to assessing the efficacy of a therapy.
• Direct endpoints directly describe patient well-being, and may be objective, i.e.,
explicitly measure clinical outcomes, or subjective, which often depend on
subject self-report.
• Surrogate or biomarker endpoints substitute for direct endpoints and are often
employed when direct endpoints are infeasible and/or unethical to measure. Many
are biomarkers, which objectively measure biological processes, pathologies,
and/or responses to exposures or interventions, whereas surrogate endpoints are
tantamount to the outcome of interest itself.
• Safety outcomes measure negative health consequences such as side effects or
adverse events and help assess the trade-offs between the benefits and risks of
therapies.
50 Outcomes in Clinical Trials 913

• Choosing outcome measures requires both practical and statistical considerations.


Practical considerations include the ability to capture the physical phenomenon of
interest, interpretability, while statistical considerations include validity, reliabil-
ity, and the ability to answer the clinical questions.
• Outcome measurements may depend on a patient’s subjective experience
(patient-reported), require some degree of medical/professional expertise (clini-
cian-reported), or be measurable by some other third party who is neither a
medical professional nor the patient (observer-reported).

References
Barker-Collo SL (2005) Within session practice effects on the PASAT in clients with multiple
sclerosis. Arch Clin Neuropsychol 20:145–152. https://doi.org/10.1016/j.acn.2004.03.007
Bartko JJ (1966) The intraclass correlation coefficient as a measure of reliability. Psychol Rep
19:3–11
Bauer DJ, Sterba SK (2011) Fitting multilevel models with ordinal outcomes: performance of
alternative specifications and methods of estimation. Psychol Methods 16(4):373–390. https://
doi.org/10.1037/a0025813
Benjamini Y, Cohen R (2017) Weighted false discovery rate controlling procedures for clinical
trials. Biostatistics 18(1):91–104. https://doi.org/10.1093/biostatistics/kxw030
FDA-NIH Biomarker Working Group. BEST (Biomarkers, EndpointS, and other Tools) Resource
[Internet]. Silver Spring (MD): Food and Drug Administration (US); 2016-. Available from:
https://www.ncbi.nlm.nih.gov/books/NBK326791/ Co-published by National Institutes of
Health (US), Bethesda (MD).
Carpenter JR, Kenward MG (2007) Missing data in randomised controlled trials – a practical guide.
National Institute for Health Research, Birmingham
Chen J, Luo J, Liu K, Mehrotra DV (2011) On power and sample size computation for multiple
testing procedures. Comput Stat Data Anal 55:110–122
Cohen JA, Cutter GR, Jill FS, Goodman AD, Heidenreich FR, Jak AJ, . . . Whitaker JN (2001) Use
of the multiple sclerosis functional composite as an outcome measure in a phase 3 clinical trial.
Arch Neurol 58: 961–967
Cohen JA, Reingold SC, Polman CH, Wolinsky JS (2012) Disability outcome measures in multiple
sclerosis clinical trials: current status and future prospects. Lancet Neurol 11:467–476
Cutter GR, Baier ML, Rudick RA, Cookfair DL, Fischer JS, Petkau KS, . . . Reingold S (1999)
Development of a multiple sclerosis functional composite as a clinical trial outcome measure.
Brain 122(5): 871–882. https://doi.org/10.1093/brain/122.5.871
Edwards LJ, Simpson SL (2014) An analysis of 24-hour ambulatory blood pressure monitoring data
using orthonormal polynomials in the linear mixed model. Blood Press Monit 19(3):153–163.
https://doi.org/10.1097/MBP.0000000000000039
FDA (2009) Guidance for industry patient-reported outcome measures: use in medical product
development to support labeling claims. U.S. Department of Health and Human Services Food
and Drug Administration. Retrieved from https://www.fda.gov/downloads/drugs/guidances/
ucm193282.pdf
FDA (2017) Multiple endpoints in clinical trials: guidance for industry. U.S. Department of Health
and Human Services Food and Drug Administration. Retrieved from https://www.fda.gov/
downloads/drugs/guidancecomplianceregulatoryinformation/guidances/ucm536750.pdf
Haime M (2016) Somahlution announces study results showing DuraGraft ® vascular graft treat-
ment improves long-term outcomes in coronary artery bypass grafting surgery. European
Association for cardio-thoracic surgery annual meeting. Barcelona. Retrieved from https://
www.somahlution.com/vascular-graft-treatment/
914 J. M. Leach et al.

Hedeker D (2015) Methods for multilevel ordinal data in prevention research. Prev Sci
16(7):997–1006. https://doi.org/10.1007/s11121-014-0495-x
Hedeker D, Gibbons RD (2006) Longitudinal data analysis. Wiley, Hoboken
Huque MF, Dmitrienko A, D’Agostino R (2013) Multiplicity issues in clinical trials with multiple
objectives. Stat Biopharmaceut Res 5(4):321–337. https://doi.org/10.1080/19466315.2013.
807749
Kennedy RE, Cutter GR, Wang G, Schneider LS (2015) Using baseline cognitive severity for
enriching Alzheimer’s disease clinical trials: how does mini-mental state examination predict
rate of change? Alzheimer’s Dementia: Transl Res Clin Interven 1:46–52. https://doi.org/10.
1016/j.trci.2015.03.001
Mallinckrodt CH, Sanger TM, Dubé S, DeBrota DJ, Molenberghs G, Carrol RJ, . . . Tollefson GD
(2003) Assessing and interpreting treatment effects in longitudinal clinical trials with missing
data. Biol Psychiatry 53:754–760
Motooka Y, Matsui T, Slaton RM, Umetsu R, Fukuda A, Naganuma M, . . . Nakamura M (2018)
Adverse events of smoking cessation treatments (nicotine replacement therapy and non-nicotine
prescription medication) and electronic cigarettes in the Food and Drug Administration Adverse
Event Reporting System, 2004–2016. SAGE Open Med 6:1–11. https://doi.org/10.1177/
2050312118777953
Pocock SJ (1997) Clinical trials with multiple outcomes: a statistical perspective on their design,
analysis, and interpretation. Control Clin Trials 18:530–545
Powney M, Williamson P, Kirkham J, Kolamunnage-Dona R (2014) A review of handling missing
longitudinal outcome data in clinical trials. Trials 15:237. https://doi.org/10.1186/1745-6215-
15-237
Prentice RL (1989) Surrogate endpoints in clinical trials: definition and operational criteria. Stat
Med 8(4):431–440. https://doi.org/10.1002/sim.4780080407
Shrout PE, Fleiss JL (1979) Intraclass correlations: uses in assessing rater reliability. Psychol Bull
86(2):420–428
Simon R, Maitournam A (2004) Evaluating the efficiency of targeted designs for randomized clinical
trials. Clin Cancer Res 10:6759–6763. https://doi.org/10.1158/1078-0432.CCR-04-0496
Sormani MP, Bonzano L, Luca R, Cutter GR, Mancardi GL, Bruzzi P (2009) Magnetic resonance
imaging as a potential surrogate for relapses in multiple sclerosis: a meta-analytic approach. Ann
Neurol 65:268–275. https://doi.org/10.1002/ana.21606
Sormani MP, Bonzano L, Luca R, Mancardi GL, Ucceli A, Bruzzi P (2010) Surrogate endpoints for
EDSS worsening in multiple sclerosis. A meta-analytic approach. Neurology 75(4):302–309.
https://doi.org/10.1212/WNL.0b013e3181ea15aa
Sozu T, Sugimoto T, Hamasaki T (2011) Sample size determination in superiority clinical trials with
multiple co-primary correlated endpoints. J Biopharm Stat 21:650–668. https://doi.org/10.1080/
10543406.2011.551329
Sozu T, Sugimoto T, Hamasaki T (2012) Sample size determination in clinical trials with multiple
co-primary endpoints including mixed continuous and binary variables. Biom J 54(5):716–729.
https://doi.org/10.1002/bimj.201100221
Sullivan EJ (n.d.) Clinical trials endpoints. U.S. Food and Drug Administration. Retrieved Novem-
ber 19, 2018, from https://www.fda.gov/downloads/Training/ClinicalInvestigatorTrai
ningCourse/UCM337268.pdf
Surrogate Endpoint Resources for Drug and Biologic Development (n.d.) U.S. Food and Drug
Administration. Retrieved November 19, 2018, from https://www.fda.gov/Drugs/
DevelopmentApprovalProcess/DevelopmentResources/ucm606684.htm
Swartz RJ, Schwartz C, Basch E, Cai L, Fairclough DL, Mendoza TR, Rapkin B (2011) The king’s
foot of patient-reported outcomes: current practice and new developments for the measurements
of change. Qual Life Res 20:1159–1167. https://doi.org/10.1007/s11136-011-9863-1
Wolfe GI, Kaminski HJ, Aban IB, Minisman G, Kuo H-C, Marx A, . . . Evoli A (2016) Randomized
trial of thymectomy in myasthenia gravis. N Engl J Med 375(6):511–522. https://doi.org/10.
1056/NEJMoa1602489
Patient-Reported Outcomes
51
Gillian Gresham and Patricia A. Ganz

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916
Role of PROs in Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 919
PROs to Support Labeling Claims in the United States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 921
PROs to Support Labeling Claims in Other Countries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 922
Types of PROs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 923
Health-Related QOL (HRQOL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924
Healthcare Utility and Cost-Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 925
PROs Developed by the National Institutes of Health (NIH) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 926
Patient-Reported Common Terminology Criteria for Adverse Events (PRO-CTCAE) . . . . 927
Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 928
Selection of the Instrument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 928
Modes of Administration and Data Collection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 929
Frequency and Duration of PRO Assessments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 930
Other Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 931
Clinical Trial Protocol Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 932
Reporting PRO Results from Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 933
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 934
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 934
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 934

G. Gresham
Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center,
Los Angeles, CA, USA
e-mail: [email protected]
P. A. Ganz (*)
Jonsson Comprehensive Cancer Center, University of California at Los Angeles,
Los Angeles, CA, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 915


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_241
916 G. Gresham and P. A. Ganz

Abstract
Patient-reported outcomes (PROs) are defined as any report that comes directly
from a patient. Their use as key outcomes in clinical trials has increased signif-
icantly, especially during the last decade. PROs encompass a variety of measure-
ments including health-related quality of life (HRQOL), symptoms, functional
status, safety, utilities, and satisfaction ratings. Selection of the PRO in a trial will
depend on a variety of factors including the trial’s objectives, study population,
disease or condition, as well as the type of treatment or intervention. PROs can be
used to inform clinical care and to support drug approval and labeling claims.
This chapter will provide an overview of the different types of PROs
with examples and their role in healthcare within the context of clinical trials.
Summaries of important regulatory documents including the FDA PRO guidance
and recommendations (SPIRIT and CONSORT PRO extensions) will also be
provided. Considerations when designing clinical trials are described in the last
section, highlighting important issues and topics that are unique to PROs. Many
methodologic and analytic features of PROs are similar to those of any outcomes
used in clinical trials; thus they require the same methodological rigor with
special attention to missing data. This chapter is written with a focus on the use
of PROs in interventional trials in the United States, although most information
can be applied to any context. Information presented in this chapter is relevant to
clinicians, researchers, policy makers, regulatory and funding agencies, as well as
patients. When used appropriately, PROs can generate high-quality data about the
effects of a particular intervention on a patient’s physical, psychological, func-
tional, and symptomatic experience.

Keywords
Patient-reported outcomes (PROs) · Quality of life · Health-related quality of
life · Health status · Symptoms · Adverse events · NIH PROMIS

Introduction

During the last two decades, patient-reported outcomes (PROs) have increasingly
been used in clinical trials to inform clinical care and to support drug approval and
labeling claims. PROs are defined as: “any report of the status of a patient’s health
condition that comes directly from the patient, without interpretation of the patient’s
response by a clinician or anyone else” (FDA 2009). A list of key terms related to
PROs and their definitions is included in Table 1.
There are several key regulatory and academic events that led to the increased
incorporation of PROs in clinical trials. Some of the first research studies to use
PROs included the Alameda County Human Population Laboratory Studies, the
RAND Health Insurance Experiment, and the Medical Outcome Studies occurring in
the 1970s and 1980s (Ganz et al. cited in Kominski 2013). In July 1990, the US
51 Patient-Reported Outcomes 917

Table 1 List of key terms and definitions related to PROs


Term Definition
Comparative effectiveness The generation and synthesis of evidence that compares the
research (CER) benefits and harms of alternative methods to prevent, diagnose,
treat, and monitor a clinical condition or to improve the delivery
of carea
Construct validity Evidence that relationships among items, domains, and concepts
conform to a priori hypotheses concerning logical relationships
that should exist with other measures or characteristics of patients
and patient groupsb
Content validity Evidence from qualitative research demonstrating that the
instrument measures the concept of interest including evidence
that the items and domains of an instrument are appropriate and
comprehensive relative to its intended measurement concept,
population, and use. Testing other measurement properties will
not replace or rectify problems with content validityb
Criterion validity The extent to which the scores of a PRO instrument are related to
a known gold standard measure of the same concept. For most
PROs, criterion validity cannot be measured because there is no
gold standardb
Domain A subconcept represented by a score of an instrument that
measures a larger concept comprised of multiple domains. For
example, psychological function is the larger concept containing
the domains subdivided into items describing emotional function
and cognitive functionb
Health-related quality of life A subcomponent of overall quality of life that relates to health
(HRQOL) and that focuses on the patient’s own perception of Well-being
and the ability to function as a result of health status or disease
experiencea
Item An individual question, statement, or task (and its standardized
response options) that is evaluated by the patient to address a
particular conceptb
Outcomes of care Refers to specific indicators of what happens to the patient once
care has been rendereda
Patient-reported outcome A measurement based on a report that comes directly from the
(PRO) patient (i.e., study subject) about the status of a patient’s health
condition without amendment or interpretation of the patient’s
response by a clinician or anyone else. A PRO can be measured
by self-report or by interview provided that the interviewer
records only the patient’s response.b
Process of care Refers to the content of the medical and psychological
interactions between patient and providera
Quality-adjusted life years A measure of life expectancy with adjustment for quality of life
(QALY) that integrates mortality and morbidity to express health status in
terms of equivalents of well-years of lifea
Quality of life A range of human experience, including but not limited to access
to the daily necessities of life such as food and shelter,
intrapersonal and interpersonal response to life events, and
activities associated with professional fulfillment and personal
happinessa
(continued)
918 G. Gresham and P. A. Ganz

Table 1 (continued)
Term Definition
Questionnaire A set of questions or items shown to a respondent to get answers
for research purposes. Types of questionnaires include diaries and
event logsb
Reliability The ability of a PRO instrument to yield consistent, reproducible
estimates of true treatment effectb
Sign Any objective evidence of a disease, health condition, or
treatment-related effect. Signs are usually observed and
interpreted by the clinician but may be noticed and reported by
the patientb
Structure of care Refers to how medical and other services are organized in a
particular institution or delivery systema
Symptom Any subjective evidence of a disease, health condition, or
treatment-related effect that can be noticed and known only by the
patientb
a
Definitions transcribed from Ganz et al. (2014)
b
Definitions transcribed from US Department of Health and Human Services: Food and Drug
Administration (2009)

National Cancer Institute held an all-day workshop to discuss the inclusion of PROs
in cancer clinical trials, informing subsequent strategies for their use in federally
funded cancer trials (Nayfield et al. 1992). Additional regulatory advancements in
the early 2000s include the draft of the EMA Reflection Paper on HRQL (2005) and
release of the Food and Drug Administration (FDA) draft guidance in 2006. In 2009,
the FDA published final guidance regarding the use of PROs for medical product
development to support labeling claims (FDA 2009). The guidance provides infor-
mation related to the evaluation of a PRO instrument and clinical trial design and
protocol considerations. This guidance will be used throughout this chapter to
support recommendations for the design of clinical trials that incorporate PROs.
Another key event in the history of PROs in the United States was the establish-
ment of the Patient-Centered Outcomes Research Institute (PCORI) as part of the
2010 Affordable Care Act (Frank et al. 2014). PCORI is an independent nonprofit,
nongovernmental organization that funds a wide range of research that incorporates
the patient perspective and will improve healthcare delivery and patient outcomes. It
is the largest public research funder that focuses on comparative effectiveness
research (CER) having funded over $2 billion dollars in research and related projects
to date. A unique feature of PCORI includes the active engagement of patients and
stakeholders throughout the research and review process (Frank et al. 2014). The
PCORI merit review includes five criteria that address the “impact of the condition
on the health of individuals and populations, population for the study to improve
health care and outcomes, technical merit, patient-centeredness, and patient and
stakeholder engagement” (Frank et al. 2014). Additional funding information and
details about PCORI can be found at the following web link: https://www.pcori.org/.
The purpose of this chapter is to provide a summary of the role of PROs within
the context of clinical trials, describe different types of PROs currently being used,
51 Patient-Reported Outcomes 919

and include PRO-specific considerations to take into account when designing a


clinical trial. Additional information regarding outcomes more broadly and the
analysis of PROs are provided in ▶ Chaps. 50, “Outcomes in Clinical Trials” and
▶ 92, “Statistical Analysis of Patient-Reported Outcomes in Clinical Trials,”
respectively.

Role of PROs in Clinical Trials

PROs play different, but equally important roles across each phase of drug devel-
opment (e.g., Phase I, II, III, IV). The role of the PRO depends largely on the clinical
trial endpoint model, defined in the FDA 2009 PRO guidance as “a diagram of the
hierarchy of relationships among all endpoints, both PRO and non-PRO, that
corresponds to the clinical trial’s objectives, design, and data analysis plan” (FDA
2009). The conceptual framework for each endpoint model is illustrated in Figs. 1
and 2 as adapted from the FDA guidance. A PRO may be defined as the primary,
secondary, or exploratory endpoint in a clinical trial. If used as a primary endpoint,
the methods for sample size determination and power calculations should be
included, as described further in ▶ Chap. 92, “Statistical Analysis of Patient-
Reported Outcomes in Clinical Trials.” Regardless of the type of outcome, the
specific PRO hypothesis and objectives should be clearly stated a priori along with
details of the instrument’s conceptual framework and measurement properties in the
protocol.
The use of PROs across all phases of clinical trials, as registered in clinicaltrials.
gov between 2000 and 2018, has increased over time as displayed in Fig. 3. Based on
a general search of clinicaltrials.gov, we identified approximately 146 trials that
included PROs or HRQOL as outcomes in 2000, 693 in 2005, 1056 in 2010, and

Fig. 1 Endpoint model: Treatment of disease X


920 G. Gresham and P. A. Ganz

Fig. 2 Endpoint model: Treatment of symptoms associated with disease Y

Fig. 3 Number of clinical trials with PROs registered in clinicaltrials.gov by phase

1505 in 2018. The majority of trials registered during this period were phase 2
(n ¼ 6768), phase 3 (n ¼ 5807) followed by phase 4 (2715), and phase 1 (n ¼ 1769).
In early development trials (phase I/II), PROs can provide important information
about specific toxicities and symptoms before the treatment progresses to the next
phase (Lipscomb et al. 2005; Basch et al. 2016). In the earlier phases of clinical
trials, disease-specific measures and PROs may be more useful and clinically
relevant to clinicians and patients (Spilker 1996). Because early phase trials enroll
selective groups of subjects, PROs should be employed cautiously (Lipscomb et al.
2005). Early phase trials are less likely to support PRO labeling claims but can
provide early insight into toxicity and tolerability of new drugs or devices.
51 Patient-Reported Outcomes 921

PROs are especially informative in situations where a disease may be well-


controlled by either of the interventions being compared but the symptom profile
and HRQOL are different. Thus, the use of PRO assessments as primary outcomes in
middle to late development (Phase II–III) studies is appropriate in this setting and
can help inform clinical decisions (Piantadosi 2017). In middle and late development
trials, PROs can provide important information regarding the baseline and treatment
symptom profiles as well as provide additional data for comparing the tolerability
between treatments (Basch et al. 2016). Thus, well-designed phase III trials with
PROs may inform policy and support labeling claims. PROs may also play a role in
post-marketing (phase IV) studies with regard to the long-term treatment effects and
safety surveillance (Basch et al. 2016).

PROs to Support Labeling Claims in the United States

There has been an increasing call for the incorporation of the patient voice into FDA
drug approval and labeling claims. This led to the release of PRO-specific FDA
guidance for industry (FDA 2009). The 2009 FDA guidance for use of PROs to
support labeling claims provides recommendations for areas that should be
addressed in PRO documents that are submitted to the FDA for review (FDA
2009). To ensure high-quality PRO data that is used to support the labeling claims,
the guidance focuses on the evaluation of the PRO instrument and design consider-
ations. Briefly, these areas include as follows: (I) the PRO instrument being used
along with instructions; (II) targeted claims or target product profile related to the
trial outcome measures (e.g., disease or condition, intended population, data analysis
plan); (III) the endpoint model; (IV) the PRO instrument’s conceptual framework;
(V) documentation for content validity; (VI) assessment of other measurement
properties (e.g., reliability, construct validity); (VII) interpretation of scores; (VIII)
language translations (if applicable); (IX) the data collection method; (X) any
modifications to the original instrument with justification; (XI) the protocol includ-
ing PRO-specific content; and (XII) key references. Detail and explanation for each
of these topics are included in the Appendix of the final 2009 FDA guidance
document and described within the context of clinical trial design in a later section.
While the FDA PRO guidance marks a significant advancement for the field, the
use of PROs to support labeling claims and their inclusion in published reports of
clinical trials remains low (Basch 2012). In the United States, it has been reported
that approximately 20-25% of all drug approvals were supported by PROs between
2006, when the FDA draft guidance was released, and 2015, despite the fact that
50% of drug approval packages included PRO endpoints (Basch 2012; DeMuro
et al. 2013). A literature review identified 182 NDAs between 2011 and 2015, for
which 16.5% had PRO labeling, defined as any treatment benefit related to PROs
that are mentioned in the FDA product label (Gnanasakthy et al. 2017). Authors
found that the majority of PRO labeling has been based on primary outcomes in
PRO-dependent diseases (e.g., mental, behavioral, and neurodevelopment disorders,
diseases of the respiratory system, diseases of the musculoskeletal system, etc.) as
922 G. Gresham and P. A. Ganz

compared to non-PRO-dependent diseases (e.g., neoplasms, infection and parasitic


diseases, diseases of the circulatory system, etc.) The majority of new drugs
approved supported by PRO labeling were in diseases of the nervous system,
musculoskeletal system, and genitourinary system. For example, there were 23
new drugs approved in diseases of the nervous system between 2006 and 2015,
for which over half (n ¼ 13) were approved based on PRO labeling (Gnanasakthy
et al. 2017). Treatment approvals included gabapentin enacarbil for restless legs
syndrome, where the International Restless Legs Syndrome (IRLS) Rating Scale was
used to support labeling, perampanel and eslicarbazepine acetate to treat seizures
using patient diaries to record seizures, and tasimelteon or suvorexant to treat sleep
disorders using patient-reported sleep latency and time.
Other notable PROs used to support drug approvals included the Pain Visual
Analog Scale (VAS) for conditions of the musculoskeletal system (e.g., arthritis),
diaries for rescue medication in diseases of the respiratory system, and other diaries
and self-reported accounts of gastrointestinal or genitourinary symptoms. Of the 69
new drug approvals in neoplasms between 2006 and 2015, none of them had PRO
labeling (Gnanasakthy et al. 2017). A second review identified three treatments that
used PRO data to support their FDA approval, including the Brief Pain Inventory-
Short Form (BPI-SF) for abiraterone in prostate cancer (2011), the Myelofibrosis
Symptom Assessment Form (MFSAF) to support the approval of ruxolitinib for
myelofibrosis (2011), and the Visual Symptom Assessment Questionnaire-Anaplas-
tic Lymphoma Kinase (VSAQ-ALK) to support the approval of crizotinib for non-
small cell lung cancer (Gnanasakthy et al. 2016). Authors suggest that the low rates
of PRO labeling in oncology are due to the fact that the development of cancer drugs
mostly relies on survival-related outcomes and tumor growth to assess treatment
efficacy (Gnanasakthy et al. 2017; Kluetz and Pazdur 2016). However, the inclusion
of PROs in pivotal cancer studies may help increase PRO labeling in future drug
approvals and are essential to the development and review of oncology drugs (Basch
2018). Additional recommendations for sponsors and the FDA to increase the
success of including PROs in clinical trials and US drug labels have been provided
in Basch (2012).

PROs to Support Labeling Claims in Other Countries

While other countries have not issued formal guidelines specific to PROs such as the
US 2009 FDA guidance for PROs, recommendations exist for the use of PROs to
support the evaluation and approval of drug products. Perspectives of different
International regulatory scientists, focusing on the incorporation of PROs into the
regulatory decision-making process has been described in a recent paper by Kluetz
et al. (2018). In Europe, the European Medicines Agency (EMA) published a
reflection paper that provides recommendations on HRQOL evaluation within the
context of clinical trials (European Medicines Agency 2005; DeMuro et al. 2013). A
review published in 2013 compared PRO label claims granted by the FDA to those
granted by the EMA between 2006 and 2010 (DeMuro et al. 2013). The authors
51 Patient-Reported Outcomes 923

found that the EMA granted more PRO labels with 47% of products with at least one
EMA-granted PRO label compared to 19% by the FDA. They also observed that the
majority of FDA claims focus on symptoms, while the EMA-granted claims are
more likely to approve treatments based on higher-order concepts such as HRQOL,
patient global ratings, and functioning. Of the 52 PRO label claims granted by both
agencies, 14 products were approved by both the FDA and EMA (DeMuro et al.
2013). The UK National Institute for Clinical Excellence (NICE) has also provided
guidance for the inclusion of patients and public in treatment evaluations (NICE
2006).
In most parts of Europe, Australia, and Canada, PROs are included as an
important component of health technology assessment (HTA). The WHO defines
HTA as: “The systematic evaluation of properties, effects, and/or impacts of health
technology.” HTA was first developed in the United States in the 1970s as a policy
tool and later introduced to Europe, Canada, the United Kingdom, and Scandinavia
(Banta 2003). HTA is used to support healthcare policy and inform reimbursement
and coverage insurers (Banta 2003). Although the United States supports HTA
research, also as a component of comparative effectiveness research, no formal
HTA agencies exist in the United States. Within the context of clinical trials,
survival, QOL, and cost-effectiveness outcomes generated from clinical trials are
subsequently used to inform HTA. Thus, it is important that trials are designed to
incorporate these important and informative outcomes in order to generate high-
quality evidence.

Types of PROs

PROs encompass a variety of measurements including health-related quality of life


(HRQOL), symptoms, functional status, safety, utilities, and satisfaction ratings
(FDA 2009; Calvert et al. 2013). The following sections describe some key PRO
measurements that can be included as outcomes in clinical trials.
There are several instruments that have been developed to target different
populations with different purposes. PRO assessments can be categorized into
generic and disease-specific instruments based on the concepts they are measuring.
Generic measures may be useful when evaluating core domains of function and well-
being across various populations and to detect differential effects on more general
aspects of health status (Spilker 1996; Lipscomb et al. 2005). While they may allow
comparison across interventions, they may not focus adequately on a specific area of
interest and may not be as responsive as disease-specific measurements (Spilker
1996). Generic measures may be appropriate when there is a large trade-off between
length of life and quality of life, which feeds into the concept of quality-adjusted life
years (QALY) and quality-adjusted time without symptoms or toxicity (Q-TWIST),
where generic population-based assessments have been used (see discussion of
QALYs later).
Utility measures are a second type of generic instrument, which “reflect the
preferences of patients for different health states” (Spilker 1996). Their use may be
924 G. Gresham and P. A. Ganz

appropriate when it is of interest to assess health states as they relate to death or if


investigators want to conduct cost-utility analyses. Because utility measures provide
a single summary score of the net change in HRQL, they do not allow for exami-
nation of effect on the different aspects of quality of life (Spilker 1996).
Finally, disease-specific measures focus on the distinguishing features of func-
tional status and well-being that are specific to different diseases or conditions (e.g.,
heart failure, cancer), populations (e.g., frail or elderly), functions (e.g., sleep, sexual
function), or problems (e.g., pain) (Spilker 1996). Because they only include the
aspects of PROs that are relevant to the patients being studied, they may result in
improved responsiveness. Often, several disease-specific measures are used together
in a battery to obtain a more comprehensive picture of the impact of treatment
interventions (Spilker 1996). For example, a clinical trial that is evaluating the effect
of a particular treatment on HRQL may include measures of physical function, pain,
sleep, and side effects as they relate to that treatment. Some weaknesses of disease-
specific assessments include the fact that they may be limited in terms of populations
and interventions, there may be restricted domains of relevance to the specific
disease, population, function, or problem, and they do not allow cross-condition
comparisons (Spilker 1996).

Health-Related QOL (HRQOL)

HRQOL is a special type of PRO that is widely used in clinical trials (Piantadosi
2017). It is considered an outcome of care that focuses on a “patient’s own percep-
tion of well-being, and the ability to function as a result of health status or disease
experience” (Ganz et al. 2014). The World Health Organization (WHO) defines
health as: “A state of complete physical, mental and social well-being and not merely
the absence of disease or infirmity,” and it has been schematically represented in
three levels of QOL that included the overall assessment of well-being, broad
domains, and the specific components of each domain (Spilker 1996). Early reviews
of the MEDLINE database demonstrated an exponential increase in the number of
studies that used QOL evaluation, with only five studies identified in 1973 and 9291
articles in 2010 (Testa and Simonson 1996; Ganz et al. 2014). In 2018, this number is
just over 34,000 articles, using “quality of life” as a topic heading with 1532 if
restricted to human clinical trials. In a clinicaltrials.gov search of the term “quality of
life,” a total of 28,886 interventional trials started between 01/01/2000 and 12/31/
2018 were identified. It is anticipated that this number will continue to grow as the
relevance and value of QOL are increasingly recognized.
HRQOL is measured using instruments (self-administered questionnaires) that
contain questions or items that are subsequently organized into scales (Ganz et al.
2014). HRQOL instruments may be general, addressing common components of
functioning and well-being, or disease-targeted, which focus on the specific impact
of the particular organ dysfunctions that affect HRQOL (Patrick and Deyo 1989 in:
Ganz et al. 2014). HRQOL is considered multidimensional, consisting of different
domains (physical, psychological, economic, spiritual, social) and specific
51 Patient-Reported Outcomes 925

measurement scales within each domain (Testa and Simonson 1996; Spilker 1996).
The overall assessment of well-being or global HRQOL score is often achieved by
summing scores across the different domains or may contain a global rating scale for
summary. Scores can also be established by each of the broad domains (e.g., physical
domain) or of the specific item (e.g., symptoms, functioning, disability).
One of the most common generic HRQOL instruments, emerging from the
Medical Outcomes Study (RAND Corporation), is the SF-36. It is composed of 36
items across 8 scales (physical functioning, role-physical, bodily pain, general
health, vitality, social functioning, role-emotional, mental health), which can be
further grouped into physical health and mental health summary measures. An
abbreviated, but equivalent version of the SF-36 was also developed through the
MOS and includes 12 items (SF-12) summarized into the mental and physical
domains. The SF-36 and SF-12 have been particularly useful in comparing the
relative burden of different diseases. They also have potential for evaluating the
QOL burden from different treatments (Spilker 1996).
An example of a disease-specific instrument of QOL is the European Organiza-
tion for Research and Treatment of Cancer (EORTC) QLQ-30 questionnaire
(Aaronson et al. 1993). The QLQ-30 is a cancer-specific questionnaire that was
developed from an international field study of 305 patients from centers across 13
different countries. The QLQ-C30 is composed of nine multi-item scales, which
include functional scales (physical, role, emotional, cognitive, social), symptom
scales (fatigue, nausea/vomiting, pain, dyspnea, sleep disturbance, appetite loss,
constipation, diarrhea, financial impact), and a global health and QOL scale. The
EORTC questionnaire and its many cancer-specific modules have been used
throughout the world to evaluate the HRQOL outcomes in cancer treatment trials:
(https://www.eortc.org/). Comprehensive inventory of HRQOL instruments is avail-
able online at https://eprovide.mapi-trust.org/ with information for over 2000 generic
and disease-specific instruments.

Healthcare Utility and Cost-Effectiveness

HRQOL can be used as utility coefficients to weight or adjust for outcomes such as
survival or progression-free survival and inform policy and healthcare decisions. For
example, quality-adjusted life years (QALY) are a measure of life expectancy that
adjusts for QOL (Kominski 2014). QALYs are often used to evaluate programs and
assist in decision-making processes where the Institute of Medicine (IOM) Com-
mittee on Public Health Strategies to Improve Health recommended that QALYs are
used to monitor the health status of all communities (Institute of Medicine 2011).
Consequently, there have been an increasing number of studies in which QALYs
have appeared in the literature, and advancements in methodologies for measuring
and reporting QALYs have been made (Ganz et al. in Kominski 2014). QALYs can
also be combined with cost evaluations, where the cost per QALY can be used to
show the relative efficiency of different health programs. While QALYs have played
an important role in health policy, they are also associated with limitations and
926 G. Gresham and P. A. Ganz

obstacles for which the Affordable Care Act states that QALYs cannot be used for
resource allocation decisions, and thus studies funded by PCORI do not include
these assessments (Ganz et al. in Kominski 2014). However, in the United Kingdom,
and other jurisdictions, their assessments play an important role in drug and device
approval processes.
Another measure that incorporates quantity and quality of life is the quality-
adjusted time without symptoms of disease and toxicity of treatment (Q-TWIST).
It is based on the concept of QALYs and represents a utility-based approach to QOL
assessment in clinical trials (Spilker 1996). Q-TWIST is calculated by subtracting
the number of days with specified disease symptoms or treatment-induced toxicities
from the total number of days of survival (Lipscomb et al. 2005). It requires
calculation of QOL-oriented clinical health states, for which treatments can then
be compared using a weighted sum of the mean duration of each health state with
weights being utility-based (Spilker 1996; Lipscomb et al. 2005). Often, the area
under the Kaplan-Meier survival curves is partitioned to calculate the average time a
patient spends in each clinical health state. An example of where Q-TWIST was
informative is in the case of adjuvant chemotherapy with tamoxifen for breast cancer
versus tamoxifen alone, where findings demonstrated the additional burden the
cytotoxic chemotherapy imposed on patients (Gelber et al. cited in Lipscomb et al.
2005). Another example is in a trial that demonstrated lengthened progression-free
survival (0.9 months) with zidovudine in patients with mildly symptomatic HIV
infection. However, Q-TWIST showed that patients who were treated with zidovu-
dine did worse (Spilker 1996). Additional details and analysis information related to
QALYs and Q-TWIST are described in ▶ Chap. 92, “Statistical Analysis of Patient-
Reported Outcomes in Clinical Trials.”

PROs Developed by the National Institutes of Health (NIH)

NIH PROMIS ®
In 2004, the NIH established the Patient-Reported Outcomes Measurement Infor-
mation System (PROMIS ®) as part of a multicenter cooperative group with the
purpose of developing and validating PROs for clinical research and practice (Cella
et al. 2010). PROMIS ® consists of over 300 self-reported and parent-reported
measures of global, physical, mental, and social health. This domain mapping
process was built based on the WHO framework of physical, mental, and social
health. Qualitative research and item response theory (IRT) analysis from 11 large
datasets informed an item library of close to 7000 PRO items to be further reviewed
and evaluated in field testing (Cella et al. 2010). PROMIS® measurement tools have
been validated in adults and children from the general population and those living
with chronic conditions. Additional information and PROMIS ® measures are avail-
able through the official information and distribution center: HealthMeasures (http://
www.healthmeasures.net/index.php).
The adult PROMIS® measures framework is composed of the PROMIS® profile
domains and additional domains, which are further categorized into physical, mental,
or social health components (HealthMeasures 2019). Physical health measures include
51 Patient-Reported Outcomes 927

fatigue, pain intensity, pain interference, physical function, and sleep disturbance.
Mental health profile domains include anxiety depression, and social health includes
the ability to participate in social roles and activities. A general global health measure
also makes up the self-reported health framework. The complete framework and a list
of additional PROMIS® domains can be accessed at http://www.healthmeasures.net/
explore-measurement-systems/promis/intro-to-promis.
A framework for pediatric self-reported and proxy-reported health was also
developed including the same health measures (physical, mental, social, global
health) with slightly different profile and additional domains. For instance, physical
health adds mobility and upper extremity function to the list of physical profile
domains, while sleep disturbance is not included. Anxiety and depressive symptoms
represent profile domains for mental health, while peer relationships are assessed as
part of social health.
PROMIS® measures can be administered using computer adaptive testing or on
paper through short forms or profiles (HealthMeasures 2019). The PROMIS ® self-
report measures are intended to be completed by the respondent themselves without
the help from others, unless they are unable to answer on their own, in which case a
parent or proxy measure may be used. Computer adaptive tests and short forms can
be imported into common data platforms and web applications such as REDCap,
Epic, OBERD, the Assessment Center (SM), and the Assessment Center Application
Programming Interface (API), which can connect any data collection software
application with the full library of PROMIS ® measures.
PROMIS® items include questions accompanied by Likert-type responses (e.g., not
at all, very little, somewhat, quite a lot, cannot do), which are associated with a
numerical score (0–5). The sums of the scores can then be converted into standardized
T-scores through the HealthMeasures scoring service, automatic scoring in data
collection tools, or manually. T-score metrics have a mean of 50 and standard deviation
of 10, making it easy to compare to reference populations including the general
population and clinical samples (e.g., cancer, pain populations) (Cella et al. 2007).
PROMIS® scores can also be converted to similar items from different instruments
such as the SF-36. For example, a PROMIS® physical T-score for physical function
can be linked to the SF-36 physical function score (www.prosettastone.org). This
allows for PRO evaluation and comparisons even when different measures are used.
The use of PROMIS ® measures in clinical trials has significantly increased since
their development, with over 2000 observational or interventional studies identified
in clinicaltrials.gov as of January 2021. The majority of trials that use NIH PRO-
MIS ® measures are for pain conditions, musculoskeletal diseases, and mental and
psychotic disorders. Their use is also increasing in cancer clinical trials, especially
breast cancer, and diseases of the central nervous system.

Patient-Reported Common Terminology Criteria for Adverse Events


(PRO-CTCAE)

In 2014, the NCI developed PRO-CTCAE in order to incorporate the patient


perspective and improve symptom monitoring in cancer (Basch et al. 2014; Dy
928 G. Gresham and P. A. Ganz

et al. 2014; Atkinson et al. 2017; Dueck et al. 2015), reflecting toxicities and adverse
events typically measured by clinical trial staff as part of the Common Terminology
Criteria for Adverse Events (CTCAE) assessment system. The PRO-CTCAE mea-
surement system includes 78 treatment toxicities that patients can systematically use
to document the frequency, severity, and interference of each toxicity (Basch et al.
2014). PRO-CTCAE includes a library of 124 questions that can be selected to
include as relevant to the specific trial. Each question includes a measure of the
frequency of the AE (e.g., “never,” “rarely,” “occasionally,” “frequently,” “almost
constantly”), the severity (“none,” “mild,” “moderate,” “severe,” “very severe”), or
the interference with usual or daily activities (“not at all,” “a little bit,” “somewhat,”
“quite a bit,” “very much”) (Basch et al. 2016). The patient-reported AEs can be
systematically collected at baseline, during active treatment, and during follow-up
using a pre-populated questionnaire.
An electronic platform for PRO-CTCAE also exists allowing for customized data
tailored to a particular treatment schedule and the incorporation of patient reminders
and clinician alerts (Dy et al. 2011; Basch et al. 2017b). Studies of both the PRO-
CTCAE and electronic PRO-CTCAE symptom collection system have demon-
strated feasibility, validity, and reliability of both the PRO-CTCAE and electronic
PRO-CTCAE in cancer patients (Dy et al. 2011; Basch et al. 2017a). Their use has
also been associated with enhanced care, improved QOL, and survival, possibly as
the result of earlier responsiveness to patient symptoms by medical personnel (Basch
et al. 2014, 2017; Aaronson et al. 1993).

Design Considerations

The following sections provide summaries of PRO-specific considerations to


account for when designing a clinical trial and to include in the clinical trial protocol.

Selection of the Instrument

Selection of the PRO in a trial will depend on a variety of factors including the trial’s
objectives, study population, disease or condition, as well as the type of treatment or
intervention to a certain extent (Piantadosi 2017). When designing a clinical trial, the
instrument(s) used should be specified a priori and appropriately selected for the
specific population being enrolled in the clinical trial. For instance, additional
measurement considerations may need to be accounted for when assessing PROs
in pediatric, cognitively impaired, or seriously ill patients (FDA 2009). Investigators
should first determine whether an adequate PRO instrument exists to assess and
measure the concepts of interest (FDA 2009). In some cases, a new PRO instrument
may be developed or modified, with additional steps that would need to be taken to
ensure validity and reliability.
Validity and reliability should be supported before using an instrument in a
clinical trial. The FDA guidance requires that content validity as well as other
51 Patient-Reported Outcomes 929

validity and reliability to be established as a component of FDA review. The content


validity of an instrument, or the extent to which the instrument measures the concept
of interest, should be supported by evidence from the literature or preliminary
studies and established prior to the evaluation of other measurement properties
(FDA 2009). The PRO instrument should also demonstrate reliability, or the ability
to yield consistent, reproducible estimates of the true treatment effect (e.g., test-retest
reliability), as well as construct validity, the relationships among items, domains, and
concepts, and criterion validity, or the extent to which the scores of a PRO instru-
ment are related to a gold standard measure of the same concept, if available (FDA
2009). FDA definitions of validity and reliability are included in Table 1, and a
detailed description of how validity and reliability are assessed is described in
▶ Chap. 50, “Outcomes in Clinical Trials.”
The FDA also provides guidance on the review of PRO instrument characteristics
including the modes of administration and data collection methods, the frequency
and duration of assessments, as well as other considerations specific to the clinical
trial design as described in the following sections:

Modes of Administration and Data Collection Methods

There are several different ways that PROs can be administered including
self-administration, interview-administered, telephone-administered, surrogate-
(or proxy) administered, or a combination of modes (Spilker 1996; Lipscomb
et al. 2005; FDA 2009). When selecting the mode of administration and data
collection method in a trial, it is important to consider its intended use, the cost,
and how missing data can be reduced (Lipscomb et al. 2005). While self-admin-
istrated PROs require the minimal amount of resources, they are associated with
increased likelihood for missing items, misunderstanding, and lower response rates
(Spilker 1996). For both face-to-face and telephone interviews, response rates are
maximized, while missing data and errors of misunderstanding are minimized
(Spilker 1996). Disadvantages of these methods are that more time and resources
are required to train the interviewers and administer the questionnaires. Addition-
ally, for telephone interviews, the format of the instrument is further limited
(Spilker 1996). A third option is the use of surrogate responders to complete the
assessments. Advantages of using surrogates or proxies are that it is more inclusive
of patients who may not be able to complete the questionnaires themselves such as
children and those who are cognitively impaired or have language barriers (Spilker
1996). A risk associated with using surrogate responders is that the perceptions of
the surrogate may be different from those of the target group and not accurately
represent the patient’s perspective. For example, proxy reports of more observable
domains such as physical or cognitive function domains may be overestimated,
while symptoms or signs may be underestimated by proxy respondents (Spilker
1996). Thus, it is important to consider the strengths and weaknesses of each mode
of administration and identify the mode that is most relevant and appropriate for
each context.
930 G. Gresham and P. A. Ganz

Methods to collect PRO data from either self-administered or interviewer-admin-


istered questionnaires include entry on paper by the patient or interviewer or using
computer-assisted assessments. While paper-based methods of assessment are the
most widely used and may be preferred by some patients, they can result in higher
risk of missing items or data due to skipped questions or pages. Alternatively,
computer-assisted assessments, such as electronic PROMIS ® questionnaires, can
include skip patterns, data checks, and forced responses to ensure complete data. The
FDA guidance has highlighted specific issues associated with the use of electronic
PRO instruments including the entry, maintenance, and transmission of electronic
data (FDA 2009). If the electronic PRO instrument is used as the source document,
additional requirements must be met including 21 CFR Part 11 compliance and a
plan to ensure data security and integrity. As part of the FDA PRO guidance, the
FDA will review the clinical trial protocol to determine the steps used to ensure that
the patients complete the entries at the specified period using the appropriate
administration mode (FDA 2009).

Frequency and Duration of PRO Assessments

The frequency and duration of PRO assessments must correspond to the specific
research question and objectives of the particular clinical trial. The frequency of
assessments will depend on the natural history of disease, the timing of the thera-
peutic and diagnostic interventions, and the likelihood of changes in the outcome
within the time period (Lipscomb et al. 2005; FDA 2009). Clinical trials with PROs
will often require at least one baseline assessment and several PRO assessments over
the course of the study period. Assessments should be frequent enough to capture the
meaningful change without introducing additional burden to the patients. They
should also not be more frequent than the specific period of recall, explained in
the next section, as defined in the instrument. For example, if an instrument has a 1-
month recall period, assessments should not occur weekly or daily.
The duration of the assessment will also depend on the research question and
should cover the period of time that is sufficient to observe changes in the outcome
(Lipscomb et al. 2005). Investigators should also consider whether they are inter-
ested in specific changes that occur during therapy or the long-term effect of the
therapy on that particular outcome. Therefore, the duration of follow-up with a PRO
assessment may be the same of other measures of efficacy or may be longer in
duration if the study objectives require continued assessment. In the former case, it is
important that efforts are made to reduce missing data and loss to follow-up.
Investigators should also take the recall period for the PRO instrument into
consideration when designing a trial. This is defined as the period of time patients
are asked to consider in responding to a PRO or question and can be momentary or
retrospective of varying lengths (FDA 2009). The recall period will depend on the
purpose and intended use of the PRO instrument, the concept being measured, and
the specific disease and treatment schedule. Items with shorter recall periods are
preferred over retrospective as patients are likely to be influenced by their current
51 Patient-Reported Outcomes 931

state during the time of recall (FDA 2009). Thus, the use of PRO instruments with
short recall periods administered at regular intervals (e.g., 2, 4, 6 weeks) may
enhance the quality of the data and reduce the risk for recall bias but have more
chance of missing data.

Other Design Considerations

The FDA PRO guidance also reviews clinical trial design principles unique to PRO
endpoints. The first consideration relates to masking (blinding) and randomization of
trial participants. In the case of open-label trials, patients may overestimate the
treatment benefit in their responses, while those who are not receiving active
treatment may underreport any potential improvements (FDA 2009). Authors sug-
gest administering the PRO assessments prior to clinical assessments or procedures
to minimize the potential influence on patient perceptions. It is rare that such open-
label trials will be adequate to support labeling claims based on PRO instruments. In
masked clinical trials, there is still a possibility for inadvertent unblinding if a
treatment has obvious effects, such as adverse events. Consequently, similar over-
or underreporting of the treatment effect may occur if a patient thinks they are
receiving one treatment over the other. To decrease the risk of possible unblinding,
the guidance suggests using response options that ask for current status, not giving
patients access to previous responses, and using instruments that include many items
about the same concept (FDA 2009). Investigators should also take specific host
factors into consideration, where randomization may not achieve balance with
regard to the specific PROs (psychological and functional outcomes) that partici-
pants have at baseline or develop as a result of treatment.
The FDA guidance provides additional recommendations for clinical trial quality
control when using PROs in order to ensure standardized assessments and processes.
Specifically, the protocol should include information on how both patients and
interviewers (if applicable) will be trained for the PRO instrument along with
detailed instructions. The protocol should also include instructions regarding the
supervision, timing, and order of questionnaire administration as well as the pro-
cesses and rules for the questionnaire review for completeness; documentation of
how and when data are filed, stored, and transmitted to/from a clinical trial site; and
plans for confirmation of the instrument’s measurement properties using the clinical
trial data (FDA 2009).
A third recommendation as it relates to the use of PROs is to provide detailed
plans on how investigators will minimize and handle missing data (FDA 2009).
Because longitudinally measured PROs are subject to informative missingness, they
can introduce bias and interfere with the ability to compare effects across groups.
Thus, the protocol should include plans for collecting reasons that patients
discontinued treatment or withdrew their participation. Efforts should also be
made to continue to collect PRO data, regardless of whether patients discontinued
treatment, and a process should be established for how to obtain PRO measurement
before or after patient withdrawal to prevent loss to follow-up. Details on statistical
932 G. Gresham and P. A. Ganz

methods to account for missing data in the analysis plan are further described in
▶ Chap. 92, “Statistical Analysis of Patient-Reported Outcomes in Clinical Trials,”
of this book. Despite more stringent guidance on the assessment of PROs in clinical
research, there remains a need for a more standardized and coordinated approach to
further improve the efficiency for which PROs are collected and to maximize the
benefits of PROs in healthcare (Calvert et al. 2019).

Clinical Trial Protocol Development

It is essential that clinical trial protocols incorporating PROs are designed with the
same methodological rigor and detail as any other clinical trial protocol. To improve
standardization and enhance quality across clinical trial protocols, the SPIRIT
(Standard Protocol Items: Recommendations for Interventional Trials) statement
was developed, with the most recent version being published in 2013 (Chan et al.
2013) (Appendix 10.7). It consists of 33 recommended items to include in clinical
trial protocols organized by protocol section. To address PRO content-specific
recommendations, a PRO extension of the SPIRIT statement was developed in
2017. In addition to the 33 checklist items from the SPIRIT statement, the PRO
extension includes 11 extensions and 5 elaborations that focus on PRO-specific
issues across each protocol section (Calvert et al. 2018). Extensions and elaborations
for the PRO-specific elaborations and extensions to the standard SPIRIT checklist
are paraphrased in the following section.
As part of the administrative information and introduction components of the trial
protocol, PRO elaborations and extensions include SPIRIT-5a specifying the indi-
vidual(s) responsible for the PRO content of the trial; SPIRIT-6a describing the
PRO-specific research question and rationale for PRO assessment and summarizing
PRO findings in relevant studies; and SPIRIT-7 stating the PRO objectives or
hypotheses. PRO extensions related to the methods section of the protocols are:

SPIRIT-10 Specify any PRO-specific eligibility criteria. If PROs are only collected in
a subsample, provide rationale and description of methods for obtaining the PRO
subsample.
SPIRIT-12 Specify the PRO concepts/domains used to evaluate the intervention and,
for each one, the analysis metric and principle time point or period of interest.
SPIRIT-13 Include a schedule of PRO assessments (with rationale for the time
points) and specify time windows and whether order of administration will be
standardized if using multiple questionnaires.
SPIRIT-14 State the required sample size and how it was determined (if PRO is the
primary endpoint). If sample size is not established based on PRO, discuss the
power of the principal PRO analyses.
SPIRIT-18a(i) Justify the PRO instrument that will be used and describe domains,
number of items, recall period, instrument scaling, and scoring. Provide informa-
tion about the instrument measurement properties, interpretation guidelines,
patient acceptability and burden, as well as the user manual, if available.
51 Patient-Reported Outcomes 933

SPIRIT-18a(ii) Include the data collection plan that outlines the different modes of
administration and the setting.
SPIRIT-18a(iii) Specify whether more than one language version will be used and
state whether translated versions have been developed and will be used.
SPIRIT-18a(iv) If trial requires a proxy-reported outcome, state and justify the use of
a proxy respondent and cite the evidence of the validity of the proxy assessment,
if available.
SPIRIT-18b(i) Specify the PRO data collection and management strategies used to
minimize missing data.
SPIRIT-18b(ii) Describe the process of PRO assessment for participants who dis-
continue or deviate from the assigned intervention.
SPIRIT-20a State PRO analysis methods and include plans for addressing multiplic-
ity and type 1 error.
SPIRIT-20c State how missing data will be described and outline methods for
handling missing items or entire assessments.
SPIRIT-22 State whether PRO data will be monitored during the study to inform
patient care and how it will be managed in a standardized way.

In summary, the SPIRIT-PRO extension provides important consensus-based


guidance on PRO-specific information that should be included in clinical trial pro-
tocols. Currently, literature reviews suggest that PRO-specific content in clinical trial
protocols is frequently absent or incomplete from clinical trial protocols (Kyte et al.
2014). By following the SPIRIT PRO statement and checklist, researchers can
enhance the quality of the design, conduct, and analysis of clinical trials for which
PROs are primary, secondary, or exploratory outcomes. By conducting high-quality
clinical trials with PRO-specific content, results can contribute to the global PRO
evidence base and appropriately inform decision-making, labeling claims, clinical
guidelines, and health policy (Calvert et al. 2018).

Reporting PRO Results from Clinical Trials

Standards exist to improve the quality and completeness of reporting PROs from
clinical trials (Calvert et al. 2013). The Consolidated Standards of Reporting Trials
(CONSORT) statement is intended for use as a tool for authors, reviewers, and
consumers and endorsed by major journals and editorial groups. A CONSORT
extension was published in 2013 that includes recommendations for RCTs that
incorporate PROs in order to facilitate interpretation of PRO results and inform
clinical care. CONSORT statement was first published in 1996 to improve clinical
trial reporting (Calvert et al. 2013) (Appendix 10.5). Since then, extensions have
been developed to address specific trial designs and methods, including the CON-
SORT PRO extension. The CONSORT PRO extension describes five specific
recommendations that have been added to supplement the standard CONSORT
guidelines. The specific PRO items include the following: (1) PROs should be
identified as primary or secondary outcomes in the abstract; (2) the scientific
934 G. Gresham and P. A. Ganz

background and rationale as well as a description of the hypothesis and relevant


domains (for multidimensional instruments) should be provided; (3) evidence for the
validity and reliability of the instrument should be provided or cited; (4) statistical
approaches to handle missing PRO data should be explicitly stated; and (5) PRO-
specific limitations and generalizability of results should be discussed (Calvert et al.
2013). The CONSORT PRO extension includes specific examples and explanations
for each of these recommendations (Calvert et al. 2013). The complete CONSORT
PRO statement and checklist can be accessed online at the following link: http://
www.consort-statement.org/extensions

Summary and Conclusion

PROs are important sources of information for the evaluation of clinical trial out-
comes. There is an increasing emphasis on the use of PROs to inform healthcare policy
and clinical care. Routine collection of PROs has also been shown to improve quality
of care, engagement, and survival in some cases (Basch 2010). To ensure high-quality
PRO data, guidance and recommendations exist for the design of trials (FDA PRO
guidance), protocol development (SPIRIT-PRO), as well as the reporting of PROs
(CONSORT). The role of PROs in clinical trials will depend on the study objectives,
the disease/condition, and the treatment/intervention. It is important to recognize some
of the challenges with using PROs in clinical trials, such as the difficulty validating
different PRO measures, missing data, and the reporting biases that inheriently exist in
their assessment. Given the added value that PROs provide as well as their potential
for improving patient-centered care, PROs should be incorporated into clinical trial
design following the available guidelines and recommendations that exist to ensure
high-quality PRO data.

Key Facts

– The use of PROs in clinical trials can provide added value about the efficacy and
tolerability of an intervention
– PROs are increasingly being used as primary or secondary outcomes to support
labeling claims in the United States and inform clinical decision making
– There remains a need to standardize methods and coordinate efforts in the
collection, assessment, analysis, and reporting of PROs
– Guidance and recommendations have been established to improve the quality and
methodologic rigor of clinical trials that incorporate PROs

References
Aaronson NK, Ahmedzai S, Bergman B, Bullinger M, Cull A, Duez NJ, Filiberti A, Flechtner H,
Fleishman SB, de Haes JC et al (1993) The European Organization for Research and Treatment
of Cancer QLQ-C30: a quality-of-life instrument for use in international clinical trials in
oncology. J Natl Cancer Inst 85(5):365–376
51 Patient-Reported Outcomes 935

Atkinson TM, Stover AM, Storfer DF, Saracino RM, D’Agostino TA, Pergolizzi D, Matsoukas K,
Li Y, Basch E (2017) Patient-reported physical function measures in cancer clinical trials.
Epidemiol Rev 39(1):59–70
Banta D (2003) The development of health technology assessment. Health Policy 63(2):121–132
Basch E (2010) The missing voice of patients in drug-safety reporting. New England Journal of
Medicine 362(10):865–869
Basch (2012) Beyond the FDA PRO Guidance: Steps toward Integrating Meaningful
Patient-Reported Outcomes into Regulatory Trials and US Drug Labels Value in Health
15(3):401–403
Basch E (2018) Patient-reported outcomes: an essential component of oncology drug development
and regulatory review. The lancet Oncology 19(5):595–597
Basch E, Reeve BB, Mitchell SA, Clauser SB, Minasian LM, Dueck AC, Mendoza TR, Hay J,
Atkinson TM, Abernethy AP, Bruner DW, Cleeland CS, Sloan JA, Chilukuri R, Baumgartner P,
Denicoff A, Germain DS, O’Mara AM, Chen A, Kelaghan J, Bennett AV, Sit L, Rogak L, Barz
A, Paul DB, Schrag D (2014) Development of the National Cancer Institute’s patient-reported
outcomes version of the common terminology criteria for adverse events (PRO-CTCAE). J Natl
Cancer Inst 106(9):dju244. https://doi.org/10.1093/jnci/dju244. PMID: 25265940; PMCID:
PMC4200059
Basch E, Rogak LJ, Dueck AC (2016) Methods for implementing and reporting patient-reported
outcome (PRO) measures of symptomatic adverse events in cancer clinical trials. Clin Ther 38
(4):821–830
Basch E, Deal AM, Dueck AC, Scher HI, Kris MG, Hudis C, Schrag D (2017a) Overall survival
results of a trial assessing patient-reported outcomes for symptom monitoring during routine
cancer treatment overall survival for patient-reported symptom monitoring in routine cancer
treatment letters. JAMA 318(2):197–198
Basch E, Dueck AC, Rogak LJ, Minasian LM, Kelly WK, O’Mara AM, Denicoff AM, Seisler D,
Atherton PJ, Paskett E, Carey L, Dickler M, Heist RS, Himelstein A, Rugo HS, Sikov WM,
Socinski MA, Venook AP, Weckstein DJ, Lake DE, Biggs DD, Freedman RA, Kuzma C,
Kirshner JJ, Schrag D (2017b) Feasibility assessment of patient reporting of symptomatic
adverse events in multicenter cancer clinical trials patient reporting of symptomatic adverse
events in multicenter cancer trials patient reporting of symptomatic adverse events in multicen-
ter cancer trials. JAMA Oncol 3(8):1043–1050
Calvert M, Blazeby J, Altman DG, Revicki DA, Moher D, Brundage MD (2013) Reporting of
patient-reported outcomes in randomized trials: the CONSORT PRO extension. JAMA 309
(8):814–822
Calvert M, Kyte D, Mercieca-Bebber R, Slade A, Chan A-W, King MT, a. t. S.-P. Group (2018)
Guidelines for inclusion of patient-reported outcomes in clinical trial protocols: the SPIRIT-
PRO extension guidelines for inclusion of patient-reported outcomes in clinical trial protocols
guidelines for inclusion of patient-reported outcomes in clinical trial protocols. JAMA 319
(5):483–494
Calvert M, Kyte D, Price G, Valderas JM, Hjollund NH (2019) Maximising the impact of patient
reported outcome assessment for patients and society. BMJ 24;364
Cella D, Yount S, Rothrock N, Gershon R, Cook K, Reeve B, Ader D, Fries JF, Bruce B,
Rose M (2007) The patient-reported outcomes measurement information system (PROMIS):
progress of an NIH roadmap cooperative group during its first two years. Med Care
45(5 Suppl 1):S3
Cella D, Riley W, Stone A, Rothrock N, Reeve B, Yount S, Amtmann D, Bode R, Buysse D, Choi S,
Cook K, Devellis R, DeWalt D, Fries JF, Gershon R, Hahn EA, Lai JS, Pilkonis P, Revicki D,
Rose M, Weinfurt K, Hays R (2010) The patient-reported outcomes measurement information
system (PROMIS) developed and tested its first wave of adult self-reported health outcome item
banks: 2005–2008. J Clin Epidemiol 63(11):1179–1194
Chan A-W, Tetzlaff JM, Altman DG, Laupacis A, Gøtzsche PC, Krleža-Jerić K, Hróbjartsson A,
Mann H, Dickersin K, Berlin JA, Doré CJ, Parulekar WR, Summerskill WSM, Groves T,
Schulz KF, Sox HC, Rockhold FW, Rennie D, Moher D (2013) SPIRIT 2013 statement:
defining standard protocol items for clinical trials. Ann Intern Med 158(3):200–207
936 G. Gresham and P. A. Ganz

DeMuro C, Clark M, Doward L, Evans E, Mordin M, Gnanasakthy A (2013) Assessment of PRO


label claims granted by the FDA as compared to the EMA (2006–2010). Value Health 16
(8):1150–1155
Dueck AC, Mendoza TR, Mitchell SA, Reeve BB, Castro KM, Rogak LJ, Atkinson TM, Bennett
AV, Denicoff AM, O’Mara AM, Li Y, Clauser SB, Bryant DM, Bearden JD 3rd, Gillis TA,
Harness JK, Siegel RD, Paul DB, Cleeland CS, Schrag D, Sloan JA, Abernethy AP, Bruner DW,
Minasian LM, Basch E (2015) National Cancer Institute PRO-CTCAE Study Group. Validity
and Reliability of the US National Cancer Institute’s Patient-Reported Outcomes Version of the
Common Terminology Criteria for Adverse Events (PRO-CTCAE). JAMA Oncol 1(8):1051–
1109. https://doi.org/10.1001/jamaoncol.2015.2639. Erratum in: JAMA Oncol. 2016 Jan;2
(1):146. PMID: 26270597; PMCID: PMC4857599
Dy SM, Roy J, Ott GE, McHale M, Kennedy C, Kutner JS, Tien A (2011) Tell us™: a web-based
tool for improving communication among patients, families, and providers in hospice and
palliative care through systematic data specification, collection, and use. J Pain Symptom
Manag 42(4):526–534
Dy SM, Walling AM, Mack JW, Malin JL, Pantoja P, Lorenz KA, Tisnado DM (2014) Evaluating
the quality of supportive oncology using patient-reported data. J Oncol Pract 10(4):e223–e230
European Medicines Agency, C. f. M. P. f. H. U (2005) Reflection paper on the regulatory guidance
for the use of health related quality of life (HRQL) measures in the evaluation of medicinal
products. Retrieved March 3, 2019, from https://www.ema.europa.eu/en/regulatory-guidance-
use-health-related-quality-life-hrql-measures-evaluation-medicinal-products
Frank L, Basch E, Selby JV, F. t. P.-C. O. R. Institute (2014) The PCORI perspective on patient-
centered outcomes research the PCORI perspective on patient-centered research the PCORI
perspective on patient-centered research. JAMA 312(15):1513–1514
Ganz PA, Hays RD, Kaplan RM, Litwin MS (2014) Measuring health-related quality of life and
other outcomes. In: Kominski GF (ed) Changing the U.S. health care system: key issues in
health services policy and management. Wiley, San Francisco, pp 307–341
Gnanasakthy A, DeMuro C, Clark M, Haydysch E, Ma E, Bonthapally V (2016) Patient-reported
outcomes labeling for products approved by the Office of Hematology and Oncology Products
of the US Food and Drug Administration (2010–2014). J Clin Oncol 34(16):1928–1934
Gnanasakthy A, Mordin M, Evans E, Doward L, DeMuro C (2017) A review of patient-reported
outcome labeling in the United States (2011–2015). Value Health 20(3):420–429
HealthMeasures. PROMIS (2019) Explore measurement systems. Retrieved March 20, 2019 from
http://www.healthmeasures.net/explore-measurement-systems/promis/intro-to-promis
Institute of Medicine (2011) Leading health indicators for healthy people 2020: letter report.
National Academies Press, Washington, DC. Retrieved 15 March 2019 from http://books.nap.
edu/openbook.php?record_id¼13088&page¼R1
Kluetz PG, Pazdur R (2016) Looking to the future in an unprecedented time for cancer drug
development. Semin Oncol 43(1):2–3
Kluetz PG, O’Connor DJ, Soltys K (2018) Incorporating the patient experience into regulatory
decision making in the USA, Europe, and Canada. The lancet Oncology 19(5):e267–e274
Kyte D, Duffy H, Fletcher B, Gheorghe A, Mercieca-Bebber R, King M, Draper H, Ives J,
Brundage M, Blazeby J, Calvert M (2014) Systematic evaluation of the patient-reported
outcome (PRO) content of clinical trial protocols. PLoS One 9(10):e110229
Lipscomb JG, Gotay CC, Snyder C (2005) Outcomes assessment in cancer: measures, methods, and
applications. Cambridge University Press, Cambridge
Nayfield SG, Ganz PA, Moinpour CM, Cella DF, Hailey BJ (1992) Report from a National Cancer
Institute (USA) workshop on quality of life assessment in cancer clinical trials. Qual Life Res 1
(3):203–210
Piantadosi S (2017) Clinical trials: a methodologic perspective. Wiley, Hoboken
Spilker B (1996) Quality of life and pharmacoeconomics in clinical trials. Lippincott-Raven
Publishers, Philadelphia
51 Patient-Reported Outcomes 937

Testa MA, Simonson DC (1996) Assessment of quality-of-life outcomes. N Engl J Med 334
(13):835–840
US Department of Health and Human Services (USDHHS): Food and Drug Administration.
Accessed 1 March 2019; Draft guidance for industry. Patient-reported outcome measures: use
in medical product development to support labeling claims. 2006 February.; www.ispor.org/
workpaper/FDAPROGuidance2006.pdf
US Department of Health and Human Services (USDHHS): Food and Drug Administration.
Accessed 1 March 2019; Guidance for industry. Patient-reported outcome measures: use in
medical product development to support labeling claims. 2009 December.; www.fda.gov/
downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/UCM193282.pdf
Translational Clinical Trials
52
Steven Piantadosi

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 940
Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 941
Issues in Translational Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 942
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944
Reducing Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944
Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 946
Safety Versus Efficacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 948
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 950

Abstract
Translational research and clinical trials are often discussed especially in aca-
demic centers from the perspective that such efforts are difficult or endangered.
Yet every new therapeutic must pass through this stage of investigation where
promising evidence supports staged development with the goal of product regis-
tration. Many clinical investigators instinctively know that relatively small clin-
ical trials can be essential in translation, but this is often contrary to the statistical
rigors of later development. This chapter attempts to reconcile these equally valid
perspectives.

Keywords
Translational research · Translational clinical trials · Biomarkers · Information ·
Entropy · Sample size

S. Piantadosi (*)
Department of Surgery, Division of Surgical Oncology, Brigham and Women’s Hospital, Harvard
Medical School, Boston, MA, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 939


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_76
940 S. Piantadosi

Introduction

The terms basic research, clinical research, and translational research require some
practical definitions. Basic and clinical investigations are the ends of a spectrum and
therefore have always been part of the research landscape by default. Basic implies
that the research does not need an immediate application but represents knowledge
for its own sake. Clinical is the direct application to care or prevention of illness –
literally “at the bedside.” Without fanfare, translational research takes place between
the two ends of the spectrum, so it too has always been with us but perhaps less
obvious.
Historically, academic centers seem to have concentrated on basic and clinical
research, while commercial entities focused on translational research. Translational
research began to be characterized explicitly mostly in the 1990s and later.
Translation is difficult to define universally, but many academic institutions and
sponsors try to foster it alongside basic and clinical research at least in the sense that
they know it when they see it. A problem with a precise definition is that there is no
single anchor for the domain of translation like the laboratory or clinic. In fact, the
apparent gap between basic and clinical research often seems to be widening (Butler
2008). There is a National Center for Advancing Translational Sciences (NCATS) as
part of the National Institutes of Health. But NCATS does not offer a simple clear
definition.
Translational clinical trials are elusive because the label is often applied even
when the research has conventional developmental purposes such as dose finding. A
search of ClinicalTrials.gov (National Library of Medicine 2019) found only 243 tri-
als with the term “translational” out of over 300,000 entries in the database. The
search filters were “recruiting,” “enrolling by invitation,” “interventional,” “early
phase 1,” “phase 1,” and “phase 2.” Removing all filters except “interventional”
raised the number to 2888. These are imperfect snapshots. For example, a random-
ized trial came up under the filter “early phase 1.” ClinicalTrials.gov does not have
an explicit filter for translational trials.
The clinical scope of trials captured was universal as one might expect. Most of
the trials were probably not translational in the sense to be developed in this chapter.
But many trials have components or sub-studies that have translational objectives.
Many other studies in the database used similar descriptors but were not obviously
interventional trials. It seems unlikely that less than one in a thousand trials is
translational, but this low number points to nonuniformity in the way the concept
is used. The characterizations of translational trials in this chapter will push toward
common concepts and usage.
Presently there are about 100 medical journals in at least 5 languages with the
word “translational” in their title or self-identified as translational. Most of these are
online. A wide range of disciplines and diseases are represented among them,
including pediatrics, cardiovascular disease, cancer, psychiatry, immunology, and
informatics. As one might expect from the minority of clinical trials that are
classified as translational, publications in such journals seem to emphasize techno-
logical and pre-developmental studies.
52 Translational Clinical Trials 941

This chapter will attempt to define and characterize translational clinical trials as a
distinct type apart from the typical developmental classifications such as phase I, II,
or III. Major conceptual differences are that unlike developmental trials, translational
clinical trials are unlikely to employ clinical outcome measures and may beget
laboratory experiments as readily as additional clinical trials. Due to the high
uncertainty regarding treatment effects when they are begun, even the relatively
weak evidence produced from translational trials is informative for initiating or
suspending developmental steps.

Definitions

For this chapter, I will define translational research simply as converting observa-
tions from the laboratory into clinical interventions. This definition keeps away from
the ends of the spectrum in the following sense. Basic science observations must
already exist: translational research does not discover them. Clinical interventions
are created and await confirmation: translational research does not prove health
benefits.
With this simple definition of translational research, a translational clinical trial
(TCT) will be seen to be something special. First, although a TCT takes place in the
clinic, it relies heavily on laboratory foundations. Second, a TCT will not provide
strong evidence for therapeutic efficacy. Third, a TCT must inform the subsequent
clinical trials or laboratory experiments that will lead to strong evidence.
For the purposes of this chapter, I will use the following definition of a transla-
tional clinical trial (Piantadosi 2005):

A clinical trial where the primary outcome: 1) is a biological measurement (target) derived
from a well-established paradigm of disease, and 2) represents an irrefutable signal regarding
the intended therapeutic effect. The design and purposes of the trial are to guide further
experiments in the laboratory or clinic, inform treatment modifications, and validate the
target, but not necessarily to provide reliable evidence regarding clinical outcomes.

Therapy acts on a signal derived from a disease model but measured in the clinic
where it implicates definitive effects (Fig. 1). A TCT should inform the design of
subsequent experiments by reducing uncertainty regarding effects on the target.
Hence it should be designed to yield useful evidence whether the treatment succeeds
or fails. It must contain two explicit definitions for lack of effect: one for each study
subject and another for the entire study cohort. A TCT will not carry formal
hypothesis tests regarding therapeutic efficacy. One result for such a trial might be
that the treatment requires modifications or replacement: translational trials are
circular between the laboratory and clinic.
A translational treatment is not fixed. Imagine needing to modify a small mole-
cule or antibody based on an initial human trial. Not only might the treatment itself
change, but it is being tracked by a signal derived from the laboratory and measured
in people. Clinical outcomes will come only in later more definitive studies.
942 S. Piantadosi

Fig. 1 Translational clinical trial paradigm. The irrefutable signal is derived from understanding of
the disease model but it is measured in human subjects. The treatment modifies the signal which
then implies alteration in a definitive outcome

Table 1 Additional characteristics of translational trials


The disease model must contain relevant elements of the actual disease process
The target draws support from the disease model to implicate an effect on the definitive outcome
There can be no targets (surrogates) for safety
Can observe high risks but cannot reliably establish safety
Evidence for efficacy can evolve prior to strong evidence regarding safety, reversing the typical
sequence
Informs the design of subsequent experiments by reducing uncertainty regarding effects on the
target

Therefore, TCTs are pre-developmental. Such studies can sometimes be nested


within typical developmental clinical trials. Most often they will be single-cohort
studies. Additional requirements and characteristics of a translational trial are listed
in Table 1.

Issues in Translational Trials

TCTs may focus on any of several natural questions. The most pointed question is
targeting – does the therapy hit the biological target and does it appear to produce
the intended effect. A small molecule or biologic might target a cellular receptor,
enzyme, antibody ligand, or gene, for example. The answer to this question implies
quantitative measurement of the relevant product.
Another question in translation might be signaling – are there products or effects
(signals) downstream from the target that reveal the intended effect after treatment. If
52 Translational Clinical Trials 943

we observe and measure the signal, we can reasonably infer that the target was hit. A
signal might be a change in activity level or switching of a gene expression, for
example.
A TCT could focus on feasibility – can we successfully implement and refine an
unproven complex method for delivering a therapy. Feasibility is vital when it is a
legitimate question but a straw man otherwise. Feasibility is not an appropriate
objective when the ability to administer a treatment is neither complex nor ques-
tionable. Such questions are sometimes chosen merely to deflect criticism from small
or poorly designed studies.
In a feasibility study, two definitions for infeasible are essential because it represents
a type of failure mentioned above. One definition pertains to each study participant so
there is an outcome measure relevant to the primary objective in every subject. A
second definition of infeasible refers to the entire study cohort so there will be a
prespecified measure of success. That tolerance specification also needs a precision
or confidence level. For example, suppose our translational trial addresses delivery
feasibility and we have an appropriate list of show-stopping problems that could be
encountered. Assume we can continue development only if 85% or more of subjects
have no clinically significant delivery problems. The smallest possible study that can
satisfy our requirement is 20 subjects, none of whom could have feasibility problems.
Then the lower 95% confidence bound on the success rate would exceed 85%.
The potential scope of delivery issues in translation is wide as a few examples will
illustrate. A drug might depend on its size, polarity, or solubility for reaching its target.
The dose or schedule will determine blood and tissue levels. For oral medications
subject behavior, adherence, or diet may also be factors. Individual genetic or epige-
netic characteristics can affect drug exposure via metabolism. For gene or cell
therapies, properties of the vector, dose or schedule, need for replication, immunolog-
ical characteristics of the recipient, and individual genetic or epigenetic characteristics
may affect delivery. Devices or skill-dependent therapies may depend on procedural
technique, function of a device, or the anatomy of the subject or disease. This daunting
array of issues indicates that study goals must be set thoughtfully, and off-the-shelf
drug development designs may not be appropriate for many feasibility questions.
TCTs may involve biomarkers at any of several levels. They are not unique in
this regard, but some uses of biomarkers are specific to translational trials. A
biomarker is an objective measurement that informs disease status or treatment
effects. Surrogate outcomes also track the effects of treatment but also change in
proportion to the way a definitive outcome would respond to the treatment. From the
perspective of trial design, a biomarker creates a subset in the study population. It
predicts whether a treatment works or carries information regarding prognosis. A
principal design question for a biomarker is if the study population should be
enriched with respect to it. If the biomarker indicates definitively whether treatment
will work, the population should be selected accordingly.
In some cases, companion diagnostics are at issue in translational trials. A
companion diagnostic is the test that reveals the biomarker level or presence. Such
a test might be new and could be based on evolving technology. The diagnostic test
could therefore be refined alongside the therapeutic. Different companion
944 S. Piantadosi

diagnostic tests, if they exist, could yield different study compositions and poten-
tially different results.

Examples

As a hypothetical example of a TCT, suppose we have a local gene therapy for brain
tumors to be delivered using a well-studied (safe) lentiviral vector. The virus delivers
the gene product to tumor cells where it stops their growth and kills them. Suppose
further that production of the delivery vector is straightforward, injecting the viral
particles is feasible, and the correct dose is known from previous studies. The
treatment will be administered days prior to routine surgical resection of the tumor
so effects can be measured clinically and in the specimens.
An appropriate design for the first human trial in this circumstance will not
emphasize a dose question. To rule out adverse events at a 10% threshold requires
at least 30 subjects yielding zero such events. Then the upper 95% confidence bound
on the adverse event rate would be 0.1. This would be a relatively large early clinical
trial. However, we might reliably establish the presence of gene product in resected
tumor cells with many fewer subjects. Furthermore, seeing a handful of tumor
responses prior to surgery would also be promising evidence of efficacy. Hence a
translational trial could reveal more about efficacy than safety.
An example of an actual translational trial is in the ClinicalTrials.gov database at
record NCT02427581 which is a breast cancer vaccine trial (Gillanders et al. 2019).
Subjects in this trial have not had a complete response after chemotherapy and are at
very high risk for disease recurrence. The planned sample size is 15, and the primary
outcome measure is grade and frequency of adverse events at 1 year. Secondary
outcomes of the trial are immunogenicity measures for the vaccine.
Another translational trial is illustrated by a test of mushroom powder for
secondary breast cancer prevention in 24 subjects (Palomares et al. 2011). The
trial was formally described as “dose finding” although up to 13 g of white button
mushroom was not a difficult tolerance challenge, especially compared to typical
cancer therapies. The translational objectives were aromatase inhibition, with
response defined as a 50% decrease in free estradiol. No participants met the
predefined response criterion in this trial according to the report.
A final example of a translational clinical trial is a test of valproate for
upregulating CD20 levels in patients with chronic lymphocytic leukemia (Scialdone
et al. 2017). This trial was planned in four subjects but one dropped out due to a
hearing disorder. No upregulation of CD20 mRNA or protein could be detected
in vivo in cells from patients on this trial according to the report.

Reducing Uncertainty

Clinicians want translational clinical trials to be small. I have often heard excellent
clinical investigators indicate that there would be much to learn from an early test of
a new therapeutic idea in a handful of subjects, say 6 or 8, for example. This is
52 Translational Clinical Trials 945

consistent with some of the examples just cited. Resource limitations are partly at the
root of clinicians’ concerns to make TCTs small. But the reasoning is more consid-
ered than resources alone. Small experiences can be critical when uncertainty is high.
But statisticians tend to disrespect this notion because statistical learning is
connected to narrowing confidence or probability intervals around estimands. A
handful of observations does not do much by those measures.
Both perspectives are correct. Statisticians are usually interested in definitive
evidence in service of decisions. Clinical investigators often seek ways to reduce
uncertainty regarding the biological effects of a new therapy as a prerequisite for
further clinical trials. In settings of high uncertainty, relatively few observations can
reduce uncertainty and lead in the correct direction for further experimentation. Here
I will illustrate some hypothetical circumstances of high uncertainty and its reduction
using small trials.
Suppose that our therapy can yield a positive, neutral, or negative outcome.
Before the trial, assume we are maximally uncertain as to the effect that the treatment
will produce. This means that we hypothesize that each outcome is equally likely.
We hope our experiment will reveal a dominance of positive outcomes. When the
trial is over, we assess the information gained and decide what studies should be
performed next. We will temporarily put aside concerns over sample size and focus
only on the outcome frequencies produced by such a trial.
Table 2 shows a hypothetical example. Before the trial, each outcome is assumed
to have an equal chance of happening. After the trial, suppose half the subjects have
a positive result, and 25% have each negative or neutral. Gain in information can be
calculated using entropy (Gillanders et al. 2019) and yields a value of 0.6. Alterna-
tively, the relative information from these results can be calculated using the
Kullback-Leibler divergence (Kullback and Leibler 1951) and yields a value of
0.05. The Kullback-Leibler divergence is sometimes called relative entropy and
can be taken as a measure of surprise. Most of us do not have an intuitive feel for
these information values, but both indicate a small gain in information. Depending
on the clinical context, this might be a very promising result because half the subjects
seemed to benefit from the therapy.
The consequences of assuming too much prior to the experiment can be
serious. Consider the hypothetical results in Table 3 where optimism prior to
the trial was very strong. The same trial results from Table 2 are measured
against that very optimistic initial hypothesis. The information value is 0.52,
and the divergence is 0.28 indicating that before and after are quite different.
However, we seem to know less after the trial than before because we have “lost”
information. In a sense this is true because the initial hypothesis implied too

Table 2 Hypothetical prior (before) and outcome (after) probabilities for a translational clinical
trial. Before the trial, maximum uncertainty regarding the treatment effect was assumed
Outcome probabilities
Time Neg Neutral Pos
Before 0.33 0.33 0.33
After 0.25 0.25 0.50
946 S. Piantadosi

Table 3 Hypothetical prior (before) and outcome (after) probabilities for a translational clinical
trial. Before the trial, strong assumptions were made regarding the treatment effect. Implications are
different compared to Table 2, despite the outcome data being identical
Outcome probabilities
Time Neg Neutral Pos
Before 0.05 0.10 0.85
After 0.25 0.25 0.50

Table 4 Hypothetical prior (before) and outcome (after) probabilities for a translational clinical
trial. Prior assumptions were strong. Formally there is no gain in information but the results are
divergent and carry biological implications
Outcome probabilities
Time Neg Neutral Pos
Before 0.10 0.50 0.40
After 0.50 0.40 0.10

much certainty. This might be a very unpromising result if the strong optimism
prior to the trial was justified.
As a third example, consider the scenario in Table 4. There the anticipated
outcome probabilities are the same as those observed after the trial but rearranged.
The information gain is apparently zero. The divergence is 0.5, indicating before and
after are substantially different. Such a result would likely prevent development
because half the subjects show negative outcomes.
These examples have assumed we have reasonable estimates of the classification
or response probabilities when the TCT is complete. In small sample sizes, the
calculated information and its variance are biased (Butler 2008). The bias can be
substantially reduced in modest sample sizes, meaning that we can then obtain
reasonable quantitative estimates of information gain.
These simple examples show that simple prior hypotheses coupled with outcome
summaries yield quantifiable information and relative information. It seems appro-
priate to assume we are uncertain before human data are obtained. The clinical
setting and properties of existing treatments must be brought into the assessment of
how promising results are.

Sample Size

We can now return to the question of how large such trials need to be. What does it
take to get reasonable estimates of information gain? Simulation according to the
following algorithm provides an answer. We begin with a fixed sample size and
combinatorically enumerate all possible outcomes, calculating entropy and diver-
gence for each. Each outcome has a probability of occurring according to a multi-
nomial distribution that is assumed to represent the truth of nature. From the
52 Translational Clinical Trials 947

probabilities we construct the cumulative distribution functions (CDFs) of entropy


and divergence. Then by varying the sample size, we can observe its effect on the
CDF or expected values of entropy and divergence. The impact of different assumed
true multinomial distributions can also be studied.
CDFs for entropy and divergence from such a simulation with varying sample
sizes are shown in Figs. 2 and 3. Small sample sizes with coarse estimates of the
CDF are biased as one would expect intuitively. This is demonstrated by the
rightward shift of the CDFs as sample size increases in Figs. 2 and 3. However,
the bias diminishes rapidly with increasing sample size, especially for the KL
divergence CDF. Very similar curves result from alternative true multinomial distri-
butions. This indicates that nearly bias-free estimates of information and divergence
for a TCT can be obtained using small sample sizes. Large sample sizes will, of
course, increase the precision of estimated outcome probabilities, but a picture of
(relative) information emerges sooner. For example, returning to Table 1, the simu-
lations suggest that sample sizes in the range of 12–18 subjects give an accurate
assessment of the distribution of KL divergence values for all possible outcomes.
There is a sense in which the results regarding sample size seem too good to be
true. Small sample sizes relative to conventional standards are useful with respect to
acquiring information. We know such sample sizes would be inadequate for most
consequential clinical decisions such as comparing treatments. But the usefulness of
information measures depends on context. When uncertainty about a treatment effect
is high as it is in translation, weaker evidence than it would otherwise take to change

Fig. 2 CDF of entropy as defined in Shannon [1948]. Three outcome categories were used as in
Table 1 in the text and the multinomial probabilities were uniform. Curves correspond to sample
sizes ranging from 6 to 36. A reference curve for N ¼ 100 is shown to indicate how small sample
sizes perform relative to it
948 S. Piantadosi

Fig. 3 CDF of Kullback-Leibler divergence. The same conditions as in Fig. 2 apply

medical practice is valuable and can guide subsequent experimental steps. A TCT
shares as many characteristics with lab experiments as with definitive clinical trials.
The decisions faced after a TCT are treatment effects on the irrefutable signal and
what is the best next experiment. Whether or not to accept a treatment for wide use is
not an issue. Therefore, lesser evidence is required to guide the next experimental
steps.

Safety Versus Efficacy

Another dimension for translational clinical trials is the extent to which they can
reassure investigators as to the safety of the intervention in question. Safety is a
ubiquitous concern in clinical trials but is more aspirational than real in small studies.
There are two strong reasons for this. One is biological: there are no surrogates for
safety, so we never have more promising evidence than zero events in the trial
cohort. In contrast, efficacy may show promise via the irrefutable signal which is its
intended purpose.
A second reason why proof of safety is unattainable in TCTs derives from
statistical considerations. Safety is not a measurement but is a clinical judgment
based on an informative absence of events. Only sizable experiences are informative.
For example, if 10% is the upper tolerance for serious adverse events, data would
have to show zero events in 30 subjects for the confidence bound on the event rate to
be below the 10% limit. A 5% limit would require 60 event-free subjects. These are
realistic tolerances for serious adverse events putting proof of safety out of reach of
many TCTs that enroll only a few subjects.
52 Translational Clinical Trials 949

To contrast, far fewer subjects could show evidence of efficacy. For example, a
handful of disease responses out of 10 or 12 participants in a TCT could be very
promising. Hence, we must be alert to the chance to learn about efficacy before
safety in such trials. This is a reverse of the stereotypical sequence.

Summary and Conclusion

The goal of this chapter has been to describe features of translational clinical trials
that codify them as a distinct type of study. There are trials that fit this paradigm very
well. However, many studies labeled as such have only certain components that are
translational. Still others are relatively ordinary developmental trials that carry a
translational label probably to make them appear more vital because of the impor-
tance of the term.
The key feature of a translational trial is the intersection of a well-characterized
laboratory model of disease with the actual human condition. An irrefutable signal is
derived and validated from laboratory studies but measured as an outcome in the
clinical trial. If it does not reflect favorably on the treatment, investigators must
either abandon or redesign the therapy.
Statistical precision is not obtained in the small experiences of many translational
trials. But measurable gains in information relative to a state of maximal uncertainty
can be obtained. The acceptability of this perspective may be a point of divergence
for the instincts of clinician investigators versus statisticians. In any case, transla-
tional trials must provide enough information to justify performing more expensive
developmental studies or to terminate or not begin development at all. Trialists
should be alert to the needs of clinical studies in translation so they are not forced
into unserviceable commonplace designs.

Key Facts

Translation is a key step in the evolution of a new therapeutic idea, whether viewed
as predevelopment or as part of staged development. It constitutes the bridge
between basic science and preclinical investigations and human trials. Relatively
small translational trials can reduce the considerable uncertainty regarding therapeu-
tic effects at this stage of development. Such trials rely on biological models as
implemented in the laboratory and biological targets and signaling. Simple
approaches to quantifying information gained show that small sample sizes can
reduce uncertainly enough to guide subsequent clinical trials.

Cross-References

▶ Power and Sample Size


950 S. Piantadosi

References
Butler D (2008) Translational research: crossing the valley of death. Nature 453(7197):840–842
Gillanders WE et al (2019) Safety and immunogenicity of a personalized synthetic long peptide
breast cancer vaccine strategy in patients with persistent triple-negative breast cancer following
neoadjuvant chemotherapy. ClinicalTrials.gov identifier: NCT02427581
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86
National Library of Medicine (2019) ClinicalTrials.gov. https://clinicaltrials.gov
Palomares MR et al (2011) A dose-finding clinical trial of mushroom powder in postmenopausal
breast cancer survivors for secondary breast cancer prevention. J Clin Oncol 29(15_suppl):
1582–1582
Piantadosi S (2005) Translational clinical trials: an entropy-based approach to sample size. Clin
Trials 2:182–192
Scialdone A et al (2017) The HDAC inhibitor valproate induces a bivalent status of the CD20
promoter in CLL patients suggesting distinct epigenetic regulation of CD20 expression in CLL
in vivo. Oncotarget 8(23):37409–37422. https://doi.org/10.18632/oncotarget.16964
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27(3):379–423
Dose-Finding and Dose-Ranging Studies
53
Mark R. Conaway and Gina R. Petroni

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 952
Designs Based on Increasing Dose-Toxicity Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 952
Rule-Based Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 954
Interval-Based Methods for Dose-Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955
Model-Based Methods for Dose-Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 957
Semiparametric and Order-Restricted Methods for Dose-Finding . . . . . . . . . . . . . . . . . . . . . . . . . 959
Evaluating Methods for Dose-Finding and Dose-Ranging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 960
Operating Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 960
Ease of Implementation and Adaptability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 961
Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 961
Extensions Beyond Single-Agent Trials with a Binary Toxicity Outcome . . . . . . . . . . . . . . . . . . . . 963
Time-to-Event Toxicity Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 963
Combinations of Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 964
Heterogeneity of Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 967
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 967
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 967
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 968

Abstract
There is a growing recognition of the importance of well-designed dose-finding
studies in the overall development process. This chapter is an overview of designs
for studies that are meant to identifying one or more doses of an agent to be tested
in subsequent stages of the drug development process. The chapter also provides

M. R. Conaway (*)
University of Virginia Health System, Charlottesville, VA, USA
e-mail: [email protected]
G. R. Petroni
Translational Research and Applied Statistics, Public Health Sciences, University of Virginia
Health System, Charlottesville, VA, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 951


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_77
952 M. R. Conaway and G. R. Petroni

a summary of dose-finding designs that have been developed to meet the chal-
lenges of contemporary dose-finding trials, including the use of combinations of
agents, more complex outcome measures, and heterogeneous groups of
participants.

Keywords
Interval-based methods · Model-based methods · Operating characteristics ·
Coherence · Combinations of agents · Patient heterogeneity

Introduction

This chapter describes the design of studies that have the goal of identifying one or
more doses of an agent to be tested in subsequent stages of the drug development
process. Piantadosi (2017) makes a distinction between dose-ranging studies, in
which doses are to be explored without a pre-specified objective, and dose-finding,
where the objective is to find a dose that meets a pre-specified criterion such as a
target rate of toxicities. This chapter provides a description of designs for both dose-
ranging and dose-finding studies.
Some early-phase studies are designed as randomized trials (Eussen et al. 2005;
Partinen et al. 2006; Schaller et al. 2010; Vidoni et al. 2015), with participants
allocated randomly among several pre-specified doses. Design considerations for
these trials, such as the choice of outcome measures, sample size, the use of
stratification factors, or interim monitoring, are common to randomized clinical
trials and are covered in Sections 4 (▶ Bias Control and Precision) and 5 (▶ Basics
of Trial Design). The distinguishing feature of the designs described in this chapter is
that allocation of participants to study dose is done sequentially; the choice of a dose
for a participant is determined by the observed outcomes from participants previ-
ously treated in the trial. These trials often involve the first use of an agent or
combination of agents, and the sequential allocation is intended to avoid exposing
participants to undue risk of adverse events. While there are examples of sequential
dose-ranging studies in many fields, including anesthesiology (Sauter et al. 2015)
and addiction research (Ezard et al. 2016), these designs are most commonly
associated with “phase I” trials in oncology.

Designs Based on Increasing Dose-Toxicity Curves

A primary goal of a dose-finding trial is to learn about the safety and adverse events
related to study agent(s) and is often labeled “phase I.” Historically, for phase I trials
involving cytotoxic agents in oncology, a main objective was to identify the “max-
imum tolerated dose” (MTD). This remains a main objective even for noncytotoxic
agents. The MTD is defined as the highest dose that can be administered to
participants with an “acceptable” level of toxicity where toxicity is assessed based
53 Dose-Finding and Dose-Ranging Studies 953

upon observed adverse events. The amount of toxicity at a given dose level is
considered “acceptable” if the proportion of participants treated at that dose who
experience a “dose-limiting toxicity” (DLT) is less than or equal to a target level of
toxicity. The definition of a DLT is study specific, depends on the type of agent being
studied, and is used to set the study target level. Traditionally, the target level has
been in the range of 20–33%. Participants are sequentially assigned to dose levels
with the starting dose being the lowest dose. Dose allocations can be done for
individual participants or in groups of participants where groups of participants are
referred to as cohorts. Each participant is assigned to a single dose level and is
observed on a binary outcome measure specifying whether or not the participant
experienced a DLT. Many of the methods for these trials were developed for
cytotoxic agents, where it is assumed that the dose-toxicity and dose-efficacy
relationships are monotonic, in which the probability of a DLT and the potential
for clinical benefit, often termed “efficacy,” both increase with dose (see Fig. 1).
In this setting, the MTD is to be chosen from a pre-specified set of doses,
d1 < d2 < . . . < dK, with the probability that a participant given dose level dk
experiences a DLT denoted by πk, k ¼ 1, . . ., K. The probability of a DLT is assumed
to increase with dose, π1 < π2 < . . . < πK. At any point in the trial, there are nk
participants who have been observed on dose level dk, and of the participants treated,
Yk have experienced a DLT. The target level of toxicity is denoted by θ.
Although the discussion in this chapter will center on dose-finding and dose-
ranging studies in oncology, the designs can be applied more widely to any clinical
setting in which both the probability of an adverse event and efficacy can be
expected to increase with dose. One example is the study of a new anesthetic. The
probability of an adverse event and efficacy, defined in this case as sufficient
sedation, both increase with dose. As in the oncology studies, the goal is to find
the highest dose that can be administered safely, with the added requirement that the
dose yields sufficient sedation to an acceptable proportion of participants.
For these trials the statistical design revolves around two questions:
(Ananthakrishnan et al. 2017) How should doses be allocated to participants as the

Fig. 1 An example of a monotonic dose-toxicity relationship


954 M. R. Conaway and G. R. Petroni

trial proceeds? and (Babb et al. 1998) At the end of the trial, what dose should be
nominated as the MTD? There are, of course, many other clinical and statistical
issues to be made in carrying out a dose-finding or dose-ranging trial, including the
pre-specification of dose levels and the definition of a DLT (Senderowicz 2010), but
this chapter focuses primarily on the statistical issues of dose allocation and estima-
tion of the MTD at the end of the study.
Methods in this situation are broadly categorized as “rule-based,” in which
decisions to decrease, increase, or assign the same dose level for a new participant
are determined by rules for the observed proportion of toxicities at the current dose,
or “model-based,” in which a parametric model is fit to all the accumulated data and
used to guide dose allocation and the estimation of the MTD (Le Tourneau et al.
2009). In practice, the distinction between rule-based and model-based is not
completely clear, as there are methods that use rule-based dose allocation (Storer
1989; Stylianou and Flournoy 2002; Ji et al. 2007) but use a parametric model or
isotonic regression at the end of the trial to estimate the MTD. Other methods (Shen
and O’Quigley 1996; Wages et al. 2011a) are model-based but start with an initial
rule-based stage before using the parametric model. In this chapter, methods are
designated as rule-based or model-based depending on how participants are allo-
cated to doses.

Rule-Based Algorithm

The 3+3 Algorithm for Dose-Ranging


The “3+3” algorithm was used originally for finding the MTD for a single cytotoxic
agent and applies decision rules based upon outcomes from cohorts of three partic-
ipants. Rogatko et al. (2007) reviewed phase I clinical trials reported over a 15-year
period, from 1991 to 2006, and found that 98% of the trials used some version of the
3+3 algorithm. More recently, Conaway and Petroni (2019b) noted that in 2018, a
leading cancer journal published 37 articles that report on dose-finding or dose-
ranging studies. Of the 37 studies, 32 (86%) still used the 3+3, even though few of
the studies involved finding the MTD. There are a few versions of this algorithm
with varying decision rules (Storer 1989; Piantadosi 2017).
Notwithstanding the prevalence of its use in applications, the 3+3 is generally
dismissed in the statistical literature for its poor operating characteristics. Lin and
Shih (2001) provide an in-depth evaluation of the properties of algorithmic designs
in general, including the 3+3. The authors present results for three scenarios, each
with five dose levels, and show that if the target toxicity were 25%, the percent of
times that the 3+3 correctly selected the true MTD was only 26%, 30%, and 29% in
the three scenarios. Given that choosing a dose at random, without even doing a trial,
would recommend the correct MTD 20% of the time, the results for the 3+3 design
are not impressive. Lin and Shih (2001) also demonstrate that despite the perception
that the 3+3 targets the dose with a 1/3 probability of a DLT, the 3+3 algorithm
design does not have a target toxicity level. Thus, it remains an applied design that
does not target a specific population characteristic but instead describes the outcome
53 Dose-Finding and Dose-Ranging Studies 955

from a small sample which provides little information on what proportion of future
patients would experience a DLT. Similar results are found in Storer (1989), Reiner
et al. (1999), and Iasonos et al. (2008). More general versions of this design, “A+B,”
have been proposed (Ananthakrishnan et al. 2017). An R Shiny app that can be used
to investigate the operating characteristics of A+B designs, including the 3+3, is
given by Wheeler et al. (2016).

Biased Coin Designs


Durham et al. (1997) describe a method for sequential dose allocation and estimation
of the MTD based on random walks which allocates participants to dose level in
cohorts of 1. If the current participant has been treated at dose level dk and experiences
a DLT, the next participant is treated at the next lower dose level dk  1. If the current
participant does not experience a DLT, the next participant is treated at the current
dose, dk, with probability θ/(1θ) or at the next higher dose level dk + 1 with
probability 1[θ/(1θ)], where θ <0.5 is the pre-specified target level of acceptable
toxicity. The trial ends when a pre-study specified total of participants have been
accrued to the study. Durham, Flournoy, and Rosenberger (1997) and Stylianou and
Flournoy (2002) propose several estimators for the MTD at the end of the trial,
including estimates based on logistic regression, means based on the number of
participants assigned to each dose level, or based on isotonic regression. There are a
number of good features to this design. It relies only on the assumption of an
increasing dose-toxicity curve, the escalation and de-escalation rules are easy to
implement, and both the small sample properties and asymptotic distribution theory
for the estimators have been derived (Durham and Flournoy 1994, 1995).

Interval-Based Methods for Dose-Finding

The original interval-based method is the “cumulative cohort design” (CCD) pro-
posed by Ivanova et al. (2007). Given a target toxicity probability, θ, the term
“interval-based” derives from basing decision rules for treating the next set of
participants at a lower, higher, or the same dose as the current cohort of participants
on intervals placed around θ. If the current dose level is dk and there are Yk
participants experiencing a DLT out of the nk participants who have been treated at
that dose, the decision rules proposed by Ivanova et al. (2007) are:

• If qb  θ  ΔL , “‘ escalate the dose’” : treat the next group of participants at


dose d kþ1 .
• If qb  θ þ ΔU , “‘ de-escalate the dose’” : treat the next group of participants
at dose dk1 .
• If θ  ΔL  qb  θ þ ΔU , “‘ stay’” : treat the next group of participants at dose dk .

where qb ¼ Y k =nk is the proportion of participants treated at dose dk who have


experienced a DLT. The interval values ΔL and ΔH are chosen to allocate, on average,
as many participants as possible to the MTD. Ivanova et al. (2007) simplify the
956 M. R. Conaway and G. R. Petroni

design further by taking ΔL ¼ ΔH ¼ Δ so that the decision intervals are symmetric


around the target θ. They recommend Δ ¼ 0.09 for target toxicity probabilities
between 0.10 and 0.25 and Δ ¼ 0.10 for target toxicity probabilities of 0.30 and 0.35.
As an example, with a target toxicity probability of 0.20, the cumulative cohort
design would indicate escalating to the next higher dose if the observed proportion of
toxicities at the current dose is less than or equal to 0.11. If the observed proportion
of toxicities is 0.29 or greater, the next cohort of participants would be treated at the
next lower dose. The current dose would be allocated if the observed proportion of
participants experiencing a DLT is between 0.21 and 0.29. This process of adaptively
recommending doses to participants continues until a pre-specified number of
participants have been observed.
A number of other interval-based designs have been proposed, including the
modified toxicity probability interval (mTPI) method (Ji et al. 2007) and a subse-
quent modification of this method, the mTPI-2 (Guo et al. 2017). Liu and Yuan
(2015) proposed a Bayesian optimal interval design (BOIN). A related method, the
“Keyboard design,” was proposed by Yan, Mandrekar, and Yuan (2017) and is
equivalent to the mTPI-2 design. In this chapter, these methods are categorized as
“interval-based” methods, although the authors also use the term “model-assisted” to
refer to this class of method. All of these methods share a common goal, to develop
designs for dose-finding studies that are simple to implement and have better
operating characteristics than the 3+3. The methods differ on the criteria on which
the intervals are derived, and the use of a point estimate, such as qb in the cumulative
cohort design, or the Bayesian posterior probability associated with the intervals
indicating de-escalation, escalation, or staying at the current dose.
In addition, these interval-based methods, unlike the cumulative cohort design,
implement additional rules on the dose assignment process. A dose, and all higher
doses, can be eliminated from consideration if the observed proportion of toxicities
at the current dose is too great. The need for such a rule is clear with a simple, if
somewhat unrealistic, example. Suppose that four dose levels are under consider-
ation, and the levels were chosen such that the true toxicity probabilities associated
with dose levels 1 and 2 are near 0 and dose levels 3 and 4 are near 1. Without these
stopping rules, the design would oscillate between recommending dose escalation
from dose level 2 to 3 and de-escalation from dose level 3 to 2. Other than the CCD,
the other interval-based methods also implement rules that stop the trial if too much
toxicity is observed at the lowest dose level.
As a practical matter, these designs differ little in their decision rules. Table 1
shows the decision rules for a target toxicity level of 0.20, for up to ten participants
treated at a dose, and the possible number of participants observed to have a DLT.
The CCD method was modified to have the same dose elimination rules as BOIN
and Keyboard. The values for CCD, BOIN, and Keyboard (mTPI-2) were computed
from the “get.boundary” function in R (Lin 2018). In this table, if only a single entry
appears, all three methods gave the same recommendation. In the few cases where
the recommendations differ, the recommendations from each of the methods are
given in the order CCD, BOIN, and Keyboard. There are few points of disagreement
among the three methods and the BOIN and Keyboard methods never disagree. The
53 Dose-Finding and Dose-Ranging Studies 957

Table 1 Comparison of dose allocation rules for three interval-based methods


Number of participants observed on the current dose
Number of
participants with a
DLT on the current
dose 1 2 3 4 5 6 7 8 9 10
0 E E E E E E E E E E
1 D D D S;D;D S S S;E;E S;E;E S;E;E E
2 D D D D D S;D;D S;D;D S S
3 R R R R D D D D
4 R R R R R R D
5 R R R R R R
6 R R R R R
7 R R R R
8 R R R
9 R R
10 R
E, escalate dose; S, stay at current dose; D, de-escalate; R, de-escalate and remove the dose from
consideration

similarity in the dose allocation rules is not surprising; Clertant and O’Quigley
(2017, 2019) develop a semiparametric method that can be calibrated to produce
identical operating characteristics to all of these individual methods.

Model-Based Methods for Dose-Finding

The most widely recognized model-based method for dose-finding trials is the
continual reassessment method (CRM). The original CRM paper proposed a sin-
gle-stage design; a later version (Shen and O’Quigley 1996) is a two-stage design
using a rule-based algorithm in the first stage and maximum likelihood estimation in
the second stage. An excellent overview of the theoretical properties and guidelines
for the practical application of the method is given in Cheung (2011) and O’Quigley
and Iasonos (2012). The CRM assumes a parametric model for the dose-toxicity
curve, but it does not require that the model be correct across all the doses under
consideration. The model only needs to be increasing in dose and be such that there
is a parameter value that enables the function to equal the target value, θ, at the true
MTD. The original CRM paper discussed one- and two-parameter models but
focused primarily on one-parameter models because the simpler models tended to
have better properties in terms of identifying the correct MTD.
The most common implementation of the CRM uses the “empiric” model for πk;
the probability of a DLT at dose level dk is assumed to be equal to

π k ¼ ðφk Þ exp ðaÞ


958 M. R. Conaway and G. R. Petroni

where 0 < φ1 < φ2 < . . . < φK < 1 are pre-specified constants, often referred to as
the “skeleton” values and “a” is a scalar parameter to be estimated from the data. The
parametrization exp(a) ensures that the probability of toxicity is increasing in dose
for all values of the parameter a. The original CRM paper (O’Quigley et al. 1990)
was a Bayesian method that put a prior distribution on the parameter, a. The paper
provided guidance on eliciting a gamma prior for exp(a) but noted that in many
cases, the special case of an exponential prior with mean 1 gave satisfactory
performance. The skeleton values can be based on the prior, as suggested in the
original CRM paper, or by using the method of Lee and Cheung (2009) where the
skeleton values are calibrated in a way to give good performance for the CRM across
a variety of true dose-toxicity curves.
Once the prior and skeleton values are chosen, the first participant is assigned to
the dose level with prior probability closest to the target θ. After that, the CRM
allocates participants sequentially, with each participant assigned to the dose level
with the model-based estimated probability of toxicity closest to the target. To be
specific, suppose that j1 participants have been observed on the trial, with nk 0
participants observed on dose levels k, k ¼ 1, . . ., K. Of the nk participants, Yk
participants experienced a DLT. Using the data accumulated from the j1 partici-
pants observed, the updated model-based toxicity probabilities are

π k ¼ ðφk Þ exp ðbaÞ


b

where ab can be the posterior mean computed via numerical integration, an approx-
imation to the posterior mean as in O’Quigley et al. (1990), the posterior mode, or
the posterior median (Chu et al. 2009). The next participant is assigned to the dose
level k with the estimated toxicity probability closest to the target, where “closest” is
according to a pre-specified measure of distance between the estimate and the target.
The original paper uses a quadratic distance, but asymmetric distance functions,
which give greater loss to deviations above the target than below, could also be used.
The updating of “a” and the allocation of participants to the dose with updated
toxicity probability closest to the target continues until a pre-specified number of
participants have been observed. At the end of the study, the MTD is taken to be the
dose that the next participant would have received had the trial not ended.
A two-stage version of the CRM is presented by Shen and O’Quigley (1996). The
first stage uses a rule-based design and continues until at least one participant
experiences a DLT and one participant does not experience a DLT. Once heteroge-
neity in responses is observed, the trial proceeds as in the original CRM, except that
the estimate of the parameter “a” is based on maximum likelihood. The paper uses a
rule-based design using single-participant cohorts: if a participant does not experi-
ence a DLT, the next participant is treated at the next higher dose level, but the
authors note that any rule-based design could be used in stage I until heterogeneity is
observed.
The “escalation with overdose control” (EWOC) method of Babb et al. (1998) is a
popular method for dose-finding. As with the previous dose-finding methods, Babb
et al. (1998) set a target toxicity probability, θ, and assume that the true MTD,
53 Dose-Finding and Dose-Ranging Studies 959

defined as the dose that has a toxicity probability equal to the target, is in a pre-
specified interval [Xmin, Xmax]. Their method is designed to identify the MTD while
providing “overdose control,” limiting the proportion of participants exposed to
doses above the MTD.
The EWOC design is based on a two-parameter model for the probability of a
DLT at dose x in [Xmin, Xmax]. One of several possibilities for the dose-toxicity
relationship model is the logistic model:

logit ½PðDLT jxÞ ¼ β0 þ β1 x

where β1 is restricted to be greater than 0 so that the probability of a DLT is


increasing in x. Babb et al. (1998) propose a Bayesian method, assuming priors on
the pair (β0, β1).
The first participant or cohort of participants is assigned dose Xmin. From there,
the study proceeds much like the original CRM. Data from the first j-1 participants is
used to update the posterior distribution for (β0, β1) and used to guide the dose
administered to the jth participant. Unlike the CRM, which allocates the jth partici-
pant to the dose that has a toxicity probability estimated to be closest to the target, the
EWOC method assigns the dose, x*, such that the posterior probability that the
toxicity probability associated with x* exceeds the target is equal to a pre-specified
value.
This process of updating the posterior distribution of (β0, β1) and allocating
participants to doses based on the updated posterior distribution continues until a
pre-specified number of participants have been observed. At the end of the trial, the
MTD is chosen as the value that minimizes the expected loss with respect to the
posterior distribution of the MTD. This loss is taken as an asymmetric loss function,
penalizing overdosing more than underdosing.
Babb et al. (1998) showed that EWOC and the Bayesian CRM with a symmetric
loss function had similar properties for identifying the MTD and were more efficient
than any of the rule-based designs they considered which include the four up-and-
down designs of Storer (1989) and two methods based on stochastic approximation.
On average, the EWOC method tended to treat more participants on low and possibly
sub-therapeutic doses than did the CRM but treated fewer participants at dose levels
above the MTD than did CRM. Over all the simulations, the average proportion of
participants with DLTs with CRM was almost exactly equal to the target level (33%);
this proportion was between 25% and 30% for EWOC. An excellent overview of
EWOC and its extensions is given in Tighiouart and Rogatko (2014).

Semiparametric and Order-Restricted Methods for Dose-Finding

These methods do not fall neatly into either the interval-based or model-based
classifications. The semiparametric dose-finding method of Clertant and O’Quigley
(2017) is more of a class of methods than a specific design. If parametric conditions
are added to the class, the result is the CRM design. Using less structure on the class
960 M. R. Conaway and G. R. Petroni

results in the interval-based designs, including the CCD, mTPI-2, and BOIN.
Clertant and O’Quigley (2017) present results for a “semiparametric” design that
corresponds to CRM with an additional nuisance parameter. This formulation leads
to a design that reduces the dependence of the CRM on a single model and produces
results similar to those of the CRM.
The methods of Leung and Wang (2001) and Conaway et al. (2004) are based on
methods for order-restricted inference. These methods are, in a way, model-based
designs but rely only on the assumption that the probability of a DLT increases with
dose. The methods do not specify a full parametrized parametric model. Recent work
(Wages and Conaway 2018) has shown that the order-restricted methods are com-
petitive in performance to the CRM, the method that generally has the best operating
characteristic, as described in the following section, in dose-finding studies over a
wide range of scenarios.

Evaluating Methods for Dose-Finding and Dose-Ranging

There are a number of criteria on which the methods can be compared. These include
the statistical properties, the ease of implementation and adaptability to changes in
the study conduct, and principles for conducting early-stage studies.

Operating Characteristics

Comparisons of the properties of the methods are complicated by the lack of


consensus in the criteria on which the methods should be judged and the necessity
of focusing on selected true dose-toxicity scenarios. The most common criterion for
evaluation is the “percent correct selection” (PCS) (Cheung 2011), the proportion of
times that the method correctly selects the true MTD, or how often the method
selects a dose within a certain range, such as 5 or 10 percentage points within the
target toxicity. Comparisons may be also made on the percent of participants treated
at the MTD or at doses close to the MTD or on the basis of the proportion of
participants treated at doses above the MTD.
The accuracy index (Cheung 2011) is a useful measure that takes into account the
entire distribution of dose recommendations:
PK
k¼1 ρk Pðdesign selects dose kÞ
Accuracy index ¼ 1  K PK
k¼1 ρk

where ρk is a measure of the deviation of the true toxicity probability, πk, at dose k
from the target toxicity probability θ. Cheung (2011) gives several choices for ρk,
including an absolute deviation, ρk ¼ |πk  θ|. The accuracy index has a maximum
value of 1, which occurs when the design always recommends the correct MTD.
With few exceptions, comparisons among the methods are made based on
simulations, and historically, these simulations were done using a limited number
53 Dose-Finding and Dose-Ranging Studies 961

of true dose-toxicity curves. Even when exact small sample results are available
(Durham and Flournoy 1994, 1995; Lin and Shih 2001), these results depend on the
true unknown underlying dose-toxicity curve. This can make comparisons difficult,
since every method has some scenarios under which it will perform well. A tool for
evaluating the properties of designs is given in O’Quigley et al. (2002) and Paoletti
et al. (2004). The benchmark cannot be used in practice because it requires knowl-
edge of the true underlying dose-toxicity curve; however, it has been shown to be
useful in investigating the efficiency of proposed designs (Wages et al. 2013) in the
context of studies with monotone dose-toxicity curves.
It is important, when evaluating a design, to consider the performance across a
broad range of scenarios, varying the location of the MTD and the steepness of the
dose-toxicity curve. To this end, a number of families of dose-toxicity curves have
been proposed. Evaluating methods for specific dose-toxicity curves randomly
sampled from the family of curves is intended to test the method across a range of
curves that vary in MTD location and steepness. One of the first families of curves
was generated by Paoletti et al. (2004); subsequent proposals can be found in Horton
et al. (2017), Clertant and O’Quigley (2017), and Conaway and Petroni (2019a).

Ease of Implementation and Adaptability

Rule-based methods have the practical advantage of being simpler to carry out
because all of the decision rules can be laid out in a table prior to starting the
study. The model-based methods also have some practical advantages over rule-
based methods. Model-based methods can enroll participants even if the follow-up
period for previously enrolled participants is not yet complete. Model-based
methods can accommodate revisions to data errors; on subsequent review, partici-
pants thought not to have had DLTs could be found to have had DLTs, or vice versa,
participants thought not to have DLTs could be classified upon further review as
having had a DLT. Subsequent allocations can proceed based on models fit to the
corrected data.

Principles

Cheung (2005) defined the principle of coherence for single-agent dose-finding and
dose-ranging studies. By this definition, a method is coherent for dose escalation if
the method does not increase the dose following an observed DLT and coherent for
dose de-escalation if the method does not decrease the dose following the observa-
tion of a non-DLT. This is an important principle in implementing a study, since
clinicians can be reluctant to follow an incoherent design, particularly one that is
incoherent in dose escalation.
The 3+3, despite its poor operating characteristics, is at least coherent by this
definition. The biased coin design (Durham et al. 1997), EWOC (Tighiouart and
Rogatko 2014), and the semiparametric CRM (Clertant and O’Quigley 2017) are all
coherent. Cheung (2011) shows that the one-stage Bayesian CRM is coherent and
962 M. R. Conaway and G. R. Petroni

that the two-stage CRM is coherent as long as it does not produce an incoherent
transition between the rule-based and model-based stages. Clertant and O’Quigley
(2017) show that the semiparametric CRM is coherent. In general, the interval-based
methods are not coherent by this definition, and in practice, incoherent decisions
occur frequently with these designs (Wages et al. 2019 under review).
Interval-based designs defined a separate principle and unfortunately also used
the term “coherence” (Liu and Yuan 2015). The principle, defined as “long-term
memory coherence,” means that a method will not increase the dose for the next
participant if the observed proportion of participants at the current dose who have
experienced a DLT exceeds the target toxicity rate and the method will not reduce the
dose for the next participant if the observed proportion of DLTs at the current dose is
less than the target. By construction, the interval-based designs are all “long-term
memory coherent.”
An example from Wages et al. (2019 under review) serves to illustrate the
difference between the original definition of coherence and “long-term memory
coherence.” Table 2 shows the dose allocations from BOIN and Keyboard for a
simulated trial using the settings in Scenario 2 in Table S2 of Zhou et al. (2011).
Decisions at participants 9 and 13 represented incoherence in dose de-escalation; in
each case, the next participant is treated at a lower dose immediately following a
non-DLT. Both of these decisions are “long-term memory coherent.” The decision at
participant 16 is incoherent in dose escalation. Even though this participant

Table 2 Example of incoherent dose escalations


Accumulated data on each dose
(# DLTs/# participants)
Participant Dose assigned Outcome Decision 1 2 3 4 5
1 1 Non-DLT Escalate 0/1 0/0 0/0 0/0 1/1
2 2 Non-DLT Escalate 0/1 0/1 0/0 0/0 1/1
3 3 Non-DLT Escalate 0/1 0/1 0/1 0/0 1/1
4 4 Non-DLT Escalate 0/1 0/1 0/1 0/1 1/1
5 5 DLT De-escalate 0/1 0/1 0/1 0/1 1/1
6 4 DLT De-escalate 0/1 0/1 0/1 1/2 1/1
7 3 DLT De-escalate 0/1 0/1 1/2 1/2 1/1
8 2 Non-DLT Escalate 0/1 0/2 1/2 1/2 1/1
9a 3 Non-DLT De-escalate 0/1 0/2 1/3 1/2 1/1
10 2 Non-DLT Escalate 0/1 0/3 1/3 1/2 1/1
11 3 DLT De-escalate 0/1 0/3 2/4 1/2 1/1
12 2 Non-DLT Escalate 0/1 0/4 2/4 1/2 1/1
13a 3 Non-DLT De-escalate 0/1 0/4 2/5 1/2 1/1
14 2 Non-DLT Escalate 0/1 0/5 2/5 1/2 1/1
15 3 DLT De-escalate 0/1 0/5 3/6 1/2 1/1
16b 2 DLT Escalate 0/1 1/6 3/6 1/2 1/1
17 3 Non-DLT De-escalate 0/1 1/6 3/7 1/2 1/1
a
Not coherent in dose de-escalation
b
Not coherent in dose escalation
53 Dose-Finding and Dose-Ranging Studies 963

experienced a DLT on dose level 2, both BOIN and Keyboard treat the next
participant at a higher dose, dose level 3, which has an observed proportion of
toxicities greater than the target.

Extensions Beyond Single-Agent Trials with a Binary Toxicity


Outcome

Single-agent dose-finding or dose-ranging trials are becoming less frequent, partic-


ularly in oncology, giving way to studies that involve combinations of agents or
studies that involve heterogeneous groups of participants. Designs that have been
developed to meet the challenges of studies of combinations of agents or studies
conducted in heterogeneous groups of participants are discussed in this chapter.
Other features of contemporary dose-finding or dose-ranging trials include the study
of noncytotoxic agents or studies that collect preliminary assessments of efficacy, but
space limitations preclude a full discussion of methods in these areas.

Time-to-Event Toxicity Outcomes

The original CRM paper (O’Quigley et al. 1990) noted that the observation of a
toxicity does not occur immediately. As a result, there may be new participants ready
to enroll in the study before all the prior participants have been observed for a DLT.
O’Quigley et al. (1990) suggested treating the new participants at the last allocated
dose, or given the uncertainty in the dose allocations, treating participants’ one level
above or one level below the most recent model-recommended dose.
Cheung and Chappell (2010) proposed an extension to the continual reassessment
method known as the “time-to-event CRM” (TITE-CRM). This method allows for a
weighted toxicity model, with weights proportional to the time that the participant
has been observed. They consider a number of weight functions, but simulation
results suggested that a simple linear weight function of the form w(u) ¼ u/T, where
T is a fixed length of follow-up observation time for each participant, is adequate.
If a participant is observed to have a toxicity at time u < T, the follow-up time u is set
to equal T. If DLT information has been observed for all participants, the method
reduces to the continual reassessment method of O’Quigley et al. (1990).
Normolle and Lawrence (2006) discuss the use of the TITE-CRM in radiation
oncology studies, where the toxicities tend to occur late in the follow-up period.
Polley (2011) observes that in studies with rapid participant accrual and late toxic-
ities, the TITE-CRM can allocate too many participants to overly toxic doses. The
paper has a comparison of a modification of the TITE-CRM that was suggested in
the original TITE-CRM paper, as well as a modification that incorporates wait times
between participant accruals. A version of EWOC with time-to-event endpoints is
described in Mauguen et al. (2011) and Tighiouart et al. (2014a).
A modification of the 3+3, called the rolling 6 (Skolnik et al. 2008), is meant to
reduce the time to completion of a dose-ranging trial by allowing participants to
964 M. R. Conaway and G. R. Petroni

enter a trial before complete information is available for participants in the prior
cohort. Zhao et al. (2011) has shown that this method is less efficient and less
accurate than TITE-CRM.

Combinations of Agents

The simplest form of a study with a combination of two agents is depicted below.
Each of two agents (A and B) is being studied at two dose levels (“low” and “high”).
The probability of a DLT with agent at level “a” and agent B at level “b” is denoted
by πab. The problem of dose-finding or dose-ranging differs from the single-agent
case in that the probabilities of toxicity no longer follow a complete order, in which
the ordering or any two toxicity parameters are known, but instead follow a “partial
order” (Robertson et al. 1988). In a partial order, there are pairs of parameters whose
ordering is not known. For example, it is not known whether πHL > πLH or πHL < πLH.
A second distinction from the single-agent case is that in a combination study, there
may be more than one “MTD,” meaning more than one dose combination with a
toxicity probability close to the target.
The initial suggestion for studies of drug combinations was to lay out a specific
ordering of the combinations (Korn and Simon 1993; Kramar et al. 1999). While
simple to implement, this approach follows only one path through the combinations
and could have poor properties in identifying an MTD, particularly if the assumed
ordering is incorrect.
As in the single-agent case, methods can broadly be classified as rule-based or
model-based methods. The interval-based “Bayesian optimal interval design
(BOIN)” has been extended to studies involving combinations (Lin and Yin 2017).
Based on the observed proportion of participants experiencing a DLT at the current
dose, a decision is made to “escalate,” “de-escalate,” or “stay” at the current dose.
More than one dose combination might be considered an “escalation” or a “de-
escalation,” and Lin and Yin (2017) propose pre-specifying, for each dose combi-
nation, a set of “admissible escalation doses” and “admissible de-escalation doses.”
A similar idea had been proposed by Conaway et al. (2004), who used the estimation
method of Hwang and Peddada (1994) as well as “possible escalation sets” for each
dose combination to guide dose combination allocations. Similarly, bivariate iso-
tonic regression was the basis of the method proposed by Wang and Ivanova (2005).
This method estimates the probability of a DLT for each combination under the
assumption that for a fixed row, the toxicity probabilities increase across columns
and for each column, toxicity probabilities increase across each column. In Table 3
for example, it is known that πLL < πLH, πHL < πHH, πLL < πHL and πLH < πHH.
The majority of methods for dose-finding and dose-ranging for combinations of
agents are model-based. Extensions of the CRM were proposed by Wages et al.
(2011a, b). These methods consider either a subset or all possible orders of the
toxicity probabilities. For example, in Table 3 for the simplest case, there are two
possible orders:
53 Dose-Finding and Dose-Ranging Studies 965

Table 3 A study with two agents, each at two dose levels


Agent B
Agent A Low High
High πHL πHH
Low πLL πLH

Order 1 : π LL < π LH < π HL < π HH

Order 2 : π LL < π HL < π LH < π HH

The CRM is fit separately within each order. After each participant is observed,
the recommendation is based on the order that yields a greater value of the likeli-
hood. In many studies with more than two dose levels for each of the agents, there
can be too many possible orderings to specify all of them. In this case, clinical
judgment can be used to guide the choice of orders under consideration, or the
default set of orders recommended by Wages and Conaway (2013) can be used.
Yin and Yuan (2009) also generalize the single-agent CRM for studies consider-
ing combinations of J levels of agent A and K levels of agent B. They pre-specify
values p1 < p2 < . . . < PK for agent A and values q1 < q2 < . . . < qJ for agent B and
use a model for the probability of a DLT that depends on the pre-specified values as
well as three parameters to be estimated from the data. At the end of the trial, the
estimate of the MTD is the dose with the estimated DLT probability closest to the
pre-specified target.
Other model-based methods are based on a full mathematical specification of the
probabilities of toxicity for each combination. Thall et al. (2003) propose a nonlinear
six-parameter model to describe how the probability of toxicity depends on the dose
combination.
All of the previous methods are for studies in which a discrete set of combinations
have been pre-specified. For dose-finding studies, Shi and Yin (2013) and Tighiouart
et al. (2014b) have generalized the EWOC method for combinations. Both of these
generalizations use a logistic model for the probability of toxicity that included main
effects for the dose level of each combination and a multiplicative interaction term to
specify the joint effect of the two agents.

Heterogeneity of Participants

In some dose-finding trials, there are several groups of participants, and the goal is to
estimate a MTD within each group. These groups may be defined by the participants’
degree of impairment at baseline (Ramanathan et al. 2008; LoRusso et al. 2012) or
genetic characteristics (Kim et al. 2013). For example, Ramanathan et al. (2008)
enrolled 89 participants with varying solid tumors to develop dosing guidelines for
the administration of imatinib in participants with liver dysfunction. Prior to dosing,
participants were stratified into “none,” “mild,” “moderate,” or “severe” liver
966 M. R. Conaway and G. R. Petroni

dysfunction at baseline, according to serum total bilirubin and AST. A similar


classification is used by LoRusso et al. (2012). Kim et al. (2013) define three groups
of participants according to the number of defective alleles, either 0, 1, or 2.
In each of these cases, parallel phase I studies were conducted within each group
but did not account for the expectation that the MTD would be lower in the more
severely impaired participants at baseline or in the subset of participants with a
greater number of defective alleles. In these cases, even with an efficient design,
given the sample sizes typically seen in phase I trials, ignoring the orderings among
the groups can lead to reversals in the MTD estimates, meaning that the estimated
MTDs in the groups can contradict what is known clinically (Horton et al. 2019b).
Furthermore, even in cases where the ordering is not known, running parallel studies
can be inefficient compared to a design that uses a model to pool information from all
participants across all groups in order to estimate the dose-toxicity relationship.
O’Quigley, Shen, and Gamst (1999) and O’Quigley and Paoletti (2003) were the
first to investigate the consequences of using parallel trials. More recently, Raphael
et al. (2010) discussed the use of parallel trials of heavily pretreated and lightly
pretreated participants in dose-finding trials in pediatric participants and
recommended that parallel trials only be undertaken when there is a strong rationale
for doing so.
To avoid the issues with parallel groups, a number of statistical methods have
been proposed for estimating MTDs when there is heterogeneity among participants.
All of the methods proposed to date for accounting for participant heterogeneity in
dose-finding are either generalizations of model-based methods or order-restricted
methods for single-agent trials. O’Quigley, Shen, and Gamst (1999) and O’Quigley
and Paoletti (2003) account for participant heterogeneity by adding a covariate to
represent participant characteristics. In O’Quigley and Paoletti (2003), the ordering
of probabilities of DLTs between the groups is known, and this knowledge is
incorporated into the model through the prior distribution.
An alternative to adding a covariate to account for discrete groups in dose-finding
studies is to combine model-based methods and order-restricted inference. The first
method to do this is Yuan and Chappell (2004), who are estimating the MTD for a
single agent in each of G ordered groups. In Yuan and Chappell (2004), the single-
agent CRM is applied separately to the data in each group. Using the bivariate
isotonic regression estimator (Robertson et al. 1988), the resulting DLT probability
estimates are modified so that the estimates increase with increasing dose within
each group, and there are no reversals, meaning there are no dose levels where
a lower-risk group has greater DLT probability estimates than a higher-risk group.
Once the isotonic estimates are computed, dose allocation proceeds as in the single-
group CRM, the next participant in group “g” is allocated to the dose with an
estimated toxicity probability in group “g” closest to the target θ. Similar ideas are
found in Conaway (2017a, b) which propose methods for completely ordered
groups, such as the example in Ramanathan (2008), or partially ordered groups, in
which some of the orderings between the groups are unknown.
Several methods take advantage of the discrete dose levels often found in dose-
finding studies. The “shift model” (O’Quigley 2006), described more fully in
53 Dose-Finding and Dose-Ranging Studies 967

O’Quigley and Iasonos (2014), takes a different approach to generalizing the CRM
to two ordered groups. For two groups, the assumption underlying this method is
that the MTD in group 2 will be Δ dose levels less than the MTD in group 1, with Δ a
nonnegative integer. O’Quigley and Iasonos (2014) restrict Δ to be 0, 1, 2, or 3
levels, but their method applies to any shift in the MTD. Horton et al. (2019a)
generalize the shift model to more than two groups and to either completely or
partially ordered groups.
Babb and Rogatko (2001) extend the EWOC method to allow for a continuous
covariate. In their application (Babb and Rogatko 2001), the covariate was protec-
tive, with increasing levels of the covariate associated with a lower probability of a
DLT. Data from a previous study of the agent allowed the investigators to set bounds
on the permissible doses for a participant with a specific covariate value. With this
method, participants can receive individualized doses according to their level of the
covariate. Similar methods are found in Tighiouart et al. (2012).

Summary and Conclusion

This chapter has presented a number of designs for dose-finding and dose-ranging
studies for a single agent and in which the primary objective of the study is to
establish a maximum safe dose. Many of the applications of these designs are in
oncology, which is currently undergoing a change in the complexity and objectives
of dose-finding and dose-ranging studies. This chapter has also reviewed a number
of designs that have been developed recently to meet the challenges of contemporary
dose-finding and dose-ranging studies.

Key Facts

1. The dose-finding design must be conducted in a way that addresses the study
objectives.
2. Dose-finding studies are an integral part of the overall drug development process.
3. The commonly used “3+3” algorithm has poor operating characteristics.
4. Interval-based dose-finding designs are simple to implement and provide better
operating characteristics than the “3+3.”
5. In general, model-based dose-finding designs have superior operating character-
istics and provide the flexibility needed to handle data revisions and delayed
dose-limiting toxicities.

Cross-References

▶ Bayesian Adaptive Designs for Phase I Trials


▶ Dose Finding for Drug Combinations
▶ Implementing the Trial Protocol
968 M. R. Conaway and G. R. Petroni

▶ Interim Analysis in Clinical Trials


▶ Monte Carlo Simulation for Trial Design Tool
▶ Participant Recruitment, Screening, and Enrollment
▶ Power and Sample Size
▶ Principles of Clinical Trials: Bias and Precision Control

References
Ananthakrishnan R, Green S, Chang M, Doros G, Massaro J, LaValleya M (2017) Systematic
comparison of the statistical operating characteristics of various phase I oncology designs.
Contemp Clin Trials Commun 5:34–48
Babb J, Rogatko A (2001) Patient specific dosing in a cancer phase I clinical trial. Stat Med
20:2079–2090
Babb J, Rogatko A, Zacks S (1998) Cancer phase I clinical trials: efficient dose escalation with
overdose control. Stat Med 17:1103–1120
Cheung YK (2005) Coherence principles in dose-finding studies. Biometrika 92:203–215
Cheung YK (2011) Dose finding by the continual reassessment method. Chapman and Hall/CRC
Biostatistics Series, New York
Cheung YK, Chappell R (2010) Sequential designs for phase I clinical trials with late-onset
toxicities. Biometrics 56:1177–1182
Chu PL, Lin Y, Shih WJ (2009) Unifying CRM and EWOC designs for phase I cancer clinical trials.
J Stat Plann Inference 139:1146–1163
Clertant M, O’Quigley J (2017) Semiparametric dose finding methods. J R Stat Soc Ser B 79
(5):1487–1508
Clertant M, O’Quigley J (2019) Semiparametric dose finding methods: special cases. Appl Stat 68
(2):271–288
Conaway M (2017a) A design for phase I trials in completely or partially ordered groups. Stat Med
36(15):2323–2332
Conaway M (2017b) Isotonic designs for phase I trials in partially ordered groups. Clin Trials 14
(5):491–498
Conaway M, Petroni G (2019a) The impact of early stage design on the drug development process.
Clin Cancer Res 25(2):819–827
Conaway M, Petroni G (2019b) The role of early-phase design-response. Clin Cancer Res 25
(10):3191
Conaway M, Dunbar S, Peddada S (2004) Designs for single- or multiple-agent phase I trials.
Biometrics 60:661–669
Durham S, Flournoy N (1994) Random walks for quantile estimation. In: Gupta S, Berger J (eds)
Statistical decision theory and related topics V. Springer, New York, pp 467–476
Durham S, Flournoy N (1995) Up-and-down designs I: stationary treatment distributions. In:
Flournoy N, Rosenberger W (eds) Adaptive designs. Institute of Mathematical Statistics,
Hayward, pp 139–157
Durham S, Flournoy N, Rosenberger W (1997) A random walk rule for phase 1 clinical trials.
Biometrics 53(2):745–760
Eussen S, de Groot L, Clarke R, Schneede J, Ueland P, Hoefnagels W, van Staveren W (2005) Oral
cyanocobalamin supplementation in older people with vitamin B12 deficiency: a dose-finding
trial. Arch Intern Med 165:1167–1172
Ezard N, Dunlop A, Clifford B, Bruno R, Carr A, Bissaker A, Lintzeris N (2016) Study protocol: a
dose-escalating, phase-2 study of oral lisdexamfetamine in adults with methamphetamine
dependence. BMC Psychiatry 16:428
53 Dose-Finding and Dose-Ranging Studies 969

Guo W, Wang S-J, Yang S, Lynna H, Ji Y (2017) A Bayesian interval dose-finding design
addressing Ockham’s razor: mTPI-2. Contemp Clin Trials 58:23–33
Horton B, Wages N, Conaway M (2017) Performance of toxicity probability interval based designs
in contrast to the continual reassessment method. Stat Med 36:291–300
Horton BJ, Wages NA, Conaway MR (2019a) Shift models for dose-finding in partially ordered
groups. Clin Trials 16(1):32–40
Horton BJ, O’Quigley J, Conaway M (2019b) Consequences of performing parallel dose finding
trials in heterogeneous groups of patients. JNCI Cancer Spectrum. https://doi.org/10.1093/
jncics/pkz013. Online ahead of print
Hwang J, Peddada S (1994) Confidence interval estimation subject to order restrictions. Ann Stat
22:67–93
Iasonos A, Wilton AS, Riedel ER, Seshan VE, Spriggs DR (2008) A comprehensive comparison of
the continual reassessment method to the standard 3+3 dose escalation scheme in phase I dose-
finding studies. Clin Trials 5(5):465–477
Ivanova A, Flournoy N, Chung Y (2007) Cumulative cohort design for dose-finding. J Stat Plann
Inference 137:2316–2327
Ji Y, Li Y, Bekele B (2007) Dose-finding in phase I clinical trials based on toxicity probability
intervals. Clin Trials 4:235–244
Kim K, Kim H, Sym S, Bae K, Hong Y, Chang H, Lee J, Kang Y, Lee J, Shin J, Kim T (2013) A
UGT1A1*28 and *6 genotype-directed phase I dose-escalation trial of irinotecan with fixed-
dose capecitabine in Korean patients with metastatic colorectal cancer. Cancer Chemother
Pharmacol 71:1609–1617
Korn E, Simon R (1993) Using tolerable-dose diagrams in the design of phase I combination
chemotherapy trials. J Clin Oncol 11:794–801
Kramar A, Lebecq A, Candalh E (1999) Continual reassessment methods in phase I trials of the
combination of two agents in oncology. Stat Med 18:849–864
Le Tourneau C, Lee J, Siu L (2009) Dose escalation methods in phase I clinical trials. J Natl Cancer
Inst 101:708–720
Lee S, Cheung YK (2009) Model calibration in the continual reassessment method. Clin Trials
6:227–238
Leung D, Wang Y-G (2001) Isotonic designs for phase I trials. Clin Trials 22:126–138
Lin R (2018) R codes for interval designs. https://github.com/ruitaolin/IntervalDesign
Lin Y, Shih W (2001) Statistical properties of traditional algorithm-based designs for phase I cancer
clinical trials. Biostatistics 2(2):203–215
Lin R, Yin G (2017) Bayesian optimal interval design for dose finding in drug-combination trials.
Stat Methods Med Res 26(5):2155–2167
Liu S, Yuan Y (2015) Bayesian optimal interval designs for phase I clinical trials. J R Stat Soc Ser C
Appl Stat 32:2505–2511
LoRusso P, Venkatakrishnan K, Ramanathan R, Sarantopoulos J, Mulkerin D, Shibata S, Hamilton
A, Dowlati A, Mani S, Rudek M, Takimoto C, Neuwirth R, Esseltine D, Ivy P (2012)
Pharmacokinetics and safety of Bortezomib in patients with advanced malignancies and varying
degrees of liver dysfunction: phase I NCI Organ Dysfunction Working Group Study NCI-6432.
Clin Cancer Res 18(10):1–10
Mauguen A, Le Deleya M, Zohar S (2011) Dose-finding approach for dose escalation with overdose
control considering incomplete observations. Stat Med 30:1584–1594
Normolle D, Lawrence T (2006) Designing dose-escalation trials with late-onset toxicities using the
time-to-event continual reassessment method. J Clin Oncol 24:4426–4433
O’Quigley J (2006) Phase I and phase I/II dose finding algorithms using continual reassessment
method. In: Crowley J, Ankherst D (eds) Handbook of statistics in clinical oncology, 2nd edn.
Chapman and Hall/CRC Biostatistics Series, New York
O’Quigley J, Iasonos A (2012) Dose-finding designs based on the continual reassessment method.
In: Crowley J, Hoering (eds) Handbook of statistics in clinical oncology, 3rd edn. Chapman and
Hall/CRC Biostatistics Series, New York
970 M. R. Conaway and G. R. Petroni

O’Quigley J, Iasonos A (2014) Bridging solutions in dose-finding problems. J Biopharm Stat 6


(2):185–197
O’Quigley J, Paoletti X (2003) Continual reassessment method for ordered groups. Biometrics
59:430–440
O’Quigley J, Pepe M, Fisher L (1990) Continual reassessment method: a practical design for phase I
clinical trials in cancer. Biometrics 46(1):33–48
O’Quigley J, Shen L, Gamst A (1999) Two sample continual reassessment method. J Biopharm Stat
9:17–44
O’Quigley J, Paoletti X, Maccario J (2002) Nonparametric optimal design in dose finding studies.
Biostatistics 3(1):51–56
Paoletti X, O’Quigley J, Maccario J (2004) Design efficiency in dose finding studies. Comput Stat
Data Anal 45:197–214
Partinen M, Hirvonen K, Jama L, Alakuijala A, Hublin C, Tamminen I, Koester J, Reess J (2006)
Efficacy and safety of pramipexole in idiopathic restless legs syndrome: a polysomnographic
dose-finding study – the PRELUDE study. Sleep Med 7:407–417
Piantadosi S (2017) Clinical trials: a methodologic perspective, 3rd edn. Wiley, Hoboken
Polley M (2011) Practical modifications to the time-to-event continual reassessment method for
phase I cancer trials with fast patient accrual and late-onset toxicities. Stat Med 30:2130–2143
Ramanathan R, Egorin M, Takimoto C, Remick S, Doroshow J, LoRusso P, Mulkerin D, Grem J,
Hamilton A, Murgo A, Potter D, Belani C, Hayes M, Peng B, Ivy P (2008) Phase I and
pharmacokinetic study of Imatinib Mesylate in patients with advanced malignancies and
varying degrees of liver dysfunction: a study by the National Cancer Institute Organ Dysfunc-
tion Working Group. J Clin Oncol 26:563–569
Raphael M, le Deley M, Vassal G, Paoletti X (2010) Operating characteristics of two independent
sample design in phase I trials in paediatric oncology. Eur J Cancer 46:1392–1398
Reiner E, Paoletti X, O'Quigley J (1999) Operating characteristics of the standard phase I clinical
trial design. Comput Stat Data Anal 30(3):303–315
Robertson T, Wright FT, Dykstra R (1988) Order restricted statistical inference. Wiley, New York
Rogatko A, Schoeneck D, Jonas W, Tighiouart M, Khuri F, Porter A (2007) Translation of
innovative designs into phase I trials. J Clin Oncol 25(31):4982–4986
Sauter A, Ullensvang K, Niemi G, Lorentzen H, Bendtsen T, Børglum J, Pripp A, Romundstad L
(2015) The shamrock lumbar plexus block: a dose-finding study. Eur J Anaesthesiol
32:764–770
Schaller S, Fink H, Ulm K, Blobner M (2010) Sugammadex and neostigmine dose-finding study for
reversal of shallow residual neuromuscular block. Anesthesiology 113:1054–1060
Senderowicz A (2010) Information needed to conduct first-in-human oncology trials in the United
States: a view from a former FDA medical reviewer. Clin Cancer Res 16(6):1719–1725
Shen L, O’Quigley J (1996) Continual reassessment method: a likelihood approach. Biometrics
52:673–684
Shi Y, Yin G (2013) Escalation with overdose control for phase I drug combination trials. Stat Med
32:4400–4412
Skolnik JM, Barrett JS, Jayaraman B, Patel D, Adamson PC (2008) Shortening the timeline of
pediatric phase I trials: the rolling six design. J Clin Oncol 26(2):190–195
Storer B (1989) Design and analysis of phase I clinical trials. Biometrics 45(3):925–937
Stylianou M, Flournoy N (2002) Dose finding using the biased coin up-and-down design and
isotonic regression. Biometrics 58(1):171–177
Thall P, Millikan R, Mueller P, Lee S-J (2003) Dose-finding with two agents in phase I oncology
trials. Biometrics 59:487–496
Tighiouart M, Rogatko (2014) A dose finding with escalation with overdose control (EWOC) in
cancer clinical trials. Stat Sci 25(2):217–226
Tighiouart M, Cook-Wiens G, Rogatko A (2012) Incorporating a patient dichotomous characteristic
in cancer phase I clinical trials using escalation with overdose control. J Probab Stat 10:Article
ID: 567819
53 Dose-Finding and Dose-Ranging Studies 971

Tighiouart M, Liu Y, Rogatko A (2014a) Escalation with overdose control using time to toxicity for
cancer phase I clinical trials. PLoS One 9(3):e93070
Tighiouart M, Piantadosi S, Rogatko A (2014b) Dose finding with drug combinations in
cancer phase I clinical trials using conditional escalation with overdose control. Stat Med 33
(22):3815–3829
Vidoni ED, Johnson DK, Morris JK, Van Sciver A, Greer CS, Billinger SA et al (2015) Dose-
response of aerobic exercise on cognition: a community-based, pilot randomized controlled
trial. PLoS One 10(7):e0131647
Wages NA, Conaway MR (2013) Specifications of a continual reassessment method design for
phase I trials of combined drugs. Pharm Stat 12(4):217–224
Wages N, Conaway M (2018) Revisiting isotonic phase I design in the era of model-assisted dose-
finding. Clin Trials 15(5):524–529
Wages N, Conaway M, O’Quigley J (2011a) Dose-finding design for multi-drug combinations. Clin
Trials 8:380–389
Wages N, Conaway M, O’Quigley J (2011b) Continual reassessment method for partial ordering.
Biometrics 67:1555–1563
Wages N, Conaway M, O’Quigley J (2013) Performance of two-stage continual reassessment
method relative to an optimal benchmark. Clin Trials 10:862–875
Wages NA, Iasonos A, O’Quigley J, Conaway MR (2019) Coherence principles in interval-based
dose-finding. Submitted
Wang K, Ivanova A (2005) Two-dimensional dose finding in discrete dose space. Biometrics
61:217–222
Wheeler G, Sweeting M, Mander A (2016) AplusB: a web application for investigating A+B
designs for phase I cancer clinical trials. PLOS. https://doi.org/10.1371/journal.pone.0159026.
Published: July 12, 2016
Yan F, Mandrekar S, Ying Y (2017) Keyboard: a novel Bayesian toxicity probability interval design
for phase I clinical trials. Clin Cancer Res 23(15):3994–4003
Yin G, Yuan Y (2009) Bayesian dose finding in oncology for drug combinations by copula
regression. Appl Stat 58(2):211–224
Yuan Z, Chapell R (2004) Isotonic designs for phase I cancer clinical trials with multiple risk
groups. Clin Trials 1(6):499–508
Zhao L, Lee J, Mody R, Braun T (2011) The superiority of the time-to-event continual reassessment
method to the rolling six design in pediatric oncology phase I trials. Clin Trials 8(4):361–369
Inferential Frameworks for Clinical Trials
54
James P. Long and J. Jack Lee

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 974
Inferential Frameworks: Samples, Populations, and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976
Frequentist Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 978
Optimizing Trial Design Using Frequentist Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 981
Limitations and Guidance on Frequentist Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 981
Bayesian Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 983
Sequential Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986
Bayesian Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986
Prior Distributions and Schools of Bayesian Thought . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987
Guidance on Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 989
Inferential Frameworks: Connections and Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 990
Reducing Subjectivity and Calibrating P-Values with Bayes Factors . . . . . . . . . . . . . . . . . . . . . 990
Model-Based and Model-Assisted Phase I Designs Constructed Using Inferential
Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 992
Model-Based and Model-Assisted Phase II Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 993
Inferential Frameworks and Modern Trial Design Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996
Precision Medicine, Master Protocols, Umbrella Trials, Basket Trials, Platform Trials,
and Adaptive Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996
Multiple Outcomes and Utility Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 998
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 998
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 999
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1000

J. P. Long · J. J. Lee (*)


Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, TX, USA
e-mail: [email protected]; [email protected]

© Springer Nature Switzerland AG 2022 973


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_271
974 J. P. Long and J. J. Lee

Abstract
Statistical inference is the process of using data to draw conclusions about
unknown quantities. Statistical inference plays a large role both in designing
clinical trials and in analyzing the resulting data. The two main schools of
inference, frequentist and Bayesian, differ in how they estimate and quantify
uncertainty in unknown quantities. Typically, Bayesian methods have clearer
interpretation at the cost of specifying additional assumptions about the unknown
quantities. This chapter reviews the philosophy behind these two frameworks
including concepts such as p-values, Type I and Type II errors, confidence
intervals, credible intervals, prior distributions, posterior distributions, and
Bayes factors. Application of these ideas to various clinical trial designs including
3 + 3, Simon’s two-stage, interim safety and efficacy monitoring, basket,
umbrella, and platform drug trials is discussed. Recent developments in comput-
ing power and statistical software now enable wide access to many novel trial
designs with operating characteristics superior to classical methods.

Keywords
Bayes factors · Bayesian · Confidence intervals · Credible intervals · Frequentist ·
Hypothesis testing · P-value · Prior distribution · Sequential designs · Statistical
inference

Introduction

Statistical inference is the process of using data to draw conclusions about unknown
quantities. Rigorous frameworks for statistical inference ensure that the conclusions
are backed by some form of guarantee. For example, in drug development, one is
interested in estimating the response rate of a new agent with certain precision and/or
concluding whether the new agent has at least 30% response rate with certain
confidence. The two dominant frameworks for statistical inference, Bayesian and
Frequentist, both provide these guarantees in the form of probabilistic statements.
Inference frameworks play an important role both in designing clinical trials and in
analyzing and interpreting the information contained in the collected data.
Consider the popular oncology Phase I 3 + 3 dose escalation design with a set of
predefined dose levels (Storer 1989). Starting at the lowest dose level, three patients
are enrolled. If none experience a toxicity, the dose is escalated. If one experiences a
toxicity, an additional three are enrolled at the same dose level. Among the additional
three patients, if none experience a toxicity, the dose is escalated. If one experiences
a toxicity, the current dose or one dose lower is defined as the maximum tolerated
dose (MTD). If two or more patients experience toxicity in three or six patients, the
dose exceeds the MTD.
Table 1 lists the probability of dose escalation for given toxicity probability p
based on Eq. 1.
54 Inferential Frameworks for Clinical Trials 975

Table 1 Probability of Toxicity probability Escalation probability


dose escalation in the 3 + 3
0.1 0.91
design for given toxicity
probability 0.2 0.71
0.3 0.49
0.4 0.31
0.5 0.17
0.6 0.08

Escalation Probability ¼ ð1  pÞ3 þ 3p ð1  pÞ5 ð1Þ

The table provides assurance that as the probability of toxicity of the current dose
increases, the chance of escalation decreases. This table requires simple probability
computations to construct.
The 3 + 3 design (and variants; see Le Tourneau et al. (2009), Lin and Shih
(2001)) is a rule-based method in which the trial design and interpretation of results
are typically not performed within a formal statistical inference framework. This
results in several weaknesses:

1. The algorithm does not explicitly produce an estimate of the toxicity rate at the
MTD. Studies show that the targeted toxicity rate is not the widely believed 33%
and depends on extraneous factors such as the number of dose levels used (Lin
and Shih 2001). In most typical applications, the targeted toxicity rate resulting
from a 3 + 3 design is between 23% and 28%, but it can be much lower or higher
(Smith et al. 1996).
2. Depending on the circumstances, investigators can elicit maximum targeted
toxicities for drugs (higher for drugs with higher potential efficacy and lower
for preventive agents), yet the 3 + 3 algorithm cannot incorporate this information
when finding the MTD.

A statistical inferential framework is necessary for addressing these issues. Such a


framework can provide a formal method for evaluating the clinical trial data, e.g.,
answering questions about the drug’s toxicity profile after the data has been collected
and providing an estimate of the probability of toxicity at the reported MTD. Equally
important, the framework can be used to design a trial with favorable operating
characteristics, e.g., ensure that enough, but not too many, patients are enrolled at
each of the potential dose levels, that the dose levels and patient allocations are
sensible, etc. Computer simulations show that 3 + 3 is deficient as a trial design
(O’Quigley et al. 1990; O’Quigley and Chevret 1991). Despite these weaknesses,
reviews suggest that over 90% of Phase I oncology drug trials use a 3 + 3 design
(Le Tourneau et al. 2009).
The last 250 years have seen extensive methodological development in the
frequentist and Bayesian inferential frameworks. Frequentist inference views param-
eters, unknown quantities of interest, as fixed. Probabilities are considered only with
respect to the data given these fixed parameters. Conversely, Bayesian inference
views parameters as random following a particular statistical distribution. The data is
976 J. P. Long and J. J. Lee

fixed since it has been collected. Bayesian inference applies probabilistic statements
to estimate these unknowns directly given the data. While the frameworks exhibit
important differences, the fact that both have been successfully used in nearly every
scientific field demonstrates their breadth and versatility. These characteristics are
critical for clinical trials which seek answers to a broad range of questions, including
evaluation of efficacy, safety monitoring, dose identification, treatment assignment,
adaptation to interim events, go/no-go decision on drug development, cost-benefits
analysis, etc.
In the context of clinical trials, frequentist inference dominated the design and
analysis of clinical trials in the twentieth century. However, in the last 25 years,
Bayesian methods have increased in popularity due to a number of factors including
advances in computing power, better computational algorithms, deficiencies in
frequentist measures of evidence (such as the p-value), and the need for complex,
adaptive trial designs (Biswas et al. 2009; Lee and Chu 2012; Tidwell et al. 2019).
This chapter is organized as follows. Section “Inferential Frameworks: Samples,
Populations, and Assumptions” contrasts inferential frameworks with rule-based
methods for inference and discusses sampling assumptions common to both
frequentist and Bayesian frameworks. The two dominant inferential frameworks,
frequentist and Bayesian, are reviewed in sections “Frequentist Framework” and
“Bayesian Framework,” respectively. Popular designs are discussed such as the
frequentist Simon’s optimal two-stage design (Simon 1989) and the Bayesian design
of Thall et al. (1995). In section “Inferential Frameworks: Connections and
Synthesis,” efforts to connect the frameworks as well as Bayesian-frequentist hybrid
designs are reviewed. Section “Inferential Frameworks and Modern Trial Design
Challenges” describes recent developments and challenges in trial design. Section
“Summary and Conclusion” concludes with a discussion.

Inferential Frameworks: Samples, Populations, and Assumptions

Consider a hypothetical Phase II trial of an experimental cancer drug in which the


investigator would like to determine p, the percentage of all patients in some
population (e.g., patients with Stage III pancreatic cancer) who would achieve
complete or partial response if given the drug. It is costly and unethical (this is an
experimental drug) to treat every patient with the new drug, so it is given to a sample
of n ¼ 10 patients, of whom six achieve responses. One would like to answer
questions such as:

1. What is a reasonable best guess (i.e., estimate) for p? Typically, this sample-based
estimate of p is denoted by pb (read “p hat”).
2. Is the new treatment better than standard of care? If standard of care produces a
response in 20% of patients, this statement becomes is p greater than 20%?
3. What is a range of plausible values for p?
4. What is the probability that the response rate is greater than 40%?
54 Inferential Frameworks for Clinical Trials 977

All of these questions except question #4 can be addressed in the frequentist and
Bayesian inferential frameworks. Question #4 can only be answered by the Bayesian
framework because it assumes that the unknown response rate is random, while in the
frequentist framework, the unknown response rate is considered fixed and does not
have a distribution. Note that the unknown, population response rate p is treated
separately from the sample-based estimate pb. This separation of sample and population
is critical for formal inferential reasoning and is not typically discussed in the 3 + 3
dose finding algorithm. In 3 + 3, the MTD is defined based on the sample. In fact, 3 + 3
produces a number more akin to MTD, d an estimate of the true MTD. The 3 + 3 design
does not precisely define the true MTD which it is attempting to estimate. However,
one could posit the true MTD as the highest of the prespecified doses which will
produce a toxicity in no more than 33% of all patients in the population. With this
d to be
definition, one can now ask questions akin to 2 and 3, such as how likely is MTD
no higher than MTD? The 3 + 3 design does not offer any answer to this seemingly
important question and in fact makes this question difficult to even ask by not
d from what is being estimated, MTD.
conceptually separating the estimate, MTD,
Inferential frameworks require the existence of a population on which inferences
are to be drawn. Typically, the population of interest in clinical trials is all patients
who would receive the treatment if it were to be approved by a regulatory agency. In
the case of the experimental cancer drug, this may be all present and future patients
in the United States with Stage III pancreatic cancer.
Nearly all clinical trial designs, both Bayesian and frequentist, assume that the
sample is collected by randomly selecting patients from the population with each
patient being equally likely to be selected. In practice this random sampling is not
done for several reasons including that there do not exist easily accessible lists of all
patients in the population and that the resulting selection would result in a geograph-
ically diverse set of patients who would be difficult to treat. Instead patients are
enrolled at one or a small set of usually academic institutions (thus geographically
biasing the sample) and must consent to be treated (thus biasing the sample towards
individuals with a high desire for new treatments). The resulting sample may differ
from the population in terms of socioeconomic status, ethnicity, present health
condition, disease severity, etc.
The extent to which deviations in sampling assumptions occur is situation
dependent. Phase II clinical trials typically seek to demonstrate treatment efficacy
relative to efficacy on historical controls. Observed differences between treatment
and control may be the result of treatment efficacy or differences between trial
subjects and historical controls, e.g., the trial may systematically select healthier
patients than the historical control group. In this setting, standard Bayesian and
frequentist methods will both produce biased results. Phase III double-blinded
randomized studies are less likely to suffer from selection bias caused by deviations
in sampling assumptions. Yet problems can arise regarding generalization of study
findings to patient groups not meeting study eligibility criteria. See Jüni et al. (2001)
for a discussion of these issues. In addition, journals are more likely to publish
studies with positive findings. Hence, readers must be aware that the publication bias
may portray a rosier picture than what the truth is.
978 J. P. Long and J. J. Lee

Frequentist Framework

Frequentist approaches to addressing questions 1, 2, and 3 above are now discussed.


There are many variants of these approaches tailored to fit specific types of data
(Casella and Berger 2002; Agresti and Franklin 2009). A few of the most common
versions are reviewed with an emphasis on correct frequentist interpretation of the
resulting quantities.
With regard to question 1, the best guess of p is referred to as a point estimator.
Common frequentist methods for point estimation include maximum likelihood
estimators (MLEs) and method of moment (MOM) estimators. Often these methods
agree with intuition. In the Phase II cancer trial, the most intuitive estimator for p is
the percentage of patients in the sample who achieved response. MLE and MOM
both require a likelihood function, the probability of observing data given a particular
value of the unknown parameter. The most natural likelihood for the cancer example
with n ¼ 10 patients is the binomial distribution with the mathematical form
 
10  p x  
p 10x
f ðxjpÞ ¼ 1 ð2Þ
x 100 100

where x is the number of responses. Figure 1 provides a graphical illustration of the


function f(x|p ¼ 20%). From the graph, the probability of observing six responses,
assuming a population response rate of 20%, is about 0.0055, calculated by math-
ematically computing f(x ¼ 6|p ¼ 20%) in Eq. 2.
With the binomial model, both the MLE and MOM methods suggest using the
intuitive estimator of the percentage of patients who achieved response, in this case
60%. Statistical theory for these estimation methods provides theoretical guarantees
on its quality. While the percentage of responses in the sample may seem like the

0.3

0.2
f(x|p=20%)

0.1

0.0
0 1 2 3 4 5 6 7 8 9 10
x (Number of Responses)

Fig. 1 Binomial probability mass function. This is a graphical illustration of Eq. 1 with p ¼ 20%
54 Inferential Frameworks for Clinical Trials 979

only reasonable estimator, the Bayesian inferential approach discussed in the next
section typically suggests a different estimate. In this response example, MLE and
MOM theory do not produce a particularly surprising result. However, for more
complicated statistical models, these methods can be used to derive estimators when
intuition fails. For example, suppose dozens of candidate biomarkers (SNPs, protein
expression levels, prior treatments, etc.) are measured and some of them are associ-
ated with the response status. MLE, in conjunction with a logistic regression
statistical model, could be used to identify combinations of biomarkers which
predict response.
Hypothesis testing in conjunction with p-values is used to answer 2. Typically, the
hypothesis that the experimental treatment does not exceed standard of care
( p ¼ 20%) is termed the null hypothesis, while experimental treatment beating
standard of care ( p > 20%) is termed the alternative hypothesis. These may be
written as

H 0 : p  20%
H a : p > 20%:

A decision about whether to accept Ha may be based on the p-value, the


probability of observing a result as extreme or more extreme than the one actually
obtained, assuming that the null hypothesis is true. Since 6 out of 10 responses were
observed, the p-value is the probability of getting 6, 7, 8, 9, or 10 responses assuming
the true response rate in the population is 20% which equals about 0.0064. Note that
the p-value is not the probability that the null hypothesis is correct, i.e., the
probability that p is less than 20% (or equals 20%). While this would seem like
the most natural definition, frequentist statistics cannot apply probabilities to the
unknown quantity p because p is assumed to be fixed. Many statisticians see the
convoluted definition of the p-value as a major disadvantage of frequentist hypoth-
esis testing.
The p-value does have an interpretation in the context of error rates. Consider
rejecting the null hypothesis (i.e., concluding that p is greater than 20% and that the
experimental drug works) if the p-value is less than some threshold, usually taken to
be 0.05. Then the probability of rejecting the null hypothesis when the null hypoth-
esis is actually true, known as a Type I error or the false positive rate, is equal to the
threshold.
Another way to understand p-values is to imagine running the trial over and
over again with a separate set of n patients. Suppose the experimental drug is not
effective. Then if one uses a p-value threshold of 0.05 to reject the null
hypothesis, in 5% of the hypothetical trials, one will erroneously conclude that
the drug is effective. The mental experiment of rerunning the trial and calculating
the p-value is an important component of frequentist reasoning. The distribution
of p-values from these hypothetical experiments is known as the sampling
distribution.
Confidence intervals are used for addressing 3. Frequentist statisticians use
confidence intervals to determine a range of plausible values for an unknown
980 J. P. Long and J. J. Lee

parameter. In the context of the example, with six in ten responses, a 95% confidence
interval for p may span 31–83% (Wilson binomial confidence interval. Wilson
1927). Here, again, the natural interpretation is to claim that the probability that p
is in the interval 31–83% is 95%. However, the frequentist inferential framework
cannot apply probabilities to p, so this statement does not have meaning. Instead, the
95% refers to what would occur if the trial were run again and again, and in each
trial, one made a 95% confidence interval. Approximately 95% of these intervals
would contain p. However, any individual interval either does or does not contain p.
Again, under the frequentist approach, the data is random while the parameter is
unknown but fixed.
Figure 2 illustrates this concept. Here 100 hypothetical trials were run each with
10 patients. The true response rate is 40%. For each trial, a 95% confidence interval
is produced. The horizontal lines in Fig. 1 show the confidence intervals for each
trial. The black lines denote trials in which the resulting confidence interval includes
the true response rate, while the red lines denote trials in which the resulting
confidence interval does not include the true response rate. For about 95% of trials
(in this particular example, 96%), the confidence interval will include the true
response rate. However, under the frequentist paradigm, any particular trial confi-
dence interval either does or does not contain the true response rate, so the 95%
interpretation can only be applied to the average coverage probability of confidence
intervals, not any single interval. The “frequentist” framework studies the property
of long-run frequency of an estimator by how often it occurs, hence the name. In
both the hypothesis testing and confidence interval estimation, the frequentist
approach answers the question of interest indirectly.

Fig. 2 Illustration of correct 95% Confidence Intervals for p


interpretation of frequentist
confidence interval

0 20 40 60 80 100
Percentage Remission
54 Inferential Frameworks for Clinical Trials 981

Optimizing Trial Design Using Frequentist Statistics

In addition to analyzing the results of a clinical trial, the frequentist inferential


framework is used to design trials with favorable operating characteristics. In simple
cases, the main design decision is selecting the sample size. For example, the sample
size could be chosen such that the width of the resulting confidence interval is no
wider than 10%. Alternatively, the sample size could be chosen so that one will
successfully reject the null hypothesis with a specified probability assuming some
true difference between treatment and control, i.e., power calculations. These design
decisions ensure that the trial will accrue enough patients to make conclusions (such
as to proceed to a Phase III trial) but not incur excess costs in terms of time, money,
and patient exposure to experimental drugs. Note that 3 + 3, as a fixed algorithmic
design, cannot be easily modified to achieve any of these objectives.
While a simple trial design may involve determining a single sample size, more
complicated designs can often achieve more efficient allocation of resources. For
example, Simon’s optimal two-stage design for Phase II trials terminates the trial
early if the experimental treatment is deemed ineffective based on an interim analysis
of the data (Simon 1989). Simon’s method chooses two sample sizes n1 and n2. After
accrual of n1 patients, the data is examined to determine if the treatment response rate
is sufficient to merit further investigation. If yes, additional n2 patients are enrolled,
and the combined sample of n1 + n2 is analyzed for efficacy. If not, the trial
terminates and the treatment is deemed ineffective. The methodology for determin-
ing effectiveness follows the frequentist hypothesis testing framework. Simon’s
optimal two-stage design determines sample sizes n1 and n2 which minimize the
expected number of patients to be treated when the null hypothesis is true while
maintaining Type I and Type II error guarantees. Alternatively, Simon’s minimax
design finds a design which minimizes the maximum sample size when the null
hypothesis is true with specified Type I and Type II error rates.

Limitations and Guidance on Frequentist Statistical Inference

Frequentist statistics is often criticized for the complicated interpretation of its


output. The p-value can be used to claim whether or not statistical significance is
reached for rejecting the null hypothesis while controlling Type I error. However, it
cannot be directly translated into the probability that the null hypothesis is true, a
quantity that is likely of greater interest. In order to assign probabilities to the null
and alternative hypothesis, one needs prior probabilities on these quantities, a feature
of Bayesian statistics.
The p-value threshold of 0.05 is commonly used to denote statistical significance
and is often the boundary between publishable and non-publishable results. The 0.05
threshold is quite arbitrary. Statisticians have argued for cutoffs of 0.005 and 0.001 to
denote significant and highly significant results (Johnson 2013). The following key
points were outlined in a recent statement from the American Statistical Association
regarding context, process, and purpose of p-values (Wasserstein and Lazar 2016):
982 J. P. Long and J. J. Lee

• P-values can indicate how incompatible the data are with a specified statistical
model.
• P-values do not measure the probability that the studied hypothesis is true, or the
probability that the data were produced by random chance alone.
• Scientific conclusions and business or policy decisions should not be based only
on whether a p-value passes a specific threshold.
• Proper inference requires full reporting and transparency.
• A p-value, or statistical significance, does not measure the size of an effect or the
importance of a result.
• By itself, a p-value does not provide a good measure of evidence regarding a
model or hypothesis.

The frequentist hypothesis testing paradigm does not offer a method to conclude
that the null hypothesis is true. The result of the test is generally a p-value and a
conclusion that the null was rejected at some Type I error control rate (usually 0.05)
or the null hypothesis was not rejected. The conclusion is never that the null
hypothesis is true or highly likely to be true. Bayesian hypothesis testing permits
reaching these conclusions as will be discussed in Sections “Bayesian Framework”
and “Reducing Subjectivity and Calibrating P-Values with Bayes Factors.”
In other words, since null hypothesis significance testing (NHST) calculates the
p-value assuming that the null hypothesis Ho is true, the p-value is not the probability
that Ho is true. No specific alternative hypothesis H1 needs to be specified. No
estimation of the treatment effect is given. The inference is based on unobserved data
and violates the likelihood principle (Berger and Wolpert 1988).
In a recent special issue of the American Statistician, a collection of 43 articles
provides further discussion on p-values and statistical inference in general (Wasser-
stein et al. 2019). There are a few “Do’s” and “Don’t’s” offered in the editorial. For
the “Don’t’s”:

• Don’t base your conclusions solely on whether an association or effect was found
to be “statistically significant” (i.e., the p-value passed some arbitrary threshold
such as p < 0.05).
• Don’t believe that an association or effect exists just because it was statistically
significant.
• Don’t believe that an association or effect is absent just because it was not
statistically significant.
• Don’t believe that your p-value gives the probability that chance alone produced the
observed association or effect or the probability that your test hypothesis is true.
• Don’t conclude anything about scientific or practical importance based on statis-
tical significance (or lack thereof).

For the “Do’s”:

• Accept Uncertainty
• Be Thoughtful
54 Inferential Frameworks for Clinical Trials 983

– Thoughtfulness in the big picture, context, and prior knowledge


– Thoughtful alternatives and complements to p-values
– Thoughtful communication of confidence
• Be Open
– Openness to transparency and to the role of expert judgment
– Openness in communication
• Be Modest
– Being modest requires a reality check, in recognizing there is not a “true
statistical model” underlying every problem.
– Be modest about the role of statistical inference in scientific inference, encour-
aging others to reproduce your work.

Bayesian Framework

Bayesian statistics applies probabilistic statements directly to unknown quantities. It


assumes that the data is fixed and the parameter is random. For example, one may
state that given the data collected, the probability that the response rate for the
experimental treatment is greater than the response rate for standard of care (20%) is
95%. In mathematics this is

Pðp > 20%Þ ¼ 0:95

This statement avoids the opaque interpretation of p-values and can be seen as a
benefit of the Bayesian inferential framework (Berger 2003). Bayesian approaches to
questions 1, 2, 3, and 4 are now discussed in the context of the response example.
More complex analyses are discussed in references (Berry et al. 2010; Gelman et al.
2013).
In order to make probabilistic statements, prior knowledge about the unknown
parameters is formalized into a prior distribution. The prior is then combined with
the data to produce a posterior distribution. The mathematical machinery for com-
bining the prior with the data is Bayes theorem, from whence the framework gets its
name (Bayes 1763). Bayes theorem states that for two events A and B

PðBjAÞPðAÞ
PðAjBÞ ¼ :
Pð BÞ

The left side of the equation is read “the probability that A is true given that B is
true.” In the context of the response example, A could be “the population response
rate p is greater than 20%” and B could be data such as six out of ten patients
responded. The left side of the equation, P(A|B), then reads the probability that the
response rate is greater than 20% (in the population) given that six out of ten
responses were observed in the sample.
The central concept of Bayesian statistics is information synthesis. Specifically,
Bayesian 1-2-3 is that prior plus data becomes posterior. The current posterior can be
984 J. P. Long and J. J. Lee

considered as an updated prior for future data acquisition. The Bayesian method
takes a “learn-as-we-go” approach to perform continual learning by synthesizing all
available information at hand.
Note that it is a misconception that frequentist statisticians do not use Bayes
theorem. Bayes theorem is a mathematical fact, accepted and used by all statisticians.
The frequentist framework objects to the representation of unknown quantities using
distributions, in particular the prior, not the existence of Bayes theorem.
Figure 3 demonstrates simple Bayesian statistics with the response example. The
x-axis represents different possible response rates p for the experimental drug. The
blue and red curves and area under the corresponding curves represent prior knowl-
edge about p (prior distribution) and the updated knowledge of p after observing the
data (posterior distribution). The word prior refers to prior to data collection. The
mathematical form for this prior is a beta distribution. For Fig. 3, left panel,
   
1 p 0:4 p 0:4
π ð pÞ ¼ 1 : ð3Þ
Bð0:6, 1:4Þ 100 100

This prior favors low response percentages with a mean response rate of 0.3 and
an effective sample size of 2. The source of prior knowledge can be subjective and is
sometimes controversial in Bayesian statistics. However, it is reasonable to formu-
late the prior distribution based on response rates observed with past experimental
drugs.
The prior distribution in Eq. 3 is combined with the likelihood function of Eq. 2 to
produce a posterior distribution. The posterior distribution of p after observing six
responses out of ten evaluated patients is mathematically computed by

f ðx ¼ 6jpÞπ ðpÞ    
1 p 5:6 p 4:4
π ðpjx ¼ 6Þ ¼ Ð ¼ 1 :
f ðx ¼ 6jpÞπ ðpÞdp Bð6:6, 5:4Þ 100 100

0.05 0.05
Distribution Distribution
0.04 Prior 0.04 Prior
Posterior Posterior

0.03 0.03
Density

Density

0.02 0.02

0.01 0.01

0.00 0.00
25 50 75 25 50 75
p=Response Percentage p=Response Percentage

Fig. 3 (Left) Prior and posterior distributions. The prior is Beta(0.6,1.4), while the posterior is Beta
(6.6,5.4). (Right) A second example with the same data but a different prior. The prior is Beta
(1.4,0.6) and the posterior is Beta(7.4,4.6). The posterior distribution is different than (Left),
representing the subjective nature of Bayesian analyses
54 Inferential Frameworks for Clinical Trials 985

The first equality is Bayes theorem, while the second equality involves algebraic
manipulations. Notice the probabilities have shifted considerably after observing the
data of six responses out of ten patients. While the prior states that the probability p is
greater than 20% is only 53%, the posterior assigns greater than 99% chance to this
event.
The posterior distribution in red is used to answer questions 1, 2, 3, and 4. The
most common Bayesian point estimator is the posterior mean or the average value of
p as indicated by the posterior distribution. With six out of ten responses, the
posterior mean is 55%. This is different than the sample percentage of response of
60%. The reason for this difference is that the prior distribution (in blue) favored low
percentages and is still exerting an effect on the point estimate even after collecting
the data. As the amount of data increases, the prior will have less and less effect. The
posterior mean estimator will become closer to the percentage of responses in the
sample. For example, with 60 responses out of 100 patients, the posterior mean
(using the blue prior) is 59%. Since the typical frequentist estimator is 60%, the
sample proportion, this example illustrates that as the sample size increases, Bayes-
ian and frequentist methods become more in agreement. This is common in many
other settings.
For addressing question 2, one can use the posterior distribution which shows
there is a 99.6% chance that the experimental treatment exceeds standard of care
(20% response). This is determined by calculating the percentage of the red posterior
area which is greater than 20% on the x-axis. This efficacy result would likely form
the basis for proceeding to a Phase III trial. For example, one could decide to proceed
to a Phase III trial if there is at least a 95% chance that the experimental treatment
exceeds standard of care. In practice, such decisions usually involve a number of
factors including toxicity analysis. In section “Multiple Outcomes and Utility
Functions” statistical methodology for formal incorporation of multiple objectives
(e.g., efficacy and safety) in decisions is discussed.
For addressing question 3, the red posterior distribution indicates there is a 95%
chance that the response rate is between 28% and 81%. This is known as a 95%
credible interval, the Bayesian equivalent of a confidence interval. These endpoints,
28% and 81%, are the 2.5 and 97.5 percentiles of the red curve (i.e., the area of the
red region to the left of 28% is 0.025, and the area of the red region to the left of 81%
is 0.975).
For addressing question 4, the tail probability of the response rate greater than
40% can be calculated by summing up or integrating the area under the curve from
the response rate of 0.4 to 1.0, which is 85%.
The frequentist testing paradigm treats the null and alternative hypotheses asym-
metrically, and there is no clear way to conclude the null is true or make a statement
about confidence in the null hypothesis. In contrast Bayesian hypothesis tests treat
the null and alternative symmetrically. A posterior probability for each hypothesis
may be reported (sum of probabilities of null and alternative will of necessity
equal 1). In section “Reducing Subjectivity and Calibrating P-Values with Bayes
Factors,” we will discuss Bayesian hypothesis testing using Bayes factor.
986 J. P. Long and J. J. Lee

Sequential Design

In many data analysis applications, the sample size is fixed and the data is analyzed
only after collection. However, in clinical trials patients accrue sequentially, offering
the opportunity for stopping based on interim analysis of safety and efficacy results.
The Bayesian framework is particularly simple because the reason for stopping
formally has no impact on the subsequent data analysis (Berger and Wolpert
1988). A caveat to this message is that data-dependent stopping rules can increase
sensitivity to prior distribution assumptions. For example, Bayesian credible inter-
vals constructed using conservative priors can have anti-conservative coverage
probability (lower than the specified amount) when data-dependent stopping rules
are used (Rosenbaum and Rubin 1984). Frequentist measures of evidence, such as p-
values, explicitly require incorporating the reason for stopping. This can be
implemented a priori, such as in Simon’s optimal two-stage design, in which interim
stopping is accounted for in the hypothesis test decision with the specified Type I and
Type II error rates.
Thall et al. (1995) is a popular Bayesian sequential design for Phase II clinical
trials. The design is used for monitoring multiple outcomes, such as safety and
efficacy. Data can be monitored patient by patient, but typically interim analyses are
conducted in cohort sizes of five or ten to reduce logistical burden. At each interim
analysis, such as five patients with toxicity and efficacy data available, the method
produces a probability distribution for the parameters, similar to Fig. 3. Prior to the
trial, tolerable safety and efficacy boundaries are established. For example, stop if the
probability of efficacy being greater than 20% is less than 5% or the probability of
toxicity greater than 25% is greater than 95%. Thus, the trial is stopped whenever
one becomes confident in efficacy less than 20% or toxicity greater than 25%. From
these probabilistic thresholds, one can determine stopping boundaries (number of
toxicities or treatment failures which will terminate the trial) at various interim
analysis points. In this way the trial design produces simple rules for continuing or
terminating the trial, much like a 3 + 3 design, but with probabilistic guarantees
about the decisions. In addition to stopping boundaries, operating characteristics of
the trial such as the probability of stopping early given some efficacy and toxicity
levels can be tabulated prior to trial initiation.

Bayesian Computation

Although Bayes theorem was published more than 250 years ago, its use was limited
to conjugate models in which the posterior distribution has the same parametric form
as the prior distribution (such as the beta-binomial example discussed in the previous
section). In these cases, analytic solutions are available for computing the posterior
distribution. Bayesian computation beyond conjugate cases can be demanding.
Since the late 1980s, a combination of faster computers and better algorithms
(e.g., Gibbs sampling, Metropolis-Hastings sampling, and general Markov Chain
Monte Carlo (MCMC)) has made Bayesian computation for clinical trial data sets
54 Inferential Frameworks for Clinical Trials 987

feasible and even routine. Software tools such as BUGS, JAGS, STAN, and SAS
PROC MCMC allow easy implementation of a wide spectrum of Bayesian models
(Spiegelhalter et al. 1996; Plummer 2003; Chen 2009; Carpenter et al. 2017).

Prior Distributions and Schools of Bayesian Thought

The posterior distribution is computed using Bayes theorem and is dependent


upon both the prior distribution and the data. Different prior distributions can
produce different conclusions from the data. For example, in Fig. 3 (right), the
prior belief about effectiveness has been changed (relative to Fig. 3 (left)) to a
Beta(1.4,0.6) in order to favor high response rates. Using the same data as
before, six out of ten observed responses, the posterior is now the red curve in
the right plot. Notice that the two posterior distributions differ even though the
data collected are the same. This will result in different numerical summaries of
the posterior and possibly even different decisions about whether to proceed to
the next clinical trial phase. For example, with the Fig. 3 (left) posterior,
the probability the response rate is greater than 40% is about 85%, while for
the Fig. 3 (right) posterior, it is about 94%.
Within Bayesian statistics, there are different views about how to construct prior
distributions. Spiegelhalter et al. (2004) identified four schools of Bayesian thinking
which may represent a continuity from frequentist to fully Bayesian methods:

1. Empirical: Prior distributions are constructed from data, usually in the context of
a hierarchical model which has multiple levels of parameters (see diagnostic
testing example below for a definition and example of a hierarchical model).
While empirical Bayesian analyses use Bayes theorem, the resulting inferences
typically reported are frequentist (e.g., confidence intervals rather than credible
intervals, MLEs rather than posterior means). Many statisticians who consider
themselves frequentists use empirical Bayes methods.
2. Reference (or Objectivist): A set of default, or reference, prior distributions are
used which do not attempt to incorporate subjective knowledge about the param-
eters. Default priors may be chosen to have favorable mathematical properties
(Jeffreys 1946). Reference Bayesians may employ improper priors or non-infor-
mative/vague priors which do not correspond to prior belief because they are not
probability densities such as Uniform (0, 1) or a normal distribution prior with
infinite variance.
3. Proper (or Subjectivist): The prior is chosen to reflect subject matter knowledge
about the parameter. Different practitioners will have different beliefs about the
parameter, resulting in different informative priors and different inferences.
4. Decision Theoretic: Utility functions, which assign numeric values to different
outcomes, are combined with posterior distributions, which reflect uncertainty
about the state of nature, to make decisions. The particular choice of utility
function adds an additional level of subjectivity to the analysis, for example, in
the trade-off of toxicity versus efficacy or cost versus benefit.
988 J. P. Long and J. J. Lee

An example from diagnostic testing may help illustrate the differences, similarities,
and continuity in these schools of thought. Suppose a test for lung cancer has 95%
sensitivity and 90% specificity. For a particular individual, let the parameter θ equal 1 if
the person has cancer and 0 if not. Let Y equal 1 if the test is positive for the individual
and 0 if negative. What is the probability that this person has cancer given she tests
positive, i.e., P(θ ¼ 1| Y ¼ 1)? The following steps illustrate a dynamic back-and-forth
among the various schools of statistical thinking on how to address this question:

(i) Since this question involves computing probabilities of parameters, it can


naturally be addressed in a Bayesian framework once a prior for θ has been
chosen. Supposing this individual was selected randomly from a population,
the appropriate prior (the probability the individual has the disease prior to
administering the test) is the disease prevalence in the population, termed p.
Mathematically P(θ ¼ 1) ¼ p. The post-test probability is then computed using
Bayes formula

PðY ¼ 1jθ ¼ 1ÞPðθ ¼ 1Þ


Pðθ ¼ 1jY ¼ 1Þ ¼
PðY ¼ 1jθ ¼ 1ÞPðθ ¼ 1Þ þ PðY ¼ 1jθ ¼ 0ÞPðθ ¼ 0Þ
0:95p
¼
0:95p þ 0:1ð1  pÞ

For example, with p ¼ 0.1, the post-test probability is 0.51. This could be
viewed as a subjective Bayesian analysis in which the prior was chosen based
on the practitioner’s belief about disease prevalence.
(ii) A frequentist statistician may object to this analysis as subjective. Where does
the disease prevalence number originate? The frequentist statistician may
engage in a literature search and find that 100 individuals from the population
were given a gold standard lung cancer test (always provides correct result),
and X had cancer. The frequentist computes a (binomial) MLE estimate of
pb ¼ X=n, the sample proportion who have the disease. The posttest probability
estimate is

0:95bp
Pbðθ ¼ 1jY ¼ 1Þ ¼
0:95b
p þ 0:1ð1  pbÞ

This could be viewed as a simple empirical Bayes model or as a frequentist


application of the invariance property of MLEs (Theorem 7.2.10 from Casella and
Berger (2002)). This is a hierarchical model because there are two levels of
parameters, a parameter describing the population ( p in this case) and a parameter
describing an individual (θ in this case). Empirical Bayesians consider individual
level parameters random, but population-level parameters fixed. If pb ¼ 0:1, then
the empirical Bayesian estimate will match the subjectivist Bayesian from (i).
(iii) Now that the frequentist (or empirical Bayesian) has included X formally in the
analysis, a Bayesian would seek an analysis that uses both Y and X. While the
frequentist treated the population-level parameter p as fixed, the Bayesian will
54 Inferential Frameworks for Clinical Trials 989

put a prior on this quantity. This distribution represents belief about the
population prevalence prior to observing the gold standard data X. A subjec-
tivist Bayesian will incorporate beliefs into this prior, e.g., most diseases are not
common, so the prior on p will favor disease prevalences less than 0.5. In
contrast, the two objective Bayesian priors for this model are Beta(1/2,1/2)
(Jeffreys prior) and Beta(1,1) (suggested by Laplace); see Mossman and Berger
(2001).
(iv) While the empirical, reference, and proper Bayesian will all report
Pbðθ ¼ 1jY ¼ 1Þ, a decision theoretic Bayesian will go a step further and seek
to use the data X and Y (along with the prior distributions) to make some choice
of action. For example, given that an individual tests positive, should she
undergo an additional invasive test which has some potential side effects?
This decision requires balancing many factors, including the risks associated
with having undetected disease and side effects of the invasive test, along with
the measure Pbðθ ¼ 1jY ¼ 1Þ. The consideration of all these factors is typically
formalized in a utility function.

Mossman and Berger (2001) discuss this problem in the more complex case
where the sensitivity and specificity must be estimated from data and confidence
bounds (rather than simply point estimates) for Pbðθ ¼ 1jY ¼ 1Þ are desired. They
find the objectivist Bayesian method has desirable properties relative to frequentist/
empirical Bayesian analyses.
The schools of thought above involve considerable overlap. Some statisticians
advocate using multiple approaches within the same problem. For example, cali-
brated Bayes recommends using Bayesian (either reference or proper) analysis for
fitting statistical models while using frequentist methods to test the quality of the
model itself (Little 2006).

Guidance on Bayesian Inference

Several important elements for applying Bayesian methods are listed below.

1. Define the primary endpoint(s) under the Bayesian probability model.


2. Determine prior information.
(a) Historical data of similar studies.
(b) Elicit expert opinion.
(c) Non-informative or objective prior.
3. Extensive simulations.
(a) Calibrate design parameters to reach desirable operating characteristics such
as accuracy for making correct decisions, study accrual and duration, early
stopping probabilities, etc.
(b) Maintain the desirable frequentist properties such as controlling Type I and
Type II errors.
4. Sensitivity analysis.
990 J. P. Long and J. J. Lee

An important principle of the Bayesian analysis is that one should not manipulate
the prior in order to obtain desirable results post data collection. After setting the
prior, sensitivity analysis can be applied to study its influence. For example, the two
posteriors in Fig. 3 are based on two priors (with different parameters a and b) and
can help illustrate to what extent posterior conclusions are influenced by the prior.

Inferential Frameworks: Connections and Synthesis

Rule-based, frequentist, and Bayesian trial designs all exist and are in use because
each approach has strengths: rule-based methods are simple to follow, frequentist
methods avoid prior selection, and Bayesian methods are flexible. The strengths of
each approach have motivated trial design and statistical methodology which incor-
porate ideas from multiple frameworks.

Reducing Subjectivity and Calibrating P-Values with Bayes Factors

In the response example above, the prior belief about p (blue curve in Fig. 3 left)
implied a prior belief about whether the response rate of the experimental treatment
exceeded that of the standard of care (20%). In particular, prior to collecting any
data, one assumed a 53% chance that the experimental treatment had response rate
greater than 20% (the area under the blue curve to the right of 20%) and hence a 47%
chance that the experimental treatment was worse than standard of care (the area
under the blue curve to the left of 20%). The act of assigning prior probabilities to the
null and alternative hypothesis may be seen as especially subjective.
One Bayesian remedy for this problem is to construct prior distributions for the
null hypothesis and alternative hypothesis separately. One may then calculate the
probability of the data given the null hypothesis is true and the probability of the data
given the alternative hypothesis is true. The ratio of these two quantities is known as
the Bayes factor. Letting D denote data, H0 denote the null hypothesis, and H1 denote
the alternative hypothesis, the Bayes factor links the posterior odds with the prior
odds:

PðH 0 jDÞ PðDjH 0 Þ PðH 0 Þ


¼  ð4Þ
PðH 1 jDÞ PðDjH 1 Þ PðH 1 Þ
|fflfflfflfflffl{zfflfflfflfflffl} |fflfflfflfflffl{zfflfflfflfflffl} |fflffl{zfflffl}
posterior odds Bayes factor prior odds

Table 2 (adapted from Goodman (1999) Table 1) relates the Bayes factor, the prior
probability of the null hypothesis, and the posterior probability of the null hypoth-
esis. For example, with a Bayes factor of 1/5 and a prior probability on the null
hypothesis of 90%, the posterior probability of the null is 64%. This is determined by
noting that a 90% prior on the null is equivalent to a 9:1 odds, so the posterior odds is
1 9 9
5  1 ¼ 5 . The posterior odds is then converted to a posterior probability with
54 Inferential Frameworks for Clinical Trials 991

9
9
5
þ1
 0:64. Reporting the Bayes factor (BF) does not require specifying the prior
5
odds (i.e., the Bayes factor does not depend on the prior probability of the null
hypothesis) and is thus is perceived as testing a hypothesis “objectively” (Berger
1985) and removes “an unnecessary element of subjectivity” (Johnson and Cook
2009).
In Fig. 3 (left), the Bayes factor can be computed as a way to remove the influence
of the initially specified prior probability of the null P(H0)¼ 0.47 and only consider
the conditional priors P( p| H0) and P( p| H1) when making a decision about the
veracity of the null hypothesis. The Bayes factor can be determined for Fig. 3 (left)
by noting that P(H0| D) ¼ P( p < 0.2| data) ¼ 0.0041 and thus P(H1| D) ¼ P
( p > 0.2| data) ¼ 0.9959 resulting in posterior odds of 0.0042. The prior distribution
Beta(0.6,1.4) implies P(H0)¼ 0.47 and P(H1)¼ 0.53 with prior odds ¼ 0.87. Thus,
the Bayes factor is the ratio of posterior odds divided by prior odds which is 0.0048
in favor of the alternative hypothesis that the response rate is greater than 0.2.
A user of the Bayes factor can specify his own prior probability and then compute
the posterior probability of the alternative or consider several possible prior proba-
bilities (as in Table 2). For example, a Bayes factor of 1/100 is considered strong–
very strong evidence for the alternative hypothesis because even assuming a 90%
prior probability on the null, the posterior probability on the null is 8%.
Bayes factors can also be used to calibrate p-values in an attempt to find more
objective cutoffs than 0.05. Johnson (2013) developed connections between Bayes
factors and p-values based on Bayesian uniformly most power tests. He argued that
the p-value thresholds of 0.005 and 0.001 should be used to denote significant and
highly significant results in clinical trials, rather than the typical 0.05.
A simple version of this idea can be seen with the Gaussian null hypothesis
testing problem. Suppose the common scenario of testing a simple null hypothesis
(population mean equals some constant) and that under the null and alternative
hypotheses, the frequentist test statistic has a normal distribution (Z score). Then

Table 2 Bayes factors, prior probabilities, and posterior probabilities


Change in probability of null (i.e., P(H0) to P(H0|D))
Strength of evidence Bayes factor From (%) To (%)
Weak 1/5 90 64
50 17
25 6
Moderate 1/10 90 47
50 9
25 3
Moderate–strong 1/20 90 31
50 5
25 2
Strong–very strong 1/100 90 8
50 1
25 0.3
992 J. P. Long and J. J. Lee

Table 3 P-values, Bayes factors, and posterior probabilities


Change in probability of null (i.e., P(H0) to P(H0|
D))
P-value (Z score) Minimum Bayes factor From (%) To (%)
0.1 (1.64) 0.26 (1/3.8) 75 44
50 21
17 5
0.05 (1.96) 0.15 (1/6.8) 75 31
50 13
26 5
0.03 (2.17) 0.095 (1/11) 75 22
50 9
33 5
0.01 (2.58) 0.036 (1/28) 75 10
50 3.5
60 5
0.001 (3.28) 0.005 (1/216) 75 1
50 0.5
92 5

for a given Z-score (equivalently p-value), one can derive a minimum Bayes factor
and hence a maximum level of support for the alternative hypothesis. Table 3
(adapted from Goodman (1999) Table 2) displays these relations. The common p-
value threshold of 0.05 implies a minimum Bayes factor of 0.15. If one assigns a
50% chance that the null hypothesis is true a priori, then the posterior on the null is
13%. Thus a p-value of 0.05 could be considered moderate, but certainly not
definitive evidence of the alterative.
For more on Bayes factors, see Johnson and Cook (2009) for a Phase II single-
arm trial design and Goodman (1999) and Kass and Raftery (1995) for general
reviews (https://biostatistics.mdanderson.org/softwaredownload/ hosts software for
implementing clinical trial designs with Bayes factors.).

Model-Based and Model-Assisted Phase I Designs Constructed Using


Inferential Frameworks

The Continual Reassessment Method (CRM) is a Bayesian Phase I trial design


meant to address some of the deficiencies in 3 + 3 (O’Quigley et al. 1990). CRM
assumes a dose-toxicity relationship curve and, using a Bayesian framework,
updates estimates of the curve with each new patient. The next patient is given the
dose with estimated toxicity probability closest to the targeted toxicity probability.
CRM can estimate the maximum tolerated dose (MTD) at any desired target toxicity
probability, e.g., at 15%, 30%, or 40%, etc. depending on the respective clinical trial
setting. It can escalate and de-escalate dose levels an arbitrary number of times and
54 Inferential Frameworks for Clinical Trials 993

produce an estimate of the toxicity probability at each dose level, features absent in
traditional 3 + 3 designs.
The traditional 3 + 3 remains popular likely due to its simplicity. While the rule-
based 3 + 3 dose levels can be assigned by any clinical trialist, CRM requires
evaluating the posterior distribution at each cohort, a task that requires computer
software and a statistician.
To address this concern, new trial designs have attempted to merge the good
operating characteristics of CRM with the simplicity of rule-based methods such as
3 + 3. These represent one sort of hybrid trial. A particular example of this new
approach is the Phase I Bayesian Optimal Interval Design (BOIN) (Liu and Yuan
2015). In BOIN, users specify a toxicity interval. The design seeks a dose with
toxicity in the interval. BOIN escalates or de-escalates doses based on the results of a
Bayesian hypothesis test of whether the current dose toxicity is in the acceptable
toxicity interval. Similar to CRM, the method can result in an arbitrary number of
escalations and de-escalations. However, unlike CRM, these decisions are made
only based on the toxicities observed at the current dose, not other doses. This
feature results in simple interval boundaries for the escalation decision which can be
pre-computed at the design level. Thus the trial runs without consultation to com-
puter software, much like the 3 + 3 design. Figure 4 displays a decision flowchart for
a BOIN trial design with a targeted toxicity probability of 0.3. Dose escalation, de-
escalation, and retention decisions are entirely based on comparing the DLT rate at
the current dose to fixed thresholds provided in the diagram. BOIN designs (includ-
ing these operational flowcharts) can be created using freely available web applica-
tions (http://trialdesign.org).

Model-Based and Model-Assisted Phase II Designs

The Phase II sequential monitoring design of Thall et al. (1995) (TSE) proposes early
stopping for safety and efficacy based on posterior probabilities. A disadvantage of
this procedure is that, unlike Simon’s two-stage design (Simon 1989), the method does
not provide a recommendation about proceeding to a Phase III trial. TSE suggest
performing a separate analysis on the final data using either a frequentist or Bayesian
framework. A challenge of using frequentist methods in this setting is that standard
frequentist hypothesis tests will not control Type I or Type II error at the specified level
due to the previous sequential monitoring. Control of these errors is considered
desirable by regulatory agencies because they are not based on prior distributions
and are thus seen as more objective than traditional Bayesian hypothesis testing.
Recent Bayesian designs calibrate parameters to obtain desirable frequentist
properties, such as Type I and Type II error control, while simultaneously making
interim decisions in a straightforward Bayesian probabilistic manner. Two such
Phase II designs are the predictive probability design (Lee and Liu 2008) and the
Bayesian Optimal Phase II design (BOP2) (Zhou et al. 2017).
The predictive probability design adds early stopping for efficacy based on the
probability that the trial will achieve its objective, given all current information.
994 J. P. Long and J. J. Lee

Fig. 4 BOIN flowchart with a targeted probability of toxicity of 0.3. BOIN retains the operational
simplicity of 3 + 3 but with better statistical properties

The null hypothesis is that the true experimental efficacy rate p is no better than
standard of care efficacy rate p0 (i.e., H0 : p  p0). The targeted efficacy for the
experimental treatment is p1. At final sample size of N, the null hypotheses will
be rejected, and the treatment determined more effective than standard of care, if
P( p > p0 |data on N patients) > θT. Rather than waiting until all N patient responses
have been observed, the trial computes the probability of attaining this result in
cohorts of arbitrary size. The trial is terminated early if the probability falls below
θL (treatment determined ineffective early) or above θU (treatment determined
effective early). Similar to Simon’s two-stage design, the predictive probability
design optimizes over N, θU, θL, and θT to obtain specified Type I and Type II error
54 Inferential Frameworks for Clinical Trials 995

rates and minimum sample size. By controlling Type I and Type II error, the design
ensures good frequentist properties while making decisions based on straightfor-
ward posterior probabilities.
The BOP2 design operates in a similar manner to Thall et al. (1995) (TSE) but
tunes the posterior probability cutoffs at each interim cohort in order to control Type
I error at some predefined threshold while maximizing power (i.e., minimizing Type
II error). The posterior probability cutoffs become more stringent as the trial pro-
gresses, requiring more evidence of treatment efficacy. Like TSE, BOP2 can monitor
multiple endpoints, such as safety and efficacy.
To illustrate BOP2, consider a hypothetical Phase II single-arm study with a
maximum of 50 patients. Each patient will be evaluated for a binary treatment
response and a binary toxicity response. BOP2 requires a user-specified null hypoth-
esis probability of efficacy, toxicity, and efficacy AND toxicity. These probabilities
could be determined from historical controls and are here assumed P(Eff) ¼ 0.2, P
(Tox) ¼ 0.4, and P(Eff & Tox) ¼ 0.08. BOP2 then constructs a vague prior
distribution belonging to the Dirichlet class with these historical control probabilities
as the prior parameter means.
The power is computed at a user input value under the alternative hypothesis, here
specified as P(Eff) ¼ 0.4, P(Tox) ¼ 0.2, and P(Eff & Tox) ¼ 0.08. A Type I error rate,
here 0.05, is chosen as well as interim monitoring at 10, 20, 30, and 40 patients. BOP2
then seeks probability stopping thresholds to maximize power. These probability
thresholds are converted into stopping criteria in terms of number of toxicities and
number of responses at each interim cohort. The stopping boundaries are listed in
Table 4. This design obtains a power of 0.95 to reject the null hypothesis and conclude
that the treatment is more efficacious and less toxic than the null hypothesis.
One can also assume particular efficacy and toxicity values and calculate the
probability of stopping, the probability of claiming the treatment is acceptable, and
the expected sample size. These are known as the operating characteristics for the
trial and are contained in Table 5. Higher efficacy and lower toxicity probabilities
lead to lower chance of early stopping and higher chance of claiming acceptable. At
the null value of Pr(Eff) ¼ 0.2 and Pr(Tox) ¼ 0.4 (second row), the probability of
claiming acceptable is 3.57%, below the specified 5% threshold. At the alternative
value of Pr(Eff) ¼ 0.4 and Pr(Tox) ¼ 0.2, the power of the test is 95.36%.
The dynamic performance of the BOP2 design can be visualized with animations.
A still image from one such animation is contained in Fig. 5. Patients are monitored
in cohorts of size 10 up to a maximum of 50. At each interim cohort, a Go/No Go
decision is made based on the number of responses (left plot) and toxicities (right

Table 4 Stopping rules for Cohort size STOP IF # responses < ¼ OR # toxicities > ¼
a BOP2 trial design with the
10 0 6
null hypothesis of P
(Eff) ¼ 0.2, P(Tox) ¼ 0.4, 20 3 10
and P(Eff & Tox) ¼ 0.08, 30 5 13
5% Type I error 40 8 16
50 12 18
996 J. P. Long and J. J. Lee

Table 5 Operating characteristics for a BOP2 trial design with the null hypothesis of P(Eff) ¼ 0.2,
P(Tox) ¼ 0.4, and P(Eff & Tox) ¼ 0.08, 5% Type I error, and the alternative hypothesis of P
(Eff) ¼ 0.4, P(Tox) ¼ 0.2, and P(Eff & Tox) ¼ 0.08, 95% power
Pr Pr Pr(Eff & Early stopping Claim acceptable Sample
(Eff) (Tox) Tox) (%) (%) size
0.2 0.2 0.04 66.63 15.9 32.6
0.2 0.4 0.08 87.17 3.57 25.3
0.2 0.6 0.12 99.93 0 13.9
0.4 0.2 0.08 3.59 95.36 48.9
0.4 0.4 0.16 61.85 20.89 34.3
0.4 0.6 0.24 99.73 0.03 14.9

Fig. 5 Visualization of a BOP2 study

plot). This study did not stop early because the number of responses and toxicities
stayed within the green regions.

Inferential Frameworks and Modern Trial Design Challenges

The flexibility of the Bayesian and frequentist inferential frameworks enables adaptation
and development of new statistical methodology to address challenges and opportunities
in modern medicine. Here two recent directions for trial design are reviewed with an
emphasis on how the inferential frameworks are impacting design decisions.

Precision Medicine, Master Protocols, Umbrella Trials, Basket Trials,


Platform Trials, and Adaptive Randomization

The advent of widely available genetic tests enables selective targeting of drugs to
specific patient subpopulations. For example, cancer patients may be screened for
dozens of genetic mutations and then given a treatment which produces optimal
results for their genetic profile. Whereas traditional clinical trials focused on answer-
ing a single question (is this drug effective for a given patient population), newer trial
designs seek to find optimal matches between patient profiles and particular drugs.
54 Inferential Frameworks for Clinical Trials 997

Master protocols coordinate several closely linked investigations into a single


trial, enabling efficient use of resources (Mandrekar et al. 2015; Redman and
Allegra 2015; Renfro and Sargent 2016; Woodcock and LaVange 2017). For
example, umbrella trials select patients from a certain disease site (e.g., lung
cancer), perform genetic testing on patients, and assign them to multiple treatments
according to the matched drug targets. Basket trials take patients from multiple
disease sites but with a certain mutation (e.g., BRAF mutation) and assign them to
the corresponding target therapy (e.g., BRAF inhibitors) (Redig and Jänne 2015;
Simon et al. 2016; Hobbs and Landin 2018). Platform trials provide effi-
cient screening of multiple treatments in a certain disease in which a steady flow
of patients is available (Berry et al. 2015; Hobbs et al. 2018). A common control
group such as the standard of care can be incorporated as the reference groups.
New treatments can be added to the platform and evaluated. If a treatment is
promising, it can “graduate” and, if a treatment is not promising, it can be dropped
from the platform. The trial can run perpetually to efficiently screen for effective
treatments (Simon 2017).
Accrual of sufficient sample sizes can be challenging because of the number of
disease and treatment combinations. For example, the Biomarker-integrated
Approaches of Targeted Therapy for Lung Cancer Elimination (BATTLE) trial is
an umbrella trial for non-small cell lung cancer tested 4 treatments and 4 genetic
marker strata for a total of 16 possible marker-treatment combinations applying a
Bayesian hierarchical model with 8-week disease control rate as the primary end-
point (Zhou et al. 2008; Kim et al. 2011; Liu and Lee 2015; Simon 2017). Pre-
specifying a fixed sample size for each combination would result in a large overall
sample size due to the large number of combinations. Instead the trial used Bayesian
adaptive randomization in which the most promising combinations continued accru-
ing patients while combinations with disappointing response rates were suspended.
With adaptive randomization designs, suspended combinations may be reopened if
initially promising combinations begin to show poor performance. This accrual
strategy optimizes use of limited resources and has the potential to speed up drug
development (Berry 2015).
The I-SPYII Phase II platform trial for women with locally advanced breast
cancer (Barker et al. 2009) is testing standard neoadjuvant chemotherapy against
five new chemotherapy drugs, each being added to the standard regimen. Each drug
is tested in a minimum of 20 and maximum of 120 patients. Patients are randomized
based on hormone receptor status, HER2 status, and MammaPrint score. As the trial
progresses, biomarker-drug combinations with more favorable outcomes (patholog-
ical complete response after resection) are assigned more patients so that they can
accumulate more data. These promising arms then proceed to separate Phase III
trials. The GBM-Agile for glioblastoma multiforme includes many of the inferential
elements of I-SPY2 while adding a transition from Phase II to III as part of the trial
for the promising arms (Alexander et al. 2018). NCI-Match trial and a BRAF trial are
examples of basket trials (Hyman et al. 2015; Mullard 2015). Platform trials are
currently underway for several cancer types (Herbst et al. 2015; Alexander and
Cloughesy 2018).
998 J. P. Long and J. J. Lee

Multiple Outcomes and Utility Functions

Clinical trials typically collect data on several outcomes, such as safety and
efficacy, thus offering several metrics on which to compare treatments. This can
make decisions about identifying the “best” treatment difficult, as one drug may be
more effective but also more toxic. A simple and common solution is to define a
maximum toxicity threshold and then select as superior the treatment with accept-
able toxicity and highest efficacy. Alternatively a new drug may undergo a
noninferiority trial for efficacy and then be deemed superior based on safety
(Mauri and D’Agostino Sr 2017).
Decision theory offers a more nuanced alternative. In decision theory, a utility
function determined a priori by the clinician is used to summarize potentially
complex treatment effects as a single number. Murray et al. (2016) discuss utility
functions in the context of cancer patients where treatment efficacy may be
recorded as complete response, partial response, stable disease, and progressive
disease (four categories) while nonfatal toxicities may be recorded as none, minor,
and major (three categories). Each patient has one of 13 possible responses to
treatment (four possible efficacies x three toxicity levels + death). The clinician
then defines a desirability, or utility, of each of these possible 13 responses. Murray
et al. (2016) recommend assigning a utility score of 100 to the best possible
response (complete response with no toxicity) and 0 to the worst outcome
(death). The utility is then computed for each patient enrolled in the trial. The
distribution of utilities in treatment and control groups can be compared using
either frequentist or Bayesian methods. Hypothesis testing can be used to deter-
mine if the experimental treatment convincingly delivers better average utility than
standard of care.
The AWARD-5 trial used a utility function to find the optimal dose of dulaglutide
for treating type 2 diabetes patients (Skrivanek et al. 2014). The trial combined four
safety and efficacy measures (glycosylated hemoglobin A1c versus sitagliptin,
weight, pulse, and diastolic blood pressure) into a clinical utility index (CUI) with
larger values indicating more favorable profile. The trial computed posterior distri-
butions for CUI at various dose levels and recommended a dose based on these
distributions.

Summary and Conclusion

Inferential frameworks enable clinicians to conduct clinical trials and draw princi-
pled conclusions from the resulting data. Four essential aspects of inferential frame-
works are:

1. Assumptions relating to the data collection of the sample from the population:
Both frequentist and Bayesian methods will produce valid results when the
sample is selected in an unbiased manner. Random sampling of patients,
54 Inferential Frameworks for Clinical Trials 999

treatment randomization, and objective evaluation of treatment outcomes blinded


to the treatment assignment are critical for obtaining valid inferences.
2. Separation of the concepts of parameter estimate ðpbÞ from the true unknown
parameter in the population (p): Both Bayesian and frequentist statistics separate
sample-based estimates from true population-based quantities. The particular
estimated values produced by Bayesian and frequentist statistical methods are
often different.
3. Frequentist framework assumes that the parameter is fixed and data are random.
Conversely, Bayesian framework assumes that the parameter is random and data
are fixed.
4. Quantification of uncertainty in how close the estimate pb is to the true unknown
population value p: Here Bayesian and frequentist methods differ. Bayesian
statistics uses the posterior distribution of the parameter to quantify uncertainty,
while frequentist statistics uses sampling distributions. Bayesian measures of
uncertainty typically have a more straightforward interpretation, at the cost of
having to specify a prior distribution.

While constructing new trial designs involves complex statistical consider-


ations, software for implementing existing designs is increasingly available via
web applications and free downloads. The website http://trialdesign.org hosts web
applications for implementing over 30 trial designs, both frequentist and Bayesian,
including classical methods such as Simon’s optimal two-stage and newer Bayes-
ian adaptive methods for basket and platform trials. This software is freely
available to the scientific research community and requires only an Internet
browser to use.
In the past, Bayesian and frequentist schools were combative. They countered
each other’s points and had heated debate and argument fiercely. At the present time,
the two schools of inference are competitive. There is abundant literature comparing
the pros and cons of each approach. In the future, they will be more cooperative.
Bayesian and frequentist’s approaches offer complementary views, and both
approaches can learn from each other. Convergence of Bayesian and frequentist’s
methods was inconceivable in the past but inevitable in the future.

Cross-References

▶ Adaptive Phase II Trials


▶ Bias Control in Randomized Controlled Clinical Trials
▶ Confident Statistical Inference with Multiple Outcomes, Subgroups, and Other
Issues of Multiplicity
▶ Dose Finding for Drug Combinations
▶ Essential Statistical Tests
▶ Power and Sample Size
▶ Statistical Analysis of Patient-Reported Outcomes in Clinical Trials
1000 J. P. Long and J. J. Lee

References
Agresti A, Franklin CA (2009) Statistics: the art and science of learning from data. Prentice Hall,
Upper Saddle River
Alexander BM, Cloughesy TF (2018) Platform trials arrive on time for glioblastoma. Oxford
University Press US
Alexander BM et al (2018) Adaptive global innovative learning environment for glioblastoma:
GBM AGILE. Clin Cancer Res 24(4):737–743
Barker A et al (2009) I-SPY 2: an adaptive breast cancer trial design in the setting of neoadjuvant
chemotherapy. Clin Pharmacol Ther 86(1):97–100
Bayes T (1763) LII. An essay towards solving a problem in the doctrine of chances. By the late Rev.
Mr. Bayes, FRS communicated by Mr. Price, in a letter to John Canton, AMFR S. Philos Trans
R Soc Lond 53:370–418
Berger JO (1985) Statistical decision theory and Bayesian analysis. Springer Science & Business
Media
Berger JO (2003) Could fisher, Jeffreys and Neyman have agreed on testing? Stat Sci 18(1):1–32
Berger JO, Wolpert RL (1988) The likelihood principle. IMS
Berry DA (2015) The brave New World of clinical cancer research: adaptive biomarker-driven trials
integrating clinical practice with clinical research. Mol Oncol 9(5):951–959
Berry SM et al (2010) Bayesian adaptive methods for clinical trials. CRC press
Berry SM et al (2015) The platform trial: an efficient strategy for evaluating multiple treatments.
JAMA 313(16):1619–1620
Biswas S et al (2009) Bayesian clinical trials at the University of Texas MD Anderson cancer center.
Clin Trials 6(3):205–216
Carpenter B et al (2017) Stan: a probabilistic programming language. J Stat Softw 76(1)
Casella G, Berger RL (2002) Statistical inference. Duxbury Pacific Grove, Belmont
Chen F (2009) Bayesian modeling using the MCMC procedure. Proceedings of the SAS Global
Forum 2008 Conference. SAS Institute Inc., Cary
Gelman A et al (2013) Bayesian data analysis. Chapman and Hall/CRC
Goodman SN (1999) Toward evidence-based medical statistics. 2: the Bayes factor. Ann Intern Med
130(12):1005–1013
Herbst RS et al (2015) Lung Master Protocol (Lung-MAP) – a biomarker-driven protocol for
accelerating development of therapies for squamous cell lung cancer: SWOG S1400. Clin
Cancer Res 21(7):1514–1524
Hobbs BP, Landin R (2018) Bayesian basket trial design with exchangeability monitoring. Stat Med
37(25):3557–3572
Hobbs BP et al (2018) Controlled multi-arm platform design using predictive probability. Stat
Methods Med Res 27(1):65–78
Hyman DM et al (2015) Vemurafenib in multiple nonmelanoma cancers with BRAF V600
mutations. N Engl J Med 373(8):726–736
Jeffreys H (1946) An invariant form for the prior probability in estimation problems. Proc R Soc
Lond A Math Phys Sci 186(1007):453–461
Johnson VE (2013) Revised standards for statistical evidence. Proc Natl Acad Sci 110(48):19313–
19317
Johnson VE, Cook JD (2009) Bayesian design of single-arm phase II clinical trials with continuous
monitoring. Clin Trials 6(3):217–226
Jüni P et al (2001) Assessing the quality of controlled clinical trials. BMJ 323(7303):42–46
Kass RE, Raftery AE (1995) Bayes factors. J Am Stat Assoc 90(430):773–795
Kim ES et al (2011) The BATTLE trial: personalizing therapy for lung cancer. Cancer Discov 1
(1):44–53
Le Tourneau C et al (2009) Dose escalation methods in phase I cancer clinical trials. J Natl Cancer
Inst 101(10):708–720
Lee JJ, Chu CT (2012) Bayesian clinical trials in action. Stat Med 31(25):2955–2972
54 Inferential Frameworks for Clinical Trials 1001

Lee JJ, Liu DD (2008) A predictive probability design for phase II cancer clinical trials. Clin Trials
5(2):93–106
Lin Y, Shih WJ (2001) Statistical properties of the traditional algorithm-based designs for phase I
cancer clinical trials. Biostatistics 2(2):203–215
Little RJ (2006) Calibrated Bayes: a Bayes/frequentist roadmap. Am Stat 60(3):213–223
Liu S, Lee JJ (2015) An overview of the design and conduct of the BATTLE trials. Chin Clin Oncol
4(3)
Liu S, Yuan Y (2015) Bayesian optimal interval designs for phase I clinical trials. J R Stat Soc Ser C
Appl Stat 64(3):507–523
Mandrekar SJ et al (2015) Improving clinical trial efficiency: thinking outside the box. American
Society of Clinical Oncology educational book. American Society of Clinical Oncology. Annual
Meeting
Mauri L, D’Agostino RB Sr (2017) Challenges in the design and interpretation of noninferiority
trials. N Engl J Med 377(14):1357–1367
Mossman D, Berger JO (2001) Intervals for posttest probabilities: a comparison of 5 methods. Med
Decis Mak 21(6):498–507
Mullard A (2015) NCI-MATCH trial pushes cancer umbrella trial paradigm. Nature Publishing
Group
Murray TA et al (2016) Utility-based designs for randomized comparative trials with categorical
outcomes. Stat Med 35(24):4285–4305
O’Quigley J, Chevret S (1991) Methods for dose finding studies in cancer clinical trials: a review
and results of a Monte Carlo study. Stat Med 10(11):1647–1664
O’Quigley J et al (1990) Continual reassessment method: a practical design for phase 1 clinical
trials in cancer. Biometrics:33–48
Plummer M (2003) JAGS: a program for analysis of Bayesian graphical models using Gibbs
sampling. In: Proceedings of the 3rd international workshop on distributed statistical computing.
Austria, Vienna
Redig AJ, Jänne PA (2015) Basket trials and the evolution of clinical trial design in an era of
genomic medicine. J Clin Oncol 33(9):975–977
Redman MW, Allegra CJ (2015) The master protocol concept. Seminars in oncology. Elsevier
Renfro L, Sargent D (2016) Statistical controversies in clinical research: basket trials, umbrella
trials, and other master protocols: a review and examples. Ann Oncol 28(1):34–43
Rosenbaum PR, Rubin DB (1984) Sensitivity of Bayes inference with data-dependent stopping
rules. Am Stat 38(2):106–109
Simon R (1989) Optimal two-stage designs for phase II clinical trials. Control Clin Trials 10
(1):1–10
Simon R (2017) Critical review of umbrella, basket, and platform designs for oncology clinical
trials. Clin Pharmacol Ther 102(6):934–941
Simon R et al (2016) The Bayesian basket design for genomic variant-driven phase II trials.
Seminars in oncology. Elsevier
Skrivanek Z et al (2014) Dose-finding results in an adaptive, seamless, randomized trial of once-
weekly dulaglutide combined with metformin in type 2 diabetes patients (AWARD-5). Diabetes
Obes Metab 16(8):748–756
Smith TL et al (1996) Design and results of phase I cancer clinical trials: three-year experience at
MD Anderson Cancer Center. J Clin Oncol 14(1):287–295
Spiegelhalter DJ et al (1996) BUGS: bayesian inference using Gibbs sampling. Version 0.5,
(version ii). http://www.mrc-bsu.cam.ac.uk/bugs. 19
Spiegelhalter DJ et al (2004) Bayesian approaches to clinical trials and health-care evaluation.
Wiley
Storer BE (1989) Design and analysis of phase I clinical trials. Biometrics 45(3):925–937
Thall PF et al (1995) Bayesian sequential monitoring designs for single-arm clinical trials with
multiple outcomes. Stat Med 14(4):357–379
Tidwell RSS et al (2019) Bayesian clinical trials at The University of Texas MD Anderson Cancer
Center: an update. Clin Trials:1740774519871471
1002 J. P. Long and J. J. Lee

Wasserstein RL, Lazar NA (2016) The ASA’s statement on p-values: context, process, and purpose.
Am Stat 70(2):129–133
Wasserstein RL et al (2019) Moving to a world beyond “p< 0.05”. Taylor & Francis
Wilson EB (1927) Probable inference, the law of succession, and statistical inference. J Am Stat
Assoc 22(158):209–212
Woodcock J, LaVange LM (2017) Master protocols to study multiple therapies, multiple diseases,
or both. N Engl J Med 377(1):62–70
Zhou X et al (2008) Bayesian adaptive design for targeted therapy development in lung cancer – a
step toward personalized medicine. Clin Trials 5(3):181–193
Zhou H et al (2017) BOP2: bayesian optimal design for phase II clinical trials with simple and
complex endpoints. Stat Med 36(21):3302–3314
Dose Finding for Drug Combinations
55
Mourad Tighiouart

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004
Dose Finding to Estimate the Maximum Tolerated Dose Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006
Trial Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1007
Operating Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1008
Application to the CisCab Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1009
Attributable Toxicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1011
Dose-Toxicity Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1012
Dose Allocation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1013
Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1015
Phase I/II Dose Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1017
Stage I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1017
Stage II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1017
Discrete Dose Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1023
Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1023
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1028
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1028

Abstract
We present early phase cancer clinical trial designs for drug combinations
focusing on continuous dose levels. For phase I trials, the goal is to estimate
the maximum tolerated dose (MTD) curve in the two-dimensional Cartesian
plane. Parametric models are used to describe the relationship between the
doses of the two agents and the probability of dose limiting toxicity (DLT).
Trial design proceeds using cohorts of two patients receiving doses according
to univariate escalation with overdose control (EWOC) or continual reassessment
method (CRM). The maximum tolerated dose curve is estimated as a function of
M. Tighiouart (*)
Cedars-Sinai Medical Center, Los Angeles, CA, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 1003


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_80
1004 M. Tighiouart

Bayes estimates of the model parameters. In the case where some DLTs can be
attributed to one agent but not the other, we describe how these parametric
designs can be extended to account for an unknown fraction of attributable
DLTs. For treatments where efficacy is resolved after few cycles of therapy, it is
standard practice to perform single or randomized phase II trials using the MTD(s)
obtained from a phase I trial. In our setting, we show how the MTD curve is
carried out into a phase II trial where patients are allocated to doses likely to have
high probability of treatment efficacy using a Bayesian adaptive design. The
methodology is illustrated with an application to an early phase trial of cisplatin
and cabazitaxel in advanced stage prostate cancer patients with visceral metastasis.
Finally, we describe how these methods are adapted to the case of a pre-specified
set of discrete dose combinations.

Keywords
Dose fiinding · Drug combinations · MTD · DLT · EWOC · CRM · Adaptive
designs · Attributable toxicity · Efficacy · Cubic splines

Introduction

Early phase cancer clinical trials are small studies aimed at identifying tolerable
doses with promising signal for efficacy. These trials use drug combinations of
cytotoxic, biologic, immunotherapy, and/or radiotherapy agents to better target
different signaling pathways simultaneously and reduce potential tumor resistance
to chemo- or targeted therapy. However, most of these trials are designed to estimate
the maximum tolerated dose (MTD) of a single agent for fixed dose levels of the
other agents. This approach may provide a single safe dose for the combination, but
it may be suboptimal in terms of therapeutic effects. Statistical designs that allow
more than one drug to vary during the trial have been studied extensively in the last
decade (see, e.g., Thall et al. 2003; Wang and Ivanova 2005; Yin and Yuan 2009a, b;
Braun and Wang 2010; Wages et al. 2011; Shi and Yin 2013; Tighiouart et al. 2014b,
2016, 2017b; Riviere et al. 2014; Mander and Sweeting 2015). Some of these
designs are aimed at identifying a single MTD, whereas others can recommend
more than one MTD combination and even an infinite number of MTDs Tighiouart
et al. (2014b, 2016, 2017b). Most of these methods use a parametric model for the
dose-toxicity relationship

PðT ¼ 1jdose ¼ xÞ ¼ Fðx, ξÞ, ð1Þ

where x ¼ (x1, . . . , xk) is the dose combination of k drugs, F is a known link function,
T is the indicator of dose limiting toxicity (DLT), and ξ  ℝd is an unknown
parameter. Let S be the set of all dose combinations available in the trial. The
MTD is defined as the set C of dose combinations x such that the probability of
DLT for a patient given dose combination x equals to a target probability of DLT θ:
55 Dose Finding for Drug Combinations 1005

C ¼ fx  S : Fðx, ξÞ ¼ θg: ð2Þ

An alternative definition of the MTD is the set of dose combinations x that satisfy
|F (x, ξ)  θ|  δ since the set C in (2) may be empty. This can happen, for example,
when S is finite and the MTD is not part of the dose combinations available in the
trial. The threshold parameter δ is referred to as 100  δ  point window in Braun
and Wang (2010) and is pre-specified by the clinician. In general, the above methods
proceed by treating successive cohorts of patients with dose escalation starting from
the lowest dose combination and the model parameters and estimated probabilities of
toxicities are sequentially updated. Dose allocation to the next cohort of patients is
carried out by minimizing the risk of exceeding the target probability of DLT θ
according to some loss function. In section “Dose Finding to Estimate the Maximum
Tolerated Dose Curve” of this chapter, we present a drug combination design based
on escalation with overdose control (EWOC) (Babb et al. 1998; Tighiouart et al.
2005, 2012a, b, 2014a, 2017a; Tighiouart and Rogatko 2010, 2012; Chen et al.
2012a; Wheeler et al. 2017; Diniz et al. 2019). We will focus on drug combination of
two agents with continuous dose levels, and the goal of the trial is to estimate the
MTD curve. The design proceeds by treating consecutive cohorts of two patients
receiving different dose combinations determined using univariate EWOC. In sec-
tion “Attributable Toxicity,” we extend model (1) to account for an unknown fraction
of attributable DLTs. This may arise when combining drugs with different mecha-
nisms of action such as Taxotere and metformin. The design is similar to the one in
section “Dose Finding to Estimate the Maximum Tolerated Dose Curve” except that
the estimated doses for the next cohort of patients use the continual reassessment
method criteria (CRM) (O’Quigley et al. 1990; Faries 1994; Goodman et al. 1995;
O’Quigley and Shen 1996; Piantadosi et al. 1998).
In section “Phase I/II Dose Finding” of this chapter, we show how the estimated
MTD curve from a phase I trial can be used in a phase II study with the goal of
determining a dose combination along the MTD curve with maximum probability of
efficacy. This setting corresponds to a phase I/II cancer clinical trial design where the
MTD is first determined in a phase I trial and then is used in a phase II trial to
evaluate treatment efficacy. Such situations occur when response evaluation takes
few cycles of therapy or the phase I and II patient populations are different. In the
case where both toxicity and efficacy are resolved within one or two cycles of
therapy, sequential designs that update the probabilities of DLT and efficacy are
used instead, and the goal is to determine a tolerable dose combination with
maximum probability of treatment response. Finally, we show how the methods
described in sections “Dose Finding to Estimate the Maximum Tolerated Dose
Curve” and “Attributable Toxicity” can be adapted to the setting of a discrete set
of dose combinations in section “Discrete Dose Combinations.” Properties of these
designs are evaluated by presenting operating characteristics derived under a large
number of practical scenarios. For phase I trials, summary statistics of safety and
precision of the estimate of the MTD curve are calculated. For the phase II trial,
Bayesian power and type I error probabilities are provided under scenarios favoring
the alternative and null hypotheses, respectively.
1006 M. Tighiouart

Dose Finding to Estimate the Maximum Tolerated Dose Curve

Model

Consider the dose-toxicity model of the form

PðT ¼ 1jx, yÞ ¼ Fðη0 þ η1 x þ η2 y þ η3 xyÞ, ð3Þ

where T is the indicator of DLT, T ¼ 1 if a patient given the dose combination (x, y)
exhibits DLT within one cycle of therapy, and T ¼ 0 otherwise, x  [Xmin, Xmax] is
the dose level of agent A1, y  [Ymin, Ymax] is the dose level of agent A2, and F is a
known cumulative distribution function. Here, Xmin, Xmax and Ymin, Ymax are the
lower and upper bounds of the continuous dose levels of agents A1 and A2, respec-
tively. Suppose that the doses of agents A1 and A2 are standardized to be in the
interval [0, 1] using the transformations h1(x) ¼ (x  Xmin)/(Xmax  Xmin),
h2( y) ¼ (y  Ymin)/(Ymax  Ymin), and the interaction parameter η3 > 0.
We will assume that that the probability of DLT increases with the dose of any one
of the agents when the other one is held constant. A necessary and sufficient
condition for this property to hold is to assume η1, η2 > 0. The MTD is defined as
any dose combination (x, y) such that

ProbðT ¼ 1jx , y Þ ¼ θ: ð4Þ

The target probability of DLT θ is set relatively high when the DLT is a reversible
or nonfatal condition and low when it is life threatening. We reparameterize model
(3) in terms of parameters clinicians can easily interpret. One way is to use ρ10, the
probability of DLT when the levels of drugs A1 and A2 are 1 and 0, respectively; ρ01,
the probability of DLT when the levels of drugs A1 and A2 are 0 and 1, respectively;
and ρ00, the probability of DLT when the levels of drugs A1 and A2 are both 0. It can
be shown that
8
1
< η0 ¼ F ðρ00 Þ
>
η1 ¼ F1 ðρ10 Þ  F1 ðρ00 Þ ð5Þ
>
:
η2 ¼ F1 ðρ01 Þ  F1 ðρ00 Þ

Using (3), the definition of the MTD in (4), and reparameterization (5), we obtain
the MTD curve C as a function of the model parameters ρ00, ρ01, ρ10, and η3 and
target probability of DLT θ as
(    1  )
1 1 1
F ð θ Þ  F ð ρ 00 Þ  F ð ρ 10 Þ  F ð ρ 00 Þ x
C¼ ðx , y Þ : y ¼  1  : ð6Þ
F ðρ01 Þ  F1 ðρ00 Þ þ η3 x

This reparameterization allows the MTD curve to lie anywhere within the dose
range [Xmin, Xmax]  [Ymin, Ymax]. If there is strong a priori belief that ΓA1 |A2 ¼0, the
MTD of drug A1 when the level of drug A2 is equal to Ymin is in the interval [Xmin,
55 Dose Finding for Drug Combinations 1007

Xmax] and ΓA2 |A1 ¼0, the MTD of drug A2 when the level of drug A1 is equal to Xmin is
in the interval [Ymin, Ymax], then the reparameterization ρ00, ΓA1 |A2 ¼0, ΓA2 |A1 ¼0, η3
is more convenient (see Tighiouart et al. 2014b for more details on this
reparameterization).
A prior distribution on the model parameters is placed as follows. ρ01, ρ10, and η3
are independent a priori with ρ01  beta(a1, b1), ρ10  beta(a2, b2), and conditional
on (ρ01, ρ10), ρ00/min(ρ01, ρ10)  beta(a3, b3). The prior distribution on the interac-
tion parameter η3 is a gamma with mean a/b and variance a/b2. If Dk ¼ {(xi, yi, Ti)} is
the data after enrolling k patients to the trial, the posterior distribution of the model
parameters is

k
Y
π ðρ00 , ρ01 , ρ10 , η3 Þ / ðGðρ00 , ρ01 , ρ10 , η3 ; xi , yi ÞÞT i ð1  Gðρ00 , ρ01 , ρ10 , η3 ; xi , yi ÞÞ1T i
i¼1

π ðρ01 Þπ ðρ10 Þπ ðρ00 jρ01 , ρ10 Þπ ðη3 Þ,

where
  
Gðρ00 , ρ01 , ρ10 , η3 ; xi , yi Þ ¼ F F1 ðρ00 Þ þ F1 ðρ10 Þ  F1 ðρ00 Þ xi
  ð7Þ
þ F1 ðρ01 Þ  F1 ðρ00 Þ yi þ η3 xi yi Þ:

Features of the posterior distribution are estimated using WinBUGS (Lunn et al.
2000) and JAGS (Plummer 2003).

Trial Design

Dose escalation/de-escalation proceeds by treating cohorts of two patients simulta-


neously. It is based on the escalation with overdose control (EWOC) principle where
at each stage of the trial, the posterior probability of overdosing a future patient is
bounded by a feasibility bound α (see, e.g., Babb et al. 1998; Tighiouart et al. 2005;
Tighiouart and Rogatko 2010, 2012). For a given cohort, one subject receives a new
dose of agent A1 for a given dose of agent A2 that was previously assigned, and the
other patient receives a new dose of agent A2 for a given dose of agent A1 that was
previously assigned. Specifically,

(i) The first two patients receive the same dose combination (x1, y1) ¼ (x2, y2) ¼ (0,
0) and let D2 ¼ {(x1, y1, T1), (x2, y2, T2)}.
(ii) In the second cohort, patients 3 and 4 receive doses (x3, y3) and (x4, y4),
respectively,
 where
 y3 ¼ y1, x4 ¼ x2, x3 is the α-th percentile  of
π ΓA1 jA2 ¼y1 jD2 , and y4 is the α-th percentile of π ΓA2 jA1 ¼x2 jD2 . Here,
 
π ΓA1 jA2 ¼y1 jD2 is the posterior distribution of the MTD of drug A1 given that
the level of drug A2 is y1, given the data D2.
(iii) In the i-th cohort of two patients, if i is even, then patient (2i  1) receives dose
(x2i1, y2i3), and patient 2i receives dose (x2i2, y2i), where x2i1 ¼
1008 M. Tighiouart

Π1
ΓA jA ðαjD2i2 Þ and y2i ¼ Π1
ΓA jA ðαjD2i2 Þ . If i is odd, then patient
1 2 ¼y2i3 2 1 ¼x2i2
(2i  1) receives dose (x2i  3, y2i  1), and patient 2i receives dose (x2i, y2i2),
where y2i1 ¼ Π1 ΓA jA ¼x ðαjD2i2 Þ and x2i ¼ Π1 ΓA jA ¼y ðαjD2i2 Þ . Here,
2 1 2i3 1 2 2i2

Π1
Γ ðαjDÞ denotes the inverse cdf of the posterior distribution
 A1 jA2 ¼y 
π ΓA1 jA2 ¼y jD .
(iv) Repeat step (iii) until N patients are enrolled to the trial subject to the following
stopping rule.

Stopping Rule
Enrollment to the trial is suspended for safety if P (P (T ¼ 1|(x, y) ¼ (0,
0)) > θ + ξ1|data) > ξ2, i.e., if the posterior probability that the probability of DLT
at the minimum available dose combination in the trial exceeds the target probability
of DLT is high. The design parameters ξ1 and ξ2 are chosen to achieve desirable
model operating characteristics.
At the end of the trial, we estimate the MTD curve using (6) as
(    1  )
1 1 1
F ð θ Þ  F ðbρ 00 Þ  F ðbρ 10 Þ  F ðbρ 00 Þ x
Cest ¼ ðx , y Þ : y  ¼  1 1
 , ð8Þ
F ðb ρ01 Þ  F ðb ρ00 Þ þ b η3 x 

ρ00 , b
where b ρ01 , b
ρ10 ,and b
η are the posterior medians given the data DN.

Operating Characteristics

The performance of this design is evaluated for a prospective trial by assessing the
safety of the trial and efficiency of the estimated MTD curve under various plausible
scenarios elicited by the clinician in collaboration with the statistician.

Safety
For trial safety, the percent of DLTs across all patients and all simulated trials is
reported in addition to the percent of trials with an excessive DLT rate, for example,
greater than θ + 0.1. The latter is an estimate of the probability that a prospective trial
will result in a high rate of DLTs for a given scenario.

Efficiency
Uncertainty about the estimated MTD curve is evaluated by the pointwise average
bias and percent selection. For i ¼ 1, . . ., m, let Ci be the estimated MTD curve and
Ctrue be the true MTD curve, where m is the number of simulated trials. For every
point (x, y)  Ctrue, let
 1=2
ðiÞ
dðx,yÞ ¼ signðy0  yÞ  min fðx ,y Þ:ðx ,y Þ  Ci g ðx  x Þ2 þ ðy  y Þ2 , ð9Þ
55 Dose Finding for Drug Combinations 1009

where y0 is such that (x, y0 )  Ci. This is the minimum relative distance of the point
(x, y) on the true MTD curve to the estimated MTD curve Ci. Let

m
1 X ðiÞ
dðx,yÞ ¼ d : ð10Þ
m i¼1 ðx,yÞ

Equation (10) can be interpreted as the pointwise average bias in estimating the
MTD.
Let Δ(x, y) be the Euclidean distance between the minimum dose combination (0,
0) and the point (x, y) on the true MTD curve and 0 < p < 1. Let

m  
1 X ðiÞ
Pðx,yÞ ¼ I jdðx,yÞ j pΔðx, yÞ : ð11Þ
m i¼1

This is the pointwise percent of trials for which the minimum distance of the point
(x, y) on the true MTD curve to the estimated MTD curve Ci is no more than
(100  p)% of the true MTD. This statistic is equivalent to drawing a circle with
center (x, y) on the true MTD curve and radius pΔ(x, y) and calculating the percent of
trials with MTD curve estimate Ci falling inside the circle. This will give us the
percent of trials with MTD recommendation within (100  p)% of the true MTD for
a given tolerance p. This is interpreted as the pointwise percent selection for a given
tolerance p.

Application to the CisCab Trial

The algorithm described in section “Trial Design” was used to design the first part of a
phase I/II trial of the combination cisplatin and cabazitaxel in patients with prostate
cancer with visceral metastasis. A recently published phase I trial of this combination
by Lockhart et al. (2014) identified the MTD of cabazitaxel/cisplatin as 15/75 mg/m2.
This trial used a “3 + 3” design exploring three pre-specified dose levels 15/75, 20/75,
and 25/75. In part 1 of the trial, nine patients were evaluated for safety, and no DLT
was observed at 15/75 mg/m2. In part 2 of the study, 15 patients were treated at 15/
75 mg/m2, and 2 DLTs were observed. Based on these results and other preliminary
efficacy data, it was hypothesized that there exists a series of active dose combinations
which are tolerable and active in prostate cancer. Cabazitaxel dose levels will be
selected in the interval [10, 25], and cisplatin dose levels were selected in the interval
[50, 100] administered intravenously. The plan is to enroll N ¼ 30 patients and
estimate the MTD curve. The target probability of DLT is θ ¼ 0.33, and a logistic
link function for F () in (3) was used. DLT is resolved within one cycle (3 weeks) of
treatment. Although the algorithm dictates that the first two patients receive dose
combination 10/50 mg/m2, the clinician Dr. Posadas preferred to start with 15 mg/m2
cabazitaxel and 75 mg/m2 cisplatin since this combination was tolerable based on the
results of the published phase I trial and a number of patients he treated at this
1010 M. Tighiouart

combination. The prior distributions were calibrated so that the prior mean probability
of DLT at the dose combination 15/75 mg/m2 equals the target probability of DLT.
Specifically, informative priors were used for the model parameters ρ01, ρ10  beta
(1.4, 5.6), and conditional on ρ01, ρ10, ρ00/min(ρ01, ρ10)  beta(0.8, 7.2) and a vague
prior for η3 with mean 20, and variance 540 was used so that E(P (DLT |(15;
75)))  0.33 a priori. Operating characteristics were derived by simulating
m ¼ 2000 trial replicates under various scenarios for the true MTD curve. Figure 1
shows the true and estimated MTD curve obtained using (6) with the parameters ρ00,
ρ01, ρ10, and η3 replaced by their posterior median averaged across all 2000 simulated
trials. Scenario A shown on the left panel of Fig. 1 is a case where the true MTD curve
passes through a point very close to the dose combination (15, 75) identified as the
MTD from the previous trial. Scenario B shown on the right panel is a case where the
MTD curve is way above this dose combination. In each case, the estimated MTD
curves are very close to the true MTD curves. This is also evidenced by the pointwise
bias and percent selection (graphs included in the supplement). The trial was also safe
since the percent of trials with DLT rate above θ + 0.1 were 3.5% for the scenario on
the left and 5.0% for the scenario on the right.
Figures 2 and 3 show the pointwise average bias and percent selection for
tolerances p ¼ 0.05, 0.1 under scenarios A and B. In both cases, the absolute average
bias is less than 0.05, which corresponds to 5% of the standardized dose range of
either agent. We conclude that the pointwise average bias is practically negligible.
The pointwise percent selection when p ¼ 0.05 varies between 40% and 90% under
scenario A for most doses and between 60% and 75% under scenario B. These are
reasonable percent selections comparable to dose combination phase I trials. Other
scenarios were included in the clinical protocol.

Fig. 1 True and estimated MTD curve under two different scenarios for the MTD curve. The gray
diamonds represent the last dose combination from each simulated trial along with a 90% confi-
dence region
55 Dose Finding for Drug Combinations 1011

Fig. 2 Pointwise average bias (left) and percent selection (right) under scenario A

Fig. 3 Pointwise average bias (left) and percent selection (right) under scenario B

Attributable Toxicity

In section “Dose Finding to Estimate the Maximum Tolerated Dose Curve” and Eq.
(3), a DLT event is assumed to be caused by drug A1 or drug A2, or both. In some
applications, some DLTs can be attributed to one agent but not the other. For
example, in a drug combination trial of Taxotere, a known cytotoxic agent, and
metformin, a diabetes drug, in advanced or metastatic breast cancer patients, the
clinician expects that some DLTs can be attributable to either agent or both. For
example, a grade 3 or 4 neutropenia can only be attributable to Taxotere and not
metformin. In this section, we present a dose combination trial design that accounts
for an unknown fraction of attributable DLTs.
1012 M. Tighiouart

Dose-Toxicity Model

Let Fα() and Fβ () be parametric models for the probability of DLT of drugs A1 and
A2, respectively. We specify the joint dose-toxicity relationship using the Gumbel
copula model (see Murtaugh and Fisher 1990) as

π ðδ1 ,δ2 Þ ¼ Probðδ1 , δ2 jx, yÞ ¼ Fδα1 ðxÞ½1  Fα ðxÞ 1δ1 


 1δ2   eγ  1
Fδβ2 ðyÞ 1  Fβ ðyÞ þ ð1Þðδ1 þδ2 Þ Fα ðxÞ½1  Fα ðxÞ Fβ ðyÞ 1  Fβ ðyÞ γ ,
e þ1
ð12Þ

where x and y are the standardized dose levels of drugs A1 and A2, respectively, δ1
and δ2 are the binary indicators of DLT attributed to drugs A1 and A2, respectively,
and γ is the interaction coefficient. Similar to section “Dose Finding to Estimate the
Maximum Tolerated Dose Curve,” we assume that the probability of DLT
π ¼ 1  π (0,0) increases with the dose of any one of the agents when the other one
is held constant. A sufficient condition for this property to hold is to assume that
Fα() and Fβ () are increasing functions with α > 0 and β > 0. We take Fα(x) ¼ xα
and Fβ ( y) ¼ yβ. Using (12), if the DLT is attributed exclusively to drug D1, then
    eγ  1
π ðδ1 ¼1,δ2 ¼0Þ ¼ xα 1  yβ  xα ð1  xα Þyβ 1  yβ γ : ð13Þ
e þ1
If the DLT is attributed exclusively to drug D2, then
  eγ  1
π ðδ1 ¼0,δ2 ¼1Þ ¼ yβ ð1  xα Þ  xα ð1  xα Þyβ 1  yβ γ : ð14Þ
e þ1
If the DLT is attributed to both drugs D1 and D2, then
  eγ  1
π ðδ1 ¼1,δ2 ¼1Þ ¼ xα yβ þ xα ð1  xα Þyβ 1  yβ γ : ð15Þ
e þ1
Equation (13) represents the probability that A1 causes a DLT and drug A2 does
not cause a DLT. This can happen, for example, when a type of DLT of Taxotere,
such as grade 4 neutropenia, is observed. However, this type of DLT can never be
observed with metformin. This can also happen when the clinician attributes a grade
4 diarrhea to Taxotere but not to metformin in the case of a low dose level of this later
even though both drugs have this common type of side effect. The fact that dose level
y is present in Eq. (13) is a result of the joint modeling of the two marginals and
accounts for the probability that drug A2 does not cause a DLT. This later case is, of
course, based on the clinician’s judgment. Equations (14) and (15) can be interpreted
similarly. The probability of DLT is

π ¼ ProbðDLTjx, yÞ ¼ π ðδ1 ¼1,δ2 ¼0Þ þ π ðδ1 ¼0,δ2 ¼1Þ þ π ðδ1 ¼1,δ2 ¼1Þ ¼
  eγ  1 ð16Þ
xα þ yβ  xα yβ  xα ð1  xα Þyβ 1  yβ γ :
e þ1
55 Dose Finding for Drug Combinations 1013

The MTD is any dose combination (x, y) such that Prob(DLT|x, y) ¼ θ. It
follows that the MTD set C(α, β, γ) is
8 2 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi3β1 9
> >
< ð1  xα  κÞ ð1  xα  κÞ2  4κðxα  θÞ =
Cðα, β, γ Þ ¼ ðx , y Þ : y ¼ 4 5 ,
>
: 2κ >
;

ð17Þ

where

eγ  1
κ ¼ xα ð1  xα Þ :
eγ þ 1
Let T be the indicator of DLT, T ¼ 1 if a patient treated at dose combination (x, y)
experiences DLT within one cycle of therapy that is due to either drug or both, and
T ¼ 0 otherwise. Among patients treated with dose combination (x, y) who exhibit
DLT suppose that an unknown fraction η of these patients has a DLT with known
attribution, i.e., the clinician knows if the DLT is caused by drug A1 only, or drug A2
only, or both drugs A1 and A2. Let A be the indicator of DLT attribution when T ¼ 1.
It follows that for each patient treated with dose combination (x, y), there are five
possible toxicity outcomes: {T ¼ 0}, {T ¼ 1, A ¼ 0}, {T ¼ 1, A ¼ 1, δ1 ¼ 1, δ2 ¼ 0},
{T ¼ 1, A ¼ 1, δ1 ¼ 0, δ2 ¼ 1}, and {T ¼ 1, A ¼ 1, δ1 ¼ 1, δ2 ¼ 1}. Using Eqs. (13),
(14), (15), and (16) and Fig. 4, the likelihood function is
" #T i
n
Y Ai
ðδ1i ,δ2i Þ 1Ai
Lðα, β, γ, ηjdataÞ ¼ ηπ i ðπ i ð1  ηÞÞ ð1  π i Þ1T i , ð18Þ
i¼1

and the posterior distribution of the model parameters is

π ðα, β, γ, ηjdataÞ / Lðα, β, γ, ηjdataÞ  π ðα, β, γ, ηÞ: ð19Þ

Features of the posterior distribution are estimated using JAGS (Plummer 2003).

Dose Allocation Algorithm

Dose escalation is similar to section “Dose Finding to Estimate the Maximum


Tolerated Dose Curve” except that univariate continual reassessment method
(CRM) (O’Quigley et al. 1990) is carried out to estimate the next dose instead of
EWOC. In a cohort with two patients, the first one would receive a new dose of agent
A1 given the dose y of agent A2 that was previously assigned. The new dose of agent
A1 is defined as xnew ¼ argminu j Prob d ðDLTju, yÞ  θ j , where y is fixed and
d
ProbðDLTju, yÞ is computed using Eq. (16) with α, β, γ replaced by their posterior
medians. The other patient would receive a new dose of agent A2 given the dose of
agent A1 that was previously assigned. Specifically, the design proceeds as follows:
1014 M. Tighiouart

Fig. 4 A chance tree


illustrating the five possible
outcomes we can find in a trial

(i) The first two patients receive the same dose combination (Xmin, Ymin).
(ii) In the i-th cohort of two patients,
• If i is even, patient (2i  1) receives dose combination (x2i  1, y2i  1), where
d ðDLTju, y Þ  θ , and y2i1 ¼ y2i3. For ethical
x2i1 ¼ argmin Prob 2i3
u
reason, if a DLT was observed in the previous cohort of two patients and
was attributable to drug A1, then x2i1 is further restricted to be no more than
x2i3. Patient 2i receives dose combination (x2i, y2i), where y2i ¼

d ðDLTjx2i2 , vÞ  θ , and x2i ¼ x2i2. If a DLT was observed


argmin Prob
v
in the previous cohort of two patients and was attributable to drug A2, then
y2i is further restricted to be no more than y2i2.
• If i is odd, patient (2i1) receives doses (x2i1, y2i1), where y2i1 ¼

d ðDLTjx2i3 , vÞ  θ , and x2i1 ¼ x2i3. If a DLT was observed


argmin Prob
v
in the previous cohort of two patients and was attributable to drug A2, then
y2i1 is further restricted to be no more than y2i3. Patient 2i receives doses
d ðDLTju, y Þ  θ , and y2i ¼ y2i2. If a
(x2i, y2i), where x2i ¼ argmin Prob 2i2
u
DLT was observed in the previous cohort of two patients and was attribut-
able to drug A1, then x2i is further restricted to be no more than x2i2.
(iii) Repeat step 2 until the maximum sample size is reached subject to a safety
stopping rule as described in section “Trial Design.”
55 Dose Finding for Drug Combinations 1015

Here, we used univariate CRM instead of EWOC to estimate the next dose for
computational efficiency. A comparison of the two methods in drug combination
setting can be found in Diniz et al. (2017).

Simulation Studies

Dose levels of drugs A1 and A2 are standardized to be in the interval [0.05, 0.30], and
we consider three scenarios for the true MTD curve shown by the black dashed
curves in Fig. 5. We evaluate the effect of toxicity attribution in these three scenarios
using four different values for η: 0, 0.1, 0.25, and 0.4. These values are reasonable
because higher values of η in practice are very rare. Data are randomly generated as
follows. For a given dose combination (x, y), a binary indicator of DLT T is generated
from a Bernoulli distribution with probability of success computed using Eq. (16). If
{T ¼ 1}, we generate the attribution outcome A using a Bernoulli distribution with
probability of success η. If {T ¼ 1, A ¼ 1}, we attribute the DLT to drug A1, A2, or to
both with equal probabilities. We assume that the model parameters α, β, γ, and η are
independent a priori. We assign vague prior distributions to α, β, and γ as in Yin and
Yuan (2009a), where α  Uniform(0.2, 2), β  Uniform(0.2, 2), and γ  Gamma
(0.1, 0.1). The prior distribution for the fraction of attributable toxicities η is set to be
Uniform(0, 1). Using these prior distributions, the true parameter values for each
scenario are as follows: in scenario 1, α ¼ β ¼ 0.9 and γ ¼ 1; in scenario 2,
α ¼ β ¼ 1.1 and γ ¼ 1; and last, in scenario 3, α ¼ β ¼ 1.3 and γ ¼ 1. For each
scenario, m ¼ 1000 trials will be simulated. The target risk of toxicity is fixed at
θ ¼ 0.3, the sample size is n ¼ 40, and the values for ξ1 and ξ2 will be 0.05 and 0.8,
respectively. Figure 5 shows the estimated MTD curves for each scenario as a
function of η. In general, increasing the value of η until 0.4 corresponds to estimated
MTD curves closer to the true MTD curve.
Table 1 shows the average percent of toxicities as well as the percent of trials with
toxicity rates greater than θ + 0.05 and θ + 0.1 for scenarios 1–3. In general, we

Fig. 5 Estimated MTD curves for m ¼ 1000 simulated trials. The black dashed curve represents
the true MTD curve, the gray dashed lines represent the contours at θ 0.05 and θ 0.10, and the
solid curves represent the estimated MTD curves for each value of η
1016 M. Tighiouart

Table 1 Operating characteristics summarizing trial safety in m ¼ 1000 simulated trials


Average
% of % of trials with toxicity % of trials with toxicity
toxicities rate > θ + 0.05 rate > θ + 0.10
Scenario 1 η ¼ 0.00 33.62 25.90 4.10
η ¼ 0.10 32.67 22.60 4.80
η ¼ 0.25 31.55 17.60 2.70
η ¼ 0.40 30.70 13.30 2.00
Scenario 2 η ¼ 0.00 30.64 9.40 0.90
η ¼ 0.10 29.69 7.30 0.40
η ¼ 0.25 28.76 5.00 0.20
η ¼ 0.40 28.04 4.10 0.30
Scenario 3 η ¼ 0.00 27.47 2.00 0.00
η ¼ 0.10 26.80 1.80 0.00
η ¼ 0.25 25.99 1.30 0.00
η ¼ 0.40 25.37 0.70 0.00

Fig. 6 Pointwise percent of MTD recommendation for m ¼ 1000 simulated trials. Solid lines
represent the pointwise percent of MTD recommendation when p ¼ 0.2, and dashed lines represent
the pointwise percent of MTD recommendation when p ¼ 0.1

observe that increasing the fraction of toxicity attributions η reduces the average
percent of toxicities and percent of trials with toxicity rates greater than θ + 0.05 and
θ + 0.10. These results show that the design is safe in the sense that the probability
that a prospective trial will result in an excessive rate of toxicity (greater than
θ + 0.10) is less than 5%. Figure 6 shows the pointwise percent of MTD recommen-
dation of the three proposed scenarios for each value of η. In general, increasing the
value of η increases the pointwise percent of MTD recommendation, reaching up to
80% of correct recommendation when p ¼ 0.2 and up to 70% of correct recommen-
dation when p ¼ 0.1. Based on these simulation results, we conclude that in
continuous dose setting, the approach of partial toxicity attribution generates safe
trial designs and efficient estimation of the MTD. Further details about the approach
and computer codes can be found in Jimenez et al. (2019).
55 Dose Finding for Drug Combinations 1017

Phase I/II Dose Finding

In this section, we describe a phase I/II design with the objective of determining a
tolerable dose level that maximizes treatment efficacy. For treatments where efficacy
is ascertained in a relatively short period of time such as one or two cycles of therapy,
sequential designs for updating the joint probability of toxicity and efficacy and
estimating the optimal dose have been studied extensively in the literature (for single
agent trials, see, e.g., Murtaugh and Fisher (1990), Thall and Russell (1998), Braun
(2002), Ivanova (2003), Thall and Cook (2004), Chen et al. (2015), and Sato et al.
(2016), and for dose combination trials Yuan and Yin (2011), Wages and Conaway
(2014), Cai et al. (2014), Riviere et al. (2015), and Clertant and Tighiouart (2017)).
For treatments where response evaluation takes few cycles of therapy, it is standard
practice to perform a two-stage design where a maximum tolerable dose (MTD) of a
new drug or combinations of drugs is first determined, and then this recommended
phase II dose is studied in stage II and evaluated for treatment efficacy, possibly
using a different population of cancer patients from stage I (see Rogatko et al. 2008;
Le Tourneau et al. 2009; Chen et al. 2012b for a review of such a paradigm). For
drug combination phase I trials, more than one MTD can be recommended at the
conclusion of the trial and choosing a single MTD combination for efficacy study
may result in a failed phase II trial since other MTDs may present higher treatment
efficacy. Hence, adaptive or parallel phase II trials may be more suitable for
searching an optimal dose combination that is well tolerable with desired level of
efficacy.

Stage I

Stage I proceeds as in section “Dose Finding to Estimate the Maximum Tolerated


Dose Curve.” Let Cest be the estimated MTD curve obtained at the end of the phase I
trial, and suppose it is defined for x  [X1, X2] and y  [Y1, Y2]. Here, [X1,
X2] [Xmin, Xmax] and [Y1, Y2] [Ymin, Ymax]. Let E be the indicator of treatment
response such as tumor shrinkage, E ¼ 1 if we have a positive response after a pre-
defined number of treatment cycles and E ¼ 0 otherwise. Let p0 be the probability of
efficacy of the standard of care treatment. We propose to carry out a phase II study to
identify dose combinations (x, y)  Cest such that P (E ¼ 1|(x, y)) > p0.

Stage II

For every dose combination (x, y)  Cest, let x be the unique vertical projection of (x,
y) on the interval [X1, X2]. Next, denote by z  [0, 1] the standardized dose of x 
[X1, X2] using the transformation z ¼ h3(x) ¼ (x  X1)/(X2  X1). In the sequel, we
will refer to z as dose combination since there is a one-to-one transformation
mapping z  [0, 1] to (x, y)  Cest, x  [X1, X2], y  [Y1, Y2]. We model the
probability of treatment response given dose combination z in Cest as
1018 M. Tighiouart

PðE ¼ 1jz, ψ Þ ¼ Fð f ðz; ψ ÞÞ, ð20Þ

where F is a known link function, f (z; ψ) is an unknown function, and ψ is an


unknown parameter. A flexible way to model the probability of efficacy along the
MTD curve is the cubic spline function

k
X  3
f ðz; ψ Þ ¼ β0 þ β1 z þ β2 z2 þ β j z  κ j þ, ð21Þ
j¼3

where ψ ¼ (β, κ), β ¼ (β0, . . ., βk), κ ¼ (κ3, . . ., κ;k) with κ3 ¼ 0. Let Dm ¼ {(zi, Ei),
i ¼ 1, . . ., m} be the data after enrolling m patients in the trial, where Ei is the
response of the i-th patient treated with dose combination zi, and let π(ψ) be a prior
density on the parameter ψ. The posterior distribution is

m
Y
π ðψjDm Þ / ½Fð f ðzi ; ψ ÞÞ Ei ½1  Fð f ðzi ; ψ ÞÞ 1Ei π ðψ Þ: ð22Þ
i¼1

Let pz be the probability of treatment efficacy at dose combination z and denote by


p0 the probability of efficacy of a poor treatment or treatment not worthy of further
investigation. An adaptive design is used to conduct a phase II trial in order to test
the hypothesis.
H0: pz  p0 for all z versus H1: pz > p0 for some z.

Trial Design
(i) Randomly assign n1 patients to dose combinations z1 , . . . , zn1 equally spaced
along the MTD curve Cest so that each combination is assigned to one and only
one patient.
(ii) Obtain a Bayes estimate ψ b of ψ given the data Dn1 using (22).
(iii) Generate n2 dose combinations from the standardized density F ( f (z; ψ b )), and
assign them to the next cohort of n2 patients.
(iv) Repeat steps (ii) and (iii) until a total of n patients have been enrolled to the trial
subject to pre-specified stopping rules.

This algorithm can be viewed as an extension of a Bayesian adaptive design to select


a superior arm among a finite number of arms Berry et al. (2011) to selecting a
superior arm from an infinite number of arms.
Decision rule. At the end of the trial, we accept the alternative hypothesis if

Maxz ½PðFð f ðz; ψ ÞÞ > p0 jDn Þ > δu , ð23Þ

where δu is a design parameter.


Stopping rules. For ethical considerations and to avoid exposing patients to
subtherapeutic doses, we stop the trial for futility after j patients are evaluable for
efficacy if there is strong evidence that none of the dose combinations are promising,
55 Dose Finding for Drug Combinations 1019

i.e., Maxz [P (F ( f (z; ψ)) > p0|Dj)] < δ0 where δ0 is a small pre-specified threshold.
In cases where the investigator is interested in stopping the trial early for superiority,
the trial can be terminated after j patients are evaluable for efficacy if Maxz [P (F ( f
(z; ψ)) > p0|Dj)] > δ1 where δ1 δu is a pre-specified threshold and the
corresponding dose combination z ¼ argmaxu{P (F ( f (u; ψ)) > p0|Dj)} is selected
for future randomized phase II or III studies.

CisCab Trial (Continued)


In section “Application to the CisCab Trial,” we described the phase I part of the
CisCab trial where 30 patients are enrolled and the MTD curve estimated. In stage II,
n ¼ 30 patients will be enrolled to identify dose combinations along the MTD curve
with maximum clinical benefit rate. Clinical benefit is defined as either a complete
response, partial response, or stable disease within three cycles of treatment. The
probability of a poor clinical benefit is p0 ¼ 0.15, and we expect that a tolerable dose
combination achieves a clinical benefit rate of p ¼ 0.4. We present simulations based
on six scenarios that include three situations favoring the alternative hypothesis and
three instances supporting the null hypothesis. A logistic link function F (u) ¼ (1 + exp
(u))1 is used in (20), and f (z; ψ) is modeled as a cubic spline function with two
knots in (0, 1). This is a very flexible class of efficacy curves and accommodates
cases of constant probability of efficacy along the MTD curve, high probability of
efficacy around the middle of the MTD curve, and high probability of efficacy at one
or both edges of the MTD curve. Vague priors are placed on the model parameters by
assuming that β  N ð0, σ 2 I 6 Þ with σ 2 ¼ 104 and (κ4, κ5)  Unif{(u, v):
0  u < v  1}. It can be shown that the induced prior mean and variance of the
probability of treatment response are Eprior(F ( f (z; ψ)))  0.5 and V arprior(F ( f (z;
ψ)))  0.25 for all dose combinations z  [0, 1]. The initial number of patients
enrolled to the trial was set to n1 ¼ 10, and n2 ¼ 5 was used in the adaptive
randomization phase of the design. The design parameter for the decision rule in
(23) was taken as δu ¼ 0.8. In each scenario, we simulated M ¼ 2000 trial replicates.
The true probability of response curves under scenarios (a,b,c) is shown in blue in
Fig. 7. The black horizontal lines correspond to the probability of a poor treatment
response p0 ¼ 0.15, and the green horizontal lines represent the target probability of
response p ¼ 0.4. Scenario (a) is a case where the probability of efficacy is
maximized near the middle of the estimated MTD curve with dose combinations
in the interval (0.03, 0.76) having probability of efficacy greater than p0 ¼ 0.15. The
target probability of response is achieved at a single dose combination z ¼ 0.42.
Scenario (b) is a case where higher doses of cisplatin and lower doses of cabazitaxel
achieve higher efficacy. Specifically, standardized dose combinations in the interval
(0.00, 0.49) have probability of efficacy greater or equal to p0 ¼ 0.15. Scenario (c) is
an unusual situation where the probability of efficacy is maximized at the edges of
the MTD curve. In this case, dose combinations in the interval (0.00, 0.41) [ (0.90,
1.00) have probability of efficacy greater or equal to p0 ¼ 0.15. Corresponding to
these scenarios are situations (d–f) favoring the null hypothesis shown in Fig. 7d, e, f.
The true probability of response curves shown in blue has been shifted downward so
1020
M. Tighiouart

Fig. 7 True and estimated efficacy curve under six scenarios favoring the null and alternative hypotheses
55 Dose Finding for Drug Combinations 1021

that the probability of response equals to p0 at one dose combination only for scenarios
(d), (e), and (f).

Operating Characteristics
For each scenario favoring the alternative hypothesis, we estimate the Bayesian
power as

M
1 X
Power  I ½Maxz fPðFð f ðz; ψ i ÞÞ > p0 jDn,i Þg > δu , ð24Þ
M i¼1

where P (F ( f (z; ψ i)) > p0|Dn,i) is estimated using an MCMC sample of ψ i,

1X     
PðFð f ðz; ψ i ÞÞ > p0 jDn,i Þ  I F f z; ψ i,j L > p0 , ð25Þ
L j¼1

where ψ i,j, j ¼ 1, . . ., L is an MCMC sample from the i-th trial. For scenarios
favoring the null hypothesis, (24) is the estimated Bayesian type I error probability.
The optimal or target dose from the i-th trial is

zi ¼ argmaxv fPðFð f ðv; ψ i ÞÞ > p0 jDn,i Þg: ð26Þ

We also report the estimated efficacy curve by replacing ψ in (20) by the average
posterior medians across all simulated trials

Fð f ðz; ψ ÞÞ, ð27Þ


  1
PM 1
PM
where ψ ¼ β, κ , βl ¼ M b
i¼1 βi,l , l ¼ 0, . . . , 5, κk ¼ M κi,k, l ¼ 4, 5, and
i¼1 b
b
βi,l , b
κi,k are the posterior medians from the i-th trial. Finally, we also report the mean
posterior probability of declaring the treatment as efficacious for all dose combina-
tion z as

M
1 X
PðFð f ðz; ψ i ÞÞ > p0 jDn,i Þ: ð28Þ
M i¼1

The estimated efficacy curves shown in black dashed lines in Fig. 7 computed
using Eq. 27 are fairly close to the true probability of efficacy curve in all scenarios
except for scenario (c) near the lower edge of the MTD curve. The mean posterior
probability of efficacy curve shown in red dashed line computed using Eq. 28 is 80%
or more at dose combinations where the true probability of efficacy is maximized for
scenarios (a, b) and close to 80% for scenario (c). Similar conclusions can be drawn
for scenarios favoring the null hypothesis where the maximum of the mean posterior
probability of efficacy is less than 50%. Figure 8 is the estimated density of the target
dose z defined in Eq. 26 under scenarios favoring the alternative hypothesis (a–c),
and the shaded region corresponds to dose combinations with corresponding true
1022 M. Tighiouart

Fig. 8 Estimated density of the target dose combination under three scenarios favoring the
alternative hypothesis

Table 2 Bayesian power, type I error, and coverage probabilities


Scenarios Power Scenarios Prob(Type I error) Coverage prob.
(a) 0.896 (d) 0.100 0.964
(b) 0.921 (e) 0.190 0.897
(c) 0.810 (f) 0.143 0.937

probability of efficacy greater than p0 ¼ 0.15. The mode of these densities is close to
the target doses. Moreover, the estimated probabilities of selecting a dose with true
probability of efficacy greater than p0 ¼ 0.15 vary between 0.90 and 0.96 across the
three scenarios. The Bayesian power for scenarios (a–c) and type I error probability
for scenarios (d–f) estimated using Eq. 24 using a threshold δu ¼ 0.8 are reported in
Table 2. Power varies between 0.81 and 0.92, and the type I error probability varies
between 0.10 and 0.19. The coverage probability in the last column of Table 2 is the
estimated probabilities of selecting a dose with true probability of efficacy greater
55 Dose Finding for Drug Combinations 1023

than p0 ¼ 0.15. We conclude that the design has good operating characteristic in
identifying tolerable dose combinations with maximum benefit rate. We refer the
reader to Tighiouart (2019) for sensitivity analysis regarding n1, n2, δu, and other
values of p0 and effect size.

Discrete Dose Combinations

In sections “Dose Finding to Estimate the Maximum Tolerated Dose Curve,”


“Attributable Toxicity,” and “Phase I/II Dose Finding,” the methodologies for
estimating the MTD curve and tolerable dose combination with maximum proba-
bility of efficacy were described for continuous dose levels of the two agents. These
methods can be adapted to the case of pre-specified discrete dose levels as follows.
Let (x1, . . ., xr) and (y1, . . ., ys) be the doses of agents A1 and A2, respectively.
Following the notation of section “Dose Finding to Estimate the Maximum Tolerated
Dose Curve,” Xmin ¼ x1, Xmax ¼ xr, Ymin ¼ y1, and Ymax ¼ ys and the doses are
standardized to be in the interval [0, 1]. Dose escalation proceeds using the algo-
rithms described in sections “Trial Design” and “Dose Allocation Algorithm” where
the recommended continuous doses in steps (ii) and (iii) of the algorithms are
rounded to the nearest discrete dose levels. At the end of the trial, a discrete set Γ
satisfying conditions (a) and (b) below is selected as the set of MTDs. Let d((xj, yk),
Cest) be the Euclidean distance between dose combination (xj, yk) and the estimated
MTD curve Cest.

s 
(a) Let ΓA1 ¼ [ ðx, yt Þ : x ¼ argmin d x j , yt Þ, Cest Þ ,
t¼1 xj
(
r
 o
ΓA2 ¼ [ ðxt , yÞ : y ¼ argmin d xt , y j Þ, Cest Þ , and Γ0 ¼ ΓA1 \ ΓA2 :
t¼1 yj

(b) Let Γ ¼ Γ0\{(x, y) : P(| P(DLT| (x, y))  θ| >δ1| Dn) > δ2}.

The set Γ0 in (a) consists of dose combinations closest to the MTD curve obtained
by first minimizing the Euclidean distances across the levels of drug A1 and then
across the levels of drug A2. Doses in Γ0 that are either likely to be too toxic or
subtherapeutic are excluded in (b). The design parameter δ1 is selected after consul-
tation with a clinician. The parameter δ2 is selected after exploring a large number of
practical scenarios when designing a trial. In our experience with the sample sizes
and scenarios used in Wang and Ivanova (2005), we found that δ2 ¼ 0.3, 0.35 result
in good design operating characteristics.

Illustration

We consider five scenarios studied in Wang and Ivanova (2005) and shown in
Table 3. The sample size for the first four scenarios is n ¼ 54 and n ¼ 60 for the
1024 M. Tighiouart

Table 3 Dose limiting toxicity scenarios with θ ¼ 0.2


Dose level 1 2 3 4 5 6
Scenario 1
3 0.08 0.13 0.2 0.29 0.40 0.53
2 0.05 0.08 0.13 0.2 0.29 0.40
1 0.03 0.05 0.08 0.13 0.2 0.20
Scenario 2
3 0.05 0.08 0.11 0.15 0.21 0.29
2 0.04 0.06 0.09 0.13 0.18 0.25
1 0.04 0.05 0.08 0.11 15 0.21
Scenario 3
3 0.20 0.30 0.41 0.53 0.65 0.70
2 0.10 0.20 0.25 0.32 0.41 0.50
1 0.03 0.05 0.13 0.20 0.27 0.35
Scenario 4
3 0.20 0.40 0.47 0.56 0.65 0.76
2 0.08 0.13 0.20 0.32 0.41 0.50
1 0.03 0.05 0.08 0.13 0.17 0.20
Scenario 5
4 0.20 0.29 0.40 0.53
3 0.13 0.20 0.29 0.40
2 0.08 0.13 0.20 0.29
1 0.05 0.08 0.13 0.20

last scenario. The target probability of DLT is θ ¼ 0.2, and the prior distributions for
ρ00, ρ01, ρ10, and η3 are described in section “Dose Finding to Estimate the Maxi-
mum Tolerated Dose Curve” with hyperparameters ai ¼ bi ¼ 1, i ¼ 1, . . ., 3. A tight
gamma(1,1) prior was put on the interaction parameter η3 since the model in Wang
and Ivanova (2005) has two parameters with no interaction coefficient. We assess the
performance of the method by simulating m ¼ 2000 trials and calculating the
accuracy index introduced in Cheung (2011).
PK
k¼1 Δk  pn,k
AI n ¼ 1  K  PK , ð29Þ
k¼1 Δk

where n is the trial sample size, K is the number of discrete doses available in the
trial, pn,k is the probability of selecting dose k in a trial with n patients, and Δk is a
distance measure between the true probability of DLT pk at dose k and the target
probability of DLT θ. It can be shown that AIn < 1 and higher values of AIn are
desirable. We also report a measure of percent selection defined as follows. For a
given scenario, let Γδ ¼ {(xi, yj): |P (DLT |(xi, yj))  θ| < δ} be the set of true MTDs
where the threshold parameter δ is fixed by the clinician. Let Γi the set of estimated
MTDs at the end of the i-th trial as described in section “Discrete Dose Combina-
tions,” i ¼ 1, . . ., m. The percent of MTDs selection is
55 Dose Finding for Drug Combinations 1025

m
1 X
%Selection ¼ I ðΓi Γδ Þ: ð30Þ
m i¼1

This statistic is an estimate of the probability that for a given scenario, a


prospective trial will recommend a set of dose combinations that are all MTDs.
Other measures of percent selection when recommending more than one MTD can
be found in Diniz et al. (2017). Table 4 gives the summary statistics of the accuracy
index using the square discrepancy (sq) Δk ¼ ( pk  θ)2, absolute discrepancy (abs)
Δk ¼ |pk  θ|, and overdose error (od) Δk ¼ α✶(θ  pk)+ +(1  α✶)( pk  θ)+ with
α✶ ¼ 0.25, the percent selection with δ ¼ 0.1, and safety of the trial for the proposed
approach (conditional EWOC) and the two-dimensional design of Wang and
Ivanova (2005). Conditional EWOC performs well relative to the two-dimensional
design under scenarios 1, 3, and 5 according to the three measures of discrepancies.
Scenario 2 is more complex due to the location of the true MTDs, and this is reflected
by the negative values of the accuracy index across the three measures of discrep-
ancies for the two approaches. The two-dimensional design performs better than
conditional EWOC under scenario 4 according to two of the three discrepancy
measures. When the accuracy index AIn is averaged across the five scenarios,
conditional EWOC performs better than the two-dimensional design for each dis-
crepancy measure. The percent selection is higher using conditional EWOC under
scenarios 1, 3, and 4 and higher on the average across all five scenarios relative to the
two-dimensional design. The last three columns of Table 4 show that the trial is safe
using both approaches under these five scenarios. Other simulation results using
informative priors matching the priors used in Wang and Ivanova (2005) lead to similar
conclusions and much higher percent selection under scenarios 1, 3, 4, and 5. Further
details can be found in Tighiouart et al. (2017b).

Summary and Conclusion

Model-based designs for drug combinations in early phase cancer clinical trials have
been studied extensively in the last decade. For phase I trials, these methods are
designed to estimate one or more MTDs for use in future phase II trials. It is
important to note that designs that recommend more than one MTD for efficacy
studies should be used as this may decrease the likelihood of a failed phase II trial. In
this chapter, we focused on dose finding using two drugs with continuous dose
levels. For a phase I trial design, consecutive cohorts of two patients were treated
simultaneously with different dose combinations to better explore the space of doses.
The method was studied extensively in Tighiouart et al. (2014b, 2016, 2017b) and
Diniz et al. (2017) via extensive simulations and was shown to be safe in general
with high percent of MTD recommendation. We also showed how this was applied
to design the first part of the CisCab trial using a relatively small sample size and
calibrate prior distributions of the model parameters. In practice, active involvement
of the clinician is required at the design stage of the trial to facilitate prior calibration
and to specify scenarios with various locations of the true MTD set of doses.
1026

Table 4 Operating characteristics for the two designs


Accuracy index
sq abs od % Selection Mean % DLTs % Trials: DLT rate > θ + 0.05 % Trials: DLT rate > θ + 0.10
Scenario 1
Cond. EWOC 0.37 0.13 0.07 75.2 15.05 0.30 0.00
Two-dim 0.17 0.45 0.08 58.8 16.56 0.65 0.00
Scenario 2
Cond. EWOC 0.34 0.63 0.62 53.6 12.54 0.00 0.00
Two-dim 0.13 0.72 0.82 81.6 13.37 0.08 0.00
Scenario 3
Cond. EWOC 0.80 0.40 0.46 92.65 19.08 6.30 0.00
Two-dim 0.70 0.20 0.43 68.58 20.08 6.90 0.10
Scenario 4
Cond. EWOC 0.57 0.10 0.20 45.95 17.10 2.30 0.00
Two-dim 0.62 0.05 0.36 39.93 19.35 3.55 0.00
Scenario 5
Cond. EWOC 0.54 0.01 0.23 78.55 16.34 0.01 0.00
Two-dim 0.27 0.5 0.12 87.40 17.72 0.88 0.00
Average
Cond. EWOC 0.39 0.05 0.07 69.19 16.02 1.78 0.00
Two-dim 0.33 0.28 0.05 67.26 17.42 2.41 0.02
M. Tighiouart
55 Dose Finding for Drug Combinations 1027

It is well known that optimal treatment protocols use drug combinations that have
nonoverlapping toxicities. However, cancer drugs with nonoverlapping toxicities of
any grade are rare. In this chapter, we described situations where the clinician is able
to attribute the DLT to one or more drugs in an unknown fraction of patients by
extending the previous statistical models. This is practically useful when the two
drugs do not have many overlapping toxicities (see, e.g., Miles et al. (2002)) for
some examples of drug combination trials with these characteristics. We showed by
simulations that as the fraction of attributable toxicities increases, the rate of DLT
decreases, and there is a gain in the precision of the estimated MTD curve. In cases
where we expect a high percent of overlapping DLTs, designs that do not distinguish
between drug attribution listed in the introduction and described in section “Dose
Finding to Estimate the Maximum Tolerated Dose Curve” may be more appropriate.
It is also important to note that the method relies on clinical judgment regarding DLT
attribution.
In the second part of the chapter, we showed how the estimated MTD curve from
a phase I trial is carried to a phase II trial for efficacy study using Bayesian adaptive
randomization. This design can be viewed as an extension of the Bayesian adaptive
design comparing a finite number of arms (Berry et al. (2011)) to comparing an
infinite number of arms. In particular, if the dose levels of the two agents are discrete,
then methods such as the ones described in Thall et al. (2003), Wang and Ivanova
(2005), and Wages (2016) can be used to identify a set of MTDs in stage I, and the
trial in stage II can be done using adaptive randomization to select the most
efficacious dose. Unlike phase I/II designs that use toxicity and efficacy data
simultaneously and require a short period of time to resolve efficacy status, the use
of a two-stage design is sometimes necessary in practice if it takes few cycles of
therapy to resolve treatment efficacy or if the populations of patients in phases I and
II are different. In fact, for the CisCab trial described in section “Phase I/II Dose
Finding,” efficacy is resolved after three cycles (9 weeks) of treatment, and patients
in stage I must have metastatic, castration-resistant prostate cancer, whereas patients
in stage II must have visceral metastasis. The uncertainty of the estimated MTD
curve in stage I is not taken into account in stage II of the design in the sense that the
MTD curve is not updated as a result of observing DLTs in stage II. This is a
limitation of this approach since patients in stage II may come from a different
population and may have different treatment susceptibility relative to patients in
stage I. This problem is also inherent to single agent two-stage designs where the
MTD from the phase I trial is used in phase II studies and safety is monitored
continuously during this phase. Due to the small sample size, methods to estimate
the MTD curves for each subpopulation in the phase I trial (Diniz et al. 2018) may
not be appropriate. An alternative design would account for first, second, and third
cycle DLT in addition to efficacy outcome at each cycle. In addition, the nature of
DLT (reversible vs. nonreversible) should be taken into account since patients with a
reversible DLT are usually treated for that side effect and kept in the trial with dose
reduction in subsequent cycles. For the CisCab trial, a separate stopping rule using
Bayesian continuous monitoring for excessive toxicity is included in the clinical
protocol.
1028 M. Tighiouart

Cross-References

▶ Adaptive Phase II Trials


▶ Bayesian Adaptive Designs for Phase I Trials
▶ Dose-Finding and Dose-Ranging Studies
▶ Inferential Frameworks for Clinical Trials
▶ Interim Analysis in Clinical Trials

Acknowledgments This work is supported in part by the National Institute of Health Grant
Number R01 CA188480-01A1 and the National Center for Research Resources, Grant
UL1RR033176, and is now at the National Center for Advancing Translational Sciences, Grant
UL1TR000124, P01 CA098912, and U01 CA232859-01.

References
Babb J, Rogatko A, Zacks S (1998) Cancer phase I clinical trials: efficient dose escalation with
overdose control. Stat Med 17:1103–1120
Berry SM, Carlin BP, Lee JJ, Muller P (2011) Bayesian adaptive methods for clinical trials.
Chapman & Hall, Boca Raton
Braun TM (2002) The bivariate continual reassessment method: extending the CRM to phase I trials
of two competing outcomes. Control Clin Trials 23:240–256
Braun TM, Wang SF (2010) A hierarchical Bayesian design for phase I trials of novel combinations
of cancer therapeutic agents. Biometrics 66:805–812
Cai C, Yuan Y, Ji Y (2014) A Bayesian dose finding design for oncology clinical trials of
combinational biological agents. Appl Stat 63:159–173
Chen Z, Tighiouart M, Kowalski J (2012a) Dose escalation with overdose control using a quasi-
continuous toxicity score in cancer phase I clinical trials. Contemp Clin Trials 33:949–958
Chen Z, Zhao Y, Cui Y, Kowalski J (2012b) Methodology and application of adaptive and
sequential approaches in contemporary clinical trials. J Probability Stat 2012:20
Chen Z, Yuan Y, Li Z, Kutner M, Owonikoko T, Curran WJ, Khuri F, Kowalski J (2015) Dose
escalation with over-dose and under-dose controls in phase I/II clinical trials. Contemp Clin
Trials 43:133–141
Cheung YK (2011) Dose-finding by the continual reassessment method, 1st edn. Chapman & Hall,
Boca Raton
Clertant M, Tighiouart M (2017) Design of phase I/II drug combination cancer trials using
conditional continual reassessment method and adaptive randomization. In: JSM Proceedings,
Biopharmaceutical Section. Alexandria, VA: American Statistical Association 1332–1349
Diniz MA, Quanlin-Li, Tighiouart M (2017) Dose Finding for Drug Combination in Early Cancer
Phase I Trials Using Conditional Continual Reassessment Method. J Biom Biostat 8: 381.
https://doi.org/10.4172/2155-6180.1000381
Diniz MA, Kim S, Tighiouart M (2018) A Bayesian adaptive design in cancer phase I trials using
dose combinations in the presence of a baseline covariate. J Probab Stat 2018:11
Diniz MA, Tighiouart M, Rogatko A (2019) Comparison between continuous and discrete doses for
model based designs in cancer dose finding. PLoS One 14:e0210139
Faries D (1994) Practical modifications of the continual reassessment method for phase I cancer
clinical trials. J Biopharm Stat 4:147–164
Goodman S, Zahurak M, Piantadosi S (1995) Some practical improvements in the continual
reassessment method for phase I studies. Stat Med 14:1149–1161
Ivanova A (2003) A new dose-finding design for bivariate outcomes. Biometrics 59:1001–1007
55 Dose Finding for Drug Combinations 1029

Jimenez JL, Tighiouart M, Gasparini M (2019) Cancer phase I trial design using drug combinations
when a fraction of dose limiting toxicities is attributable to one or more agents. Biom J 61
(2):319–332
Le Tourneau C, Lee JJ, Siu LL (2009) Dose escalation methods in phase I cancer clinical trials.
J Natl Cancer Inst 101:708–720
Lockhart AC, Sundaram S, Sarantopoulos J, Mita MM, Wang-Gillam A, Moseley JL, Barber SL,
Lane AR, Wack C, Kassalow L, Dedieu JF, Mita A (2014) Phase I dose-escalation study of
cabazitaxel administered in combination with cisplatin in patients with advanced solid tumors.
Investig New Drugs 32:1236–1245
Lunn DJ, Thomas A, Best N, Spiegelhalter D (2000) WinBUGS – a Bayesian modelling frame-
work: concepts, structure, and extensibility. Stat Comput 10:325–337
Mander A, Sweeting M (2015) A product of independent beta probabilities dose escalation design
for dual-agent phase I trials. Stat Med 34:1261–1276
Miles D, Von Minckwitz GJ, Seidman AD (2002) Combination versus sequential single-agent
therapy in metastatic breast cancer. Oncologist 7:13–19
Murtaugh PA, Fisher LD (1990) Bivariate binary models of efficacy and toxicity in dose-ranging
trials. Commun Stat Theory Methods 19:2003–2020
O’Quigley J, Shen LZ (1996) Continual reassessment method: a likelihood approach. Biometrics
52:673–684
O’Quigley J, Pepe M, Fisher L (1990) Continual reassessment method: a practical design for phase I
clinical trials in cancer. Biometrics 46:33–48
Piantadosi S, Fisher JD, Grossman S (1998) Practical implementation of a modified continual
reassessment method for dose-finding trials. Cancer Chemother Pharmacol 41:429–436
Plummer M (2003) JAGS: a program for analysis of Bayesian graphical models using Gibbs
sampling. 3rd International Workshop on Distributed Statistical Computing (DSC 2003);
Vienna, Austria. 124
Riviere M, Yuan Y, Dubois F, Zohar S (2014) A bayesian dose-finding design for drug combination
clinical trials based on the logistic model. Pharm Stat 13:247–257
Riviere MK, Yuan Y, Dubois F, Zohar S (2015) A Bayesian dose-finding design for clinical trials
combining a cytotoxic agent with a molecularly targeted agent. J R Stat Soc Ser C 64:215–229
Rogatko A, Gosh P, Vidakovic B, Tighiouart M (2008) Patient-specific dose adjustment in the
cancer clinical trial setting. Pharm Med 22:345–350
Sato H, Hirakawa A, Hamada C (2016) An adaptive dose-finding method using a change-point
model for molecularly targeted agents in phase I trials. Stat Med 35:4093–4109
Shi Y, Yin G (2013) Escalation with overdose control for phase I drug-combination trials. Stat Med
32:4400–4412
Thall PF, Cook JD (2004) Dose-finding based on efficacy toxicity trade-offs. Biometrics 60:684–693
Thall PF, Russell KE (1998) A strategy for dose-finding and safety monitoring based on efficacy
and adverse outcomes in phase I/II clinical trials. Biometrics 54:251–264
Thall PF, Millikan RE, Mueller P, Lee SJ (2003) Dose-finding with two agents in phase I oncology
trials. Biometrics 59:487–496
Tighiouart M (2019) Two-stage design for phase I/II cancer clinical trials using continuous-dose
combinations of cytotoxic agents. J R Stat Soc Ser C 68(1):235–250
Tighiouart M, Rogatko A (2010) Dose finding with escalation with overdose control (EWOC) in
cancer clinical trials. Stat Sci 25:217–226
Tighiouart M, Rogatko A (2012) Number of patients per cohort and sample size considerations
using dose escalation with overdose control. J Probab Stat 2012:16
Tighiouart M, Rogatko A, Babb JS (2005) Flexible Bayesian methods for cancer phase I clinical
trials. Dose escalation with overdose control. Stat Med 24:2183–2196
Tighiouart M, Cook-Wiens G, Rogatko A (2012a) Escalation with overdose control using ordinal
toxicity grades for cancer phase I clinical trials. J Probab Stat 2012:18. https://doi.org/10.1155/
2012/317634
Tighiouart M, Cook-Wiens G, Rogatko A (2012b) Incorporating a patient dichotomous characteristic
in cancer phase I clinical trials using escalation with overdose control. J Probab Stat 2012:10
1030 M. Tighiouart

Tighiouart M, Liu Y, Rogatko A (2014a) Escalation with overdose control using time to toxicity for
cancer phase I clinical trials. PLoS One 9:e93070
Tighiouart M, Piantadosi S, Rogatko A (2014b) Dose finding with drug combinations in cancer
phase I clinical trials using conditional escalation with overdose control. Stat Med 33:3815–
3829
Tighiouart M, Li Q, Piantadosi S, Rogatko A (2016) A Bayesian adaptive design for combination of
three drugs in cancer phase I clinical trials. Am J Biostat 6:1–11
Tighiouart M, Cook-Wiens G, Rogatko A (2017a) A Bayesian adaptive design for cancer phase I
trials using a flexible range of doses. J Biopharm Stat 31:1–13
Tighiouart M, Li Q, Rogatko A (2017b) A Bayesian adaptive design for estimating the maximum
tolerated dose curve using drug combinations in cancer phase I clinical trials. Stat Med 36:280–
290
Wages NA (2016) Identifying a maximum tolerated contour in two-dimensional dose finding. Stat
Med 36:242–253
Wages NA, Conaway MR (2014) Phase I/II adaptive design for drug combination oncology trials.
Stat Med 33:1990–2003
Wages NA, Conaway MR, O’Quigley J (2011) Continual reassessment method for partial ordering.
Biometrics 67:1555–1563
Wang K, Ivanova A (2005) Two-dimensional dose finding in discrete dose space. Biometrics
61:217–222
Wheeler GM, Sweeting MJ, Mander AP (2017) Toxicity-dependent feasibility bounds for the
escalation with overdose control approach in phase I cancer trials. Stat Med 36:2499–2513
Yin GS, Yuan Y (2009a) A latent contingency table approach to dose finding for combinations of
two agents. Biometrics 65:866–875
Yin GS, Yuan Y (2009b) Bayesian dose finding by jointly modelling toxicity and efficacy as time-
to-event outcomes. J R Stat Soc Ser C Appl Stat 58:719–736
Yuan Y, Yin G (2011) Bayesian phase I/II adaptively randomized oncology trials with combined
drugs. Ann Appl Stat 5:924–942
Middle Development Trials
56
Emine O. Bayman

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1032
Single-Arm Versus Two-Arm Phase II Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1032
Frequentist Two-Stage Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1033
Pitfalls with Conventional Frequentist Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035
Bayesian Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035
How to Construct Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1036
Noninformative Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1036
Beta-Binomial Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1037
Bayesian Phase II Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1038
Predictive Probability Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1039
Oncology Example with the PP Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1042
Frequentist Two-Stage Design Versus Bayesian PP Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1043
Bayesian Phase I–II Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1044
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1044
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1044
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1044
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1044

Abstract
Phase I trials are the first application of the new treatment on humans. The main
goal of the phase I trial is to establish the safety of the new treatment and
determine the maximum tolerable dose for use in subsequent phase II clinical
trial. When moved from phase I to phase II trial, the focus shifts from toxicity
(safety) to efficacy. In phase II trials, the aim is to decide whether the new
treatment is sufficiently promising relative to the standard therapy so that the
new treatment can be included in a large-scale phase III clinical trial.

E. O. Bayman (*)
University of Iowa, Iowa City, IA, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 1031


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_81
1032 E. O. Bayman

In this chapter, first frequentist one-arm two-stage phase II clinical trials will
be introduced. Then, a brief background for Bayesian trials will be provided.
Finally, one-arm Bayesian design using predictive probability approach will be
explained. Calculations or software to implement the examples will also be
provided when available.

Keywords
Phase II · Clinical trial · Bayesian · Predictive probability · Two-stage design

Introduction

After one or more successful phase I trials have been completed, phase II trials may
be initiated. Phase II clinical trials are aimed to decide whether a new treatment is
sufficiently promising, relative to the standard therapy, to include in large-scale
randomized clinical trials. Phase II trials provide a bridge between small phase I
trials, where maximum tolerated dose is determined, and large-scale randomized
phase III trials. Compared to phase I trials, phase II trials run on larger groups of
patients, generally 40 to 200 patients. Compared to phase III trials, phase II trials
tend to use surrogate markers as earlier endpoints (tumor shrinkage within first few
weeks instead of survival at 5 years) to shorten the study duration. Generally, sample
size is not large enough to have sufficient power. The design provides decision
boundaries, a probability distribution for the sample size at termination, and operat-
ing characteristics under fixed-response probabilities with the new treatment.
There are three basic requirements for any clinical trial: (1) the trial should
examine an important research question; (2) the trial should use rigorous methodol-
ogy to answer the question of interest; and (3) the trial must be based on ethical
considerations and assure that risks to subjects are minimized. Because of the small
sample size, meeting these requirements in early-phase clinical trials can be more
challenging compared to phase III trials. Therefore, the importance of study planning
is magnified in these settings.

Single-Arm Versus Two-Arm Phase II Trials

Phase II studies could be single arm, where only the treatment of interest is tested, or
two-arm with a concurrent control group. One-arm designs are used more frequently
to expedite the phase II clinical trials and will be presented here.
The main goal of phase II studies is to provide assessment of efficacy of the
treatment of interest. Accordingly, the goal is to determine if new treatment is
sufficiently promising to justify inclusion in large-scale randomized trials. Other-
wise, ineffective treatments should be screened out. In addition, the safety profile of
the treatment of interest is further characterized in phase II trial. Generally a binary
primary endpoint of favorable/unfavorable outcome (efficacy/no efficacy) is used in
56 Middle Development Trials 1033

phase II designs. If the probability of efficacy is higher than a predetermined


threshold at the end of the phase II trial, then the treatment of interest will be tested
in a larger phase III clinical trial (Yuan et al. 2016).

Frequentist Two-Stage Designs

Most commonly used frequentist two-stage designs are Gehan’s design (Gehan
1961), Simon’s optimal design, and minimax design (Simon 1989). The optimal
design minimizes the expected sample size under the null hypothesis, and sample
size. The user needs to specify the fixed target response rate for the new therapy ( p1)
and the existing treatment ( p0) along with the type I (α) and type II error (β) rates to
obtain the sample sizes and stopping boundaries for each stage for two-stage
designs. In this setting, type I error rate can be interpreted as the probability of
finding the treatment of interest as efficacious and recommending for further study
when it is not in fact efficacious. Similarly, type II error is the probability of finding
the treatment of interest as not efficacious and not recommending for further study
when it is in fact efficacious. Therefore, in phase II trials, it is more important to
control type II error rate than the type I error rate so that efficacious treatments are
not missed. Type I and II error rates are larger and around 10% in phase II trials.
Because of the small sample size, when available, exact methods are preferred
(Jung 2013). More complex phase II designs with more than one interim monitoring
also exist (Yuan et al. 2016).
At the end of the first stage, frequentist two-stage designs allow early termination
of the trial for futility, if the interim data indicate that the new treatment is not
effective (Lee and Liu 2008). Both Simon and minimax designs can be implemented
online for pre-specified inputs using the NIH Biometric Research Program website:
https://linus.nci.nih.gov/brb/samplesize/otsd.html.

Example 1 Let the current favorable response rate with the standard therapy be
30% ( p0 ¼ 0.3). New treatment is expected to increase the favorable response rate to
50% ( p1 ¼ 0.5). The outcome will be recorded as favorable versus unfavorable for
each patient. Both type I and type II error rates will be kept below 10%. Null and
alternative hypothesis for this study can be written as follows: H0: p0.3; H1: p0.5.
The stopping boundaries for each of the two stages can be calculated for both
optimum and minimax designs from the website provided above.
For the optimum design, 22 (n1) patients should be enrolled in the first stage. If
there are 7 (r1) or less favorable outcomes, out of 22 patients, the trial should be
stopped early due to futility, and the new treatment should be declared as ineffective.
If there are more than 7 favorable outcomes, 24 more patients should be enrolled in
the second stage so that the overall sample size of the study is 46 (n). If there are 17
(r) or less favorable outcomes, out of 46 patients, the new treatment will be declared
as ineffective. If there are more than 17 favorable outcomes, null hypothesis will be
rejected and the new treatment will be declared as effective (Fig. 1).
1034 E. O. Bayman

If x1 + x2 ≤ r: new trt ineffective

If x1 ≤ r1: stop for futility If x1 + x2 > r: new trt effective


x1
If x1 > r1: go to stage 2 x2
favorable
favorable
Stage I Stage II
n1= 22 n2 = 24

n = 46

Fig. 1 Frequentist two-stage design example

The probability of early termination under the null hypothesis, PET( p0), is the
probability of observing 0 to 7 favorable outcomes out of the first 22 subjects where
the probability of favorable outcomes is 0.3. Let x1 be the number of favorable
outcomes out of the first n1 patients in the first stage. The probability of early
termination can be calculated in R with the following codes: PET( p0) ¼ Pr
(x1  r1 | H0) ¼ sum(dbinom((0:7), 22, 0.3)) ¼ 0.67.
The expected sample size of the study under the null hypothesis is the combina-
tion of early termination at the end of the first stage or not terminating and enrolling
all 46 patients. Therefore, it can be written as E(N|p) ¼ [n1  PET( p0)] + [n  (1 –
PET( p0))] ¼ [22  0.67] + [46  (1–0.67)] ¼ 29.9.
A good two-stage frequentist design would have type I and type II error rates
lower than the initial constraints (0.1 for each in our example), high probability of
early termination, and small expected sample size under the null hypothesis (Lee and
Liu 2008).

Oncology Example: Ray-Coquard et al. investigated antitumor activity of


sorafenib in patients with metastatic or advanced angiosarcomas in a phase II clinical
trial (Ray-Coquard et al. 2012). The primary endpoint was the progression-free rate
(PFR) at 9 months after the initiation of sorafenib. Only part of the study including
superficial angiosarcoma patients will be considered here. Simon’s two-stage mini-
max design was used. Based on the previous study using paclitaxel in angiosarcoma
patients, the 9-month PFR rate was assumed 12.7% ( p0 ¼ 0.127). It was assumed
that sorafenib will increase the 9-month PFR rate to 31.7% ( p1 ¼ 0.317). Type I and
type II error rates were selected as 10% and 5%.
Stopping boundaries according to minimax designs are as follows. Enroll 26
patients in the first stage. If there are 3 or less patients with 9-month PFR, out of 26
patients, the trial should be stopped early due to futility, and the sorafenib should be
declared as ineffective. If there are more than 3 patients with disease-free PFR, 17
more patients should be enrolled in the second stage. If there are 8 or less favorable
outcomes, out of 43, sorafenib will be declared as ineffective to reduce PFR at
9 months.
56 Middle Development Trials 1035

In this study, there was only 1 patient with PFR at 9 months out of the 26 patients
enrolled in the first stage. Therefore, no more patients were enrolled in stage 2 and
the study was stopped early for futility. It was concluded that sorafenib is ineffective
on PFR at 9 months (Ray-Coquard et al. 2012).

Pitfalls with Conventional Frequentist Designs

Multistage designs have better statistical properties than single-stage designs because
they allow users to incorporate the interim data in decision-making (Lee and Liu 2008).
The two-stage design also has limitations. In the extreme case scenario, consider the
Example 1 presented above: p0 ¼ 0.3, p1 ¼ 0.5, n1 ¼ 22, r1 ¼ 7, n ¼ 46, and r ¼ 17.
Let’s assume 8 favorable outcomes were observed at the first stage of the study, and 24
more patients should be enrolled in the second stage. To be able to show the efficacy at
the end of the second stage, nine more patients should have favorable outcome (r ¼ 17).
If the number of favorable outcomes out of the next 16 patients is 0, it is impossible to
observe 9 more favorable outcomes for the next 8 patients to declare efficacy at the end
of the second stage of the study (Fig. 2). However, investigators cannot stop the study at
this point under this design. In other words, eight more patients should be enrolled in this
study even if their results will be very likely to be unfavorable and the overall results of
the study will not change. Therefore, more flexible designs that allow users to incorpo-
rate interim data at multiple stages of the study are needed.

Bayesian Methods

Due to incorporating prior information and more frequent monitoring, Bayesian


designs may require a smaller sample size and therefore may take shorter time to
conduct compared to frequentist designs. The opportunity of arriving at the same
decision with a smaller sample size makes Bayesian designs more appealing than
frequentist designs, especially for phase I and phase II clinical trials where prelim-
inary data is limited. Bayesian methods are being increasingly used in the design and
analysis of clinical trials (Biswas et al. 2009). As with the frequentist design, the

0/16 y ?/8
x=8

Stage I Stage II

n1= 22 n2 = 24

N = 46

Fig. 2 Frequentist two-stage design with an extreme case


1036 E. O. Bayman

statistical analysis plan should be predefined in Bayesian designs. Similarly, the prior
information should be identified in advance and justified.
A study design can be a stand-alone Bayesian design or a hybrid approach with
Bayesian and frequentist approaches for different outcomes. The FDA Guidance for
the Use of Bayesian Statistics in Medical Device Clinical Trials requires the
frequentist properties of Bayesian procedures to be investigated (FDA 2010).
To implement most of the commonly used frequentist designs for phase II trials,
the clinician must specify a single value of the patient’s favorable outcome rate, p0,
to the standard therapy. In many cases there is uncertainty regarding p0. In contrast to
frequentist design, the parameter of interest, p0, is considered as a random variable
with a prior distribution with density π(p0) in Bayesian designs. Both in planning the
phase II trial and interpreting its results, a more realistic approach should explicitly
account for the clinician’s uncertainty regarding p0.
The design and conduct of phase II clinical trials would benefit from statistical
methods that can incorporate external information into the design process. With the
Bayesian design, the prior information and uncertainty can be quantified into a
probability distribution. This prior information can be updated and easily
implemented in a sequential design strategy.
Bayesian inference requires a joint distribution of the unknown parameters p and
the data y. This is usually specified through a prior distribution π(p) over the
parameter space θ and a likelihood, the conditional distribution of y, the data,
given p the parameters. Bayesian inference about p is through the posterior distri-
bution, the conditional distribution of p given y.
As data accumulate, the prior distribution is updated, and the posterior distribu-
tion from the previous step becomes the prior distribution. Therefore, there is a
continuous learning as data accumulates with the Bayesian approach:

PðpjyÞ / Lðy jpÞπðpÞ: ð1Þ

How to Construct Prior Distributions

If there is a historical data available for the standard therapy, they may be incorpo-
rated formally into the trial design and subsequent statistical inferences. If there is no
such data that exists, clinical experience and a clinician’s current belief regarding the
efficacy of the standard therapy may be represented by a probability distribution on
p0. This prior probability distribution can be elicited from subjective opinions of the
experts in the field (Chaloner and Rhame 2001) or the subjective opinion of the
investigator. In this case the Bayesian approach becomes even more appropriate.

Noninformative Prior Distributions

When the prior distribution and the posterior distribution are from the same family,
this is called conjugacy. For example, the beta prior distribution is a conjugate
family for the binomial likelihood (Gelman et al. 2004), and the normal
56 Middle Development Trials 1037

distribution with a known variance is a conjugate to itself (Chen and Peace 2011).
When the prior distribution of p is not conjugate, the posterior distribution should
be calculated numerically. It is often mathematically convenient to use conjugate
family of distributions so that the posterior distribution follows a known paramet-
ric form. Most real applied problems cannot be solved by conjugate prior
distributions.
If there is some information about the distribution, parameters of the prior
distribution can be derived. For example, binomial endpoint is commonly used
in phase II clinical trials in terms of favorable versus unfavorable outcomes. In
such case, because of the conjugacy, using beta prior distribution would make
calculations easier (Gelman et al. 2004). Let’s assume from the historical data that
the median and the upper confidence bound are known for the favorable outcome
rates. Using these two values, a search algorithm can be used to find the parameters
of the prior distribution for the favorable outcome rate. It is advised to use a larger
standard error to add some uncertainty to the prior distribution (Lynch 2007).
When conducting Bayesian analyses, it is recommended to use different prior
distributions as a sensitivity analysis to assess the robustness of the results
(Gelman et al. 2004).
Another approach is to make statistical inferences from a posterior distribution
based on simulation (Chen and Peace 2011). Modern computational methods can be
used to calculate posterior distributions. For example, WinBUGS is a popular
software specifically developed for Bayesian analyses which can also be used to
easily implement Markov chain Monte Carlo methods to generate a random sample
from any posterior distribution. A large class of prior distributions can be specified in
WinBUGS. R packages such as R2WinBUGS (Sturtz 2005) allow users to use
WinBUGS within R which are relatively easy framework for many analyses.
Additionally, there are stand-alone R packages, such as MCMCpack (Martin et al.
2011), that can be used for Bayesian analyses. The MCMCpack can be used for an
extensive list of statistical models such as hierarchical longitudinal models and
multinomial logit model (Chen and Peace 2011).
Bayesian inferences are made based on the posterior distribution. It should also be
noted that there is no p-value in the Bayesian analysis. Instead, the 95% (or 1 – type I
error rate) credible interval of the posterior distribution can be used to evaluate the
strength of evidence of the results.

Beta-Binomial Example

Assume p is the favorable outcome rate of the new treatment, and the interest is to
test the following hypothesis in a one-arm phase II clinical trial: H0: p  p0; H1:
p>p1.
Let Y1, Y2, . . ., Yn denote patient responses to the new treatment with each Yi ¼ 1
or 0 as success or failure, respectively. Xn ¼ Y1 + Y2 + . . . + Yn denotes the total
number of favorable outcomes out of the n subjects treated. Xn follows a binomial
distribution with parameters n and p. Because of the conjugacy of the beta prior
distribution for the binomial likelihood, it is common to use a beta prior
1038 E. O. Bayman

distribution for the favorable response rate, p. Let the prior distribution for p follow
a beta distribution with parameters a and b:

Γða þ bÞ a1
π ð pÞ ¼ p ð1  pÞb1
ΓðaÞ ΓðbÞ

The mean of this beta distribution is a/(a + b).


The posterior distribution of the favorable response rate, given Xn, follows
another beta distribution:

pðpjXn Þ / LðXn jpÞπ ðpÞ


/ pXn ð1  pÞnXn pa1 ð1  pÞb1
ð2Þ
/ pXn þa1 ð1  pÞnXn þb1
p j Xn  Beta ða þ Xn , b þ n  Xn Þ:

The mean of this posterior distribution is (a + Xn) / (a + b + n). Therefore, the


prior distribution can be interpreted as contributing a + b patients where a patients
are with favorable outcomes and b patients with unfavorable outcomes. It is
recommended to keep the worth of the prior distribution, a + b, relatively small,
compared to the size of the actual patients, n (Geller 2004). Some authors
recommended choosing a equal to the mean probability of favorable outcome
and b as 1- a, 2- a, or 3- a depending on how much weight investigators are
planning to put on prior distribution (Zohar et al. 2008). In other words, if the
expected favorable outcome rate is 40%, a can be selected as 0.4, and b can be 0.6,
1.6, or 2.6.

Bayesian Phase II Clinical Trials

Efficacy, safety, and cost of the proposed therapy are assessed at phase II trials
(Stallard 1998). The data can be assessed only at two stages in traditional
frequentist phase II clinical trials. In contrast, Bayesian methods allow users to
examine the interim data by updating the posterior probability of parameters and
make relevant predictions and decisions at multiple stages. At each stage, the
posterior distribution can be used to draw inferences concerning the parameter of
interest. Accordingly, at each stage, there are three possible actions (Lee and Liu
2008):

I: Stop the study because of futility and declare that the new drug is not promising.
II: Stop the study because of efficacy and declare that the new drug is promising.
III: Continue with phase II study until the next inspection or the maximum sample size
is reached.
56 Middle Development Trials 1039

Predictive Probability Approach

Bayesian decision in phase II clinical trials is based on the predictive probabilities


(PP). The approach introduced by Lee and Liu for single-arm designs will be
presented here (Lee and Liu 2008). The PP is obtained by calculating the probability
of rejecting the null hypothesis (concluding efficacy) should the trial be conducted to
the maximum planned sample size (Nmax), based on the current data from the
patients already enrolled in the study (Lee and Liu 2008). Then, depending on the
strength of this probability, the decision to continue or stop (go/no go) is made.
Assume the response is binary and the data is monitored continuously. The goal is
to provide simple and practical guidelines to decide whether the new treatment is
promising relative to the standard therapy while accounting for the uncertainty
regarding the response rates in each group. The trial continues until the new
treatment is shown with high posterior predictive probability to be either promising,
not promising, or until the Nmax is reached.
First, the posterior probability of target response rate being greater than the pre-
specified alternative response probability should be calculated. If this posterior
probability is greater than a pre-specified threshold (θT), the design can declare
efficacy. Therefore, if Pr (p> p0 | X1, X2) >θT, the new treatment will be deemed
efficacious and the study will proceed to phase III clinical trial.
The steps for the Bayesian PP approach are as follows. First, the prior distribution
for the favorable outcome rate is pre-specified. Second, the group of n patients is
enrolled, and favorable (X) versus unfavorable (nX) outcome is observed from
each of these n patients. At this point in time (marked as red in Fig. 3), the posterior
distribution for the favorable outcome rate is obtained based on the prior information
and the data. As shown in Equation 2, the posterior distribution based on n patients
where X patients had favorable would be Beta (a + X, b + nX). This posterior
distribution would be used as prior distribution for the calculations for the not
observed future m patients.
Let Y be the number of favorable outcomes among the potential m ¼ Nmax–n
future patients. The distribution of Y future responses can be derived as

x~ Bin (n, p) Y~ Beta-Binom(m, a+x, b+n - x)

p~ Beta (a, b) p|x ~ Beta (a + x, b + n - x) p|x, y ~ Beta (a + x +y, b + n - x+ m - y)

Observed Not observed


n m

Nmax

Fig. 3 Bayesian design with the PP approach


1040 E. O. Bayman

ð1
PðyejyÞ ¼ pðye jpÞpðpjyÞdp
0

ð1 !
m Γða þ b þ nÞ
¼ py ð1  pÞmy paþx1 ð1  pÞbþnx1 dp
y Γða þ xÞΓðb þ n  xÞ
0

ð1
m! Γða þ b þ nÞ
¼ pyþaþx1 ð1  pÞbþnxþmy1 dp
y!ðm  yÞ! Γða þ xÞΓðb þ n  xÞ
0

Γðm þ 1Þ Γ ða þ b þ n Þ Γðy þ x þ aÞΓðb þ n  x þ m  yÞ


¼
Γðy þ 1ÞΓðm  y þ 1Þ Γða þ xÞΓðb þ n  xÞ Γ ða þ b þ n þ m Þ

and follows a beta-binomial distribution (m, a + x, b + nx).


For the future m potential patients, the number of favorable outcomess, Y, will be
between 0 and m. The probabilities of observing each possible value of Y ¼ i, i ¼ 0,
1, . . ., m favorable outcomes can be calculated from this beta-binomial distribution.
In addition, when Y ¼ i favorable outcomes out of the m patients are observed, the
posterior distribution of the favorable outcome rate will follow another beta distri-
bution: P|X ¼ x, Y ¼ i ~ Beta (a + x + y, b + n – x + my). From this beta
distribution, the probability of p>p0 can be calculated and will be called Bi:

Bi ¼ Probð P > p0 j x, Y ¼ iÞ

Then, this Bi value will be compared to the threshold value, θT. If Bi is greater than
θT, for that realization of Y ¼ i, it is expected that the new treatment will be
efficacious at the end of the trial. The predictive probability is the weighted average
of the positive trial (Bi>θT) and the trial continuing until Nmax patients are enrolled
(Lee and Liu 2008). The predictive probability approach looks at the strength of
evidence for concluding efficacy at the end of the trial, based on the current evidence
in terms of the prior information and the data. The decisions to stop the study early
due to efficacy/futility or continue because the current data are not conclusive will
depend on this PP. If the PP is high, it is expected that the new treatment will be
efficacious at the end of the study, given the current data. On the other hand, low PP
indicates that the new treatment may not have sufficient activity by the end of the
study. To prevent any ambiguity, lower (QL) and upper (QU) stopping thresholds
should be pre-specified:

X
m
PP ¼ PrðY ¼ ijxÞx I    
Pr p>p0 jx, Y ¼i >θT
i¼0
Xm
¼ PrðY ¼ i jxÞI ðBi >θT Þ
i¼0

The decision of early stopping or continuing the trial will be based on the
following thresholds.
56 Middle Development Trials 1041

If PP<θL: with the given current information, it is unlikely that the response rate
will be larger than the p0 at the end of the trial. Stop for futility and reject H1.
If PP>θU: the current data suggest that, if the same trend continues, it is highly
likely that the treatment will be efficacious at the end of the trial. Stop for efficacy
and reject H0.
If θL<PP<θU: continue to the next stage until reaching Nmax patients.

Both lower (θL) and the upper (θU) stopping thresholds set between 0 and 1. It is
advised to stop early if the drug is not promising. Therefore, θL is chosen to be closer
to 0. In contrast, if the drug is promising, it is better not to stop the trial early.
Therefore, θU is chosen as close to 1.

Example 2 Assume an investigator wants to design a study with a binary favorable/


unfavorable outcome. Suppose she does not want to enroll more than 30 (Nmax)
subjects into this trial. To date, she has enrolled 16 (n) subjects and observed
favorable outcomes from 12 (x) of the 16 subjects. Assume Y is the number of
positive outcomes for the 14 future subjects. What is the probability of favorable
outcome rate being greater than 65% at the end of the trial?
The following vague beta prior distribution for the favorable outcome rate can be
used: Beta(0.65, 0.35). Note that, as explained above, the mean of this beta distri-
bution is 0.65 (0.65 / (0.65 + 0.35)), and the worth of the prior distribution is only 1
(0.65 + 0.35).

p~ Beta (0.65, 0.35) X = 12 p|x ~ Beta (12.65, 4.35) p|x, y ~ Beta (12.65 +y, 18.35 - y)

Observed Not observed

n = 16 m = 14

Nmax = 30

A table showing each possible number of favorable outcomes for the future 14
patients can be created (Table 1).
Note that, for example, to calculate the Prob(Y ¼ 11 | X ¼ 12) in column 2, the
beta-binomial distribution should be used, beta-binomial(i ¼ 11, m ¼ 14, 12.65,
4.35). dbetabinom.ab function in VGAM package in R can be used to calculate this
probability: dbetabinom.ab(11, 14, shape1 ¼ 12.65, shape2 ¼ 4.35). Similarly, to
calculate the probability of p>0.65 given x ¼ 12 and i ¼ 11 presented in column 3,
beta distribution should be used: Beta(23.65, 12.35). In R, 1pbeta(0.65, 23.65,
7.35) function can be used. The last column is an indicator function showing whether
Bi is greater than the threshold of θT ¼ 0.9.
For values of Y between 0 and 10, Bi is less than 0.9. In other words, if 0 to 10
favorable outcomes are observed, out of the future 14 patients, the null hypothesis
will be failed to be rejected, and it will be concluded that the new treatment is not
1042 E. O. Bayman

Table 1 Calculation of Bi and PP


Y¼i Prob(Y ¼ i |x) Bi ¼ Prob( p>0.65 | x, Y ¼ i) I (Bi>0.9)
0 0.0000 0.0031 0
1 0.0001 0.0088 0
2 0.0004 0.0223 0
3 0.0017 0.0504 0
4 0.0051 0.1015 0
5 0.0128 0.1832 0
6 0.0275 0.2981 0
7 0.0516 0.4393 0
8 0.0856 0.5909 0
9 0.1261 0.7319 0
10 0.1635 0.8450 0
11 0.1832 0.9225 1
12 0.1706 0.9672 1
13 0.1209 0.9885 1
14 0.0509 0.9968 1
Nmax ¼ 30, n ¼ 16, x ¼ 12, m ¼ 14, a ¼ 0.65, b ¼ 0.35, θT ¼ 0.9, θL ¼ 0.1, and θU ¼ 0.95

effective. On the other hand, if number of favorable outcomes out of the future 14
patients is 11 or more, new treatment will be deemed effective.
Finally, the PP for this table can be calculated as 0.1832 + 0.1706 + 0.1209 +
0.0509 ¼ 0.5256. This PP value is >θL ¼ 0.1. Therefore, the trial cannot be stopped
for futility. It is less than θU ¼ 0.95. Thus, it cannot be stopped for efficacy.
According to this PP value, based on the interim data, the study should continue
because the evidence is insufficient to draw a definitive conclusion in either stopping
early for futility or efficacy yet.
This predictive probability value can be calculated using the R package
“ph2bayes” with the following R code:

predprob(12, n ¼ 16, nmax ¼ 30, alpha_e ¼ 0.65, beta_e ¼ 0.35,


p_s ¼ 0.65, theta_t ¼ 0.9).

Oncology Example with the PP Approach

For example, for the oncology example presented earlier, the 9-month PFR rate was
assumed 12.7% (p0 ¼ 0.127) for the standard therapy group. It was expected that
sorafenib will increase the 9-month PFR rate to 31.7% ( p1 ¼ 0.317). Same example
can also be handled with the Bayesian PP approach. Let both type I and type II error
rates to be 10%. In addition to the existing frequentist assumptions, let’s assume the
data will start to be monitored after data from the first 14 (Nmin ¼ 14) patients are
observed and Nmax ¼ 43. Let the response rate to follow a beta prior distribution with
56 Middle Development Trials 1043

parameters 0.127 and 0.873 (1–0.127). Note that this distribution has a prior mean of
0.127 (0.127/(0.127 + 0.873)) and worth only one patient (0.127 + 0.873).
The corresponding rejection regions for this study would be 0/14, 1/24, 2/29, 3/
33, 4/37, 5/39, 6/41, 7/42, and 8/43. The trial will stop for futility, the first time the
number of favorable outcomes falls into the rejection region. In the actual study,
Ray-Coquard et al. observed 1 patient with PFR at 9 months, out of the 26 enrolled
patients (Ray-Coquard et al. 2012). Consistent with the frequentist design, Bayesian
design would also recommend stopping at this point. Indeed, with the PP approach,
the study would have been stopped after 14 patients if there was no favorable
outcome or after only 24, instead of 26, patients if there was only 1 patient with
favorable outcome.
Rejection regions can be calculated using the following R code in “ph2bye”
package:

DT.design(type ¼ "PredP", a¼0.127, b¼0.873, nmin¼14, nmax¼43,


p0¼0.127, p1¼0.317, theta0 ¼ 0.001, theta1 ¼ 0.9, theta_t ¼ 0.85).

In addition, following Lee and Liu’s (Lee and Liu 2008) approach, M.D. Ander-
son Cancer Center group developed a software (https://biostatistics.mdanderson.org/
SoftwareDownload/) to implement further calculations to determine stopping
regions for futility and search for Nmax, θL, θT, and θU. Similarly, to be able to use
the software, users should specify the minimum (Nmin) and maximum sample sizes
(Nmax), response rates of the standard therapy (p0) and the new treatment ( p1)
groups, type I and type II error rates, and the parameters of the beta prior distribution
for the success in the standard therapy group (Berry et al. 2011). It is recommended
to start monitoring the data after the first ten patients (Nmin ¼ 10) have been treated
and outcomes were observed. The software can be used for different Nmax values and
be searched for θL and θU space to generate designs satisfying both type I and type II
error rate criteria (Berry et al. 2011).

Frequentist Two-Stage Design Versus Bayesian PP Approach

In contrast to looking at the data only at the end of the first and second stages with
the frequentist two-stage design, the Bayesian PP approach allows users to assess
the data continuously after outcomes from at least ten patients were observed. In
other words, for the extreme case presented in Fig. 2, the frequentist design
would not be stopped even if no favorable outcome is observed for the first 16
patients in the second stage of the study. Bayesian design would allow users to
stop the study much earlier, in fact, after the first six unfavorable outcomes in the
second stage of the study for futility. The Bayesian PP approach allows more
frequent monitoring; therefore it is more flexible than the frequentist two-stage
design. The Bayesian PP design enables users to stop at any time if the accumu-
lating evidence does not support the new treatment’s efficacy over the standard
therapy.
1044 E. O. Bayman

Bayesian Phase I–II Trials

It is more common to focus to the outcome of toxicity only in phase I trials and for
the outcome of efficacy in phase II trials. However, it is also possible and may be
more efficient to include a bivariate binary outcome of efficacy and toxicity in a
single phase I–II clinical trial. Readers can learn more about Bayesian phase I–II
trials on Yuan, Nguyen, and Thall (Yuan et al. 2016).

Summary and Conclusion

The main goal of phase II studies is to provide assessment of efficacy of the


treatment of interest. Frequentist two-stage designs are common and easy to imple-
ment. However, data is assessed only at the end of two stages. Bayesian approach
allows synthesis of external information into the design and allows updating the
evidence based on the accumulated data, more frequent monitoring, and calculating
predictive probabilities based on the current information (Biswas et al. 2009).
Accordingly, Bayesian designs are more flexible compared to frequentist two-stage
designs and allow for continuous decision of stopping the study for futility and
efficacy or enrolling more patients. With the availability of computer programs
utilizing Bayesian predictive probability approach, Bayesian designs are now easier
to implement and being used increasingly (Biswas et al. 2009).

Key Facts

• With the Bayesian design, it is possible to incorporate external information and


update evidence based on the accumulated data.
• Bayesian designs using predictive probability approach are more flexible than
frequentist two-stage designs and allow continuous monitoring.

Cross-References

▶ Dose-Finding and Dose-Ranging Studies


▶ Randomized Selection Designs

References
Berry SM, Carlin BP, Lee JJ, Muller P (2011) Bayesian adaptive methods for clinical trials.
Chapman & Hall/CRC Biostatistics Series, vol 38. CRC Press, Boca Raton
Biswas S, Liu DD, Lee JJ, Berry DA (2009) Bayesian clinical trials at the University of Texas M. D.
Anderson Cancer Center. Clin Trials (London, England) 6:205–216. https://doi.org/10.1177/
1740774509104992
56 Middle Development Trials 1045

Chaloner K, Rhame FS (2001) Quantifying and documenting prior beliefs in clinical trials. Stat Med
20:581–600. https://doi.org/10.1002/sim.694
Chen DG, Peace KE (2011) Clinical trial data analysis using R. CRC, Boca Raton, pp 1–357
Gehan EA (1961) The determination of the number of patients required in a preliminary and a
follow-up trial of a new chemotherapeutic agent. J Chronic Dis 13:346–353
Geller NL (2004) Advances in clinical trial biostatistics. Biostatistics 13:1–52
Gelman A, Carlin JB, Stern HS, Rubin D (2004) Bayesian data analysis. Chapman & Hall/CRC
texts in statistical science, Boca Rotan, Florida, 3rd edn
FDA (2010) Guidance for the use of bayesian statistics in medical device clinical trials. U.S.
Department of Health and Human Services Food and Drug Administration Center for Devices
and Radiological Health Division of Biostatistics Office of Surveillance and Biometrics . http://
www.fda.gov/downloads/MedicalDevices/DeviceRegulationandGuidance/Guidance
Documents/ucm071121.pdf
Jung S-H (2013) Randomized phase II cancer clinical trials. Chapman & Hall/CRC Biostatistics
Series. CRC Press/Taylor & Francis Group, Boca Raton
Lee JJ, Liu DD (2008) A predictive probability design for phase II cancer clinical trials. Clin Trials
(London, England) 5:93–106. https://doi.org/10.1177/1740774508089279
Lynch SM (2007) Introduction to applied Bayesian statistics and estimation for social scientists.
Statistics for social and behavioral sciences. Springer, New York
Martin AD, Quinn KM, Park JH (2011) MCMCpack: Markov Chain Monte Carlo in R. J Stat Softw 42
Ray-Coquard I et al (2012) Sorafenib for patients with advanced angiosarcoma: a phase II Trial
from the French Sarcoma Group (GSF/GETO). Oncologist 17:260–266. https://doi.org/10.
1634/theoncologist.2011-0237
Simon R (1989) Optimal two-stage designs for phase II clinical trials. Control Clin Trials 10:1–10
Stallard N (1998) Sample size determination for phase II clinical trials based on Bayesian decision
theory. Biometrics 54:279–294
Sturtz SL, Ligges U, Gelman A (2005) R2WinBUGS: a package for running WinBUGS from R. J
Stat Softw 12:1–16. https://doi.org/10.18637/jss.v012.i03
Yuan Y, Nguyen HQ, Thall PF (2016) Bayesian designs for phase I-II clinical trials. Chapman &
Hall/CRC Biostatistics Series. CRC Press, Taylor & Francis Group, Boca Raton
Zohar S, Teramukai S, Zhou Y (2008) Bayesian design and conduct of phase II single-arm clinical
trials with binary outcomes: a tutorial. Contemp Clin Trials 29:608–616. https://doi.org/10.
1016/j.cct.2007.11.005
Randomized Selection Designs
57
Shing M. Lee, Bruce Levin, and Cheng-Shiun Leu

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1048
Considerations for Designing Randomized Selection Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1050
Approaches to the Subset Selection Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1050
Acceptable Subset Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1051
Fixed Versus Random Subset Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1052
Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1053
The Simon et al. (1985) Fixed Sample Size Procedure (SWE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1053
Prescreening Using Simon’s Two-Stage Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054
The Steinberg and Venzon (2002) Design (SV) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054
The Levin-Robbins-Leu Family of Sequential Subset Selection Procedures . . . . . . . . . . . . . . 1055
Other Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1058
Applications of Randomized Selection Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1060
Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1063
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065

Abstract
The general goal of a randomized selection design is to select one or more
treatments from several competing candidates to which patients are randomly
assigned, in such a way that selected treatment(s) are likely to be better than those
not selected. For example, if one treatment is clearly superior to all the others, we
may demand that the procedure select that treatment with high probability. The
experimental treatments could be different doses of a drug or intensities of a
behavioral intervention, different treatment schedules, modalities, or strategies, or

S. M. Lee (*) · B. Levin · C.-S. Leu


Department of Biostatistics, Mailman School of Public Health, Columbia University, New York,
NY, USA
e-mail: [email protected]; [email protected]; [email protected]

© Springer Nature Switzerland AG 2022 1047


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_82
1048 S. M. Lee et al.

different combinations of treatments. The hallmark feature of a selection design is


its ability to achieve its stated goals with surprisingly fewer participants compared
with traditional “phase III” trials, precisely because it eschews the formal hypoth-
esis test paradigm with its tight control over type 1 error rates. These designs can
be used in clinical research to screen for treatments that are worthy of further
evaluation in a subsequent confirmatory clinical trial and to discard unpromising
treatments. Thus, they are ideal for middle development settings where we are
interested in selecting promising treatments under circumstances typically limited
by smaller sample sizes. In this chapter, we discuss the randomized selection
designs of Simon, Wittes, and Ellenberg, Steinberg and Venzon, and the Levin-
Robbins-Leu family of sequential subset selection procedures. The first two
designs select a single treatment, while the latter allows for sequential elimination
of inferior treatments, sequential recruitment of superior treatments, and may be
used to select treatments with fixed or variable subset sizes.

Keywords
Selection paradigm · Correct selection · Subset selection · Acceptable set · Phase
2 designs · Selection trials

Introduction

The goal of a randomized selection design is to select a truly best treatment (given a
suitable definition of “best”), or more generally to select a subset of b  1 treatments,
ideally containing the b truly best treatments, using methods that have certain
desirable operating characteristics. For example, assuming that one treatment is
truly better than all the rest to a prespecified degree, we may wish to select that
best treatment with a prespecified high probability of correct selection. Furthermore,
we may wish to select a treatment or a subset of treatments that are reasonably
“acceptable” (suitably defined) with prespecified high probability under any and all
circumstances, irrespective of the true efficacy differences among treatments. When
selecting treatments, the size of the subset b is typically fixed in advance – most often
being b ¼ 1 – but we may also wish to select varying numbers of “best” treatments in
a data-dependent manner. Alternatively, we may wish to select a subset of treatments
of varying size that achieves a prespecified high probability of containing the one
truly best treatment. We may even wish to produce a reliable ranking of the c
candidate treatments. These treatments can be different treatment schedules, doses,
or strategies, and may include the standard of care. The statistical theory of ranking
and selection, with its strong ties to multiple decision theory, provides an overarch-
ing framework for all such selection goals.
It may be helpful here to introduce a taxonomy of randomized selection designs.
We shall say that such a trial is a pure selection trial if its primary purpose is to select
(or identify or rank) treatments with goals as stated above with no further intention of
making statements of statistical significance. Indeed, there is no concern about the
57 Randomized Selection Designs 1049

type 1 error rate because one simply cannot commit a type 1 error when one doesn’t
declare efficacy differences statistically significant at the 0.05 level! To the contrary,
in pure selection designs we are singularly uninterested in the null hypothesis of no
difference between treatments. What we care about is making correct selections (or,
more generally, acceptable selections – see below) with high probability when there
are clinically meaningful and worthwhile differences even if we cannot “prove” that
they are so. But if there are no meaningful or worthwhile efficacy differences
among any of the candidates, then we are generally indifferent to which treatment
(or treatments) is (or are) selected, other things being equal such as side effects,
tolerability, or costs. This is the so-called indifference zone approach of
Robert Bechhofer (1954). We will have more to say about this approach in section
“Considerations for Designing Randomized Selection Trials.”
At this point the reader might be wondering how could one dare to conduct a
clinical experiment without some sort of tight control over type 1 error? It seems an
unfortunate tendency in some quarters reflexively to identify “clinical research” as
synonymous with “testing a null hypotheses at the 0.05 level of significance.” There
are good reasons regulatory agencies or journal editors insist on such a definition.
However, not every important question can or should be addressed this way. For
example, when kids in the schoolyard are choosing up sides for a baseball game, the
team captains are surely not interested in testing the null hypothesis that all the kids
have the same talent, controlling the type 1 error rate. To the contrary, the captains
want to select the kids best able to help their teams based on observations of their
performance. Similarly for choosing a winning horse at the race track or a profitable
portfolio for investment. As a further example, during the Ebola outbreak of 2014 in
West Africa, some argued (cogently, we believe) that the regulatory dictum – “we
must use a control group to test the hypothesis of no efficacy while controlling the
type 1 error rate, end of debate” – was misguided in the sense that the regulatory
mandate was perhaps not the most pressing or important question to settle at that
very moment. Rather, a pure randomized selection design of active treatments could
have been implemented rapidly to select the best candidate treatment option or
options, followed by careful rollout and watchful observation to see if patients
stopped dying. Best available supportive care, which in West Africa was no better
than no treatment at all given the resource poor environment, would not have been
required. It would seem the selection goal was urgent and important enough to set
aside the agnostic need to control the risk of a type 1 error. Even if the optimal
standard of care were available in West Africa and had some efficacy, more rapid
results might have been reached by including it in a pure randomized selection
design than to insist on its use to demonstrate another treatment’s significant
improvement over it. See Caplan et al. (2015a, b) for further discussion.
We continue the taxonomy of randomized selection trials in the following
sections. In section “Considerations for Designing Randomized Selection Trials”
some important considerations for the design of selection trials are introduced. In
section “Designs,” we discuss some specific pure selection designs that have been
proposed in the literature. In section “Other Designs,” we briefly mention some other
designs such as when selection procedures are used as a preliminary step for a
1050 S. M. Lee et al.

randomized controlled trial or when they are used to formally test whether better-
than-placebo treatments exist and if so, to select them. For simplicity in this chapter
we shall focus exclusively on the case of binary outcomes as the clinical endpoint of
interest, such as tumor response, and we shall use “response” or “success” synon-
ymously. A selection design for time to event outcomes is briefly mentioned in
section “Other Designs.” In section “Applications of Randomized Selection
Designs,” we present two examples of actual selection trials to illustrate some
practical implementation considerations, one using the prescreening selection
approach and the other using an adaptive Levin-Robbins-Leu procedure, both
discussed in section “Considerations for Designing Randomized Selection Trials.”
We conclude with a brief discussion in section “Discussion and Conclusion.”

Considerations for Designing Randomized Selection Trials

Approaches to the Subset Selection Problem

A key operating characteristic of any randomized selection trial is its probability of


correct selection or PCS, which is the probability of selecting truly best treatments.
Most often one specifies a fixed target number b  1 of best treatments in advance
and then uses a procedure that selects b treatments. This is called a fixed subset size
procedure. The PCS is then the probability that the selected subset is in fact the b
truly best treatments. Fixed subset size procedures use the indifference zone
approach of Bechhofer (1954) which requires that whatever procedure is used, it
shall achieve a minimum pre-specified PCS, say P*, if and when the best treatments’
true success probabilities are sufficiently better than the remaining c–b treatments’
success probabilities. We say in such cases that the c success probabilities (or
parameters) fall in the preference zone. For the case of binary outcomes considered
in this chapter, by “sufficiently better” we shall mean that the odds on success for the
bth best treatment exceeds the odds on success of the next best treatment by a
prespecified multiplicative factor, namely, the design odds ratio of θ > 1. The choice
of θ will depend on the context, but it generally signifies a clinically meaningful and
worthwhile degree of relative treatment efficacy. The complement of the preference
zone in the parameter space is the indifference zone, so-called because presumably
one will be indifferent to the fact that the PCS may not exceed P* when the
parameters are not sufficiently distinct. Clearly if there are no differences between
response probabilities, we are indifferent to which treatments are selected (other
factors such as side effects, costs, etc., being equal). At the opposite extreme, though,
one should stop being indifferent to the shortfall of the PCS below P* as the
separation between the bth best and next-best treatment approaches θ. This often
helps to guide the choice of the design odds ratio.
The main drawback of the indifference-zone approach is that in practice one
doesn’t know whether the population parameters fall in the preference or the
indifference zone. This was the principal motivation for Shanti S. Gupta’s subset
57 Randomized Selection Designs 1051

selection approach (Gupta 1956, 1965), which selects a subset of treatments using a
fixed, prespecified sample size, the goal of which is to capture the one best treatment
in the subset with a prespecified high probability, no matter what the true response
probabilities are. To guarantee that, the size of the subset necessarily has to vary
randomly according to the observed data. For example, if there are only small
differences between success probabilities, all c treatments might have to be
“selected.” Clearly, the only role played by selection of subsets of size one or greater
in the Gupta approach is to assure capture of the best treatment among those selected
with high probability. More generally, to assure capture of b best populations with
high probability, subsets of size b or greater would have to be selected, with size
varying according to the observed data.
We shall not pursue the Gupta subset selection approach any further, referring the
reader instead to the book by Bechhofer et al. (1995), because in the sequel we shall
mean something very different by the term “subset selection.” Henceforth, by subset
selection we shall mean any procedure whose goal is explicitly to select subsets of b
best treatments, when b is fixed in advance. In section “Random Subset Size
Selection with LRL Procedures” we also briefly consider subset selection procedures
that identify subsets of random size, but still the goal will be to select best subsets of
treatments, albeit of varying size. Such subset selection procedures are called
random subset size procedures.

Acceptable Subset Selection

The most familiar goal for a pure randomized selection trial in the indifference zone
approach is to correctly identify the best treatment with high probability if a minimal
clinically meaningful and worthwhile difference exists between the best and the
second-best treatment. However, if the success probabilities for the several best
treatments are close to one another, assuring a high probability of correct selection is
neither meaningful nor possible. When several treatments are close to best in
efficacy, we may be indifferent as to which is selected, but we should still want a
high probability of selecting one of those near-best treatments if not technically the
best. This leads to the general notion of acceptable subset selection which offers
another resolution to the dilemma posed by ignorance of the true parameter values
and which, unlike Gupta’s approach, stays within the indifference zone approach.
We specify these ideas in some more detail next.
Precisely because we generally don’t know whether the treatment response
probabilities fall in the preference or indifference zone, we will want to know
whether or not a procedure will select, if not best subsets, then “acceptable” subsets,
with a prespecified high probability irrespective of the true population parameters.
We shall refer to such a property as acceptable subset selection, where the phrase,
“with pre-specified probability irrespective of the true population parameters”
should be understood. The following notions were introduced by Bechhofer et al.
(1968, p. 337, hereinafter BKS) and elaborated upon by Leu and Levin (2008b).
1052 S. M. Lee et al.

Fig. 1

For any given success odds vector w = (w1, . . ., wc ), where wj = pj / (1pj) and
design odds ratio θ  1, we define certain integers s and t given by functions of w and
θ, say s ¼ s(w,θ) and t ¼ t(w,θ), such that

wsþ1 < wbþ1 θ  ws

and

wtþ1  wb =θ < wt

as illustrated in the following diagram (Fig 1).


These inequalities define an open interval of odds containing wb and wb + 1 such
that w1,. . .,ws are at least as great as the upper endpoint of the interval, which is given
by θ times wb + 1, and such that wt + 1,. . .,wc are no greater than the lower endpoint of
the interval, which is given by θ1 times wb. BKS reasoned that with the above
configuration of true success odds, in order to be deemed “acceptable” any selection
of b treatments ought to contain the s best treatments and ought to exclude the c–t
worst treatments, while it would be acceptable to select the remaining b–s treat-
ments in any manner from among the (s + 1)st to tth best. In general, then, a θ-
acceptable subset is any subset of treatments in which the s ¼ s(w,θ) best treat-
ments are selected and all selected treatments are among the t ¼ t(w,θ) best treat-
ments. In other words, a θ-acceptable subset must include all treatments better than
the upper endpoint of the defined interval around wb and wb + 1 and exclude all
treatments worse than its lower endpoint. It is easy to see that if w is in the preference
zone, then s ¼ t ¼ b and the only θ-acceptable subset is the correct subset,
(b) ¼ (1,. . .,b). At the other extreme, if the treatments all have equal success odds,
s ¼ 0 and t ¼ c, meaning any b-tuple would be θ-acceptable, in which case we are
completely indifferent as to which treatments are selected (other things like cost,
side-effects, etc., being equal). The neighborhood around wb and wb + 1 creates
different regions of the parameter space with a gradation from maximal preference of
which treatments to select to maximal indifference.

Fixed Versus Random Subset Sizes

When planning a pure randomized selection trial, it is of course important to decide on


how many treatments to select, and if a fixed number of treatments, b, is decided upon,
there should be a cogent reason for doing so. For example, there may be several good
57 Randomized Selection Designs 1053

candidate treatments but the research budget allows for selecting two but not more. Or,
we want to screen for three promising candidates in a search process, neither more nor
less. Even though most frequently selection procedures aim to identify a single best
candidate, that design specification should still be justified.
Often, however, it will not be clear to investigators how best to specify a desired
subset size prior to the experiment. For practical reasons we may wish to constrain
the size of the selected subset prior to the experiment. For example, budgetary
constraints may force us to select at most a given number of treatments, but we
may be content to select fewer than that number. Or, evidence from pilot work may
suggest that at least another given number of promising treatments may exist among
the c candidates, in which case we may wish to identify at least that number of good
treatments. Such practical needs of clinical research call for random subset size
selection procedures. Random subset size selection procedures should have the
acceptable subset selection property.

Designs

The Simon et al. (1985) (hereinafter “SWE”) design is a classical fixed sample size
selection procedure to select a best treatment, such as discussed in Gibbons et al.
(1977). We also discuss an adaptation of the SWE design following a collection of
parallel, single-arm, Simon two-stage designs used for prescreening the candidate
treatments. The prescreening allows incorporating historical control information to
help assure that the selected treatments are better than the current standard of care
(Liu et al. 2006). Another extension of the SWE design by Steinberg and Venzon
(2002) allows for interim stopping and early termination by requiring a prespecified
difference between treatment arms in order to select the most promising treatment at
interim looks. The aforementioned designs select one treatment given a fixed sample
size. Finally, we discuss the Levin-Robbins-Leu (LRL) family of sequential
subset selection procedures, which aim to select subsets of one or more treatments.
Extensions allow the size of the selected subset to vary, possibly with prespecified
constraints on the (random) size of the selected subsets for practical purposes. Both
the fixed- and random-subset size LRL procedures offer acceptable subset selection
with high probability, irrespective of true treatment efficacies, while allowing
sequentially adaptive elimination and recruitment of treatments.

The Simon et al. (1985) Fixed Sample Size Procedure (SWE)

Suppose we are interested in identifying the one best treatment (b ¼ 1), i.e., the
treatment with highest success probability, from the c candidates. The SWE proce-
dure draws a fixed sample of size n for each of the c treatment arms, then selects the
treatment with the largest observed number of responses, or equivalently largest
proportion of responses. If there are ties among the largest, select one randomly as
the best (or, more realistically, select one on practical grounds such as side effects,
1054 S. M. Lee et al.

cost, etc.). Simon et al. (1985) provide formulas to calculate the probability of
correct selection (PCS) under the least favorable configuration, that is, the proba-
bility of correctly identifying the best treatment if a clinically meaningful difference
of Δ exists between the response probability of the best treatment (say p + Δ) and the
response probabilities of the remaining c–1 treatments (each assumed equal to p).
This configuration is called “least favorable” because any set of response probabil-
ities wherein the best differs from the others by Δ or more will have a PCS no smaller
than that for the least favorable configuration. For given c, p, and Δ, the sample size n
per arm may then be determined using the formulas by trial and error to achieve the
prespecified level of PCS. Table 3 of the SWE paper provides explicit sample sizes
per arm in the special case of an absolute difference in response rates of Δ ¼ 0.15 for
p ranging from 0.2 to 0.8 in increments of 0.1 with c ¼ 2, 3, or 4 treatments that
achieve a 90% PCS under the least favorable configuration. Alternatively, tables
provided in Gibbons et al. (1977) may be used.

Prescreening Using Simon’s Two-Stage Design

Liu et al. (2006) describe an adaptation of the SWE procedure following pre-
screening of several candidate treatments using Simon’s two-stage design (Simon
1989). Patients are randomized to the c candidate treatments of interest and parallel
single-arm phase II trials are conducted using Simon’s two-stage design to screen out
nonpromising treatments. We then apply the SWE selection rule among the c0
treatments that passed the prescreening step. Note that the sample size for this
adaptation is based on the sample size calculation for the original Simon two-stage
trials and may not be adequate to guarantee a high probability of correct selection. Of
course, the number of treatments that will pass the prescreening step is not known in
advance and this complicates the calculation of the overall probability of correct
selection. One feature is clear though, which is that the overall PCS for the Liu et al.
proposal must be smaller than the probability of declaring a treatment promising in
the Simon two-stage design under the design alternative. This is because not only
must the best treatment pass the Simon prescreening, but in case other less effica-
cious treatments do too, the best must have more responses than the competitors. We
illustrate this feature in the example given in Applications of Randomized Selection
Designs.

The Steinberg and Venzon (2002) Design (SV)

The design by Steinberg and Venzon (2002) is an extension of the SWE approach. It
differs by allowing an interim look at the data to perform an early selection and
terminate the trial. The maximum sample size for the design is calculated using the
same method as the SWE procedure. In addition, let d be a prespecified integer such
that if the difference in the number of successes between the apparently best and
second-best treatment arm is at least d at the interim look, the procedure stops,
57 Randomized Selection Designs 1055

whereupon the leading treatment is selected as the best. Otherwise, the trial con-
tinues to complete its planned accrual. If there is no early stopping, the SWE
selection rule is applied. Steinberg and Venzon describe how to determine d to
limit the probability of making an incorrect selection in a two-arm trial. They also
provide a table with values of d that limit the probability of making an incorrect
selection to 0.5%, 1%, 5%, or 10%, assuming a difference of 0.10 or 0.15 in the
probability of response between the two treatment arms. These authors also propose
their interim stopping rule for use in the context of parallel Simon two-stage screening
trials, in effect allowing for early stopping in the design discussed in section “Pre-
screening Using Simon’s Two-Stage Design.” Early stopping takes place if the best
response tally exceeds the Steinberg and Venzon criterion d among the subset of
treatments that have passed the first stage of screening. If the criterion is not met,
enrollment continues to complete the second stage with the passing subset and a final
selection takes place as in section “Prescreening Using Simon’s Two-Stage Design.”

The Levin-Robbins-Leu Family of Sequential Subset Selection


Procedures

The Levin-Robbins-Leu (LRL) family of sequential subset selection procedures are


experimental designs that aim to identify the best treatment, or more generally, the
best subset of b  1 treatments among c > b candidates. The procedures allow for
two adaptive features, namely, sequential elimination of apparently inferior treat-
ments as soon as the current weight of evidence indicates that such treatments are not
among the b best treatments; and/or sequential recruitment of apparently superior
treatments as soon as the current weight of evidence indicates that such treatments
are among the b best treatments. Four members comprise the LRL family of subset
selection procedures: the nonadaptive member with neither the elimination nor
recruitment feature; elimination but no recruitment; recruitment but no elimination;
and both elimination and recruitment features. Here we shall present only the
nonadaptive LRL procedure, which we shall denote by procedure N and the adaptive
procedure featuring both elimination and recruitment, which we shall denote by
procedure E=R . The LRL family of procedures is discussed fully in Leu and Levin
(2008a, b, 2017) and Levin and Leu (2013, 2016).

LRL Procedure N
The sampling proceeds vector-at-a-time, meaning that patients are assigned ran-
domly to each of the c treatments in blocks of size c. If desired, the blocking on c
ðnÞ
patients can also incorporate matching on prognostic factors. Let r j denote the
number of responses in treatment j after n rounds of c-tuplets have been randomized
ðnÞ ½n ½n
for j ¼ 1,. . ., c. We call the r j “response tallies.” In addition, let r 1  r 2 
½n
    r c denote the ordered response tallies, where the subscripts refer to ordering
the number of responses from the greatest ( j ¼ 1) to the least ( j ¼ c). Now let d be a
positive integer chosen in advance. Procedure N stops the first time that the
1056 S. M. Lee et al.

difference between the bth largest st


n and (b + 1) largest response
o tally equals d, i.e.,
½n ½n
after N N ¼ N N ðb, c, dÞ ¼ inf n  1 : r b  r bþ1 ¼ d rounds of c-tuples of
patients have had their endpoints observed. At stopping time N N , select the unique
set of treatments with the b largest response tallies as the b best.

LRL Procedure E=R


As above, choose positive integer d in advance. Begin randomizing patients
vector-at-a-time, but now eliminate the apparently inferior arm or arms as soon
as their response tallies fall d heads behind the arm or arms with the currently held
bth largest tally. Similarly, recruit the apparently superior arm or arms as soon as
their response tallies pull d heads ahead of the arm or arms with the currently held
(b + 1)st largest tally. By “eliminate” we mean that an arm is withdrawn from the
competition with no further patients allocated to it and is classified as outside the
set of b best treatments. By “recruit” we mean that an arm is withdrawn from the
competition with no further patients allocated to it and is selected to be among the
set of b best treatments. Note that there is no claim that the first-recruited arm is
equal to the best arm, merely that it qualifies as among the b best. Similarly, there is
no claim that the first-eliminated arm is equal to the worst arm, merely that it does
not qualify as among the b best.
To specify procedure E=R a bit more precisely, let N E=R denote the time of first
elimination and/or recruitment in a c-treatment experiment with treatment labels in
the set C ¼ {1,. . .,c}. In what follows the subset C will change composition as
treatments are eliminated or recruited, resulting in a sequence of subsets
C  C0  C00     of treatment labels still in competition and we shall let the
sequence c > c0 > c00 >    denote the size of the corresponding subsets. Then N E=R
is defined as

N E=R ¼ N E=R ðb, C, dÞ


n o n o
½n ½n ½n
¼ inf n  1 : r b  r ½cn ¼ d ^ inf n  1 : r 1  r bþ1 ¼ d

ðnÞ ½n ½n ½n


At round n ¼ N E=R , we eliminate all treatments i with r i ¼ r c if r b ¼ r c ¼ d
ðnÞ ½n ½n ½n
and/or we recruit all treatments i with r i ¼ r 1 if r 1  r bþ1 ¼ d. If fewer than b
arms are recruited and/or fewer than c–b arms are eliminated, the procedure
continues, starting from the current tallies of the remaining subset of treatments,
say C0  C, and iterates with time of next elimination and/or recruitment
N E=R ðb0 , C0 , dÞ, wherein c0 ¼ j C0j replaces c and b0 is b minus the total number
of coins recruited by time N E=R ðb, C, dÞ. Continuing in this way, we stop whenever
there is a recruitment of however many treatments are required to fill out the subset
of size b and, simultaneously, an elimination of the other currently remaining
treatments, at which point a total of b arms will have been recruited and c–b
arms eliminated. Upon stopping we declare the subset of recruited treatments as the
b best. The procedure always identifies a well-defined subset of b treatments with
no ties.
57 Randomized Selection Designs 1057

A Lower Bound Formula for the PCS of LRL Procedures


Let w ¼ (w1,. . .,wc) denote the vector of true odds on response with wj ¼ pj/(1  pj) for
j ¼ 1,. . .,c. Because the procedures are symmetric with respect to permutations of the
labels of the treatments, we may assume without loss of generality that
w1  w2      wc even though no such assumption is required in practice. Leu
and Levin (2008b) proved that the following lower-bound formula holds for the PCS
when using procedure N . For any true success odds vector w, the PCS, which we
denote as Pw[cs] to reflect its dependence on the parameters, is bounded from below by

ðw   wb Þd
Pw ½cs  P1 d
:
ðbÞ wðbÞ

The sum in the denominator is over all possible b-tuples which we denote generi-
cally as (b) ¼ (i1,. . .ib) with integers 1  i1 <    < ib  c, and where the summands
use the convenient notation wdðbÞ ¼ wdi1   wdib . The above lower bound allows us to
choose d in designing selection experiments as follows. It is easy to see that the right-
hand side of the above inequality is minimized for any w in the preference zone,
Pref(b, c, θ) ¼ {w : wb/wb + 1  θ} at w ¼ w  (θ, . . ., θ, 1, . . .1) for any positive
constant w. Then

ðcbÞ 
b^X  
b cb
Pw fcorrect selectiong  θ = db
θdðbiÞ :
i¼0 i i

It follows that we can choose a value of d depending only on b, c, θ, and P*, say
d ¼ d(b, c, θ, P*), such that the right-hand side of the inequality is at least P* for any
 1  1
c c
given P* satisfying < P < 1. We exclude the trivial case P ¼
b b
since no formal procedure is needed to achieve that lax goal.
Levin and Leu (2013) demonstrate that the lower bound formula holds in fact for
each adaptive member of the LRL family of sequential subset selection procedures
for any number of treatments c  7 and numerical evidence strongly supports the
conjecture that the inequality holds in complete generality as it does for the non-
adaptive procedure N .

Acceptable Subset Selection for LRL Procedures


An analogous lower bound formula holds for the probability of selecting θ-accept-
able subsets with LRL procedure N . One simply replaces the term (w1  wb)d in the
numerator of the lower bound for PCS with a sum of analogous terms corresponding
to each type of θ-acceptable subset. Numerical evidence further suggests that this
more general lower bound formula holds for each adaptive member of the LRL
family. Using the lower bound formula, Leu and Levin (2008b) proved a key result,
namely, that for any 1  b < c, any given design odds ratio θ > 1, and any given
 1
c
probability P* with < P < 1 the LRL procedure N selects θ-acceptable
b
1058 S. M. Lee et al.

subsets with probability at least P* for any and all true success probabilities. Thus
we can say that the LRL family of subset selection procedures has the acceptable
subset selection property mentioned in section “Acceptable Subset Selection.” See
Leu and Levin (2008b) and Levin and Leu (2016) for further details.
Other operating characteristics for LRL subset selection procedures such as the
expected number of rounds (number of vectors), Ew ½N N ðr, b, cÞ , expected total
number of tosses (sample size), expected number of failures, etc., are typically
obtained via simulation.

Random Subset Size Selection with LRL Procedures


An extension of the LRL family of subset selection procedures provides for random
subset size selection. Briefly, one simultaneously monitors the accumulating
response tallies with each of the fixed subset size reference criteria d, say db, in
LRL procedure N for b ¼ 1,. . .,c–1. (The various db criteria need not be equal.) The
½n ½n
first time any stopping criterion is met – that is, the first time r b  r bþ1  d b for
some b – we select the subset of b treatments with the largest response tallies. If two
or more criteria are simultaneously met, we select the subset with the smallest size.
Because the first separation between tallies to meet its corresponding criterion is not
predetermined, the resulting selected subset has a random size. Constraints may be
imposed on the size of the final selected subset as follows. At the time the first db
criterion is met, if the size of the subset that would be selected exceeds an upper
constraint, sampling continues only among treatments in that subset, i.e., the other
treatments are eliminated, and monitoring continues for the next separation in
response tallies among that subset, and so on, until the final selected subset size
meets the upper constraint. On the other hand, if the size of subset at the first criterion
separation is smaller than a lower constraint, the subset is “recruited” and sampling
resumes with the remaining treatments monitoring for additional best treatments, and
so on, until all constraints are met. Thus subsets of random size can still be selected
with the adaptive features of elimination and recruitment. Leu and Levin (2017)
provide further details and they show that the LRL family of procedures extended to
random subset sizes continues to provide for acceptable subset selection.

Other Designs

Often times randomized selection designs appear as a preliminary stage in larger


studies, such as in adaptive phase III RCTs with a preliminary selection stage. A
distinguishing feature of such studies is the simultaneous use of the data from the
preliminary selection stage together with the subsequent efficacy data from the larger
trial (as opposed to using a pure selection design first, followed by a completely
independent subsequent phase III trial, such as contemplated above). We refer to
such applications as a preliminary selection design in the sense that the primary
purpose of such trials is that of the confirmatory phase III study, i.e., to test a null
57 Randomized Selection Designs 1059

hypothesis of no efficacy (even with the preferred treatment in the selection stage)
with traditional control of the type 1 error rate, whereas the early-stage selection
feature is of secondary interest, used for the sake of seamless efficiency leading into
the larger study. Because such preliminary selection typically introduces some
degree of selection bias in the final evaluation of all the data, special statistical
adjustments must be made to account for that bias. For example, if the null hypoth-
esis were true, one would be capitalizing purely on chance by selecting the treatment
with the apparently best performance, thereby introducing a selection bias when
comparing that treatment to a placebo control. We shall not discuss preliminary
selection designs further here except to note some examples of such methods: see
Stallard and Todd (2003), Stallard and Friede (2008), Levy et al. (2006), Kaufmann
et al. (2009), Levin et al. (2011), and the various methods used in adaptive trial
design (see, e.g., Coffey et al. 2012, and references cited therein).
A sequential selection design featuring a response-adaptive randomized play-
the-winner rule with sequential elimination of inferior treatments was studied by
Coad and Ivanova (2005). They show a desirable savings in total sample size
compared with fixed sample size procedures together with a palpable increase in
the proportion of patients allocated to the superior treatment. They demonstrate
the practical benefits of their procedure using a three-treatment lung cancer study
and argue that these benefits extend also to dose-finding studies. The same
authors also studied the selection bias involved in the maximum likelihood
estimators of the success probabilities after the trial stops, using a key identity
between the bias and the expected reciprocal stopping time; see Coad and
Ivanova (2001).
A novel design called the comparative selection design was introduced by Leu
et al. (2011) which combines features of a pure selection design with a hypothesis
test. One supposes there are one or more active candidate treatments and one or more
placebo or other control arms (such as an attention control group or best available
standard of care). The primary goal is to test the null hypothesis that there does not
exist a better-than-placebo (BTP) subset of active treatments against the alternative
that there does exist a BTP subset of active treatments; and, if the null hypothesis is
rejected in favor of the alternative, to select one such BTP subset of active treat-
ments. The type 1 error rate may be controlled at conventional levels of statistical
significance and the probability of correctly selecting a BTP subset of active
treatments can be made arbitrarily high given a sufficiently wide separation between
the efficacy of the BTP active treatments and that of the placebo or other control
arms. We refer the reader to Leu et al. (2011) for further details of the comparative
selection design.
While we have focused exclusively on randomized selection designs with binary
outcomes in this chapter, a selection design has also been proposed for time-to-event
outcomes. We refer the reader to Liu et al. (1993) for further details and to Herbst
et al. (2010) for an application of the design to evaluate concurrent chemotherapy
with Cetuximab versus chemotherapy followed by Cetuximab in patients with lung
cancer.
1060 S. M. Lee et al.

Applications of Randomized Selection Designs

Though randomized selection designs have much to offer, they have seldom been
applied in practice. Here we provide two examples of trials using such designs.
The first example is a recently published paper by Lustberg et al. (2010). The
study evaluated two schedules of the combination of mitomycin and irinotecan in
patients with esophageal and gastroesophageal adenocarcinoma with the goal of
selecting the most promising schedule. Patients were randomized (1:1) to either
6 mg/m2 mitomycin C on day 1 and 125 mg/m2 irinotecan on days 2 and 9 or 3 mg/
m2 mitomycin C on days 1 and 8 and 125 mg/m2 irinotecan on days 2 and 9. The
Simon two-stage design was used to prescreen the treatments with the primary
outcome being response to treatment. Each treatment arm was designed to detect a
20% difference with an alpha of 0.10 and a beta of 0.10 assuming response rates of
30% under the null and  50% under the alternative hypothesis. This required
enrolling 28 patients in the first stage of the Simon two-stage design. If 8 or more
responded then an additional 11 patients would be enrolled for a total of 39 patients
per arm. Treatment(s) with 16 or more responders would be considered worthy of
further evaluation and the SWE procedure would be used to select the most prom-
ising one.
Prescreening with Simon’s two-stage design followed by an application of the
SWE procedure can be considered as an overall selection procedure. Though the
probability of passing the screening step with Simon’s two-stage procedure is 0.900
in the example at the design alternative of a 50% response rate, the overall PCS is
only 0.888 for c ¼ 2 treatments assuming the inferior treatment has a response rate of
30%. This is because there is a non-negligible chance the inferior treatment will pass
the Simon prescreen (with probability 0.094) followed by a small chance that its
response tally will actually exceed (or equal) that of the better treatment, leading to
an incorrect selection (or a 50% chance of an incorrect selection in the case of a tie).
As the response rate of the inferior treatment approaches that of the superior
treatment, the PCS decreases. For example, with a 35% response rate, the overall
PCS is 0.855 and with a 40% response rate, the overall PCS falls to 0.780.
In the conduct of the actual study, only 6 mg/m2 mitomycin C on day 1 and
125 mg/m2 irinotecan on days 2 and 9 passed the screening and was considered
worthy for further evaluation. Thus, selection of the most promising treatment was
not needed.
The second example is a randomized selection trial the present authors designed
using the Levin-Robbins-Leu selection procedures. This study is currently enrolling
patients at Columbia University Irving Medical Center to evaluate three types of
garments (gloves and socks) for the prevention of taxane-induced peripheral neu-
ropathy, with the goal of selecting the most promising intervention and evaluating it
in a large randomized controlled clinical trial. Breast-cancer patients are being
randomized in triplets to cryotherapy (cold garments), compression therapy (tight-
fitting garments), or placebo (loose garments) with stratification for the chemother-
apy schedule. Previous smaller studies have shown that both cryotherapy and
compression may be efficacious at preventing taxane-induced peripheral neuropathy.
57 Randomized Selection Designs 1061

A “response” is defined as a change from baseline to 12 weeks of less than 5 points


in the FACT NTX, a patient self-reported scale of neuropathy symptoms. Though the
selection could have been conducted without the placebo arm, it was included in the
trial because the investigators were interested in obtaining preliminary effect size
estimates for future studies with this unmasked, self-reported endpoint.
The trial uses LRL procedure E=R with c ¼ 3 and b ¼ 1. In this case the trial will
stop the first time two intervention arms have been eliminated and the leading arm is
recruited. The criterion for elimination of an intervention is defined as a difference of
4 between the currently greatest and smallest response tallies. If and when one arm is
eliminated, patients are randomized in pairs to the remaining interventions. If at any
time two arms should happen to be tied with 4 fewer responses than the intervention
currently in the lead, then both trailing interventions are eliminated at that time. Once
the second elimination criterion is reached, the remaining intervention with the
largest response tally is selected as the preferred intervention. The trial design
modifies the original LRL procedure in two ways. First, sequential monitoring will
begin only once 45 patients in 15 triplets have been enrolled. This minimum
enrollment requirement will provide unbiased estimates of response proportions
for the three arms uncomplicated by selection effects. Second, the trial will be
stopped if the second elimination does not occur at or before 100 outcomes have
been observed, thus converting the “open” LRL procedure to a “closed” procedure
by truncation. If the trial stops by truncation, the intervention with the largest
response tally among the remaining competitors is selected. If there are ties for the
largest tally at time of truncation, one intervention is selected according to other
considerations (safety, ease of compliance, etc.).
The elimination criterion of a lead of 4 between largest and smallest response
tallies and the maximum sample size of 100 patients were chosen to achieve a PCS
of at least 80% for any true success probabilities lying in the preference zone
characterized by an odds ratio of 2.0 or greater between the true response probabil-
ities of the best two interventions. Based on published studies evaluating the
interventions of interest, we assumed a success rate of 79%. Moreover, experience
suggests a 40% response rate for the control arm is plausible. Five scenarios were
evaluated based on these rates. The first scenario corresponds to response probabil-
ities in the least favorable configuration in the preference zone with an odds ratio of
2.0 (79%, 65%, and 65)%. The second scenario corresponds to the abovementioned
design alternative based on the observed rates from previously conducted studies
(79%, 65% and 40%). The third and fourth scenarios describe the operating charac-
teristics of the selection procedure inside the indifference zone, assuming smaller
differences in the response rates for the two interventions. The third scenario
assumes that both interventions are superior to the placebo with a smaller difference
between them (such that the odds ratio between the best and second-best intervention
is less than 2.0); the response rates are 0.75, 0.65, 0.40. The fourth scenario assumes
that both interventions are superior to the placebo with no difference between them
(an odds ratio of 1.0); the response rates are 0.65, 0.65, 0.40. The fifth scenario
illustrates a case where there is no true difference in the rates for all three arms, that is
all interventions have a response rate of 0.65.
1062 S. M. Lee et al.

Table 1 Operating characteristics of the selection procedure for peripheral neuropathy prevention
Scenario 1 Scenario 2 Scenario 3 Scenario 4 Scenario 5
(0.79, 0.65, (0.79, 0.65, (0.75, 0.65, (0.65, 0.65, (0.65, 0.65,
0.65) 0.40) 0.40) 0.40) 0.65)
P[cs] 0.838 0.912 0.831 0.498 0.336
P[as] 0.838 0.912 0.999 0.997 1.00
P 0.134 0.272 0.214 0.153 0.075
[N ¼ 45]
P[trunc] 0.303 0.163 0.231 0.324 0.489
P 0.322 0.514 0.435 0.343 0.201
[N < 60]
P 0.534 0.719 0.638 0.537 0.362
[N < 80]
Mean N 75.2 65.7 69.8 74.8 83.1
Med N 76 59 65 75 99
P[cs] is the probability of correct selection overall
P[as] is the probability of an acceptable selection overall
P[N ¼ 45] is the probability of reaching a decision after exactly 45 patients have been randomized
P[trunc] is the probability that the trial will be truncated before the second elimination time
P[N < 60] is the probability that the total number of patients will be less than 60 at stopping
P[N < 80] is the probability that the total number of patients will be less than 80 at stopping
Mean N is the mean of the distribution of the (random) total number of patients
Median N is the median of the distribution of the total number of patients

The operating characteristics of the design shown in Table 1 below were evalu-
ated by simulation studies using 100,000 replications per scenario. The characteris-
tics evaluated were the PCS, the probability of an acceptable selection, the
probability of stopping at the first look with N ¼ 45 patients, the probability of
truncation at N ¼ 100 patients, the probability of the trial concluding with a sample
size below 60 or below 80 (to assess accrual feasibility), and the mean and median
sample size.
At the end of the trial the sample proportion of patients with a change in FACT
NTX < 5 from baseline to week 12 will be reported for each arm. Additionally, the
likelihood of the response tallies for the intervention selected together with the first
runner-up will be calculated. This likelihood is given by
  ð nÞ ð nÞ r ð nÞ  nrðjnÞ
ðnÞ ðnÞ r
L pi , p j jr i , r j ¼ pi i ð1  pi Þnri p j j 1  p j ,

ðnÞ ðnÞ
where r i and r j are the observed tallies for the selected and first runner-up
intervention and where pi and pj are the respective true response probabilities. The
likelihood of the observed response tallies will also be calculated under the assump-
tion that we erred in our selection and
 that the true probabilities
 are those for the two
ðnÞ ðnÞ
interventions transposed, namely, L p j , pi jr i , r j . The likelihood ratio, or LR, is
57 Randomized Selection Designs 1063

the ratio of these two likelihoods. It can be shown that LR equals the true odds ratio
raised to the fourth power,
  ( )4
ðnÞ ðnÞ
L pi , p j jr i , r j pi =ð 1  pi Þ
LR ¼  ¼   ,
ðnÞ ðnÞ
L p j , pi jr i , r j p j= 1  p j

in the case where the trial ends meeting the selection criterion. In the case of
ðnÞ ðnÞ
truncation, the exponent 4 is replaced by r i  r j . LR will be evaluated at the
 
ðnÞ
adjusted sample proportions of pi and pj, namely, r i þ 0:5 =ðn þ 1Þ and
 
ðnÞ
r j þ 0:5 =ðn þ 1Þ, respectively.
The likelihood ratio is an important measure of the weight of evidence in favor of
a correct selection after the trial concludes (see, e.g., Royall 1997 and 2000). For
example, it indicates strong evidence of correct selection if LR > 10 or only weak
evidence if it is near 1, and if the placebo arm should actually be the selected
intervention arm, there would presumably be either weak or even strong evidence
against either active intervention being the best. Thus the LR will play a crucial role
in deciding whether or not to mount a subsequent phase III trial.

Discussion and Conclusion

We began this chapter with the somewhat provocative assertion that not every
clinical research problem can – nor should – be addressed with a study design that
tests a null hypothesis of no treatment differences while controlling the type 1 error
rate to conventional levels such as 0.01, 0.05, or 0.10. Selection problems, addressed
by randomized selection trials, are prime examples. While a pure selection design
can always be viewed as a multiple decision procedure that tests the null hypothesis
of no difference which rejects that hypothesis when one selects the best performing
treatment, this view misses the point, which is that when the goal is to select a best
treatment, one really doesn’t care about the null hypothesis. Indeed, the reason pure
selection designs can achieve good PCS (a.k.a. “power” in the hypothesis test
context) with smaller sample sizes than conventional phase III designs is exactly
 1
c
because selection trials control the type 1 error rate only at level α ¼ 1/c (or
b
for selecting subsets of size b). Pure selection trials are precisely the right tool for the
job when a choice between competing alternatives must be made with no negative
consequences if the treatments are all of equal efficacy.
Some authors have raised concerns with the use of randomized selection designs
in clinical research apart from the hypothesis testing issue; see, e.g., Rubinstein et al.
(2009), echoed by Green et al. (2016). Referring to the original SWE design,
Rubinstein et al. (2009, p.1886) write,
1064 S. M. Lee et al.

The weakness in the original design is that it does not assure that the (sometimes nominally)
superior experimental regimen is superior to standard therapy. It was occasionally argued
that an ineffective experimental regimen could act as a control arm for the other regimen, but
the design was not constructed to be used in this way, since, as designed, one of the two
experimental regimens would always be chosen to go forward, even if neither was superior
to standard treatment.

To address this concern the authors suggest prescreeing the candidates with Simon’s
two-stage design, citing Liu et al. (2006). The concern presumes, of course, that
standard therapy is not one of the treatments considered for selection. Apart from
potential problems with the rate of accrual, there is no intrinsic reason why standard
treatment cannot be included among those to be studied in a selection trial, and this
option should be carefully considered in the planning stages of the trial. The concern
largely evaporates when standard treatments are included.
Whether or not standard treatment is included among the candidates, the stated
concern does raise an important question: If one is to give up on statements of
statistical significance in the selection paradigm, what then can be said about the
quality of the selected treatment(s) based on the accumulated data, a question that will
inevitably be asked once the trial ends? We believe that an assessment of the weight of
evidence using likelihood ratio methodology is perhaps the most appropriate answer,
such as was illustrated in the peripheral neuropathy selection trial discussed in section
“Applications of Randomized Selection Designs.” This approach is not only reason-
able insofar as it addresses the right question – How strong is the evidence in favor of
having selected the truly best treatment? – it also accords with the current trend away
from too-heavy reliance on null hypothesis significance testing and p-values. Weight-
of-evidence considerations are especially germane if standard treatment is included
among the candidate treatments, but even if not, likelihood ratios against historical
control parameters can be quite illuminating.
Such weight-of-evidence considerations complement the more traditional
frequentist response to the concern that (a) one has high confidence P* that the
selected treatments are acceptable because we use a procedure that has such an
operating characteristic (though acknowledging that that does not pertain to any
particular trial result); and (b) a descriptive review of the maximum likelihood
estimates of the treatment response probabilities, possibly together with an estimate
of the selection bias adhering thereto following the methods of Coad and Ivanova
(2001), can be revealing.
Nevertheless, we suspect some researchers will be unable to overcome the reflex
to test some hypothesis, in which case the preliminary selection design or the
comparative selection trial mentioned in section “Other Designs” may hold appeal.
With the former design, a confirmatory phase III trial ultimately assesses the efficacy
of the selected treatment against a control treatment. With the latter design, given
several active treatments and possibly several control treatments, it seems quite
natural to test whether there is a better-than-placebo active treatment (or a subset
of them) and even more reasonable to then wish to correctly identify it (or them) with
a high probability of correct (or acceptable) selection.
57 Randomized Selection Designs 1065

Finally, in this chapter we have focused on the LRL family of sequential subset
selection procedures because we believe its flexibility, adaptive features, and acceptable
subset selection property make it an attractive option for randomized selection trials.

References
Bechhofer RE (1954) A single-sample multiple decision procedure for ranking means of normal
populations with known variances. Ann Math Stat 25:16–39
Bechhofer RE, Kiefer J, Sobel M (1968) Sequential identication and ranking procedures. University
of Chicago Press, Chicago
Bechhofer RE, Santner TJ, Goldsman DM (1995) Design and analysis of experiments for statistical
selection, screening, and multiple comparisons. Wiley, New York
Caplan A, Plunkett C, Levin B (2015a) Selecting the right tool for the job (invited paper). Am J
Bioeth 15(4):4–10. (with open peer commentaries, pp. 33-50)
Caplan A, Plunkett C, Levin B (2015b) The perfect must not overwhelm the good: response to open
peer commentaries on “selecting the right tool for the job”. Am J Bioeth 15(4):W8–W10
Coad DS, Ivanova A (2001) Bias calculations for adaptive urn designs. Seq Anal 20:91–116
Coad DS, Ivanova A (2005) Sequential urn designs with elimination for comparing K3
treatments. Stat Med 24:1995–2009
Coffey CS, Levin B, Clark C, Timmerman C, Wittes J, Gilbert P, Harris S (2012) Overview, hurdles,
and future work in adaptive designs: perspectives from an NIH-funded workshop. Clin Trials 9
(6):671–680
Gibbons JD, Olkin I, Sobel M (1977) Selecting and ordering populations: a new statistical
methodology. Wiley, New York
Green S, Benedetti J, Smith A, Crowley J (2016) Clinical trials in oncology, 3rd edn. Chapman and
Hall/CRC Press, Boca Raton
Gupta SS (1956) On a decision rule for a problem in ranking means, mimeograph series 150,
Institute of Statistics. University of North Carolina, Chapel Hill
Gupta SS (1965) On some multiple decision (selection and ranking) rules. Technometrics 7:225–
245
Herbst RS, Kelly K, Chansky K, Mack PC, Franklin WA, Hirsch FR, Atkins JN, Dakhil SR, Albain
KS, Kim ES, Redman M, Crowley JJ, Gandara DR (2010) Phase II selection design trial of
concurrent chemotherapy and cetuximab versus chemotherapy followed by cetuximab in
advanced-stage non-small-cell lung cancer: Southwest Oncology Group study S0342. J Clin
Oncol 28(31):4747–4754
Kaufmann P, Thompson JLP, Levy G, Buchsbaum R, Shefner J, Krivickas LS, Katz J, Rollins Y,
Barohn RJ, Jackson CE, Tiryaki E, Lomen-Hoerth C, Armon C, Tandan R, Rudnicki SA,
Rezania K, Sufit R, Pestronk A, Novella SP, Heiman-Patterson T, Kasarskis EJ, Pioro EP,
Montes J, Arbing R, Vecchio D, Barsdorf A, Mitsumoto H, Levin B, for the QALS Study Group
(2009) Phase II trial of CoQ10 for ALS finds insufficient evidence to justify phase III. Ann
Neurol 66:235–244
Leu C-S, Levin B (2008a) A generalization of the Levin-Robbins procedure for binomial subset
selection and recruitment problems. Stat Sin 18:203–218
Leu C-S, Levin B (2008b) On a conjecture of Bechhofer, Kiefer, and Sobel for the Levin-Robbins-
Leu binomial subset selection procedures. Seq Anal 27:106–125
Leu C-S, Levin B (2017) Adaptive sequential selection procedures with random subset sizes. Seq
Anal 36(3):384–396
Leu C-S, Cheung Y-K, Levin B (2011) Chapter 15, Subset selection in comparative selection trials.
In Bhattacharjee M, Dhar SK, Subramanian S (eds) Recent advances in biostatistics: false
discovery, survival analysis, and other topics. Series in biostatistics 4:271–288. World Scientific
1066 S. M. Lee et al.

Levin B, Leu C-S (2013) On an inequality that implies the lower bound formula for the probability
of correct selection in the Levin-Robbins-Leu family of sequential binomial subset selection
procedures. Seq Anal 32(4):404–427
Levin B, Leu C-S (2016) On lattice event probabilities for Levin-Robbins-Leu subset selection
procedures. Seq Anal 35(3):370–386
Levin B, Thompson JLP, Chakraborty B, Levy G, MacArthur RB, Haley EC (2011) Statistical
aspects of the TNK-S2B trial of Tenecteplase versus Alteplase in acute ischemic stroke: an
efficient, dose-adaptive, seamless phase II/III design. Clin Trials 8:398–407
Levy G, Kaufmann P, Buchsbaum R, Montes J, Barsdorf A, Arbing R, Battista V, Zhou X,
Mitsumoto H, Levin B, Thompson JLP (2006) A two-stage design for a phase II clinical trial
of coenzyme Q10 in ALS. Neurology 66:660–663
Liu PY, Moon J, LeBlanc M (2006) Phase II selection designs. In: Crowly J, Ankerst DP (eds)
Handbook of statistics in clinical oncology, 2nd edn. Chapman and Hall/CRC, Boca Raton, pp
155–164
Liu PY, Dahlberg S, Crowley J (1993) Selection designs for pilot studies based on survival.
Biometrics 49:391–398
Lustberg MB, Bekaii-Saab T, Young D et al (2010) Phase II randomized study of two regimens of
sequentially administered mitomycin C and irinotecan in patients with unresectable esophageal
and gastroesophageal adenocarcinoma. J Thorac Oncol 5:713–718
Royall R (1997) Statistical evidence: a likelihood paradigm. Chapman and Hall, London
Royall R (2000) On the probability of observing misleading statistical evidence. J Am Statist Assoc
95(451):760–768
Rubinstein L, Crowley J, Ivy P, LeBlanc M, Sargent D (2009) Randomized phase II designs. Clin
Cancer Res 15(6):1883–1890
Simon R (1989) Optimal two-stage designs for phase II clinical trials. Control Clin Trials 10:1–10
Simon R, Wittes RE, Ellenberg SE (1985) Randomized phase II clinical trials. Cancer Treat Rep
69:1375–1381
Stallard N, Friede T (2008) A group-sequential design for clinical trials with treatment selection.
Stat Med 27(29):6209–6227
Stallard N, Todd S (2003) Sequential designs for phase III clinical trials incorporating treatment
selection. Stat Med 22(5):689–703
Steinberg SE, Venzon DJ (2002) Early selection in a randomized phase II clinical trial. Stat Med
21:1711–1726
Futility Designs
58
Sharon D. Yeatts and Yuko Y. Palesch

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1068
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1069
Superiority Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1070
Single-Arm Futility Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1071
Case Study: Creatine and Minocycline in Early Parkinson Disease . . . . . . . . . . . . . . . . . . . . . . . 1073
Concurrently Controlled Futility Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1074
Case Study: Deferoxamine in Intracerebral Hemorrhage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1076
Sample Size Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1076
Interim Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1078
Protocol Adherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1079
Sequential Futility Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1080
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1080
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1081
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1081
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1081

Abstract
Limited resources require that interventions be evaluated for an efficacy signal in
Phase II prior to initiation of large and costly confirmatory Phase III clinical trials.
The standard concurrently controlled superiority design is not well-suited for
this evaluation. Because the Phase II superiority design is often underpowered to
detect clinically meaningful improvements, investigators are left to make
subjective decisions in the face of a nonsignificant test result. The futility design
reframes the statistical hypothesis in order to discard interventions which do not
demonstrate sufficient promise. The alternative hypothesis is that the effect is less

S. D. Yeatts (*) · Y. Y. Palesch


Data Coordination Unit, Department of Public Health Sciences, Medical University of South
Carolina, Charleston, SC, USA
e-mail: [email protected]; [email protected]

© Springer Nature Switzerland AG 2022 1067


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_83
1068 S. D. Yeatts and Y. Y. Palesch

than some minimally worthwhile threshold. In this way, the trial can be
appropriately powered to evaluate whether the intervention is worth pursuing in
Phase III and thus provides a clear “no go” signal. We briefly describe the
superiority design in order to compare and contrast with the futility design. We
then describe both the single-arm and concurrently controlled futility designs and
present case studies of each. Lastly, we discuss some key considerations related to
sample size calculation and interim analysis.

Keywords
Phase II · Futility design · Single-arm futility design · Concurrently controlled
futility design · Calibration control

Introduction

Each phase of clinical testing has its own objectives, and the optimal trial design
should be tailored to the research question at hand. Phase I trials are typically
designed to identify the dose (or range of doses) which has desired properties,
usually related to safety, and there is a growing body of statistical literature describ-
ing various dose-finding/dose-ranging designs, such as the continuous reassessment
method (Garrett-Mayer 2006). In Phase II, the selected doses are evaluated for an
efficacy signal, in addition to further assessment of safety. Those with sufficient
promise then proceed to a confirmatory evaluation of efficacy in Phase III, for which
the randomized controlled clinical trial is generally regarded to be the gold standard.
Sacks et al. (2014) conducted a retrospective evaluation of marketing applications
submitted to the US Food and Drug Administration for new molecular entities and
found that only 50% were approved on first submission. Similarly, Hwang et al.
(2016) found that 54% of 640 novel therapeutics entering confirmatory testing
between 1998 and 2008 failed and the failure was related to efficacy in 57% of
these. This success rate appears to differ by therapeutic area, and the data is
conflicting. Hwang et al. (2016) report a failure rate of nearly 70% in cancer, whereas
Sacks et al. (2014) report a first round approval rate of 72% in oncology. Djulbegovic
et al. (2008) evaluated 624 Phase III clinical trials completed by the National Cancer
Institute cooperative groups between 1955 and 2000 and found that only 30% of
randomized comparisons were statistically significant and 29% were inconclusive
(defined as “equal chance that standard treatment better than experimental or vice
versa”). Among trials evaluating treatments for acute ischemic stroke, Kidwell et al.
(2001) reported that 23% were considered positive by the reporting authors, but only
3% yielded a positive response on a prespecified primary endpoint at the typical level
of significance of 0.05. Chen and Wang (2016) reviewed 430 drugs considered for
the treatment of stroke between 1995 and 2015 and found that 70% were
discontinued.
Given the disappointing performance of candidate treatments in Phase III clinical
trials, there is a need to better screen therapies prior to the implementation of
58 Futility Designs 1069

expensive Phase III clinical trials (Brown et al. 2011; Sacks et al. 2014; Levin 2015).
Clinical trial conduct requires extensive resources financially (for research personnel
efforts and infrastructure support) and in terms of patients with the condition of
interest. The available resources are clearly limited, and properly vetting interven-
tions in Phase II allows the finite resources available to be targeted toward
confirming efficacy in those with most promise.
The standard concurrently controlled Phase II trial design is often powered to
detect very large effect sizes in order to keep the sample size feasible, and hence, it
can be criticized as an underpowered Phase III trial (Levin 2015). Failure to find
significance may be the result of inadequate power at effect sizes which are still
clinically meaningful. The trial results, then, do not provide a clear “go/no go” signal
as to whether the intervention should move forward for confirmatory efficacy
testing. Consequently, even when the outcome analysis fails to achieve statistical
significance, the standard Phase II design assumes that the intervention will move
forward.
Rather than evaluating whether an intervention has sufficient promise, the futility
design seeks to discard an intervention which clearly lacks sufficient promise. The
statistical implication of this distinction is impactful – the futility design can be
appropriately powered to declare futility (and hence provide a clear “no go” signal)
when an intervention has little or no effect.

Background

The methodologic basis for the futility design stems from the field of cancer clinical
trials. To eliminate ineffective therapies from future development, a single-arm
clinical trial would be conducted in order to compare the resulting outcome to
some minimally acceptable level (Herson 1979).
In recent years, the futility design has received increased attention, particularly in
the field of neurology, as a mechanism to weed out interventions which are not
sufficiently promising. The IMS I Investigators (2004) adapted the futility design to
the acute ischemic stroke treatment with the single-arm futility trial evaluating the
effect of intravenous plus intra-arterial tPA. In 2005, Palesch et al. applied this
methodology to six past Phase III trials in ischemic stroke and found that the futility
design could have prevented three such trials for which the treatment was ultimately
determined to be ineffective. The NINDS NET-PD Investigators used the single-arm
futility design to test whether creatine or minocycline (2006), as well as Co-Q10 or
GPI-1485 (2007), warranted definitive confirmatory testing for Parkinson disease.
Kaufmann et al. (2009) conducted a concurrently controlled, adaptive, two-stage
selection and futility design to evaluate the promise of coenzyme Q10 in
amyotrophic lateral sclerosis (ALS).
As we have stated previously, the traditional concurrently controlled Phase II
clinical trial, designed to evaluate the efficacy signal in a test for superiority, can be
criticized as an underpowered Phase III trial. We first briefly review the superiority
setting in order to demonstrate this point. We then introduce the single-arm futility
1070 S. D. Yeatts and Y. Y. Palesch

design and discuss its advantages and disadvantages. Finally, we describe the
concurrently controlled futility design.

Superiority Setting

Consider a two-arm concurrently controlled clinical trial designed to evaluate


whether there is a difference between the experimental arm and the control arm in
the proportion of subjects with good outcome. Let π represent the true proportion of
subjects with good outcome, such that πtx is the good outcome proportion associated
with the intervention and πctrl is the good outcome proportion associated with the
control. The null hypothesis states that the treatment effect, defined as the absolute
difference in the proportions of subjects with good outcome, is zero, and the
alternative hypothesis states that the treatment effect is not zero:

H 0 : π tx  π ctrl ¼ 0
H A : π tx  π ctrl 6¼ 0

A Type I error is the rejection of a true null hypothesis, which here means that the
treatment arms are declared different when in fact they are not. The commonly used
term “false positive” reflects both the statistical framework and the conclusion about
the intervention; the investigators declare a positive finding (a difference between the
treatments) when none exists. In the superiority setting, the level of significance,
which reflects our willingness to make this error, is typically set at 0.05. The
scientific community may be more or less willing to tolerate this error, depending
on its consequences. The type, safety profile, and cost of the intervention or other
considerations may factor into the general willingness to accept the conclusion that a
treatment is efficacious when it is not.
A Type II error is the failure to reject a false null hypothesis, which here means
that the statistical test fails to conclude a difference when in fact one exists. Again,
the commonly used term “false negative” reflects both the statistical framework and
the conclusion about the intervention. The investigators consider the trial to be
negative (are unable to declare a difference between the treatments) despite a
nonzero treatment effect. The willingness to accept such an error is typically set at
0.2 or less; in other words, the statistical power of the trial is set to 0.8 or greater.
Again, the scientific community may be more or less willing to tolerate this error,
depending on the same factors as for the Type I error.
In order to justify the criticism of a Phase II superiority design as an underpow-
ered Phase III trial, consider a hypothetical Phase II trial intended to evaluate the
efficacy signal associated with a new treatment for intracerebral hemorrhage. As the
binomial proportion has maximum variance at 0.5, we assume this to be the control
proportion to represent the worst case scenario. A concurrently controlled superiority
trial of 300 subjects, 150 in each arm, has 81% power to detect an improvement of 16
percentage points. In stroke, recent confirmatory trials have been designed to detect a
minimum clinically relevant difference of 10 percentage points; the design has only
58 Futility Designs 1071

42% power to detect this difference. As a result, there is a high likelihood of a


statistically nonsignificant finding, even when a clinically relevant treatment effect
exists. One can tweak the design in various ways to improve the power. For example,
increasing the level of significance (the alpha level) and testing against a one-sided
alternative hypothesis will both improve the power for a given sample size. How-
ever, even at a one-sided 0.10 level of significance, this design would have only 68%
power to detect a 10% absolute improvement over control. As the statistical test is
very likely to be not significant, leading to a failure to reject the null hypothesis of no
difference between the groups, it is not clear how the efficacy signal can be reliably
evaluated in this way.

Single-Arm Futility Design

In the single-arm futility design, all subjects enrolled are treated with the interven-
tion, in order to compare the outcome against a prespecified reference value. The
design is intended to establish whether the outcome on the intervention represents
less than some minimally clinically relevant improvement over the prespecified
reference value, which would lead us to declare the intervention futile. The alterna-
tive hypothesis, then, represents futility, and conversely, the null hypothesis assumes
that the intervention is not futile. Let πtx represent the true proportion of subjects with
good outcome on the intervention, and let π0 represent this clinically relevant
improvement over the reference; we now refer to this improvement as the futility
threshold. The statistical hypotheses are written as shown:

H 0 : π tx  π 0
H A : π tx < π 0

A Type I error is still the rejection of a true null hypothesis; in the context of
futility, this means that the treatment response is declared to be less than the
threshold when in fact it is not. The commonly used term “false positive” here
reflects the statistical framework but does not well describe our conclusions about
the intervention; the investigators declare a negative finding (the intervention is
futile) when it is not. The prespecified level of significance should take into account
both the consequences of this error and the phase of the study. The consequence,
here, is that a useful intervention may be unnecessarily discarded. We want to
minimize the chance of abandoning effective therapies, certainly, but the community
may be more willing to tolerate a Type I error in the futility context than in the
superiority context, where the result of such an error is that patients are unnecessarily
exposed to an ineffective therapy. In addition, the sample size associated with Phase
II trials is expected to be relatively small, at least in comparison to the confirmatory
setting. Balancing these needs, a 0.10 level of significance has been suggested
(Tilley et al. 2006). Note that the alternative hypothesis is necessarily one-sided,
as we wish to discard only interventions for which the response is less than the
threshold, and the level of significance should be allocated as such.
1072 S. D. Yeatts and Y. Y. Palesch

A Type II error is still the failure to reject a false null hypothesis; in the context of
futility, this means that the statistical test fails to conclude that the treatment is futile
despite a treatment response which is less than the threshold. Again, the commonly
used term “false negative” reflects the statistical framework but not our conclusions
about the intervention; although the response is less than the specified threshold, the
intervention is not declared futile. The consequence is that an ineffective therapy will
be moved forward for a definitive efficacy evaluation. As our objective is to discard
ineffective therapies, we want to limit the chance of this error; however, because
additional testing is required before declaring the intervention efficacious, our
tolerance for the Type II error can be greater than that for the Type I error.
There is a great efficiency to this single-arm approach in terms of sample size
savings. Let us re-envision the previously described concurrently controlled superi-
ority design as a single-arm futility design. Assume, as before, that the literature
suggests a good outcome proportion of 0.5 associated with the control; further
assume that 10 percentage points can be assumed to be the minimum worthwhile
improvement required to warrant further investigation. The statistical hypothesis,
then, is written as shown:

H 0 : π tx  0:6
:
H A : π tx < 0:6

Calculating the sample size in order to achieve 80% power for declaring futility
when there is no improvement associated with the intervention (i.e., when πtx ¼ 0.5
under the alternative hypothesis), we find that a total sample size of 110 subjects is
required to evaluate futility using a one-sided 0.10 level of significance. This is in
stark contrast to the superiority approach, which was underpowered to detect
clinically relevant improvements in good outcome with a sample size of 300
subjects.
The advantages to this single-arm approach are self-evident. All subjects will
receive the intervention of interest, which may increase a potential participant’s
willingness to enroll. Perhaps more importantly, the trial can be appropriately
powered at what might be considered a typical Phase II sample size. Comparison
of the outcome proportion against a fixed value yields very real savings in terms of
the sample size. This fixed value can be derived from the literature available on
outcomes associated with the control intervention, often referred to as historical
control data. Decision-making based on the use of such historical control data is also
considered to be the primary drawback of the single-arm approach.
Chalmers et al. (1972) summarized the arguments against randomization, both
practical and ethical, but ultimately concluded that randomization is necessary to
evaluate efficacy and toxicity, even in the early stages of evaluation. Without a
concurrent control, one cannot be sure that the historical response would apply to the
enrolled population. Estey and Thall (2003) describe this problem as “treatment-
trial” confounding. Clinical management, imaging availability, or outcome ascer-
tainment may have changed over time, thus altering outcomes. Subtle differences in
58 Futility Designs 1073

eligibility criteria may result in slightly different populations across trials; such
differences in baseline characteristics, whether known or unknown, may impact
outcomes. When historical data are used for comparison, the observed effect reflects
a combination of these trial-specific effects and the true treatment effect, and the
specific contribution of each to the observed effect cannot be determined (Estey and
Thall 2003). It has also been suggested that outcome is altered simply because of
participation in a trial, a phenomenon sometimes referred to as the Hawthorne effect
and sometimes referred to as the placebo effect. Because of these limitations, a
concurrently controlled design may be preferred.
Pocock (1976) argues that, if it exists, “acceptable” historical control data should
not be ignored and presents a case for formally incorporating both randomized and
historical controls and describes the associated statistical inference. His definition of
acceptability is based on six conditions related to consistency and comparability of
the treatment, eligibility criteria, evaluation, baseline characteristics, investigators,
and trial conduct. The historical control data typically available would not meet most
of the six conditions.
To address concern over the applicability of the specified reference value in a
single-arm design, Herson and Carter (1986) proposed the use of a calibration
control, a small group of randomized concurrent controls. The calibration control
arm is not directly compared to the intervention arm but used to evaluate the
relevance of the historical control data in the current study. The utility of the
calibration control arm is demonstrated by the NINDS NET-PD Investigators in
the case study below.

Case Study: Creatine and Minocycline in Early Parkinson Disease

A brief summary of a randomized, double-blind, futility trial of creatine and


minocycline in early Parkinson disease is provided here. The interested reader is
referred to NINDS NET-PD Investigators (2006) for a detailed description of the
rationale, methods, and findings. Participants were randomly allocated to either (1)
active creatine and placebo minocycline, (2) placebo creatine and active
minocycline, or (3) placebo creatine and placebo minocycline. The analysis plan
specified that each of the active arms would be evaluated using a single-arm futility
design, based on historical control data derived from the Deprenyl and Tocopherol
Antioxidant Therapy of Parkinsonism (DATATOP) Trial (The Parkinson Study
Group 1989). The placebo arm was included as a calibration control – to confirm
the historical control assumptions on which the design was based, not for a direct
comparison against the active arms. The primary outcome was the change in the
Unified Parkinson’s Disease Rating Scale (UPDRS), where an increase represents
worsening. The futility threshold was defined as 30% less progression on the
UPDRS than the increase (10.65, 95% CI 9.63–11.67) observed in DATATOP
(NINDS NET-PD Investigators 2006). As shown, the alternative hypothesis
describes futility as a mean (μ) increase (worsening) of more than 7.46 points:
1074 S. D. Yeatts and Y. Y. Palesch

H 0 : μ  7:46
H A : μ > 7:46

The single-arm futility analysis was conducted as planned, and there was not
sufficient evidence to declare either creatine or minocycline futile. However, the
mean change observed in the calibration control arm (8.39) was less than anticipated
based on the historical control data (10.65), and as a result, the futility threshold was
not consistent with 30% less progression than control. However, the investigators
had planned for this possibility during the design phase. A series of prespecified
sensitivity analyses were undertaken using the calibration control data to update the
historical control response in various ways, and the conclusions were not substan-
tively altered. This example demonstrates the potential concern over evaluating
futility using historical control data and highlights the potential utility of a concur-
rent control, whether for calibration, as described above, or for direct comparison, as
introduced below.

Concurrently Controlled Futility Design

In the concurrently controlled futility design, subjects are randomly allocated to


either an intervention or a control arm, in order to compare the effect of the
intervention against some minimum clinically relevant improvement. Let πtx and
πctrl represent the true proportion of subjects with good outcome on the intervention
and control, respectively, and let δ represent the futility threshold, defined as a
minimum clinically relevant improvement in outcome. As above, the alternative
hypothesis represents futility, and the null hypothesis assumes that the intervention is
not futile.

H0 : π tx  π ctrl  δ
HA : π tx  π ctrl < δ

Again, comparing these hypotheses to those in the standard two-arm superiority


design, where the alternative hypothesis is that the two arms differ (HA : πtx  πtx 6¼ 0),
allows us to evaluate the interpretation of the corresponding statistical errors, their
implications, and our willingness to tolerate them. These are as described above in
the case of the single-arm futility design.
For comparison purposes, let us revisit, again, the previously described concur-
rently controlled superiority design as a concurrently controlled futility design.
Assume, as before, that experience suggests a good outcome proportion of 0.5
associated with the control; further assume that 10 percentage points can be assumed
to be the minimum worthwhile improvement required to warrant further investiga-
tion. The statistical hypothesis, then, is written as shown:
58 Futility Designs 1075

H 0 : π tx  π ctrl  0:10
H A : π tx  π ctrl < 0:10

Calculating the sample size in order to achieve 80% power for declaring futility
when there is no improvement associated with the intervention (i.e., when
πtx  πctrl ¼ 0 under the alternative hypothesis), we find that a total sample size of
451 subjects is required to evaluate futility using a one-sided 0.10 level of signifi-
cance. Although larger than one might expect for a Phase II trial, this sample size
would yield only 57% power to detect an absolute 10% improvement in the two-
tailed superiority design and is a dramatic increase over the 110 subjects required for
the single-arm futility design. However, the inclusion of a concurrent control group
for direct comparison with the intervention avoids the pitfalls associated with the use
of historical control data to derive the futility threshold and allows for concurrent
estimation of the treatment effect.

Case Study: Deferoxamine in Intracerebral Hemorrhage

A brief summary of the Intracerebral Hemorrhage Deferoxamine (i-DEF) trial is


provided here; the interested reader is referred to Selim et al. (2019) for a detailed
description of the rationale, methods, and findings. This multicenter, randomized,
double-blind, placebo-controlled trial was designed to evaluate whether
deferoxamine is futile for the purpose of improving good outcome, defined via
modified Rankin Scale score 0–2. A weighted average derived from the available
literature suggests a good outcome rate in the control arm of 28%. Recently
conducted Phase III trials in this patient population have targeted a minimum
clinically important difference of 10%. Noting that effect size estimates in confir-
matory trials tend to be smaller than the earlier phase counterparts (potentially due to
greater heterogeneity with a larger number of participating clinical sites in Phase III),
it was decided that a treatment effect less than 12% in favor of deferoxamine would
be considered futile, resulting in the following statistical hypotheses:

H 0 : π tx  π ctrl  0:12
H A : π tx  π ctrl < 0:12

In order to evaluate futility with 80% power when the two treatments have the
same proportion, using a one-sided 0.10 level of significance, 253 subjects are
required. The sample size was inflated to account for loss to follow-up, consent
withdrawal, etc., resulting in a maximum sample size of 294 subjects. At the
conclusion of the trial, the observed good outcome rate in the control arm was
slightly higher than anticipated (34%, vs. the anticipated 28%). The power of the
trial to declare futility is affected by the discrepancy between the observed control
outcome and the assumed control outcome, particularly in binary outcome studies.
1076 S. D. Yeatts and Y. Y. Palesch

However, the futility threshold is not dependent on the assumed control response,
and the statistical test for futility is based on the observed response in both groups.
Therefore, the concurrent control approach allows the design to compensate, to some
extent, for the drawbacks of the single-arm approach, albeit with a larger sample
size.

Analysis

Standard statistical testing procedures can be appropriately modified to reflect the


key components of the futility design: the one-sided nature of the alternative
hypothesis and the nonzero null value. The futility hypothesis can be analyzed via
statistical hypothesis test, with the corresponding one-sided p-value used to support
the conclusion. It is important to note, however, that the p-value is often used to
describe the level of evidence supporting a difference between two treatments, which
would not be a correct interpretation in the futility design. As a result, it may be
preferable to conduct the futility evaluation using a one-sided confidence boundary
on the treatment effect. This would serve to both provide a consistent reminder of the
futility threshold and prevent confusion in interpretation.

Sample Size Considerations

Recall the sample size calculation for comparing two independent proportions in a
superiority design. Let z1α=2 and z1  β represent the corresponding quantiles from
the standard normal distribution. Assume the true control response to be πctrl, and let
πtx be derived from the minimum clinically important improvement (ε) over the
assumed control response, such that πtx  πctrl ¼ ε. The sample size required to
achieve power (1  β), using a two-sided α level of significance, and assuming equal
allocation to the treatment arms, is defined according the formula below:
 2 !
z1α=2 þ z1β
n¼2 ðπ ctrl ð1  π ctrl Þ þ π tx ð1  π tx ÞÞ
e2

The sample size calculation for the futility design follows the same algebraic
formulation but reflects the key components of the futility design:

1. The level of significance is one-sided, as suggested by the trial objectives.


2. Because the superiority setting generally tests against a null value of 0, a
placeholder for the null value is often omitted from the formula. The futility
hypothesis, however, tests against a nonzero null value, the futility threshold δ,
which must be reflected in the calculation.
58 Futility Designs 1077

Using the same notation as above, and letting δ represent the futility threshold, the
sample size required to achieve power (1  β) for evaluating the futility hypothesis is
defined as shown below:
 2 !
z1α þ z1β
n¼2 ðπ ctrl ð1  π ctrl Þ þ π tx ð1  π tx ÞÞ
ðe  δÞ2

Another distinguishing feature of the futility calculation is the effect size for
which the trial is powered. In the superiority setting, a trial is designed to achieve
adequate power to declare superiority under the assumption that some minimum
clinically relevant difference ε exists. In the futility setting, however, a trial is
designed to achieve adequate power to declare futility under the assumption that
any potential improvement in outcomes is less than an effect which is minimally
worthwhile from a clinical perspective. A scenario where there is no improvement
associated with treatment, for instance, would be considered truly futile, and so one
might assume ε ¼ 0 for the power calculation. One might instead wish to target a
scenario where there is a small but clinically uninteresting improvement in out-
comes. In either case, the trial would have more than adequate power to declare
futility if the treatment decreases good outcomes.
The formulas provided here assume equal allocation to each treatment arm but are
easily modified to allow for an unequal allocation ratio.
We previously mentioned that the design parameters of the superiority design
could be revised in order to improve operating characteristics. Considering instead
a one-sided superiority hypothesis (i.e., HA : πtx  πctrl > 0), a total sample size of
451 subjects yields 81% power to detect a 10% absolute improvement under a one-
sided 0.10 level of significance. Given that the one-sided superiority design and the
futility design have approximately equivalent operating characteristics under con-
sistent assumptions, Levin (2012) argues that the futility approach is more consis-
tent with Phase II objectives and the possible conclusions are more concrete. As
described in the table below, the form of the statistical hypothesis has a very real
implication for the resulting inference. You may recall from introductory statistics
that there are only two plausible hypotheses, and we retain our belief in the null
hypothesis, unless the data overwhelmingly contradict it, leading us to believe in
the alternative hypothesis. In the one-sided superiority setting, a nonsignificant test
result leads to a statement that “there is insufficient evidence to conclude that the
intervention is better” and therefore requires that we accept as plausible that the
intervention does not have a positive effect. In the futility setting, however, a
nonsignificant test result leads to a statement that “there is insufficient evidence to
conclude that the treatment effect is less than δ” and allows us to accept as
plausible that the effect of the intervention is at least minimally worthwhile. In
either case, the confidence interval may be used to evaluate which effect sizes
remain plausible based on the available data, but in the futility setting, the decision-
making process does not rely on such post hoc evaluations of the resulting
confidence interval.
1078 S. D. Yeatts and Y. Y. Palesch

Superiority setting Futility setting


Statistical H 0 : π tx  π ctrl  0 H 0 : π tx  π ctrl  δ
hypotheses H A : π tx  π ctrl > 0 H A : π tx  π ctrl < δ
Significance p-value < α (1  α)  100% one-sided confidence
evaluated via bound
Non- There is insufficient evidence to There is insufficient evidence to
significant conclude that the intervention is conclude that the intervention is futile
result means better
that
Implications Unclear, given the potential for Confirmatory efficacy evaluation is
for future criticism as an underpowered warranted to definitively evaluate
study Phase III trial treatment efficacy

Interim Analysis

Interim analysis for statistical futility is somewhat common in today’s funding


climate. An interim analysis for statistical futility allows the trial to terminate early
if it becomes overwhelmingly clear that the trial cannot accomplish its stated
objective. Understandably, the funding agency may want to cut their losses if, during
the course of the study, it becomes clear that the null hypothesis is very unlikely to be
rejected at the conclusion of the trial (i.e., it is statistically futile to continue). Group
sequential monitoring approaches, such as alpha- and beta-spending functions, can
be applied to the futility design as well. However, such interim analysis for statistical
futility is inherently different than the futility design we have described, and as
before, it is important to note that the consequences of such interim analysis depend
on the formulation of the statistical hypotheses.
In the superiority setting, an interim analysis for statistical futility allows the trial
to terminate if it becomes clear that the trial will not demonstrate a difference
between the treatment arms. From both a logistical standpoint and an ethnical one,
termination of a superiority trial in the face of statistical futility may be of interest.
One would not want to continue randomizing participants to an experimental
intervention for which there is little chance of showing a benefit. In the context of
the futility design, however, an interim analysis for statistical futility allows the trial
to terminate if it becomes clear that the trial will not demonstrate that the intervention
is futile. Early termination in this scenario is likely not in the best interest of either
the scientific community or the funding agency, as continued enrollment would
allow a more precise estimate of both the control response and the treatment effect,
as well as a more detailed evaluation of the safety profile of the intervention. For this
reason, the remainder of our interim analysis discussion is focused on interim
analysis for efficacy.
In the context of the futility design, interim analysis for evidence of “efficacy”
would allow the trial to terminate if it became overwhelmingly clear that the
intervention is futile. It may be worth terminating the trial early in that case, but
58 Futility Designs 1079

the operating characteristics of such an analysis should be evaluated during the study
design phase. The usual alpha- and beta-spending functions applicable to superiority
designs are also applicable in the context of futility designs, but the function which is
considered optimal for superiority may not be so for futility. The O’Brien-Fleming
spending function (1979) spends alpha more conservatively than its Pocock coun-
terpart (1977), which means that it is more difficult to terminate under O’Brien-
Fleming. This may be desirable in the context of the superiority design, where the
consequence of such termination is to declare the intervention efficacious, thereby
making it available as a treatment option to the target population. In the context of
the futility design, however, one might wish to be more liberal in terms of early
stopping, given that the consequence of such termination is to declare that the
intervention is not sufficiently promising to warrant further study.
Consider a concurrently controlled futility design, sized at 451 subjects in order to
achieve 80% power to declare futility against a 10% absolute improvement, with a
prespecified interim analysis to be conducted after 50% of subjects have completed
the primary follow-up period. The O’Brien-Fleming boundary would call for termi-
nation only if the estimated treatment effect were greater than 3.6 percentage points
in the wrong direction (in favor of the control arm); under the alternative hypothesis
of no difference between the arms, the trial would have a 30% likelihood of
terminating early to declare the intervention futile. The Pocock boundary, on the
other hand, would call for termination if the estimated treatment effects were greater
than 0.2 percentage points in the wrong direction; under the alternative hypothesis of
no difference, the trial would have a 49% likelihood of terminating early to declare
the intervention futile.
When considering interim analysis, investigators should be aware that treatment
effect estimates can be unstable with small sample sizes and tend to stabilize over
time as outcomes accrue. Decision-making in the interim, when the effect estimates
may yet be unstable, could lead to termination on a random high or low. Bassler et al.
(2010) conducted a systematic review and meta-analysis and concluded that trials
which are terminated early for benefit yield biased estimates of treatment effect.
Early termination yields a smaller than anticipated sample size and a correspond-
ingly imprecise estimate of the treatment effect. In the superiority design, a precise
estimate of the treatment effect is desirable, especially when the intervention is
shown to be efficacious. One could argue, however, that concern over the impreci-
sion of the effect estimate can be overcome if there is truly overwhelming evidence
of benefit. In the futility design, a precise estimate of the treatment effect may not be
as important in the face of a futility declaration.

Protocol Adherence

The occurrence of protocol violations (treatment crossovers, failure to administer


intervention correctly, inclusion of ineligible subjects, etc.) tends to dilute the
treatment effect. In the superiority design, the result is a movement of the estimate
away from the alternative of superiority and toward the null hypothesis; the sample
1080 S. D. Yeatts and Y. Y. Palesch

size is often increased in order to compensate for the corresponding reduction in


power. In the futility design, the result of this dilution is a movement of the
estimate toward the alternative hypothesis. The sample size cannot be increased
to compensate for the resulting increase in the Type I error probability. While it is
always important to encourage strict adherence to the protocol, investigators
should be aware that protocol nonadherence can make a futility declaration more
likely.

Sequential Futility Designs

One can easily envision the futility design as a second stage in a sequential early
phase trial, following either a dose-finding or dose-selection stage. Levy et al. (2006)
developed a two-stage selection and futility design to sequentially select the dose of
Coenzyme Q10 and subsequently to evaluate the futility of the selected dose, in
ALS. The trial design is briefly described here; the interested reader is referred to
Levy et al. (2006) for details. The first (selection) stage was designed and conducted
according to statistical selection theory, with the sample size determined in order to
yield a high probability that the superior dose would be selected. Subjects were
randomly allocated to one of three treatment arms (one of two active doses or a
concurrent placebo), and at the conclusion of this stage, the preferred active dose was
selected. The second stage was designed and conducted according to the futility
design. Subjects were randomly allocated to one of two treatment arms (the active
dose selected in the first stage or a concurrent placebo). At the conclusion of the
second stage, the futility analysis compared the active dose selected in the first to the
concurrent placebo, using the subjects randomized to those arms in both stage 1 and
stage 2. Levy et al. (2006) note that a bias is introduced into the final futility
evaluation because the best dose was selected in the first stage and that same data
are used in the second stage analysis, and their methodology includes an appropriate
bias correction.

Summary and Conclusion

The futility design can be used to provide a clear “no go” signal for evaluating
whether an intervention shows sufficient promise to warrant confirmatory testing.
The single-arm futility design yields dramatic sample size savings, but the need to
derive a fixed reference value can be difficult. This drawback can be overcome using
either a calibration control or a concurrent control, to directly compare against the
intervention. The statistical hypotheses, as stated in the futility design, are more in
keeping with the Phase II objective than the hypotheses of the superiority design.
Because the alternative hypothesis is used to describe futility, a statistically signif-
icant finding indicates that the intervention does not warrant confirmatory efficacy
testing, whereas a nonsignificant finding suggests that the intervention should be
moved forward for further evaluation.
58 Futility Designs 1081

Key Facts

The phase II trial is often used to evaluate whether an intervention has sufficient
efficacy signal to warrant confirmatory testing. The typical design, a superiority
design powered to detect large effect sizes, can be criticized as an underpowered
Phase III trial. The futility design reverses the statistical hypotheses, in order to weed
out interventions which do not warrant further testing. The design provides a clear
“no go” signal as to whether an intervention should be moved to Phase III efficacy
evaluation.

Cross-References

▶ Middle Development Trials


▶ Randomized Selection Designs
▶ Use of Historical Data in Design

References
Bassler D, Briel M, Montori VM, Lane M, Glasziou P, Zhou Q, Heels-Ansdell D, Walter SD,
Guyatt GH, The STOPIT-2 Study Group (2010) Stopping randomized trials early for benefit and
estimation of treatment effects: systematic review and meta-regression analysis. J Am Med
Assoc 303:1180–1187
Brown SR, Gregory WM, Twelves CJ, Buyse M, Collinson F, Parmar M, Seymour MT, Brown JM
(2011) Designing phase II trials in cancer: a systematic review and guidance. Br J Cancer
105:194–199
Chalmers TC, Block JB, Lee S (1972) Controlled studies in clinical cancer research. NEJM
287:75–78
Chen X, Wang K (2016) The fate of medications evaluated for ischemic stroke pharmacotherapy
over the period 1995–2015. Acta Pharm Sin B 6:522–530
Djulbegovic B, Kumar A, Soares HP, Hozo I, Bepler G, Clarke M, Bennett CL (2008) Treatment
success in cancer: new cancer treatment successes identified in phase 3 randomized controlled
trials conducted by the National Cancer Institute-Sponsored Cooperative Oncology Groups,
1955–2006. Arch Intern Med 168(6):632–642
Estey EH, Thall PF (2003) New designs for phase 2 clinical trials. Blood 102:442–448
Garrett-Mayer E (2006) The continual reassessment method for dose-finding studies: a tutorial.
Clin Trials 3:57–71
Herson J (1979) Predictive probability early termination plans for phase II clinical trials. Biometrics
35:775–783
Herson J, Carter SK (1986) Calibrated phase II clinical trials in oncology. Stat Med 5:441–447
Hwang TJ, Carpenter D, Lauffenburger JC, Wang B, Franklin JM, Kesselheim AS (2016) Failure of
investigational drugs in late-stage clinical development and publication of trial results. JAMA
Intern Med 176:1826–1833
IMS Investigators (2004) Combined intravenous and intra-arterial recanalization for acute ischemic
stroke: the Interventional Management of Stroke Study. Stroke 35:904–911
Kauffman P, Thompson JL, Levy G, Buchsbaum R, Shefner J, Krivickas LS, Katz J, Rollins Y,
Barohn RJ, Jackson CE, Tiryaki E, Lomen-Hoerth C, Armon C, Tandan R, Rudnicki SA,
Rezania K, Sufit R, Pestronk A, Novella SP, Heiman-Patterson T, Kasarskis EJ, Pioro EP,
Montes J, Arbing R, Vecchio D, Barsdorf A, Mitsumoto H, Levin B, QALS Study Group (2009)
1082 S. D. Yeatts and Y. Y. Palesch

Phase II trial of CoQ10 for ALS finds insufficient evidence to justify phase III. Ann Neurol
66:235–244
Kidwell CS, Liebeskind DS, Starkman S, Saver JL (2001) Trends in acute ischemic stroke trials
through the 20th century. Stroke 32:1349–1359
Levin B (2012) Chapter 8: Selection and futility designs. In: Ravina B, Cummings J, McDermott
MP, Poole M (eds) Clinical trials in neurology. Cambridge University Press, Cambridge
Levin B (2015) The futility study – progress over the last decade. Contemp Clin Trials 45:69–75
Levy G, Kaufmann P, Buchsbaum R, Montes J, Barsdorf A, Arbing R, Battista V, Zhou X,
Mitsumoto H, Levin B, Thompson JLP (2006) A two-stage design for a phase II clinical trial
of coenzyme Q10 in ALS. Neurology 66:660–663
NINDS NET-PD Investigators (2006) A randomized, double-blind, futility clinical trial of creatine
and minocycline in early Parkinson disease. Neurology 66:664–671
NINDS NET-PD Investigators (2007) A randomized clinical trial of coenzyme Q10 and GPI-1485
in early Parkinson disease. Neurology 68:20–28
O’Brien PC, Fleming TR (1979) A multiple testing procedure for clinical trials. Biometrics
35:549–556
Palesch YY, Tilley BC, Sackett DL, Johnston KC, Woolson R (2005) Applying a phase II futility
study design to therapeutic stroke trials. Stroke 36:2410–2414
Parkinson Study Group (1989) Effect of deprenyl on the progression of disability in early
Parkinson’s disease. N Engl J Med 321:1364–1371
Pocock SJ (1976) The combination of randomized and historical controls in clinical trials. J Chronic
Dis 29:175–188
Pocock SJ (1977) Group sequential methods in the design and analysis of clinical trials. Biometrika
64:191–199
Sacks LV, Shamsuddin HH, Yasinskaya YL, Bouri K, Lanthier ML, Sherman RE (2014) Scientific
and regulatory reasons for delay and denial of FDA approval of initial applications for new
drugs, 2000–2012. JAMA 311:378–384
Selim M, Foster LD, Moy CS, Xi G, Hill MD, Morgenstern LB, Greenberg SM, James ML, Singh
V, Clark WM, Norton C, Palesch Y, Yeatts SD, on behalf of the iDEF Investigators (2019)
Deferoxamine mesylate in patients with intracerebral haemorrhage (i-DEF): a multicenter,
placebo-controlled, randomized, double-blind phase 2 trial. Lancet Neurol 18(5):428–438
Tilley BC, Palesch YY, Kieburtz K, Ravina B, Huang P, Elm JJ, Shannon K, Wooten GF,
Tanner CM, Goetz GC, on behalf of the NET-PD Investigators (2006) Optimizing the ongoing
search for new treatments for Parkinson disease: using futility designs. Neurology 66:628–633
Interim Analysis in Clinical Trials
59
John A. Kairalla, Rachel Zahigian, and Samuel S. Wu

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1084
Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1084
Types of Data Used in Interim Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1086
Applications of Interim Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1087
Methods of Interim Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1089
Non-comparative Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1089
Comparative Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1090
Planning Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1092
Oversight and Maintaining Integrity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1094
Applications and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1096
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1098
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1099
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1100
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1100
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1100

Abstract
Modern randomized controlled trials often involve multiple periods of data
collection separated by interim analyses, where the accumulated data is analyzed
and findings are used to make adjustments to the ongoing trial. Various endpoints
can be used to influence these decisions, including primary or surrogate outcome
data, safety data, administrative data, and/or new external information. Example
uses of interim analyses include deciding if there is evidence that a trial should be
stopped early for safety, efficacy, or futility or if the treatment allocation ratios

J. A. Kairalla (*) · S. S. Wu
University of Florida, Gainesville, FL, USA
e-mail: johnkair@ufl.edu; sw45@ufl.edu
R. Zahigian
Vertex Pharmaceuticals, Boston, MA, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 1083


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_84
1084 J. A. Kairalla et al.

should be modified to optimize trial efficiency and better align the risk-benefit
ratio. Additionally, a decision could be made to lengthen or shorten a trial based
on observed information. To avoid unwanted bias, studies known as adaptive
design clinical trials pre-specify these decision rules in the study protocol.
Extensive simulation studies are often required during study planning and
protocol development in order to characterize operating characteristics and
validate testing procedures and parameter estimation. Over time, researchers
have gained a better understanding of the strengths and limitations of employing
interim analyses in their clinical studies. In particular, with proper planning and
conduct, adaptive designs incorporating interim analyses can provide great ben-
efits in flexibility and efficiency. However, an increase in infrastructure for
development and planning is needed to successfully implement adaptive designs
and interim analyses and allow their potential advantages to be achieved in
clinical research.

Keywords
Adaptive design · Early stopping · Flexible design · Futility · Interim analysis ·
Interim monitoring · Group sequential · Safety monitoring · Nuisance parameter ·
Sample size

Introduction

This chapter describes the concept of interim analysis (IA, also generally referred to
as interim monitoring) in clinical trials and how they are used to enhance and
optimize the conduct of clinical studies. A specific focus is placed on the use of
interim analyses in a class of clinical trials known as adaptive designs (ADs). The
chapter begins with an overview including definitions, a brief history, and motiva-
tions. It then describes the type of data used in IAs to inform decision-making and
various possible applications. Possible study adjustments based on IA data include
sample size re-estimation (SSR), early stopping, safety monitoring, and treatment
arm modification. Following descriptions of planning considerations, oversight, and
results reporting, various examples are summarized. Discussion topics include
highlighting the emergence of Bayesian methodology in clinical trials with IAs
and describing some logistical barriers that must be addressed in order for clinical
research to benefit from interim decision-making.

Background and Motivation

Evidence of efficacy and safety of new interventions is usually provided through the
conduct and analysis of randomized controlled trials (RCTs). Traditionally, RCTs are
largely inflexible: many design components such as meaningful treatment effects,
outcome variability, patient population, and primary endpoint are specified and fixed
59 Interim Analysis in Clinical Trials 1085

before trial enrollment begins. The trial is then sized and conducted with statistical
power (such as 80%) and an allowable type I error rate (such as 5% for 2-sided tests)
in mind for the given set of study assumptions, with analysis conducted once all of
the information has been collected. However, if study assumptions are incorrectly
specified, the trial may produce inaccurate or ambiguous results, with significant
time and resources largely wasted. Additionally, ignoring accruing safety informa-
tion during study conduct could lead to important ethical concerns. For these
reasons, various forms of IAs are included in many modern trial plans. The US
Food and Drug Administration (FDA) defines an IA as “any examination of data
obtained from subjects in a trial while that trial is ongoing. . . [including] . . .baseline
data, safety outcome data, pharmacokinetic, pharmacodynamic or other biomarker
data, or efficacy outcome data” (FDA 2018). The idea of IAs was described in the
1967 Greenberg Report (Heart Special Project Committee 1988), which highlighted
the potential benefits to stopping a trial early for efficacy or futility. The Greenberg
Report also illustrated the necessity of an independent Data Monitoring Committee
(DMC, also known as Data Safety and Monitoring Board or similar) to evaluate
interim data and provide recommendations. In a trial with IAs, the information that is
collected partway through a trial is used to inform a trial’s future in some manner.
This added flexibility can be very appealing to researchers, stakeholders and spon-
sors of the trial, regulatory agencies, and study participants. Accumulating data can
inform about efficacy, event rates, variability, accrual rates, protocol violations,
dropout rates, and other useful study elements. Using this information, various
study decisions can be made, including closing a study for safety, efficacy, or futility,
updating intervention allocation ratios, altering treatment dosing and regimens,
changing primary endpoints, re-estimating the sample size, or altering the study
population of interest.
A well-known fact of repeated testing of hypotheses is a potentially inflated
overall type I error rate: each time the data is evaluated, there is an additional chance
of making a false-positive conclusion (Armitage et al. 1969). In 1977, Pocock
proposed repeated significance testing with equally sized groups of sequentially
evaluated subjects using a fixed, but reduced nominal significance level to control
the overall type I error rate (Pocock 1977). This began the development and
implementation of a popular class of study designs called group sequential methods
(GSMs).
An alternative approach comes from the idea of partitioning a trial into distinct
stages separated by IAs. Each stage is analyzed separately, and the data is combined
using a pre-specified combination function (Bauer and Kohne 1994). This method
can be implemented for both ADs and flexible designs, which allow for planned or
unplanned study modifications between stages, including early stopping, trial exten-
sions, and many other stagewise modifications, without type I error rate inflation
(Proschan and Hunsberger 1995).
Adaptive designs (ADs) are an important class of RCTs that incorporate IAs in
which accumulating data is used to inform how the study should proceed using only
pre-specified modification rules (FDA 2018). A beneficial property of this rules-
based approach is that study operating characteristics (e.g., power, type I error rate,
1086 J. A. Kairalla et al.

expected sample size) can be exhaustively explored under various scenarios before
trial implementation by undertaking sensitivity analyses based either on known
theory or simulation studies (Kairalla et al. 2012). An implementation of ADs that
has experienced much attention and methodological development over the last
30 years is sample size re-estimation (SSR). Most SSR procedures aim to stabilize
the study power by updating study planning assumptions using observed data, such
as using an interim estimate of the treatment effect (Cui et al. 1999), or modifying the
sample size based on an updated nuisance parameter: a planning parameter, such as
variance, that is not of primary concern but that affects the statistical properties of a
study (Wittes and Brittain 1990). SSR procedures may or may not include early
stopping features in addition to repowering a trial. Methodological development for
ADs as a broad category is significant, and implementation frequency is increasing.
Years of statistical development and discussion followed before regulatory guidance
documents were released in Europe and the United States (EMA 2007; FDA 2018).
When correctly implemented, IAs can lead to improvements in resource and
statistical efficiency since there is potentially a higher chance of correctly detecting
a treatment effect if one exists or stopping a trial early to save resources if a
conclusion is clear. There are also ethical benefits for early stopping and safety
monitoring using IAs. These advantages motivate the remainder of this chapter as
they highlight the potential benefits to consider when planning a clinical trial using
IAs.

Types of Data Used in Interim Analysis

Non-comparative versus Comparative Data: IAs in RCTs use various types of


information to make decisions and inform a trial’s conduct. This information can
be either non-comparative or comparative: non-comparative information does not
reveal treatment assignment in any manner, whereas decisions made using accumu-
lating comparative information involve knowledge of actual or masked (e.g., A vs.
B) treatment assignments. While investigator’s knowledge of non-comparative
information (e.g., accrual rates, pooled variance, or event rates) poses less of a
concern to trial integrity, it is also more limited in IA possibilities. Of note, a
particular IA can use non-comparative information regardless of whether or not
the trial team is masked to treatment assignment (FDA 2018).
Administrative: Various administrative items can be used to inform the decisions
made in IAs. These usually consist of non-comparative study elements such as
accrual rates or overall event rates. A study team (together with a DMC) could
make decisions at interim when faced with administrative data. For example, if the
accrual or event rates are lower than expected, the desired statistical power for a
given effect size will not be reached in an allotted time frame. In order to increase
recruitment, the DMC could suggest relaxing eligibility criteria, closing the trial
early, or extending the accrual period.
Nuisance Parameters: In the initial design phase, assumptions must be made
about nuisance parameters to properly size a trial. However, at IA periods, one can
59 Interim Analysis in Clinical Trials 1087

evaluate the accumulated data and use it to modify initial assumptions. The most
common nuisance parameters are estimates of variance in continuous outcome
settings and estimates of control group event rates in binary outcome settings.
While most methods use comparative information by taking advantage of group
assignment (such as using residual variance), non-comparative information can also
be used to incorporate nuisance parameter updates (Gould and Shih 1992).
Safety:
If an intervention is not proven to be relatively safe, it will not be approved by the
FDA or other regulatory bodies regardless of its efficacy for a given endpoint. Early
phase clinical trials have safety endpoints, such as dose-limiting toxicities, as
primary outcomes. Confirmatory trials also include safety and adverse event mon-
itoring, with both explicit rules and ad hoc safety considerations taken into account
at IAs and at final evaluations.
Study and Surrogate Endpoints: Generally, the main study endpoint in a late-
phase RCT is intervention efficacy. Decisions made at IAs often involve comparative
outcome data among the different treatment arms. In this case, a decision to
terminate a trial is reflected directly in the trial’s efficacy outcome, e.g., a study
could be stopped if there is significant evidence that a new treatment is superior to
standard therapy. A surrogate endpoint is a response variable that is assumed to
correlate directly to the primary endpoint. With scientific rationale and research
justification, it may be possible to use this surrogate outcome as a short-term
substitute for the primary outcome of interest at IAs to reduce study timelines. For
example, the minimal residual disease is a quantifiable value of residual blood cancer
that is correlated with relapse risk; it has been considered as a possible surrogate for
event-free survival, which would require waiting until enough patients have relapses
or other events to conduct an IA.
External Information: Information gathered outside of an enrolling trial (e.g.,
from a similar trial) may alter knowledge of expected study outcomes or safety
information. Rather than permanently stopping the ongoing trial, changes can be
made at interim periods. Unplanned IAs may result, often with study amendments
justifying the changes and resulting statistical properties. In order to maintain
validity, it is important that internal comparative data are not considered when
making such trial modifications. Some designs (e.g., flexible designs) can handle
this naturally by analyzing the sequential cohorts separately.

Applications of Interim Analysis

IAs can be ad hoc in nature (typically requiring an amendment to make study


modifications), can use flexible design methodology in a manner not anticipated
during study design, or can be incorporated into formal, pre-planned ADs. They are
useful in both early phase exploratory studies and late-phase confirmatory studies.
While the most traditional application of IAs is study outcome monitoring (including
early stopping rules for efficacy and futility), they can also inform valuable safety
1088 J. A. Kairalla et al.

monitoring and monitoring of administrative data, such as patient enrollment rates.


Some general areas for application of IAs are described here.
Flexible Designs: Flexible designs are characterized by having multiple stages
and IAs that can be both planned and unplanned. Although flexible design methods
have advantages and proponents, according to FDA guidance, unplanned
modifications create difficulty with statistical properties and trial interpretations
and should be limited to pressing issues such as unexpected toxicity in a particular
treatment arm or in response to unexpected outside information (FDA 2018). While
statistical approaches to incorporating changes and controlling type I error for
flexible designs exist (Proschan and Hunsberger 1995), there are other potential
statistical and scientific issues associated with unplanned design changes, including
reduced interpretability, inefficiency, and potential violation of principles of statis-
tical inference (Burman and Sonesson 2006).
Adaptive Designs: Brannath et al. state that “Many designs have been suggested
which incorporate adaptivity, however, are in no means flexible, since the rule of
how the interim data determine the design of the second part of the trial is assumed to
be completely specified in advance” (Brannath et al. 2007). ADs are a subset of
flexible designs characterized by formal or binding procedures with resulting deci-
sions pre-specified in the study protocol. By adhering to a specified plan, uncertain
sources of bias can be avoided. Extensive planning can be undertaken to enumerate
and describe study operating characteristics before the study plan is implemented.
Elements of the study that should be pre-defined in the study protocol include the
number and timing of the planned IAs, the types of adaptations and/or possible
stopping scenarios, and the statistical inference methods that will be used to prevent
erroneous conclusions (FDA 2018). Of note, although GSMs have a somewhat
unique developmental history, they do allow a study to stop early for efficacy or
futility according to pre-defined rules, falling under the general definition of ADs.
Thus, they are the most widely used and well-known form of AD.
Phases of Study: ADs have gained considerable acceptance in early “learning
stage” trial settings, where various information about the characteristics of a drug can
be gathered, with less focus on tight control of false-positive probabilities. An AD in
an exploratory setting may gather valuable information regarding treatment dosing,
safety, pharmacodynamics, and patient response that can be used in future confir-
matory studies. For example, investigators can use ADs in exploratory settings to
determine the maximum tolerated dose, or the highest dose that is deemed to be safe
enough for further research (Garrett-Mayer 2006). In late learning stage designs,
ADs have been used in dose-ranging studies to find efficacious doses to pass on to
confirmatory phase trials (e.g., see the ASTIN study (Krams et al. 2003)). Using IAs
in the exploratory phase can efficiently push promising treatments down the devel-
opment pipeline in a setting where type I errors are less important, since confirma-
tory trials are still required.
Adaptive methodology can be applied to confirmatory trials in order to evaluate
the safety and efficacy of a particular intervention. Generally, confirmatory trials are
held to a higher standard with regard to statistical rigor, and thorough justifications
and simulations must be conducted (FDA 2018). Adaptive enrichment designs,
59 Interim Analysis in Clinical Trials 1089

sample size re-estimation, and adaptive seamless designs are applications of IAs in a
confirmatory setting that will be discussed.
Safety Monitoring: Accumulating data can inform investigators about potential
safety concerns of a particular intervention or dose. Adaptations at IAs are often
planned with both efficacy and safety in mind. For example, exploratory dose-
finding trials have adaptations planned on safety and toxicity. Additionally, in
confirmatory trials, interventions will not gain regulatory approval unless they
have been shown to have an acceptable risk-benefit ratio. It is necessary in the
planning phase to consider the minimum amount of data required in order to obtain
sufficient safety information; this must be accounted for when planning GSMs (and
other IA plans) that may stop early for efficacy (FDA 2018).
Futility Monitoring: Futility stopping in a RCT is an appealing option that can
improve overall clinical research efficiency by stopping a trial when there is statis-
tical evidence that a trial is unlikely to show efficacy if allowed to continue. Futility
rules can be binding or nonbinding; as futility stopping does not increase the type I
error rate, both can be appropriate so long as they are accounted for transparently in
the statistical analysis (FDA 2018). In non-binding cases, results can be summarized
and presented with recommendations to the DMC, which makes decisions regarding
trial alterations or continuations. One class of futility monitoring rules is based on
repeatedly testing the alternative hypothesis at a fixed significance level (such at
0.005) and stopping for futility if the alternative hypothesis is rejected at any point
(Anderson and High 2011; Fleming et al. 1984). An alternative approach is stochas-
tic curtailment based on conditional power arguments (Lachin 2005). Here, evidence
to stop for futility is based on a low probability of correctly detecting a statistically
significant result at the end of the study, given the current data.

Methods of Interim Analysis

Once interim data is collected, it is used to inform trial modifications that can better
achieve study goals. Methodological results and analytic descriptions of IAs in RCTs
are extensive and will not be comprehensively reviewed here. Additionally, compli-
cated and novel designs are frequently proposed that do not neatly fall into one of the
below categories. However, some key design classes are highlighted to summarize
the various approaches to using interim data in RCTs.

Non-comparative Designs

IAs using non-comparative data do not include information about masked or


unmasked treatment assignment. Trial adaptations in this scenario are based on
aggregate data across treatment assignments, and pooled analysis is typically used.
These methods are appealing from a regulatory standpoint because decisions made
based on non-comparative data have negligible effect on type I error rates. One area
of application is in updating nuisance parameter values in SSR designs.
1090 J. A. Kairalla et al.

To account for the fact that the pooled variance estimate is not independent of the
treatment effect, adjustments are made based on the planned treatment effect (Friede
and Kieser 2011; Gould and Shih 1992). Another application of non-comparative
ADs is in looking at outcome data (such as event rates) for a particular biomarker
group in order to assess and optimize an enrichment strategy (FDA 2018).

Comparative Designs

Adaptations that are made based on comparative data, or data that uses information
about treatment assignment, can affect the overall type I error rate more drastically
than adaptations based on non-comparative data and thus must be justified and
accounted for in statistical methods (FDA 2018). There are many described ADs
using comparative data, with some broad classes summarized here.
Group Sequential Methods: GSM clinical trials involve several prospectively
planned IAs on sequentially enrolled groups of subjects and involve a decision
about whether a trial should stop early based on observed interim treatment
outcomes. At an IA, a trial can be stopped for efficacy or futility, and the type I
error inflation inherent to repeated uncorrected significance testing is controlled
through developed stopping bounds. As mentioned in the “Background and Moti-
vation” section, the first proposed bounds involved a fixed, but adjusted nominal
significance level at all testing points (Pocock 1977). Other notable GSM stopping
bounds include the popular O’Brien-Fleming bounds that start conservative and
become more liberal as a study accrues more information (O’Brien and Fleming
1979) and the more flexible GSMs that utilize α-spending functions to allow for
flexible number and timing of analyses (Lan and DeMets 1983). GSMs have ethical
and efficiency advantages by reducing expected sample size versus fixed sample
designs, as well as versus other forms of IAs such as adaptive combination tests and
SSRs (Jennison and Turnbull 2006; Tsiatis and Mehta 2003). However, several
issues must be considered during planning and before the decision is made to stop
early. For one, when using flexible bounds such as those described by Lan and
DeMets (1983), the decision to perform an analysis should be specified by calendar
time or fractions of available information rather than influenced by observed trends.
Additionally, efficacy stopping should be rule-based and involve transparent
reporting that includes the stopping bounds considered at each IA. Finally, it is
important to consider how much additional precision, as well as secondary outcome
and safety information, is lost by stopping a trial early. To account for this, one
approach is to allow the first IA only after a minimum fraction of the planned sample
has been evaluated.
Adaptive Combination Tests: In RCTs with IAs, outcomes for independent
cohorts of participants can be evaluated across time, with standardized test statistics
calculated separately in each cohort. By combining these test statistics in a pre-
determined way, type I error rate can be controlled regardless of whether or not the
trial is following pre-defined adaptation rules (Bauer and Kohne 1994; Proschan and
Hunsberger 1995). P-values or independent test statistics from different stages can
59 Interim Analysis in Clinical Trials 1091

be combined using techniques such as Fisher’s p-value combination criterion or the


weighted inverse normal method. In general, results at IAs are inspected, and if there
is evidence of significance or futility, the trial may stop early. Otherwise, the study
continues to enroll an additional cohort of patients using a possibly modified
scenario designed to control the overall type I error rate. The advantage of this
approach lies in the flexibility that allows for a variety of planned and unplanned
adaptations in addition to early stopping.
Sample Size Re-estimation: When calculating a sample size in the initial design
phase, information is required about the desired statistical power and significance
threshold, and assumptions must be made regarding the treatment effect and nui-
sance parameters. To protect a study from incomplete knowledge during planning,
the assumptions can be re-evaluated, and the sample size can be re-estimated at IAs
in order to ensure that the desired statistical power is achieved. These early phases
that inform later stages are often referred to as internal pilots when based on
inspection of nuisance parameters, such as variance (Wittes and Brittain 1990). In
a more controversial application, the SSR can also involve inspection of the treat-
ment effect; if the treatment effect is less than the desired a priori planning value but
is still deemed important or promising, the sample size can be increased (Cui et al.
1999; Mehta and Pocock 2011). Some concerns with estimating based on treatment
effect include decreased interpretability, lost efficiency, and the risk of not clinically
meaningful differences (Proschan 2009). Properly planned GSMs have been shown
to have efficiency advantages versus planned and unplanned SSRs (Jennison and
Turnbull 2006; Tsiatis and Mehta 2003); however, the rigid maximum sample size
limits GSMs in certain situations. Tsiatis proposes updating the maximum sample
size at each IA in an adaptive GSM (Tsiatis 2006). Mehta and Pocock showed that
the adaptive approach is still beneficial in that the initial sample size commitment is
small and additional resources are only asked for if needed (Mehta and Pocock
2011).
Adaptive Randomization: For various reasons, it may be beneficial to modify the
allocation ratio in which new study participants are randomized to each treatment
arm. As one example, randomization may be covariate-adaptive, with allocation
ratios changing based on the observed covariate allocations to better achieve covar-
iate balance between treatment groups beyond that achieved by simple randomiza-
tion. Alternatively, randomization may be response-adaptive, where observed
treatment outcomes from the data are used to inform how the trial should proceed.
In a parallel group setting, this can fulfill an ethical desire to increase the chance of
giving patients a superior treatment. By noting that dropping or adding arms is
equivalent to adjusting its allocation ratio to zero or nonzero values, response-
adaptive randomization strategies can also be used in dose-ranging trials, where
the trial initially evaluates several different doses and selects a dose based on
comparative data, such as dose-limiting toxicities in early phase studies. Adaptive
platform trials are a specific example of allocation ratio modifications, where
prospectively planned adaptations are used to compare multiple treatment arms to
one common control arm, with arms added and removed at IAs. However, these
studies have many complexities that could limit their practical benefit (FDA 2018).
1092 J. A. Kairalla et al.

Enrichment Designs: A particular treatment may be more effective in a group of


people who have a particular biomarker or genetic characteristic. For example, a
drug to reduce coronary heart disease may show better results in patients with high
baseline blood pressure. Targeting a population with known risk factors is known as
enrichment. When targeted risk factors are not known at study design, adaptive
enrichment provides a mechanism for a trial to determine promising subgroup(s) of
patients to continue on the trial while minimizing efforts toward non-promising
groups. Statistical power for chosen subgroups is increased, and improved precision
of undiluted effect sizes can be achieved. Correctly implemented adaptive
enrichment studies have the capacity to preserve the overall type I error rate while
combining the data across multiple stages (Bhatt and Mehta 2016). However,
drawbacks include increased complexity, lack of generalizability of results to subsets
not included, and potentially biased treatment effect estimates.
Endpoint Modification: An adaptive endpoint selection design could be
considered when there is a high degree of uncertainty about treatment effect sizes
on multiple patient outcomes in the design phase. Statistical methods have been
developed to avoid multiple testing problems (Hommel 2001). However, changing
the primary endpoint may reduce interpretability and complicate the regulatory
process (FDA 2018; Proschan 2009). Investigators may also attempt to change
their primary hypothesis from one of superiority to non-inferiority in cases where
two active treatments are being compared. If this is the case, a non-inferiority margin
must be pre-specified before enrollment in order to avoid type I error inflation (Hung
et al. 2006).
Adaptive Seamless Designs: Adaptive seamless designs, which combine research
phases into a single protocol, allow efficiency gains by reducing the time it would
normally take to move between the phases as well as by using information from the
early phase in final analysis. Most research focus has been placed on adaptive phase
II/III designs (Stallard and Todd 2010); however, early development adaptive
seamless designs exist as well. Consider a phase I/IIa design with the first stage
focusing on dose-finding and the second stage focusing on safety and efficacy
confirmation. The first stage can incorporate multiple doses; at the IA, the arm
with the best risk-benefit ratio is chosen for the second stage. Since participants
from both stages are used to inform the primary aim (which could introduce bias),
adaptive seamless designs can be relatively complex, and caution must be weighed
against the potential increased efficiency and reduced study timeline.

Planning Considerations

Limitations: Although ADs with IAs provide many advantages, there are limitations
that must be considered. For example, terminating a trial early is appealing to
sponsors because it saves money and resources and can result in effective treatments
being available more quickly. However, evidence collected from a smaller trial is not
as precise or reliable; a larger trial allows for more information to be collected on
subgroups as well as secondary endpoints and important safety data.
59 Interim Analysis in Clinical Trials 1093

Studies with ADs, and IAs more generally, are not a cure for inadequate planning;
in fact, they generally require much more up-front planning than fixed sample
studies. This planning process will likely be lengthier and involve more complicated
logistical considerations, offsetting some of the time advantages. The added effi-
ciency and flexibility must justify the increase in study complexity and the accom-
panying difficulties in interpretability. Derived analytic methods and numeric
justifications (e.g., simulations) must be used in ADs to avoid bias and type I error
inflation, which may be compromised if unplanned IAs arise. Rigidity of an AD
implementation plan is a difficultly in practice for complicated RCTs involving
hundreds or thousands of subjects across many sites. Any deviations must be
documented and subsequent study properties ascertained as well as possible given
the actual procedures followed.
Additionally, timing should be considered when contemplating the usefulness of
IAs in RCTs. Designs work best when there is a predictable accrual rate and
outcomes are known relatively quickly. If there is a fast accrual and outcomes are
not known until years of follow-up are completed, then any advantages of efficiency
in the study design will be mitigated by the fact that full enrollment is complete
before IA results are known. It is important that the design fits the setting and
expected outcome time frames, and the usefulness and feasibility of an IA plan
must be carefully considered when comparing potential trial designs in the planning
stage (Bhatt and Mehta 2016).
Estimation Bias: Much focus when discussing statistical methods for IAs is on
hypotheses testing and control of type I error rate. However, any publication or
results reporting when a trial is complete would include information about the
observed treatment effect, a figure which will be widely cited for a high-profile
study. Biased treatment effects from naïve estimates are a concern for any IA plan,
especially those involving early stopping. Large fluctuations of early-stage estimated
treatment effects could induce stopping, leading to bias and possible overestimation
(Bassler et al. 2010). This is also true for estimation of secondary endpoints that are
not involved directly in stopping rules but are correlated with the primary endpoint
(FDA 2018). Methods can be incorporated prospectively into a trial plan to adjust for
potential estimation bias for some designs, but this is an area of research not as well
understood or developed as control of type I error rates (Shimura 2019). The extent
of potential bias should be explored methodologically or via simulations, with bias
corrections considered and estimates and confidence intervals presented with inter-
pretational caution (FDA 2018; Kimani et al. 2015).
Information Sharing and Operational Bias: Comparative results at IAs during
trial conduct should be carefully guarded. However, even revealing study decisions
based on comparative interim data to those involved with the conduct or manage-
ment of a trial can lead to substantial bias and unpredictable trial complications. For
example, if a statistical plan is known in detail, and a study design changes in a
transparent manner such as increasing the sample size to a particular number, then it
may be possible for investigators to speculate, infer, or back-calculate treatment
effect results. Among other issues, this could compromise study integrity by affect-
ing enrollment and retention for patients currently enrolled and cause hesitancy with
1094 J. A. Kairalla et al.

sponsors to further support the trial. To limit this bias, the DMC (or other tasked
body who evaluates the interim data) should include statistical expertise and a clear
understanding of the specified design being implemented by a team independent
from those directly involved with the conduct of the trial (Bhatt and Mehta 2016;
FDA 2018). Data coordinating centers are useful in creating separation between
those directly conducting a trial on a day-to-day level and those responsible for
analyzing and reporting IA findings to the DMC. Additionally, a study could
consider reporting the details of AD algorithms somewhere other than a public
study protocol, such as in a DMC charter.
Role of Simulations: Often, ADs, despite their pre-specified analysis plan, involve
complicated components, with hard-to-discern operating characteristics under pos-
sible true conditions, such as treatment effects, event rates, and nuisance parameters.
In order to justify the validity and advantages of a design, extensive simulation
studies can be conducted. Simulations studies use clinical knowledge, programming,
and computing technology to create a virtual clinical trial framework that incorpo-
rates all IA rules, including SSR, early stopping, and treatment allocation changes.
Simulations can provide information about expected study duration and can quan-
titate power, expected sample sizes, and potential biases. By conducting a sensitivity
analysis, researchers can justify their design and explore optimization of study
components such as critical value thresholds and sample sizes (FDA 2018; Pallmann
et al. 2018).
Generalizability: After adaptation, the results of a trial may not be generalizable
to the original study population. For example, in a trial using enrichment strategies,
demographics may change over the course of the study based on selected biomarker
groups, and study results may be restricted to a particular subgroup. Additionally, a
flexible design with resulting study changes at IA such as sample size, participant
entry criteria, and primary study endpoint may lead to stagewise study results that are
not similar enough for a broad interpretation of a resulting hypothesis test. Careful
consideration is necessary when interpreting study conclusions for trials with IAs
(Pallmann et al. 2018).

Oversight and Maintaining Integrity

Protocol: Before conducting a RCT, a study protocol must be developed which


outlines the study design and intentions of investigators. Since RCTs involving IAs
are generally more complicated than fixed sample or historically well-understood
designs such as GSMs, detailed study planning results and study operating charac-
teristics (including simulation results) are reported in a study protocol. This includes
rationale and information about the chosen study design, a description of potential
IAs and their timing, and appropriate statistical methods. The protocol should outline
how statistical validity will be maintained and how bias will be minimized through-
out the trial. The FDA, other regulatory body, DMC, and/or study sponsors will
review the study protocol and provide suggestions before it is approved and the trial
begins enrollment. Depending on study adaptations being considered, there may be
59 Interim Analysis in Clinical Trials 1095

agreement for certain details (such as exact SSR procedures) to be excluded from the
publicly available protocol and documented separately (e.g., in a DMC charter).
Data Monitoring Committee:
DMCs are comprised of some combination of statisticians, epidemiologists,
pharmacists, ethicists, patient advocates, and others who are responsible for
overseeing IAs in RCTs (Ellenberg et al. 2003). To maintain trial integrity, it is
important that the DMC is both intellectually and financially independent from those
conducting and sponsoring the trial. The DMC should ideally be involved in
protocol development and approval to insure the entire team is on the same page
and that the committee understands the design and their responsibilities. At IAs, the
DMC considers participant recruitment and compliance to treatment, intervention
safety, quality of study conduct, proper measurement of response, and the primary
and secondary outcome results. It is recommended that the DMC is unblinded when
they conduct or review IAs and their recommendations should follow the previously
agreed upon protocol whenever possible. Interim data should be reviewed for
intervention efficacy and safety signals, recruitment rates, and subgroup indications.
Open reports containing pooled, aggregate information (such as enrollment rates and
serious adverse events) can be shared more broadly, and confidential closed reports
with specific efficacy information are generated separately for closed DMC review.
The DMC should then pass recommendations to a blinded trial steering committee,
whose role is to oversee the trial conduct (Pallmann et al. 2018).
Interactions with the FDA: In addition to reviewing protocol information and
study design properties, the FDA often reviews marketing applications that highlight
the results of a completed RCT. These can include new drug applications (NDAs) or
biologic license applications (BLAs). In an AD setting, the FDA’s primary concerns
of safety and efficacy are coupled with complicated design components that may not
be readily understood without further communication. As a result, applications with
ADs are often reviewed with greater scrutiny than nonadaptive designs. As
described previously, simulations and other justifications are required to rationalize
the advantages of a complicated analysis plan with the chosen study parameters. It is
also important that the FDA is able to see that the overall type I error rate is
controlled despite repeated IAs. Unless patient safety is at risk, results from IAs in
ongoing trials are generally not shared with the FDA until the conclusion of the trial
(FDA 2018).
Reporting: In the United States, most RCTs that involve human volunteers are
required to be registered through the National Library of Medicine at ClinicalTrials.
gov in order to provide transparency and give the public, patients, caregivers, and
clinical researchers access to trial information. ClinialTrials.gov is a database that
summarizes publicly available information and results for registered domestic and
international clinical trials. Additionally, a clinical trial should meet the minimum
reporting standards outlined in the CONSORT guideline, last updated in 2010
(Moher et al. 2010). This guideline helps investigators worldwide provide complete,
transparent reporting with regard to trial design, participant recruitment, statistical
methods, and results. The guideline specifically mentions the necessity of reporting
IAs (item 7b), regardless of whether or not pre-specified rules are used in decision-
1096 J. A. Kairalla et al.

making. It is required for investigators to report how many interim looks the DMC
completed along with their purpose and the statistical methods implemented.

Applications and Examples

Group Sequential Trials: Consider the Beta-Blocker Heart Attack Trial (Beta-
Blocker Heart Attack Study Group 1981), a large, multicenter, double-blind RCT
consisting of patients with recent myocardial infarction. The primary aim was to
compare mortality in patients taking the active medication, propranolol, versus a
placebo. After the independent DMC determined a clear treatment benefit, the trial
was terminated 9 months early. The interim data revealed 7% mortality in the
propranolol group (135 deaths), compared to 9.5% mortality in the placebo group
(183 deaths). In making their recommendation, the DMC ensured no potential
confounding existed due to baseline demographics, study compliance, and unantic-
ipated side effects. They were able to conclude that the outcome was unlikely to
change if the study continued and, for ethical reasons, it was appropriate to dissem-
inate the trial information as quickly as possible. The study is notable as one of the
first major trials to incorporate relatively new GSM monitoring. The rules, based on
O’Brien-Fleming-type stopping bounds, were not originally part of the trial design
but were incorporated early in the trial implementation. Other examples of pre-
specified GSM trials that stopped early include the Herceptin Adjuvant Trial,
designed to test the effect of trastuzumab after adjuvant chemotherapy in those
diagnosed with HER2-positive breast cancer, and the EXAMINE trial, which studied
the use of alogliptin in patients who were considered to be high risk for cardiovas-
cular disease (Piccart-Gebhart et al. 2005; White et al. 2013). Finally, a GSM design
combined with information-based SSR was utilized in a multinational study of
vitamin D and chronic kidney disease, termed the PRIMO study. The protocol
allowed for interim nuisance parameter estimates to modify uncertain design
assumptions, and the interim treatment effect was used to influence efficacy-based
decision rules. Ultimately, it was found at IA that no SSR was necessary to achieve at
least 85% statistical power (Pritchett et al. 2011).
Changing Hypotheses: The EXAMINE trial had an additional adaptive feature in
which the hypothesis to be tested could vary from superiority to non-inferiority. The
protocol would have allowed the trial to proceed for additional 100 primary events if,
at interim, the probability of detecting superiority, assuming it existed, was above
20%. The IA found that the conditional power was less than 20%, and the study was
terminated early, declaring non-inferiority (Bhatt and Mehta 2016; White et al.
2013).
Potential Uncertainty and Bias: Consider the recently completed TOTAL trial
which studied thrombectomy as adjuvant to traditional percutaneous coronary inter-
vention (PCI) for ST-segment elevation myocardial infarction (STEMI). At an IA
(n ¼ 2,791), the combined intervention group had an observed lower death rate than
those with only PCI (with p ¼ 0.025) along with no significant evidence of a
difference in stroke occurrence; one could deduce from this information that the
59 Interim Analysis in Clinical Trials 1097

combined intervention is advantageous. However, the trial continued to completion


(n ¼ 10,064) and found no significant group-wise difference in death rates
(p ¼ 0.48). Additionally, 1.2% of the combined intervention group experienced
stroke at the end of 1 year, compared to only 0.7% of the PCI only group (p ¼ 0.015),
highlighting an adverse event that was not evident at interim. Due to the results of
this study, thrombus aspiration is no longer recommended in the guidelines to reduce
mortality in those with STEMI. If the TOTAL trial had shown an increased early
signal and stopped at the IA, the benefits associated with the combined intervention
could have been overestimated, and the dangerous association of thrombectomy
with stroke may not have been initially exposed (Jolly et al. 2018).
Exploratory Response-Adaptive Design: Consider an adaptive double-blind
phase II RCT for patients with bipolar depression testing two interventions using a
2x2 factorial trial design (Savitz et al. 2018). This trial examined the anti-inflamma-
tory effects of the antibiotic minocycline and low-dose aspirin, which is known to
quicken the response to SSRIs. After half of the patients were evaluated, an IA was
conducted for futility testing. This analysis determined that two of the four inter-
vention groups were diverging and provided the only opportunity for a powered
result. Thus, enrollment was adjusted with the remaining participants randomized
only to the double intervention group and the double placebo group (while
maintaining blinding). Additionally, two of the three primary outcomes were
reduced to exploratory outcomes due to evidence of insufficient power. This trial
proceeded without pre-specified rules, and the protocol was amended to adjust the
design going forward; the trial still maintained some integrity through the blinded
nature of the analysis, transparency of design decisions, and the preliminary nature
of its findings.
Adaptive Seamless Design: A large phase IIb/III trial was recently conducted to
efficiently move a potentially more effective nine-valent human papillomavirus
vaccine through clinical development (Chen et al. 2015). In the first phase, 1,240
women were equally randomized to three doses of the new vaccine or to active
control (four-valent) vaccine. After an IA examined safety data and immunogenicity
(used as a short-term biomarker endpoint), one experimental dose was selected, and
subjects who received that dose or the control continued follow-up and contributed
to the final analysis. Additional 13,400 women were enrolled in the second phase
and randomized to the two continuing arms. By not considering the primary end-
point efficacy of viral infections at the IA, claiming small correlation between
immunogenicity and infection rates, and using a conservative analysis technique in
the confirmatory stage, the investigators justified the seamless design without com-
plex statistical correction. Despite documented challenges, the study met its efficacy
goals while shortening the clinical development time frame for the new vaccine
formulation.
Allocation Ratio Modifications: In a two-stage AD RCT examining HIV preven-
tion methods in Malawi, pregnant women attending an antenatal care clinic were
instructed to encourage their male partners to participate in HIV testing (Choko et al.
2019). In the first stage, participants were randomized to six groups including
standard of care (clinic invitation letter) and five other arms defined by HIV self-
1098 J. A. Kairalla et al.

testing coupled with different incentives. At the IA, enrollment was discontinued in
groups not showing improvement over the standard of care (p-value >0.2); in stage
2, randomized enrollment continued equally for the remaining groups. Results
showed that secondary distribution of HIV self-testing increased the number of
males being tested for HIV, and with incentives, men were more likely to access
care and prevention services.
Sample Size Re-estimation: The CHAMPION PHOENIX trial, which evaluated
the effects of cangrelor on ischemic complications of percutaneous coronary inter-
vention, incorporated SSR to adjust the trial if observed relative risks differed from
the assumed rates (Leonardi et al. 2012). The IA was performed after 70% of patients
had completed a short follow-up. The sample size was to be increased if the DMC
found the interim results to be in a “promising zone” (as opposed to being clearly
favorable or unfavorable) (Bhatt and Mehta 2016). Ultimately, the sample size was
not increased since the results were “favorable” at IA. Another example, the
CARISA trial was a double-blind, three-group parallel trial to determine whether
ranolazine improves treadmill exercise duration of patients with severe chronic
angina (Chaitman et al. 2004). An IA based on an updated standard deviation
using aggregate data was scheduled after half of the patients were followed for
12 weeks. This “internal pilot-” based SSR allowed the study to maintain stable
statistical power despite incorrect initial assumptions.
Enrichment Designs: In a two-stage adaptive enrichment study to test rizatriptan
for the treatment of acute migraines in people aged 6–17, the first stage randomized
participants at a 20:1 ratio of placebo to intervention (Ho et al. 2012). To enrich the
sample by excluding false responders, any patients who noted a quick improvement
in migraine symptoms after the first stage were dropped. Of the remaining non-
responders, those who took the active treatment in stage 1 were allocated to placebo
in stage 2, whereas those assigned to the placebo in stage 1 were randomized equally
to rizatriptan and placebo in stage 2. Ultimately, efficacy of rizatriptan was shown,
and the drug is now approved by the FDA for acute migraine treatment in this age
group.

Discussion

Two discussion points are worth further consideration: the development and poten-
tial of ADs using Bayesian methodology and the evolving need for infrastructure in
the implementation of RCTs incorporating novel and complicated IAs.
Bayesian Methods: Statistical methods using the Bayesian framework com-
bine prior information with new information to update posterior distributions of
interest. While the use of Bayesian methods in early phase ADs has been
accepted for years (Garrett-Mayer 2006), interest in their potential use in confir-
matory ADs has ramped up considerably over the last decade (Berry et al. 2010;
Brakenhoff et al. 2018). To those in the field, the “learn as you go” nature of ADs
seems like a natural fit for Bayesian reasoning. For example, to address the
59 Interim Analysis in Clinical Trials 1099

uncertainty associated with estimating nuisance parameters in the design phase,


Brakenhoff proposed a Bayesian solution combining prior knowledge with data
collected at IA (Brakenhoff et al. 2018). Bayesian methods can be useful for
predictive modeling, for dose escalation studies, and when statisticians want to
explicitly incorporate results from previous or external trials. Computationally
intensive simulations are critical to validating adaptive Bayesian designs and
ensuring that statistical operating characteristics are being maintained (FDA
2018). An example comes from a phase II enrichment oncology trial using a
hierarchical Bayesian design to examine the benefits of a treatment in the whole
study population and subpopulations defined by histologic subtype. The hierar-
chical component borrows treatment effect information from one group and uses
it to influence estimation of the treatment effect for another group, making it more
likely to correctly conclude efficacy or futility (Berry et al. 2013). Additionally,
the trial allows for early stopping for efficacy or futility based on continuously
updated posterior estimates of treatment efficacy.
Infrastructure: Although adaptive and flexible designs with IAs have gained
significant popularity in private industry, there are barriers preventing their wide-
spread use in publicly funded research. In order to apply for a grant, extensive
simulation studies must have already been conducted in order to verify the validity of
the study design in a particular setting. Therefore, sufficient infrastructure and
resources must be available before the grant is awarded so that the necessary time
can be spent on up-front planning. These issues were discussed among representa-
tives from government, academia, and industry at the Scientific Advances in Adap-
tive Clinical Trial Designs Workshop in November of 2009 (Coffey et al. 2012).
Ultimately, the creation of networks across clinical research helps the infrastructure
issue by pooling resources and expertise, increasing the feasibility of complicated
trial approval. Examples of these networks include the federal Clinical and Transla-
tional Science Award program and the Network for Excellence in Neuroscience
Clinical Trials (NeuroNEXT). With sufficient resources and infrastructure, adaptive
clinical trial designs with IAs can continue to attain their potential and improve trial
efficiency.

Summary and Conclusions

IAs in clinical trials are powerful tools that, when properly employed, greatly benefit
clinical efficiency, ethics, and chance of a successful trial. Pre-specified ADs in
particular (including GSMs) have the advantage of known possible decisions and
enumerated study operating characteristics being available for scrutiny before a trial
begins. Flexible designs incorporating unplanned analyses while controlling type I
error rate can also be useful when unanticipated situations occur during trial conduct.
As design understanding and ease of implementation catch up to methodological
development, advanced IA designs will benefit health research and patient outcomes
in the decades ahead.
1100 J. A. Kairalla et al.

Key Facts

• In randomized controlled trials, interim analyses occur periodically during data


accumulation to consider adjustments to the ongoing trial.
• These analyses can allow for greater study flexibility and efficiency by updating
design considerations with actual information collected during the trial.
• Adaptive designs are a special class of randomized controlled trials with pre-
specification of modification rules, which improve a priori understanding of study
operating characteristics.
• Through continued methodological and computational advancements, increased
investment in trial infrastructure, and careful planning and implementation,
successful interim analyses have increasingly become commonplace in clinical
research.

Cross-References

▶ Adaptive Phase II Trials


▶ Bayesian Adaptive Designs for Phase I Trials
▶ Biomarker-Driven Adaptive Phase III Clinical Trials
▶ Data and Safety Monitoring and Reporting
▶ Futility Designs

References
Anderson J, High R (2011) Alternatives to the standard Fleming, Harrington, and O’Brien futility
boundary. Clin Trials 8(3):270–276. https://doi.org/10.1177/1740774511401636
Armitage P, McPherson C et al (1969) Repeated significance tests on accumulating data. J R Stat
Soc Ser A 132(2):235–244. https://doi.org/10.2307/2343787
Bassler D, Briel M et al (2010) Stopping randomized trials early for benefit and estimation of
treatment effects: systematic review and meta-regression analysis. JAMA 303(12):1180–1187.
https://doi.org/10.1001/jama.2010.310
Bauer P, Kohne K (1994) Evaluation of experiments with adaptive interim analyses. Biometrics 50
(4):1029–1041. https://doi.org/10.2307/2533441
Berry S, Carlin B et al (2010) Bayesian adaptive methods for clinical trials. CRC Press, Boca Raton
Berry S, Broglio K et al (2013) Bayesian hierarchical modeling of patient subpopulations: efficient
designs of Phase II oncology clinical trials. Clin Trials 10(5):720–734. https://doi.org/10.1177/
1740774513497539
Beta-Blocker Heart Attack Study Group (1981) The beta-blocker heart attack trial. JAMA 246
(18):2073–2074
Bhatt D, Mehta C (2016) Adaptive designs for clinical trials. N Engl J Med 375(1):65–74.
https://doi.org/10.1056/NEJMra1510061
Brakenhoff T, Roes K et al (2018) Bayesian sample size re-estimation using power priors. Stat
Methods Med Res. https://doi.org/10.1177/0962280218772315
Brannath W, Koenig F et al (2007) Multiplicity and flexibility in clinical trials. Pharm Stat J Appl
Stat Pharm Ind 6(3):205–216. https://doi.org/10.1002/pst.302
59 Interim Analysis in Clinical Trials 1101

Burman C, Sonesson C (2006) Are flexible designs sound? Biometrics 62(3):664–669. https://doi.
org/10.1111/j.1541-0420.2006.00626.x
Chaitman B, Pepine C et al (2004) Effects of ranolazine with atenolol, amlodipine, or diltiazem on
exercise tolerance and angina frequency in patients with severe chronic angina: a randomized
controlled trial. JAMA 291(3):309–316. https://doi.org/10.1001/jama.291.3.309
Chen Y, Gesser R et al (2015) A seamless phase IIb/III adaptive outcome trial: design rationale and
implementation challenges. Clin Trials 12(1):84–90. https://doi.org/10.1177/
1740774514552110
Choko A, Corbett E et al (2019) HIV self-testing alone or with additional interventions, including
financial incentives, and linkage to care or prevention among male partners of antenatal care
clinic attendees in Malawi: an adaptive multi-arm, multi-stage cluster randomized trial. PLoS
Med 16(1). https://doi.org/10.1371/journal.pmed.1002719
Coffey C, Levin B et al (2012) Overview, hurdles, and future work in adaptive designs: perspectives
from a National Institutes of Health-funded workshop. Clin Trials 9(6):671–680. https://doi.org/
10.1177/1740774512461859
Cui L, Hung H et al (1999) Modification of sample size in group sequential clinical trials.
Biometrics 55(3):853–857. https://doi.org/10.1111/j.0006-341X.1999.00853.x
Ellenberg S, Fleming T et al (eds) (2003) Data monitoring committees in clinical trials: a practical
perspective. Wiley, Chichester
European Medicines Agency (2007) Reflection paper on methodological issues in confirmatory
clinical trials planned with an adaptive design. Retrieved from http://www.ema.europa.eu
Fleming T, Harrington D et al (1984) Designs for group sequential tests. Control Clin Trials 5
(4):349–361. https://doi.org/10.1016/S0197-2456(84)80014-8
Food and Drug Administration (2018) Adaptive designs for clinical trials of drugs and biologics:
guidance for industry. Retrieved from https://www.fda.gov
Friede T, Kieser M (2011) Blinded sample size recalculation for clinical trials with normal data and
baseline adjusted analysis. Pharm Stat 10(1):8–13. https://doi.org/10.1002/pst.398
Garrett-Mayer E (2006) The continual reassessment method for dose-finding studies: a tutorial. Clin
Trials 3(1):57–71. https://doi.org/10.1191/1740774506cn134oa
Gould A, Shih W (1992) Sample size re-estimation without unblinding for normally distributed
outcomes with unknown variance. Commun Stat Theory Methods 21(10):2833–2853.
https://doi.org/10.1080/03610929208830947
Heart Special Project Committee (1988) Organization, review and administration of cooperative
studies (Greenberg report): a report from the Heart Special Project Committee to the National
Advisory Council, May 1967. Control Clin Trials 9:137–148
Ho T, Pearlman E et al (2012) Efficacy and tolerability of rizatriptan in pediatric migraineurs: results
from a randomized, double-blind, placebo-controlled trial using a novel adaptive enrichment
design. Cephalalgia 32(10):750–765. https://doi.org/10.1177/0333102412451358
Hommel G (2001) Adaptive modifications of hypotheses after an interim analysis. Biom J 43(5):581–
589. https://doi.org/10.1002/1521-4036(200109)43:5<581::AID-BIMJ581>3.0.CO;2-J
Hung H, O’Neill R et al (2006) A regulatory view on adaptive/flexible clinical trial design. Biometr
J 48(4):565–573. https://doi.org/10.1002/bimj.200610229
Jennison C, Turnbull B (2006) Efficient group sequential designs when there are several effect sizes
under consideration. Stat Med 25(6):917–932. https://doi.org/10.1002/sim.2251
Jolly S, Gao P et al (2018) Risks of overinterpreting interim data: lessons from the TOTAL trial
(thrombectomy with PCI versus PCI alone in patients with STEMI). Circulation 137
(2):206–209. https://doi.org/10.1161/CIRCULATIONAHA.117.030656
Kairalla J, Coffey C et al (2012) Adaptive trial designs: a review of barriers and opportunities. Trials
13(1):145. https://doi.org/10.1186/1745-6215-13-145
Kimani P, Todd S et al (2015) Estimation after subpopulation selection in adaptive seamless trials.
Stat Med 34(18):2581–2601. https://doi.org/10.1002/sim.6506
Krams M, Lees K et al (2003) Acute stroke therapy by inhibition of neutrophils (ASTIN). Stroke 34
(11):2543–2548. https://doi.org/10.1161/01.STR.0000092527.33910.89
1102 J. A. Kairalla et al.

Lachin J (2005) A review of methods for futility stopping based on conditional power. Stat Med 24
(18):2747–2764. https://doi.org/10.1002/sim.2151
Lan K, DeMets D (1983) Discrete sequential boundaries for clinical trials. Biometrika 70:659–663.
https://doi.org/10.1093/biomet/70.3.659
Leonardi S, Mahaffey K et al (2012) Rationale and design of the Cangrelor versus standard therapy
to achieve optimal Management of Platelet Inhibition PHOENIX trial. Am Heart J 163
(5):768–776. https://doi.org/10.1016/j.ahj.2012.02.018
Mehta C, Pocock S (2011) Adaptive increase in sample size when interim results are promising: a
practical guide with examples. Stat Med 30(28):3267–3284. https://doi.org/10.1002/sim.4102
Moher D, Hopewell S et al (2010) CONSORT 2010 explanation and elaboration: updated guide-
lines for reporting parallel group randomised trials. J Clin Epidemiol 63(8):e1–e37. Retrieved
from www.consort-statement.org
O’Brien P, Fleming T (1979) A multiple testing procedure for clinical trials. Biometrics 35
(3):549–556. https://doi.org/10.2307/2530245
Pallmann P, Bedding A et al (2018) Adaptive designs in clinical trials: why use them, and how to
run and report them. BMC Med 16(1):29. https://doi.org/10.1186/s12916-018-1017-7
Piccart-Gebhart M, Procter M et al (2005) Trastuzumab after adjuvant chemotherapy in HER2-
positive breast cancer. N Engl J Med 353(16):1659–1672. https://doi.org/10.1056/
NEJMoa052306
Pocock S (1977) Group sequential methods in the design and analysis of clinical trials. Biometrika
64(2):191–199. https://doi.org/10.1093/biomet/64.2.191
Pritchett Y, Jemiai Y et al (2011) The use of group sequential, information-based sample size re-
estimation in the design of the PRIMO study of chronic kidney disease. Clin Trials 8
(2):165–174. https://doi.org/10.1177/1740774511399128
Proschan M (2009) Sample size re-estimation in clinical trials. Biometr J 51(2):348–357.
https://doi.org/10.1002/bimj.200800266
Proschan M, Hunsberger S (1995) Designed extension of studies based on conditional power.
Biometrics 51(4):1315–1324. https://doi.org/10.1016/0197-2456(95)91243-6
Savitz J, Teague T et al (2018) Treatment of bipolar depression with minocycline and/or aspirin: an
adaptive, 2 2 double-blind, randomized, placebo-controlled, phase IIA clinical trial. Transl
Psychiatry 8(1):27. https://doi.org/10.1038/s41398-017-0073-7
Shimura M (2019) Reducing overestimation of the treatment effect by interim analysis when
designing clinical trials. J Clin Pharm Ther 44(2):243–248. https://doi.org/10.1111/jcpt.12777
Stallard N, Todd S (2010) Seamless phase II/III designs. Stat Methods Med Res 20(6):626–634.
https://doi.org/10.1177/0962280210379035
Tsiatis A (2006) Information-based monitoring of clinical trials. Stat Med 25(19):3236–3244.
https://doi.org/10.1002/sim.2625
Tsiatis A, Mehta C (2003) On the inefficiency of the adaptive design for monitoring clinical trials.
Biometrika 90(2):367–378. https://doi.org/10.1093/biomet/90.2.367
White W, Cannon C et al (2013) Alogliptin after acute coronary syndrome in patients with type 2
diabetes. N Engl J Med 369(14):1327–1335. https://doi.org/10.1056/NEJMoa1305889
Wittes J, Brittain E (1990) The role of internal pilot studies in increasing the efficacy of clinical
trials. Stat Med 9(1–2):65–72. https://doi.org/10.1002/sim.4780090113
Part VI
Advanced Topics in Trial Design
Bayesian Adaptive Designs for Phase I Trials
60
Michael J. Sweeting, Adrian P. Mander, and Graham M. Wheeler

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1106
Escalation with Overdose Control (EWOC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1108
Example 1: Dose-Escalation Cancer Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1110
Varying the Feasibility Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1111
Toxicity-Dependent Feasibility Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1111
Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1112
Time-to-Event Endpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1112
Example 2: Dose Escalation of Cisplatin in Pancreatic Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . 1113
Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1114
Toxicity Grading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1114
Ordinal Toxicity Gradings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115
Toxicity Score Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115
Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116
Dual Endpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116
The EffTox Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116
Example 3: The Matchpoint Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1118
Other Approaches for Joint Modeling of Efficacy and Toxicity . . . . . . . . . . . . . . . . . . . . . . . . . . . 1119

M. J. Sweeting (*)
Department of Health Sciences, University of Leicester, Leicester, UK
Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
e-mail: [email protected]; [email protected]
A. P. Mander
Centre for Trials Research, Cardiff University, Cardiff, UK
e-mail: [email protected]
G. M. Wheeler
Imperial Clinical Trials Unit, Imperial College London, London, UK
Cancer Research UK & UCL Cancer Trials Centre, University College London, London, UK
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 1105


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_92
1106 M. J. Sweeting et al.

Dual-Agent and Dose-Schedule-Finding Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1121


Extensions to the CRM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1122
Dose Toxicity Surface Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1123
Example 4: Nilotinib plus Imatinib in Stromal Tumors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1124
Bayesian Model-Free Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1125
Dose-Schedule Finding Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1127
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1127
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1128

Abstract
Phase I trials mark the first experimentation of a new drug or combination of
drugs in a human population. The primary aim of a cancer phase I trial is to seek
a safe dose or range of doses suitable for phase II experimentation. Bayesian
adaptive designs have long been proposed to allow safe dose escalation and
dose finding within phase I trials. There are now a vast number of designs
proposed for use in phase I trials though widespread application of these designs
is still limited. More recent designs have focused on the incorporation of
multiple sources of information into dose-finding algorithms to improve trial
safety and efficiency. This chapter reviews some of the papers that extend the
simple dose-escalation trial design with a binary toxicity outcome. Specifically,
the chapter focuses on five key topics: (1) overdose control, (2) use of partial
outcome follow-up, (3) grading of toxicity outcomes, (4) incorporation of both
toxicity and efficacy information, and (5) dual-agent or dose-scheduling
designs. Each extension is illustrated with an example from a real-life trial
with reference to freely available software. These extensions open the way to a
broader class of phase I trials being conducted, leading to safer and more
efficient trials.

Keywords
Dose finding · Dose escalation · Phase I trial design · Toxicity · CRM

Introduction

Phase I trials mark the first experimentation of a new drug in a human population. A
primary objective is to identify tolerable doses while ensuring the trial is safe,
acknowledging the necessary balance of risk versus benefit for participants. In
oncology phase I trials, cytotoxic anticancer drugs may have severe toxicity at
high doses, yet at low doses little efficacy is expected from the drug. A goal of
such trials is therefore to minimize the number of patients allocated to ineffective or
excessively toxic doses, and efficient trial designs are required to achieve this and to
meet ethical considerations (Jaki et al. 2013).
Phase I trials are conducted as dose-escalation studies, where the dose of the drug
under consideration can be adapted as new patients are sequentially recruited into the
trial, using dose and outcome data from previously enrolled patients. Designs are
60 Bayesian Adaptive Designs for Phase I Trials 1107

often inherently Bayesian in nature since decisions about dose escalation must be
made early in the trial when few or no results are available, and thus prior beliefs of
the dose-toxicity relationship (and corresponding uncertainty) are often needed. The
key quantity in most phase I dose-escalation trials is the maximum tolerated dose
(MTD). This is often defined as the dose that has a probability of dose-limiting
toxicity (DLT) that is equal to a prespecified target toxicity limit (TTL), which is
commonly chosen in cancer trials to be between 20% and 33% (Le Tourneau et al.
2009). A DLT is defined as a drug-induced toxic effect or severe adverse event that is
considered unacceptable due to its severity or irreversibility, thus preventing an
increase in the dose of the treatment. This definition of the MTD assumes that
there is an underlying continuous dose-toxicity relationship, and is central to most
model-based phase I designs. An alternative set of designs, called rule-based
designs, define the MTD based on the observed proportion of patients in the trial
that experience a DLT at a dose level. These designs are not considered in this
chapter.
It has been over 30 years since the seminal publication of the continual
reassessment method (CRM) (O’Quigley et al. 1990), which was proposed as a
model-based adaptive design for dose-escalation phase I trials. The CRM in its
simplest form is a one-parameter dose-toxicity model that uses previous dose and
DLT outcomes to assign new patients to dose levels as they enter the trial and aims to
estimate the MTD. The CRM, and most designs for phase I trials, is based on the
assumption of monotonicity, whereby the probability of observing a DLT increases
with dose. Since the key interest is in estimating the MTD, a model with a single
parameter is sufficient for local estimation of dose response (i.e., if focus is on a
single point estimate) (O’Quigley et al. 1990). However, phase I trials may require
more complex designs that consider other features, such as limiting the chance of
severe overdosing, using partial data from patients who are still under follow-up for
DLTs, using toxicity outcomes based on graded responses rather than a dichotomous
outcome (DLT or no DLT), considering both toxicity and efficacy outcomes, and
designing trials where two drugs are to be administered and their dose levels adapted
in combination.
The focus of this chapter is to provide a broad overview of some of the more
advanced issues in model-based (Bayesian) adaptive designs for phase I trials and
key considerations that have led to these designs being proposed. The chapter is not
intended to be all-encompassing, but should provide the reader with a flavor of some
of the methodological developments in the area that extend the CRM approach, and
to highlight practical considerations for researchers wishing to apply these methods.
Examples from real-life trials are given throughout the chapter, along with recom-
mendations of freely available software available to apply the methods. For a more
in-depth discussion of the CRM and some of its earlier extensions, readers should
refer to the ▶ chapter 53, “Dose-Finding and Dose-Ranging Studies” by Conaway
and Petroni in section ▶ “Basics of Trial Design” of this book. While this chapter
covers some recent designs for dual-agent phase I trials, a more comprehensive
discussion of designs for drug combination dose finding is given in section
▶ “Basics of Trial Design” of this book by Tighiouart.
1108 M. J. Sweeting et al.

Escalation with Overdose Control (EWOC)

The CRM was the first model-based adaptive design for phase I dose escalation
studies and has been implemented in both Bayesian and frequentist frameworks
(O’Quigley et al. 1990; O’Quigley and Shen 1996). The design makes use of the
dose and toxicity data accumulating as the trial progresses to make dose selection
decisions, giving it a significant advantage over traditional rule-based designs such
as the 3 + 3 method (Iasonos et al. 2008; Le Tourneau et al. 2009). Nevertheless, a
number of modifications have been proposed since the original design to counter
safety concerns about possible overdosing. These include rules that sometimes
override model recommendations including always starting at the lowest dose,
avoiding dose-skipping when escalating, and treating more than one patient at
each dose level (Faries 1994; Korn et al. 1994; Goodman et al. 1995; Piantadosi
et al. 1998).
An alternative approach to control overdosing is to modify the CRM dose-finding
algorithm. After each patient, the CRM estimates the posterior distribution of the
MTD and uses the middle of the distribution (e.g., the mean or median) to recom-
mend the dose to administer to the next patient. However, at least early on in the trial,
the posterior mean or median MTD estimate may fluctuate wildly leading to some
patients receiving doses high above the true MTD. Overdosing can also occur if the
prespecified model is incorrect (Le Tourneau et al. 2009). To overcome this problem
Babb et al. (1998) developed the Escalation With Overdose Control (EWOC) design,
which modifies the CRM so that it recommends the α quantile of the MTD
distribution to the next patient, where α < 0.5. The quantile, α, is known as the
feasibility bound and governs the predicted chance of overdosing in the trial. For
each successive patient the predicted probability of overdosing is α, whereas for the
CRM using the median of the MTD distribution the predicted probability is 0.5. Low
values of α will result in more cautious escalation; the trade-off for this cautious
escalation is that the dose sequence allocated through the trial will generally take
longer to converge to the true MTD. In notation, let Fn ðxÞ ¼ PðMTD  xjD n Þ denote
the probability that the MTD is less than or equal to dose x given the data collected
from the previous n patients, namely the doses allocated x1, . . ., xn, and the
corresponding n DLT outcome indicator variables, y1, . . ., yn. The EWOC design
selects the dose xn + 1 for patient n + 1 such that

Fn ðxnþ1 Þ ¼ α:

The EWOC model has an attractive decision-theoretic loss function interpreta-


tion. Given the feasibility bound α, the EWOC model minimizes the risk of toxicity
based on the asymmetric loss-function

αðγ  xÞ if x  γ
Lðx, γ Þ ¼
ð1  αÞðx  γ Þ if x > γ
60 Bayesian Adaptive Designs for Phase I Trials 1109

where γ is the true MTD. Hence a higher penalty is given to overdosing, and this
implies that treating a patient δ units above the MTD is (1 – α)/α times worse than
treating them δ units below the MTD.
In practice, with a discrete set of dose levels d1, . . ., dK, the EWOC design selects
the dose that is within a certain tolerance, T1, of the EWOC target dose xnþ1 ¼
F1
n ðαÞ and where the predicted probability of the MTD being less than the dose is
within a certain tolerance, T2, of the feasibility bound. For patient n + 1 the next
recommended dose is therefore
 
max d1 , . . . , dK : di  xnþ1  T 1 and Fn ðd i Þ  α  T 2 :

A dose-toxicity model, often used with the EWOC method, is the two-parameter
logistic model, where

π ðxÞ ¼ pðDLT jdose ¼ xÞ ¼ logit1 ðβ0 þ β1 xÞ

and x is the dose, either on the original dose scale or standardized. For example,
given a reference dose xR and using log(x/xR) as a standardized dose, the intercept β0
has the interpretation of being the log-odds of toxicity at the reference dose (see, e.g.,
Neuenschwander et al. 2008). By placing a bivariate normal prior distribution on the
parameters β0 and log(β1) we ensure a monotonically increasing dose-toxicity
relationship since β1 (the slope) is forced to be positive. An alternative parameter-
ization originally proposed in the EWOC formulation (Babb et al. 1998) is to define
ρ0 ¼ π(xmin) ¼ p(DLT|dose ¼ xmin) as the probability of a DLT at the lowest dose,
xmin, and γ as the MTD. Then it can be shown that

logitðρ0 Þ ¼ β0 þ β1 xmin

and

logitðθÞ ¼ β0 þ β1 γ,

where θ is the TTL. The rationale for the re-parameterization is that it may be easier
to specify prior distributions for γ and ρ0, which then can be translated to priors for β0
and β1 (using MCMC for example). In a phase I trial of 5-fluorouracil (5-FU) Babb
et al. (1998) propose independent Uniform (xmin, xmax) and (0, θ) distributions for γ
and ρ0, respectively, which forces the MTD to exist in the prespecified dose range. In
further investigations by Tighiouart et al. (2005) a joint prior for γ and ρ0 with
negative correlation structure was found to perform well and which generally
resulted in a safer trial. An issue with this parameterization and choice of priors is
that the MTD has prior (and hence posterior) probability of 1 of lying between xmin
and xmax. One solution proposed by Tighiouart et al. (2018) is to reparametrize the
EWOC model in terms of ρ0 and ρ1, the probabilities of DLT at the minimum and
maximum doses, respectively.
1110 M. J. Sweeting et al.

Example 1: Dose-Escalation Cancer Trial

Neuenschwander et al. (2008) describe a dose-escalation cancer trial designed to


characterize the safety, tolerability, and pharmacokinetic profile of a drug. Fifteen
doses were prespecified as doses that could be experimented on during the trial:
1, 2.5, 5, 10, 15, 20, 25, 30, 40, 50, 75, 100, 150, 200, and 250 mg. The trial initially
recruited five cohorts of individuals of sizes 3, 4, 5, 4, and 2 to doses 1, 2.5, 5, 10,
and 25 mg, respectively, with no DLTs experienced in the first four dose levels and
2 (out of 2) DLTs seen at dose 25 mg. The target toxicity was 30% and the original
CRM model recommended continued escalation (to dose 40 mg) for cohort 6, using
a one-parameter power model and a recommendation rule based on the point
estimates for the probability of DLT at each dose. This unexpected recommendation
led to further critical re-evaluation of the CRM approach.
An alternative two-parameter model with standardized dose log(x/250) and a
non-informative bivariate lognormal prior on the untransformed parameters was
used (see prior B in Neuenschwander et al. (2008)). Figure 1 shows the posterior
distribution of the MTD from this model after the first five cohorts had been

Fig. 1 Posterior distribution of the maximum tolerated dose (MTD) from Example 1 after five
cohorts of patients have been recruited, and the α ¼ 0.25 and 0.5 quantiles, F1 1
n ð0:25Þ and Fn ð0:5Þ
60 Bayesian Adaptive Designs for Phase I Trials 1111

recruited, where the dose axis is truncated to doses 40 mg. The potential doses that
can be tested within the trial are shown as points on the x-axis. At a feasibility bound
of α ¼ 0.25, the inverse cumulative distribution function of the MTD, denoted
F1
n ð0:25Þ on the figure, is 18.2 mg. To choose from the discrete dose levels, suppose
we set strict thresholds such that the next dose is not more than 1 mg above
F1
n ð0:25Þ and the probability that the MTD is below the next dose is not more
than α + 0.05 ¼ 0.30. That is we set T1 ¼ 1 and T2 ¼ 0.05. Dose 20 mg does not
satisfy the first criteria and therefore the recommended next dose would be 15 mg,
which does satisfy both constraints. This contrasts to a recommended next dose of
25 mg if the median of the distribution, F1 n ð0:5Þ, is used, which would correspond
to the sixth cohort receiving the same dose as the fifth cohort.

Varying the Feasibility Bound

Different proposals have been made for choosing the final dose recommended for phase
II study at the end of an EWOC trial (Babb et al. 1998; Berry et al. 2010). One
potentially undesirable feature from choosing a central estimate from the posterior
MTD distribution (e.g., mean, median, or mode) is that the estimate may be larger
than any dose experimented on in the trial. It may also be undesirable to choose a dose
that would be given if a new patient were recruited into the trial, based on the feasibility
bound, since the final recommended dose is then acknowledged to have posterior
probability of (1 – α) being less than the MTD. An alternative approach originally
proposed by Babb and Rogatko (2001) and later by Chu et al. (2009) is to vary the
feasibility bound as the trial progresses; specifically increasing the bound until it reaches
0.5, at which point the EWOC method would behave like a CRM (with decisions based
on the posterior median). The rationale is that early on in the trial there is a lot of
uncertainty as to the value of the MTD and hence there is more chance of administering
doses that are much greater than the MTD. While, once a number of patients have been
recruited, the magnitude of overdosing will be less and hence the feasibility bound can
be raised. This hybrid approach should therefore converge quicker to the MTD than the
traditional EWOC method while also ensuring that the recommended phase II dose
coincides with the central estimate from the MTD distribution.

Toxicity-Dependent Feasibility Bounds

Increasing the feasibility bound during the trial is often done using a step-wise
procedure. However, it is possible that the approach can lead to incoherence; that is
despite the most recent patient experiencing a DLT, the recommendation may be to
treat the next patient at a higher dose (Wheeler et al. 2017). While both the
unmodified CRM and EWOC approaches have been shown to be coherent (the
latter for n  2) (Cheung 2005; Tighiouart and Rogatko 2010), coherence violations
may occur using the EWOC approach with an increasing feasibility bound (Wheeler
2018). To overcome this issue, Wheeler et al. (2017) introduced a toxicity-dependent
1112 M. J. Sweeting et al.

feasibility bound that guarantees coherence and where the feasibility bound
increases as a function of the number of non-DLT responses observed.

Software

EWOC-type designs can be fitted using a number of software packages. The bcrm
package in R allows the user to fit EWOC-type designs by specifying the quantile of
the MTD distribution that should be used for dose-escalation decisions (Sweeting
et al. 2013). The package allows users to conduct a trial interactively or to investigate
operating characteristics via simulation. However, the package only allows specifi-
cation of prior distributions on the regression parameters of the two-parameter
logistic model. Alternatively, the ewoc package, also in R (Diniz 2018), is specifi-
cally designed for EWOC designs, allowing the user to explicitly set priors for (the
probability of DLT at the minimum dose), and γ (the MTD). Users are limited,
however, to independent Beta prior distributions for ρ0 and γ, or priors can be placed
on ρ0 and ρ1, as proposed by Tighiouart et al. (2018). Finally, a Graphic User
Interface application by Dinart et al. (2020) is available to download, allowing
users to run and simulate EWOC trials with minimal programming experience
(https://github.com/ddinart/GUIP1).

Time-to-Event Endpoints

Many designs for dose-escalation studies require that for a patient’s DLT outcome to
be included in dose-escalation decision making, a patient must be observed until the
end of the DLT observation period, or until a DLT occurs, whichever time point is
first. In practice, patients may be recruited to trials whilst other patients are receiving
treatment. Therefore, complete outcomes will not be available for all patients, even
though a decision on dose allocation for the next patients is required. To accommo-
date these situations, partial DLT observations may be used to estimate DLT risks at
each dose level, conditional on the absence of a DLT up to the current time. This also
offers the benefit of reducing the overall trial duration.
Cheung and Chappell (2000) proposed an adaptation to the CRM design to
accommodate partial DLT outcomes, known as the Time-to-Event CRM (TITE-
CRM). Under the TITE-CRM, the likelihood for the single model parameter a is
weighted according to the proportion of each patient’s DLT window for which a DLT
has not been observed. That is, for patients 1, . . ., n, let xi and yi,t be the dose given
and current DLT outcome at time t for patient i, and let Δn be the set of all data for
patients 1, . . ., n. The likelihood for parameter a is defined as

Y
n
LðajD n , tÞ ¼ fπ ðxi ; aÞgyi,t f1  wi,t π ðxi ; aÞg1yi,t ,
i¼1
60 Bayesian Adaptive Designs for Phase I Trials 1113

where wi,t ¼ 0 if a DLT has been observed by time t (i.e., yi,t ¼ 1) and wi,t ¼ (t – ti,0)/T
if yi,t ¼ 0 and t  T + ti,0, where ti,0 is the time at which patient i started treatment and
T is the length of the DLT observation window. As t increases, the contribution of
patient i to the likelihood, in the absence of a DLT, gets bigger. The rest of the trial
design process is similar to that of the CRM.
Extensions to the TITE-CRM have also been proposed. Braun et al. (2003)
extended the TITE-CRM to adapt the length of schedule (which they refer to as
dose) both between and within patients, in order to identify the Maximum Tolerated
Cumulative Dose (cumulative, as the schedule may change when a patient is on
treatment, and it is the total length of administration that is of interest to the
investigators). Braun (2006) also generalized the TITE-CRM approach to borrow
information on the timing of toxicity across patients. Furthermore, Mauguen et al.
(2011) and Tighiouart et al. (2014) have combined the TITE approach with the
EWOC trial design, thus allowing for overdose control methods to be used in dose
escalation studies with partial observations over a patient’s DLT window.
A similar approach was employed by Ivanova et al. (2016) in their Rapid
Enrollment Design (RED); rather than using the weighting structure of Cheung
and Chappell (2000), they proposed that a patient who has been followed up for
proportion t/T of their DLT window without a DLT being observed has experienced
1 – t/T of a temporary DLT. As t increases, the weighting ascribed to the patient’s
DLT risk goes down, and this in turn updates the likelihood. The subtle difference
between these two approaches is as follows: in the TITE-CRM, a patient who has
completed 70% of their DLT window without having a DLT is included as 0 DLTs
out of 0.7 patients; in the RED, the same patient would be included as 0.3 DLTs out
of one patient. TITE endpoints have also been included in the design of combination
therapy dose-escalation studies (Wages et al. 2013; Wheeler et al. 2019).

Example 2: Dose Escalation of Cisplatin in Pancreatic Cancer

Muler et al. (2004) report the results of a phase I trial with the objective of
identifying the MTD of cisplatin when given with fixed doses of gemcitabine and
radiotherapy in patients with pancreatic cancer. The investigators planned to inves-
tigate four dose levels (20, 30, 40, and 50 mg/m2), with 30 mg/m2 as the starting dose
and the target toxicity level chosen as 20%. Dose-escalation decisions were
recommended using the TITE-CRM design, which used a one-parameter logistic
model, with an exponential prior distribution on the model parameter. DLT was
defined as either Grade 4 thrombocytopenia, Grade 4 neutropenia lasting more than
7 days, or any other adverse event of at least grade 3, and the DLT observation
window was 9 weeks from start of treatment. Prior to the study starting, the skeleton
DLT probabilities at each dose were chosen to be 10%, 15%, 20%, and 25%,
respectively.
Figure 2 shows the entry and follow-up times for the 18 patients who were
considered evaluable for toxicity. Patients 1 through 4 were allocated to the starting
1114 M. J. Sweeting et al.

Fig. 2 Trial conduct for Muler et al. (2004)

dose to observe enough time without DLTs before patient 5 was allocated to 40 mg2
(on June 26, 2000). Patient 9 was allocated to 50 mg/m2 on November 27, 2000.
Patients 11 and 12 experienced DLTs during follow-up at 50 mg/m2 leading to
patient 16 receiving the lower dose of 40 mg/m2 and patient 18 receiving 30 mg/
m2 following a further DLT in patient 17. At the end of the trial, the 40 mg/m2 dose
had an expected posterior DLT probability of 0.204 (95% credible interval 0.064–
0.427).

Software

The R software package dfcrm can be used to design a TITE-CRM trial.

Toxicity Grading

Much of the literature in dose-finding trial designs focuses on the binary DLT
outcome in order to identify the MTD. Toxicities are usually graded using the
Common Terminology Criteria for Adverse Events (CTCAE) published by the US
National Cancer Institute (http://evs.nci.nih.gov/ftp1/CTCAE/About.html) and in
turn the dose-limiting toxicities will be tailored to the trial in question. Toxicities
are grouped under different System Organ Classes, and each toxicity (e.g. “nausea,”
“dermatitis,” “neutrophil count decreased”) is graded from 0 (no toxicity) to
60 Bayesian Adaptive Designs for Phase I Trials 1115

5 (death). A DLT is usually defined as any grade 3+ toxicity whereas a non-DLT is


toxicity within grades 0–2. The simplification of the outcome is to help with dose-
escalation decision making but it is well known that this results in a reduction in the
information used and hence estimation of the MTD may be less efficient. Using
graded toxicities in dose-escalation designs can potentially give trialists more infor-
mation about the speed in which dose escalation should occur; for example, if grade
2 toxicities are being observed in patients, then trialists may wish to slow escalation
since this may be indicative of more severe toxicities at nearby higher doses (Van
Meter et al. 2012). There has been an increase in the number of papers that handle
toxicity gradings directly. The model-based methods for handling toxicity grades fall
into two broad categories: those that use ordinal toxicity gradings directly and those
that use a score-based (continuous) outcome.

Ordinal Toxicity Gradings

A number of phase I designs have been proposed in the literature that incorporate
ordinal toxicity outcomes. Approaches have used either a proportional odds
(PO) model (Van Meter et al. 2011; Tighiouart et al. 2012), a continuation ratio
(CR) model (Van Meter et al. 2012), or a multinomial extension of the CRM power
model (Iasonos et al. 2011) to account for the ordinal toxicity outcome. The PO
model relies on the assumption that the odds of a more severe toxicity grade relative
to any less severe toxicity is constant among all possible toxicity grades (Van Meter
et al. 2012). That is the odds that the toxicity grade is 2 versus <2 is the same as the
odds that the toxicity is 3 versus <3, etc. Meanwhile, the CR method models the
probability that the toxicity is at level g given it is greater than or equal to g but relies
on its own assumption of homogeneity of grade-specific dose effects (Cole and
Ananth 2001). However, with these assumptions, the models can focus on estimating
just one quantile of interest, namely the dose that gives the target probability of
observing grades of toxicity that define a DLT. Information from non-DLT grades are
used to refine the estimation of the relationship between dose and the common odds
ratio. To avoid assumptions imposed by the PO or CR methods, a nonparametric
approach has been proposed using a multidimensional isotonic regression estimator
(Paul et al. 2004). This allows nonparametric estimation of quantiles for each
toxicity grade subject to order constraints and based on a corresponding set of
prespecified probabilities for each grade.

Toxicity Score Approach

There are several other approaches of note that collapse the ordinal toxicities into a
single equivalent toxicity score (between 0 and 1) such as a beta regression model
(Potthoff and George 2009) or a quasi-Bernoulli likelihood approach (Yuan et al.
2007; Ezzalfani et al. 2013). The latter uses a standard CRM model but requires a
clinically meaningful toxicity score to be assigned to each grade of toxicity.
1116 M. J. Sweeting et al.

Another approach uses an ordinal probit regression with a latent variable, for each
toxicity type under consideration (Bekele and Thall 2004; Lee et al. 2010). The
probability that the toxicity is at a given level (grade) g ¼ 0, . . ., G is then modeled
using the probit model and G – 1 cutoff parameters. Bekele and Thall (2004) used a
multivariate ordinal probit regression approach that allowed multiple toxicity types
(myelosuppression, dermatitis, liver toxicity, nausea/vomiting, and fatigue), each
graded, to be modeled simultaneously with correlation. The authors then quantified
the severity of each toxicity type and grade by eliciting numerical weights. For each
dose under consideration the posterior expected probability of each toxicity type and
grade was multiplied by its associated severity weight and the sum of these across types
and grades gave the overall total toxicity burden (TTB) for that dose. Dose escalation
then proceeded by assigning the next patient the dose with TTB closest to a prespecified
target TTB (elicited through a set of scenario analyses with the oncologists).

Software

The R package ordcrm allows the user to fit both the ordinal PO and CR CRM
models.

Dual Endpoints

The focus for most dose-escalation designs is purely on toxicity, and the common
assumption is that as dose increases, so does both the risk of toxicity and the efficacy
of the drug. However, it may be more prudent to model the dose-efficacy relationship
as well as dose-toxicity. Efficacy may even plateau after a certain dosage, and
therefore, increasing a dose with no increase in efficacy but a potential increase in
toxicity would be unwise. Therefore, many approaches have been proposed in order
to jointly model dose-efficacy and dose-toxicity outcomes.

The EffTox Design

Thall and Cook (2004) proposed what has come to be known as the EffTox Design, a
Bayesian approach that models the efficacy and toxicity risks per dose, and uses the
trade-off between toxicity and efficacy to select dose levels for new patients.
Specifically, logistic functions are assumed for the dose-toxicity and dose-efficacy
curves, that is,

logitðπ T ðx; βT ÞÞ ¼ βT,0 þ βT,1 x

and

logitðπ E ðx; βE ÞÞ ¼ βE,0 þ βE,1 x þ βE,2 x2 :


60 Bayesian Adaptive Designs for Phase I Trials 1117

The dose-efficacy relationship includes the quadratic term βE,2x2 to permit a


turning point in the curve. Both πT and πE are combined using the Gumbel copula
model so that the probability of each toxicity-efficacy outcome result (a, b), where
a and b take value 0 if toxicity or efficacy does not occur, and 1 if they do
respectively, is given as

eϕ  1
π a,b ¼ π aT ð1  π T Þ1a þ π bE ð1  π E Þ1b þ ð1Þaþb π T ð1  π T Þπ E ð1  π E Þ
eϕ þ 1
which for patients 1, . . ., n, gives the likelihood

Y
n
LðβE , βT , ϕjD n Þ ¼ π a,b ðxi Þ½a¼ai ,b¼bi  :
i¼1

As per Thall and Cook (2006) and Brock et al. (2017), prior beliefs on the
efficacy and toxicity at each dose must be elicited from the clinicians, along with
the prior Effective Sample Size (ESS). It is then possible to transform these prior
beliefs and ESS onto the model parameters {βT,0, βT,1, βE,0, βE,1, βE,2, ϕ} using
specialist software (EffTox Software, MD Anderson, https://biostatistics.
mdanderson.org/softwaredownload/SingleSoftware.aspx?Software_Id¼2; or the
R package trialr). Thall et al. (2014) show how different prior effective sample
sizes affect the operating characteristics of the EffTox design, including the
probability of selecting each dose as the optimum dose and the probability of
terminating the trial early.
The key step under the EffTox approach is to define a utility function that reflects
the trade-offs between efficacy and toxicity that the trial team are willing to accept.
To do this, three target trade-offs are specified: π 1 ¼ ðπ T,1 , 1Þ , where π T,1 is the
maximum toxicity level at which efficacy is guaranteed; π 2 ¼ ð0, π E,2 Þ, where π E,2 is
the minimum efficacy level at which toxicity is guaranteed to not occur; π 3 ¼
ðπ T,3 , π E,3 Þ, an intermediate target between the two marginal targets π 1 and π 2 that,
with a contour fitted through all three target trade-offs, will provide a suitably steep
contour to encourage escalation to doses that are estimated to have substantially
higher efficacy probabilities with only a limited increase in toxicity risk (Yuan et al.
2017; Brock et al. 2017). Thall et al. (2014) use L p norms to model the utility
contours, specifically
 p  p 1=p
1  πE πT
uð π T , π E Þ ¼ 1  þ ,
1  π E,2 π T,1

where p determines the extent of the curvature of the contours. The utility function
u allows us to evaluate the desirability/utility of a dose level based on its estimated
probability of toxicity and efficacy. The value of p is obtained by solving u(πT, πE) ¼ 0,
which denotes the neutral contour. We may then recommend a dose level for the next
patient that maximizes this utility, subject to any other constraints one may wish to use
in the trial. For example, if we have target minimum efficacy π E and target maximum
1118 M. J. Sweeting et al.

toxicity π T , then for selected cutoffs pE and pT, only doses that satisfy the following
constraints are available for recommendation:

Pr π E ðxÞ  π E > pE

and

Pr π T ðxÞ  π T > pT :

Example 3: The Matchpoint Trial

Brock et al. (2017) described how they designed the Matchpoint trial, a dose-finding
study of Ponatinib plus chemotherapy in patients with chronic myeloid leukemia in
blastic transformation phase, using the Efftox design. The aim of the study was to
identify the dose of Ponatinib that produced a minimum efficacy response rate of
45%, with an acceptable toxicity level of at most 40%. Four doses were considered:
15 mg every second day, 15 mg daily, 30 mg daily, and 45 mg daily. Clinicians
specified prior toxicity and efficacy probabilities as shown in Table 1, and with the
help of the trial team, chose cutoffs for admissible doses to be pE ¼ 0.03 and
pT ¼ 0.05. The low thresholds permitted that even weak beliefs of the efficacy and
toxicity probabilities would still allow doses to be admissible.
For their three target trade-off points, the team chose three points in the toxicity-
efficacy space that they felt had equal utility, and solved simultaneous equations to
identify what π 1 , π 2 , and π 3 would be. This resulted in π 1 ¼ ð0, 0:40Þ and π 2 ¼
ð0:70, 1Þ, giving p ¼ 2.07. The resultant utility curves for different utility/desirability
levels are shown in Fig. 3; the neutral contour is shown in blue, and yields an interior
target point of π 3 ¼ ð0:4, 0:5Þ. Any other point lying on this curve could be selected
as an interior target point. A trial-and-error approach was used to select the ESS
based on the operating characteristics from simulation studies, similar to the
approach of Thall et al. (2014). For the Matchpoint trial, the investigators set the
ESS as 1.3 to obtain prior distributions on their model parameters.
The Matchpoint trial is currently ongoing, so results are not available. However,
Brock et al. (2017) provided results of simulation studies to assess their trial design
across six scenarios with different dose-efficacy and dose-toxicity relationships.
Figure 4 shows the probability of selecting each dose (or not recommending a

Table 1 Dose levels and prior probabilities for the Matchpoint trial
1 2 3 4
Dose level Ponatinib 15 mg every other 15 mg 30 mg 45 mg
dose day daily daily daily
Prior Pr(Eff) 0.20 0.30 0.50 0.60
Prior Pr(Tox) 0.025 0.05 0.10 0.25
60 Bayesian Adaptive Designs for Phase I Trials 1119

Fig. 3 Utility contours elicited for the Matchpoint trial. Green circles show the three trade-off
points used to fit contours. Blue line shows the neutral contour fitted through trade-off points

dose due to safety) across the six scenarios for their proposed design, along with the
true toxicity and efficacy probabilities in each scenario.

Other Approaches for Joint Modeling of Efficacy and Toxicity

Additional Bayesian approaches for joint modeling efficacy and toxicity outcomes
have been proposed in the literature. Thall and Russell (1998) describe a propor-
tional odds approach to dose-escalation where there are three measurable outcomes
concerning adverse events and the onset of Graft versus Host Disease (GvHD): no
severe toxicities and no GvHD (outcome 1), no severe toxicities and only moderate
GvHD (outcome 2), and either severe toxicity or severe GvHD (outcome 3). The aim
is to find the dose that has an expected probability of outcome 2 of at least 50%, but
with the expected probability of outcome 3 being no greater than 10%. A parsimo-
nious modeling approach is used, whereby γj(x) ¼ ℙ (Outcome  j), and

γ 0 ðxÞ ¼ 1
exp ðμ þ α þ βxÞ
γ 1 ðxÞ ¼
1 þ exp ðμ þ α þ βxÞ
exp ðμ þ βxÞ
γ 2 ðxÞ ¼ :
1 þ exp ðμ þ βxÞ
1120 M. J. Sweeting et al.

Fig. 4 Simulation results for the EffTox design proposed for the Matchpoint Trial. Gray shaded
area shows the set of points that are more desirable than the elicited neutral contour. Areas of points
are proportional to the probability of choosing that point as the MTD. Scenario orderings for No
dose chosen results are 1 (black), 2 (red), 3 (orange), 4 (purple), 5 (blue), and 6 (brown)

Model parameters are updated by standard Bayesian methods and then the dose
with the highest probability for outcome 2 (i.e., moderate GvHD and no severe
toxicity) is chosen, subject to the constraint that γ2(x)  0.10. A related approach by
Braun (2002) extended the CRM to jointly model the probabilities of severe toxicity
and disease progression.
Zhang et al. (2006) proposed a continuation-ratio approach for jointly modeling
efficacy and toxicity. Specifically, given dose, three probabilities are of interest:
ψ 0(x), the probability of no efficacy and no DLT; ψ 1(x), the probability of efficacy
and no DLT; ψ 2(x), the probability of DLT, regardless of efficacy status. These
probabilities are then modeled to allow toxicity to increase monotonically with dose,
but for ψ 1(x) to be non-monotonic with dose:

ψ 1 ðxÞ
log ¼ α1 þ β1 x
ψ 0 ðxÞ

and

ψ 2 ðxÞ
log ¼ α2 þ β2 x:
1  ψ 2 ðxÞ
60 Bayesian Adaptive Designs for Phase I Trials 1121

Then, with constraints on the model parameters, specifically α1 > α2 and β1,
β2 > 0, the above equations can be solved to give expressions directly on ψ 0(x),
ψ 1(x), and ψ 2(x). Given a target toxicity level θ, the dose-finding algorithm is based
on two decision functions:

δ 1 ðxÞ ¼ I ½ ψ 2 ðxÞ < θ 

and

δ2 ðxÞ ¼ ψ 1 ðxÞ  λψ 2 ðxÞ

where λ is the weight for the toxicity risk of dose x relative to its efficacy. The dose x*
for the next patient is that which satisfies δ1(x*) ¼ 1 and δ2(x*) ¼ maxx  Ξ {δ2(x)},
where Ξ is the dose range (or set of doses) under consideration. Other approaches
using the continuation ratio model have been published since, including those for
combination therapy trials (Mandrekar et al. 2007, 2010).
Dragalin and Fedorov (2006) proposed using optimal design theory for dose-
finding studies with joint endpoint data. For a joint probability model py,z (x), where
x is the dose, y is the binary efficacy outcome, and z is the binary toxicity outcome,
the authors suggest either a Gumbel-type bivariate logistic regression, such as that
used in the EffTox design, or Cox bivariate binary model. In both cases, an analytical
expression for the Fisher Information Matrix (FIM) is obtained. A common choice of
optimization is the D-optimality criterion, which chooses the dose for the next
individual patient that maximizes the determinant of the FIM. An optimal design
allows the trial to obtain as much information as possible about the joint probability
model. However, the optimal dose may not always be a safe dose. Therefore the
range of doses from which the optimal dose is chosen can be restricted to doses
within the therapeutic range (above the posterior estimate of the minimum effective
dose and below the posterior estimate of the MTD). Other constraints for defining
admissible doses are also explored by Dragalin and Fedorov (2006).
The use of D-optimality and an admissible dose range aims to blend together two
goals in drug development: doing what is best for the population (by learning as
much as possible about the dose-efficacy and dose-toxicity relationships) and doing
what is best for the patient (by giving them the dose that has a controlled toxicity risk
but some efficacy benefit). Optimal design-theoretic approaches for dose-finding
studies have also been proposed by others (Pronzato 2010; Padmanabhan et al.
2010a; Padmanabhan et al. 2010b; Dragalin et al. 2008).

Dual-Agent and Dose-Schedule-Finding Studies

After exploring different endpoints and joint modeling of efficacy and toxicity
outcomes for single-agent dose-escalation designs, a natural progression for research
and application of such designs was into trials where two or more treatment-related
quantities were to be adapted. Since many treatment plans are formed from
1122 M. J. Sweeting et al.

combinations of drugs, or even different treatment modalities, dose-finding studies


may wish to vary the dosage/level of multiple treatments to find one or more
maximum tolerated dose combination, or optimal biological dose combinations.
Furthermore, even with a single-agent study with one treatment being adapted, it
may be of interest to explore different dose administration schedules (e.g., 200 mg
daily for two weeks versus 100 mg daily for three weeks) to identify a maximum
tolerated dose-schedule combination. As treatments for patients become more com-
plex, so too must trial designs.
Harrington et al. (2013) conducted an extensive review of available trial designs for
dual-agent dose-escalation designs, many of which may be also applied to dose-schedule
finding trials. This review discusses both rule- and model-based approaches and, though
few in number, found several case studies of their implementation in practice. From this
and more recent reviews (see Wages et al. 2016; Hirakawa et al. 2018; Riviere et al.
2014), we discuss several designs to illustrate the different trial designs.

Extensions to the CRM

Studies of dual-agent combinations are complicated by the lack of a well-defined


ordering with respect to increasing toxicity (and also efficacy if monotonic); that is,
the relationship between dose combinations and toxicity risk is only partially
ordered. While we may safely assume that, holding all else constant, an increase
in one treatment will maintain or increase the risk of toxicity, and that this property
also holds when both treatments are increased, we cannot be sure what the relation-
ship is when one treatment is increased and the other is decreased. Figure 5 shows an
example dose toxicity grid with six combinations. The grid is partially ordered, in
that we know, for example, that d5 is not as toxic as d6, but is at least as toxic than
both d1 and d3; however, we do not know whether d5 is more or less toxic than d2 or
d4.

Fig. 5 Dose-toxicity grid and partial ordering


60 Bayesian Adaptive Designs for Phase I Trials 1123

There exist five possible simple orders:

1!2!3!4!5!6
1!2!3!5!4!6
1!3!2!4!5!6
1!3!2!5!4!6
1 ! 3 ! 5 ! 2 ! 4 ! 6:

It is the investigators job to identify one or more maximum tolerated dose


combinations while accounting for potential uncertainty in the true dose-toxicity
relationship given the partial ordering.
Wages et al. (2011a, b) proposed an extension to the standard CRM design, the
Partially Ordered CRM (POCRM), which accounts for the partial ordering structure
in a dual-agent dose-escalation study setting and conveniently allows for a single-
agent dose-escalation approach to be used in a multi-agent setting. For a study of
K dose combinations, and under the partial order structure present for the K dose
combinations, let us assume there are M possible simple orders. Let πm(dk) be the
probability of DLT at dose combination k ¼ 1, . . ., K under the assumption that the
true dose-toxicity order is that specified by order m ¼ 1, . . .., M. The dose-toxicity
function π may be a one-parameter power or logistic as is often used for the CRM.
For each possible order m, we may obtain the likelihood for the data and, given some
prior belief on how likely order m is to be the true simple ordering for the dose-
toxicity relationship, generate a posterior probability that order m is indeed the true
simple order. That is, letting f(m) be the prior belief that order m is the true simple
order, and Lm(Δn) the likelihood of the model parameters given current trial data Δn
under simple order m, the posterior probability that order m is the true simple order is

L ðD Þf ðmÞ
ψ ðmÞ ¼ PMm n :
l¼1 Ll ðD n Þf ðlÞ

We may then choose order m* ¼ arg maxm ¼ 1,. . ., M ψ(m) to be the best guess of
the true simple order, and apply the CRM for single-agent phase I trials to the dose
combinations under this specified ordering. Alternatively, we may randomly select
an ordering by using the ψ(m) as selection probabilities; this may be beneficial if two
or more orderings have the same or very similar posterior weightings. An extension
of this approach including efficacy outcomes has also been proposed (Wages and
Conaway 2014).

Dose Toxicity Surface Models

Other approaches model the entire dose-toxicity surface. Gasparini (2013) describes
several models for dose-toxicity surfaces, which include logistic-type and copula-
type models employed by Thall et al. (2003), Wang and Ivanova (2005), Yin and
1124 M. J. Sweeting et al.

Yuan (2009a), and Yin and Yuan (2009b). Further to these, extensions of the EWOC
designs for the combination therapy setting have also been proposed for dual-agent
phase I dose-escalation studies (Jimenez et al. 2018; Tighiouart et al. 2017; Diniz
et al. 2017; Tighiouart 2018); we do not cover these here as they are discussed in
other areas of this book.

Example 4: Nilotinib plus Imatinib in Stromal Tumors

Bailey et al. (2009) used an extension of the logistic model to conduct a dose-
escalation study of nilotinib plus imatinib in adult patient with imatinib-resistant
gastrointestinal stromal tumors. Five doses of nilotinib {100, 200, 400, 600, 800}mg
and two doses of imatinib {600, 800}mg were considered, though patients could
also be given nilotinib alone (so imatinib dose of 0 mg). For each dose level, the
probability that the posterior DLT risk is either an underdose (i.e., ℙ(π(x)  [0,
0.20))), in the target range (i.e., ℙ(π(x)  [0.20, 0.35))), an excessive toxicity (i.e, ℙ
(π(x)  [0.35, 0.60))), or an unacceptable toxicity (i.e., ℙ(π(x)  [0.6, 1])) are
computed. These probability masses are used for dose-escalation decisions.
The model in this trial was a four-parameter logistic-type model. For dose a of
nilotinib and dose b of imatinib, the probability of DLT at dose combination (a, b) is.
 
a
logitðπ ða, bÞÞ ¼ log ðαÞ þ β log þ γ 1 ½b  600 þ γ 2 ½b  800,
aR

where aR is a reference dose level for nilotinib, and [·] denotes the indicator
function, taking value 1 if true and 0 otherwise. Using suitable priors on {α, β, γ1,
γ2}, the trial proceeds like most others in this chapter; after dose and DLT status data
are collected, Bayesian methods are used to update the posterior DLT risks per dose,
and also to calculate the four aforementioned interval probabilities. Then, the dose
for the next patient is that which has the largest probability of being in the target
range (i.e., xn + 1 ¼ argmaxx  χ ℙ(π(x)  [0.20, 0.35))), subject to the constraint that
ℙ(π(x)  [0.35, 1])  0.25. In addition to this, it was possible for patients to be
dosed at combinations that had a smaller target interval probability than π(xn + 1)
based on additional clinical data, or if xn + 1 was a combination where one of the
drugs was increased by more than 100% of the current highest level.
To begin, seven patients were recruited to (800, 0) and one had a DLT. Based on
the posterior interval probabilities and dose escalation rules, three patients were
treated at (200, 800), none of whom had DLTs; during this time an additional two
patients received (800, 0) and neither experienced DLTs. Figure 6 shows the trial
progress and allocation of patients to different dose combinations throughout the
trial, with Bayesian inference and evaluation of the model performed in five stages.
During the course of this trial, several key changes were made. Firstly, severe skin
rash was added to the definition of DLT, which meant among the five patients who
received the (800,800) combination, four were classes as having experienced DLTs;
60 Bayesian Adaptive Designs for Phase I Trials 1125

Fig. 6 Trial progress of nilotinib plus imatinib study. Evaluations (Eval) denote when a decision
was made to open up a new dose combination to recruitment based on estimates of target interval
and overdose interval probabilities

this would previously have been only one patient under the old DLT definition.
Secondly, the investigators agreed to open up a 400 mg dose of imatinib after
observing the four DLTs at (800,800). This meant that a) the dose-toxicity model
had to be modified to include an additional parameter, so the model became
 
a
logitðπ ða, bÞÞ ¼ log ðαÞ þ β log þ γ 0 ½b  400 þ γ 1 ½b  600
aR
þ γ 2 ½b  800,

and b) the prior distributions for γ1 and γ2 needed to be modified. Once this was
completed, posterior estimates for DLT probabilities and dosing interval probabili-
ties were recalculated. A further 16 patients were recruited to the new (800,400)
combination, and three of these patients experienced DLT. At the end of the study,
the (800,400) dose, which had the largest probability mass in the target range of
[0.20, 0.35) and satisfied the aforementioned overdose constraint, was selected as the
maximum tolerated dose combination. Figure 7 shows the posterior probability mass
for target or overdosing at each dose combination.

Bayesian Model-Free Approaches

Choosing a model for the above designs requires careful consideration, and appro-
priate priors need to be chosen. Furthermore, for a true dose-toxicity surface with a
very asymmetric shape (i.e., increasing one drug adds little to DLT risk, but
increasing the other drug adds a lot), it may be difficult to obtain reliable estimates
1126 M. J. Sweeting et al.

Fig. 7 Summary of posterior probabilities of target (green circles) and overdosing (red circles) per
dose combination. Area of circles proportional to probability. Combinations with red circles have ℙ
(π(x)  [0.35, 1]) > 0.25, so target probability not shown. Combinations with green circles all have
ℙ(π(x)  [0.35, 1])  0.25

of DLT risks under some models. In light of this, Bayesian approaches have been
proposed where a model for the dose-toxicity surface is not required.
Lin and Yin (2017) proposed a Bayesian Optimal Interval (BOIN) design for
combination trials, whereby the dose for the next patient is either increased,
decreased, or maintained if the expected probability of DLT at the current dose
falls within some predefined intervals. Let θ be the target toxicity level, πjk be the
DLT risk at dose combination ( j, k), and ΔL and ΔU be the lower and upper limits for
the target DLT interval. If bπ jk , equal to the number of DLTs at combination ( j, k)
divided by the number of patients at ( j, k), is less than θ – ΔL, then the next patient
receives either combination ( j, k + 1) or ( j + 1, k), whichever combination maxi-
mizes ℙ(πjk  (θ – ΔL, θ + ΔU)|Δn). If b π jk is greater than θ + ΔU, then the next patient
receives either combination ( j, k – 1) or ( j – 1, k), whichever combination maxi-
mizes ℙ(πjk  (θ – ΔL, θ + ΔU)|Δn). Otherwise, the next patient receives the
current dose.
Mander and Sweeting (2015) proposed the Product of Independent beta Proba-
bilities dose Escalation design (PIPE), which focused on identifying the Maximum
Tolerated Contour (MTC) that divides safe and unsafe doses. A working model is
first set up whereby each combination is given an independent beta prior distribu-
tion, that is, for combination ( j, k), πjk ~ Beta(ajk, bjk). With these independent priors
and trial data Δn, the posterior distribution for each combination is also a beta
distribution and it is easy to calculate the probability that the combination is less
that the TTL, pjk ¼ ℙ(πjk  θ|ajk, bjk, Δn). The PIPE design then considers each
possible contour that can divide a dose-toxicity surface into safe and unsafe combi-
nations and that satisfies the assumption of monotonicity. Using the working model,
the probability of each contour being the MTC can then be calculated. For contour
60 Bayesian Adaptive Designs for Phase I Trials 1127

Cm, let Cm[j, k] ¼ 1 if combination ( j, k) is above the contour (i.e., unsafe), and Cm[j,
k] ¼ 0 if it is below (i.e., safe). Then

J Y
Y K
1Cm ½ j,k Cm ½ j,k
ℙðMTC ¼ Cm jD n Þ ¼ pjk 1  pjk :
j¼1 k¼1

The contour that maximizes the above expression is selected as the MTC, and
dose-escalation decisions may be made with the MTC as a guide. Software to
implement the PIPE designs is available in the R package pipe.design.

Dose-Schedule Finding Designs

Most of the methods for dual-agent phase I trials may be applied directly to dose-
schedule finding studies, simply by treating the different administration schedules as
another treatment that is varied. However, there are other approaches that consider
the subtleties of dose-schedule finding studies. Braun et al. (2007) used time-to-
toxicity outcomes to adaptively select dose-schedule combinations where schedules
of lower frequency/intensity are nested within more intense ones, with a motivating
example of using 5-azacitidine to treat patients with acute myeloid leukemia.
Meanwhile, O’Quigley and Conaway (2011) extended the CRM approach so that
the skeleton values varied according to which schedule patients were to be treated
with.

Summary and Conclusion

This chapter provides an overview and summary of advanced statistical methodol-


ogy proposed for extending Bayesian adaptive model-based phase I trial designs. A
key factor common to many of these methods is the incorporation of data into dose-
finding algorithms that is in addition to binary DLT outcomes. This can be in the
form of using a more granular classification of toxicity based on CTC criteria, using
information on the timing of outcomes for patients during follow-up to allow
continual enrollment, use of both toxicity and efficacy outcomes, and the use of
dual-agent or dose-schedule finding algorithms. As more complex trials are devel-
oped such methodology opens the way to the design of more safe and efficient phase
I trials, both in terms of time savings and sample size.
There is a large literature of Phase I trial designs and this chapter is not intended to
be all-encompassing. Instead, a wide variety of designs are introduced, which
provide a flavor of some of the methodological developments in the area. There
are a number of excellent review articles that allow further in-depth exploration of
the literature, including O’Quigley and Conaway (2011), Harrington et al. (2013),
O’Quigley et al. (2017), Zhou (2009), Wages et al. (2016), and Thall (2010).
1128 M. J. Sweeting et al.

References
Babb JS, Rogatko A (2001) Patient specific dosing in a cancer phase I clinical trial. Stat Med
20(14):2079–2090
Babb J, Rogatko A, Zacks S (1998) Cancer phase I clinical trials: efficient dose escalation with
overdose control. Stat Med 17(10):1103–1120
Bailey S, Neuenschwander B, Laird G, Branson M (2009) A Bayesian case study in oncology phase
I combination dose-finding using logistic regression with covariates. J Biopharm Stat 19(3):
469–484
Bekele BN, Thall PF (2004) Dose-finding based on multiple toxicities in a soft tissue sarcoma trial.
J Am Stat Assoc 99(465):26–35
Berry SM, Carlin BP, Jack Lee J, Muller P (2010) Bayesian adaptive methods for clinical trials.
Boca Raton, FL: Chapman and Hall/CRC Press
Braun TM (2002) The bivariate continual reassessment method. Extending the CRM to phase I
trials of two competing outcomes. Control Clin Trials 23(3):240–256
Braun TM (2006) Generalizing the TITE-CRM to adapt for early- and late-onset toxicities. Stat
Med 25(12):2071–2083
Braun TM, Levine JE, Ferrara JLM (2003) Determining a maximum tolerated cumulative dose:
dose reassignment within the TITE-CRM. Control Clin Trials 24(6):669–681
Braun TM, Thall PF, Nguyen H, de Lima M (2007) Simultaneously optimizing dose and schedule
of a new cytotoxic agent. Clin Trials 4(2):113–124
Brock K, Billingham L, Copland M, Siddique S, Sirovica M, Yap C (2017) Implementing the
EffTox dose-finding design in the matchpoint trial. BMC Med Res Methodol 17(1)
Cheung YK (2005) Coherence principles in dose-finding studies. Biometrika 92(4):863–873
Cheung YK, Chappell R (2000) Sequential designs for phase I clinical trials with late-onset
toxicities. Biometrics 56(4):1177–1182
Chu P-L, Lin Y, Shih WJ (2009) Unifying CRM and EWOC designs for phase I cancer clinical
trials. J Stat Plan Inference 139(3):1146–1163
Cole SR, Ananth CV (2001) Regression models for unconstrained, partially or fully constrained
continuation odds ratios. Int J Epidemiol 30(6):1379–1382
Dinart D, Fraisse J, Tosi D, Mauguen A, Touraine C, Gourgou S, Le Deley MC, Bellera C, Mollevi
C (2020) GUIP1: a R package for dose escalation strategies in phase I cancer clinical trials.
BMC Med Inform Decis Mak 20(134). https://doi.org/10.1186/s12911-020-01149-3
Diniz MA (2018) ewoc: Escalation with overdose control. R package version 0.2.0
Diniz MA, Quanlin-Li, Tighiouart M (2017) Dose finding for drug combination in early cancer
phase I trials using conditional continual reassessment method. J Biom Biostat 8(6)
Dragalin V, Fedorov V (2006) Adaptive designs for dose-finding based on efficacy–toxicity
response. J Stat Plan Inference 136(6):1800–1823
Dragalin V, Fedorov V, Wu Y (2008) Adaptive designs for selecting drug combinations based on
efficacy–toxicity response. J Stat Plan Inference 138(2):352–373
Ezzalfani M, Zohar S, Qin R, Mandrekar SJ, Deley M-CL (2013) Dose-finding designs using a
novel quasi-continuous endpoint for multiple toxicities. Stat Med 32(16):2728–2746
Faries D (1994) Practical modifications of the continual reassessment method for phase I cancer
clinical trials. J Biopharm Stat 4(2):147–164
Gasparini M (2013) General classes of multiple binary regression models in dose finding problems
for combination therapies. J R Stat Soc: Ser C: Appl Stat 62(1):115–133
Goodman SN, Zahurak ML, Piantadosi S (1995) Some practical improvements in the continual
reassessment method for phase I studies. Stat Med 14(11):1149–1161
Harrington JA, Wheeler GM, Sweeting MJ, Mander AP, Jodrell DI (2013) Adaptive designs for
dual-agent phase I dose-escalation studies. Nat Rev Clin Oncol 10(5):277–288
Hirakawa A, Sato H, Daimon T, Matsui S (2018) Dose finding for a combination of two agents. In:
Modern Dose-Finding Designs for Cancer Phase I Trials: Drug Combinations and Molecularly
Targeted Agents, pp 9–40. Tokyo: Springer
60 Bayesian Adaptive Designs for Phase I Trials 1129

Iasonos A, Wilton AS, Riedel ER, Seshan VE, Spriggs DR (2008) A comprehensive comparison of
the continual reassessment method to the standard 3 + 3 dose escalation scheme in phase I dose-
finding studies. Clin Trials 5(5):465–477
Iasonos A, Zohar S, O’Quigley J (2011) Incorporating lower grade toxicity information into dose
finding designs. Clin Trials J Soc Clin Trials 8(4):370–379
Ivanova A, Wang Y, Foster MC (2016) The rapid enrollment design for phase I clinical trials. Stat
Med 35(15):2516–2524
Jaki T, Clive S, Weir CJ (2013) Principles of dose finding studies in cancer: a comparison of trial
designs. Cancer Chemother Pharmacol 71(5):1107–1114
Jimenez JL, Tighiouart M, Gasparini M (2018) Cancer phase I trial design using drug combinations
when a fraction of dose limiting toxicities is attributable to one or more agents. Biom J 61
(2):319–332
Korn EL, Midthune D, Chen TT, Rubinstein LV, Christian MC, Simon RM (1994) A comparison of
two phase I trial designs. Stat Med 13(18):1799–1806
Le Tourneau C, Jack Lee J, Siu LL (2009) Dose escalation methods in phase I cancer clinical trials.
JNCI J Nat Cancer Inst 101(10):708–720
Lee SM, Cheng B, Cheung YK (2010) Continual reassessment method with multiple toxicity
constraints. Biostatistics 12(2):386–398
Lin R, Yin G (2017) Bayesian optimal interval design for dose finding in drug-combination trials.
Stat Methods Med Res 26(5):2155–2167
Mander AP, Sweeting MJ (2015) A product of independent beta probabilities dose escalation design
for dual-agent phase I trials. Stat Med 34(8):1261–1276
Mandrekar SJ, Cui Y, Sargent DJ (2007) An adaptive phase I design for identifying a biologically
optimal dose for dual agent drug combinations. Stat Med 26(11):2317–2330
Mandrekar SJ, Qin R, Sargent DJ (2010) Model-based phase I designs incorporating toxicity and
efficacy for single and dual agent drug combinations: methods and challenges. Stat Med 29(10):
1077–1083
Mauguen A, Le Deley MC, Zohar S (2011) Dose-finding approach for dose escalation with
overdose control considering incomplete observations. Stat Med 30(13):1584–1594
Muler JH, McGinn CJ, Normolle D, Lawrence T, Brown D, Hejna G, Zalupski MM (2004) Phase I
trial using a time-to-event continual reassessment strategy for dose escalation of cisplatin
combined with gemcitabine and radiation therapy in pancreatic cancer. J Clin Oncol 22(2):
238–243
Neuenschwander B, Branson M, Gsponer T (2008) Critical aspects of the Bayesian approach to
phase I cancer trials. Stat Med 27(13):2420–2439
O’Quigley J, Conaway M (2011) Extended model-based designs for more complex dose-finding
studies. Stat Med 30(17):2062–2069
O’Quigley J, Shen LZ (1996) Continual reassessment method: a likelihood approach. Biometrics
52(2):673
O’Quigley J, Pepe M, Fisher L (1990) Continual reassessment method: a practical design for phase
1 clinical trials in cancer. Biometrics 46(1):33
O’Quigley J, Iasonos A, Bornkamp B (2017) Handbook of methods for designing, monitoring, and
analyzing dose-finding trials. Boca Raton, FL: Chapman and Hall Press/CRC
Padmanabhan SK, Krishna Padmanabhan S, Dragalin V (2010a) Adaptive dc-optimal designs for
dose finding based on a continuous efficacy endpoint. Biom J 52(6):836–852
Padmanabhan SK, Krishna Padmanabhan S, Hsuan F, Dragalin V (2010b) Adaptive PenalizedD-
optimal designs for dose finding based on continuous efficacy and toxicity. Stat Biopharm Res
2(2):182–198
Paul RK, Rosenberger WF, Flournoy N (2004) Quantile estimation following non-parametric phase
I clinical trials with ordinal response. Stat Med 23(16):2483–2495
Piantadosi S, Fisher JD, Grossman S (1998) Practical implementation of a modified continual
reassessment method for dose-finding trials. Cancer Chemother Pharmacol 41(6):429–436
Potthoff RF, George SL (2009) Flexible phase I clinical trials: allowing for nonbinary toxicity
response and removal of other common limitations. Stat Biopharm Res 1(3):213–228
1130 M. J. Sweeting et al.

Pronzato L (2010) Penalized optimal designs for dose-finding. J Stat Plan Inference 140(1):
283–296
Riviere MK, Le Tourneau C, Paoletti X, Dubois F, Zohar S (2014) Designs of drug-combination
phase I trials in oncology: a systematic review of the literature. Ann Oncol 26(4):669–674
Sweeting M, Mander A, Sabin T (2013) Bcrm: Bayesian continual reassessment method designs for
phase I dose-finding trials. J Stat Softw 54(13)
Thall PF (2010) Bayesian models and decision algorithms for complex early phase clinical trials.
Stat Sci 25(2):227–244
Thall PF, Cook JD (2004) Dose-finding based on efficacy-toxicity trade-offs. Biometrics 60(3):
684–693
Thall PF, Cook JD (2006) Using both efficacy and toxicity for dose-finding. In: Statistical methods
for dose-finding experiments, pp 275–285. New York: Wiley
Thall PF, Russell KE (1998) A strategy for dose-finding and safety monitoring based on efficacy
and adverse outcomes in phase I/II clinical trials. Biometrics 54(1):251–264
Thall PF, Millikan RE, Mueller P, Lee S-J (2003) Dose-finding with two agents in phase I oncology
trials. Biometrics 59(3):487–496
Thall PF, Herrick RC, Nguyen HQ, Venier JJ, Norris JC (2014) Effective sample size for computing
prior hyperparameters in Bayesian phase I-II dose-finding. Clin Trials 11(6):657–666
Tighiouart M (2018) Two-stage design for phase I-II cancer clinical trials using continuous dose
combinations of cytotoxic agents. J R Stat Soc: Ser C: Appl Stat 68(1):235–250
Tighiouart M, Rogatko A (2010) Dose finding with escalation with over-dose control (EWOC) in
cancer clinical trials. Stat Sci 25(2):217–226
Tighiouart M, Rogatko A, Babb JS (2005) Flexible Bayesian methods for cancer phase I clinical
trials. Dose escalation with overdose control. Stat Med 24(14):2183–2196
Tighiouart M, Cook-Wiens G, Rogatko A (2012) Escalation with over-dose control using ordinal
toxicity grades for cancer phase I clinical trials. J Probab Stat 2012:1–18
Tighiouart M, Liu Y, Rogatko A (2014) Escalation with overdose control using time to toxicity for
cancer phase I clinical trials. PLoS One 9(3):e93070
Tighiouart M, Li Q, Rogatko A (2017) A Bayesian adaptive design for estimating the maximum
tolerated dose curve using drug combinations in cancer phase I clinical trials. Stat Med 36(2):
280–290
Tighiouart M, Cook-Wiens G, Rogatko A (2018) A Bayesian adaptive design for cancer phase I
trials using a flexible range of doses. J Biopharm Stat 28(3):562–574
Van Meter EM, Garrett-Mayer E, Bandyopadhyay D (2011) Proportional odds model for dose-
finding clinical trial designs with ordinal toxicity grading. Stat Med 30(17):2070–2080
Van Meter EM, Garrett-Mayer E, Bandyopadhyay D (2012) Dose-finding clinical trial design for
ordinal toxicity grades using the continuation ratio model: an extension of the continual
reassessment method. Clin Trials 9(3):303–313
Wages NA, Conaway MR (2014) Phase I/II adaptive design for drug combination oncology trials.
Stat Med 33(12):1990–2003
Wages NA, Conaway MR, O’Quigley J (2011a) Continual reassessment method for partial order-
ing. Biometrics 67(4):1555–1563
Wages NA, Conaway MR, O’Quigley J (2011b) Dose-finding design for multi-drug combinations.
Clin Trials 8(4):380–389
Wages NA, Conaway MR, O’Quigley J (2013) Using the time-to-event continual reassessment
method in the presence of partial orders. Stat Med 32(1):131–141
Wages NA, Ivanova A, Marchenko O (2016) Practical designs for phase I combination studies in
oncology. J Biopharm Stat 26(1):150–166
Wang K, Ivanova A (2005) Two-dimensional dose finding in discrete dose space. Biometrics 61(1):
217–222
Wheeler GM (2018) Incoherent dose-escalation in phase I trials using the escalation with overdose
control approach. Stat Pap (Berl) 59(2):801–811
60 Bayesian Adaptive Designs for Phase I Trials 1131

Wheeler GM, Sweeting MJ, Mander AP (2017) Toxicity-dependent feasibility bounds for the
escalation with overdose control approach in phase I cancer trials. Stat Med 36(16):2499–2513
Wheeler GM, Sweeting MJ, Mander AP (2019) A Bayesian model-free approach to combination
therapy phase I trials using censored time-to-toxicity data. J R Stat Soc: Ser C: Appl Stat
68(2):309–329
Yin G, Yuan Y (2009a) Bayesian dose finding in oncology for drug combinations by copula
regression. J R Stat Soc: Ser C: Appl Stat 58(2):211–224
Yin G, Yuan Y (2009b) A latent contingency table approach to dose finding for combinations of two
agents. Biometrics 65(3):866–875
Yuan Z, Chappell R, Bailey H (2007) The continual reassessment method for multiple toxicity
grades: a Bayesian quasi-likelihood approach. Biometrics 63(1):173–179
Yuan, Y., Nguyen, H. Q., and Thall, P. F. (2017). Bayesian designs for phase I–II clinical trials. Boca
Raton, FL: Chapman and Hall Press/CRC
Zhang W, Sargent DJ, Mandrekar S (2006) An adaptive dose-finding design incorporating both
toxicity and efficacy. Stat Med 25(14):2365–2383
Zhou Y (2009) Adaptive designs for phase I dose-finding studies. Fundam Clin Pharmacol 24
(2):129–138
Adaptive Phase II Trials
61
Boris Freidlin and Edward L. Korn

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1134
Interim Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1134
Phase II/III Trial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1136
Adaptations Related to Biomarkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1138
Sample Size Reassessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1139
Outcome-Adaptive Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1140
Adaptive Pooling of Outcome Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1141
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1142
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1142
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1142
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1142

Abstract
Phase II trials are designed to obtain preliminary efficacy information about a new
therapy in order to assess whether the new therapy should be tested in definitive
(phase III) trials. Adaptive trial designs allow the design of a trial to be changed
during its conduct, possibly using accruing outcome data. Adaptations to
phase II trials considered in this chapter include formal interim monitoring,
phase II/III trial designs, adaptations related to biomarker subgroups, sample
size reassessment, outcome-adaptive randomization, and adaptive pooling of
outcome results across patient subgroups. Adaptive phase II trials allow for the
possibility of trials reaching their conclusions earlier, with more patients being
treated with therapies that have activity for them.

B. Freidlin (*) · E. L. Korn


Biometric Research Program, Division of Cancer Treatment and Diagnosis, National Cancer
Institute, Bethesda, MD, USA
e-mail: [email protected]; [email protected]

© This is a U.S. Government work and not under copyright protection in the U.S.; 1133
foreign copyright protection may apply 2022
S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_276
1134 B. Freidlin and E. L. Korn

Keywords
Biomarkers · Futility monitoring · Interim monitoring · Outcome-adaptive
randomization · Phase II/III · Sample size reassessment

Introduction

Phase II trials are designed to obtain preliminary efficacy information on a new


therapy to decide whether development should be pursued with definitive (phase III)
trials. Phase II trials are smaller than phase III trials because they relax the error-rate
requirements and can use shorter-term endpoints that are more sensitive to biologic
activity (regardless of whether they directly measure clinical benefit). For example,
oncology phase II trials often use a tumor response (shrinkage) endpoint (rather than,
say, overall survival). For evaluating a new agent, a single-arm trial with 32 patients
would allow one to distinguish a 20% response rate (interesting activity) from a 5%
response rate (uninteresting activity), with both false-positive and false-negative
error rates under 10% (Simon 1989). For evaluating addition of a new agent to the
standard therapy, a trial that randomly assigns 120 patients to receive either the
standard treatment or the standard treatment plus the new agent will have 90% power
at a one-sided 0.10 significance level to detect an increase in response rate from 10%
to 30% (Green et al. 2016). For randomized phase II trials, one can also use other
endpoints for which it would not be as easy to interpret efficacy in a single-arm trial,
e.g., progression-free survival.
Adaptive trial designs allow the course of the trial to be changed during its
conduct using accruing outcome data. Adaptive features of phase II trial designs
considered in this chapter are interim monitoring, phase II/III trial designs, adapta-
tions related to biomarkers, sample size reassessment, outcome-adaptive randomi-
zation, and adaptive pooling of outcome results.

Interim Monitoring

The most fundamental adaptive element of a clinical trial is formal interim


monitoring, which allows a trial to stop early when its scientific objectives
have been met: If it becomes clear in an ongoing phase II trial that the experi-
mental treatment is not going to be worth pursuing, then the trial should be
stopped (futility monitoring). This minimizes patient exposure to inactive toxic
treatments and conserves resources (Freidlin and Korn 2009). For example, in a
single-arm trial targeting a response rate of 20% versus 5%, a Simon optimal two-
stage design (Simon 1989) first accrues and evaluates 12 patients. The trial only
continues (to a total of 37 patients) if there is at least one response seen in these 12
patients. For a randomized phase II trial, the simple Wieand futility monitoring
rule (Wieand et al. 1994) stops the trial half-way through if the experimental arm
61 Adaptive Phase II Trials 1135

is doing worse than the standard treatment arm by any amount. For example, in a
120-patient randomized trial to detect an improvement in response rates from
10% to 30%, the trial would stop when the first 60 patients have been evaluated if
the observed response rate was lower in the experimental arm than the control
arm. With a time-to-event endpoint, Wieand futility monitoring is performed
when one-half of the required number of events for the final analysis is observed.
Unlike phase III trials, phase II trials generally do not include the possibility of
stopping (and discontinuing enrollment) early for positive results (efficacy stop-
ping), e.g., superiority of the experimental treatment over the standard treatment
in a randomized phase II trial. This is because it is useful to get more experience
with the treatment to inform phase III design (and patients are meanwhile not
being given an ineffective experimental therapy). However, it could be useful for
phase II trials to allow for the possibility of early reporting of positive results as
the trial continues to completion, especially if the trial is relatively large
or expected to take a long time to complete (e.g., due to the rarity of the disease).
For example, using a version of the Fleming approach (Fleming 1982) to the
single-arm two-stage design above, one can report the first-stage (12 patients)
results for efficacy with three or more responses as well as stopping the trial for
futility with zero responses. For randomized studies, it often would not be
acceptable to continue randomization once a positive result is reported. However,
in studies with time-to-event endpoints, where the outcome data requires non-
trivial follow-up to mature, it might still be useful to conduct an efficacy analysis
for potential early reporting once all patients have been enrolled and are off the
randomized treatments.
Multi-arm randomized phase II trials with multiple experimental arms being
compared to a standard treatment arm are efficient designs as compared to
performing separate randomized phase II trials for each experimental treatment
(because of the shared standard-treatment arm) (Freidlin et al. 2008). Futility
monitoring for each experimental arm/control arm comparison may increase
the efficiency further, allowing individual experimental arms to be closed early
(increasing the accrual rate on the remaining open arms). An example of such a
trial is SWOG 1500 (Pal et al. 2017), which has three experimental arms and
a standard treatment arm for metastatic papillary renal carcinoma. The trial design
included Wieand futility monitoring rules, and two of the experimental arms stopped
early for futility.
There is also the possibility of having a multi-arm trial with a “master protocol”
that accommodates adding new experimental treatment arms when they become
available for phase II testing. An example of such a trial is ISPY-2 (Barker et al.
2009), which is testing neoadjuvant experimental treatments for women with locally
advanced breast cancer. In addition to being less efficient than having all treatments
available for testing at the same time (because results from patients on an experi-
mental arm can only be compared to results from patients on the standard-treatment
arm who were randomized contemporaneously), trials with master protocols present
major logistical challenges to execute (Cecchini et al. 2019).
1136 B. Freidlin and E. L. Korn

Phase II/III Trial Designs

A phase II/III design is a phase II trial with an adaptation to possibly extend it to


a phase III trial if the phase II results look sufficiently promising (Bretz et al. 2006;
Korn et al. 2012). The advantage of phase II/III design over a separate phase II trial
followed by a phase III trial (if the phase II trial is positive) is that the patient
outcomes from the phase II portion can be used in the phase III analysis. In addition,
a phase II/III trial, which requires a single protocol, reduces the development time in
that two protocols do not need to be written and receive separate regulatory
approvals. The disadvantage to using a phase II/III trial is that one is committing
early to the treatment that will be evaluated in a phase III trial (Korn et al. 2012;
Cuffe et al. 2014); in a setting where many new treatments are being developed, it
may be better to perform a stand-alone phase II trial and then make decision based on
what new treatments are available for testing in phase III.
If a phase II/III trial is appropriate, then one first specifies the phase III design
parameters: type 1 error (e.g., one-sided 0.025) and sufficient power (e.g., 90%)
to detect a clinically meaningful improvement in the phase III endpoint (e.g.,
improving median overall survival from 12 months to 16 months). The phase III
design parameters determine the phase III sample size (e.g., 500 patients random-
ized), or, for time-to-event endpoints, the number of required events for the final
analysis (e.g., 400 deaths). The phase II portion of the trial is then embedded in the
phase III trial by evaluating an appropriately selected phase II endpoint on an initial
set of randomized patients, typically using a similar design that one would have used
for a stand-alone phase II trial (e.g., targeting a 30% versus 10% improvement in
response rate in the first 120 randomized patients using the design described in the
Introduction). If this phase II analysis rejects the null hypothesis that the response
rates are equal at the 0.10 level, then the accrual is continued to the full 500 patients
for the phase III analysis. Note that phase II/III designs can have multiple experi-
mental arms with some or all of them dropped at the phase II stage.
For example, SWOG S1117 (Sekeres et al. 2017) is a phase II/III trial
that randomly assigned high-risk myelodysplastic syndrome and chronic
myelomonocytic leukemia patients to azacitidine (the standard treatment),
azacitidine+lenalidomide, or azacitidine+vorinostat arms. The phase II portion
of the trial was designed to enroll 240 patients (80 per arm). For each of
the combination versus azacitidine comparisons, this provided 81% power at
a one-sided 0.05 significance level to detect an increase from 35% to 55% in
response rates. If at least one of the combination arms was shown to be promising,
the trial would proceed to the phase III portion. The phase III portion was designed
to include a total of 452 patients randomly assigned to either azacitidine or the best
combination arm (including the phase II stage patients) to provide 80% power
to detect an overall survival hazard ratio of 1.4 at a one-sided significance level
of 0.025. The study was stopped after the phase II stage because both combination
arms failed to pass the phase II decision rule.
Operationally, short-term phase II endpoints like response are easier to employ in
the phase II/III framework since they allow a quick evaluation as to whether to
61 Adaptive Phase II Trials 1137

continue to the phase III stage of the trial. However, in many clinical settings, a time-
to-event endpoint (e.g., progression-free survival) is considered to be a more reliable
phase II measure of clinical activity (than response) for deciding whether to proceed
to the phase III trial. This leads to another key decision in the trial design: Should one
continue to accrue while waiting for the data from the phase II patients to mature, or
should one suspend accrual during this time? The advantage of suspending accrual is
that no additional patients will have been unnecessarily accrued in the event the
phase II analysis suggests not proceeding to phase III. In particular, in some settings,
all or practically all the phase III patients will be accrued before the data on the phase
II patients are mature enough to make a decision, negating any efficiency in using
a phase II/III design. The disadvantage of suspending accrual is that it will take
longer to get the phase III results (assuming the phase II analysis is positive),
especially if it takes time to ramp up accrual after the suspension. In addition, with
a long suspension, one can question the generalizability of the trial results because
the patient population may have changed over time. However, a changing patient
population can be of theoretical concern for any long trial (whether or not there is an
accrual suspension), so this is not a reason to avoid an accrual suspension in a phase
II/III trial (Freidlin et al. 2018).
An example of a phase II/III design with accrual suspension is given by RTOG-
1216 (Zhang et al. 2019) that compared radiation+docetaxel and radiation
+docetaxel+cetuximab arms to a standard radiation+cisplatin arm in advanced
head and neck cancer. The phase II design was based on randomly assigning 180
patents between the 3 arms (60 patients per arm), targeting for each of the two
experimental versus control comparisons a progression-free survival hazard ratio
of 0.6 (with 80% power at a one-sided 0.15 significance level). The phase II
analysis for each experimental versus control comparison was scheduled when a
total of 56 progression-free survival events were observed for the two arms. If at
least one of the experimental arms was significantly positive, the study would
proceed to the phase III portion with a total of 460 patients randomized between the
best experimental arm and the control arm, targeting an overall survival hazard
ratio of 0.67. The study was designed to suspend accrual after the 180-patient
phase II portion finished accrual until the phase II analyses were performed
requiring 56 events for each of the two experimental versus control comparisons.
The protocol projected that the phase II analysis would take place approximately
1.3 years after completion of phase II accrual (but it actually took slightly over 2
years for the phase II data to mature).
An example of a phase II/III trial without an accrual suspension is given by
GOG0182-ICON5 for advanced-stage ovarian cancer (Bookman et al. 2009). This
trial had four experimental arms that were compared to a standard treatment arm.
The phase II endpoint was progression-free survival, and the phase III endpoint
was overall survival. The trial accrued 4312 patients rapidly before the phase II
analyses determined no difference between the experimental arms and the standard
treatment arm (there was also no difference in overall survival). If an accrual
suspension of 15 months had been used, the total sample size would have been
1740 (Korn et al. 2012).
1138 B. Freidlin and E. L. Korn

Adaptations Related to Biomarkers

Biomarkers can be used to identify patients for whom an experimental treatment is


likely to be effective. For example, in metastatic colorectal cancer, it has been
well established that the benefit of anti-EGFR monoclonal antibodies (e.g.,
cetaximab and panitumumab) is restricted to patients with KRAS wild-type tumors
(Vale et al. 2012). In the context of phase II trials, it is not known whether
the experimental treatment will be effective in either the biomarker-positive or
biomarker-negative subgroups, but it is thought that it will be at least as effective
in the biomarker-positive subgroup as in the biomarker-negative subgroup. In this
case, a biomarker-stratified trial design, in which patients in the two subgroups are
separately randomized and analyzed, can be used (Freidlin et al. 2010). It is
important to perform interim monitoring separately in these subgroups, which allows
for the possibility of adaptively stopping accrual of biomarker-negative patients if it
appears that the experimental treatment is not working for them. One should also
consider stopping the whole trial (and not just accrual to the biomarker-positive
patients) if the biomarker-positive subgroup crosses a futility interim monitoring
boundary.
In addition to interim monitoring, a trial may be adaptively modified to increase
the number of patients with a specific characteristic, e.g., a specific histology or
a specific biomarker value. An example of this type of adaptation is given by the
design of the arms of the NCI-Pediatric MATCH trial (Allen et al. 2017). The generic
design for each arm (subprotocol) of this trial is the following: Twenty patients with
the specific tumor mutation are treated with the agent targeted for this mutation
(regardless of cancer histology). The analysis for this primary cohort uses a decision
rule of the agent/mutation being considered worthy of further study if at least 3 tumor
responses are seen in these 20 patients. Three possible adaptations included in the
design are the following. (1) If there are three or more responses in patients with
the same histology (after the primary cohort is enrolled), then up to ten patients (in
total) can be enrolled with the same histology (who have the mutation). (2) If at any
time there are three or more responses in the primary cohort but with less than three
responses in the same histology, then a cohort of 10 patients without the mutation
(regardless of cancer histology) will be enrolled. (3) If at any time there are three or
more responses in the primary cohort with the same histology, then a cohort of ten
patients without the mutation but with the same histology will be enrolled. The
purpose of these adaptive cohorts is to obtain more information about in which
patient subsets the agent is likely to be effective, while minimizing the exposure of
patient subsets where the agent is likely not to be effective.
With a biomarker that putatively identifies patients for whom the experimental
therapy is better than the standard therapy, the aims of a randomized phase II trial can
be expanded from “Does the therapy warrant further testing?” to “Does the therapy
warrant further testing only in the biomarker-positive subgroup, in both the bio-
marker-positive and biomarker-negative subgroups (Freidlin et al. 2010), in
the whole population regardless of biomarker status, or not at all?” This allows
61 Adaptive Phase II Trials 1139

one to adapt the plan for a future phase III trial. This can be done informally by
examining experimental arm versus standard arm comparisons in the biomarker-
positive and biomarker-negative subgroups from a completed trial. However, this
post hoc approach may not work because of insufficient sample sizes in one or both
subgroups. A formal approach to this problem has been proposed (Freidlin et al.
2012), which requires a larger sample size than a single randomized phase II trial, but
less total sample size than performing separate randomized phase II trials in the
biomarker-positive and biomarker-negative subgroups.

Sample Size Reassessment

Sample size reassessment is a potential adaptation to the sample size of a trial based
on promising interim results from the trial. Although mostly used for phase III trials,
it has been recommended and used for phase II trials (Wang et al. 2012; Campbell
et al. 2014; Meretoja et al. 2014). In the phase II setting, a particular implementation
(Chen et al. 2004) initially starts with a plan to enroll a fixed number of patients to
target a potentially optimistic treatment effect with 90% power at a one-sided
significance level of 10%. When the information from the first-half of the trial
becomes available, the design examines the (one-sided) p-value. If this p-value is
less than 0.18 but greater than 0.06, then the sample size is increased up to twice the
original sample size using a formula that depends on this p-value, with p-values
closer to 0.18 leading to larger increases. The idea is that one can gain power to reject
the null hypothesis when the interim results appear promising by increasing the
sample size.
Sample size reassessment is controversial (Burman and Sonesson 2006; Emerson
et al. 2011; Freidlin and Korn 2017; Mehta 2017). The issue is whether sample size
reassessment is a reasonable approach to ensuring an adequately powered trial.
A simple numerical example illustrates the issues: Consider a randomized phase II
trial targeting a response rate of 45% for the experimental therapy versus 20% for the
standard treatment arm. With a standard design (maximum of 92 randomized
patients with Wieand futility monitoring after 46 patients), the trial would have
90% power for a one-sided significance level of 10% (Green et al. 2016). With the
same level and power and using sample size reassessment, the initial sample size
can be set to 84, with a possible increase to 168 (based on the interim results after 42
patients have been evaluated, including Wieand futility monitoring at that time).
Theoretically, if a sponsor of a trial initially had resources only for an 84-patient trial
but not a 96-patient trial, then sample size reassessment would allow the sponsor to
obtain additional resources to increase the sample size based on promising interim
results. However, as is shown in Table 1, the sample size reassessment design is
inferior to the standard design as, on average, it would require more patients (in some
cases nontrivially increasing the sample size and duration of the trial). Note that in
addition to the efficiency issues, sample size reassessment designs raise some
integrity concerns when used with time-to-event outcomes (Freidlin and Korn 2017)
1140 B. Freidlin and E. L. Korn

Table 1 Comparison of standard design with sample size reassessment design for comparing
response rates of 45% versus 20% (90% power at 0.1 significance level)
Standard design (with Wieand Sample size reassessment (with
Sample size futility monitoring) Wieand futility monitoring)
Minimum 46 42
Maximum 92 168
Average under null 72 73
hypothesis
Average under the 91 95
alternative hypothesis

Outcome-Adaptive Randomization

Outcome-adaptive randomization is a technique where the proportion of patients


randomized to the different treatment arms changes during the trial based on the
accruing outcome data (Lee and Chu 2012). The changes are made so that a higher
proportion of patients are randomized to the arm(s) that appear to be doing better.
Although this technique is superficially appealing, there are a number of caveats. First,
any time trends in the prognosis of patients accruing to the trial can potentially bias
(confound) the treatment results and lead to inflated type 1 error (Byar et al. 1976;
Korn and Freidlin 2011a). Therefore, this method is not recommended for definitive
phase III trials. Second, the operating characteristics of the method are poor for trials
with one experimental arm and one control treatment arm (Korn and Freidlin 2011a;
Thall et al. 2015), so the technique is also not recommended for two-armed phase II
randomized trials. Comparisons of outcome-adaptive randomization with standard
fixed randomization (with appropriate futility interim monitoring) are more nuanced
in multi-arm trials with multiple experimental agents: Simulations demonstrate similar
results, with the outcome-adaptive randomization yielding a slightly higher proportion
of patients with good outcomes but also a slightly longer trial with a larger number of
patients with bad outcomes (Korn and Freidlin 2011b). Finally, whether using out-
come-adaptive randomization is more or less ethical than using fixed randomization
has been debated (Hey and Kimmelman 2015).
An example of a multi-arm randomized phase II that used outcome-adaptive
randomization is the BATTLE-2 trial (Papadimitrakopoulou et al. 2016). In this trial,
patients with advanced non-small cell lung cancer were randomized among four
treatment arms (a control arm and three experimental arms), with the outcome being
disease control at 8 weeks (complete or partial tumor response or non-progressing
disease at 8 weeks). Equal randomization was used for the first 70 patients, and then
outcome-adaptive randomization was used for the next 130 patients (adjusted for
two biomarkers). For the 186 evaluable patients, the 8-week disease control rates
(DCR) were 32% (6/19) for the control arm, and 50% (18/36), 53% (37/70), and
46% (28/61) for the three experimental arms. One can calculate that if one had used
equal randomization throughout, then a trial with 120 evaluable patients would have
achieved the same statistical power as the 186 evaluable-patient outcome-adaptive
61 Adaptive Phase II Trials 1141

randomization trial for the pairwise comparisons between the experimental arms and
standard treatment arms (Korn and Freidlin 2017). The equal randomization trial
would be expected to have an overall DCR of 45% (54/120), less than the observed
rate of 48% (89/186) with the outcome-adaptive randomization, a slight plus for
outcome-adaptive randomization. However, equal randomization would have
resulted in a shorter trial, with fewer patients, and with fewer patients having a bad
outcome (66 versus 97) (Korn and Freidlin 2017).
Note that the outcome-adaptive randomization should not be confused with
statistical techniques designed to balance the distribution of baseline covariates
between the study arms (Pocock and Simon 1975). Unlike outcome-adaptive
randomization that uses study outcome to change the probability of treatment
assignment, these methods use prespecified baseline characteristics of the accruing
patents to modify the randomization and are not controversial.

Adaptive Pooling of Outcome Results

To account for potential heterogeneity in the activity of an experimental treatment in


different patient subgroups, phase II trials often consider the activity separately for
the different subgroups. For example, when evaluating an agent targeting a particular
molecular target, patients with tumors expressing the target may be separated into
histology-based subgroups. It may be reasonable to expect some degree of similarity
in the level of activity between the subgroups (although this is not assured). In this
case it could be attractive to borrow (pool) activity information across subgroups,
especially when the subgroups are small. To address this issue, a variety of adaptive
pooling (“adaptive information borrowing”) methods that allow the estimate of
activity in a given subgroup to be influenced by the outcomes in other subgroups
have been proposed. For example, in a trial that is designed to evaluate response
rates in different histologic subgroups, the response rate for a specific histology
could be estimated as a weighted average of the observed response rate for
that histology and the overall response rate across all histologies. The amount of
borrowing (pooling) across subgroups is determined by the weights, with weighting
the overall rate more representing more borrowing. Most of the adaptive approaches
for choosing weights use Bayesian hierarchical modeling (Thall et al. 2003). In this
approach the weights are determined by the spread of the observed histology-specific
rates: The more narrow the spread, the more weight is given to the overall response
rate in estimating the individual histology response. That is, when the observed rates
are close to each other, the estimates of the histology-specific means are taken to be
close to the observed overall mean. On the other hand, when the observed rates
are far apart, the estimated rate for each histology is taken to be close to the
corresponding histology-specific mean.
Attractive as it sounds, however, borrowing comes at a price (i.e., just like in life
there is no free lunch): In most practical settings, adaptive pooling without proper
adjustment can result in nontrivially inflated rates of incorrectly declaring an agent
effective in subgroups where it does not work and incorrectly declaring an agent
1142 B. Freidlin and E. L. Korn

ineffective in subgroups where it does work (Freidlin and Korn 2013). Vigorous
methodologic research in refining adaptive pooling designs is continuing (Chu and
Yuan 2018; Cunanan et al. 2019). It may not be possible to have a design that allows
the use of observed data to guide the pooling across subgroups without inflating the
design error rates (Kopp-Schneider et al. 2019).

Summary

Careful application of adaptive features like interim monitoring or biomarker-driven


adaptations can dramatically improve the efficiency of phase II trials. This could
accelerate development of new therapies, protect patients from exposure to poten-
tially toxic ineffective therapies, and conserve resources. On the other hand, out-
come-adaptive randomization has been long known to have suboptimal statistical
properties. Less understood methodologies like sample size reassessment and adap-
tive pooling should be considered carefully, as they provide questionable benefit (if
any). Moreover, these more complex statistical methods may pose major transpar-
ency and reproducibility challenges. Therefore, their use requires clear justification,
and their reporting requires clear description of the study design and conduct.

Key Facts

Application of certain adaptive features like interim monitoring or biomarker-driven


adaptations can improve efficiency of phase II trials. Outcome-adaptive randomiza-
tion has not been shown to improve design performance relative to the properly
designed/monitored fixed-randomization trials. Less understood methodologies like
sample size reassessment and adaptive pooling increase design complexity without
providing tangible benefit.

Cross-References

▶ Biomarker-Guided Trials
▶ Futility Designs
▶ Interim Analysis in Clinical Trials
▶ Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials

References
Allen CE, Laetsch TW, Mody R, Irwin MS, Lim MS, Adamson PC, Seibel NL, Parsons DW,
Cho YJ, Janeway K, on behalf of the Pediatric MATCH Target and Agent Prioritization
Committee (2017) Target and Agent Prioritization for the Children’s Oncology Group –
National Cancer Institute Pediatric MATCH Trial. J Natl Cancer Inst 109:djw274
61 Adaptive Phase II Trials 1143

Barker AD, Sigman CC, Kelloff GJ, Hylton NM, Berry DA, Esserman LJ (2009) I-SPY 2: an
adaptive breast cancer trial design in the setting of neoadjuvant chemotherapy. Clin Pharmacol
Ther 86:97–100
Bookman MA, Brady MF, McGuire WP, Harper PG, Alberts DS, Friedlander M, Colombo N,
Fowler JM, Argenta PA, De Geest K, Mutch DG, Burger RA, Swart AM, Trimble EL, Accario-
Winslow C, Roth LM (2009) Evaluation of new platinum-based treatment regimens in
advanced-stage ovarian cancer: a phase III trial of the Gynecologic Cancer Inter Group. J Clin
Oncol 27:1419–1425
Bretz F, Schmidli H, König F, Racine A, Maurer W (2006) Confirmatory seamless phase II/III
clinical trials with hypothesis selection at interim: general concepts. Biom J 48:623–634
Burman CF, Sonesson C (2006) Are flexible designs sound? Biometrics 62:664–669
Byar DP, Simon RM, Friedewald WT, Schlesselman JJ, DeMets DL, Ellenberg JH, Gail MH, Ware
JH (1976) Randomized clinical trials – perspectives on some recent ideas. N Engl J Med
295:74–80
Campbell BC, Mitchell PJ, Yan B, Parsons MW, Christensen S, Churilov L, Dowling RJ, Dewey H,
Brooks M, Miteff F, Levi C, Krause M, Harrington TJ, Faulder KC, Steinfort BS, Kleinig T,
Scroop R, Chryssidis S, Barber A, Hope A, Moriarty M, McGuinness B, Wong AA, Coulthard
A, Wijeratne T, Lee A, Jannes J, Leyden J, Phan TG, Chong W, Holt ME, Chandra RV, Bladin
CF, Badve M, Rice H, de Villiers L, Ma H, Desmond PM, Donnan GA, Davis SM, EXTEND-IA
Investigators (2014) A multicenter, randomized, controlled study to investigate EXtending the
time for Thrombolysis in Emergency Neurological Deficits with Intra-Arterial therapy
(EXTEND-IA). Int J Stroke 9:126–132
Cecchini M, Rubin EH, Blumenthal GM, Ayalew K, Burris HA, Russell-Einhorn M, Dillon H,
Lyerly HK, Reaman GH, Boerner S, LoRusso PM (2019) Challenges with novel clinical trial
designs: master protocols. Clin Cancer Res 25:2049–2057
Chen YH, DeMets DL, Lan KK (2004) Increasing the sample size when the unblinded interim result
is promising. Stat Med 23:1023–1038
Chu Y, Yuan Y (2018) Bayesian basket trial design using a calibrated Bayesian hierarchical model.
Clin Trials 15:149–158
Cuffe RL, Lawrence D, Stone A, Vandemeulebroecke M (2014) When is a seamless study
desirable? Case studies from different pharmaceutical sponsors. Pharm Stat 13:229–237
Cunanan KM, Iasonos A, Shen R, Gönen M (2019) Variance prior specification for a basket trial
design using Bayesian hierarchical modeling. Clin Trials 16:142–153
Emerson SS, Levin GP, Emerson SC (2011) Comments on ‘Adaptive increase in sample size when
interim results are promising: a practical guide with examples’. Stat Med 30:3285–3301
Fleming TR (1982) One-sample multiple testing procedure for phase II clinical trials. Biometrics
38:143–151
Freidlin B, Korn EL (2009) Monitoring for lack of benefit: a critical component of a randomized
clinical trial. J Clin Oncol 27:629–633
Freidlin B, Korn EL (2013) Borrowing information across subgroups: is it useful? Clin Cancer Res
19:1326–1334
Freidlin B, Korn EL (2017) Sample size adjustment designs with time-to-event outcomes: a caution.
Clinical Trials 14:597–604
Freidlin B, Korn EL, Gray R, Martin A (2008) Multi-arm clinical trials of new agents: some design
considerations. Clin Cancer Res 14:4368–4371
Freidlin B, McShane LM, Korn EL (2010) Randomized clinical trials with biomarkers: design
issues. J Natl Cancer Inst 102:152–160
Freidlin B, McShane LM, Polley MY, Korn EL (2012) Randomized phase II trials designs with
biomarkers. J Clin Oncol 30:1–6
Freidlin B, Korn EL, Abrams JS (2018) Bias, operational bias, and generalizability in phase II/III
trials. J Clin Oncol 36:1902–1904
Green S, Benedetti J, Smith A, Crowley J (2016) Clinical trials in oncology, 3rd edn. CRC Press,
New York
Hey SP, Kimmelman J (2015) Are outcome-adaptive allocation trials ethical? (and Commentary).
Clin Trials 12:102–127
1144 B. Freidlin and E. L. Korn

Kopp-Schneider A, Calderazzo S, Wiesenfarth M (2019) Power gains by using external information


in clinical trials are typically not possible when requiring strict type I error control. Biom J.
https://doi.org/10.1002/bimj.201800395
Korn EL, Freidlin B (2011a) Outcome-adaptive randomization: is it useful? J Clin Oncol
29:771–776
Korn EL, Freidlin B (2011b) Reply to Y. Yuan et al. J Clin Oncol 29:e393
Korn EL, Freidlin B (2017) Adaptive Clinical Trials: advantages and disadvantages of various
adaptive design elements. J Natl Cancer Inst 109:dlx013
Korn EL, Freidlin B, Abrams JS, Halabi S (2012) Design issues in randomized phase II/III trials.
J Clin Oncol 30:667–671
Lee JJ, Chu CT (2012) Bayesian clinical trials in action. Stat Med 31:2955–2971
Mehta C (2017) Commentary on Freidlin and Korn. Clinical Trials 14:605–608
Meretoja A, Churilov L, Campbell BC, Aviv RI, Yassi N, Barras C, Mitchell P, Yan B, Nandurkar
H, Bladin C, Wijeratne T, Spratt NJ, Jannes J, Sturm J, Rupasinghe J, Zavala J, Lee A, Kleinig T,
Markus R, Delcourt C, Mahant N, Parsons MW, Levi C, Anderson CS, Donnan GA, Davis SM
(2014) The spot sign and tranexamic acid on preventing ICH growth – Australasia Trial (STOP-
AUST): protocol of a phase II randomized, placebo-controlled, double-blind, multicenter trial.
Int J Stroke 9:519–524
Pal SK, Tangen CM, Thompson IM, Shuch BM, Haas NB, George DJ, Stein MN, Wright JJ, Plets
M, Lara P (2017) A randomized, phase II efficacy assessment of multiple MET kinase inhibitors
in metastatic papillary renal carcinoma (PRCC): SWOG S1500. J Clin Oncol 35(15_suppl):
TPS4599
Papadimitrakopoulou V, Lee JJ, Wistuba II, Tsao AS, Fossella FV, Kalhor N, Gupta S, Byers LA,
Izzo JG, Gettinger SN, Goldberg SB, Tang X, Miller VA, Skoulidis F, Gibbons DL, Shen L, Wei
C, Diao L, Peng SA, Wang J, Tam AL, Coombes KR, Koo JS, Mauro DJ, Rubin EH, Heymach
JV, Hong WK, Herbst RS (2016) The BATTLE-2 study: a biomarker-integrated targeted therapy
study in previously treated patients with advanced non-small-cell lung cancer. J Clin Oncol
334:3638–3647
Pocock SJ, Simon R (1975) Sequential treatment assignment with balancing for prognostic factors
in the controlled clinical trial. Biometrics 31:103–115
Sekeres MA, Othus M, List AF, Odenike O, Stone RM, Gore SD, Litzow MR, Buckstein R, Fang
M, Roulston D, Bloomfield CD, Moseley A, Nazha A, Zhang Y, Velasco MR, Gaur R, Atallah
E, Attar EC, Cook EK, Cull AH, Rauh MJ, Appelbaum FR, Erba HP (2017) Randomized Phase
II Study of azacitidine alone or in combination with lenalidomide or with vorinostat in higher-
risk myelodysplastic syndromes and chronic myelomonocytic leukemia: North American
Intergroup Study SWOG S1117. J Clin Oncol 35:2745–2753
Simon R (1989) Optimal two-stage designs for phase II clinical trials. Control Clin Trials 10:1–10
Thall PF, Wathen JK, Bekele BN, Champlin RE, Baker LH, Benjamin RS (2003) Hierarchical
Bayesian approaches to phase II trials in diseases with multiple subtypes. Stat Med 22:763–780
Thall P, Fox P, Wathen J (2015) Statistical controversies in clinical research: scientific and ethical
problems with adaptive randomization in comparative clinical trials. Ann Oncol 26:1621–1628
Vale CL, Tierney JF, Fisher D, Adams RA, Kaplan R, Maughan TS, Parmar MK, Meade AM (2012)
Does anti-EGFR therapy improve outcome in advanced colorectal cancer? A systematic review
and meta-analysis. Cancer Treat Rev 38:618–625
Wang S-J, Hung HMJ, Robert O’NR (2012) Paradigms for adaptive statistical information designs:
practical experiences and strategies. Stat Med 31:3011–3023
Wieand S, Schroeder G, O’Fallon JR (1994) Stopping when the experimental regimen does not
appear to help. Stat Med 13:1453–1458
Zhang QE, Wu Q, Harari PM, Rosenthal DI (2019) Randomized phase II/III confirmatory treatment
selection design with a change of survival end points: statistical design of Radiation Therapy
Oncology Group 1216. Head Neck 41:37–45
Biomarker-Guided Trials
62
L. C. Brown, A. L. Jorgensen, M. Antoniou, and J. Wason

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1146
Types of Biomarker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1147
Prognostic Biomarkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1147
Predictive Biomarkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1148
The Life Course of a Biomarker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1149
Discovery and Analytical Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1149
Clinical Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1150
Clinical Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1150
Biomarker-Guided Trial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1150
Nonadaptive Biomarker-Guided Trial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1151
Single-Arm Designs Including All Patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1151
Enrichment Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1152
Marker-Stratified Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1153
Hybrid Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1153
Biomarker-Strategy Design with Biomarker Assessment in the Control Arm . . . . . . . . . . . . 1154
Biomarker-Strategy Design Without Biomarker Assessment in the Control Arm . . . . . . . . 1155
Biomarker-Strategy Design with Treatment Randomization in the Control Arm . . . . . . . . . 1156
Reverse Marker-Based Strategy Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1157
A Randomized Phase II Trial Design with Biomarker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1158

L. C. Brown (*)
MRC Clinical Trials Unit, UCL Institute of Clinical Trials and Methodology, London, UK
e-mail: [email protected]
A. L. Jorgensen
Department of Health Data Science, University of Liverpool, Liverpool, UK
e-mail: [email protected]
M. Antoniou
F. Hoffmann-La Roche Ltd, Basel, Switzerland
J. Wason
Population Health Sciences Institute, Newcastle University, Newcastle upon Tyne, UK
MRC Biostatistics Unit, University of Cambridge, Cambridge, UK
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 1145


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_168
1146 L. C. Brown et al.

Adaptive Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1159


Adaptive Signature Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1159
Outcome-Based Adaptive Randomization Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1160
Adaptive Threshold Enrichment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1161
Adaptive Patient Enrichment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1162
Adaptive Parallel Simon Two-Stage Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1163
Multi-arm Multi-stage Designs (MAMS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1164
Operational Considerations for Biomarker-Guided Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1165
Analysis of Biomarker-Guided Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1166
Analysis of Biomarker-Strategy Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1167
Analysis of Marker-Stratified Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1167
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1168
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1169

Abstract
This chapter describes the field of precision or stratified medicine and the role that
clinical trials play in the development and validation of markers, particularly
biomarkers, to inform management of patients. We begin by defining various
types of biomarker and describe the life cycle of a biomarker in terms of
discovery, analytical validation, clinical validation, and clinical utility. We pro-
vide a detailed overview of the many types of biomarker-guided trial designs that
have been described in the literature and then summarize the analytical methods
that are often used for biomarker-guided trials. Much of the research process for
biomarker-guided trials does not differ markedly from that used for non-
biomarker-guided trials but particular attention must be given to selecting the
most appropriate trial design given the research question being investigated and
we hope that this chapter helps with decisions on trial design and analysis in the
biomarker-guided setting.

Keywords
Biomarker · Stratified · Precision · Personalized · Validation · Prognostic ·
Predictive · Subgroup · Interaction

Introduction

We begin this chapter with a definition of what is meant by a biomarker-guided


trial. The field of stratified medicine (also known as precision or personalized
medicine) is dedicated to the identification of patient attributes that can be mea-
sured and used to make decisions on the management of their condition. These
patient attributes are often called biomarker and they can include anything from
complex laboratory tests to simple stratifiers such as gender, age, or stage of
disease. Identifying these biomarkers and proving that they are clinically useful
is not straightforward: many issues that are challenging in non-biomarker-guided
trials tend to be heightened in a biomarker-guided trial setting. This is, in part,
because the evidence on which a biomarker performs is based on the comparative
62 Biomarker-Guided Trials 1147

evidence between subgroups of patients which by definition will have smaller


sample sizes than the non-stratified trial. Furthermore, there can be considerable
heterogeneity between patients in terms of their baseline characteristics as well as
their responses to treatments.
When undertaking a biomarker-guided clinical trial, many of the design consid-
erations are similar to those for non-biomarker-guided trials but there are particular
issues that require attention. This chapter will describe the various different types of
biomarkers that can be used in clinical trials as well as the development and
validation process that is required before they can be recommended for routine use
in clinical practice. A large part of the chapter will be dedicated to describing the
different trial designs that have been developed and the advantages and disadvan-
tages of each to help decide which designs might be the most appropriate given the
research question. From a statistical perspective, sample size considerations and
analytical methods for biomarker trials are important and these are also summarized
or referred to in the literature.

Types of Biomarker

There are various different types of biomarkers that are used for different clinical
applications. A comprehensive and detailed description of the different types of
biomarkers is provided in the FDA-NIH Working Group, Biomarkers, Endpoints
and Other Tools (BEST) guidelines (FDA-NIH Working Group 2018) where eight
distinct biomarkers have been defined. These are summarized in Table 1.
Biomarkers can be measured using binary, categorical, ordinal, or continuous
data, and appropriate statistical methods are required to ensure they are analyzed
correctly (both for getting robust results from biomarker assays and analyzing how
the biomarker data is associated with the prognosis or treatment effect in a trial). For
the purposes of this chapter, the most commonly used biomarkers in clinical trials are
prognostic and predictive biomarkers and most of the chapters will be dedicated to
these types. Some biomarkers demonstrate both prognostic and predictive qualities
so when designing a biomarker-guided trial, it is important to be aware of any
existing data that describe the discriminatory performance of the biomarker in
question, whether it be prognostic, predictive, or both.

Prognostic Biomarkers

Prognostic biomarkers stratify patients on the basis of the prognosis of the


disease in the absence of the new treatment being tested, and thus, they relate
to the natural history of the disease. Prognostic biomarkers can be important in
biomarker-guided trials as it is usually necessary to understand the behavior of
the disease under control conditions. If this is unclear, then it is possible that
changes in the course of the disease might be incorrectly attributed to the
treatment being tested. Furthermore, if a biomarker is prognostic, then the
1148 L. C. Brown et al.

Table 1 Summary of types of biomarkers*


Type of biomarker Description
Diagnostic A biomarker used to detect or confirm presence of a disease or condition
of interest or to identify individuals with a subtype of the disease.
Monitoring A biomarker measured serially for assessing status of a disease or medical
condition or for evidence of exposure to (or effect of) a medical product
or an environmental agent.
Pharmacodynamic/ A biomarker used to show that a biological response has occurred in an
response individual who has been exposed to a medical product or an
environmental agent.
Predictive A biomarker used to identify individuals who are more likely than similar
individuals without the biomarker to experience a favorable or
unfavorable effect from exposure to a medical product or an
environmental agent.
Prognostic A biomarker used to identify likelihood of a clinical event, disease
recurrence or progression in patients who have the disease or medical
condition of interest.
Surrogate A biomarker supported by strong mechanistic and/or epidemiologic
rationale such that an effect on the surrogate biomarker endpoint is
expected to be correlated with an effect on the endpoint intended to assess
clinical benefit.
Safety A biomarker measured before or after an exposure to a medical product or
an environmental agent to indicate the likelihood, presence, or extent of
toxicity as an adverse effect.
Susceptibility/risk A biomarker that indicates the potential for developing a disease or
medical condition in an individual who does not currently have clinically
apparent disease or the medical condition.
a
Summarized from the FDA-NIH Working Group, Biomarkers, Endpoints and Other Tools (BEST)
guidelines, updated May 2018 (1)

event rate in the control arm will differ between strata and this may influence the
sample size needed for each subgroup. Prognostic biomarkers can potentially
also be useful for enriching trials or informing a treatment strategy that avoids
toxic treatment in patients that have, for example, a very good or very poor
prognosis (Friedlin 2014).

Predictive Biomarkers

Predictive biomarkers (also known as treatment-selection biomarkers) stratify patients on


the basis of their expected response (or not) to a particular treatment. For these types of
biomarkers, it can be particularly challenging to prove utility as evidence of interaction is
often required. In this situation, the target difference is a difference in treatment effects
between stratified groups rather than a difference in the overall effect between random-
ized groups. Often, attaining adequate statistical power for a test of interaction can lead to
a potentially unfeasible sample size and tests for interaction are often underpowered in
biomarker-guided trials. Thus, the importance of validation of predictive biomarkers in
other datasets is particularly important and validation will be discussed later.
62 Biomarker-Guided Trials 1149

The Life Course of a Biomarker

The development and testing of a biomarker ideally follow a typical life course (see
Fig. 1). However, this life course is often iterative rather than sequential and a return
to earlier stages in the development is often required as more data become available,
biomarkers are refined and/or new possibly better biomarkers emerge.

Discovery and Analytical Validity

Biomarkers can be discovered either retrospectively or prospectively. Retrospec-


tively, they can emerge from either preplanned or post-hoc subgroup analyses of
previous trials, but they can also be developed prospectively as part of the
hypothesis-driven design of a molecule that is specifically intended to work in a
biomarker-defined subgroup (targeted therapy). It is important to note that retro-
spective discovery of biomarkers can be influenced by selection bias in terms of
availability of data or tissue on which to retrospectively test the biomarker so
caution should be taken in this regard and prospective testing is usually preferable.
Regardless of the mode of discovery, biomarkers that are to be used in a clinical
trial need to demonstrate analytical validity which means that the test should be
reliable (low risk of test failure), accurate (classify the patient into the correct
biomarker group), and repeatable (in terms of both inter- and intra-reproducibility).
High biomarker test failure rates can make the delivery of a biomarker-guided trial
infeasible, and misclassification errors can lead to the dilution of treatment effects
(see later section) and thus mask the potential of truly effective treatments.
Therefore, establishing analytical validity is an important step in the life course
of a biomarker and the extent to which the biomarker is analytically validated will
depend on the stage of development of the treatment being tested. For example,
during an early phase clinical trial, the development of the analytical validity of the
biomarker may still be in progress but it should be validated to a minimum standard
before integrating it into the next phases of clinical trial assessment. Furthermore,
the analytical validity of the test is required to determine the prevalence of the

Discovery and Clinical


development Clinical validity Utility
Analytical Clinical Independent Clinical
Discovery
validity testing validation Utility

Diagnostics and Evidence of Benefit if


quality control discriminatory rolled out
effect
Fig. 1 The life course of a biomarker from discovery to clinical testing and utility
1150 L. C. Brown et al.

biomarker, and this is an important consideration for the design and sample size
calculations for the subsequent trial.

Clinical Validity

If a retrospective subgroup analysis from a clinical trial demonstrates evidence that a


particular biomarker might be an important stratifier for patient management, then this
finding needs to be validated in an independent dataset, ideally another randomized
clinical trial testing the treatment of interest. This is often a challenging aspect of
biomarker development as, unless the original trial has built a validation stage into its
design, it requires the completion of at least two randomized trials that have not only
tested the treatment but also have adequate tissue or data to be able to stratify the patients
into the biomarker positive and negative groups. For this reason, clinical validation
sometimes occurs in non-randomized, observational datasets or in treated cohorts of
patients who are then compared against an appropriately selected historical control
group. The use of randomized data should be regarded as the ideal for clinical validation
but this is not always possible. The absence of an unbiased control group can lead to
type 1 (false positive) and type 2 (false negative) errors. Another important consideration
when biomarkers are measured as continuous variables is ensuring reliable estimation of
the optimal threshold for classifying patients into the biomarker positive or negative
groups on the basis of their benefit (or lack of) with the treatment. Some trial designs
such as the adaptive signature design (described later) aim to develop and validate the
biomarker cutpoint within the same trial but this can be methodologically challenging.

Clinical Utility

Once a biomarker has passed through all the stages of analytical and clinical valida-
tion, it may be necessary to test the utility of the biomarker-guided treatment approach
against one that does not use the biomarker to make clinical decisions. These are
typically large and expensive trials as they are attempting to measure the real-world
effectiveness of biomarker-guided management including reliability, feasibility, and
acceptability for patients. If the biomarker testing is complex and expensive, then it
may be important to confirm that use of a biomarker-guided approach does indeed lead
to better outcomes for patients at a reasonable cost. As a result, cost-effectiveness
outcomes can become important in these clinical utility trials.

Biomarker-Guided Trial Designs

Similar to non-biomarker-guided trials, there are two broad categories of biomarker-


guided trials: nonadaptive and adaptive. Nonadaptive designs do not allow modifi-
cations of important aspects of the trial after its commencement, such as refining
biomarker subgroups, adding or dropping treatment arms, sample size, etc. Rather,
62 Biomarker-Guided Trials 1151

these factors are defined at the outset and remain fixed for the trial duration. This can
be problematic when there is uncertainty surrounding assumptions made at the
design stage. There is generally more potential for uncertainty when designing a
biomarker-guided trial. For example, new biomarkers or targeted treatments may
come to light once the trial is underway; the predictive ability of a potentially
promising biomarker may be lower than expected; and there may be uncertainty
regarding biomarker prevalence at the outset. Hence an adaptive design, which
allows adaptations based on accumulating data, can be an attractive alternative due
to its flexibility. However, while offering more flexibility, adaptive designs are more
complex and can raise both practical and methodological challenges which need
careful consideration.
A summary of the various adaptive and nonadaptive biomarker-guided trial
designs is provided below. More in-depth discussion of the various design options,
together with an overview of their methodology, guidance on sample size calcula-
tions, other statistical considerations, and their advantages and disadvantages in
various situations, is provided in two review articles by Antoniou et al. (Antoniou
et al. 2016, 2017) and as part of the Delta guidelines for sample size calculations
(Cook et al. 2018). Further guidance is also provided in an online tool available at
www.bigted.org.

Nonadaptive Biomarker-Guided Trial Designs

Single-Arm Designs Including All Patients

Such designs include the whole study population to which the same experimental
treatment is prescribed, without taking into consideration biomarker status. All
patients are prescribed the experimental treatment, with no comparison to a control
treatment (see Fig. 2). These trial designs can be useful for initial identification
and/or validation of a biomarker since they can allow for association to be tested

Fig. 2 Schematic for single-


arm biomarker exploration
design
1152 L. C. Brown et al.

between biomarker status and efficacy or safety of the experimental treatment. Their
aim is not to estimate the treatment effect, nor the clinical utility of a biomarker in a
definitive way, but to identify whether the biomarker is sufficiently promising to
proceed to a more definitive biomarker-guided randomized controlled trial.

Enrichment Designs

Enrichment designs involve entering and, if appropriate, randomizing only in patients


who are positive for a particular biomarker, and comparing the experimental treatment
with the standard treatment only in this particular biomarker-positive subgroup.
Biomarker-negative patients are excluded from the study at the start but they are
sometimes included later if sufficient evidence emerges of a treatment effect in the
biomarker-positive group. If this does occur, then consequently, assessing efficacy of
the experimental treatment is limited to the biomarker-positive subgroup (see Fig. 3).
These trials are useful for testing treatment efficacy in a specific biomarker-
defined subgroup where there is mechanistic evidence to suggest that efficacy is
likely to be limited to those within that biomarker-positive subgroup, but this still
requires prospective validation. In this situation, these trials can result in cost savings
with biomarker-negative patients not randomized unnecessarily. Further, any treat-
ment effect is not inappropriately diluted due to inclusion of biomarker-negative
patients, particularly in the case where biomarker-positive prevalence is low. How-
ever, these designs are recommended only when both the cut-off for determination of
biomarker status and the analytical validity of the biomarker have been well
established. They are also only suitable where assessing biomarker status can be
done with a rapid turnaround time, to avoid delaying treatment.
Enrichment designs are particularly useful where it would not be appropriate to
randomize the biomarker-negative population into different treatment arms, for

Fig. 3 Schematic for enrichment designs. “R” refers to randomization of patients


62 Biomarker-Guided Trials 1153

example, where there is prior evidence that the experimental treatment is not beneficial
for them or is likely to cause them harm. However, when it remains unclear whether or
not biomarker-negative individuals will benefit from the novel treatment, the enrich-
ment design is not appropriate and alternative designs, which also assess effectiveness
in the biomarker-negative individuals, should be considered.

Marker-Stratified Designs

In marker-stratified design trials, individuals are first stratified into biomarker-


positive and biomarker-negative subgroups, then randomized within each of these
subgroups to either the experimental or control treatment. Consequently, there are
four treatment groups. This allows an assessment of treatment effect not only in the
study population overall but also in the biomarker-defined subgroups separately.
The design is useful when there is sufficient evidence that the experimental
treatment is more effective in the biomarker-positive subgroup than in the
biomarker-negative subgroup but insufficient data demonstrating that the experi-
mental treatment is of no benefit to biomarker-negative individuals. However, the
design may well be unfeasible when the prevalence of one of the biomarker
subgroups is low resulting in chance imbalances between randomized groups for
that subgroup (Fig. 4).

Hybrid Designs

In hybrid design trials, the entire population is firstly screened for biomarker status
and all individuals enter the trial. However, only biomarker-positive patients are
randomly assigned either to the experimental or control treatment, while all

Fig. 4 Schematic for marker-stratified designs. “R” refers to randomization of patients


1154 L. C. Brown et al.

Fig. 5 Schematic for hybrid design. “R” refers to randomization of patients

biomarker-negative patients receive the control treatment. The difference compared


to enrichment designs is that biomarker-negative patients are not excluded (see
Fig. 5). Such designs are recommended when there is compelling prior evidence
showing detrimental effect of the experimental treatment for a specific biomarker-
defined subgroup (i.e., biomarker-negative subgroup) or some indication of its
possible excessive toxicity in that subgroup, thus making it unethical to randomize
patients to the experimental treatment. The strength of the hybrid design is that as
well as allowing evaluation of the treatment in the biomarker-positive group, its
feasibility as a prognostic biomarker can also be tested.

Biomarker-Strategy Design with Biomarker Assessment


in the Control Arm

In this trial design, the entire study population is tested for its biomarker status. Next,
patients irrespective of their biomarker status are randomized either to the
biomarker-based strategy arm or to the non-biomarker-based strategy arm. In the
biomarker-based strategy arm, biomarker-positive patients receive the experimental
treatment, whereas biomarker-negative patients receive the control treatment.
Patients who are randomized to the non-biomarker-based strategy arm receive the
control treatment irrespective of biomarker status (see Fig. 6).
This approach is useful when the aim is to test the hypothesis that a treatment
approach taking biomarker status into account is superior to that of the standard of
care – that is, the clinical utility of the biomarker. Further, the biomarker-based
strategy arm does not necessarily need to be limited to one experimental treatment –
in principle, a marker-based strategy involving many biomarkers and many possible
treatments could be tested. This type of design can inform researchers whether the
biomarker is prognostic, since both biomarker positive and negative patients are
exposed to the control treatment. However, it cannot definitively answer the question
62 Biomarker-Guided Trials 1155

Fig. 6 Schematic for biomarker-strategy design with biomarker assessment in the control arm. “R”
refers to randomization of patients

of whether the biomarker is predictive since only biomarker-positive patients are


exposed to the experimental treatment.
Furthermore, these designs do not allow for a direct comparison between exper-
imental and control treatment directly as they are designed to compare but the
biomarker-strategy and not the treatments.

Biomarker-Strategy Design Without Biomarker Assessment


in the Control Arm

Here, patients are again randomized between testing strategies (i.e., biomarker-based
strategy and non-biomarker-based strategy) but the design differs in terms of timing of
biomarker evaluation. More precisely, first, patients are randomized to either the
biomarker-based strategy or to the non-biomarker-based strategy, and biomarkers are
evaluated only in patients who are assigned to the biomarker-based strategy arm.
Patients found to be biomarker-positive are then given the experimental treatment with
biomarker-negative patients given the control treatment. Again, those randomized to
the non-biomarker-based strategy receive the control treatment (see Fig. 7).
This design is useful in situations where it is either not feasible or ethical to test
the biomarker in the entire population due to several logistical (e.g., specimens not
submitted), technical (e.g., assay failure), or clinical reasons (e.g., tumor inacces-
sible); thus, biomarker status is obtained only in patients who are randomized to
the biomarker-based strategy arm. However, biomarker-positive and biomarker-
1156 L. C. Brown et al.

Fig. 7 Schematic
for biomarker-strategy design
without biomarker assessment
in the control arm. “R” refers
to randomization of patients

negative subgroups might be more imbalanced as compared to the first type of


biomarker-strategy design since randomization is performed before evaluation of
biomarker status. This can happen especially when the number of patients is very
small.

Biomarker-Strategy Design with Treatment Randomization


in the Control Arm

In this design, there is a second randomization between experimental and control


treatment in the non-biomarker-guided strategy arm. While the two previously
described biomarker-strategy designs can address the question of whether a
biomarker-based strategy is more effective than standard treatment, the biomarker-
strategy design with treatment randomization in the control arm allows a test of
whether the biomarker-based strategy is better not only than the standard treatment
but also than the experimental treatment in the overall, unselected population (see
Fig. 8).
Patients are first randomly assigned to either the biomarker-based strategy arm or
to the non-biomarker-based strategy arm. Next, patients who are allocated to the
non-biomarker-based strategy arm are further randomized either to the experimental
or to the standard treatment arm, irrespective of biomarker status. The ratio for
randomizing in the non-biomarker-based strategy arm should be informed by the
prevalence of the biomarker in the population as a whole, to ensure balance between
the study arms. Patients randomized to the biomarker-based strategy arm and who
are biomarker-positive are given the experimental treatment with biomarker-
negative patients given the control treatment.
62 Biomarker-Guided Trials 1157

Fig. 8 Schematic for biomarker-strategy design with treatment randomization in the control arm.
“R” refers to randomization of patients

The clinical utility of the biomarker is evaluated by comparing treatment effect


between the biomarker-based strategy arm and non-biomarker-based strategy arm. It
is also possible to test whether the experimental treatment is more effective in the
entire population or in a biomarker-defined subgroup only, since both biomarker
subgroups are exposed to both treatments.
One benefit of this design as compared to the two previously discussed
biomarker-strategy designs is that it allows investigation of not only whether
the biomarker is prognostic but also whether it is a predictive treatment effect
modifier. A further strength is that it allows clarification of whether a result
indicating an advantage in favor of the biomarker-based strategy is due to a
true effect of the biomarker itself or due to a treatment effect irrespective of
biomarker status.

Reverse Marker-Based Strategy Design

Here, patients are randomized either to the biomarker-based strategy arm or the
reverse biomarker-based strategy arm. As in the previous three biomarker-strategy
designs, patients who are allocated to the biomarker-strategy arm receive the exper-
imental treatment if they are biomarker-positive whereas biomarker-negative
patients receive the control treatment. By contrast, patients who are randomly
assigned to the reverse biomarker-based strategy arm receive the control treatment
if they are biomarker-positive, whereas biomarker-negative patients receive the
experimental treatment (see Fig. 9).
1158 L. C. Brown et al.

Fig. 9 Schematic for reverse marker-based strategy design. “R” refers to randomization of patients

This design is recommended in cases where prior evidence indicates that both
experimental and control treatment are effective in treating patients, but the optimal
strategy has not yet been identified. The design enables the evaluation of an
interaction between the biomarker and different treatments. Additionally, it allows
estimation of the effect size of the experimental treatment compared to control
treatment for each biomarker-defined subgroup separately. Also, there is no chance
that the same treatment will be allocated to biomarker-positive patients in both arms
or to biomarker-negative patients in both arms. This is a problem in the other types of
biomarker-based strategy designs where there will be patients with the same bio-
marker status having the same treatment in both trial arms.
It is important to note that all biomarker-strategy designs will need a larger
sample size as compared to the marker-stratified designs.

A Randomized Phase II Trial Design with Biomarker

This is a biomarker-guided phase II clinical trial design which, when completed,


recommends which type of phase III trial design should be used. The trial starts with
biomarker assessment, with all patients randomized to either an experimental or
control treatment. An interim analysis is then undertaken in the biomarker-positive
subgroup. If the experimental treatment is found superior to control at a prespecified
level of significance, treatment effect is subsequently estimated in the biomarker-
negative subgroup. Based on the estimated treatment effect in the biomarker-
negative subgroup, and in particular its confidence interval, a recommendation is
62 Biomarker-Guided Trials 1159

Entire Population

Biomarker Assessment

Experimental Control

Biomarker + Biomarker –

Is experimental Is experimental treatment superior in Find 80% Cl for the


treatment superior at the Entire Population at level a = 0.05 ? hazard ratio
level a1 = 0.10 ?

Yes No Yes No Cl<1.3 Cl includes Cl>1.5


1.3 or 1.5

No further testing Marker


Traditional Enrichment stratified Traditional
of experimental
design design design design
treatment

Fig. 10 Schematic for a randomized phase II trial design with biomarkers. “R” refers to random-
ization of patients. CI refers to the confidence interval. Uncolored boxes refer to the first stage of the
trial and colored boxes refer to the second stage of the trial. Different stages refer to the analysis and
not to the trial design

given on the type of phase III trial design to be used (enrichment, marker stratified or
traditional with no biomarker). If in the interim analysis, however, the treatment
effect is not found to be significant in the biomarker-positive subgroup, the exper-
imental treatment is compared to control in the entire study population. If the overall
treatment effect is found significant at a prespecified level of significance, a tradi-
tional design with no biomarker assessment is recommended for phase III. Other-
wise, it is recommended that no phase III trial is undertaken for the experimental
treatment (see Fig. 10).

Adaptive Designs

Adaptive Signature Design

The adaptive signature design was proposed for settings where a biomarker signa-
ture, defined as a set of biomarkers the combined status of which is used to stratify
patients into subgroups, is not known at the outset, and allows the development and
evaluation of a biomarker signature within the trial. Generally, this approach is
useful when there is no available biomarker at the start of the trial or when there
1160 L. C. Brown et al.

Fig. 11 Schematic for


adaptive signature design. “R” Entire Population
refers to randomization of
patients R

Experimental Control

Interim Analysis: Identify Biomarker +


Final Analysis: Experimental Treatment
Superior in the Entire Population?

Yes No

Success
Test Biomarker +
recruited after interim
analysis

are a great number of candidate biomarkers which could be combined to identify a


biomarker-defined subgroup (see Fig. 11).
The design begins with a comparison between the experimental and standard
treatment in the entire study population at a prespecified level of significance. If
treatment effect is statistically significant, the treatment is considered beneficial, and
the trial is closed. If the comparison in the overall population is not promising, then
the entire population is divided into two samples in order to develop a biomarker
signature in one sample and validate it in another. This is in order to identify a
biomarker signature that best identifies subjects for which the experimental treatment
is better than the standard treatment (the so-called “biomarker-positive” group). The
trial then continues, but recruiting only biomarker-positive patients, as determined
by the biomarker signature. Hence, this approach (i) identifies patients who benefit
from the experimental treatment during the initial stage of the study (at the interim
analysis); (ii) assesses the global treatment effect of the entire randomized study
population through a powered test, and (iii) assesses the treatment effect for the
biomarker-positive subgroup within patients randomized in the remainder of the
trial, the so-called “validation test.”

Outcome-Based Adaptive Randomization Design

This design can be useful when the biomarkers are either putative or unknown at the
beginning of a phase II trial, and also when there are multiple targeted treatments and
biomarkers to be considered. It aims to test simultaneously both biomarkers and
62 Biomarker-Guided Trials 1161

Entire Population

Biomarker Assessment

Biomarker + Biomarker –

R R

First: AR 1:1 First: AR 1:1


Second: AR e.g 1:3 Second: AR e.g 2:1
Interim Analysis Interim Analysis
Third: AR e.g 1:4 Third: AR e.g 1:1
Interim Analysis Interim Analysis
Etc. Etc.

Experimental Control Experimental Control

Fig. 12 Schematic for outcome-based adaptive randomization design. “R” refers to randomization
of patients

treatments while providing more patients with effective therapies according to their
biomarker profiles (see Fig. 12).
The trial begins with the assessment of patients’ biomarker status. Within each
biomarker subgroup, patients are then randomized equally to one or more experi-
mental arms or control arm. The design permits the modification of the allocation
ratio to different treatment arms over time so that the arm(s) which seem(s) to have
the best response rate is composed of the higher proportion of randomized patients.
This modification in allocation ratio is informed by accumulated patients’ data about
how well the biomarker performs at each interim analysis stage. For example, when
data accrued so far suggests that a particular treatment is superior to others, the ratio
will be modified to ensure a higher number of patients are allocated accordingly.

Adaptive Threshold Enrichment Design

This design is based on the former knowledge that a specific biomarker-defined


subgroup (biomarker-positive subgroup) is believed to benefit more from a novel
treatment as compared to the remainder of the study population (biomarker-negative
subgroup). The trial is conducted as follows: (i) accrue and randomize only
biomarker-positive patients to experimental or control treatment; (ii) conduct an
interim analysis in order to compare the experimental treatment with control
1162 L. C. Brown et al.

Fig. 13 Schematic for adaptive threshold enrichment design. “R” refers to randomization of
patients

treatment within the biomarker-positive subgroup; and (iii) if the interim result is
negative, then the accrual stops due to futility in the biomarker-positive subgroup
and the trial is closed without showing a treatment benefit; if the result is “promis-
ing” for the specific biomarker-positive subgroup, then the study continues with this
specific biomarker-positive subgroup and accrual also begins for biomarker-negative
patients. Thus, the trial continues with patients randomized from the entire popula-
tion. A “promising” result in the biomarker-positive subgroup at the interim stage is
claimed when the estimated treatment effect is above a particular prespecified
threshold (see Fig. 13).

Adaptive Patient Enrichment Design

This design adaptively modifies accrual to two predefined biomarker-defined


subgroups based on an interim analysis for futility. The trial is conducted as
follows: (i) accrue both biomarker-positive and biomarker-negative patients, and
randomize the two subgroups respectively to experimental or control treatment;
(ii) perform an interim analysis to evaluate treatment effect in the biomarker-
negative subgroup; (iii) if the interim result in that subgroup is “not promising,”
defined as the observed efficacy for the control group being greater than that for
the experimental group and the difference being insufficient to pass the a futility
62 Biomarker-Guided Trials 1163

Entire Population

Biomarker + Biomarker –

Experimental Control

Interim Analysis
Experimental Treatment Superior in
Biomarker – ?

Yes No

Entire population Biomarker +

R R

Experimental Control Experimental Control

Fig. 14 Schematic for adaptive patient enrichment design. “R” refers to randomization of patients

boundary, then accrual of biomarker-negative patients stops; but the strategy


continues with accruing additional biomarker-positive patients in order to sub-
stitute the unaccrued biomarker-negative patients until the prespecified total
target sample size is achieved; (iv) contrarily, if the interim results are promising
in the biomarker-negative patients, the accrual of both biomarker-negative and
biomarker-positive patients continues until the total target sample size is
achieved (see Fig. 14).

Adaptive Parallel Simon Two-Stage Design

This design allows the efficacy of a novel treatment, which possibly differs in the
biomarker-positive subgroup compared to the biomarker-negative subgroup, to be
tested. It requires a predefined biomarker with well-established prevalence (see
Fig. 15). The design begins with a first stage, which entails two parallel phase II
studies, one in the biomarker-positive and the other in the biomarker-negative
subgroup. Next, if activity is not observed in either biomarker subgroup during the
first stage, the trial stops; if activity of the experimental treatment is observed during
the first stage of the study for both the biomarker-positive and biomarker-negative
subgroups, additional patients from the general patient population are enrolled into
the second stage; if results of the first stage suggest that activity is limited to
biomarker-positive patients, the second stage continues with the recruitment of
additional biomarker-positive patients only. This design may augment the efficiency
of a trial as it allows for early understanding that a particular experimental treatment
is beneficial in a specific biomarker defined subgroup.
1164 L. C. Brown et al.

Entire population

Biomarker + Biomarker –

Experimental Experimental

Interim Analysis
Efficacy of Experimental
Treatment ?

Preliminary efficacy in both Preliminary efficacy


No
Biomarker + and Biomarker – in Biomarker +

Enroll additional Enroll additional


Unselected population Biomarker +

Experimental Experimental

Fig. 15 Schematic for adaptive parallel Simon two-stage design. “R” refers to randomization of
patients

Multi-arm Multi-stage Designs (MAMS)

This design as originally proposed was not for biomarker-guided trials but rather was
aimed at testing multiple experimental treatments against a control treatment in the
same trial. However, it is also useful in a biomarker-guided context since it allows
patients to be allocated to a trial of a particular experimental treatment, based on their
biomarker status.
The first stage of a MAMS trial (the phase II stage) involves biomarker stratifi-
cation into one of a number of separate comparisons with each comparing an
experimental treatment with a control treatment. The comparison within which a
patient is included depends on their biomarker status, for example, patients positive
for biomarker 1 may be randomized in comparison 1 to either control or experimen-
tal treatment 1 while patients positive for biomarker 2 may be randomized into
comparison 2 to either control or experimental treatment 2. At the end of this first
stage, an interim analysis is undertaken within each comparison, comparing each
experimental treatment with the control treatment. Depending on the outcome of the
interim analysis, accrual of patients in a comparison either continues to the second
stage of the trial or the accrual of additional patients stops within that comparison
(see Fig. 16).
62 Biomarker-Guided Trials 1165

Entire Population

Biomarker 1 + Biomarker 2 +

R R

Experimental 1 Control Experimental 2

Interim Analysis 1 Interim Analysis 1


on Intermediate outcome on Intermediate outcome
Experimental 1 Superior? Experimental 2 Superior?

Yes No Yes No

Drop Drop
Experimental 1 Experimental 2

Fig. 16 Schematic for multi-arm, multi-stage (MAMS) design. “R” refers to randomization of
patients

This design has the ability to simultaneously compare multiple experimental


treatments with a control treatment, therefore achieving results in less time as
compared with separate phase II trials to assess each novel treatment individually.
Depending on how long the actual endpoint takes to observe, the actual or an
intermediate endpoint can be used at the interim analysis stage. Generally, MAMS
designs are useful when (i) there are multiple promising treatments in phase II/III
studies; (ii) there is no strong belief that a treatment will be more beneficial
compared to another therapy; (iii) availability of adequate funds; (iv) there is an
adequate number of patients to be enrolled; and (v) there is an intermediate outcome
measure that is likely to be on the causal pathway to the primary outcome measure.
Benefits of this design are that the overall trial is unlikely to stop for futility as
multiple experimental treatments are tested and hence, it is unlikely that all exper-
imental arms will be ineffective and dropped. Further, the regulatory and adminis-
trative burden is reduced as compared to running several separate trials, while
unpromising experimental arms can be dropped in a quick and reliable way. Design
benefits of this approach have been reported and implemented in trials such as the
UK FOCUS4 trial in colorectal cancer and the US Lung Map trial in lung cancer
(Kaplan et al. 2013; Lung-Map: Master Protocol for Lung Cancer 2021).

Operational Considerations for Biomarker-Guided Trials

For many biomarkers, measurement can be relatively easy but for some of the more
complex laboratory driven and imaging biomarkers, quality assurance is necessary
to demonstrate that the biomarker test is reliable and repeatable, particularly between
laboratories and between any investigators who are responsible for making
1166 L. C. Brown et al.

judgments on whether the patient is classified as biomarker positive or negative.


Much of this relates to the important work completed during the analytical validation
stage of the biomarker life course (see Fig. 1), but it is important that quality
assurance is performed at the start and during the course of any biomarker-guided
clinical trial.
From a regulatory and a patient perspective, there are also important approvals
that need to be in place to ensure the appropriate handling and archiving of both
patient tissue and clinical data collected from patients. These approvals will vary
internationally but all will need to comply with the ICH Guidelines on Good Clinical
Practice (ICH GCP guidelines n.d.). The appropriate handling of patient tissue along
with the necessary legal requirements for anonymization of patient data is not to be
underestimated in these types of trials and adequate resource must be available to
ensure that data are secure and test results are turned around in a timely fashion so
that patients can be entered into the trial without delays.

Analysis of Biomarker-Guided Trials

There are a number of sources of statistical uncertainty when analyzing biomarker-


guided trials (see Fig. 17). Furthermore, as described in the previous section, there
are a large number of designs available that utilize biomarkers. In many cases, the
primary analysis of biomarker-guided trials aims to demonstrate the performance of
the biomarker in terms of decision-making for patient management and this will
typically be analyzed using a sensitivity and specificity approach, where area-under-
the-curve (AUC) from receiver operator characteristics (ROC) curves will be used.
In other cases, it will involve tests of interaction between the biomarker and the
treatment. Often both of these aspects are of interest.
The most appropriate statistical analysis is highly dependent on the design used,
in particular what is the hypothesis being tested. Ideally any subgroup analyses
should be prespecified before data are inspected and analyzed and inclusion in a
signed and dated statistical analysis plan is advised to document that the subgroup
was prespecified.

Randomise

Treatment A No treatment A

Biomarker X Biomarker X Biomarker X Biomarker X


+ - + -

Benefit No Benefit No Benefit No Benefit No


Benefit Benefit Benefit Benefit

Fig. 17 Sources of statistical uncertainty when exploring the role of a biomarker in a stratified trial
62 Biomarker-Guided Trials 1167

In subsequent subsections, we cover some more specific considerations for the


statistical analysis of common types of biomarker-guided trial.

Analysis of Biomarker-Strategy Designs

A biomarker-strategy design tests the hypothesis that using the biomarker to guide
treatment will result in superior outcomes to not using it. As discussed above,
biomarker-strategy designs can take various forms. The experimental arm may
allocate between a number of treatments depending on the results of the biomarker
test or may just choose between treatment or nontreatment. The control arm may
allocate all patients to one standard treatment or may randomize patients between
treatments.
In all cases, the primary analysis of a biomarker strategy will compare outcomes
between the experimental and control arm. Thus, methods of analysis will be similar
to traditional two-arm RCTs. As biomarker-strategy trials are often assessing the
effectiveness of implementing the strategy in routine practice, the primary analysis
should generally be by intention to treat, with patients analyzed within their ran-
domized groups. It is likely that some patients in the experimental arm may not
follow the specified treatment strategy due to practical issues. In this case, a
per-protocol analysis could be useful to determine whether an idealized version of
the biomarker strategy, where there were no errors or delays in assessing the
biomarker and no deviations from the recommended treatment occur, has a larger
advantage. This may be useful in determining if further refinements of the biomarker
test could be useful. If a per-protocol analysis is required, then it is important to
prespecify the definition of per-protocol clearly in the statistical analysis plan.
One issue with analysis of biomarker strategy designs is that there may be
heterogeneity of outcome within each arm which may violate assumptions made
by the analysis. If, for example, different types of patients are allocated to different
treatments within each arm, then the assumptions of some statistical tests that data
from within each arm is similarly distributed will not be true. Analyses that adjust or
stratify by the biomarker status may be more appropriate, but still would assume that
the mean treatment effect (i.e., the difference in effect of being on the experimental
arm compared with the control arm) is the same for different types of patients on
some scale. It is important to be clear about the assumptions of the analysis and to
ensure the results are robust to them.

Analysis of Marker-Stratified Designs

The gold standard design for testing whether a biomarker is predictive is the marker
stratified design. The statistical analysis will generally focus on: (1) the interaction
effect between biomarker and treatment on outcome and (2) the marginal effect of
the treatment. The primary analysis should be the question that the study was
powered to test.
1168 L. C. Brown et al.

Estimating and testing the interaction effect can be performed by fitting a suitable
regression model. This model should include parameters for: (1) the marginal effect
of treatment arm; (2) marginal effect of biomarker status; and (3) interaction between
treatment arm and biomarker status. For instance, with a normally distributed
outcome, the suitable linear model will be:

Y i ¼ α þ βT i þ γBi þ δT i Bi þ ϵ i

where Yi is the outcome for individual i, Ti is the treatment allocation (1 if exper-


imental, 0 if control), Bi is the biomarker status (e.g., 1 if positive, 0 if negative), and
ϵ i is a normally distributed error term. The parameters in the model, α, β, γ, δ,
represent (respectively) the intercept, the marginal effect of treatment, the marginal
effect of biomarker, and the interaction between biomarker and treatment. By fitting
this model and finding the maximum likelihood estimates b α, b
β, bγ , b
δ, we can estimate
the effect of the treatment in the negative biomarker group (b β ), the effect of the
treatment in the positive biomarker group (b β+bδ), and the interaction effect (b δÞ. The
standard errors of these quantities can be extracted from the model and used to form
confidence intervals and Wald tests for testing the null hypothesis of the true
parameter (or sum of parameters) being 0.
One consideration that influences interaction and subgroup testing is the presence of
measurement error on the biomarker (see section on analytical validity of the bio-
marker). In epidemiological studies, measurement error can cause bias issues as well
as loss of precision when conducting interaction tests (Carroll et al. 2006). This persists
in randomized trials, meaning that the estimated interaction effect is likely to be
attenuated (i.e., biased towards 0) in the presence of measurement error (Aiken and
West 1991). However, as the measurement error of the baseline biomarker status is
independent of arm assignment, there is no inflation in the type I error for the interaction
test (Pennello 2013). More advanced analysis methods can be used to correct for bias
caused by measurement error. It is possible when analyzing a biomarker-guided trial that
some high-quality information is available from previous studies assessing the bio-
marker that could be incorporated into the analysis. For example, if information about
the sensitivity and specificity of the biomarker is available, this could be used within a
Bayesian framework to correct the measurement error.

Summary and Conclusions

Biomarker-guided trial designs have developed considerably over the last 15 years.
The trial design features described in this chapter provide a summary of what is
available and the selection of particular design features will depend upon the research
question being investigated. We have aimed to provide a high-level summary of issues
to consider but other designs will no doubt emerge in the coming years. We would
recommend further reading of the extensive literature in this field (Renfro et al. 2016;
Freidlin and Korn 2010; Buyse et al. 2011; Mandrekar and Sargent 2009; Freidlin
2010; Stallard 2014; Tajik et al. 2013; Simon 2010; Gosho et al. 2012; European
62 Biomarker-Guided Trials 1169

Medicines Agency 2015; Eng 2014; Baker 2014; Freidlin et al. 2012; Freidlin and
Simon 2005; Wang et al. 2009; Karuri and Simon 2012; Jones and Holmgren 2007;
McShane et al. 2009; Parmar et al. 2008; Wason and Trippa 2014).

Acknowledgments This work is based on research arising from UK’s Medical Research Council
(MRC) grants MC_UU_00004/09, MC_UU_12023/29, and MC_UU_12023/20.

References
Aiken LS, West SG (1991) Multiple regression: testing and interpreting interactions. Sage,
Newbury Park
Antoniou M, Jorgensen AL, Kolamunnage-Dona R (2016) Biomarker-guided adaptive trial designs
in phase II and phase III: a methodological review. PLoS One 11(2):e0149803. https://doi.org/
10.1371/journal.pone.0149803
Antoniou M, Kolamunnage-Dona R, Jorgensen AL (2017) Biomarker-guided non-adaptive trial
designs in phase II and phase III: a methodological review. J Pers Med 7(1). https://doi.org/10.
3390/jpm7010001
Baker SG (2014) Biomarker evaluation in randomized trials: addressing different research ques-
tions. Stat Med 33:4139–4140
Buyse M, Sargent G, de Matheson G (2011) Integrating biomarkers in clinical trials. Expert Rev
Mol Diagn 11:171–182
Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM (2006) Measurement error in nonlinear
models: a modern perspective, 2nd edn. Chapman Hall/CRC, Boca Raton
Cook JA et al (2018) DELTA2 guidance on choosing the target difference and undertaking and
reporting the sample size calculation for a randomised controlled trial. BMJ 363. https://doi.org/
10.1136/bmj.k3750
Eng KH (2014) Randomized reverse marker strategy design for prospective biomarker validation.
Stat Med 33:3089–3099
European Medicines Agency. Reflection Paper on Methodological Issues Associated with
Pharmacogenomic Biomarkers in Relation to Clinical Development and Patient Selection.
Available online: http://www.ema.europa.eu/docs/en_GB/document_library/Scientific_guide
line/2011/07/WC500108672.pdf (Accessed on 10 Oct 2015)
FDA-NIH Working Group, Biomarkers, Endpoints and Other Tools (BEST) guidelines, updated
May 2018
Freidlin K (2010) Biomarker-adaptive clinical trial designs. Pharmacogenomics 11(12):1679–1682
Freidlin MS, Korn (2010) Randomized clinical trials with biomarkers: design issues. J Natl Cancer
Inst 102:152–160
Freidlin B, Simon R (2005) Adaptive signature design: an adaptive clinical trial design for
generating and prospectively testing a gene expression signature for sensitive patients. Clin
Cancer Res 11(21):7872–7878
Freidlin B, McShane LM, Polley M-YC, Korn EL (2012) Randomized phase II trial designs with
biomarkers. J Clin Oncol 30:3304–3309
Friedlin K (2014) Biomarker enrichment strategies: matching trial design to biomarker credentials.
Nat Rev Clin Oncol 11:81–90
Gosho M, Nagashima K, Sato Y (2012) Study designs and statistical analyses for biomarker
research. Sensors 12:8966–8986
ICH GCP guidelines.: https://www.ich.org/products/guidelines/efficacy/efficacy-single/article/
integrated-addendum-good-clinical-practice.html
Jones CL, Holmgren E (2007) An adaptive Simon two-stage Design for Phase 2 studies of targeted
therapies. Contemp Clin Trials 28(5):654–661
1170 L. C. Brown et al.

Kaplan RK, Maughan TM, Crook AC, Fisher DF, Wilson RW, Brown LC, Parmar MP (2013)
Evaluating many treatments and biomarkers in oncology: a new design. J Clin Oncol 31(36):
4562–4568
Karuri SW, Simon R (2012) A two-stage Bayesian design for co-development of new drugs and
companion diagnostics. Stat Med 31(10):901–914
Lung-Map: Master Protocol for Lung Cancer.: https://www.cancer.gov/types/lung/research/lung-
map. (Accessed 16 Jan 2021)
Mandrekar SJ, Sargent DJ (2009) Clinical trial designs for predictive biomarker validation:
theoretical considerations and practical challenges. J Clin Oncol 27(24):4027–4034
McShane LM, Hunsberger S, Adjei AA (2009) Effective incorporation of biomarkers into phase II
trials. Clin Cancer Res 15(6):1898–1905
Parmar MKB, Barthel FMS, Sydes M, Langley R, Kaplan R, Eisenhauer E et al (2008) Speeding up
the evaluation of new agents in cancer. J Natl Cancer Inst 100(17):1204–1214
Pennello G (2013) Analytical and clinical evaluation of biomarkers assays: when are biomarkers
ready for prime time? Clin Trials 10:666–676. [300,301]
Renfro M, Ming-Wen S, Mandrekar (2016) Clinical trial designs incorporating predictive bio-
markers. Cancer Treat Rev 43:74–82
Simon R (2010) Clinical trial designs for evaluating the medical utility of prognostic and predictive
biomarkers in oncology. Pers Med 7:33–47
Stallard H (2014) Parsons, Friede adaptive designs for confirmatory clinical trials with subgroup
selection. J Biopharm Stat 24:168–187
Tajik P, Zwinderman AH, Mol BW, Bossuyt PM (2013) Trial designs for personalizing cancer care:
a systematic review and classification. Clin Cancer Res 19:4578–4588
Wang S-J, Hung HMJ, O’Neill RT (2009) Adaptive patient enrichment designs in therapeutic trials.
Biom J Biometrische Zeitschrift 51(2):358–374
Wason JMS, Trippa L (2014) A comparison of Bayesian adaptive randomization and multi-stage
designs for multi-arm clinical trials. Stat Med 33(13):2206–2221
Diagnostic Trials
63
Madhu Mazumdar, Xiaobo Zhong, and Bart Ferket

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1172
Diagnostic Trial Type I: Evaluating Diagnostic Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1173
Assessment of Diagnostic Accuracy of Single Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1173
Comparing Diagnostic Accuracy of Multiple Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1174
Definitions of Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1175
Sensitivity and Specificity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176
Positive and Negative Predictive Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1177
Receiver Operating Characteristics Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1177
The Area Under the ROC Curve (AUC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1178
Sample Size Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1179
Reporting Diagnostic Trials for Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1180
Diagnostic Trial Type II: Diagnostic Randomized Clinical Trials for Assessment of Clinical
Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1182
Test-Treatment Trial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1182
Evaluating a Single Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1183
Randomized Controlled Trial (RCT) of Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1183
Random Disclosure Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1184
Evaluating Multiple Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1186
Explanatory Versus Pragmatic Approaches for Test-Treatment Trials . . . . . . . . . . . . . . . . . . . . . . . . 1191
Reporting of Test-Treatment Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1191
Statistical Analysis and Sample Size Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1192
Economic Analysis in Test-Treatment Trials and Decision Models . . . . . . . . . . . . . . . . . . . . . . . 1192
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1193
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1194

M. Mazumdar (*)
Director of Institute for Healthcare Delivery Science, Mount Sinai Health System, NY, USA
e-mail: [email protected]
X. Zhong · B. Ferket
Ichan School of Medicine at Mount Sinai, New York, NY, USA
e-mail: [email protected]; [email protected]

© Springer Nature Switzerland AG 2022 1171


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_281
1172 M. Mazumdar et al.

Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1194
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1194

Abstract
The term diagnostic trial is generally used in two different ways. A diagnostic
trial type I describes studies that evaluate accuracy of diagnostic tests in detecting
disease or its severity. Primary endpoints for these studies are generally test
accuracy outcomes measured in terms of sensitivity, specificity, positive predic-
tive value, negative predictive value, and area under the receiver operating
characteristics curves. Although establishing an accurate diagnosis or excluding
disease is a critical first step to manage a health problem, medical decision-
makers generally rely on a larger evidence base of empirical data that includes
how tests impact patient health outcomes, such as morbidity, mortality, functional
status, and quality of life. Therefore, the diagnostic trial type II evaluates the
value of test results to guide or determine treatment decisions within a broader
management strategy. Typically, differences in diagnostic accuracy result in
differences in delivery of treatment, and ultimately affect disease prognosis and
patient outcomes. As such, in the diagnostic trial type II, the downstream
consequences of tests followed by treatment decisions are evaluated together in
a joint construct. These diagnostic randomized clinical trials or test-treatment
trials are considered the gold standard of proof for the clinical effectiveness or
clinical utility of diagnostic tests. In this chapter, we define the variety of accuracy
measures used for assessing diagnostic tests, summarize guidance on sample size
calculation, and bring attention to the importance of more accurate reporting of
study results.

Keywords
Diagnostic trial type I · Diagnostic trial type II · Test-treatment trial · Sensitivity ·
Specificity · Positive predictive value · Negative predictive value · Area under the
receiver operating characteristics curves

Introduction

Diagnostic tests (such as genetic or imaging tests) are health interventions used to
determine the existence or severity of a disease (Sun et al. 2013; Huang et al. 2017).
The development and introduction process of diagnostic tests is equivalent to the
development of other health technologies such as therapeutic drugs, and similarly the
purpose of diagnostic trials can be categorized according to different research
development phases: varying from exploratory to evaluation of clinical impact
(Pepe 2003). The field of diagnostic trials has grown tremendously in the last
40 years (Zhou et al. 2009).
The term diagnostic trial is generally used in two distinct ways in the literature.
The first, here labeled as diagnostic trial type I, is used for studies covering earlier
63 Diagnostic Trials 1173

development phases that merely evaluate the accuracy of diagnostic tests in


detecting disease or severity of disease (Colli et al. 2014). Primary endpoints for
these studies are generally test accuracy measured in terms of sensitivity, specificity,
positive predictive value, negative predictive value, and area under the receiver
operating characteristics curves. These terms will be explained in more detail
below in the section “Definitions of Accuracy.” The goal of type I diagnostic trials
in the early, exploratory phase is to investigate whether the diagnostic test seems
promising in distinguishing disease from non-disease and meets criteria for mini-
mally acceptable diagnostic accuracy. The study design used in such early phase is
typically the retrospective case-control study. This chapter predominantly focuses on
type I diagnostic trials in later development phases which aim to confirm and refine
diagnostic accuracy and compare tests (Colli et al. 2014; Gluud and Gluud 2005;
Begg and Greenes 1983; Sackett and Haynes 2002).
Although establishing an accurate diagnosis is a critical first step in the manage-
ment of a health problem, medical decision-making generally relies on whether there
is net health benefit to patients in terms of improvements in morbidity, mortality,
functional status, and quality of life. Yet, differences in diagnostic accuracy gener-
ally also lead to differences in delivery of treatment, and medical tests thus ultimately
affect disease prognosis and patient outcomes. As such, in the final development
phase of new medical tests, the downstream consequences of tests followed by
treatment decisions should be ideally evaluated together. Here, we explain the role
of a late-phase diagnostic trial type II. The study design for these type II diagnostic
trials is equivalent to phase III clinical trials for therapeutic interventions and the
purpose is to evaluate test results within the broader management strategy. Such
diagnostic randomized clinical trials or test-treatment trials are considered the gold
standard of proof for clinical effectiveness or clinical utility of diagnostic tests. In this
chapter, we define the variety of accuracy measures used for diagnostic test assess-
ment, summarize guidance on the sample size calculation for various designs, and
bring attention to the importance of more accurate reporting of study designs and
results.

Diagnostic Trial Type I: Evaluating Diagnostic Accuracy

Assessment of Diagnostic Accuracy of Single Test

Establishing an accurate diagnosis is a critical first step in the management of a


health problem; type I diagnostic trials or diagnostic accuracy studies attempt to
answer this question. To design a type I diagnostic trial, investigators must recruit
subjects with and without the index disease and obtain valid and precise information
about the true disease status. Multiple terms have been used to describe assessment
of the true disease status, such as “gold standard,” “standard of reference,” and
“reference standard.” In this chapter, we use the term “gold standard.”
When using a case-control design, study subjects are enrolled in the trial based on
their disease status and test results are assessed retrospectively. The alternative is to
1174 M. Mazumdar et al.

assess both test results and disease status after enrollment within a cohort study. Both
study designs, the case-control and the cohort design are subject to a variety of
biases. The two most frequently encountered forms of bias are spectrum bias (in
case-control studies) and verification bias (in cohort studies) Spectrum bias in case-
control studies occurs when subjects with more severe disease than generally
observed are selected as cases and healthier subjects are selected as controls. In
order to avoid an overestimation of diagnostic accuracy that results from such an
induced difference in case-mix between the study and target population, both cases
and controls should be randomly selected. Verification bias in cohort studies occurs
when the likelihood of obtaining true disease status depends on the results of the
diagnostic test. For example, invasive or expensive gold standard tests are oftentimes
solely or more frequently performed in subjects with positive test results. The
problem of verification bias in type I diagnostic trials is equivalent to missing
outcome data in cohort studies looking at exposure-outcome relationships. As
such, statistical inference about diagnostic accuracy would still be possible but
oftentimes requires the missingness at random (MAR) assumption. Solutions for
verification bias are available under MAR using, for example, the inverse of the
propensity of verification conditional on test results and other predictors of verifi-
cation (de Groot et al. 2011; Braga et al. 2012; Bruni et al. 2014; Kosinski and
Barnhart 2003). Verification bias often happens in studies in which it is not feasible
to obtain diagnostic results from the “gold standard” on subjects thought to be at low
risk. Thompson et al. (2005) studied the operating characteristics of prostate-specific
antigen (PSA), in which prostate biopsy, the gold standard, was only recommended
to men with PSA greater than 4.0 ng/ml or abnormal rectal examination results
(Thompson et al. 2005). Harel and Zhou (2006) found that multiple imputations
could help correct the verification bias by assuming that the missing information of
prostate biopsy would not relate to the true prostate cancer status, but may depend on
the values of PSA or rectal examination among some other variables (Harel and
Zhou 2006). As such, under the MAR assumption, we can still obtain asymptotic
unbiased estimates of diagnostic test performance results (e.g., sensitivity and
specificity) with full information maximum likelihood.

Comparing Diagnostic Accuracy of Multiple Tests

Often a diagnostic trial is initiated to compare a promising new diagnostic procedure


with an existing one, with the hypothesis that the new procedure is expected to be
more accurate and can replace the standard procedure. Study designs used for such
assessment of comparative diagnostic accuracy are the randomized controlled trial
(RCT) and the paired trial. The distinction lies in the methods of assigning the
diagnostic procedures to trial participants.

Randomized Controlled Trial (RCT)


RCTs are recommended to avoid various biases in the assessment of diagnostic
accuracy when comparing an experimental diagnostic procedure with the reference
63 Diagnostic Trials 1175

procedure. In the uncontrolled setting, patients may undergo one of the two pro-
cedures due to a variety of reasons, such as lower cost, higher comorbidity, or
hospital policy. These reasons might not be documented if the study is of observa-
tional nature, hampering the necessary adjustment for confounding. Yet, even if a
diagnostic test is considered accurate in such a study after controlling for
confounding, it will not be clear whether the result is influenced by unobserved
confounders. In RCTs, patients undergo diagnostic procedures (i.e., experimental vs.
reference) according to the results of randomization. The randomization mechanism
rules out the potential impact of unmeasured confounders and thus helps to protect
the conclusion from biases (Braga et al. 2012).
For example, cervical cancer is one of the leading causes of cancer-related mortality
in sub-Saharan Africa. Visual inspection with acetic acid (VIA) is the standard test in
this setting, but visual inspection with Lugol’s iodine (VILI) is also a commonly
recommended diagnostic technique for detecting cervical cancer. Huchko et al. (2015)
conducted a randomized clinical trial to compare the diagnostic accuracy of VILI with
VIA among HIV-infected women in western Kenya (Huchko et al. 2015). The trial
enrolled 654 women, who were randomized to undergo either VILI or VIA with
colposcopy (1:1 ratio). Any lesion suspicious for cervical intraepithelial neoplasia 2 or
greater (CIN2+) was then biopsied as the gold standard for determining true disease
status. To maximize the statistical power in a two-arm RCT, the randomization ratio is
usually set as 1:1, so the numbers of patients undergoing different procedures are
equal. However, a formulation for ratios other than 1:1 could also be used and might
be preferred for some practical reasons, such as reducing the costs, and enhancing the
feasibility of recruitment into or execution of a RCT.

Paired Trial
When diagnostic tests do not interfere with each other and can be done in the same
study subject, a trial with a paired design might provide a more efficient alternative.
For example, Ahmed et al. (2017) reported a diagnostic trial with a paired design
comparing two imaging tests for prostate cancer (Ahmed et al. 2017). Men with high
serum prostate-specific antigen (PSA) usually undergo transrectal ultrasound-guided
prostate biopsy (TRUS-biopsy), which can cause side effects such as bleeding, pain,
and infection. Multi-parametric magnetic resonance imaging (MP-MRI) might allow
avoiding these side effects and improve diagnostic accuracy. To test this idea, 576
men were enrolled and underwent an MP-MRI followed by a TRUS-biopsy. At the
end of the study, a template prostate mapping (TPM) biopsy was conducted for each
patient, and the result was adopted as true disease status (gold standard). Diagnostic
comparison was made on the paired results for the competing tests.

Definitions of Accuracy

A binary outcome, defined as presence (positive) or absence (negative) of a certain


disease, is frequently used as the gold standard. The data can be typically presented
in a 2X2 table (Table 1) with columns representing the true disease status, usually
1176 M. Mazumdar et al.

Table 1 Underlying statistics for evaluation of a diagnostic test with binary outcomes
True disease (golden standard)
Disease No disease
Test results Positive True positive False positive
Negative False negative True negative

defined by the gold standard (e.g., TPM biopsy) and rows indicating the results of
either experimental or reference procedure (e.g., MP-MRI or TRUS-biopsy). A
diagnostic test that leads to a high proportion of positive results among patients
with true disease, and a high proportion of negative results among patients without
true disease, indicates good or better diagnostic accuracy.

Sensitivity and Specificity

A pair of measures, sensitivity and specificity, are commonly used in discussing


diagnostic trials. Sensitivity answers the question “How likely is it that a patient with
the true disease can be correctly identified as having a positive result under a
diagnostic procedure?” The value of sensitivity varies from 0 to 1, with one
indicating a perfect test. A diagnostic procedure with high sensitivity is important
for identifying a serious and treatable disease. However, having a high sensitivity is
not always sufficient for a diagnostic procedure to be clinically useful, because
calculation of sensitivity only focuses on patients with the true disease. A diagnostic
test with high sensitivity generally also leads to a high proportion of positive results
in patients without the true disease. Thus, achieving balance requires also consider-
ing specificity, which answers the question “How likely is it that a patient without the
true disease can be correctly identified as negative under a diagnostic procedure?” In
a disease for which treatment is burdensome and costly, incorrectly claiming that
someone has the disease may lead to unnecessary treatment. Similar to sensitivity,
the value of specificity varies from 0 to 1; a procedure with specificity equal to 1
correctly identifies all patients without the true disease.
Table 2 gives the diagnostic results of MP-MRI. All 576 men in the trial
underwent MP-MRI; 418 were diagnosed as positive for prostate cancer and 158
were diagnosed as negative. Based on the gold standard, there were 230 patients with
true prostate cancer. Thus, the sensitivity was 0.93 (¼213/230). On the other hand,
there were 346 patients without true prostate cancer; thus, the specificity was 0.41
(¼141/346).
Ideally, a perfect diagnostic procedure would have both sensitivity and specificity
equal to 1, such that all patients with and without the true disease can be correctly
identified. However, in practice, a clinician often needs to choose between a proce-
dure with high sensitivity and low specificity, versus one with low sensitivity and
high specificity. Values of sensitivity and specificity are not directly affected by the
prevalence of the target disease (Table 2).
63 Diagnostic Trials 1177

Table 2 Diagnostic results of MP-MRI and impact of change in disease prevalence: PROMIS trial
(B) Prevalence rate increase from 40% to
(A) Original MP-MRI with 40% prevalence 60%
Disease No disease Total Disease No disease Total
Negative 17 141 158 ! Negative 24 94 118
Positive 213 205 418 ! Positive 322 136 458
Total 230 346 576 Total 346 230 576

Positive and Negative Predictive Values

Two other measures of accuracy commonly used in diagnostic trials are positive and
negative predictive values. Positive predictive value (PPV) answers the question
“How likely is it that a patient has the true disease given a positive result?” Negative
predictive value (NPV) answers the question “How likely is it that a patient does not
have the disease given a negative result?” In the example, 418 and 158 patients were
diagnosed as positive and negative, respectively, based on MR-MRI results. Thus,
the PPV was 0.51 (¼213/418) and the NPV was 0.89 (¼141/158). Unlike sensitivity
and specificity, PPV and NPV are affected by disease prevalence. It is more likely to
find positive test results in a high-prevalence population compared to a low-preva-
lence population (Trevethan 2019). If the disease prevalence of prostate cancer
increases from 40% to 60%, for the same diagnostic procedure with a sensitivity
of 0.93 and a specificity of 0.41, the PPV would increase from 0.51 to 0.7 (¼322/
346) and the NPV would decrease from 0.89 to 0.80 (¼94/118). On the contrary,
when disease prevalence decreases, the PPV of a diagnostic procedure will decrease
and NPV will increase, while the sensitivity and specificity remain constant (Table
2). PPV and NPV are important because posttest probabilities eventually determine
the clinical impact of subsequent treatment and the test-treatment strategy as a
whole. When the PPV of a diagnostic procedure is high, more benefit can be
expected from an efficacious treatment, whereas when the NPV is high, less harm
can be expected from foregoing treatment.

Receiver Operating Characteristics Curve

When the diagnostic test has a continuous scale, researchers may face multiple
possible cutoff points, and each cutoff point leads to a pair of sensitivity and specificity
values. For example, Park et al. (2004) reported a study in which 70 patients with
solitary pulmonary nodules underwent plain chest radiography to determine whether
the nodules were benign or malignant. Chest radiographs were interpreted according
to a five-point scale: 1-definitely benign, 2-probably benign, 3-possibly malignant, 4-
probably malignant, and 5-definitely malignant. Thus, a positive result in this study
was based on four possible cutoff points: 2, 3, 4, and 5, and vice versa. Note that
sometimes a low score relates to a positive test, for example, lower cycle threshold (Ct)
1178 M. Mazumdar et al.

Fig. 1 Operating points, empirical and smooth ROC curves in the radiograph study

values in reverse transcription polymerase chain reaction (RT-PCR) tests. Conse-


quently, we can define four diagnostic tests, each of which corresponds to a particular
cutoff point, and thus followed by a pair of sensitivity and specificity. A diagnostic test
with lower cutoff point leads to more patients with true disease diagnosed as positive
(i.e., higher value of sensitivity) and less patients without true disease diagnosed as
negative (i.e., lower value of specificity). Therefore, when the cutoff point moves from
5 to 2, the sensitivity of the diagnostic test will increase and the specificity will
decrease, and vice versa. Note that sometimes a low score relates to a positive test,
for example, lower cycle threshold (Ct) values in reverse transcription polymerase
chain reaction (RT-PCR) tests.
A receiver operating characteristics (ROC) curve is an effective tool for sum-
marizing the accuracy of a diagnostic procedure with a continuous measure when
there is the potential for multiple cutoff points. It is a two-dimensional probabilistic
measurement curve of sensitivity versus specificity that discloses how a true positive
rate (TPR) varies with the change in false positive rate (FPR) across all the possible
cutoff points. ROC curves can be drawn using either parametric or empirical
methods (Hajian-Tilaki et al. 1997). To draw an empirical ROC curve for the data
in the above example, the four pairs of sensitivities and 1-specificities are plotted as
discrete points, called operating points, as shown in Fig. 1a. These operating points
and the two endpoints can be connected at (0, 0) and (1, 1) (Fig. 1b) by assuming a
linear relationship between sensitivity and 1-specificity between two nearby operat-
ing points. A smooth ROC curve can be fit using the parametric method by assuming
the diagnostic accuracy measurement follows a particular probabilistic distribution
(Fig. 1c). Distributions of ROC curves include binomial, Poisson, chi-squared,
gamma, and logistic distributions (Pepe 2003; Ogilvie and Douglas Creelman
1968; Swets 1986; Walsh 1997). Faraggi and Reiser (2002) provide an excellent
review of these details (Faraggi and Reiser 2002).

The Area Under the ROC Curve (AUC)

In clinical trials aiming to evaluate the overall performance of diagnostic procedures


that classify patients into with and without the disease based on a particular threshold
of a continuous measure, it is common to use the ROC curve as the primary outcome
63 Diagnostic Trials 1179

due to its advantage in summarizing the variation of TPR and FPR across different
possible cutoff points. The accuracy of a diagnostic procedure in the ROC context is
widely measured by area under the ROC curve (AUC). The AUC provides the
average value of TPR given all the possible values of FPR. Considering both the
ranges of TPR and FPR are (0, 1), the AUC can take any value between 0 and 1. The
practical value of the AUC is reflected by a value that ranges from 0.5 (area under the
chance diagonal) to 1 (area under a ROC with perfect diagnostic ability). A higher
value of the AUC indicates better overall diagnostic performance. As with TPR and
FPR, the AUC is independent of disease prevalence. Considering that the AUC of a
diagnostic procedure from a trial is estimated based on a random sample, appropriate
statistical inference is necessary for making a conclusion, and the uncertainty around
the AUC is typically handled by a certain level of confidence interval (e.g., 95%).
We can estimate the AUC by using parametric (McClish 1989; Metz 1978) and
empirical methods (McClish 1989; Metz 1978; Obuchowski and Bullen 2018). Zhou
et al. (2009) reviewed the performances of both parametric and empirical estimators
of AUC (Zhou et al. 2009). When diagnostic procedures are evaluated based on
continuous (e.g., biomarker) or quasi-continuous (e.g., a percent-confidence scale
with range 0–100%) measurements, both empirical and parametric estimators per-
form well, and the bias is negligible. When considering discrete outcomes (e.g.,
radiography in the study) by the empirical method sometimes underestimates the
AUC. On the other hand, parametric methods rely on the distributional assumption
and sometimes have poor performance in small diagnostic trials.

Sample Size Calculation

Determination of sample size plays an important role in designing a trial for


assessment of diagnostic accuracy. Too small samples may lead to imprecise esti-
mation (i.e., a wider confidence interval), whereas obtaining a large sample size is
costly and could require that patients undergo potentially unnecessary subsequent
testing with an unknown risk from subsequent treatments (e.g., adverse events).
Depending on the study setting and analytical plan, sample size calculations can be
considered using two concepts: estimation and comparison. To determine the sample
size for a diagnostic trial that seeks to estimate the sensitivity (or specificity) of a
single diagnostic test, four essential elements must be considered: (1) a pre-
determined value of sensitivity; (2) the confidence level (1  α); (3) the precision
of estimation, or the maximal marginal error, which is the maximum difference
between estimated sensitivity and the true value; and (4) disease prevalence. The
sample size calculation of a diagnostic trial based on a comparison of test accuracies
is hypothesis driven. It is applied to either a single-arm trial for comparing the
accuracy of a diagnostic procedure with a historical control, or a randomized trial for
comparing the accuracy between the experimental procedure versus an appropriate
control. In this situation, the targeted statistical power (i.e., 1- type II error) and the
significance level of hypothesis test (i.e., type I error) also need to be specified.
Simel et al. (1991) provide a sample size formula based on the likelihood ratio,
defined as the ratio between sensitivity and (1-specificity). Beam (1992) provided a
1180 M. Mazumdar et al.

Table 3 Statistical software for sample size calculation under a specific design
Software procedure Design of diagnostic trial
PASS: Proportions/test for one sample Observational study design for comparing
sensitivity and specificity sensitivity and specificity of a new diagnostic
procedure to an existing standard procedure
PASS: Proportions/test for paired Match-paired design for comparing sensitivities/
sensitivities and specificities specificities of two diagnostic procedures
PASS: Proportions/test for two Observational study design or RCT for comparing
independent sensitivities and specificities sensitivities/specificities of two diagnostic
procedures between two independent samples
PASS: Proportion/confidence intervals for Observational study design for estimating a single
one-sample sensitivity sensitivity using confidence intervals
PASS: Proportion/confidence intervals for Observational study design for estimating a single
one-sample specificity specificity confidence interval
PASS: Proportion/confidence intervals for Observational study design for estimating both
one-sample sensitivity and specificity sensitivity and specificity confidence intervals,
based on a specified sensitivity and specificity,
interval width, confidence level, and prevalence
PASS: AUC-based test for one ROC curve Observational study design for comparing ROC
curve of a new diagnostic procedure to a standard
procedure
PASS: ROC/test for two ROC curve Match-paired design for comparing the AUCs of
two diagnostic procedures
PASS: ROC/confidence intervals for the Observational study design for estimating a
AUC specified width of a confidence interval for AUC
AUC area under the curve, RCT randomized clinical trial, ROC receiver operated characteristics.
Observational study includes cohort and case-control designs

sample size formula that can be applied to trials with a paired design (Simel et al.
1991; Beam 1992). A variety of formulation for different settings and improved
efficiency are provided by many authors (Flahault et al. 2005; Fosgate 2009; Kumar
and Indrayan 2011; Li and Fine 2004; Liu et al. 2005; Obuchowski 1998; Steinberg
et al. 2009). The statistical software, PASS, implements most of these methods and is
easy to use. Table 3 summarizes these procedures and the design corresponding to
each procedure.

Reporting Diagnostic Trials for Accuracy

Several surveys have shown that studies evaluating diagnostic accuracy often fail to
transparently describe core elements of design and analysis, including how the
cohort was selected and what design parameters the sample size was based on, as
well as to comprehensively describe the study findings and how they will impact
clinical practice (Korevaar et al. 2014, 2015; Lijmer et al. 1999). They also find that
the recommendations from these studies are often unnecessarily generous and
63 Diagnostic Trials 1181

optimistic. The Standards for Reporting of Diagnostic Accuracy Studies (STARD)


statement was developed to facilitate complete and transparent reporting. STARD
recommends a checklist of items that should be reported for diagnostic accuracy
studies. Although quite a few journals have adopted STARD into their instruction to
authors, uptake remains low (Korevaar et al. 2014; NCSS 2018).
Failures in reporting fall in many categories and can jeopardize the decision-
making on which diagnostic test should be used in clinical practice. For example, Hu
(2016) noted that in their systematic review of diagnostic studies of osteopontin for
ovarian cancer, few publications did not report research results in accordance with
the STARD guideline (Hu 2016). This hindered their overall ability in estimating the
risk of bias and applicability concerns of included studies remained unresolved.
They furthermore found that in some of the studies, the inclusion and exclusion
criteria for subject enrollment were not reported, and only the disease spectrum
sample size were reported (Hu et al. 2015). Therefore, the authors performing the
systematic review did not know whether the prevalence of the target disease in the
study cohorts was consistent with the real word. Knowledge of the prevalence of the
target disease is vital because it can greatly affect estimates of test performance in the
presence of spectrum bias (Whiting et al. 2004). In summary, without knowledge of
inclusion/exclusion criteria and prevalence, it is difficult to decide under which
condition the diagnostic test should be adopted.
Many other types of reporting failures have been found that may prevent the
appropriate translation of research into clinical practice, and these also apply to
diagnostic research. How to improve reporting has however remained a vexing
problem despite many efforts. Reporting guidelines were first published simulta-
neously in a number of high-impact journals hoping wide adoption. In addition,
interventions have been attempted including convincing journal publishers to make
the use of reporting guidelines a requirement for the authors and reviewers, educat-
ing them on how to use reporting guidelines, and training them on how to evaluate
quality of a manuscript through guideline-based scoring etc. However, most evalu-
ations show that a continued effort for further improvement is needed. One of the
most robust efforts so far is the development of the Enhancing the QUAlity and
Transparency Of health Research (EQUATOR) Network. The EQUATOR network,
established in 2006, is a global initiative that has brought together researchers and
journal editors with the aim of achieving accurate, complete, and transparent
reporting of health research studies to support research reproducibility and useful-
ness (Network, Equator 2017). Their work aims to increase the value of health
research and minimize avoidable waste of financial and human investments in health
research projects. The work of the EQUATOR investigators has already been
bestowed with an award from the Council of Science Editors for improvement of
scientific communication through the pursuit of high standards in reporting (Majeed
and Amir 2018). This kind of high-level recognition, regular publication of com-
mentaries on this topic, and making reporting guidelines part of regular training in
the medical school curriculum are bound to make a favorable impact.
1182 M. Mazumdar et al.

Diagnostic Trial Type II: Diagnostic Randomized Clinical Trials for


Assessment of Clinical Effectiveness

There are circumstances when diagnostic accuracy results from type 1 diagnostic
trials are considered sufficient to extrapolate about net health benefits. Yet, further
empirical evidence is often needed about how the test affects longer-term outcomes.
One way to better evaluate the potential utility of diagnostic tests is to investigate
how well test results match with future patient outcomes by determining the prog-
nostic value and/or the ability to modify treatment effects (predictive value). The
latter should be generally assessed in a randomized setting with testing performed at
baseline in all patients prior to the randomization to treatment(s) (Lijmer and
Bossuyt 2009). However, diagnostic tests are seldom used on their own, independent
of treatment. Test results generally guide or determine treatment decisions as part of
a broader management strategy. Thus, differences in diagnostic accuracy will likely
result in differences in delivery of treatment, which will ultimately affect disease
prognosis and patient outcomes. As such, the downstream consequences of tests
followed by treatment decisions should be evaluated together. The diagnostic
randomized clinical trial or the so-called test-treatment trial design is considered
the gold standard to provide such proof for the clinical effectiveness or clinical utility
of diagnostic tests (Ferrante di Ruffano et al. 2012, 2017). Sometimes consequences
beyond those for the health of patients need to be considered as well, including those
for use of resources in the healthcare sector and/or society. The goal is then for the
test-treatment trial to also provide evidence for changes in efficiency of care by
evaluation of economic outcomes, i.e., cost-effectiveness or cost-utility analysis. In
this chapter, a clinical perspective is taken for discussing the concepts of designing
test-treatment trials, although the broader healthcare sector and societal perspectives
are briefly discussed as well.

Test-Treatment Trial Designs

Optimizing the design of test-treatment trials requires a good prior understanding of


how the application of diagnostic tests may change outcomes in the target patient
population by conceptualizing potential underlying pathways. Figure 2 illustrates a
pathway showing how diagnostic tests in comparison with comparator strategies
(strategies of alternative diagnostic tests, recommended care or the current practice)
may affect health outcomes (Ferrante di Ruffano et al. 2012).
For an initial conceptualization of a test-treatment trial, researchers should start
by defining which alternative diagnostic and management pathways need to be
compared, while specifying where differences can be expected. These steps will
allow them to select mechanisms underlying the outcomes of interest (Mustafa et al.
2017). A helpful method to conceptualize a trial is using care pathway algorithms or
flow diagrams based on the components depicted in Fig. 2. Different test-treatment
applications can be defined depending on the medical decision problem based on
single, replacement, triage, add-on, and parallel or combined testing. The optimal
63 Diagnostic Trials 1183

Fig. 2 Simplified test-


treatment pathway showing
each component of a patient’s
management that can affect
health outcomes (Ferrante di
Ruffano et al. 2012)

design of the trial depends on the application type and the certainty about the
diagnostic accuracy of the test(s), as well as the added value of disclosing test
results, and treatment effectiveness.

Evaluating a Single Test

Randomized Controlled Trial (RCT) of Testing

This design can be used for comparing health outcomes following a new or
established diagnostic test to health outcomes from a comparator no test strategy,
in the most extreme case defined as treat all or treat none). As the comparator
1184 M. Mazumdar et al.

strategy does not rely on testing, the randomization concerns the decision whether to
perform the testing or not. This trial can answer the question whether it would be
beneficial to avoid treatment in those who test negative, when the comparator is a
treat all strategy, or to offer treatment to those who test positive, when the compar-
ator is a treat none strategy. This scenario is illustrated in Fig. 3. The no test strategy
is oftentimes defined as usual care in which diagnostic and therapeutic interventions
following randomization are not protocolized by the investigators. For example, a
randomized trial was conducted in low-risk pregnant women to evaluate whether
routine ultrasonography in the third trimester improves severe adverse perinatal
outcomes compared with usual care (Henrichs et al. 2019). Routine ultrasonography
was associated with a higher antenatal detection of small for gestational age fetuses,
higher incidence of induction of labor, and lower incidence of augmentation of labor.
However, it did not significantly improve severe adverse perinatal and maternal
peripartum outcomes.

Random Disclosure Trial

Sometimes the medical decision problem pertains to understanding whether com-


munication of test results would affect treatment decisions and subsequent health
outcomes. If there are no ethical constrains about delaying the communication of test
results, the randomization point can occur after performance of the diagnostic test
(Fig. 4). Patients are thus randomized to disclosure of test results versus no (or
delayed) disclosure. In this trial design, randomization can be stratified by test results
to ensure more balanced groups. The random disclosure design has also the option of
studying the prognostic value of the diagnostic test by statistical modeling of patient
outcomes observed conditional on test results in the non-disclosed arm. For exam-
ple, Modic et al. used a random disclosure design to investigate the effect of
disclosing imaging findings on outcome in patients with acute low back pain or
radiculopathy, as well as to determine the prognostic role of MR imaging for
physical disability due to low back pain and patient satisfaction (Modic et al.

Fig. 3 Trial of testing with treatment as comparator, modify fig (Lijmer and Bossuyt 2009)
63
Diagnostic Trials

Fig. 4 Random disclosure trial, with treatment as comparator, modify fig (Lijmer and Bossuyt 2009)
1185
1186 M. Mazumdar et al.

2005). Patients underwent MR imaging at presentation and were then randomized to


either an early information arm (results provided to referring physician and patient
within 48 hours) or a blinded arm (both patient and physician were blinded to MR
imaging results). Improvement in function and other patient reported outcomes at 6-
weeks were similar in unblinded and blinded patients. Multivariable modeling of
imaging results did not reveal any relationship between herniation type, size, and
behavior over time with physical disability and patient satisfaction.

Evaluating Multiple Tests

For many medical decision problems, the question is not whether to test or not to
test, but which test or which combination of tests to use. Conceptually, the trial
design concerning such research questions is equivalent to designing a trial for a
single test as outlined above, with some modifications.

Comparative Test RCT


When the test-treatment strategies concerning two or more tests are intrinsically
different (e.g., because delivery of the tests and/or the process of achieving test
results vary), the optimal randomization point is when the decision is made about
which test to use (Fig. 5). An example of such a head-to-head or two-arm compar-
ison is the Prospective Multicenter Imaging Study for Evaluation of Chest Pain
(PROMISE) trial (Douglas et al. 2015). In this trial, patients with symptoms of
coronary heart disease were randomized to an initial strategy of coronary computed
tomographic angiography (CTA) or a diagnostic strategy using functional testing
(exercise electrocardiography, nuclear stress testing, or stress echocardiography).
Although, the CTA strategy was associated with a lower incidence of invasive
catheterization showing no obstructive coronary artery disease at 90-days, the
composite endpoint of mortality and coronary outcomes did not differ between
groups over a median follow-up of 2 years. However, the PROMISE trial was not
designed to incorporate subsequent additional diagnostic tests or revascularization
procedures.

Discordant Test Results RCT


When two or more competing, mutually exclusive test-treatment strategies are
compared, usually one test is considered as standard practice. In some occasions,
the competing tests have a similar delivery and process for generation of test results.
It can then be assumed that when the tests being compared have the same result
(either all positive or all negative), the subsequent management should be the same
and expected outcomes would be identical. When these conditions are satisfied and it
is feasible to perform both tests in all patients, the discordant test results trial design
or paired design (Fig. 6), in which only the patients with discordant test results are
randomized to treatment(s) and followed up, is the most efficient design (Lijmer and
Bossuyt 2009). This design, however, only allows for estimation of an absolute risk
difference between the test-treatment strategies and not a relative risk measure, and it
63
Diagnostic Trials

Fig. 5 Trial comparing two different tests (Lijmer and Bossuyt 2009)
1187
1188

Fig. 6 Discordant test results trial


M. Mazumdar et al.
63 Diagnostic Trials 1189

is rarely implemented in practice. Hooper et al. (Hooper et al. 2013), however,


describe a good decision analytic example based on the MINDACT trial, a trial
comparing a 70-gene expression profile with standard clinicopathologic criteria for
determining which patients with node-negative breast cancer should receive adju-
vant chemotherapy (Cardoso et al. 2007). A discordant test results trial was found to
be ~four times more efficient than the conventional head-to-head trial in terms of
required sample size and only around a third of the sample needed to be followed up,
thus rendering follow-up 12 times more efficient.

Random Disclosure Trial


Similar to extension of the trial design for an RCT investigating a single test-
treatment strategy versus no testing, the random disclosure trial as explained above
and depicted in Fig. 4 can be extended for multiple tests as well. However, in this
situation, both diagnostic tests will need to be performed in all patients. In the next
step, patients are randomized to disclosure of one test only in each arm. The
subsequent management or treatment is solely done in that arm based on the results
of the disclosed test. The flowchart of such a trial with randomization after testing is
depicted in Fig. 7 for two tests. Note that randomization can also be conducted prior
to the testing, which may be preferable in case test results could affect clinical
equipoise.

Add-On, Triage, and Parallel or Combined Testing


Test-treatment strategies often consist of a series or combinations of multiple tests,
instead of a single test followed by patient management. There are clinical scenarios
in which a decision should be made about whether using the results of a new test
would provide useful additional information. For example, the Scottish Computed
Tomography of the Heart (SCOT-HEART) trial investigated the use of CTA in
addition to standard noninvasive stress testing (electrocardiography, radionuclide
scintigraphy, echocardiography, or MR imaging) versus noninvasive stress testing
alone for patients with stable chest pain (Newby et al. 2018). Adding CTA resulted in
a significantly lower rate of coronary events at 5 years by improving the use of
subsequent tests and treatments. For such add-on tests, the designs as presented for
the single test (Figs. 3 and 4) can be used and need only minor modification.
Another frequent clinical application is using results from a new (perhaps less
invasive) test to select patients for an established, more invasive test (triage). Lastly,
a new test may be used independently and in parallel with a conventional test, after
which results of both tests are interpreted simultaneously for the diagnosis. New
triage or parallel testing strategies can be compared to the established test-treatment
strategy in a randomized fashion, similar to the design depicted in Fig. 5. However,
alternative trial designs exist for evaluating new triage strategies to improve effi-
ciency. For example, a triage test could be performed in all patients, followed by
randomization of patients to the established test only when management based on the
established test seems questionable. Another option is to perform both the triage and
established tests in all patients, and then randomize only those for whom the
established test is negative, but the new test for triage is positive (Lijmer and Bossuyt
1190

Fig. 7 Random disclosure trial


M. Mazumdar et al.
63 Diagnostic Trials 1191

2009). The Canadian Pulmonary Embolism Diagnosis Study (CANPEDS) Group,


for example, investigated if additional diagnostic testing can be safely withheld in
patients with suspected pulmonary embolism in those who have negative erythrocyte
agglutination D-dimer test results. Four hundred and fifty-six patients with negative
erythrocyte agglutination D-dimer test results out of 1126 patients with suspected
pulmonary embolism were randomly assigned to no further diagnostic testing or a
ventilation-perfusion lung scan followed by ultrasonography of the proximal deep
veins of the legs. Results showed that additional diagnostic testing can be safely
withheld without increasing the frequency of venous thromboembolism during
follow-up (Kearon et al. 2006).

Explanatory Versus Pragmatic Approaches for Test-Treatment


Trials

In general, results from pragmatic trials are considered more generalizable to current
practice than those from explanatory trials. Pragmatic test-treatment trials enroll
patients within a standard care setting and allow both clinical decision-making at the
discretion of the clinician and patient nonadherence to recommended care (Ferrante
di Ruffano et al. 2017). The PROMISE trial enrolled patients with stable chest pain
who were predominantly at high cardiovascular risk and for whom noninvasive
cardiovascular testing was considered necessary in an outpatient setting (Douglas
et al. 2015). The data coordinating center ensured all study sites had experienced
staff and used diagnostic procedures in agreement with guidelines. Local physicians
made all clinical management decisions at their discretion based on the test results in
both the functional testing as CTA study arm. Follow-up visits were scheduled at
60 days and at 6-month intervals after randomization. Clinical events were adjudi-
cated in a blinded fashion by an independent committee.
Likewise, the SCOT-HEART trial was conducted within a standard care outpa-
tient setting and management following diagnostic testing was done at the discretion
of the clinician in both study arms (Newby et al. 2018). There were, however, no
trial-specific visits planned, and routinely collected data on events were used from
the Information and Statistics Division and the electronic Data Research and Inno-
vation Service of the National Health Service (NHS) Scotland. There was no formal
event adjudication committee involved in the study, and study end points were
classified using diagnostic codes and procedural codes from discharge records.
Thus, both the PROMISE and SCOT-HEART trial can be considered pragmatic
trials, although the SCOT-HEART trial may be considered more so.

Reporting of Test-Treatment Trials

Nonetheless, findings from any trial (pragmatic or less pragmatic of nature) are more
likely to be translated into clinical practice when the protocol is published and
provides transparent and detailed information on all test-treatment pathways that
1192 M. Mazumdar et al.

should be followed (e.g., decision trees or flow diagrams). Results should include
data on the diagnoses that were made, as well as data on how the test results may
have impacted subsequent clinical decision-making and outcomes. For example,
treatment decisions should be reported with stratification by test results to show the
extent which clinical decisions were guided by the recommended test-treatment
protocols (Ferrante di Ruffano et al. 2017). The Template for Intervention Descrip-
tion and Replication (TIDieR) checklist and guide can be used to improve the
reporting of test-treatment trials.

Statistical Analysis and Sample Size Calculations

Statistical analysis plans and sample size estimations depend on the design of test-
treatment trials and outcome type (e.g., binary, continuous, time-to-event). For two-
arm designs, as in single test and comparative test trials, the statistical analysis is
relatively straightforward and preferably performed using the intention-to-treat
principle. Sample size and power formulas depend on disease prevalence, sensitiv-
ities, and specificities of test(s), and response rate of the treatment(s). For the paired
design used in discordant test results trials, formulas are more complicated, because
overall treatment response rates depend on how many patients had discordant test
results. This in turn is a function of the total number of patients, disease prevalence,
sensitivities, and specificities (Hooper et al. 2013; Lu and Gatsonis 2013).

Economic Analysis in Test-Treatment Trials and Decision Models

Usually when new tests are being evaluated against conventional tests or standard care,
it is critical for decision-makers to consider whether replacing existing care by
implementing the new test strategy would be cost-effective. Contemporary clinical
trials therefore frequently include a secondary economic analysis or analysis that
integrates clinical effectiveness and cost implications, i.e., cost-utility or cost-effective-
ness analysis. For example, the PROMISE trialists conducted an economic sub-study in
which costs of the initial outpatient testing strategies were estimated from administra-
tive data, costs of hospitalizations were estimated from uniform billing claims data, and
physician fees were assessed based on reimbursement rates (Mark et al. 2016).
The preferred health outcome used in a more comprehensive cost-effectiveness
analysis is the quality-adjusted life year (QALY), which combines morbidity and
mortality by considering both generic health-related quality of life and survival time.
When the cost-effectiveness analysis is conducted alongside an RCT, average
cumulative QALYs can be estimated by the integral of quality of life utility repeat-
edly scored at a scale from 0 (death) to 1 (perfect health) for each participant during
the trial follow-up period (Glasziou et al. 1998). Utility scores are generally derived
from generic quality of life questionnaires that can be mapped to community
preference weights obtained by standard gamble or time trade-off methods. Costs
can be estimated using a micro-costing approach or a more indirect gross-costing
63 Diagnostic Trials 1193

Table 4 Key differences between test-treatment trials and decision models for evaluating utility of
medical tests (Adapted from Bossuyt et al. (2012), PMID 22730450)
Test-treatment trial Decision model
Can compare few competing test- Can compare multiple competing test-treatment
treatment strategies strategies
Can evaluate a limited number of Can evaluate all relevant effectiveness and safety
effectiveness and safety outcomes outcomes based on multiple sources
Restricted by a limited time horizon Lifetime horizon is possible
User empirical data and correlations can Assumptions about model structure and parameters
be observed and accounted for need to be made

approach, like done in the PROMISE trial. For example, in the Netherlands, a cost-
utility analysis was performed parallel to a randomized controlled trial to determine
the cost-effectiveness of early referral for MR imaging by general practitioners
versus usual care alone in patients with traumatic knee symptoms. QALYs and
costs were estimated over the trial duration from a healthcare and societal perspec-
tive. Results from this analysis showed that MR imaging referral was more costly
(mean costs €1109 vs €837) and less effective (mean QALYS 0.888 vs 0.899). Thus,
usual care was deemed the dominant strategy for patients with traumatic knee
symptoms (van Oudenaarde et al. 2018).
Test-treatment trials are, however, often limited by the number of strategies that can
be evaluated, follow-up duration, and ability to evaluate outcomes with low event rates
(e.g., radiation risk) or large variability (cost data). For these reasons, decision models
are increasingly used to assess the cost-effectiveness of diagnostic tests by combining
and linking data from multiple sources. These sources can include evidence regarding
diagnostic accuracy, disease prevalence, immediate and future adverse event rates,
treatment efficacy, and quality of life. Similar to a cost-effectiveness analysis conducted
alongside the trial, decision models generally analyze outcomes relevant to the
healthcare sector and society. As such, preference weights and costs from multiple
components (formal and informal health care, and non-health care sectors) are assigned
to the modeled tests, treatments, events, and health states. Key differences in charac-
teristics of empirical test-treatment trials and decision models are listed in Table 4.
Decision models can also be used to extrapolate cost-effectiveness outcomes
beyond the trial duration, when it is expected that the trial follow-up is too short to
capture all potential future benefits and harms. For example, in the economic analysis
of the PROMISE trial, a parametric model was used to extrapolate costs from 90 days
to 3 years (Mark et al. 2016). The analysis showed that CTA and conventional
diagnostic testing resulted in similar costs through 3 years of follow-up.

Summary and Conclusions

In summary, the usefulness of diagnostic tests firstly depends on making an accurate


diagnosis and/or determining disease severity. However, diagnostic data are gener-
ally sought for a broader set of reasons, including improvement of health outcomes
1194 M. Mazumdar et al.

in the target patient population. Key questions to ask are how good is a diagnostic
test at providing the desired answers concerning these outcomes, and what rules of
evidence should be used to judge the value of new tests. The ultimate determinant is
whether the clinical intervention imposed based on the diagnostic test result truly
helped improving a relevant clinical metric for patients. The encompassing field of
diagnostic trials helps addressing these questions through the various designs pre-
sented in this chapter. Researchers must weigh the strengths and weaknesses of each
of these designs, compute sample sizes with an eye toward feasibility, and report all
results transparently to ensure that the new information obtained is useful for clinical
practice and future studies.

Key Facts

• The field of diagnostic trials has grown tremendously in the last 40 years. Two
types of diagnostic trials have emerged: (I) studies that estimate and compare
accuracy of diagnostic procedures and (II) studies that estimate and compare
effectiveness of a treatment pathway triggered by specific results of diagnostic
tests.
• Each type of diagnostic trial has a variety of design options, and related sample
size computation and software tools are now available.
• Guidelines for reporting designs and results of diagnostic trials remain underused.

Cross-References

▶ Bayesian Adaptive Designs for Phase I Trials


▶ Biomarker-Driven Adaptive Phase III Clinical Trials
▶ Cluster Randomized Trials
▶ Introduction to Meta-Analysis
▶ Monte Carlo Simulation for Trial Design Tool
▶ Power and Sample Size
▶ Principles of Clinical Trials: Bias and Precision Control
▶ Pragmatic Randomized Trials Using Claims or Electronic Health Record Data
▶ Sequential, Multiple Assignment, Randomized Trials (SMART)

References
Ahmed HU, El-Shater Bosaily A, Brown LC, Gabe R, Kaplan R, Parmar MK, Collaco-Moraes Y et
al (2017) Diagnostic accuracy of multi-parametric MRI and TRUS biopsy in prostate cancer
(PROMIS): a paired validating confirmatory study. Lancet 389(10071):815–822. https://doi.org/
10.1016/s0140-6736(16)32401-1
Beam CA (1992) Strategies for improving power in diagnostic radiology research. AJR Am J
Roentgenol 159(3):631–637. https://doi.org/10.2214/ajr.159.3.1503041
Begg CB, Greenes RA (1983) Assessment of diagnostic tests when disease verification is subject to
selection bias. Biometrics 39(1):207–215
63 Diagnostic Trials 1195

Bossuyt PM, Reitsma JB, Linnet K, Moons KG (2012) Beyond diagnostic accuracy: the clinical
utility of diagnostic tests. Clin Chem 58(12):1636–1643. https://doi.org/10.1373/clinchem.2012.
182576
Braga LH, Farrokhyar F, Bhandari M (2012) Confounding: what is it and how do we deal with it?
Can J Surg 55(2):132–138. https://doi.org/10.1503/cjs.036311
Bruni L, Barrionuevo-Rosas L, Albero G, Serrano B, Mena M, Gómez D, Muñoz J, Bosch FX, de
Sanjosé S (2014) Human papillomavirus and related diseases report. L’Hospitalet de Llobregat:
ICO Information Centre on HPV and Cancer
Cardoso F, Piccart-Gebhart M, Van’t Veer L, Rutgers E (2007) The MINDACT trial: the first
prospective clinical validation of a genomic tool. Mol Oncol 1(3):246–251. https://doi.org/10.
1016/j.molonc.2007.10.004
Colli A, Fraquelli M, Casazza G, Conte D, Nikolova D, Duca P, Thorlund K, Gluud C (2014) The
architecture of diagnostic research: from bench to bedside–research guidelines using liver
stiffness as an example. Hepatology 60(1):408–418. https://doi.org/10.1002/hep.26948
de Groot JA, Bossuyt PM, Reitsma JB, Rutjes AW, Dendukuri N, Janssen KJ, Moons KG (2011)
Verification problems in diagnostic accuracy studies: consequences and solutions. BMJ 343:
d4770. https://doi.org/10.1136/bmj.d4770
Douglas PS, Hoffmann U, Patel MR, Mark DB, Al-Khalidi HR, Cavanaugh B, Cole J et al (2015)
Outcomes of anatomical versus functional testing for coronary artery disease. N Engl J Med 372
(14):1291–1300. https://doi.org/10.1056/NEJMoa1415516
Faraggi D, Reiser B (2002) Estimation of the area under the ROC curve. Stat Med 21(20):3093–
3106. https://doi.org/10.1002/sim.1228
Ferrante di Ruffano L, Dinnes J, Taylor-Phillips S, Davenport C, Hyde C, Deeks JJ (2017) Research
waste in diagnostic trials: a methods review evaluating the reporting of test-treatment interven-
tions. BMC Med Res Methodol 17(1):32. https://doi.org/10.1186/s12874-016-0286-0
Ferrante di Ruffano L, Hyde CJ, McCaffery KJ, Bossuyt PM, Deeks JJ (2012) Assessing the value
of diagnostic tests: a framework for designing and evaluating trials. BMJ 344:e686. https://doi.
org/10.1136/bmj.e686
Fosgate GT (2009) Practical sample size calculations for surveillance and diagnostic investigations.
J Vet Diagn Investig 21(1):3–14. https://doi.org/10.1177/104063870902100102
Flahault A, Cadilhac M, Thomas G (2005) Sample size calculation should be performed for design
accuracy in diagnostic test studies. J Clin Epidemiol 58(8):859–862. https://doi.org/10.1016/j.
jclinepi.2004.12.009
Glasziou PP, Cole BF, Gelber RD, Hilden J, Simes RJ (1998) Quality adjusted survival analysis
with repeated quality of life measures. Stat Med 17(11):1215–1229. https://doi.org/10.1002/
(sici)1097-0258(19980615)17:11<1215::aid-sim844>3.0.co;2-y
Gluud C, Gluud LL (2005) Evidence based diagnostics. BMJ 330(7493):724–726
Hajian-Tilaki KO, Hanley JA, Joseph L, Collet JP (1997) A comparison of parametric and
nonparametric approaches to ROC analysis of quantitative diagnostic tests. Med Decis Mak
17(1):94–102. https://doi.org/10.1177/0272989x9701700111
Harel O, Zhou XH (2006) Multiple imputation for correcting verification bias. Stat Med 25(22):
3769–3786. https://doi.org/10.1002/sim.2494
Henrichs J, Verfaille V, Jellema P, Viester L, Pajkrt E, Wilschut J, van der Horst HE, Franx A, de
Jonge A (2019) Effectiveness of routine third trimester ultrasonography to reduce adverse
perinatal outcomes in low risk pregnancy (the IRIS study): nationwide, pragmatic, multicentre,
stepped wedge cluster randomised trial. BMJ 367:l5517. https://doi.org/10.1136/bmj.l5517
Hooper R, Díaz-Ordaz K, Takeda A, Khan K (2013) Comparing diagnostic tests: trials in people
with discordant test results. Stat Med 32(14):2443–2456. https://doi.org/10.1002/sim.5676
Huchko MJ, Sneden J, Zakaras JM, Smith-McCune K, Sawaya G, Maloba M, Bukusi EA, Cohen
CR (2015) A randomized trial comparing the diagnostic accuracy of visual inspection with
acetic acid to visual inspection with Lugol’s iodine for cervical cancer screening in HIV-infected
women. PLoS One 10(4):e0118568. https://doi.org/10.1371/journal.pone.0118568
Huang EP, Lin FI, Shankar LK (2017) Beyond correlations, sensitivities, and specificities: a
roadmap for demonstrating utility of advanced imaging in oncology treatment and clinical
trial design. Acad Radiol 24(8):1036–1049. https://doi.org/10.1016/j.acra.2017.03.002
1196 M. Mazumdar et al.

Hu ZD (2016) STARD guideline in diagnostic accuracy tests: perspective from a systematic


reviewer. Ann Transl Med 4(3):46. https://doi.org/10.3978/j.issn.2305-5839.2016.01.03
Hu ZD, Wei TT, Yang M, Ma N, Tang QQ, Qin BD, Fu HT, Zhong RQ (2015) Diagnostic value of
osteopontin in ovarian cancer: a meta-analysis and systematic review. PLoS One 10(5):
e0126444. https://doi.org/10.1371/journal.pone.0126444
Kearon C, Ginsberg JS, Douketis J, Turpie AG, Bates SM, Lee AY, Crowther MA et al (2006) An
evaluation of D-dimer in the diagnosis of pulmonary embolism: a randomized trial. Ann Intern
Med 144(11):812–821. https://doi.org/10.7326/0003-4819-144-11-200606060-00007
Kosinski AS, Barnhart HX (2003) A global sensitivity analysis of performance of a medical
diagnostic test when verification bias is present. Stat Med 22(17):2711–2721. https://doi.org/
10.1002/sim.1517
Korevaar DA, van Enst WA, Spijker R, Bossuyt PM, Hooft L (2014) Reporting quality of
diagnostic accuracy studies: a systematic review and meta-analysis of investigations on adher-
ence to STARD. Evid Based Med 19(2):47–54. https://doi.org/10.1136/eb-2013-101637
Korevaar DA, Wang J, van Enst WA, Leeflang MM, Hooft L, Smidt N, Bossuyt PM (2015)
Reporting diagnostic accuracy studies: some improvements after 10 years of STARD. Radiol-
ogy 274(3):781–789. https://doi.org/10.1148/radiol.14141160
Kumar R, Indrayan A (2011) Receiver operating characteristic (ROC) curve for medical
researchers. Indian Pediatr 48(4):277–287. https://doi.org/10.1007/s13312-011-0055-4
Li J, Fine J (2004) On sample size for sensitivity and specificity in prospective diagnostic accuracy
studies. Stat Med 23(16):2537–2550. https://doi.org/10.1002/sim.1836
Liu A, Schisterman EF, Mazumdar M, Hu J (2005) Power and sample size calculation of compar-
ative diagnostic accuracy studies with multiple correlated test results. Biom J 47(2):140–150.
https://doi.org/10.1002/bimj.200410094
Lijmer JG, Mol BW, Heisterkamp S, Bonsel GJ, Prins MH, van der Meulen JH, Bossuyt PM (1999)
Empirical evidence of design-related bias in studies of diagnostic tests. JAMA 282(11):1061–
1066. https://doi.org/10.1001/jama.282.11.1061
Lijmer JG, Bossuyt PM (2009) Various randomized designs can be used to evaluate medical tests. J
Clin Epidemiol 62(4):364–373. https://doi.org/10.1016/j.jclinepi.2008.06.017
Lu B, Gatsonis C (2013) Efficiency of study designs in diagnostic randomized clinical trials. Stat
Med 32(9):1451–1466. https://doi.org/10.1002/sim.5655
Mark DB, Federspiel JJ, Cowper PA, Anstrom KJ, Hoffmann U, Patel MR, Davidson-Ray L et al
(2016) Economic outcomes with anatomical versus functional diagnostic testing for coronary
artery disease. Ann Intern Med 165(2):94–102. https://doi.org/10.7326/m15-2639
McClish DK (1989) Analyzing a portion of the ROC curve. Med Decis Mak 9(3):190–195. https://
doi.org/10.1177/0272989x8900900307
Metz CE (1978) Basic principles of ROC analysis. Semin Nucl Med 8(4):283–298. https://doi.org/
10.1016/s0001-2998(78)80014-2
Majeed H, Amir E (2018) EQUATOR-Oncology: reducing the latitude of cancer trial design and
reporting: Nature Publishing Group
Mustafa RA, Wiercioch W, Cheung A, Prediger B, Brozek J, Bossuyt P, Garg AX, Lelgemann M,
Büehler D, Schünemann HJ (2017) Decision making about healthcare-related tests and diag-
nostic test strategies. Paper 2: a review of methodological and practical challenges. J Clin
Epidemiol 92:18–28. https://doi.org/10.1016/j.jclinepi.2017.09.003
Modic MT, Obuchowski NA, Ross JS, Brant-Zawadzki MN, Grooff PN, Mazanec DJ, Benzel EC
(2005) Acute low back pain and radiculopathy: MR imaging findings and their prognostic role
and effect on outcome. Radiology 237(2):597–604. https://doi.org/10.1148/radiol.2372041509
NCSS. PASS (Power Analysis and Sample Size) Software 2018
Network, Equator (2017) EQUATOR Network: what we do and how we are organised 2016
Newby DE, Adamson PD, Berry C, Boon NA, Dweck MR, Flather M, Forbes J et al (2018)
Coronary CT angiography and 5-year risk of myocardial infarction. N Engl J Med 379(10):924–
933. https://doi.org/10.1056/NEJMoa1805971
63 Diagnostic Trials 1197

Obuchowski NA (1998) Sample size calculations in studies of test accuracy. Stat Methods Med Res
7(4):371–392. https://doi.org/10.1177/096228029800700405
Ogilvie JC, Douglas Creelman C (1968) Maximum-likelihood estimation of receiver operating
characteristic curve parameters. J Math Psychol 5(3):377–391
Obuchowski NA, Bullen JA (2018) Receiver operating characteristic (ROC) curves: review of
methods with applications in diagnostic medicine. Phys Med Biol 63(7):07tr01. https://doi.org/
10.1088/1361-6560/aab4b1
Park SH, Goo JM, Jo CH (2004). Receiver operating characteristic (ROC) curve: practical review
for radiologists. Korean J Radiol 5(1):11–18. https://doi.org/10.3348/kjr.2004.5.1.11
Pepe MS (2003) The statistical evaluation of medical tests for classification and prediction.
Medicine
Sackett DL, Haynes RB (2002) The architecture of diagnostic research. BMJ 324(7336):539–541.
https://doi.org/10.1136/bmj.324.7336.539
Simel DL, Samsa GP, Matchar DB (1991) Likelihood ratios with confidence: sample size estimation
for diagnostic test studies. J Clin Epidemiol 44(8):763–770. https://doi.org/10.1016/0895-4356
(91)90128-v
Steinberg DM, Fine J, Chappell R (2009) Sample size for positive and negative predictive value in
diagnostic research using case-control designs. Biostatistics 10(1):94–105. https://doi.org/10.
1093/biostatistics/kxn018
Sun F, Schoelles KM, Coates VH (2013) Assessing the utility of genetic tests. J Ambul Care
Manage 36(3):222–232. https://doi.org/10.1097/JAC.0b013e318295d7e3
Swets JA (1986) Indices of discrimination or diagnostic accuracy: their ROCs and implied models.
Psychol Bull 99(1):100–117
Thompson IM, Ankerst DP, Chen C, Scott Lucia M, Goodman PJ, Crowley JJ, Parnes HL, Coltman
CA (2005) Operating characteristics of prostate-specific antigen in men with an initial PSA level
of 3.0 ng/ml or lower. JAMA 294(1):66–70
Trevethan R (2019) Response: commentary: sensitivity, specificity, and predictive values: founda-
tions, Pliabilities, and pitfalls in research and practice. Front Public Health 7:408. https://doi.org/
10.3389/fpubh.2019.00408
van Oudenaarde K, Swart NM, Bloem JL, Bierma-Zeinstra SMA, Algra PR, Bindels PJE, Koes BW
et al (2018) General practitioners referring adults to MR imaging for knee pain: a randomized
controlled trial to assess cost-effectiveness. Radiology 288(1):170–176. https://doi.org/10.1148/
radiol.2018171383
Whiting P, Rutjes AW, Reitsma JB, Glas AS, Bossuyt PM, Kleijnen J (2004) Sources of variation
and bias in studies of diagnostic accuracy: a systematic review. Ann Intern Med 140(3):189–
202. https://doi.org/10.7326/0003-4819-140-3-200402030-00010
Walsh SJ (1997) Limitations to the robustness of binormal ROC curves: effects of model mis-
specification and location of decision thresholds on bias, precision, size and power. Stat Med 16
(6):669–679. https://doi.org/10.1002/(sici)1097-0258(19970330)16:6<669::aid-sim489>3.0.
co;2-q
Zhou X-H, McClish DK, Obuchowski NA (2009) Statistical methods in diagnostic medicine. John
Wiley & Sons
Designs to Detect Disease Modification
64
Michael P. McDermott

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1200
Standard Single-Period Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1201
Two-Period Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1202
Withdrawal Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1202
Delayed Start Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1204
Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1206
Eligibility Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1208
Duration of Follow-up Periods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1209
Statistical Considerations for Two-Period Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1210
Primary Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1210
Strategies for Accommodating Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1212
Sample Size Determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1214
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1215
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216

Abstract
Designing a trial to determine whether or not an intervention has modified the
underlying course of the disease is straightforward for certain conditions, such as
cancer, in which it is possible to directly measure the disease course. For many
other diseases, the disease course is latent, and one must rely on indirect measures
such as clinical symptoms to quantify the effects of interventions. In this case, it is
difficult with conventional trial designs to determine the extent to which
the treatment is modifying the disease course as opposed to merely alleviating
the symptoms of the disease. This distinction has become critically important in

M. P. McDermott (*)
Department of Biostatistics and Computational Biology, University of Rochester Medical Center,
Rochester, NY, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 1199


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_93
1200 M. P. McDermott

the study of treatments for neurodegenerative diseases such as Alzheimer’s


disease and Parkinson’s disease, but it applies to many other diseases as well.
This chapter discusses proposed strategies for trial design to attempt to
distinguish between the disease-modifying and symptomatic effects of a treat-
ment in diseases with a latent disease course. Two-period designs, such as the
withdrawal design and the delayed start design, are being used for this purpose,
most commonly in neurodegenerative disease. In these designs, the first period
involves a standard randomization of participants to active and placebo treat-
ments. In the second period, those in the active treatment group are switched to
placebo (withdrawal design), or those in the placebo group are switched to active
treatment (delayed start design). These designs are reviewed in detail in terms of
their underlying assumptions, limitations, and strategies for statistical analysis.

Keywords
Two-period design · Withdrawal design · Delayed start design · Disease-
modifying effect · Symptomatic effect · Alzheimer’s disease · Parkinson’s
disease · Missing data · Noninferiority

Introduction

For many diseases, it is not possible to directly observe the underlying disease
process. Instead, clinical symptoms and/or function or, in some cases, even labora-
tory or biological markers might serve as indirect measures of this process. Exam-
ples of such conditions include diabetic peripheral neuropathy, depression, anemia,
osteoporosis, and neurodegenerative disease (e.g., Alzheimer’s disease, Parkinson’s
disease, and Huntington’s disease). In recent years, interest has increased substan-
tially in the problem of designing clinical trials to determine whether a treatment has
modified the underlying course of the disease or has merely exerted its effect on
disease symptoms.
This heightened interest in trial designs that can detect disease modification has
been most prominent in the area of neurodegenerative disease, specifically in
Alzheimer’s disease (AD) and Parkinson’s disease (PD). Although a wide variety
of effective treatments have been developed for these conditions, none have been
conclusively shown to modify the underlying course of the disease, and most are
believed to only alleviate disease symptoms. The discovery of a treatment that either
slows, halts, or even reverses underlying disease progression has been termed “the
highest priority in PD research” (Olanow et al. 2008). The aging of the population
has raised grave concerns regarding the global public health crisis posed by AD
(Cummings 2017) and PD (Dorsey and Bloem 2018), exacerbating the need for
disease-modifying treatments.
The term disease modification implies that the treatment has an enduring effect on
the course of the underlying disease. Modifications of a key pathological feature of
the disease, such as tau and β-amyloid protein levels in the brain in AD (Kaye 2000)
64 Designs to Detect Disease Modification 1201

or the rate of loss of catecholaminergic neurons (primarily the dopaminergic projec-


tion from the substantia nigra to the striatum) in PD (Clarke 2004), are examples of
this. For a disease-modifying effect to be important, however, a clear benefit with
respect to the clinical course of the disease would also be required (Cummings
2009). Treatments that merely ameliorate the symptoms of the disease without
affecting the underlying disease process, on the other hand, would be expected to
lose their benefit relatively soon upon discontinuation.
Designing a trial to determine whether a treatment has an impact on the under-
lying disease is straightforward when a valid measure of the underlying disease is
available (e.g., tumor size in cancer or viral load in HIV infection). Some success has
been realized in establishing disease modification for treatments on the basis of a
combination of clinical and imaging markers. One example is in relapsing-remitting
multiple sclerosis, where several treatments are considered to be disease-modifying
on the basis of reductions in relapse rates and the appearance of new brain lesions
detected by magnetic resonance imaging, and of slowing of the accumulation of
disability (Sormani and Bruzzi 2013). Another example is in rheumatoid arthritis, for
which disease-modifying antirheumatic drugs (DMARDs) have been shown to
improve clinical, laboratory, and radiologic endpoints; see Emery et al. (2008) for
an example. Although a considerable amount of research has been (and continues to
be) devoted to establishing valid markers of underlying disease progression in
neurodegenerative disease, these efforts, so far, have been unsuccessful (Athauda
and Foltynie 2016; Cummings 2009; Cummings 2017; Vellas et al. 2008). In the
absence of such a measure, there are difficult challenges in designing a clinical trial
that can clearly distinguish between the symptomatic and disease-modifying effects
of the treatment.
In response to these challenges, special trial designs termed two-period designs
(McDermott et al. 2002), including the withdrawal and delayed start designs, have
been developed that attempt to distinguish between the symptomatic and disease-
modifying effects of treatment using clinical outcome measures. This chapter out-
lines the rationale for these designs and their specific features and assumptions.
Issues related to implementation, statistical considerations, and important limitations
of the designs are also discussed. Although these designs have been used mainly in
the context of neurodegenerative diseases such as AD and PD, they are more broadly
applicable.

Standard Single-Period Designs

It has been suggested by some that standard parallel group designs can be used to
infer a disease-modifying effect of an intervention by examining whether the pattern
of group differences in mean responses on a suitable clinical rating scale diverges
over time (Guimaraes et al. 2005; Vellas et al. 2008). For example, if the pattern
of change over time is linear in each treatment group, a group difference in the rate of
change (slope) would indicate an effect of treatment on the underlying progression of
the disease. The trouble with this interpretation is that such results are also
1202 M. P. McDermott

compatible with the interpretation of a very slow-onset symptomatic effect (Ploeger


and Holford 2009). They are also compatible with the interpretation that the symp-
tomatic effect of the treatment increases over time. It is quite plausible in a neuro-
degenerative disease, for example, for the magnitude of the symptomatic effect in a
participant to increase as the underlying disease worsens or as the score on the
clinical rating scale worsens.
Some trials attempting to discern the disease-modifying effects of an intervention
have relied on milestone endpoints in their design. The Deprenyl and Tocopherol
Antioxidative Therapy of Parkinsonism (DATATOP) trial was one of the earliest trials
to explicitly test a hypothesis concerning the disease-modifying effects of an inter-
vention, in this case two interventions, selegiline and vitamin E (The Parkinson Study
Group 1989; The Parkinson Study Group 1993). At the time the trial was designed, it
was believed that neither selegiline nor vitamin E had symptomatic effects. The trial
randomly assigned 800 participants with early, untreated PD in a 2  2 factorial design
to receive selegiline, vitamin E, both treatments in combination, or placebo, with the
primary outcome variable being the time from randomization until the development of
disability sufficient to require treatment with dopaminergic therapy, as judged by the
enrolling investigator. A substantial beneficial effect of selegiline was observed in
terms of delaying the need for dopaminergic therapy. On the other hand, an unantic-
ipated short-term effect of selegiline thought to be indicative of a symptomatic benefit
was also apparent (The Parkinson Study Group 1989), making the results difficult to
interpret with respect to mechanism. A very similar design was employed in a trial of
the same interventions in AD in which the primary outcome variable was the time
from randomization until death, institutionalization, loss of basic activities of daily
living, or a diagnosis of severe dementia, whichever occurred first (Sano et al. 1997).
The problem with this strategy is that such endpoints can be influenced by symptom-
atic effects as well as disease-modifying effects. Ideally, one would employ an
endpoint that is not influenced by a treatment with symptomatic benefit and that can
be ascertained in a reasonably short period of time.
An alternative approach to evaluating the disease-modifying effects of an inter-
vention is to combine a model for disease progression with a pharmacodynamic
model for drug effects, the latter facilitating inference concerning the mechanisms of
the drug effect (Holford 2015). These methods are analytically complex and rely on
several modeling assumptions, but they might overcome some of the limitations of
two-period designs discussed below and might facilitate understanding of the mech-
anisms of drug benefit (Holford and Nutt 2011).

Two-Period Designs

Withdrawal Design

In a seminal paper, Leber (1996) formally proposed the use of two-period designs to
attempt to distinguish between the symptomatic and disease-modifying effects of an
intervention. In the withdrawal design, participants are randomly assigned to receive
64 Designs to Detect Disease Modification 1203

either active treatment or placebo in the first period (Period 1) and followed for a
fixed length of time. In the second period (Period 2), those who were receiving active
treatment are switched to placebo (A/P group), and those who were receiving
placebo remain on placebo (P/P group) (Fig. 1). Period 1 is chosen to be sufficiently
long to permit the emergence of a measurable disease-modifying effect of the
treatment. Period 2 is chosen to be long enough to eliminate (or “wash out”) any
symptomatic effect of the treatment from Period 1; the two periods do not have to be
of equal length. The purpose of the withdrawal maneuver is to determine whether
any portion of the treatment effect that is apparent at the end of Period 1 persists after
withdrawal of treatment, i.e., to distinguish between the short-term symptomatic
effect and the long-term disease-modifying effect. In theory, any difference in mean
response at the end of Period 2 in favor of the A/P group can be attributed to a
disease-modifying effect of the treatment.
A key assumption of the withdrawal design is the adequacy of the length of the
withdrawal period (Period 2). Consider, for example, the Early vs. Late L-dopa in

Fig. 1 Illustration of the results of a trial of a disease-modifying treatment using a withdrawal


design. In this design, participants are randomly assigned to receive either active (A) or placebo (P)
treatment in Period 1 followed by placebo treatment for all participants in Period 2. The notation “A/
P” indicates the group that received active treatment in Period 1 followed by placebo treatment in
Period 2. The plotted points are mean changes from baseline in the 13-item Alzheimer’s Disease
Assessment Scale-Cognitive Subscale (ADAS-Cog 13) score, where positive changes indicate
worsening. Disease modification is supported by a persisting difference in mean response between
the A/P and P/P groups at the end of Period 2, with evidence that the group difference in mean
response is not continuing to decrease over time near the end of this period
1204 M. P. McDermott

Parkinson Disease (ELLDOPA) trial in which participants were randomly assigned


to receive one of three dosages of levodopa or matching placebo and followed for
40 weeks. At the conclusion of the 40-week treatment period, participants underwent
a 2-week withdrawal of study medication and were reevaluated (The Parkinson
Study Group 2004). Participants receiving levodopa in Period 1, regardless of
dosage, continued to have substantially better mean scores on the Unified
Parkinson’s Disease Rating Scale (UPDRS) than those receiving placebo after the
withdrawal period, but it is not clear if the duration of the withdrawal period was
sufficient to completely eliminate the symptomatic effects of levodopa. It is inter-
esting to note that the underlying hypothesis being tested in the ELLDOPA trial was
that levodopa would be associated with a worsening of PD progression.
One problem with the withdrawal design is that there is no blinding with respect
to the treatment received during the withdrawal period (Period 2), which can result in
bias. In addition, participant retention during Period 2 might become a problem
depending on its duration since participants will be aware that they are not receiving
active treatment. A solution would be to add a third randomized group to the study in
which participants remain on active treatment in both periods (A/A), enabling the
blind to be maintained throughout the trial. Since the A/A group would have no
value in distinguishing between the disease-modifying and symptomatic effects of
the treatment (McDermott et al. 2002), it would be wise to assign relatively few
participants to this group to minimize the loss of efficiency.

Delayed Start Design

The withdrawal design is associated with concerns regarding participant recruitment


and retention; this motivated Leber (1996) to propose an alternative two-period
design that he termed the randomized start design, now commonly known as the
delayed start design. The design is identical to that of the withdrawal design in
Period 1, but in Period 2 those who were receiving placebo are switched to active
treatment (P/A group), and those who were receiving active treatment remain on
active treatment (A/A group) (Fig. 2). Period 1 is chosen to be sufficiently long to
permit the emergence of a measurable disease-modifying effect of the treatment.
Period 2 is chosen to be long enough for the treatment to fully exert its symptomatic
effect; again, these periods do not have to be of equal length. The inference would be
that any difference in mean response at the end of Period 2 in favor of the A/A group
can be attributed to a disease-modifying effect of the treatment.
An important assumption of the delayed start design is that Period 2 is sufficiently
long to ensure that the P/A group will not continue to “catch up” to the A/A group. For
example, in the Attenuation of Disease Progression with Azilect Given Once-Daily
(ADAGIO) trial of rasagiline (Olanow et al. 2008; Olanow et al. 2009), participants
were randomly assigned with equal allocation to one of four groups: (1) rasagiline
1 mg/day for 72 weeks, (2) placebo for 36 weeks followed by rasagiline 1 mg/day for
36 weeks, (3) rasagiline 2 mg/day for 72 weeks, and (4) placebo for 36 weeks followed
by rasagiline 2 mg/day for 36 weeks. The design, therefore, included two delayed start
64 Designs to Detect Disease Modification 1205

Fig. 2 Illustration of the results of a trial of a disease-modifying treatment using a delayed start
design. In this design, participants are randomly assigned to receive either active (A) or placebo (P)
treatment in Period 1 followed by active treatment for all participants in Period 2. The notation “P/
A” indicates the group that received placebo treatment in Period 1 followed by active treatment in
Period 2. The plotted points are mean changes from baseline in the 13-item Alzheimer’s Disease
Assessment Scale-Cognitive Subscale (ADAS-Cog 13) score, where positive changes indicate
worsening. Disease modification is supported by a persisting difference in mean response between
the P/A and A/A groups at the end of Period 2, with evidence that the group difference in mean
response is not continuing to decrease over time near the end of this period

trials, one for the 1 mg/day dosage and one for the 2 mg/day dosage of rasagiline. The
trial unexpectedly produced conflicting results (Olanow et al. 2009). While the 1 mg/
day dosage yielded a pattern of mean UPDRS total scores over time that would be
expected from a drug that had at least a partial disease-modifying effect, the 2 mg/day
dosage did not demonstrate evidence of a disease-modifying effect as the delayed start
(P/A) group “caught up” to the early start (A/A) group in terms of mean response
during Period 2, as measured by the UPDRS total score.
Like the withdrawal design, the delayed start design has the problem that there is
no blinding with respect to the treatment received during Period 2. Again, one could
add a third randomized group to the study in which participants remain on placebo
throughout the trial (P/P) to address this problem, with relatively few participants
assigned to this group since it would have no value in distinguishing between the
disease-modifying and symptomatic effects of the treatment (McDermott et al.
2002). The addition of this third group, which would never receive active treatment,
might make it more difficult to recruit participants in the trial.
1206 M. P. McDermott

Assumptions

Simplified statistical models for the withdrawal and delayed start designs can be used
to illustrate the assumptions that each of these designs requires. Suppose that a
normally distributed outcome variable Y is measured on each participant at the end of
Period 1 (Y1) and at the end of Period 2 (Y2). A typical analysis of data from this
design would incorporate the additional longitudinal data collected and would likely
include certain covariates such as enrolling center and the baseline value of the
outcome variable, but these will be ignored here for simplicity. Additional details
regarding these models are described elsewhere (McDermott et al. 2002).
The models for the mean responses at the end of each period for the withdrawal
and delayed start designs are provided in Table 1. At the end of Period 1, participants
receiving placebo (i.e., those in the P/P and P/A groups) have a mean response μ1,
but participants receiving active treatment (i.e., those in the A/P and A/A groups)
have a mean response that also includes a treatment effect that is assumed to be a sum
of two components: a symptomatic effect (θS) and a disease-modifying effect (θD).
The data at the end of Period 1 can only be used to estimate the total treatment effect,
θS + θD, in that period; they cannot distinguish between these two components. In the
withdrawal design, for example, the difference in mean response between the A/P
and P/P groups would estimate θS + θD. Similarly, in the delayed start design, the
difference in mean response between the A/A and P/A groups would also estimate
θS + θD. The data from Period 2 are used to attempt to distinguish between the
symptomatic and disease-modifying components of that effect.
In the withdrawal design, participants who received placebo in both periods (P/P)
have a mean response μ2 at the end of Period 2. For the A/P group, which had active
treatment withdrawn in Period 2, it is assumed that the disease-modifying effect
acquired from active treatment during Period 1 is retained at the end of Period 2, but
that any symptomatic effect acquired during Period 1 disappears by the end of Period
2. The mean response in this group at the end of Period 2 is, therefore, μ2 + θD.
In the delayed start design, the P/A group receives active treatment in Period 2;
therefore, the mean response in this group at the end of Period 2 is μ2 + λT, i.e.,

Table 1 Statistical models for mean responses in the withdrawal and delayed start designs
Design Group End of Period 1 End of Period 2
Withdrawal P/P μ1 μ2
A/P μ1 + θS + θD μ2 + θD
Difference (A/P – P/P) θS + θD θD
Delayed start P/A μ1 μ 2 + λT
A/A μ1 + θS + θD μ2 + θD + δT
Difference (A/A – P/A) θS + θD θD + δT – λT
Group indicates the Period 1/Period 2 treatment assignments, with P ¼ placebo and A ¼ active
θS ¼ Symptomatic effect acquired during Period 1
θD ¼ Disease-modifying effect acquired during Period 1
λT ¼ Total treatment effect (symptomatic + disease-modifying) acquired during Period 2
δT ¼ Total treatment effect (symptomatic + disease-modifying) acquired during Period 2
64 Designs to Detect Disease Modification 1207

is augmented by a total treatment effect λT acquired during this period that could
consist of both symptomatic and disease-modifying components. Note, however,
that λT is not necessarily equal to θS + θD since the total treatment effect acquired
during Period 2 might not be the same as that acquired during Period 1. In the A/A
group, the mean response at the end of Period 2 is μ2 + θD + δT; it is assumed that this
group retains the disease-modifying effect (θD) and loses the symptomatic effect (θS)
acquired during Period 1 but also acquires a total treatment effect δT during Period 2
that might differ from that acquired by the P/A group.
The important assumptions of the withdrawal and delayed start designs are
illustrated by this simple model for the mean responses: (1) Period 1 is of sufficient
duration to permit the emergence of a measurable disease-modifying effect θD; (2)
the disease-modifying effect θD acquired during Period 1 persists at least through the
end of Period 2, but presumably longer; (3) Period 2 is of sufficient duration for the
symptomatic effect from Period 1 (θS) to completely disappear by the end of Period
2; and (4) withdrawal of active treatment does not modify (e.g., hasten) the disease
process in some way.
It can be seen from Table 1 that in the withdrawal design, the difference in
observed mean response between the A/P and P/P groups at the end of Period 2
will be an unbiased estimate of θD, the disease-modifying effect, under the assumed
statistical model. In the delayed start design, however, the difference in observed
mean response between the A/A and P/A groups at the end of Period 2 will not be an
unbiased estimate of θD under this model unless λT ¼ δT, i.e., unless the total
treatment effect acquired during Period 2 is the same for the P/A and A/A groups.
The assumption, therefore, is that the total (symptomatic + disease-modifying) effect
of treatment received in Period 2 is the same regardless of whether or not the
participant received treatment during Period 1. Because of this assumption, it is
important to ensure that the duration of Period 2 is sufficient to allow the symptom-
atic effect of the treatment to become fully apparent in the P/A group.
Although the assumption that λT ¼ δT is necessary in the delayed start design to
interpret the difference in observed mean response between the A/A and P/A groups
at the end of Period 2 as the magnitude of the disease-modifying effect of the
treatment, it is not a testable assumption in this design. The assumption could be
tested, however, using data from a complete two-period design (McDermott et al.
2002), i.e., a combination of the withdrawal and delayed start designs that includes
all four treatment arms (P/P, P/A, A/P, P/P) (Fig. 3). In this design, an unbiased
estimate of λT is the difference in observed mean response between the P/A and P/P
groups at the end of Period 2 (Table 1). Similarly, an unbiased estimate of δT is the
difference in observed mean response between the A/A and A/P groups at the end of
Period 2 (Table 1). A test of the null hypothesis λT ¼ δT, then, could be based on the
difference between these estimates. If one were comfortable with the assumption of
λT ¼ δT, a pooled estimate of θD could be formed from the withdrawal and delayed
start components of the design (McDermott et al. 2002). Such a design would also
promote blinding. Issues regarding allocation of participants to the different treat-
ment arms of a complete two-period design, including recruitment, dropout, and
statistical efficiency, are discussed in detail elsewhere (McDermott et al. 2002). A
1208 M. P. McDermott

Fig. 3 Illustration of the results of a trial of a disease-modifying treatment using a complete two-
period design, which can be viewed as the combination of the withdrawal and delayed start designs.
In this design, participants are randomly assigned to receive either active (A) or placebo (P)
treatment in Period 1 followed by either active or placebo treatment during Period 2. The notation
“A/P” indicates the group that received active treatment in Period 1 followed by placebo treatment
in Period 2; similar notation is used for the other three groups

slight variation on this design was used in two randomized trials of pegaptanib
sodium for the treatment of age-related macular degeneration (Mills et al. 2007). The
complete two-period design was also presented for a trial of propentofylline in AD
(Whitehouse et al. 1998), although the results of this trial do not seem to have been
published.

Eligibility Criteria

Depending on the disease in question, it might be helpful to enroll trial participants


as soon as possible after disease diagnosis, with the thought that a disease-modifying
treatment might be more effective if given earlier in the disease course. This is
especially a concern in neurodegenerative diseases such as AD or PD. For example,
trials with a delayed start design in PD have restricted enrollment to participants who
were diagnosed within the past 18–24 months and do not yet require treatment with
dopaminergic therapy (Olanow et al. 2008; Schapira et al. 2010). Such consider-
ations have motivated the idea of investigating potential disease-modifying
64 Designs to Detect Disease Modification 1209

treatments in participants with “pre-manifest” disease. This concept would be easier


to apply in diseases where the genetic defect is known, such as Huntington’s disease;
for other conditions, identifying a population at high risk of developing manifest
disease within a relatively short period of time is a major challenge. Also, more
research is needed on identifying appropriate outcome measures before such trials
can be recommended (Kieburtz 2006; Vellas et al. 2007). In addition, there are
practical challenges in the design and execution of trials of potentially toxic treat-
ments in individuals who have pre-manifest disease (Kieburtz 2006).
Because trials attempting to distinguish between the disease-modifying and
symptomatic effects of an intervention have explanatory or mechanistic aims rather
than pragmatic aims, it is important to minimize the use of, or changes in, concom-
itant treatments during the trial. This is especially important for concomitant treat-
ments that might themselves have disease-modifying effects. For example, the
ADAGIO trial prohibited the use of levodopa, dopamine agonists, selegiline,
rasagiline, and coenzyme Q10 (> 300 mg/day) within 4 months of randomization.
Retention is another critical issue that must be considered in terms of eligibility
criteria. Exclusion of patients who have certain comorbid conditions or who will
likely need ancillary treatment during the trial might be indicated. In the ADAGIO
trial, for example, eligibility was restricted to patients who were judged by the site
investigator to not likely require symptomatic treatment in the subsequent 9 months.
A potential concern with this criterion is that it might yield a cohort of participants
with a slower underlying disease progression in whom a disease-modifying effect
might be more difficult to detect (Ahlskog and Uitti 2010; Clarke 2008). Such
restrictions on eligibility criteria need to be balanced with the ability to recruit
potentially large numbers of participants and considerations related to generalizing
the trial results (Clarke 2008).

Duration of Follow-up Periods

As noted above, in a two-period design, Period 1 should be chosen to be sufficiently


long to allow a measurable disease-modifying effect to emerge. Also, convincing
support for the hypothesis of a disease-modifying effect of a treatment using either a
withdrawal design or a delayed start design would have to include evidence that the
group differences in mean response near the end of Period 2 are no longer decreasing
over time. For this reason, Period 2 should be chosen to be sufficiently long for the
symptomatic effect from Period 1 to completely disappear by the end of Period 2
and, in the delayed start design, for the symptomatic effect of the treatment to fully
emerge in Period 2. Clearly the duration of these periods will depend on the nature of
the treatment being studied, but practical aspects of study execution such as recruit-
ment and retention will also have to be considered.
In the ADAGIO delayed start study, Periods 1 and 2 were each 9 months in
duration. In the setting of inexorable progression of PD and the availability of
effective symptomatic treatments, 9 months might be the longest duration for Period
1 that would be considered practical. For the same reason, withdrawal designs in PD
1210 M. P. McDermott

might not be feasible unless any symptomatic effect associated with the treatment is
expected to disappear rapidly. In AD, a duration of 18 months is typically used for
Period 1 (Liu-Seifert et al. 2015). In diseases that are not progressive or have no
known effective treatment, Huntington’s disease being an example of the latter,
longer period durations might be feasible.

Statistical Considerations for Two-Period Designs

Primary Analyses

The primary analyses for withdrawal and delayed start designs aim to address three
scientific hypotheses in support of disease modification: (1) that there is an overall
effect of the treatment during Period 1; (2) that there remains a difference between
the groups (the A/P and P/P arms in the withdrawal design or the P/A and A/A arms
in the delayed start design) in Period 2; and (3) that the group differences in mean
responses near the end of Period 2 are not continuing to decrease over time.
Several authors have advocated for a comparison of the mean responses at the end
of Period 1 between those receiving active treatment and those receiving placebo to
address the first hypothesis (Liu-Seifert et al. 2015; McDermott et al. 2002; Zhang et
al. 2011); others have suggested a comparison of average slopes during this period,
perhaps including only time points after which the symptomatic effect or any
placebo effects are thought to have fully emerged (Bhattaram et al. 2009; Xiong et
al. 2014). For example, in the ADAGIO trial, the analyses involved comparisons of
the average slopes between the rasagiline and placebo groups in Period 1, where the
slopes were based on data from Week 12 to Week 36 (Olanow et al. 2008; Olanow et
al. 2009). The rationale for this strategy seems to be that increasing separation of the
active treatment and placebo groups over time with respect to mean response would
be expected in a trial of a disease-modifying agent. Although this strategy might be
more powerful if the assumption of a linear trajectory of response over time holds, it
should only be of interest in Period 1 to determine whether or not the treatment
groups differ with regard to mean response at the end of this period and not to
speculate about the mechanism of the treatment effect; the latter is addressed in the
second hypothesis. Also, this strategy requires strong assumptions concerning the
time point after which the symptomatic effect of the treatment has fully emerged
(Week 12 in ADAGIO) and linearity of the trajectory of response over time, which
might be problematic (Holford and Nutt 2011) and ended up being a main point of
contention as rasagiline was being considered for a disease-modification claim by
the Food and Drug Administration (Li and Barlas 2017).
There is consensus in the literature concerning the key analyses to address the
second hypothesis, namely, that these should involve group comparisons of the mean
responses at the end of Period 2 (Bhattaram et al. 2009; Liu-Seifert et al. 2015;
McDermott et al. 2002; Zhang et al. 2011). The analyses for the third hypothesis
should address the issue of whether or not the group differences in mean response
near the end of Period 2 are continuing to decrease over time. Decisions are required
64 Designs to Detect Disease Modification 1211

as to how to quantify the evolution of the group difference in mean response over
time as well as which time points to include in the analysis. In the ADAGIO trial, the
investigators followed the recommendation of Bhattaram et al. (2009) to compare
the slopes of the two groups during Period 2, assuming a linear trajectory of response
over time in each group. They used the data from Weeks 48–72 in this comparison
because it was thought that the symptomatic effect of rasagiline would appear within
12 weeks of its initiation in the delayed start group at Week 36 (Olanow et al. 2009).
Evidence for disease modification would be supported by a finding that the group
differences in mean response are not continuing to decrease over time. Therefore, it
is appropriate to formulate the hypothesis testing problem as one involving non-
inferiority. Let βP/A be the slope (Weeks 48–72) in the delayed start (P/A) group, and
let βA/A be the corresponding slope in the early start (A/A) group. The following
statistical hypotheses were specified in the ADAGIO trial:

H0 : βP=A  βA=A > δ vs:H 1 : βP=A  βA=A  δ,

where δ is the noninferiority margin. If H0 is rejected, the conclusion would be that


the slope in the P/A group is not meaningfully larger than the slope in the A/A group,
as measured by the noninferiority margin δ. Convincing evidence of disease mod-
ification would require that δ be chosen to be quite small. The choice of δ ¼ 0.15
UPDRS points/week in ADAGIO was not justified in the trial publications (Olanow
et al. 2008; Olanow et al. 2009) and was much too large, being consistent with the
group difference in mean responses shrinking by as much as 3.6 points (0.15  24)
over the 24-week time period (Weeks 48–72), a value greater than the treatment
effect observed during Period 1. Considering such a large difference to be non-
decreasing over time would be clearly inappropriate. The estimate of βP/A – βA/A was
0.00 for the 1 mg/day dosage, with a 95% confidence interval of 0.04–0.04
(Olanow et al. 2009), indicating that the data are consistent with a group difference
between the slopes of no more than 0.04 UPDRS points/week or with convergence
of the group means by no more than approximately 1 UPDRS point (0.04  24) over
the 24-week period. One then has to decide whether this evidence is sufficient to
declare that the group difference in mean responses is not continuing to decrease
appreciably over time.
A slightly different approach to assessing the hypothesis that the group differ-
ences in mean response are not decreasing over time was proposed by Li and Barlas
(2017). They suggested a noninferiority test for a linear trend over time in the group
differences, which does not assume a linear trajectory over time in the mean
responses in each group. Liu-Seifert et al. (2015) suggested testing a noninferiority
hypothesis using only the differences in the group means at the end of Periods 1
and 2:

H 0 : Δ2  0:5 Δ1  0 vs:H 1 : Δ2  0:5 Δ1 > 0,

where Δ1 and Δ2 are the group differences in mean response at the end of Period 1
and at the end of Period 2, respectively. Rejection of H0 would imply that at least
1212 M. P. McDermott

50% of the total treatment effect observed during Period 1 is preserved after Period 2.
A problem with this approach is that it does not address the issue of whether the
group differences in mean response are decreasing over time.
To adequately test the hypothesis that the group differences in mean response are
not decreasing over time, more frequent evaluations might be required in the latter
part of Period 2. The frequency of evaluations will depend on the disease and
treatment being studied. In the context of AD, Zhang et al. (2011) suggested monthly
evaluations in the final 3 months of Period 2; however, evaluations so close together
in time might not allow the slopes during this period to be estimated with sufficient
precision, and there could be problems with feasibility as well (Liu-Seifert et al.
2015).
Both Bhattaram et al. (2009) and Zhang et al. (2011) propose testing the three null
hypotheses of interest in sequence: (1) no group difference in average slopes (or
mean responses) in Period 1; (2) no group difference in mean response at the end of
Period 2; and (3) group differences in mean response are decreasing over time at a
rate that is greater than the specified noninferiority margin. Each hypothesis is tested
at a pre-specified significance level (say 5%), and one proceeds to test the next
hypothesis in the sequence if and only if the previous hypothesis is rejected. If one
takes the position, however, that all three null hypotheses would have to be rejected
in order for the treatment to be considered disease-modifying, then this would be an
example of reverse multiplicity (Offen et al. 2007) whereby the overall probability of
a false-positive result will be less than the significance level used for each of the three
tests, so correction for multiple testing would not be required. Of course, if it is
desired to make an efficacy claim about the treatment using data from Period 1 alone,
regardless of the mechanism of this effect, then an appropriate adjustment for
multiplicity would be necessary (D’Agostino Sr 2009).

Strategies for Accommodating Missing Data

Compared to standard clinical trial designs, the problem of missing data can be
exacerbated in two-period designs due to the long duration of follow-up and the fact
that the evidence concerning potential disease modification is derived from the data
acquired during Period 2. The 2010 National Research Council (NRC) report on The
Prevention and Treatment of Missing Data in Clinical Trials (National Research
Council 2010) has led to increased attention to how missing data are handled in
clinical trials. In particular, the report highlighted the shortcomings of simplistic
methods such as carrying forward the last available observation (LOCF) and so-
called complete case analyses that omit cases with missing data (Mallinckrodt et al.
2017; National Research Council 2010) and promoted the use of more principled
methods such as those based on direct likelihood, multiple imputation, and inverse
probability weighting (Molenberghs and Kenward 2007).
Most of the literature on the analysis of data from two-period designs favors the
use of so-called “mixed model repeated measures” (MMRM) analyses (Mallinckrodt
et al. 2008) that treat time as a categorical variable and use maximum likelihood to
64 Designs to Detect Disease Modification 1213

estimate model parameters (e.g., mean treatment group responses at each individual
time point) using all available data, including all observed data from participants
who prematurely withdraw from the trial (Li and Barlas 2017; Liu-Seifert et al. 2015;
Zhang et al. 2011). Linear or nonlinear mixed effects models (Molenberghs et al.
2004) that specify a functional form for the relationship between response and time
can also be used for this purpose and might be more efficient than the MMRM
strategy if the specified functional form is (approximately) correct, but this could be
a strong assumption in practice. Multiple imputation can also be a useful strategy in
this setting (Little and Yau 1996; Schafer 1997).
These methods all rely on the “missing at random” (MAR) assumption
concerning the missing data mechanism, namely, that the missingness depends
only on observed outcomes in addition to covariates, but not on unobserved out-
comes (Little and Rubin 2002). The reasonableness of this untestable assumption
depends on the clinical setting but also on the estimand of interest (International
Conference on Harmonization 2017; National Research Council 2010). The
estimand is the population quantity to be estimated in the trial and requires specifi-
cation of four elements: the target population, the outcome variable, the handling of
post-randomization (intercurrent) events, and the population-level summary for the
outcome variable. Key among these elements in the context of missing data is the
handling of intercurrent events such as discontinuation of study medication, use of
rescue medication, and use of an out-of-protocol treatment. The need for additional
treatment is particularly important for trials in PD, for which there are many
available effective treatments, but applies to AD and other conditions as well.
There are a number of options for dealing with this issue, including (1) withdrawing
the participant from the trial; (2) moving the participant directly into Period 2; and
(3) allowing the participant to receive additional treatment while continuing partic-
ipation in the trial. The second of these options only applies to participants who
require treatment in Period 1 and would not apply in the case of a withdrawal design.
The third option is consistent with strict adherence to the intention-to-treat principle
and might be sensible in a trial with a pragmatic aim, but it is not appealing in a trial
that aims to evaluate the disease-modifying effect of a treatment using a two-period
design, an aim that is explanatory or mechanistic.
In the ADAGIO trial, participants who were followed for at least 24 of the
scheduled 36 weeks in Period 1 were allowed to proceed directly into Period 2 if
judged by the enrolling investigator to require additional anti-parkinsonian medica-
tion. While this allows information to be obtained in these participants on the
mechanism of the effect of the treatment, the time scale for follow-up becomes
compressed for these participants, the implications of which are not entirely clear.
Also, if the active treatment has a beneficial effect regardless of its mechanism, the
early initiation of Period 2 might occur preferentially in those receiving placebo
during Period 1, which could complicate interpretation of the results. ADAGIO
participants who required additional treatment in Period 2 were withdrawn from the
trial at that time. Only participants who had at least one follow-up evaluation after
the start of Period 2 were included in the primary analyses of Period 2 data. Even
though participant retention in ADAGIO was quite good (Olanow et al. 2009),
1214 M. P. McDermott

exclusion of randomized participants from these analyses has the potential to


introduce bias of unknown magnitude and direction. Methods such as propensity
score adjustment (D’Agostino Jr 1998) can be useful in reducing the bias resulting
from such participant exclusion (D’Agostino Sr 2009). The ADAGIO trial used the
MMRM strategy to deal with missing data in Period 2.
Given the explanatory aim of a trial with a two-period design, the strategy of
excluding data from participants after the introduction of required additional treat-
ment, or withdrawing participants from follow-up at that time, is arguably a reason-
able one. This would be consistent with specification of the disease modification
estimand as the group difference in mean response for all randomized participants at
the end of Period 2 that would have been obtained if all participants tolerated and
complied with treatment (Mallinckrodt et al. 2012; National Research Council
2010), a de jure estimand (Carpenter et al. 2013). An MMRM analysis, for example,
could yield an appropriate estimator for this quantity. Because the MAR assumption
is not testable, however, it would be important to perform analyses that examine the
sensitivity of the results to the assumptions that are made concerning the missingness
mechanism (Carpenter et al. 2013; Liu and Pang 2017; O’Kelly and Ratitch 2014;
Tang 2017). This was an emphasis of the NRC report (National Research Council
2010) and the draft addendum to the International Conference on Harmonization
guidance on Statistical Principles for Clinical Trials (International Conference on
Harmonization 2017) in the context of clinical trials in general.

Sample Size Determination

The considerations for sample size determination that are unique to two-period designs
are the specification of the effect size for disease modification (θD) to be detected at the
end of Period 2 and the noninferiority margin for the third hypothesis that the group
differences in mean response are not decreasing over time. The effect size specified for
sample size determination in ADAGIO was chosen to be 1.8 points for the UPDRS
total score (Olanow et al. 2009), which was criticized by some to not represent a
clinically important effect (Clarke 2008). This group difference, however, must be
interpreted in the proper context: it is the benefit attributable to disease modification
that would accrue over the duration of Period 1, i.e., 36 weeks. This is a very short time
period relative to the expected duration of the disease. If this effect is truly due to
disease modification, it would be expected to continue to accrue over time, possibly
over many years. The observed effect of the 1 mg/day dosage of rasagiline (1.7 points
over 36 weeks) represents a 38% reduction in the change from baseline (Olanow et al.
2009); if this truly represents disease modification, an effect of this magnitude would
arguably be of major clinical importance. In a two-period design, the choice of effect
size for sample size determination should be based on a realistic expectation of the
magnitude of a disease-modifying effect that could accrue over a follow-up period that
is brief relative to the disease course and might not be very large.
The sample size required to determine whether the group difference in mean
responses is not continuing to decrease appreciably over time near the end of Period
64 Designs to Detect Disease Modification 1215

2 could be quite large depending on the choice for the noninferiority margin;
considerations for choosing this margin are discussed in the ADAGIO example
above. Assumptions such as the time points included in this analysis and the residual
variability around the slopes would have to be carefully considered in the calcula-
tion. Additional factors that need to be considered in the sample size calculation
include intercurrent events (e.g., participant withdrawal and noncompliance) and
misdiagnosis (if applicable). Given the complexities that these considerations intro-
duce, the technique of simulation can be highly useful in assessing the required
sample size under a variety of design assumptions.

Summary and Conclusion

There is great interest in developing interventions that can modify the course of
neurodegenerative diseases and other diseases in a meaningful way. The develop-
ment of reliable and valid methods to measure the underlying course of these
diseases is urgently needed, and this is a highly active area of research. In the
meantime, clinical trials in these conditions have to rely on rating scales, functional
measures, or other instruments to indirectly measure disease status. In this setting,
two-period designs represent a potentially attractive option to distinguish between
effects of interventions that are enduring (disease-modifying) vs. those that are short-
term/reversible (symptomatic).
Two-period designs are associated with several limitations, including uncer-
tainty regarding the required durations of the two periods; the assumption in the
delayed start design that the total (symptomatic + disease-modifying) effect of
treatment received in Period 2 is independent of whether or not the participant
received treatment during Period 1; potential difficulties with recruitment and
retention, particularly for the withdrawal design; potential compromise of
blinding; requirements of large sample sizes; the need for effective ancillary
treatments in some cases; and the problem of how to address the issue of missing
data from subjects who cease participation in the trial. Another limitation, in the
context of enrolling trial participants with relatively mild disease, is that the
outcome measure might lack sensitivity to assess disease-modifying effects, espe-
cially if there is a large symptomatic component to the effect of the intervention
(Olanow et al. 2009).
As discussed above, many of the assumptions of two-period designs cannot be
verified directly and need to be informed by knowledge of the intervention acquired
outside of the trial. Also, it will likely be difficult with a two-period design to discern
the mechanisms of interventions with a very slow onset and/or offset of a symptom-
atic effect (Holford and Nutt 2011; Ploeger and Holford 2009).
So far, the withdrawal and delayed start designs to detect disease modification
have been used mainly in the context of neurodegenerative disease. Definitive
demonstration of the disease-modifying effect of an intervention has not been
achieved to date with these designs, and the experience in the ADAGIO trial
illustrates some of the difficulties in achieving this goal. Additional experience
1216 M. P. McDermott

with these designs and the development of strategies to address their limitations will
eventually determine their usefulness in detecting the disease-modifying effects of
interventions.

Cross-References

▶ Estimands and Sensitivity Analyses


▶ Missing Data

References
Ahlskog JE, Uitti RJ (2010) Rasagiline, Parkinson neuroprotection, and delayed-start trials: still no
satisfaction? Neurology 74:1143–1148
Athauda D, Foltynie T (2016) Challenges in detecting disease modification in Parkinson’s disease
clinical trials. Parkinsonism Relat Disord 32:1–11
Bhattaram VA, Siddiqui O, Kapcala LP, Gobburu JV (2009) Endpoints and analyses to discern
disease-modifying drug effects in early Parkinson’s disease. AAPS J 11:456–464
Carpenter JR, Roger JH, Kenward MG (2013) Analysis of longitudinal trials with protocol
deviation: a framework for relevant, accessible assumptions, and inference via multiple impu-
tation. J Biopharm Stat 23:1352–1371
Clarke CE (2004) A “cure” for Parkinson’s disease: can neuroprotection be proven with current trial
designs? Mov Disord 19:491–498
Clarke CE (2008) Are delayed-start design trials to show neuroprotection in Parkinson’s disease
fundamentally flawed? Mov Disord 23:784–789
Cummings JL (2009) Defining and labeling disease-modifying treatments for Alzheimer’s disease.
Alzheimers Dement 5:406–418
Cummings J (2017) Disease modification and neuroprotection in neurodegenerative disorders.
Transl Neurodegener 6:25. https://doi.org/10.1186/s40035-017-0096-2
D’Agostino RB Jr (1998) Propensity score methods for bias reduction in the comparison of a
treatment to a non-randomized control group. Stat Med 17:2265–2281
D’Agostino RB Sr (2009) The delayed-start study design. N Engl J Med 361:1304–1306
Dorsey ER, Bloem BR (2018) The Parkinson pandemic – a call to action. JAMA Neurol 75:9–10
Emery P, Breedveld FC, Hall S, Durez P, Chang DJ, Robertson D, Singh A, Pedersen RD, Koenig
AS, Freundlich B (2008) Comparison of methotrexate monotherapy with a combination of
methotrexate and etanercept in active, early, moderate to severe rheumatoid arthritis (COMET):
a randomised, double-blind, parallel treatment trial. Lancet 372:375–382
Guimaraes P, Kieburtz K, Goetz CG, Elm JJ, Palesch YY, Huang P, Ravina B, Tanner CM, Tilley
BC (2005) Non-linearity of Parkinson’s disease progression: implications for sample size
calculations in clinical trials. Clin Trials 2:509–518
Holford N (2015) Clinical pharmacology ¼ disease progression + drug action. Br J Clin Pharmacol
79:18–27
Holford NHG, Nutt JG (2011) Interpreting the results of Parkinson’s disease clinical trials: time for
a change. Mov Disord 26:569–577
International Conference on Harmonization (2017) ICH E9 (R1) addendum on estimands and
sensitivity analysis in clinical trials to the guideline on statistical principles for clinical trials:
Step 2b, 16 June 2017
Kaye JA (2000) Methods for discerning disease-modifying effects in Alzheimer disease treatment
trials. Arch Neurol 57:312–314
64 Designs to Detect Disease Modification 1217

Kieburtz K (2006) Issues in neuroprotection clinical trials in Parkinson’s disease. Neurology


66(Suppl 4):S50–S57
Leber P (1996) Observations and suggestions on antidementia drug development. Alzheimer Dis
Assoc Disord 10(Suppl 1):31–35
Li JD, Barlas S (2017) Divergence effect analysis in disease-modifying trials. Statist Biopharm Res
9:390–398
Little RJA, Rubin DB (2002) Statistical analysis with missing data. John Wiley and Sons, Hoboken
Little R, Yau L (1996) Intent-to-treat analysis for longitudinal studies with drop-outs. Biometrics
52:1324–1333
Liu GF, Pang L (2017) Control-based imputation and delta-adjustment stress test for missing data
analysis in longitudinal clinical trials. Statist Biopharm Res 9:186–194
Liu-Seifert H, Andersen SW, Lipkovich I, Holdridge KC, Siemers E (2015) A novel approach to
delayed-start analyses for demonstrating disease-modifying effects in Alzheimer’s disease.
PLoS One 10(3):e0119632. https://doi.org/10.1371/journal.pone.0119632
Mallinckrodt CH, Lane PW, Schnell D, Peng Y, Mancuso JP (2008) Recommendations for the
primary analysis of continuous endpoints in longitudinal clinical trials. Drug Inf J 42:303–319
Mallinckrodt CH, Lin Q, Lipkovich I, Molenberghs G (2012) A structured approach to choosing
estimands and estimators in longitudinal clinical trials. Pharm Stat 11:456–461
Mallinckrodt C, Molenberghs G, Rathmann S (2017) Choosing estimands in clinical trials with
missing data. Pharm Stat 16:29–36
McDermott MP, Hall WJ, Oakes D, Eberly S (2002) Design and analysis of two-period studies of
potentially disease-modifying treatments. Control Clin Trials 23:635–649
Mills E, Heels-Ansdell D, Kelly S, Guyatt G (2007) A randomized trial of pegaptanib sodium for
age-related macular degeneration used an innovative design to explore disease-modifying
effects. J Clin Epidemiol 60:456–460
Molenberghs G, Kenward MG (2007) Missing data in clinical studies. John Wiley and Sons,
Chichester
Molenberghs G, Thijs H, Jansen I, Beunckens C, Kenward MG, Mallinckrodt C, Carroll RJ (2004)
Analyzing incomplete longitudinal clinical trial data. Biostatistics 5:445–464
National Research Council (2010) The prevention and treatment of missing data in clinical trials.
National Academies Press, Washington, DC
O’Kelly M, Ratitch B (2014) Clinical trials with missing data: a guide for practitioners. John Wiley
and Sons, Chichester
Offen W, Chuang-Stein C, Dmitrienko A, Littman G, Maca J, Meyerson L, Muirhead R, Stryszak P,
Baddy A, Chen K, Copley-Merriman K, Dere W, Givens S, Hall D, Henry D, Jackson JD,
Krishen A, Liu T, Ryder S, Sankoh AJ, Wang J, Yeh C-H (2007) Multiple co-primary endpoints:
medical and statistical solutions. Drug Inf J 41:31–46
Olanow CW, Hauser RA, Jankovic J, Langston W, Lang A, Poewe W, Tolosa E, Stocchi F,
Melamed E, Eyal E, Rascol O (2008) A randomized, double-blind, placebo-controlled, delayed
start study to assess rasagiline as a disease modifying therapy in Parkinson’s disease (the
ADAGIO study): rationale, design, and baseline characteristics. Mov Disord 15:2194–2201
Olanow CW, Rascol O, Hauser R, Feigin PD, Jankovic J, Lang A, Langston W, Melamed E, Poewe
W, Stocchi F, Tolosa E, the ADAGIO Study Investigators (2009) A double-blind, delayed-start
trial of rasagiline in Parkinson’s disease. N Engl J Med 361:1268–1278
Ploeger BA, Holford NHG (2009) Washout and delayed start designs for identifying disease
modifying effects in slowly progressive diseases using disease progression analysis. Pharm
Stat 8:225–238
Sano M, Ernesto C, Thomas RG, Klauber MR, Schafer K, Grundman M, Woodbury P, Growdon J,
Cotman CW, Pfeiffer E, Schneider LS, Thal LJ (1997) A controlled trial of selegiline, alpha-
tocopherol, or both as treatment for Alzheimer’s disease. N Engl J Med 336:1216–1222
Schafer JL (1997) Analysis of incomplete multivariate data. Chapman and Hall/CRC, Boca Raton
Schapira AHV, Albrecht S, Barone P, Comella CL, McDermott MP, Mizuno Y, Poewe W, Rascol O,
Marek K (2010) Rationale for delayed-start study of pramipexole in Parkinson’s disease: the
PROUD study. Mov Disord 25:1627–1632
1218 M. P. McDermott

Sormani MP, Bruzzi P (2013) MRI lesions as a surrogate for relapses in multiple sclerosis: a meta-
analysis of randomised trials. Lancet Neurol 12:669–676
Tang Y (2017) An efficient multiple imputation algorithm for control-based and delta-adjusted
pattern mixture models using SAS. Statist Biopharm Res 9:116–125
The Parkinson Study Group (1989) Effect of deprenyl on the progression of disability in early
Parkinson’s disease. N Engl J Med 321:1364–1371
The Parkinson Study Group (1993) Effects of tocopherol and deprenyl on the progression of
disability in early Parkinson’s disease. N Engl J Med 328:176–183
The Parkinson Study Group (2004) Levodopa and the progression of Parkinson’s disease. N Engl J
Med 351:2498–2508
Vellas B, Andrieu S, Sampaio C, Wilcock G, the European Task Force Group (2007) Disease-
modifying trials in Alzheimer’s disease: a European task force consensus. Lancet Neurol
6:56–62
Vellas B, Andrieu S, Sampaio C, Coley N, Wilcock G, the European Task Force Group (2008)
Endpoints for trials in Alzheimer’s disease: a European task force consensus. Lancet Neurol
7:436–450
Whitehouse PJ, Kittner B, Roessner M, Rossor M, Sano M, Thal L, Winblad B (1998) Clinical trial
designs for demonstrating disease-course-altering effects in dementia. Alzheimer Dis Assoc
Disord 12:281–294
Xiong C, Luo J, Gao F, Morris JC (2014) Optimizing parameters in clinical trials with a randomized
start or withdrawal design. Comput Statist Data Anal 69:101–113
Zhang RY, Leon AC, Chuang-Stein C, Romano SJ (2011) A new proposal for randomized start
design to investigate disease-modifying therapies for Alzheimer disease. Clin Trials 8:5–14
Screening Trials
65
Philip C. Prorok

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1220
Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1220
Endpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1223
Sample Size Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1226
Screening Trial Design Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1227
Standard or Traditional Two Arm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1227
Continuous Screen Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1227
Stop Screen Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1227
Split Screen or Close Out Screen Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1228
Delayed Screen Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1228
Designs Targeting More Than One Intervention and Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1228
Analysis Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1229
Follow-Up Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1229
Evaluation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1231
Monitoring an Ongoing Screening Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1232
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1234
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1235
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1235

Abstract
The most rigorous approach to evaluating screening interventions for the early
detection of disease is the randomized controlled trial (RCT). RCTs are major
undertakings requiring substantial resources to enroll and follow large
populations over long time periods. Consequently, it is important that such trials
be carefully conducted to ensure high quality information and scientifically valid
results. The purpose of this chapter is to discuss some of the intricacies of

P. C. Prorok (*)
Division of Cancer Prevention, National Cancer Institute, Bethesda, MD, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 1219


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_95
1220 P. C. Prorok

screening trial design, analysis, and monitoring. General design considerations


include the choice of interval between screens, the number of screening rounds,
and duration of follow-up. A crucial issue in screening trials is choice of the
proper outcome measure. This should reflect the impact of the intervention on the
clinical outcome for the disease of interest. In cancer screening, the most valid
endpoint is the trial population cause-specific mortality. Concerns about lead time
bias, length bias and overdiagnosis bias that render other endpoints questionable
are discussed. Following presentation of an approach to sample-size calculation
for these trials, there is a discussion of commonly employed data analysis
methods, including comparison of cause-specific mortality rates between
screened and control arms as the primary analysis. Lastly there is a discussion
of topics to address in monitoring an evolving screening trial. Examples from
completed or ongoing cancer screening trials are used throughout the
presentation.

Keywords
Screening · Early detection · Lead time · Length bias · Cancer

Introduction

Screening for the early detection of disease is considered by many to be an obvious


intervention strategy to help alleviate the burden of various diseases, particularly
cancer. However, it is not always recognized that screening interventions are not
automatically beneficial and that there are real or potential harms and costs associ-
ated with screening. Therefore, before screening is introduced into a population it is
important that a screening test and associated screening program be carefully
evaluated to ensure the benefits outweigh the harms. It is widely recognized that
the randomized controlled trial (RCT) is the most scientifically valid approach to
accomplish this. This chapter is a discussion of issues in the design, analysis, and
monitoring of such trials to evaluate disease screening, with examples drawn from
the cancer screening literature.

Design Issues

The term clinical trial often brings to mind the concept of an investigation aimed at
testing a clinical intervention or treatment in a group of individuals. The trial
participants are patients who have been diagnosed with some disease and have
sought treatment to alleviate their condition. Such therapy trials typically involve a
few hundred to perhaps a few thousand patients, last for perhaps a few years, and
seek to improve a clinical outcome such as reduced recurrence rate or improved
survival rate. Many such trials have been performed by cooperative groups and other
organizations in various countries. In contrast, relatively few screening trials have
65 Screening Trials 1221

been conducted due to their size, cost and duration. They generally involve thou-
sands of ostensibly healthy participants followed for many years to determine if the
screening intervention reduces the disease related death rate in the screened popu-
lation. Given these contrasting features of screening trials compared to therapy trials,
and acknowledging many well-known requirements of clinical trials in general, it is
important to give careful thought to a number of key considerations in designing
screening trials.
Informed consent is an initial consideration in screening trial design. Both pre and
post randomization consent have been used. In post randomization consent, partic-
ipants are chosen from nationwide or regional registration rolls, for example, and
randomly assigned to the trial arms. Those in the control arm receive their usual
medical care, and sometimes are not informed that they are in a trial. Those in the
intervention arm are asked to consent to screening after being randomized (e.g.,
Bretthauer et al. 2016). This approach has the advantage of being population based,
and the participants in the control arm are less likely to undergo the screening
procedure since they are not aware of the study. One disadvantage is that interven-
tion arm participants have to choose to be screened after they are already in the trial,
and invariably some do not, thereby reducing compliance and diluting any effect of
the screening. Further, it may be difficult to obtain information other than vital status
about control arm individuals because they have not agreed to participate. There
might also be ethical concerns about entering individuals into a study which they do
not know about.
Prerandomization consent, on the other hand, requires informed consent from all
participants before randomization into study and control arms (e.g., NLST Research
Team 2011; Prorok et al. 2000). This method may lead to greater compliance in the
screening arm and allows the collection of similar detailed information from both the
study and control arms because all participants agree to be part of the study. A
disadvantage is that it may be more difficult to recruit participants because many
may refuse randomization. There may also be substantial contamination in the
control arm because the controls are aware of the screening tests being used and
could, in theory, seek them elsewhere. This would also dilute any screening effect.
A major issue is the question of whether an available test is ready for evaluation in
a large scale randomized trial, and/or how to choose among several candidate tests.
There are no straightforward scientific answers since a standard set of criteria does
not exist. Hopefully there are preliminary data providing estimates of the key process
measures of the test: sensitivity (the probability of being test positive when disease is
present), specificity (the probability of being test negative when disease is absent),
and positive predictive value (the probability of having disease when the test is
positive). However, these data often emanate from studies involving small numbers
of individuals in a clinical setting, few of whom have preclinical disease that is the
target of a population screening program. Even when appropriate data exist, agreed-
upon threshold values for these parameters that would trigger the decision to
undertake a trial do not exist. It seems clear, however, that for population screening,
particularly for a relatively rare disease such as cancer, there is a requirement for very
high specificity (on the order of 95% or higher) because of low disease prevalence,
1222 P. C. Prorok

while sensitivity need not be so high, although a value of at least 80% is often
deemed preferable.
The issue also arises as to the number of screens or screening rounds and the
interval between screens to be used in a trial. The interval between screens is
typically chosen to be 1 or 2 years (e.g., Prorok 1995), although irregular intervals
have been used, but these may be more difficult to implement in practice in terms of
participant compliance. The number of screening rounds depends on the tradeoff
between a sufficient number to produce a statistically valid effect on the primary
outcome measure, if there is one, and the cost of adding additional rounds. Although
some cancer screening trials have involved screening for essentially the entire
follow-up period (Tabar et al. 1992), most have employed an abbreviated screening
period typically involving four or five screening rounds, with a subsequent follow-
up period devoid of screening (e.g., Miller et al. 1981; Shapiro et al. 1988). These
issues can be addressed using mathematical modeling (e.g., NLST Research Team
2011).
As an example, in the Prostate, Lung, Colorectal and Ovarian (PLCO) cancer
screening trial, the initial choice of four annual screens, at baseline plus three annual
re-examinations, was later expanded to six annual screens for PSA testing for
prostate cancer and CA125 testing for ovarian cancer. This was a trade-off between
enough screens to produce an effect versus anticipated resources (Prorok et al. 2000)
Three or four screening rounds were sufficient in some breast cancer screening trials
(e.g., Shapiro et al. 1988; Tabar et al. 1992). The annual interval between screens
was chosen as the most frequent yet practical interval if screening is shown to be
effective. Compared to less frequent screening, an annual interval also increases the
likelihood of detection of a broad spectrum of the preclinical conditions in the
natural history of the cancers under study. A longer interval might allow some
rapidly growing lesions, which might be a source of mortality but which could be
cured if found early, to escape detection.
Another design consideration involves the relationship between study duration,
sample size, and the expected timing of any effect or achievement of a maximal
effect. Sample size and study duration are inversely related. If only these two
parameters were involved, the relationship between follow-up cost versus recruit-
ment and screening cost would determine the design. For example, if follow-up costs
were substantial compared with those of recruitment and screening, a relatively
larger population would be recruited that would be screened and followed for a
shorter period to achieve the desired statistical validity. However, the issue of the
time at which the screening effect (reduction in mortality, see below) may occur must
also be considered. For those cancer screening trials that have demonstrated an
effect, a separation in the mortality rates between the screened and control groups
has often not begun to occur until 4–5 years or more into the study (e.g., Mandel
et al. 1993; Shapiro et al. 1988). Thus, even with a very large sample size, follow-up
may have to continue for many years to observe the full effect of the screening. A
follow-up period of at least 10 years is common (Prorok and Marcus. 2010).
For example, in the PLCO trial a minimum of 10 years of follow-up was initially
decided upon to allow sufficient time for any mortality reduction from screening to
65 Screening Trials 1223

emerge. Follow-up intervals of 7 years or more were typically required in breast


cancer screening trials (e.g., Shapiro et al. 1988; Tabar et al. 1992), and it was
assumed in designing PLCO that the longer natural history of prostate cancer, and
perhaps other cancers under study, warranted a longer follow-up period. In the
National Lung Screening Trial (NLST), modeling of the disease and screening
processes resulted in the decision to capture endpoint events over an approximately
7 year period (NLST Research Team 2011). It must be recognized that these and
other design parameter choices were based on the best information at the time. In
some circumstances the value of a design parameter is found to be inaccurate once
a trial is underway. One particularly important parameter in this regard is the
control arm event rate. In the Minnesota trial of fecal occult blood testing for
colorectal cancer, the initially estimated screening and follow-up periods were both
extended to provide the opportunity for valid findings to emerge (Mandel et al.
1993).

Endpoints

The appropriate and most meaningful endpoint in a screening study is the clinical
event that the screening is aimed at preventing. For major chronic diseases such as
diabetes or cancer the intent of screening is to find the disease in an early phase so
that treatment can be initiated sooner, thereby preventing the most consequential
clinical outcome of such diseases, which is death (e.g., Echouffo-Tcheugui and
Prorok 2014; Prorok 1995). Particularly in cancer screening, the most valid endpoint
is the trial population cancer-specific mortality rate. This is the number of deaths
from the target cancer per unit time per unit population at risk (e.g., Prorok 1995).
The mortality rate provides a combined assessment of the impact of early detection
plus therapy. The unequivocal demonstration of reduction in the cancer mortality
rate for a population offered screening is justification for the cost of a screening
program and fulfills the implicit promise of benefit to those who elect to participate
in the program.
Careful study design and long-term follow-up of large populations are generally
required to obtain an accurate estimate of a mortality reduction. Consequently,
intermediate or surrogate outcome measures have been proposed. There are, how-
ever, critical shortcomings associated with these end points (Prorok 1995). The
shortcomings are a consequence of well-known biases that occur in screening pro-
grams: lead time bias, length bias, and overdiagnosis bias.
If an individual participates in a screening program, his or her disease may be
detected earlier than it would have been in the absence of screening. The amount of
time by which the diagnosis is advanced as a result of screening is called the lead
time. Because of the lead time, the point of diagnosis is advanced and survival as
measured from diagnosis is automatically lengthened for cases detected by screening
even if length of life is not increased. This is referred to as lead-time bias and renders
the case survival endpoint invalid (Prorok 1995).
1224 P. C. Prorok

Length bias is the phenomenon that cases of disease detected by a screening


program are not a random sample from the general distribution of cases of preclinical
disease in the screened population. Instead, cases with longer duration preclinical
disease are overrepresented among the detected cases (Kafadar and Prorok 2009;
Prorok 1995). If, as seems reasonable, disease with long preclinical duration is slow-
growing preclinical disease that then progresses to slow-growing clinical disease, it
follows that cases of disease with more favorable progression rates are the ones more
likely to be detected by screening. Therefore, screen-detected cases will tend to have
characteristics of good prognosis, such as lack of involvement of regional lymph
nodes or longer survival from diagnosis. These good-prognosis cases have a more
favorable outcome even in the absence of screening.
Overdiagnosis bias is related to the concepts of lead-time bias and length bias, and
can be considered an extreme form of length bias. One can postulate the existence of a
nonprogressive or regressive preclinical disease state in which cases of the disease are
detectable by the screening test but would not progress to clinical disease during the
person’s lifetime in the absence of screening. This is a major concern in screening for
several cancers including prostate cancer and breast cancer (e.g., Andriole et al. 2012;
Welch et al. 2016). The detection of such cases cannot benefit the individual, but such
cases remain preclinical over a long time and, with repeated screenings, are therefore
more likely to be detected. The counterparts to these cases never surface clinically in
the control arm of a trial. Thus, there will be a higher proportion of early-stage cases in
the screened arm even if there is no mortality effect from screening.
Three often proposed alternative endpoints are case-finding rate or yield, case
survival, and stage of disease. The case finding rate or incidence rate can be an early
clue as to whether screening might be having an effect, as more cases should be
detected in the presence than in the absence of screening. However, this rate
generally yields little information on the effect of the screening program on disease
outcome (but see discussion below on incidence rate). Case finding should increase
in a screened population, at least initially, relative to an unscreened population,
because of lead-time bias. This can happen whether or not there is a mortality effect.
Furthermore, some borderline lesions found by modern screening modalities may
not be progressive disease. This results in overdiagnosis bias, as noted above. If this
occurs, individuals are treated unnecessarily and exposed to other possible risks of
screening. Thus, an increased disease rate in a screening program, in and of itself, is
only an indication of increased cost.
In contrast to mortality, which is a population measure, the case survival rate
(see ▶ Chap. 89, “Survival Analysis II”) refers only to cases of the target disease
within a population. The N-year survival rate is defined as the number of cases alive
after N years of observation divided by the number of cases diagnosed at the
beginning of the time period. Because there are losses to follow-up, this measure
is ordinarily calculated using life table methods. Survival does address the final
outcome of disease and suggests that screening could be effective. However, it may
not accurately reflect mortality because of lead time and length biases.
If screening is effective, this should be reflected in an increased case survival rate
as well as a reduction in the population mortality rate. However, any observed
65 Screening Trials 1225

increase in survival from time of diagnosis is, at least in part, a reflection of lead time.
For any case of disease that is screen detected, it is impossible to distinguish between
a true increase in survival time and an artificial increase due to lead time because lead
time cannot be directly observed for ethical reasons. Further, there is no universally
accepted procedure to estimate lead time or to adjust survival for lead time. Thus
case survival is not a valid measure of screening effectiveness.
Furthermore, even if one could adjust for lead time, length bias could still
confound survival comparisons. In comparing survival of cases in two groups, for
example between two subgroups of cases detected by different screening modalities,
cases in one subgroup may have a different distribution of natural histories than the
cases in another subgroup because of a modality-dependent sampling effect. Even if
one could adjust for lead time, any remaining survival difference could simply be a
consequence of the difference in disease natural history between the two subgroups
caused by differing sampling bias. Methodology has been developed to explore the
length bias effect on survival (e.g., Kafadar and Prorok 2009), but no general
methodology exists to either estimate the magnitude of a length bias effect or to
adjust survival for length bias. Approaches to separating the effects of treatment,
lead time, and length bias in certain circumstances have been proposed (Duffy et al.
2008; Morrison 1982).
Stage of disease at diagnosis, or a related prognostic categorization, can also be
used as an early indicator of screening effect, but it can be misleading and is
unsatisfactory as a final end point. The relationship between the magnitude of a
shift in the stage distribution of cases as a result of screening and the magnitude of a
reduction in mortality is not usually known. The detection of in situ or borderline
lesions can also affect the stage distribution but should have little impact if any on
mortality. The problem is most pronounced for stage I or localized cases where lead
time and length bias can lead to slow-growing, even nonprogressive, cases being
detected in stage I in a screened arm to a greater extent than in a control arm. Some
counterpart cases in the control arm may never surface clinically. As a result, the
screened arm will contain a higher proportion of stage I cases even if screening has
no effect on mortality. Or, the magnitude of a real mortality effect could be exag-
gerated by focusing on stage of disease. Thus, a proportional stage shift in a screened
arm can be a sign of early detection, but it is insufficient evidence to conclude that
there is an improvement in disease outcome.
A related measure that can be a reasonable surrogate endpoint in some screening
circumstances is the population incidence rate of advanced-stage disease. The
overall incidence rate or the rate of early-stage disease should increase with screen-
ing, as discussed above, rendering these measures invalid as endpoints. However, if
screening reduces the rate of advanced disease, disease that has metastasized and/or
is likely to lead to death, then it is reasonable to expect that the death rate from the
disease will also be reduced. Whether this is a valid substitute for mortality must be
established in a given setting. Advanced-stage disease must first be defined, then the
relationship between advanced disease and mortality must be established in properly
designed studies. Advanced stage rate is the primary endpoint in a breast cancer
screening trial comparing digital mammography with tomosynthesis (Pisano 2018).
1226 P. C. Prorok

Some screening tests for cancer, such as tests for cervical cancer and colorectal
cancer, do detect true precursor lesions. The subsequent removal of these lesions
then prevents the cancer from ever being clinically diagnosed, and consequently the
incidence rate of the cancer is reduced. The incidence rate is a meaningful endpoint
in such circumstances, but it is important to monitor the mortality rate as well, since
it is possible that cancers that are eliminated are not a major source of cancer deaths,
and so there may not be a direct correspondence between incidence effect and
mortality effect.

Sample Size Calculation

A crucial element of trial design is calculation of the number of participants required


for the trial. There are well known clinical trial sample size calculation methods (see
other chapter in this book) that could potentially be adapted to screening trials. Also,
statistical formulas can be supplemented with modeling to tailor the calculations to a
specific trial (e.g., NLST Research Team 2011). Whatever the approach, in screening
there are several key issues that must be addressed. In particular, since screening trial
participants are ostensibly healthy, despite informed consent, they may be inclined
not to undergo the screening test. Alternatively, those assigned to a control arm
might become aware of the intervention and get tested outside the trial protocol.
Thus noncompliance in both arms is an issue. Further, it is well known that
individuals who volunteer to participate in screening trials are not typical of the
general population, generally being healthier (e.g., Pinsky et al. 2007; Shapiro et al.
1988). This healthy screenee bias must be accounted for in sample size calculations.
One relatively straightforward approach to screening trial sample size calculation
is that used in the PLCO trial (Prorok et al. 2000). Let NC be the number of
individuals randomized to the control arm and NS be the number randomized to
the screened arm, with NS = f NC. The trial is designed to detect a (1r)  100%
reduction (0 < r < 1) in the cumulative disease-specific death rate over the duration
of the trial. Further, let PC be the proportion of individuals in the control arm who
comply with the usual-care protocol and PS be the proportion of individuals in the
screened arm who comply with the screening protocol. The total number of disease-
specific deaths needed for a one-sided a-level significance test with power 1b is
given by
 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2
ðQC þ f QS Þ Z1a  QC :QS ð1 þ f Þ Zb
D ¼
f ðQC  QS Þ2

where QC = r + (1r)PC and QS = 1 – (1r)PS. The number of participants in the


control arm is given by

D
NC ¼
ðQ C þ f Q S Þ R C Y
65 Screening Trials 1227

where Y is the duration of the trial from entry to end of follow-up in years and RC is
the average annual disease-specific death rate in the control arm expressed in deaths
per person per year, adjusted for healthy screenee bias.

Screening Trial Design Options

Standard or Traditional Two Arm Design

Most screening trials have used a traditional or standard two arm design targeting one
disease and aimed at addressing the basic question of whether the screening interven-
tion results in a reduction in cause-specific mortality. Participants in one arm receive
the screening test for a given disease and those in the other arm serve as a control
(unscreened or usual care) (e.g., Shapiro et al. 1988; Schroder et al. 2014; Yousaf-
Khan et al. 2017). Other standard trials have addressed the effect of adding one
screening modality to another (e.g., Miller et al. 1981). A related three arm design
has been used to compare different frequencies of screening (Mandel et al. 1993).
Several variants of this standard design are now discussed (Etzioni et al. 1995).

Continuous Screen Design

A natural design approach is to randomize individuals to an intervention or a control


arm and offer periodic screening in the intervention arm throughout the trial. If the trial
is of very long duration, the screening intervention in this design approximates
population screening over a long age range, such as might happen in a national public
health program. However, a drawback of this design is the potentially prohibitive cost
of screening all intervention group participants for the duration of the trial.

Stop Screen Design

The Stop Screen design is similar to the Continuous Screen design, except that
screening is offered for only a limited time in the intervention arm while follow-up
continues. Both arms are followed for the mortality endpoint until the end of the trial.
This is the design of choice when it is anticipated that a long time will be required
before a reduction in mortality can be expected to emerge, and when it would be
expensive or difficult to continue the periodic screening for the entire trial period.
Examples of this design are the Health Insurance Plan (HIP) of Greater New York
Breast Cancer Screening Study (Shapiro et al. 1988), the PLCO trial (Prorok et al.
2000), and the European prostate cancer screening trial (Schroder et al. 2014). As an
illustration, the HIP trial randomized 62,000 women aged 40–64. The intervention
arm was offered four annual screens consisting of two-view mammography and
clinical breast examinations. The screens were offered at entry and for the next
3 years. Women in the control arm followed their usual medical practices. Although
1228 P. C. Prorok

screening ended after 3 years, follow-up continued to year 15. By restricting the
screening period, the Stop Screen design can result in a considerable saving in cost
and effort relative to the Continuous Screen design. Importantly, the Stop Screen
design is the only one that allows a direct assessment of overdiagnosis, provided
compliance is high and follow-up is complete. However, analysis of the Stop Screen
design can be more complex than that of the Continuous Screen design. This is
because the difference in disease-specific mortality between the two arms may be
diluted by deaths that arise from cancers that develop in the intervention arm after
screening stops. (See Analysis section below).

Split Screen or Close Out Screen Design

The Split Screen design is related to the Stop Screen design. The difference is that at the
time the last screen is offered to the intervention arm, a screen is also offered to all
participants in the control arm. The Stockholm Breast Cancer screening trial is an
example of this design. (Friskell et al. 1991) Women were randomized to intervention
or control arms. The intervention was single-view mammography at an initial round
then two succeeding rounds performed 24–28 months apart. The control group was
offered a single screen, at approximately 4.5 years after study entry. One potential
advantage of this design is that comparable groups of cancer cases in the two trial arms
can theoretically be identified, which can potentially enhance the analysis (See Analysis
section). A downside is that some of the control arm cancers detected by screening may
benefit, and if so, any screening benefit in the intervention arm will be diluted.

Delayed Screen Design

In the Delayed Screen design, periodic screening is offered to control arm partici-
pants starting at some time after the start of the study, then screening continues in
both arms until the end of the intervention period. The UK Breast Cancer Screening
Age Trial followed this design (Moss et al. 2015). Women in the intervention arm
were offered annual screening starting at age 39–41 and continuing to age 47–48,
then at age 50–52 all women in both arms were offered periodic screening as part of
the National Health Care Program. Thus one can assess the impact of starting
periodic screening at age 39–41 relative to waiting until age 50–52. This design is
well suited for the situation where screening is the standard of care beginning at a
certain age, and the research question centers on the marginal benefit of introducing
screening at an earlier age.

Designs Targeting More Than One Intervention and Disease

As noted, RCTs to assess early detection interventions face several challenges. It is


necessary to recruit large numbers of healthy participants and follow them for many
65 Screening Trials 1229

years, with consequent expenditure of substantial resources. There is therefore


interest in exploring more efficient ways to conduct trials so as to share resources
and participant pools. A study design that can answer multiple questions in a single
study is one possible approach. Options include factorial, reciprocal control, and all-
versus-none designs (Freedman and Green 1990). These designs have rarely been
considered in screening, but the latter was used in the PLCO trial.
A major design issue for the PLCO trial was whether to undertake separate trials
for each of the four cancer sites and corresponding screening modalities under
investigation or combine them. An examination of the costs and logistics of separate
trials resulted in the decision to conduct one combined trial. The reciprocal control
and all-versus-none designs were the primary options (Prorok et al. 2000). The
reciprocal control design would have had three arms: one devoted to screening for
prostate or ovarian cancer, the second to colorectal cancer screening, and the third to
lung cancer screening. Since screening would be undertaken for only one cancer site
per gender in any given arm, the other two arms combined would serve as controls.
This design was not deemed feasible because of the cost of bringing all participants
in for screening and the anticipated substantial levels of contamination, because all
participants would be aware that participants in the other arms were receiving other
screening tests, that they would then request. A two arm all-versus-none design was
chosen instead. One arm served as a control, while screening for all cancers was
done in the other arm, in the spirit of a multiphasic screening endeavor. Use of the
all-versus-none design required the reasonable assumptions for the cancers and
screening tests in PLCO that the tests for each cancer do not detect any of the
other cancers, and that the endpoints, death from each of the four cancers, are not
related. In other circumstances these assumptions might not be as tenable.

Analysis Methods

Follow-Up Analysis

As discussed above (Endpoint section), for a screening RCT targeting a chronic


disease the only generally valid end point is mortality. Specifically, some appropriate
measure of the target disease mortality from entry to the end of follow-up in the
population randomized to the intervention group is compared with that in the popu-
lation randomized to the control group. All deaths from the target disease that occur
throughout the trial in both arms are analyzed, including all that occur after screening
ceases if the trial does not use a Continuous Screen design. This has been termed a
follow-up analysis (Nystrom et al. 1993). This approach includes all endpoint events
that occur after randomization and is therefore consistent with the intent-to-screen
principle. This analysis should be done and reported for any screening trial.
Mortality in a particular trial arm can be measured by several quantities, including
(1) the average annual or cumulative mortality, which is the ratio of the number of
deaths from the disease of interest to the number of individuals randomized, (2) the
average annual or cumulative mortality rate, which is the ratio of the number of
1230 P. C. Prorok

deaths from the disease of interest to the number of person-years at risk of dying of
the disease, and (3) the survival distribution of the population using death from the
disease of interest as the endpoint, with the time of entry into the trial as the time
origin. To assess whether or not the screening intervention is of benefit, either the
difference or the ratio of the intervention and control group mortalities can be used.
The former is a measure of the absolute change in mortality due to the screening,
while the latter is a measure of the relative mortality change due to the screening.
Rate ratios, rate differences, and their confidence intervals can readily be calculated
(e.g., Ahlbom 1993).
Various statistical procedures can be used to test formally for a difference in the
mortality experience between the randomized arms. For the first measure, standard
procedures for comparing two proportions are available, such as Fisher’s exact test.
The cumulative mortality rates can be tested using Poisson methods for comparing
two groups. A test statistic is

Z ¼ ðPYS DC  PYC DS Þ=fPYC PYS ðDC þ DS Þg1=2 ,

where DC is the number of deaths from the disease of interest in the control arm
through the time of analysis, DS is the corresponding number of deaths in the
screened arm, PYC = the number of person-years at risk of death from the disease
of interest in the control arm through the time of analysis and PYS = the
corresponding number of person-years in the screened arm. This statistic has an
approximately standard normal distribution. For comparing the survival distribu-
tions, nonparametric tests such as the logrank test are used. It is important to note that
these analyses involve all individuals randomized to the respective trial arms.
Additional approaches that have been used are Cox proportional hazards regres-
sion and Poisson regression. These methods offer the possibility of a more thorough
exploration of screening trial data. Further, with the availability of modern comput-
ing power, randomization tests are an option that should be considered since
these avoid the assumptions required for other procedures (see ▶ Chap. 94,
“Randomization and Permutation Tests”).
Related testing and modeling techniques have been suggested to address the
problem of the optimal timing of a screening trial analysis relative to the appearance
of an effect (e.g., Baker et al. 2002). For several cancer screening trials that have
reported a benefit, a pattern was exhibited where the endpoint rates in the two arms
were roughly equivalent for some random period after the start of the trial, after
which they separated gradually leading to a statistically significant difference (e.g.,
Shapiro et al. 1988; Schroder et al. 2014; Tabar et al. 1992). This implies that the
proportional hazards assumption often invoked in survival analysis does not hold
and other methods of analysis are required. One possibility would be a method that
in a sense ignores the period where there is no difference in the rates and uses only
data from the period where there is a difference. However, such an approach must
account for multiplicity in the choice of the time point when separation of the rates
begins, and must be done with appropriate statistical methods to obtain the correct
variance of the test statistic (Prorok 1995).
65 Screening Trials 1231

Evaluation Analysis

The follow-up analysis is generally the preferred choice, but the method is subject to
bias in the relative effect of the screening if the effect is diluted (described below)
during follow-up. Evaluation analysis is an attempt to adjust for this.
There are many screening trials in which the intervention arm is offered screening
for a limited time only, with the follow-up continuing thereafter to the end of the
study (e.g., see Stop Screen design above). During the period of follow-up after
screening ceases, those in the intervention arm, as is the case for those in the control
arm throughout the study, follow their usual medical care practices. If the post
screening follow-up period is lengthy, the mortality comparison will be subject to
error relative to a study in which screening continues.
The primary problem is that there can be a dilution of the effect in that the
mortality in both arms will become more alike as time from the end of screening
increases. The dilution can occur when some of those dying of the disease are
individuals whose disease was diagnosed during the post screening period. For
such deaths in the intervention arm, it is unlikely that screening could have any
beneficial impact on their mortality. Hence, their inclusion in the analysis dilutes the
screening effect. However, in the control arm, some cases may correspond to cases in
the intervention arm that were screen-detected and that did benefit from the screen-
ing. If, hypothetically, deaths among these control arm cases of the disease were to
be excluded from the analysis, the screening effect is diluted in that the control arm’s
mortality will be underestimated. Thus, deaths from the disease of interest that occur
among cases diagnosed after screening stops, incorrectly included or excluded, can
result in the observed mortalities of the two randomized arms appearing to be more
similar or dissimilar than they should. This can lead to erroneous conclusions about
the effectiveness of the screening program.
An approach to countering this problem is evaluation analysis (Nystrom et al.
1993). This applies to the Split Screen design. Recall in this design participants in the
control arm are screened once at the time of the last screen in the intervention arm.
The evaluation analysis then includes deaths that occur from randomization through
the end of follow-up, but only those deaths from the target disease that occur among
cases diagnosed from the time of randomization through and including the last
screen, in each arm. If the sensitivity of the screening test is very high, this can
create two groups of cases, one in each arm, that are comparable in terms of their
natural history distributions, and hence their expected mortality outcomes in the
absence of screening. Thus, analysis of the deaths confined only to those arising
from the comparable case groups can theoretically provide an unbiased analysis and
eliminate the dilution. A concern, however, is that most screening tests do not
possess very high sensitivity. Further, it is crucial that the control arm screen be
done exactly at the same time as the last screen in the intervention arm, a circum-
stance unlikely to arise in practice. Otherwise, the case groups will likely not be
comparable and the inference about a mortality effect can be biased (Berry 1998).
In some circumstances comparable case groups can arise naturally. This can
happen in a Stop Screen design when the number of cases in the control arm “catches
1232 P. C. Prorok

up” to that in the screened arm at some point during follow-up after screening stops.
Cases in the comparable groups up to the “catch up” point are then the source of
deaths for the mortality analysis. Deaths among cases diagnosed after this point are
excluded thereby mitigating dilution. This situation occurred in the HIP trial, where
at about 5 or 6 years after randomization the cumulative numbers of breast cancer
cases were very similar in the two arms (Shapiro et al. 1988). The mortality measures
and statistical methods used in the follow-up analysis, appropriately modified, can be
used for this analysis (Prorok 1995). However, successfully determining the appro-
priate “catch up” point to identify case groups for this analysis can be problematic.
Of additional concern is that with modern screening tests, there is the likelihood of
overdiagnosis, so that the control arm will never catch up to the screened arm.

Monitoring an Ongoing Screening Trial

Several categories of data and information are anticipated at various stages of a


screening trial. These relate to the population under study, acceptance of the screen-
ing test by the population, results and characteristics of the screening test, harms of
the intervention, and intermediate and final endpoints. These variables should be
examined on a regular basis for evidence to alter the protocol or stop the trial. They
are valuable in assessing the consistency of findings and can be examined within
important strata defined by age, gender, and other risk factors. Categories for
consideration (with particular reference to cancer trials) include:

1. Population Characteristics
The demographic, socioeconomic, and risk characteristics of the study partici-
pants, possibly including dietary and occupational histories. These data are useful
for describing the study population and assessing the comparability of the
screened and control arms and may be used in statistical adjustment procedures.
2. Coverage and Compliance
Determination of the proportion offered screening who actually undergo the
initial screening. This can inform the acceptability of the screening procedures
and indicate whether the level is consistent with that assumed in the trial design.
Compliance with each scheduled repeat screen should also be recorded.
3. Test Yield in the Screened Arm
The number of cases found at each screen should be recorded and related to the
interval cases not discovered by screening. This is important for gauging how
successful the screening test is in finding the disease.
4. Contamination
The amount of screening in the control arm outside the trial protocol should be
assessed. This is crucial for ascertaining the potential level of dilution of any
intervention effect. Ideally this would be ascertained at the individual level, but
sometimes sampling of the controls is used. Approaches aimed at minimizing
contamination include cluster rather than individual randomization and post
randomization consent, and methods exist to adjust for contamination in the
analysis (Baker et al. 2002; Cuzick et al. 1997).
65 Screening Trials 1233

5. Screening Test Characteristics


Determination of the detection capabilities of the screening test by estimating
sensitivity, specificity, and predictive value.
6. Diagnostic Follow-Up
Collection of medical records and related information on diagnostic procedures
subsequent to every positive screening test. The diagnostic process is also
tracked in both the screened and control arms for cases diagnosed as a result
of signs or symptoms. For cancer screening trials, the biopsy rate can be
calculated relative to each screen and for the program as a whole, and the biopsy
yield of cancers can be determined.
7. Disease Case Characteristics
Key histologic and prognostic variables should be determined for every case of
disease in both the screened and control arms. In cancer, these include histolog-
ical type and grade, lesion size, nodal involvement, and perhaps genetic or other
biomarkers. This information can be used for comparing cancer case subgroups
and in survival and other case-based analyses. Comparison between screen
detected and interval cancers is also of interest.
8. Stage of Disease
This should be ascertained for every cancer case in the trial population. This
information is used to compare the stage distribution of screen detected cases
versus other case subsets to suggest whether screening might have an impact on
mortality, and is necessary for defining stage-specific incidence rates.
9. Case Survival
When sufficient follow-up time accrues, survival of individuals in whom disease
is observed can be investigated. Although potentially biased as noted above
(Endpoint section), the survival distributions of all cases in the screened arm and
of screen-detected cases can be compared with the distributions of other case
subgroups to provide a suggestion of whether screening might have an effect on
disease outcome. Of interest is an order relationship in the survival rates where it
would be expected that the survival of screen-detected cases would exceed that
of control arm cases, which in turn would exceed that of interval cases.
10. Incidence Rate
Calculation of the disease incidence rate requires information on the time of
diagnosis of each case as well as the number of person-years at risk of disease
incidence in each time interval of follow-up. These data are of interest, partic-
ularly in a Stop Screen design, because a higher total incidence in the screened
arm relative to the control arm is expected until some point after screening stops.
If the rates do not equalize, this is evidence of overdiagnosis.
11. Advanced Stage Rate
For cancer screening trials, the incidence rate of advanced-stage cancer can be
calculated yearly and cumulatively for each randomized arm. This rate is often
considered to be a reasonable surrogate for mortality.
12. Mortality Rate
As noted, mortality rates are the basis for the primary inference regarding the
effectiveness of screening. Calculation of these rates requires the date and cause
of every death in the population as well as the number of person-years at risk of
1234 P. C. Prorok

death during each follow-up interval. Mortality rates should be compared


between the screened arm and the control arm. In addition, the death rates
from other causes should be scrutinized to assess the comparability of the
randomized populations, and all-cause mortality should be reported.
13. Therapy
The specific therapy used for every case of disease should be recorded. At a
minimum, this should be the initial therapy, but adjuvant therapy or treatment for
recurrence is valuable as well. This information is crucial for separating the early
detection component from the therapy component of any screening effect. That
is, within each stage of disease, the therapy distribution should be comparable
for each randomized group to eliminate any confounding effect of therapy in
assessing the impact of the screening.
14. Harms
Harms of screening include overdiagnosis, false positives, and complications of
the screening, diagnostic, and treatment procedures administered to trial partici-
pants. Complications include any adverse medical events and any mortality
potentially related to trial procedures, notably any procedures that follow a
positive screen.
15. Procedures and Costs
The ultimate decision whether to implement a screening program in a population
rests on a tradeoff between costs and benefits. To facilitate assessment of cost
and cost-effectiveness of the screening program, data can be collected on the
costs of all phases of the program in an evaluation trial. An alternative is to
record the procedures done in each phase so that costs can be assigned at a later
date. Included are efforts to recruit the population, the screening tests, diagnostic
procedures, treatment procedures, and efforts used to follow the population.
16. Sequential Monitoring and Interim Analysis
A process for regular, formal monitoring of safety issues and accumulating data
should be established early in the course of a screening trial. This is best
accomplished by the creation of a data and safety monitoring board (DSMB).
This board is comprised of experts not associated with the trial who can
therefore provide an independent assessment of trial progress. A DSMB typi-
cally uses statistical monitoring methods to examine emerging data. Accruing
mortality and secondary endpoints are examined regularly to determine if and
when a protocol change is warranted that would result in early termination of the
trial. Formal statistical procedures are available (e.g., Proschan et al. 2006).

Conclusion

As in other areas of research, much has been learned over time from completed and
ongoing screening trials. This chapter is an attempt to convey some of this knowl-
edge. Hopefully this will lead to improved trial design and analysis in the future.
Some additional insights are the following:
65 Screening Trials 1235

1. New screening tests can rapidly become widely used, especially in the U.S., often
without valid scientific evidence of benefit nor proper assessment of harm. It is
therefore important to undertake rigorous trials as soon as possible when a new
test becomes available to take advantage of a window of opportunity, before
widespread use precludes establishment of a proper control;
2. Over-diagnosis has been indicated repeatedly, particularly in cancer screening.
This should be expected and accounted for in study design, analysis, and
interpretation;
3. A pilot phase prior to or at the beginning of a trial can be extremely valuable for
testing operational components and evaluating study centers. Although not
discussed in this chapter, pilot studies have been instrumental in several cancer
screening trials (e.g., NLST Research Team 2011; Prorok et al. 2000);
4. Quality assurance of all trial operations is crucial.

As has been stated previously, a screening trial is a major endeavor requiring a long-
term commitment by participants, investigators and funding organizations. If a decision
is made to do a such a trial, necessary resources must be provided for the full study
duration. To accomplish this in the usual climate of resource competition and peer
review can be difficult. One strategy is full commitment sequentially to the primary
phases of such a trial; ie, pilot, recruitment, screening, and follow-up, with funding for
each successive phase contingent on successful completion of the previous phase.
What is clear however is that such trials have been successfully conducted, and
that screening interventions for chronic diseases can and should be evaluated
rigorously.

Cross-References

▶ Randomization and Permutation Tests


▶ Survival Analysis II

References
Ahlbom A (1993) Biostatistics for epidemiologists. Lewis Publishers, Boca Raton, pp 61–66
Andriole GL et al (2012) Prostate cancer screening in the randomized prostate, lung, colorectal and
ovarian cancer screening trial: mortality results after 13 years of follow-up. J Natl Cancer Inst
104:125–132
Baker SG et al (2002) Statistical issues in randomized trials of cancer screening. BMC Med Res
Methodol 2:11. (19 September 2002)
Berry DA (1998) Benefits and risks of screening mammography for women in their forties: a
statistical appraisal. J Natl Cancer Inst 90:1431–1439
Bretthauer M et al (2016) Population-based colonoscopy screening for colorectal cancer: a ran-
domized trial. JAMA Intern Med 176:894–902
Cuzick J et al (1997) Adjusting for non-compliance and contamination in randomized clinical trials.
Stat Med 16:1017–1029
1236 P. C. Prorok

Duffy SW et al (2008) Correcting for lead time and length bias in estimating the effect of screen
detection on cancer survival. Am J Epidemiol 168:98–104
Echouffo-Tcheugui JB, Prorok PC (2014) Considerations in the design of randomized trials to
screen for type 2 diabetes. Clin Trials 11:284–291
Etzioni RD et al (1995) Design and analysis of cancer screening trials. Stat Meth Med Res 4:3–17
Freedman LS, Green SB (1990) Statistical designs for investigating several interventions in the
same study: methods for cancer prevention trials. J Natl Cancer Inst 82:910–914
Friskell J et al (1991) Randomized study of mammography screening – preliminary report on
mortality in the Stockholm trial. Breast Cancer Res Treat 18:49–56
Kafadar K, Prorok PC (2009) Effect of length biased sampling of unobserved sojourn times on the
survival distribution when disease is screen detected. Stat Med 28(16):2116–2146
Mandel JS et al (1993) Reducing mortality from colorectal cancer by screening for fecal occult
blood. New Engl J Med 328:1365–1371
Miller AB et al (1981) The national study of breast cancer screening. Clin Invest Med 4:227–258
Morrison AS (1982) The effects of early treatment, lead time and length bias on the mortality
experienced by cases detected by screening. Int J Epidemiol 11:261–267
Moss SM et al (2015) Effect of mammographic screening from age 40 years on breast cancer
mortality in the UK age trial at 17 years follow-up: a randomized controlled trial. Lancet Oncol
16:1123–1132
NLST Research Team (2011) The national lung screening trial: overview and study design.
Radiology 258:243–253
Nystrom L et al (1993) Breast cancer screening with mammography: overview of Swedish
randomized trials. Lancet 341:973–978
Pinsky PF et al (2007) Evidence of a healthy volunteer effect in the prostate, lung, colorectal and
ovarian cancer screening trial. Am J Epidemiol 165:874–881
Pisano ED (2018) Is tomosynthesis the future of breast cancer screening? Radiology 287:47–48
Prorok PC (1995) Screening studies. In: Greenwald P, Kramer BS, Weed DL (eds) Cancer
prevention and control. Marcel Dekker, New York, pp 225–242
Prorok PC, Marcus PM (2010) Cancer screening trials: nuts and bolts. Semin Oncol 37:216–223
Prorok PC et al (2000) Design of the prostate, lung, colorectal and ovarian (PLCO) cancer screening
trial. Conrolled Clin Trials Suppl 21(6S):273S–309S
Proschan MA et al (2006) Statistical monitoring of clinical trials. Springer, New York
Schroder FH et al (2014) Screening and prostate cancer mortality: results of the european random-
ized study of screening for prostate cancer (ERSPC) at 13 years follow-up. Lancet
384:2027–2035
Shapiro et al (1988) Periodic screening for breast cancer: the health insurance plan project and its
sequelae, 1963–1986. The Johns Hopkins University Press, Baltimore
Tabar L et al (1992) Update of the Swedish two-county program of mammographic screening for
breast cancer. Radiol Clin N Am 30:187–210
Welch HG et al (2016) Breast cancer tumor size, overdiagnosis, and mammography screening
effectiveness. NEJM 375:1438–1414
Yousaf-Khan U et al (2017) Final screening round of the NELSON lung cancer screening trial: the
effect of a 2.5 year screening interval. Thorax 72:48–56
Biosimilar Drug Development
66
Johanna Mielke and Byron Jones

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1238
The Stepwise Approach to Biosimilarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1240
Testing for Equivalence in Biosimilar Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1243
Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1245
Step 1: Analytical Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1245
Step 2: Nonclinical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1245
Step 3: Clinical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1246
Selected Challenges in Biosimilar Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1247
The Choice of Equivalence Margins in Efficacy Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1247
Interchangeability of Biosimilars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1249
Incorporating Additional Data in Clinical Efficacy Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1251
Operational Challenges in Biosimilar Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1254
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1255
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1256
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1256
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1257

Abstract
Biologics are innovative, complex large molecule drugs that have brought life-
changing improvements to patients in various disease areas like cancer, diabetes,
or psoriasis. Biosimilars are copies of innovative biologics. Their development is
currently a focus of attention because the patents of several important biologics
have expired, making it possible for competing companies to produce their own
biosimilar version of the drug. Although, at first sight, there seems to be some
similarity with the development of generics, which are copies of simple small
molecule drugs, there is an important distinction because of the complexity and

J. Mielke · B. Jones (*)


Novartis Pharma AG, Basel, Switzerland
e-mail: [email protected]; [email protected]

© Springer Nature Switzerland AG 2022 1237


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_272
1238 J. Mielke and B. Jones

the variability inherent in the development of biologics. This chapter introduces


the studies and analyses required to obtain regulatory approval for marketing a
biosimilar and reviews several important regulatory concepts. In addition, several
important statistical challenges are highlighted and discussed.

Keywords
Follow-on biologics · Equivalence testing · Totality of the evidence ·
Biosimilarity · Extrapolation · Biologics · Comparability · Analytics ·
Switchability · Historical information

Introduction

Biologics (or large molecule drugs) have revolutionized the treatment of various
diseases and dramatically improved the life of many patients. However, they suffer
from the disadvantage that their costs are very high: it is estimated that the costs of
treatment with biologics are 22 times higher than that of a nonbiological drug
(Health Affairs Health Policy Brief 2013). That is why the question as to whether
biologics should be used as the first line treatment is still controversial in many
disease areas (e.g., see Finckh et al. (2009) for a discussion in rheumatoid arthritis)
and the access of patients to these life-changing products if often limited.
Previous experience with (small-molecule) nonbiological drugs showed that the
introduction of generics, that is, copies of the originator small-molecule drug,
substantially lowered drug prices and thus improved the access of patients to these
products. Generics are usually developed and produced by a competing company
and can be marketed after the patent of the original drug has expired. The analogue to
generics for biologics are the so-called biosimilars (also known as follow-on bio-
logics). These medical products are developed and approved as copies of already
marketed biologics.
However, while the concepts of generics and biosimilars are comparable, it is
important to note that small molecule drugs and biologics differ substantially
(Crommelin et al. 2005). While small molecules tend to have a well-defined and
stable chemical structure which can be easily identified, biologics are more complex
proteins with heterogeneous structures. In addition, small molecule drugs are chem-
ically engineered, but biologics are grown in living cells: this makes the manufacture
of biologics extremely sensitive to environmental changes (e.g., a small change of
temperature in the manufacturing site might influence the therapeutic effect of the
product). The high complexity of the molecule and the sensitive manufacturing
process makes it, even for the manufacturer of the originator, impossible to produce
an exact copy. That is why, in contrast to generics, which are chemically identical to
the original small molecule drug, biosimilars are only expected to be similar to the
originator product. The high complexity and the difficult characterization of the
molecules combined with the fact that biosimilars are only similar, but not identical
to the originator product, lead to a higher uncertainty if the therapeutic effect of the
66 Biosimilar Drug Development 1239

biosimilar is comparable to that of the originator product. Therefore, the limited


evidence which is required for gaining approval for a generic is not considered
sufficient for biosimilars. It should also be noted that the objectives of a biosimilar
development program are not entirely the same as that of the original product, since
the aim is to demonstrate comparability of the biosimilar and originator and not to
establish efficacy de novo (Christl et al. 2017). Indeed, regulatory agencies, for
example, the Food and Drug Administration (FDA) in the USA, have implemented a
separate regulatory pathway for biosimilars.
In this chapter, we discuss some of the most important concepts, statistical
challenges, and regulations of biosimilar development with a focus on the FDA’s
point of view. However, it should be noted that these are comparable to other highly
regulated markets (Cazap et al. 2018).
The foundation of biosimilar development in the USA lies in the Biologics Price
and Competition (BPCI) Act (FDA 2009) where the legal framework for approval of
biosimilars has been written into law. The FDA defines a biosimilar as “a biological
product that is highly similar to and has no clinically meaningful differences from an
existing FDA-approved reference [i.e., original] product” (FDA 2017a). Therefore,
for getting approval as a biosimilar, a sponsor (the developer of the biosimilar) needs
to demonstrate that patients who are taking the biosimilar can expect the same
efficacy and safety profile as patients who are taking the originator product.
For the showing of biosimilarity, the FDA recommends a stepwise approach
which is introduced in detail in section “The Stepwise Approach to Biosimilarity.”
Before that, two fundamental concepts for biosimilar development are introduced.
The most important concept in the biosimilar pathway in the USA is the idea of the
“totality of the evidence” (Christl et al. 2017): not one study in the development
program is considered pivotal, but all provided evidence (the results of all steps) is
considered important. When the decision is made on whether to approve or to reject
a biosimilar, all evidence is taken into account.
Another important concept is “extrapolation” (Weise et al. 2014). This relates to
the fact that the clinical trials, which are performed as part of the stepwise approach
(see section “The Stepwise Approach to Biosimilarity”), are only conducted in
selected indications. In contrast, the originator is normally approved for a wide
range of indications and the aim of the sponsor of a biosimilar is usually to gain
approval in all the same indications as the originator product. However, since the
clinical evidence is, in the context of the “totality of the evidence,” not pivotal, a so-
called extrapolation of the provided evidence to indications which were not explic-
itly studied in clinical trials is possible by appealing to scientific judgment. It should
be noted that the use of extrapolation is still a topic of debate, especially for products
for which the mechanism of action is not fully understood in all indications
(Schellekens and Moors 2015). Nonetheless, extrapolation has been used in all
biosimilar applications in the USA so far.
The rest of this chapter is structured as follows: after the introduction of the
stepwise approach in the section “The Stepwise Approach to Biosimilarity,” the
section “Testing for Equivalence in Biosimilar Trials” gives an overview of the
statistical methodology for testing for equivalence. Section “Case Study” gives a
1240 J. Mielke and B. Jones

case study that illustrates the stepwise approach using the development program of
the biosimilar Zarxio. Then, in section “Selected Challenges in Biosimilar Devel-
opment” selected challenges related to the design and analysis of biosimilar clinical
trials are discussed. Conclusions are presented in section “Summary and
Conclusion.”

The Stepwise Approach to Biosimilarity

In this section, the stepwise approach to biosimilarity is introduced with a focus on


the FDA’s terminology and regulations. However, it should be emphasized that the
way of thinking is comparable also to other highly regulated markets (e.g., in the
EU). The FDA’s proposed biosimilar development strategy consists of three main
steps which are illustrated in Fig. 1: analytical studies (Step 1), nonclinical studies
(Step 2), and clinical studies (Step 3) which are split (a) into pharmacokinetic (PK)
and pharmacodynamic (PD) studies and (b) therapeutic equivalence studies. It is
recommended that after each step the already obtained evidence is considered and
any residual uncertainty is identified before it is decided which additional studies are
necessary for the establishment of biosimilarity (FDA 2015b). The FDA’s expecta-
tions for each step are described in the overarching guideline on scientific questions
related to biosimilar development (FDA 2015b) and further outlined in topic-specific
guidelines on pharmacological data (FDA 2016).
Even though, in line with the idea of the “totality of the evidence,” all provided
evidence is important, the analytical studies (Step 1) are often considered the
foundation of biosimilar development. The aim of the analytical studies is to
establish comparability of the biosimilar and its originator at the molecular level,
that is, it should be confirmed that the biosimilar molecule and the originator
molecule are “highly” similar. However, due to the complexity of the molecule, it
is, with the current state-of-the-art technologies, not possible to characterize the
molecules sufficiently well enough with one single tool, as it is done for small

Therapeutic
equivalence
Step 3

PK/PD
Totality of
the evidence
Non-clinical studies Step 2

Analytical studies Step 1

Fig. 1 Steps of biosimilar development (PK pharmacokinetic, PD pharmacodynamic)


66 Biosimilar Drug Development 1241

molecule drugs. In fact, several assessments are performed in order to check different
characteristics of the molecule (so-called quality attributes) and it is assumed that if
all of these assessments indicate equivalence, then the molecules themselves are also
sufficiently similar. The structural characteristics, for example, the identity of the
primary sequence of amino acids, are analyzed with techniques like peptide mapping
or mass spectrometry. Also the biological activity needs to be comparable and, for
that, often bioassays are used which can assess comparability in terms of binding and
functionality (Schiestl et al. 2014). The way statistics can support the comparability
claims in analytical studies is still highly controversial: the FDA published (and
subsequently withdrew) a draft guideline (FDA 2018) that discussed the value of
statistics for the establishing of comparability. The FDA suggested using a risk-
based approach where the type of statistical methodology depends on the criticality
of the quality attribute. That is, for an attribute which is assumed to be strongly
related to the clinical outcome (e.g., the results of a bioassay which is assessing the
binding of the molecule to a target which is imitating the mechanism of action),
stricter criteria for comparability are applied compared to a quality attribute which is
assumed not to be critical for the therapeutic effect.
After all analytical studies are conducted, the FDA recommends classifying the
obtained comparability into one of four categories (FDA 2016): (1) insufficient
analytical similarity, (2) analytical similarity with residual uncertainty, (3) tentative
analytical similarity, and (4) fingerprint-like analytical similarity. In categories (1)
and (2), the sponsor needs to conduct additional analytical studies and/or to adjust
the manufacturing process. Categories (3) and (4) allow a sponsor to proceed to the
next step of biosimilar development. Dependent on the amount of residual uncer-
tainty, selective animal and clinical studies might be sufficient. Therefore, providing
a higher level of evidence (e.g., fingerprint-like analytical similarity instead of
tentative analytical similarity) might reduce the amount of required studies in the
following steps. On the other hand, demonstrating fingerprint-like similarity might
be challenging or even not possible in some cases.
In Step 2, studies in animals are conducted. The main aim of the animal studies is
to establish the toxicology profile of the proposed biosimilar. In some cases, also the
PK and PD profiles in animals of the biosimilar are compared to the originator.
However, it is clearly emphasized that the inclusion of animal PK and PD studies
does not lead to a negation of the need for clinical studies in humans. If there is no
relevant animal species, additional in vitro studies might be appropriate, for exam-
ple, with human cells. The extent of the required studies highly depends on the
success of the analytical studies which were performed in Step 1. This is stated in the
regulatory document issued by the FDA (2015b): “If comparative structural and
functional data using the proposed product provide strong support for analytical
similarity to a reference [originator] product, then limited animal toxicity data may
be sufficient to support initial clinical use of the proposed product.”
After Step 2 is completed successfully, the proposed biosimilar is used for the first
time in humans (Step 3). The “FDA expects a sponsor to conduct comparative
human PK and PD studies (if there is a relevant PD measure(s)) and a clinical
immunogenicity assessment” and, if these evidences are not sufficient for removing
1242 J. Mielke and B. Jones

residual uncertainties, also comparative clinical trials are required (FDA 2015b). The
aim of PK and PD equivalence studies is to confirm comparable exposure (PK) of the
proposed biosimilar and the originator and, if possible, to show that the way the drug
affects the body is sufficiently similar (PD). The results of PD studies can only be
considered an important piece of evidence if there exists a well-established PD
marker which can serve as a surrogate for the clinical outcome. In these cases, the
PK/PD studies may be seen as a more sensitive step for detecting potential differ-
ences between the biosimilar and the originator than clinical comparability studies.
For example, for the approval of Zarxio (Sandoz) the pharmacology studies reduced
the need for clinical comparability studies (Holzmann et al. 2016, see for details also
section “Case Study”).
The design and analysis of PK/PD studies are comparable to the studies which are
conducted for the showing of bioequivalence of generics. The preferred study design
(FDA 2016) for products with a short half-life (the time until half of the drug is
eliminated from the body) is a crossover design. Often two-period, two-treatment
crossover designs are used where subjects first take the biosimilar and then the
originator or vice versa. These studies have the advantage that each subject acts as
his or her own control, which reduces the variability and allows for smaller sample
sizes (Jones and Kenward 2014). In the case of a long half-life, parallel groups
designs are also acceptable (FDA 2016). The study population should consist of
healthy volunteers, if possible. This is expected to reduce the variability since
patients often have confounding factors (e.g., comorbidity). However, if this is not
feasible due to ethical reasons (e.g., known toxicology) or if a PD marker can only be
assessed in patients (e.g., in diabetes), then patients are preferred.
The analysis of PK/PD data is, compared to the other steps in biosimilar devel-
opment, standardized and leaves only a small degree of flexibility for the sponsor: a
response of the drug in the blood over time is measured after the drug is injected. For
PK analysis, the response of interest is the concentration of the drug in the blood, for
PD, it might be a well-established PD marker. Measures like the area under the
response vs. time curve (AUC) and the maximum response over time (Cmax) are
reported for each subject. The aim is to show that the ratio of the mean values of the
originator product and the proposed biosimilar for each of these measures as a
percentage lies within 80% and 125% with a prespecified confidence level 1 – α
where commonly α ¼ 0.1 is used. This confidence level corresponds to a one-sided
significance level of 5% which is typically used for testing for superiority.
An assessment of immunogenicity (e.g., the potential to induce an immune
response, as for example anaphylaxis) is required since, with the current understand-
ing of highly complex molecules, it is not possible to reliably predict the immuno-
genicity purely based on analytical studies. Since immune responses might influence
the treatment effect and the safety profile, it is important to confirm similar immu-
nogenicity. Immunogenicity is mostly assessed as part of the clinical studies (Christl
et al. 2017) and the amount and type of immunogenicity assessment depends on the
active substance.
If the comparability of the products at the PK/PD level has been established, but
there still exists residual uncertainties, then clinical comparability studies in patients
66 Biosimilar Drug Development 1243

are conducted. These studies should be carefully chosen to target the residual uncer-
tainty. The selection of endpoints, study population, study duration, and study design
need to be scientifically justified. In general, the approach should be selected which is
expected to be most sensitive to detect potential differences between the proposed
biosimilar and the originator (FDA 2015b). Therapeutic equivalence is typically
assessed at one chosen point in time using an equivalence testing approach (see
section “Testing for Equivalence in Biosimilar Trials”), that is, it is confirmed that
the characteristic of interest of the treatment response, for example, the mean value of a
chosen endpoint, after taking the biosimilar is neither smaller nor larger than after
taking the originator. A non-inferiority-type test, that is, the showing that a chosen
characteristic under treatment with the biosimilar is not larger or smaller, respectively,
than under treatment with the originator, might be acceptable in specific cases. For
example, one might consider a noninferiority design if a higher response can be ruled
out due to scientific reasons (e.g., saturation of the target with a specific dose, see
Schoergenhofer et al. (2018)). In terms of the study design, mostly parallel groups
designs are conducted which often are combined with an extension period in which the
effect of a single switch from the originator to the biosimilar is studied and the safety
and immunogenicity profile is compared between the switching and nonswitching
group. This type of assessment is explicitly required in the respective guideline (FDA
2015b). Safety and immunogenicity are usually assessed descriptively.
Taking all results into account, by referring to the concept of “totality of the
evidence” (see section “Introduction”), regulators make a decision if biosimilarity is
established or not. Consequently, this means that the failing of one analysis does
necessarily lead to the failing of the biosimilar development program, as long as a
scientific justification is provided (for an example of approval with failed “compo-
nents of evidence,” see Mielke et al. 2016). It is important to note that the overall
assessment of biosimilarity is made by appealing to scientific judgment and not by a
quantitative decision making approach. This makes the decision whether as to
approve a biosimilar a subjective one. In the recent past, some strategies were
published on how to formalize the decision making process (e.g., the biosimilarity
index by Hsieh et al. 2013), but these suggestions have not yet made it into practice.
Information regarding the provision of clinical evidence for biosimilar approval
in practice can be found in Hung et al. (2017) for the USA, in Mielke et al. (2016,
2018a) for the European Union and in Arato (2016) for Japan.

Testing for Equivalence in Biosimilar Trials

Usually, the aim of a study in biosimilar development is to establish equivalence, that


is, it is necessary to confirm that a characteristic of interest measured after treatment
with the biosimilar is similar to the same characteristic of interest measured after
treatment with the originator. As a simplification, it is assumed that the aim is to
establish equivalence in the difference of the measurements, that is, to show that the
difference between two mean values is sufficiently small. However, ratios can
usually be transformed into differences with a logarithmic transformation so that
1244 J. Mielke and B. Jones

the same type of hypothesis can be assessed. More formally, let τB be a characteristic
of interest of the biosimilar (e.g., the mean value of log(AUC)) and τO be the same
characteristic of interest of the originator. Then, the aim is to test the hypotheses
(Wellek 2010):

H 0 : jτB  τO j  Δ vs:H 1 : jτB  τO j < Δ,

where Δ is a positive value and called the equivalence margin. The choice of the
equivalence margin is discussed in more detail in section “The Choice of Equiva-
lence Margins in Efficacy Trials” and for the time being, it is assumed that an
equivalence margin Δ is provided.
There exists two common ways to test the above mentioned hypotheses: first, one
can split the equivalence hypothesis into two one-sided hypotheses. This approach is
commonly known as the two-one-sided-test (TOST) approach (Schuirmann 1987).
The two sets of hypotheses are given by:
ð1Þ ð1Þ
H 0 : τB  τO  Δthickmathspacevs:H 1 : τB  τO > Δ,
ð2Þ ð2Þ
H 0 : τB  τO  Δthickmathspacevs:H1 : τB  τO < Δ:
ð1Þ ð2Þ
If both H 0 and H 0 are rejected, the overarching hypotheses H0 is also rejected
and equivalence can be claimed. In the following, the test statistics and decision rules
ð1Þ ð2Þ
for the hypotheses H 0 and H 0 are illustrated using the example of a normally
distributed endpoint. For that, let τB be the expected value of the biosimilar and τO be
the expected value of originator. The standard deviation of the originator is denoted
by σ O, whereas the standard deviation of the biosimilar is denoted by σ B. We assume
that both standard deviations are equal, that is, σ B ¼ σ O. In addition, we assume a
parallel groups design with n subjects per group. Let yO and yB be the observed mean
values of the originator and the biosimilar, respectively. The estimated standard
deviations are denoted by σ O and σ^B . The corresponding test statistics are then

ðyB  yO Þ þ Δ ðy  y B Þ þ Δ
Z1 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffi2ffi and Z 2 ¼ Oqffiffiffiffiffiffiffiffiffiffiffiffiffiffi :
2
σ^B σ^O σ^2B σ^2O
n þ n n þ n

Both test statistics follow, under the null hypotheses, a t-distribution with 2n – 2
degrees of freedom. Therefore, the null hypothesis is rejected if both realizations are larger
than the (1 – α)-quantile of a t-distribution with 2n – 2 degrees of freedom. A typical
choice for the significance level α is α ¼ 0.05.
The second strategy is based on a confidence interval approach: a (1 – 2α)-
confidence interval for the difference of the mean value is calculated. If this
confidence interval fully lies within

½Δ, Δ,

the null hypothesis H0 is rejected and equivalence is claimed. In the case of a


normally distributed endpoint, the (1 – 2α)-confidence interval is given by
66 Biosimilar Drug Development 1245

" rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi#


σ^2B σ^2O σ^2B σ^2O
yB  yO  t1α,2n2 þ , yB  yO þ t1α,2n2 þ ,
n n n n

where tβ,k is the β-quantile of the t-distribution with k degrees of freedom.


Comparing the confidence interval approach with the TOST-approach in this
example, one quickly realizes that both approaches lead to the same result. A more
detailed discussion on the connection between the TOST approach and the confi-
dence interval approach can be found in Hsu et al. (1994).
For normally distributed endpoints, typically the mean values are compared.
However, also more complex approaches have previously been proposed: For
example, Chow et al. (2009) proposed comparing the probability that the two
characteristics of interest do not differ by more than a prespecified value and Tsou
et al. (2013) discussed a consistency approach. These approaches have the advantage
that the focus is not exclusively on the mean value, but also takes into account the
variability of the products. However, so far, the simple comparison of mean values, if
necessary adjusted for relevant covariates, is still the standard approach for
establishing equivalence in a biosimilar development program.

Case Study

In 2015, Sandoz Inc., a Novartis Company, gained approval from the FDA to market
Zarxio, its biosimilar version of the reference biologic, Neupogen (active substance:
filgrastim). Neupogen is used to treat neutropenia (an abnormally low level of
neutrophils in the blood) which can occur, for example, in cancer patients undergo-
ing chemotherapy. This case study gives a brief description of some of the informa-
tion presented by Sandoz Inc. in its submission to the FDA to gain approval to
market Zarxio. This information is publically available online (FDA 2015a) and the
summary given below is based directly on that text. The following subsections
briefly describe the contributions to the steps that were illustrated in Fig. 1 and
described in detail in section “The Stepwise Approach to Biosimilarity.”

Step 1: Analytical Similarity

Analytical similarity was assessed using multiple quality attributes, and some of
them are listed in Table 1. In total, it was concluded that the proposed biosimilar is
“highly similar” to Neupogen.

Step 2: Nonclinical Studies

EP2006 was compared to Neupogen in animal studies for assessing the pharmacody-
namics (PD), toxicity, toxicokinetics, and local tolerance of the products. Of these
1246 J. Mielke and B. Jones

Table 1 Quality attributes and methods used to evaluate analytical similarity of EP2006 and US-
Licensed Neupogen (partial list, for illustration)
Quality attribute Method
Primary N-terminal sequencing
structure Peptide mapping with ultraviolet (UV) and mass spectrometry detection
Protein molecular mass by electrospray mass spectrometry (ESI MS)
Protein molecular mass by matrix-assisted laser desorption ionization mass
spectrometry (MALDI-TOFMS)
DNA sequencing of the EP2006 construct cassette
Peptide mapping coupled with tandem mass spectrometry (MS/MS)
Bioactivity Proliferation of murine myelogenous leukemia cells (NFS-60 cell line)
Receptor Surface Plasmon Resonance
binding
Protein content RP-HPLC

studies, one was a single dose tolerance study in rabbits and the other was a 28-day
multiple, repeat dose toxicology study in rats. According to Bewesdorff (2016), these
two studies used, respectively, a single group of 24 rabbits and two groups of 60 rats.

Step 3: Clinical Studies

To assess PK and PD, four studies were reported, labeled as EP06-109, EP06-103,
EP06-105, and EP06-101. Each of these was a 2  2 cross-over trial and involved
between 24 and 32 healthy subjects. For the PK assessment, the usual PK parame-
ters, for example, AUC and Cmax, were used and for the PD assessment, the
endpoints were the absolute neutrophil count (ANC) and the increase in CD34+
cell count.
For establishing equivalence of the PK profiles, equivalence was tested using the
usual criteria that the 90% confidence interval for the geometric ratios of the AUC
and Cmax parameters should lie within (80%, 125%), except for study EP06-101
which used the wider margin of (75%, 133%) for Cmax.
In the PD studies, equivalence was assessed using the criterion that the 95%
confidence interval for the ratio of geometric means for AUEC (area under the effect
curve over time) and the maximum ANC should lie within the (80%, 125%) interval,
except for study EP06-103 where the interval was (87.25%, 114.61%) for the
2.5 mcg/kg dose and (86.5%, 115.61%) for the 5 mcg/kg dose. According to the
publicly available information, there were no predefined equivalence criteria for
CD35+ and 95% and 90% confidence intervals for the ratio of the parameters
(AUEC and maximum CD34+ count) were reported. It is notable that, in general,
the choice of equivalence margins, especially for the PD parameters, is not fixed, but
may depend on the chosen endpoint. This is taken up in subsection “The Choice of
Equivalence Margins in Efficacy Trials” where the choice of margins is discussed.
For the clinical assessment of efficacy and safety, data from two trials were used:
EP06-301 and EP06-302. The latter trial was a double-blind parallel groups trial in
66 Biosimilar Drug Development 1247

women with histologically proven breast cancer and the treatments were adminis-
tered over six cycles of chemotherapy. The study had four arms: (1) EP006 (E) given
repeatedly for all cycles, (2) Neupogen (N) given repeatedly for all cycles, (3) E and
N were alternated over the cycles in the order (N,E,N,E,N,E), and (4) E and N were
alternated over the cycles in the order (E,N,E,N,E,N). This design was planned to not
only assess similarity but also interchangeability. The concept of interchangeability
is discussed in subsection “Interchangeability of Biosimilars.” The endpoint in this
study was the duration of severe neutropenia. Study EP06-301 was a non-
comparative single arm study in which patients with breast cancer were treated
with chemotherapy and then one day later were given daily EP2006 until neutrophil
recovery.
In January, 2015, the expert panel reviewing the Sandoz Inc. application unani-
mously recommended its approval and in March 2015, the FDA gave approval for
the biosimilar to be marketed for all five of the indications approved for Neupogen.

Selected Challenges in Biosimilar Development

In this section, selected challenges in biosimilar development are presented with a


focus on statistical issues. First, the choice of the equivalence margin Δ in theory and
practice is discussed before the assessment of interchangeability and strategies for
including additional information in the Phase III efficacy trials are described.

The Choice of Equivalence Margins in Efficacy Trials

Equivalent efficacy has to be established during biosimilar development. As


described in section “The Stepwise Approach to Biosimilarity,” typically one effi-
cacy endpoint is selected and compared under treatment with the biosimilar and with
the originator. Only if the selected endpoint is “similar” in both treatment groups,
therapeutic comparability is established. However, it is often not straightforward
what “similar” means. The degree of acceptable differences between the biosimilar
and the originator is reflected in the equivalence margin Δ (see section “Testing for
Equivalence in Biosimilar Trials”) which is supposed to give the maximal value such
that the difference in the endpoint is not considered relevant from a clinical point of
view. That is, if the null hypothesis of an equivalence test is rejected, this means that
differences larger than the equivalence margins Δ can be excluded with a pre-
specified probability 1 – α. This clearly shows that the choice of the equivalence
margin has a major influence on the test decision.
In PK (average) bioequivalence studies for generics, a standardized approach is
used for selecting the equivalence margins: typically, the equivalence margins for the
difference of the PK parameters on the log-scale are set to  log (1.25) indepen-
dently of the active substance (Jones and Kenward 2014). For efficacy endpoints in
biosimilar development, this is, however, not possible because the acceptable dif-
ference (the degree of similarity) in the endpoints highly depends on the active
1248 J. Mielke and B. Jones

substance, the indication and the chosen endpoint. That is why a case-by-case
decision has to be made for each active substance and endpoint and the margins
are typically determined by negotiation with the regulatory agencies. In the follow-
ing, we discuss experiences with the choice of equivalence margins in applications to
the European Medicines Agency (EMA) instead of experience with the FDA since
the EMA has already approved more than 40 biosimilars and therefore allows for a
broader overview of current practice.
In their respective guidelines (CHMP 2014a), the EMA only states that “compa-
rability margins should be prespecified and justified on both statistical and clinical
grounds by using the data of the reference [original] product” and refers to a related
guideline on the choice of non-inferiority margins (CHMP 2005). The EMA aims at
being transparent in their decision making and that is why they publish so-called
European public assessment reports (EPARs) which give detailed information on the
provided evidence for approved biosimilars and are accessible by the general public.
Thus, it is possible to analyze the choice of margins in practice where, indeed, the
equivalence margins for the clinical comparability studies were usually prespecified
and only in a few cases were post hoc decisions made (Mielke et al. 2018a).
However, the information on the derivation provided in the EPARs was only in a
few cases reported in enough detail so that a reproduction of the equivalence margins
would be possible. In addition, there does not seem to be a standardized strategy for
the determination of equivalence margins. This is illustrated by using the regulatory
applications for Benepali (active substance: etanercept) and Rixathon (active sub-
stance: rituximab).
For Benepali, the chosen endpoint was the ACR20 responder rates (CHMP
2016): a subject is classified as an ACR20 responder if the relative improvement
in percentage according to the American College of Rheumatology (ACR) criterion
(Felson et al. 1993) when compared to baseline is larger than 20%. Three historical
studies were identified and combined using a random-effects meta-analysis and a
95% confidence interval for the difference in response rates of the originator vs.
placebo was obtained and is reported as (0.3103, 0.4996). The sponsor decided to
aim for a preservation of 50% of the effect of the originator vs. placebo and chose
Δ ¼ 0.15. For Rixathon, in contrast, the overall response rate to the treatment was
the chosen endpoint. The sponsor used only a single historical trial in a comparable
study population for deriving the equivalence margin. A 95% confidence interval for
the observed add-on effect of the originator was estimated and is given by (0.14,
0.34). The sponsor decided for an equivalence margin of Δ ¼ 0.12 which preserves
only 15% of the add-on effect of the originator.
It is acknowledged that the richness of historical data was different in the two
situations: while for Benepali three comparable studies were used with, in total, 460
patients enrolled, for Rixathon only one single study with 320 subjects was ana-
lyzed. If only limited data are available, this leads to a wider confidence interval and
that generally lowers the equivalence margin for a fixed percent of effect to be
preserved. A lower equivalence margin Δ makes it finally more difficult to claim
equivalence. Nonetheless, this example shows that the statistical approach for the
choice of the equivalence margins is not standardized yet. Due to the close
66 Biosimilar Drug Development 1249

connection between the equivalence margin and the test result, it would be beneficial
if more concrete guidance was provided by regulatory agencies, specifically on the
percentage of effect to be preserved.
Negotiation of the equivalence margin with the regulatory authorities by seeking
Scientific Advice is not mandatory in Europe. This is evident in some of the EPARs
in which it is explicitly stated that the EMA did not agree with the chosen margin.
One example is the application of Amgevita (active substance: adalimumab). There,
the sponsor decided to use a margin of (0.738, 1.355) for the risk ratio of ACR20
responders. The EMA (CHMP 2017) was concerned that this margin was too wide
because it “would correspond to an absolute margin of more than –16% on the
additive scale.” It was concluded that “however, in light of the results observed this
does not represent an issue that could compromise the reliability of the study.” It is
unclear how the EMA would have decided if the study results would not have
supported also the tighter margins. Therefore, an early discussion with regulatory
authorities on an acceptable choice of equivalence margins is recommended.

Interchangeability of Biosimilars

The primary efficacy endpoint in therapeutic equivalence studies (see section “The
Stepwise Approach to Biosimilarity”) is usually compared in a parallel groups
design, that is, it is confirmed that patients who are taking repeatedly the biosimilar
and patients who are taking repeatedly the originator respond comparably to the
treatment. The focus is typically on treatment-naive patients, that is, patients without
any relevant pre-treatment prior to the start of the study (FDA 2015b). In practice,
since biosimilars are often developed for chronic diseases, patients might want or
need to switch between the biosimilar and its originator once or even multiple times
during the duration of the treatment. While for the approval as a biosimilar in
Europe, no data on transition from the originator to the biosimilar or vice versa is
required, the FDA recommends assessing the impact on immunogenicity of a single
transition from the originator to the biosimilar (FDA 2015b). However, also in the
USA, usually no data are provided in the biosimilar application by the sponsor on the
impact of multiple switches and single crossovers from the biosimilar to the
originator.
To fill this gap, the FDA has the legal option to approve biosimilars as “inter-
changeable biosimilars.” According to BPCI Act (FDA 2009), a proposed product is
considered to be interchangeable, if (1) the proposed product is biosimilar to its
originator, (2) it “can be expected to produce the same clinical result as the reference
[originator] product in any given patient” and (3) “for a biological product that is
administered more than once to an individual, the risk in terms of safety or dimin-
ished efficacy of alternating or switching between use of the biological product and
the reference [originator] product is not greater than the risk of using the reference
product without such alternation or switch” where alternating relates to multiple
switches (e.g., biosimilar to originator back to biosimilar). It is important to note that
there is a clear hierarchy between “biosimilarity” and “interchangeability”: a product
1250 J. Mielke and B. Jones

which is interchangeable is also biosimilar; however, biosimilarity is only one part of


the showing of interchangeability.
In the past, it was not known with certainty what data and analysis are required for
the showing of interchangeability. However, in 2017, the FDA published a first draft
guidance (FDA 2017b) which outlined their expectation on studies for approval as
an interchangeable biosimilar. Since biosimilars are diverse, there is “no one-size-
fits-all approach across the product landscape to the data needed to demonstrate
biosimilarity. It follows that the data needed to demonstrate interchangeability are
also determined on a case-by-case basis depending on considerations such as the
complexity of the product, the reference [originator] product’s indications and the
potential for immune system complications.” (Christl 2018). Generally, the FDA
expects that applications will include data from clinical studies in which specifically
the effect of switching is studied for assessing its impact on efficacy and safety of the
product. For the study design, the FDA recommends (FDA 2017b) including so-
called switching (patients switch from the biosimilar to the originator and vice versa)
and non-switching sequences (continuous treatment with the biosimilar or the
originator). An example of the FDA’s proposed study design for a study with the
focus on interchangeability is outlined in Fig. 2. It should consists of a lead-in period
which needs to be sufficiently long enough for patients to reach steady state PK (i.e.,
the rate of drug input is equal to the rate of elimination) before the first switch occurs.
A minimum of three switches is required and the last switch needs to be a switch
from the originator back to the biosimilar. After the last switch, a wash-out period of
at least three half-lives needs to be included. Following this wash-out, intensive PK
sampling is performed and Cmax and AUC of these PK profiles are compared by
calculating a 90% confidence interval (see section “Testing for Equivalence in
Biosimilar Trials”) for the ratio of the means of the biosimilar and originator for
both endpoints. If these intervals are both fully contained with 80–125%, the study is
considered to be successful. It should be noted that the interchangeability study may
be combined with the therapeutic equivalence study, which is performed for the
application for getting approval as a biosimilar, if the study is planned to appropri-
ately address both goals. In addition to the clinical studies, human factor studies

Time point for the


First switch after assessment of
PK steady state is interchangeability
reached

Originator
Originator
Biosimilar Originator Biosimilar

Wash-out period

Fig. 2 FDA’s proposed study design for a separate trial for establishing interchangeability
66 Biosimilar Drug Development 1251

might be required which focus on the question if the patients are able to use the
biosimilar device without any additional training.
In discussion among stakeholders, it can be seen that the above guidance is
controversial opinions (Barlas 2017). Independently of the guidelines, a collection
of statistical methodologies for the assessment of interchangeability has evolved.
Compared to the recommended approach in the guideline, some of these methodol-
ogies are less focused on the mean value and are also sensitive to detect changes in
variability (e.g., Li and Chow 2017) or make better use of all measured data by using
the longitudinal assessments of the patients (e.g., Mielke et al. 2018d).
So far, no interchangeable biosimilar has gained approval. It is important to note
that, so far, no results of any study have been published which revealed that
biosimilars cannot be used interchangeably. In contrast, there exist several publica-
tions indicating that switching is not problematic (e.g., Jørgensen et al. 2017;
Benucci et al. 2017). Therefore, it is unclear if the concerns related to interchange-
ability will diminish when more experience with biosimilars in practice is gained or
if the complex clinical studies dedicated to the assessment of interchangeability will
be required in the future.
It should be noted that interchangeability is not a regulatory topic in Europe since
the EMA (2012) clearly states that “the Agency’s evaluations do not include
recommendations on whether a biosimilar should be used interchangeably with its
reference [originator] medicine” and recommends that “for questions related to
switching from one biological medicine to another, patients should speak to their
doctor or pharmacist.” The member states in Europe handle switching and alternat-
ing of biosimilars differently and no joint position is expected to evolve in the near
future (Moorkens et al. 2017).

Incorporating Additional Data in Clinical Efficacy Studies

Typically when clinical efficacy studies are conducted, already rich information on
the biosimilar and its originator has been gathered. The proposed biosimilar has
already been evaluated in analytical, animal, and human PK (and possibly PD)
studies. Even more information is available on the originator since this product is
already an established medical product: the sponsor of the originator conducted
several clinical efficacy studies for gaining market authorization and often the
product has additionally been assessed in postmarketing studies. Furthermore, also
academic institutes or health care providers might have conducted separate trials.
Therefore, it seems natural to include all available information into the showing of
similar efficacy. In the following, it is first assumed that historical information is
available for the originator only and the information is of same type, that is, the focus
is on the incorporation of results from historical clinical trials (same endpoint) in the
showing of equivalent efficacy of the biosimilar and the originator. It should be noted
that historical information in biosimilar trials was already used in practice by
comparing the efficacy outcomes of a single-arm trial to a historical control trial,
for example, in the application for Zarzio in 2008 (CHMP 2008).
1252 J. Mielke and B. Jones

In general, the aim of the inclusion of historical data is the lowering of the
required sample size or, in other words, the increase of the power (the probability
of claiming equivalence) of the study. However, one also needs to consider the
disadvantages of including all available knowledge in the assessment of equivalent
efficacy: clearly the statistical approach is more complicated, making it more
difficult to analyze the study and communicate the results to a nonstatistical audi-
ence. In addition, in case the data in the new study follows a different distribution
than the data in the historical trials (a so-called prior-data conflict), the Type I error
rate (the probability of false positive decisions) might be higher than the acceptable
nominal level. Especially the potential of an inflation of the Type I error rate, which
is the patient’s risk that a nonequivalent product will be called equivalent, is of
concern for regulatory agencies. The use of historical information is common
practice in some disease areas: for example, in rare diseases where it is challenging
to recruit a sufficient number of patients in randomized trials, the use of prior
information might be required so that the development is feasible at all. Also in
situations in which it is unethical to include a placebo group, a comparison of the
active treatment to historical placebo data has already been used for regulatory
approval. In these situations, a moderate inflation of the Type I error rate is
considered acceptable.
In biosimilar development, the situation is quite different (Mielke et al. 2018c):
biosimilars are not developed for rare diseases; therefore, a sufficient number of
subjects are available. In addition, the randomization of patients to the control group
is not unethical since the control group receives the originator which is often still the
standard of care. Nonetheless, even though the necessity for the inclusion of all
available information is weaker for biosimilars, it is still important to emphasize that
the inclusion of all available data is desirable from a scientific point of view,
especially in the context of “totality of the evidence,” and can also speed up the
development and bring the product earlier to the patient. Therefore, the use of all
information is desirable both for the sponsor and the general public. However, due to
the reasons outlined above, it is expected that the regulatory expectations in terms of
control of the Type I error rate are stricter.
Several approaches for the incorporation of historical information have already
been proposed and an overview can be found in van Rosmalen et al. (2017). For
most approaches, it is possible to adjust the methodology to make it more robust
against a potential prior-data conflict and for limiting the overall Type I error rate. In
the context of biosimilar development, Pan et al. (2017) developed a methodology
which features some tuning parameters for an improved control of the Type I error
rate. Mielke et al. (2018c) proposed not to aim for control of the Type I error rate
over the whole parameter space (e.g., for response rates between 0 and 1 for a binary
endpoint), but to focus instead on scenarios which are realistic in practice (e.g., true
response rates between 0.2 and 0.3). This idea is displayed in Fig. 3: the Type I error
rate and the power are displayed dependent on the rate of a binary characteristic of
interest of the originator in the new study for two hypothetical approaches. One of
these approaches is making use of the historical data (solid lines, via a prior
distribution, for example), while the other (dotted horizontal line) is not
66 Biosimilar Drug Development 1253

Fig. 3 Acceptable operating characteristics dependent on a characteristic of interest of the origi-


nator: the vertical solid line indicates the center of the historical data, the dotted horizontal lines are
the operating characteristics for an hypothetical approach which is not using the historical data. The
non-constant solid lines correspond to an hypothetical approach which incorporates the historical
data. The vertical dashed lines give the interval in which the Type I error rate has to be controlled

incorporating the historical data. The vertical solid line gives the center of the
historical data (e.g., the mean of the prior distribution): an observed rate on the x-
axis close to this line shows the Type I error rate and the power for situations in
which the data on the original in the new study approximately match the historical
data. In contrast, if the observed rate of the characteristic of interest is, for example,
at a position on the x-axis of 0.8, this refers to the operating characteristics for a
scenario with a clear prior-data conflict. The proposal of Mielke et al. (2018c) is to
control the Type I error rate only for scenarios with a good to moderate fit of
historical data and data in a new study which is displayed in Fig. 3 as within the
dashed vertical lines. This idea is motivated by the understanding that since the
originator is already an established product, there exists a rich collection of knowl-
edge about the originator which can be used for planning a study which might not be
identical, but at least would provide results similar to those from previous studies.
Outside of the chosen interval, an increased Type I error rate and a lower power are
acceptable since one is certain that the true estimates will not lie outside of the
chosen interval. Mielke et al. (2018c) proposed a hybrid Bayes-frequentist method-
ology for binary endpoints which has the above-described operating characteristics.
The incorporation of information gathered during early development (preclinic,
animal, and human PK and PD) is less straightforward. Combest et al. (2014)
proposed constructing an informative prior for the efficacy assessment based on
preclinical data. However, the proposal is rather vague and does not give any
detailed information on the underlying methodology. The main challenge for an
approach like this will, most likely, be the connection of the preclinical assessment
with the clinical result: in contrast to the previously discussed examples where the
same endpoint was measured in the historical study as in the new efficacy study, it is
1254 J. Mielke and B. Jones

here necessary to combine completely different pieces of evidence. For example, the
preclinical result may be the result of a bioassay, but the clinical endpoint is a binary
endpoint (responder, nonresponder). Then, one needs to establish a link between the
different measurements, that is, how does a difference of, for example 0.2 from the
bioassay, relate to the chance of being a responder or nonresponder. Often, the
connection between preclinical and clinical results is not known and therefore
even the establishing of equivalence margins for the most critical quality attributes
is not straightforward. That is why the aim to include information from early
development into the clinical efficacy studies is interesting, but ambitious and
more research is required.

Operational Challenges in Biosimilar Development

In the previous sections, challenges related to the design and analysis of biosimilar
trials were described. However, it is important to emphasize that there exists also
multiple challenges which are not related to these aspects. In the following, some of
these aspects are briefly discussed.
Due to the complex nature of the processes needed to manufacture a biologic, the
batches of the drug that are produced over a given time period may vary in terms of
their exact analytical properties. This is understood by regulators and manufacturers
are expected to run so-called comparability studies at regular intervals to ensure that
critical quality attributes of the biologic are maintained within agreed limits (ICH
2004). Over time, the manufacturer builds up a history of how batches vary over
time, but this knowledge is not available to the developer of a biosimilar. The only
knowledge that the biosimilar developer has comes from analysis of batches of the
original biologic that are purchased on the open market. So, in some sense, the
biosimilar developer is having to chase a moving target in terms of showing
analytical similarity (Step 1 in Fig. 1). See Schiestl et al. (2011) and Mielke et al.
(2019) for examples and further discussion on this. See Berkowitz (2017) for a more
complete discussion of issues related to the structural assessment of biosimilarity.
As some biologics have a long half-life, this may preclude the use of cross-over
trials to show equivalence of PK and PD markers for certain active substances.
Therefore, parallel groups trials have to be used and these typically have much larger
sample size requirements compared to those needed for providing evidence of
equivalence of a nonbiologic (generic) drug with its reference using a cross-over
trial.
Another challenge related to recruitment, mentioned by Weschler (2016), is that
experienced clinical investigators might prefer to be involved in the development of
innovative drugs rather than in the development of copies of existing drugs. This
might limit the number of research centers that are available to take part in a
biosimilar study.
Once the biosimilar has gained regulatory approval and is on the market, a further
challenge is to convince physicians to prescribe the new drug. In addition, patients
need to agree to use the biosimilar instead of the originator product. Surveys
66 Biosimilar Drug Development 1255

studying the level of awareness of physicians and patients regarding biosimilars


indicated that the understanding of the concept of biosimilarity still needs to be
improved (Cohen et al. 2016; Jacobs et al. 2016). See Blackstone and Fuhr Jr (2017),
for example, for further discussion on issues related to competition in the biologic
and biosimilar market.

Summary and Conclusion

This chapter provided an overview of the necessary steps of a biosimilar develop-


ment program: in contrast to the development of generics, biosimilar development is
a complex, expensive, and time-consuming procedure which consists of analytical,
nonclinical, and clinical studies. All evidence is combined using the concept of
“totality of the evidence” which means that there is not one pivotal study, but all
steps in the development program are considered important. The final decision as to
whether approve or reject a proposed biosimilar is based on scientific judgment by
taking into account all provided evidence. Clearly, this requires a multidisciplinary
team which is able to weight the very different pieces of evidence. Due to the lack of
experience with biosimilars, the FDA announced that there will be an advisory
committee for the first biosimilar for each originator (Brennan 2016).
Biosimilar development worldwide is not harmonized yet and each regulatory
authority has its own regulations and guidelines. The main hurdles for global
biosimilar development are, however, not the different regulatory requirements,
but that regulatory agencies usually prefer studies which give a head-to-head
comparison of the originator product (one that is approved in the respective region)
and the proposed biosimilar. Nonetheless, many sponsors are not keen on
conducting several separate biosimilar development programs (one per region), but
one global development program and to use this global development program for
approval worldwide (e.g., Webster and Woollett 2017). The key to global biosimilar
development is the idea of bridging: one or several studies are conducted which
establish the comparability between the biosimilar, the originator which is used in
most studies (e.g., the EU-sourced originator) and the local originator (e.g., the
USA-sourced product). The focus is then not only on establishing similarity between
biosimilar and originator, but also on the showing of comparability of the two
originator products. This approach is compatible with the FDA’s requirement to
compare (if not justified that any of these is not necessary) using “analytical studies
and at least one clinical PK study and, if appropriate, at least one PD study” (FDA
2015b) directly against the FDA approved originator product. Animal and therapeu-
tic equivalence studies might be performed using a non-USA-licensed originator
product if the bridge between this originator product and the USA originator product
has been established. The European regulations are slightly different: the compara-
bility exercise on the analytical level needs to be conducted against an originator
which is authorized in the EU. In contrast, it is explicitly stated that the clinical and in
vivo nonclinical studies may be conducted with a non-EU originator if the relevance
of this non-EU originator has been demonstrated in bridging studies (CHMP 2014b).
1256 J. Mielke and B. Jones

In practice, global development programs are becoming more common. From the
recently approved applications in Europe, many used a strategy with bridging at the
PK level (Mielke et al. 2018a). Herzuma (Celltrion Healthcare, active substance:
trastuzumab) was the first approved product in Europe where bridging studies were
only conducted at the analytical level and no head-to-head comparisons of the EU-
approved originator to the proposed biosimilar in human subjects were performed (i.
e., no PK, PD, clinical comparability, and safety studies were conducted).
In this chapter, also some key statistical challenges were highlighted (the choice
of equivalence margins, the establishing of interchangeability, and the formal
incorporation of additional information into the analysis of efficacy endpoints).
These key challenges can certainly only be seen as a short introduction to statistical
issues in biosimilar development and there exists several other important topics to
be considered, for example, the handling of multiplicity (e.g., Mielke et al. 2018b),
the use of statistics in preclinical development (e.g. Tsong et al. 2017; Mielke et al.
2019), or the application of advanced statistical tools like network meta-analysis for
an improved efficacy assessment (e.g., Messori et al. 2017). With the increasing
number of approved biosimilars, the number of tailored statistical methodologies
developed specific for biosimilar development is expected to further increase in the
near future.

Key Facts

Biosimilars are developed as copies of already approved, innovative, large-molecule


drugs which can be sold after the patent of the originator has expired. The develop-
ment and approval of biosimilars differs substantially from the development of
generics (copies of small-molecule drugs). Regulators recommend a step-wise
approach which consists of comparisons of structural and functional characteristics
of the molecules, nonclinical studies, and clinical studies. The data from these
studies is in most cases analyzed with an equivalence testing approach. Since
biosimilars are a fairly new concept, there exists still many open questions; not
only in terms of tailored statistical methodology, but also in terms of regulatory
guidance. More research is required to make biosimilar development more efficient.

Cross-References

▶ Cross-over Trials
▶ Essential Statistical Tests
▶ Introduction to Meta-Analysis
▶ Pharmacokinetic and Pharmacodynamic Modeling
▶ Use of Historical Data in Design
66 Biosimilar Drug Development 1257

Acknowledgements The authors gratefully acknowledge the funding from the European Union’s
Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant
agreement No 633567 and from the Swiss State Secretariat for Education, Research and Innovation
(SERI) under contract number 999754557. The opinions expressed and arguments employed herein
do not necessarily reflect the official views of the Swiss Government.

References
Arato T (2016) Japanese regulation of biosimilar products: past experience and current challenges.
Br J Clin Pharmacol 82(1):30–40
Barlas S (2017) FDA guidance on biosimilar interchangeability elicits diverse views: current and
potential marketers complain about too-high hurdles. Pharm Ther 42(8):509
Benucci M, Gobbi FL, Bandinelli F, Damiani A, Infantino M, Grossi V, Manfredi M, Parisi S,
Fusaro E, Batticciotto A et al (2017) Safety, efficacy and immunogenicity of switching from
innovator to biosimilar infliximab in patients with spondyloarthritis: a 6-month real-life obser-
vational study. Immunol Res 65(1):419–422
Berkowitz SA (2017) Analytical characterization: structural assessment of biosimilarity, Chap 2. In:
Endrenyi L, Declerck P, Chow SC (eds) Biosimilar drug product development. CRC Press, Boca
Raton, pp 15–82
Bewesdorff M (2016) Biosimilars in the U.S. – the long way to their first approval. Master of drug
regulatory affairs, Rheinischen Friedrich-Wilhelms-Universitat Bonn
Blackstone E, Fuhr JP Jr (2017) Biosimilars and biologics. The prospect for competition, Chap 16.
In: Endrenyi L, Declerck P, Chow SC (eds) Biosimilar drug product development. CRC Press,
Boca Raton, pp 413–438
Brennan Z (2016) FDA to hold one advisory committee for each initial biosimilar. https://www.
raps.org/regulatory-focus%E2%84%A2/news-articles/2016/9/fda-to-hold-one-advisory-commi
ttee-for-each-initial-biosimilar. Accessed 07 June 2018
Cazap E, Jacobs I, McBride A, Popovian R, Sikora K (2018) Global acceptance of biosimilars:
importance of regulatory consistency, education, and trust. Oncologist 23:1188
CHMP (2005) Guideline on the choice of non-inferiority margins. http://www.ema.europa.eu/docs/
en_GB/document_library/Scientific_guideline/2009/09/WC500003636.pdf. Accessed 07 June
2018
CHMP (2008) Zarzio: EPAR public assessment report. http://www.ema.europa.eu/docs/en_GB/docu
ment_library/EPAR_-_Public_assessment_report/human/000917/WC500046528.pdf. Accessed
26 Oct 2015
CHMP (2014a) Guideline on similar biological medicinal products containing biotechnology-derived
proteins as active substance: non-clinical and clinical issues (revision 1). http://www.ema.europa.
eu/docs/en_GB/document_library/Scientific_guideline/2015/01/WC500180219.pdf. Accessed 22
Feb 2018
CHMP (2014b) Guideline on similar biological medicinal products (revision 1). http://www.ema.
europa.eu/docs/en_GB/document_library/Scientific_guideline/2014/10/WC500176768.pdf.
Accessed 22 Feb 2018
CHMP (2016) Benepali: EPAR – public assessment report. http://www.ema.europa.eu/docs/en_
GB/document_library/EPAR_-_Public_assessment_report/human/004007/WC500200380.pdf.
Accessed 07 June 2018
CHMP (2017) Amgevita: EPAR – public assessment report. http://www.ema.europa.eu/docs/en_
GB/document_library/EPAR_-_Public_assessment_report/human/004212/WC500225231.pdf.
Accessed 07 June 2018
Chow SC, Hsieh TC, Chi E, Yang J (2009) A comparison of moment-based and probability-based
criteria for assessment of follow-on biologics. J Biopharm Stat 20(1):31–45
1258 J. Mielke and B. Jones

Christl LA (2018) From our perspective: interchangeable biological products. https://www.fda.gov/


Drugs/NewsEvents/ucm536528.htm. Accessed 22 Feb 2018
Christl LA, Woodcock J, Kozlowski S (2017) Biosimilars: the US regulatory framework. Annu Rev
Med 68(1):243–254
Cohen H, Beydoun D, Chien D, Lessor T, McCabe D, Muenzberg M, Popovian R, Uy J (2016)
Awareness, knowledge, and perceptions of biosimilars among specialty physicians. Adv Ther
33(12):2160–2172
Combest A, Wang S, Healey B, Reitsma DJ (2014) Alternative statistical strategies for biosimilar
drug development. GaBI J 3(1):13–20
Crommelin D, Bermejo T, Bissig M, Damiaans J, Krämer I, Rambourg P, Scroccaro G, Strukelj B,
Tredree R (2005) Pharmaceutical evaluation of biosimilars: important differences from generic
low-molecularweight pharmaceuticals. Eur J Hosp Pharm Sci 11(1):11–17
EMA (2012) Questions and answers on biosimilar medicines (similar biological medicinal products).
http://www.medicinesforeurope.com/2012/09/27/ema-questions-and-answers-on-biosimilar-medi
cines-similar-biological-medicinal. Accessed 22 Feb 2018
FDA (2009) Biologics price competition and innovation act. http://www.fda.gov/downloads/Drugs/
GuidanceComplianceRegulatoryInformation/ucm216146.pdf. Accessed 22 Feb 2018
FDA (2015a) Sandoz briefing book for application to market zarxio. https://patentdocs.typepad.
com/files/briefing-document.pdf. Accessed 11 Jan 2019
FDA (2015b) Scientific considerations in demonstrating biosimilarity to a reference product. https://
www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/UCM
291128.pdf. Accessed 05 June 2018
FDA (2016) Clinical pharmacology data to support a demonstration of biosimilarity to a reference
product. https://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/
Guidances/UCM397017.pdf. Accessed 05 June 2018
FDA (2017a) Biological product definitions. https://www.fda.gov/downloads/Drugs/Development
ApprovalProcess/HowDrugsareDevelopedandApproved/ApprovalApplications/TherapeuticBio
logicApplications/Biosimilars/UCM581282.pdf. Accessed 05 June 2018
FDA (2017b) Considerations in demonstrating interchangeability with a reference product. https://
www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/UCM
537135.pdf. Accessed 22 Feb 2018
FDA (2018) FDA withdraws draft guidance for industry: statistical approaches to evaluate analyt-
ical similarity. https://www.fda.gov/Drugs/DrugSafety/ucm611398.htm. Accessed 17 Jul 2018
Felson DT, Anderson JJ, Boers M, Bombardier C, Chernoff M, Fried B, Furst D, Goldsmith C, Kieszak
S, Lightfoot R et al (1993) The American College of Rheumatology preliminary core set of disease
activity measures for rheumatoid arthritis clinical trials. Arthritis Rheumatol 36(6):729–740
Finckh A, Bansback N, Marra CA, Anis AH, Michaud K, Lubin S, White M, Sizto S, Liang MH
(2009) Treatment of very early rheumatoid arthritis with symptomatic therapy, disease-modify-
ing antirheumatic drugs, or biologic agents: a cost-effectiveness analysis. Ann Intern Med 151
(9):612–621
Health Affairs Health Policy Brief (2013) Biosimilars. https://www.healthaffairs.org/do/10.1377/
hpb20131010.6409/full/. Accessed 05 June 2018
Holzmann J, Balser S, Windisch J (2016) Totality of the evidence at work: the first U.S. biosimilar.
Expert Opin Biol Ther 16(2):137–142
Hsieh TC, Chow SC, Yang LY, Chi E (2013) The evaluation of biosimilarity index based on
reproducibility probability for assessing follow-on biologics. Stat Med 32(3):406–414
Hsu JC, Hwang JTG, Liu HK, Ruberg SJ (1994) Confidence intervals associated with tests for
bioequivalence. Biometrika 81(1):103–114
Hung A, Vu Q, Mostovoy L (2017) A systematic review of US biosimilar approvals: what evidence
does the FDA require and how are manufacturers responding? J Manag Care Spec Pharm
23(12):1234–1244
66 Biosimilar Drug Development 1259

ICH (2004) Comparability of biotechnological/biological products subject to changes in their


manufacturing process, Q5E
Jacobs I, Singh E, Sewell KL, Al-Sabbagh A, Shane LG (2016) Patient attitudes and understanding
about biosimilars: an international cross-sectional survey. Patient Prefer Adherence 10:937–948
Jones B, Kenward M (2014) Design and analysis of cross-over trials, 3rd edn. Chapman & Hall/
CRC monographs on statistics & applied probability. Taylor & Francis. https://books.google.ch/
books?id¼tuisBAAAQBAJ
Jørgensen KK, Olsen IC, Goll GL, Lorentzen M, Bolstad N, Haavardsholm EA, Lundin KE, Mørk
C, Jahnsen J, Kvien TK et al (2017) Switching from originator infliximab to biosimilar CT-P13
compared with maintained treatment with originator infliximab (NOR-SWITCH): a 52-week,
randomised, double-blind, noninferiority trial. Lancet 389(10086):2304–2316
Li J, Chow SC (2017) Statistical evaluation of the scaled criterion for drug interchangeability.
J Biopharm Stat 27(2):282–292
Messori A, Trippoli S, Marinai C (2017) Network meta-analysis as a tool for improving the
effectiveness assessment of biosimilars based on both direct and indirect evidence: application
to infliximab in rheumatoid arthritis. Eur J Clin Pharmacol 73(4):513. https://doi.org/10.1007/
s00228-016-2177-z
Mielke J, Jilma B, Koenig F, Jones B (2016) Clinical trials for authorized biosimilars in the
European Union: a systematic review. Br Clin Pharmacol 82(6):1444–1457
Mielke J, Jilma B, Jones B, Koenig F (2018a) An update on the clinical evidence that supports
biosimilar approvals in Europe. Br Clin Pharmacol 84(7):1415–1431
Mielke J, Jones B, Jilma B, König F (2018b) Sample size for multiple hypothesis testing in
biosimilar development. Stat Biopharm Res 10(1):39–49
Mielke J, Schmidli H, Jones B (2018c) Incorporating historical information in biosimilar trials:
challenges and a hybrid Bayesian-frequentist approach. Biom J 60(3):564–582
Mielke J, Woehling H, Jones B (2018d) Longitudinal assessment of the impact of multiple switches
between a biosimilar and its reference product on efficacy parameters. Pharm Stat 17(3):231–247
Mielke J, Innerbichler F, Schiestl M, Ballarini NM, Jones B (2019) The assessment of quality
attributes for biosimilars: a statistical perspective on current practice and a proposal. AAPS J 21:7
Moorkens E, Vulto AG, Huys I, Dylst P, Godman B, Keuerleber S, Claus B, Dimitrova M, Petrova
G, Sović-Brkičić L et al (2017) Policies for biosimilar uptake in Europe: an overview. PLoS One
12(12):e0190147
Pan H, Yuan Y, Xia J (2017) A calibrated power prior approach to borrow information from
historical data with application to biosimilar clinical trials. J R Stat Soc Ser C Appl Stat
66(5):979–996
Schellekens H, Moors E (2015) Biosimilars or semi-similars? Nat Biotechnol 33(1):19–20
Schiestl M, Stangler T, Torella C, Cepeljnik T, Toll H, Grau R (2011) Acceptable changes in quality
attributes of glycosylated biopharmaceuticals. Nat Biotechnol 29:310–312
Schiestl M, Li J, Abas A, Vallin A, Millband J, Gao K, Joung J, Pluschkell S, Go T, Kang HN
(2014) The role of the quality assessment in the determination of overall biosimilarity: a
simulated case study exercise. Biologicals 42(2):128–132
Schoergenhofer C, Schwameis M, Firbas C, Bartko J, Derhaschnig U, Mader RM, Plaßmann RS,
Jilma-Stohlawetz P, Desai K, Misra P et al (2018) Single, very low rituximab doses in healthy
volunteers-a pilot and a randomized trial: implications for dosing and biosimilarity testing. Sci
Rep 8(1):124
Schuirmann DJ (1987) A comparison of the two one-sided tests procedure and the power approach for
assessing the equivalence of average bioavailability. J Pharmacokinet Biopharm 15(6):657–680
Tsong Y, Dong X, Shen M (2017) Development of statistical methods for analytical similarity
assessment. J Biopharm Stat 27(2):197–205
Tsou HH, Chang WJ, Hwang WS, Lai YH (2013) A consistency approach for evaluation of
biosimilar products. J Biopharm Stat 23(5):1054–1066
1260 J. Mielke and B. Jones

van Rosmalen J, Dejardin D, van Norden Y, Lwenberg B, Lesaffre E (2017) Including historical
data in the analysis of clinical trials: is it worth the effort? Statistical methods in medical
research. https://www.ncbi.nlm.nih.gov/pubmed/28322129
Webster CJ, Woollett GR (2017) A ‘global reference’ comparator for biosimilar development.
BioDrugs 31(4):279–286
Weise M, Kurki P, Wolff-Holz E, Bielsky MC, Schneider CK (2014) Biosimilars: the science of
extrapolation. Blood 124(22):3191–3196
Wellek S (2010) Testing statistical hypotheses of equivalence and noninferiority, 2nd edn. CRC
Press, London
Weschler B (2016) Biosimilar trials differ notably from innovator studies. Appl Clin Trials. http://
www.appliedclinicaltrialsonline.com/biosimilar-trials-differ-notably-innovator-studies
Prevention Trials: Challenges in Design,
Analysis, and Interpretation of Prevention 67
Trials

Shu Jiang and Graham A. Colditz

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1262
Trial Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1263
The Disease Process and Identifying a Population at Risk Who Can Benefit
from a Preventive Intervention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1264
Components of Intervention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1265
Sustainability of the Behavior Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1265
The Time Course of the Intervention Within the Disease Process . . . . . . . . . . . . . . . . . . . . . . . . 1266
The Dose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1267
The Duration of “Exposure/Intervention” Needed to Produce Risk Reduction . . . . . . . . . . . 1268
The Durability of the Impact of the Intervention After It Has Stopped . . . . . . . . . . . . . . . . . . . 1269
Outcomes for Prevention Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1269
Analysis ITT and Adherence in Prevention Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1270
Biomarkers and Other Emerging Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1271
Interpreting Prevention Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1273
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1273
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1274
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1274
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1274

Abstract
Designing a prevention trial requires understanding the natural history of the
disease, and the likely length of intervention required to achieve a reduction in
incidence. The population suitable to contribute meaningful information to the
outcomes under study, the intervention is appropriate and likely will generate a
balance of risks and benefits for the typically disease free population, and the
primary outcome is biologically plausible and clinically relevant. Given the

S. Jiang · G. A. Colditz (*)


Division of Public Health Sciences, Department of Surgery, Washington University School of
Medicine, Saint Louis, MO, USA
e-mail: [email protected]; [email protected]

© Springer Nature Switzerland AG 2022 1261


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_96
1262 S. Jiang and G. A. Colditz

relatively long evolution of chronic diseases, prevention trials bring extra pres-
sures on two fundamental issues in the design of the trial: adherence to the
preventive intervention among participants who are otherwise healthy, and
sustained follow-up of trial participants. With growing emphasis on the compo-
sition of the trial participant population reflecting the overall population for
ultimate application of the results, there is the need for additional attention to
recruitment and retention of participants. This is fundamental to planning a
prevention trial. Planning for follow-up after the intervention is completed
helps place the intervention and outcomes in the context of the disease process
but adds complexity to recruitment. Never the less this adds to insights from
prevention trials. Improving risk stratification for identification of eligible partic-
ipants for recruitment to prevention trials can improve efficiency of the trials and
fit prevention trials in the context of precision prevention.

Keywords
Participants · Baseline risk · Diversity: natural history · Sustainability ·
Intervention timing · Adherence

Introduction

The majority of chronic diseases can be prevented through a combination of lifestyle


and preventive medications, and in some settings vaccines and screening or early
detection. In 2002 the leading chronic diseases were cardiovascular disease, cancer,
chronic respiratory disease, and diabetes accounting for a combined 29 million
deaths (Yach et al. 2004). Yet effective interventions to prevent these chronic
diseases are often lacking. One central challenge for understanding the impact of
prevention strategies for chronic disease is the placing of change in exposure in the
time course of disease development. The level and sources of evidence supporting
change in individual exposure to reduce disease risk, or to prevent chronic illnesses,
vary substantially across both lifestyle exposures and chronic diseases that are a
focus of prevention interventions. We focus on design and interpretation trials of
such interventions here. We do not address the design issues in screening trials (see
▶ Chap. 65, “Screening Trials”), nor do we tackle cluster randomized trials (see ▶
Chap. 77, “Cluster Randomized Trials”) which are increasingly used in implemen-
tation science studies, where the intervention such as changes in provider and patient
behaviors (in clinical settings or schools, for example) may cluster study subjects
with the clinic or classroom, and these may also be clustered within health systems
for school districts, leading to multiple levels of clustering with implications for
design, size, intervention delivery and outcome evaluation as well as analysis.
In this chapter, we consider several key factors that bear on the design and
interpretation of prevention trials. These factors include (1) the population in
which the intervention can be evaluated efficiently; (2) the underlying disease
process; (3) key components of the intervention and comparison/control
67 Prevention Trials: Challenges in Design, Analysis, and Interpretation. . . 1263

intervention; and (4) the outcome. For interventions we consider (a) sustainability of
the behavior change, (b) the time course of the intervention within the disease
process, (c) the dose, (d) the duration of “exposure” needed to effect risk reduction,
and (e) the durability of the impact of the intervention after it has stopped. Issues of
adherence to the intervention and approaches to analysis also impact the inference
from prevention trials when informing changes in policy and practice.

Trial Population

A key assumption for enrolling participants in a trial is that they will contribute
meaningful information to the outcome measures and that they are likely to engage
in the intervention for the duration of the study. Participant recruitment for preven-
tion trials brings added challenges beyond enrolling patients facing acute disease or
major catastrophic outcomes in the near term. The setting of disease treatment adds
to issues when enrolling participants as likely adherence is high to treatment options
and the balance of risks and benefits can be conveyed in time frames that relate to the
patient situation at hand. For primary prevention trials, however, we begin enrolling
healthy individuals who may be at risk of future disease and engage them for longer
term adherence to prevention strategies (pills, behaviors, or combinations) to
observed endpoints off in the future. This challenge of recruitment to prevention
trials has resulted in a number of trials where the prevalence of baseline behaviors
results in population not ideally selected to evaluate the intervention (see Physicians
Health Study of aspirin to prevent cardiovascular death, and very low cardiovascular
disease incidence (Cairns et al. 1991)) or calcium and vitamin D in the Women’s
Health Initiative where baseline calcium was above the threshold of benefit as
determined from observational studies(Martinez et al. 2008). On the other hand,
the trials evaluating Tamoxifen for breast cancer prevention used a baseline estimate
of breast cancer risk to identify women at elevated risk and so shift the balance of
risks and benefits for those randomized (Fisher et al. 1998). Similar issues are
discussed in ▶ Chap. 112, “Trials Can Inform or Misinform: “The Story of Vitamin
A Deficiency and Childhood Mortality””).
Eligibility and enrollment of the study population may also limit the application
of results beyond the trial. This issue is not limited to primary prevention of course,
but is also demonstrated with exclusion based on older age, or presence of major
comorbidities, that limit generalizability and application of results (see Stoll et al.
(2019), and pragmatic trials (Ware and Hamel 2011)).
Beyond disease severity, risk factor profiles and the like, many clinical trials are
underpopulated with minority participants (Chen et al. 2014). This is due, in part, to
eligibility criteria, and lack of engagement strategies tailored to minorities. Evidence
shows that concerted efforts to modify eligibility to include broader populations of
patients, and use of culturally tailored materials and processes, result in increased
research and trial accruals of minorities and their retention through the duration of
the trial (Warner et al. 2013). Thus, to generate results applicable to the broader
population, design of trials should increase eligibility to populations with multiple
1264 S. Jiang and G. A. Colditz

comorbidities as experienced by populations with cancer disparities, and promote


the development of patient engagement approaches tailored to minority and under-
served populations.

The Disease Process and Identifying a Population at Risk Who Can


Benefit from a Preventive Intervention

Epidemiologic and natural history studies often define risk factors and provide input
to models that classify risk of chronic disease. In defining the population for
recruitment to a prevention trial, the aim is to identify those with sufficiently high
risk of disease based on a combination of risk factors so that the benefit from a
preventive intervention will outweigh any possible adverse effects of the interven-
tion. Thus, selection of participants is in part driven by baseline disease risk being
sufficient to generate a research answer in a short funding time frame. For example,
after numerous meetings convened by NIH to discuss trial design for weight loss to
prevent chronic disease, NIDDK choose to move forward with a prevention trial of
intensive lifestyle intervention to prevent or delay development of diabetes (Knowler
et al. 2002).
Numerous examples from prior trials show that healthy volunteers are not
necessarily at sufficient risk to generate endpoints from the intervention. Improving
risk classification for entry to prevention trials is a major imperative. Much work is
ongoing in this area to more precisely define at risk groups whether by combing
questionnaire risk factors, polygenic risk scores, or metabolomic profiles to differ-
entiate those who might respond to a prevention intervention and those who will not.
Take for example breast cancer where chemoprevention shows marked differ-
ences between prevention receptor positive and negative breast cancer. Overall
selective estrogen receptor modulators (SERMs) such as Tamoxifen and Raloxifene
have been shown in randomized controlled prevention trials to reduce risk of
preinvasive and invasive breast cancer (Fisher et al. 1998; Martino et al. 2004).
The separation of incidence curves is dramatic and clear within 2 years of initiating
therapy. Like aspirin, SERMs also raise the challenge of risks and benefits of
therapies as well as the limitation of randomized trials to quantify potential harms
that are much less frequent than the primary trial end point. Tamoxifen increases risk
of uterine cancer, a finding confirmed by epidemiologic studies; Raloxifene, which
looks to have a safer profile, does not (Chen et al. 2007). Yet the protection is limited
to receptor positive disease. Thus, either identifying and enrolling those who are at
highest risk of receptor positive but not receptor negative breast cancer could
maximize benefits and reduce potential harms.
The emerging field of personalized medicine has been offering possibilities for
improving risk prediction and stratification based on patient-specific demographic,
clinical factors, medical histories, and genetic profile. Models that are tailored for
patients have provided high-quality recommendations for screening accounting for
individualized heterogeneity (Sargent et al. 2005). Traditional approaches usually
aim to investigate evidence of treatment differences by conducting subgroup
67 Prevention Trials: Challenges in Design, Analysis, and Interpretation. . . 1265

analysis based on prior data to gain insights on markers that can better stratify
patients according to their risk level. Other approaches include regression models
which usually include interaction terms between treatments and the covariates in
order to examine whether these interactions are statistically significant as well as
estimating the true (undiluted) benefit of the intervention. Effective classification of
patients can thus be translated into statistical models which aim to minimize the
prediction error, where the optimal risk classifier can lead to the best predicted
outcome.

Components of Intervention

Interventions for prevention range from a one (or two) time events (some vaccines),
to longer term use of preventive drugs such as aspirin, nutritional supplements, or the
polypill for cardiovascular diseases (Yusuf et al. 2021), and lifestyle changes
(components of diet, physical activity, sun exposure) (Knowler et al. 2002). The
components of the intervention have major implications for the design and cost of
prevention trials and the ultimate interpretation of the results as discussed in the next
sections.
The control or comparison intervention generates similar challenges for designing
and implementing a trial. If usual care is the comparison such as the Hypertension
Detection Follow-up Program in the 1970s (1979), or Health Insurance Plan of
New York mammography trial (Shapiro et al. 1985), the control arm may seek out
the intervention and dilute the comparison. Similar issues arose in the Women’s
Health Initiative trial discussed below.

Sustainability of the Behavior Change

For lifestyle intervention to change diet, physical activity, or other aspects of our
lifestyle such as transportation and commuting, many issues arise related to sustain-
ability of the intervention and its associated lifestyle changes, and use of approaches
to document adherence.
Adherence to interventions has not been high in long-term primary prevention
trials. The Tamoxifen breast cancer prevention trial (P1) was designed allowing 10%
of women/year to discontinue Tamoxifen therapy, though the observed non-
compliance was lower 23.7% of women randomized to Tamoxifen stopped their
therapy during the trial vs 19.7% of the placebo group (Fisher et al. 1998). In the
Women’s Health Initiative evaluation of menopausal hormone therapy, drop out was
42% for estrogen plus progestin and 38% for placebo. This exceeded the design
projections. Of note, women in the placebo group initiated hormone use through
their own clinical providers (10.7% by the sixth year) (Rossouw et al. 2002). Similar
adherence issues apply for diet interventions. In the low-fat diet intervention in the
Women’s’ Health Initiative a total of 48,835 postmenopausal women were random-
ized to the dietary intervention (40%) or the comparison group (60%). The
1266 S. Jiang and G. A. Colditz

intervention promoted dietary change with a goal of reducing intake of total fat to
20% of energy. Concomitant with this the participants would increase consumption
of vegetables and fruit to at least five servings daily, and also increase their intake of
grains to six servings daily. Estimated adherence in the intervention group was 57%
at year 3, 31% at year 6, and 19% at year 9, substantially lower than the adherence in
the comparison group (Prentice et al. 2006). Given the challenges of sustained
behavior change in otherwise healthy trial participants, adaptations of technology
such as text messaging and more real time feedback have been studied as adjuncts to
motivating and sustaining lifestyle changes (Wolin et al. 2015). There is much
research ongoing to identify the most effective strategies to motivate and sustain
participation and adherence for different populations based on gender, age, and race/
ethnicity. As technology continues to evolve, and access increases, additional
insights should improve the design and interpretation of prevention trials.
One strategy that has been used to improved adherence in trial participants is the
active run-in phase before randomization. For example, in the physicians health
study, a randomized trial of aspirin and beta carotene to prevent heart disease and
cancer, and active run in facilitated identification of those with an adverse tolerance
of every other day aspirin.
Beyond examples such as these, and the continuing research to adapt and improve
approaches to enroll and sustain adherence to interventions over prolonged time
periods, adherence in prevention trials has major implications for design, analysis
and interpretation of trial results as discussed below.

The Time Course of the Intervention Within the Disease Process

Interpreting null results in prevention trials begs the question of whether the inter-
vention was delivered at an appropriate time in the disease process, or whether dose
and duration of the intervention were chosen correctly. First, we consider the timing
of the intervention.
The null RCTs of fiber and fruit and vegetables for prevention of polyp recurrence
amply illustrate Zelen’s concerns about the timing of the preventive intervention in the
disease process. Randomized trials of fiber and fruit and vegetables in the prevention
of colon polyp recurrence have not shown any benefit from increased intake (Alberts
et al. 2000; Schatzkin et al. 2000). Furthermore, in prevention trials addressing
recurrence of polyps, the extent of DNA damage accumulated across the colonic
mucosa at the time the eligibility polyp is detected certainly is not limited to only the
removed polyp. Thus we must ask of RCTs, at what stage in the disease process may
fiber play a role in protecting against colon cancer? Constraints of design in RCTs
usually limit to a narrow time point and defined dose of exposure (and specific
duration), which contrast with the richness of epidemiologic studies that can address
exposure over the life course and relate such exposure to disease risk.
Other nutritional agents have also been tested in chemoprevention trials in the
developed world and in China (Greenwald et al. 2007). Based on evidence
documenting that people in Linxian, China, had low intakes of several nutrients, a
67 Prevention Trials: Challenges in Design, Analysis, and Interpretation. . . 1267

randomized trial comparing combinations of retinol, zinc, riboflavin, niacin, vitamin


C and molybdenum, beta-carotene, vitamin E, and selenium was undertaken (Blot
et al. 1993). Significant reductions in mortality were observed for those who
received the combination of beta-carotene, vitamin E, and selenium (factor D), and
the reduction was greater for those who began the therapy at a younger age. These
results again emphasize the importance of the timing of exposure in the disease
process.
Stratification of the results by sex and age was planned a priori. There were no
statistically significant interactions with sex. However, when stratified by age, factor D
had a strong protective effect in individuals under age 55 but demonstrated almost no
effect in subjects aged 55 years or older (Qiao et al. 2009). This pattern was seen
consistently for total mortality, total cancer mortality, gastric cancer mortality, and
esophageal cancer mortality. Indeed, the effect of factor D on esophageal cancer was
reversed by age, showing a protective effect for younger individuals but a harmful
effect for older individuals. Further insight into the timing in the carcinogenic process
is provided by a separate RCT in Linxian (Limburg et al. 2005), which gave further
support for a preventive effect of selenium in subjects with preexisting esophageal
squamous dysplasia, the precursor lesion of esophageal squamous cell carcinoma.
Compared with control subjects, those with mild dysplasia who received 10 months of
daily supplementation with 200 μg of selenomethionine were more likely to have
regression and less likely to have progression of their esophageal squamous dysplasia.
Clear a priori definition of an analytic framework to address possible
mis-specification of timing in the natural history of the disease is an essential step
to position the trial analysis to address this issue of the underlying disease process.

The Dose

Often investigators move from treatment trials showing efficacy for an agent on
disease outcomes to then apply the agent for prevention. This sequence has been
followed in examples such as aspirin and CHD; tamoxifen and breast cancer preven-
tion, to name a few. In both heart disease and breast cancer prevention lower doses
have been chosen for prevention in part to avoid potential adverse events that
accumulate in the healthy population taking a drug to prevent future disease onset.
In breast cancer prevention the dose of Tamoxifen has been reduced to minimize
menopausal symptoms and now shows significant benefits with reduction in breast
cancer events (DeCensi et al. 2019) and also in breast density a marker of breast
cancer risk (Eriksson et al. 2021). If we have more markers of response, we might
shorten the time frame from development of these trials to endpoint ascertainment.
Prevention trials typically are large, expensive, and of long duration, because we are
interrupting a slow disease process – e.g., chronic disease.
The initial Tamoxifen P1 breast cancer prevention trial screened 98,018 women
identifying 57,641 risk eligible women and randomized 13,388 participants to
determine the worth of Tamoxifen in preventing breast cancer in women with
5-year risk above 1.66%. Cumulative incidence through 69 months was 43.4/1000
1268 S. Jiang and G. A. Colditz

women in placebo group and 22.0/1000 in the Tamoxifen group (total 175 invasive
cases). This trail cost $64 million (without costs for participant enrolment, follow-up
visits, or drug/placebo). Subsequent trials compared Raloxifene and Tamoxifen
(Martino et al. 2004), costing $134 million, and investigators secured drug and
$30 million from Novartis for the STELLAR trial (study to evaluate Letrozole and
Raloxifene but NCI withdrew support at the level of $55 million (The Lancet 2007;
Parker-Pope 2007).
The need to balance benefits against adverse effects from interventions in
otherwise healthy populations places emphasis on determining the lowest possible
does to achieve benefit and reduce risk of adverse side effects. An earlier explo-
ration of response at lower doses of possible preventive agents may speed the move
to large scale prevention or phase 3 trials. Lower dose reduces risk of adverse
events in many settings, but a sufficient framework for evaluation of response by
dose, including biomarkers or risk profiles would speed the path to efficient
prevention trials. Promising options in precision-based approaches include pros-
taglandin pathways, BRAF and HLA class 1 antigen expression, among others
(Jaffee et al. 2017).

The Duration of “Exposure/Intervention” Needed to Produce Risk


Reduction

Dose and duration of intervention are typically informed a priori by observational


data. Given the complexity of implementing a primary prevention RCT, the impor-
tance of choosing the correct dose and duration for the intervention is imperative.
Two factors interplay here contributing to the cumulative exposure and the lag from
the exposure to the observed benefit. This determination again requires consideration
of risks and benefits because adverse effects of most therapeutic interventions cannot
be completely avoided.
Trials for prevention have often focused on enrolling high-risk participants –
often defined by family history of high penetrance genetic markers of risk. Such
restriction to high-risk populations increases the incidence of the outcome of
interest and so shortens the required duration (and cost) of the trial through
improving power for a finite number of participants screened for eligibility,
recruited, randomized and followed (when costs per participant are largely
fixed). Largely due to the increase in contrast between the high-risk group and
the control group in an RCT, fewer patients are usually required to reach the
prespecified statistical power. This of course will be dependent on the actual
context of the trial. On the other hand, the higher risk eligibility criteria may
then limit the generalizability and applicability of the trial results. For broader
prevention effectiveness consideration of the disease process and duration of
“prevention” to achieve an observable reduction in incidence is essential to
power a trial and use its results to estimate population benefit.
67 Prevention Trials: Challenges in Design, Analysis, and Interpretation. . . 1269

The Durability of the Impact of the Intervention After It Has Stopped

Drawing on examples from the breast cancer prevention trials and the China/Linxian
trial – we see that to address the persistence of a prevention benefit after the cessation
of the intervention requires planned additional follow-up beyond the primary hypoth-
esis of the trial. If clearly defined as secondary hypothesis and analyses, then continued
follow-up of trial participants can answer key questions of duration for the trial
intervention, and further inform evaluation of risks and benefits for prevention.
The additional insight on prevention gained from the precise knowledge of
exposure recorded in the randomized trial includes the added understanding of the
disease process after cessation of a precisely measured intervention. Continued
follow-up of trial participants has shown the durability of the effect of a prevention
agent. In the Linxian trial, factor D, which included selenium, vitamin E, and beta-
carotene, statistically significantly reduced total mortality, total cancer mortality, and
mortality from gastric cancer (Blot et al. 1993). An important question remained,
however: whether the preventive effects of factor D would last beyond the trial
period. The results of the continued follow-up showed that hazard ratios (HRs), as
indicated by moving HR curves, remained less than 1.0 for each of these end points
for most of the follow-up period; 10 years after completion of the trial, the group that
received factor D still showed a 5% reduction in total mortality and an 11% reduction
in gastric cancer mortality (Qiao et al. 2009).
Similar insight on the duration of protection has been provided from continued
follow-up of three tamoxifen trials, which showed benefit after the conclusion of
active therapy (Fisher et al. 2005). The calcium polyp prevention trial also reported
that the protection observed during the trial persisted for up to 5 years after
supplementation ended and may, in fact, have been stronger after, rather than during,
active intervention (Grau et al. 2007). With the exception of smoking cessation,
cessation of exposure to occupational carcinogens, and termination of drug use,
lifestyle factors (diet, energy balance, physical activity, sleep pattern or sun expo-
sure) rarely have a clearly demarcated cessation, thus requiring observational studies
to provide insight on the durability of effects and lag from exposure to disease. For
pharmacologic interventions, on the other hand, long term follow-up is essential to
fully determine risks and benefits (Cuzick 2010).

Outcomes for Prevention Trials

While the gold standard in prevention of chronic diseases may historically be


reduction in mortality, evolving technologies, changes in detection, and treatment
can bias estimates of prevention benefits. The initial community treatment of blood
pressure was assessed in the hypertension detection and follow-up study, which
screened over 150,000 adult 30–69 years of age to identify community living adults
with hypertension. Randomization to stepped care treatment of hypertension or
1270 S. Jiang and G. A. Colditz

community care (usual care) showed 5-year mortality from all causes was signifi-
cantly lower in the stepped care treatment arm compared to community care (1979).
Evolving trial design and scientific agreement on more proximate endpoints reflects
the evolution of understanding of the underlying disease processes and the priority
for interventions that show benefits exceeding harms.
Debate regarding endpoints has included the focus on mortality reduction vs a
reduction in incidence of disease. For example, the UK Doctors Study were designed
to test whether aspirin 500 mg daily reduced incidence and mortality from stroke,
myocardial infarction, or other vascular condition (Peto et al. 1988). The US
Physicians Health Study evaluated second daily aspirin (325 mg) vs placebo ran-
domizing 22,071 participants and following them for an average of 57 months
(Steering Committee of the Physicians’ Health Study Research 1988). Incident
myocardial infarction was significantly reduced but mortality was equivalent in
each arm (44 cardiovascular deaths). Subsequent reporting from the Data Monitor-
ing Board demonstrated the futility of continuing the trial for a mortality benefit
(Cairns et al. 1991). They present data on the substantially lower cardiovascular
mortality than expected from age comparable population rates, consistent with
baseline prevalence of current smoking at 12% (Glynn et al. 1994). Treatment of
nonfatal myocardial infarction further complicated interpretation as demonstrated by
Cook et al. (2002). While disease incidence is the primary endpoint for most
prevention trials, design features and clinical diagnosis and treatment must be
carefully monitored to avoid inducing bias in endpoint ascertainment.
The US FDA defines endpoints for drugs and biologics to be assessed as safe and
effective. They consider clinical outcomes and surrogate endpoints. Surrogate end-
points are used when clinical outcomes might take a long time to study (think
prevention trials of stroke, or cervical cancer, for example). The FDA now publishes
a list of acceptable surrogate endpoints for both adult and childhood diseases
(US Food and Drug Administration 2021). These are typically very strict surrogacy
criteria to avoid apparent benefit for reduced clinical disease incidence when none is
present. Across prevention trials the importance of outcome choice and rigor of
confirmation applies in similar manner to the more general issues (see ▶ Chap. 47,
“Ascertainment and Classification of Outcomes”).

Analysis ITT and Adherence in Prevention Trials

Zelen considered the challenges of primary prevention trials in the 1980s and
addressed both compliance and models of carcinogenesis as major impediments to
the use of RCTs to evaluate cancer prevention strategies (Zelen 1988). It is important
to contrast these issues in treatment trials and prevention trials. In treatment trials, we
typically take recently diagnosed patients and offer them, often in a life-threatening
situation, the option to participate in a trial of a new therapy compared with standard
therapy or placebo. Compliance or adherence to therapy is usually very high among
these highly motivated patients and outcomes are generally in a short to mid-term
time frame. In contrast, prevention trials recruit large numbers of healthy
67 Prevention Trials: Challenges in Design, Analysis, and Interpretation. . . 1271

participants, offer them a therapy, and then follow them over many years, since the
chronic diseases being prevented are relatively rare. With substantial nonadherence –
often in the range of 20–40% over the duration of the trial – an intention-to-treat
analysis is no longer unbiased.
Issues in analysis using the a priori intention to treat plan (ITT) (see ▶ Chap. 82,
“Intention to Treat and Alternative Approaches”) and detailed approaches to model-
ing adherence over time in the prevention trial setting calls for rigorous details in the
study protocol. Additional challenges that recur in the prevention trial setting include
drop out and loss to follow-up that may be nondifferential, particularly in settings
such as weight loss trials (Ware 2003). When endpoints such as weight loss or
quality of life may reflect both engagement with the trial and adherence to the
intervention, maximizing strategies to obtain endpoint data and retain participants
in the study are fundamental to integrity of the trial results.
Many of the design issues, population, intervention and control arm, adherence
and cost constraints come together to be balanced in the design of the trial. The
protocol for most trial in the last 10 years have been posted online when the primary
trials results are published. For older prevention trials, access to a full protocol,
sample size considerations and so forth may be harder to locate. The Women’s
Health Initiative published their protocol (Writing Group 1998), and the Diabetes
Prevention Program web site at the NIH (NIDDK) provides access to the study
protocol with extensive details of a less complicate but still three arm design. The
principle objective of the trial was to prevent or delay development of Non-Insulin
Dependent Diabetes Mellitus (Type 2 diabetes) in persons at high risk with impaired
glucose tolerance (Knowler et al. 2002; The Diabetes Prevention Program Research
Group 1999). The protocol is available http://www.bsc.gwu.edu/dpp.

Biomarkers and Other Emerging Areas

The nature of many chronic disease prevention/interception interventions requires a


very long timeline for assessment of effectiveness. Validated biomarkers to improve
risk assessment, for example, to characterize “premalignancy” and to predict tumor
aggressiveness remain active areas of research. In Alzheimer’s disease (AD), for
example, susceptibility to AD is determined by both monogenic and polygenic risk
factors as well as environmental exposures. The evaluation of efficacy of interven-
tions to treat AD is highly dependent on the selection of cognitively normal
individuals years before the onset of AD. The need for biomarkers to predict
responsiveness to various interventions, to serve as surrogate endpoints for inter-
vention trials and to predict toxicities of prevention interventions also remain
essential for progress in cancer prevention.
Artificial intelligence, analytics and applied statistics, engineering, and data
science bring opportunities to speed precision medicine and precision prevention.
A recent report from the National Academy of Medicine (NAM) reviews and
highlights opportunities, promises, and perils in application of AI in health care
(Matheny et al. 2019).
1272 S. Jiang and G. A. Colditz

For prevention trials the challenge is to harness these resources to better stratify or
classify underlying disease risk. An increasing array of technologies allows
non-invasive imaging with increasing precision. Imaging is spatially defined, adapt-
able to a variety of instruments, minimally invasive, and sensitive to capturing
detailed information, and it supports the use of contrast agents. For primary preven-
tion of cancer – including prevention trials – imaging provides information on organ
health, such as sun damage to skin, liver fat or fibrosis, and breast density. For
secondary prevention, imaging identifies early disease in high-risk populations
through such screenings as mammography, colonoscopy, colposcopy, lung com-
puted tomography (CT), dermoscopy, and in prostate cancer, where better stratifi-
cation of patients who may be able to forego biopsy if MRI shows evidence of
indolent disease. For tertiary prevention, imaging is used to monitor a primary tumor
or metastasis. Advanced imaging techniques enable digital pathomics analyses of
cell shape, nucleus texture, stroma patterns, and tissue architecture arrangement.
Much of this is coupled with AI and ML to speed discovery and translation of
applications. The ultimate goal is often delivery of results at point of care, with
immediate decision-making and action. Importantly, point of care can increasingly
be used in under-resourced settings to potentially bridge access gaps and reduce cancer
health disparities. AI/ML methods are good if the data set is sufficiently large, often
requiring huge data sets for training in order for them to perform optimally. Bringing
these technologies to point of care for evaluation of patient eligibility for prevention
trials is rapidly emerging area of study with much potential to increase efficiency of
prevention trials.
Interfaces with data science and machine learning in -omics and other applica-
tions beyond imaging are rapidly expanding. Opportunities for application in preci-
sion prevention include development of conventional analysis as well as AI/ML to
handle disparate data types from imaging, omics, demographic, lifestyle, environ-
mental exposure and generate actionable information.
Multidimensional data typically combines several lines of evidence, such as
whole-genome sequencing, gene expression, copy number variation, and methyla-
tion, to produce plots that can predict patient outcomes. These multidimensional data
can also vary over time (e.g., time-varying factors, markers, and images). The
approach with high dimensional baseline covariates is being used in the ongoing
NCI Precancer Atlas (PCA) and other advances in applications require novel
analytic strategies and methods to verify the robust AI and ML approaches. Bringing
these approaches to risk classification will transform eligibility assessment for
prevention trials with precision approaches in the coming years.
There is great promise in the integration of multidimensional data into cancer risk
prediction. Risk stratification algorithms will be required. This work will build on
the record of methods development and application in cancer prevention for risk
models (both classic statistical models and Bayesian approaches) (Steyerberg 2009).
Strategies to bring multidimensional data to point of care for risk stratification and
precision prevention decision making will need integrated studies of communication
of these approaches and their interpretation (Klein and Stefanek 2007). At the same
time, coverage of populations regardless of socioeconomic status and race/ethnicity
67 Prevention Trials: Challenges in Design, Analysis, and Interpretation. . . 1273

is essential to eliminate disparities and provide complete population application for


the multidimensional data studies. The underlying importance of cohort data for
model development and validation is well established and remains a priority for
precision prevention (Moons et al. 2012a, b).
For chemoprevention trials, there are particular challenges. These include:

a) Efforts to improve risk stratification of at-risk populations to permit better defined


study populations.
b) Novel trial designs that may support different models of chemoprevention, will
further change the landscape of prevention trials. Such changes as intermittent
exposures to agents otherwise too toxic for long duration use (this could enable
adoption of targeted agents that might remove early cancer cells, for example), or
long term evaluation of interventions that modulate the immune systems for
unanticipated effects.

Interpreting Prevention Trials

Prevention trials may offer a range of stress-tests for the design and interpretation of
randomized trials. Not only are these often longer in duration as they aim to reduce the
incidence or onset of disease, but the challenges of volunteers or participants willing to
enroll not reflecting the distribution of risk factors on the broader population that may
have motivated the scientific questions being addressed add to the challenges of
interpreting and applying results. Sommer reviews some case studies in the chapter
in Vitamin A deficiency (see ▶ Chap. 112, “Trials Can Inform or Misinform: “The
Story of Vitamin A Deficiency and Childhood Mortality””) and many have written
critiques on other prevention trials when results do not “hold up” as expected from the
motivation for the trial (Martinez et al. 2008; Tanvetyanon and Bepler 2008).
Recent experience with vaccines against COVID-19 demonstrate increasing
public focus on trial design, protocol access, and almost real time reporting of the
race/ethnic and age composition of participants to hold trialist accountable for
enrolling study populations reflecting the at risk population. Despite these advances
in the face of pandemic COVDI-19, there remains much room for improvement in
recruitment of broader and more diverse populations of participants for prevention
trials in general, and the application of advancing methods in design of trials to bring
timely results for prevention of chronic diseases.

Conclusion

Prevention trials allow the investigator to evaluate the magnitude of benefit for a
preventive intervention in the context of the natural history of disease develop.
Through randomization prevention trials avoid self-selection to new uses of thera-
pies or potentially preventive lifestyle patterns that can be confounded in observa-
tional settings by socioeconomic status, education, and access to prevention and
1274 S. Jiang and G. A. Colditz

diagnostic health services. Control selection is important to be realistic and practical,


given the usually long duration of prevention trials. Appropriate selection of study
population must balance a group with sufficiently high risk to generate endpoints
and yet broad enough to support generalizability of the findings. Timing in the
disease process is under the control of the investigator, more so than in the setting of
treatment trials where the diagnosis of disease may set the timing for initiation of
therapy. Choosing when the intervention should start, and for how long, should be
grounded in the natural history of disease development and progression to clinical
endpoints. Adherence to therapies (intervention and control) in prevention trials can
have major impact on the interpretation of the findings and the adequacy of the
contrast between intervention and control arms to support a meaningful contrast.
This adds some complexity to design as illustrated through the Women’s Health
Initiative. Planned long-term follow-up can help maximize the value of prevention
trials bringing additional information to bear on the risks and benefits of the
preventive intervention. Despite their cost, prevention trials add much evidence to
strategies for risk reduction across many chronic conditions.

Key Facts

• Prevention trials offer results that remove self-selection bias to evaluation pre-
vention approaches for chronic diseases.
• Choosing population for inclusion in the trial balances level of risk, duration of
trial needed for sufficient endpoints to test the intervention, and the generaliz-
ability of the findings for prevention.
• Long durational may exacerbate challenges for adherence by otherwise healthy
populations.
• Extended follow-up beyond the planned trail intervention may add important
details trade-offs on risk and benefits.

Cross-References

▶ Ascertainment and Classification of Outcomes


▶ Cluster Randomized Trials
▶ Intention to Treat and Alternative Approaches
▶ Screening Trials
▶ Trials Can Inform or Misinform: “The Story of Vitamin A Deficiency and
Childhood Mortality”

References
Alberts DS, Martinez ME, Roe DJ, Guillen-Rodriguez JM, Marshall JR, van Leeuwen JB, Reid
ME, Ritenbaugh C, Vargas PA, Bhattacharyya AB, Earnest DL, Sampliner RE (2000) Lack of
67 Prevention Trials: Challenges in Design, Analysis, and Interpretation. . . 1275

effect of a high-fiber cereal supplement on the recurrence of colorectal adenomas. Phoenix


Colon Cancer Prevention Physicians’ Network. N Engl J Med 342(16):1156–1162. https://doi.
org/10.1056/NEJM200004203421602
Blot WJ, Li JY, Taylor PR, Guo W, Dawsey S, Wang GQ, Yang CS, Zheng SF, Gail M, Li GY et al
(1993) Nutrition intervention trials in Linxian, China: supplementation with specific vitamin/
mineral combinations, cancer incidence, and disease-specific mortality in the general popula-
tion. J Natl Cancer Inst 85(18):1483–1492. https://doi.org/10.1093/jnci/85.18.1483
Cairns J, Cohen L, Colton T, DeMets DL, Deykin D, Friedman L, Greenwald P, Hutchison GB,
Rosner B (1991) Issues in the early termination of the aspirin component of the Physicians’
Health Study. Data Monitoring Board of the Physicians’ Health Study. Ann Epidemiol 1(5):
395–405. https://doi.org/10.1016/1047-2797(91)90009-2
Chen MS Jr, Lara PN, Dang JH, Paterniti DA, Kelly K (2014) Twenty years post-NIH Revitaliza-
tion Act: enhancing minority participation in clinical trials (EMPaCT): laying the groundwork
for improving minority clinical trial accrual: renewing the case for enhancing minority partic-
ipation in cancer clinical trials. Cancer 120(Suppl 7):1091–1096. https://doi.org/10.1002/cncr.
28575
Chen WY, Rosner B, Colditz GA (2007) Moving forward with breast cancer prevention. Cancer
109(12):2387–2391. https://doi.org/10.1002/cncr.22711
Cook NR, Cole SR, Hennekens CH (2002) Use of a marginal structural model to determine the
effect of aspirin on cardiovascular mortality in the Physicians’ Health Study. Am J Epidemiol
155(11):1045–1053. https://doi.org/10.1093/aje/155.11.1045
Cuzick J (2010) Long-term follow-up in cancer prevention trials (it ain’t over ‘til it’s over). Cancer
Prev Res 3(6):689–691. https://doi.org/10.1158/1940-6207.CAPR-10-0096
DeCensi A, Puntoni M, Guerrieri-Gonzaga A, Caviglia S, Avino F, Cortesi L, Taverniti C, Pacquola
MG, Falcini F, Gulisano M, Digennaro M, Cariello A, Cagossi K, Pinotti G, Lazzeroni M,
Serrano D, Branchi D, Campora S, Petrera M, Buttiron Webber T, Boni L, Bonanni B (2019)
Randomized placebo controlled trial of low-dose tamoxifen to prevent local and contralateral
recurrence in breast intraepithelial neoplasia. J Clin Oncol 37(19):1629–1637. https://doi.org/
10.1200/JCO.18.01779
Eriksson M, Eklund M, Borgquist S, Hellgren R, Margolin S, Thoren L, Rosendahl A, Lang K,
Tapia J, Backlund M, Discacciati A, Crippa A, Gabrielson M, Hammarstrom M, Wengstrom Y,
Czene K, Hall P (2021) Low-dose tamoxifen for mammographic density reduction: a random-
ized controlled trial. J Clin Oncol 2021:JCO2002598. https://doi.org/10.1200/JCO.20.02598
Fisher B, Costantino J, Wickerham D, Redmond C, Kavanah M, Cronin W, Vogel V, Robidoux A,
Dimitrov N, Atkins J, Daly M, Wieand S, Tan-Chiu E, Ford L, Wolmark N, Other National
Surgical Adjuvant Breast and Bowel Project Investigators (1998) Tamoxifen for prevention of
breast cancer: report of the National Surgical Adjuvant Breast and Bowel Project P-1 study.
J Natl Cancer Inst 90:1371–1388
Fisher B, Costantino JP, Wickerham DL, Cecchini RS, Cronin WM, Robidoux A, Bevers TB,
Kavanah MT, Atkins JN, Margolese RG, Runowicz CD, James JM, Ford LG, Wolmark N
(2005) Tamoxifen for the prevention of breast cancer: current status of the National Surgical
Adjuvant Breast and Bowel Project P-1 study. J Natl Cancer Inst 97(22):1652–1662
Glynn RJ, Buring JE, Manson JE, LaMotte F, Hennekens CH (1994) Adherence to aspirin in the
prevention of myocardial infarction. The Physicians’ Health Study. Arch Intern Med 154(23):
2649–2657. https://doi.org/10.1001/archinte.1994.00420230032005
Grau MV, Baron JA, Sandler RS, Wallace K, Haile RW, Church TR, Beck GJ, Summers RW, Barry
EL, Cole BF, Snover DC, Rothstein R, Mandel JS (2007) Prolonged effect of calcium supple-
mentation on risk of colorectal adenomas in a randomized trial. J Natl Cancer Inst 99(2):129–136
Greenwald P, Anderson D, Nelson SA, Taylor PR (2007) Clinical trials of vitamin and mineral
supplements for cancer prevention. Am J Clin Nutr 85(1):314S–317S
Jaffee EM, Dang CV, Agus DB, Alexander BM, Anderson KC, Ashworth A, Barker AD, Bastani R,
Bhatia S, Bluestone JA, Brawley O, Butte AJ, Coit DG, Davidson NE, Davis M, DePinho RA,
Diasio RB, Draetta G, Frazier AL, Futreal A, Gambhir SS, Ganz PA, Garraway L, Gerson S,
Gupta S, Heath J, Hoffman RI, Hudis C, Hughes-Halbert C, Ibrahim R, Jadvar H, Kavanagh B,
1276 S. Jiang and G. A. Colditz

Kittles R, Le QT, Lippman SM, Mankoff D, Mardis ER, Mayer DK, McMasters K, Meropol NJ,
Mitchell B, Naredi P, Ornish D, Pawlik TM, Peppercorn J, Pomper MG, Raghavan D, Ritchie C,
Schwarz SW, Sullivan R, Wahl R, Wolchok JD, Wong SL, Yung A (2017) Future cancer
research priorities in the USA: a Lancet Oncology Commission. Lancet Oncol 18(11):e653–
e706. https://doi.org/10.1016/S1470-2045(17)30698-8
Klein WM, Stefanek ME (2007) Cancer risk elicitation and communication: lessons from the
psychology of risk perception. CA Cancer J Clin 57(3):147–167. https://doi.org/10.3322/
canjclin.57.3.147
Knowler WC, Barrett-Connor E, Fowler SE, Hamman RF, Lachin JM, Walker EA, Nathan DM,
Diabetes Prevention Program Research Group (2002) Reduction in the incidence of type
2 diabetes with lifestyle intervention or metformin. N Engl J Med 346(6):393–403. https://
doi.org/10.1056/NEJMoa012512
Limburg PJ, Wei W, Ahnen DJ, Qiao Y, Hawk ET, Wang G, Giffen CA, Wang G, Roth MJ, Lu N,
Korn EL, Ma Y, Caldwell KL, Dong Z, Taylor PR, Dawsey SM (2005) Randomized, placebo-
controlled, esophageal squamous cell cancer chemoprevention trial of selenomethionine and
celecoxib. Gastroenterology 129(3):863–873
Martinez ME, Marshall JR, Giovannucci E (2008) Diet and cancer prevention: the roles of
observation and experimentation. Nat Rev Cancer 8(9):694–703
Martino S, Cauley JA, Barrett-Connor E, Powles TJ, Mershon J, Disch D, Secrest RJ, Cummings
SR (2004) Continuing outcomes relevant to Evista: breast cancer incidence in postmenopausal
osteoporotic women in a randomized trial of raloxifene. J Natl Cancer Inst 96(23):1751–1761
Matheny M, Israni S, Ahmed M, Whicher D (2019) Artificial intelligence in health care: the hope,
the hype, the promise, the peril, NAM special publication. National Academy of Medicine,
Washington, DC
Moons KG, Kengne AP, Grobbee DE, Royston P, Vergouwe Y, Altman DG, Woodward M (2012a)
Risk prediction models: II. External validation, model updating, and impact assessment. Heart
98(9):691–698. https://doi.org/10.1136/heartjnl-2011-301247
Moons KG, Kengne AP, Woodward M, Royston P, Vergouwe Y, Altman DG, Grobbee DE (2012b)
Risk prediction models: I. Development, internal validation, and assessing the incremental value
of a new (bio)marker. Heart 98(9):683–690. https://doi.org/10.1136/heartjnl-2011-301246
Parker-Pope T (2007) Do pills have a place in cancer prevention? Wall Street J 2007:D1
Peto R, Gray R, Collins R, Wheatley K, Hennekens C, Jamrozik K, Warlow C, Hafner B,
Thompson E, Norton S et al (1988) Randomised trial of prophylactic daily aspirin in British
male doctors. Br Med J 296(6618):313–316. https://doi.org/10.1136/bmj.296.6618.313
Prentice RL, Caan B, Chlebowski RT, Patterson R, Kuller LH, Ockene JK, Margolis KL, Limacher
MC, Manson JE, Parker LM, Paskett E, Phillips L, Robbins J, Rossouw JE, Sarto GE, Shikany
JM, Stefanick ML, Thomson CA, Van Horn L, Vitolins MZ, Wactawski-Wende J, Wallace RB,
Wassertheil-Smoller S, Whitlock E, Yano K, Adams-Campbell L, Anderson GL, Assaf AR,
Beresford SA, Black HR, Brunner RL, Brzyski RG, Ford L, Gass M, Hays J, Heber D, Heiss G,
Hendrix SL, Hsia J, Hubbell FA, Jackson RD, Johnson KC, Kotchen JM, LaCroix AZ, Lane DS,
Langer RD, Lasser NL, Henderson MM (2006) Low-fat dietary pattern and risk of invasive
breast cancer: the Women’s Health Initiative Randomized Controlled Dietary Modification
Trial. JAMA 295(6):629–642
Qiao YL, Dawsey SM, Kamangar F, Fan JH, Abnet CC, Sun XD, Johnson LL, Gail MH, Dong ZW,
Yu B, Mark SD, Taylor PR (2009) Total and cancer mortality after supplementation with
vitamins and minerals: follow-up of the Linxian General Population Nutrition Intervention
Trial. J Natl Cancer Inst 101(7):507–518
Rossouw JE, Anderson GL, Prentice RL, LaCroix AZ, Kooperberg C, Stefanick ML, Jackson RD,
Beresford SA, Howard BV, Johnson KC, Kotchen JM, Ockene J (2002) Risks and benefits of
estrogen plus progestin in healthy postmenopausal women: principal results from the Women’s
Health Initiative randomized controlled trial. JAMA 288(3):321–333
Sargent DJ, Conley BA, Allegra C, Collette L (2005) Clinical trial designs for predictive marker
validation in cancer treatment trials. J Clin Oncol 23(9):2020–2027
Schatzkin A, Lanza E, Corle D, Lance P, Iber F, Caan B, Shike M, Weissfeld J, Burt R, Cooper MR,
Kikendall JW, Cahill J (2000) Lack of effect of a low-fat, high-fiber diet on the recurrence of
67 Prevention Trials: Challenges in Design, Analysis, and Interpretation. . . 1277

colorectal adenomas. Polyp Prevention Trial Study Group. N Engl J Med 342(16):1149–1155.
https://doi.org/10.1056/NEJM200004203421601
Shapiro S, Venet W, Strax P, Venet L, Roeser R (1985) Selection, follow-up, and analysis in the
Health Insurance Plan Study: a randomized trial with breast cancer screening. Natl Cancer Inst
Monogr 67:65–74
Steering Committee of the Physicians’ Health Study Research Group (1988) Preliminary report:
findings from the aspirin component of the ongoing Physicians’ Health Study. N Engl J Med
318(4):262–264. https://doi.org/10.1056/NEJM198801283180431
Steyerberg EW (2009) Clinical prediction models. A practical approach to development, validation,
and updating, Statistics for biology and health. Springer, New York. https://doi.org/10.1007/
978-0-387-77244-8
Stoll CRT, Izadi S, Fowler S, Philpott-Streiff S, Green P, Suls J, Winter AC, Colditz GA (2019)
Multimorbidity in randomized controlled trials of behavioral interventions: a systematic review.
Health Psychol 38(9):831–839. https://doi.org/10.1037/hea0000726
Tanvetyanon T, Bepler G (2008) Beta-carotene in multivitamins and the possible risk of lung cancer
among smokers versus former smokers: a meta-analysis and evaluation of national brands.
Cancer 113(1):150–157. https://doi.org/10.1002/cncr.23527
The Diabetes Prevention Program Research Group (1999) The Diabetes Prevention Program.
Design and methods for a clinical trial in the prevention of type 2 diabetes. Diabetes Care
22(4):623–634. https://doi.org/10.2337/diacare.22.4.623
The Lancet (2007) NCI and the STELLAR trial. Lancet 369(9580):2134. https://doi.org/10.1016/
S0140-6736(07)60987-8
US Food and Drug Administration (2021) Table of surrogate endpoints that were the basis of drug
approval or licensure. https://www.fda.gov/drugs/development-resources/table-surrogate-end
points-were-basis-drug-approval-or-licensure
Ware JH (2003) Interpreting incomplete data in studies of diet and weight loss. N Engl J Med
348(21):2136–2137. https://doi.org/10.1056/NEJMe030054
Ware JH, Hamel MB (2011) Pragmatic trials – guides to better patient care? N Engl J Med 364(18):
1685–1687. https://doi.org/10.1056/NEJMp1103502
Warner ET, Glasgow RE, Emmons KM, Bennett GG, Askew S, Rosner B, Colditz GA (2013)
Recruitment and retention of participants in a pragmatic randomized intervention trial at three
community health clinics: results and lessons learned. BMC Public Health 13:192. https://doi.
org/10.1186/1471-2458-13-192
Wolin KY, Steinberg DM, Lane IB, Askew S, Greaney ML, Colditz GA, Bennett GG
(2015) Engagement with eHealth self-monitoring in a primary care-based weight man-
agement intervention. PLoS One 10(10):e0140455. https://doi.org/10.1371/journal.pone.
0140455
Writing Group (1979) Five-year findings of the hypertension detection and follow-up program.
I. Reduction in mortality of persons with high blood pressure, including mild hypertension.
Hypertension Detection and Follow-up Program Cooperative Group. JAMA 242(23):
2562–2571
Writing Group (1998) Design of the Women’s Health Initiative clinical trial and observational
study. The Women’s Health Initiative Study Group. Control Clin Trials 19(1):61–109. https://
doi.org/10.1016/s0197-2456(97)00078-0
Yach D, Hawkes C, Gould CL, Hofman KJ (2004) The global burden of chronic diseases:
overcoming impediments to prevention and control. JAMA 291(21):2616–2622. https://doi.
org/10.1001/jama.291.21.2616
Yusuf S, Joseph P, Dans A, Gao P, Teo K, Xavier D, Lopez-Jaramillo P, Yusoff K, Santoso A,
Gamra H, Talukder S, Christou C, Girish P, Yeates K, Xavier F, Dagenais G, Rocha C,
McCready T, Tyrwhitt J, Bosch J, Pais P, International Polycap Study 3 Investigators (2021)
Polypill with or without aspirin in persons without cardiovascular disease. N Engl J Med 384(3):
216–228. https://doi.org/10.1056/NEJMoa2028220
Zelen M (1988) Are primary cancer prevention trials feasible? J Natl Cancer Inst 80:1442–1444
N-of-1 Randomized Trials
68
Reza D. Mirza, Sunita Vohra, Richard Kravitz, and Gordon H. Guyatt

Contents
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1280
History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1281
Introduction: Why Conduct an N-of-1 RCTs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1281
Limitations of Informal Trials of Therapy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1282
How N-of-1 RCTs Address the Limitations of Informal Trials of Therapy . . . . . . . . . . . . . . . . . . . 1282
Five Reasons for Conducting N-of-1 RCTs to Improve Patient Care . . . . . . . . . . . . . . . . . . . . . . . . . 1282
N-of-1 RCTs Addressing Treatment Effects in a Group of Patients . . . . . . . . . . . . . . . . . . . . . . . . . . 1283
Determining Appropriateness for an N-of-1 RCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1284
Designing an N-of-1 RCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1285
Choosing an Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1285
Trial Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1286
Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1287
Collaboration with Pharmacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1287
Advanced Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1287
Interpreting the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1287
Visual Inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1288
Nonparametric Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1288
Wilcoxon Signed Rank Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1288
Parametric Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1289

R. D. Mirza
Department of Medicine, McMaster University, Hamilton, ON, Canada
e-mail: [email protected]
S. Vohra
University of Alberta, Edmonton, AB, Canada
e-mail: [email protected]
R. Kravitz
University of California Davis, Davis, CA, USA
e-mail: [email protected]
G. H. Guyatt (*)
McMaster University, Hamilton, ON, Canada
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 1279


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_97
1280 R. D. Mirza et al.

Student’s T-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1289


ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1289
Aggregation of N-of-1 RCTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1289
Reporting for N-of-1 RCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1290
Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1290
An Example of an N-of-1 RCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1291
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1293
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1293
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1294
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1294

Abstract
Single-subject trials have a rich history in the behavioral sciences, but a much
more limited history in clinical medicine. This chapter deals with a particular
single-subject design, the N-of-1 randomized control trial (RCT). N-of-1
RCTs are single-patient multiple crossover studies of an intervention and
usually one comparator. Typically, patients undergo pairs of treatment periods;
random allocation determines the order of intervention and comparator arms
within each pair and patients and clinicians are ideally blind to allocation.
Patients and clinicians repeat pairs of treatment periods as necessary to
achieve a convincing result. In the medical sciences, N-of-1 RCTs have
seen limited use, in part due to lack of familiarity and feasibility concerns
that arise in day-to-day clinical practice. Investigators may carry out a number
of N-of-1 RCTs of the same intervention and comparator as part of a formal
research study, aggregating across N-of-1 RCTs to develop population esti-
mates. N-of-1 RCTs have demonstrated their utility in clarifying whether a
clinical intervention is effective or not. Although N-of-1 trials have the
potential for improving patient outcomes, the few small randomized trials
comparing N-of-1 to conventional care have not demonstrated important
benefits.

Keywords
N-of-1 · Single-patient trial · Randomized controlled trial · Crossover trial ·
Personalized medicine

Definition

This chapter deals with a particular type of single-participant experiment, the N-of-1
randomized control trial (RCT). N-of-1 RCTs are prospective, single-patient trials
with repeated pairs of intervention and comparator periods in which the order is
randomized and patients and clinicians are ideally blinded with respect to allocation.
We will describe the history of N-of-1 RCTs, as well as the indications, design,
interpretation, reporting, and associated ethical issues.
68 N-of-1 Randomized Trials 1281

History

Psychologists pioneered the use and development of single-subject designs, includ-


ing N-of-1 trials (Kazdin 2011; Kratochwill 2013). Its debut in medicine came in a
1986 issue of the New England Journal of Medicine (Guyatt et al. 1986). Clinician-
scientists from McMaster University presented a patient with severe asthma who
was poorly controlled despite inhaled beta-agonist, anti-cholinergic (ipratropium),
theophylline, and oral prednisone. They conducted an N-of-1 RCT that demon-
strated that theophylline, far from improving the symptoms, made them considerably
worse. A second N-of-1 trial convinced the patient his ipratropium did in fact
provide benefit. Discontinuation of the theophylline and regular use of ipratropium
markedly improved symptoms and allowed gradual discontinuation of prednisone.
In that same chapter, the authors announced the creation of an N-of-1 clinical
service allowing local physicians to refer patients with a therapeutic question. Over
3 years, the service completed 57 N-of-1 RCTs that provided a definite answer for
50 patients; in 15 (39%) of whom the results led to a change referring physicians’
planned management. Given this success, other clinicians established N-of-1 ser-
vices formed in their centers including the University of Washington (Dr. Eric
Larson), the University of Alberta (Dr. Sunita Vohra), and a national Australian
N-of-1 service based out of the University of Queensland (Dr. Geoff Mitchell)
(Mirza et al. 2017).
As of 2021, N-of-1 services are minimally active. The N-of-1 RCT remains useful
for addressing clinical questions that meet certain criteria – see the next section – but
most clinicians remain unaware of its existence. As clinical research embraces
patient-centerd research and moves to the era of personalized medicine, there
appears to be a resurgence of interest in N-of-1 RCTs (Kravitz et al. 2014; Shamseer
et al. 2015; Mirza et al. 2017).

Introduction: Why Conduct an N-of-1 RCTs?

N-of-1 randomized control trials (RCTs) can be broken down into two major
categories depending on the underlying purpose. In one, the purpose is to improve
the care of individual patients by carrying out rigorous trials that leave patients and
clinicians confident that a particular treatment is, or is not, beneficial or harmful. By
ensuring applicability to the individual, N-of-1 RCTs represent the highest quality of
evidence.
The second reason for conducting N-of-1 RCTs is to determine the effect of an
intervention in a population. Conducting a series of N-of-1 RCTs allows investiga-
tors to provide an estimate of the proportion of patients who achieve an important
benefit, or who suffer troubling adverse effects, and thus establish the extent of
heterogeneity of response (Stunnenberg et al. 2018). Many patients and clinicians
considering the impact of a treatment are likely to find such a result more informative
than, for example, a mean effect.
1282 R. D. Mirza et al.

Limitations of Informal Trials of Therapy

In routine practice, clinicians typically conduct informal trials of therapy. This


entails starting a treatment and monitoring a patient’s response. For a number of
reasons, this approach is prone to false-positive, and less frequently false-negative,
results.
First, patients may have been destined to get better (or worse) as a function of
natural history, in which case patients and clinicians may deem the treatment
responsible when improvement or worsening would have occurred without the
intervention. Second, both patients and clinicians may desire to meet each other’s
expectations; thereby, each is more likely to infer the success of a treatment. Third,
an apparent response may be due to a placebo rather than the intended biological
effect. Similarly, patients who expect an adverse effect of treatment may experience
that adverse effect even if the biological effect of treatment is not responsible – the
so-called nocebo effect (Barsky et al. 2002). Finally, an exposure other than the
treatment may have been responsible for an apparent response – for instance a week
of cloudy days with minimal sun exposure may be responsible for a decrease in
symptoms in a patient with systematic lupus erythematous.

How N-of-1 RCTs Address the Limitations of Informal Trials


of Therapy

N-of-1 RCTs involve protecting against risk of bias that bedevils informal trials of
therapy. Choosing chronic, stable diseases attenuates the risk of conflating treatment
benefit and natural history. Blinding patients and clinicians to allocation to treatment
versus comparator minimizes biases related to expectation and placebo effects.
Multiple crossover periods control the risk of the misleading impact of transient
third variables, as well as effectively addressing natural history effects (i.e., it is very
unlikely that natural history will correspond closely to the institution and withdrawal
of a beneficial treatment).

Five Reasons for Conducting N-of-1 RCTs to Improve Patient Care

Treatments, even if beneficial in a population, will seldom if ever achieve an


important benefit in every individual in that population: In other words, treatment
response is often, perhaps usually, heterogenous. N-of-1 RCTs can sort out whether
an individual who would have been eligible for an RCT that has reported a positive
result is one of the fortunate responders to treatment, or unfortunate nonresponders.
Indeed, the N-of-1 trial can quantify treatment effect estimates specific to that
individual.
A second reason for conducting an N-of-1 RCT is for patients who would,
because of age restrictions, comorbidity, or concurrent therapy, have been excluded
68 N-of-1 Randomized Trials 1283

from existing parallel group RCTs. A particular strength of N-of-1 RCTs is that they
can address the question of whether benefits extend to such individuals.
Third, some patients have symptoms that lack evidence-based management
options, or are refractory to standard medical management. Determined clinicians
may be tempted to trial off-label interventions to alleviate their patient’s suffering. In
these cases, an N-of-1 RCT allows for objective assessment of untested therapeutic
strategies.
Fourth, patients using a therapy with anticipated benefits may be experiencing
troubling symptoms for which the treatment they are using may, or may not, be
responsible. N-of-1 RCTs can provide definitive evidence confirming the culpability,
or exoneration, of the particular treatment (Joy et al. 2014).
Fifth, sometimes patients remain on a treatment for extended periods and it is
unclear whether there is any ongoing benefit. Given the rise of polypharmacy and the
increased recognition of its risks, the importance of reevaluating medications is
increasingly clear. RCTs often provided only limited data on the long-term efficacy
of a treatment. N-of-1 RCTs can clarify whether a medication is providing ongoing
benefit, or is not. A good use case for this is chronic PPI therapy in an asymptomatic
patient with a history of gastroesophageal reflux disease.

N-of-1 RCTs Addressing Treatment Effects in a Group of Patients

Soon after their introduction into medicine, proponents suggested N-of-1 RCTs may
hold promise as a tool for efficient early drug development. The proposal addressed
three major questions faced by drug developers before engaging in large, costly
parallel group RCTs. First, does the drug in question show sufficient promise to
justify drug development? Second, what patient population will be most responsive
to the drug? Third, what is the optimal dose to maximize benefit and minimize
adverse effects?
In drug development, these questions are managed by using a combination of
small efficacy studies in conjunction with small studies using nonrepresentative
healthy volunteers examining safety, tolerance, pharmacology, and drug disposition.
The efficacy studies are often unblinded and uncontrolled, instead using historical
reference groups. The data from these studies are of limited value due to bias and
limited power. The problem manifests when trying to use the data during the design
of the first large parallel group RCT. Investigators are forced to gamble on the most
efficacious dose (or doses if they opt for multiple treatment arms) and which
population is most likely to benefit (Guyatt et al. 1990b).
N-of-1 RCTs allow for methodologically robust small-scale studies that can
address whether a drug shows promise (Phase 3), which patient populations are
most responsive, and which doses are optimal (Phase 1). These principles are
demonstrated in an early N-of-1 RCT examining the role of amitriptyline in fibro-
myalgia (Guyatt et al. 1988). Low-dose amitriptyline is currently a first-line agent for
the treatment of fibromyalgia, but at the time of this N-of-1 RCT series – reported in
1988 – there was only one parallel group RCT suggesting benefit.
1284 R. D. Mirza et al.

Table 1 Benefits of N-of-1 RCTs depending on purpose


Improving patient care Drug development
Reliably answer clinical questions for patients Identify if a drug shows promise to
regarding the efficacy of interventions justify drug development
Patients will directly benefit from their participation Identify patient population that will be
most responsive
Patients are guaranteed an intervention arm Identify optimal dose to maximize
benefit and minimize adverse effects
Evidence can be generated for patients who would Low cost compared to large parallel
not qualify for clinical trials, due to age, comorbidity, group RCTs
or concurrent therapies
Accelerated timeline compared to large
parallel group RCTs

A group at McMaster group conducted 23 N-of-1 RCTs that demonstrated rapid


onset of beneficial effect in a number of patients, strongly supporting the efficacy of
low-dose amitriptyline for fibromyalgia. The group went on to conduct similar
studies assessing tetrahydroaminoacridine in Alzheimer’s patients (no important
benefit at all) (Molloy et al. 1991), and the efficacy of home oxygen in reducing
symptoms in patients with chronic obstructive pulmonary disorder with exertional
hypoxemia (beneficial in very few patients) (Nonoyama et al. 2007).
These success stories demonstrate how N-of-1 RCTs can address treatment
efficacy in a group of patients. Almost 30 years after the publication of the paper
suggesting their possible use in drug development, their implementation in this arena
remains an idea waiting to be tested (Table 1).

Determining Appropriateness for an N-of-1 RCT

For a patient to be deemed appropriate for an N-of-1 RCT, the clinical circumstances
must meet particular requirements. N-of-1 RCTs are useful when uncertainty exists
regarding treatment effect (either benefit or harm). Earlier in this chapter, we
provided examples of circumstances in which such uncertainty is likely to exist.
The N of RCT requires that specific clinical circumstances be met.

1. The outcome of interest (typically symptoms) should occur frequently, ideally daily.
Intervention period lengths must be tailored to outcome frequency. If the
outcome is infrequent, the requirement for treatment periods sufficiently long
for the outcome to be manifest may make the N-of-1 RCT excessively burden-
some for both patient and clinician. One exception is when treatments are
unusually expensive, in which case clinicians and patients may be particularly
motivated to complete the trial (Kravitz et al. 2008).
2. The condition should be chronic and stable.
Acute symptoms may represent transient conditions that are likely to resolve
spontaneously. By choosing a stable condition in terms of severity and symptoms,
clinicians reduce the random error that may make true treatment effects very
68 N-of-1 Randomized Trials 1285

difficult to detect. Stability does not preclude frequently episodic conditions, such
as a child with multiple seizures a day.
3. Interventions should have rapid onset and termination of effect.
Rapid onset ensures that intervention periods can be a reasonable length. An
N-of-1 RCT with selective serotonin reuptake inhibitors would, for instance, be
prohibitively cumbersome given the 4–6 weeks required at a minimum for
treatment effect, and several weeks for tapering to discontinuation. If each
intervention period was 8 weeks in length, and there are three crossovers periods,
the total trial length would be at least 48 weeks (sufficient time for spontaneous
resolution of the condition).
Rapid termination of action ensures that treatments effects do not influence
comparator periods, without requiring washout periods. Typically, if there are
residual effects, the treatment periods are lengthened and the patient/physician
team considers only the data after resolution of effects. For instance, if one expects
treatment effects to persist for a week, treatment periods can continue for 2 weeks,
and one can use data only from the second week. Alternatively, a washout period
can be used as a buffer between periods to prevent carryover effects.

Designing an N-of-1 RCT

N-of-1 RCTs represent multiple crossover trials of an intervention and one or more
comparators that, to minimize risk of bias, include randomization in terms of
sequence order. Interventions are typically drugs – but may be nonpharmacologic
or complementary and alternative medicine – compared in one of three ways: drug
versus placebo, drug versus comparator drug, or high dose versus low dose of the
same drug. For optimal rigor, clinicians and patients must be blind to allocation.
Blinding is not always possible (e.g., physical therapy). N-of-1 trials are particularly
amenable to being codesigned by patient and clinician, including with regards to
outcome measure selection. Typically outcomes are symptoms monitored daily.
Some researchers choose to use physiologic and biochemical variables as outcomes,
but the value of such surrogates for inferring patient-important benefit is limited. The
number of pairs – each pair including one period of each treatment and comparator –
continue until both patient and clinician are satisfied that superiority or equivalence
have been demonstrated. A run-in period may be employed for the same reason as in
other trials: establishing dose tolerability and compliance.

Choosing an Outcome

Outcomes can be a measure of symptoms or physiologic outcomes. A 2016 system-


atic survey of 100 N-of-1 RCTs conducted between 1950 and 2013 using an ABAB
design and assessing a health intervention for a medical condition identified mea-
sures of symptoms as most common: Likert scales (55% of trials), visual analogue
scales (30%), patient diaries (26%), and patient-generated questionnaires (18%).
1286 R. D. Mirza et al.

Physiologic outcomes were used in 35% trials, including clinical tests such as blood
pressure or laboratory tests such as erythrocyte sedimentation rate (Punja et al.
2016). A single N-of-1 RCT can address more than one outcome. Regardless of
the outcome(s) chosen, clinicians should work with patients to identify patient-
important targets prior to starting the trial.
As mentioned earlier in the chapter, the outcome measure is ideally one that can
be measured frequently (e.g., daily) to ensure that there is enough data to analyze
within an intervention period (typically 5–14 days). Physiologic parameters should
be one that patients can measure themselves at their convenience, such as blood
pressure or blood sugar concentrations. Automated tracking using cell phones and
other monitoring devices are likely to prove increasingly useful for outcome mon-
itoring (Ryu 2012; Kravitz et al. 2018).
Likert scales are widely used outside of N-of-1 research for their simplicity,
allowing for patient familiarity, ease in understanding, and ease in interpretation.
There is evidence to suggest that seven-point scales are more sensitive in detecting
small differences in comparison to fewer response options, and are more convenient
than visual analogue scales (Guyatt et al. 1987; Girard and Ely 2008). If using a
Likert scale to assess symptoms, clinicians should consider including items specif-
ically assessing symptom interference with daily activities. An example of how one
might phrase this is presented below.
Please indicate how much your pain interferes with your everyday activities of
daily living, such as cooking, cleaning, and getting dressed:

1. No interference at all
2. A little interference
3. Some interference
4. Moderate interference
5. Much interference
6. Severe interference
7. I am unable to do carry out these activities as a result of the interference

Trial Length

The duration of an N-of-1 RCT will depend on the number of days in each treatment
period, and the number of pairs of periods undertaken. Most often the trial addresses
a single intervention and a single comparator. Typically each treatment period will
range between 5 and 14 days (median: 10 days), the interquartile range of all
captured N-of-1 RCTs in the aforementioned systematic survey by Punja and
colleagues (Punja et al. 2016). In terms of the number of pairs of treatment periods –
one period in which the patient receives the intervention and one period with the
comparator, in Punja’s survey 75% of trials required between 2 and 5 pairs of
treatment periods (median: 3 pairs).
Based on these numbers, a typical N-of-1 RCT with a pair of treatment periods
will take 20 days (10 days for each of the two arms); with 3 such pairs, the total
duration would be 60 days.
68 N-of-1 Randomized Trials 1287

Randomization

Within an N-of-1 trial, the order of treatments is subject to randomization in contrast


to a randomized crossover trial where the patient is the unit of randomization.
Randomization is an essential component of N-of-1 RCTs as it will control for
factors that may influence outcome and vary over time, and by facilitating blinding
of the clinician and patient. Clinicians can conduct randomization by tossing a coin,
utilizing a computer algorithm, or consulting a randomization table, and can ran-
domize the order in each pair separately.

Collaboration with Pharmacy

N-of-1 RCTs at their most rigorous are blinded to protect against the bias of patient
and clinician expectations, co-interventions, and placebo effects. To blind effectively
and efficiently, physicians should and often do collaborate with a local pharmacy to
prepare treatments and comparators that are identical in appearance, taste, texture,
and smell. Pharmacists can achieve this goal by crushing the active drugs and
repackaging in capsules. Placebos can be filled with an inert substance.
Pharmacists can also play a number of other important roles in N-of-1 RCTs.
They can provide input in terms of drug half-life and thus determining whether and
how long each treatment period need be. Certainly, with increased scope of practice,
Doctors of Pharmacy in particular can design, conduct, and interpret the N-of-1 trial.
Pharmacist technicians can also help, particularly with monitoring drug compliance
by conducting pill counts and assessing whether patients are refilling their medica-
tions at the correct time.

Advanced Techniques

Advanced techniques existing in terms of trial design allow for adaptive


features. Adaptive arms allow for crossover, dose change, or discontinuation
of an intervention based on patient preference or preset outcomes, such as
adverse effects or response. An example of such a design is establishing a
predetermined stopping rule that minimizes patient exposure to an inferior
treatment. This is particularly valuable when comparing several treatments
arms (Duan et al. 2013).

Interpreting the Data

There are a number of options in interpreting the data that depend on the goals of
the trial, the trial design, and the data generated. Broadly this can be broken
down into statistical methods, which were used in 84% of the trials that Punja
reported; in the other 16%, clinicians and patients use visual inspection alone
(Punja et al. 2016).
1288 R. D. Mirza et al.

Visual Inspection

Using visual inspection of the data, clinicians and patients examine a graph
displaying repeated measures of the outcome of interest with specification of
intervention and control arms. The features to suggest an arm is effective
include: 1) minimal variability within periods; 2) the magnitude and direction
of difference between the arm of interest and comparator arm is consistent; and
3) the difference between the arm of interest and its comparator is large in
comparison to the variability within periods. Review of the evidence collected
after 2 or more pairs of periods can help determine whether to conduct further
pairs.
The rationale for visual inspection is that both clinician and patient can intuitively
assess the components of efficacy – direction, magnitude, and consistency of effect –
in a straightforward manner that may satisfy both and simplify decision-making. The
limitation is the subjective nature of the assessment that can lead to inconsistent and
incorrect inferences. This methodology is appropriate only for individual patient
clinical decision-making rather than using the N-of-1 methodology to make infer-
ences about treatment effects in a population.

Nonparametric Statistical Tests

Broadly speaking, nonparametric tests refer to those that do not assume the data is
normally distributed; this makes them a more conservative test. There are a number
of nonparametric statistical tests available. We will focus on the Wilcoxon signed
rank test, and a quantitative randomization test.

Wilcoxon Signed Rank Test

The Wilcoxon signed rank test incorporates the size of the treatment difference, but
fails to take the absolute value of difference into account. To conduct a Wilcoxon
signed rank test, the absolute difference within treatment cycles is ranked by
absolute difference from smallest to largest (i.e., independent of direction).
The sum of the ranks in favor of the treatment are compared to the sum of the
ranks in favor of the comparator. The null hypothesis would expect the sums to be
equivalent.
More sophisticated than either of two previous tests is a pure quantitative
randomization test. This approach assesses not only the direction and size in
comparing arms, but also the mean treatment difference. The probability of a
given mean treatment difference is calculated by determining the proportion of
randomizations that would lead to the given outcome over the denominator of all
possible randomizations. The null hypothesis for this test states the expected mean
treatment difference is zero.
68 N-of-1 Randomized Trials 1289

Parametric Statistical Tests

Parametric statistical tests, by contrast, assume the data are normally distributed. The
two most commonly used tests are the analysis of variance (ANOVA) and student’s
t-test, which is a special case of the ANOVA model. There are two factors that will
help guide which test to use. First, if the trial under consideration is comparing three
or more treatment arms, then the ANOVA is the preferred approach. ANOVA allows
for a single analysis (F-test) across arms which the t-test does not.

Student’s T-Test

The t-test is only appropriate for N-of-1 RCTs comparing two arms, regardless of
whether the comparator is placebo or alternative treatment. In the general case, the
student’s t-test can be either paired or unpaired, but in the case of N-of-1 RCTs each
intervention arm is paired by design, and therefore the paired t-test constitutes the
appropriate approach. To conduct the paired t-test one calculates a single value for
each treatment period. So, for instance, if the patient has completed a daily diary for
7 days, and each day has answered three questions, the score for that period will be
the mean of 21 observations. One then makes the same calculation for the paired
control period and examines the difference in means which the t-test addresses. The
degrees of freedom for the test is the number of blocks of treatment periods minus
one. The t-test is routinely used for N-of-1 RCTs, and is universally included in
statistical packages.

ANOVA

Often the student’s t-test functions as an extension of the ANOVA model and will
provide the same result. There is at least one case in which this is not true. ANOVA
may provide a different result when there is no dependency between one observation
and the next. Under such circumstances the ANOVA can use each individual
observation (in the example above, 7 instead of one observation per period).
Unfortunately, independence of observations will rarely if ever be the case. For
most illnesses, good days tend to run together, as do bad ones.

Aggregation of N-of-1 RCTs

The results of individual N-of-1 RCTs can be aggregated to estimate population


effects with power comparable to conventional RCTs, and using similar analytic
techniques. There are three common approaches to aggregation of N-of-1 data. The
first method is analyzing as a traditional multipatient crossover trial. The analysis
should be planned prospectively but can be done retrospectively. It is important that
1290 R. D. Mirza et al.

the analysis should consider the possibility of carryover of treatment effect between
periods. This is true of any crossover trial analysis.
The second method is using conventional meta-analysis techniques. (See
Chap. 8.11 to learn more about meta-analysis.) There are at least two benefits in
meta-analyzing N-of-1 RCTs. The first is to generate a more precise estimates of
treatment effects, and predictors of patient response versus nonresponse, sustained
response, and susceptibility to side effects (Lillie et al. 2011). Second, in trials where
N-of-1 methodology is compared to standard of care, meta-analysis can assess
whether the N-of-1 methodology provides benefit over traditional clinical care.
The third method is Bayesian analysis that has been adapted specifically for use in
N-of-1 RCTs (Zucker et al. 2010). What distinguishes Bayesian analysis from other
forms of aggregation is the requisite incorporation of preexisting estimates into the
analysis. Typically analyses require prespecification of population mean effect and
variance. This serves as a liability in the many cases where this information is neither
available in the literature nor easily estimated.

Reporting for N-of-1 RCT

The previous sections have focused on the use of N-of-1 RCTs for either clinical
practice or as part of a research endeavor. If the latter, the issue of how one reports
N-of-1 RCTs for a wider audience (typically in a publication) arises.
Similar to other types of trials, reporting standards for N-of-1 RCTs maintained
by the Consolidated Standards of Reporting Trials (CONSORT) exist. To ensure
optimal reporting for N-of-1 RCTs, CONSORT has published a standardized
25-item checklist (CENT), most recently updated in 2015 (Vohra et al. 2015). The
recommendations were based on CONSORT recommendations for parallel group
RCTs, and address reporting expectations of title and abstract, specifying rationale
and objectives in the introduction, trial design, patient selection, intervention, out-
comes, sample size, randomization, allocation concealment, randomization, results,
analyses, discussion, protocol registration, and funding.

Ethics

In addition to the ethical principles of everyday clinical care – autonomy, benefi-


cence, and nonmaleficence – those required for an N-of-1 trial depend on the purpose
of the trial. Broadly, N-of-1 trials can be conducted to improve the care of an
individual patient, to produce generalized knowledge, or as a blend of the two goals.
No additional scrutiny, including research ethics board (REB) approval, is
required if the N-of-1 RCTs is done as a part of clinical care. The use of random-
ization and blinding does not, in and of itself, determine whether a therapeutic model
is research. Similarly when the stated goal of the trial is quality improvement of care,
there is no need for additional scrutiny. The way to conceptualize this is appreciating
that the physician is engaging in the practice of confirming his or her clinical
68 N-of-1 Randomized Trials 1291

hypothesis. In this case the hypothesis is whether an intervention is effective. The


key prerequisite is informed consent from the patient. This approach is sensible
when one considers that N-of-1 RCTs patients do not pose any increased risk
compared to informal trials of therapy, or prescription without monitoring for
effectiveness. Indeed, the added rigor of close monitoring of benefits and adverse
effects for a particular patient support the position that N-of-1 RCTs represent
optimal clinical care (Guyatt et al. 1990a; Molloy et al. 1991; Irwig et al. 1995;
Nonoyama et al. 2007).
When the purpose is a blend of clinical care with a secondary research interest in
analyzing the data to inform future care, the clinical component represents the same
low risk as conducting N-of-1 RCTs for clinical care or quality assurance. Given the
additional intention of research, however, the project should undergo research ethics
board assessment to evaluate the risk of analysis of anonymized data. This is
typically an expedited review given the low-risk nature, and should be considered
equivalent to a chart review.
When investigators conduct trials to create generalized knowledge, ethical stan-
dards for other research apply. One purpose in formal research is investigating the
impact of N-of-1 methodology on patient outcomes. The alternative is the classic
research-oriented N-of-1 RCT with the goal of producing generalizable insights
regarding a therapeutic intervention in a population. These models must meet the
standards for clinical research, including full REB approval and federal regulatory
oversight, as appropriate (Punja et al. 2014).

An Example of an N-of-1 RCT

The following is an example of one of the early N-of-1 RCTs conducted as part of
McMaster’s clinical service. A 34-year-old female with a past medical history
significant for scleroderma was referred for evaluation of treatment for persistent
weakness, in the context of possible myasthenia gravis. Two separate encounters
with specialists revealed electromyographical findings atypical for the disease, and
so the question of whether treatment with pyridostigmine would benefit remained
uncertain. This trial meets our criteria: she experienced symptoms daily, her disease
is chronic and stable, there is uncertainty about therapeutic benefit, and
pyridostigmine has rapid onset and termination of effect.
The intervention was pyridostigmine 30 mg by mouth twice daily and was
placebo controlled. Each treatment period was 7 days, and the outcome measure
was daily ratings of weakness and energy levels.
Figure 1 represents the patient’s reported data using a seven-point Likert scale,
where 7 represents the highest level of function, and 1 represents the lowest level of
function. There were four pairs of treatment periods. Unsurprisingly, the patient did
not have 100% adherence to symptom charting.
Visual inspection reveals that the treatment seems to be consistent better than the
placebo. This is particularly clear in Figs. 2 and 3, which reveal the mean symptom
score in each treatment period, and differences in each pair, respectively.
1292 R. D. Mirza et al.

Fig. 1 N-of-1 RCT results: Mean daily Likert score

Fig. 2 N-of-1 RCT mean period score


68 N-of-1 Randomized Trials 1293

Fig. 3 N-of-1 RCT treatment and placebo difference scores

A two-tailed paired t-test comparing the differences in symptom scores across the
four periods was conducted to confirm the visual inspection. The results revealed in
Fig. 4 confirm a clear benefit in treatment.

Summary and Conclusion

N-of-1 RCTs are unique among experimental studies in giving physicians the ability
to answer clinical questions for individual patients in a methodologically rigorous
way. Other trials – including parallel group RCTs, observational trials, and meta-
analyses – are all limited to answering questions at the population level. For this
reason, N-of-1 trials have been suggested as the pinnacle of the evidence pyramid.
Aggregation of N-of-1 RCTs by meta-analysis and Bayesian techniques allow for
treatment effect estimates at the population level. By conducting an N-of-1 RCT,
physicians are afforded the opportunity to offer optimal care to the individual
patients whom they serve.

Key Facts

• N-of-1 RCTs are single-patient multiple crossover trials that seek to answer a
clinical question and improve patient care. Multiple N-of-1 RCTs can also inform
treatment effects in a population.
1294 R. D. Mirza et al.

Fig. 4 N-of-1 RCT t-test results

• N-of-1 RCTs constitute the highest quality evidence for a particular patient’s care,
because the evidence is specific to the individual patient (OCEBM Levels of
Evidence Working Group 2011).
• Varied analytic techniques can inform the interpretation of N-of-1 RCTs including
nonstatistical techniques (i.e., visual inspection) and statistical techniques includ-
ing both nonparametric and parametric tests.

Cross-References

▶ Introduction to Meta-Analysis

References
Barsky AJ, Saintfort R, Rogers MP, Borus JF (2002) Nonspecific medication side effects and the
nocebo phenomenon. JAMA 287:622–627
68 N-of-1 Randomized Trials 1295

Duan N, Kravitz RL, Schmid CH (2013) Single-patient (n-of-1) trials: a pragmatic clinical decision
methodology for patient-centered comparative effectiveness research. J Clin Epidemiol 66:S21–
S28. https://doi.org/10.1016/j.jclinepi.2013.04.006
Girard TD, Ely EW (2008) Delirium in the critically ill patient. Handb Clin Neurol 90:39–56.
https://doi.org/10.1016/S0072-9752(07)01703-4
Guyatt G, Sackett D, Taylor DW et al (1986) Determining optimal therapy — randomized trials in
individual patients. N Engl J Med 314:889–892. https://doi.org/10.1056/NEJM198604033141406
Guyatt GH, Townsend M, Berman LB, Keller JL (1987) A comparison of Likert and visual
analogue scales for measuring change in function. J Chronic Dis 40:1129–1133
Guyatt G, Sackett D, Adachi J, Roberts R, Chong J, Rosenbloom D, Keller J (1988) A clinician’s
guide for conducting randomized trials in individual patients. CMAJ: Canadian Medical
Association Journal 139(6):497–503
Guyatt GH, Keller JL, Jaeschke R, Rosenbloom D, Adachi JD, Newhouse MT (1990a) The n-of-1
randomized controlled trial: clinical usefulness: our three-year experience. Annals of Internal
Medicine 112(4):293–299
Guyatt GH, Heyting A, Jaeschke R et al (1990b) N of 1 randomized trials for investigating new
drugs. Control Clin Trials 11:88–100
Irwig L, Glasziou P, March L (1995) Ethics of n-of-1 trials. Lancet (Lond) 345:469
Joy TR, Monjed A, Zou GY et al (2014) N-of-1 (single-patient) trials for statin-related myalgia. Ann
Intern Med 160:301–310. https://doi.org/10.7326/M13-1921
Kazdin A (2011) Single-case research designs: Methods for clinical and applied settings. Second
Edition. New York, NY: Oxford University Press
Kratochwill TR (Ed) (2013) Single subject research: Strategies for evaluating change. Academic
Press
Kravitz RL, Duan N, White RH (2008) N-of-1 trials of expensive biological therapies: a third way?
Arch Intern Med 168:1030–1033. https://doi.org/10.1001/archinte.168.10.1030
Kravitz R, Duan N, Eslick I et al (2014) Design and implementation of N-of-1 trials: a user’s guide.
Agency for Healthcare Research and Quality, US Department of Health and Human Services
(2014). 540 Gaither Road. Rockville, MD 20850 www.ahrq.gov
Kravitz RL, Schmid CH, Marois M et al (2018) Effect of mobile device–supported single-patient
multi-crossover trials on treatment of chronic musculoskeletal pain. JAMA Intern Med. https://
doi.org/10.1001/jamainternmed.2018.3981
Lillie EO, Patay B, Diamant J et al (2011) The n-of-1 clinical trial: the ultimate strategy for
individualizing medicine? Per Med 8:161–173. https://doi.org/10.2217/pme.11.7
Mirza RD, Punja S, Vohra S, Guyatt G (2017) The history and development of N-of-1 trials. J R Soc
Med 110:330–340. https://doi.org/10.1177/0141076817721131
Molloy DW, Guyatt GH, Wilson DB et al (1991) Effect of tetrahydroaminoacridine on cognition,
function and behaviour in Alzheimer’s disease. CMAJ 144:29–34
Nonoyama ML, Brooks D, Guyatt GH, Goldstein RS (2007) Effect of oxygen on health quality of
life in patients with chronic obstructive pulmonary disease with transient exertional hypoxemia.
Am J Respir Crit Care Med 176:343–349. https://doi.org/10.1164/rccm.200702-308OC
OCEBM Levels of Evidence Working Group (2011) The Oxford 2011 Levels of Evidence. Oxford
Centre for Evidence-Based Medicine. https://www.cebm.net/index.aspx?o=5653
Punja S, Eslick I, Duan N, Vohra S, the DEcIDE Methods Center N-of-1 Guidance Panel (2014) An
ethical framework for N-of-1 trials: clinical care, quality improvement, or human subjects
research? In: Kravitz RL, Duan N (eds), and the DEcIDE Methods Center N-of-1 Guidance
Panel (Duan N, Eslick I, Gabler NB, Kaplan HC, Kravitz RL, Larson EB, Pace WD, Schmid
CH, Sim I, Vohra S). Design and implementation of N-of-1 trials: a user’s guide. AHRQ
Publication No. 13(14)-EHC122-EF. Agency for Healthcare Research and Quality, Rockville,
Chapter 2, pp. 13–22, January 2014. http://www.effectivehealthcare.ahrq.gov/N-1-Trials.cfm
Punja S, Bukutu C, Shamseer L et al (2016) N-of-1 trials are a tapestry of heterogeneity. J Clin
Epidemiol 76:47–56. https://doi.org/10.1016/j.jclinepi.2016.03.023
Ryu S (2012) Book review: mHealth: new horizons for health through Mobile technologies: based
on the findings of the second global survey on eHealth (global observatory for eHealth series,
volume 3). Healthc Inform Res 18:231. https://doi.org/10.4258/hir.2012.18.3.231
1296 R. D. Mirza et al.

Shamseer L, Sampson M, Bukutu C, Schmid C (2015) CONSORT extension for reporting N-of-1
trials (CENT) 2015: explanation and elaboration. BMJ 76:18–46
Stunnenberg BC, Raaphorst J, Groenewoud HM et al (2018) Effect of Mexiletine on muscle
stiffness in patients with nondystrophic Myotonia evaluated using aggregated N-of-1 trials.
JAMA 320:2344. https://doi.org/10.1001/jama.2018.18020
Vohra S, Shamseer L, Sampson M et al (2015) CONSORT extension for reporting N-of-1 trials
(CENT) 2015 statement. BMJ 350:h1738. https://doi.org/10.1136/BMJ.H1738
Zucker DR, Ruthazer R, Schmid CH (2010) Individual (N-of-1) trials can be combined to give
population comparative treatment effect estimates: methodologic considerations. J Clin
Epidemiol 63:1312–1323. https://doi.org/10.1016/j.jclinepi.2010.04.020
Noninferiority Trials
69
Patrick P. J. Phillips and David V. Glidden

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1298
Hypotheses and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1299
Motivation for NI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1300
Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1300
DISCOVER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1301
STREAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1301
Defining Margin of NI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1302
The 95/95 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1302
Combination Therapies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1303
Public Health Clinical Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1304
Design and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1305
Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1305
How to Design a NI Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1306
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1308
Choice of Analysis Populations and Estimands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1310
Further Challenges Unique to NI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1311
Assay Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1311
Effect Preservation in Determining the NI Margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1312
Two Sides of the Same Coin: Superiority Versus NI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1313
Testing for Noninferiority and Superiority in the Same Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1314
Sensitivity of Trial Results to Arbitrary Margin and Control Arm Event Rate . . . . . . . . . . . 1314
Justification of Margin in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1315
Interim Analyses and Data and Safety Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1315

P. P. J. Phillips (*)
UCSF Center for Tuberculosis, University of California San Francisco, San Francisco, CA, USA
Department of Epidemiology and Biostatistics, University of California San Francisco, San
Francisco, CA, USA
e-mail: [email protected]
D. V. Glidden
Department of Epidemiology and Biostatistics, University of California San Francisco, San
Francisco, CA, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 1297


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_98
1298 P. P. J. Phillips and D. V. Glidden

Alternative Analyses and Designs and Innovative Perspectives on NI Trials . . . . . . . . . . . . . . . . 1316


Bayesian Approaches to NI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1316
Trial Designs to Evaluate Different Treatment Durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1317
Three-Arm NI Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1318
Pragmatic Superiority Strategy Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1318
Averted Infections Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1319
Conclusions and Recommendations for Design/Conduct/Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . 1319
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1320
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1320
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1320

Abstract
In this chapter we provide an overview of non-inferiority trials. We first introduce
two motivating examples and describe scenarios for when a non-inferiority trial is
appropriate. We next describe the procedures for defining the margin of non-
inferiority from both regulatory and public health perspectives and then provide
practical guidance for how to design a non-inferiority trial and analyze the
resulting data, paying particular attention to regulatory and other published
guidelines. We go on to discuss particular challenges unique to non-inferiority
trials including the importance of assay sensitivity, the enigma of effect preser-
vation, switching between non-inferiority and superiority, the interpretation of
results when event rate assumptions are incorrect, and the place of interim
analyses and safety monitoring. We conclude the chapter by addressing alterna-
tive methodologies and innovative perspectives on non-inferiority trials that have
been proposed in an attempt to mitigate these challenges, including Bayesian
approaches, alternative three-arm and pragmatic designs, and methods that
address different treatment durations and the averted infections ratio.
Keywords
Noninferiority · Margin of noninferiority · Assay sensitivity · Effect
preservation · Active control trial · Biocreep

Introduction

The objective of a superiority randomized clinical trial is to evaluate whether the


investigational intervention has superior efficacy (or effectiveness, or safety,
depending on the specific trial objectives) as compared to the control arm. The
objective of a noninferiority (NI) trial, in contrast, is to evaluate whether the
investigational intervention has efficacy that is not much worse than, or noninferior
to, that of the control intervention. Critical to this determination of NI, is how to
quantitatively describe “not much worse,” and this is called the margin of NI, which
is the largest reduction in efficacy that is still considered to be consistent with a
finding of NI and must be prespecified prior to trial start. If there is sufficient
evidence that the reduction in efficacy observed in the trial is no more than the
margin of NI, then a conclusion of NI is appropriate.
69 Noninferiority Trials 1299

Historically, NI trials were a subset of equivalence trials which had the objective
of showing that an investigational intervention was not much worse and not much
better than a control intervention (Wellek 2010). In practice, the dual objectives of
equivalence are less relevant to randomized clinical trials of interventions to improve
human health, apart from studies to demonstrate the bioequivalence of two pharma-
ceutical agents, and this chapter therefore relates exclusively to the NI trial design.
This chapter describes aspects related to the design, conduct, analysis, and interpre-
tation of NI trials, although one could extend many of these ideas to equivalence
trials if needed.
The most common NI trial design is a two-arm trial where the internal comparator
is an active control intervention which usually reflects a standard of care treatment,
and the focus of this chapter is therefore on this two-arm design; other variations on
this design are addressed in section “Alternative Analyses and Designs and Innova-
tive Perspectives on NI Trials.”

Hypotheses and Notation

The most common primary efficacy outcome of a clinical trial relates to the occur-
rence or nonoccurrence of an event of interest, e.g., death, failure, cure, or stable
culture conversion. We therefore consider a treatment effect of the form
θEC ¼ pE  pC, where θEC is the true treatment effect, and where pE and pC are
the proportions of participants with the event on the investigational and control arms,
respectively (the former is sometimes called the experimental arm), and where the
difference might be calculated on the linear scale (for a risk difference) or the log
scale (for a risk ratio). Although this convention is used here, the discussion in this
chapter can easily be extended to NI trials with other types of primary outcomes,
such as continuous or ordinal.
The one-sided null and alternative hypotheses for a superiority and NI trial are
shown in Table 1. For simplicity, and without loss of generality, we consider a
negative θEC corresponds to a beneficial effect of the investigational intervention on
the outcome of interest (e.g., a reduction in mortality or an increase in cure) and
therefore δ > 0 ; we will use this convention throughout the chapter.
In setting the hypotheses of a NI trial alongside those of a superiority trial, the
only difference is in changing the number on the right-hand side of the equations
from 0 to δ; the hypotheses otherwise stay the same. In superiority trials, a minimum
treatment effect that has some analogy to the margin of noninferiority is used for
sample size and power calculations but not for hypothesis testing (see section “Two
Sides of the Same Coin: Superiority Versus NI”).

Table 1 A comparison of null and alternative hypotheses for superiority and NI trials
Superiority comparison NI comparison
Treatment effect measure pE  pC ¼ θEC pE  pC ¼ θEC
Null hypothesis H0 : θEC ¼ 0 H0 : θEC ¼ δ (δ > 0)
Alternative hypothesis H1 : θEC < 0 H1 : θEC < δ
1300 P. P. J. Phillips and D. V. Glidden

Since interpretation of a NI trial necessarily involves consideration of historical


trial data (see section “Defining Margin of NI” below), we will use θEP to denote the
estimate of the effect of the investigational arm as compared to no treatment
(placebo) and θCP to denote the estimate of the effect of the active control arm as
compared to no treatment.

Motivation for NI

NI trials arise as an option in settings where there are one or more effective
treatments for a condition. Typically, a new product is being developed because
there is some hope that it offers superior efficacy, a better safety profile, simpler
administration, lower cost, or other advantages. This new product should be evalu-
ated in a randomized clinical trial. If there is an established treatment for the
condition, it is usually unethical to randomize trial participants to no treatment
(placebo), and an active control design must be adopted. There may be settings
where the condition under study is transient and not serious and a placebo could be
justified.
In either case, the candidate regimen needs to be evaluated in the context of
other effective regimens. The FDA guidance (Food and Drug Administration
Center for Drug Evaluation and Research (CDER) 2016) for such settings lays
out three major alternatives: a study which examines the incremental value of the
new therapy combined with established standard of care compared to standard of
care alone, a placebo-controlled trial of the new therapy among those who are not
candidates for the current standard of care, or an active controlled trial which
randomizes participants among the standard of care regimen and the candidate
intervention. When the first two options are not feasible or ethical, then the third
option, the NI trial, is used.
Because the new regimen may offer substantial advantages over the current
standard of care, it would be enough to show that the current regimen is effective.
However, it may not be superior, but we would want to avoid the situation of
introducing the new drug if it is unacceptably worse than the standard of care.
This study objective gives rise to the NI trial design. Specifically, our trial would
have two objectives: to support our claim that the new regimen is superior to the
withholding of the standard of care and that it is not meaningfully less effective than
the standard of care. A major issue here is the choice of the standard of care arm.
NI studies formalize these standards by establishing a margin of NI in a formal
statistical framework to determine when these two objectives are met.

Case Studies

Two cases studies are used to illustrate various aspects of NI trials throughout this
chapter and are introduced here.
69 Noninferiority Trials 1301

DISCOVER

Nearly 1.7 million people are infected with HIV yearly, and no vaccine is currently
available. However, there are abundant safe and potent medications which can
suppress HIV replication. In this context, the paradigm of HIV pre-exposure pro-
phylaxis (PrEP) developed. PrEP involves using anti-HIV medication to prevent
HIV acquisition in an HIV negative person. Several randomized trials showed that
daily use of emtricitabine and tenofovir disoproxil fumarate (F/TDF) was a highly
effective PrEP regimen. There is a vigorous pipeline for the development of anti-
HIV drugs and/or delivery systems (e.g., long-acting injection) as candidates
for PrEP.
The DISCOVER study (Mayer et al. 2020) was a randomized double blind active
controlled trial evaluating the efficacy of daily oral emtricitabine and tenofovir
alafenamide (F/TAF) for PrEP. The trial’s primary objective was to show that,
among adults at high risk of acquiring HIV, F/TAF was effective in preventing
incident HIV infection. With a proven safe effective and available regimen (F/TDF),
it is no longer ethical to evaluate future PrEP candidates in trials with a placebo
control. Given that F/TDF is highly effective, it was considered unlikely that F/TAF
would be superior in preventing HIV infections. Instead, the major motivation for the
adoption of F/TAF is that the reformulation should have less subclinical effects of
tenofovir on kidney and bone density. This led the investigators to adopt a NI
objective.
Participants took two pills daily: F/TAF (or matching placebo) and F/TDF
(or matching placebo). HIV infection was diagnosed in 7 and 15 participants on
the F/TAF and F/TDF arms, respectively, yielding a relative incidence of 0.47 (95%
CI: 0.19–1.15). Since the 95% confidence interval excluded the prespecified NI
margin of 1.62, NI was concluded.

STREAM

Tuberculosis kills more people than any other single pathogen. 1.2 million people
died from tuberculosis with 10 million new cases in 2019 (World Health Organiza-
tion 2020). When the bacteria develop resistance to the main drug, rifampicin, there
are few treatment options, although new drugs are in development. STREAM Stage
1 was a phase III trial conducted to evaluate a novel 9–11-month regimen for the
treatment of rifampicin-resistant TB and was the first phase III trial to specifically
evaluate any regimen for rifampicin-resistant TB (Nunn et al. 2014). A second stage
of the trial was conducted including regimens with new drugs; for the purposes of
this chapter, we refer to STREAM Stage 1 when referring to the STREAM trial. At
the time that the trial was designed, the standard of care, as recommended in WHO
2011 guidelines (World Health Organization 2011), included a cocktail of 4–7 drugs
given for at least 18 months. A series of nonrandomized interventional cohort studies
in Bangladesh (Van Deun et al. 2010) had identified this 9–11-month regimen
1302 P. P. J. Phillips and D. V. Glidden

resulted in low rates of treatment failures and relapses with adequate safety. It was
calculated that just the cost of drugs for this regimen was approximately USD $270
(Van Deun et al. 2010), only one-tenth of the cost of drugs in the
WHO-recommended regimen (Floyd et al. 2012). Given these and other benefits
to patients and the health system of reducing treatment duration by half, STREAM
was designed as a NI trial. The primary efficacy outcome was a favorable status at
132 weeks, defined by cultures negative for Mycobacterium tuberculosis at
132 weeks and at a previous occasion, with no intervening positive culture or
previous unfavorable outcome. The margin of NI was 10%, with the primary
analysis being a calculation of the absolute risk difference in proportion favorable.
There were 424 participants randomized into the trial (Nunn et al. 2019), with twice
as many allocated to the intervention regimen in order to collect more safety data on
the intervention regimen. NI was demonstrated in both coprimary modified intention
to treat (mITT) and per protocol (PP) analysis populations. WHO guidelines for the
treatment of rifampicin-resistant TB were changed to include the STREAM regimen
while the trial was ongoing based on external observational cohort data but were
subsequently changed to remove this regimen as a recommended regimen, despite
the trial results, due to concerns with the injectable agent included in the regimen
(World Health Organization 2019).

Defining Margin of NI

The testing framework for NI requires specification of a NI margin, δ > 0. Ideally,


this could be defined as the smallest clinically meaningful difference between the
standard and investigational regimens based purely on subject matter grounds.
However, this is difficult enough to define for planning a superiority trial and
would seem to be even harder in the context of NI.
A major approach has been to define the NI margin by statistical criteria. The
approach aims to define a margin which meets two criteria: (i) that the trial would
establish that the investigational arm is superior to no treatment; and (ii) that is not
unacceptably worse than the control arm. These two objectives are met by using a
pair of margins which are typically referred to as the “M1” and “M2” margins,
respectively. Some regulatory guidance encourages that trials be adequately powered
to refute effects outside the M1 and M2 margins.

The 95/95 Method

Translating a difference between the investigational and control arms into a state-
ment about the former’s effect against no treatment requires a working estimate
,θCP, of the control effectiveness compared to no treatment in the current trial.
The 95/95 method (Rothmann and Tsou 2003) uses a meta-analysis of studies of
the control regimen as the starting point for such an estimate – ideally randomized
placebo-controlled trials of the control. From the meta-analysis, θCP is taken as the
69 Noninferiority Trials 1303

upper bound of a two-sided 95% confidence interval (in a setting where θCP > 0
indicates a treatment benefit of the control, the lower bound would be taken). Taking
the value closer to the null than the point estimate, so-called discounting, introduces
a conservatism. For example, the DISCOVER trial needed to estimate the effective-
ness of F/TDF compared to placebo and used a meta-analysis of three placebo-
controlled trials in similar populations to derive a meta-analysis of (log) relative
hazards with upper bound of the 95% confidence interval θCP ¼  0.96.
The M1 margin is synonymous with a test H0 : θEP ¼ 0 which translates to
H0 : θEC + θCP ¼ 0. Thus, the M1 margin is a comparison of the treatment contrast
between investigational and control against a null of θCP. In the case of DIS-
COVER, the M1 margin would be ruling out a log-relative hazard of F/TAF
compared to F/TDF > 0.96 (HR > 2.62).
The M2 margin is derived as a tighter (more conservative) margin which ensures
that some proportion of the control treatment effect (ρ : 0 < ρ < 1) is preserved by
the investigational agent. This functions as a standard of how much worse it can be
compared to the effect of the control agent. The M2 margin is then a comparison
against ρθCP. Note that this margin is closer to the null and thus requires more
evidence to refute. The 95/95 method typically choses 50% effect preservation
which corresponds to ρ ¼ 0.5 In the case of DISCOVER, the M2 margin with
50% requires ruling out a log-relative hazard of F/TAF compared to
F/TDF > 0.96*0.5 ¼ 0.48 (HR > 1.62).
An alternative approach, known as the synthesis method, derives M1/M2 margins
using the uncertainty in completed trials without applying discounting.
Sample sizes for NI trials would be smaller than 95/95 if there were no
discounting and/or they were powered on a M1 margin alone.
Discounting is motivated as a hedge against: (i) selection of a nonoptimal control
therapy, (ii) changes in background treatment, and (iii) publication bias in the meta-
analysis. These are particular concerns in mature fields where many randomized
controlled trials have been conducted and where there are many estimates of the
effectiveness and many possible control comparators. While discounting is clearly
sensible in many settings, the 95% confidence interval has been shown to be highly
conservative (Sankoh 2008; Holmgren 1999; James Hung et al. 2003). It uses a
statistical criterion to handle an unquantifiable uncertainty about the control treat-
ment effect which would have been observed if the NI trial included a putative no
treatment arm. The effect preservation criterion, used to develop the M2 margin,
ensures that the conclusion of NI will only be made if a high proportion of the control
treatment effect is retained by the investigational regimen. The value of the “effect
preservation” standards is further discussed in section “Trial Designs to Evaluate
Different Treatment Durations.”

Combination Therapies

The M1/M2 approach is further complicated when the intervention under evaluation
is not just a single agent but is a combination, as is increasingly common in many
1304 P. P. J. Phillips and D. V. Glidden

disease areas. Where the candidate intervention and standard of care regimen have
common components, the calculation of M1 and derivation of the minimum margin
of NI will be less than if there are no common components.
An example of this is seen in tuberculosis which is usually treated with a
combination regimen with two drugs (rifampicin and isoniazid) given for 6 months
supplemented by two additional drugs (pyrazinamide and ethambutol) in the first
2 months. NI trials have been conducted to evaluate regimens that have one or two
drugs replaced with novel compounds and are given for shorter durations, com-
monly 4 months instead of 6.
The objective of these trials can be reframed as an evaluation of whether the effect
of the new drug (s) has noninferior efficacy to the combined effect of the last
2 months of therapy and the effect of drugs that were replaced, where each is
added to a standard background therapy of the drugs that are common to both
combination regimens in the first 4 months of treatment.
The FDA draft guidance for developing drugs for the treatment of pulmonary
tuberculosis (Food and Drug Administration Center for Drug Evaluation and
Research (CDER) 2013) provides a worked example where the effect of the last
2 months of therapy (for rifampicin-sensitive disease) is shown to be an absolute
difference of M1 (θCP) of 8.4% (95% CI 4.8%, 12.1%) from two previous trials of
4-month regimens, providing support for a margin of NI of 4.8% using this 95/95
approach for NI trials evaluating one- or two-drug substitution trials. A comparable
approach has been used to derive margins of 6% or 6.6% in recent drug-substitution
trials (Dorman et al. 2020; Gillespie et al. 2014; Jindani et al. 2014), using the
M1-type approach, without consideration of the M2.
In contrast, the M1 of the full effect of the entire standard of care regimen is more
like an absolute difference of 50–60% given the high effectiveness of 80–90% of the
standard 6-month regimen compared to an expected 30% cure from untreated
tuberculosis (Tiemersma et al. 2011). For this reason, in trials of new regimens
with only minor or no drugs in common with the standard of care, the M1/M2
approach can be used to justify margins of up to 12%, and consequently much
smaller sample sizes, and still be described as preserving more than 75% (1–12%/
50% ¼ 0.76) of the treatment effect of the standard of care regimen (Tweed et al.
2021). This incongruity can discourage sponsors from including existing drugs in
novel combination regimens that are more readily available with an established
safety profile in favor of only new drugs that are more expensive with less data on
drug safety, often to the detriment of patients and health systems.

Public Health Clinical Criteria

In some contexts, the regulatory statistical criteria are not desirable or feasible, and
the margin of NI has been set on substantive grounds. For instance, the US FDA has
defined a SARS-CoV-2 vaccine to be noninferior (Food and Drug Administration
Center for Biologics Evaluation and Research (CBER) 2020) if θEC < 0.1 where the
parameters represent the vaccine efficacy of the new vaccine relative to a control.
69 Noninferiority Trials 1305

This selection is based entirely on substantive grounds in guidance which defines an


acceptable vaccine efficacy (VE) as 0.50 with a study design that could rule out a VE
of 0.30 or less.
Another example occurs in the STREAM trial. When the trial was started, WHO
guidelines for the treatment of rifampicin-resistance TB were based exclusively on
evidence from nonrandomized clinical studies (unfortunately, this is largely still the
case (World Health Organization 2019)), and it was therefore not possible to
construct the M1 from previous studies. A 10% margin in the absolute difference
was chosen based on discussions with trial investigators and clinicians and conse-
quently “considered to be an acceptable difference in efficacy, given the shorter
treatment duration” (Nunn et al. 2019). The regulatory guidance is somewhat
inconsistent in such settings. For example, ICH E10 states “The NI trial design is
appropriate and reliable only when the historical estimate of drug effect size can be
well supported by reference to the results of previous studies of the control drug”
(International Conference on Harmonisation of Technical Requirements for Regis-
tration of Pharmaceuticals For Human Use 2000) which seems to rule out a trial like
STREAM despite the urgent and obvious public health need for such a trial.
A third example of note is the BLISTER NI trial evaluating doxycycline as a
treatment of bullous pemphigoid compared to the much more toxic standard of care
of prednisolone. To derive the margin of NI, the investigators conducted “a survey of
dermatologists participating in the UK Dermatology Clinical Trials Network, where
participants were asked their opinion on various scenarios of possible gains in
safety” (Bratton et al. 2012). The chosen margin of 37% is large and “reflects the
fact that the majority of dermatologists would accept a substantial reduction in
treatment efficacy in exchange for a significant reduction in long-term adverse
events, including mortality” (Bratton et al. 2012).

Design and Analysis

Sample Size

Sample size formulae for NI trials are similar to those for superiority trials (see
▶ Chap. 41, “Power and Sample Size”) with a few differences. In addition to
specifying the event rate (for a time to event end point) or proportion of events
(for a binary end point) expected in the control, a key difference is that, instead of
specifying a minimum clinically important difference between arms that the trial will
be powered to detect, one must specify both the margin of NI and the expected event
rate, or proportion of events, in the intervention arm. It is usually assumed that this
event rate is the same as the control arm (namely that both arms have true compa-
rable efficacy). If there is compelling evidence to believe that the intervention arm
will have slightly better efficacy than the control, then this will result in a smaller
sample size, as was done in the STREAM trial, although investigators then run the
risk that the trial will be underpowered if this assumption is incorrect. On the other
hand, it might be prudent to assume that the intervention arm will have slightly lower
1306 P. P. J. Phillips and D. V. Glidden

efficacy than the control (although within the acceptable margin of NI), although the
disadvantage is that this will greatly inflate the sample size. Considerations for the
choice of type I error rate and power are the same for NI trials as for superiority trials.
An oft repeated myth is that NI trials are larger than superiority trials. As a broad
statement, this is incorrect – NI trials can be larger or smaller than comparable
superiority trials, depending on the sample size assumptions. It is, however, true that
the sample size of a trial designed to show superiority to placebo, via the indirect
comparison in a NI trial design, is always larger than a superiority trial comparing the
intervention directly with placebo as noted below in section “Effect Preservation in
Determining the NI Margin.”
For a trial comparing proportions, the most commonly used formulae for sample
size calculations come from Farrington and Manning (1990) (using their formula
based on “maximum likelihood” which is more accurate than the approximate
formula based on “observed values”), and this is implemented in many statistical
software packages for sample size calculations and used in the latest editions of the
fourth edition of a popular sample size formulae textbook (Machin et al. 2018),
although earlier editions used the less accurate approximate formula.

How to Design a NI Trial

After careful selection and justification of the margin of NI (section “Defining


Margin of NI”) and consideration of sample size requirements (section “Sample
Size”), there are several further design aspects that must be addressed. Many of the
considerations informing design aspects of NI trials are the same as those for
superiority trials including level of blinding, choice of sites, and length of follow-
up (see Chapters 4.5, 2.2, and 2.11). Considerations regarding interim analyses are
addressed in section “Interim Analyses and Data and Safety Monitoring.”
In recognition of the particular complexities in NI trials, many NI design consid-
erations are addressed in the myriad guideline documents developed by regulators
and other international groups specific to NI trials.

Regulatory Guidelines
The ICH efficacy guidelines (numbered E1 through E20) provide guidelines on
various aspects of the design, conduct, and reporting of clinical trials, a selection of
which specifically addresses NI trials. Of note, many of the documents were
finalized more than 20 years ago when NI trials were less common and consequently
less well described and understood.
Aside from ICH E3 “Clinical Study Reports” (finalized November 1995) which
briefly notes that the evidence in suport of assay sensitivityis important in the NI
clinical study report, ICH E9 “Statistical Principles for Clinical Trials” (finalized
February 1998) addresses NI trials in several areas. The document notes “well
known difficulties” associated with NI trials which include the “lack of any measure
of internal validity. . . thus making external validation necessary,” and that “many
flaws in the design and conduct of the trial will tend to bias the results towards a
69 Noninferiority Trials 1307

conclusion of equivalence.” A major source of controversy in the interpretation of NI


trials is the choice of a less effective control regimen that can maximize differences
between the arms and increase the chance of showing NI but complicating the
relative benefits of the investigational regimen versus an optimal control regimen.
The document states that the active control should ideally be a “widely used therapy
whose efficacy. . . has been clearly established and quantified in well designed and
well documented superiority trial(s)” and notes that the new NI trial “should have the
same important design features” as these previous superiority trial(s).
The document stresses that the trial protocol should “contain a clear statement
that this [NI] is its explicit intention” and should specify the margin of NI which
should be “justified clinically.” The document recognizes that, while the full analysis
set should be primary for a superiority trial, it is “generally not conservative” in a NI
trial and therefore “its role should be considered very carefully” as it “may be biased
towards demonstrating equivalence [NI]” in the presence of participants that with-
draw or are lost to follow-up.
In addition to adopting the ICH guidelines described above, the US FDA has
published the guidance for industry document “NI Clinical Trials to Establish
Effectiveness” (Food and Drug Administration Center for Drug Evaluation and
Research (CDER) 2016) (finalized November 2016), and the EMA committee for
medicinal products for human use (CHMP) has published the “Guidelines on the
choice of the NI margin” (Committee for Proprietary Medicinal Products 2006)
(implemented January 2006); other international regulatory agencies also have other
guidance documents on NI trials.
Both documents from the US FDA and the EMA provide broad guidance on the
design and conduct of NI trials. It is notable that the FDA document is 56 pages as
compared to the 11 pages of the EMA document, although it was also finalized
10 years later and likely reflects the increased knowledge and controversy surround-
ing with these trials. Both documents state that the margin of NI should be justified
on statistical and clinical grounds, and the FDA document provides extensive
guidance on the former.

Other Guidelines
The CONSORT statement was developed to improve the quality and adequacy of
reporting of the results of randomized clinical trials and has undergone regular
updates in addition to extensions to specific types of trials, including NI trials in
2006 (Piaggio et al. 2006) and most recently in 2012 (Piaggio et al. 2012). This
CONSORT statement provides guidelines exclusively for the reporting of NI, and
compliance is broadly required by most major medical journals prior to publication
of trial results (http://www.icmje.org/recommendations/browse/manuscript-prepara
tion/preparing-for-submission.html). It was noted that the reporting of NI trials is
particularly poor (Piaggio et al. 2006), and more recent reviews have also come to
the same conclusion (Rehal et al. 2016).
Key aspects of trial reporting of NI trials in the CONSORT statement include
particular rationale for the NI design, statement and justification of the margin of NI,
description of how eligibility criteria and choice of control compare to previous
1308 P. P. J. Phillips and D. V. Glidden

superiority trials that established efficacy of the control, and clear description of
which among the primary and secondary efficacy and safety outcomes have NI
hypotheses and which have superiority hypotheses.
Other widely accepted guidelines of note are the SPIRIT statement defining
standard protocol items for clinical trials (Chan et al. 2013) and the guidance
document for the content of statistical analysis plans for clinical trials (Gamble
et al. 2017). Extensions of the SPIRIT guidelines for certain types of trials have
been developed, but there is, currently, no extension specifically for NI trials – this is
clearly a document that should be developed, if development is not already under-
way. Neither SPIRIT nor SAP guidance documents directly address NI trials beyond
instructions in the elaboration documents stating that the protocol and the SAP for a
NI trial should describe the framework (superiority or NI) for the primary and
secondary outcomes.
The EQUATOR network (https://www.equator-network.org/) provides an online
repository of guidelines and reference documents related to the reporting of health
research (equator-network.org). A search on their database (May 2021) for the words
“inferiority” or “equivalence” yields only the CONSORT extension for NI trials
(described above).
There are three textbooks addressing the methodology of the design, conduct, and
analysis of NI trials published in 2010 (Wellek 2010), 2012 (Rothmann et al. 2012),
and 2015 (Ng 2015), curiously all by the same publisher; we are not aware of others.
Additional review articles relating to NI trials include a guidance document on how
to handle NI trials in the context of systematic reviews (Treadwell et al. 2012) and
some general guidelines on the reporting of NI trials that predate the CONSORT
extension (Gomberg-Maitland et al. 2003).

Analysis

In general, the methods of analysis for NI trials do not significantly depart from those
used for superiority trials. The approach is to calculate a point estimate and confi-
dence interval for the treatment effect, using appropriate methods depending on the
type of outcome and study objectives ensuring that the analysis is reflective of the
trial design, and desired level of significance for the confidence interval. Figure 1
shows a plot of confidence intervals against a margin of NI to demonstrate different
outcomes of NI clinical trials with the upper bound denoted by a square since this is
the bound that is the focus of the hypothesis tests. In a superiority trial, if the upper
bound of the confidence interval of the treatment effect is lower than the null value
(0.0 for a difference or log ratio), then this is evidence for superiority; in a NI trial, if
the upper bound of the confidence interval is less than the margin of NI, then this is
evidence for NI. If this condition is not met for a NI trial, the conclusion must be that
there is no evidence of NI, that is, the investigational arm is not noninferior. This is a
somewhat confusing double negative which is sometimes wrongly interpreted as
evidence of inferiority. No evidence of NI is comparable to the situation in a
superiority trial with no evidence of superiority which is not the same as evidence
that the two arms in comparison are equivalent. It is universally true across
69 Noninferiority Trials 1309

Fig. 1 Examples of potential outcomes from NI trials labeled with interpretation. The upper bound
is denoted by a square to show that this is the bound used for determination of NI

superiority and NI trials that absence of evidence should never be interpreted as


evidence of absence.
In the STREAM trial, the upper bound of the 95% confidence intervals for the
absolute difference in the proportion with a favorable status for the coprimary mITT
and PP analysis populations were 9.5% and 9.1%, respectively, both lower than the
NI margin of 10% therefore leading to a conclusion of NI. In the DISCOVER trial,
the upper bound of the 95% confidence interval for the HIV incidence rate ratio was
1.15 which was lower than the NI margin of 1.62 therefore also leading to a
conclusion of NI.

Inferiority and NI
Although the conclusion of NI relates only to the upper bound of the confidence
interval (denoted with a square in Fig. 1) since the alternative hypothesis is
one-sided, nevertheless, the role of lower bound of the confidence interval can be
a source of confusion. The interpretation of the results of a NI trial needs careful
consideration when the lower bound of the confidence interval exceeds zero. This
would normally be interpreted as evidence that the intervention has inferior efficacy
as compared to the control, although this interpretation is inappropriate in a NI trial.
The margin of NI defines an acceptable margin of reduction in efficacy, and so
the interpretation must be no evidence of NI if the upper bound is not less than the
margin (scenario G), or actual evidence of NI if the upper bound is less than the
margin (scenario D).
1310 P. P. J. Phillips and D. V. Glidden

This latter case is somewhat paradoxical and rare, although it is sometimes


observed. An interesting example is provided by the BLISTER trial (Williams
et al. 2017) (described in section “Public Health Clinical Criteria”) where the
difference in the primary outcome was 18.6% (90% CI 11.1%, 26.1%) and 18.7%
(90% CI 9.8%, 27.6%) in the modified intention to treat and per protocol analyses,
respectively. A large margin of NI of 37%, coupled with the finding of clear evidence
of improved safety, led the investigators to appropriately conclude that the interven-
tion was noninferior to standard treatment.
Strictly speaking, a conclusion of inferiority can only be made if the lower bound
exceeds the margin of NI (scenario H). To avoid this confusion regarding the lower
bound, confidence intervals for the results of NI trials are sometimes presented as
one-sided confidence intervals; see this approach in JAMA (where the direction of
the comparison has been switched from our example), (Kaji and Lewis 2015). ICH
E9 also recommends “only the lower margin [upper bound using our convention] is
needed for the active control NI trial” (International Conference on Harmonisation of
Technical Requirements for Registration of Pharmaceuticals For Human Use 1998).

Choice of Analysis Populations and Estimands

For a superiority trial, it is widely recommended that the primary analysis should
include all randomized participants in the treatment groups to which they were
allocated; this is regarded as an “intention-to-treat” (ITT) analysis (International
Conference on Harmonisation of Technical Requirements for Registration of Pharma-
ceuticals For Human Use 1998). This “Full Analysis Set,” as it is also sometimes
described, is preferred for superiority trials not only as it yields “estimates of treatment
effects which are more likely to mirror those observed in subsequent practice”
(International Conference on Harmonisation of Technical Requirements for Registra-
tion of Pharmaceuticals For Human Use 1998) but also because it provides a conser-
vative or protective analysis strategy whereby misclassification of outcomes from
participants that have had protocol violations is likely to dilute the treatment effect
thereby reducing the chance of falsely demonstrating superiority. For exactly this
reason, this ITT analysis set may actually increase the chance of demonstrating NI
and is therefore not uniformly accepted as the default choice for the primary analysis
for a NI trial, or as noted in ICH E9 (R1): “its role in such [NI] trials should be
considered very seriously” (International Conference on Harmonisation of Technical
Requirements for Registration of Pharmaceuticals For Human Use 2019).
An alternative analysis population is the modified-ITT (mITT) population with
limited exclusions of randomized participants, usually those that violated eligibility
criteria but were erroneously randomized, provided entry criteria were measured
prior to randomization and all participants recruited undergo equal scrutiny for
eligibility violations (International Conference on Harmonisation of Technical
Requirements for Registration of Pharmaceuticals For Human Use 1998). A more
common alternative is the “Per-Protocol” (PP) population where participants that did
not adequately adhere to the treatments under evaluation or other important aspects
of the trial protocol are excluded from the analysis. This is sometimes described as
69 Noninferiority Trials 1311

an “As-treated” analysis population, although the latter also implies the additional
criterion of analyzing participants according to the treatment they actually received.
How a PP analysis is defined varies greatly between guidelines and published NI
trials (Rehal et al. 2016), and some recommend a limited interpretation and put an
emphasis on causal inference methodology for analysis to overcome limitations of
postrandomization exclusions (Hernan and Robins 2017). A full discussion of
different analysis sets for clinical trials is outside the scope of this chapter; readers
should look at chapter 7.2.
In the past, many have recommended that the (m)ITT and PP analysis populations
should be coprimary (Piaggio et al. 2006; International Conference on Harmonisation
of Technical Requirements for Registration of Pharmaceuticals For Human Use 1998;
Jones et al. 1996; D’Agostino et al. 2003; Committee for Proprietary Medicinal
Products 2002) for NI trials such that it is necessary to demonstrate NI in both analysis
populations in order to declare NI of the regimen. A more recent commentary also
supports this approach (Mauri and D’Agostino 2017). Other authors, however, rec-
ommend relegating a PP analysis to a secondary analysis (Wiens and Zhao 2007).
There is no mention, for example, of a PP analysis in the 2016 FDA guidance on NI
(Food and Drug Administration Center for Drug Evaluation and Research (CDER)
2016), although an “as-treated” analysis had been included in the earlier 2010 draft.
Much of the discussion of different analysis populations has been replaced by the
emphasis on a clear specification of the estimand of interest in the ICH E9
(R1) Addendum (International Conference on Harmonisation of Technical Require-
ments for Registration of Pharmaceuticals For Human Use 2019) (see ▶ Chap. 84,
“Estimands and Sensitivity Analyses”) which includes attributes specifying choice of
analysis populations and also how intercurrent events (events such as treatment
switching or discontinuation that affect or prevent observation of the primary out-
come) are handled in analysis. In this regard, the ICH E9 (R1) addendum providing
regulatory guidelines on specification of estimands addresses this controversy (Inter-
national Conference on Harmonisation of Technical Requirements for Registration of
Pharmaceuticals For Human Use 2019): “estimands that are constructed with one or
more intercurrent events accounted for using the treatment policy strategy present
similar issues for NI and equivalence trials as those related to analysis of the Full
Analysis Set under the ITT principle.” The addendum also recognizes the importance
of the PP-type analyses: “An estimand can be constructed to target a treatment effect
that prioritizes sensitivity to detect differences between treatments, if appropriate for
regulatory decision making.”

Further Challenges Unique to NI

Assay Sensitivity

A concern with any NI trial is the concept of “assay sensitivity.” A trial is said to
have “assay sensitivity” if it would detect the inferiority of an investigational
intervention if it were truly inferior. This is an issue of trial conduct and aligning
the trial context with assumptions. For instance, in the DISCOVER trial, if
1312 P. P. J. Phillips and D. V. Glidden

adherence to F/TDF was poor, then effectiveness of F/TDF would be expected to be


low. In that setting, similarity of HIV incidence between F/TAF and F/TDF would
not be evidence of substantial effectiveness of F/TAF.
The assay sensitivity also depends on trial conduct. The poor quality of conduct
and design of a NI trial can directly affect whether the proportion of participants
experiencing the primary event of interest is estimated with error thereby inducing
bias in the estimate of the treatment effect. Quality issues such as high rates of loss to
follow-up or low specificity of outcome assessment can result in underestimation of
the true proportion of events and issues such as laboratory contamination, unneces-
sary use of rescue medication, and poor treatment adherence which can result in
overestimation of the true proportion of events. Even if these errors are equally
distributed between arms, such quality issues can adversely increase the chance of
falsely declaring NI (a Type I error). If the true treatment effect, the difference in
proportion of events between arms, is larger than the margin of noninferiority but
there is underestimation of the proportion of events in each arm, the observed
difference between arms will be less than the true difference, and there is a chance
of falsely demonstrating noninferiority if the confidence interval is sufficiently
narrow. This is in contrast to the chance of falsely declaring superiority which is
not increased in this scenario of quality issues that are equally distributed between
arms when an ITT analysis is used (White et al. 2012). For these reasons, quality of
trial conduct is even more important in a NI trial than in a superiority trial.
ICH E10 “Choice of control group and related issues in clinical trials” (finalized
July 2000) also addresses this issue of assay sensitivity. There must be consistency in
aspects of trial design between the current NI trial and historical trials evaluating the
active control to provide a “fair effectiveness comparison with the control” to
support assay sensitivity, and the document specifically notes the choice of dose,
patient population, and choice and timing of outcomes. Two approaches to collec-
tively determine assay sensitivity are proposed: (1) historical evidence of sensitivity
to drug effects and (2) appropriate trial conduct, the former being evaluated before
the trial starts (as part of the derivation of the margin, see section “Defining Margin
of NI” above) and the latter once the trial is completed showing that the study
population was similar to that in previous trials and that the trial was “conducted
with high quality (e.g., good compliance, few losses to follow-up).” The document
highlights a number of specific aspects of this “appropriate trial conduct” that can
dilute the observed difference between treatments thus reducing the assay sensitivity
of the trial including poor adherence to therapy, use of nonprotocol medications, a
selective participant population that has a lower response rate, poorly applied
diagnostic criteria, and conscious or unconscious underreporting of end points.

Effect Preservation in Determining the NI Margin

A major reason that the sample size of a NI trial can be large is the required tightness
of an M2 margin when the 95/95 method is used. Snappin and Jiang (2008) gave an
insightful critique of M2 (preservation of effect) criterion and constructed
69 Noninferiority Trials 1313

hypothetical scenarios where an investigational therapy was truly more effective


than the standard of care. They demonstrated scenarios whereby the standard was
evaluated in a randomized trial compared to placebo and “approved”; however, a NI
trial of the investigational therapy with the sample size, though more effective, could
not exclude the M2 margin, and the investigational therapy was therefore “not
approved.” Simply by being second, the investigational therapy was judged to be a
failure even though it would have been approved if it was first in class. Hence, under
effect preservation, the new agent is held to a higher standard. Stronger evidence is
required for approval, and the NI trial with an M2 margin is necessarily larger than
the original superiority trials of the first agent.
The rationale for effect preservation is compelling, however, in some situations.
Consider, for example, the situation when the relevant clinical question is “should
the standard of care change from use of the established control as the recommended
therapy to use of the investigational intervention as recommended therapy?” This
would be a setting in which the objective is to substitute one therapy for another. If
so, it is sensible to demand more from the new drug. For instance, the old drug might
have an extensive safety record having been shown over time that it is safe and
effective in unselected nontrial populations. It might be generic or about to go
generic (likely resulting in considerable reductions in cost). Substituting a new
drug implies losing some intangibles (demonstrated long-term safety, generics, and
the access they can bring), and it is sensible to demand an additional assurance that
not much efficacy is lost. If the overall clinical objective is to evaluate a substitute,
then effectiveness standards should be high.
In other cases, effect preservation would be considerably less relevant. For
instance, the FDA guidance suggests a placebo-controlled superiority trial among
those who have contraindication for approved therapies, which implicitly only
require an M1 type margin. However, in some cases, the new agent may more subtly
lead to a greater therapeutic impact because it may be more acceptable through its
simplicity, convenience, or acceptability, a shorter course of TB treatment, for
example. An M2 margin, with its criterion of effect preservation, is a poor match
for preference-sensitive decisions. It would seem to be a poor fit for contexts where
many people who could benefit from the standard therapy cannot or will not use it.

Two Sides of the Same Coin: Superiority Versus NI

Some authors have pointed out that the distinction between NI and “superiority”
trials is artificial in many contexts. Dunn et al. (2018a) notes that in nonregulatory
situations, it can be unclear which regimen is the “control.” Even if one is identified,
the smallest clinically significant difference should govern the choice of the NI
margin and the alternative for superiority. This yields identical sample sizes for
both types of studies. Further, if the data analysis de-emphasizes null hypothesis
significance testing in favor of estimation of treatment contrast with the associated
confidence intervals, then the analysis and conclusions should be identical for NI and
superiority questions. Hence, in many active controlled trials, the distinctions
1314 P. P. J. Phillips and D. V. Glidden

between NI and superiority are not evident. This critique is not strictly relevant when
a regulatory-derived margin is used or when one regimen has such clear ancillary
advantages that the comparison between the arms is greatly asymmetric.

Testing for Noninferiority and Superiority in the Same Trial

Related to this issue is a document published by the EMA CHMP “Points to consider
on switching between superiority and NI” (Committee for Proprietary Medicinal
Products 2000) that is focused on the relationship between superiority and NI
hypotheses within a single trial. The document is clear that it is acceptable to test
for superiority in a NI trial: “there is no multiplicity argument that affects this
interpretation because. . . it corresponds to a simple closed test procedure,” with
the proviso that “the intention-to-treat principle is given greatest emphasis” in the
superiority analysis which is likely to be different for the NI comparison. In any case,
every analysis plan for a NI trial should include a plan for a superiority test.
The document also notes that it can be appropriate to test for NI in a superiority
trial where superiority has not been demonstrated, provided a margin of NI has been
prespecified in the protocol and that the trial was “properly designed and carried out
in accordance with the strict requirements of a NI trial.” This includes the notion that
“in a NI trial the full analysis set and the PP [per protocol] analysis set have equal
importance and their use should lead to similar conclusions for a robust interpreta-
tion,” which is a departure from what is described in the FDA document which
recognizes the challenges with the ITT analysis, but does not go so far as to describe
a per protocol analysis as of equal importance (see section “Choice of Analysis
Populations and Estimands” above).

Sensitivity of Trial Results to Arbitrary Margin and Control Arm


Event Rate

A complication of NI trials is the centrality of the NI margin that must be pre-


specified before the trial starts in inference that occurs after the trial has been
completed. For instance, trials with the same control group may derive different
margins. For instance, a NI trial of an injectable PrEP agent (cabotegravir) versus
F/TDF used the 95/95 method to justify a margin of 1.23 hazard ratio for
cabotegravir versus F/TDF. In the same population, the DISCOVER trial specified
a much wider margin of 1.62. It is awkward to have trials of products for the same
population and indication yet with differing NI margins. Meeting the prespecified
margin is often required by regulators, and large inconsistencies in margins give the
product with a wider margin a greater chance at regulatory approval. They also
further contribute to confusion among colleagues who are not deeply immersed in NI
methods about the meaning of and standards for NI. By the 95/95 paradigm, a
different margin could be justified in the DISCOVER trial since a stronger control
efficacy was expected than in the other trial. This illustrates that fuller evaluation of
69 Noninferiority Trials 1315

the evidence for NI should consider a variety of factors including consideration of


margins used in similar trials.
Further complications can arise when the observed proportion of events in the
control arm deviates from that assumed in sample size calculations. This will
severely affect the power of both superiority and NI trials. Such a scenario can
also severely affect the interpretation of the results of a NI trial and increase the
risk of a false positive (Type I) error since the margin of NI is defined in relation
to the expected proportion of events in the control arm (see section “Defining
Margin of NI” above). For example, in the ISAR-safe trial, fewer events were
observed on the control arm than expected providing supposed strong evidence
for NI based on the prespecified margin (on the absolute difference in pro-
portions scale), but a NI conclusion was not accepted by the investigators
(Mauri and D’Agostino 2017). This limitation with a fixed margin of NI has
been recognized, and aside from Bayesian methods described below, alternative
approaches for a flexible margin of NI have been proposed including the use NI
frontier (Quartagno et al. 2020). This frontier is “a curve defining the most
appropriate NI margin for each possible value of control event risk” which is
proposed as a fixed arcsine difference frontier which is “power-stabilizing” and
has good properties particularly in that its asymptotic variance is independent of
the control arm event rate.

Justification of Margin in Practice

Reporting of NI trials frequently lacks a rigorous justification for their conclusions.


A systematic review of NI trials reporting in high-impact medical journals (Rehal
et al. 2016) have found that nearly half present no justification for NI margin, and
margins frequently were not chosen to show that the investigational regimen was
effective compared to no treatment and rarely could ensure at least 50% preservation
of the control arm treatment effect (Tsui et al. 2019). Additional frequent problems
include a lack of clarity on the type I error of the NI comparison and its direction and
whether results are consistent between ITT and PP analyses (Aberegg et al. 2018).
Most surprisingly, in 11% of cases, no clear secondary benefit of the investigational
regimen was discernable or reported.
As recognized by the extension to the CONSORT guidelines specifically for NI
trials (Piaggio et al. 2012), the special consideration of NI trials compels the clear
reporting of methods and justification for the margin, type I error rate, and primary
analysis. In addition, they should include a rationale for the NI design and the effect
of sensitivity analyses such as the mITT and/or per protocol analysis.

Interim Analyses and Data and Safety Monitoring

In randomized clinical trials of untested interventions, an independent Data Moni-


toring Committee (DMC) will review the results of interim analyses of safety and
1316 P. P. J. Phillips and D. V. Glidden

efficacy data at regular intervals (see ▶ Chap. 37, “Data and Safety Monitoring
and Reporting”). Aside from adaptive trial designs where any number of features of a
clinical trial could be modified, and review of data quality and trial procedures,
a main task of the DMC will be to recommend whether the trial can continue
or whether it should be stopped before the scheduled end. This latter recommenda-
tion is usually only made when there is sufficient evidence for one of the
following: (1) unacceptable harm to trial participants, (2) overwhelming benefit for
one arm, or (3) lack of benefit of the investigational intervention. In a NI trial,
the consideration of stopping guidelines should be different to those for superiority
trials.
It is unlikely to be appropriate to stop a NI trial early for overwhelming evidence
of NI since the margin of NI is somewhat subjective, and it would normally be better
to continue the trial to get a better estimate of the treatment effect and also to
determine whether the intervention might actually be superior. For this reason, a
superiority comparison is recommended when evaluating evidence for overwhelm-
ing benefit, and one might consider a conditional power approach (Bratton et al.
2012), even in a NI trial (Korn and Freidlin 2018).
It will also be inappropriate to stop a NI trial for lack of benefit since lack
of benefit may still be consistent with a finding of NI. When evaluating
evidence for lack of benefit in a NI trial, the comparison should be against
the margin of NI (effectively evaluating sufficient lack of evidence for NI)
rather than against a null finding as would be usual in a superiority trial
(Bratton et al. 2012).

Alternative Analyses and Designs and Innovative Perspectives


on NI Trials

Due to the complexities and challenges in design, conduct, and interpretation of NI


trials, several alternative designs or new analyses have been proposed in recent
years. We highlight important developments here.

Bayesian Approaches to NI

Bayesian approaches are particularly compelling for NI trials and have been adapted
in a variety of ways (Simon 1999; Gamalo-Siebers et al. 2016).
In any NI study, margins are derived directly or indirectly based on historical data.
A Bayesian framework provides an intuitive and formal way to incorporate these
data through prior distributions. A key source of uncertainty is the value of θCP about
which there might be expert opinion or preliminary data – an ideal setting for the use
of prior distributions. Bayesian methods can allow for discounting historical data
through shifting the prior toward a null distribution by using skeptical prior distri-
butions (Kirby et al. 2012) or through power priors (Ibrahim and Chen 2000) which
allow development of priors which depend the historical data with a flexible
69 Noninferiority Trials 1317

weighting index. A lack of constancy can be handled by explicitly modeling


variation in the standard treatment effect by using a hyperprior on heterogeneity of
treatment effects (Neuenschwander et al. 2010).
This framework allows considerable flexibility in its use of prior information
but also allows more flexible statements about treatment effects through the
posterior distributions after the trial has been completed. For instance, what is
the posterior probability that the investigational regimen is superior to no treatment
or to the standard treatment (Spiegelhalter et al. 1994)? Similarly, the posterior
probability that the difference between the treatment and the standard exceeds a
given margin can be calculated (as was done as a secondary analysis in the
STREAM trial (Nunn et al. 2019), see supplementary appendix). This allows for
a more rich and nuanced examination than just the binary conclusion of the
frequentist fixed margin paradigm where the confidence interval falls within the
NI margin or it does not.

Trial Designs to Evaluate Different Treatment Durations

The evocatively named DOOR/RADAR approach (Evans et al. 2015) was proposed
as “a new paradigm in assessing the risks and benefits of new strategies to optimize
antibiotic use,” particularly to avoid the “complexities of non-inferiority trials” and
consequent large sample sizes when evaluating whether the duration of antibiotic use
can be reduced without a reduction in effectiveness. The idea is to prespecify an
ordinal clinical outcome that combines measures of efficacy and safety and then rank
trial participants by this clinical outcome measure, where those with a similar clinical
outcome are ordered by duration of antibiotic use with shorter duration given a
higher rank. This “Desirability of outcome ranking (DOOR)” is compared between
different antibiotic strategies in a “Response adjusted for duration of antibiotic risk
(RADAR)” superiority comparison, and mean sample sizes needed are much smaller
than comparable trials powered for NI.
There has been some uptake of this methodology, but it suffers from replacing the
complexities of NI trials with a host of new complexities and a number of substantial
limitations (Phillips et al. 2016). These include the introduction of a new metric “the
probability of a better DOOR for a randomly selected participant” with no clinical
interpretation, the same tendency with NI trials where poor quality may increase the
chance of a false positive, and the obscuring of important clinical differences if an
important clinical outcome occurs in only a few trial participants. This latter point
was illustrated by applying DOOR/RADAR to the results of three NI comparisons
from two TB trials that resulted in conclusions of an absence of NI (that were widely
accepted by the clinical community) yet counterintuitively showed clear superiority
in DOOR with p < 0.001 in each case.
A more attractive alternative design for trials evaluating different durations of
therapy involves explicit modeling of the duration-response relationship and selec-
tion of the duration that achieves the desired cure proportion (Quartagno et al. 2018;
Horsburgh et al. 2013).
1318 P. P. J. Phillips and D. V. Glidden

Three-Arm NI Design

One design that can overcome many of the challenges inherent in NI trials is a three-
arm trial with a single investigational intervention arm compared to an active control
intervention in a NI comparison, and compared to a different control of no treatment
in a superiority comparison. This three-arm trial allows for simultaneous demon-
stration that the investigational intervention is superior to no treatment and is
noninferior to the active control standard of care. This trial design is not possible
in many settings with established treatments where it would be inappropriate to
withhold treatment.
A risk with NI trials designed after a first investigational intervention has been
shown to be noninferior which is the phenomenon of biocreep (D’Agostino et al.
2003; Nunn et al. 2008). If a subsequent successful NI trial results in investigational
interventions “not much worse” than the previous investigational interventions, this
leads to an intervention that is considerably worse than the original standard of care
control and consequently not much better than placebo. A three-arm design is useful to
avoid the problem of biocreep. A three-arm NI trial of a second investigational
intervention would include both the first intervention shown to be noninferior as
well as the original standard of care control. The objective would be to demonstrate
NI of the second investigational intervention compared to the original standard of care
control, also allowing an internal randomized comparison between the two investiga-
tional interventions to support decision-making and facilitate informed patient choice.

Pragmatic Superiority Strategy Trial

If the investigational intervention has expected benefits that mitigate no improvement


in efficacy, an alternative to conducting the much maligned NI trial is to conduct a
pragmatic trial to evaluate the strategy of implementing the investigational interven-
tion. This pragmatic strategy trial would be designed to evaluate superiority in patient-
relevant outcomes in the intended clinical setting of the intervention, thereby avoiding
the limitations with NI (defining the margin, assay sensitivity, etc.) altogether. This
approach is likely to result in a larger sample size due to the heterogeneity in trial
participants and outcomes introduced by the pragmatic design elements, but the
additional cost of a larger trial may be offset by reduced complexities in a pragmatic
design. Where an investigational intervention has limited safety data, this sort of
design is likely to be less appropriate.
As an example of a potential pragmatic design, consider the BLISTER trial
described above. When the trial was designed, it was recognized that the investiga-
tional intervention could be used in future clinical practice as a first-line therapy to be
followed by the standard of care steroid treatment as second-line therapy, even if it
was less effective than steroid treatment, without much additional risk to the patient
due to its better safety profile. A pragmatic strategy trial that more directly addresses
this question than the NI design could be a two-arm trial comparing the strategy of
doxycycline as first-line therapy followed by steroid as rescue medication compared
69 Noninferiority Trials 1319

with the alternative standard of care strategy of steroid medication as first-line


therapy and evaluating superiority on a major clinical outcome such as death, or
severe life-threatening adverse events.
An implicit assumption in the objective of shortening treatment for the treatment
for tuberculosis (in the STREAM trial, for example) is that shorter duration of
treatment results in a variety of benefits to the patient, the health system, and the
community. A follow-up trial to any treatment shortening TB trial to evaluate this
could therefore be a pragmatic strategy trial to compare the implementation of a short
regimen with the standard of care longer regimen to determine whether there are
community-level benefits such as reduced TB-related (or all-cause) mortality and
reduced health system and/or patient costs. Such a trial may need to be designed as a
cluster-randomized or stepped-wedge trial to properly account for the pragmatic
nature of the strategy, but randomization would be important, as compared to an
interventional cohort study, to improve the strength of evidence generated. The
BEAT Tuberculosis trial (https://clinicaltrials.gov/ct2/show/NCT04062201) is a
pragmatic superiority strategy trial evaluating the strategy of treatment shortening
coupled with standardized regimens based on drug-resistant profile to see whether
the investigational strategy improves TB treatment outcomes about individuals with
a variety of types of drug-resistance TB in South Africa.

Averted Infections Ratio

In the regulatory framework, the M2 margin plays a major role. The M2 margin is
based on some fraction, typically 50%, of effect preservation. While this may seem
conceptually straightforward, measures of effect preservation involve subtle choices;
they depend on the estimands for the control and investigational arms (whether
multiplicative or additive) as well as on what scale effect preservation is judged.
Recent work in the HIV prevention context (Dunn et al. 2018b) has shown this scale
dependency. This work has reinforced that effect preservation is based on a “meta
estimand” and that in some contexts novel meta estimands are both useful and
interpretable, providing information beyond whether a M2 margin has been crossed.
In the context of HIV prevention trials, an alternative measure of effectiveness, the
Averted Infections Ratio (AIR), has been proposed based on the comparison of the
number of averted infections between treatment arms (Dunn et al. 2018b). This
measure is simple to interpret with both clinical and public health relevance and
overcomes limitations of scale dependency.

Conclusions and Recommendations


for Design/Conduct/Reporting

NI trials are most compelling when there is an investigational intervention which has
theoretical or known advantages over a control intervention. These might include
safety, simplicity, cost, or other desirable features for patients or health systems. The
1320 P. P. J. Phillips and D. V. Glidden

key issue in the NI design is that the investigational intervention is a compelling choice
provided it is not much worse than the control intervention. The NI margin quantifies
this magnitude and is the key design parameter in NI trials. The choice of margin should
be prespecified and transparent. In developing it, consideration should be given to the
public health context and may be developed using data on historic control intervention
effects (when available). If such data is used, consideration must be given to how
applicable they will be to the current trial. At a minimum, margins are chosen to exclude
the possibility of declaring NI if the investigational treatment effect is not superior to no
treatment. Proper trial conduct (avoiding missing/mismeasured data and loss to follow-
up) is essential to preserve assay sensitivity. In addition, sensitivity analyses can also be
useful. Bayesian analyses have numerous advantages for importing historic knowledge
and summarizing conclusions. Clear and complete reporting of key design choices is
lacking in the medical literature of NI trials.

Key Facts

• NI trials are most compelling when there is an investigational intervention which


has theoretical or known advantages over control interventions such as safety,
simplicity, or cost.
• The key issue in the NI design is that the investigational intervention is a
compelling choice provided it is not much worse than the control intervention.
• The NI margin quantifies this magnitude and is the key design parameter in NI
trials. The choice of margin should be prespecified with a transparent justification.
• Proper trial conduct (avoiding missing/mismeasured data, loss to follow-up) is
essential to preserve assay sensitivity.
• Clear and complete reporting of key design choices is lacking in the medical
literature of NI trials.

Cross-References

▶ Data and Safety Monitoring and Reporting


▶ Estimands and Sensitivity Analyses
▶ Interim Analysis in Clinical Trials
▶ Masking of Trial Investigators
▶ Masking Study Participants
▶ Platform Trial Designs
▶ Power and Sample Size
▶ Pragmatic Randomized Trials Using Claims or Electronic Health Record Data

References
Aberegg SK, Hersh AM, Samore MH (2018) Empirical consequences of current recommendations
for the design and interpretation of noninferiority trials. J Gen Intern Med 33(1):88–96
69 Noninferiority Trials 1321

Bratton DJ, Williams HC, Kahan BC, Phillips PP, Nunn AJ (2012) When inferiority meets
non-inferiority: implications for interim analyses. Clin Trials 9(5):605–609
Chan AW, Tetzlaff JM, Gotzsche PC, Altman DG, Mann H, Berlin JA et al (2013) SPIRIT 2013
explanation and elaboration: guidance for protocols of clinical trials. BMJ 346:e7586
Committee for Proprietary Medicinal Products (2000) Points to consider on switching between
superiority and non-inferiority. https://www.ema.europa.eu/en/documents/scientific-guideline/
points-consider-switching-between-superiority-non-inferiority_en.pdf
Committee for Proprietary Medicinal Products (2002) Points to consider on multiplicity issues in
clinical trials. https://www.ema.europa.eu/en/documents/scientific-guideline/points-consider-
multiplicity-issues-clinical-trials_en.pdf
Committee for Proprietary Medicinal Products (2006) Guideline on the choice of the non-inferiority
margin. https://www.ema.europa.eu/en/documents/scientific-guideline/guideline-choice-non-
inferiority-margin_en.pdf
D’Agostino RB Sr, Massaro JM, Sullivan LM (2003) Non-inferiority trials: design concepts and
issues - the encounters of academic consultants in statistics. Stat Med 22(2):169–186
Dorman SE, Nahid P, Kurbatova EV, Goldberg SV, Bozeman L, Burman WJ et al (2020) High-dose
rifapentine with or without moxifloxacin for shortening treatment of pulmonary tuberculosis:
study protocol for TBTC study 31/ACTG A5349 phase 3 clinical trial. Contemp Clin Trials 90:
105938
Dunn DT, Copas AJ, Brocklehurst P (2018a) Superiority and non-inferiority: two sides of the same
coin? Trials 19(1):499
Dunn DT, Glidden DV, Stirrup OT, McCormack S (2018b) The averted infections ratio: a novel
measure of effectiveness of experimental HIV pre-exposure prophylaxis agents. Lancet HIV 5
(6):e329–e334
Evans SR, Rubin D, Follmann D, Pennello G, Huskins WC, Powers JH et al (2015) Desirability of
outcome ranking (DOOR) and response adjusted for duration of antibiotic risk (RADAR). Clin
Infect Dis 61(5):800–806
Farrington CP, Manning G (1990) Test statistics and sample size formulae for comparative binomial
trials with null hypothesis of non-zero risk difference or non-unity relative risk. Stat Med 9(12):
1447–1454
Floyd K, Hutubessy R, Kliiman K, Centis R, Khurieva N, Jakobowiak W et al (2012) Cost and cost-
effectiveness of multidrug-resistant tuberculosis treatment in Estonia and Russia. Eur Respir J
40(1):133–142
Food and Drug Administration Center for Biologics Evaluation and Research (CBER) (2020)
Guidance for industry. Development and licensure of vaccines to prevent COVID-19.
U.S. Department of Health and Human Services. https://www.fda.gov/media/139638/download
Food and Drug Administration Center for Drug Evaluation and Research (CDER) (2013) Guidance
for industry. Pulmonary tuberculosis: developing drugs for treatment, draft guidance.
U.S. Department of Health and Human Services. https://www.fda.gov/media/87194/download
Food and Drug Administration Center for Drug Evaluation and Research (CDER) (2016) Guidance
for Industry. Non-inferiority clinical trials to establish effectiveness. U.S. Department of Health
and Human Services. https://www.fda.gov/media/78504/download
Gamalo-Siebers M, Gao A, Lakshminarayanan M, Liu G, Natanegara F, Railkar R et al (2016)
Bayesian methods for the design and analysis of noninferiority trials. J Biopharm Stat 26(5):
823–841
Gamble C, Krishan A, Stocken D, Lewis S, Juszczak E, Dore C et al (2017) Guidelines for the
content of statistical analysis plans in clinical trials. JAMA 318(23):2337–2343
Gillespie SH, Crook AM, McHugh TD, Mendel CM, Meredith SK, Murray SR et al (2014) Four-
month moxifloxacin-based regimens for drug-sensitive tuberculosis. N Engl J Med 371(17):
1577–1587
Gomberg-Maitland M, Frison L, Halperin JL (2003) Active-control clinical trials to establish
equivalence or noninferiority: methodological and statistical concepts linked to quality. Am
Heart J 146(3):398–403
1322 P. P. J. Phillips and D. V. Glidden

Hernan MA, Robins JM (2017) Per-protocol analyses of pragmatic trials. N Engl J Med 377(14):
1391–1398
Holmgren EB (1999) Establishing equivalence by showing that a specified percentage of the effect
of the active control over placebo is maintained. J Biopharm Stat 9(4):651–659
Horsburgh CR, Shea KM, Phillips P, Lavalley M (2013) Randomized clinical trials to identify
optimal antibiotic treatment duration. Trials 14(1):88
Ibrahim JG, Chen M-H (2000) Power prior distributions for regression models. Stat Sci 15(1):46–60
International Conference on Harmonisation of Technical Requirements for Registration of Phar-
maceuticals For Human Use (1998) Statistical principles for clinical trials (E9). https://database.
ich.org/sites/default/files/E9_Guideline.pdf
International Conference on Harmonisation of Technical Requirements for Registration of Phar-
maceuticals For Human Use (2000) Choice of control group and related issues in clinical trials
(E10). https://database.ich.org/sites/default/files/E10_Guideline.pdf
International Conference on Harmonisation of Technical Requirements for Registration of Phar-
maceuticals For Human Use (2019) Estimands and sensitivity analysis in clinical trials. E9(R1).
https://database.ich.org/sites/default/files/E9-R1_Step4_Guideline_2019_1203.pdf
James Hung HM, Wang SJ, Tsong Y, Lawrence J, O’Neil RT (2003) Some fundamental issues with
non-inferiority testing in active controlled trials. Stat Med 22(2):213–225
Jindani A, Harrison TS, Nunn AJ, Phillips PP, Churchyard GJ, Charalambous S et al (2014) High-
dose rifapentine with moxifloxacin for pulmonary tuberculosis. N Engl J Med 371(17):1599–1608
Jones B, Jarvis P, Lewis JA, Ebbutt AF (1996) Trials to assess equivalence: the importance of
rigorous methods. BMJ 313(7048):36–39
Kaji AH, Lewis RJ (2015) Noninferiority trials: is a new treatment almost as effective as another?
JAMA 313(23):2371–2372
Kirby S, Burke J, Chuang-Stein C, Sin C (2012) Discounting phase 2 results when planning phase
3 clinical trials. Pharm Stat 11(5):373–385
Korn EL, Freidlin B (2018) Interim monitoring for non-inferiority trials: minimizing patient
exposure to inferior therapies. Ann Oncol 29(3):573–577
Machin D, Campbell MJ, Tan SB, Tan SH (2018) Sample size tables for clinical, laboratory and
epidemiology studies, 4th edn. Wiley, Hoboken
Mauri L, D’Agostino RB Sr (2017) Challenges in the design and interpretation of noninferiority
trials. N Engl J Med 377(14):1357–1367
Mayer KH, Molina JM, Thompson MA, Anderson PL, Mounzer KC, De Wet JJ et al (2020)
Emtricitabine and tenofovir alafenamide vs emtricitabine and tenofovir disoproxil fumarate for
HIV pre-exposure prophylaxis (DISCOVER): primary results from a randomised, double-blind,
multicentre, active-controlled, phase 3, non-inferiority trial. Lancet 396(10246):239–254
Neuenschwander B, Capkun-Niggli G, Branson M, Spiegelhalter DJ (2010) Summarizing historical
information on controls in clinical trials. Clin Trials 7(1):5–18
Ng T-H (2015) Noninferiority testing in clinical trials: issues and challenges. Taylor & Francis/CRC
Press, Boca Raton, xvii, 190 p
Nunn AJ, Phillips PPJ, Gillespie SH (2008) Design issues in pivotal drug trials for drug sensitive
tuberculosis (TB). Tuberculosis 88:S85–S92
Nunn AJ, Rusen I, Van Deun A, Torrea G, Phillips PP, Chiang CY et al (2014) Evaluation of a
standardized treatment regimen of anti-tuberculosis drugs for patients with multi-drug-resistant
tuberculosis (STREAM): study protocol for a randomized controlled trial. Trials 15(1):353
Nunn AJ, Phillips PPJ, Meredith SK, Chiang CY, Conradie F, Dalai D et al (2019) A trial of a
shorter regimen for rifampin-resistant tuberculosis. N Engl J Med 380(13):1201–1213
Phillips PP, Morris TP, Walker AS (2016) DOOR/RADAR: a gateway into the unknown? Clin
Infect Dis 62(6):814–815
Piaggio G, Elbourne D, Altman D, Pocock S, Evans S (2006) Reporting of noninferiority and
equivalence randomized trials: an extension of the CONSORTstatement. JAMA 295(10):1152
Piaggio G, Elbourne DR, Pocock SJ, Evans SJ, Altman DG, Group C (2012) Reporting of
noninferiority and equivalence randomized trials: extension of the CONSORT 2010 statement.
JAMA 308(24):2594–2604
69 Noninferiority Trials 1323

Quartagno M, Walker AS, Carpenter JR, Phillips PP, Parmar MK (2018) Rethinking non-inferiority:
a practical trial design for optimising treatment duration. Clin Trials 15(5):477–488. https://doi.
org/10.1177/1740774518778027
Quartagno M, Walker AS, Babiker AG, Turner RM, Parmar MKB, Copas A et al (2020) Handling
an uncertain control group event risk in non-inferiority trials: non-inferiority frontiers and the
power-stabilising transformation. Trials 21(1):145
Rehal S, Morris TP, Fielding K, Carpenter JR, Phillips PP (2016) Non-inferiority trials: are they
inferior? A systematic review of reporting in major medical journals. BMJ Open 6(10):
e012594
Rothmann MD, Tsou HH (2003) On non-inferiority analysis based on delta-method confidence
intervals. J Biopharm Stat 13(3):565–583
Rothmann MD, Wiens BL, Chan ISF (2012) Design and analysis of non-inferiority trials. Chapman
& Hall/CRC, Boca Raton, xvi, 438 p
Sankoh AJ (2008) A note on the conservativeness of the confidence interval approach for the selection
of non-inferiority margin in the two-arm active-control trial. Stat Med 27(19):3732–3742
Simon R (1999) Bayesian design and analysis of active control clinical trials. Biometrics 55(2):
484–487
Snapinn S, Jiang Q (2008) Preservation of effect and the regulatory approval of new treatments on
the basis of non-inferiority trials. Stat Med 27(3):382–391
Spiegelhalter DJ, Freedman LS, Parmar MK (1994) Bayesian approaches to randomized trials. J R
Stat Soc A Stat Soc 157(3):357–387
Tiemersma EW, van der Werf MJ, Borgdorff MW, Williams BG, Nagelkerke NJ (2011) Natural
history of tuberculosis: duration and fatality of untreated pulmonary tuberculosis in HIV
negative patients: a systematic review. PLoS One 6(4):e17601
Treadwell JR, Uhl S, Tipton K, Shamliyan T, Viswanathan M, Berkman ND et al (2012) Assessing
equivalence and noninferiority. J Clin Epidemiol 65(11):1144–1149
Tsui M, Rehal S, Jairath V, Kahan BC (2019) Most noninferiority trials were not designed to
preserve active comparator treatment effects. J Clin Epidemiol 110:82–89
Tweed CD, Wills GH, Crook AM, Amukoye E, Balanag V, Ban AYL et al (2021) A partially
randomised trial of pretomanid, moxifloxacin and pyrazinamide for pulmonary TB. Int J Tuberc
Lung Dis 25(4):305–314
Van Deun A, Maug AKJ, Salim MAH, Das PK, Sarker MR, Daru P et al (2010) Short, highly
effective, and inexpensive standardized treatment of multidrug-resistant tuberculosis. Am J
Respir Crit Care Med 182(5):684–692
Wellek S (2010) Testing statistical hypotheses of equivalence and noninferiority, 2nd edn. CRC
Press, Boca Raton, xvi, 415 p
White IR, Carpenter J, Horton NJ (2012) Including all individuals is not enough: lessons for
intention-to-treat analysis. Clin Trials 9(4):396–407
Wiens BL, Zhao W (2007) The role of intention to treat in analysis of noninferiority studies. Clin
Trials 4(3):286–291
Williams HC, Wojnarowska F, Kirtschig G, Mason J, Godec TR, Schmidt E et al (2017) Doxycy-
cline versus prednisolone as an initial treatment strategy for bullous pemphigoid: a pragmatic,
non-inferiority, randomised controlled trial. Lancet 389(10079):1630–1638
World Health Organization (2011) Guidelines for the programmatic management of drug-resistant
tuberculosis - 2011 update. World Health Organization
World Health Organization (2019) WHO consolidated guidelines on drug-resistant tuberculosis
treatment. World Health Organization, Geneva
World Health Organization (2020) Global tuberculosis report 2020. World Health Organization,
Geneva
Cross-over Trials
70
Byron Jones

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1326
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1326
Challenges When Designing a Cross-over Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1329
Efficiency of a Cross-over Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1329
Example 1: 2  2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1330
Plotting the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1332
Two-Sample t-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1334
Fitting a Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1335
Checking Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1336
Testing for a Difference in Carry-Over Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1337
Additional Use of the Random-Effects Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1339
Williams Cross-over Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1340
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1340
Example of a Cross-over Trial with Five Treatments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1342
Example 3: Incomplete Block Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1345
Use of Baseline Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1348
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1349
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1350
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1350
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1350

Abstract
Cross-over trials have the potential to provide large reductions in sample size
compared to their parallel groups counterparts. In this chapter, three different
types of cross-over design and their analysis will be described. In the linear
models used to analyze cross-over data, the variability due to differences between
the subjects in the trial may be modeled as either fixed or random effects, and both
will be illustrated. In designs of the incomplete block type, where there are more

B. Jones (*)
Novartis Pharma AG, Basel, Switzerland
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 1325


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_243
1326 B. Jones

treatments than periods, the use of random subject effects enables between-
subject information on the treatment comparisons to be recovered, and how this
may be done will also be illustrated. Finally, if the so-called baseline measure-
ments have been taken at the beginning of each treatment period, these can be
used as covariates to reduce the variability of the estimated treatment compari-
sons. The use of baselines will be illustrated using the incomplete block design.

Keywords
Crossover trials · Fixed effects · Random effects · Incomplete block · Between-
subject information · Baseline information

Introduction

In a cross-over trial, each subject (i.e., a patient or a healthy volunteer) receives a


sequence of treatments over a set number of time periods. At the end of each period,
the clinical response of interest is measured and recorded. The effects of the different
treatments are then compared using the repeated measurements on the same subject.
These “within-subject” comparisons are to be contrasted with those that are obtained
from a parallel groups design where each subject receives only one of the treatments,
and comparisons between treatments are made “between-subjects.” Typically, the
variability of the within-subject comparisons is much smaller that the variability of
the between-subject comparisons, leading to cross-over designs that may require far
fewer subjects than their parallel groups counterparts to achieve the same power to
detect a treatment effect of a given size (see Piantadosi (1997), Sect. 16.2.1).
Three different types of cross-over design and their analyses will be described in
this chapter, and this will be done using examples based on real trials. In section
“Example 1: 2  2 Design,” two treatments are compared using two periods. In
section “Williams Crossover Design,” five treatments are compared using five
periods, and in section “Example 3: Incomplete Block Design” the trial is of the
incomplete block type and compares four treatments in three periods. Using this last
design, the recovery of any useful between-subject information that is typically
present in an incomplete design will be illustrated.
It is not unusual in a cross-over clinical trial to take measurements at the start of
each period as well as at the end of each period. The baseline measurements taken
at the beginning of the periods can sometimes be included as useful covariates to
reduce the variability in the estimated treatment comparisons. The use of baseline
information is illustrated in section “Use of Baseline Measurements.”
In the next section, the notation used in this chapter is introduced.

Notation

When discussing trials in a generic sense, those taking part will be referred to as
subjects. Depending on the actual context, these may be either patients or healthy
volunteers.
70 Cross-over Trials 1327

In a cross-over trial, the time the trial takes to complete is divided into a sequence of
time periods. Within each period, a subject receives either one of the treatments to be
compared or no treatment. The periods where no treatments are received are referred to
as either run-in or wash-out periods. The reason for these will be explained later.
Typically, a subject receives a sequence of different treatments spaced out over all time
periods. In the trial as a whole, a particular (usually small) number of different
treatment sequences will be used, and an example of such a trial is given in Table 1,
below. The subjects that are available for the trial are randomly assigned to the
different sequences, and the group of subjects assigned to a particular sequence is
referred to as a sequence group. Often, the random assignment is done in a way that
ensures all sequence groups are the same (or approximately the same) size.
In the cross-over trial illustrated in Table 1, there are two treatments to be
compared (A and B), over four time periods. The first time period is the “run-in”
period where baseline measurements are taken and subjects get acclimatized to the
trial procedures; in the second period, each subject receives one of the two treat-
ments; the third period is a “wash-out” period, and in the fourth period each subject
gets the treatment that was not administered in the second period. The trial has two
sequence groups, defined by the order in which the treatments are administered: AB
and BA.
In general, the number of treatments, periods, and sequence groups in the design
will be denoted as t, p, and s, respectively. The number of subjects in sequence group
i will be denoted as ni, i ¼ 1,2, . . ., s. Within sequence group i, ni, subjects receive all
the treatments in the specified treatment sequence for that Pgroup. Note that the total
number of subjects in the trial is the sum of the ni, i.e., si¼1 ni .
If yijk (i ¼ 1, 2,. . .,s; j ¼ 1, 2, . . ., p; k ¼ 1, 2,. . ., ni) denotes the response observed
on the kth subject in period j of sequence group i, then a typical statistical model used
to explain a continuous response is:

yijk ¼ μ þ π j þ τd½i,j þ sik þ eijk , ð1Þ

where μ is a general intercept, πj is an effect associated with period j (j ¼ 1,. . ., p), τd[i,
j] is the direct treatment effect associated with the treatment d[i, j], applied in period j
of sequence group i (d[i, j] ¼ 1,. . ., t), sik is the effect associated with the kth subject in
sequence group i (i ¼ 1,. . ., s, k ¼ 1,. . ., ni), and eijk is an independent random error
term, with zero mean and variance σ 2. We refer to this model as Model (1).
Note that the sik represent characteristics of the subjects, not the treatments. For
example, some subjects may have baseline values of the response of interest that are
higher or lower than others. Including the sik in the model takes account of some of
the variability in the response that is just the result of variation in baseline subject

Table 1 Typical structure of the 2  2 cross-over trial


Period
Group Run-in 1 Wash-out 2
1(AB) - A - B
2(BA) - B - A
1328 B. Jones

characteristics. The sik do not account for variation in the treatment effects over the
periods: other parameters defined above account for this.
In a cross-over trial, there is the possibility that the effect observed in one period
may persist into the next period. This effect is known as a carry-over effect. If there is a
need to allow for carry-over effects, and these are additive to the treatment effects, then
the above model can be extended to include the carry-over effect term, λd[i, j  1]:

yijk ¼ μ þ π j þ τd½i,j þ sik þ λd½i, j1 þ eijk : ð2Þ

Obviously, there can be no carry-over effect in the first period, i.e., λd[i, 0] ¼ 0. We
refer to this model as Model (2).
Models (1) and (2) are examples of fixed-effects models: In such models, the
effects (for periods, treatments, carry-overs and subjects) are constant, but unknown
values, to be estimated.
An alternative model, referred to here as the random-effects model, assumes that
the sik are independent random effects with mean 0 and variance σ 2s .
The sik have mean zero, because they represent the random relative effect of the
subject characteristics on the response (some values are higher and some are lower).
The amount the subject effects vary is measured by the variance parameter σ 2s (larger
σ 2s implies greater variability).
In addition, the sik are assumed to be independent of the eijk. Of course, it could be
argued, quite rightly, that what is being referred to here as a random-effects model is
actually a mixed-effects model because it contains both random and fixed effects.
However, to maintain consistency with the literature, the term random-effects model
will be used in this chapter.
If not stated otherwise, it is assumed in the following that the sik in Models (1) and
(2) are random variables, as defined above.
Then the responses from periods j and j0( j 6¼ j0), on the  same subject,
 have
2
variances Var(yijk) ¼ Var(yij0 k) ¼ σ þσ s and covariance Cov yijk , yi j0 k ¼ σ s . This
2 2

means  that the  correlation between any two responses on the same subject is ρ ¼
σ 2s = σ 2s þ σ 2 . As can be seen, the correlation increases as σ 2s increases.
More complex correlation structures can be defined by making particular assump-
tions regarding the correlation structure of the sik or the eijk, but we do not do this
here. See Chi and Reinsel (1989) for how this may be done to produce an auto-
regressive correlation structure.
When any two measurements on the same subject are positively correlated, the
estimate of any treatment comparison has a smaller variance than would be obtained
from a parallel groups design with the same number of subjects (see Piantadosi
(1997), Sect. 16.2.1). So when used appropriately, the cross-over design requires
fewer subjects than a parallel groups design to achieve a desired level of power to
reject the null hypothesis of no treatment difference.
When the subject effects are assumed to be random, the model parameters can be
estimated using restricted maximum likelihood (REML), as introduced by Patterson
and Thompson (1971).
70 Cross-over Trials 1329

Challenges When Designing a Cross-over Design

When designing a cross-over trial, several important questions need to be answered:

• Is the condition being treated a chronic condition for which a cross-over trial is
appropriate?
• Is the effect of the treatment likely to persist into a following period, i.e., are
carry-over effects expected?
• Can the carry-over effects be removed by including sufficiently long wash-out
periods?
• How many treatments are to be compared?
• How many periods can be used?
• Which treatment comparisons are important?
• What is the maximum sample size, i.e., the total number of subjects required?

Assuming that a cross-over design is an appropriate choice, the answers to the


above questions typically do not lead to a unique design and some way of choosing
between the alternatives is required. Jones and Kenward (2014) give tables of
alternative designs for up to nine treatments and nine periods. Rohmeyer (2014)
has provided an R package Crossover to accompany Jones and Kenward (2014) that
provides an easy to use graphical user interface (GUI) to locate designs for different
values of (t, p, s,), either from the tables provided in Chapter 4 of Jones and Kenward
(2014) or by using a computer search algorithm. See Chapter 4 of Jones and
Kenward (2014) for examples of using the GUI and Rohmeyer (2014) for details
of the search algorithm. The Crossover package also allows a choice for how the
carry-over effects are modeled. The criterion used to distinguish between good and
bad cross-over designs is referred to as the efficiency of the design.

Efficiency of a Cross-over Design

An important criterion when choosing between designs, especially when all pairwise
comparisons between the treatments are of equal importance, is the efficiency of the
design. The efficiency of a particular comparison is the ratio of the variance of an
estimated pairwise difference of the treatment effects in the design of interest com-
pared to a theoretical lower bound on that variance (Patterson and Lucas (1962)).
For the calculation of these variances, it is assumed that the sik are fixed effects.
This is because, when comparing designs, the main interest is in maximizing the
within-subject information.
The lower bound is a technical construct that has proved very useful for calibrat-
ing how good a design is relative to the (statistically) best it could be. In this
technically best design, statistical theory states that the estimator of the difference
between two treatment effects that has minimum variance is the difference between
the simple (unadjusted) means of those two treatments. The variance of this estima-
tor is then the variance of the difference of these two means, which can easily be
1330 B. Jones

calculated for any design. For example, in most designs, the planned number of
responses, r, to be recorded on each treatment (A, B, and so on) is the same. If each
sequence group has size n, the planned total number of responses is N ¼ s  n  p
and r ¼ N/t. The variance of the difference of two means in this case is

2σ 2
V bound ¼ ð3Þ
r
and this is the technical lower bound for the difference of two treatments in the
design under consideration. For some designs, the number of responses on a
particular pair of treatments, i and j, say, may not be the same, and equal ri and rj,
respectively. In this case, the technical lower bound is

V bound ¼ σ 2 =r i þ σ 2 =r j : ð4Þ

If Vd is the variance of an estimated pairwise comparison in the design of interest,


then the efficiency, E, is defined as

V bound
E¼ : ð5Þ
Vd
Basically, E measures how large Vd is when compared to the lower bound. The
larger is Vd, the smaller is E.
The value of Vd depends on the design and the model assumed for the response. It
can be calculated using the formulae for the least squares estimators of the treatment
comparisons in the design under consideration (see Jones and Kenward (2014),
Sect. 3.5, for some examples). Although it involves the within-subject variance σ 2 as
a constant multiplier, this cancels out in the ratio in Eq. (5), as Vbound also includes σ 2
as a constant multiplier. Therefore, it can be calculated prior to the collection of any
data. As already noted, E measures how large the variance of the difference of two
estimated treatment effects is in the design under consideration relative to the lower
bound. In an ideal design, E ¼ 1, as then Vd ¼ Vbound. The reason why Vd may be
greater than Vbound, and E < 1, is that the structure of the design is such that, even
after adjusting for the period and subject effects (and possibly, carry-over effects), in
the statistical analysis (i.e., after fitting Models (1) or (2)), the lower bound on the
variance is still exceeded for some or all estimated pairwise comparisons.
Examples of the efficiency values for two types of design will be given in sections
“Williams Crossover Design” and “Example 3: Incomplete Block Design,”
respectively.

Example 1: 2 3 2 Design

The simplest cross-over design, known as the 2  2 design, compares two treatments
using two active treatment periods. The basic structure of this design is given, as
previously, in Table 1. Here s ¼ 2, p ¼ 2, and t ¼ 2. In this design, as illustrated, there
are four periods: two active treatment periods (labeled as 1 and 2), a run-in period, and a
wash-out period. The purpose of the run-in period, as already noted, may be to
70 Cross-over Trials 1331

familiarize the subjects with the clinical trial procedures, collect baseline measure-
ments, remove the effects of a previous treatment, or confirm that a subject is able to
continue into the first period, for example. The purpose of the wash-out period is to
remove the effects of the drug given in Period 1 before the second drug is given in
Period 2. This should ensure that subjects are in the same clinical state at the start of
Period 2 as they were at the start of Period 1. It should be noted that in some trials the
wash-out period is not included. This may be because its inclusion will extend the time
the trial will take to complete beyond that which is reasonable or because there is
confidence that carry-over effects will not exist. Typically, to remove a pharmacological
carry-over effect of a single oral dose, the wash-out period should be at least five half-
lives of the drug (FDA (2013)). The half-life of a drug (see Clark and Smith (1981)) is
the time it takes for the concentration of the drug in the blood (or plasma) to reduce to
half of what it was at equilibrium. After one half-life, the concentration should drop by
(100  1/2)% ¼ 50%, after two half-lives drop by 100(1/2 + 1/4)% ¼ 75%, and by five
half-lives drop by 100(1/2 + 1/4 + 1/8 + 1/16 + 1/32)% ≈ 97%.
For Models (1) and (2), the fixed effects are displayed in Table 2 and Table 3,
respectively.
The data that will be used to illustrate the analysis of the 2  2 design are based on
actual data from a completed Phase III randomized, double-blind, placebo controlled
trial to assess the effect of drug A on exercise endurance in subjects with moderate to
severe chronic obstructive pulmonary disease (COPD). Drug B is a placebo treat-
ment, i.e., a drug with no active pharmacological ingredients. Exercise endurance
time in seconds was measured using a constant-load cycle ergometry test after 3
weeks of treatment.
After a 1-week run-in period, subjects were randomized to receive either A or B.
After 3 weeks of treatment, the drugs were withdrawn, and there was a 3-week
washout-period. Patients who began on A then crossed over to B, and those who
began on B crossed over to A. Due to the presence of the wash-out periods, it is
assumed that there are no carry-over effects present and Model (1) applies.
Patients who received A first belong to the AB sequence group (Group 1), and
subjects who received B first belong to the BA sequence group (Group 2).
The data in Tables 4 and 5 give examples of observations similar to those
collected in the actual trial. The actual data have been modified to preserve their
confidentiality and to enable key ideas to be presented without too much

Table 2 The fixed effects Group Period 1 Period 2


in the model that excludes
1(AB) μ + π1 + τ1 μ + π2 + τ 2
carry-over effects
2(BA) μ + π1 + τ2 μ + π2 + τ 1

Table 3 The fixed effects Group Period 1 Period 2


in the model that includes
1(AB) μ + π1 + τ 1 μ + π 2 + τ 2 + λ1
carry-over effects
2(BA) μ + π1 + τ 2 μ + π 2 + τ 1 + λ2
1332 B. Jones

Table 4 Example of observations from Group 1(AB) subjects (endurance time in seconds)
Subject Period 1(A) Period 2(B) Difference Sum
1 496 397 99 893
3 405 228 177 633
. . . . .
. . . . .
58 575 548 27 1123
59 330 303 27 633

Table 5 Example of observations from Group 2(BA) subjects (endurance time in seconds)
Subject number Period 1(B) Period 2(A) Difference Sum
2 270 332 62 602
5 465 700 235 1165
. . . . .
. . . . .
54 308 236 72 544
57 177 293 116 470

complication. To save space, only the data for a few subjects in each group are
shown. Table 4 gives the data for the subjects in the AB group (Group 1). It can be
seen that there are two measurements for each subject, one for Period 1 when A was
received and one for Period 2 when B was received. Table 5 gives the corresponding
data for the subjects in the BA group (Group 2), where it is noted that in this group
the subjects received B first and then crossed over to A.
Also given in these two tables are the within-subject differences (Period 1 –
Period 2) and the sum of the two responses of each subject. These will be used to
estimate the treatment difference and the carry-over difference, respectively (see
sections “Two-Sample t-Test” and “Testing for a Difference in Carry-Over Effects”).
By taking the difference between the Period 1 measurement and the Period 2
measurement within each subject, a within-subject comparison of A versus B is
obtained in Group 1 and a within-subject comparison of B versus A is obtained in
Group 2.
In the following, the complete data from the 2  2 trial are analyzed to decide if
treatment A is superior to treatment B. Before that, however, we describe two useful
plots that focus on illustrating the strength of the difference (if any) between the
effects of A and B.

Plotting the Data

Before conducting formal hypothesis testing, it is always useful to plot the data that
are to be analyzed. For the 2  2 trial, a useful plot is the subject profiles plot, which
for this example is displayed in Fig. 1. The left-hand plot is for the AB group, and the
70 Cross-over Trials 1333

Sequence AB Sequence BA

l l
l
1000

1000
l l
l l
l l
l
l l
l
l l
800

800
l
l
l
l

l l
l
l
l l l
l
Exercise score

Exercise score
l
l l
l
l l
600

600
l l
l
l
l l
l l l
l l l
l l
l l
l
l l l
l l
l l l
l
l
l l
400

400
l
l l l
l
l
l l l
l l l l
l
l l
l l l
l l l
l
l l l
l l
l l
l
l l l
l
200

200

l l
l
l

1 2 1 2

Period Period

Fig. 1 Subject profiles plot for the endurance data

right-hand plot is for the BA group. Looking at the plot for the AB group, for
example, it can be seen that there is a pair of points and a line connecting them for
each subject. The left-hand point is the response in Period 1, and the right-hand point
is the response in Period 2. It can be seen that some subjects have a much longer
endurance time on A compared to B (see the second line from the top for the AB
group) whereas others actually have a shorter endurance time on A compared to B
(some substantially so). It is clear that the between-subject variability in the
responses is large, and this is one of the situations where a cross-over trial is likely
to be preferred to a parallel groups design.
However, although it might be possible to get the impression from Fig. 1 that A
increases endurance time compared to B, it is not very clear and a significance test
must be performed to get a definitive answer. As a preliminary to this, the mean of
the responses of the subjects per group and period is calculated, as displayed in
Table 6.
A very useful way to display these means is the groups-by-periods plot, as given
in Fig. 2. The means have been joined in different ways. In the left-hand panel, the
1334 B. Jones

Table 6 The means for Group Period 1 Period 2


each group and period
1(AB) 539.57 485.33
combination
2(BA) 478.17 563.52

Fig. 2 Group-by-period Connect treatments Connect groups


means for the endurance data

600

600
Mean Exercise time

Mean Exercise time


2A
l 2A
l

1A
l 1A
l
500

500
2B 1B
l 2B 1B
l
l l
400

1 2 400 1 2
Period Period

lines connect the same treatment, and in the right-hand panel the lines connect the
same group. The left-hand panel emphasizes the treatment difference in each period:
It can be seen that in each period there is a consistent pattern where A is higher than
B. The right-hand plot emphasizes the within-subject mean changes: In the AB
group, there is a definite decline in endurance from the first period to the second, and
in the BA group there is a definite increase in endurance time from the first period to
the second. This displays the clearest evidence so far, that A is superior to B.

Two-Sample t-Test

Before immediately fitting Model(1), it is instructive to first see how the null
hypothesis of no treatment effect can be tested using the familiar two-sample t-
test. This requires the additional assumption that the within-subject differences in
response are normally distributed. In the absence of an assumption of normality, the
nonparametric Wilcoxon rank-sum test may be used (Jones and Kenward (2014),
Sect. 2.12 and Hollander and Wolfe (1999)), although the t-test is quite robust
against violations of this and other assumptions (Havliceck and Peterson (1974)).
The column headed “Difference” in both Tables 4 and 5 gives the within-subject
difference for each subject in each group. Each within-subject difference in Group 1
is an unbiased estimator of (τ1 – τ2) – (π1 – π2), where it is noted that the subject
70 Cross-over Trials 1335

effects and the general intercept have been canceled out. Similar reasoning leads to
the conclusion that the mean of the differences in Group 2 is an unbiased estimate of
(τ2 – τ1) – (π1 – π2). Consequently, the difference between the mean of the within-
subject differences in Group 1 and the mean of the within-subject differences in
Group 2 has expectation 2(τ1 – τ2). Therefore, a two-sample t-test comparing the
within-subject differences of the two groups is a test of the null hypothesis of no
difference between the treatments, H0: τ1 ¼ τ2.
If dik ¼ yi1k – yi2k denotes the within-subject difference of subject k in group i, and
di: the mean of these differences in Group i, then a pooled estimator of the variance
of the differences, σ 2d , is:

2 X
X ni
 2
σ 2d ¼ 2b
b σ2 ¼ d ik  di: =ðn1 þ n2  2Þ: ð6Þ
i¼1 k¼1

The pooled estimator of the variance is an application of a standard formula, as


given, for example, in Altman (1991), Sect. 9.6.1.
The estimator of the treatment difference, τd ¼ τ1 – τ2, is
 
bτd ¼ d 1:  d2: =2 ð7Þ

and its standard error is


qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
sτ ¼ σ 2d ð1=n1 þ 1=n2 Þ=4:
b ð8Þ

σ 2d ¼ 42393:3, bτd ¼ ð54:233  ð85:345ÞÞ=2 ¼


Calculating these statistics gives: b
69:79 and sτ ¼ 26.81. The two-sample t-test statistic, bτd =sτ, is then 69.79/26.81 ¼ 2.60
on 57 degrees of freedom. This gives a p-value of 0.012 (two-sided) or 0.006 (one-
sided). There is clear evidence to reject the null hypothesis and to conclude that A is
superior to B and increases endurance time by 69.8 seconds on average, with a 95%
confidence interval for the difference of (16.10, 123.47).

Fitting a Linear Model

The analysis of variance table obtained from fitting Model (1) is given in Table 7,
and the corresponding estimated treatment and period effects are given in Table 8.
Although Table 7 does not give any new information regarding the treatment
effect, it does make clear that a large proportion of the total variability in the data is
accounted for by the between-subject variability. By this is meant that the between-
subjects SS are a large proportion of the Total SS (4,090,211/5448153 ¼ 0.75). Also,
there is no evidence of a significant period difference.  2 
In addition, from output not shown, b
ρ¼b σ 2s = b σ2 ¼
σs þ b
225262=ð25262 þ 21197Þ ¼ 0:54.
The period difference could also have been tested using the two-sample t-test. To
test for a period effect, the within-subject differences in one of the groups (e.g., Group 2)
1336 B. Jones

Table 7 Analysis of variance for the endurance data [df ¼ degrees of freedom, SS ¼ sums of
squares, MS ¼ mean square, and F¼F-ratio]
Source df SS MS F P-value
Between-subjects 58 4,090,211 70,521
Periods 1 7136 7136 0.34 0.564
Treatments 1 143,639 143,639 6.78 0.012
Residual 57 1,208,209 21,197
Total 117 5,448,153

Table 8 Estimates of the Effect Estimate Standard error


fitted effects in Model (1)
bτ1  bτ2 69.789 26.809
b
π1  b
π2 15.556 26.809

are multiplied by 1 before the test is applied. This will change the expected value of
the difference in means to (τ1 – τ2) + (π1 – π2) – (τ1 – τ2) + (π2 – π1) ¼ 2(π1 – π2), and
hence a test of the null hypothesis H0: π1 ¼ π2 is obtained. Then the approach used in
section “Two-Sample t-Test” can be followed, giving a t-test statistic of 0.580 and
a two-sided p-value of 0.564, in agreement (as expected) with the corresponding
values in Table 7.

Checking Assumptions

One advantage of fitting a linear model is that it conveniently allows the checking of
assumptions made about the model.
For example, it is easy to check if the residuals from the fitted model are
approximately normally distributed or if there are any outliers. The raw residual is
the difference between the actual response and its prediction from the model. The
standardized residual is the raw residual divided by its standard error.
Figure 3 is a quantile-quantile plot (or Q-Q plot) of the ordered standardized
residuals taken from Period 1, which is a visual check for normality. In this plot, the
quantiles of the sample are plotted against the theoretical quantiles of the normal
distribution. The quantile for a particular value in the dataset is the percentage of data
points below that value. For example, the median is the 50th quantile. It is not
necessary to plot the residuals from both periods because within a subject they add
up to zero and so it is only necessary to take one per subject. If the standardized
residuals are normally distributed, the points in the Q-Q plot should lie on, or close
to, the diagonal straight line. In Fig. 3, it can be seen that there is some deviation of
the points from the line at the extremes, indicating some skewness in the distribution
of the residuals. However, when some of the standard tests for normality are applied
to this sample of standardized residuals, there is no evidence to reject the null
hypothesis that the residuals are normally distributed. For example, the p-value for
70 Cross-over Trials 1337

Fig. 3 Q-Q plot of the l


standardized residuals from l

Period 1 2 l
l

Standardized Residuals
l
l
1 l
l
llll
ll
ll
l
ll
lllll
lll
lll
0 lll
ll
ll
lll
lll
ll
l
l
l
ll
−1 ll

l
l
l

−2 l
l
l

−2 −1 0 1 2
Theoretical Quantiles

the Anderson-Darling normality test ([Anderson and Darling (1952)]) is 0.436, and
the p-value for the Shapiro-Wilk test ([Shapiro and Wilk (1965)]) is 0.531.
There are five standardized residuals that are larger than 1.964 in absolute value,
with values of 2.13, 2.17, 2.44, 2.25, and  2.32, respectively. That is, 5/
59 ≈ 8.5% of the residuals are “large,” compared to the expected percentage of 5%
if they were truly realizations from the standard normal distribution. However, none
of the standardized residuals are excessively large (i.e., > 3), so there should be no
serious concerns regarding the normality of the residuals.
As noted earlier, if the assumption of normality is not fulfilled, a nonparametric
comparison of the treatment effects can be performed using the Wilcoxon rank-sum test
applied to the within-period differences. For details, see Chapter 4 of Hollander and
Wolfe (1999), who also note that the loss of asymptotic relative efficiency when using
the Wilcoxon rank-sum test instead of the t-test is no greater than 13.6%. When the data
are actually independent and normally distributed, the loss in efficiency is 4.5%.

Testing for a Difference in Carry-Over Effects

An important assumption regarding the data in this example is that carry-over effects
are not present (or that they are equal and therefore do not enter into the expectation
of bτd ) and effectively Model (1) applies. Although testing for carry-over effects in
the 2  2 design is not recommended, for reasons that will be given shortly, it can be
done. For completeness, an explanation of how to do it is given below.
Suppose it is assumed that Model (2) applies. It will be recalled that the carry-over
parameters for treatments A and B are denoted by λ1 and λ2, respectively. The
difference between the carry-over effects is denoted by λd ¼ λ1 – λ2.
1338 B. Jones

In order to derive a test of the null hypothesis that λd ¼ 0, it is noted that the
subject totals

t1k ¼ y11k þ y12k for the kth subject in Group 1

and

t2k ¼ y21k þ y22k for the kth subject in Group 2

have expectations

E½t1k  ¼ E½2μ þ π 1 þ π 2 þ τ1 þ τ2 þ λ1 þ 2s1k þ e11k þ e12k 

and

E½t2k  ¼ E½2μ þ π 1 þ π 2 þ τ1 þ τ2 þ λ2 þ 2s2k þ e21k þ e22k :

As the expectations of sik and eijk are both zero, the expectations of t1k and t2k
reduce to E[t1k] ¼ 2μ + π1 + π2 + τ1 + τ2 + λ1 and E[t2k] ¼ 2μ + π1 + π2 + τ1 + τ2 + λ2
respectively.
If λ1 ¼ λ2, then these two expectations are equal. Consequently, to test if λd ¼ 0,
the familiar two-sample t-test can be applied to the subject totals.
The estimate of the difference in carry-over effects is b λd ¼ t1:  t2: and has a
variance of:

h i  
1 1
Var b
λd ¼ σ 2T þ ,
n1 n2

where
 
σ 2T ¼ 2 2σ 2s þ σ 2 :

An estimate of σ 2T is

2 X
X ni
σ 2T ¼
b ðtik  ti: Þ2 =ðn1 þ n2  2Þ,
i¼1 k¼1

the pooled sample variance which has (n1 + n2 – 2) degrees of freedom.


The estimated standard error of b
λd is
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 
1 1
sλ ¼ b
σT þ :
n1 n2

On the null hypothesis that λ1 ¼ λ2, the statistic T λ ¼ b


λd =sλ has a Student’s
t–distribution on (n1 + n2 – 2) degrees of freedom.
70 Cross-over Trials 1339

The estimate of λd is 16.79 with a standard error of 98.63 on 57 degrees of


freedom. The two-sided p-value is 0.865, indicating no evidence of a difference in
carry-over effects.
While it may be informative, as a follow-up, to test if the carry-over effects are
significantly different, this test should not be used as a preliminary test to decide
which model to fit to the data, as advocated by Grizzle (1965). Grizzle’s suggestion
was to use the result of the carry-over test to decide between two strategies: (1) if the
carry-over test is significant, exclude the data from the second period and base the
comparison of A and B only on the data from the first period; (2) if the carry-over
effect is not significant, then proceed as done above and use the data from both
periods to fit Model (1). Freeman (1989) showed that Grizzle’s suggestion produced
a biased test and an inflated Type I error rate. See Chapter 2 of Jones and Kenward
(2014) or Freeman (1989) for more details.
The best advice is to decide before the trial begins if a significant difference in
carry-over effects is expected or not. If it is expected, and cannot be removed by
using adequate wash-out periods, then the 2  2 design should not be used.
Alternative cross-over designs are available that allow for a difference in carry-
over effects, and Chapter 3 of Jones and Kenward (2014) should be consulted for
more information.
Typically, when carry-over effects are included in the statistical model for the
responses, the efficiency of the estimation of pairwise differences between the
treatment effects is lower than when the carry-over effects are not included in the
model. However, there are designs for which the efficiency of the estimators of
pairwise differences between the treatment effects is the same whether or not the
additive carry-over effects are included in the model. A simple example of this is the
design to compare two treatments using three periods that has the two sequence
groups defined by the sequences ABB and BAA. In this design, the estimators of the
treatment effects are independent of the estimators of the carry-over effects (see
Jones and Kenward (2014), Sect. 3.5, for more details).

Additional Use of the Random-Effects Linear Model

If the results of the trial are such that every subject provides responses in both
periods, the estimates of the treatment effects and the corresponding significance
tests are identical for both the fixed- and random-effects models. This is because the
subject effect cancels out in any within-subject comparison. When this is not the
case, the random-effects model permits the data from the single responses, provided
by some subjects, to be included in the analysis.
To illustrate this, a new dataset is constructed by adding the data given in Table 9,
to the previously used data that were partly shown in Tables 4 and 5. These data are
for subjects who dropped out of the trial after the first period for various (treatment
unrelated) reasons.
The random effects model is fitted using the Kenward-Roger adjustment
[Kenward and Roger (1997)]. This adjustment is a bias correction for the variance
1340 B. Jones

Table 9 Period 1 endurance time (seconds) for subjects without a second period
Group Subject Treatment Time
1 60 1 130
1 61 1 302
1 62 1 653
1 63 1 467
2 64 2 364
2 65 2 155
2 66 2 226

of an estimator of a fixed-effects parameter in the random-effects model. See


Kenward and Roger (1997) for more details.
The p-values from this analysis, for the period and treatment effects, are 0.362
and 0.010, respectively. The estimated treatment difference is 70.63 with a standard
error of 26.59.
This is to be compared with the estimate (standard error) obtained from the fixed-
effects model: 69.79 (26.81). Including the additional data has resulted in a slightly
larger treatment difference and a slightly smaller standard error. Adding the addi-
tional data from Period 1 has made little difference to the conclusions. This is not an
unusual result as the information retrieved from the additional data can only be used
to compare treatments on a between-subject basis, and if the between-subject
variance is large, this additional information will have a low weight compared to
the within-subject information.
As noted earlier, the estimate of the within-subject correlation is estimated to be
0.54. This is not especially high, suggesting that the between-subject information
may be of some use, although here, the small amount of additional data from Period
1 was probably too little to make much of a difference.
Random-effects models can also be useful when the cross-over design is of the
incomplete block type, i.e., where each subject does not get every one of the possible
treatments and therefore receives an incomplete set of treatments. An example of
such a design and its analysis will be given in section “Example 3: Incomplete Block
Design.”

Williams Cross-over Design

Introduction

A limitation of the 2  2 cross-over design is that it does not permit any carry-over
effects to be estimated using within-subject information. The use of adequate
washout periods, of course, can remove any pharmacological carry-over effects
and make redundant the need to estimate them. However, when more than two
treatments are to be compared over several periods, the use of long wash-out periods
can be impractical as the longer a trial takes, the greater the chance that subjects will
70 Cross-over Trials 1341

drop out before completing all the periods. Longer trials also mean that the time a
successful new drug will take to reach the patients who need it will be extended. If
wash-out periods are removed, or if there is doubt that any carry-over effects can be
removed using the maximum permissible length of wash-out period, then cross-over
designs that allow the estimated treatment effects to be adjusted for the presence of
any carry-over effects will be needed. In other words, if there are any additive carry-
over effects that cannot be removed by the design of the study, suitable alternative
designs will have to be used.
Fortunately, there are many such designs for two or more treatments in two or
more periods. Tables of suitable designs are given in Chapter 4 of Jones and
Kenward (2014), and these should be referred to for examples of designs not
illustrated in this chapter. One class of designs is the Williams design. These designs
fall into two types, depending on whether t, the number of treatments, is even or odd.
If t is even, then the basic design requires t subjects and t periods. If t is odd, then the
basic design requires 2 t subjects and t periods. Examples of these basic designs for
t ¼ 3 and t ¼ 4 are given in Tables 10 and 11, respectively. The basic designs,
obtained from published tables, or by computer search, can be thought of as designs
with one subject allocated to each sequence group. The rows of these tables give the
sequences (which define the sequence groups) to be used in the trial. In the actual
trial, the available subjects are allocated at random to the sequences to form the
sequence groups. Usually, the sequence groups are of the same size. Whereas, the
2  2 design has two sequence groups, the Williams design has t or 2 t sequence
groups. An algorithm to determine the basic sequences in a Williams design for all
values of t is given in Chapter 4 of Jones and Kenward (2014).
As with all clinical trials, at the planning stage, it is necessary to determine how
many subjects in total are needed to achieve a given power, e.g., 90%, to detect a
given treatment difference of interest at a specified significance level (e.g., 0.05). In

Table 10 Williams cross-over trial to compare three treatments


Group Period 1 Period 2 Period 3
1 A B C
2 B C A
3 C A B
4 C B A
5 A C B
6 B A C

Table 11 Williams cross-over trial to compare four treatments


Group Period 1 Period 2 Period 3 Period 4
1 A D B C
2 B A C D
3 C B D A
4 D C A B
1342 B. Jones

the example that follows for t ¼ 5, 80 subjects were needed, requiring, therefore,
8 subjects be assigned at random to each of the 10 sequence groups.
To calculate the sample size required to ensure a given power for a particular
comparison, it is necessary to derive, for the particular design under consideration,
the standard error of the estimated comparison as a function of n, the size of each
sequence group (assuming equal group sizes are required). Once this has been
obtained, standard formulae for sample sizes to compare two groups can then be
modified to include this standard error, rather than the usual standard error for the
difference of two means. How to do this is beyond the scope of this chapter and
typically requires the use of purpose-written software.
The general property of a Williams design is that, if each pair of consecutive periods
is considered, and the counting is over all the sequence groups in the basic design, then
each treatment occurs an equal number of times before each other treatment, except
itself. This ensures that each of the t possible carry-over effects occurs with each of the
other t – 1 treatments an equal number of times. For example, in Table 10, each
treatment occurs twice before each of the other two treatments. In Table 11, each
treatment occurs once before each of the other three treatments. This type of design
balances out the carry-over effects, and as a consequence, the estimates of the
treatment comparisons adjusted for the presence of carry-over effects can be made
using within-subject information. In addition, these designs typically have variances of
the estimated treatment effects that are not excessively large compared to the situation
where carry-over effects are assumed to be absent and are not adjusted for.

Example of a Cross-over Trial with Five Treatments

The data for this example are a simulated version of those taken from a Phase III,
randomized, double-blind, placebo-controlled trial. The trial compared a novel drug
at two doses (labeled here as B and C, where B is the lower dose) with a placebo drug
(labeled here as A) in subjects with moderate to severe Chronic Obstructive Pulmo-
nary Disease (COPD). Also included in the trial were two other drugs which acted as
positive controls: Salbutamol and a combination of Salmeterol and Fluticasone
(labeled here as D and E, respectively). The trial population consisted of 80 adult
males and females (age 40 years and over) with a clinical diagnosis of moderate-to-
severe COPD. The efficacy measurement of interest was the Forced Expired Volume
in 1 second (FEV1) measured in liters and taken at 5 minutes postdose. The objective
of the trial was to determine if either B or C or both were superior to A (Placebo).
Other pairwise comparisons were considered to be of secondary importance.
As there are five different treatments, this study was designed as a Williams
design with five periods and ten sequence groups as given in Table 12. A wash-out
period of 7 days was used between the treatments periods. Although, the length of
the wash-out periods was considered adequate to remove any carry-over effects,
there was some uncertainty about this, and so a Williams design was used to ensure
that the treatment effects could be adjusted for carry-over effects. The primary
70 Cross-over Trials 1343

Table 12 Williams design to compare five treatments for the COPD


Group Period 1 Period 2 Period 3 Period 4 Period 5
1 D C A B E
2 E A B D C
3 B E C A D
4 C B D E A
5 A D E C B
6 E B A C D
7 C D B A E
8 D A C E B
9 A E D B C
10 B C E D A

analysis model is therefore Model (2) and contains terms for subjects, periods,
treatments, and carry-over effects.
To calculate the efficiency of this Williams design, the lower bound is calculated
using, r, the number of times each treatment occurs in the basic design, i.e., r ¼
N/t ¼ 50/5 ¼ 10. Hence, Vbound ¼ 2σ 2/10 ¼ 0.2σ 2. The variance of a pairwise
comparison in this design, assuming a model without carry-over effects, is also
0.2σ 2. Therefore, the design efficiency, in the absence of carry-over effects, is
100  0.2/0.2 ¼ 100%. If, as in this example, carry-over effects are included in
the model, the variance of a pairwise comparison increases to 0.2111σ 2, giving an
efficiency of 100  0.2/0.2111 ¼ 94.74%.
There is clearly a price to be paid for allowing for the presence of carry-over
effects, but in this case it is not high: The relative increase in sample size is less than
6% (100/94.74 ¼ 1.055). That is, allowing for differing carry-over effects requires
about 6% more subjects.
Eight subjects were randomized to each of the sequence groups. As an illustra-
tion, the data for the subjects in the first sequence group are given in Table 13.
As a first step in the analysis of these data, the raw treatment means obtained from
the complete dataset are plotted in Fig. 4, where the treatments have been labeled as
A ¼ 1, B ¼ 2, . . ., and E ¼ 5.
It can be seen that there is evidence that treatment C (high dose of the novel drug)
gives the highest mean FEV1 response, and treatment A (Placebo) has the lowest
mean.
This will be explored further by fitting Model (2) to the responses. As every
subject gets every treatment, and there are no missing values, it does not matter if the
subject effects are considered to be fixed or random. An example of a design where it
does matter will be given in the next section.
The analysis of variance table obtained by fitting this model is given in Table 14.
From Table 14, it can be seen that the p-value for carry-over is significant at the
0.05 level, indicating that there is evidence of differences between the carry-over
effects. In retrospect, therefore, it was wise that the Williams design had been used
1344 B. Jones

Table 13 Group 1 treatment sequence (D C A B E) FEV1 (liters)


Subject Period 1 Period 2 Period 3 Period 4 Period 5
1 1.601 1.493 1.748 1.795 1.510
2 1.212 1.395 1.308 1.250 1.429
3 2.281 2.358 2.362 2.376 2.538
4 1.587 1.904 1.597 1.485 1.663
5 1.202 1.373 1.341 1.328 1.218
6 1.398 1.708 1.327 1.506 1.696
7 1.568 1.677 1.250 1.321 1.496
8 1.395 1.259 1.340 1.245 1.153

Fig. 4 Plot of raw treatment 1.6


means with 95% confidence
intervals

1.5 l

l
FEV1

l l

1.4
l

1.3

A B C D E
Treatment

because it allows estimation of the treatment differences, even in the presence of


carry-over effects.
To simplify the presentation, Table 15 shows only the comparisons between B
versus A and C versus A.
As was noted in the description of this trial, the primary objective was to
determine if either B or C or both are superior to A (Placebo). As there are two
comparisons of interest, it is necessary to apply a multiplicity adjustment to the
testing procedure in order to ensure that the family-wise error rate is not inflated
above 0.05. See Bretz et al. (2016), for example, for a general coverage of
approaches for dealing with multiple testing issues. One simple, although potentially
conservative, way to do this is to use a Bonferroni adjustment and halve the required
significance level of each test (i.e., test each one at significance level 0.025 instead of
0.05). It can be seen that both p-values are much smaller than 0.025, indicating both
70 Cross-over Trials 1345

Table 14 Analysis of variance for Williams design [df ¼ degrees of freedom, SS ¼ sums of
squares, MS ¼ mean square, and F¼F-ratio]
Source df SS MS F P-value
Between-subjects 79 68.3215 0.8648 60.34 <0.0001
Period 4 0.0178 0.0044 0.31 0.8711
Treatment 4 0.6688 0.1672 11.67 <0.0001
Carry-over 4 0.1505 0.0376 2.63 0.0348
Residual 308 4.4145 0.0143
Total 399 73.9643

Table 15 Pairwise differences in the fitted means for the Williams design
Parameter Estimate Standard error One-sided p-value
B-A 0.0470 0.0194 0.0082
C-A 0.1275 0.0194 <0.0001

are significant at the overall significance level of 0.05. In summary, it can be


concluded that both B and C are superior to A.

Example 3: Incomplete Block Design

Quite often it is not possible to use as many periods in a cross-over trial as there are
treatments. This could be because there is a concern that subjects may not want to
stay in the trial long enough to complete all t periods and will drop out before the end
of the trial. Another reason could be that, with the inclusion of lengthy wash-out
periods, the trial may take too long if all t periods have to be completed. In any case,
trials with p < t are not uncommon, and the third illustrative example has t ¼ 4 and
p ¼ 3. Jones and Kenward (2014) give many examples of such incomplete block
designs and their sequence groups. Their associated efficiencies can be obtained by
referring to that book or by using the R package Crossover (Rohmeyer (2014)).
This example is also a multicenter, Phase III, cross-over trial where, for confiden-
tiality reasons, the actual data have been replaced with simulated values. The aim of the
trial was to assess the efficacy of a drug B in subjects with moderate to severe COPD.
The trial also included an active control, C, another treatment of interest (labeled as A)
and a placebo drug D. The main aim of the trial was to compare B with D, with the
other comparisons being of secondary interest. To limit the length of the trial, the four
treatments were given over three periods in an incomplete block design. The study
population consisted of a representative group of adult males and females aged 40 years
and over with a clinical diagnosis of moderate to severe COPD. The efficacy variable of
interest was the FEV1. The structure of the trial included a 14-day run-in period used to
assess eligibility of subjects for the study. Each treatment period lasted 14 days
followed by a 14-day wash-out period (after Periods 1 and 2 only).
1346 B. Jones

The sequences used in this design are given in Table 16, and in the trial four
subjects were allocated to each sequence group. This design is fully balanced in the
sense that each estimated pairwise comparison has the same variance.
To calculate the efficiency of this incomplete block design, the lower bound is
calculated using the number of times each treatment occurs in the basic design, i.e.,
r ¼ 9. Hence, Vbound ¼ 2σ 2/9 ¼ 0.2222σ 2. The variance of a pairwise comparison in
the basic design, assuming a model without carry-over effects, is 0.2500σ 2. There-
fore, the design efficiency, in the absence of carry-over effects, is 100  0.2222/
0.2500 ¼ 88.89%. If carry-over effects are included in the model, the variance of a
pairwise comparison increases to 0.3088σ 2, giving a relatively low efficiency of
100  0.2222/0.3088 ¼ 71.96%. Fortunately, given the relatively long wash-out
periods, there was no expectation in this trial that carry-over effects would be
present.
As an illustration, the data from the first sequence group are given in Table 17.
The FEV1 values obtained at the end of each period are given for each subject,
along with the baseline measurement taken at the end of the run-in period and at the
end of each wash-out period before the start of the following period. These baseline
measurements will be ignored for now, and only the responses given in the columns
that have headings Period 1, Period 2, and Period 3 will be analyzed. The analysis
using baselines will be discussed in section “Use of Baseline Measurements.”

Table 16 Incomplete block cross-over design to compare four treatments in three periods
Group Period 1 Period 2 Period 3
1 A B C
2 B A D
3 C D A
4 D C B
5 A D B
6 B C A
7 C B D
8 D A C
9 A C D
10 B D C
11 C A B
12 D B A

Table 17 Group 1 (A B C) FEV1 (mls)


Subject Baseline 1 Baseline 2 Baseline 3 Period 1 Period 2 Period 3
1 256.25 278.53 284.06 265.16 311.23 284.20
2 262.32 244.86 253.60 301.31 264.86 272.95
3 174.87 206.42 204.57 226.60 253.41 245.73
4 232.53 259.66 226.48 244.33 267.89 240.39
70 Cross-over Trials 1347

As carry-over effects were not anticipated, they will not be included in the fitted
model. A new feature of the analysis of an incomplete block design is that the
question as to whether the subject effects sik are assumed to be fixed or random
effects is now highly relevant. If they are assumed to be fixed effects, the comparison
of treatments uses only the information that is available from within-subject com-
parisons between the responses on a subject. If the subject effects are assumed to be
random, then some additional information can be obtained from the between-subject
comparisons (often referred to as the inter-block information). This recovery of
information is typically most advantageous when the cross-over design has low
efficiency for the treatment comparisons or a low to moderate correlation between
the repeated measurements on each subject and a large number of subjects. As
already noted, this final example design has quite high efficiency (88.89%), so the
recovery of inter-block information may not make much of a difference to the
analysis of the data. Nevertheless, for the purposes of providing an illustration, the
inter-block information will be recovered. The parameter estimates obtained from
the fixed-subject effects model are given in Table 18. Note that the period effects are
the differences compared to Period 3, and the treatment effects are the differences
compared to treatment D.
When the subject effects are assumed to be random, the REML estimates of the
treatment and period effects are as given in Table 19. It can be seen that the standard
errors of the treatment estimates are slightly smaller when REML is used. It should be
noted that the Kenward-Roger adjustment (Kenward and Roger (1997)) has been used.
As part of the output from fitting this model using standard statistical software
(not shown), estimates of the variance components may also be obtained: b σ2 ¼
361:35 and σ 2s ¼ 1062:78 . The within-subject  correlation
  between any pair of
repeated measurements on a subject is ρ ¼ σ 2s = σ 2s þ σ 2 and is estimated as
1062.78/(1062.78 + 361.35) ¼ 0.75. Therefore, not only does this design have a

Table 18 Parameter estimates obtained from model with fixed subject effects
Effect Estimate Standard error Two-sided p-value
Period 1 0.4381 3.8798 0.9103
Period 2 0.7450 3.8798 0.8482
Treatment A 31.5485 4.7518 < 0.0001
Treatment B 29.8806 4.7518 < 0.0001
Treatment C 29.1396 4.7518 < 0.0001

Table 19 Parameter estimates obtained using REML


Effect Estimate Standard error Two-sided p-value
Period 1 0.4381 3.8802 0.9103
Period 2 0.7450 3.8802 0.8482
Treatment A 31.8566 4.7261 < 0.0001
Treatment B 30.5531 4.7261 < 0.0001
Treatment C 30.2235 4.7261 < 0.0001
1348 B. Jones

high efficiency, but the estimated within-subject correlation is also large, implying a
large between-subject variance. For well-designed cross-over trials, this is often the
case, and in this situation, as has been already mentioned, the recovery of inter-block
information is unlikely to make much difference to the estimates of the treatment
comparisons and their estimated standard errors.

Use of Baseline Measurements

When a measurement of the response is taken prior to the start of each period, these
baseline measurements may be useful in increasing the precision of the estimated
treatment effects. Whether they are useful or not depends on the degree and type of
correlation structure between the response and the baseline measurements. See
Chapter 5 of Jones and Kenward (2014) for more details.
Two approaches that may be considered when making use of baseline measure-
ments are to (1) analyze the change from baseline measurements, i.e., for each
subject and period replace the response by the difference between the response
and the baseline value for that period or (2) include the baseline measurements as
covariates. The inclusion of covariates is the recommended approach. See Senn
(2006) for further discussion on this.
In this section, the data from section “Example 3: Incomplete Block Design” are
reanalyzed, but now making use of the baseline measurements that were taken at the
start of each treatment period. Table 17 gives these baseline values for the subjects in
the first sequence group.
Typically, the analysis of the changes from baseline is only worth considering if
the response and its associated baseline are close together in time, compared to the
gap between periods and if the variability of the baselines is considerably less than
that of the response, which may happen if the baseline is, in fact, the average over
several baseline measurements, for example.
In a fixed subject effects model, it is sufficient to include the baseline (from each
period) as a single covariate.
However, if a random subject effects model is used to recover between-subject
information, two separate covariates must be included for each subject: Covariate (a)
is the average over the p baselines, and Covariate (b) is the difference from this
average of each of the p baseline measurements. See Chapter 5 of Jones and
Kenward (2014) for more details.
To illustrate the construction of the covariates, Table 20 shows their values for the
first subject in Table 17. For example, the value of the Covariate (a) is
(256.25 + 278.53 + 284.06)/3 ¼ 272.9467 ≈ 272.95 and the value of Covariate (b)
for Period 1 is 256.25–272.95 ¼ 16.70.
Assuming that there are no carry-over effects due to the inclusion of the washout
periods (i.e., Model (1) is assumed), the extension of the previous analysis to include
baselines is now illustrated.
Table 21 shows the estimate and its standard error for the comparison of B versus
D, for a selection of models and where the subject effects are either fitted as fixed or
70 Cross-over Trials 1349

Table 20 Factors, response, baselines, and covariates for the first two subjects
Subject Period Drug Baseline Response Covariate (a) Covariate (b)
1 1 1 256.25 265.16 272.95 16.70
1 2 2 278.53 311.23 272.95 5.58
1 3 3 284.06 284.20 272.95 11.11
48 1 4 185.10 203.00 177.07 8.03
48 2 2 190.20 213.08 177.07 13.13
48 3 1 155.90 206.93 177.07 21.17

Table 21 Parameter estimates and standard errors from analyses with and without baselines (B
versus D)
Fixed subject effects Random subject effects
Model Estimate Standard error Estimate Standard error
(A) Response only 29.881 4.752 30.553 4.726
(B) Change from baseline 26.196 3.701 26.820 3.668
(C) Single period baseline 27.392 3.500 27.871 3.324
(D) Both baseline covariates 27.391 3.500 28.034 3.323

random effects. All models contain factors for the subjects, periods, and treatments.
The fitted models are: (A) response only and no covariates; (B) the change from
baseline in each period is used as the response, and no covariates are added; (C) the
response is fitted with each period baseline used as the covariate; and (D) the
response is fitted with Covariates (a) and (b).
It is immediately clear that making use of the baselines does increase the precision
of estimation. For example, the standard error drops from 4.75 to 3.50 when one or
both covariates are added to the fixed effects model. In this example, the use of the
change from baseline is almost as good as using the covariates, although that may not
always be the case. Because it is already known from the analysis reported in the
previous section that there is little advantage in recovering the between-block infor-
mation, there is not a great difference in the respective results for the fixed and random
effects models. Indeed, as explained in Chapter 5 of Jones and Kenward (2014), the
use of both covariates is only needed when there is some incompatibility between the
within-subject and between-subject estimates. If there is little to no between-subject
information on a treatment comparison, as here, then it is not necessary to fit both
covariates. In fact, in such a situation, using the fixed effects model is recommended.

Summary and Conclusion

This chapter has concentrated on the design and analysis of continuous data from
cross-over trials. Examples where p ¼ t and p < t have been given, and the use of
period baseline covariates has been illustrated. Designs also exist for p > t, and Jones
and Kenward (2014) should be consulted for examples of these.
1350 B. Jones

As a thorough treatment of the analysis of binary and categorical data from cross-
over trials is beyond the scope of this chapter, the reader is referred to Chapter 6 of
Jones and Kenward (2014) for a detailed coverage. However, it should be noted that
for binary data from the 2  2 cross-over design, simple analyses based on 2  2
contingency tables are available. Chapter 2 of Jones and Kenward (2014) gives the
details.

Key Facts

In a cross-over trial, each subject receives a series of treatments over a fixed number
of periods. When used appropriately, cross-over designs require fewer subjects to
achieve a given level of precision or power compared to their parallel groups
counterparts. When there are more than two treatments or periods, choices between
designs can be made using their efficiencies. Models used to fit data from cross-over
trials may include subject effects as fixed or random variables. For designs of the
incomplete block type, the use of random subject effects permits the recovery of
between-subject information on treatment comparisons. The use of baseline mea-
surements, taken before the start of each period, may be useful to increase the
precision of the treatment comparisons.

Cross-References

▶ Controlling for Multiplicity, Eligibility, and Exclusions


▶ Power and Sample Size

References
Altman DG (1991) Practical statistics for medical research. London: Chapman and Hall
Anderson TW, Darling DA (1952) Asymptotic theory of certain “goodness-of-fit” criteria based on
stochastic processes. Ann Math Stat 23:193–212
Bretz F, Hothorn T, Westfall P (2016) Multiple comparisons using R. Boca Raton: CRC Press
Chi EM, Reinsel GC (1989) Models for longitudinal data with random effects and AR(1) errors. J
Am Stat Assoc 84:452–459
Clark B, Smith D (1981) Introduction to pharmacokinetics. Oxford: Blackwell Scientific
Publications
FDA (2013) Guidance for industry: bioequivalence studies and pharmacokinetic endpoints for
drugs submitted under an ANDA. Food and Drug Administration
Freeman P (1989) The performance of the two-stage analysis of two treatment, two period crossover
trials. Stat Med 8:1421–1432
Grizzle JE (1965) The two-period change-over design and its use in clinical trials. Biometrics
21:467–480
Havliceck LL, Peterson NL (1974) Robustness of the t-test: a guide for researchers on the effect of
violations of assumptions. Psychol Rep 34:1095–1114
Hollander M, Wolfe D (1999) Nonparametric statistical methods, 2nd edn. New York: Wiley
70 Cross-over Trials 1351

Jones B, Kenward MG (2014) Design and analysis of cross-over trials, 3rd edn. Boca Raton: CRC
Press
Kenward MG, Roger JH (1997) Small sample inference for fixed effects estimators from restricted
maximum likelihood. Biometrics 53:983–997
Patterson HD, Lucas HL (1962) Change-over designs. North Carolina Agricultural Station, Tech
Bull 147
Patterson HD, Thompson R (1971) Recovery of inter-block information when block sizes are
unequal. Biometrika 58:545–554
Piantadosi S (1997) Clinical Trials. A methodological perspective. New York: Wiley
Rohmeyer K (2014) Crossover. R package Crossover, Version 01-16
Senn S (2006) Change from baseline and analysis of covariance revised. Stat Med 25:4334–4344
Shapiro SS, Wilk MB (1965) An analysis of variance test for normality (complete samples).
Biometrika 52:591–611
Factorial Trials
71
Steven Piantadosi and Susan Halabi

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1354
Characteristics of Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1355
Interactions or Efficiency, But Not Both Simultaneously . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1355
Factorial Designs Are Defined by Their Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1355
Factorial Designs Can Be More Efficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1357
Design and Analysis of Factorial Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1358
Design Without Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1358
Design with Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1360
Designs with Biomarkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1361
Analysis of Factorial Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1363
Treatment Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1364
Factorial Designs Are the Only Way to Study Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1364
Interactions Depend on the Scale of Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1366
The Interpretation of Main Effects Depends on Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1366
Analyses Can Employ Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1368
Examples of Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1370
Partial, Fractional, and Incomplete Factorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1372
Use Partial Factorial Designs When Interactions Are Absent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1372
Incomplete Designs Present Special Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1373
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1373
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1374
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1374
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1374
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1374

S. Piantadosi (*)
Department of Surgery, Division of Surgical Oncology, Brigham and Women’s Hospital, Harvard
Medical School, Boston, MA, USA
e-mail: [email protected]
S. Halabi
Department of Biostatistics and Bioinformatics, Duke University Medical Center, Durham, NC, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 1353


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_100
1354 S. Piantadosi and S. Halabi

Abstract
Factorial clinical trials test the effects of two or more therapies using a design that
can estimate interaction between therapies (Piantadosi 2017). (This chapter
revises, updates, and expands upon reference (Piantadosi 2017)) A factorial
structure is the only design that can assess treatment interactions, so this type of
trial is required for those important therapeutic questions. When interactions
between treatments are absent, which is not a trivial requirement, a factorial
design can estimate each of several treatment effects from the same data. For
example, two treatments can sometimes be evaluated using the same number of
subjects ordinarily used to test a single therapy. When possible, this demonstrates
a striking efficiency. For these reasons, factorial designs have an important place
in clinical trial methodology, and have been applied in a variety of setting, but in
particular in disease prevention.

Keywords
Factorial clinical trials · Treatment interactions · Factorial designs

Introduction

Factorial clinical trials test the effects of two or more therapies using a design that
can estimate interaction between therapies (Piantadosi 2017). A factorial structure is
the only design that can assess treatment interactions. When interactions between
treatments are absent, a factorial design can estimate each of several treatment effects
from the same data. For example, two treatments can sometimes be evaluated using
the same number of subjects ordinarily used to test a single therapy. For these
reasons, factorial designs have an important place in clinical trial methodology,
and have been applied in a variety of setting, but in particular in disease prevention.
Historically, control variables in experiments were called factors. For example,
a factor can be defined by the presence or absence of a single drug. A factor can
have more than one level, as indicated by different doses of the same drug. A factor
is not strictly qualitative. The choice between treatments, A and B, is not a factor
(assuming that one is not a placebo). Many factors have only two levels (present or
absent) and are therefore both ordinal and qualitative. In a factorial design, all
factors are varied systematically, with some groups receiving more than one
treatment, and the experimental groups are arranged that may permit testing if a
combination of treatments is better or worse than individual treatments, although
the power is often limited.
The method of varying more than one factor or treatment in a single study was
used in agricultural experiments before 1900. It was developed and popularized by
R. A. Fisher (1935, 1960) and Yates (1935), and used to great advantage in both
agricultural and industrial experiments. In medicine, factorial designs have been
used more in prevention trials than therapeutic studies.
71 Factorial Trials 1355

Factorial designs carry important assumptions that must be understood before


deciding if one is the best choice for a therapeutic question. The critical issue is to
distinguish between an investigation of treatment interactions versus efficient testing
of multiple noninteracting individual therapies. The following section on interac-
tions or efficiency. More complete discussions of factorial designs, especially
pertaining to cancer prevention trials, can be found in (Byar and Piantadosi 1985;
Byar et al. 1993). Discussion of these designs related to cardiovascular trials,
particularly in the context of the ISIS-4 study (Flather et al. 1994), can be found in
Lubsen and Pocock (1994) and McAlister et al. (2003).

Characteristics of Factorial Designs

Interactions or Efficiency, But Not Both Simultaneously

Factorial designs embody an essential dichotomy mentioned above that is a source of


frequent misunderstanding. The same structural design can be used either to gain
substantial efficiency in questions about individual treatments, or to estimate the
interaction between treatments. Both objectives cannot be met in the same trial,
because they require very different sample sizes. The design of a factorial trial can
therefore appear conflicting or confusing unless we understand which purpose is
intended.
To summarize details below, a factorial design can estimate efficacy for each of two
therapies with one sample size only if the treatments are known not to interact with one
another (section “Factorial Designs Can Be More Efficient”), or if any interaction
between the treatments is negligible relative to the main effect of individual treatments.
This two-for-one efficiency is therefore predicated on strong biological knowledge.
When therapeutic interaction is the topic of inquiry, only a factorial structure can
estimate it, but at the cost of a sample size that is roughly four times larger than usual
(section “Factorial Designs Are the Only Way to Study Interactions”). Hence, the two
possible objectives implied by a factorial structure cannot be met simultaneously.
When the design is derived from the question rather than the other way around,
investigators can likely to employ the factorial structure appropriately.

Factorial Designs Are Defined by Their Structure

The simplest factorial design has treatments A and B, and four treatment groups
(Table 1). Assume n subjects are entered into each of the four treatment groups for a
total sample size of 4n and a balanced (equal allocation) design. One group receives
neither A nor B, a second receives both A and B, and the other two groups receive
only one of A or B. This is called a 22 (two by two) factorial design. Although
basic, this design illustrates many of the general features of factorial experiments.
The design generates enough information to test the effects of A alone, B alone, and
A plus B. The efficiencies in doing so will be presented below.
1356 S. Piantadosi and S. Halabi

Table 1 Four treatment Treatment B


groups and sample size in a
Treatment A No Yes Total
2  2 balanced factorial
design No n n 2n
Yes n n 2n
Total 2n 2n 4n

Table 2 Eight treatment Treatments


groups and sample size in a
Group A B C Sample size
2  2  2 balanced factorial
design 1 No No No n
2 Yes No No n
3 No Yes No n
4 No No Yes n
5 Yes Yes No n
6 No Yes Yes n
7 Yes No Yes n
8 Yes Yes Yes n

The 22 design generalizes to higher order designs in a straightforward manner.


For example, a factorial design studying three treatments, A, B, and C is the 222.
Possible treatment groups for this design are shown in Table 2. The total sample size
is 8n if all treatment groups have n subjects.
Aside from illustrating the factorial structure, these examples highlight some of
the prerequisites and restrictions for using a factorial trial. First, the treatments must
be amenable to being administered in combination without changing dosage in the
presence of each other. For example, in Table 1, we would not want to reduce the
dose of A in the lower right cell where B is present. This requirement implies that the
side effects of the treatments cannot be cumulative to the point where the combina-
tion would be difficult to administer.
Second, it must be ethically acceptable to administer individual treatments or
administer them at lowered doses. In some situations, this means having a
no-treatment or placebo group in the trial. In other cases, A and B may be admin-
istered in addition to a standard therapy, so all groups receive some treatment. An
example might be a factorial trial of chemotherapy and prophylactic brain radiother-
apy in subjects with lung cancer, all of whom received chest radiotherapy. Third, we
must be genuinely interested in learning about the treatment combinations or else
some of the treatment groups would be unnecessary. Alternatively, to use the design
to achieve greater efficiency in studying two or more treatments, we must know that
some interactions do not exist.
Fourth, the therapeutic questions must be chosen appropriately. We would not use
a factorial design to test treatments that have exactly the same mechanisms of action,
such as two angiotensin converting enzyme (ACE) inhibitors for high blood pres-
sure, because either agent would answer the question. Treatments acting through
different mechanisms would be more appropriate for a factorial design. In some
71 Factorial Trials 1357

prevention factorial trials, the treatments tested may target different diseases in the
same cohort.

Factorial Designs Can Be More Efficient

Although their scope is limited, factorial designs offer certain important efficiencies
or advantages when they are applicable. To illustrate how this occurs, consider the
22 design and the estimates of treatment effects that would result using an additive
model for analysis (Table 3). Assume that the responses are group averages of some
normally distributed response denoted by Y . The subscripts on Y indicate which
treatment group it represents. Note that half the subjects receive one of the treat-
ments. This is also true in higher order designs. For the moment, further assume that
the effect of A is not influenced by the presence of B.
There are two estimates of the effect of treatment A compared with placebo in the
design, Y A  Y 0 and Y AB  Y B : If B does not modify the effect of A, it is sensible to
combine or average them to estimate the overall, or main, effect of A, denoted here
by βA,
   
Y A  Y 0 þ Y AB  Y B
βA ¼ ð1Þ
2
Similarly,
   
Y B  Y 0 þ Y AB  Y A
βB ¼ ð2Þ
2
Thus, in the absence of interactions, which means the effect of A is the same with
or without B, and vice versa, the design permits the full sample size to be used to
estimate two treatment effects.
Now suppose that each subject’s response has a variance σ 2and that it is the same
in all treatment groups. We can calculate the variance of βB to be

1 4σ 2 σ 2
varðβA Þ ¼  ¼
4 n n
This is exactly the same variance that would result if A were tested against placebo
in a single two-armed comparative trial with 2n subjects in each treatment group.
Similarly,

Table 3 Treatment effects Treatment B


in a 2 2 factorial design
Treatment A No Yes
No Y0 YB
Yes YA Y AB
1358 S. Piantadosi and S. Halabi

σ2
varðβB Þ ¼
n
However, if we tested A and B separately, we would require 4n subjects in each
trial or a total of 8n subjects to have the same precision obtained from half as many
subjects in the factorial design. Thus, in the absence of interactions, these designs
allow great efficiency in estimating main effects. In fact, in the absence of interac-
tion, we get two trials for the price of one. Tests of both A and B can be conducted in
a single factorial trial with the same precision as two single-factor trials using twice
the sample size.

Design and Analysis of Factorial Trials

The literature is rich in examples of therapeutic factorial designs (Henderson et al.


2003; Sikov et al. 2015), and there are different approaches to designing factorial
trials. Green (2005) provides a thorough explanation of the options in designing
factorial trials and on how to interpret such trials. We focus on the design of the 22
factorial trials and discuss sample size computations based for a time-to-event
endpoint, the most common endpoint that is utilized in phase III trials in cancer.
There are two main questions that need to be addressed in designing a factorial trial.
The first one pertains to whether one needs to adjust for the multiple hypotheses that
are being tested. The second one is whether an interaction among the treatment arms
exits. Unfortunately, there is no consensus in the literature on these two points. We
present three examples of: (1) a trial designed with no interaction without adjustment
for multiplicity, (2) a trial designed with no interaction and with adjustment for
multiplicity, and (3) a trial designed with interaction between the two treatment arms.
Moser and Halabi (2015) developed methodology for designing a factorial trial
when the primary endpoint is a time-to-event endpoint. A matrix formulation was
provided for calculating the required sample size to test a main effect or an interac-
tion term for a pre-specified type I error rate and power (Moser and Halabi 2015).

Design Without Interaction

Computing the sample size for a 22 factorial trial assuming no interaction between
the arms is straightforward. One can use the sample size formula for designing a trial
comparing two treatment arms (Moser and Halabi 2015; Rubinstein et al. 1981).
Suppose we are interested in testing the effect of two regimens in men with advanced
prostate cancer. Patients will be randomized with equal allocation to four treatment
groups: standard of care, experimental arm A, experimental arm B, or experimental
arms A+ B (Table 4). Let λij be the hazard rate of the ith factor (i ¼ 1, 2) of treatment
A and the jth factor (j ¼ 1, 2) of treatment B. Overall survival (OS) is the primary
endpoint. Similar to Rubenstein et al. (Peterson and George 1993), we make the
following assumptions:
71 Factorial Trials 1359

Table 4 Hazard rates of a Treatment B


factorial design with two
Treatment A No Yes Pooled
factors A and B for a time-
to-event endpoint No λ11 λ12 λ1.
Yes λ21 λ22 λ2.
Pooled λ.1 λ.2 λ. .

(a) There is an accrual period [0, T] where T is the number of years. During this
accrual period, the patients enter the clinical trial according to a Poisson process
with n patients per year. The patients are randomized to k treatment groups with
Pk
probability Pj (0 < Pj < 1) to treatment j, where j ¼ 1,2,...k and P j ¼ 1:
j¼1
(b) The patients are followed-up for a period of τ years. The τ is known as follow-up
period. The total length of the study is T + τ years.
(c) In treatment j, the failure or death times (the times from entry into the trial to
failure or death) are i.i.d exponentials with hazard λj. Moreover, the failure times
across the treatment groups are assumed to be independent.
(d) The censored times (the times from entry into the trial to loss to follow-up) are i.
i.d exponentials with common hazard Φc.
(e) The failure times and censored times are independent.
(f) The censoring mechanism is random censoring.
(g) Constant treatment effect for both treatments A and B over time, i.e., the
proportional hazards assumption.

In designing such a 22 factorial trial without interaction, we are interested in


testing two hypotheses concerning the main effect. Based on historical data, the
median OS in men with advanced prostate cancer is 20 months in the standard of
care arm. Consider the data in Table 4. For the first factor, we are interested in
comparing the hazard rates for patients randomized to receive experimental treat-
ment A with patients who receive standard of care, i.e., the main focus is testing the
null hypothesis hazard ratio (Δ) ¼ λ1./λ2. ¼1 versus the alternative hypothesis
Δ ¼ λ1./λ2. <1. The target total sample size is 900 patients. With 722 deaths for
testing the efficacy of experimental arm A, the log-rank test has 85% power to detect
a 20% decrease in hazard rate (equivalent to an increase in median OS from
20 months to 25 months in patients randomized to the standard of care and exper-
imental treatment arm A, respectively; hazard ratio ¼ 0.8) with a one-sided type I
error rate of 0.025. We make the following assumptions in the sample size compu-
tation: equal allocation to the treatment arms, OS follows the exponential distribu-
tion, fixed sample size (no interim analysis), an accrual period rate of 30 patients/
months, an accrual period of 30 months, a follow-up period of 14 months, and a trial
duration of 44 months.
For testing the efficacy of treatment B (main effect), our objective is to compare
the hazard rates for patients randomized to standard of care or experimental arm B,
i.e., we focus on testing the null hypothesis Δ ¼ λ.1/λ.2 ¼ 1 versus the alternative
hypothesis Δ ¼ λ.1/λ.2 < 1. The target number of deaths is 434 and the total sample
1360 S. Piantadosi and S. Halabi

size is 750 patients. With 434 deaths for, the log-rank test has 85% power to detect a
hazard ratio ¼ 0.75 (assuming that the median OS ¼ 20 months and 26.7 months in
the standard of care and treatment B, respectively) with a one-sided type I error rate
of 0.025. In designing the above trial, we base the sample size on testing the
hypothesis with the smallest effect size (comparing experimental arm A to the standard
of care) since its sample size is larger than what is required for testing the second
hypothesis (comparing experimental arm B to the standard of care). Thus, the target
sample size for this trial is 900 prostate cancer patients.
In the prostate cancer example we could adjust for the type I error rate using the
Bonferroni procedure (α/2) because we are testing two hypotheses. The required
number of events is 863 deaths for testing the first hypothesis, and the log-rank test
has 85% power to detect a 20% decrease in hazard rate (HR ¼ 0.8) with a one-sided
type I error rate of 0.0125. As expected, we observe that the number of events has
increased drastically from a 722 to 863 deaths (approximately 20% increase). If we
assume the same sample size (900 patients) and the same accrual rate of 30 patients/
month, then the trial duration will be doubled from 44 months to 88 months.
Several authors have debated that there is no need to adjust for the type I error rate
in designing a factorial trial when several experimental arms are compared to a
control or the standard of care group (Freidlin et al. 2008; Wason et al. 2014;
Proschan and Waclawiw 2000). Their rationale is that such trials are designed to
answer the efficacy question for each experimental drug separately and as such the
results of one comparison should not influence the results of the other hypothesis.

Design with Interaction

Peterson et al. developed the sample size required in a 22 factorial trial in the
presence of interaction when the endpoint is time-to-event (Peterson and George
1993).
Suppose we are interested in testing the effect of the interaction of the two
treatment in the prostate cancer example (Table 4) and our objective is to compare
the difference in the hazard rates λ21/λ11 / λ22/λ12 (on a log-scale). Let Δ1 ¼ λ11/λ21,
Δ2 ¼ λ12/λ22, and γ ¼ Δ2/Δ1 be the hazard ratios for treatment effect for patients
receiving treatment A versus standard of care, treatment effect for patients receiving
treatments A and B versus treatment B, and the ratio of the hazard ratios
(or interaction between the two treatments), respectively. The null hypothesis is
that there is no interaction between the two treatment (γ ¼ 1) versus the alternative
hypothesis that there is an interaction between the treatment groups (γǂ1). We assume
that the median OS to be M11 ¼ 19 in patients randomized to standard of care,
M21 ¼ 20 in patients randomized to experimental arm A, M12 ¼ 20 in patients
randomized to experimental arm B, and M22 ¼ 30 months in patients randomized to
experimental arms A and B (Table 4). In order to test this hypothesis, we need to
extend both the accrual period from 30 months to 48 months and the follow-up
period from 14 months to 30 months. The total sample size is now 1,440 patients and
the expected number of events is 1,151 at the end of the trial. The power to detect a
71 Factorial Trials 1361

γ ¼ 1.5 for the interaction between the treatment arms is 85% assuming a one-sided
type I error of 0.025.
Another strategy to consider in the interaction between the treatments is to use
Simon’s approach (Simon and Freedman 1997), who proposed inflating the sample
size by 30%. Thus, the sample size for the prostate cancer trial will be 1,170
(900*30%) and the number of deaths ¼ 934. For the power computation, we assume
an accrual of 30 patients over a 48-month period and a follow-up period of
30 months. With 1,170, the power to detect an interaction γ ¼ 1.5 between the
two factors is 77%. Including Simon’s reasoning to the computation that we
performed above indicates, surprisingly, that the power for testing an interaction
term (γ ¼ 1.5) is around 80% assuming a one-sided type I error of 0.025.
The main drawback to factorial trials is that often trials are not designed to test for
an interaction between the treatment groups and as a result such trials are usually
underpowered. While the examples above were based on testing superiority hypoth-
eses, a factorial trial can be designed testing a superiority and a non-inferiority
(or equivalence) hypothesis. For example, CALGB 80203 (NCT00077233) was
originally designed as a phase III 22 factorial trial to test two hypotheses. The
first hypothesis was to test if the addition of C225 to FOLFOX or FOLFIRI
chemotherapy will improve OS in untreated metastatic colon cancer patients. The
second hypothesis was to test the equivalence of FOLFOX and FOLFIRI in OS in
untreated metastatic colon cancer patients. The trial was closed due to poor accrual.
Recently Freidlin and Korn (2017) argued that in designing factorial trials in
oncology, one needs to consider an interaction between the drugs as it is very likely
that the “no interaction” assumption is not a valid one. Moreover, the authors
advocate for matching the analysis with the trial design to achieve the objectives
of the trial.

Designs with Biomarkers

Treatment by biomarker interaction is frequently implemented in cancer trials.


Gönen (2003) considers the planning of subgroup analyses for time-to-event out-
comes in a treatment by molecular marker factorial design. Factorial trials are the
only design in which an investigator can test for treatment-biomarker interaction. It
is important to note that testing for the treatment-biomarker interaction term
(a predictive marker) is often conducted after the treatment trial has been reported.
While treatment arm is defined by randomization, the biomarker status (e.g., positive
or negative biomarker) is defined by observation. For a time-to-event outcome, one
would use the proportional hazards model with:

λðtj x1, x2 Þ ¼ λ0 ðtÞ exp ðβ1 x1 þ β2 x2 þ β3 x1 x2 Þ

where λ0(t) is the baseline hazard, xi ¼ 0 or 1 represents the treatment arm, x2 is the
biomarker level (usually measured as a continuous variable), and x1x2 is the treat-
ment arm-biomarker interaction term. Under the null hypothesis, β3 ¼ 0 indicates
1362 S. Piantadosi and S. Halabi

that there is no interaction between treatment arm-biomarker. If the interaction term


between a biomarker and a treatment is statistically significant, then this is consid-
ered to be a predictive biomarker of the outcome.
We illustrate the above concept by using an example from a phase III trial of
transitional cell carcinoma (Rosenberg et al. 2019) where 588 patients were ran-
domized to one of two treatments arms: (a) gemcitabine, cisplatin, and placebo, or
(b) gemcitabine, cisplatin, and bevacizumab. The investigator is interested in testing
the treatment by vascular endothelial growth factor (VEGF) level interaction using a
two-sided type I error rate ¼ 0.05 to attain a power of 80%. In simplifying the power
computation, we assume that the biomarker is binary (the VEGF level is dichoto-
mized based on an established cut point with positive and negative groups) and that
the prevalence of a positive VEGF marker is 30%. The primary endpoint is OS. The
median survival times (months) are assumed to be: M11 ¼ 13.80 in patients random-
ized to gemcitabine, cisplatin, and placebo, and have negative VEGF biomarker;
M21 ¼ 15.80 in patients randomized to gemcitabine, cisplatin, and placebo, and have
positive VEGF biomarker; M12 ¼ 13.80 in patients randomized to gemcitabine,
cisplatin, and bevacizumab, and have negative VEGF biomarker; and M22 ¼ 30.40
in patients randomized to gemcitabine, cisplatin, and bevacizumab, and have posi-
tive VEGF biomarker (Table 5). We also assumed: an accrual rate of 168 patients/
year for a total accrual period of 3.5 years and a follow-up period of 1 year. Let using
the methodology developed by Peterson et al. (Rubinstein et al. 1981), the null
hypothesis is that there is no interaction between the treatment arm-biomarker
groups (γ ¼ Δ2/Δ1 ¼ 1) versus the alternative hypothesis that there is an interaction
between the treatment-biomarker (γǂ1). With 428 deaths, the power is 83% assuming
two-sided test of 0.05 to detect an interaction (ratio of hazard ratios) of γ ¼ 1.92.
The stratified biomarker trial is a powerful design that will test for the treatment
arm-biomarker interaction prospectively (Liu et al. 2014). The stratified biomarker
design is one where all patients regardless of their biomarker status are randomly
assigned to either an experimental arm or a control (Fig. 1) (Freidlin and Korn 2014).
For simplicity, we concentrate on a biomarker that is binary where patients can be
classified as either having a positive or negative biomarker status. The stratified
biomarker design is commonly used to test for the treatment-biomarker interaction
(Liu et al. 2014). In the biomarker design, an investigator may consider testing
whether the treatments differ in outcomes within a biomarker group, or test if the
clinical outcome within the same treatment differ between the biomarker groups, or
can test for the treatment-biomarker interaction. An example of a stratified biomarker
trial is the MARVEL trial (NCT0073888) where tissues from consenting patients
were to be submitted for epidermal growth factor receptor (EGFR) evaluation. In the

Table 5 Median overall survival (months) for testing a treatment-biomarker interaction


VEFG levels
Treatment assignment Low VEGF High VEGF
Gemcitabine+cisplatin M11 ¼ 13.80 M12 ¼ 13.80
Gemcitabine+cisplatin+bevacizumab M21 ¼ 15.80 λ22 ¼ 30.0
71 Factorial Trials 1363

Fig. 1 Example of a stratified biomarker trial

MARVEL trial, about 1200 non-small lung cancer patients were to be randomized to
either erlotinib or pemetrexed. The primary objective was to evaluate whether there
are differences in progression-free survival between erlotinib and pemetrexed within
the FISH positive and FISH negative subgroups. Unfortunately, the trial was closed
due to slow accrual. A stratified biomarker trial often requires large sample size, that
the cutoff point for the biomarker has been validated and that the prevalence of the
biomarker is appropriate to allow for testing for the treatment-biomarker interaction.
When biomarkers are based on tumor, the true status of the biomarker can be
classified with error. Sample size formula for the stratified biomarker design to
account for the misclassification error has been provided and is a hot research area
(Liu et al. 2014).

Analysis of Factorial Trials

Factorial trials have been analyzed inconsistently in the literature (Freidlin and Korn
2017). Some authors maintain the view that an interaction test should be provided
even if the trial was not designed to test for an interaction between the treatments
(Montgomery et al. 2003; Korn and Freidlin 2016). As an example, consider the
ECOG trial (E1199) 22 factorial trial in neoadjuvant breast cancer patients where
they were randomized to: paclitaxel (administered over 3 weeks), paclitaxel
(weekly), docetaxel given every 3 weeks, and docetaxel given weekly (Sparano
2008). The study was designed with 86% power using a two-sided significance level
of 0.05 for testing each of the primary factors (treatment: paclitaxel vs. docetaxel;
schedule: weekly vs. 3 weeks). No statistically significant differences in
DFS were observed between patients randomized to paclitaxel versus docetaxel
( p-value ¼ 0.61) nor weekly treatment versus those who received 3 weeks treatment
( p-value ¼ 0.33). The authors performed a test of interaction between treatment and
1364 S. Piantadosi and S. Halabi

schedule ( p-value ¼ 0.003). Furthermore, the authors compared individual arms and
demonstrated that patients receiving weekly paclitaxel had superior disease-free
survival than patients who received paclitaxel over 3 weeks. The results persisted
in long-term follow-up of these patients (Sparano et al. 2015).

Treatment Interactions

We now consider more general circumstances where the effect of treatment A is


influenced by the presence of treatment B, and vice versa. In such cases, there is said
to be a treatment interaction. Although the sample size efficiencies just discussed
will be lost when this occurs, factorial designs become even more relevant.

Factorial Designs Are the Only Way to Study Interactions

One of the most consequential features of factorial designs is that they are the only
type of trial design that permits study of treatment interactions. This is because the
factorial structure has groups with all possible combinations of treatments, allowing
the responses to be compared directly. Consider, again, the two estimates of the
effect of A in the 22 design, one in the presence of B and the other in the absence of
B. The definition of an interaction is that the effect of A in the absence of B is
different from the effect of A in the presence of B. This difference can be estimated
by comparing
   
βAB ¼ Y A  Y 0  Y AB  Y B ð3Þ

with zero. If βABis near zero, we would conclude that no interaction is present. It is
straightforward to verify that βAB ¼ βBA.
An important principle of factorial trials is evident by examining the variance of
βAB. Under the same assumptions as in section “Factorial Designs Can Be More
Efficient.”

σ2
varðβAB Þ ¼ 4
n
which is four times larger than the variance for either main effect when an interaction
is known to be absent. Therefore, to have the same precision for an estimate of an
interaction effect as for a main effect, the sample size has to be four times larger. This
illustrates again why both the efficiency and interaction objectives cannot be simul-
taneously met in the same factorial study.
When there is an AB interaction, we cannot use the estimators given above for the
main effects of A and B (Eqs. 1 and 2), because they assume that no interaction is
present. In fact, it is not sensible to talk about an overall main effect in the presence
of an interaction because Eqs. 1 or 2 would have us average over two quantities that
71 Factorial Trials 1365

are not expected to be equal. Instead, we could talk about the effect of A in the
absence of B,
 
β0 A ¼ Y A  Y 0 ð4Þ

or the effect of B in the absence of A,


 
β0 B ¼ Y B  Y 0 ð5Þ

These are logically and statistically equivalent to what would be obtained from
stand-alone trials.
In the 222 design, there are three main effects and four interactions possible,
all of which can be estimated by the design. Following the notation above, the effects
are

1        
βA ¼ Y A  Y 0 þ Y AB  Y B þ Y AC  Y C þ Y ABC  Y BC ð6Þ
4
for treatment A,

1        
βAB ¼ Y A  Y 0  Y AB  Y B þ Y AC  Y C þ Y ABC  Y BC ð7Þ
2
for the AB interaction, and
       
βABC ¼ Y A  Y 0  Y AB  Y B  Y AC  Y C  Y ABC  Y BC ð8Þ
2 2 2
σ
for the ABC interaction. The respective variances are 2n , 2σn , and 8σn . Thus the
precision of the two-way interactions relative to the main effect is 1/4, and for the
three-way interaction is 1/16.
When certain interactions are present, here again it will not be sensible to think of
the straightforward main effects. But the design can yield an alternative estimator for
βA, or βBA,or for other effects.
Suppose that there is an ABC interaction. Then instead of βA, an estimator of the
effect of A in the absence of C would be

1    
β0 A ¼ Y A  Y 0 þ Y AB  Y B
2
which does not use βABC and implicitly assumes that there is no AB interaction.
Similarly, the AB interaction would be
   
β0 AB ¼ Y A  Y 0 þ Y AB  Y B

for the same reason. Thus, when high-order interactions are present, we must modify
our estimates of lower order effects, losing some efficiency. However, factorial
designs are the only ones that permit treatment interactions to be studied.
1366 S. Piantadosi and S. Halabi

Table 6 Response data Treatment B


from a hypothetical
Treatment A No Yes
factorial trial showing no
interaction on an additive No 5 10
scale of measurement Yes 10 15

Table 7 Response data Treatment B


from a hypothetical
Treatment A No Yes
factorial trial showing no
interaction on a No 5 10
multiplicative scale of Yes 10 20
measurement

Interactions Depend on the Scale of Measurement

In the examples just given, the treatment effects and interactions have been assumed
to exist on an additive scale. This is reflected in the use of sums and differences in the
formulas for estimation.
In practice, other scales of measurement, particularly a multiplicative one, may be
useful. As an example, consider the response data in Table 6 where the effect of
treatment A is to increase the baseline response by 5 units. The same is true of B, and
there is no interaction between the treatments on this scale because the joint effect of
A and B is to increase the response by 5 + 5 ¼ 10 units.
In contrast, Table 7 shows data in which the effects of both treatments are to
multiply the baseline response by 2.0. Hence, the combined effect of A and B is a
fourfold increase, which is greater than the joint treatment effect for the additive
case. If the analysis model were multiplicative, Table 6 would show an interaction,
whereas if the analysis model were additive, Table 7 would show an interaction.
Thus, to discuss interactions, we must establish the scale of measurement.

The Interpretation of Main Effects Depends on Interactions

In the presence of an interaction in the 22 design, there is not an overall, or main,
effect of either treatment. This is because the effect of A is different depending on
the presence or absence of B. In the presence of a small interaction, where all
subjects benefit regardless of the use of B, we might observe that the magnitude of
the overall effect of A is of some size and that therapeutic decisions are unaffected
by the presence of an interaction (Fig. 2a). This is known as quantitative interac-
tion because it does not affect the direction of the treatment effect. For large
quantitative interactions, it may not be sensible to talk about overall effects
(Kahan, 2013).
71
Factorial Trials

1.0
1.0
No-No No-No
No-Yes No-Yes
Yes-No Yes-No
Yes-Yes Yes-Yes

0.8
0.8

0.6
0.6

0.4
0.4

Overall Survival Probability


Overall Survival Probability

0.2
0.2

0.0
0.0
0 3 6 9 12 15 18 21 24 0 3 6 9 12 15 18 21 24
Time since Random Assignment (months) Time since Random Assignment (months)

Fig. 2 Hypothetical examples of (a) (left) a quantitative interaction and (b) (right) qualitative interaction
1367
1368 S. Piantadosi and S. Halabi

In contrast, if the presence of B reverses the effects of A, then the interaction


is qualitative, and treatment decisions may need to be modified (Fig. 2b). We
would not talk about an overall effect of A, because it could be positive in the
presence of B and negative in the absence of B and could yield an average effect
near zero.

Analyses Can Employ Linear Models

Motivation for the estimators given above can be obtained using linear models.
There has been little theoretical work on analyses using other models. One exception
is the work by Slud (1994) describing approaches to factorial trials with survival
outcomes. Suppose we have conducted a 22 factorial experiment with group sizes
given by Table 1. We can estimate the AB interaction effect using a linear model of
the form

EðY Þ ¼ β0 þ βA XA þ βB XB þ βAB XA XB ð9Þ

where the X’s are indicator variables for the treatment groups and βAB is the
interaction effect. For example,

1 for treatment group A,
XA ¼
0 otherwise:

The design matrix has dimension 4n  4 and is


2 3
1... 1... 1... 1...
60... 1... 0... 1...7
6 7
X0 ¼ 6 7,
40... 0... 1... 1...5
0... 0... 0... 1...

where there are four blocks of n identical rows representing each treatment
group and the columns represent effects for the intercept, treatment A, treatment
B, and both treatments, respectively. The vector of responses has dimension
4n 1 and is

Y 0 ¼ ½Y 01 , . . . , Y A1 , . . . , Y B1 , . . . Y AB1 , . . .

By ordinary least squares estimation, the solution to Eq. 9 is.

1
β ¼ ðX 0 X Þ X 0 Y :
b

When the interaction effect is omitted, the estimates will be denoted by b
β . The
covariance matrix of estimates is (X0X)1σ 2, where the variance of each observation
is σ 2.
71 Factorial Trials 1369

We have
2 3 2 3
4 2 2 1 1 1 1 1
62 2 1 17 6 1 2 1 2 7
6 7 1 1 6 7
X 0 X 5n  6 7, ðX 0 X Þ ¼  6 7,
42 1 2 15 n 4 1 1 2 2 5
1 1 1 1 1 2 2 4

and
2 3
Y0 þY A þ Y B þ Y AB
6Y þY AB 7
6 A 7
X 0 Y 5n  6 7,
4 YB þY AB 5
Y AB

where Y i denotes the average response in the ith group. Then


2 3
Y0
6 Y þYA 7
b 6 0 7
β¼6 7 ½1 ð10Þ
4 Y0 þYB 5
Y0 YA YB þYAB

which corresponds to the estimators given in Eqs. 3, 4, and 5. However, if the test for
interaction fails to reject and the βc
AB effect is removed from the model, then
2 3
3 1 1 1
6 4 0 Y þ Y þ Y  Y
6 4 A 4 B 4 AB 7 7
6 1 1 1 1 7
βb ¼ 6  Y 0 þ YA  YB þ Y AB 7 [1].
6 2 2 2 2 7
4 5
1 1 1 1
 Y0  Y þ Y þ Y
2 2 A 2 B 2 AB

The main effects for A and B are given above in Eqs. 1 and 2.
The covariance matrices for these estimators are
2 3
1 1 1 1
6 1 2 1 2 7
d σ2 6 7
cov ðβ Þ ¼  6 7 ð11Þ
n 4 1 1 2 2 5
1 2 2 4

and
2 3
3 1 1
 
6
6 4 2 2 7
7
d  σ2 6 1 7
covðβ Þ ¼  6  1 0 7: ð12Þ
n 6 2 7
4 5
1
 0 1
2
1370 S. Piantadosi and S. Halabi

In the absence of an interaction, the main effects of A and B are estimated


independently and with higher precision than when an interaction is present. The
interaction effect is relatively imprecisely estimated, indicating the larger sample
sizes required to have a high power to detect such effects.

Examples of Factorial Designs

Factorials trials have received a lot of attention in clinical trials (Sikov et al. 2015).
We list interesting examples of factorial trials in Table 8. Factorial designs are well
suited to prevention trials for reasons outlined above, but many therapeutic trials
have also utilized factorial designs because of the questions being addressed. One
important classic study using a 22 factorial design is the Physicians’ Health Study
(Hennekens and Eberlein 1985). This trial was conducted in 22,000 physicians in the
USA and was designed to test the effects of (1) aspirin on reducing cardiovascular
mortality and (2) β-carotene on reducing cancer incidence. The trial is noteworthy in
several ways, including its test of two interventions in unrelated diseases, use of
physicians as subjects to report outcomes reliably, relatively low cost, and an
all-male, high-risk study population. This last characteristic led to some criticism,
which was probably unwarranted.
In January 1988, the aspirin component of the Physicians’ Health Study was
discontinued because evidence demonstrated convincingly that it was associated
with lower rates of myocardial infarction. The question concerning the effect of
β-carotene on cancer was addressed by continuation of the trial. In the absence of an
interaction, the second major question of the trial was unaffected by the closure of
the aspirin component and showed no benefit for β-carotene.
Another interesting example of a 22 factorial design is the α-tocopherol
β-carotene Lung Cancer Prevention Trial, conducted in 29,133 male smokers in
Finland between 1987 and 1994 (The ATBC Cancer 1994). In this study, lung cancer
incidence was the sole outcome. It was thought possible that lung cancer incidence
could be reduced by either or both interventions. When this trial was stopped in
1994, there were 876 new cases of lung cancer in the study population during the
trial. Alpha-tocopherol was not associated with a reduction in the risk of cancer.
Surprisingly, β-carotene was associated with a statistically significant increased
incidence of lung cancer. There was no evidence of a treatment interaction. The
unexpected findings of this study have been supported by the recent results of
another large trial of carotene and retinol.
The Fourth International Study of Infarct Survival (ISIS-4) was a 2  2  2
factorial trial assessing the efficacy of oral captopril, oral mononitrate, and intrave-
nous magnesium sulfate in 58,050 subjects with suspected myocardial infarction
(McAlister et al. 2003). No significant interactions among the treatments were found
and each main effect comparison was based on approximately 29,000 treated versus
29,000 control subjects. Among the findings was demonstration that captopril was
associated with a small but statistically significant reduction in 5-week mortality.
The difference in mortality was 7.19% versus 7.69% (143 events out of 4,319),
71 Factorial Trials 1371

Table 8 Examples of trials using factorial designs


Trial Design Cohort Treatments Outcomes
Physicians’ 22 Healthy male Aspirin, β-carotene CHD, cancer
Health Study physicians,
(Hennekens n ¼ 22,071
and Eberlein
1985)
Linxian 24 4 Linxian Retinol + zinc Esophageal
Nutrition communes, Riboflavin + niacin cancer; all-cause
Trial (Li et al. n ¼ 29,584 Ascorbic acid + mortality
1993) molybdenum
Selenium + β-carotene
α-tocopherol
ISIS-4 23 Acute MI patients, Oral captopril Mortality: 5 week;
(Flather et al. n ¼ 58,050 Oral mononitrate 12 month
1994) IV magnesium sulfate
Prevention of 26 Patients at high Ondansetron Within 24 h
postoperative risk for nausea and Dexamethasone postoperative
nausea and vomiting, Droperidol nausea and
vomiting n ¼ 5,199 Propofol or volatile vomiting
(Apfel et al. anesthetic
2004) Nitrogen or nitrous
oxide
Remifentanil or
fentanyl
Ispwich 22 Women needing Repair: 2 stage or Pain or re-suturing
Childbirth episiotomy repair, 3 stage
Study (Grant n ¼ 793 Suture: polyglactin or
et al. 2001) chromic
Thrombosis 22 Men at risk of Warfarin + aspirin Coronary death,
Prevention ischemic heart Warfarin + placebo fatal/nonfatal MI
Trial disease, n ¼ 5,499 aspirin
(Thrombosis Placebo warfarin +
prevention aspirin
1998) Placebo + placebo
Women 2 Women, aged Vitamin C Myocardial
(Cook et al. 22 40 and over, at Vitamin E infarction, stroke,
2007) high risk, with a Beta-carotene coronary
history of Folic acid/Vitamin revascularization,
cardiovascular B6/Vitamin B12 or cardiovascular
disease or three or death
more coronary
heart disease risk
factors.
E1199 22 Neoadjuvant stage Paclitaxel vs. docetaxel Disease-free
(Sparano II and III breast Every survival
et al. 2015) cancer 3 weeks vs. weekly Overall survival
EDTA 22 Post-MI patients EDTA chelation or Composite of total
chelation 50 years and placebo infusions mortality, MI,
Trial (Lamas creatinine 6 caplets daily of a stroke, coronary
et al. 2014) 28-component revascularization,
(continued)
1372 S. Piantadosi and S. Halabi

Table 8 (continued)
Trial Design Cohort Treatments Outcomes
2.0 mg/dL multivitamin or or hospitalization
n ¼ 1,708. placebo for angina.
Adapted from Piantadosi 2017.

illustrating the ability of large studies to detect potentially important treatment


effects, even when they are small in relative magnitude. Mononitrate and magnesium
therapy did not significantly reduce 5-week mortality.

Partial, Fractional, and Incomplete Factorials

Use Partial Factorial Designs When Interactions Are Absent

Partial, or fractional, factorial designs are those that omit certain treatment groups by
design. A careful analysis of the objectives of an experiment, its efficiency, and the
effects it can estimate may justify not using some groups. Because many cells
contribute to the estimate of any effect, a design may achieve its intended purpose
without some of the cells.
In the 2  2 design, all treatment groups must be present to permit estimating the
interaction between A and B. However, for higher order designs, if some interactions
are known biologically not to exist, certain treatment combinations can be omitted
from the design and still permit estimates of other effects of interest. For example, in
the 2  2 2 design, if the interaction between A, B, and C is known not to exist, that
treatment cell could be omitted from the design and still permit estimation of all the
main effects. The efficiency would be somewhat reduced, however. Similarly, the
two-way interactions could still be estimated without Y ABC . This can be verified
from the formulas above.
Generally, partial high-order designs will produce a situation termed “aliasing” in
which the estimates of certain effects are algebraically identical to completely
different effects. If both are biologically possible, the design will not be able to
reveal which effect is being estimated. Naturally this is undesirable unless additional
information is available to the investigator to indicate that some aliased effects are
zero. This can be used to advantage in improving efficiency, and one must be careful
in deciding which cells to exclude. The reader is referred to Cox (1958) or Mason
et al. (1989) for a discussion of this topic.
The Women’s Health Initiative (WHI) clinical trial was a 2  2  2 partial
factorial design studying the effects of hormone replacement, dietary fat reduction,
and calcium and vitamin D on coronary disease, breast cancer, and osteoporosis
(Assaf and Carleton 1994; Design of the Women’s 1998; Shumaker et al. 1998). The
study accrued 162,000 subjects into multiple clinical trials and finished the initial
study period in 2005 (Rossouw et al. 2002). The hormone therapy trials randomized
27,347 women in an estrogen plus progestin study and an estrogen alone study. The
71 Factorial Trials 1373

dietary component of the study randomized 48,835 women, using a 3:2 allocation
ratio in favor of the control arm and 9 years of follow-up. The calcium and vitamin D
component randomized 36,282 women. Such a large and complex trial was not
without controversy early on (Marshall 1993), and presented logistical difficulties,
questions about adherence, and sensitivity to assumptions that could only roughly be
validated during design.

Incomplete Designs Present Special Problems

Treatment groups can be dropped out of factorial plans without yielding a frac-
tional replication. The resulting trials have been called incomplete factorial designs
(Byar et al. 1993). In incomplete designs, cells are not missing by design intent but
because some treatment combinations may be infeasible. For example, in a 2 2
design it may not be ethically possible to use a placebo group. In this case, one
would not be able to estimate the AB interaction. In other circumstances, unwanted
aliasing may occur, or the efficiency of the design to estimate main effects may be
greatly reduced. In some cases, estimators of treatment and interaction effects are
biased, but there may be reasons to use a design that retains as much of the
factorial structure as possible. For example, they may be the only way to estimate
certain interactions.

Summary

Factorial trials are efficient under the assumption of no interaction between the
treatments, and this should be considered at the design stage. Factorial designs
may also be used for the purpose of detecting an interaction between the factors if
the trial is powered accordingly. Therefore, factorial trial designs are useful in two
circumstances. When two or more treatments do not interact, factorial designs can
test the main effects of each using smaller sample sizes and greater precision than
separate parallel group designs. When it is essential to study treatment interactions,
factorial designs are the only effective way to do so. The precision, however, with
which interaction effects are estimated is lower than that for main effects in the
absence of interactions. A factorial trial designed to detect for an interaction has no
advantage in terms of the required sample size compared to a multi-arm parallel trial
for assessing more than one intervention.
When there are many treatments or factors, these designs require a relatively large
number of treatment groups. In complex designs, if some interactions are known not
to exist or are unimportant, it may be possible to omit some treatment groups, reduce
the size and complexity of the experiment, and still estimate all of the effects of
biological interest. Extra attention to the design properties is necessary to be certain
that fractional designs will meet the intended objectives. Such fractional or partial
factorial designs are of considerable use in agricultural and industrial experiments
but have not been applied frequently to clinical trials.
1374 S. Piantadosi and S. Halabi

Ethical and toxicity constraints may make it impossible to apply either a full
factorial or a fractional factorial design, yielding an incomplete design. The proper-
ties of incomplete factorial designs have not been studied extensively, but they may
be the best design in some circumstances.
A number of important, complex, and recent clinical trials have used factorial
designs. Because of the low potential for toxicity, these designs have been more
frequently applied in studies of disease prevention. Examples include the Physicians’
Health Study and the Women’s’ Health Trial. In medical studies, the design is employed
usually to achieve greater efficiency, since the treatments are unlikely to interact.

Summary and Conclusions

Factorial trials are efficient only when there is no interaction between the treatments,
and this should be considered at the design stage. Factorial designs must be used when
the intent is to study interactions in which case the trial must be powered accordingly.
Interaction effects have roughly four times the variance of main effects and so require
much larger sample sizes. If many treatments and interactions are possible, factorials
designs may be impractical for therapeutic questions due to their large sample size and
complexity. Other constraints may make these designs unsuitable such as the need to
omit treatments or administer all therapies at full dose in some groups.

Key Facts

Factorial trials represent a structure that can test treatment by treatment interactions. In
the narrow circumstance that interactions are known to be absent, the factorial structure
can test the effects of two treatments using a sample size ordinarily used for a single
treatment. When interactions are the focus, sample size must be increased substantially
because they are estimated with less precision than “main effects.” Factorial designs are
often well suited to prevention questions but have been applied widely.

Cross-References

▶ Biomarker-Guided Trials
▶ Prevention Trials: Challenges in Design, Analysis, and Interpretation of Preven-
tion Trials

References
Apfel CC, Korttila K, Abdalla M, Kerger H, Turan A, Vedder I, . . . IMPACT Investigators (2004) A
factorial trial of six interventions for the prevention of postoperative nausea and vomiting. N
Engl J Med 350(24):2441–2451
71 Factorial Trials 1375

Assaf AR, Carleton RA (1994) The Women’s Health Initiative Clinical Trial and Observational
Study: history and overview. R I Med 77(12):424–427
Byar DP, Piantadosi S (1985) Factorial designs for randomized clinical trials. Cancer Treat Rep 69
(10):1055–1063
Byar DP, Herzberg AM, Tan WY (1993) Incomplete factorial designs for randomized clinical trials.
Stat Med 12(17):1629–1641
Cook NR, Albert CM, Gaziano JM, Zaharris E, MacFadyen J, Danielson E, . . . Manson JE (2007)
A randomized factorial trial of vitamins C and E and beta carotene in the secondary prevention
of cardiovascular events in women: results from the Women’s Antioxidant Cardiovascular
Study. Arch Intern Med 167(15):13–27
Cox DR (1958) Planning of experiments. Wiley, New York
Design of the Women’s Health Initiative clinical trial and observational study. The Women’s Health
Initiative Study Group. (1998) Control Clin Trials 19(1):61–109
Fisher RA (1935) The design of experiments. Oliver and Boyd, Edinburgh/London
Fisher RA (1960) The design of experiments, 8th edn. Hafner, New York
Flather M, Pipilis A, Collins R, Budaj A, Hargreaves A, Kolettis T, . . . et al (1994) Randomized
controlled trial of oral captopril, of oral isosorbide mononitrate and of intravenous magnesium
sulphate started early in acute myocardial infarction: safety and haemodynamic effects. ISIS-4
(Fourth international study of infarct survival) Pilot Study Investigators. Eur Heart J 15(5):608–619
Freidlin B, Korn EL (2014) Biomarker enrichment strategies: matching trial design to biomarker
credentials. Nat Rev Clin Oncol 11(2):81–90
Freidlin B, Korn EL (2017) Two-by-two factorial cancer treatment trials: is sufficient attention being
paid to possible interactions? J Natl Cancer Inst 109(9). https://doi.org/10.1093/jnci/djx146
Freidlin B, Korn EL, Gray R, Martin A (2008) Multi-arm clinical trials of new agents: some design
considerations. Clin Cancer Res 14(14):4368–4371
Gönen M (2003) Planning for subgroup analysis: a case study of treatment-marker interaction in
metastatic colorectal cancer. Control Clin Trials 24(4):355–363
Grant A, Gordon B, Mackrodat C, Fern E, Truesdale A, Ayers S (2001) The Ipswich childbirth
study: one year follow up of alternative methods used in perineal repair. BJOG 108(1):34–40
Green S (2005) Factorial designs with time to event endpoints, pp 181–189
Henderson IC, Berry DA, Demetri GD, Cirrincione CT, Goldstein LJ, Martino S, . . . Norton L
(2003) Improved outcomes from adding sequential paclitaxel but not from escalating doxoru-
bicin dose in an adjuvant chemotherapy regimen for patients with node-positive primary breast
cancer. J Clin Oncol 21(6):976–983
Hennekens CH, Eberlein K (1985) A randomized trial of aspirin and beta-carotene among
U.S. physicians. Prev Med 14(2):165–168
Kahan BC (2013) Bias in randomised factorial trials. Stat Med 32(26):4540–4549
Korn EL, Freidlin B (2016) Non-factorial analyses of two-by-two factorial trial designs. Clin Trials
13(6):651–659
Lamas GA, Boineau R, Goertz C, Mark DB, Rosenberg Y, Stylianou M, . . . Lee KL (2014) EDTA
chelation therapy alone and in combination with oral high-dose multivitamins and minerals for
coronary disease: the factorial group results of the Trial to Assess Chelation Therapy. Am Heart
J 168(1):37.e5–44.e5
Li B, Taylor PR, Li J-Y, Dawsey SM, Wang W, Tangrea JA, . . . Blot WJ (1993) Linxian nutrition
intervention trials design, methods, participant characteristics, and compliance. Ann Epidemiol
3(6):577–585
Liu C, Liu A, Hu J, Yuan V, Halabi S (2014) Adjusting for misclassification in a stratified biomarker
clinical trial. SIM Stat Med 33(18):3100–3113
Lubsen J, Pocock SJ (1994) Factorial trials in cardiology: pros and cons. Eur Heart J 15(5):585–588
Marshall E (1993) Women’s health initiative draws flak. Science 262(5135):838
Mason RL, Gunst RF, Hess JL (1989) Statistical design and analysis of experiments: with
applications to engineering and science. Wiley, New York
McAlister FA, Straus SE, Sackett DL, Altman DG (2003) REVIEWS – analysis and reporting of
factorial trials: a systematic review. JAMA 289(19):2545
1376 S. Piantadosi and S. Halabi

Montgomery AA, Peters TJ, Little P (2003) Design, analysis and presentation of factorial
randomised controlled trials. BMC Med Res Methodol 3(26)
Moser BK, Halabi S (2015) Sample size requirements and study duration for testing main effects
and interactions in completely randomized factorial designs when time to event is the outcome.
Commun Stat Theory Methods 44(2):275–285
Peterson B, George SL (1993) Sample size requirements and length of study for testing interaction
in a 2 x k factorial design when time-to-failure is the outcome [corrected]. Control Clin Trials
14(6):511–522
Piantadosi S (2017) Factorial designs. In: Piantadosi S (ed) Clinical trials: a methodologic perspec-
tive. Wiley, Hoboken, pp 672–687
Proschan MA, Waclawiw MA (2000) Practical guidelines for multiplicity adjustment in clinical
trials. Control Clin Trials 21(6):527–539
Rosenberg J, Ballman KV, Halabi S, Watt C, Hahn O, Steen P, . . . Morris M (2019) CALGB 90601
(Alliance): randomized, double-blind, placebo-controlled phase III trial comparing gemcitabine
and cisplatin with bevacizumab or placebo in patients with metastatic urothelial carcinoma. J
Clin Oncol 37(15_suppl):4503–4503
Rossouw JE, Anderson GL, Prentice RL, LaCroix AZ, Kooperberg C, Stefanick ML, . . . Writing
Group for the Women’s Health Initiative (2002) Risks and benefits of estrogen plus progestin in
healthy postmenopausal women: principal results from the Women’s Health Initiative random-
ized controlled trial. JAMA 288(3):321–333
Rubinstein LV, Gail MH, Santner TJ (1981) Planning the duration of a comparative clinical trial
with loss to follow-up and a period of continued observation. J Chronic Dis 34(9):469–479
Shumaker SA, Reboussin BA, Espeland MA, Rapp SR, McBee WL, Dailey M, . . . Jones BN
(1998) The Women’s Health Initiative Memory Study (WHIMS): a trial of the effect of estrogen
therapy in preventing and slowing the progression of dementia. Control Clin Trials 19(6):604–
621
Sikov WM, Berry DA, Perou CM, Singh B, Cirrincione CT, Tolaney SM, . . . Winer EP (2015)
Impact of the addition of carboplatin and/or bevacizumab to neoadjuvant once-per-week
paclitaxel followed by dose-dense doxorubicin and cyclophosphamide on pathologic complete
response rates in stage II to III triple-negative breast cancer: CALGB 40603 (Alliance). J Clin
Oncol 33(1):13–21
Simon R, Freedman LS (1997) Bayesian design and analysis of two x two factorial clinical trials.
Biometrics 53(2):456–464
Slud EV (1994) Analysis of factorial survival experiments. Biometrics 50(1):25–38
Sparano JA (2008) Weekly paclitaxel in the adjuvant treatment of breast cancer. New Engl J Med
358(16):1663
Sparano JA, Zhao F, Martino S, Ligibel JA, Perez EA, Saphner T et al (2015) Long-term follow-up
of the E1199 phase III trial evaluating the role of Taxane and schedule in operable breast cancer.
J Clin Oncol 33(21):2353–2360
The ATBC Cancer Prevention Study Group, The Alpha-Tocopherol, Beta-Carotene lung cancer
prevention study: design, methods, participant characteristics, and compliance. (1994 ) Ann
Epidemiol 4(1): 1–10
Thrombosis prevention trial: randomised trial of low-intensity oral anticoagulation with warfarin
and low-dose aspirin in the primary prevention of ischaemic heart disease in men at increased
risk. (1998) The Lancet 351(9098):233
Wason J, Mander A, Stecher L (2014) Correcting for multiple-testing in multi-arm trials: is it
necessary and is it done? Trials 15(1):1–7
Yates F (1935) Complex experiments. Suppl J R Stat Soc B2(2):181–247
Within Person Randomized Trials
72
Gui-Shuang Ying

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1378
Rationale for Using Within Person Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1379
The Requirements for Within Person Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1380
No Carry Across Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1380
Within Person Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1381
Trial Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1381
Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1382
Recruitment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1382
Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1383
Generalizability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1383
Other Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1384
Concurrent Treatment Versus Sequential Treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1385
Alternatives to the Within Subject Control Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1386
Power and Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1388
Sample Size for Continuous Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1388
Sample Size for Binary Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1389
Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1391
Analysis of Continuous Outcome Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1392
Statistical Comparison of Binary Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1393
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1395
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1395
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1396

Abstract
Within person randomized trials (e.g., trials using within subject controls) are
often employed for conditions that affect paired organs or two or more body sites
of a person. In within person trials, the paired organs or body sites of a person

G.-S. Ying (*)


Center for Preventive Ophthalmology and Biostatistics, Department of Ophthalmology, Perelman
School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 1377


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_101
1378 G.-S. Ying

receive two competing interventions either concurrently or sequentially, and the


outcome measures are taken from each of paired organs or body sites. The within
person design is a useful and efficient tool because comparisons between two
interventions are made within the same person, thus removing the inter-person
variability. Within person trials are most commonly conducted in ophthalmology,
dentistry, and dermatology. However, within person trials pose some challenges
including the possible bias from the carry across effect, the difficulty in recruit-
ment subjects with bilateral disease of similar characteristics, and the limitation in
generalization of the trial results. The within person correlations in outcome
measures also complicate the sample size determination and statistical analyses
of trial data from within person trials.
This chapter describes the rationale and requirements for employing within
person design, the considerations in designing within person trials in various
disease specialty areas. The appropriate methods for sample size calculation and
the statistical analysis for within person trials are also described. Real trials are used
throughout the chapter to demonstrate the trial design considerations, sample size
calculation, and statistical analysis of correlated data from within person trials.

Keywords
Within person trials · Within subject controls · Within person correlation ·
Inter-eye correlation · Paired design · Split-mouth design · Carry across effect

Introduction

Some diseases affect paired organs, body parts, or body sites of a subject (such as eyes,
ears, arms, or breasts) or two sites of a single organ, body part, or body site (such as
teeth or sides of the mouth). This feature provides a unique opportunity for designing
efficient clinical trial by using within subject controls. Different from conventional
parallel group trials where eligible persons are randomized to receive only one of the
study treatments (i.e., randomization unit is per person), within person trials randomize
each organ or body site to treatment (i.e., the unit of randomization is per organ or
body site), and each person receives all study treatments (Paré 1575).
Within person design is efficient in that it enables the comparison between two
interventions within a person, eliminates the between-person variation, and hence
improves the efficiency in estimating the treatment effect. The trials using within
person controls do not have a generally accepted name, although some medical
specialties have their specific terms, such as “contralateral design” or “paired
design” in ophthalmology, “split-mouth design” in dentistry, and “split face” or
“split body” design in dermatology (ref: Machin and Fayers 2010). To encompass
all possible medical specialties and to align with the terminology used in the
published guidelines for Consolidated Standards of Reporting Trials (CONSORT)
(Pandis et al. 2017), trials using within subject controls are called within person trials
in this chapter. In ophthalmology, within person trials randomly assign treatment to
72 Within Person Randomized Trials 1379

one eye and another treatment (or control) to the fellow eye of the same person
(CAPT Research Group 2004). In dentistry, within person trials apply one treatment
to some teeth and applying another treatment to other teeth of the same person
(Pandis et al. 2013).
Within person trials, in which each person receives all study treatments, should not
be confused with trials in which randomization and treatment are at the person level
and all the organs or body sites of a person receiving the single same treatment are in
the same comparison group. For example, in the Age-Related Eye Disease Study
(AREDS), the participants were randomized to one of the four treatment groups – (1)
zinc alone; (2) antioxidants alone; (3) a combination of antioxidants and zinc; or (4) a
placebo (The AREDS Research Group 1999) – to evaluate the effect of high doses of
vitamin C, vitamin E, beta-carotene, and zinc on the progression of age-related
macular degeneration (AMD) and cataract. As two eyes of each participant received
the same systematic treatment (e.g., dietary supplements) and are in the same com-
parison group, the AREDS is not a within person trial; instead it can be viewed as a
type of clustered randomized trials that are not discussed here. Although within person
trials have some similarities to cross-over trials which are also not discussed here, it
differs from cross-over trials in that treatment and outcome measures are at the organ
level or body site level rather than at the person level.
Within person trials have been used to evaluate a variety of preventive and
therapeutic treatments (Pandis et al. 2017). Pandis et al. reported that approximately
2% of published randomized clinical trials employed a within person design (Pandis
et al. 2017). Within person trials are most common in ophthalmology, dentistry, and
dermatology. In dentistry, a review of 413 clinical trials published in 8 high-impact
oral health journals from 1992 to 2012 found 43 (10%) dental trials used split-mouth
design (Koletsi et al. 2014). Another study found that 67 (24%) of 276 trials
published in implant dentistry journals between 1989 and 2011 used the split-
mouth design (Cairo et al. 2012). In ophthalmology, Lee CF et al. found that within
person design was used in 9 (13%) of 69 ophthalmic trials published in top four
general clinical ophthalmology journals (American Journal of Ophthalmology,
Archives of Ophthalmology, the British Journal of Ophthalmology, and Ophthal-
mology) between January and December of 2009 (Lee et al. 2012).
This chapter describes the rationale and the requirements for employing within
person design, the considerations in designing within person trials, the sample size/
power determination, and the appropriate statistical approaches for analyzing corre-
lated data from within person trials. The examples of real within person clinical trials
are used to demonstrate the design, sample size calculation, and statistical analysis
for within person trials.

Rationale for Using Within Person Design

In the parallel group trials that randomize persons to one of treatments, the treatment
effect is determined through comparing outcome measure between persons random-
ized to one treatment and persons randomized to another treatment (i.e., through
1380 G.-S. Ying

between-person comparison). The treatment outcome measures are usually affected


by baseline characteristics (e.g., age, gender, disease severity, genetic factors, etc.),
contributing to the variability of outcome measure for evaluating treatment effect.
However, in within person trials, each person receives all study treatments (e.g.,
paired organs or body sites of the same person receive different treatments), and the
evaluation of treatment effect is made through comparing outcomes between paired
organs or body sites of the same person (i.e., through within-person comparison).
Using persons as their own controls, the inter-person variability is removed; thus, the
within person trials reduce variability in treatment response and improve the effi-
ciency of the trials, leading to the smaller sample size and improved statistical power
when comparing to conventional parallel group trials.
The within person design is ideal for evaluating the efficacy of a single treatment
by using one organ or body site as control. In ophthalmology, as ocular diseases are
usually very symmetric, affecting both eyes of a person simultaneously (Murdoch
et al. 1997), ophthalmic trials often randomize one eye to study treatment, and the
other eye serves as control. For example, in the US Diabetic Retinopathy Study (The
Diabetic Retinopathy Study Group 1978), one eye of each eligible participant was
randomly assigned to immediate photocoagulation and the other eye to follow-up
without treatment. This type of paired-eye design is commonly used when the effects
of treatment are localized (such as laser treatment for diabetic retinopathy) to a
single eye.
As the resources available for clinical trials are usually limited and most of trials
are faced with challenges of enrolling and maintaining sufficient number of subjects
over the course of the trial, the reduction in the required sample size compared with
the parallel group design makes the within person trials very attractive.

The Requirements for Within Person Design

No Carry Across Effect

The most important assumption underlying the use of within person design is that the
treatment effect is localized, i.e., there is no spill-over effect (also called no carry
across effect) from therapy in one organ or body side to another. For example, the
treatment in one tooth has no effect on another tooth, or the treatment in one eye has
no effect on the fellow eye. In designing a within person trial to compare surgical
treatment vs. nonsurgical treatment for periodontal disease, it is desirable to dem-
onstrate that the sections of the mouth receiving surgical treatment are not affected
by the sections receiving nonsurgical therapy and vice versa. Unless this indepen-
dence can be demonstrated, the treatment effect may not be surgical compared to
nonsurgical therapy but the effect of surgical treatment in a section in conjunction
with nonsurgical treatment in another section, and it is not possible to obtain an
unbiased, independent estimate of either treatment.
The assumption of no carry across effect may not be met for some within person
trials, even when the treatment is localized to an organ/body site. For example, in the
72 Within Person Randomized Trials 1381

initial One-eyed Trials of the Ocular Hypertension Treatment Study, the topical
β-blocker was given to the eye with higher intraocular pressure (IOP) or a randomly
selected eye if both eyes had the same IOP. After 2–6 weeks topical medication in the
treated eye, it was found that the contralateral fellow eye had mean ( standard
deviation) IOP reduction of 1.5  3.0 mm Hg, as compared to the mean reduction of
5.9  3.4 mm Hg in the treated eye, suggesting the topical β-blocker has contralat-
eral effect (Piltz et al. 2000). This carry across effect is likely due to the systemic
absorption of the β-blocker primarily through the nasolacrimal mucosa, resulting in
the transport of the β-blocker to the contralateral eye through the blood stream (Piltz
et al. 2000).

Within Person Correlation

The measures taken from paired organs or body sites of the same person are usually
correlated. The within person design takes the advantage of high within person
correlation that makes the within person trial more efficient than the parallel group
design. Reported correlation coefficients in ophthalmology (Katz 1988), dermatol-
ogy (Van et al. 2015), and orthodontics (Pandis et al. 2014) were 0.80, 0.80, and
0.50, respectively. Balk et al. (Balk et al. 2012) calculated 811 within person
correlation coefficients from 123 studies. The median within person correlation
value across all studies was 0.59 (interquartile range 0.40–0.81). No heterogeneity
of correlation values across outcome types and clinical domains was observed (Balk
et al. 2012). In ophthalmology, a wide variety of inter-eye correlation coefficients
was reported for various eye diseases and outcome measures (Ying et al. 2017a,
2018; Maguire 2020). The inter-eye correlation in refractive error can be as high as
0.90 in preschoolers but is only 0.43 in patients with neovascular age-related
macular degeneration (Ying 2017). The inter-eye agreement in the referral-warranted
retinopathy of prematurity was reported to be 0.80 (Ying 2017). The gain in the
efficiency from the within person design is positively correlated with the magnitude
of within person correlation (i.e., the higher within person correlation, the more gain
in the efficiency and more reduction in the sample size compared to the parallel
group design).

Trial Design Considerations

In the simplest within person trials, two interventions (one of which may be a control
or standard treatment) are applied to two paired organs or body sites of a person
through randomization, either concurrently or sequentially, and the outcome mea-
sures are assessed at each organ or body site. For example, in the Complications of
Age-Related Prevention Trials (CAPT), designed to evaluate whether prophylactic
laser treatment to the retina can prevent the incidence of the advanced-stage AMD,
1052 participants with at least 10 large drusen (>125 u) in both eyes were enrolled
with 1 eye randomized for laser treatment for large drusen and the contralateral eye
1382 G.-S. Ying

as control (i.e., without treatment) (The CAPT Research Group 2004). Each partic-
ipant was followed-up annually for at least 5 years, to compare the incidence rates of
advanced-stage AMD between treated eye and the contralateral observed eye of the
same participant.
Within person randomized trials present some particular challenges. When con-
templating to use within person design for a clinical trial, careful considerations
should be given on issues associated with bias, efficiency, and the consequences on
recruitment and statistical analysis.

Bias

One potential problem of using within subject controls is the possibility of a carry
across effect. For example, an intervention applied to one eye can affect the other eye
systemically (Piltz et al. 2000); treatment in an area of the mouth can affect other
areas of the mouth locally (Lesaffre et al. 2009; Pandis et al. 2013); success or failure
of the first replacement hip in a patient requiring bilateral hip replacement can affect
the success or failure of the second hip (Lie et al. 2004).
The carry across effect has been the main concern in within person trials (Piltz et al.
2000; Lesaffre et al. 2009; Lie et al. 2004). Carry across effect can bias the estimates of
treatment efficacy and tend to dilute the treatment effect. However, the exact magni-
tude of bias due to carry across effect is difficult to estimate (Hujoel 1998); thus, the
true treatment effect from intervention cannot be accurately estimated. What we can
estimate is the treatment effect contaminated with the carry across effect. If the
intervention is thought to have carry across effect, randomizing individual patients
to treatment groups (instead of using within subject control) is preferred.
The carry across effect is similar to the temporal carry over effect in cross-over
trials, in which lingering effects of the first intervention may require adjustment for
different baselines before the second intervention or the use of washout periods. A
within person design is not appropriate to use if a substantial carry across effect or
contamination is expected. For example, in a study of oral lichen planus (Poon et al.
2006), the topical treatments applied to each side of mouth can have serious carry
across effect, so the split-mouth design should not be used. Similarly, in ophthal-
mology, the intravitreal injection of anti-vascular endothelial growth factor (anti-
VEGF) in the study eye can carry across to the contralateral eye (Acharya et al.
2011); the paired design for evaluating the efficacy of one anti-VEGF agent through
randomizing one eye to treatment and contralateral eye as control or for comparing
efficacy of two anti-VEGF agents within two eyes of the same subject is not ideal.

Recruitment

Within person trials require recruitment of individuals with similar disease condition
that affects paired organs or body sites of a person. However, identifying such
participants sometimes can be difficult, thus endangering the recruitment. In
72 Within Person Randomized Trials 1383

ophthalmology, a lot of eye diseases are symmetrical, such as refractive error, age-
related macular degeneration and retinopathy of prematurity (Katz 1988; Quinn et al.
1995). This may not be the case in dentistry. It may be easy to find an individual with
a tooth having a cavity, but it can be challenging to find individual with two teeth
having cavity of similar size, particularly in two sides of the mouth. For example, for
a particular periodontal disease trial, over 1500 patients were screened to find only
12 patients with symmetric periodontal lesions eligible for the study (Smith et al.
1980). This difficulty in identifying subjects with similar disease condition could be
a major obstacle for achieving the sample size required by the within person trial,
even though smaller sample size is required for within person trials compared to the
parallel group trials. The more strict the criterion for the similarity of disease in
paired organs or body sites, the more difficult the recruitment will be. Such very
selective recruitment of participants for within person trials will also hurt the
generalizability of trial findings.
In addition, the requirement of within person trial for each participant to receive
all interventions could potentially make some patients not willing to participate the
trial. Bunce et al. reported that in ophthalmic trials, some patients had very strong
opinions against enrolling both eyes into within person trials because this makes
patients feel like experimental units rather than people. These patients are most
comfortable with enrolling only one eye into the study even though both eyes are
eligible (Bunce and Wormald 2015).

Efficiency

The within person design takes the advantage of within person correlation (ρ) in
outcome measures for gaining efficiency and reducing the sample size compared to
the parallel group design. Assuming all the parameters for sample size calculation
are the same in within person trial and parallel group trial, the ratio between the
sample size (in terms of number of subjects) for the within person trial (Npaired) and
for the parallel group trial (Nparallel) can be calculated using the following formula
(Wang and Bakhai 2006):

N paired =N parallel ¼ ð1  ρÞ=2 ð1Þ

From Eq. (1), it is clear that the higher within person correlation, the smaller the
ratio in their sample size, and the more gain in efficiency. If the within person
correlation is low, the gain in efficiency can be minimal, and the within person
design may not be very appropriate.

Generalizability

Within person trials require each participant having similar disease in paired organs
or body sites; it is uncertain whether the within person trial results from patients with
1384 G.-S. Ying

bilateral disease can be generalizable to patients with unilateral disease. Bilateral


disease sometimes indicates poorer clinical status than unilateral disease. For exam-
ple, diabetic neuropathy is a systemic consequence of diabetes that is considered
worse if multiple limbs are affected, and the need for multiple dental implants is
indicative of a worse dental condition. The tendency of higher disease severity in
subjects with bilateral disease than subjects with unilateral disease makes it uncertain
whether the treatment effect estimated from the within person trial is generalizable to
the patients with unilateral disease.
In the Early Treatment for Retinopathy of Prematurity (ETROP) Study,
269 infants with bilateral prethreshold ROP had one eye randomly assigned to
treatment with peripheral retinal ablation, and the fellow eye managed convention-
ally, and 70 infants with unilateral prethreshold ROP were randomized to receive
treatment with either peripheral retinal ablation or managed conventionally in the
single eye with prethreshold ROP. It was found that the two-year structural outcome
was higher in infants with bilateral prethreshold ROP than infants with unilateral
prethreshold ROP. Among infants with bilateral prethreshold ROP, the rate of
unfavorable structural outcome was 10.4% in eyes treated with peripheral retinal
ablation and 16.7% in eyes with conventional management (p ¼ 0.003), while
among infants with unilateral prethreshold ROP, the rate of unfavorable structural
outcome was 0% and 3.3%, respectively ( p ¼ 0.26) (The ETROP Cooperative
Group 2006). The ETROP Study results demonstrated that within person trial results
from patients with bilateral disease may not be generalizable to patients with
unilateral disease.

Other Considerations

As outlined in the CONSORT guidelines for within person trials (Pandis et al. 2017),
the design of within person trial also has to consider the following questions:

• What is the eligible criteria for enrollment? The within person design needs to
consider two sets of eligibility criteria including the eligibility of the individual
participants and the eligibility of organ (eyes) or body sites. For example, to be
eligible for the CAPT study, participants had to be at least 50 years of age and free
of conditions likely to preclude 5 years of follow-up (person level eligibility), each
eye had to have presence of 10 or more drusen at least 125 u in diameter within
2 disc diameters of the fovea, and the standardized visual acuity had to be 20/40 or
better in each eye (eye level eligibility) (The CAPT Research Group 2004).
• What is the outcome of the within person trial? The outcome of the within person
trial should be specific to the organ or body site. Within person design is not
appropriate for trial with outcome assessment at person level. For example, in the
Dry Eye Assessment and Management (DREAM) Study (The DREAM Investi-
gator Group 2018), although the dry eye disease is mostly bilateral (>90%
participants have both eyes met the enrollment criteria), because the treatment
is systematic and the outcome measure of the dry eye symptom is the Ocular
72 Within Person Randomized Trials 1385

Surface Disease Index (OSDI) which was measured at the person level, it is not
appropriate to use the within person design for the DREAM Study.
• Can the assessment of outcomes for efficacy and safety be adversely impacted by
the decision to treat the same patients with two different treatments?
• Are the paired organs/sites for each participant similar in terms of baseline
characteristics such as location, anatomy (e.g., tooth type), and severity of
disease?
• Will the treatments be administered concurrently or sequentially to the same
participant? If treatments are given sequentially, will baseline information be
recorded at the time of randomization or at the time of treatment administration?
Similarly, if the treatment were sequential, the outcome of the first intervention
could affect the outcome of the second intervention, and hence the applicability of
the within person trial findings to other settings can be questionable. For example,
early and late loaded implants or one hip replacement at a time can potentially
influence the outcome. In some cases, however, the sequential approach is
standard clinical practice, such as cataract surgery (Vasavada et al. 2012).
• How will the order of treatments and allocation to paired organs/body sites be
determined (e.g., right versus left)? In within person trials, randomization is
needed not only to determine which intervention is applied to which organ or
body site but also to determine which organ or body site is treated first (partic-
ularly if paired organs or body sites are not treated concurrently).
• Will there be any provision to monitor whether the assigned treatment is actually
applied to the correct organ or body site?
• Will the outcome evaluator be masked to the treatment assignment of each organ
or body site, and if so how?
• How will the blinding of treatment assignment to organs/body sites of the same
subject be operated, and will accidental unblinding of treatment of one organ
affect the other organ?

Concurrent Treatment Versus Sequential Treatment

When a subject is assigned to receive two treatments in a within person trial, decision
needs to be made on whether two treatments given to paired organs or body sites are
concurrent or sequential. In the concurrent treatment, the two treatments are deliv-
ered at the same time or within a trivial interval following a specific or random
treatment order, whereas in the sequential treatment, there is a “non-trivial” time lag
between the two interventions. With concurrent treatment, loss to follow-up will
automatically be the same across treatment groups, but side effects (particularly
systemic adverse events) from treatments may be difficult to attribute to a specific
treatment. Another concern in concurrent treatment is the possible confusion as to
which organ or body site receives which treatment, particularly when there is a long
treatment period. The traditional methods for monitoring compliance of treatment
might be insufficient in within person trials when participants are responsible for
administering the treatment (e.g., topical eye drops) by themselves.
1386 G.-S. Ying

For example, in a cataract surgery trial to determine if intraocular infusion of low-


molecular-weight heparin reduces postoperative inflammation in pediatric eyes
undergoing cataract surgery with intraocular lens (IOL) implantation (Vasavada
et al. 2012), among 20 children (40 eyes) undergoing bilateral surgery with IOL
implantation, the first eye was randomly assigned to receive enoxaparin in the
intraocular infusion fluid or not to receive enoxaparin, and the second eye received
alternate treatment. The eye treated first was selected by a computer-generated table
of random numbers. In this trial, two eyes of a child did not undergo cataract
surgeries at the same time; instead the second eye underwent cataract surgery after
a gap of at least 2 weeks following surgery in the first eye, as this is the clinical
practice of cataract surgery.
Cautions should be executed for designing within person trials with sequential
treatments, because problems can arise from carry across effects or period effects
(the effect of intervention is influenced by the period of delivery) and a baseline
adjustment may be required when the baseline characteristics are believed to change
between two sequential treatments. For example, in a split mouth trial for comparing
two types of dental implants, baseline characteristics and outcome of the second
dental implants might be influenced by the time interval between the two dental
implants and the status of the first implant. If the first early loaded implant results in a
poor outcome or the time interval between two dental implant operations is long, or
both, the patient might rely excessively on the other side of the mouth, which might
have a negative impact on the outcome of the second loaded implant. Conversely, if
the outcome in the first implant is good and the burden on the second implant is
small, a satisfactory outcome from the second implant can be more likely (Pandis
et al. 2017).

Alternatives to the Within Subject Control Design

Clinical trials for conditions that occur in multiple organs or body sites require
careful consideration of study design, because it has strong implications for patient
enrollment, statistical analysis, and the presentation of results. Besides the within
person design, the possible alternative designs:

• Include only one organ/site per subject either through random selection, use of
organ/site with the most severe disease, or at the discretion of clinician or
patient. For example, in the Complications of Age-related Macular Degenera-
tion Treatment Trials (CATT), 1185 participants with neovascular AMD were
randomized to treat with intravitreal injection of ranibizumab or bevacizumab
on a monthly or PRN schedule. The trial requires each study eye to have active
subfoveal choroidal neovascularization (CNV) and visual acuity of between
20/25 and 20/320. The CATT only enrolled one eye per patient into the trial.
When two eyes of a participant meet the enrollment criteria, the ophthalmolo-
gist and the patients decide which eye will be enrolled (The CATT Research
Group 2011).
72 Within Person Randomized Trials 1387

The advantage of this design is the simplicity of design and statistical analysis of
the trial data, but may lead to the loss of opportunity to efficiently collect more
information.

• Randomize patients to a treatment, and treat paired organs/body sites with the
same treatment. This is a clustered randomized trial in which the clusters are
individual patients. For example: In a multi-center randomized clinical trial to
evaluate the efficacy of intravitreal injection of bevacizumab for stage 3+ ROP,
150 infants with stage 3+ ROP were randomized to receive intravitreal
bevacizumab or conventional laser therapy in both eyes (ref: Mintz-Hittner
et al. 2011).
• Mixture of participants with unilateral and bilateral disease. Although many
diseases occur in paired organs or multiple body sites, the extent and severity
of disease may not be the same. The within person design that requires similar
disease condition in paired organs or body sites may significantly limit the
recruitment potential and also make the results not easily generalizable to the
patients with unilateral disease. In ophthalmology, some trials use hybrid design
which allows both eyes to be randomized if both eyes are eligible and allows one
eye to be randomized if only one eye is eligible (Lee et al. 2012; The ETROP
Cooperative Group 2006; Elman et al. 2010). For example, the ETROP enrolled
infants with prethreshold ROP in both eyes and also infants with prethreshold
ROP in one eye only. For infants with bilateral prethreshold ROP, one eye was
randomized to treatment, and the other (the control eye) was managed conven-
tionally. For infants with unilateral prethreshold ROP, a separate randomization
scheme assigned such infants to either treatment or conventional management
(The ETROP Cooperative Group 2006). The Diabetic Retinopathy Clinical
Research Network also used this hybrid design in several large clinical trials, as
the hybrid approach can lead to faster recruitment and reduced costs considering
the overall number of participants (Glassman and Melia 2015). In the Diabetic
Retinopathy Clinical Research Network Protocol I (Diabetic Retinopathy Clinical
Research Network, 2010), patients with one or two eligible eyes enrolled to
compare four treatments for diabetic macular edema including (A) prompt laser
(N ¼ 293 eyes); (B) 0.5 mg ranibizumab + prompt laser (N ¼ 187 eyes);
(C) 0.5 mg ranibizumab + deferred laser (N ¼ 188 eyes); and (D) 4 mg triam-
cinolone + prompt laser (N ¼ 186 eyes). For 528 patients with only 1 eye eligible,
they were randomized to 4 treatment groups with equal probability, while for
163 patients having both eyes eligible, the right eye was randomized to 1 of the
4 treatment groups and the left eye assigned to group A if right eye was not in
group A, and the left eye was randomized to 1 of the 3 remaining treatments with
equal probability if the right eye was in group A. Such hybrid design can
complicate the statistical analysis for the trial data, because statistical analysis
for the bilateral cases needs to adjust for the inter-eye correlation. Consistency of
treatment effect among bilateral patients and unilateral patients should be checked
before the assessment of overall treatment effect by combining the results from
these unilateral cases and bilateral cases (The ETROP Cooperative Group 2006).
1388 G.-S. Ying

Power and Sample Size

The sample size calculation for within person trials requires an estimate of within
person correlation for the primary outcome measure. This correlation estimate can be
obtained from the previous studies. If such data is not available in practice, the
sample size/power can be calculated assuming various degree of within person
correlation (e.g., moderate correlation with correlation coefficient of 0.50 or high
correlation with correlation of 0.75 etc.) to see how sensitive the sample size
calculation is to assumption for the within person correlation coefficient. However,
such within person correlation should not be ignored in sample size calculation,
because ignoring within person correlation will result in an over-estimation of the
sample size. The degree of over-estimate of sample size is dependent on the within
person correlation as demonstrated in Eq. (1). The higher the within person corre-
lation, the more over-estimation of sample size. It is a common mistake that most
within person trials do not take into account the within person correlation in the
calculation of sample size or statistical power (Lee et al. 2012; Lesaffre et al. 2007;
Lai et al. 2007; Bryant et al. 2006).
Besides the assumption for the within person correlation, assumptions need to be
made for other parameters including the expected mean and variance (σ2) or standard
deviation (SD) of continuous outcome measure and the expected event rate for
binary outcome for each treatment group, the type 1 error rate (α), and the desired
statistical power (1-β).

Sample Size for Continuous Outcome

Assuming for a two-arm within person trial, the expected mean is uA for treatment
A and uB for treatment B, and their variances are the same σ2. Their mean difference
in outcome measure is d ¼ uA  uB.
If SD for the within person difference (d ) between treatment groups for the
primary continuous outcome measure is available, the sample size can be calculated
without need of assuming the within person correlation coefficient (ρ). However, if
such data are not available, the within person correlation has to be assumed for
calculating the SD of the within person difference d.
For the given within person correlation ρ, the variance for the within person
difference d can be calculated as

σ 2d ¼ 2 ð1  ρÞ σ 2 ð2Þ

The required sample size N (i.e., total number subjects needed) for detecting
mean difference of d in primary outcome measure with statistical power of 1-β and
type I error rate of α is as follows:
 2
σ 2d Z1α=2 þ Z 1β Z 2 1α=2
N¼ þ ð3Þ
d 2 2
72 Within Person Randomized Trials 1389

Example: A within person trial is designed with one eye randomly assigned to
treatment A and the contralateral fellow eye assigned to treatment B. The primary
outcome is visual acuity score (calculated as number of letters read correctly from the
visual acuity chart) at one year after treatment. It is desirable to know how many
patients need to be enrolled to provide 90% power for detecting five-letter difference in
visual acuity score between two treatment groups at 5% type I error rate. Previous
studies suggest the SD of the visual acuity score is approximately 14 letters in each
treatment group.
Assuming moderate inter-eye correlation with ρ of 0.50, the variance for the mean
visual acuity difference between two treatment groups calculated using Eq. (2) is
2  (1  0.50)  142 ¼ 196. The sample size can be calculated using Eq. (3) as:
 2
σ 2d Z 1α=2 þ Z 1β Z 2 1α=2 196ð1:96 þ 1:28Þ2 1:962
N¼ þ ¼ þ ¼ 84 ð4Þ
d2 2 25 2
So a total of 84 patients (168 eyes) need to be enrolled. If the parallel group design
is used with 1 eye per patient randomized to either treatment A or treatment B, a total
of 332 patients (332 eyes) is needed to achieve the same statistical power as within
person design.
To demonstrate the impact of inter-eye correlation on sample size, the sample size
at various inter-eye correlations ranging from 0 to 0.9 calculated using equations
(Balk et al., 2012; Bryant et al., 2006) is provided in Table 1. When comparing
sample size of 332 eyes from 332 patients for the parallel group design to achieve the
same statistical power as within person design, Table 1 clearly demonstrates that the
gain in the efficiency from within person trial leads to the reduced sample size and
the magnitude of gain in efficiency is dependent on the within person correlation,
with the higher the inter-eye correlation, the smaller the sample size. For example,
when the inter-eye correlation is 0.5, the percent of reduction in sample size is 75%
in terms of number of patients and 50% in terms of number of eyes.

Sample Size for Binary Outcome

For a two-arm within person trial with binary outcome (e.g., treatment success or
failure), the 2  2 table (Table 2) can be laid out to estimate the parameters needed
for the sample size calculation.
The odds ratio for success between treatment B relative to treatment A is calcu-
lated as

Ψ ¼ b=c: ð5Þ

The discordant proportion in response between treatment A and treatment B is


calculated as

π discordant ¼ ðb þ cÞ=N: ð6Þ


1390 G.-S. Ying

Table 1 Comparison of the sample size using within person design and parallel group designa
under various inter-eye correlation for an ophthalmology trial
Within
person Sample size from within % Reduction in number % Reduction in number
correlation person trial: Number of of subjects comparing to of eyes comparing to
(ρ) subjects (number of eyes)b parallel group designa parallel group designa
0.0 166 (332) 50% 0%
0.1 150 (300) 55% 10%
0.2 134 (268) 60% 20%
0.3 117 (234) 65% 30%
0.4 101 (202) 70% 40%
0.5 84 (168) 75% 50%
0.6 68 (136) 80% 60%
0.7 51 (102) 85% 70%
0.8 35 (70) 90% 80%
0.9 18 (36) 95% 90%
a
The parallel group design requires a total of 332 patients (332 eyes) assuming standard deviation of
14 letters, 90% power to detect mean difference of 5 letters in visual acuity between treatment A and
B at type I error rate of 0.05
b
Assume the standard deviation of 14 letters, 90% power to detect mean difference of 5 letters in
visual acuity between treatment A and B at type I error rate of 0.05

Table 2 The 2  2 table for the comparison of paired binary outcome from within person trial with
N participants
Treatment B
Treatment A Failure Success Total Anticipated proportions
Failure a b a+b 1πA
Success c d c+d πA
Total a+c b+d N¼a+b+c+d
Anticipated proportions 1  πB πB

For the within person design, the number of subjects needed (N ) with
two-sided type I error rate (α) and power (1-β) can be calculated using the
following formula:
 rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
h iffi2
2 2
Z 1α=2 ðψ þ 1Þ þ Z 1β ðψ þ 1Þ  ðψ  1Þ π discordant
N¼ ð7Þ
ðψ  1Þ2 π discordant

In order to calculate the sample size N using Eq. (7), assumptions about the
expected odds ratio Ψ and the discordant percentage πdiscordant need to be made.
If the information on Ψ and the πdiscordant are not available, the Ψ and the
πdiscordant can be estimated based on anticipated treatment response rate πA and
πB as follows:
72 Within Person Randomized Trials 1391

Table 3 The pilot data from a hypothetical within person trial


New early treatment
Conventional treatment Unfavorable vision Favorable vision
Unfavorable vision 3 9
Favorable vision 6 12

π A ð1  π B Þ
ψ¼ ð8Þ
π B ð1  π A Þ

π discordant ¼ π A ð1  π B Þ þ π B ð1  π A Þ: ð9Þ

Example: Suppose a large within person trial similar to the Early Treatment of
Retinopathy of Prematurity (ETROP) (Good and Hardy 2001) will be designed to
test the hypothesis that earlier treatment in selected high-risk cases of acute ROP
results in better visual outcomes than conventional ROP management. The pilot
study in 30 infants with bilateral ROP provided the following data (Table 3):

Based on this pilot data, the large multi-center within person trial is designed to
provide 90% power for detecting odds ratio of Ψ ¼ 9/6 ¼ 1.5 at type I error rate
of 0.05.
From the pilot data, the π discordant ¼ (9 + 6)/30 ¼ 0.5.
Using Eq. (7), the sample size can be determined:
 rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
h iffi2
1:96  ð1:5 þ 1Þ þ 1:28  ð1:5 þ 1Þ2  ð1:5  1Þ2  0:5

ð1:5  1Þ2  0:5
¼ 521 ð10Þ

So 521 infants with bilateral ROP (1042 eyes) need to be enrolled with 1 eye
treated using new treatment and the fellow eye using conventional treatment.

Statistical Analysis

One of the major features of the within person trials is that comparisons of outcomes
are made through within person comparison. Because outcome measures from
paired organs or body sites of the same person are correlated, the appropriate
statistical analysis should account for the correlation among measures from the
same person. The common mistake in analysis of data from within person trials is
the lack of adjustment for the within person correlation in statistical comparisons of
trial outcomes, leading to the invalid conclusions (Murdoch et al. 1997; Pandis et al.
2017; Zhang and Ying 2018).
1392 G.-S. Ying

For the within person trial involving two treatments in two paired organs/body
sites of each participants, the statistical methods for analyzing trial outcomes mea-
sured at the end of the trial are usually very standard, such as paired t-test for
comparison of continuous outcome and McNemar test for comparison of binary
outcome. Similar to the other trials, the statistical analyses of data from within person
trial need to consider loss to follow-up or deal with missing data, which can occur in
both organs/sites in each participant (e.g., due to missed follow-up visit) or just at a
single organ/site (e.g., due to poor image quality in one eye). In within person trials
with concurrent interventions, the losses to follow-up are usually equal between
treatment groups, thus unlikely bias the estimate of treatment effect due to missing
data from lost to follow-up, but will decrease the statistical power and limit the
generalization of the trial results.
One advantage of within person trials is the elimination of confounding from
person-level baseline covariates, because these person-level baseline characteristics
are balanced across treatment groups. Statistical analysis and interpretation of results
for treatment effect don’t need to worry about the person-level confounders. How-
ever, the imbalance in the organ-specific or site-specific variables can still occur;
thus, statistical analysis for comparing outcomes between treatment groups needs to
account for imbalance in baseline variables at organ/site level by using the mixed
effects models or marginal models (Laird and Ware 1982; Liang and Zeger 1986;
Ying et al. 2017, 2018).

Analysis of Continuous Outcome Measures

For within person trials with n participants and each participant received two
treatments (A and B), the outcome measure in individual i is yiA and yiB for
organ/site in treatment A and treatment B, respectively. The within person differ-
ence (d ) in continuous outcome measure between treatments A and B can be
calculated as:

di ¼ yAi  yBi ð11Þ

The mean, standard deviation (SD), and standard error (SE) for the difference
between treatments A and B are

Xn
di
d¼ ¼ yA  yB ð12Þ
i¼1
n
 2
Xn
di  d
SDðd Þ ¼ ð13Þ
i¼i
ð n  1Þ

  SDðdÞ
SE d ¼ pffiffiffi ð14Þ
n
72 Within Person Randomized Trials 1393

The corresponding paired t-test statistic for comparing continuous outcome


between treatment A and B is

d
t¼   ð15Þ
SE d

which follows t-distribution with degree of freedom of n-1.


If the distribution outcome measure is very skewed, nonparametric test can be
used for the comparison between treatment groups using Wilcoxon signed rank test
or sign test.
If the baseline covariates (at organ or body site level) need to be accounted for in
the comparison of outcome between treatment groups, the model-based analysis
needs to be used. However, such statistical model should account for the within
person correlation by using either the mixed effects model or marginal model (Laird
and Ware 1982; Liang and Zeger 1986; Ying et al. 2017a, 2018). When performing
the model-based analysis for within person trials, the correlation structure for the
within person correlation needs to be specified. The mis-specification of correlation
can potentially impact the model results. However, in within person trials of oph-
thalmology (with unique feature of cluster size of 2 due to two eyes of each subject),
the analyses using various specifications of correlation structure (unstructured,
compound symmetry, or working correlation) provide very similar results (Ying
et al. 2017, 2018).

Example: The data in Table 4 are from the 11 participants of the Choroidal
Neovascularization Prevention Trial (CNVPT) (The CNVPT Research Group
1998) with equal visual acuity in their paired eyes at baseline. The primary outcome
of the CNVPT is visual acuity score measured at the end of 4-year follow-up. In the
CNVPT, one eye of each participant was randomized to laser treatment for drusen,
and the fellow eye was observed without treatment as control. The calculation using
Eqs. (11, 12, 13, 14) provided mean difference of 0.90 letters, with SD of 12.3 letters
and SE of 3.7 letters. The paired t-test provided t ¼ 0.90/3.7 ¼ 0.24 with degree of
freedom of 10. The two-sided p-value from paired t-test is 0.81. If nonparametric test
is applied, the Wilcoxon signed rank test provided p-value of 0.85.

Statistical Comparison of Binary Outcome

When the outcome measure of the within person trial is binary (yes/no), it is not
appropriate to use the standard chi-square test because it ignores the within person
correlations. Instead, the McNemar test should be applied for the comparison of
proportions.
A presentation using 2  2 paired tabulation format as Table 2 is desirable, as it
provides the counts of concordant and discordant pairs.
1394 G.-S. Ying

Table 4 Visual acuity from 11 participants of the CNVPT


Visual acuity score (in letters) at 4 years
Patient ID Laser-treated eye Untreated eye Difference (treated eye – untreated eye)
1 78 54 24
2 68 70 2
3 84 83 1
4 77 77 0
5 81 82 1
6 86 77 9
7 44 60 16
8 63 84 21
9 81 82 1
10 86 76 10
11 90 83 7
Mean 76.2 75.3 0.90
SD 13.3 10.0 12.3
SE 4.0 3.0 3.7

The proportion difference for binary outcome is:

bþd cþd bc


Δ¼  ¼ ð16Þ
N N N
Under the null hypothesis that there is no difference between treatment groups,
the b and c are expected to be equal given there are a total of b + c discordant pairs.
When b or c is small (<5), the continuous correction is often applied using the
formula:

ðjb  cj1Þ2
χ2 ¼ ð17Þ
bþc
In large samples, the McNemar test is

ðb  c Þ2
χ2 ¼ ð18Þ
bþc
The degree of freedom for McNemar test is 1.

Example: In the CAPT Study (The CAPT Research Group 2004) designed to
evaluate whether prophylactic laser treatment to the retina can prevent the develop-
ment of the advanced-stage AMD, one of the analyses for the primary outcome is to
compare the incidence rate of geographic atrophy (GA) between treated eye and
control eye of the same participant at 4-year follow-up. The cross-tabulation for the
incidence of GA among 997 participants who completed the 4-year follow-up is as
follows:
72 Within Person Randomized Trials 1395

Control eye
Laser-treated eye No GA GA Total
No GA 892 32 924 (92.7%)
GA 29 44 73 (7.3%)
Total 921 76 997

The McNemar test for comparing the GA incidence rate between treated eye and
untreated control eye is:

ð29  32Þ2
χ2 ¼ ¼ 0:1475
29 þ 32
with degree of freedom of 1, and the corresponding two-sided p-value is 0.70.

Summary and Conclusion

For diseases that affect paired organs or two body sites, the treatment can be either
systemic or organ-specific. When treatment and outcome measure are specific to the
organ or body site, within person design may be applied to improve the efficiency of
the trial. Such design has been commonly used in in ophthalmology, dentistry, and
dermatology. When the within person correlation is high in outcome measure, such
design can substantially reduce the sample size. However, careful considerations
need to be given on the possibility of carry across effect, the feasibility of recruiting
sufficient patients with bilateral disease, and the limitation in the generalization of
trial results to patients with unilateral disease. The sample size and statistical analysis
also have to account for the within person correlation in outcome measures.

Key Facts

• The within-person design is efficient because comparisons are made within the
same person.
• When within-person correlation for outcome measure is high, within-person
design can substantially reduce the sample size.
• Within-person clinical trials are often used for conditions that affect paired organs
or multiple body sites of a person, such as in ophthalmology, dentistry, and
dermatology.
• Within-person trials pose some challenges including the possible bias from the
carry across effect, the difficulty in recruitment subjects with condition affecting
paired organs or multiple body sites.
• Within-person correlation should be taken into consideration in the sample size
determination and statistical analyses.
1396 G.-S. Ying

References
Acharya NR, Sittivarakul W, Qian Y et al (2011) Bilateral effect of unilateral ranibizumab in
patients with uveitis-related macular edema. Retina 31:1871–1876
Balk EM, Earley A, Patel K, Trikalinos TA, Dahabreh IJ (2012) Empirical assessment of within-
arm correlation imputation in trials of continuous outcomes. Methods Research Report.
(Prepared by the Tufts Evidence-based Practice Center under Contract No. 290-2007-
10055-I.) AHRQ Publication No. 12(13)-EHC141-EF. Agency for Healthcare Research and
Quality, Rockville
Bryant D, Havey TC, Roberts R, Guyatt G (2006) How many patients? How many limbs? Analysis
of patients or limbs in the orthopaedic literature: a systematic review. J Bone Joint Surg Am 88:
41–45
Bunce C, Wormald R (2015) Considerations for randomizing 1 eye or 2 eyes. JAMA Ophthalmol
133:1221
Cairo F, Sanz I, Matesanz P, Nieri M, Pagliaro U (2012) Quality of reporting of randomized clinical
trials in implant dentistry. A systematic review on critical aspects in design, outcome assessment
and clinical relevance. J Clin Periodontol 39:81–107
Diabetic Retinopathy Clinical Research Network (2010) Randomized trial evaluating Ranibizumab
plus prompt or deferred laser or triamcinolone plus prompt laser for diabetic macular edema.
Ophthalmology 117:1064–1077
Elman MJ, Aiello LP, Beck RW et al (2010) Diabetic retinopathy clinical research network.
Randomized trial evaluating ranibizumab plus prompt or deferred laser or triamcinolone plus
prompt laser for diabetic macular edema. Ophthalmology 117:1064–1077, e35
Glassman AR, Melia M (2015) Randomizing 1 eye or 2 eyes: a missed opportunity. JAMA
Ophthalmol 133:9–10
Good WV, Hardy RJ (2001) The multicenter study of early treatment for retinopathy of prematurity
(ETROP). Ophthalmology 108:1013–1014
Hujoel PP (1998) Design and analysis issues in split mouth clinical trials. Community Dent Oral
Epidemiol 26:85–86
Katz J (1988) Two eyes or one? The data analyst’s dilemma. Ophthalmic Surg 19:585–589
Koletsi D, Fleming PS, Seehra J, Bagos PG, Pandis N (2014) Are sample sizes clear and justified in
RCTs published in dental journals? PLoS One 9:e85949
Lai TYY, Wong VWY, Lam RF, Cheng AC, Lam DS, Leung GM (2007) Quality of reporting of key
methodological items of randomized controlled trials in clinical ophthalmic journals. Ophthal-
mic Epidemiol 14:390–398
Laird NM, Ware JH (1982) Random-effects models for longitudinal data. Biometrics 38:963–974
Lee CF, Cheng AC, Fong DY (2012) Eyes or subjects: are ophthalmic randomized controlled trials
properly designed and analyzed? Ophthalmology 119:869–872
Lesaffre E, Garcia Zattera M-J, Redmond C, Huber H, Needleman I, ISCB Subcommittee on
Dentistry (2007) Reported methodological quality of split-mouth studies. J Clin Periodontol 34:
756–761
Lesaffre E, Philstrom B, Needleman I, Worthington H (2009) The design and analysis of split-
mouth studies: what statisticians and clinicians should know. Stat Med 28:3470–3482
Liang KY, Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika
73:13–22
Lie SA, Engesaeter LB, Havelin LI, Gjessing HK, Vollset SE (2004) Dependency issues in survival
analyses of 55,782 primary hip replacements from 47,355 patients. Stat Med 23:3227–3240.
https://doi.org/10.1002/sim.1905
Machin D, Fayers PM (eds) (2010) Randomized clinical trials. Wiley-Blackwell, West Sussex
Maguire MG (2020) Assessing Intereye symmetry and its implications for study design. Invest
Ophthalmol Vis Sci 61:27
Mintz-Hittner HA, Kennedy KA, Chuang AZ for BEAT-ROP Cooperative Group (2011) Efficacy
of intravitreal bevacizumab for stage 3+ retinopathy of prematurity. N Engl J Med 364:603–615
Murdoch IE, Morris SS, Cousens SN (1997) People and eyes: statistical approaches in ophthal-
mology. Br J Ophthalmol 82:971–973
72 Within Person Randomized Trials 1397

Pandis N, Walsh T, Polychronopoulou A, Katsaros C, Eliades T (2013) Split-mouth designs in


orthodontics: an overview with applications to orthodontic clinical trials. Eur J Orthod 35:783–789
Pandis N, Fleming PS, Spineli LM, Salanti G (2014) Initial orthodontic alignment effectiveness
with self-ligating and conventional appliances: a network meta-analysis in practice. Am
J Orthod Dentofac Orthop 145(Suppl):S152–S163
Pandis N, Chung B, Scherer RW, Elbourne D, Altman DG (2017) CONSORT 2010 statement:
extension checklist for reporting within person randomized trials. BMJ 357:j2835
Paré A (1575) The James Lind Library. http://www.jameslindlibrary.org/pare-a-1575. Accessed
15 Mar 2019
Piltz J, Gross R, Shin DH et al (2000) Contralateral effect of topical beta-adrenergic antagonists in initial
one-eyed trials in the ocular hypertension treatment study. Am J Ophthalmol 130:441–453
Poon CY, Goh BT, Kim MJ, Rajaseharan A, Ahmed S, Thongsprasom K, Chaimusik M, Suresh S,
Machin D, Wong HB, Seldrup (2006) A randomized control trial to compare steroid with
cyclosporine for the topical treatment of ordal lichen planus. Oral Surg Oral Med Oral Pathol
Oral Radiol Endod 102:47–55
Quinn GE, Dobson V, Biglan A et al (1995) Correlation of retinopathy of prematurity in fellow eyes
in the cryotherapy for retinopathy of prematurity study. The Cryotherapy for Retinopathy of
Prematurity Cooperative Group. Arch Ophthalmol 113:469–473
Smith DH, Ammons WF, Van Belle G (1980) A longitudinal study of periodontal status comparing
osseous recontouring with flap curettage. J Periodontol 51:367–375
The Age-Related Eye Disease Study Research Group (1999) The age-related eye disease study
(AREDS): design implications AREDS report no. 1. Control Clin Trials 20:573–600
The Choroidal Neovascularization Prevention Trial (CNVPT) Research Group (1998) Choroidal
neovascularization in the choroidal neovascularization prevention trial. Ophthalmology 105:
1364–1372
The Comparisons of Age-Related Macular Degeneration Treatment Trials (CATT) Research Group
(2011) Ranibizumab and bevacizumab for neovascular age-related macular degeneration.
N Engl J Med 364:1897–1908
The Complication of Age-Related Macular Degeneration Prevention Trial (CAPT) Research Group
(2004) The complications of age-related macular degeneration prevention trial (CAPT): ratio-
nale, design and methodology. Clin Trials 1:91–107
The Diabetic Retinopathy Study Research Group (1978) Photocoagulation treatment of prolifera-
tive diabetic retinopathy: the second report from the diabetic retinopathy study. Arch
Ophthalmol 85:82–106
The Dry Eye Assessment and Management Study (DREAM) Research Group (2018) N-3 fatty acid
supplementation and dry eye disease. N Engl J Med 378:1681–1690
The Early Treatment for Retinopathy of Prematurity (ETROP) Cooperative Group (2006) The early
treatment for retinopathy of prematurity study: structural findings at age 2 years. Br
J Ophthalmol 90:1378–1382
van Zuuren EJ, Fedorowicz Z, Carter B, Pandis N (2015) Interventions for hirsutism (excluding
laser and photoepilation therapy alone). Cochrane Database Syst Rev 4:CD010334
Vasavada VA, Praveen MR, Shah SK, Trivedi RH, Vasavada AR (2012) Anti-inflammatory effect
of low-molecular-weight heparin in pediatric cataract surgery: a randomized clinical trial. Am
J Ophthalmol 154:252–258.e4
Wang D, Bakhai A (2006) Chapter 10, clinical trials in practice. A practical guide to design, analysis
and reporting. Remedica, London
Ying GS, Maguire MG, Glynn R, Rosner B (2017a) Tutorial on biostatistics: linear regression
analysis of continuous correlated eye data. Ophthalmic Epidemiol 24:130–140
Ying GS, Pan W, Quinn GE et al (2017b) Inter-eye agreement of retinopathy of prematurity from
image evaluation in the telemedicine approaches to evaluating acute-phase ROP (e-ROP) study.
Ophthalmol Retina 1:347–354
Ying GS, Maguire MG, Glynn R et al (2018) Tutorial on biostatistics: statistical analysis for
correlated binary eye data. Ophthalmic Epidemiol 25:1–12
Zhang HG, Ying GS (2018) Statistical approaches in published ophthalmic clinical science papers:
a comparison to statistical practice two decades ago. Br J Ophthalmol 102:1188–1191
Device Trials
73
Heng Li, Pamela E. Scott, and Lilly Q. Yue

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1400
Drugs Versus Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1401
Mechanism of Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1401
Safety and Efficacy/Effectiveness Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1401
Skill of the User . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1402
Implants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1403
Placebo Effect and Sham Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1403
Blinding or Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1403
Design Considerations for Therapeutic Device Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1403
Control Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1403
Blinding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1404
Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1405
Clinical Endpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1406
Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1406
Error Rate Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1407
Special Considerations for Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1407
Imaging Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1408
Companion Diagnostic Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1408
Complementary Diagnostic Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1409
Next-Generation Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1409
Bayesian Design for Device Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1409
Prior Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1410
Bayesian Adaptive Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1410
Operating Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1411
Observational (Nonrandomized) Clinical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1411
Comparative Observational Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1412

H. Li (*) · L. Q. Yue
Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring,
MD, USA
e-mail: [email protected]; [email protected]
P. E. Scott
Office of the Commissioner, U.S. Food and Drug Administration, Silver Spring, MD, USA

© Springer Nature Switzerland AG 2022 1399


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_102
1400 H. Li et al.

Noncomparative Observational Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1412


Bias in Observational Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1412
Outcome-Free Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1413
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1414
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1414
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1414
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1415

Abstract
This section provides an overview of clinical studies for medical devices. Impor-
tant differences between drugs and devices are highlighted. Specific topics
covered include Bayesian design and observational (nonrandomized) studies.
Special considerations are given to diagnostic devices.

Keywords
Medical device · FDA · Diagnostic device · Bayesian · Nonrandomized ·
Propensity score

Introduction

What is a medical device? A medical device is defined by law as “an instrument,


apparatus, implement, machine, contrivance, implant, in vitro reagent, or other
similar or related article, including any component, part, or accessory, which is
(1) recognized in the official National Formulary, or the United States Pharmaco-
poeia, or any supplement to them, (2) intended for use in the diagnosis of disease or
other conditions, or in the cure, mitigation, treatment, or prevention of disease in
man or other animals, or (3) intended to affect the structure or any function of the
body of man or other animals, and which does not achieve its primary intended
purposes through chemical action within or on the body of man or other animals and
which is not dependent upon being metabolized for the achievement of any of its
principal intended purposes” – [US Code, Food, Drug, and Cosmetic Act; 21 US
Code 321 (h)].
A simpler definition for medical device is a definition by exclusion. A medical
device is a medical product that does not have a chemical, metabolic, or biological
principle of action. As such, a medical device is any medical product that is not a
drug or biological product.
Medical devices range in complexity from simple products such as bedpans and
tongue depressors to complex products such as heart pacemakers and laser surgical
devices. The term also applies to in vitro diagnostic products and test kits for
diseases, conditions, or infections such as home test kits for HIV. Electronic radiat-
ing products with medical usage and claims such as x-ray machines also meet the
definition of medical devices. Certain software may be a medical device (US FDA
2016).
73 Device Trials 1401

Drugs Versus Devices

Mechanism of Action

Medical devices do not function in the same manner as drugs. The mechanism of
action of a medical device is physical and localized, whereas the mechanism of
action for drugs is often chemical or biological and the effect can be localized or
systemic. The process of drug development is one of discovery. Drug discovery
involves the screening of candidate compounds to identify those promising enough
for development, further examination, and testing in animals and humans. Once a
drug is discovered, the chemical composition remains unchanged during the entire
development process. Although the dose, indications for use, dosing regimen,
preparation, and release of the drug can change, the chemical entity remains the
same. In contrast to drugs, devices are invented and often evolve through a series of
changes and improvements (Campbell and Yue 2016) during the course of develop-
ment and evaluation. As a result, the device that is marketed may be different from
the one that underwent testing. While any change to drug formulation may impact
the safety and effectiveness profile of the drug, minor device changes may have little
effect on the clinical performance or safety and effectiveness profile of the product.

Safety and Efficacy/Effectiveness Assessment

Clinical studies for medical devices are often preceded by bench/mechanical and
animal testing for reliability and biocompatibility. The premarket clinical testing of
investigational drugs is generally conducted in three stages or phases (I, II, and III).
Phase I trials are early development trials conducted in a small number of people
(e.g., 20–100) to evaluate safety issues such as adverse events and maximum
tolerated dose of the drug. If this early development study demonstrates that the
drug is not toxic, then clinical testing progresses to Phase II. During this middle
development stage, the drug is given to up to several hundred people with the
indicated condition to evaluate short-term safety and efficacy. After demonstration
of short-term safety and efficacy, the comparative treatment efficacy trial (Phase III)
is conducted in large numbers of people to demonstrate safety and efficacy. Typi-
cally, two Phase III trials are needed in order to market a drug.
The clinical testing of devices is generally characterized by the conduct of a
feasibility (pilot) study and a pivotal study (an analog of Phase III drug trial). Pilot
studies for medical devices may include only one investigator at one investigational
site with a small number of patients. The main focus of those studies is on safety
issues. Pilot studies are also used to obtain preliminary data to assess the learning
curve for device use, to generate estimates of effect sizes and variances for sample
size calculation, and to develop and refine study procedures for the pivotal trial.
Generally, one pivotal study is conducted to obtain data to evaluate safety and
effectiveness of devices prior to entry into the consumer marketplace. Both drugs
1402 H. Li et al.

and medical devices have post-market studies referred to as Phase IV for drugs and
post-market studies for devices.
The standard for pharmaceutical drug regulation is one of substantial evidence
of safety and effectiveness obtained from well-controlled investigations (21 CFR
314.126), which historically have been well-controlled Phase III trials. The statu-
tory standard for approval and level of evidence required for devices is one of
reasonable assurance of safety and effectiveness based on valid scientific evidence.
Valid scientific evidence (21 CFR 860.7(c)(2)) is defined as “well-controlled
investigations, partially controlled studies, studies and objective trials without
matched controls, well-documented case histories conducted by qualified experts,
and reports of significant human experience with a marketed device” (21 CFR
860.7). The type of study required is very device specific. It may vary depending
on the indication for use and degree of experience with knowledge of the device. A
reasonable assurance of safety is obtained when “it can be determined, based upon
valid scientific evidence, that the probable benefits outweigh any probable risks,”
and can be demonstrated by establishing “the absence of unreasonable risk of
illness or injury associated with the use of the device for its intended uses and
conditions of use” (21 CFR 860.7(d)(1)). Similarly, a reasonable assurance of
effectiveness is obtained when “it can be determined, based upon valid scientific
evidence the use of the device for its intended uses will provide clinically signif-
icant results” (21 CFR 860.7(e)(1)). These criteria differ from the regulatory
standards set by law for drugs and biological products in the USA, and these
differences lead to a more varied approach in the studies and data required to
support market approval/clearance for devices.

Skill of the User

For drugs, the influence of the physician technique or skill is very low on treatment
outcome. Drugs are dispensed with instructions to the patient for medical use. It is
the responsibility of the patient or caregiver to comply with these instructions. As
such, the training for drug trials focuses on the protocol requirement, mechanism of
action of the drug, and potential adverse effects. In contrast, for medical devices, the
skill of physician or device user may play a significant role in the treatment outcome.
For example, the success of the device for implants and other devices that rely on
surgical technique may depend on the skill of the surgeon. The potential learning
curve on the part of the surgeon may have a major impact on the performance of the
product during the clinical study. The impact of the skill of the user can also be seen
in diagnostic trials such as diagnostic imaging devices that rely on skilled radiolo-
gists to read and interpret the images. Consequently, training requirements for device
clinical trials should include hands-on device training in addition to the protocol
requirements. Studies in which patients are treated with a medical device often assess
the contribution of the device user in addition to assessing contributions from the
device, disease, and patient.
73 Device Trials 1403

Implants

Since a drug is metabolized, stopping the medication often addresses many adverse
effects. Devices that are implanted into the body pose a number of unique chal-
lenges. Many implantable devices cannot be easily removed once implanted in the
body. Consequently, the risk of removing the implant need to be weighed against that
of the device remaining in the body.

Placebo Effect and Sham Control

The use of placebo control arm is common in drug clinical trials. However, in device
clinical trials it is often impractical or unethical to use a placebo (or sham) control
and withhold treatment, especially when considerable risk may be associated with
the sham surgery.

Blinding or Masking

Often it is not possible to blind (or mask) the treatment or implant the patient is
receiving or health care provider is delivering in medical device clinical trials.
Although it is possible to correctly guess the treatment in drug trials, it is more
likely to correctly guess the treatment in device trials especially when the procedures
reveal treatment information.

Design Considerations for Therapeutic Device Trials

Like drugs, therapeutic devices are used to treat diseases. However, due to the
distinctions between drugs and devices highlighted in the previous section, certain
design considerations may figure more prominently for device trials. In this section
we discuss some of such considerations. The focus is on pivotal device trials. These
trials are intended as the primary clinical support for a marketing application, just as
Phase III drug trials.

Control Group

Medical devices may be invented for a range of purposes, from providing a therapy
that is superior to those currently available to offering a less invasive treatment
option for surgery. The choice of control group should reflect this purpose. Some-
times a device is meant to do both: serve as a more effective therapy than drugs for
patients too frail to undergo surgery and as a less invasive option for surgery for
1404 H. Li et al.

patients who are less frail. If that is the case, then there need to be two separate trials,
one with medical therapy as control and the other with surgery as control (Svensson
et al. 2013). When a trial compares two different treatment modalities, patients and
physicians/surgeons are impossible to blind, and therefore artifacts such as placebo
effect is difficult to rule out. In such a trial the use of objective endpoints (e.g., death,
stroke, etc.) as primary would be advisable.
For a trial conducted to evaluate subjective endpoints such as pain or function,
one may consider using a so-called sham control when it is ethical and practical to do
so. A sham control has been broadly defined as a treatment or procedure that is
similar with but omits a key therapeutic element of the treatment or procedure under
investigation. The riskier the sham control, the less likely it will be considered
ethical. A very risky sham control that is completely without benefit is seldom
justifiable. Of course, such judgment depends on the context. A recent example of
a sham-controlled device trial is the ORBITA study (Al-Lamee et al. 2018). The
therapy under investigation is percutaneous coronary intervention (PCI), the target
population is patients with stable chronic angina, and the endpoint is the exercise
time. Previously, a randomized trial of PCI (a device therapy) versus medical
management had found benefit on exercise tolerance (Parisi et al. 1992). The
ORBITA trial suggests that most of this apparent benefit is placebo effect, which
prompts the medical community to seriously question an accepted practice.
While clinical trials for first-of-a-kind devices often use medical therapy or
surgery as controls, later devices with similar indications may use other devices as
control. Such trials are often noninferiority trials. Sometimes the control can even be
designated as any device of the same indication that is commercially available,
allowing noninferiority claim to be made to an entire class of devices.
In certain device areas randomized trials with patients as their own control are
conducted. Here we are not talking about crossover trials: the treatments being
compared are not separated in time. Instead, they are administered simultaneously
to different parts of a patient. For example, to test an ophthalmic device, one may
randomly assign one eye to treatment and the other eye to control for each patient.
For such a design the experimental unit is an eye. In other device areas these within-
subject designs may use a limb, a blood vessel, or some other parts of body as
experimental units. The use of the subject as his/her own concurrent control allows
for the advantageous use of the correlation within the subject. This design is only
possible when the experimental device and control intervention effects are local and
do not overlap.

Blinding

Blinding refers to keeping key persons, such as patients, health-care providers, and
outcome assessors, unaware of the treatment administered. The purpose of blinding
is to minimize artifacts and biases coming from various sources, such as placebo
effect, performance bias (sometimes known as Hawthorne effects), and detection
bias (i.e., observer, ascertainment, and assessment bias) (Mansournia et al. 2017).
73 Device Trials 1405

Therefore, it is important to use blinding when feasible. As mentioned in the


previous segment, in a device trial blinding of patients and health-care providers is
not possible if the control group is medical management or surgery. When the
control group is another device, it may be possible to blind patients if the control
therapy is administered in a similar way as the investigational device therapy. If the
control device is visually indistinguishable from the investigational device, then it
may be possible to blind the health-care provider as well, which is the case with the
pivotal clinical trial for the TAXUS drug-eluting stent, where the control device is a
visually indistinguishable bare-metal stent (Stone et al. 2004). This is one of the few
device trials in which active control is used yet double blinding is possible.
Even when blinding of patients and providers are not feasible, the blinding of
outcome assessors could still be implemented in many circumstances, and it should
be implemented to forestall detection bias. The evaluation of some endpoints can be
conducted by examining video, audiotape, or photography, which can be sent to a
blinded “core lab” for interpretation. Clinical events could be adjudicated by a
blinded clinical events committee (CEC). For endpoints that need to be evaluated
by directly observing patients, the blinding of outcome assessors would involve
instructing the patients not to reveal the treatment they received. Of course, when
patients cannot be blinded, the assessors of patient reported outcomes (PRO) cannot
be blinded because they are the patients themselves.
The word blinding can also be used to refer to maintaining the confidentiality of
interim data. Interested readers are referred to Fleming et al. (2008) and Fleming
(2015). Yet another usage of “blinding” concerns the blinding of statisticians or data
analysts to outcome data, which will be discussed later in the subsection “Observa-
tional (Nonrandomized) Clinical Studies.”

Randomization

As the site-to-site variability tends to be relatively large for device therapy, it is


important to balance site distributions between the treatment arms in device trials.
Hence randomization is usually stratified by study site. Additional stratification on
key baseline covariates is sometimes desirable but is often limited by the average
number of patients per site, since this number is often relatively small given the
modest sample size of most device trials.
In some trials there is a time lag between randomization and the initiation of the
treatments being compared. For example, after randomization between device ther-
apy and surgery, it takes time to schedule the procedure for a device therapy and even
longer to schedule a surgery. In the meanwhile, events may happen that would
complicate the analysis of trial data. A patient may die or may decide to switch
treatment or drop out. Therefore, it is good practice to put a limit on this time lag in
the protocol, and to eliminate it where possible, particularly in some trials in which
device is compared to another device or sham procedure, such as the EchoCRT study
(Ruschitzka et al. 2013). In this study, all patients underwent device implantation and
was randomly assigned to have cardiac resynchronization therapy (CRT) capability
1406 H. Li et al.

turned on or off. The randomization occurred after successful implantation of the


device and the adjustment of medical therapy for heart failure according to current
guidelines. There was no need for any time lag between randomization and treatment
initiation.

Clinical Endpoints

Pivotal device trials evaluate the safety and effectiveness of the device in the
population expected to be indicated. Accordingly, primary endpoints are divided
into one or more safety endpoints and one or more effectiveness endpoints. The
study would be considered successful if both the safety and effectiveness endpoints
are met. Occasionally, a single endpoint may play the dual role of a primary safety
and effectiveness endpoint.
The specification of clinical endpoints for a device trial often involves the concept
of device-related adverse events. These are events directly attributable to the device
itself. Therefore, it is imperative that the investigational device is precisely defined in
the protocol. The classification of whether an adverse event is device related requires
careful adjudication. Distinction is made between device-related and procedure-
related events. The latter are events that occur from the procedure, irrespective of
the device (Ouriel et al. 2013).
Beside primary endpoints, secondary endpoints that are not part of the study
success criteria are usually specified for a device trial. They may serve as the bases of
additional meaningful claims or provide further insight into the device effect or
mechanism of action. Sometimes it is possible to submit the primary endpoint results
to the regulatory agency for device approval while data collection for a secondary
endpoint is still ongoing.

Sample Size

Sample size is usually driven by study power, which is the probability of rejecting
the primary endpoint null hypotheses. Sometimes a powered secondary endpoint
may drive the sample size. While most device trials still adopt a fixed design,
adaptive designs are more and more common where the sample size depends on
outcome data. In an adaptive design the minimal sample size needs to be bigger than
that required from a clinical perspective.
A recent example of a device trial using Bayesian adaptive design is the
SURTAVI trial (Reardon et al. 2017). The objective of the trial is to compare the
safety and efficacy of transcatheter aortic-valve replacement (TAVR) with surgical
aortic-valve replacement in patients who were deemed to be at intermediate risk for
surgery. It is a noninferiority trial with the primary endpoint being a composite of
death from any cause or disabling stroke at 24 months and a planned sample size of
1600. A Bayesian interim analysis was prespecified when 1400 patients had reached
12-month follow-up. Through Bayesian modeling, it was possible to calculate the
73 Device Trials 1407

posterior probability of noninferiority in terms of the 24-month event rate even


though at the interim analysis not every patient had had 24-month follow-up. As it
turned out, this posterior probability exceeded the prespecified success threshold,
thus the trial could declare success early.

Error Rate Control

Type 1 error rate and type 2 error rate (one minus power), also called operating
characteristics, must be controlled in all hypothesis-driven pivotal device trials. Any
valid statistical approach to error rate control applies to device trials. For complex
adaptive designs, controlling the error rates is usually an iterative process. First a
tentative decision rule is set up so that operating characteristics can be obtained via
simulation. If the error rates are not satisfactory, then the decision rule is adjusted,
and simulation is carried out again. This process continues until one arrives at a
decision rule that leads to adequate error rate control. The success threshold for the
posterior probability of noninferiority in the SURTAVI trial was determined in this
fashion. In general, the simulation should be reasonably extensive by covering a
wide range of scenarios.

Special Considerations for Diagnostics

A diagnostic device is a medical device that is used to identify or assist in identifying


medical conditions of interest in a well-specified intended use population (Yu et al.
2016). Diagnostic devices can be classified into two broad categories: in vitro and
in vivo. In vitro devices are laboratory tests based on tissue or blood specimens
sampled from patients, such as genetic tests (Campbell et al. 2018). In vivo devices
involve test procedures performed directly on the patient, such as diagnostic imag-
ing. The data produced by a diagnostic device can be qualitative (e.g., dichotomous),
quantitative, or semi-quantitative (ordinal scale). Quantitative measures can be
transformed into dichotomous results via a threshold or cutoff value. Multiple cutoff
points may be applied to a quantitative test to generate ordinal categories. The
performance of a test producing a dichotomous result (positive or negative) is
measured by its sensitivity and specificity when there is a truth standard (a gold
standard test) (Pepe 2003). The diagnostic accuracy of a quantitative test can be
evaluated by receiver operating characteristic (ROC) analysis. The ROC plot is a
graph of the observed sensitivity versus 1 minus the observed specificity of the
diagnostic test, evaluated at all possible thresholds that one could use to dichotomize
the diagnostic test. One global measure of the diagnostic capability of the test is the
area under the ROC plot (AUC) (Pepe 2003; Zhou et al. 2009). Some diagnostic
devices are essentially instruments of measurement. Indicators of performance for
such devices include systematic bias, accuracy (agreement with the true value of the
measurand), imprecision (variability of repeated measurements), and limit of
1408 H. Li et al.

detection (the lowest concentration of an analyte that can be reliably distinguished


from zero). The opposite of imprecision is precision.
The randomized clinical trial which is the basis of the clinical testing of drugs and
some therapeutic medical devices is not generally applicable to the evaluation of
diagnostic devices. It would be unethical to randomize a patient to receive an
investigational device as the sole basis for diagnosis in the presence of an effective
alternative method. In the area of diagnostic devices, the study design is generally
observational in nature where each patient serves as his/her own control and receives
both the standard diagnostic test and investigational diagnostic device. But some-
times randomized controlled trials do occur, as we will see below.

Imaging Devices

A particular type of in vivo test is diagnostic imaging. Diagnostic imaging tests often
involve readings and/or interpretations by persons who may be referred to as readers
(or operators, evaluators, etc.), and present unique study design problems. In the case
of readers, it is often a question of what information is available and when. In a
so-called sequential design, a reader is provided more information gradually to
observe how their ratings change. This design is typically used for diagnostic
devices that are intended to be adjunctive to the standard imaging evaluation.
Alternatively, a crossover design is typically used to compare a new imaging
modality with a standard modality on reader diagnostic accuracy. In a basic version
of the fully crossed, multi-reader multi-case (MRMC) crossover design, cases are
divided randomly into groups A and B, which are read in both modalities by all
readers in two reading sessions separated by a washout period of time. A crossover
design is usually used to compare a new imaging modality with a standard modality
on reader diagnostic accuracy.

Companion Diagnostic Devices

Predictive biomarkers inform on likely outcomes with specific treatments. They


have become increasingly important for precision (or personalized) medicine. They
are the basis for an in vitro companion diagnostic device, defined by FDA as a
diagnostic test essential for the safe and effective use of a corresponding therapeutic
product which is identified in the product label (US FDA 2014). A companion
diagnostic test often defines the intended use population of the corresponding
therapeutic product. For example, patients may be considered eligible for a drug
only if their companion test result is positive because only in that subpopulation have
the drug’s benefits been established to outweigh its risks. Restriction of the eligibility
of a drug to patients with a companion test positive results can be supported by a
qualitative interaction between treatment and test on the clinical outcome in an
all-comers trial. However, clinical equipoise may not exist for randomly assigning
the drug to test negative patients because a priori they are anticipated to not benefit
from it. Thus, many drugs are evaluated with their companion diagnostics using an
73 Device Trials 1409

enrichment strategy (US FDA 2012) of enrolling just the test positive patients into a
randomized trial of the drug. In an enrichment trial, a significance test for qualitative
interaction is not possible; the diagnostic is evaluated only for whether it has selected
a population in whom the drug’s benefits outweigh its risks. For an example of a
clinical trial for a companion diagnostic device, see Rosell et al. (2012).

Complementary Diagnostic Devices

In contrast, a complementary diagnostic device is not required for the use of the drug
but provides information about a population who may derive greater benefit. It can
help inform the discussion between prescriber and patient. A complementary diag-
nostic can be described as having a quantitative interaction with the drug effect
(Beaver et al. 2017). The treatment is beneficial in both test negative and test positive
patients, but the benefit is smaller in test negatives. Note that quantitative interac-
tions can be an artifact of the scale of measurement of the treatment effect. Trial
designs and practical considerations for clinical evaluation of predictive biomarker
tests have received extensive review, including Polley et al. (2013). A general
statistical framework for deciding if a treatment should be given to everyone or to
just a biomarker-defined subpopulation has been proposed (Millen et al. 2012).
Largely because of advances in precision medicine, subgroup analysis and its
various purposes has received renewed interest (Alosh et al. 2015).

Next-Generation Sequencing

Most diagnostic in vitro medical devices are designed to test for a single analyte
associated with disease. In contrast, microarray and next-generation sequencing
(NGS) technologies can be used to measure large numbers of genetic analytes
simultaneously, e.g., gene expressions and single-nucleotide polymorphisms
(SNPs), that may confer useful diagnostic information. This creates a huge simulta-
neous testing problem in need of a multiplicity adjustment. The adjustment need not
be as severe as the Bonferroni correction when correlation between the tests is
considered. Permutation-based methods can be used to take into account the corre-
lation (Dudoit et al. 2002). Alternatively, Bayesian approaches have also been
proposed (Newton et al. 2001; Efron et al. 2001). It is worth noting that the false
discovery rate (FDR) (Benjamini and Hochberg 1995) is increasingly being con-
trolled in such large multiplicity problems, as opposed to the more traditional and
more conservative familywise Type I error rate.

Bayesian Design for Device Trials

Bayesian statistical methodology has been used for well over 10 years in medical
device clinical trials for premarket submissions. The Center for Devices and Radio-
logical Health (CDRH) has published a guidance document “Guidance for the Use of
1410 H. Li et al.

Bayesian Statistics in Medical Device Clinical Trials” (US FDA 2010). The Bayes-
ian guidance covers many topics on study design and is an essential reference in
designing Bayesian medical device clinical trials that will be reviewed by FDA.

Prior Information

The incremental steps in which improvements are made in device development make
the Bayesian approach particularly suitable. Good prior information is often avail-
able from, for example, trials in other countries, earlier trials on previous device
versions, or possibly bench tests or animal studies. In such situations, the natural
mode of statistical inference is that of Bayesian. A Bayesian clinical trial for a
medical device may include prior information for the investigational device, for the
control therapy, or for both the investigational device and control therapy. Previous
device studies used as sources of prior information should be recent and similar to
current studies in terms of devices used, objectives, endpoints studied, protocol,
patient population, investigational sites, physician training, and patient management.
Covariates such as demographics and prognostic variables can be used to calibrate
previous studies to the current study. The use of prior information often leads to more
precise estimates enabling decision-makers to reach a decision on a device with
smaller and shorter trials.

Bayesian Adaptive Design

Bayesian inference is used in medical device trials not only where there is prior
information that can be incorporated into the current trial, but also where a flexible
adaptive clinical trial is being considered (Berry et al. 2011; Campbell 2011, 2013).
When there is no good prior information, the prior distributions used in a Bayesian
adaptive design are usually relatively noninformative. One of the most prominent
advantages of a Bayesian approach to adaptive design over a frequentist one is that
the Bayesian approach allows for the construction of likelihood models that can use
information obtained at the current time to calculate predictive distributions of
observations at later time points. Predictive probabilities are widely used in medical
device clinical trials and they serve many purposes. For example, at each interim
analysis for sample size adaptation, one could decide whether to stop accrual, to
continue enrollment, or to declare futility based on predictive probability of trial
success. A possible decision rule could be: If the predictive probability of trial
success given the data on the enrolled subjects exceeds a prespecified value, then
stop accrual; if the probability of trial success under the maximum sample size is
below a certain value, then declare futility; otherwise, enrollment will continue to the
next stage unless the maximum sample size is reached. Predictions can be made only
if the patients yet to be observed are exchangeable with the patients already
observed. In device trials, patients enrolled later in the study may not be
73 Device Trials 1411

exchangeable with patients enrolled earlier if there is a learning curve associated


with the device.

Operating Characteristics

For medical device trial designs submitted to FDA for review, including Bayesian
adaptive designs, it is of paramount importance to thoroughly evaluate its operating
characteristics including type I error rate, power, the distribution of sample size, and
probability of stopping at each interim look. In a regulatory environment, it is
necessary to control type I error rate and to maintain power at appropriate levels,
just as for a frequentist design. In general, when no prior data are used, the type I
error rate is controlled at the customary frequentist level. When prior data are used,
the type I error rate is often controlled at a higher level, with consideration given to
the credibility of the prior data and the knowledge of potential benefit-risk profile of
the investigational device. Due to the inherent complexity in the design of a
Bayesian clinical trial, specifically when prior data are used or a wide variety of
adaptations are planned, an adequate characterization of the operating characteristics
of any particular trial design usually needs extensive simulations. Simulations are
performed under various more or less plausible scenarios of parameters of interest,
evaluating the desirability of the operating characteristics. The process may be
iterative, in the sense that sample size, any interim decision rule, study success
criteria, and priors may need to be adjusted many times to achieve acceptable
operating characteristics.
The advantages of Bayesian methodology have resulted in its increasing use in
the medical device arena (Campbell and Yue 2016). It is the methodology of choice
in situations where there is prior information that can be used to augment the current
trial, or where a flexible adaptive clinical trial is desirable.

Observational (Nonrandomized) Clinical Studies

There is no standard study design or approach that is applicable to the clinical testing
of all devices. As has been stated earlier, the statutory standard for approval and level
of evidence required for devices is one of reasonable assurance of safety and
effectiveness based on valid scientific evidence, and valid scientific evidence can
be from partially controlled studies or studies and objective trials without matched
controls. This is different from drugs, which require substantial evidence of safety
and effectiveness obtained from well-controlled investigations. Depending on the
specific device, randomized controlled trials (RCT) may be either unnecessary or
unfeasible. As a result, observational (nonrandomized) studies play a substantial role
in the evaluation of investigational medical devices.
Observational studies could be comparative through the explicit use of a control
group or may be carried out without a control group (US FDA 2013).
1412 H. Li et al.

Comparative Observational Study

A comparative observational study (referred to as comparative clinical outcome


study in US FDA (2013) ) could be conducted using an internal (but non-
randomized) control, an external control, or a hybrid internal/external control. An
internal control refers to a control group that is enrolled into the study concurrently
with the investigational medical device group (treated group). Which treatment a
patient undergoes is determined by a mechanism other than randomization, such as
clinical judgment based on patient characteristics and risk factors. A key feature of
this kind of design is that the patients in the treatment and control groups are enrolled
prospectively and treated contemporaneously. Unlike internal control, an external
control is constituted by patients treated outside the investigational study. One
example is historical control, which can be formed from patients collected from
earlier studies of an approved device. Another example is a control extracted from a
well-designed and executed national or international patient registry database. A
control group could also be constituted in part by internal control and in part by
external control. In all three cases, the treatment comparison is made using data
collected from the treated group and data collected from the control group.

Noncomparative Observational Study

In a noncomparative observational study (referred to as noncomparative clinical


outcome study in US FDA (2013) ), the medical device is evaluated by comparing
the data collected from patients enrolled into the study and treated with the inves-
tigational device to information extracted from outside the study, for example, via
synthesis of previously conducted studies or accumulated experience. A numerical
value is currently the most common form for the extracted information on a clinical
outcome of interest and such numerical values have been referred to by various
names in the medical device arena, including objective performance criterion (OPC),
performance goal, and target value (US FDA 2013). Statistical reasoning should be
applied with judiciousness in producing such numerical values. Sometimes the
comparison can also be to a standard established by clinical judgment.

Bias in Observational Studies

While observational studies could provide potential benefits, such as savings in cost
or time of conducting clinical studies, statistical and regulatory challenges also arise
regarding the validity of study design and the interpretability of study results. For
instance, the lack of randomization in observational studies often leads to a system-
atic difference in the distribution of baseline covariates between investigational
device group and control group, resulting in bias in treatment effect estimation.
There may be more differences in baseline covariate distributions between the
treatment groups in studies using external controls. Such differences lead to doubts
73 Device Trials 1413

about treatment group comparability, and hence the interpretability of study results
(Li and Yue 2008). Fortunately, there exist some statistical methods that could be
used to reduce bias, including traditional matching and stratification on baseline
covariates, regression (covariate) analysis, and the propensity score methodology
developed by Rosenbaum and Rubin (1983, 1984). However, it is important to note
that all statistical methods aforementioned can only adjust for confounding
covariates that are observed and incorporated in the statistical model but not for
unobserved ones. Also, when there are large differences in baseline covariates
between two treatment groups, these statistical methods may not be able to mitigate
the bias. And, of course, none of these statistical methods can adjust for bias caused
by the separation in time between a treated group and its historical control (temporal
bias), or by difference in medical practice among multiregions. Therefore, it is
critical that minimizing bias starts from the design stage of an investigational study.

Outcome-Free Design

In an RCT, study design and the act of outcome data analysis are clearly separated:
outcome data are not available at the design phase in which their analysis is
prespecified. However, traditionally this is often not the case with observational
studies. It has been recognized that designing an observational study with outcome
data in sight could compromise the objectivity of study design and make study
result difficult to interpret (Yue 2007; Yue et al. 2014; Li et al. 2016; Yue et al.
2016). Rubin advocates objective design of observational studies, i.e., prospective
study design without access to any outcome data (Rubin 2001, 2007, 2008). This
outcome-free principle can be realized for propensity score design (Yue et al. 2014;
Li et al. 2016) – in building a propensity score model to balance covariates between
treatment groups, only baseline covariates and the treatment indicator are needed;
outcome data do not need to be accessed. Propensity score design is an iterative
process (Austin 2011). The aim is to derive proper propensity score estimation
model and grouping or weighting method(s) such that adequate balance in covar-
iate distributions is reached. Not accessing the outcome data eliminates the bias
caused by selecting a propensity score model that favors one of the treatments. For
confirmatory investigational studies, outcome-free propensity score design can be
implemented via a two-stage design process (Yue et al. 2014; Li et al. 2016). Stage
I occurs when the clinical protocol is being developed. Key elements of stage I
include (1) selection of appropriate control group or data source for control group,
(2) preliminary estimation of sample size, and (3) specification of covariates to be
collected in the study and used in the second design stage. An independent
statistician to perform the study design is identified at this stage. Stage II of the
design starts ideally as soon as all patients are enrolled and information on all
baseline covariates is available. In this stage, the independent statistician identified
in the first design stage estimates propensity score, matches all patients in the
investigational device group with patients in the control group according to the
estimated propensity score, assesses balance in covariate distributions, and
1414 H. Li et al.

finalizes control group selection and sample size estimation as well as the statistical
analysis plan for future outcome analysis. All these need to be performed without
access to any outcome data (Yue et al. 2014; Li et al. 2016). The two-stage
framework has been successfully applied to medical device clinical studies
(Thourani et al. 2016).

Summary and Conclusion

While there are many commonalities between clinical trials for medical devices and
for drugs, device trials do have some unique challenges. It is relatively straightfor-
ward to implement placebo control in a drug trial, but in many cases it is not ethical
to give patients the device equivalent of a placebo, namely a sham device. Blinding
or masking is often impossible to do in a device trial, especially when the treated and
control arms involve different treatment modalities, such as when the comparison is
between device and drug therapies. Due to the rapid pace at which innovations are
made, the product life cycle of a medical device is relatively short. It is not
uncommon for a newly marketed device to become obsolete and replaced by next-
generation technology in a couple of years. This means that large and lengthy
randomized clinical trials are often impractical in the medical device arena. A wide
variety of clinical study designs and statistical methodologies, such as those over-
viewed in this section, have been utilized in medical device clinical trials. We believe
that opportunities for clinical trial and statistical innovation will continue to expand
in the future.

Key Facts

• A medical device is a medical product that does not have a chemical, metabolic,
or biological principle of action. Medical device regulations are different from
drug regulations.
• Due to the rapid pace at which innovations are made, the product life cycle of a
medical device tends to be shorter than that of a drug.
• Bayesian design is more common in device trials than in drug trials.
• Non-randomized studies are more common in the medical device world than in
the drug world.

Cross-References

▶ Bayesian Adaptive Designs for Phase I Trials


▶ Diagnostic Trials
73 Device Trials 1415

References
Al-Lamee R, Thompson D, Hakim-Moulay D et al (2018) Percutaneous coronary intervention in
stable angina (ORBITA): a double-blind, randomised controlled trial. Lancet 391:331–340
Alosh M, Fritsch K, Huque M, Mahjoob K, Pennello G, Rothmann M, Russek-Cohen E, Smith F,
Wilson S, Yue LQ (2015) Statistical considerations on subgroup analysis in clinical trials. Stat
Biopharm Res 7:286–304
Austin P (2011) An introduction to propensity score methods for reducing the effects of
confounding in observational studies. Multivariate Behav Res 46:399–424
Beaver JA, Tzou A, Blumenthal GM, McKee AE, Kim G, Pazdur R, Philip R (2017) An FDA
perspective on the regulatory implications of complex signatures to predict response to targeted
therapies. Clin Cancer Res 23:1368–1372
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful
approach to multiple testing. J R Stat Soc B 57:289–300
Berry SM, Carlin BP, Lee JJ, Müller P (2011) Bayesian adaptive methods for clinical trials. CRC
Press, Boca Raton
Campbell G (2011) Bayesian statistics in medical devices: innovation sparked by the FDA. J
Biopharm Stat 21:871–887
Campbell G (2013) Similarities and differences of Bayesian designs and adaptive designs for
medical devices: a regulatory view. Stat Biopharm Res 5:356–368
Campbell G, Yue LQ (2016) Statistical innovations in the medical device world sparked by the
FDA. J Biopharm Stat 26:3–16
Campbell G, Li H, Pennello G, Yue LQ (2018) Medical devices. In: Armitage P, Colton T (eds)
Encyclopedia of biostatistics. Wiley, New York
Dudoit S, Yang YH, Callow MJ, Speed TP (2002) Statistical methods for identifying differentially
expressed genes in replicated cDNA microarray experiments. Stat Sinica 12:111–139
Efron B, Tibshirani R, Storey JD, Tusher V (2001) Empirical Bayes analysis of a microarray
experiment. J Am Stat Assoc 96:1151–1160
Fleming TR (2015) Protecting the confidentiality of interim data: addressing current challenges.
Clin Trials 12(1):5–11
Fleming TR, Sharples K, McCall J (2008) Maintaining confidentiality of interim data to enhance
trial integrity and credibility. Clin Trials 5(2):157–167
Li H, Yue LQ (2008) Statistical and regulatory issues in non-randomized medical device clinical
studies. J Biopharm Stat 18:20–30
Li H, Mukhi V, Lu N, Xu Y, Yue LQ (2016) A note on good practice of objective propensity score
design for premarket nonrandomized medical device studies with an example. Stat Biopharm
Res 8:282–286
Mansournia MA, Higgins JP, Sterne JA, Hernán MA (2017) Biases in randomized trials: a
conversation between trialists and epidemiologists. Epidemiology 28(1):54
Millen BA, Dmitrienko A, Ruberg S, Shen L (2012) A statistical framework for decision making in
confirmatory multipopulation tailoring clinical trials. Drug Info J 46(6):647–656
Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW (2001) On differential
variability of expression ratios: improving statistical inference about gene expression changes
from microarray data. J Comput Biol 8:37–52
Ouriel K, Fowl RJ, Davies MG et al (2013) Reporting standards for adverse events after medical
device use in the peripheral vascular system. J Vasc Surg 58:776–786
Parisi AF, Folland ED, Hartigan P et al (1992) A comparison of angioplasty with medical therapy in
the treatment of single-vessel coronary artery disease. N Engl J Med 326(1):10–16
Pepe MS (2003) The evaluation of diagnostic tests and biomarkers. Oxford Press, London
Polley MY, Freidlin B, Korn EL, Conley BA, Abrams JS, McShane LM (2013) Statistical and
practical considerations for clinical evaluation of predictive biomarkers. J Natl Cancer Inst 105:
1677–1683
1416 H. Li et al.

Reardon MJ, van Mieghem NM, Popma JJ et al (2017) Surgical or transcatheter aortic-valve
replacement in intermediate-risk patients. N Engl J Med 376(14):1321–1331
Rosell R, Carcereny E, Gervais R et al (2012) Erlotinib versus standard chemotherapy as first-line
treatment for European patients with advanced EGFR mutationpositive non-small-cell lung cancer
(EURTAC): a multicenter, open-label, randomised phase 3 trial. Lancet Oncol 13(3):239–246
Rosenbaum PR, Rubin DB (1983) The central role of the propensity score in observational studies
for causal effects. Biometrika 70:41–55
Rosenbaum PR, Rubin DB (1984) Reducing bias in observational studies using subclassification on
the propensity score. J Am Stat Assoc 79:516–524
Rubin DB (2001) Using propensity scores to help design observational studies: application to the
tobacco litigation. Health Serv Outcomes Res Methodol 2:169–188
Rubin DB (2007) The design versus the analysis of observational studies for causal effects: parallel
with the design of randomized trials. Stat Med 26:20–36
Rubin DB (2008) For objective causal inference, design trumps analysis. Ann Appl Stat 2:808–840
Ruschitzka F, Abraham WT, Singh JP et al (2013) Cardiac-resynchronization therapy in heart
failure with a narrow QRS complex. N Engl J Med 369(15):1395–1405
Stone GW, Ellis SG, Cox DA et al (2004) A polymer-based, paclitaxel-eluting stent in patients with
coronary artery disease. N Engl J Med 350(3):221–231
Svensson LG, Tuzcu M, Kapadia S et al (2013) A comprehensive review of the PARTNER trial. J
Thorac Cardiovasc Surg 145(3S):S11–S16
Thourani VH, Kodali S, Makkar RR et al (2016) Transcatheter aortic valve replacement versus
surgical valve replacement in intermediate-risk patients: a propensity score analysis. Lancet 387:
2218–2225
U.S. Food and Drug Administration (2010) Guidance for industry and FDA staff: guidance for the
use of Bayesian statistics in medical device clinical trials. Available at https://www.fda.gov/
downloads/medicaldevices/deviceregulationandguidance/guidancedocuments/ucm071121.pdf.
Accessed 9 Feb 2018
U.S. Food and Drug Administration (2012) Draft guidance on enrichment strategies for clinical
trials to support approval of human drugs and biological products. Available at https://www.fda.
gov/downloads/drugs/guidancecomplianceregulatoryinformation/guidances/ucm332181.pdf.
Accessed 9 Feb 2018
U.S. Food and Drug Administration (2013) Design considerations for pivotal clinical investigations
for medical devices: guidance for industry, clinical investigators, institutional review boards and
Food and Drug Administration Staff. Available at: https://www.fda.gov/downloads/
medicaldevices/deviceregulationandguidance/guidancedocuments/ucm373766.pdf. Accessed
9 Feb 2018
U.S. Food and Drug Administration (2014) In vitro companion diagnostic devices: guidance for
industry and Food and Drug Administration Staff. Available at: https://www.fda.gov/
downloads/MedicalDevices/DeviceRegulationandGuidance/GuidanceDocuments/
UCM262327.pdf. Accessed 9 Feb 2018
U.S. Food and Drug Administration (2016) Draft guidance: Software as a Medical Device (SAMD):
clinical evaluation. Available at: https://www.fda.gov/ucm/groups/fdagov-public/@fdagov-
meddev-gen/documents/document/ucm524904.pdf. Accessed 9 Feb 2018
Yu T, Li Q, Gray G, Yue LQ (2016) Statistical innovations in diagnostic device evaluation. J
Biopharm Stat 26:1067–1077
Yue LQ (2007) Statistical and regulatory issues with the application of propensity score analysis to
non-randomized medical device clinical studies. J Biopharm Stat 17:1–13
Yue LQ, Lu N, Xu Y (2014) Designing pre-market observational comparative studies using existing
data as controls: challenges and opportunities. J Biopharm Stat 24:994–1010
Yue LQ, Campbell G, Lu N, Xu Y, Zuckerman B (2016) Utilizing national and international
registries to enhance pre-market medical device regulatory evaluation. J Biopharm Stat 26:
1136–1145
Zhou X-H, Obuchowski NA, McClish DK (2009) Statistical methods in diagnostic medicine, 2nd
edn. Wiley, New York
Complex Intervention Trials
74
Linda Sharples and Olympia Papachristofi

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1418
Developing the Intervention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1420
Defining the Intervention Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1420
Development of the Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1422
Timing of Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1423
Feasibility/Early Phase Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1424
Evaluation/Statistical Methods for Trial Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1425
Individually Randomized Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1425
Cross-Classified Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1428
Cluster Randomized Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1429
Stepped-Wedge Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1430
Sample Size Estimation for Trials with Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1432
Model Fitting and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1433
Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1434
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1435
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1435
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1435
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1436
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1436

L. Sharples (*)
London School of Hygiene and Tropical Medicine, London, UK
e-mail: [email protected]
O. Papachristofi
London School of Hygiene and Tropical Medicine, London, UK
Clinical Development and Analytics, Novartis Pharma AG, Basel, Switzerland
e-mail: olympia.papachristofi@novartis.com

© Springer Nature Switzerland AG 2022 1417


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_245
1418 L. Sharples and O. Papachristofi

Abstract
Clinical trial methodology was developed for pharmaceutical drug development
and evaluation. In recent years, trials have expanded to an increasingly diverse
range of interventions.
The term complex intervention describes treatments that are multicomponent
and include clustering due to specific components, such as the healthcare
provider, which cannot be separated from the package of treatment and influence
treatment outcomes. This chapter provides an overview of the main consider-
ations in the design and analysis of complex interventions trials.
Initial development of complex interventions is a multidisciplinary endeavor
and requires rigorous qualitative and quantitative methods. Understanding both
the intervention components and how they interact is crucial for successful
development and evaluation of the intervention.
Published guidance on methods for feasibility, piloting, or early phase trials of
complex interventions is scarce. However, there are well-established methods for
phase III trials of multicomponent interventions that involve clustering. The most
commonly used methods, including individually randomized trials with
random effects for clusters, cluster randomized trials, and stepped-wedge cluster
randomized trials, are described. Analysis focuses on generalized linear (mixed)
models; methods for sample size estimation that accommodate the extra variance
related to clustering are also provided for a range of designs in this setting.
With careful attention to the correlation structure induced by the chosen
design, results can be analyzed in standard statistical software, although small
numbers of clusters, and/or small within-cluster sizes, can cause convergence
problems.
Statistical analysis results of complex interventions trials, including
those relating to components of the intervention, need to be considered
alongside economic, qualitative, and behavioral research to ensure that complex
interventions can be successfully implemented into routine practice.

Keywords
Randomized · Cluster randomized · Clinical trial · Complex intervention ·
Multicomponent · Clustering · Healthcare provider · Stepped-wedge

Introduction

In recent years, clinical trial methodology has been applied in diverse healthcare
settings and to a wide range of interventions beyond fixed-dose, single-drug
treatments. For example, novel surgical procedures (Sharples et al. 2018),
multidisciplinary packages of care for chronic diseases (education, physiotherapy,
medicine) (Dickens et al. 2014), mental health coping strategies (Mohr et al. 2011),
and public health interventions (Emery et al. 2017) have all been the subject of
74 Complex Intervention Trials 1419

randomized controlled trials (RCTs). The interventions in these trials are generally
termed complex interventions.
Although complex interventions have no single, universally accepted definition,
there are two main issues that characterize them: (i) the multicomponent nature of the
intervention itself and (ii) a level of dependency between trial participants treated by
the same healthcare provider, or in a group setting, that induces clustering of
outcome measurements. Note that, in this chapter, we use the general term provider
for any person who delivers all or part of a complex intervention, including
surgeons, physicians, therapists, nurses, physiotherapists, and so on.
In (i), what distinguishes complex from non-complex interventions is the fact that
they are made up of a number of components, which may be interdependent (e.g.,
use of rescue treatment after failure of initial therapy) or independent (e.g., treatment
packages comprising psychotherapy, physiotherapy, and health education). Some
components of the intervention package may be of interest themselves and thought
of as fixed effects (e.g., physiotherapy or not). Other components may not be
of interest themselves (nuisance parameters), but may introduce some level of
dependency between trial participants, or clustering. For example, two patients
treated by the same surgeon may have more similar outcomes than two patients
treated by different surgeons, due to differences between surgeons in experience and
expertise, as well as local (to the hospital) population attributes. In general, specific
surgeons are not of interest themselves, but they represent a population of surgeons
who might ultimately provide the intervention of interest (see later on for an example
of (ii)). In this case, providers are better defined and analyzed as random effects (e.g.,
surgeon performing a procedure, therapist providing group cognitive behavioral
therapy). What links these (fixed and random effects) components is that they can
all have an influence on the effectiveness of the package that makes up the complex
intervention, and differences (between patients) in the specific treatment package
received may manifest as heterogeneity in the outcome of interest. The multi-
component nature of the intervention may also mean that multiple outcomes are
used to assess the success of its different components, which may complicate the
choice of primary outcome (Richardson and Redden 2014).
Complex intervention trials may be further complicated due to other factors in
addition to the intervention itself. The context in which interventions are delivered
may be complex, an obvious example being the operating rooms in which surgical
procedures are performed; these contain many instruments, machines, and devices,
operated by multiple interacting clinical disciplines (Blencowe et al. 2015). Public
health trials are clearly dependent on the setting in which they are conducted and on
how that affects the delivery of the intervention (Emery et al. 2017). In particular,
trials in hard-to-reach populations may require designs such as snowballing (index
cases identifying other cases for study participation) that introduce dependency
between participants (Yuen et al. 2013). Low- and middle-income countries may
have diverse healthcare infrastructure, staffing, cultural, and economic factors that
add to the complexity of an intervention and its evaluation (Cisse et al. 2017).
All these factors affect the way a novel intervention is developed, the amount of
standardization required for each component, the assessment of treatment adherence,
1420 L. Sharples and O. Papachristofi

the fidelity to the intervention as defined, and the statistical design and analysis of the
trial. In particular, any random effects or clustering inherent in the delivery of the
complex intervention will increase the sample size compared to a design in which a
simple intervention is applied at the patient level, as independent and identically
distributed patient outcomes cannot be assumed.
General characteristics of complex interventions have been described, with the
Medical Research Council (MRC) in the UK providing an early general framework
for design of such trials in 2000 (Medical_Research_Council 2000), which
was updated in 2008 (Medical_Research_Council 2008). Although useful, these
guidelines do not provide detail on specific methods of assessment. It is generally
appreciated that the design and analysis of complex intervention trials require
rigorous quantitative and qualitative research methods, with collaborative work
between, for example, statisticians, behavioral scientists, and clinical experts.
While this chapter provides some general discussion on the use of mixed methods
in this context, the focus is primarily on statistical methods in a broad sense.

Developing the Intervention

Defining the Intervention Components

Pharmaceutical interventions are manufactured under strict quality control so that the
dose of active drug is known exactly (e.g., see (Pocock 1983)). Moreover, delivery
of the drug rarely depends on the context, as it is typically identical across physicians
and settings. In contrast, complex intervention trials are often embedded in the
clinical or public health setting that they aim to address, which influences the trial
design. Moreover, there may be components of the intervention that can be left to the
discretion of the provider, say the type of sutures used in a surgical intervention or
the number of sessions required in a psychotherapy trial. For these reasons, including
differences in the way the intervention is delivered between patients, complex
interventions are naturally evaluated using pragmatic trials (Loudon et al. 2013).
Although there are exceptions (such as evaluations of new diagnostic tests), in many
cases, the evaluation aims to reflect how the intervention would perform in the
setting for which it is intended, rather than in a tightly controlled setting, with highly
selected patients, as is the case in trials assessing drug efficacy.
Nevertheless, to maintain scientific quality and ensure that the intervention can be
reproduced by other providers, a clear definition of exactly what constitutes the
complex intervention, and how it will be delivered in the planned setting, is
paramount. The Template for Intervention Description and Replication (TIDieR)
checklist and guide provides general advice on the reporting of interventions
(Hoffmann et al. 2014); this section provides guidance specific to complex
interventions.
Focusing on surgical trial design, Blencowe et al. (2016) highlighted the impor-
tance of describing each component of a surgical procedure, as well as the level of
standardization and flexibility permitted within the surgical intervention package.
74 Complex Intervention Trials 1421

For example, consider the Amaze trial of ablation to treat abnormal heart rhythm in
patients already scheduled for cardiac surgery; the control group received the
scheduled cardiac surgery alone (Sharples et al. 2018). The complete procedure
involved treating 12 different sections of the heart, but because Amaze was designed
to reflect the treatment as it was currently used in the UK (pragmatic trial), the
protocol allowed surgeons to use their judgment and treat as many sections as
considered necessary. In addition, surgeons were also able to use their expertise to
decide whether to apply co-interventions, such as electrical stimulation (cardiover-
sion), if the initial operation was not fully effective. All other procedure components
were fixed according to protocols. The flexibility inherent in Amaze is common in
surgical trials where procedure success depends heavily on the training, skill, and
expertise of the surgeon; as a result, the intervention is a combination of both the
operation delivered and the surgeon who performs it. Similarly, psychotherapy
typically involves a set of protocols and techniques, which must be well-defined,
together with a psychotherapist delivering them. The therapist will have an influence
both on the content and delivery of the package, as well as on patient adherence
(Walwyn and Roberts 2017).
The level of standardization to be considered when defining the intervention has
been discussed in surgical (Blencowe et al. 2016), behavioral (Mars et al. 2013), and
public health (Perez et al. 2018) contexts. The main factors to consider can be
summarized as follows:

(i) Which intervention components are necessary and which are optional?
(ii) Under what circumstances is each component mandatory, prohibited, or
optional?
(iii) For each component, what delivery methods are mandatory, prohibited, or
optional?
(iv) For each component, which delivery methods have a strict definition and which
can be applied flexibly?
(v) What training or competency is required for providers of each component?

Careful definition of the components of a complex intervention allows


monitoring of the intervention delivery and its components. Successful interven-
tion delivery requires both patient compliance (e.g., attendance at therapy ses-
sions) and completion of each component of treatment according to plan by the
provider, termed fidelity. Both are important to establish why a successful
intervention worked and crucially to inform its subsequent implementation into
the intended healthcare setting. Fidelity is less of an issue in traditional drug
trials, where the common departure from treatment delivery is patient non-
compliance to treatment, due to either side effects or lack of efficacy: see, for
example, Pocock (1983).
Note that, because a complex intervention comprises multiple components and
the intervention and context are difficult to separate, ensuring that its delivery will
be identical in all cases is challenging. For example, in the Amaze trial (Sharples
et al. 2018), almost 9% of the variation in risk-adjusted outcomes was related to
1422 L. Sharples and O. Papachristofi

differences between cardiac surgeons. Thus, no two patients can be assumed to have
had an identical procedure, and the surgeon can be considered integral to the
intervention package; therefore, consideration must be given to the criteria for
including a surgeon in the trial. This principle is important for healthcare providers
delivering complex interventions in general.
However, establishing when a provider is sufficiently experienced to be
randomized in an RCT is not straightforward (discussed in detail in the next
section). For instance, learning assessments are complicated because as surgeons
gain experience, they are likely to undertake more high-risk cases. Their perfor-
mance might seem not to improve or even to deteriorate with time, but this may
be due to an increase in the risk in the cases they handle, so that the duration of
their learning period (and hence when they can be randomized in an RCT) may
be overestimated.

Development of the Package

Complex interventions cover such diverse areas that detailed discussion of the initial
derivation of the intervention itself is beyond the scope of this chapter. General
guidance on the design and evaluation of complex interventions can be found in the
MRC complex intervention guidance (Medical_Research_Council. 2008);
the IDEAL (Idea, Development, Exploration, Assessment, and Long-term Study)
publications provide a phased development of the intervention, from conception to
final evaluation, for surgical trials (Mcculloch et al. 2009), and MOST (Multiphase
Optimization Strategy) describes a framework for behavioral interventions (Collins
et al. 2007).
Development of the intervention package requires input from a range of clinical
and research disciplines. As with all interventions, it is also important to have
a comprehensive knowledge of the state of the art. This may involve some or all
of the following activities:

(i) For a multicomponent intervention, systematic reviews and meta-analyses for


each component will be required.
(ii) Detailed assessment of the intended healthcare setting, treatment pathways, and
other contextual factors which may require both qualitative and survey work.
(iii) Understanding of the theoretical underpinnings of the intervention and its
relation to behavioral change in patients and public health practice, to improve
the chance of trial success and inform its implementation.
(iv) Statistical modeling using early data sets (if available) can provide estimates of
the likely clinical efficacy and cost-effectiveness of the intervention.

The following section focuses on how statistical methods may contribute to the
development of complex interventions; details of the use of all above methods in this
context are provided in a review by Richards and Hallberg and references therein
(Richards and Hallberg 2015).
74 Complex Intervention Trials 1423

Timing of Evaluation

The timing of definitive evaluation of a complex intervention is closely associated


with the amount of expertise required for it to be implemented. A new treatment that
is technically demanding, or depends on an in-depth understanding of its theory and
practice, may require a period of training for providers to achieve satisfactory
performance. Training can take a number of forms, including classroom instruction,
training in individual intervention components, and mentored delivery of the inter-
vention as a whole.
To ensure that providers have reached the required expertise level for participa-
tion in a trial, results from early cases completed during their training should be
closely monitored. Moreover, treatment effects may change during the trial as the
package of care is rolled out from the innovators who have initially developed it to a
wider group of providers. Whether treatment effects increase or decrease during the
period of evaluation may not be predictable, so that reassessment of treatment effects
as the trial progresses is recommended.
Simple summary statistics can be useful in comparing the success rates for early
and later recipients of the intervention during its development and evaluation. More
formally, with a carefully selected clinical outcome, or a more sensitive surrogate,
performance over time can be modeled as a function of the chronological order of the
interventions delivered, adjusting for patient risk factors as necessary. For example,
Cook et al. describe model-based methods for monitoring outcomes of a new
surgical procedure during its development, focusing on three characteristics: (i) the
initial level of provider performance; (ii) the learning rate, measuring how quickly
performance improves; and (iii) an asymptote or plateau representing the level at
which performance stabilizes (Cook et al. 2004). Papachristofi et al. demonstrate
situations in which a particular surgeon has achieved the required level of perfor-
mance for inclusion in an RCT (see Fig. 1); this requires setting a predetermined
level of performance and monitoring results until they reach this level (1a), exceed it
(1b and 1d), or are within an acceptable distance defined by the precision of
the estimated performance level (1c). Papachristofi et al. also extend this work in

Fig. 1 Learning curve a) b)


scenarios for RCT
randomization (lower values
Measure of
on the y-axis represent
performance
superior performance)

c) d)

Measure of
performance

Measure of Measure of
experience experience
1424 L. Sharples and O. Papachristofi

an attempt to pinpoint the time at which performance has stabilized (Papachristofi


et al. 2016a).
Such models are useful to identify the stage at which provider performance for the
new intervention has stabilized, so that a phase III RCT can be initiated. Early
randomization of patients and premature inclusion of providers who have not
reached a stable level of performance may lead to biased estimates of treatment
effects and false inference of the novel intervention’s effectiveness. However, an
RCT should only begin if equipoise still holds; delaying the RCT evaluation for too
long carries the risk that providers form strong opinions as to the best treatment for
their patients, despite the lack of scientific evidence of efficacy.
Such analyses of the intervention’s development period are only possible if there
is complete and rigorous data collection from the beginning of development of new
interventions, documenting the timing and nature of any changes in the intervention.
Trials can then begin when the intervention has been fully refined and when
individual providers have reached the required level of performance.

Feasibility/Early Phase Studies

Once developed, complex interventions must go through formal evaluation.


However, there are challenges in generalizing the traditional early phases of
evaluation used for new drugs to the case of complex interventions, partly due to
the frequent lack of surrogate endpoints for early evaluation, multiple outcomes
of interest, and the inherent clustering in many complex interventions (Wilson et al.
2016). There may also be difficulty in deciding the timing of a randomized trial,
particularly if the intervention requires development of new skills or competencies
(Papachristofi et al. 2016a).
As a result, early evaluation of complex interventions typically addresses
feasibility and piloting of a phase III trial. The aim is to finalize the finer points of
the intervention delivery and inform the design of the definitive trial. Such early
phase, often single-arm studies can be used to estimate patient adherence, provider
fidelity, variance components, and interactions between different components of the
intervention package. Variance-covariance components are a key element of trial
design that captures the degree of similarity between individuals within a cluster.
This similarity between individuals within a cluster is defined as the proportion
of total variation attributable to clustering and is quantified using the intra-class
correlation coefficient (ICC), denoted in this chapter by ρ.
Because some of the parameters required for designing the definitive trial are
second order and above (e.g., variance components, interactions), and as useful
surrogates are rarely available, the power for testing hypotheses at this early
evaluation stage is low, and the emphasis is instead on estimation of parameters
needed for trial design.
Recent work has provided some clear thinking around the relationship between
complex interventions assessment and phase II drug trials (Wilson et al. 2016); its
74 Complex Intervention Trials 1425

focus is on both hypothesis testing and Bayesian methods to inform the decision to
continue to the phase III trial. Challenges that have been identified include:

(i) The small number of clusters (often defined by the few innovators who
developed the intervention) at early stages
(ii) The small number of patients in each cluster
(iii) The lack of information on cluster sizes and the ICC
(iv) The number of endpoints that may be under investigation, with no clear
decision about the appropriate phase III primary outcome
(v) The lack of data to inform (compound) hypothesis tests and/or Bayesian
utilities when assessing multiple outcomes (e.g., both efficacy and safety)

Although this work and references therein provide some useful guidance, early
phase trial statistical methodology is not yet established in the field.

Evaluation/Statistical Methods for Trial Design

The pivotal phase of evaluation is the phase III RCT assessing clinically important
outcomes, usually in a large sample of patients: see, for example, Pocock (1983). As
the phase III trial aims to estimate treatment efficacy or effectiveness, all aspects of
the intervention package, including treatment protocols, treatment duration, and
safety profiles, should have already been established. In what follows, the most
commonly used designs for phase III complex intervention trials are described.

Individually Randomized Designs

Consider first trials in which randomization, treatment, and outcomes assessment are
all conducted at the individual patient level. In situations where the intervention is to
be evaluated as a package of care, ignoring any random effects, standard trial design,
and analysis methods can be used. In order to introduce some notation, assume a
generalized linear model for patient i, i ¼ 1, . . ., m, with xi a categorical covariate
representing treatment allocation:

g1 ðE½yi Þ ¼ β0 þ β1 xi , ð1Þ

where yi is a measure of outcome and g is an appropriate link function. For example,


for continuous outcomes, the identity link is used, and the residual error terms
 are

(usually) assumed to be independent and identically distributed as ei N~ 0, σ 2e .
This section focuses on generalized linear models for illustration, but methods for
time-to-event outcomes are also available; further covariates can additionally be
incorporated (omitted here for clarity).
Supplementary exploratory analysis of fixed components of the package can
be undertaken based on subsets of package components or interactions between
1426 L. Sharples and O. Papachristofi

components. Such exploratory analyses will likely be underpowered unless the


trial has been designed to accommodate them. However, paired with qualitative
exploration of the reasons for differential response to the intervention, they can
identify areas for further research, suggest components linked with worse
outcomes that need refinement, and inform the optimum method of implementing
the intervention in clinical practice.
Clustering is often inherent in complex intervention trials due to the nature of the
intervention itself. For example, patients may be allocated to open surgery or
minimal access (keyhole) surgery arms individually, but the well-known heteroge-
neity between surgeons’ outcomes means that outcomes for two patients treated by
the same surgeon will be more similar than outcomes for two patients treated by
different surgeons; that is, there is clustering by surgeon (Papachristofi et al. 2016b).
This violates the usual statistical assumption that patient outcomes are independent
and identically distributed. The extent to which violation of this assumption affects
the point estimate and precision of the treatment effect clearly depends on the degree
of similarity between outcomes of patients within the same cluster.
Patient clustering within treatment providers and hospitals introduces hierarchies
of care (patient within provider within hospital). Providers (clusters) can be modeled
as fixed effects; however, this may introduce many additional parameters in the
model that are not of interest in themselves, that is, nuisance parameters. Alterna-
tively, providers can be modeled as random effects; these treat the clusters as a
random sample from a population so that the specific treatment providers are only of
interest in that they represent a population of providers, to which the results will be
generalizable. In this case, accommodation of clustering during modeling using
random intercept terms should reduce bias and correct the type I error in the
treatment effect estimation (Papachristofi et al. 2016b; Kahan and Morris 2013).
The statistical model for patient i, i ¼ 1, . . ., m in cluster j, j ¼ 1, . . ., c can be
written:

 h i
g1 E yij ¼ β0 þ β1 xij þ uj ð2Þ
 
where uj N~ 0, σ 2u are the cluster-specific random intercepts for providers. For  thecase
of a linear model, the residual error terms for patient i in cluster j, eij j uj N~ 0, σ 2e are
now independent conditional on cluster occupancy. Such a model is termed a hierar-
chical or nested design (see Fig. 2a). Normally distributed random effects (on some
scale) are almost universally used in the trials’ literature for the primary analysis, with
other distributional assumptions explored in sensitivity analyses.
For the simple nested model (2) the ICC is given by

σ 2u
ρ¼ : ð3Þ
σ 2u þ σ 2e

We refer to this as the simple ICC. The ICC can be interpreted as the proportion of
the total variation that is attributable to between-cluster variance and is an important
74 Complex Intervention Trials 1427

Fig. 2 Illustration of b)
a)
individually randomized trials
with clustering in (a) both trial Experimental Control
arms, (b) the experimental
arm only, (c) cross-classified, Surgeons Surgeons
and (d) multiple membership
multiple classification
Patients Patients Patients

c) d)

Surgeons Anaesthesiologist Surgeons

Patients Patients

parameter for sample size estimation for phase III trials (see section “Sample Size
Estimation for Trials with Clustering” below). High values for ρ indicate that
intervention delivery is quite heterogeneous between clusters relative to the
within-cluster variation and vice versa.
There are two main alternative scenarios to this simple design: first, the random
components affect outcomes differently in the intervention and control arms, and,
second, the treatment effect varies between clusters within the same arm (i.e.,
random coefficient for treatment).
The first scenario might arise in trials with very different treatment arms, for
example, a trial of a new technically demanding surgery (high variation) compared
with standard surgery (low variation) and can be modeled by
 h i    
g1 E yij ¼ β0 þ β1 xij þ uj δ xij ¼ 1 þ u0j δ xij ¼ 0 ð4Þ
   
where uj N~ 0, σ 2u and u0j N~ 0, σ 2u0 are the cluster-specific random effects in the
treatment and control arms respectively, and δ is the treatment arm indicator function;
for a linear model, residual error terms are again expressed as eij j uj  Nð0, σ 2e Þ. Such
trials are described as partially nested if the control arm random effects are all zero
(e.g., surgery versus medical management trial; see Fig. 2b). Assuming equal numbers
of patients per cluster and equal numbers of clusters per arm, the ICC can be written as
 
0:5 σ 2u þ σ 2u0
ρ¼   ð5Þ
0:5 σ 2u þ σ 2u0 þ σ 2e

and comprises three variance terms; the two random effects variances are considered
independent since they are estimated in different clusters.
The second scenario might arise in a novel surgery versus standard surgery trial,
where the between-surgeon variation in outcomes manifests through heterogeneity
in the treatment effect; this can be modeled by
1428 L. Sharples and O. Papachristofi

 h i
g1 E yij ¼ β0 þ β1 xij þ uj þ u0j xij , ð6Þ
   
where uj N~ 0, σ 2u and u0j N~ 0, σ 2u0 are the cluster-specific random effects on the
intercept and treatment coefficient, respectively; for a linear model, eij j
uj , u j0 N~ 0, σ 2e are residual error terms. In the random coefficient model, correlation
between the two random effects parameters is possible ðσ uu0 ¼ rσ u σ u0 Þ and the ICC
can be written as

σ 2u þ σ 2u0 þ 2σ uu0
ρ¼ : ð7Þ
σ 2u þ σ 2u0 þ 2σ uu0 þ σ 2e

Cross-Classified Designs

Extending the single random component model to accommodate intervention pack-


ages with multiple random components is not straightforward since they are usually
not fully nested; that is, the resulting structure is not fully hierarchical. For example,
Papachristofi et al. (2016b) describe outcomes after surgical interventions including
both surgeons and anesthesiologists as random effects, while Roberts and Walwyn
(2013) consider both psychotherapists and general medical doctors as random effects
in mental health trials. In both these examples, two types of provider were expected
to influence outcomes, and pairs of providers were expected to share some, but not
all of their patients (Fig. 2c). Such components are said to be crossed and can be
analyzed using cross-classification methods (Browne et al. 2001). For illustration,
consider a two-level cross-classified model for patient i treated by the jthprovider of
type A, and the kthprovider of type B, with treatment xijk and outcome yijk. The
corresponding cross-classified model may be written as
 h i
g1 E yijk ¼ β0 þ β1 xijk þ uj þ vk , ð8Þ
   
where uj N~ 0, σ 2u and vk N~ 0, σ 2v are independent cluster-specific randomeffects  for
providers of type A and B, respectively; for a linear model, eijk j uj , vk N~ 0, σ 2e are
residual error terms. This model can be extended to accommodate the two alternative
scenarios described above.
Assuming equal cluster sizes and independence between provider types, the ICC
for this two-provider crossed design is

σ 2u þ σ 2v
ρ¼ : ð9Þ
σ 2u þ σ 2v þ σ 2e

When the two providers are not independent, resulting in correlated random
effects (σ uv ¼ rσ uσ v), the ICC can be written as
74 Complex Intervention Trials 1429

σ 2u þ σ 2v þ 2σ uv
ρ¼ : ð10Þ
σ 2u þ σ 2v þ 2σ uv þ σ 2e

In this case of two correlated random components, the ICC requires three
variance and one covariance components; if further random effects components
were to be accommodated, the number of terms in the models and the complexity
of the interrelationships between components would increase. Therefore, investiga-
tion of crossed components should focus on a small number of components that have
been identified as most important during trial design.
When designing complex intervention trials, it is essential to have a clear
understanding of the different components of variation and how these affect the
treatment effect estimates. For interventions with several components, high-level
interactions between fixed and random effects are difficult to robustly estimate. Our
recommendation is to rank components according to the level of clustering in the
primary outcome(s) and investigate them in a stepwise manner (Papachristofi et al.
2016b)
A generalization of the cross-classified model is the multiple membership multi-
ple classification (MMMC) model, which considers a second random component
that operates across several elements of the first (see Fig. 2d). An example of this
would be a surgical intervention that requires more than one surgeon to be involved,
or a psychological intervention provided by more than one therapist, say in a group
session. Each provider might work with a number of other providers during the trial.
Details of these models can be found in Browne et al. (2001).

Cluster Randomized Trials

The parallel cluster randomized controlled trial (CRCT) is an established design for
evaluating interventions that are either randomized to all patients within a predefined
group (cluster) or where the intervention is delivered at a group level (Eldridge and
Kerry 2012). Examples include trials involving care of dementia patients (Surr et al.
2016), primary care (Emery et al. 2017), and hospitals where interventions are
applied to individual wards or clinics (Erasmus et al. 2011). CRCTs are also an
attractive option when individuals within the same geographical or clinical area can
access information from other trial participants, so that individual randomization
may lead to contamination of treatment effects. For example, patients attending the
same clinic may share educational resources or coping strategies from a multi-
component intervention, despite being assigned to different treatment arms.
In the simplest CRCT, assuming common cluster size m, the statistical model for
patient i, i ¼ 1, . . ., m in cluster j, j ¼ 1, . . ., c, with xj the categorical covariate
representing treatment allocation for cluster j, can be specified as follows:
 h i
g1 E yij ¼ β0 þ β1 xj þ uj ð11Þ
1430 L. Sharples and O. Papachristofi

 
where yij is a measure of outcome, g is an appropriate link function,  and  uj N~ 0, σ 2u
are the cluster-specific random effects; for a linear model, eij j uj N~ 0, σ 2e are residual
error terms. In this model, all patients in a cluster receive the same treatment, and the
treatment effect is common across clusters. Note that further patient- or cluster-
specific covariates have been omitted for clarity, although it is straightforward to
include them in these models; generalized linear models are again used for illustra-
tion, but methods for clustered time-to-event outcomes are also available.
Model (11) reflects a trial in which the outcome is assessed at the individual
participant level, but the same framework (with a suitable link function and error
structure) can accommodate cluster-level outcomes, which are common in CRCTs.
For example, the model in (11) is appropriate if hospitals as a whole are randomized
to active treatment or control, but outcomes are assessed for each patient, perhaps so
that patient-level covariates can be adjusted for. Alternatively, analysis could be
based on hospital-level outcomes, such as the proportion of patients per cluster that
have a successful outcome. These aggregated outcomes can be considered indepen-
dent and analyzed using standard statistical methods, with the outcomes weighted by
their precision or the number of cases within each cluster. While this approach results
in a simple analysis, it does not allow for inclusion of patient-level covariates and
thus may be less efficient than an individual patient data analysis. The associated
ICC is identical to the simple ICC for the two-level hierarchical model in Eq. (3).

Stepped-Wedge Designs

An alternative type of CRCT that is gaining in popularity is the stepped-wedge


design (SWD). In this design, all clusters of patients begin in the control group. A
proportion of clusters is then randomized to the experimental treatment arm at the
end of each of a number of predetermined time periods, that is, in steps, until all
clusters are in the experimental intervention arm in the final time period (see Fig. 3)
(Brown and Lilford 2006). The first period is often used to collect baseline data, and
the time period at which a cluster commences the experimental intervention is
determined at random. The most commonly reported SWD trials are cross-sectional,

Time period
Step 1 2 3 4 5 6
1
2
3
4
5

Fig. 3 Illustration of a stepped-wedge cluster randomized trial. Each cell represents a cluster, and
each time period after the first cell represents a step. Darker cells indicate clusters having the
experimental treatment, and lighter cells represent control clusters
74 Complex Intervention Trials 1431

in that new patients are recruited in each time period, and even though they may be
followed up over time for an event, the outcome is analyzed according to the period
of recruitment (e.g., see the FIT trial of hand hygiene compliance (Fuller et al.
2012)). An alternative, but less frequently used, design is the cohort SWD, in which
all patients are recruited in the first period and followed throughout the trial, with a
proportion of the patient clusters randomized to switch to the active intervention at
the end of each period. For example, Jordan et al. (2015) conducted a trial in which
dementia patients in care homes were switched to nurse-led prescribing of medica-
tions in a series of steps, until all patients were in the intervention arm.
The choice between the SWD and the traditional CRCT depends on the context
and nature of the intervention. The SWD is particularly suited to service delivery or
healthcare policy interventions, for which traditional CRCTs are logistically difficult
to implement. However, as in the SWD all clusters are exposed to the active
intervention by the end of the trial. It is more difficult to revert to the control
treatment if the experimental intervention proves to be ineffective. Thus, the SWD
is only appropriate if the intervention has negligible side effects and if it is consid-
ered very unlikely to be less effective than the standard of care. For example, it is
difficult to imagine how provision of sterile hand wash could introduce substantial
risk of side effects or an increase in infectious episodes.
Because SWD trials are conducted over a series of periods, they are not appro-
priate for interventions requiring long-term follow-up in order to establish an effect
or when treatment effects vary over time.
Analysis for a SWD uses a model with random effects for clusters, and fixed
effects for time periods, i.e., for patient i in cluster j at time period k:
 h i
g1 E yijk ¼ β0 þ β1 xijk þ ωk þ uj ð12Þ

where yijk is a measure of outcome, g an appropriate link function, xijk a categorical


covariate
 representing treatment allocation, ωk the effect of time period k, and
uj N~0, σ 2u  are the cluster-specific random effects; for a linear model, eijk j
uj N~ 0, σ 2e are residual error terms.
Although the time period effects ωk are not of interest in themselves, their
inclusion in this model is important. On average there are more control clusters at
the beginning of the trial and more intervention clusters toward the end. If the period
effects are omitted from the model, any contextual or intervention changes over time
will result in bias in treatment effect estimates (Hemming et al. 2018).
The simple model (12) further relies on a number of assumptions that require
justification including
(i) A common period effect across clusters and treatment arms
(ii) Constant period effects within a period (no smooth changes across time)
(iii) A common treatment effect across clusters and periods

Analysts may consider additional exploratory analyses that relax these assump-
tions using strata (groups of clusters)-by-period interactions, strata-by-treatment
1432 L. Sharples and O. Papachristofi

interactions, and treatment-by- period interactions. Although such analyses are


underpowered, and their results should be interpreted cautiously, they may identify
major departures from the simple model’s assumptions (Hemming et al. 2018).
The ICC for model (12) is identical to the simple ICC in (3) for the two-level
hierarchical model.
Other designs that fit into the SWD framework have been published, including
CRCTs where the clusters are observed in multiple periods (Ukoumunne and
Thompson 2001) and cluster randomized crossover trials (Turner et al. 2007), but
these are outside the scope of this chapter.

Sample Size Estimation for Trials with Clustering

The sample size for trial designs with clustering is affected both by the cluster size
and the number of clusters. Methods for sample size estimation have been published
for a wide range of individually randomized trials with clustering (Walwyn and
Roberts 2010), CRCTs (Eldridge and Kerry 2012), and SWDs (Hemming and
Taljaard 2016). The general principles and some simple calculations based on
normally distributed random effects are provided in this section.
Typically, sample size estimation for trials with clustering involves calculation of
a design effect (DE), used to inflate the sample size estimate from the corresponding
trial design for independent, identically distributed outcomes; the simplest version is
given below.
The standard sample size estimate n for each arm of an individually randomized
parallel two-group trial, with 1:1 allocation ratio and a normally distributed outcome,
is given by
 2
Zα=2 þ Z1β 2σ 2e
n¼ , ð13Þ
Δ2
where Δ is the target mean difference in the outcome between the two arms, Zp is the
pth percentage point of the standard Normal(0, 1) distribution, and α and β are type I
and II error probabilities, respectively.
In a similar CRCT, where all patients in a cluster receive the same treatment, and
the cluster size is fixed and known, the number of patients per arm n can be
calculated using
 2
Zα=2 þ Z 1β 2σ 2e
n¼ ð1 þ ðm  1ÞρÞ, ð14Þ
Δ2
where m is the fixed cluster size and ρ the ICC; the DE term (1 + (m  1)ρ) is the
inflation factor due to clustering (Donner et al. 1981). In this simple case, the number
of clusters required per arm would be c ¼ n/m. Note that the DE increases with the
number of patients per cluster and the size of the ICC. For clusters of size 1 or ICC of
zero (independence between clusters), the DE reduces to one, and the sample size
74 Complex Intervention Trials 1433

formula reverts to that for the independent, identically distributed outcomes design.
In general, keeping the number of clusters fixed and increasing the within-cluster
sample size are not efficient, in that such a strategy will require more trial partici-
pants overall than if the number of clusters were to be increased.
When the cluster sizes are unequal, as it is the case for most trials, the DE relies on
knowing the mean and coefficient of variation of cluster sizes and becomes
    
DE ¼ 1 þ m 1 þ cv2  1Þ ρ, ð15Þ

where m is the average cluster size and cv is the coefficient of variation for the cluster
sizes (see, e.g., Eldridge et al. (2006)). The number of patients required per arm is
larger than for the equal cluster size case and increases as the variability between the
cluster sizes increases. The DE for unequal cluster sizes given in Eq. (15) is the most
commonly used; Eldridge et al. (2006) provide alternative formulations and further
discussion.
Note that the above inflation factors can be applied to any sample size calcula-
tions for which the treatment effect is estimated from a generalized linear model; for
example, Δ in Eq. (14) may represent the log odds ratio or log rate ratio for the
treatment effect, with appropriate values for σ 2e .
Advocates of SWD have argued that the additional time periods involved render
them more efficient than parallel CRCTs, requiring fewer patients (Woertman et al.
2013). However, this does not appear to be true in all cases, with the relative efficiency
of the two approaches dependent on design parameters such as the ICC, the size and
number of clusters, and the number of periods (Hemming and Taljaard 2016).
Assuming that the number of time periods (steps) s, and the cluster size m have
been fixed, the DE for a simple SWD has been derived as

1 þ ρðsm þ m  1Þ 3sð1  ρÞ
DE ¼ ðs þ 1Þ  : ð16Þ
1 þ ρðsm=2 þ m  1Þ 2ðs2  1Þ

The sample size needed per time period can be obtained by equating this with the
total sample size for a SWD (m(s + 1)c) and solving a quadratic equation (Hemming
and Taljaard 2016). Close inspection of the DE shows that this will be most efficient
with large numbers of time periods, or steps, and small numbers of clusters at each
step.

Model Fitting and Analysis

All models described in this chapter can be implemented in standard statistical


software such as R, Stata, and SAS. Fitting requires integration over random effects
distributions, which is exact if the response is normally distributed, but requires
approximate methods for generalized linear and other more complex models.
Papachristofi et al. provide an overview of available methods and their limitations,
with detailed referencing (Papachristofi et al. 2016b).
1434 L. Sharples and O. Papachristofi

Adaptive Gauss-Hermite quadrature has been shown to yield more accurate


estimates than other approximate methods and to perform well even in the case
of large clusters and high ICCs; however, it is slower than other methods
(Rabe-Hesketh and Skrondal 2012). It is implemented in Stata (gllamm procedure)
(Rabe-Hesketh and Skrondal 2012) and R (lme4 package) (Austin 2010). Marginal
(MQL) and penalized quasi-likelihood (PQL) are implemented in the general-pur-
pose multilevel modeling software MLwiN and use a Taylor expansion
approximation to transform nonlinear models. MQL is a quicker algorithm
but tends to underestimate random effects terms. PQL is not suitable for cases
combining small clusters with large ICCs and does not employ likelihood-based
methods, so that likelihood-ratio tests and information criteria cannot be used
(Goldstein and Rasbash 1996).
The restricted maximum likelihood (REML) approach involves maximizing a
likelihood form based on a transformation of the data, in order to remove the
effects of nuisance parameters. This method is unbiased for generalized linear
models unless the random effects variance is very small (Rabe-Hesketh and
Skrondal 2012).
Small numbers of clusters or clusters of small size may lead to poor estima-
tion of variance components, and all likelihood-based approaches may encoun-
ter convergence problems in these cases. The minimum number of clusters, as
well as patients per cluster, to ensure robust estimation of multilevel models has
been widely studied. The literature suggests that at least ten clusters per arm will
allow unbiased estimation of treatment effects, provided that the number of
patients per cluster is not too small (<5) (see Papachristofi et al. and references
therein for an overview (Papachristofi et al. 2016b)). Nevertheless, the decision
on the model structure will depend on the context of the trial and its specific
constraints.

Reporting

As the popularity of complex interventions has grown, so has the number of


documents that guides the conduct and reporting of trials evaluating them. The
CReDECI guidelines provide criteria for reporting the development phases of a
complex intervention, including ways in which it was altered during initial
testing, as well as its piloting and evaluation (Mohler et al. 2015). CReDECI
aims to encompass both qualitative and quantitative research designs so that
the guidelines are quite general. The TIDieR guide for the reporting of complex
interventions recommends a 12-point checklist for describing the intervention
itself (Hoffmann et al. 2014). Additionally, the Consolidated Standards of Reporting
Trials (CONSORT) group have published extensions of their original reporting
guidance (Moher et al. 2010) covering CRCTs (Campbell et al. 2012) and abstracts
for non-pharmacological interventions (Boutron et al. 2017). These freely accessible
tools are helpful to guide initial trial design and contribute to improvements in
research reporting.
74 Complex Intervention Trials 1435

Implementation

A key element of the MRC framework for evaluation of complex interventions is


the implementation of a successful treatment into routine healthcare (Medical_Re-
search_Council. 2008). As complex interventions are often fully embedded in the
healthcare setting they are intended for, and are evaluated using pragmatic trial designs
(Loudon et al. 2013), generalization of results should be straightforward compared
with more tightly controlled efficacy trials. The MRC guidance recommends a
reporting style for trials that is accessible to healthcare decision-makers; quantitative
(statistical) methods alone are unlikely to be sufficient for successful implementation.
First, an understanding of the behaviors that need to change for a successful treatment
to enter routine practice, including barriers and facilitators of change, is required. This
information must be elicited in a formal qualitative study alongside the randomized
phase III evaluation. Since the economic burden of the intervention implementation is
a key driver of behavioral change, complex intervention trials should include an
assessment of the costs and effects of the intervention for health providers and other
related services (Drummond et al. 2015). Post trial evaluation, a monitoring phase is
recommended to assess whether benefits and harms observed in the trial manifest
similarly in routine practice (Medical_Research_Council. 2008). Details of these
methods can be found in the review by Richard and Hallberg and references therein
(Richards and Hallberg 2015).

Summary and Conclusion

Interest in complex interventions has grown substantially in recent years. They are
characterized by multicomponent treatment packages and clustering due to specific
components, such as the provider and implementation setting, which cannot be
separated from the package of treatment and have an influence on the outcome of
treatment. Development of complex interventions is a multidiscipline endeavor and
requires a mixture of rigorous qualitative and quantitative methods. Although research
on feasibility, piloting, and early phase trials is sparse, there are well-established
methods for phase III trials of multicomponent interventions that involve clustering.
The most commonly used methods are individually randomized trials with random
effects for clusters, cluster randomized trials, and, more recently, stepped-wedge
cluster randomized trials. Sample size estimation methods exist for a range of designs.
With careful attention to the correlation structure induced by the chosen design, results
can be analyzed in standard statistical software. Post-trial implementation in routine
practice will depend on statistical, economic, qualitative, and behavioral analysis.

Key Facts

• Complex interventions typically have multiple components and are subject to


clustering of patient outcomes.
1436 L. Sharples and O. Papachristofi

• Both qualitative and quantitative methods are required in the development and
evaluation of complex interventions.
• Timing of the definitive evaluation of a complex intervention needs careful
consideration, taking into account its stage of development and treatment
equipoise of both patients and healthcare providers.
• Complex interventions are usually evaluated in pragmatic trials since they are
often embedded in the healthcare context they are intended for.
• A range of well-established trial designs for complex interventions are available,
including clustered individually randomized trials, cluster randomized trials, and
stepped-wedge designs.
• With careful definition of the correlation structure resulting from their design,
trials can typically be analyzed using standard statistical software packages.

Cross-References

▶ Cluster Randomized Trials


▶ Power and Sample Size

References
Austin PC (2010) Estimating multilevel logistic regression models when the number of clusters is
low: a comparison of different statistical software procedures. Int J Biostat 6
Blencowe NS, Brown JM, Cook JA, Metcalfe C, Morton DG, Nicholl J, Sharples LD, Treweek S,
Blazeby JM, Members of the, M. R. C. H. F. T. M. R. N. W (2015) Interventions in randomised
controlled trials in surgery: issues to consider during trial design. Trials 16:392
Blencowe NS, Mills N, Cook JA, Donovan JL, Rogers CA, Whiting P, Blazeby JM (2016)
Standardizing and monitoring the delivery of surgical interventions in randomized clinical trials.
Br J Surg 103:1377–1384
Boutron I, Altman DG, Moher D, Schulz KF, Ravaud P, Group, C. N (2017) CONSORT Statement
for randomized trials of nonpharmacologic treatments: a 2017 update and a CONSORT
extension for nonpharmacologic trial abstracts. Ann Intern Med 167:40–47
Brown CA, Lilford RJ (2006) The stepped wedge trial design: a systematic review. BMC Med Res
Methodol 6:54
Browne WJ, Goldstein H, Rasbash J (2001) Multiple membership multiple classification (MMMC)
models. Stat Model 1:103–124
Campbell MK, Piaggio G, Elbourne DR, Altman DG, Group, C (2012) Consort 2010 statement:
extension to cluster randomised trials. BMJ 345:e5661
Cisse MBM, Sangare D, Oxborough RM, Dicko A, Dengela D, Sadou A, Mihigo J, George K,
Norris L, Fornadel C (2017) A village level cluster-randomized entomological evaluation of
combination long-lasting insecticidal nets containing pyrethroid plus PBO synergist in Southern
Mali. Malar J 16:477
Collins LM, Murphy SA, Strecher V (2007) The multiphase optimization strategy (MOST) and the
sequential multiple assignment randomized trial (SMART): new methods for more potent
eHealth interventions. Am J Prev Med 32:S112–S118
Cook JA, Ramsay CR, Fayers P (2004) Statistical evaluation of learning curve effects in surgical
trials. Clin Trials 1:421–427
74 Complex Intervention Trials 1437

Dickens C, Katon W, Blakemore A, Khara A, Tomenson B, Woodcock A, Fryer A, Guthrie E


(2014) Complex interventions that reduce urgent care use in COPD: a systematic review with
meta-regression. Respir Med 108:426–437
Donner A, Birkett N, Buck C (1981) Randomization by cluster. Sample size requirements and
analysis. Am J Epidemiol 114:19
Drummond MF, Sculpher MJ, Claxton K, Stoddart GL, Torrance GW (2015) Methods for the
economic evaluation of health care programmes, 4th edn. University Press, Oxford
Eldridge S, Kerry S (2012) A practical guide to cluster randomised trials in health services research.
Wiley, Chichester
Eldridge SM, Ashby D, Kerry S (2006) Sample size for cluster randomized trials: effect of
coefficient of variation of cluster size and analysis method. Int J Epidemiol 35:1292–1300
Emery JD, Gray V, Walter FM, Cheetham S, Croager EJ, Slevin T, Saunders C, Threlfall T, Auret K,
Nowak AK, Geelhoed E, Bulsara M, Holman CDJ (2017) The Improving Rural Cancer
Outcomes Trial: a cluster-randomised controlled trial of a complex intervention to reduce time
to diagnosis in rural cancer patients in Western Australia. Br J Cancer 117:1459–1469
Erasmus V, Huis A, Oenema A, Van Empelen P, Boog MC, Van Beeck EH, Polinder S, Steyerberg
EW, Richardus JH, Vos MC, Van Beeck EF (2011) The ACCOMPLISH study. A cluster
randomised trial on the cost-effectiveness of a multicomponent intervention to improve hand
hygiene compliance and reduce healthcare associated infections. BMC Public Health 11:721
Fuller C, Michie S, Savage J, Mcateer J, Besser S, Charlett A, Hayward A, Cookson BD, Cooper
BS, Duckworth G, Jeanes A, Roberts J, Teare L, Stone S (2012) The Feedback Intervention Trial
(FIT)–improving hand-hygiene compliance in UK healthcare workers: a stepped wedge cluster
randomised controlled trial. PLoS One 7:e41617
Goldstein H, Rasbash J (1996) Improved approximations for multilevel models with binary
responses. J R Stat Soc Ser A-Stat Soc 159:505–513
Hemming K, Taljaard M (2016) Sample size calculations for stepped wedge and cluster randomised
trials: a unified approach. J Clin Epidemiol 69:137–146
Hemming K, Taljaard M, Forbes A (2018) Modeling clustering and treatment effect heterogeneity
in parallel and stepped-wedge cluster randomized trials. Stat Med 37:883–898
Hoffmann TC, Glasziou PP, Boutron I, Milne R, Perera R, Moher D, Altman DG, Barbour V,
Macdonald H, Johnston M, Lamb SE, Dixon-Woods M, Mcculloch P, Wyatt JC, Chan AW,
Michie S (2014) Better reporting of interventions: template for intervention description and
replication (TIDieR) checklist and guide. BMJ 348:g1687
Jordan S, Gabe-Walters ME, Watkins A, Humphreys I, Newson L, Snelgrove S, Dennis MS (2015)
Nurse-led medicines’ monitoring for patients with dementia in care homes: a pragmatic cohort
stepped wedge cluster randomised trial. PLoS One 10:e0140203
Kahan BC, Morris TP (2013) Assessing potential sources of clustering in individually randomised
trials. BMC Med Res Methodol 13:58
Loudon K, Zwarenstein M, Sullivan F, Donnan P, Treweek S (2013) Making clinical trials more
relevant: improving and validating the PRECIS tool for matching trial design decisions to trial
purpose. Trials 14:115
Mars T, Ellard D, Carnes D, Homer K, Underwood M, Taylor SJ (2013) Fidelity in complex
behaviour change interventions: a standardised approach to evaluate intervention integrity.
BMJ Open 3:e003555
Mcculloch P, Altman DG, Campbell WB, Flum DR, Glasziou P, Marshall JC, Nicholl J, Balliol C,
Aronson JK, Barkun JS, Blazeby JM, Boutron IC, Campbell WB, Clavien PA, Cook JA,
Ergina PL, Feldman LS, Flum DR, Maddern GJ, Nicholl J, Reeves BC, Seiler CM,
Strasberg SM, Meakins JL, Ashby D, Black N, Bunker J, Burton M, Campbell M, Chalkidou K,
Chalmers I, De Leval M, Deeks J, Ergina PL, Grant A, Gray M, Greenhalgh R, Jenicek M, Kehoe
S, Lilford R, Littlejohns P, Loke Y, Madhock R, Mcpherson K, Meakins J, Rothwell P, Summerskill
B, Taggart D, Tekkis P, Thompson M, Treasure T, Trohler U, Vandenbroucke J (2009) No surgical
innovation without evaluation: the IDEAL recommendations. Lancet 374:1105–1112
Medical_Research_Council (2000) A framework for the development and evaluation of RCTs for
complex interventions to improve health
1438 L. Sharples and O. Papachristofi

Medical_Research_Council (2008) Developing and evaluating complex interventions: new


guidance
Moher D, Hopewell S, Schulz KF, Montori V, Gotzsche PC, Devereaux PJ, Elbourne D, Egger M,
Altman DG, Consolidated Standards of Reporting Trials, G (2010) CONSORT 2010
explanation and elaboration: updated guidelines for reporting parallel group randomised trials.
J Clin Epidemiol 63:e1–e37
Mohler R, Kopke S, Meyer G (2015) Criteria for Reporting the Development and Evaluation of
Complex Interventions in healthcare: revised guideline (CReDECI 2). Trials 16:204
Mohr DC, Carmody T, Erickson L, Jin L, Leader J (2011) Telephone-administered cognitive
behavioral therapy for veterans served by community-based outpatient clinics. J Consult Clin
Psychol 79:261–265
Papachristofi O, Jenkins D, Sharples LD (2016a) Assessment of learning curves in complex surgical
interventions: a consecutive case-series study. Trials 17:266
Papachristofi O, Klein A, Sharples L (2016b) Evaluation of the effects of multiple providers in
complex surgical interventions. Stat Med 35:5222–5246
Perez MC, Minoyan N, Ridde V, Sylvestre MP, Johri M (2018) Comparison of registered
and published intervention fidelity assessment in cluster randomised trials of public health
interventions in low- and middle-income countries: systematic review. Trials 19:410
Pocock SJ (1983) Clinical trials: a practical approach. Wiley, Chichester
Rabe-Hesketh S and Skrondal A (2012) Mulilevel and longitudinal modelling using Stata.
Stata Press, Texas
Richards DA, Hallberg IR (2015) Complex interventions in health. Routledge, Oxford
Richardson E, Redden DT (2014) Moving towards multiple site outcomes in spinal cord injury pain
clinical trials: an issue of clustered observations in trial design and analysis. J Spinal Cord Med
37:278–287
Roberts C, Walwyn R (2013) Design and analysis of non-pharmacological treatment trials with
multiple therapists per patient. Stat Med 32:81–98
Sharples L, Everett C, Singh J, Mills C, Spyt T, Abu-Omar Y, Fynn S, Thorpe B, Stoneman V,
Goddard H, Fox-Rushby J, Nashef S (2018) Amaze: a double-blind, multicentre randomised
controlled trial to investigate the clinical effectiveness and cost-effectiveness of adding an
ablation device-based maze procedure as an adjunct to routine cardiac surgery for patients
with pre-existing atrial fibrillation. Health Technol Assess 22:1–132
Surr CA, Walwyn RE, Lilley-Kelly A, Cicero R, Meads D, Ballard C, Burton K, Chenoweth L,
Corbett A, Creese B, Downs M, Farrin AJ, Fossey J, Garrod L, Graham EH, Griffiths A,
Holloway I, Jones S, Malik B, Siddiqi N, Robinson L, Stokes G, Wallace D (2016) Evaluating
the effectiveness and cost-effectiveness of Dementia Care Mapping to enable person-centred
care for people with dementia and their carers (DCM-EPIC) in care homes: study protocol for
a randomised controlled trial. Trials 17:300
Turner RM, White IR, Croudace T, Group, P. I. P. S (2007) Analysis of cluster randomized
cross-over trial data: a comparison of methods. Stat Med 26:274–289
Ukoumunne OC, Thompson SG (2001) Analysis of cluster randomized trials with repeated
cross-sectional binary measurements. Stat Med 20:417–433
Walwyn R, Roberts C (2010) Therapist variation within randomised trials of psychotherapy:
implications for precision, internal and external validity. Stat Methods Med Res 19:291–315
Walwyn R, Roberts C (2017) Meta-analysis of standardised mean differences from randomised
trials with treatment-related clustering associated with care providers. Stat Med 36:1043–1067
Wilson DT, Walwyn RE, Brown J, Farrin AJ, Brown SR (2016) Statistical challenges in assessing
potential efficacy of complex interventions in pilot or feasibility studies. Stat Methods Med Res
25:997–1009
Woertman W, De Hoop E, Moerbeek M, Zuidema SU, Gerritsen DL, Teerenstra S (2013) Stepped
wedge designs could reduce the required sample size in cluster randomized trials. J Clin
Epidemiol 66:752–758
Yuen WW, Wong WC, Tang CS, Holroyd E, Tiwari AF, Fong DY, Chin WY (2013) Evaluating the
effectiveness of personal resilience and enrichment programme (PREP) for HIV prevention
among female sex workers: a randomised controlled trial. BMC Public Health 13:683
Randomized Discontinuation Trials
75
Valerii V. Fedorov

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1440
Example: AD Design Versus Two-Arm RCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1442
Notations and Major Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1442
Conventional Randomized Clinical Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1444
Amery-Dony Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1445
Symmetric AD-Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1446
Sample Size Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1447
Clinical Applicability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1448
Generalization of Results to the Source Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1448
Ethical Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1449
Classification Hurdles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1450
Place of RDT Designs in the Family of Enrichment Trial Designs . . . . . . . . . . . . . . . . . . . . . . . 1450
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1451
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1451

Abstract
Randomized discontinuation trials are usually considered as a special case of
population enrichment trials. Most often they consist of a few stages with the
early ones targeting the selection of a subpopulation that responds or may respond
to an experimental treatment and allowing an early escape of patients doing
poorly. The succeeding stages validate the treatment’s superiority to placebo or
more generally to a comparator for the selected (enriched with potential
responders) subpopulation. The approach increases the trial feasibility and the
chances of correct response-to-treatment detection, often at the expense of
decreased ability to extrapolate results over the initially targeted population.
The first randomized discontinuation trials were used exclusively to test long-
term, non-curative therapies for chronic or slow evolving diseases. The develop-
ment of stochastic longitudinal models combined with Bayesian ideas and

V. V. Fedorov (*)
ICON, North Wales, PA, USA

© Springer Nature Switzerland AG 2022 1439


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_104
1440 V. V. Fedorov

the increase of computing power led to their acceptance in various therapeutic


areas, most notably in oncology. If the statistically valid partitioning of the source
population is proven, then it might be viewed as evidence of the existence of
some classifiers and pave the road to precision medicine.

Keywords
Enrichment trials · Population enrichment · Randomized discontinuation trials ·
Screening · Subpopulation selection

Introduction

The basic idea of randomized discontinuation trials (RDTs) can be traced back to
the 1970s when Amery and Dony (1975) proposed the trial design (see Fig. 1) that
consists of an open-label stage with all patients exposed to the experimental treat-
ment, while at the second stage, all responders are randomized to placebo or an
experimental treatment arm. They showed that under rather mild assumptions, such
designs reduce patients’ exposure to placebo and mitigate the impact of placebo
responders on drug efficacy evaluation. Initially, RDTs were intended for chronic
and slow progressing diseases such as various types of angina, psychiatric disorders,
early stages of Parkinson’s and Alzheimer’s diseases, pain mitigation, and

Fig. 1 Conventional two-arm design of a randomized clinical trial


75 Randomized Discontinuation Trials 1441

hypertension treatments. Rather detailed descriptions of early examples can be found


in Chiron et al. (1996), Temple (1994), Fava et al. (2003), and FDA Guidance for
Industry (2019b). Among these, the trial for the validation of nifedipine as a
treatment of vasospastic angina was the first one to be accepted (in 1980) by
the FDA, see Temple (1994). The use of stochastic longitudinal models combined
with Bayesian techniques led to the increasing popularity of RDT in other therapeu-
tic areas, most notably in oncology, for cases when a reliable assay for selection of
sensitive patients is not available (Freidlin and Simon 2005; Korn et al. 2001; Ratain
et al. 2006; Rosner et al. 2002; Stadler et al. 2005; Trippa et al. 2012). In 2008,
Daugherty et al. (2008) mentioned that at least two oncology studies had been
completed using RDT, one with sorafenib (a positive trial) and one with carboxya-
minoimidazole (a negative trial). Numerous references to the recent randomized
discontinuation/withdrawal trials in various therapeutic areas can be found in
the FDA Guidance for Industry (2019b) and directly on the FDA WEB site. So far
the terminology is not well established, and in the publications, “RDT designs” may
appear as “enrichment/re-randomization designs, postdosed designs, pre-admission
qualification designs, randomized withdrawal designs, randomized relapse designs
and randomized maintenance designs,” cf. Grieve (2012). These variations suggest,
that the purposes of the trials and their clinical settings may be quite different.
Typically, the design of an RDT relies on the following assumptions:

• The experimental treatment will not cure the condition during the open-label
stage.
• The treatment effect(s) of the open-label stage will not be carried over to the
second stage. This assumption can be satisfied with adding a washout period
between the two stages if it is ethically admissible.
• Following Amery and Dony (1975), most publications assume that a treatment
effect is binary: either there is or there is not a response to treatment. Respectively,
the patients are labeled as either responders or nonresponders.

The later versions of RDT designs use slightly modified assumptions. For instance,
instead of “responders and non-responders,” one may consider patients with “positive
response, stable disease, and negative response,” or instead of stable tumor size, one
may consider a stationary tumor growth rate. The introduction of longitudinal models
as in Trippa et al. (2012) is a crucial part of the successful design of such RDT. The list
of different settings can be continued but in all of them there are two major steps:
population enrichment and treatment validation for the enriched subpopulation.
Often, the observed responses to treatment may be continuous (e.g., blood
pressure, duration of anginal pain), discrete (e.g., frequency of angina), or ordinal
(e.g., various scores in pain studies). Their dichotomization may lead to a significant
information loss, see Fedorov, Mannino, and Zhang (2008), Uryniak et al. (2011),
and may be questionable when it is based solely on “investigator’s judgment of
success,” see Temple (1994). To minimize the impact of dichotomization, one can
use dichotomized data only to guide patient assignment to different treatment arms
and perform the final statistical analysis using original, non-dichotomized data.
1442 V. V. Fedorov

Capra (2004) compared the power of RDT with that of RCT when the primary
endpoint is time-to-disease progression. Kopec et al. (1993) evaluated the utility and
efficiency of RDT when the endpoints are binary. They compared the relative sample
size required for a fixed power of RDT versus RCT under different scenarios and
parameter settings. The approaches are based on the outcomes solely from the
second stage, treating the open-label stage as a screening process. This simplifies
the statistical analysis, but the information contained in the open-label stage is
mostly wasted. Examples when the statistical analysis includes the information
from both the open-label and the treatment validation stages can be found in Fedorov
and Liu (2005, 2014), and Ivanova, Qaqish, and Schoenfeld (2010).
The next section presents the statistical aspects of the Amery-Dony design
(AD design) and its symmetric version, complimented with their comparison to
conventional randomized trial designs. Both types of AD designs are very basic
versions of RDT designs. However, their consideration allows illumination and
discussion of major properties that are common across all RDT designs published
so far, and/or currently used in clinical trials. Section “Clinical Applicability” will
address the major concerns associated with RDT implementation in medical
practice.

Example: AD Design Versus Two-Arm RCT

Let the eligible population consist of three basic subpopulations: treatment-only


responders, placebo responders, and nonresponders, and let their fractions be πt, πp,
and π ¼ 1  πt  πp, respectively. The variances of estimated πt and of the
ratio R ¼ πp/π+, where π+ ¼ πt + πp is a fraction of all responders, will be used to
compare statistical efficiency of various trial designs. All trials considered here start
with sampling (enrolling, accruing) n patients from the eligible population. For
convenience, the major notations are repeated in Figs. 1, 2, and 3.

Notations and Major Assumptions

If the assumption of random sampling holds, then the following working model is
used:

• There exist infinitely many eligible patients of three mutually exclusive types:
treatment-only responders, placebo responders, and nonresponders.
• At each draw, the probabilities that a sampled patient belongs to one of these
categories are πt, πp and π ¼ 1  πt  πp respectively.
• An experiment consists of n random drawings with n ¼ nt + np + n, where nt, np,
and n ¼ n  nt  np are numbers of treatment-only responders, placebo
responders, and nonresponders. In what follows, the triplet {n, nt, np} will be
called the “complete data set.”
75 Randomized Discontinuation Trials 1443

Fig. 2 Amery-Dony trial design

Fig. 3 Symmetric AD-design


1444 V. V. Fedorov

• The sampled nt, np, n have a trinomial distribution, see Chap. 35, Johnson, Kotz,
and Balakrishnan (1997), with parameters (n, πt, πp, π).

If nt and np are known exactly, then the maximum likelihood estimators (MLE)
of πt, πp, and π are very simple and readily available, cf. Chap. 35.6, Johnson, Kotz,
and Balakrishnan (1997):

nt np 1  nt  np
b
πt ¼ πp ¼ , b
,b π ¼ : ð1Þ
n n n
The variance-covariance matrix of the first two (the third one π is their linear
combination):
 
  1 π t ð1  π t Þ π t π p
Var bπt, b
πp ¼   : ð2Þ
n π t π p πp 1  πp

Given n, this matrix provides the lower bounds for the variances of any estimators
of πt, πp, or any function of them. The latter is often called the “estimand” (cf. FDA
Guidance for Industry 2019a).
 ψ(π), π ¼ (πt, πp) , is smooth enough, then the variance
T
If the function of
 interest
of its MLE ψ b¼ψ b πt, b
π p is

@ψ @ψ
b ’
Var½ψ Varðb
πÞ : ð3Þ
@π T @π
For a linear function ψ, formula (3) is exact. Otherwise, it is valid asymptotically
when n ) 1. For a reasonably large sample size n and moderate πt and πp, formula (3)
serves as a good approximation and is often dubbed the “delta rule” or “delta method,”
cf. Oehlert (1992). For the two popular cases ψ(π) ¼ πt + πp ¼ π+ and ψ(π) ¼ πt/
(πt + πp) ¼ Rt, the delta method readily provides that

π þ ð1  π þ Þ h i R ð1  R Þ
Var½b
πþ  ¼ and Var Rbt ¼ t t
: ð4Þ
n n πt þ πp

Formulae (2), (3), and (4) will be used as benchmarks for the statistical efficiency of
various trial designs when this is expressed in terms of variances or more generally
covariance matrices of parameter estimators. It should be emphasized that these bench-
marks are reachable when it is possible to extract the complete data set {n, nt, np} from
the original data. In most cases, it is not.

Conventional Randomized Clinical Trial

The conventional two-arm RCT, see Fig. 1, is set up as follows. Suppose that
n patients be randomly sampled from a legitimate population and out of them n1
patients are randomized to the treatment arm and the rest n2 ¼ n  n1 patients to the
placebo arm.
75 Randomized Discontinuation Trials 1445

Let n1+ and n2p be the numbers of responders observed on the treatment arm and
the placebo arm, respectively. Note that from these two outcomes, the exact values
of nt or np are not available and estimator (1) cannot be used. However, the MLE of
π+, πp, and πt together with their variances can be built using the observed n1+ and
n2p:
  π ð1  π Þ
b n b
π þ ¼ 1þ and Var b
b πþ ¼ þ þ
, ð5Þ
n1 n1

np   π 1  π 
b
πp ¼
b and Var b
b
πp ¼
p p
, ð6Þ
n2 n2
  π ð1  π Þ π 1  π 
b b
πt ¼ b
b πþ  bb
π p and Var b
πt ¼ þ
b þ
þ
p p
, ð7Þ
n1 n2
see notations in Fig. 1. As expected, the estimators presented in Eqs. (5), (6), and (7)
have variances greater than the respective lower bounds (2) and (4) for any choice
of n1 and n2, n1 + n2 ¼ n. The validity of this statement for the first two is obvious.
To verify the same statement for the treatment effect estimator b b
π t , one may use the
following chain of inequalities:

  qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2
 
min Var b
b
πt ¼ π þ ð1  π þ Þ þ π p 1  π p
n1 , n2 ; n1 þn2 ¼n n

1   1
π þ ð1  π þ Þ þ π p 1  π p  π t ð1  π t Þ,
 ð8Þ
n n
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 
where the optimal allocation ratio is n1 =n2 ¼ π þ ð1  π þ Þ=π p 1  π p ,
cf. Piantadosi (1997, Chap. 9.6). Note, that the lower bound πt(1  πt)/n is reached
only when πp ¼ 0 and n1 ¼ n , i.e., for a trial with all patients allocated to the
treatment arm.
In a practical setting, the variance of the estimated fraction πt of treatment
responders significantly exceeds its lower bound n1πt(1  πt), especially in the
presence of numerous placebo responders. This fact was a motivation for the
development of trial designs that allow mitigation of the negative impact of placebo
responders, on statistical properties of trials. Usually placebo responders and treat-
ment nonresponders are withdrawn from the next trial stage to avoid their exposure
to useless or harmful treatment(s).

Amery-Dony Design

In the Amery and Dony design (AD design) at the open-label stage, all qualified
n patients are be assigned to the treatment arm, see Fig. 2. After completion of this
stage, all nonresponders n leave the trial, while n+ all responders, including placebo
responders, are randomized between placebo and treatment arms. Let n1+ and n2+ ¼
n+  n1+ be the numbers of patients assigned to the treatment and placebo arms,
1446 V. V. Fedorov

respectively. It is assumed that one of the restricted randomization methods (for instance,
randomization in blocks or an urn method, cf. Piantadosi (1997, Chap. 9.3) is applied to
keep the ratio n2+/n1+ as close as possible to the targeted allocation rates ratio γ/(1  γ).
Most statements/derivations that follow will use allocation rates ratio.
The results of the second stage are the number of placebo responders n2p out of
n2+ responders assigned to the placebo arm, the number of treatment-only responders
n2t ¼ n2+  n2p, and the number of responders n1+ on the treatment arm, which is
identical to the number of randomized patients n1+, and does not provide any new
information. Whence, the most informative, albeit not very ethical, AD design
should be executed with γ ¼ 1.
The straightforward application of the maximum likelihood method leads
(cf. Fedorov and Liu 2014) to the following MLE of πt:

^ n2t ^ 1 1  γ πt πp
πt ¼ and Varðπ t Þ ¼ π t ð1  π t Þ þ : ð9Þ
γn n γ πt þ πp

The choice of γ defines the “learn-or-treat” aspects of the AD-design. It should


be large enough to secure the validity of statistical inferences but not too large to
violate ethical constraints by placing too many patients on the placebo arm. Note that
for γ ¼ 1, the variance defined in Eq. (9) coincides with its benchmark (Eq. 2), and it
is always less than the variance of bb
π t associated with any RCT design, see Eqs. (7)
and (8). Interestingly, when γ  1/2 the AD design is always superior to the
conventional RCT with equal enrollment rates for treatment and placebo arms, i.e.,
the variances of the maximum likelihood estimators of πt, πp, Rt, etc. are lower for
AD designs than for RCT designs.

Symmetric AD-Design

The analogy between RDT and cross-over designs opens the way to various mod-
ifications of RDT. For instance, similar to the conventional RCT, on the onset of the
trial, patients can be randomized to two compound arms. The first one, treatment-
placebo arm (TP arm), starts with the experimental treatment followed by discon-
tinuation of nonresponders and by the placebo assignment for responders. The
second one, the placebo-treatment arm (PT arm), starts with the placebo run-in
followed with the discontinuation of placebo responders and by the experimental
treatment assignment for the rest. See Fig. 3 for details and notations.
Unlike traditional cross-over design (cf. Piantadosi 1997, Chap. 16; Senn 1997,
Chap. 17), the number of patients at the second stage is less than at the first stage:
nonresponders leave the TP arm and placebo responders leave the PT arm. Under
assumptions made in section “Notations and Major Assumptions,” the withdrawal of
patients (n1 from the TP arm and n2p from the PT arm) does not lead to any loss of
information but improves the ethical profile of the respective clinical trials.
Let n1 and n2 patients be randomized to the TP and PT arms, respectively. As in
the previous section, n ¼ np + nt + n and only n is known before the trial. After the
75 Randomized Discontinuation Trials 1447

completion of the first stage, the numbers of nonresponders n1 for the TP arm and
the numbers of placebo responders n2p for the PT arm will be known. These two
groups leave the trial. All n1+ ¼ n1p + n1t responders to treatment are assigned to
placebo and all n2 + n2t placebo nonresponders are assigned to the experimental
treatment. One can observe that

nt ¼ n1t þ n2t , np ¼ n1p þ n2p , n ¼ n1 þ n2 ,

i.e., n, nt, np, and n are known, and the MLE (Eq. 1) can be calculated:

n1t þ n2t nt n1p þ n2p np n þ n2 n1


πt ¼
b ¼ , b
πp ¼ π  ¼ 1
¼ , b ¼ : ð10Þ
n1 þ n2 n n1 þ n2 n n1 þ n2 n
They have variances that coincide with their benchmark values, see Eq. (2).
Whence, the symmetric AD design is always statistically superior to the conven-
tional two-arm RCT design and to the original AD design except for the case with
γ ¼ 1. The symmetric AD-design assumes that all qualified patients are randomized
to two compound arms (TP and PT) at the first stage, preserving most of the
conventional RCT features related to blinding and operational integrity. Curiously
enough the variances of all estimators in Eq. (10) do not depend on n1, n2, but only
on the total number n of qualified patients. Consequently, a trialist may select any
randomization rates to address ethical and operational constraints. This makes the
symmetric AD designs very attractive.
There exist a few designs similar to the symmetric AD-design, for instance, the
“sequential parallel comparison design” introduced in Fava et al. (2003). The major
distinction is that in the latter treatment and placebo response rates may be different
for stage 1 and 2. Otherwise, they are identical. The statistical properties of the
sequential parallel comparison designs are extensively discussed in Ivanova, Qaqish,
and Schoenfeld (2010) for binary responses, and with possible dropouts.

Sample Size Calculation

Required Variance
Given the sample size, the variances of estimated parameters (πt, πp, Rt, etc.) can be
found using formulae (2), (3), (4), (5), (6), (7), (8), (9), and (10). The same equations
allow calculation of sample sizes that are needed to reach a required variance V of
the estimated parameters. For instance, for estimating πt, the sample size is

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
v0 1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi   2
n0 ¼ ¼ π þ ð 1  π þ Þ þ π p 1  π p , ð11Þ
V V

for the optimized conventional trial design and

v00 1
n00 ¼ ¼ π ð1  π t Þ, ð12Þ
V V t
1448 V. V. Fedorov

for the symmetric AD design, see Eq. (2) and comments to Eq. (10).
For an illustration, consider a toy scenario
pffiffiffiffiffiwith
ffi the prior guesses for πt ¼ 0.2,
πp ¼ 0.2, and the required V ¼ 0.044, i.e., V  ¼ π t =3. From Eqs. (11) and (12) it
follows that v 0 ¼ 0.79, v 00 ¼ 0.16, and respectively n0 ¼ 180 and n00 ¼ 36 (after
rounding to the next integer).
The inverses of v 0 and v 00 can be interpreted as the (Fisher) information values
gained per one subject/observation, cf. Atkinson et al. (2014) and Fedorov (1972).
Formulae (2), (3), (4), (8), and (9) provide v 0 and v 00 for various designs and
estimands by choosing n ¼ 1. Note, that ratio n0 /n00 depends only on v 0 /v 00 :
hpffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 i 2
n0 v0 π þ ð1  π þ Þ þ π p 1  π p
¼ ¼ ð13Þ
n00 v00 π t ð1  π t Þ

and the design efficiency ordering is the same either in terms of variances given
sample size or in terms of sample sizes given variance.

Hypotheses Testing
Let us consider testing of hypotheses and let

H 0 : ψ ¼ ψ 0 and H A : ψ  ψ A ,

where ψ is a parameter/estimand of interest. Let α and β be the targeted type one and
type two error rates, respectively. Under assumption of the asymptotic normality of
b the required sample size for a one-sided test is
estimators ψ

Z 1α þ Z1β 2
n’v , ð14Þ
ψA  ψ0

where, as before, v is the variance of a single observation, cf. Fedorov and Liu
(2014). Let us continue with the previous example, i.e., with prior guesses πt ¼ 0.2,
πp ¼ 0.2 and respectively with v 0 ¼ 0.79 and v 00 ¼ 0.16. Let ψ 0 ¼ 0, ψ A ¼ πt ¼ 0.2
and let α ¼ 0.025, β ¼ 0.1, i.e., Z1  α ¼ 1.96, Z1  β ¼ 1.28. Substituting all these
numbers into Eq. (14) provides n0 ¼ 208 (the sample size for the optimized RCT
design) and n00 ¼ 42 (the sample size for the symmetric AD design). As in the
previous subsection n0 /n00 ¼ v 0 /v 00.

Clinical Applicability

Generalization of Results to the Source Population

The major concern over inferences generated by randomized discontinuation trials is


their applicability to the originally targeted population. An immediate reaction to
such a query is that no reasonable and scientifically sound response can be given
75 Randomized Discontinuation Trials 1449

until the estimands of prime interest and their estimators are carefully defined
(cf. FDA Guidance for Industry 2019a).
For instance, if the fraction πt of the total source population is the estimand of
major interest, then the enrichment trial based on the symmetric AD design allows
estimation of this quantity, see either Eq. (10) or (11). Both formulae use data from
stages 1 and 2. The data from stage 2 are not sufficient to estimate πt but the
estimation of the subpopulation fraction of treatment responders ϕt ¼ πt/π+ is still
possible. In general, the data from all stages of RDT provide information that allows
scientific inferencing for the source population, often more precisely than from
the RCT with the same sample size. Apparently, it is not true for the data collected
from the enriched subpopulation only. The statement that RDT, like conventional
RCT, may generate valid statistical inferences for the same estimands related to the
source population is not equivalent to the statement that the respective estimates will
be quantitatively and qualitatively close. RDT and CRT have different statistical and
operational profiles. The respective estimators may have different statistical proper-
ties and different operational biases. The RCT designs were specifically developed to
avoid disclosure of any information that may influence the behavior of either patients
or trial staff, or both through careful randomization, double blinding, etc. to mini-
mize operational bias (see Piantadosi 1997, Chap. 5.3; Pocock 1983, Chap. 4). At the
same time, the RDT designs, which use open label run-in stages and interim
analyses, are more prone to such biases, see examples in Freidlin, Korn, and Abrams
(2018) and Kopec et al. (1993).

Ethical Aspects

The initial motivation for developing the RDT methodology was the very
ethically sound intention to find “a clinical trial design avoiding undue placebo
treatment,” see Amery and Dony (1975). They managed to reduce the prolonged
exposure of the patient(s) to placebo treatment, which is a typical ethical problem
in conventional randomized clinical trials. Returning to Figs. 1 and 2, one can
observe that with randomization rates equal for placebo and treatment arms, the
total exposure times for RCT an RDT are TRCT ¼ n/2  [treatment duration]RCT
and TRDT+ ¼ n+/2  [treatment duration at stage 2]RDT. Usually, n > n+, and
therefore TRCT > TRDT.
Another ethically attractive feature of RDT is an opportunity to withdraw patients
who do not benefit from the experimental treatment. However, some critics
have argued that reassigning the confirmed treatment responders to the placebo
arm, as in RDT, is an unethical move and unacceptable, for instance, in oncology
trials. Thus, there are many pro and con aspects that need a thorough professional
discussion before and during crafting a trial protocol. In general, balancing the pros
and cons is and should be very specific for each therapeutic area. Interesting
examples can be found, for instance, in Daugherty et al. (2008), Fava et al. (2003),
FDA Guidance for Industry (2019b), Ratain et al. (2006), Sonpavde et al. (2006),
Stadler (2007), Stadler et al. (2005), and Temple (1994).
1450 V. V. Fedorov

Classification Hurdles

The concept of “responder” is essential for RDT, and is based on the dichotomiza-
tion of continuous or discrete responses (cf. Uryniak et al. 2011; Fedorov et al.
2008). The poor selection of cutoff levels may lead to nonzero probabilities p1 of
false-negative (a “true” responder is assigned to the group of nonresponders) and p2
of false-positive classifications (a “true” nonresponder is assigned to the group of
responders). These probabilities can be appreciable for some response types: blood
pressure in the cardiovascular area and scores in psychiatry are typical examples. In
other therapeutic areas, it is natural to assume that the open-label stage of RDT takes
a shorter time than a single stage of a conventional RCT and therefore practitioners
resort to surrogate endpoints (cf. Burzykowski et al. 2005), which do not provide
results identical to measurements at the end of RDT. For instance, in oncology, the
tumor size change at the end of stage 1 might be used to separate responders and
nonresponders, while the actual endpoint could be the tumor size change at the end
of stage 2 that can be of the opposite sign. In general, false classification should be
taken into account whenever within-patient variability is expected to be comparable
with the population variability.
The statistical RDT superiority (lower variances of estimated parameters, smaller
sample sizes in hypothesis testing) to RCT diminishes with the increase of proba-
bilities of false classifications: the “enriched” subpopulation will miss some
“responders” falsely claimed as “non-responders” and will include some pseudo
“non-responders.” As a result, erroneously classified patients will be assigned to the
wrong treatment arms. For the relatively simple AD designs, it was shown in
Fedorov and Liu (2005, 2014) that the set of scenarios where RDTs dominate
RCTs decreases with the increase of p1 and p2.
The negative role of the false classification could be mitigated if the inferential
part of data analysis were performed with original (non-dichotomized) data. The
reporting component may include the post-analysis dichotomization to make con-
clusions more comprehensible and transparent for the larger audience.

Place of RDT Designs in the Family of Enrichment Trial Designs

As was previously pointed out, randomized discontinuation trial designs are com-
monly viewed as special cases of enrichment strategies, see, for instance, Fedorov
and Liu (2007), FDA Guidance for Industry (2019b), Hallstrom and Friedman
(1991), Pablos-Méndez et al. (1998), and Temple (1994). The difference
between RDT and other types of population enrichment trials is that their design
and analysis does not rely on any disease-related labels (e.g., biomarkers and social
markers). All others rely on partitioning source populations into subpopulations with
specific labels. This partitioning is based on prior/historic data or some intelligent
guesses. The major goal of the respective enrichment strategies is the identification
of the treatment responsive subpopulations and thorough validation of the respective
efficacy-toxicity profiles.
75 Randomized Discontinuation Trials 1451

RDTs pursue a seemingly easier goal, which is the proof of the existence of such
a subpopulation. However, it has to be reached from a more remote starting point
where the disease-informative population partitioning does not exist. The proof that
some patients benefit from the experimental treatment is very encouraging but not
sufficient for the prediction of outcomes for future patients. Further, to move closer
to precision medicine, the RDT methodology has to be complemented with statis-
tical methods that allow the posttrial selection of disease informative markers. The
situation is similar to that of unsupervised and supervised learning in artificial
intelligence or, more specifically, in the machine learning paradigm. In the
unsupervised case, given unlabeled/unmarked data, the algorithms try to make
sense by extracting patterns (responsive subpopulations) on their own. In the
supervised case, the algorithms learn on a labeled dataset (i.e., build “knowledge”)
and provide a prediction of what may happen to subsets with specific labels (sub-
populations with specific biomarkers), followed by field verification of the validity/
accuracy of this prediction on newly accrued data.

Key Facts

The idea of randomized discontinuation (withdrawal) trials was conceived to pursue


two targets: to filter out the placebo effect noise while measuring the efficacy of
a novel treatment and to minimize exposure of enrolled subjects to placebo. To some
extend, the RDT design methodology was borrowed from the theory of cross-over
designs: expose a patient to different treatments to gain more information and
minimize the impact of the between patient variability. This fact was mentioned in
the pioneering paper by Amery and Dony (1975) and it may help to develop more
sophisticated designs that allow effective comparison of several treatments and their
effects across multiple subpopulations.
It should be emphasized that the benefits RDT-s are gained at the expense of
limited applicability of trial results to the general population: often they are valid
only for the enriched subpopulations and, as was already mentioned, extra statistical
and medical effort is needed to determine the sound identifiers that can predict which
future patients will benefit from the new treatment. Without this knowledge, one has
to resort to the old-fashioned “trial and error” method to select the right treatment for
a given patient.

References
Amery W, Dony J (1975) Clinical trial design avoiding undue placebo treatment. J Clin Pharmacol
15:674–679
Atkinson AC, Fedorov VV, Herzberg AM, Zhang R (2014) Elemental information matrices and
optimal experimental design for generalized regression models. J Statis Plan Inference 144:
81–91
Burzykowski T, Molenberghs G, Buyse M (2005) The evaluation of surrogate endpoints. Springer,
New York
1452 V. V. Fedorov

Capra WB (2004) Comparing the power of the discontinuation design to that of the classic
randomized design on time-to-event endpoints. Control Clin Trials 25:168–177
Chiron C, Dulac O, Gram L (1996) Vigabatrin withdrawal randomized study in children. Epilepsy
Res 25:209–215
Daugherty CK, Ratain MJ, Emanuel EJ, Farrell AT, Schilsky RL (2008) Ethical, scientific, and
regulatory perspectives regarding the use of placebos in cancer clinical trials. J Clin Oncol 26:
1371–1378
Fava M, Eveins A, Dorer D, Schoenfeld D (2003) The problem of the placebo response in clinical
trials for psychiatric disorders: culprits, possible remedies, and a novel study design approach.
Psychother Psychosom 72:115–127
FDA Guidance for Industry (2019a) ICH E9 (R1) addendum on estimands and sensitivity analysis
in clinical trials to the guideline on statistical principles for clinical trials. https://www.fda.gov/
media/108698/download
FDA Guidance for Industry (2019b) Enrichment strategies for clinical trials to support determina-
tion of effectiveness of human drugs and biological products. https://www.fda.gov/ucm/groups/
fdagov-public/@fdagov-drugs-gen/documents/document/ucm332181.pdf
Fedorov VV (1972) Theory of optimal experiments. Academic, New York
Fedorov VV, Liu T (2005) Randomized discontinuation trials: design and efficiency.
GlaxoSmithKline biomedical data science technical report, 2005–3
Fedorov VV, Liu T (2007) Enrichment design. In: Wiley encyclopedia of clinical trials. Wiley,
Hoboken, pp 1–8
Fedorov VV, Liu T (2014) Randomized discontinuation trials with binary outcomes. J Stat Theory
Pract 8:30–45
Fedorov VV, Mannino F, Zhang R (2008) Consequences of dichotomization. Pharm Stat 8:50–61
Freidlin B, Simon R (2005) Evaluation of randomized discontinuation design. J Clin Oncol 23(22):
5094–5098
Freidlin B, Korn EL, Abrams JS (2018) Bias, operational bias, and generalizability in phase II/III
trials. J Clin Oncol 36(19):1902–1904
Grieve AP (2012) Discussion: Bayesian enrichment strategies for randomized discontinuation
trials. Biometrics 68:219–224
Hallstrom AP, Friedman L (1991) Randomizing responders. Control Clin Trials 12:486–503
Ivanova A, Qaqish B, Schoenfeld A (2010) Optimality, sample size, and power calculations for the
sequential parallel comparison design. Stat Med 30:2793–2803
Johnson NL, Kotz S, Balakrishnan N (1997) Discrete multivariate distributions. Wiley, New York
Kopec J, Abrahamowicz M, Esdaile J (1993) Randomized discontinuation trials: utility and
efficiency. J Clin Epidemiol 46:959–971
Korn EL, Arbuck SG, Pulda JM, Simon R, Kaplan RS, Christian MC (2001) Clinical trial designs
for cytostatic agents: are new approaches needed? J Clin Oncol 19:265–272
Oehlert GW (1992) A note on the delta method. Am Stat 46:27–29
Pablos-Méndez A, Barr RG, Shea S (1998) Run-in periods in randomized trials. J Am Med Assoc
279:222–225
Piantadosi S (1997) Clinical trials: a methodologic perspective. Wiley, New York
Pocock SJ (1983) Clinical trials: a practical approach. Wiley, New York
Ratain MJ, Eisen T, Stadler WM, Flaherty KT, Kaye SB, Rosner GL, Gore M, Desai AA, Patnaik A,
Xiong HQ, Rowin-sky E, Abbruzzese JL, Xia C, Simantov R, Schwartz B, Dwyer PJ (2006)
Phase II placebo-controlled randomized discontinuation trial of sorafenib in patients with
metastatic renal cell carcinoma. J Clin Oncol 24:2505–2512
Rosner GL, Stadler WM, Ratain MJ (2002) Randomized discontinuation design: application to
cytostatic antineoplastic agents. J Clin Oncol 20:4478–4484
Senn SJ (1997) Statistical issues in drug development. Wiley, New York
Sonpavde G, Hutson TE, Galsky MD, Berry WR (2006) Problems with the randomized discontin-
uation design. J Clin Oncol 24:4669–4670
75 Randomized Discontinuation Trials 1453

Stadler WM (2007) The randomized dicontinuation trial: a phase II design to access growth-
inhibitory agents. Mol Cancer Ther 6:1180–1185
Stadler WM, Rosner G, Small E, Hollis D, Rini B, Zaentz SD, Mahoney J (2005) Successful
implementation of the randomized discontinuation trial design: an application to the study of the
putative antiangiogenic agent carboxyaminoimidazole in renal cell carcinoma – CALGB 69901.
J Clin Oncol 23:3726–3732
Temple RJ (1994) Special study designs: early escape, enrichment, study in non-responders.
Commun Stat Theory Methods 23:499–531
Trippa L, Rosner GL, Müller P (2012) Bayesian enrichment strategies for randomized discontin-
uation trials. Biometrics 68:203–225
Uryniak T, Chan ISF, Fedorov V, Jiang Q, Oppenheimer L, Snapinn SM, Teng CH, Zhang J (2011)
Responder analyses – a PhRMA position paper. Stat Biopharm Res 3:476–487
Platform Trial Designs
76
Oleksandr Sverdlov, Ekkehard Glimm, and Peter Mesenbrink

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1456
Background on Platform Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1457
General Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1457
Single-Sponsor Platform Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1460
Multisponsor Platform Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1461
Statistical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1462
Choice of a Control Arm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1463
Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1467
Data Monitoring and Interim Decision Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1470
Sample Size and Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1472
Data Analysis Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1474
Examples of Platform Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1476
EPAD-PoC Study in Alzheimer’s Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1476
I-SPY COVID-19 Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1477
GBM AGILE Study in Glioblastoma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1478
FOCUS4 Study in Metastatic Colorectal Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1479
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1480
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1480

Abstract
Modern drug development is increasingly complex and requires novel
approaches to the design and analysis of clinical trials. With the precision
medicine paradigm, there is a strong need to evaluate multiple experimental
therapies across a spectrum of indications, in different subgroups of patients,
while controlling the chance of false positive and false negative findings. The
O. Sverdlov (*) · P. Mesenbrink
Novartis Pharmaceuticals Corporation, East Hannover, NJ, USA
e-mail: [email protected]; [email protected]
E. Glimm
Novartis Pharma AG, Basel, Switzerland
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 1455


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_107
1456 O. Sverdlov et al.

concept of master protocols provides a new approach to clinical trial design that
can help drug developers to enhance efficiency of clinical trials by addressing
multiple research questions within the same overall trial infrastructure. There are
three general types of trials requiring a master protocol: basket trials, umbrella
trials, and platform trials. The present chapter provides an overview of platform
trial designs. We discuss operating models for implementing platform trials in
practice, as well as some important statistical considerations for design and
analysis of such trials. We also discuss four real-life examples of platform trials:
the EPAD-PoC study in Alzheimer’s disease; the I-SPY COVID-19 study for
rapid screening of re-purposed and novel treatments for COVID-19; the GBM
AGILE study in glioblastoma; and the FOCUS4 study in metastatic colorectal
cancer.

Keywords
Master protocols · Multi-arm randomized controlled trials · Multiple comparisons

Introduction

Modern drug development is increasingly complex and requires novel approaches


to the design and analysis of clinical trials. With the precision medicine paradigm,
there is a strong need to evaluate multiple experimental therapies across a spectrum
of indications, in different subgroups of patients, while controlling the chance of
false positive and false negative findings. The concept of master protocols pro-
vides a new approach to clinical trial design that can help drug developers to
enhance efficiency of clinical trials by addressing multiple research questions
within the same overall trial infrastructure (Woodcock and LaVange 2017).
There are three general types of trials requiring a master protocol: basket trials,
umbrella trials, and platform trials. Basket trials evaluate a single investigational
compound in different indications to find the indication(s) that can be efficiently
treated with the given compound. Umbrella trials evaluate multiple investigational
compounds in one indication to find the “most promising” compounds, possibly
within different patient subgroups for the chosen indication. Platform trials eval-
uate multiple experimental treatments for a given indication in a perpetual manner,
and in theory, the platform trial can continue until the intervention(s) with the
desired risk/benefit profile are found.
Master protocols use a common trial infrastructure, often with a shared control
group, which may help streamline clinical operations and achieve enhanced and
expedited developmental decisions. At the same time, master protocols are complex,
may incur higher cost, and necessitate a lot of upfront planning and early engage-
ment with health authorities and other relevant stakeholders. Despite these chal-
lenges, the uptake of master protocols in drug development is increasing, and more
applications of these designs are expected in the near future (Park et al. 2019; Meyer
et al. 2020).
76 Platform Trial Designs 1457

The present chapter provides an overview of one type of master protocols –


platform trials. Two other types of master protocols, basket and umbrella trials, are
beyond the scope of this chapter. For a recent book-length discussion on platform,
umbrella, and basket trials, see Antonijevic and Beckman (2019). Platform trials
represent a broad concept that can be applied in different stages of clinical develop-
ment. However, one should keep in mind an important distinction between explor-
atory studies and studies with a confirmatory component. The former studies are
common in phase II, where one of the objectives is to perform a screening of various
candidate compounds, eliminating suboptimal ones as early as possible, and focus-
ing research efforts on “most promising” treatment candidates. Such studies are
hypothesis generating and are not intended for marketing authorization. By contrast,
studies with a confirmatory component (e.g., phase II/III) should provide substantial
evidence to support claims for drug effectiveness, and therefore, they should incor-
porate some formal considerations of control of the type I error rate. In this chapter,
we intend to cover both phase II and phase II/III platform trial designs, emphasizing
the distinctions between the two types where appropriate.
Section “Background on Platform Trials” presents some general background on
platform trials and describes two distinct operating models – single-sponsor and
multisponsor – for implementing platform trials in practice. Section “Statistical
Considerations” presents some important statistical considerations for design and
analysis of platform trials. Section “Examples of Platform Trials” discusses four real-
life examples of platform trials – the EPAD-PoC study in Alzheimer’s disease; the
I-SPY COVID-19 study for rapid screening of re-purposed and novel treatments for
COVID-19; the GBM AGILE study in glioblastoma; and the FOCUS4 study in
metastatic colorectal cancer. Section “Concluding Remarks” provides some con-
cluding remarks.

Background on Platform Trials

General Definitions

Using the terminology of Woodcock and LaVange (2017), the objective of a


platform trial is “to study multiple targeted therapies in the context of a single
disease in a perpetual manner, with therapies allowed to enter or leave the platform
on the basis of a decision algorithm.” Unlike standard randomized clinical trials
(RCTs) that are “intervention-focused,” platform trials are “disease-focused” (Berry
et al. 2015; The Adaptive Platform Trial Coalition 2019; Park et al. 2020). When
carefully designed and implemented, platform trials can potentially be more efficient
than a sequence of two-arm RCTs (Saville and Berry 2016).
Similar to platform trials, basket trials involve the application of a single therapy
to several subvariants of a disease and umbrella trials attempt several therapies on a
single disease. All of these trials require a master protocol describing the entire trial.
Typically then, per intervention, there is an intervention specific appendix (ISA)
detailing the specifics of the single interventions. In what is to follow, we will focus
1458 O. Sverdlov et al.

on platform trials. However, many of the considerations given here also apply to
basket and umbrella trials.
Several recent systematic literature searches have revealed the growing popularity
and use of master protocol trials, and platform trials in particular (Siden et al. 2019;
Park et al. 2019; Meyer et al. 2020). More specifically, the number of identified
platform trials/total number of identified master protocols were 25/99 (Siden et al.
2019), 16/83 (Park et al. 2019), and 12/50 (Meyer et al. 2020). All of these references
highlight a rapid increase in the number of master protocols over the past 5 years,
and this trend is expected to continue.
Figure 1 presents an example of an open platform master protocol.
We consider a randomized, placebo-controlled, open platform trial evaluating
therapeutic effects of various investigational treatments (agents) in a selected indi-
cation. Figure 1a shows the structure of the master protocol. The core part (Sections
1 to 16) describes key design elements that remain the same across all agents in the
study. Section 17 details any information or procedures that are specific to a
particular agent. The platform trial will enroll patients in cohorts/substudies. In our
example, the study starts with Cohort 1, in which eligible patients will be random-
ized to TRT1 or Control (Fig. 1b). Interventions for future cohorts will become
available over time, and subsequent Cohorts 2, 3, and 4 are planned to be added to
the master protocol. In each cohort, there is a control group (not displayed in
Fig. 1a), that is assumed to be the current standard-of-care (SOC) treatment. We
assume that each subsequent cohort may include up to three investigational agents;
for example, in Cohort 2, we provision for TRT2, TRT3, and TRT4. Within each
cohort, eligible patients will be randomized among the available active treatment
arms or control.
At some prespecified time points in the study, interim analyses (IAs) will be
performed (Fig. 1b). At each interim analysis (IA1, IA2, IA3, ...), accrued data will
be analyzed and a predetermined statistical decision rule will be applied. The nature
of decisions will generally depend on the trial design and the study objectives.
Fundamentally, both the analyses and the types of decisions should be prespecified
in the protocol (not made on an ad hoc basis) to maintain the integrity and the
validity of the results.
Our considered example in Fig. 1 is typical for a phase II platform trial. In this
case, the following decisions can be considered for any given investigational
treatment arm:

(i) Advance the arm for further development (outside of the current master proto-
col), if it exhibits sufficient evidence of activity. Or
(ii) Drop the arm from the study, if there is sufficient evidence of lack of activity. Or
(iii) Continue the arm in the study to the next decision point, if the results are
indeterminate and the maximum sample size for this treatment arm has not been
reached.

A more complex scenario is a phase II/III platform trial, which may include IAs
during both phase II and phase III parts of the study. The IA decisions during a
76 Platform Trial Designs 1459

Fig. 1 An example of an open platform trial master protocol

phase II part of a phase II/III trial often would be based on a surrogate outcome
measure, such as some biomarker predictive for clinical efficacy, and these deci-
sions can be described using items (i)–(iii) above. However, additional IAs can be
also considered during a confirmatory (phase III) part of the study. In this case, the
interim decisions will be made based on accrued clinical outcome data, which may
be the primary efficacy endpoint. For any investigational treatment arm, the
decisions may be:
1460 O. Sverdlov et al.

(iv) Declare superior efficacy of the treatment arm over the control and stop the trial
early, if clinical efficacy results are outstanding for this arm. Or
(v) Terminate the treatment arm for futility, if there is sufficient evidence of lack of
efficacy for this arm. Or
(vi) Continue the treatment arm to the next decision point (IA or end of study), if
evidence for efficacy and futility is inconclusive and the maximum sample size
for this arm has not been reached.

The efficacy decision would typically require a formal decision rule such as a
statistical test with type I error control. In contrast, the futility rule - even in phase III
– would typically not require the same level of formality. In addition to efficacy
assessments, the usual safety monitoring rules would commonly be applied, such
that interventions or the entire study may be stopped due to safety findings.
The open-endedness of the platform trial allows adding and removing of interven-
tions as the study is ongoing. If it is decided to introduce a new arm, an additional ISA
would be added to the master protocol (Fig. 1a). The randomization weights must be
updated accordingly, and new eligible subjects will be randomized to a specific ISA
and then to a specific treatment within the ISA (Fig. 1b). (Other approaches for
implementing randomization can be considered; e.g. subjects may be randomized
among all available study arms, as done in a parallel multi-arm trial.) In our example,
the initial randomization in Cohort 1 is 1:1 (TRT1 or Control), and thereafter it can be
modified, with potentially increased allocation ratio to novel experimental agents. The
choice of a randomization algorithm (e.g., if response-adaptive randomization is
utilized) should be discussed with the health authorities and it must be carefully
justified in the master protocol. We shall discuss response-adaptive randomization
(RAR) in more detail in section “Response-Adaptive Randomization”; for now, we
just make an important remark that RAR has both merits and limitations, and it may
potentially be utilized in a phase II platform trial or during the phase II screening part
of a phase II/III platform trial, but not during its confirmatory stage.

Single-Sponsor Platform Trials

Platform trials can provide a valuable framework for the development of novel
therapies within a biopharmaceutical company. While platform trials can be
designed in both early and late clinical development, the concept may be more
appealing in the early stages (e.g., phase II), where the objective is to perform a fast
screening of many investigational agents, many of which are potentially not effica-
cious. Suppose we have an indication with an unmet medical need for treatment and
assume that there are multiple potential candidate compounds for this indication
within the drug development portfolio of Company X. Once the safety of a particular
compound has been established (e.g., with first-in-human data and sufficient toxi-
cology data), it is ready to be further assessed in the clinical proof-of-mechanism and
clinical proof-of-concept (PoC) trials. A traditional 1:1 randomized controlled PoC
trial explores whether the investigational drug is likely to achieve the desired
therapeutic effect and whether it merits testing in a large-scale confirmatory trial in
76 Platform Trial Designs 1461

patients. However, given a large variety of candidate compounds and their combi-
nations, running multiple PoC trials may be infeasible. A phase II platform trial is an
attractive option, if:

• There is a strong scientific rationale (e.g., common scientific hypothesis, consis-


tently defined population, and other design elements) for evaluating multiple
therapies and possibly their combinations in the chosen indication.
• The company has multiple candidate compounds/formulations for the chosen
indication.
• There is a strong need/interest in developing more than one compound in the
given indication; for example, due to different mechanisms of action or due to
inadequate preclinical models to make comparative assessment of compounds.

There are several options with respect to the structure of a clinical trial team
(CTT) for a platform trial within a single pharmaceutical Sponsor. One approach
would be to build upon some existing clinical teams in the given disease area. This
will ensure that the clinical lead and key team members are the same across
compounds, which helps establish consistency and efficient communication; how-
ever, it also requires considerable commitment and continuous support from the
team members. Another approach is to designate an “independent” platform trial
team that would collaborate with different compound teams within the company,
thus providing integrated efforts to develop the master protocol and ensure that the
design properly accommodates each compound. A third approach is to have an
external group run the trial on behalf of the company and tap into their disease area
knowledge. All of these approaches require substantial upfront planning and invest-
ment, greater than one would expect for a standard clinical development path
(Schiavone et al. 2019; Hague et al. 2019; Morrell et al. 2019).
As an example, consider a Novartis-sponsored phase II open-entry platform trial
evaluating efficacy and safety of novel spartalizumab combinations in previously
treated unresectable or metastatic melanoma (ClinicalTrials.gov Identifier:
NCT03484923). The study design is described in detail by Racine-Poon et al.
(2020). The design consists of two parts: the exploratory Part 1 in which candidate
treatments are evaluated for activity in a randomized manner, and the confirmatory
Part 2 in which the “winner” treatment arms from Part 1 are expanded to achieve the
desired level of predictive power for confirmatory statistical hypothesis testing on the
objective response rate (ORR). The study core team was formed on the basis of the
Novartis clinical oncology group, capitalizing on internal knowledge and relevant
subject matter expertise.

Multisponsor Platform Trials

Taking a broad perspective, the overall success rates of new drug development have
been disappointingly low (Scannell et al. 2012; Wong et al. 2019), despite rising
developmental costs (DiMasi et al. 2016). The need for innovation and moderniza-
tion of drug development through collaboration among multiple public and private
1462 O. Sverdlov et al.

entities has become apparent over the years (Woodcock and Woosley 2008).
Crowdsourcing or multisponsor models, where different biopharmaceutical compa-
nies are working in a coordinated manner to develop new medicines for high unmet
medical needs, may provide a very useful framework for modern drug development
(Bentzien et al. 2015). One sensible model is when an academic research unit (or a
network of several academic centers) acts as the coordinator of the platform trial
activities. In this case, the academic research unit may: (i) secure funding for this
research through grants, (ii) develop the master protocol, (iii) build the trial infra-
structure, (iv) attract different pharma/biotech companies to participate and contrib-
ute their investigational compounds for the trial, etc.
Multisponsor platform trials are increasingly common in clinical research. Some
notable examples include the I-SPY 2 trial of novel neoadjuvant therapies in breast
cancer (Barker et al. 2009; Esserman et al. 2019), the I-SPY COVID-19 study of
promising therapeutic agents in critically ill COVID-19 patients (https://clinicaltrials.
gov/ct2/show/NCT04488081), the Systemic Therapy for Advancing or Metastatic
Prostate Cancer (STAMPEDE) study (James et al. 2009), just to name a few.
Multisponsor platform trials require more upfront planning than single-sponsor
trials, because of the need to build the operational platform infrastructure, obtain
alignment across the stakeholders, and get all necessary authorizations from health
authorities. In fact, the FDA guidance for industry “Adaptive designs for clinical trials
of drugs and biologics” (Food and Drug Administration 2019) states this explicitly:

...Because these (adaptive platform) trials may involve investigational agents from more
than one sponsor, may be conducted for an unstated length of time, and often involve
complex adaptations, they should generally involve extensive discussion with FDA...

The “independent” clinical development team on a platform trial would be liaising


with different companies interested in providing their investigational compounds and
co-sponsoring the study. From an individual sponsor’s perspective, there are both
advantages and disadvantages to having an independent clinical trial team. The
advantages include significant work by this independent team, which includes careful
coordination of tasks, starting from the development of a master protocol and involve-
ment in all subsequent activities as the study progresses. A major limitation is the lack
of full control over the study for any participating company. The “independent” team
may be less experienced in the disease area than some developers from the sponsor
companies. Moreover, while each company should be able to have a comparison of the
effects of their investigational assets to the shared control group, the direct between-
asset comparisons would be less common (Food and Drug Administration 2018).

Statistical Considerations

The design of a platform trial poses scientific, statistical, operational, and regulatory
challenges. In addition, the choice of a study design will depend on the disease area,
the competitive landscape, the established industry practices, and the development
76 Platform Trial Designs 1463

phase, in particular if the study is exploratory or confirmatory. The main objective of


any scientific experiment is to obtain reliable answers to the questions of interest.
Keeping this in mind, various designs options should be judiciously evaluated at the
study planning stage. Comparing these options through simulations under different
experimental scenarios will be an essential step in selecting the design to be
implemented (Mayer et al. 2019).
We discuss some important statistical considerations and key design elements that
can be viewed as “building blocks” for constructing a platform trial. Our presenta-
tion here is high-level and nontechnical. The intent is to provide a succinct summary
of strategic considerations for design and analysis of platform trials and, where
appropriate, to provide references to relevant statistical methodology papers. Given
the novelty of the topic, our review here is by no means comprehensive; many new
designs and concepts are yet to emerge.

Choice of a Control Arm

The use of a control group is a fundamental principle of the design of any compar-
ative clinical trial. The main purpose of a control group is to minimize confounding
of the treatment effect with other factors (such as the natural history of the disease),
thereby improving the quality of statistical inference on the treatment effect. The
importance of the choice of a control group in clinical trials is well acknowledged
and is documented in the ICH E10 guideline (International Conference on
Harmonisation E10 2001). In platform trials, many of which are designed to evaluate
the effects of various experimental treatments, considerations on the control group
are particularly important. The FDA guidance on master protocols has the following
statement in this regard (Food and Drug Administration 2018):

...FDA recommends that a sponsor use a common control arm to improve efficiency in
master protocols where multiple drugs are evaluated simultaneously in a single disease (e.g.,
umbrella trials). FDA recommends that the control arm be the current SOC so that the trial
results will be interpretable in the context of U.S. medical practice...

In a recent literature review, Meyer et al. (2020) found that among 50 identified
master protocol trials, the majority (28 out of 50, 56%) had no control group. More
specifically, among the 12 identified platform trials, five trials were designed using
concurrent control, six trials included nonconcurrent control, and one trial had no
control group. The similar numbers for nine identified umbrella trials were four
(concurrent control), one (nonconcurrent control), and four (no control). Let us
discuss different possibilities for the control group in more detail.

Historical Controls
Historical data (e.g., from previous clinical trials in the same indication) provides
valuable information that may potentially supplement evidence from a new RCT
(Pocock 1976). However, one cannot simply rely on historical controls as a basis for
1464 O. Sverdlov et al.

comparison, because there might be differences in the populations, for example, due
to change in medical care over time (Byar 1980).
There are different methods to utilize historical control data both in the design and
the analysis of clinical trials (Viele et al. 2014; Chen et al. 2018). Many phase II trials
in oncology simply use a historical reference value of the objective response rate
(ORR). For instance, a common approach to evaluate the activity of a new com-
pound is through Simon’s two-stage optimal design (Simon 1989) to test the
hypotheses H0: ORR ¼p0 vs. H1: ORR ¼p1, where p0 is the historical reference
value of the ORR, and p1 > p0 is some threshold representing promising activity.
In a platform trial, one is interested in evaluating multiple candidate treatments,
and so the study design may involve randomization to one of the treatment arms, but
the analysis for each arm is standalone (i.e., involves no comparison against control).
This approach was implemented in the platform trial in metastatic melanoma
(NCT03484923; Racine-Poon et al. 2020), where no adequate SOC is currently
available. In that study, the primary analysis for each “winner” arm that has been
promoted from Part 1 to Part 2 involved testing H0: ORR ¼0.10 vs. H1: ORR ¼0.30.
The lower bound of the 95% confidence interval using Clopper-Pearson’s exact
method for ORR was used as a criterion to decide whether a treatment warranted
further investigation in pivotal studies. Alternatively, the analysis could incorporate
relevant historical control data using some Bayesian borrowing technique, such as
hierarchical modeling (Viele et al. 2014). Such analysis would account for uncer-
tainty in the historical ORR, but it would also require careful assessment of assump-
tions necessary for a valid treatment comparison.

Concurrent Controls
In clinical settings where it is not feasible to run a series of standard adequately
powered two-arm RCTs (e.g., in rare diseases), a multiarm randomized platform trial
with a shared control group may be an appealing and efficient approach (Saville and
Berry 2016). For instance, platform trials evaluating multiple treatments from
different sponsors can benefit from borrowing of data from the pooled placebo
group for individual treatment comparisons. Since platform trials evaluate novel
treatments perpetually, some special considerations on the shared control are
required.
For illustrative purpose, consider a hypothetical platform trial with five experi-
mental treatment arms and Control (Fig. 2).
Suppose for each comparison of experimental vs. control, 100 patients per arm
provide sufficient sample size to test treatment difference. The trial starts with
randomizing initial 100 patients equally between TRT1 and Control. After that,
two new arms are added, and additional 200 patients are randomized among TRT1,
TRT2, TRT3, and Control (50 per arm). At that point, TRT1 achieves its target
sample size, and the randomization is shifted to TRT2, TRT3, or Control such that
additional 150 patients are randomized (50 per arm). Thereafter, a new arm TRT4 is
added (ISA2) and the next 100 patients are randomized between TRT4 and Control
(50 per arm). Finally, after TRT5 is added (ISA3), the trial continues with random-
izing additional 150 patients among TRT4, TRT5, or Control, and the last
76 Platform Trial Designs 1465

Fig. 2 A hypothetical platform trial with five experimental arms and a shared control arm

100 patients between TRT5 and Control. Overall, in this hypothetical study, each
treatment arm has 100 patients, and the Control arm has 300 patients.
Assume the primary outcome is available soon after randomization, and the data
analysis for each arm takes place after the target number of subjects have been
randomized and treated. In the analysis, different strategies for utilizing control data
are possible.
1466 O. Sverdlov et al.

1. All accrued data in the Control arm at the time of analysis is utilized, treating all
observations as if they had been concurrently obtained. In our example, the size of
the control group for treatment comparison is 100 for TRT1, 150 for each of the
TRT2 and TRT3, 250 for TRT4, and 300 for TRT5. A larger size for the control
arm would enable more robust inference. A major assumption is that there are no
hidden confounders such as a time trend.
2. Only data from the Control arm that was part of the randomization sequence
concurrent with the given experimental treatment arm is utilized. The argument
here is that it is difficult to justify pooling of control observations that are
separated by some time interval. In our example, first 100 allocations to control
are concurrent with TRT1, allocations 51–150 to control are concurrent with
TRT2 and TRT3, allocations 151–250 to control are concurrent with TRT4, and
allocations 201–300 to control are concurrent with TRT5. Therefore, in this case,
the size of the control group for each treatment comparison is 100.
3. Pooling of data from the Control arm in the study is performed using some
statistical methodology. Several recent papers discuss approaches that may be
relevant in this context (Yuan et al. 2016; Galwey 2017; Hobbs et al. 2018; Jiao
et al. 2019; Tang et al. 2019; Normington et al. 2020). The methods include “test-
then-pool” strategy, dynamic pooling, Bayesian hierarchical modeling, to name a
few. These methods can be applied not only to the shared internal control arm, but
also to some historical control data or some relevant concurrent external data that
may become available as the platform trial is ongoing. They provide a compro-
mise between approaches #1 and #2 in that either historical information is down-
weighted, but not entirely discarded, or is included in the analysis only if
sufficiently similar to the concurrent data (which could also be interpreted as a
form of down-weighting, since it is included in the analysis with a probability less
than 1).

Some additional important notes should be made here. First, it may be difficult to
justify upfront the analytic strategy for handling control data, and several approaches
may have to be designated to ensure robust analysis. This argument applies to both
phase II and phase II/III platform trials. In fact, it is increasingly recognized (even in
phase III trials) that a single, albeit carefully prespecified, primary analysis may be
insufficient and it is prudent to have several sensitivity analyses. For instance, the
estimand framework (ICH E9(R1), 2020; Jin and Liu 2020) suggests to designate a
main estimator to serve as primary analysis and several sensitivity estimators
targeting the same estimand but under different assumptions for missing data
and/or censoring. Second, the investigational treatments may have different mech-
anism of action and/or different routes of administration (e.g., oral vs. injections), in
which case the concurrent placebo group may be different across experimental arms
and this should be accounted for in the analysis. Third, in some platform trials, if a
current active treatment shows evidence of superiority over SOC, then this treatment
may become the SOC and the control group would have to be changed for subse-
quent cohorts. This was the case, for instance, in PREVAIL II trial in Ebola (Dodd
76 Platform Trial Designs 1467

et al. 2016; PREVAIL II Writing Group 2016) and in the currently ongoing I-SPY
COVID-19 trial (NCT04488081).

Randomization

Randomization in clinical trials mitigates potential for experimental bias, promotes


comparability of treatment groups, and validates the use of statistical methods in the
analysis.
The choice of both the allocation ratio and the randomization procedure to
implement the chosen allocation is an essential ingredient of the clinical trial design.
For multiarm trials where the number of treatment arms is fixed and predetermined,
an optimal allocation can be obtained according to the formulated study objectives
(Sverdlov and Rosenberger 2013; Sverdlov et al. 2020). In platform trials, deter-
mining optimal allocation may be more challenging due to the open-ended nature of
these trials and uncertainty in the total number of treatment arms to be tested in the
study. The chosen allocation can be implemented in practice using some sequential
randomization procedure with established statistical properties (Rosenberger and
Lachin 2015; Hu and Rosenberger 2006). Broadly speaking, randomization pro-
cedures can be classified into two major types: fixed allocation (e.g., equal) random-
ization and adaptive randomization. The latter class of procedures includes
covariate-adaptive, response-adaptive, and covariate-adjusted response-adaptive
randomization procedures (Rosenberger et al. 2012).
Below we discuss two approaches – equal and fixed unequal allocation random-
ization (section “Equal and Fixed Unequal Randomization”) and response-adaptive
randomization (section “Response-Adaptive Randomization”) – in the context of
platform trials.

Equal and Fixed Unequal Randomization


Consider a platform trial that initially starts with K  1 experimental treatment arms
and a control arm. A popular design choice is the equal randomization (1:1:. . .:1) for
which m patients are randomized to arm k ¼ 0, 1, . . ., K, where k ¼ 0 is the control.
While equal allocation are optimal for some experimental objectives, such as
estimation of treatment contrasts (Sverdlov and Rosenberger 2013), unequal alloca-
tion may sometimes be preferred. For instance, if K  1 experimental treatments are
compared against the common control, the optimal allocation ratio for the pairwise
comparison of experimental vs. control (assuming constant variance across all
groups andpassuming
ffiffiffiffi the same value of the mean treatment difference over control)
is 1:. . .:1: K (Dunnett 1955). An increased allocation to the control group may be
also attractive from the cost efficiency perspective, for example, if the control
treatment is significantly cheaper than experimental ones (Sverdlov and Ryeznik
2019). On the other hand, there may be nonstatistical rationales for increasing the
allocation proportion to the experimental arms; for example, in situations when
historical control data can be utilized to supplement the concurrent control data,
1468 O. Sverdlov et al.

investigators may be interested in gaining more information on experimental


treatment arms.
In platform trials, the total number of experimental treatments may be unknown at
the trial start, and the allocation ratio may have to be determined adaptively. To
illustrate this, consider our earlier example where the study starts with one experi-
mental treatment and control, and more experimental arms are added over time
(Fig. 2a). We assume that m ¼ 100 subjects per arm provide sufficient data to test
treatment difference vs. control. In Fig. 2a, the design uses equal allocation to
available arms throughout the study. If at some point new experimental arms are
added, the randomization scheme accommodates these arms accordingly: for exam-
ple, the design starts with 1:1 randomization (50 patients to each of TRT1 and
Control), then changes to 1:1:1:1 randomization (50 patients to each of TRT1,
TRT2, TRT3 and Control), etc. While this ensures 100 patients per experimental
arm, one may argue that we have over-allocation to the control arm. More specifi-
cally, in the described example in Fig. 2a, the size of the control group at the time of
the final analysis is 100 for TRT1 comparison, 150 for each of TRT2 and TRT3
comparisons, 250 for TRT4 comparison, and 300 for TRT5 comparison. This is
natural if experimental arms are added over time and we want to maintain equal
randomization throughout the study.
If we can assume exchangeability of observations in the control group, then
allocation to control may be gradually decreased over time. Generally speaking,
this idea may be applied in both phase II and phase II/III platform settings; however,
it requires careful considerations of the trial context, sample size, and other design
parameters. For instance, one could apply an allocation strategy as displayed in
Fig. 2b. After randomizing initial 100 patients between TRT1 and Control, TRT2
and TRT3 are added at ISA1. The randomization ratio is changed to 1:1:2:2, to have
additional 50 patients assigned to each of the TRT1 and Control and 100 patients to
each of the TRT2 and TRT3. By the time TRT4 is added at ISA2, the Control arm
already has 100 patients. We modify the allocation ratio to 1:4, to assign additional
25 patients to Control and 100 patients to TRT4. Once this change in randomization
has been applied and 100 additional assignments have been made (20 to Control and
80 to TRT4), TRT5 is added at ISA3. We decide to further amend the allocation ratio
to 1:2:10 to assign extra 10 patients to Control, 20 patients to TRT4, and 100 patients
to TRT5. Overall, in this example the total sample size for the Control arm is
130 (compared to 300 in Fig. 2a). In the analyses of TRT1, TRT2, and
TRT3 vs. Control, the comparisons are based on 100 patients per arm, and in the
analyses of TRT4 and TRT5 vs. Control, the sample size is 100 per experimental arm
and it is 130 for the control arm. Such a dynamic modification of the allocation ratio
may impact type I error and type II error rates. With this approach, one should
exercise caution if the study has a confirmatory component. Of note, the issues of
multiplicity adjustments and strategies for proper utilization of the shared control in
platform trials are still emerging (Sridhara et al. 2021; Wason and Robertson 2021;
Berry 2020; Parker and Weir 2020; Bretz and Koenig 2020; Stallard et al. 2019;
Howard et al. 2018; Kopp-Schneider et al. 2020; Dodd et al. 2021). Some of them
will be briefly discussed in section “Sample Size and Power.”
76 Platform Trial Designs 1469

Note that this example provides only one possibility of modifying the control
allocation ratio over time. A major assumption was that the randomization ratios
were prefixed (e.g., 1:1 up to patient 100, 1:1:2:2 for additional 300 patients, etc.)
such that there is no selection bias issue. If, however, these decisions are made
“pragmatically,” whenever a new treatment arm is added or dropped, then it is
important to ensure that the selected new randomization ratios are not dependent
on the observed response data; otherwise the procedure can no longer be regarded as
“fixed,” but it rather becomes response-adaptive, for which special considerations
are required; see section “Response-Adaptive Randomization.”
To implement the chosen equal or unequal allocation ratio, the simplest and most
common approach is the permuted block randomization which sequentially random-
izes cohorts of study participants in the desired ratio until the target sample size is
reached. Other randomization procedures with enhanced statistical properties can be
considered (Kuznetsova and Tymofyeyev, 2011; Kuznetsova and Tymofyeyev 2014;
Ryeznik and Sverdlov 2018).

Response-Adaptive Randomization
Response-adaptive randomization (RAR) can be applied in platform trials to
increase the chance of trial participants to receive an empirically better treatment
while maintaining important statistical properties of the trial design. Here, “empir-
ically better treatment” refers to a treatment that has been more successful in a
nonstochastic sense (e.g., simply has a greater observed proportion of responders) in
view of the data accrued in the trial thus far. RAR has a long history in the
biostatistics literature and it has been used occasionally in clinical trials (Hu and
Rosenberger 2006). In platform trials, RAR can potentially increase trial efficiency
in the sense that efficacious treatment arms can be identified quicker and quite
reliably (Saville and Berry 2016). There are both advantages and disadvantages of
RAR, and its implementation always requires careful considerations (Robertson
et al. 2020). For instance, one motivation for using RAR is to maximize the expected
number of successes in the trial, which may be particularly important in trials of rare
and life-threatening diseases with limited patient horizon (Palmer and Rosenberger
1999). Another possibility for application of RAR is trials of highly contagious
diseases such as Ebola where the hope is that the disease may be eradicated by the
investigational treatment or vaccine (Berger 2015). In all, various stakeholders’
perspectives should be taken into account when assessing the possibility of incor-
porating RAR in the design. A general consensus is that RAR may be useful in phase
II exploratory settings but less so in phase III confirmatory settings. It is also
instructive to quote the following recent perspective on RAR from the FDA (Food
and Drug Administration 2019):

...Response-adaptive randomization alone does not generally increase the Type I error
probability of a trial when used with appropriate statistical analysis techniques. It is
important to ensure that the analysis methods appropriately take the design of the trial into
account. Finally, as with many other adaptive techniques based on outcome data, response-
adaptive randomization works best in trials with relatively short-term ascertainment of
outcomes...
1470 O. Sverdlov et al.

It should be noted that RAR designs rely on certain assumptions on responses (e.g.,
statistical model linking responses with effects of treatments and biomarkers, fast
availability of individual outcome data to facilitate model updates, and modifications
of randomization probabilities) and require calibration through comprehensive simu-
lations before they are implemented in practice. It is also important to acknowledge
that RAR designs may potentially have deteriorating performance if outcome data are
affected by time trends (Thall et al. 2015), and special statistical techniques are
required to obtain robust results in the analysis (Villar et al. 2018). However, different
RAR procedures vary in the statistical properties, and some issues pertinent to
particular RAR procedures, for example, high variability and potential loss in statis-
tical power of the randomized play-the-winner rule (Wei and Durham 1978), should
not be overgeneralized to all RAR procedures (Villar et al. 2020).
Several recent papers provide simulation reports on RAR for multiarm trials with
and without control arm (Wathen and Thall 2017; Viele et al. 2020a, b). One sensible
RAR approach is to skew allocation to the empirically best arm (if it exists) while
maintaining some allocation to the control (Trippa et al. 2012; Wason and Trippa
2014; Yuan et al. 2016). This would provide sufficient power to formally compare
the effects of the most successful experimental treatment against the control. One
extra challenge, however, is that new experimental arms are added over time and
RAR requires some burn-in period to ascertain estimates of treatment effects to
facilitate adaptations. Some efficient RAR designs for multiarm controlled platform
trials where experimental arms can be added/dropped during the course of the study
are available; see papers by Ventz et al. (2018), Hobbs et al. (2018), Kaizer et al.
(2018), Normington et al. (2020), to name a few.
An increasingly useful idea in RAR platform trials is inclusion of stratification using
genetic signatures or some other predictive biomarkers. In this case, RAR probabilities
for an individual participant are adjusted such that the participant has increased
probability to be assigned to the treatment that is putatively most efficacious given
their baseline biomarker profile. The research question is: which compound/biomarker
pairs are most promising to be taken further in development to more focused confir-
matory phase trials? This approach was applied, for instance, in the I-SPY 2 trial in
breast cancer (Barker et al. 2009) and in the BATTLE trial in nonsmall cell lung cancer
(Zhou et al. 2008; Kim et al. 2011). Both I-SPY 2 and BATTLE trials carried
hypothesis-generating value and aimed at identifying targeted therapies, but not for-
mally testing their clinical efficacy. A more elaborate design is GBM AGILE – an
ongoing seamless phase II/III platform trial in glioblastoma, which combines data from
promising treatments identified during a phase II multiarm Bayesian RAR part with the
data for these treatments during a phase III part to formally test clinical efficacy with
respect to overall survival and enable submissions (Alexander et al. 2018).

Data Monitoring and Interim Decision Rules

Platform trials involve data monitoring and various interim decisions. A key princi-
ple of any adaptive design is that adaptations must be carefully preplanned to ensure
76 Platform Trial Designs 1471

statistically valid results. The FDA guidance on master protocols states (Food and
Drug Administration 2018):

...Master protocols evaluating multiple investigational drugs can add, expand, or discontinue
treatment arms based on findings from prespecified interim analyses or external new data.
Before initiating the trial, the sponsor should ensure that the master protocol and its
associated SAP describe conditions that would result in adaptations such as the addition of
a new experimental arm or arms to the trial, reestimation of the sample size based on the
results of an interim analysis, or discontinuation of an experimental arm based on futility
rules.

The guidance also emphasizes the importance of having an independent data


monitoring committee (IDMC) to conduct interim analyses and make recommenda-
tions. The IDMC ensures trial integrity and mitigation of operational bias.
In a platform trial, various interim decisions can be made on ongoing investiga-
tional treatment arms; see section “General Definitions.” Also, a platform trial
protocol may provision for addition of new investigational arms. Some papers
discuss methodology to formally justify a decision to add new arms to an ongoing
trial (Elm et al. 2012; Cohen et al. 2015; Lee et al. 2019; Choodari-Oskooei et al.
2020).
Another important aspect is the timing and frequency of interim analyses (IAs).
Safety monitoring is usually performed continuously throughout the study, but
interim analyses of efficacy are less frequent.
Decision rules should be based on some statistical criteria and the corresponding
boundary. The statistical criterion may be:

• Observed treatment effect (e.g., point estimate, confidence interval, or test


statistic)
• Bayesian posterior probability of the treatment effects
• Conditional power (probability of rejecting the null hypothesis in the final
analysis given current data)
• Bayesian predictive probability of success (average conditional power or
“assurance”)

Often, it is possible to establish a one-to-one correspondence among the decision


rules, for example, a statement on Bayesian posterior probability or conditional
power can be transformed into a statement on the observed treatment effect (Gallo
et al. 2014). In practice, decision criteria will be considered together with other
design elements such as sample size, randomization, study endpoints, etc.
It is instructive to look at the development of decision rules by example. Consider
OPTIM-ARTS design for a phase II open platform trial in melanoma (Racine-Poon
et al. 2020). The design consists of two parts: 1) randomized, open platform phase to
screen for activity multiple targeted therapy combinations and 2) expansion phase to
formally test promising treatments from the first phase. The primary efficacy end-
point is ORR at 20 weeks posttreatment initiation, and there is no control arm due to
lack of adequate SOC in this indication. The study starts with three arms and
1472 O. Sverdlov et al.

randomizes patients in a 1:1:1 ratio. The first IA is planned after about 10 patients per
arm contribute data for evaluation of ORR. Subsequent IAs are planned approxi-
mately every five months thereafter. The maximum number of patients per arm in
Part 1 is capped at 30. To facilitate decision making in part 1, the ORR for each
treatment arm is modeled using a standard Bayesian beta-binomial model with
uniform prior. At a given IA, an arm can be: (i) expanded into Part 2, if Pr
(ORR > 0.20| data) > 0.70; (ii) stopped for futility, if Pr(ORR < 0.15| data) > 0.70;
or (iii) continued in Part 1, if neither (i) nor (ii) is met. If an arm has reached its cap of
30 patients and neither (i) nor (ii) is met, the arm is stopped and not pursued further.
If a decision to expand an arm is made, the sample size for Part 2 is determined
adaptively, using Bayesian shrinkage estimation to mitigate treatment selection bias
and to ensure >70% Bayesian predictive power to obtain significant final results.
The final analysis for each treatment arm in Part 2 is done using standard frequentist
methodology (exact binomial test), based on cumulative data from Part 1 and 2 for
this arm.
All decision rules/criteria in OPTIM-ARTS design are calibrated through Monte
Carlo simulation under various true values of ORR, to achieve desirable statistical
characteristics, such as reasonably high correct decision probabilities in part 1, and
high power and control of the type I error rate in Part 2. A combination of Bayesian
monitoring in Part 1 with formal hypothesis testing for selected treatment arms in
Part 2 allows flexible and statistically rigorous design.

Sample Size and Power

Sample size determination is an integral part of any clinical trial design, and the
platform trial is no exception. Some important considerations for the sample size
planning include the study objectives, the choice of a research hypothesis, primary
endpoint, study population, control and experimental treatment groups, statistical
methodology for data analysis, etc. The common statistical criteria for sample size
planning are statistical power and significance level (probability of a type I error);
however, additional criteria such as estimation precision, probabilities of correct go/
no-go decisions may be considered as well.
At the design stage, the sample size planning will likely be an iterative process
that may involve a combination of standard calculations and simulations. Suppose
we have K  1 experimental treatment arms and a control arm, and we decide to use
equal randomization with m patients per arm. There are different ways to character-
ize power in a multi-arm setting (Marschner 2007). One way is to consider null
hypotheses on individual treatment contrasts (experimental vs. control) as follows:
ð jÞ ð jÞ
H 0 : Δ j ¼ 0 vs. H 1 : Δ j > 0, where Δj ¼ μj  μ0 and j ¼ 1, . . ., K. Assuming
individual responses on the kth treatment are normally distributed with mean μk and
variance σ 2, the sample size m ¼ 2σ 2(z1  α + z1  β)2/Δ2 per arm (where zu is the
100uth percentile of the standard normal distribution and Δ > 0 is some clinically
relevant value of the mean treatment difference) provides power of (1  β) for each
76 Platform Trial Designs 1473

of the K comparisons. This assumes that each hypothesis is tested at significance


level α and no multiplicity adjustment is made.
ð jÞ
Another way is to consider simultaneous testing H0 : \Kj¼1 H 0 vs. H1: not H0. In
this case, an investigator may wish to control a family-wise error rate (FWER), that
is, the probability of rejecting any true null hypothesis, at some prespecified level α.
A conservative way of doing this would be to use a comparison-wise level of α/K
(the Bonferroni approach). If the same standardized treatment effect Δ applies in all
K comparisons and the correlation between all endpoints is ½, then the requisite
sample size per arm is m ¼ 2σ 2(z1  α/K  uK, 1/2, β)2/Δ2 where uK, 1/2, u is the 100uth
percentile of the K-variate normal distribution with zero mean and covariance matrix
that has diagonal elements equal to 1 and all off-diagonal elements equal to ½.
Alternatively, Dunnett’s (1955) procedure accounts for the positive correlation
among contrasts with the common control. The sample size per arm to achieve
power of (1  β) per comparison while maintaining the FWER at level α using
Dunnett’s test is then m ¼ 2σ 2(uK, 1/2, 1  α  uK, 1/2, β)2/Δ2. More on sample size
calculations for multiple tests can be found in Horn and Vollandt (2000). A recent
paper (Choodari-Oskooei et al. 2020) extended Dunnett’s procedure to the case of
adding new research arms and platform trials, and the resulting approach was shown
to be less conservative than the Bonferroni approach.
In the literature, there have been debates on whether adjustment for multiplicity is
required in multiarm trials with a shared control group (Proschan and Follmann
1995; Freidlin et al. 2008; Wason et al. 2014). For instance, Freidlin et al. (2008)
argued that multiarm trials that are designed with a common control group for
logistical efficiency do not require multiplicity adjustments. However, in multiarm
trials where the research questions of different comparisons are clinically related
(e.g., experimental arms represent different dose levels of a compound, or a trial
evaluates the addition of an experimental agent to several backbone regimens against
the control arm), then the adjustment for multiple comparisons would be appropriate.
Either way, such issues should be a part of the discussion with regulatory agencies
(Collignon et al. 2020).
The sample size planning for a platform trial should also take into consideration
the adaptive nature of the experiment, that is, that some arms can be stopped early for
futility and/or efficacy. How many effective treatments should be identified within
the platform trial before it can stop is another important part of the master protocol
planning. Interim analyses may inflate probabilities of type I/type II error, and
sample size planning should account for that. Many frequentist designs that provi-
sion for interim decisions, such as group sequential designs (Jennison and Turnbull
2000), are well-established for confirmatory trials, and their validity depends on
adherence to the prespecified futility and/or efficacy rules. By contrast, Bayesian
designs do not formally incorporate considerations of the type I error rate control;
however, these designs can be fine-tuned via simulations to ensure they have
desirable statistical properties across a range of plausible experimental scenarios
and trial parameters. There are some examples in the literature how this can be done
in practice (Quan et al. 2019; Ventz et al. 2017).
1474 O. Sverdlov et al.

The theory of adaptive designs (see e.g., Wassmer and Brannath 2016) allows for
many modifications (such as dropping or adding treatment arms, restricting recruit-
ment to subpopulations, changing sample size or randomization ratios, in theory
even changing endpoints) while maintaining the family-wise error rate. However,
these methods were originally not developed for very frequent adaptations; hence,
power loss can be severe when applying them in platform trials with many design
adaptations.
The uncertainty on the final sample size numbers for the chosen platform trial
design should always be quantified; ideally, not only the values of the expected
sample size, but also the entire distribution of the sample size per arm and overall in
the study should be obtained and presented via simulations. The choice of the
experimental scenarios for simulations should be comprehensive but it will never
be exhaustive. There are some good industry practices on simulation of adaptive
trials in drug development (Mayer et al. 2019) that can be useful for sample size
planning for platform trial designs.

Data Analysis Issues

The analysis of any clinical trial should be reflective of the trial design. The statistical
analysis plan for a platform trial should include details of the planned analyses, both
interim and final. Since many platform trials have adaptive elements, some important
principles for adaptive designs naturally apply for platform trials. The FDA guidance
for industry “Adaptive designs for clinical trials of drugs and biologics” (Food and
Drug Administration 2019) makes the following statement that applies to all clinical
trials intended to provide substantial evidence of effectiveness:

. . .In general, the design, conduct, and analysis of an adaptive clinical trial intended to
provide substantial evidence of effectiveness should satisfy four key principles: the chance
of erroneous conclusions should be adequately controlled, estimation of treatment effects
should be sufficiently reliable, details of the design should be completely prespecified, and
trial integrity should be appropriately maintained...

The strong control of the type I error rate is a major requirement for any clinical
trial with a confirmatory component. Various interim decisions can inflate the type I
error rate. Thus, special statistical techniques are required to ensure the overall type I
error is maintained at a prespecified level. Some design methodologies, such as
group sequential designs (Jennison and Turnbull 2000) and adaptive designs
(Wassmer and Brannath 2016), specifically address the issue of the type I error
control by properly selecting interim stopping boundaries. Adaptive designs with
treatment (or subgroup) selection at interim, known as seamless phase II/III designs
(Bretz et al. 2009; Wassmer and Brannath 2016), provide ways to properly combine
data from the exploratory and confirmatory parts of the trial in the analysis (i.e.,
inferentially seamless designs) while controlling the type I error. For other designs
and analysis techniques, simulations can be used to evaluate the probability of false
76 Platform Trial Designs 1475

positive findings. For a platform trial with a confirmatory component, the control of
the type I error rate is more complex due to uncertainty on the number of experi-
mental treatment arms that will be tested in the study and possibly multiple regis-
trations that may follow. Industry best practices on type I error considerations in
master protocols with shared control are still emerging (Sridhara et al. 2021).
Another important aspect is estimation of treatment effects. Design adaptations
such as selection of an arm that exhibits the best interim results introduce positive
bias in the final estimation of treatment effect. In order to quantify this bias, bias-
corrected estimates accounting for design adaptations (Bowden and Glimm 2008;
Stallard and Kimani 2018) can be reported. These methods correct for the selection
bias generated by specific types of selections such as “pick-the-winner” or “drop-
the-loser” in multiarm situations. The insistence on unbiasedness inflates the mean
squared error (MSE) which, for several of these methods, is larger than that of the
corresponding maximum likelihood estimator (MLE). However, shrinkage estima-
tion techniques (which reduce, but not entirely eliminate bias) have been proven to
have lower MSE than the MLE (Carreras and Brannath 2013; Bowden et al. 2014).
The magnitude of the bias is situation dependent. It generally depends on the
“severity” of the selection (e.g., the number of treatment arms from which a winner
is picked), the size of the study, and the similarity of the underlying true but
unknown treatment effects. In a well-planned, large study with limited selection
options, it will often be small. However, in studies with a wide range of potential
selection decisions, it can be substantial.
In addition to point estimates, confidence intervals are of interest. Construction of
confidence intervals accounting for multiple interim looks at the data and design
adaptations has been discussed in the literature (Neal et al. 2011; Kimani et al. 2014;
Kimani et al. 2020); however, no fully satisfactory construction method exists and
applications are very diverse. Overall, it may be prudent to report both stage-wise
unadjusted estimates and confidence intervals and adjusted quantities based on
combined data. The assessment of data homogeneity from different stages (both
baseline characteristics and the outcome data) is very important for interpretation of
the study results (Gallo and Chuang-Stein 2009; Friede and Henderson 2009). This
is explicitly documented in the EMA “Reflection paper on methodological issues in
confirmatory clinical trials planned with an adaptive design” (European Medicines
Agency 2007):

...Using an adaptive design implies that the statistical methods control the pre-specified type
I error, that correct estimates and confidence intervals for the treatment effect are available,
and that methods for the assessment of homogeneity of results from different stages are
pre-planned. A thorough discussion will be required to ensure that results from different
stages can be justifiably combined...

Some special considerations are required for reporting of the results of a platform
trial. For instance, how should the results of completed treatment arms be reported
while the main master protocol is still ongoing? In this regard, a good example is the
STAMPEDE (Systemic Therapy for Advancing or Metastatic Prostate Cancer) study
1476 O. Sverdlov et al.

(James et al. 2009), which has been ongoing since 2005 while providing periodic
updates on the investigational treatments (comparisons) that have been completed in
due course.

Examples of Platform Trials

EPAD-PoC Study in Alzheimer’s Disease

The European Prevention of Alzheimer’s Disease (EPAD) Consortium was a public-


private effort funded by the EU through the Innovative Medicines Initiative (IMI)
and the European Federation of Pharmaceutical Industries and Associations (EFPIA)
partners, which included pharmaceutical, biotechnology and related companies. It
started in January 2015 with the mission to develop improved models of Alzheimer’s
disease and creating a research environment for optimized testing of novel treat-
ments for the secondary prevention of Alzheimer’s dementia, and it ended in
October 2020 (https://www.imi.europa.eu/projects-results/project-factsheets/epad).
The EPAD project had several foundational elements, including the virtual registry
of potential research participants, the longitudinal cohort study (EPAD-LCS) to
provide information on disease progression in the presymptomatic phase, and the
PoC study (EPAD-PoC) to test new interventions in the earliest stages of
Alzheimer’s disease.
The EPAD PoC study was designed as a phase II, open platform, randomized,
placebo-controlled Bayesian adaptive trial (Ritchie et al. 2016). The master protocol
specified a common framework for all interventions and covered the inclusion
criteria, patient stratification, assessment schedule, study logistics, and other design
features. The ISAs would provide additional compound-specific details. The study
provisioned for a Clinical Candidate Selection Committee (CCSC) that would
determine which experimental therapeutics to include for testing in the study. In
general, a study compound would have already shown clinical safety and clinical
proof-of-mechanism (target engagement). The clinical PoC criteria were based on
the Repeatable Battery for the Assessment of Neuropsychology Status (RBANS),
planned to be analyzed through a repeated measurement model with Bayesian
decision criteria for futility and efficacy. The potential efficiency gains in the
EPAD-PoC study were envisioned to be due to an operationally streamlined design
and the use of the shared control group.
While EPAD-PoC was conceptually designed using the I-SPY2 study (Barker
et al. 2009) as a prototype, it also had some unique features. The EPAD-LCS
provided important observational data on subjects with a high risk to develop
Alzheimer’s disease, which formed the basis for development of longitudinal disease
models using subjects’ genetic information, biomarkers, and other risk factors. This,
in turn, would help identify and stratify participants for the EPAD-PoC study that
would subsequently advance most promising compounds for optimized testing in
large-scale phase III trials.
76 Platform Trial Designs 1477

By the end of 2019, EPAD-LCS recruited and deeply phenotyped more than 2000
participants; however, the PoC study (EPAD-PoC) to test new interventions did not
take place due to lack of drug sponsors to run trials. The EPAD initiative finished in
late 2020. Overall, this case study represents the complexity of clinical research in
challenging indications such as Alzheimer’s disease and reinforces the importance of
lessons learned in this context.

I-SPY COVID-19 Study

The COVID-19 pandemic has been a major public emergency since February 2020.
Global efforts are taken worldwide to develop effective vaccines and treatments
against COVID-19 infections. Clinical development for COVID-19 treatments poses
several challenges:

• Being a global pandemic causing hundreds of thousands of deaths, there is a


tremendous need for the speedy development of treatments and vaccines as well
as for rapid production and dissemination of these.
• The pool of candidate treatments is huge, including drugs that have been already
approved for other indications, novel agents, and possibly their combinations.
• The SOC is likely to change over time, for example, due to identification of
efficacious therapies.

Several platform trials for rapid testing of various re-purposed and novel treat-
ments for COVID-19 were initiated in 2020 and are now ongoing. Here we discuss
just one of them, the I-SPY COVID-19 trial (Identifier: NCT04488081). This is an
open-label, randomized, multiarm, active-controlled, Bayesian adaptive phase II
platform trial to rapidly screen promising agents for treatment of critically ill
COVID-19 patients. Eligible patients are stratified based on their status at entry
(ventilation vs. high-flow oxygen) before randomization. The primary endpoint is
time to recover to a durable (at least 48 h) level of 4 or less on the
WHO-recommended COVID-19 ordinal scale (WHO 2020) (time frame: up to
28 days).
The trial design (NCT04488081) describes four experimental arms (combinations
of novel agents with remdesivir) and an active comparator (remdesivir plus SOC),
and there is a provision to add more experimental agents to the study, depending on
the recruitment and the time course of COVID-19 in the USA. The sample size per
experimental arm is capped at 125 patients. The arms can be dropped early for
futility after enrollment of 50 patients. The arms exhibiting strong efficacy signals
can qualify for further development, in which case the enrollment to these arms will
cease, and new investigational arms can be added.
The I-SPY COVID-19 trial is a massive collaborative effort that involves several
university medical centers in the USA, pharma/biotech industry, and the FDA, with
the estimated enrollment of up to 1500 participants and estimated primary comple-
tion date of July 2022.
1478 O. Sverdlov et al.

GBM AGILE Study in Glioblastoma

Alexander et al. (2018) described the design of the Glioblastoma (GBM) Adaptive
Global Innovative Learning Environment (AGILE) – an international, multiarm,
randomized, open platform, inferentially seamless study to identify effective thera-
pies for newly diagnosed and recurrent GBM within different biomarker-defined
patient subtypes (ClinicalTrials.gov Identifier: NCT03970447).
The trial employs a master protocol that allows multiple novel experimental
therapies and their combinations to be evaluated within the same trial infrastructure.
The design consists of two parts:

1. Phase II screening stage, designed using Bayesian adaptive randomization, to


identify effective therapies within biomarker subtypes based on overall survival,
compared with a common control
2. Phase III confirmatory stage, which expands sufficiently promising treatment
arms from the first part and formally tests their clinical efficacy against the control
with respect to overall survival, in an inferentially seamless manner to enable
registration

The GBM AGILE design has several innovative features that are worthy
elaborating upon:

• Study participants are stratified into three subtypes of GBM: newly diagnosed
methylated (NDM), newly diagnosed unmethylated (NDU), or recurrent disease
(RD). Each experimental arm can have one enrichment biomarker, thought to be
predictive to the outcome for the given arm. A combination of stratification and
enrichment biomarkers creates up to six different subtypes for patient randomi-
zation. Within each subtype, different SOC control arms and different experi-
mental drug combinations are considered.
• Bayesian adaptive randomization is applied such that within each stratum, 20% of
participants are randomized to the control, and allocation to experimental arms is
skewed such that greater proportions are assigned to arms with evidence of pro-
longed overall survival compared to the control group given the patient’s subtype.
• A longitudinal model linking the effects of treatments, covariates, biomarkers,
and overall survival is developed to predict the individual survival time. The
model can potentially be used to “speed up” Bayesian response-adaptive ran-
domization algorithm that otherwise relies on survival times that are observed
with natural delay.
• During phase II screening part, treatment efficacy is assessed within predefined
biomarker “signatures” (that may be different from the stratification subtypes),
such that an experimental arm exhibiting very promising results for a particular
signature will be expanded into phase III confirmatory part within this signature.

In summary, GBM AGILE study provides an open platform for clinical investi-
gation with both exploratory and confirmatory components. It can potentially enable
76 Platform Trial Designs 1479

faster, more efficient, and more ethically appealing development of therapies for
glioblastoma.

FOCUS4 Study in Metastatic Colorectal Cancer

The FOCUS4 study (ISRCTN90061546) was a phase II/III randomized, stratified,


platform trial of several targeted therapies for patients with advanced or metastatic
colorectal cancer. The study was sponsored by the UK Medical Research Council
(MRC) and designed and conducted by the MRC Clinical Trials Unit. The FOCUS4
study used a master protocol with independent comparison-specific protocols
corresponding to five different biomarker-based cohorts, with elements of multiarm
multistage (MAMS) design methodology (Kaplan et al. 2013; Kaplan 2015).
Eligible patients would receive an initial 16-week period of standard first-line
chemotherapy, and their tumor tissues would undergo several molecular assays to
determine the appropriate biomarker stratum for the patient. Within each stratum,
patients would be randomized in a 2:1 ratio to a targeted experimental treatment or
placebo. The five strata defined five subtrials (comparisons): FOCUS4-A for patients
with BRAF-mutant tumors; FOCUS4-B for patients with PIK3CA mutations or
PTEN loss; FOCUS4-C for patients with KRAS or NRAS mutations; FOCUS4-D
for patients whose tumor was wild-type for BRAF, PIK3CA, KRAS, and NRAS; and
FOCUS4-N for patients who could not be classified as any of the subtypes above.
The platform nature of the FOCUS4 study provisioned for addition of new
investigational therapies to the randomization process mid-trial or terminating the
arms that would show evidence of futility early on. More specifically, FOCUS4
master protocol (FOCUS4 2019) specified four analysis stages for each biomarker-
defined comparison of experimental treatment vs. placebo: safety (stage I), lack-of-
sufficient-activity (stage II), efficacy for progression-free survival (PFS) (stage III),
and efficacy for overall survival (OS) (stage IV). Interim results for each stage would
be reviewed by the IDMC to guide subsequent decisions for each comparison.
Importantly, different substudies of FOCUS4 had their own targeted effect sizes
for PFS and OS and called for different maximum sample sizes.
Overall, FOCUS4 provided many valuable scientific and operational insights and
lessons learned (Hague et al. 2019; Morrell et al. 2019; Schiavone et al. 2019). The
study recruitment started in October 2014. The first full-length published results
were from FOCUS4-D subtrial (Adams et al. 2018). Based on data from 32 random-
ized patients, 16 to AZD8931 (a HER1, 2, and 3 inhibitor) and 16 to placebo, the
IDMC recommended closure of FOCUS4-D at the first preplanned interim analysis
for futility as it was found that AZD8931 was unlikely to improve PFS compared to
placebo in this population.
We accessed the recruitment chart of FOCUS4 (www.focus4trial.org/
recruitmentoverall/). The listed numbers of randomized patients were as of
November 30th, 2019; and they were as follows: n ¼ 6 for FOCUS4-B (recruitment
closed in August 2018); n ¼ 32 for FOCUS4-D (recruitment closed in April 2016);
n ¼ 60 for FOCUS4-C; and n ¼ 246 for FOCUS4-N. It was also mentioned at the
1480 O. Sverdlov et al.

FOCUS4 website (www.focus4trial.org) that the recruitment into the study was
suspended in March 2020 due to COVID-19 pandemic and the study closed
follow-up of all patients on October 31st 2020.

Summary and Conclusion

In this chapter, we provided an overview of platform trial designs, an important type


of master protocols to evaluate multiple experimental treatments in a chosen indi-
cation within a common trial infrastructure. Platform trials can be cast as open-ended
randomized multiarm trial designs with or without a shared control arm, and the
experimental arms may be added/dropped during the course of the study on the basis
of a predefined decision algorithm. Platform trials have the potential to significantly
improve efficiency of clinical drug development by screening more experimental
therapies and answering more research questions in a systematic way.
Although the concepts of master protocols and platform trials are relatively new,
these designs have already found broad use in practice, which is signified by an
exponential growth of publications on both methodological work and real clinical
trials (Park et al. 2019; Meyer et al. 2020). Amid the COVID-19 pandemic, several
platform trials to evaluate re-purposed therapies, novel experimental agents, and
possibly their combinations were initiated and are currently ongoing. The platform
trial model makes it feasible to assess a huge number of treatment options for the
pandemic in a scientifically rigorous and ethical manner.
Master protocol study designs require more upfront planning and early engage-
ment with health authorities and other relevant stakeholders. These studies are
operationally complex and require careful coordination and collaboration among
various functions in the drug development enterprise. Information technology is an
important ingredient and key to a successful implementation of these studies.
Statistical software for simulation of design operating characteristics can help
clinical investigators evaluate different design options under various experimental
scenarios at the study planning stage and select the best design option for the study
objectives (Meyer et al. 2021).
Finally, we would like to highlight the importance of precompetitive collabora-
tion and broad discussions among stakeholders in industry, academia, and health
authorities on master protocols, in particular on platform trials. Best industry prac-
tices on master protocols are still emerging, and we anticipate increasing interest in
both the methodology and applications of these designs in the near future.

References
Adams R, Brown E, Brown L, Butler R, Falk S, Fisher D, Kaplan R, Quirke P, Richman S,
Samuel L, Seligmann J, Seymour M, Shiu KK, Wasan H, Wilson R, Maughan T, FOCUS4 Trial
Investigators (2018) Inhibition of EGFR, HER2, and HER3 signalling in patients with colorectal
cancer wild-type for BRAF, PIK3CA, KRAS, and NRAS (FOCUS4-D): a phase 2-3 randomised
trial. Lancet Gastroenterol Hepatol 3(3):162–171
76 Platform Trial Designs 1481

Alexander BM, Ba S, Berger MS, Berry DA, Cavenee WK, Chang SM, Cloughesy TF, Jiang T,
Khasraw M, Li W, Mittman R, Poste GH, Wen PY, Yung WKA, Barker AD, GBM AGILE
Network (2018) Adaptive global innovative learning environment for glioblastoma: GBM
AGILE. Clin Cancer Res 24(4):737–743
Antonijevic Z, Beckman RA (2019) Platform trials in drug development: umbrella trials and basket
trials. CRC Press, Boca Raton
Barker AD, Sigman CC, Kelloff GJ, Hylton NM, Berry DA, Esserman LJ (2009) I-SPY 2: an
adaptive breast cancer trial design in the setting of neoadjuvant chemotherapy. Clin Pharmacol
Ther 86(1):97–100
Bentzien J, Bharadwaj R, Thompson DC (2015) Crowdsourcing in pharma: a strategic framework.
Drug Discov Today 20(7):874–883
Berger VW (2015) Letter to the editor: a note on response-adaptive randomization. Contemp Clin
Trials 40:240
Berry SM (2020) Potential statistical issues between designers and regulators in confirmatory
basket, umbrella, and platform trials. Clin Pharmacol Ther 108(3):444–446
Berry SM, Connor JT, Lewis RJ (2015) The platform trial: an efficient strategy for evaluating
multiple treatments. JAMA 313(16):1619–1620
Bowden J, Brannath W, Glimm E (2014) Empirical Bayes estimation of the selected treatment mean
for two-stage drop-the-loser trials: a meta-analytic approach. Stat Med 33:388–400
Bowden J, Glimm E (2008) Unbiased estimation of selected treatment means in two-stage trials.
Biom J 50(4):515–527
Bretz F, Koenig F (2020) Commentary on Parker and Weir. Clin Trials 17(5):567–569
Bretz F, Koenig F, Brannath W, Glimm E, Posch M (2009) Adaptive designs for confirmatory
clinical trials. Stat Med 28:1181–1217
Byar DP (1980) Why data bases should not replace randomized clinical trials. Biometrics 36:
337–342
Carreras M, Brannath W (2013) Shrinkage estimation in two-stage adaptive designs with midtrial
treatment selection. Stat Med 32:1677–1690
Chen N, Carlin BP, Hobbs BP (2018) Web-based statistical tools for the analysis and design of
clinical trials that incorporate historical controls. Comput Stat Data Anal 127:50–68
Choodari-Oskooei B, Bratton DJ, Gannon MR, Meade AM, Sydes MR, Parmar MK (2020) Adding
new experimental arms to ransomised clinical trials: impact on error rates. Clin Trials 17(3):
273–284
Cohen DR, Todd S, Gregory WM, Brown JM (2015) Adding a treatment arm to an ongoing clinical
trial: a review of methodology and practice. Trials 16:179
Collignon O, Gartner C, Haidich AB, Hemmings RJ, Hofner B, Pétavy F, Posch M, Rantell K,
Roes K, Schiel A (2020) Current statistical considerations and regulatory perspectives on the
planning of confirmatory basket, umbrella, and platform trials. Clin Pharmacol Ther 107(5):
1059–1067
DiMasi JA, Grabowski HG, Hansen RW (2016) Innovation in the pharmaceutical industry: new
estimates of R&D costs. J Health Econ 47:20–33
Dodd LE, Freidlin B, Korn EL (2021) Platform trials – beware the noncomparable control group. N
Engl J Med 384(16):1572–1573
Dodd LE, Proschan MA, Neuhaus J, Koopmeiners JS, Neaton J, Beigel JD, Barrett K, Lane
HC, Davey RT (2016) Design of a randomized controlled trial for ebola virus disease
medical countermeasures: PREVAIL II, the Ebola MCM study. J Infect Dis 213(12):
1906–1913
Dunnett CW (1955) A multiple comparison procedure for comparing several treatments with a
control. J Am Stat Assoc 50:1096–1121
Elm JJ, Palesch YY, Koch GG, Hinson V, Ravina B, Zhao W (2012) Flexible analytical methods for
adding a treatment arm mid-study to an ongoing clinical trial. J Biopharm Stat 22:758–772
European Medicines Agency. Reflection paper on methodological issues in confirmatory clinical
trials with an adaptive design. London, 18 October 2007. Available from https://www.ema.
europa.eu/en/documents/scientific-guideline/reflection-papermethodological-issues-confirma
tory-clinical-trials-planned-adaptive-design_en.pdf
1482 O. Sverdlov et al.

Esserman L, Hylton N, Asare S, Yau C, Yee D, DeMichele A, Perlmutter J, Symmans F, van’t


Veer L, Matthews J, Berry DA, Barker A (2019) I-SPY2: unlocking the potential of the platform
trial. In: Antonijevic Z, Beckman RA (eds) Platform trial designs in drug development: umbrella
trials and basket trials. CRC Press, Boca Raton, pp 3–22
FOCUS4 master protocol (2019). http://www.focus4trial.org/media/1809/03a_focus4_master-
protocol-v70_11sep2019_clean.pdf
Food and Drug Administration. Master protocols: efficient clinical trial design strategies to expedite
development of oncology drugs and biologics. Guidance for industry (draft guidance).
September 2018. https://www.fda.gov/media/120721/download
Food and Drug Administration. Adaptive designs for clinical trials of drugs and biologics: guidance
for industry. November 2019. https://www.fda.gov/media/78495/download
Freidlin B, Korn EL, Gray R, Martin A (2008) Multi-arm clinical trials of new agents: some design
considerations. Clin Cancer Res 14(14):4368–4371
Friede T, Henderson R (2009) Exploring changes in treatment effects across design stages in
adaptive trials. Pharm Stat 8:62–72
Gallo P, Chuang-Stein C (2009) What should be the role of homogeneity testing in adaptive trials?
Pharm Stat 8:1–4
Gallo P, Mao L, Shih VH (2014) Alternative views on setting clinical trial futility criteria. J
Biopharm Stat 24(5):976–993
Galwey NW (2017) Supplementation of a clinical trial by historical control data: is the prospect of
dynamic borrowing an illusion? Stat Med 36:899–916
Hague D, Townsend S, Masters L, Rauchenberger M, Van Looy N, Diaz-Montana C, Gannon M,
James N, Maughan T, Parmar MK, Brown L et al (2019) Changing platforms without stopping
the train: experiences of data management and data management systems when adapting
platform protocols by adding and closing comparisons. Trials 20(1):294
Hobbs BP, Chen N, Lee JJ (2018) Controlled multi-arm platform design using predictive proba-
bility. Stat Methods Med Res 27:65–78
Horn M, Vollandt R (2000) A survey of sample size formulas for pairwise and many-to-one
comparisons in the parametric, nonparametric and binomial case. Biom J 42(1):27–44
Howard DR, Brown JM, Todd S, Gregory WM (2018) Recommendations on multiple testing
adjustment in multi-arm trials with a shared control group. Stat Methods Med Res 27(5):
1513–1530
Hu F, Rosenberger WF (2006) The theory of response-adaptive randomization in clinical trials.
Wiley, New York
International Conference on Harmonisation. ICH E9(R1) Addendum on Estimands and Sensitivity
Analysis in Clinical Trials to the Guideline on Statistical Principles for Clinical Trials.
17 February 2020. https://www.ema.europa.eu/en/documents/scientific-guideline/ich-e9-r1-
addendum-estimands-sensitivity-analysis-clinical-trials-guideline-statistical-principles_en.pdf
International Conference on Harmonisation. E10: Choice of Control Group in Clinical Trials.
January 2001. https://www.ema.europa.eu/en/ich-e10-choice-control-group-clinical-trials
James ND, Sydes MR, Clarke NW, Mason MD, Dearnaley DP, Anderson J, Popert RJ, Sanders K,
Morgan RC, Stansfeld J, Dwyer J, Masters J, Parmar MK (2009) Systemic therapy for
advancing or metastatic prostate cancer (STAMPEDE): a multi-arm, multistage randomized
controlled trial. BJU Int 103(4):464–469
Jennison C, Turnbull BW (2000) Group sequential methods with applications to clinical trials. CRC
Press, Boca Raton
Jiao F, Tu W, Jimenez S, Crentsil V, Chen YF (2019) Utilizing shared internal control arms and
historical information in small-sized platform clinical trials. J Biopharm Stat 29(5):845–859
Jin M, Liu G (2020) Estimand framework: delineating what to be estimated with clinical questions
of interest in clinical trials. Contemp Clin Trials 96:106093
Kaizer AM, Hobbs BP, Koopmeiners JS (2018) A multi-source adaptive platform design for testing
sequential combinatorial treatment strategies. Biometrics 74(3):1082–1094
Kaplan R (2015) The FOCUS4 design for biomarker stratified trials. Chin Clin Oncol 4(3):35
76 Platform Trial Designs 1483

Kaplan R, Maughan T, Crook A, Fisher D, Wilson R, Brown L, Parmar M (2013) Evaluating many
treatments and biomarkers in oncology: a new design. J Clin Oncol 31(36):4562–4568
Kim ES, Herbst RS, Wistuba II et al (2011) The BATTLE trial: personalizing therapy for lung
cancer. Cancer Discov 1:44–53
Kimani PK, Todd S, Renfro LA, Glimm E, Khan JN, Kairalla JA, Stallard N (2020) Point and
interval estimation in two-stage adaptive designs with time to event data and biomarker-driven
subpopulation selection. Stat Med 39(19):2568–2586
Kimani PK, Todd S, Stallard N (2014) A comparison of methods for constructing confidence
intervals after phase II/III clinical trials. Biom J 56(1):107–128
Kopp-Schneider A, Calderazzo S, Wiesenfarth M (2020) Power gains by using external information
in clinical trials are typically not possible when requiring strict type I error control. Biom J 62(2):
361–374
Kuznetsova OM, Tymofyeyev Y (2011) Brick tunnel randomization for unequal allocation to two or
more treatment groups. Stat Med 30(8):812–824
Kuznetsova OM, Tymofyeyev Y (2014) Wide brick tunnel randomization – an unequal allocation
procedure that limits the imbalance in treatment totals. Stat Med 33(9):1514–1530
Lee KM, Wason J, Stallard N (2019) To add or not to add a new treatment arm to a multi-arm study:
a decision-theoretic framework. Stat Med 38:3305–3321
Marschner IC (2007) Optimal design of clinical trials comparing several treatments with a control.
Pharm Stat 6:23–33
Mayer C, Perevozskaya I, Leonov S, Dragalin V, Pritchett Y, Bedding A, Hartford A, Fardipour P,
Cicconetti G (2019) Simulation practices for adaptive trial designs in drug and device develop-
ment. Stat Biopharm Res 11(4):325–335
Meyer EL, Mesenbrink P, Dunger-Baldauf C, Fülle HJ, Glimm E, Li Y, Posch M, König F (2020)
The evolution of master protocol clinical trial designs: a systematic literature review. Clin Ther
42(7):1330–1360
Meyer EL, Mesenbrink P, Mielke T, Parke T, Evans D, König F on behalf of EU-PEARL
(EU Patient-cEntric clinicAl tRial pLatforms) Consortium (2021) Systematic review of avail-
able software for multi-arm multi-stage and platform clinical trial design. Trials 22:183
Morrell L, Hordern J, Brown L, Sydes MR, Amos CL, Kaplan RS, Parmar MK, Maughan TS
(2019) Mind the gap? The platform trial as a working environment. Trials 20(1):297
Neal D, Casella G, Yang MCK, Wu SS (2011) Interval estimation in two-stage, drop-the-losers
clinical trials with flexible treatment selection. Stat Med 30:2804–2814
Normington J, Zhu J, Mattiello F, Sarkar S, Carlin B (2020) An efficient Bayesian platform trial
design for borrowing adaptively from historical control data in lymphoma. Contemp Clin Trials
89:105890
Palmer CR, Rosenberger WF (1999) Ethics and practice: alternative designs for phase III random-
ized clinical trials. Control Clin Trials 20:172–186
Park JJH, Harari O, Dron L, Lester RT, Thorlund K, Mills EJ (2020) An overview of platform trials
with a checklist for clinical readers. J Clin Epidemiol 125:1–8
Park JJH, Siden E, Zoratti MJ, Dron L, Harari O, Singer J, Lester RT, Thorlund K, Mills EJ (2019)
Systematic review of basket trials, umbrella trials, and platform trials: a landscape analysis of
master protocols. Trials 20:572
Parker RA, Weir CJ (2020) Non-adjustment for multiple testing in multi-arm trials of distinct
treatments: rationale and justification. Clin Trials 17(5):562–566
Pocock SJ (1976) The combination of randomized and historical controls in clinical trials. J Chronic
Dis 29:175–188
PREVAIL II Writing Group (2016) A randomized, controlled trial of Zmapp for ebola virus
infection. N Engl J Med 375:1448–1456
Proschan MA, Follmann DA (1995) Multiple comparisons with control in a single experiment
versus separate experiments: why do we feel differently? Am Stat 49(2):144–149
Quan H, Zhang B, Lan Y, Luo X, Chen X (2019) Bayesian hypothesis testing with frequentist
characteristics in clinical trials. Contemp Clin Trials 87:105858
1484 O. Sverdlov et al.

Racine-Poon A, D’Amelio A, Sverdlov O, Haas T (2020) OPTIM-ARTS – an adaptive phase II


open platform trial design with an application to a metastatic melanoma study. Stat Biopharm
Res. https://doi.org/10.1080/19466315.2020.1749722
Ritchie CW, Molinuevo JL, Truyen L, Satlin A, Van der Geyten S, Lovestone S, on behalf of the
European Prevention of Alzheimer’s Dementia (EPAD) Consortium (2016) Development of
interventions for the secondary prevention of Alzheimer’s dementia: the European Prevention of
Alzheimer’s Dementia (EPAD) project. Lancet Psychiatry 3(2): 179–186
Robertson DS, Lee KM, López-Kolkovska BC, Villar SS (2020) Response-adaptive randomization
in clinical trials: from myths to practical considerations. https://arxiv.org/pdf/2005.00564.pdf
Rosenberger WF, Lachin J (2015) Randomization in clinical trials: theory and practice, 2nd edn.
Wiley, New York
Rosenberger WF, Sverdlov O, Hu F (2012) Adaptive randomization for clinical trials. J Biopharm
Stat 22(4):719–736
Ryeznik Y, Sverdlov O (2018) A comparative study of restricted randomization procedures for
multiarm trials with equal or unequal treatment allocation ratios. Stat Med 37:3056–3077
Saville BR, Berry SM (2016) Efficiencies of platform clinical trials: a vision of the future. Clin
Trials 13:358–366
Scannell JW, Blanckley A, Boldon H, Warrington B (2012) Diagnosing the decline in pharmaceu-
tical R&D efficiency. Nat Rev Drug Discov 11(3):191–200
Schiavone F, Bathia R, Letchemanan K, Masters L, Amos C, Bara A, Brown L, Gilson C, Pugh C,
Atako N, Hudson F et al (2019) This is a platform alteration: a trial management perspective on
the operational aspects of adaptive and platform and umbrella protocols. Trials 20(1):264
Siden EG, Park JJH, Zoratti MJ, Dron L, Harari O, Thorlund K, Mills EJ (2019) Reporting of
master protocols towards a standardized approach: a systematic review. Contemp Clin Trials
Commun 15:100406
Simon R (1989) Optimal two-stage designs for phase II clinical trials. Control Clin Trials 10:1–10
Sridhara R, Marchenko O, Jiang Q, Pazdur R, Posch M, Redman M, Tymofyeyev Y, Li X,
Theoret M, Shen YL, Gwise T, Hess L, Coory M, Raven A, Kotani N, Roes K, Josephson F,
Berry S, Simon R, Binkowitz B (2021) Type I error considerations in master protocols with
common control in oncology trials: report of an American Statistical Association Biopharma-
ceutical Section open forum discussion. Stat Biopharm Res. https://doi.org/10.1080/19466315.
2021.1906743
Stallard N, Kimani P (2018) Uniformly minimum variance conditionally unbiased estimation in
multi-arm multi-stage clinical trials. Biometrika 105(2):495–501
Stallard N, Todd S, Parashar D, Kimani PK, Renfro LA (2019) On the need to adjust for multiplicity
in confirmatory clinical trials with master protocols. Ann Oncol 30(4):506–509
Sverdlov O, Rosenberger WF (2013) On recent advances in optimal allocation designs for clinical
trials. J Stat Theory Pract 7(4):753–773
Sverdlov O, Ryeznik Y (2019) Implementing unequal randomization in clinical trials with hetero-
geneous treatment costs. Stat Med 38:2905–2927
Sverdlov O, Ryeznik Y, Wong WK (2020) On optimal designs for clinical trials: an updated review.
J Stat Theory Pract 14:10
Tang R, Shen J, Yuan Y (2019) ComPAS: a Bayesian drug combination platform trial design with
adaptive shrinkage. Stat Med 38:1120–1134
Thall PF, Fox P, Wathen JK (2015) Statistical controversies in clinical research: scientific and
ethical problems with adaptive randomization in comparative clinical trials. Ann Oncol 26(8):
1621–1628
The Adaptive Platform Trials Coalition (2019) Adaptive platform trials: definition, design, conduct
and reporting considerations. Nat Rev Drug Discov 18:797–807
Trippa L, Lee EQ, Wen PY, Batchelor TT, Cloughesy T, Parmigiani G, Alexander BM (2012)
Bayesian adaptive randomized trial design for patients with recurrent glioblastoma. J Clin Oncol
30(26):3258–3263
76 Platform Trial Designs 1485

Ventz S, Cellamare M, Parmigiani G, Trippa L (2018) Adding experimental arms to platform


clinical trials: randomization procedures and interim analysis. Biostatistics 19(2):199–215
Ventz S, Parmigiani G, Trippa L (2017) Combining Bayesian experimental designs and frequentist
data analysis: motivations and examples. Appl Stoch Model Bus Ind 33:302–313
Viele K, Berry S, Neuenschwander B, Amzal B, Chen F, Enas N, Hobbs B, Ibrahim JG,
Kinnersley N, Lindborg S, Micallef S (2014) Use of historical control data for assessing
treatment effects in clinical trials. Pharm Stat 13(1):41–54
Viele K, Broglio K, McGlothlin A, Saville BR (2020a) Comparison of methods for control
allocation in multiple arm studies using response adaptive randomization. Clin Trials 17(1):
52–60
Viele K, Saville BR, McGlothlin A, Broglio K (2020b) Comparison of response adaptive random-
ization features in multiarm clinical trials with control. Pharm Stat 19:602–612
Villar SS, Bowden J, Wason J (2018) Response-adaptive designs for binary responses: how to offer
patient benefit while being robust to time trends? Pharm Stat 17:182–197
Villar SS, Robertson DS, Rosenberger WF (2020) The temptation of overgeneralizing response-
adaptive randomization. Clin Infect Dis ciaa1027. https://doi.org/10.1093/cid/ciaa1027
Wason JMS, Stecher L, Mander AP (2014) Correcting for multiple-testing in multi-arm trials: is it
necessary and is it done? Trials 15:364
Wason JMS, Trippa L (2014) A comparison of Bayesian adaptive randomization and multi-stage
designs for multi-arm clinical trials. Stat Med 33:2206–2221
Wason JMS, Robertson DS (2021) Controlling type I error rates in multi-arm clinical trials: a case
for the false discovery rate. Pharm Stat 20:09–116
Wassmer G, Brannath W (2016) Group sequential and confirmatory adaptive designs in clinical
trials. Springer International Publishing, Cham
Wathen JK, Thall PF (2017) A simulation study of outcome adaptive randomization in multi-arm
clinical trials. Clin Trials 14(5):432–440
Wei LJ, Durham SD (1978) The randomized play-the-winner rule in medical trials. J Am Stat Assoc
73:840–843
World Health Organization. WHO R&D Blueprint Novel Coronavirus COVID-19 Therapeutic
Trial Synopsis, 2020. https://www.who.int/blueprint/priority-diseases/key-action/COVID-19_
Treatment_Trial_Design_Master_Protocol_synopsis_Final_18022020.pdf
Wong CH, Siah KW, Lo AW (2019) Estimation of clinical trial success rates and related parameters.
Biostatistics 20(2):273–286
Woodcock J, LaVange LM (2017) Master protocols to study multiple therapies, multiple diseases,
or both. N Engl J Med 377:62–70
Woodcock J, Woosley R (2008) The FDA critical path initiative and its influence on new drug
development. Annu Rev Med 59:1–12
Yuan Y, Guo B, Munsell M, Lu K, Jazaeri A (2016) MIDAS: a practical Bayesian design for
platform trials with molecularly targeted agents. Stat Med 35:3892–3906
Zhou X, Liu S, Kim ES, Herbst RS, Lee JJ (2008) Bayesian adaptive design for targeted therapy
development in lung cancer – a step toward personalized medicine. Clin Trials 5:181–193
Cluster Randomized Trials
77
Lawrence H. Moulton and Richard J. Hayes

Contents
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1488
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1488
Basic Characteristics of CRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1489
Variability Across Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1490
Parameters to Be Estimated: Analysis Populations and Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1491
Analysis Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1491
Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1492
Cluster Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1493
Matching and Stratification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1494
Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1494
Randomization Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1495
Highly Constrained Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1496
Alternative Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1497
Sample Size and Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1497
Minimum Number of Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1497
Sample Size Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1498
Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1500
Individual-Level Regression Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1500
Cluster-Level Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1500
Effects of Correlation Structure on Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1501
Reporting Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1501
Ethics and Data Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1502
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1502

L. H. Moulton (*)
Departments of International Health and Biostatistics, Johns Hopkins Bloomberg School of Public
Health, Baltimore, MD, USA
e-mail: [email protected]; [email protected]
R. J. Hayes
Faculty of Epidemiology and Population Health, London School of Hygiene and Tropical
Medicine, London, UK
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 1487


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_108
1488 L. H. Moulton and R. J. Hayes

Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1503
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1503

Abstract
In a randomized clinical or field trial, when randomization units are comprised of
groups of individuals, many aspects of design and analysis differ greatly from
those of an individually randomized trial. In this chapter, we highlight those
features which differ the most, explaining the nature of the differences and
delineating approaches to accommodate them. The focus is on design, as many
readers will be familiar with the correlated data analysis techniques that are
appropriate for many (although not all) cluster randomized trials (CRTs). Thus,
the chapter begins by covering motivations for using a CRT design, basic
correlation parameters, the variety of potential estimands, delineation and ran-
domization of clusters, and sample size calculation. This is followed by sections
on the analysis and reporting of results, which highlight ways to handle the
multilevel nature of the data. Finally, ethical and monitoring considerations
unique to CRTs are discussed.

Keywords
Cluster randomized trial · Group allocation · Correlated data

Definition

A cluster randomized trial (CRT) is a randomized controlled trial (RCT) in which


the random assignment of the treatment or experimental condition is performed on
sets of individuals, so that within any given set, all individuals are allocated to the
same study arm.

Introduction

The vast majority of RCTs employ a randomization scheme wherein trial partici-
pants are individually randomized. A typical arrangement in a therapeutic trial is to
identify eligible patients as they arrive, one by one, at a clinic or hospital, and assign
them to a study arm according to a fixed randomization list. The therapy to be tested,
or a control version, is then administered to each individual accordingly, and later the
individual’s response is recorded. In a cluster randomized trial, however, entire
groups of potential participants are defined or identified, with those in a given
group assigned the same experimental condition, which they may even experience
simultaneously. Group membership may be determined by geography, e.g., place of
residence or catchment area of a hospital, or by location where services are received:
all the children in a given classroom, say, or all the patients in a hospital ward. It can
77 Cluster Randomized Trials 1489

also be determined by timing, with all the patients presenting on randomly selected
days receiving the experimental treatment and the patients on other days receiving
the standard-of-care treatment, with each day’s patients constituting a group.
Perhaps the first trial that was designed and appropriately analyzed as a CRT was
one of isoniazid administration to prevent tuberculosis, where randomization was
performed, for administrative ease, by groupings of wards in mental institutions
(Ferebee et al. 1963, as reported by Donner and Klar 2000). Still, it was not until
statistical and computing advances made in the 1980s, and uptake of these methods
in the 1990s, that CRTs became a common tool in the trialist’s design repertoire.
There are now a number of English-language books devoted to the subject (Murray
1998; Donner and Klar 2000; Hayes and Moulton 2017; Eldridge and Kerry 2012;
Campbell and Walters 2014), and hundreds of related methodological articles have
been published in the biostatistics and epidemiology literature.

Basic Characteristics of CRTs

The reasons for carrying out randomization at the group or cluster level are usually a
combination of (with examples):

1. The intervention to be tested can only be assigned to groups of people. (A


campaign to raise public awareness of a health problem might employ messages
delivered by radio or newspaper.)
2. It is logistically much easier to deliver the intervention in groups. (It may be too
difficult for staff in a clinic to continually switch their procedures for different
patients.)
3. It is more acceptable to the study population to receive the intervention in groups.
(In an indoor air pollution study where some households receive modern cook-
stoves, neighbors may become jealous and hence refuse to participate as
controls.)
4. An individually randomized approach might result in too much “contamination,”
with individuals assigned to the control arm taking up the intervention. (Materials
delivered to promote exclusive breastfeeding might be shared or discussed among
neighbors.)
5. It is desired to capture effects that are due to group dynamics. (Individuals in a
group may communicate with each other and reinforce health messages; deploy-
ment of a vaccine throughout a geographic cluster might reduce secondary and
tertiary transmissions and result in some herd protection, thereby increasing
overall effectiveness at the cluster level.)

The principal drawback to CRTs is that, in general, they require greater sample
size in terms of numbers of participants than do individually randomized trials.
There is almost always positive within-cluster correlation, which can reduce the
effective sample size to a large degree, as will be seen in the sample size section.
More participants translate into larger costs for trials that perform maneuvers at the
1490 L. H. Moulton and R. J. Hayes

individual level, and there can be cluster-level costs as well, due to increased
transportation and communications with community leaders or clinic directors.
A related feature of CRTs is that they often are comprised of relatively small
numbers of clusters, say 8–50, although they may have thousands of participants.
This is primarily due to the logistics and costs associated with adding each additional
cluster. Small numbers of clusters can engender inferential difficulties, both in terms
of small sample properties of statistical estimators and greater risks associated with
clusters that in one way or another become outliers.

Variability Across Clusters

There is almost always positive correlation among members of a designated group or


cluster with respect to any characteristic, measured or not, including health out-
comes. This correlation may be viewed as a result of cluster-to-cluster variability of
these characteristics. This point needs to be emphasized:

Between  cluster variability ! Within  cluster correlation

Imagine classrooms in an elementary school. There is a large variation in


children’s height by grade: first graders are on average shorter than second, etc.
Otherwise stated, first graders are more like each other in height than they are like
second graders. This within-grade correlation or dependence can be thought of in
another way: knowing the height of one randomly selected second grader; if a
second student is randomly selected from any grade, then knowing the second
student’s grade gives us information about their height relative to the first student’s
height.
When planning or analyzing a CRT, it is important to account for within-cluster
correlation (or, equivalently, between-cluster variation). The two statistical measures
most often used in such circumstances are the coefficient of variation k and the
intracluster correlation coefficient ρ. To help explain these, consider the example of
1-year period prevalence of tuberculosis among patients attending HIV clinics in Rio
de Janeiro. Thus, the data elements are the number of patients enrolled in a clinic (the
denominator of the prevalence) and the number of them who were diagnosed with
tuberculosis in a given year (the numerator). There are two sources of variation to be
considered: (1) binomial variability (within-clinic) and (2) extra-binomial variability
(between-clinic). Each clinic has its own, intrinsic prevalence Pj, j = 1,. . .N clinics.
These are assumed to vary across clinics; the distribution of these Pj has its own
mean, π, and variance, σ 2B , the between-clinic (cluster) variance. The clinics’
prevalences may differ because of differences in characteristics of their catchment
area populations, differences in staff skills, available diagnostic tests, etc. And for
any specific clinic, there will be variability in its outcome due to year-to-year
variation in exposure or detection. Then a measure of the variability of the true
clinic proportions is the coefficient of variation given by k = σB/π. Thus, the standard
deviation is scaled by the mean and becomes dimensionless – this makes it a fairly
77 Cluster Randomized Trials 1491

transferrable, or meaningful, measure that can be used in other studies or situations.


The intracluster correlation coefficient is also dimensionless and can be defined as

σ2B σ2
ρ¼ 2
¼ 2 B 2,
σ σB þ σw

where W stands for “within-cluster”; for proportions, this is:

σ2B
ρ¼ :
πð 1  πÞ

A third measure, the design effect (DEff for short), is not a measure of cluster
variability per se but rather a descriptive measure of the effect of within-cluster
correlation in the context of a given study design. It can be designated as:

Sample size required when accounting for within  cluster correlation


DEff ¼
Sample size required when there is no such correlation

¼ actual sample size=effective sample size


> 1 if ICC > 0 or if k > 0, which is the usual situation:

The DEff depends not only on the correlation but also on cluster size, so that it is
more specific to the actual design being considered, but, hence, less generalizable or
applicable to other studies.

Parameters to Be Estimated: Analysis Populations and Effects

Analysis Populations

When designing a study, it is important to think carefully about what one wishes to
estimate and among whom. In individually randomized trials, several analytic
populations may be specified, including intent-to-treat, as-treated, and per-protocol
populations (see ▶ Chap. 82, “Intention to Treat and Alternative Approaches”). The
same is true for CRTs, but there is an extra layer of complication due to randomizing
clusters of people.
In a typical individually randomized trial, participants are enrolled (consented and
given an identification code) and then assigned a randomized study treatment – a
strict intent-to-treat approach would then analyze all data collected from that point
on as if the participant then actually received the assigned treatment. In a CRT,
however, there can be two levels of treatment: the treatment condition the cluster
receives and the treatment received by participants. If clinics are randomized to have
or not have certain educational materials about stroke in their waiting rooms, the
intent-to-treat time at the clinic level may begin months before an individual patient
shows up at the clinic – the individual’s intent-to-treat time-at-risk for having a
1492 L. H. Moulton and R. J. Hayes

stroke would begin when they enter the waiting room. That would mimic the long-
term effectiveness of the intervention were it ever to be adopted, as the individual
would be exposed to the materials from the first time they go to the clinic.
More complications may arise depending on who is considered as having been
randomized. If a cluster is defined by geographic residence, at the time of random-
ization, a strict approach might only follow up individuals who were resident at that
moment. However, that results in a closed cohort that ages over time and might not
be of as much interest as a dynamic cohort with people entering and leaving clusters.
On the other hand, people may move into a cluster because they know it is receiving
a treatment they want to receive. The particular nature of the intervention (e.g.,
applied at the cluster or at the individual level), whether it is masked, its general
availability, and potential biases all need to be considered in order to determine
exactly what parameters the study should try to estimate.

Effects

There are further questions regarding what events should be counted for which
analyses. Halloran and Struchiner (1991) introduced a nomenclature that helps
clarify how different parameters of interest may be estimated, as a function of
which individuals’ events are counted and compared, taking into consideration the
possibility of indirect or herd effects occurring within clusters. Figure 1 indicates
four possible effects and their estimation: direct, indirect, total, and overall effects.
Halloran et al. (1997) may be consulted for further details.
In Fig. 1, comparing the attack rates among those enrolled in each trial arm
provides an estimate of the total effect of the intervention. The total effect is a
combination of the direct effect (the protection afforded to an individual enrolled in

Fig. 1 Depiction of possibly Intervention Units Control Units


estimable effects in a cluster
randomized trial where a
subset of individuals in a
cluster (unit) are enrolled and
receive a study treatment (e.g.,
a drug in the intervention units
and a placebo in the control
units), but all may have Cases among enrolled:
outcomes measured. Two Cases among non-enrolled:
representative, equal-sized
denominator clusters are
shown, with “explosion” Comparisons of Attack Rates
symbols representing cases
Effect Comparison
arising from the respective
populations Total vs.
Indirect vs.
Overall vs.
Direct vs.
77 Cluster Randomized Trials 1493

an intervention cluster) and the indirect effect (the protection due to decreased
exposure, as a result of lower secondary transmission) of an intervention. If out-
comes can be measured among individuals who are not specifically enrolled in a
trial, say if there are population disease registries or health-care system data avail-
able, then indirect and overall effects can be measured. In such a situation, the
indirect effect can be directly estimated by comparing attack rates among those who
have not been enrolled in the trial (the rates of the purple cases in Fig. 1). Note that
people who enroll in a study tend to differ from those who do not; thus, it is best not
to compare the non-enrolled in the intervention clusters to everyone in the control
clusters. The overall effect is a combination not only of direct and indirect effects but
also the degree of coverage that has been attained, i.e., the uptake of the study
intervention. It compares attack rates among all those in the clusters who would have
been eligible for the trial, regardless of whether they were enrolled. Finally, the direct
effect can be estimated by comparing those enrolled to those not enrolled within
intervention clusters but may be too biased to be of interest, as there often will be
differences in the kinds of people who enter into a trial and those who do not.

Cluster Specification

It may be clear how to define the clusters in a CRT: if the intervention is the
introduction of a new triage system in an emergency room, the unit of randomization
would be the hospital in which the ER is located, with patients (or patients with a
particular condition) arriving at its ER forming the cluster. With geographically
defined clusters, however, there may be many different options, including postal
codes, census tracts, towns, counties, states, or districts. In a given entire study area,
in general, the more clusters there are, the greater will be the study’s power, given
diminishing returns with respect to cluster size (more on this below in the sample
size section). However, the fewer the clusters, the less the potential for contamina-
tion occurring from control participants adopting or accessing intervention practices
or from control participants introducing pathogens into intervention communities,
both of which would tend to reduce observed effectiveness of an intervention. Also,
with fewer clusters, there can be reduced costs due to logistics, transportation, and
dealing with cluster-level communications and assent from gatekeepers. As an
example, in a pneumococcal conjugate vaccine study in infants on an American
Indian reservation, the randomization could have been performed at the level of the
Indian Health Service administrative units, of which there were eight. Although
mixing of infants in intervention areas with those in the control areas would have
been minimized, there would not have been much power. There were 110 smaller,
tribal organization units, but consultation with local staff indicated there might be
substantial contamination across them. In the end, the 110 areas were grouped into
38 randomization units according to a number of factors including location of
shopping areas and Head Start preschool program centers (Moulton et al. 2001).
If cross-cluster contamination is a risk, another strategy is to designate buffer
zones around the clusters from which outcome data are not collected, although
1494 L. H. Moulton and R. J. Hayes

intervention or control activities might still be carried out in these zones. Note that
this may increase cost of the trial through requiring setting up the trial in larger
geographic areas or more clusters.

Matching and Stratification

As is often done in individually randomized trials, it can be advantageous to carry


out randomization of clusters within specified strata, so that within each stratum
there is a designated balance of the treatment conditions. Stratification has the dual
role of reducing variance by combining similar clusters into strata and of enforcing
balance according to the stratification factors so as to reduce confounding. As will be
seen in the section on randomization, there are other strategies that may be employed
to achieve balance with respect to possibly confounding variables, so that the main
advantage of stratification is to compare like-with-like.
When there are not many clusters, we have found that placing them in just 2–4
strata is often sufficient to substantially reduce within-stratum variability in out-
comes while only losing a few degrees of freedom in the analysis. Stratifying more
deeply to the point of pair-matching can minimize bias and improve the inferential
basis for causality but can have some limitations with respect to the statistical
analyses that can be carried out, due to lack of replication within the matched
pairs. In situations with potentially high variability in outcomes, pair-matching can
be very efficient – in the Mwanza STD trial, there was an estimated 13-fold relative
efficiency in using matched pairs as compared to an unstratified design (Hayes and
Moulton 2017). But if the matching is not done on variables strongly related to the
outcomes and there are fewer than ten pairs, one may gain power by ignoring the
matching at the time of analysis (Diehr et al. 1995), although it is best to specify this
in advance.

Randomization

Many aspects of randomization are covered in ▶ Chap. 40, “Principles of Clinical


Trials: Bias and Precision Control.” There are, however, two ways in which CRTs
often differ from the standard clinical trial that can affect choice of randomization
method. First, for most CRTs, all of the units of randomization are identified before
the randomization occurs. This can facilitate the process, as complicated systems
of assignment, either through distributed or centralized randomization, are not
necessary for concealment or minimization of assignment bias. Usually all that is
needed is a list of the geographic areas, or the clinics in a research network, a
pseudo-random number generator, and 5 min in a quiet room. An exception is
when units come online over time, an example of which is the randomized ring
vaccination strategy employed in the Ebola ça suffit! trial (Ebola ça suffit Ring
Vaccination Trial Consortium 2015), in which clusters of individuals were defined
as a randomization unit whenever a new Ebola case was identified. The second
77 Cluster Randomized Trials 1495

potentially distinguishing factor is the number of randomization units, which may


be limited in a CRT. Certainly, there are many small phase I clinical trials, but these
are often looking at specific biologic responses to a new drug, say, which do not
vary greatly from person to person. CRTs may be investigating a combination of
behavioral and biologic effects and have additional cluster-level factors that can
introduce variability in the outcomes. In addition, a CRT with a small number of
clusters may still have thousands of participants and thus be rather expensive.
Thus, constructing a good randomization scheme can be more critical in a CRT
than in an individually randomized trial with the same number of randomization
units.
Stratification to enforce balance was mentioned in the previous section. In an
individually randomized trial, age and gender may be the only covariates that affect
an outcome, and using four strata (female/male crossed with younger/older) may
suffice. But take the example of randomizing 12 census tracts in a city to a new drug
harm reduction strategy versus the status quo. One can easily obtain many baseline
covariates that would be desirable to have balanced: housing value, proportion of
residents with a college education, median household income, population density,
etc.; if there are ten such variables that are rendered dichotomous, that yields
210 = 1024 potential strata. Clearly, with 12 randomization units, a standard strat-
ification strategy will not suffice. Any one of these covariates might end up with a
post-randomization imbalance that would not look good in a “Table 1” of a research
article and could lead one to question the study results. A multi-million dollar study
would not want to risk this kind of imbalance.

Randomization Criteria

A recent article on randomization methods for achieving covariate balance in CRTs


states there are two desirable criteria for a successful randomization: unbiasedness
and covariate balance (Morgan and Rubin 2012). Covariate balance can be defined
in a number of ways, depending on whether exact balance or a caliper-based near
balance is required and whether univariate or multivariate functions of covariates are
to be employed (e.g., achieving balance on the number of high-income recent
immigrants to an area). An unbiased randomization is one in which each random-
ization unit has the same probability to be selected for a given treatment arm as any
other unit. Stated a bit more formally, a design is unbiased if the expected value, over
all possible randomization allocations, of the difference in treatment means is equal
to the true difference (Bailey and Rowley 1987).
There is, however, a third criterion that is important for randomization, that of
“validity.” This criterion, perhaps first enunciated in R.A. Fisher’s Design of Exper-
iments (1947), relates to whether the assignments for different units may in some
way be linked to each other. Clearly, if under all possible allocations a given
geographic unit was always assigned the same treatment condition as a given
neighboring unit, the number of randomization units would effectively be reduced
by one. While unbiasedness is a first-moment (mean) consideration, validity is a
1496 L. H. Moulton and R. J. Hayes

second-moment (variance) one: a “. . .scheme is said to be valid if the expectations of


the treatment mean square and the error mean square are equal in the absence of
treatment effects. . .” (Bailey 1987). An operationally useful result is that in a
completely (simple) randomized design, the design is valid if each pair of random-
ization units has the same probability of being allocated the same treatment (Bailey
and Rowley 1987).

Highly Constrained Randomization

Standard implementations of stratified or permuted block designs are both unbiased


and valid. Yet when balance is required with respect to many variables, or near-
balance across treatment arms on the marginal means of a variable is desired,
achieving a completely valid design becomes virtually impossible. For example,
determining a valid randomization scheme in a factorial design with just one
constraining variable required Bailey (1987) to use a deep knowledge of abstract
algebra. For more complicated situations, many methods have been proposed,
including simultaneous univariate restriction on sets of variables (Raab and Butcher
2001), Mahalanobis distance (Morgan and Rubin 2012), propensity scores (Xu and
Kalbfleisch 2010), or p-values (Bruhn and McKenzie 2009). If their constraining
criteria are not too strict, they will yield schemes that are nearly valid. It is easy to see
how a given constraint might induce dependence of allocation outcomes among
units. Suppose in the above community drug harm reduction strategy it was desired
to have the total population in the intervention arm to be within 10% of the
population size of the control arm. If there were one very large unit, and one very
small one, then among the acceptable allocations, there might be only a few that did
not have both of these units in the same study arm, as they would generally need to
be in the same arm to achieve the desired marginal balance.
A given system of constraints may result in a high proportion of the total
number of possible allocations being deemed unsuitable, resulting in lack of
uniformity in the number of times given pairs of units could be placed in the
same study arm. A “validity matrix” that consists of the numbers of times each pair
of units might be included in the same arm can be constructed and inspected either
from enumeration among the acceptable allocations or, if this number is too large,
from a random sample of those acceptable (Moulton 2004). Simulations have
shown that it takes a large departure from a uniform distribution of inclusion
probabilities coupled with high correlation of responses to meaningfully affect
Type I error. As with many aspects of experimental design, there is a trade-off: the
tighter the constraints, the greater the possibility of inadvertent linkage in treatment
assignments, thereby straying further from validity. The analyst can, after the
study, perform a randomization analysis, based on the set of potential allocations,
as a check, but it is convenient to have a randomization scheme that will allow the
standard array of statistical analyses to be conducted. Alternatively, covariate
adjustment with the constraining variables can mitigate the effects of a non-valid
scheme (Li et al. 2017).
77 Cluster Randomized Trials 1497

Alternative Designs

Although the focus in this chapter is on two-arm, parallel design trials, there are
many possible variations in CRT design. Factorial trials are fairly common; the large
cost of a CRT can be more easily justified if two interventions can be evaluated for
nearly the price of one, as can be done with a 2  2 factorial design (Montgomery
et al. 2003). It is rare, however, to see more than two levels of two factors used, due
to the required larger number of clusters and high per-cluster costs.
An increasingly popular design is the stepped wedge design, which is a one-way
crossover trial with staggered implementation, so that clusters are scheduled to go
from control phase to intervention phase at randomly assigned times (steps), until all
clusters have received the intervention. Choice of this design can be motivated by
political considerations, when it is desirable to show a steady march toward all clusters
receiving an intervention. Interpretation of results, however, can be fraught with
difficulties, primarily due to secular trends and variable lengths of implementation
(Kotz et al. 2012). Hussey and Hughes (2007) provide a useful framework for
modeling these trials, but such models have strong assumptions regarding correlation
structures and commonality of effects across clusters that need to be carefully consid-
ered (Thompson et al. 2017). It must be noted that subjecting a population to a trial that
leads to inconclusive results can be ethically problematic. This design is perhaps best
used in situations where an intervention is going to be rolled out in a population
anyway, so that randomizing the rollout affords an opportunity to obtain a better
evaluation of the program than might otherwise be possible. A useful set of articles on
this design, and its many variants, may be found in Trials (Torgerson 2015).

Sample Size and Power

Minimum Number of Clusters

The reader may be acquainted with quasi-experimental designs of demonstration


projects that randomly assign one or two clusters to each treatment arm. Although
technically randomized, such studies do not provide the requisite robustness to
provide statistically solid answers to research questions. With a 2 versus 2 design,
one can perform a Student t-test with the four cluster means and base a statistical
conclusion on it, but there are not enough clusters to get any idea as to whether the
assumptions underlying the t-test are met. One cluster whose participants generate an
outlying mean response can drastically affect the outcome. It is best to conduct CRTs
that have a sufficient number of clusters so that at least a nonparametric check of the
results can achieve “statistical significance,” usually meaning obtaining a p-value
less than 0.05 from a two-sided hypothesis test. Thus, it is best to have at least eight
clusters in a simple randomized design or six pairs of clusters in a matched-pair
design. In the former, a rank sum test will achieve p < 0.05 if all four intervention
clusters are superior (or all inferior) to all four control clusters ( p = 1/C84 = 0.029).
1498 L. H. Moulton and R. J. Hayes

In the latter, the results of every pair must be the same (intervention clusters all do
better than their matched control clusters, or vice versa; in which case
p = 2  26 = 0.031). Clearly, if in one of these designs just one cluster has
difficulties, perhaps withdrawing from the study, or undergoes substantial contam-
ination due to local events, the whole experiment is jeopardized.

Sample Size Methods

The basic concepts involved in sample size determination for CRTs are the same as
for individually randomized trials, except that for CRTs there are two sizes that need
to be designated for a study: the number of randomization units (clusters) in each
trial arm (assuming for simplicity equal numbers of clusters per arm), N, and the
number of individuals in each cluster, n, or nj, j = 1,. . .,N if the sizes vary by cluster.
Sometimes, the total number of clusters, 2 N, is fixed, say the number of
convenient geopolitical units in an area, but it is possible to modify n – we may
have 40 districts to randomize and plan to measure the outcome among a random
sample of 200 individuals per district. In other circumstances, all individuals in a
cluster can be measured easily, say from electronic records in a clinic, but one needs
to decide on how many clinics to enter into the trial. It may be that both N and n are
fixed, but the follow-up time can be increased to obtain more person-years at risk and
hence more events. Typically, it costs less to increase cluster size than to increase the
number of clusters, as there can be extra per-cluster costs associated with commu-
nication, transportation, and gatekeeper signoff. Yet, as will be seen, there are
diminishing returns in terms of power with respect to increasing the size of clusters.
In individually randomized trials, a measure of response variability needs to be
specified: the individual level variance. In CRTs, this is also required, as well as
specification of a measure of between-cluster variability: either the coefficient of
variation k or the intracluster correlation coefficient ρ may be used.
The design effect in terms of ρ may be written as 1 + (n  1)ρ (Kish 1965), an
increasing function of both cluster size and the ICC. Then the total number of
individuals in a trial arm can be found by multiplying the usual sample size formula
by this design effect:

2 σ 20 þ σ 21
Nn ¼ zα=2 þ zβ ½1 þ ðn  1Þρ
ðμ0  μ1 Þ2

where σ 2i is the arm-specific variance of individuals’ responses in a given cluster


(these are the same as what were subscripted with W for “within,” above) and μi is
the true mean response in each arm (Donner and Klar 2000). Hence, adding one
cluster to each arm as an approximate small sample correction, we can take the
requisite number of clusters in each arm to be:

2 σ 20 þ σ 21
N ¼ 1 þ zα=2 þ zβ ½1 þ ðn  1Þρ:
nð μ 0  μ 1 Þ 2
77 Cluster Randomized Trials 1499

Using the coefficient of variation k to express variability across clusters yields the
similar formula:
 
2 σ 20 þ σ 21 =n þ k2 μ20 þ μ21
N ¼ 1 þ zα=2 þ zβ :
ðμ0  μ1 Þ2

These formulas can be modified to allow for unequal allocation of clusters in the
trial arms and to allow for differential levels of within-cluster correlation across trial
arms (Hayes and Bennett 1999). Other modifications include the special case of the
matched pairs design and accounting for varying cluster sizes: the greater the
variance, the greater the loss of power.
The formula based on k cannot be used directly when dealing with responses that
may be negative, e.g., with anthropometric Z-scores; the formula based on ρ is not
appropriate for Poisson or rate (based on person-years of observation) response vari-
ables. An advantage of k is that it is easily interpretable, and reasonable values for it
can be posited even in the absence of good background data. For example, it may be
known that in a given district the incidence rate of rotavirus diarrhea in infants is 5 per
10 infant-years and that it is highly unlikely for this rate to vary by more than twofold
(from the lowest to the highest) across subdistrict health centers. If the true rates are
approximately normally distributed, it might be reasonable to assume about 95% of the
rates would be between 3.33 and 6.67 per 10 infant-years, so the standard deviation
would be about (6.67  5)/2 = 0.835, giving k = 0.835/5 = 0.17.
As already mentioned, when it comes to study power, there are greatly
diminishing returns to increasing cluster sizes for a specified number of clusters.
The following table gives an illustration of this for a continuous response variable,
diastolic blood pressure, which in a study population has a mean of 100 mmHg, with
a standard deviation of 10 mmHg. For a study design with 30 clusters in each arm,
power to detect a lowering to 95 mmHg is displayed as a function of cluster size for
several levels of k, with 5% Type I error.
Power to detect a 5 mmHg decrease given 30 clusters in each of the intervention and control arms,
population SD = 10 mmHg, and 5% Type I error

Coefficient of variation k Cluster size n Power


0.050
10 91%
50 96%
500 97%
0.075
10 67%
50 72%
500 74%
0.100
10 46%
50 49%
500 50%
1500 L. H. Moulton and R. J. Hayes

There is usually little variation across clusters in biologic parameters such as


blood pressure. In this situation, there is very little to be gained by increasing the
sample size tenfold, from 50 to 500. For the two larger values of k, regardless of how
much cluster size is increased, the trial will be infeasible, in that not even 80% power
would be attainable. If k is thought to be as large as 0.075, more clusters will need to
be added to the design.

Statistical Analysis

Individual-Level Regression Methods

Since the 1980s, flexible regression methods for the analysis of correlated data have
been readily accessible to researchers. Maximum likelihood estimation of random
effects models (Laird and Ware 1982), and generalized estimating equations (GEE)
with robust variance estimation (Liang and Zeger 1986), have been the main
approaches for handling longitudinal or multilevel data. These methods are also
appropriate for the analysis of many CRTs, provided there are sufficient numbers of
clusters, say 10–15 in each trial arm. For designs with fewer clusters, the cluster-
level methods discussed in the next section may be preferable.
In general, random effects models, which have a “subject-specific” or conditional
(on cluster) interpretation, are best used when there are sufficient numbers of
observations and/or events in each cluster. GEE models, on the other hand, were
designed for situations with small numbers of observations per cluster. GEE yields
“population-averaged” (Neuhaus et al. 1991) estimates that are usually close to or
identical to the marginal values obtained when ignoring clusters but corrects for
over- or under-dispersion via empirical variance estimation. Several adjustments for
GEE models have been devised to correct the Type I error, which can be inflated
when there are relatively few clusters (Scott et al. 2017).
Because these standard approaches for analyzing correlated data have been well
elaborated in the literature, we now focus on cluster-level methods.

Cluster-Level Methods

Within-cluster correlation can be handled directly by reducing the data to a single


summary measure for each cluster. For example, this could be the mean height-for-
age Z-score of all study children in a cluster, the number of deaths in a cluster
divided by the person-years of exposure in that cluster in the time interval of interest,
or the number of people with a disease condition divided by the number tested in a
cluster. Once these summaries are obtained, any analysis may be performed that
could be done with 2 N uncorrelated observations. A Student t-test with associated
confidence interval can be applied to either these summaries or to log-transformed
measures, and weights may be incorporated if desired. The t-test is fairly robust to its
normality assumption, but there may be too few clusters to gauge how well the
77 Cluster Randomized Trials 1501

assumption is met; a Wilcoxon rank sum test can be used as a check and to
downweigh undue influence of “outlying” clusters.
If there is a sufficient number of clusters, cluster summaries can be used as
responses in regression models that adjust for cluster-level covariates (e.g., whether
there is a tertiary care facility in a geographic cluster, or the median income of cluster
residents). More often, adjustment for individual-level covariates will be required.
Even with very few clusters, the following two-stage method can be employed
(Bennett et al. 2002; Hayes and Moulton 2017): (1) In the first stage, ignoring
clusters, a regression of the outcome variable on all the adjusting variables is fit,
but with the treatment arm indicator(s) omitted; (2) in the second stage, residuals
from the fit in the first stage are calculated for each cluster – then the standard
analysis, say a t-test, is conducted on the residuals. This approach is especially useful
in matched pairs designs, where individual-level regression modeling is problematic.

Effects of Correlation Structure on Analyses

In a geographically defined cluster, e.g., a village, the correlation structure of outcomes


among residents can be complex. There may be stronger inter-person correlation in the
center, more densely populated area, and weaker in the sparser periphery. People may
talk with immediate neighbors about a behavioral intervention but also with classmates
in a school. There may be repeated measures on individuals over time, inducing
within-person correlation as well. One rarely knows the full extent of these correla-
tions. Happily, accounting for correlation at the cluster level, using any of the above
methods, automatically accounts for any correlation at all lower levels in the sense that
the variance of estimates of treatment effects will be estimated consistently. The more
accurately the correlation structure is modeled, however, the more efficient will be the
analysis. Typically, an equi-correlation structure is assumed, in the absence of any
other information, either through a random effects analysis or a GEE model with
specified exchangeable correlation. The further the departure from exchangeability, the
more additional clusters are required to achieve accurate variance estimation.

Reporting Results

Standardized reporting of trials has improved greatly with the publication of the
CONSORT Statement (see chapter “Reporting guidelines”) (Begg et al. 1996). A
specialized version has been produced for CRTs, the most recent of which is by
Campbell et al. (2012). Many checklist items are the same while calling for addi-
tional details on rationale for the cluster design, cluster definition, levels of masking,
how clustering is accounted for in the analyses, and estimates of the degree of
observed clustering. Not mentioned in this CONSORT, but desirable when there
are not too many clusters (say less than 40), is the display in a table or figure of
cluster-by-cluster outcome data, so that readers can identify the degree to which
clusters varied in size or in magnitude of response.
1502 L. H. Moulton and R. J. Hayes

Ethics and Data Monitoring

Of the three precepts given in the Belmont Report (The National Commission for
the Protection of Human Subjects of Biomedical and Behavioral Research 1979)
– respect for persons, beneficence, and justice – it is the respect, or informed
consent process, that can be the most problematic for CRTs. It may not be
possible to obtain informed consent on the part of individuals when the interven-
tion is applied at the community level, say a public health information campaign.
In such situations, community leaders, political figures, clinic directors, or other
gatekeepers will need to be approached. This is often the case even when the
intervention is delivered directly to individuals, so that consent at both the
community and individual level is required. The justice principle also calls for
special consideration, as inequitable distribution (or perception thereof) of risks
and benefits may occur, perhaps with certain communities becoming stigmatized
for their role in the trial. Individuals who do not partake in the trial may suffer
adverse effects related to the treatment of the community in which they live – for
example, a mass administration of an antibiotic to children may result in resis-
tance to some organisms, making it difficult to cure adults who succumb to them.
The Ottawa Statement (Weijer et al. 2012) addresses further ethical details
specific to CRTs.
CRTs will typically have a Data Monitoring Committee (DMC), and perhaps also
a Steering Committee, Safety Monitor, or Data Monitor, each of which has some role
in overseeing or ensuring the ethical conduct of the study. For CRTs, the DMC has to
take a broader view than is done in individually randomized trials, considering the
impact on communities or clusters as a whole, which can include those not directly
involved in the trial. Monitoring for early evidence of effectiveness and early
stopping does not occur as frequently in CRTs as in individually randomized trials,
as long-term effects are often of interest, for example, reduction in secondary or
tertiary attack rates. When potential early stopping is of interest, the DMC needs to
be aware that information is accrued, relatively speaking, more rapidly in a CRT, due
to the diminishing returns within clusters of obtaining further information on
individuals in a cluster, as described above in the section on sample size (Hayes
and Moulton 2017).

Discussion

While this chapter explained some introductory statistical notions relevant to cluster
randomized trials, it also concentrated on aspects of how they differ from individ-
ually randomized trials that will prove useful even to seasoned statisticians and
trialists who have not worked with CRTs. In individually randomized trials, it is
usually the case that the intervention is applied at the individual level and data are
collected at the individual level. In CRTs, however, there may be three different
levels involved in randomization, intervention, and data collection. For example,
77 Cluster Randomized Trials 1503

cluster, individual, and disease registry could be the levels, respectively. When
crossed with differing possible analysis populations, from intent-to-treat to per-
protocol, and different effects (e.g., indirect) of interest, the potential estimands
become myriad. As a consequence, investigators have to think long and hard
about exactly what answers a trial should be designed to provide. This will guide
cluster formation and specification of inclusion/exclusion criteria for enrollment,
intervention, and data collection.
By contrast, methods of analysis of CRTs are relatively straightforward. This
chapter has mentioned approaches for handling the particularly problematic ana-
lytic feature of CRTs, namely, that many trials involve small numbers of clusters.
That means methods relying on large-sample asymptotics may become suspect,
and we need to consider alternative methods or conduct additional sensitivity
analyses.
As new methods arise in the design and analysis of individually randomized
trials, say for covariate specification or causal inference for handling loss to follow
up, there will be parallel application of them to CRTs. Such transfer of methodology,
however, needs to be done carefully, especially when accounting for within-cluster
correlation and small numbers of clusters.

Cross-References

▶ Intention to Treat and Alternative Approaches


▶ Principles of Clinical Trials: Bias and Precision Control
▶ Reporting Biases

References
Bailey RA (1987) Restricted randomization: a practical example. J Am Stat Assoc 82:712–719
Bailey RA, Rowley CA (1987) Valid randomization. Proc R Soc Lond A Math Phys Sci
410:105–124
Begg C, Cho M, Eastwood S, Horton R, Moher D, Olkin I, Pitkin R, Rennie D, Schulz KF, Simel D,
Stroup DF (1996) Improving the quality of reporting of randomized controlled trials. The
CONSORT statement. JAMA 276:637–639
Bennett S, Parpia T, Hayes R, Cousens S (2002) Methods for the analysis of incidence rates in
cluster randomized trials. Int J Epidemiol 31:839–846
Bruhn M, McKenzie D (2009) In pursuit of balance: randomization in practice in development field
experiments. Am Econ J Appl Econ 1:200–232
Campbell MJ, Walters SJ (2014) How to design, analyse and report cluster randomised trials in
medicine and health related research. Wiley, West Sussex
Campbell MK, Piaggio G, Elbourne DR, Altman DG, CONSORT Group (2012) Consort 2010
statement: extension to cluster randomised trials. BMJ 345. https://doi.org/10.1136/bmj.e5661
Diehr PD, Martin C, Koepsell T, Cheadle A (1995) Breaking the matches in a paired t-test for
community interventions when the number of pairs is small. Stat Med 14:1491–1504
Donner A, Klar N (2000) Design and analysis of cluster randomised trials in health research.
Arnold, London
1504 L. H. Moulton and R. J. Hayes

Ebola ça suffit Ring Vaccination Trial Consortium (2015) The ring vaccination trial: a novel cluster
randomised controlled trial design to evaluate vaccine efficacy and effectiveness during out-
breaks, with special reference to Ebola. BMJ 351. https://doi.org/10.1136/bmj.h3740
Eldridge S, Kerry S (2012) A practical guide to cluster randomised trials in health services research,
1st edn. Wiley, West Sussex
Ferebee SH, Mount FW, Murray FJ, Livesay VT (1963) A controlled trial of isoniazid prophylaxis
in mental institutions. Am Rev Respir Dis 88:161–175
Fisher RA (1947) The design of experiments, 4th edn. Hafner-Publishing Company, New York
Halloran ME, Struchiner CJ (1991) Study designs for dependent happenings. Epidemiology
2:331–338
Halloran ME, Struchiner CJ, Longini IM Jr (1997) Study designs for evaluating different efficacy
and effectiveness aspects of vaccines. Am J Epidemiol 146:789–803
Hayes RJ, Bennett S (1999) Simple sample size calculation for cluster-randomized trials. Int J
Epidemiol 28:319–326
Hayes RJ, Moulton LH (2017) Cluster randomised trials, 2nd edn. Chapman & Hall, Boca Raton
Hussey MA, Hughes JP (2007) Design and analysis of stepped wedge cluster randomized trials.
Contemp Clin Trials 28:182–191
Kish L (1965) Survey sampling. Wiley, New York
Kotz D, Spigt M, Arts ICW, Crutzen R, Viechbauer W (2012) Use of the stepped wedge design
cannot be recommended: a critical appraisal and comparison with the classic cluster randomized
controlled trial design. J Clin Epidemiol 65:1249–1252
Laird NM, Ware JH (1982) Random-effect models for longitudinal data. Biometrics 38:963–974
Li F, Turner EL, Heagerty PJ, Murray DM, Volmer WM, Delong EL (2017) An evaluation of
constrained randomization for the design and analysis of group-randomized trials with binary
outcomes. Stat Med 36:3791–3806
Liang KY, Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika
73:13–22
Montgomery AA, Peters TJ, Little P (2003) Design, analysis and presentation of factorial
randomized controlled trials. BMC Med Res Methodol 3:26. https://doi.org/10.1186/1471-
2288-3-26
Morgan KL, Rubin DB (2012) Rerandomization to improve covariate balance in experiments. Ann
Stat 40:1263–1282
Moulton LH (2004) Covariate-based constrained randomization of group-randomized trials. Clin
Trials 1:297–305
Moulton LH, O’Brien KL, Kohberger R, Chang I, Reid R, Weatherholtz R, Hackell JG, Siber GR,
Santosham M (2001) Design of a group-randomized Streptococcus pneumoniae vaccine trial.
Control Clin Trials 22:438–452
Murray DM (1998) Design and analysis of group randomised trials. Oxford University Press,
New York
Neuhaus JM, Kalbfleisch JD, Hauck WW (1991) A comparison of cluster-specific and population-
averaged approaches for analyzing correlated binary data. Int Stat Rev 59:25–35
Raab GM, Butcher I (2001) Balance in cluster randomized trials. Stat Med 20:351–365
Scott JM, deCamp A, Juraska M, Fay MP, Gilbert PB (2017) Finite-sample corrected
generalized estimating equation of population average treatment effects in stepped wedge
cluster randomized trials. Stat Methods Med Res 26:583–597. https://doi.org/10.1177/
0962280214552092
The National Commission for the Protection of Human Subjects of Biomedical and Behavioral
Research (1979) Protection of human subjects; Belmont report: notice of report for public
comment. Fed Regist 44:23191–23197
77 Cluster Randomized Trials 1505

Thompson JA, Fielding KL, Davey C, Aiken AM, Hargreaves JR, Hayes RJ (2017) Bias and
inference from misspecified mixed-effect models in stepped wedge trial analysis. Stat Med
36:3670–3682
Torgerson D (ed) (2015) Stepped wedge randomized controlled trials. Trials
16:351,353,354,358,352,350,359
Weijer C, Grimshaw JM, Eccles MP, McRae AD, White A, Brehaut JC, Taljaard M, Ottawa Ethics
of Cluster Randomized Trials Consensus Group (2012) The Ottawa statement on the ethical
design and conduct of cluster randomized trials. PLoS Med 9. https://doi.org/10.1371/journal.
pmed.1001346
Xu Z, Kalbfleisch JD (2010) Propensity score matching in randomized clinical trials. Biometrics
66:813–823
Multi-arm Multi-stage (MAMS) Platform
Randomized Clinical Trials 78
Babak Choodari-Oskooei, Matthew R. Sydes, Patrick Royston, and
Mahesh K. B. Parmar

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1508
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1508
The MAMS Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1510
Advantages of MAMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1510
Example: STAMPEDE Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1511
MAMS Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1513
Design Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1513
Steps to Design a MAMS Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1518
Analysis at Interim and Final Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1519
Choosing Pairwise Design Significance Level and Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1519
Intermediate and Definitive Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1520
Operating Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1521
MAMS Selection Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1525
Adding New Research Arms and Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1527
Software and Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1528
Considerations in Design, Conduct, and Analysis of a MAMS Trial . . . . . . . . . . . . . . . . . . . . . . . . . 1532
Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1532
Conduct Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1535
Analysis Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1535
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1537
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1538
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1539
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1539

Abstract
Efficient clinical trial designs are needed to speed up the evaluation of new
therapies. The multi-arm multi-stage (MAMS) randomized clinical trial designs
have been proposed to achieve this goal. In this framework, multiple
B. Choodari-Oskooei (*) · M. R. Sydes · P. Royston · M. K. B. Parmar
MRC Clinical Trials Unit at UCL, Institute of Clinical Trials and Methodology, London, UK
e-mail: [email protected]; [email protected]; [email protected];
[email protected]

© Springer Nature Switzerland AG 2022 1507


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_110
1508 B. Choodari-Oskooei et al.

experimental treatments are compared against a common control arm in several


stages. This approach has several advantages over the more traditional designs
since it obviates the need for multiple two-arm studies, and allows poorly
performing experimental treatments to be discontinued during the study. To
further increase efficiency, Royston and colleagues proposed a particular class
of MAMS designs where an intermediate outcome can be used at the interim
stages, thus allowing phases II and III of evaluation to be incorporated under one
protocol. The MAMS Platform designs speed up the evaluation process even
further by allowing new treatments to be introduced for assessment during the
course of a MAMS trial.
In this chapter, we describe the rationale for Royston et al.’s MAMS design,
and discuss their underlying principles. An example in prostate cancer is used to
explain how the MAMS design can be realized in practice. We present analytical
solutions for the strong control of the type I and II error rates, and show how these
quantities and the required sample size can be calculated using available software.
We also describe the challenges in the design and statistical analysis of such trials,
and suggest how these difficulties should be addressed. The MAMS platform
design has been used in a variety of disease areas, and holds considerable promise
for speeding up the evaluation of new treatments where many new regimens are
available for testing in the randomized phase II and phase III trials.

Keywords
Multi-arm multi-stage randomized trials · MAMS designs · Platform protocols ·
Adaptive clinical trials · Intermediate outcome · STAMPEDE trial · Prostate
cancer

Introduction

Background

Randomized controlled trials (RCTs) are the gold-standard for testing whether a new
treatment is better than the current standard of care. Recent reviews of phase III trials
showed a success rate of around 40% in oncology trials, and only around 7% chance
of approval from drugs starting phase I testing (Hay et al. 2014). In many disease
areas such as oncology, traditional randomized trials take a long time to complete
and are often expensive. Multi-arm, multi-stage (MAMS) trial designs have been
proposed to overcome these challenges. The MAMS design aims to speed up the
evaluation of new therapies and improve success rates in identifying effective ones
(Parmar et al. 2008). In this framework, a number of experimental arms are com-
pared against a common control arm and these pairwise comparisons can be made in
several stages. The multi-stage element of the MAMS design resembles the parallel-
group sequential designs, where the accumulating data are used to make a decision
whether to take a certain treatment arm to the next stage.
78 Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials 1509

Comparing multiple new regimens against a single, common control arm in a


multi-arm approach removes the need for multiple two-arm trials with separate
control arms and reduces the overall required sample size. For example, Freidlin
et al. (2008) showed that comparing four experimental arms in parallel to a single
control (one five-armed trial) reduces the required sample size by 37% compared to
four separate two-arm trials assuming no adjustments for multiple testing are made.
In general, comparing K experimental arms to a single control reduces the overall
sample size by a factor of (K – 1)/2K compared to K separate two-arm trials (Freidlin
et al. 2008). The efficiency of the multi-arm element can be greatly increased by
incorporating a multi-stage element which introduces formal interim-monitoring
guidelines to allow stopping early for strong evidence of benefit of the experimental
agent (efficacy) or when it seems that the experimental agent will not be better than
the control treatment, that is, lack-of-benefit analysis. Figure 1 displays how a
MAMS design might compare with a traditional series of separate trials evaluating
the same agents.
This chapter describes the multi-arm multi-stage trial design proposed by
Royston et al. (2003, 2011), with a focus on the underlying principles and concepts.
We use the initial design of the STAMPEDE trial in prostate cancer to describe how
the design can be realized in practice. We then explain how to calculate the operating
characteristics of the design, and discuss the issues of treatment selection and adding
new research arms. Finally, we describe a number of challenges in the design and

Fig. 1 Schematic representation of traditional approach (left) and a multi-arm, multi-stage design
(right) of evaluating experimental treatments (T) against control (C). Traditional approach has a set
of separate phase II studies (activity), some of which may not be randomized, with follow-up phase
III trials (efficacy) in some interventions that pass the phase II stage. Multi-arm, multi-stage
approach provides a platform in which many interventions can be assessed against the control
arm simultaneously and are randomized from the phase II component onwards
1510 B. Choodari-Oskooei et al.

analysis of MAMS trials, and make suggestions for tackling them. Throughout, we
use the acronym MAMS to refer to the multi-arm, multi-stage design described by
Royston et al. (2003). Other approaches to MAMS designs will be briefly discussed
in “Summary.”

The MAMS Approach

Royston et al. (2003) developed a framework for a multi-arm multi-stage design for
time-to-event outcomes which can be applied to designs in the phase II/III settings
(Royston et al. 2003). In this design, an intermediate (I) outcome can be used at the
interim stages to further increase the efficiency of the MAMS design by stopping
recruitment to treatment arms for lack-of-benefit at interim stages. Using an I outcome
in this way allows interim analyses to be conducted sooner and so recruitment to
poorly performing arms can be stopped much earlier than if the primary outcome of
the trial was used throughout. Examples of intermediate and primary or definitive (D)
outcomes are progression-free survival (PFS) and overall survival (OS) for many
cancer trials, and CD4 count and disease-specific survival for HIV trials. When
using an I outcome at interim stages, each of the experimental arms is compared in
a pairwise manner with the control arm using the I-outcome measure, for example, the
PFS (log) hazard ratio. Section “Steps to Design a MAMS Trial” outlines the steps that
should be taken to design a MAMS trial. Section “Choosing Pairwise Design Signif-
icance Level and Power” explains how to choose the stagewise stopping rules and
design power, and section “Intermediate and Definitive Outcomes” provides guidance
on how to choose an I outcome within this framework.
This design has been extended to binary outcomes with the risk difference
(Bratton et al. 2013), and can easily be extended to odds ratio as the primary effect
measure (Abery and Todd 2019). The design has also been extended to include
stopping boundaries for overwhelming efficacy on the definitive (D) outcome of the
trial (Blenkinsop et al. 2019). It is one of the few adaptive designs being deployed
both in a number of trials and across a range of diseases in the phase II and III
settings, including cancer (Sydes et al. 2012), tuberculosis (TB), and surgical trials
(ClinicalTrials.gov Identifier: NCT03838575) (Sydes et al. 2012; ROSSINI 2018;
MRC Clinical Trials Unit at UCL). One example is the STAMPEDE trial for men
with prostate cancer which is used as an example in this chapter for illustration
(Sydes et al. 2012). In the remainder of this chapter, the MAMS designs that utilize
the I-outcome for the lack-of-benefit analysis at the interim looks are denoted by
I 6¼ D. Designs that monitor all the arms on the same definitive (D) outcome
throughout the trial are denoted by I ¼ D.

Advantages of MAMS

The MAMS design has several advantages. First, several primary hypotheses or
treatments can be evaluated under one (master) protocol. This maximizes the chance
78 Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials 1511

of identifying a new treatment which is better than the current standard (Parmar et al.
2014). Second, all patients are randomized from the start which will ensure a fair and
contemporaneous comparison. This ensures a seamless run through to the phase III if
an early phase (phase II) element is built in the design and under the same protocol
and, if at all possible, includes in the phase III evaluation the information from
patients in the early evaluation. As a result, the overall trial duration will be markedly
reduced compared to separate phase II and III studies since on average many fewer
patients will be required in most situations.
Furthermore, a MAMS design can be part of trials with master protocols such as
platform or umbrella trials. Master protocols allow major adaptations such as ceasing
randomization to an experimental arm or introducing new comparisons through the
addition of new experimental arms – see section “Adding New Research Arms and
Comparisons.” Platform trials provide notable operational efficiency since evalua-
tion of a new treatment within an existing trial will typically be much quicker than
setting up a new trial (Schiavone et al. 2019). Therefore, fewer patients tend to be
exposed to insufficiently effective or harmful treatments as these treatments are
eliminated quickly from the study. This shifts the focus to the more promising
treatments as the trial progresses.
Finally, MAMS designs tend to be popular with patients perhaps because the
increased number of active treatments means that they are more likely to receive the
new treatments. Recruitment tends to markedly improve over time in MAMS trials
while many traditional designs may struggle to accrue – particularly, within the
context of platform trials and when new treatments are to be tested for different,
often biomarker-defined, subgroups of a specific disease. In summary, MAMS
platform designs are efficient because they share a control arm, allow for early
stopping for lack-of-benefit and adding new research arms, and are operationally
seamless.

Example: STAMPEDE Trial

STAMPEDE is a multi-arm multi-stage (MAMS) platform trial for men with pros-
tate cancer at high risk of recurrence who are starting long-term androgen depriva-
tion therapy (Sydes et al. 2009, 2012). In the initial 4-stage design, five experimental
arms with treatment approaches previously shown to be suitable for testing in a
phase II/III trial were compared to a control arm regimen. In the original design, all
patients received standard of care treatment, and further treatments were added to
this in the experimental arms. The primary analysis was carried out at the end of
stage 4, with overall survival as the primary outcome. Stages 1 to 3 used an
intermediate outcome measure of failure-free survival (FFS) to drop arms for
lack-of-benefit. As a result, the corresponding hypotheses at interim stages were
on lack-of-benefit on failure-free survival. Claims of efficacy could not be made
on this outcome because such a claim can only be made on the primary (D) outcome
of the trial.
1512 B. Choodari-Oskooei et al.

Royston et al.’s MAMS design is constructed by specifying a one-sided signif-


icance level αj and power ωj for each pairwise comparison in each stage j, j ¼ 1,. . .,
J, along with the target treatment effect, for example, log hazard ratio (HR), for the
outcome of interest in that stage. The one-sided significance level αj is the type I error
rate to drop an arm for the accumulated trial data until the end of stage j, and ωj is the
corresponding power to continue under the target treatment effect. For sample size
calculation, other stagewise design parameters are also required – see Steps 1–10 in
section “Steps to Design a MAMS Trial.” For example, the user should specify the
accrual rate, allocation ratio, and the expected event rate in the control arm in trials
with binary or time-to-event outcomes. Based on these and other design parameters,
the overall operating characteristics of the design, the timing of each analysis, critical
hazard ratio for continuation, and sample sizes can then be calculated using the nstage
or nstagebin packages in Stata – see section “Software and Example.”
Table 1 illustrates how the MAMS design was applied to the original comparisons
of the STAMPEDE trial. It shows the design specification for the original treatment
comparisons at each stage: the outcome measure, target hazard ratio under the
alternative hypothesis for the experimental arms (HR1), pairwise (design) power at
each stage (ωj, j ¼ 1,2,3,4), one-sided pairwise significance level (αj). Given these
design parameters, the critical hazard ratio to drop arms for lack-of-benefit and
control arm events required to trigger each analysis are calculated using the nstage
program and included in Table 1. Note that the required number of control arm
events in stages 2–4 are slightly higher than those previously reported in Sydes et al.
(2012) and other references. These have been calculated using the latest (beta)
version of the nstage program, to be officially released, that gives more accurate
sample sizes – see section “Software and Example.” The same design parameters
were used for all five original comparisons. The one-sided design significance levels
at interim stages (αj, j ¼ 1,2,3) can be used as the “stopping boundaries” on the
P-value scale to drop arms for lack-of-benefit. At the end of each stage, if the
observed P-value for a comparison is larger than αj, recruitment (but not necessarily
follow-up) to that experimental arm ceases. Recruitment to the other experimental
treatments and the control arm continues to the next stage.
In the MAMS design, all P-values are one-sided for the following reasons. At the
interim stages, the focus is on continuing with those experimental therapies that
show a prespecified level of benefit on the outcome of interest. There would be no

Table 1 Design specification for the 6-arm 4-stage STAMPEDE trial. HR1, ωj, αj are the target
hazard ratio (HR) under the alternative hypothesis for the experimental arms, pairwise (design)
power, and significance levels at each stage. The critical HR and the required control arm events for
each stage are calculated given these design parameters
Stage (j) Type Outcome HR1 ωj αj Critical HR Contl. arm events
1 Activity FFS 0.75 0.95 0.50 1.00 113
2 Activity FFS 0.75 0.95 0.25 0.92 223
3 Activity FFS 0.75 0.95 0.10 0.88 350
4 Efficacy OS 0.75 0.90 0.025 – 437
78 Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials 1513

interest in continuing with an experimental therapy which was no better than the
control regimen (including those which are detrimental). Thus, the interim decision
rules are distinctly one-sided. Considering the final stage on the D-outcome, any
therapy that is likely to be detrimental in terms of final outcome is very unlikely to
have passed the interim stages. It therefore seems inappropriate to test for differences
in both directions at the final stage.
Recruitment to the original comparisons of the STAMPEDE trial began late in
2005 and was completed early in 2013. The design parameters for the primary
outcome at the final stage were a (one-sided) significance level of 0.025, power of
0.90, and the target hazard ratio of 0.75 on overall survival which requires 437 con-
trol arm deaths (i.e., events on overall survival). An allocation ratio of A ¼ 0.5 was
used for these original comparisons so that, over the long-term, one patient was
allocated to each experimental arm for every two patients allocated to control.
Proportional hazards assumptions were made in both FFS and OS outcomes.
Because distinct hypotheses were being tested in each of the five experimental
arms, the emphasis in the design for STAMPEDE was on the pairwise comparisons
of each experimental arm against the control arm, with emphasis on the control of the
pairwise type I error rate (PWER) – see section “Operating Characteristics” for
definition. Out of the initial five experimental arms, only three of them continued to
recruit through to their final stage. Recruitment to the other two arms was stopped at
the second interim look due to lack of sufficient activity. Since November 2011, new
experimental arms have been added to the original design with five new comparisons
added between 2011 and 2018.

MAMS Design

In this section, we present the MAMS design more formally and discuss how it can
be realized in practice. We present the design for a variety of outcome measures. We
also outline how the operating characteristics of the design can be calculated in
different scenarios.

Design Specification

Consider a J–stage trial where patients are randomized between K experimental arms
(k ¼ 1,. . .,K ) and a single control arm (k ¼ 0). The parameter θjk represents the true
difference in the outcome measure between the experimental arm k and control at
stage j, j ¼ 1,. . .,J.
For continuous outcomes, θjk could be the difference in the means of the two
groups at stage j, μjk  μj0; for binary data difference in the proportions, pjk – pj0; for
survival data a log hazard ratio, log(HRjk). For simplicity in notation, we outline the
design specification for the case where the same definitive (D) outcome is monitored
throughout the trial, that is, I ¼ D designs. Therefore, in all notations θ and
1514 B. Choodari-Oskooei et al.

Z represent the definitive primary outcome treatment effect and the corresponding Z-
test statistic comparing experimental arm k ¼ 1, 2. . .,K to the control arm.
The test statistic comparing experimental arm k against the control arm at stage
pffiffiffiffiffiffi
j can be defined as Zjk ¼ b θjk V jk where Vjk is the inverse of the variance of the
treatment effect estimator at stage j for the pairwise comparison k – in statistical
terms, Vjk is known as the Fisher’s (observed) information. In the literature, I is used
instead of V as the standard notation for Fisher’s information. Since in this chapter
the intermediate outcome is abbreviated as I, we use the rather nonstandard notation
V for Fisher’s information. A detailed discussion of information quantification
in various outcome settings is provided in Lan and Zucker (1993). The Z-test
 pffiffiffiffiffiffi
statistics is (approximately) normally distributed, that is, Z jk  N θjk V jk , 1 , and
is assumed standard normal under the null hypothesis, Zjk  N(0, 1). For normal or
binary data, Vjk depends on the number of subjects on the study arms. In time-to-
event or survival settings, it depends on the number of events on the study arms.
Table 2 presents the treatment effect measures for continuous, binary, and survival
outcomes with the corresponding Fisher’s (observed) information Vjk.
For example, in trials with continuous outcomes where the aim is to test that the
outcome of n1 individuals in experimental treatment E1 is on average better (here
better means smaller, e.g., blood pressure) than that of n0 individuals in control
group (C) at stage j, the null hypothesis H 0j1 : μj1  μj0 is tested against the
(one-sided) alternative hypothesis H 1j1 : μj1 < μj0 . In this case, the one-sided type I
   
error rate and power at stage j are αj1 and ωj1 where α j1 ¼ Φ zα j1 , ω j1 ¼ Φ zω j1 ,
and Φ(.) is the normal probability distribution function. In the MAMS design, αjk and
ωjk are the key stagewise design parameters, and are needed for sample size
calculations – see section “Steps to Design a MAMS Trial” for the other design
parameters.
Without loss of generality, assume that a negative value of θjk indicates a
beneficial effect of treatment k. In trials with K experimental arms, a set of K null
hypotheses are tested at each stage j,

H 0jk : θjk  θ0j , j ¼ 1, . . . , J : k ¼ 1, . . . , K


H 1jk : θjk < θ0j , j ¼ 1, . . . , J : k ¼ 1, . . . , K

for some prespecified null effects θ0j . In practice, θ0j is usually taken to be 0 on a
relevant scale such as the log hazard ratio for survival outcomes or the mean
difference in continuous outcomes. If the same definitive (D) outcome is monitored
throughout the trial (I ¼ D), then the true treatment effect (θjk) and θ0j are assumed
constant for all j. Otherwise, θJk and θ0J correspond to the true and null effects on the
definitive outcome, and θjk and θ0j correspond to the intermediate outcome for all
j < J and are constant. For sample size and power calculations, a minimum target
treatment effect (often the minimum clinically important difference) is also required,
that is, θ1j. For example, in the STAMPEDE design with five experimental arms and
four stages, there are up to 20 sets of null and alternative hypotheses as above. In this
78

Table 2 Treatment effects, statistical information, and correlation between the test statistics of pairwise comparisons in trials with continuous, binary, and
survival outcomes, with common allocation ratio (A) in all pairwise comparisons. Also, see section “Software and Example” (and “Correlation Structure
Between Pairwise Comparisons”) for an example when an intermediate outcome is used in a MAMS design and how to calculate the between stages correlation
structure in this case
Correlation structure
   
Outcome Treatment effect or outcome measure (θ) Fisher’s information (V) Corr Z jk , Z j0 k for j0 > j Corr Z jk , Z jk0
 n

qffiffiffiffi A
σ2
σ 2
Continuous μjk  μj0 jk
V jk ¼ n j0j0 þ Anjkj0  1 nj0 k Aþ1
qffiffiffiffi
n
ffi A
Binary p jk
pjk pj0  j0 ð1p j0 Þ pjk ð1pjk Þ
Aþ1
V jk ¼ þ 1 nj0 k
p ð1p Þ n j0 An j0 ffi
qffiffiffiffi A
log pjk 1pj0 n jk Aþ1
j0 ð jk Þ n0
jk
A
n o 1 1 Aþ1
pjk V jk ¼ þ An p 1p 1 ffi
qffiffiffiffi
log p n j0 p j0 ð1p j0 Þ j0 jk ð jk Þ n jk
j0  1 nj0 k
1p j0 1pjk
V jk ¼ n j0 p j0 þ An j0 pjk
     1 qffiffiffiffi
ffi A
Survival λjk e jk
log HRjk ¼ log 1 Aþ1
λ j0 V jk ¼ ej0 k
Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials

e j0
þ e1jk
1515
1516 B. Choodari-Oskooei et al.

   
trial, the null θ0j and target treatment effects θ1j used in all the stages (and
comparisons) were 0 and log(0.75), respectively. In MAMS designs that use an
intermediate (I) outcome measure at interim stages, the primary null and alternative
hypotheses, H0Jk and H1Jk, concern θJk, with the hypotheses at stage j ( j < J ) playing a
subsidiary role, mainly to calculate the interim stage sample sizes.
The joint distribution ofpthe Z-test
ffiffiffiffi P  statistics therefore follows a multivariate
normal distribution MVN θ V , where θ and V are the J  K matrices of the
mean
 treatment
 effects
 and the corresponding Fisher’s (observed) information
V jk ¼ 1=var bθjk , and Σ denotes the correlation matrix between the J  K test
statistics – see the last two columns  ofp
Table 2, for example in trials with time-to-
ffiffiffiffiffiffiffiffiffiffiffiffiffi
event outcomes note Corr Z jk , Z j0 k ¼ ejk =ej0 k for j0 > j.
The one-sided stagewise significance level αj plays two key roles in the
MAMS design. Together with power ωj, it is used as the design parameter to
calculate the required (cumulative) sample size at the end of stage j. Further, it
acts as the stopping boundary for lack-of-benefit at the end of stage j. In principle,
in a MAMS design different stopping boundaries can be specified for each
pairwise comparison. For simplicity, here we assume the same stopping bound-
aries (αj) for all pairwise comparisons. Section “Choosing Pairwise Design
Significance Level and Power” explains how to choose the stagewise stopping
rules and design power. The interim lack-of-benefit stopping boundaries can also
be defined on the Z-test statistic since there is a one-to-one correspondence
between them – that is, l j ¼ zα j and j ¼ 1,. . ., J – 1. For simplicity in notation,
let L ¼ (l1, . . ., lJ  1) be the stopping boundary for lack-of-benefit prespecified
for interim stages which correspond to the one-sided significance levels α1, . . .,
αJ  1 in all k comparisons – see section “Choosing Pairwise Design Significance
Level and Power.” For example, in survival outcomes, where the treatment effect
is measured by the (log) hazard ratio, L forms an upper bound because the
alternative treatment effect being targeted indicates a relative reduction in (log)
hazard compared to the control arm.
In designs that include interim stopping boundaries for overwhelming efficacy
ð EÞ
on the primary outcome measure, another set of α j should be specified at the
design stage. This may be desirable for both investigators and sponsors because
being able to identify effective regimens earlier increases the efficiency of the
design further by reducing resources allocated to these arms. It may also result in
stopping the trial early to progress efficacious arms to the subsequent phase of the
testing process or to seek regulatory approval and thus expedite uptake of the
treatment by patients. Two popular efficacy stopping boundaries are the
Haybittle-Peto and O’Brien-Fleming stopping rules. Blenkinsop et al. (2019)
investigated the impact of the efficacy stopping rules on the operating character-
istics of the MAMS design. Section “Software and Example” illustrates how to
control the overall type I error rate in MAMS designs with both lack-of-benefit
and efficacy stopping rules using the STAMPEDE trial as an example. Let B ¼
(b1,b2,. . .,bJ) be the stopping boundary for overwhelming evidence of efficacy on
78 Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials 1517

the primary outcome at interim stages, where b j ¼ zαðEÞ and bJ is the threshold for
j

assessing efficacy at the final analysis corresponding to αJ. The two stopping
boundaries meet at the final stage J to ensure a conclusion can be made regarding
efficacy. In I ¼ D designs, the primary outcome test statistic is compared to the
stopping boundaries at each stage where one of three outcomes can occur
(assuming binding boundaries – see section “Binding/Nonbinding Stopping
Boundaries”):

• If bj < Zjk < lj, experimental arm k continues to the next stage (recruitment and
treatment k continue).
• If Zjk > lj, experimental arm k is “dropped” for lack of benefit (recruitment and
treatment k stopped).
• If Zjk  bj, the corresponding null hypothesis can be rejected early and experi-
mental arm k is stopped early for overwhelming efficacy.

Note that the follow-up of randomized subjects in the “dropped” experimen-


tal arms should be continued to the planned end of the trial. This has two main
advantages. First, follow-up can help in capturing the relevant information on
safety endpoints. Second, any potential bias in the estimated treatment effects on
the definitive outcome can be markedly reduced by following all patients up to
the planned end of the trial and performing analyses then, irrespective of
whether recruitment was stopped early for lack of benefit (Choodari-Oskooei
et al. 2013).
For the experimental treatments which pass through all interim stages without
crossing the boundaries, at the final stage J, the test statistic for each experimental
arm is compared with the threshold for the final stage bJ to assess for efficacy, where
one of two outcomes can occur:

• If ZJk > bJ, the test is unable to reject the final stage null hypothesis H 0Jk at level
αJ.
• If ZJk  bJ, reject H0Jk at level αJ and conclude efficacy for experimental arm k.

In designs which include efficacy stopping boundaries on the D-outcome measure


and use the I-outcome measure for the lack-of-benefit analysis, there are two
(correlated) outcome measures with their respective (correlated) test statistics. The
test statistic on the I outcome is used at the interim stages j ¼ 1,. . ., J – 1 for the lack-
of-benefit analysis only to eliminate treatment arms that do not show the prespecified
level of activity – that is, it is not used for efficacy analysis. In section “Intermediate
and Definitive Outcomes,” we describe the impact of the I-outcome measure on the
operating characteristics of the design with guidance on how to choose the I-
outcome. Also, section “Software and Example” uses the STAMPEDE trial as an
example to calculate the operating characteristics of a MAMS design with an
intermediate outcome as well as the correlation structure that is needed for this
purpose.
1518 B. Choodari-Oskooei et al.

Steps to Design a MAMS Trial

The following steps should be taken to design a MAMS trial with both lack-of-
benefit and efficacy stopping boundaries – see section “Considerations in Design,
Conduct, and Analysis of a MAMS Trial” for further guidelines on some of the
points:

1. Choose the number of experimental arms, K, and stages, J – see section “Design
Considerations.”
2. Choose the definitive D outcome, and (optionally) I outcome – see section
“Intermediate and Definitive Outcomes.”
3. Choose the null values for θ – for example, the (log) hazard ratios on the
   
intermediate θ0I and definitive θ0D outcomes – see section “Software and
Example.”
4. Choose the minimum clinically relevant target treatment effect size, for exam-
 
ple, in the time-to-event setting the (log) hazard ratio on the intermediate θ0I
 
and definitive θ1D outcomes.
5. Choose the control arm event rate (median survival) in trials with binary
(survival) outcome – see section “Software and Example.”
6. Choose the allocation ratio A, the number of patients allocated to each experi-
mental arm for every patient allocated to the control arm. For a fixed-sample
(1-stage) multi-arm trial, the optimal allocation ratio (i.e., the one that minimizes
pffiffiffiffi
the sample size for a fixed power) is approximately A ¼ 1= K.
7. In I 6¼ D designs, choose the correlation between the estimated treatment effects
for the I and D outcomes. An estimate of the correlation can be obtained by
bootstrapping relevant existing trial data – see sections “Correlation Structure
Between Pairwise Comparisons” and “Software and Example,” and Sect. 2.7.1
in Royston et al. (2011) for further details.
8. Choose the accrual rate per stage to calculate the trials timelines – see section
“Software and Example.”
9. Choose a one-sided significance level for lack-of-benefit and the target power
for each stage (αjk, ωjk). The chosen values for αjk and ωjk are used to calculate
the required sample sizes for each stage – see sections “Choosing Pairwise
Design Significance Level and Power” and “Design Considerations.”
10. Choose whether to allow early stopping for overwhelming efficacy on the
primary (D) outcome. If yes, choose an appropriate efficacy stopping boundary
αEj on the D-outcome measure for each stage 1,. . ., J, where αEJ ¼ αJ. Possible
choices are Haybittle-Peto or O’Brien-Fleming stopping boundaries used in group
sequential designs, or one based on α-spending functions (Blenkinsop et al. 2019).
11. Given the above design parameters, calculate the number of control and exper-
imental arm (effective) samples sizes required to trigger each analysis and the
operating characteristics of the design, that is, njk in trials with continuous and
binary outcomes and ejk in trials with time-to-event outcomes, as well as the
overall type I error rate and power. If the desired (prespecified) overall type I
78 Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials 1519

error rate and power have not been maintained, for instance if the overall
pairwise power is smaller than the prespecified value, steps 9–11 should be
repeated until success. Or, if the overall type I error rate is larger than the
pre-specified value, one can choose a more stringent (lower) design alpha for
the final stage, αJ, and repeat steps 9–11 until the desired overall type I error rate
is achieved – see sections “Software and Example” and “Design
Considerations.”

Analysis at Interim and Final Stages

In a MAMS design, the end of each stage is determined when the accumulated trial
data reaches the predetermined (effective) sample size for that stage. The effective
sample size is the number of subjects in designs with binary and continuous out-
comes (Bratton et al. 2013), and the number of required events in designs with time-
to-event outcomes (Royston et al. 2011). Reaching the end of each stage triggers an
interim analysis of the accumulated trial data. The outcome of analysis is a decision
to discontinue recruitment to a particular experimental arm for lack-of-benefit on I
(or D), to terminate the trial for efficacy on D, or to continue.
At each interim analysis, the treatment effects are estimated using an appropriate
analysis method. For example, the Cox proportional hazards model can be used to
estimate the log hazard ratio, and calculate the corresponding test statistic and P-values.
In I ¼ D designs, the primary outcome test statistic is compared to the stopping
boundaries at each stage where one of the three outcomes, set out in section “Design
Specification,” can occur. In I ¼ 6 D designs with efficacy stopping boundaries for the
D outcome which utilize an I outcome for lack-of-benefit analysis at interim stages, two
I D
sets of treatment effects bθjk (on I) and bθjk (on D) are calculated, together with the
corresponding test statistics and P-values. The outcome of analysis is a decision to
discontinue recruitment to a particular experimental arm for lack-of-benefit on I, to
terminate the trial for efficacy on D, or to continue – see Sect. 2.1 in Blenkinsop and
Choodari-Oskooei (2019) for the further details on these decision rules.
At the final analysis J, the treatment effect is estimated on the primary outcome
for each experimental arm, and the observed P-value is compared against the final
stage significance level αJk. If the P-value is smaller than αJk, we reject the null
hypothesis corresponding to the definitive outcome and claim efficacy. Otherwise,
the corresponding null hypothesis cannot be rejected at the αJk level.
Section “Analysis Considerations” discusses the issue of analysis in more detail,
particularly the potential impact of the stopping rules on the average treatment effect.

Choosing Pairwise Design Significance Level and Power

The design stagewise type I error (αjk) and power (ωjk) are important in realizing a
MAMS design. Together with the target effect sizes, they are the main driver of the
1520 B. Choodari-Oskooei et al.

stagewise sample sizes. The choice of their values is guided by two considerations.
First, it is essential to maintain a high overall pairwise power, ωk, in Eq. (2) in section
“Pairwise Type I Error Rate and Power” for each comparison in the trial. The
implication is that for testing the treatment effect at the interim analysis, the design
interim-stage power ωjk( j < J) should be high, for example, at least 0.95. For testing
the treatment effect on the definitive outcome, the design pairwise power at the final
stage, ωJk, should also be high, perhaps of the order of at least 0.90 which is higher
than many academic trials might select traditionally. The main cost of using a larger
number of stages is a (slight) reduction in the overall pairwise power (ωk). For
example, the overall pairwise power in the STAMPEDE trial with four stages is
about 0.84 under binding stopping rules for lack-of-benefit – see Table 1 for design
stagewise values for αjk and ωjk. Under the nonbinding rules ωk is equal to the final
stage design power ωJk, that is, 0.90. In section “Software and Example,” we
describe how to calculate the lower bound for the overall pairwise power in cases
where we do not have an estimate of the between-stage correlation structure.
Second, given the design stagewise power ωjk, the values chosen for the αjk
largely govern the effective sample sizes required to be seen at each stage and the
stage durations. Generally, larger-than-traditional (more permissive) values of αjk are
used at the interim stages, because a decision can be made on dropping/continuing
arms reasonably early, that is, with a relatively small sample size. It is necessary to
use descending values of αjk, otherwise some of the stages become redundant.
Royston et al. (2011) suggested a geometric descending sequence of αjk values
starting at α1k ¼ 0.5 and considering αjk ¼ 0.5 j( j < J ) and αJk ¼ 0.025. The latter
mimics the conventional 0.05 two-sided significance level for tests on the
D outcome. Further, Bratton (2015) proposed a family of α-functions and a system-
atic search procedure to find the stagewise (design) significance levels. They have
been implemented in the nstagebinopt Stata program to find efficient MAMS
designs (Choodari-Oskooei et al. 2022a). Section “Considerations in Design, Con-
duct, and Analysis of a MAMS Trial” discusses the relevant issues about the timing
and frequency of interim analyses in more details, including both design and trial
conduct implications. Section “Design Considerations” addresses some of chal-
lenges on the choice of the stagewise (design) power and significance levels to
increase the efficiency of a MAMS design.

Intermediate and Definitive Outcomes

The MAMS framework by Royston et al. (2003) allows the use of an intermediate (I)
outcome at the interim stages which can speed up the weeding out of insufficiently
promising treatments. This markedly increases the efficiency of the design since
recruitment to the unpromising arms will be discontinued much faster than other-
wise. Choosing appropriate and valid intermediate and definitive (D) outcomes is
key to the success of the MAMS design (Royston et al. 2011).
The basic assumptions are that “information” on I accrues at the same rate or
faster rate than information for the D outcome, where information is defined as the
78 Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials 1521

inverse of the variance of the treatment effect estimator. Another assumption is that
the I outcome is on the pathway between the treatments and the D outcome. If the
null hypothesis is true for I, it must also hold for D. In this setting, the I outcome
does not have to be a perfect or true surrogate outcome for the definitive outcome
as defined by Prentice (1989). In the absence of an obvious choice for I, a rational
choice of I at the interim stages might be D itself. As a result, in some settings the
efficiency of the design might be reduced since the interim analyses will be delayed
compared to when using I. In this case, each pairwise comparison resembles a
parallel group-sequential design. In the cancer context, typical intermediate and
definitive outcomes might be PFS and OS, respectively. Information on PFS is
usually available sooner in a study, and in many cancer sites, the treatment effect
on PFS is usually highly positively correlated with that on OS (Royston et al.
2011).

Operating Characteristics

The operating characteristics of a conventional two-arm trial are quantified using the
type I error rate and power of the design. In MAMS designs, type I error can be
controlled for each or a set (or family) of pairwise comparisons. These are quantified
by the pairwise (PWER) and familywise (FWER) type I error rates. The simplest
(and perhaps most useful) measure of power in a MAMS trial is the pairwise power
for the comparison of each experimental arm k against control. The correlation
structure between different test statistics at different stages is required to calculate
these quantities. In the following subsections, we explain how the correlation
structure can be estimated. We also define the pairwise/familywise type I error
rates and power for a MAMS designs with (and without) an intermediate outcome
measure.

Correlation Structure Between Pairwise Comparisons


In the MAMS design, correlation between treatment effect estimates, and the
corresponding test statistics, is induced in different ways. First, the shared control
arm induces a correlation between treatment effect estimates of pairwise compari-
sons at the same stage – see the last column in Table 2. Second, since patients
accrued in each stage will be included in the analysis of subsequent stages, a
correlation will be induced between the treatment effect estimates of different stages
in each pairwise comparison – see penultimate column in Table 2.
In I 6¼ D designs, the correlation between the Z-test statistics at interim stages and
that of the final stage will decrease – which decreases the overall pairwise power, see
Table 7 in Royston et al. (2011). In these cases, the formulas presented in
the penultimate column of Table 2 will be multiplied by the correlation between
the treatment effect on I and D outcomes, ρ, at a fixed time-point in the evolution of
the trial. For example, in the STAMPEDE trial with time-to-event I and D outcomes
where FFS is used as the I outcome at the interim stages and OS as the D outcome at
the final analysis, the correlation between the FFS test statistics at the interim stages
1522 B. Choodari-Oskooei et al.

qffiffiffiffiffiffiffiffiffiffiffiffiffi
and that of the OS at the final stage will be calculated using ρ: eIjk =eD Jk , j ¼ 1, 2, 3,
where ρ is the correlation between the estimated log hazard ratios on the two
outcomes at a fixed time-point. Note that
qffiffiffiffiffiffiffiffiffiffiffiffiffi if the I- and D-outcomes are identical
qffiffiffiffiffiffiffiffiffiffiffiffiffi
then ρ ¼ 1 and ρ: ejk =eJk reduces to eIjk =eD
I D
Jk (Royston et al. 2011). For other
outcomes a similar formula can be derived (Bratton et al. 2013 for binary outcomes
and Follmann et al. 2021 for continuous outcomes). Bootstrap analysis of individual
patient data from similar previous trials can be used to assess ρ in a particular setting
(Barthel et al. 2009). Section “Software and Example” explains how the correlation
structure can be calculated when an I outcome is used at the interim stages in the
STAMPEDE trial.

Pairwise Type I Error Rate and Power


In a single stage two-arm design, there is only one way to commit a type I error at the
final analysis. In trials with lack-of-benefit interim stopping boundaries, a type I error
only happens if the trial passes all the interim stopping boundaries – that is,
conditional on the treatment arm passing all interim stages. In designs with both
lack-of-benefit and efficacy stopping boundaries there is an increased chance of a
type I error since there is a further chance of committing a type I error at interim
stages – that is, when the interim efficacy boundaries are crossed. For simplicity in
notations, we only define the type I error rate and power for designs with lack-of-
benefit interim looks.
In designs with J stages and stopping boundaries for lack-of-benefit, Royston
et al. (2011) showed that the overall pairwise type I error rate (PWER), αk, and
power, ωk, for the experimental arm k compared to control are
 
αk ¼ ΦJ zα1k , . . . , zαJk ; R0J under θ j ¼ θ0j for all j ð1Þ
 
ωk ¼ ΦJ zω1k , . . . , zωJk ; R1J under θ j ¼ θ1j for all j ð2Þ

where ΦJ is the J-dimensional multivariate normal distribution function with corre-


0=1
lation matrix RJ whose ( j, j0 )th  entry is the correlation between the treatment
effects in stages j and j0 , i.e. Corr Zjk , Z j0 k in Table 2. The formulas for αk and ωk in
designs with stopping boundaries for lack-of-benefit and efficacy are more compli-
cated and can be found in Blenkinsop et al. (2019).
In trials with an I outcome, the calculation of αk in Eq. (1) is made under the
assumption that the null hypothesis H0 is true for both I and D. However, in this case
the type I error rate is maximized when the experimental treatment is highly/
infinitely effective on I but the null hypothesis is true for D. Therefore, the maximum
pairwise type I error rate, αmax, is equal to the final stage significance level, αJk(>αk)
– see Bratton et al. (2016). In the STAMPEDE trial, Table 1, the overall pairwise
type I error rate calculated from Eq. (1) is 0.013 as reported in Sydes et al. (2009).
However, this is only when H0 is true for both I and D. By the above argument, the
78 Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials 1523

maximum pairwise type I error rate for each pairwise comparison, αmax, is actually
equal to the final stage significance level of the trial: αmax ¼ αJ ¼ 0.025.

Binding/Nonbinding Stopping Boundaries


This raises important questions about the nature of the stopping boundaries for lack-
of-benefit at the interim analyses. Are these boundaries strict rules which have to be
adhered to, that is, binding stopping boundaries? Or, are these simply stopping
guidelines and the decision to continue with the research treatment will also depend
on other factors, that is, nonbinding stopping boundaries? For instance, if the
treatment effect for an experimental arm has an encouraging point estimate at an
interim analysis but has not crossed the corresponding stopping boundary yet it has a
highly beneficial effect on an important secondary outcome (e.g., safety), then it may
be desirable to continue the arm to the next stage of the study for further assessment.
Ignoring interim lack-of-benefit stopping guidelines in a MAMS trial will not inflate
the maximum pairwise type I error rate, αmax, since it is controlled by the final stage
significance level αJ. The interim stopping boundaries could therefore be considered
“nonbinding” in these scenarios.
By contrast, lack-of-benefit stopping boundaries will inflate the overall type II
error rate. This decreases the overall pairwise power, ωk, for a fixed sample size
since, under the alternative hypothesis, there is a chance to stop the experimental
arms that are truly effective. Therefore, it is recommended to calculate the required
sample size using nonbinding stopping boundaries for the type I error rate, and
binding stopping boundaries for power. This will strongly control both the overall
type I and II error rates at prespecified levels. Further, in designs that include both
stopping boundaries for lack-of-benefit and efficacy, the lack-of-benefit stopping
boundaries interact with the efficacy boundaries. For this reason, nonbinding lack-
of-benefit boundaries are often a regulatory requirement. In such designs, the
efficacy stopping boundaries have the potential to increase the type I error rate
with no impact on power (Blenkinsop et al. 2019).

Familywise Type I Error Rate, All-Pair/Any-Pair Power


In multi-arm trials there are multiple ways to commit a type I error. In some settings
it is required to control the overall type I error rate, that is, familywise type I error
rate (FWER), for a set of pairwise comparisons of the experimental arms with the
control arm at a prespecified level. The FWER is the probability of incorrectly
rejecting the null hypothesis for the primary outcome for at least one of the
experimental arms from a set of comparisons in a multi-arm trial. Its value is higher
than the PWER if no correction for multiple testing is made.
In trials with multiple experimental arms, the maximum possible FWER often
needs to be calculated and known (Wason et al. 2014). In some multi-arm trials, this
maximum value needs to be controlled at a predefined level. This is called a strongly
controlled FWER as it covers all eventualities, that is, all possible hypotheses
(Bratton et al. 2016). Magirr et al. (2012) showed that the FWER is maximized
1524 B. Choodari-Oskooei et al.

under the global null hypothesis, H G 0 , that is, when the null hypothesis which
maximizes pairwise alpha is true for all arms.
In a family of k “independent” pairwise comparisons each with their own control
group and a PWER of αk, the overall type I error rate (FWER) is
 
FWER ¼ Pr reject at least one Hk0 jH G 0
 
¼ Pr reject H10 or H 20 . . . or H k0 jH G
0
 
¼ 1  Pr accept H 10 and H20 . . . and H k0 jH G
0
Y
K
¼1 ð1  αk Þ:
k¼1

When α1 ¼ α2 ¼ . . . ¼ αk ¼ α, the FWER can be calculated using Šidák formula


(Sidak 1967),

FWERS ¼ 1  ð1  αÞk : ð3Þ

Bonferroni correction can also be used to calculate the FWER. For example, if the
family includes two (independent) pairwise comparisons each with a (one-sided)
PWER of α1 ¼ α2 ¼ 0.025, the FWERs is 0.0494 from Eq. (3). In this case,
Bonferroni correction can also provide a good approximation, that is, α1 þ α2 ¼
0.05.
To allow for the correlation between the test statistics of different pairwise
comparisons, one can replace the term (1 – α)k in Eq. (3) with an appropriate quantity
to reduce the FWER and gain some efficiency in scenarios where the strong control
of the FWER is required. Dunnett (1955) developed an analytical formula to
calculate the FWER in multi-arm trials which takes care of this correlation structure.
For designs with nonbinding stopping boundaries for lack-of-benefit, the maximum
FWER can be computed using Dunnett probability (Bratton et al. 2016):

FWER ¼ 1  ΦK ðz1α1 , . . . , z1αK ; CÞ ð4Þ

where ΦK is the K-dimensional multivariate normal distribution function and C is the


K  K between-comparison correlation matrix with off diagonal elements equal to
A/(A þ 1), and A is the allocation ratio. For a family of two pairwise comparisons
with a shared control and PWER of α1 ¼ α2 ¼ 0.025, the FWER is 0.0455 from
Dunnett’s formula. In this case, the FWER is calculated more accurately than that
from Eq. (3) or Bonferroni correction. In designs which require the strong control of
the FWER at 0.025, the simplest approach would be to choose the final stage
significance level for each pairwise comparison, αJk, such that the maximum
FWER from Eq. (4) is 0.025. For example, in a design with two pairwise compar-
isons with equal allocation to all arms (and no stopping boundary for interim efficacy
analysis), the final stage significance level which controls the FWER at 0.025 is αJ ¼
0.0161 for both comparisons. This is larger than the significance level obtained from
other methods, that is, Bonferroni correction (αJ ¼ 0.0125) or Šidák formula
78 Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials 1525

(0.0126) which results in lower sample size, hence increasing the efficiency of the
design.
In multi-arm designs, two other types of power can be calculated: any-pair and
all-pair powers. Any-pair power is the probability of showing a statistically signif-
icant effect under the targeted effects for at least one comparison, and all-pairs power
is the probability of showing a statistically significant effect under the targeted
effects for all comparison pairs. The three measures of power will be identical in a
two-arm trial, but when considering a multi-arm design the power measure of
interest may depend on the objective of the trial (Choodari-Oskooei et al. 2020).
For more complex designs with both overwhelming efficacy and lack-of-benefit
stopping boundaries or designs where a new experimental arm is added at a later
stage, Eq. (4) can become quite complicated (Blenkinsop et al. 2019; Choodari-
Oskooei et al. 2020).

MAMS Selection Designs

In a MAMS design, all experimental arms can reach their final stage of recruitment if
they pass each interim analysis activity boundary. As a result, the number of
experimental arms recruiting at each stage cannot be predetermined. Therefore, the
actual sample size of the trial can be varied considerably with its maximum being
when all the treatment arms reach the final stage. In trials with a large number of
experimental arms, the maximum sample size might be too large either to achieve or
for any funding agency to fund it. In these cases, it may be more appropriate to
prespecify the number of experimental arms that will be taken to each stage,
alongside a criterion for selecting them. One example of such designs is the
ROSSINI-2 surgical trial – see Fig. 2.
The selection of research arms can be made based on ranking of treatment effects,
or a combination of efficacy and safety results. Traditionally, selection of the most
promising treatments has been made in phase II trials where the strict control of
operating characteristics is not a particular concern. In a MAMS selection design, the
selection and confirmatory stages are implemented within one trial protocol, and
selection of the most promising treatments can be made in multiple stages – see
Fig. 2. Patients will be randomized from the start to all the experimental and control
arms, and the primary analysis of the experimental arms that reach the final stage
include all randomized individuals from start. The advantages of MAMS selection
designs are (lower) maximum sample sizes and simpler planning. However, the
selection process might get complicated if all arms look promising at the selection
stage which might affect the operating characteristics of the design (Stallard et al.
2015). However, in this case we are less interested in individual arms, but in the
process of finding an appropriate treatment to the next stage of the trial.
In MAMS selection designs, the primary aim is to select the most promising
treatments with high probability of correct selection where strong control of the error
rates is required in the phase III setting. The probability of correct selection is driven
by the underlying treatment effects, timing of selection, and the number of
1526 B. Choodari-Oskooei et al.

Fig. 2 Schematic representation of the ROSSINI-2 selection design with seven experimental arms
and three stages. There are two prespecified subset selection stages in this trial

comparisons. Stallard et al. (2015) developed analytical derivations for the type I
and II error rates in a two-stage design – more details are in Stallard and Todd
(2003). Further research is needed to explore the operating characteristics of the
selection designs when an I-outcome measure is used for the selection of best
performing arms. In this case, the power of the design might be adversely affected
if the rankings of treatment effects on the I and D outcomes are not similar across
experimental treatments. This might also affect the average treatment effects in the
arms that reach the final stage. Simulations can be done to explore the operating
characteristics of the design as well as the extent of bias in the average treatment
effects. Key practical issues to consider in the simulations are the timing, the
selection criteria (e.g., ranking), and the number of experimental arms selected at
each stage.
When there are several experimental treatments, it is sometimes more efficient to
reduce the number of selected arms in multiple stages (Wason et al. 2017). Other-
wise, the probability of correct selection and power of the design might be adversely
affected. An example is the ROSSINI-2 design where treatment selection has been
done in two stages. It has been shown that implementing treatment selection in the
ROSSINI-2 trial could reduce the maximum sample size by up to 7% (by 370
patients) compared to the MAMS design with no selection imposed, without
adversely affecting the operating characteristics of the trial. Simulations can be
used to explore the impact of the number of treatment arms being selected at each
stage on the operating characteristics of the design.
78 Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials 1527

In MAMS selection designs, power is only lost under extreme selection criteria.
For example in the ROSSINI-2 trial with seven research arms, at least four arms
should be selected at the first interim analysis and at least three arms at the second
stage to preserve power. In this case the probability of selecting truly the best arm
remains above 90%. The timing of the first selection stage is also important. The
probability of correct selection is generally low if the selection of treatment arms is
done too early in the course of the trial – for example, earlier than 15% of
information time in some settings, see Stallard and Todd (2003) and Choodari-
Oskooei et al. (2022b). For a design such as ROSSINI-2, it has been suggested not
selecting before a fifth of the total planned patients have been recruited.

Adding New Research Arms and Comparisons

Phase III randomized clinical trials can take several years to complete in some disease
areas, requiring considerable resources. During this time, new promising treatments
may emerge which warrant testing. The practical advantages of incorporating new
experimental arms into an existing trial protocol have been clearly stated in previous
studies, not least because it obviates the often lengthy process of initiating a new trial
and competing between trials to recruit patients (Ventz et al. 2017; Schiavone et al.
2019; Hague et al. 2019). Funding bodies and scientific committees may wish to
strategically encourage such collaboration. The MAMS design framework can be
implemented as a platform trial. One such example is the STAMPEDE trial which
has incorporated five new pairwise comparisons with more to follow, each starting
accrual more quickly than the original comparisons (Sydes et al. 2012).
When adding a new experimental arm, the two major (statistical) considerations
are as follows. First, the decision whether to control the type I error rate for both the
existing and new comparisons should be made, that is, multiplicity adjustment and
control of the FWER. The decision to focus control on the PWER or the FWER (for
a set of pairwise comparisons) depends on the type of research questions being posed
and whether they are related in some way, for example, testing different doses or
duration of the same therapy in which case the control of the FWER is required.
These are mainly practical considerations and should be determined on a case-by-
case basis in the light of the rationale for the hypothesis being tested and the aims of
the protocol for the trial. Second, how the type I error rate for a set of comparisons
can be calculated if the strong control of the FWER is required in this setting.
Choodari-Oskooei et al. (2020) developed a set of guidelines that can be used to
decide whether multiplicity adjustment is necessary when adding a new experimen-
tal arm – that is, whether to control the PWER or the FWER in a particular design,
see Fig. 2 in Choodari-Oskooei et al. (2020). The emerging consensus among the
broader scientific community is that in most multi-arm trials where the rationale for
research treatments (or combinations) in the existing and added comparisons is
derived separately, the greater focus should be on controlling each pairwise error
rate (Parker and Weir 2020; O’Brien 1983; Cook and Farewell 1996). In designs
where the FWER for the protocol as a whole is required to be controlled at a certain
1528 B. Choodari-Oskooei et al.

level, then the overall type I error can be split accordingly between the original and
added comparisons, and each can be powered using their allocated type I error rate.
To calculate the overall type I error, the Dunnett probability in Eq. (4) can be
extended to control the FWER when new experimental arms are added to a MAMS
(platform) trial (Choodari-Oskooei et al. 2020). The idea is to adjust the correlated
test statistics by a factor that reflects the size of the shared control group that is used
in the pairwise comparisons. This allows the calculation of the operating character-
istics for a set of pairwise comparisons in a platform trial setting with both planned
and unplanned addition of a new experimental arm. Choodari-Oskooei et al. showed
that the FWER is driven more by the number of pairwise comparisons in the family
rather than by the timing of the addition of the new arms. The shared control arm
information common to comparisons (i.e., shared control arm individuals in contin-
uous and binary outcomes and shared control arm primary outcome events in time-
to-event outcomes), and the allocation ratio, are required to calculate the FWER.
However, the FWER can be estimated using Bonferroni correction if there is not a
substantial overlap between the new comparison and those of the existing ones, or
when the correlation between the test statistics of the new comparison and those of
the existing comparisons is less than 0.30, that is, the correlation in last column of
Table 2. Finally, in a recent review, Lee et al. (2021) addressed different statistical
considerations that arise when adding a new research arm to platform trials.

Software and Example

The availability of user-friendly software is key in implementing all designs including


advanced designs such as MAMS. Meyer et al. (2021) conducted a systematic literature
search to identify commercial and open-source software aimed at designing platform
and multi-arm and multi-stage clinical trials. The nstage suite in Stata and the MAMS
package in R are freely available (Blenkinsop and Choodari-Oskooei 2019; Jaki et al.
2019). Two other commercial software that can be used for sample size calculations are
EAST6 (http://www.cytel.com/software/east) and AddPlan (www.aptivsolutions.com/
adaptive-trials/addplan6/). However, most of these programs only handle continuous
outcome measures and, to our knowledge, none can yet accommodate the use of
intermediate outcome measures at interim analyses. To use these packages effectively,
generally a sound understanding of the underlying theory is required.
The user-written (and freely available) nstage and nstagebin programs in Stata
calculate the sample size for MAMS designs with time-to-event and binary out-
comes. Both of these programs are assisted with a menu and input window
nstagemenu (Blenkinsop and Choodari-Oskooei 2019; Choodari-Oskooei et al.
2022a). They also calculate the operating characteristics of the design such as the
pairwise and familywise type I error rates as well as trial timelines based on the
design assumptions. Both programs can accommodate the use of an intermediate
outcome measure at interim analyses. The latest versions of nstage and nstagebin
can be obtained from the SSC archive, a well-known repository for user-written
78 Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials 1529

Stata commands, by issuing the following command to Stata: ssc install nstage.
Note that in this section we use a new (beta) version of nstage program which gives
more accurate sample sizes than previously reported. The new version will also be
available on the SSC archive.
We demonstrate how the nstage command can be used to design the STAM-
PEDE trial with time-to-event I and D outcomes. To illustrate the design, we
follow the same steps in section “Steps to Design a MAMS Trial.” The design
parameters in the original comparisons of the STAMPEDE trial were as follows.
In this MAMS trial, five experimental treatments were chosen to be compared
against the control arm in four stages, that is, specified using arms(6 6 6 6) and
nstag(4) option in nstage Stata command below. Each comparison was powered
to detect a target hazard ratio of 0.75 on both the I and D outcome measures, hr0
(1 1) hr1(0.75 0.75). From previous studies, an estimate of the correlation
between survival times on the I (excluding D) and D outcomes was available,
corr(0.60), as well as the median survival (in years) for the I and D outcomes, t
(2 4). Patients were allocated to the control arm with a 2:1 ratio, aratio(0.5), to
increase power. The accrual rate was assumed to be 500 patients per year in all
stages, accrue(500 500 500 500). The stopping boundaries for the lack-of-benefit
were chosen as 0.5, 0.25, 0.10, and 0.025, alpha(0.50 0.25 0.10 0.025). The
original design only included lack-of-benefit stopping boundaries – see Table 1.
Here, we also include the Haybittle-Peto efficacy stopping boundary for illustra-
tion, esb(hp).
Given the above design parameters, nstage calculates the stagewise sample sizes
and overall operating characteristics of the design. The first table after the nstage
command shows the stagewise design specifications and the overall operating
characteristics of the trial. The second and third columns of the operating character-
istics table report the chosen stopping boundaries required for stopping for lack-of-
benefit and efficacy at each interim stage. In this example, each stage requires
p  0.0005 on the definitive outcome measure to declare efficacy early, that is,
Haybittle-Peto rule, shown under the column Alpha(ESB). There is no efficacy
boundary for the final stage, since it is equal to the final stage boundary for lack-of-
benefit, denoted in the column Alpha(LOB). Assuming exponential distribution
(and proportional hazards) for the FFS and OS, and based on the given median
survivals, the first of three interim looks is expected at 2.44 years and the final
analysis at 7.16 years. Since the design includes multiple pairwise comparisons, the
output also presents the maximum FWER, defined in Eq. (4), as the type I error
measure of interest. The upper bound for the overall pairwise power is 0.90 because
it assumes non-binding stopping boundaries for lack-of-benefit since the type I error
rate is maximized under this assumption. However, if we have an estimate of the
correlation between the FFS and OS (log) hazard ratios, ρ, we can calculate a lower
bound for the pairwise power using Eq. (2) with correlation matrix R4. We refer to
this as the correlation between treatment effects on I and D within the trial, not across
cognate trials. The ( j, j0 )th entry of RJ for interim stages, that is, the correlation
between the estimated FFS (log) hazard ratios at interim stages, can be calculated
1530 B. Choodari-Oskooei et al.

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
from the formula presented in Table 2 – that is, eIjk =eI j0 k for j1 > j, and j& j0 ¼
1,2,3, where eIjk , is the I-outcome events at interim stage j for the kth comparison.
Specifically, the correlation of the FFS (log) hazard ratios is time-dependent and its
value depends on the accumulated numbers of events at different times. However,
the correlation between the FFS (log) hazard ratios atq the interim stages and that of
ffiffiffiffiffiffiffiffiffiffiffiffiffi
the OS at the final stage should be calculated using ρ: eIjk =eD Jk , j ¼ 1, 2, 3 (Royston
et p ffi For other outcomes a similar formula can be derived – for example
al.ffiffiffiffiffiffiffiffiffiffiffiffiffi
2011).
ρ: njk =nJk in continuous outcomes, see Follmann et al. (2021). In STAMPEDE,
assuming a correlation of ρ ¼ 0.60 between the FFS and OS (log) hazard ratios, R4
correlation matrix is
0 1
1 71 0:57 0:31
B 0:71 1 0:80 0:43 C
B C
R4 ¼ B C
@ 0:57 0:80 1 0:54 A
0:31 0:43 0:54 1

With this correlation structure, the lower bound for the overall pairwise power, ω,
is 0.83. If we do not have an estimate of the correlation between the treatment effects
on I and D, the lower bound for the overall pairwise power can be calculated
assuming ρ ¼ 0, that is, no correlation between the I and D outcome measures,

ωk ¼ ΦJ ðzω1k Þ:ΦJ ðzω2k Þ:ΦJ ðzω3k Þ:ΦJ ðzω4k Þ


¼ ω1k :ω2k :ω3k :ω4k

In this case, the lower bound for the overall pairwise power is 0.77 (¼ 0.95 
0.95  0.95  0.90). The all-pairs and any-pairs power can also be obtained with the
return list command (output not shown).
nstage, nstage(4) alpha(0.5 0.25 0.1 0.025) omega(0.95 0.95 0.95 0.9) hr0(1 1)
hr1(0.75 0.75) accrue(500 500 500 500) arms(6 6 6 6) t(2 4) corr(0.60) aratio(0.5)
esb(hp)

Operating characteristics
Stage Alpha Alpha Power HR|H0 HR|H1 Crit.HR Crit.HR Lengthb Timeb
(LOB)a (ESB)a (LOB) (ESB)
1 0.5000 0.0005 0.950 1.000 0.750 1.000 0.439 2.436 2.436
2 0.2500 0.0005 0.950 1.000 0.750 0.920 0.512 1.189 3.625
3 0.1000 0.0005 0.950 1.000 0.750 0.882 0.553 1.161 4.786
4 0.0250 0.900 1.000 0.750 0.840 2.375 7.161
Max. Pairwise Error Rate 0.0256 Pairwise Power 0.9000
Max. Familywise Error Rate (SE) 0.1056 (0.0003)
LOB lack of benefit, ESB efficacy stopping boundary
a
All alphas are one-sided
b
Length (duration of each stage) is expressed in periods and assumes survival times are exponen-
tially distributed. Time is expressed in cumulative periods
78 Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials 1531

Sample size and number of events


Stage 1 Stage 3
Overall Control Exper. Overall Control Exper.
Arms 6 1 5 Arms 6 1 5
Acc. rate 500 143 357 Acc. rate 500 143 357
Patientsa 1218 348 870 Patientsa 2393 684 1709
Eventsb 343 113 230 Eventsb 1085 350 735
Stage 2 Stage 4
Overall Control Exper. Overall Control Exper.
Arms 6 1 5 Arms 6 1 5
Acc. rate 500 143 357 Acc. rate 500 143 357
Patientsa 1813 518 1295 Patientsa 3581 1023 2558
Eventsb 683 223 460 Eventsb 1332 437 895
a
Patients are cumulative across stages
b
Events are cumulative across stages but are only displayed for those arms to which patients are still
being recruited. Events are for I-outcome at stages 1–3, D-outcome at stage 4

...
Although the focus of the STAMPEDE trial was on the strong control of
the PWER, we demonstrate how the FWER could be controlled using this design.
The following command specifies that interim analyses should assess for efficacy and
the program should search for a design which controls the FWER at a maximum of
2.5%. The other design parameter inputs and options remain the same. The option for
controlling the FWER identified the final stage αJ required to ensure a maximum
FWER of 2.5% as 0.0043, which lengthened the time to final analysis. This could be
addressed by lengthening accrual, for instance, or an additional interim analysis could
be included in this case. Further simulations should be carried out to quantify the
impact of such changes on the operating characteristics of the design and trial
timelines, particularly on power which can be reduced marginally. Moreover, the
trial timelines presented in outputs assume exponential distribution for both outcomes,
which is restrictive. However, the key design quantities, that is, number of control-arm
events which are needed to trigger the interim or final analysis, only assumes propor-
tional hazards for both I and D (log) hazard ratios. So, if the exponential assumption is
breached, the effect is only to reduce the accuracy of the times for each stage. In
practice, it is helpful to visually represent time to analysis and accrual using diagrams.
The following output shows the sample sizes required for the final stage of
the design, which has changed to achieve control of the FWER. In some settings,
the interim stage stopping boundaries have to be updated to achieve control of the
FWER, for example, to choose a more stringent (lower) P-value thresholds for
efficacy stopping boundaries. The number of control arm D-outcome events required
for the stage 4 analysis should be increased from 437 to 629 to ensure control of the
FWER at 2.5%. This 44% increase in the number of events required would require
substantially greater resources; for this reason investigators should consider care-
fully at the design stage whether control of the FWER is the focus of the design, or
that of the PWER.
1532 B. Choodari-Oskooei et al.

nstage, nstage(4) alpha(0.5 0.25 0.1 0.025) omega(0.95 0.95 0.95 0.9) hr0(1 1)
hr1(0.75 0.75) accrue(500 500 500 500) arms(6 6 6 6) t(2 4) corr(0.60) aratio(0.5)
esb(hp) fwercontrol(0.025)

Operating characteristics
Stage Alpha Alpha Power HR|H0 HR|H1 Crit.HR Crit.HR Lengthb Timeb
(LOB)a (ESB)a (LOB) (ESB)
1 0.5000 0.0005 0.950 1.000 0.750 1.000 0.439 2.436 2.436
2 0.2500 0.0005 0.950 1.000 0.750 0.920 0.512 1.189 3.625
3 0.1000 0.0005 0.950 1.000 0.750 0.882 0.553 1.161 4.786
4 0.0043 0.901 1.000 0.750 0.824 4.164 8.950
Max. Pairwise Error Rate 0.0054 Pairwise Power 0.8999
Max. Familywise Error Rate (SE) 0.0252 (0.0002)
a
All alphas are one-sided
LOB lack of benefit, ESB efficacy stopping boundary
b
Length (duration of each stage) is expressed in periods and assumes survival times are exponen-
tially distributed. Time is expressed in cumulative periods
Sample size and number of events
...

Stage 4
Overall Control Exper.
Arms 6 1 5
Acc. rate 500 143 357
Patientsa 4475 1279 3196
Eventsb 1939 629 1310
a
Patients are cumulative across stages
b
Events are cumulative across stages but are only displayed for those arms to which patients are still
being recruited. Events are for I-outcome at stages 1–3, D-outcome at stage 4

Considerations in Design, Conduct, and Analysis of a MAMS Trial

There are a number of important considerations when designing and implementing a


MAMS trial. This section addresses the challenges in design, conduct, and analysis
of MAMS trials and provides guidelines on how to overcome them successfully.

Design Considerations

The number of experimental arms that can be practically included in the trial is one
important consideration. There is no optimal number for this in a MAMS design as
the main drivers for the number of arms are: the number of treatments that are ready
and available for testing; the number of patients available; and the cost of undertak-
ing the protocol.
78 Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials 1533

Another important design consideration is the shape of the interim stopping


boundaries, that is, the interim P-value thresholds for both lack-of-benefit and
efficacy. In the Royston et al.’s MAMS framework, the stagewise significance levels
and powers are the design parameters and chosen by the investigator. They are
needed to calculate the stagewise sample sizes. However, this approach is restrictive
for three reasons. First, an iterative trial-and-error approach is required in which
users must continually tweak the stagewise operating characteristics until a feasible
design – that is, a design with a particular (prespecified) overall type I error rate and
power – is found. Second, there are likely to be many feasible designs for any pair of
overall operating characteristics, some requiring smaller sample sizes than others.
Therefore, the chosen design may not be the most efficient, or optimal, one for a
particular true treatment effect. To address these difficulties, Bratton (2015) devel-
oped a systematic search procedure to find a large set of feasible designs and then
selecting those which satisfy the Bayesian optimality criteria defined by Junge et al.
(2004). These efficient designs are called admissible MAMS designs. This approach
finds admissible designs under a given loss function which is a weighted sum of the
expected sample size under the global null hypothesis and the hypothesis in which
are experimental arms are effective. This method has been implemented in the
nstagebinopt Stata command. Choodari-Oskooei et al. (2022a) used this approach
to find the “optimal” stagewise significance levels and powers in the ROSSINI
2 design.
The timing and frequency of interim analyses are also important design consid-
erations. Previous simulation studies found that 3-stage designs tend to provide a
good trade-off between efficiency, as measured by the expected sample size (ESS),
and the maximum number of interim analyses that will be required (Bratton 2015).
Using four stages can reduce the ESS further when most arms are ineffective.
However, in most cases there is little to no additional reductions in ESS in designs
with five or more stages. The main cost of using a larger number of stages is a (slight)
reduction in the overall pairwise power. Also, the results of interim analysis are, in
practice, reviewed by the independent data monitoring committee (IDMC). It is,
therefore, important to ensure that a meaningful amount of information accumulates
so that the IDMC is not burdened with very frequent meetings except in trials where
the experimental arms are thought to be toxic in which case more frequent interim
analyses may be necessary to monitor safety. Traditionally, the IDMC should meet,
at least, annually. They need not be presented with a formal interim analysis at every
meeting. A practical overview of the establishment, purpose, and responsibilities of
the IDMCs are provided by Ellenberg, Fleming, and DeMets (2019).
It should be noted that the criteria for, and implications of, stopping for lack of
benefit are quite different to stopping for overwhelming benefit. In the former, the
emphasis is usually on the current estimate of the “treatment effect” on an interme-
diate outcome measure for I 6¼ D designs, and if it is small (or null/negative), then we
may conclude the likelihood of a worthwhile treatment effect on the primary
outcome measure is also likely to be small. Usually, stopping further randomizations
to a research arm for lack of (sufficient) benefit has no implications for either the
control arm or other experimental arms in the trial. In contrast, stopping an arm for
1534 B. Choodari-Oskooei et al.

overwhelming benefit usually focuses on the need for a small P-value for the
“treatment effect” on the primary outcome measure. Furthermore, stopping for an
overwhelming benefit has direct implications, for the control arm in particular, and
potentially all of the other research arms since it will affect the assessment of other
pairwise comparisons. If the efficacious arm is found to be unsafe, both any-pair and
all-pair powers will be reduced. The reduction in power can be overcome by
increasing the sample size for the other comparisons using the conditional error
approach (Jaki and Magirr 2013).
In a MAMS design, the stagewise sample sizes drive the timings of the interim
analyses. In trials with continuous and binary outcomes, the timing of interim
analysis is typically based on observing a prespecified fraction of total sample size
required for the final analysis. However, in superiority designs with time-to-event
outcomes the timing of the interim analysis can be made based on the prespecified
fraction of total number of events in the control arm. There are two reasons for this
approach. Firstly, an event rate different to that anticipated for the trial overall, across
all arms, could either arise due to a different underlying event rate in all arms or due
to a hazard ratio different to that targeted initially. This level of ambiguity is removed
by using the control arm event rate as the deciding factor for when to conduct the
analysis. Secondly, when more than one experimental arm is recruited to, it is
unlikely that we shall observe the same hazard ratio in all comparisons, giving
different total numbers of events for each comparison. It is practically expedient
for pairwise comparisons started at the same time to have their interim analyses at the
same time. However, the calculation for the overall number of events assumes the
same event rate in all comparisons in the experimental arms. Moreover, Dang et al.
(2020) showed that monitoring the control arm events provides unbiased estimates
of the (Fisher) “information fraction” in group sequential trials with time-to-event
outcomes.
Another important design consideration is the choice between the control of the
PWER or FWER. This is a major decision which drives the required patients and the
cost of undertaking the protocol. Consensus is emerging that the most important
consideration to decide whether to control the PWER or FWER is the relatedness of
the research questions in each pairwise comparison (Choodari-Oskooei et al. 2020;
Parker and Weir 2020; Proschan and Waclawiw 2000). There are cases such as
examining different doses (or duration) of the same drug where the control of the
FWER might be necessary to avoid offering a particular therapy an unfair advantage
of showing a beneficial effect. However, in most multi-arm trials where the rationale
for research treatments (or combinations) is derived separately, the greater focus
should be on controlling each pairwise error rate.
Furthermore, the sample size and power calculations in trials with time-to-event
outcomes assume the treatment effects follow proportional hazards (PH). In general,
if the PH assumption is false, power is reduced and interpretation of the hazard ratio
(HR) as the estimated treatment effect is compromised (Royston and Parmar 2020).
For example, when there is an early treatment effect – where the HR is <1 in the
early follow-up and increases later – the research treatment may pass the interim
stage lack-of-benefit threshold. But, this may cause important lack of power at the
78 Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials 1535

final analysis of the D outcome. More serious is the late effects where lack-of-benefit
is likely to appear at the intermediate stages even when there is a demonstrable
treatment effect at the final stage. This is a generic problem of trials with time-to-
event outcomes, and is an area of ongoing methodological research.
Finally, sample size calculations in trials with continuous and binary outcomes
depend on the outcome variance and the underlying control arm event rate. The
values of these parameters are generally specified based on previous studies. Depar-
ture from the assumed values can adversely affect the operating characteristics of a
MAMS design. The impact on power increases with the number of stages and when
the outcome variance or the control arm event rate is overestimated (Mehta and
Tsiatis 2001). It is, therefore, necessary to assess these design assumptions during the
trial. Several methods have been proposed in the literature (Betensky and Tierney
1997; Proschan 2005). One common approach, which has minimal impact on the
operating characteristics, is to recalculate sample size using the revised values for
these parameters without unblinding the treatment effect (Proschan 2005).

Conduct Considerations

Besides decisions on the statistical aspects of MAMS studies, there are a number of
practical issues to consider when conducting a MAMS study which are not the focus
of this chapter. They have been extensively discussed in the literature (James et al.
2008, 2012; Sydes et al. 2009, 2012; Schiavone et al. 2019; Hague et al. 2019). Such
challenges include, for example, ensuring adequate supply of the treatments under
investigation which is much more complex due to the stochastic nature of the
demand on individual treatments; appropriately informing potential participants
before the trials; updating them with new information; and setting up and managing
scientific oversight committees. This highlights the need to garner large-scale col-
laboration bringing large parts of the research community together, to obtain signif-
icant and long-term funding, to obtain long-term commitment from the key research
leaders, to ensure that responsibilities (and also acclaim) are shared as widely as
possible, and to have operational structures and systems which allow the implemen-
tation of such long-term adaptive protocols. These challenges need to be addressed
when the protocol is at the design stage, as they will need to be resolved before any
funding is likely to be approved and released.

Analysis Considerations

A further challenge arises when estimating the treatment effects at the interim
analyses or the end of the study. While it is relatively easy to define statistical
bias, different definitions of an unbiased estimator are relevant in the MAMS design
(Robertson et al. 2021). An estimator is unconditionally unbiased, if it is unbiased
when averaged across all possible realizations of an adaptive trial. In contrast, an
estimator is conditionally unbiased if it is unbiased only conditional on the
1536 B. Choodari-Oskooei et al.

occurrence of a subset of trial realizations. For example, one might be interested in


an estimator only conditional on a particular experimental arm being selected at an
interim analysis; as such, the focus becomes on a conditional unbiased estimator.
It has been recognized that the maximum likelihood estimate (MLE) of the
treatment effect for trials with an interim selection rule can be potentially biased
(Piantadosi 2005). The average treatment effect for the trials that stop at interim
stages and those that reach the final stage will be different from the overall under-
lying treatment effect. This is known as the “selection” bias in the literature. This
bias generally tends to be larger the “earlier” the selection happens, that is, when the
decision to stop the treatment arm or continue to the next stage is based on a
relatively small amount of information (Choodari-Oskooei et al. 2013). However,
in trials with no efficacy stopping boundaries the “selection” bias in the estimate of
treatment effect of comparisons that reach their final stage is a more major consid-
eration than that of the stopped trials for lack-of-benefit. In this setting, an effective
experimental arm is very likely to reach the final stage of the trial, and the results are
more likely to be adopted into clinical practice.
Choodari-Oskooei et al. (2013) showed that in designs with lack-of-benefit stop-
ping boundaries the size of the selection bias in the comparisons that reach the final
stage is generally small. In fact, the bias is negligible if the experimental arm is truly
effective. Furthermore, using an I outcome at interim stages will reduce the selection
bias in the estimates of treatment effect on the primary outcome in both selected and
dropped treatment arms. In this case, the degree of bias depends on the correlation
between the intermediate and definitive outcome measures. This bias is markedly
reduced by further patient follow-up and reanalysis at the planned “end” of the trial
and performing analyses then, irrespective of whether recruitment was stopped early
for lack of benefit. It has been shown that the bias will be minimal if the first interim
stage is placed at a significance level of 0.30 or less (Choodari-Oskooei et al. 2013).
In designs with overwhelming efficacy stopping boundaries, the average treat-
ment effect for the comparisons that stop for efficacy will be different from the
underlying treatment effect. The difference would depend on the unknown underly-
ing treatment effect as well as the type of efficacy stopping boundary. Haybittle-Peto
and O’Brien-Fleming type boundaries are quite common in practice. In both stop-
ping boundaries, the probability of crossing the efficacy boundaries will be very
small in early stages of the trial (Freidlin and Korn 2009). For example, in a 4-stage
2-arm design with three equally spaced interim O’Brien-Fleming stopping bound-
aries, the chances of stopping for efficacy at the first, second, and third interim looks
are 0.001%, 0.2%, and 0.8% under the null hypothesis (Freidlin and Korn 2009).
With Haybittle-Peto efficacy stopping boundaries, the corresponding probabilities
are about 0.1% in all stages. Even under the alternative hypothesis of a hazard ratio
of 0.75, the chances of stopping for efficacy at the first interim stage are 0.3% and
7.2% for the O’Brien-Fleming and Haybittle-Peto stopping boundaries, respectively
(Freidlin and Korn 2009). In all these cases, the average treatment effect for trials
that cross the first interim stage efficacy boundary will be different from the
underlying treatment effect. However, as we have seen, the probability of crossing
the boundary will also be very small. For this reason, it can be argued that the bias in
78 Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials 1537

the estimate of treatment effect of trials that reach the final stage is a more major
consideration than that in stopped trials.
Several unbiased estimators of the treatment effect have been proposed to correct
for selection bias in these cases (Stallard and Kimani 2018; Bowden and Glimm
2008; Sill and Sampson 2007). They mostly proposed for two-stage designs with
continuous, conditionally normal outcome variables. However, the proposed unbi-
ased estimators might not be preferred to the slightly biased standard (MLE)
estimator because their mean square errors are likely to be larger. Simulations should
be used in these cases under different target treatment effects to assess the degree of
selection bias and probability of stopping under realistic treatment effect sizes.
Robertson et al. (2021) provide a comprehensive overview of proposed approaches
to remove or reduce the potential bias in point estimation of treatment effects in an
adaptive design, as well as illustrating how to implement them. They also propose a
set of guidelines for researchers around the choice of estimators and the reporting of
estimates following an adaptive design (Robertson et al. 2022). Moreover, construc-
tion of (simultaneous) appropriate confidence intervals is more complex and spe-
cialized methods need to be considered.
Finally, it is useful to note when analyzing the trial at the interim stages, power
may be increased like for any trial by adjusting for covariates and stratification
factors. Since the early stages of a MAMS trial will contain relatively few patients,
the trial population across the arms is more likely to be unbalanced in terms of
potentially confounding covariates such as age. Accounting for these known, influ-
ential covariates in the analysis may increase the robustness of the results.

Summary

This chapter focused on the underlying principles of the multi-arm multi-stage


randomized platform trial designs within the Royston et al. (2003)’s framework. In
general, MAMS designs are more complex than traditional designs and require
sound understanding of the underlying theory and practical challenges of
implementing them.
In the MAMS designs of this chapter, the stopping boundaries and selection
rules are preplanned. Any data-dependent deviation from the prespecified adapta-
tions can have an adverse effect on the operating characteristics of the design. In
this framework, strong control of the operating characteristics is achieved by
constructing a separate cumulative test statistics for each pairwise comparison
and monitoring it with respect to stopping boundaries that are adjusted for multiple
stages and/or testing multiple treatment arms. An alternative approach to the
MAMS design is to control the operating characteristics by combining indepen-
dent multiplicity adjusted P-values from the different stages of the trial in accor-
dance with a prespecified combination function and utilizing closed testing to
ensure strong control of the error rates (Posch et al. 2005). This method provides
more flexibility to make data-dependent adaptive changes at the end of each stage,
such as re-estimating the sample size, for the remainder of the trial. But, it is less
1538 B. Choodari-Oskooei et al.

efficient than designs based on cumulative test statistics. Mehta and Patel (2006)
discussed the pros and cons of this approach and showed that more flexibility in
these designs comes at the cost of large increases in expected sample sizes for these
designs. Recently, Ghosh et al. (2020) extended the cumulative MAMS designs
(with I ¼ D) to permit data-dependent adaptations such as sample size
re-estimation, and compared them with those based on combining independent
multiplicity adjusted P-values from the different stages. They showed that the
power gain from the cumulative test statistics approach can be substantial, by up
to 18%, and increases with the heterogeneity of underlying treatment effects. They
also showed that the power gain is larger for designs with extreme interim stopping
boundaries, that is, when it is more difficult to drop arms. Their findings are
consistent with results published in Koenig et al. (2008), Friede and Stallard
(2008), and Magirr et al. (2014).
This chapter described a class of multi-arm multi-stage trial designs incorpo-
rating repeated tests for both lack-of-benefit and efficacy of a new treatment
compared with a control regimen. Importantly, the interim lack-of-benefit analysis
can be done with respect to an intermediate outcome measure at a relaxed signif-
icance level. If carefully selected, such an intermediate outcome measure can
further increase the efficiency of the design compared to the other alternatives
where the same primary outcome is used at the interim stages (Ghosh et al. 2020).
This chapter demonstrated the mathematical calculation of the operating charac-
teristics of the designs with/without an intermediate outcome at interim stages, and
outlined advantages of the MAMS design over other alternatives. It demonstrated
how the MAMS design speeds up the evaluation of new treatment regimens in
phase II and III trials.

Key Facts

Multi-arm multi-stage randomized clinical trials are an efficient approach to study


several treatments within one protocol. In summary, the efficiency of a MAMS trial
derives from:

• Implementing a common control arm across several experimental treatments and


reducing the number of competing trials
• Randomizing patients from the outset, allowing comparative testing to start
sooner
• Discontinuing recruitment to unpromising arms, and consequently boosting
recruitment to the arms showing promise
• Using data from all patients in a given comparison in all analyses, thus maximiz-
ing information for each stage with control arm patients contributing to multiple
direct comparisons
• Increasing the probability of identifying at least one successful therapy from
many research arms
78 Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials 1539

Cross-References

▶ Adaptive Phase II Trials


▶ Bias Control in Randomized Controlled Clinical Trials
▶ Biomarker-Guided Trials
▶ Controlling for Multiplicity, Eligibility, and Exclusions
▶ Data and Safety Monitoring and Reporting
▶ Futility Designs
▶ Interim Analysis in Clinical Trials
▶ Monte Carlo Simulation for Trial Design Tool
▶ Platform Trial Designs
▶ Power and Sample Size

Acknowledgments We are grateful to Professor Ian White for his helpful comments on the earlier
version of this chapter. This work is based on research arising from MRC grants MC_UU_00004/09
and MC_UU_12023/29.

References
Abery JE, Todd S (2019) Comparing the MAMS framework with the combination method in multi-
arm adaptive trials with binary outcomes. Stat Methods Med Res 28(6):1716–1730. https://doi.
org/10.1177/0962280218773546
Barthel FMS, Parmar MKB, Royston P (2009) How do multi-stage multi-arm trials compare to the
traditional two-arm parallel group design – a reanalysis of 4 trials. Trials. https://doi.org/10.
1186/1745-6215-10-21
Betensky RA, Tierney C (1997) An examination of methods for sample size recalculation during an
experiment. Stat Med 16:2587–2598
Blenkinsop A, Choodari-Oskooei B (2019) Multiarm, multistage randomized controlled trials with
stopping boundaries for efficacy and lack of benefit: an update to nstage. Stata J 19(4):782–802
Blenkinsop A, Parmar MKB, Choodari-Oskooei B (2019) Assessing the impact of efficacy stopping
rules on the error rates under the MAMS framework. Clin Trials 16(2):132–142. https://doi.org/
10.1177/1740774518823551
Bowden J, Glimm E (2008) Unbiased estimation of selected treatment means in two-stage trials.
Biom J 50(4):515–527
Bratton DJ (2015) PhD thesis: design issues and extensions of multi-arm multi-stage clinical trials.
UCL, London
Bratton DJ, Phillips PPJ, Parmar MKB (2013) A multi-arm multi-stage clinical trial design for
binary outcomes with application to tuberculosis. Med Res Methodol 13:139
Bratton DJ, Parmar MKB, Phillips PPJ, Choodari-Oskooei B (2016) Type I error rates of multi-arm
multi-stage clinical trials: strong control and impact of intermediate outcomes. Trials 17:309.
https://doi.org/10.1186/s13063-016-1382-5
Choodari-Oskooei B, Parmar MKB, Royston P, Bowden J (2013) Impact of lack- of-benefit
stopping rules on treatment effect estimates of two-arm multi-stage (TAMS) trials with time
to event outcome. Trials 14:23
Choodari-Oskooei B, Bratton DJ, Gannon MR, Meade AM, Sydes MR, Parmar MK (2020) Adding
new experimental arms to randomised clinical trials: impact on error rates. Clin Trials 17(3):
273–284. https://doi.org/10.1177/1740774520904346
Choodari-Oskooei B, Bratton DJ, Parmar M (2022a) Facilities for optimising and designing multi-
arm multi-stage (MAMS) randomised controlled trials with binary outcomes. Stata J, submitted
1540 B. Choodari-Oskooei et al.

Choodari-Oskooei B, Thwin S, Blenkinsop A, Widmer M, Althabe F, Parmar MKB (2022b)


Treatment selection in multi-arm multi-stage (MAMS) designs: with application to a postpartum
haemorrhage trial. Clinical Trials, under review.
Cook RJ, Farewell VT (1996) Multiplicity considerations in the design and analysis of clinical
trials. J R Stat Soc Ser A Stat Soc 159:93–110
Dang HM, Alonzo T, Franklin M, Mack J, W., Krailo, M. D. and Eckel, S. P. (2020) Information
fraction estimation based on the number of events within the standard treatment regimen. Biom J
26:1960–1972. https://doi.org/10.1002/bimj.201900236
Dunnett CW (1955) A multiple comparison procedure for comparing several treatments with a
control. J Am Stat Assoc 50(272):1096–1121
Ellenberg SS, Fleming TR, DeMets DL (2019) Data monitoring committees in clinical trials: a
practical perspective, 2nd edn. Wiley
Follmann D, Proschan M (2021) Two stage designs for phase III clinical trials. medRxiv. https://doi.
org/10.1101/2020.07.29.20164525
Freidlin B, Korn EL (2009) Stopping clinical trials early for benefit: impact on estimation. Clin
Trials 6:119–125
Freidlin B, Korn EL, Gray R, Martin A (2008) Multi-arm clinical trials of new agents: some design
considerations. Clin Cancer Res 14(14):4368–4371. https://doi.org/10.1158/1078-0432.CCR-
08-0325
Friede T, Stallard N (2008) A comparison of methods for adaptive treatment selection. Biom J
50(5):767–781. https://doi.org/10.1002/bimj.200710453
Ghosh P, Liu L, Mehta C (2020) Adaptive multiarm multistage clinical trials. Stat Med. https://doi.
org/10.1002/sim.8464
Hague D, Townsend S, Masters L et al (2019) Changing platforms without stopping the train:
experiences of data management and data management systems when adapting platform pro-
tocols by adding and closing comparisons. Trials 20:294. https://doi.org/10.1186/s13063-019-
3322-7
Hay M, Thomas DW, Craighead JL, Economides C, Rosenthal J (2014) Clinical development
success rates for investigational drugs. Nat Biotechnol 32(1):40–51. https://www.nature.com/
articles/nbt.2786.pdf
Jaki T, Magirr D (2013) Considerations on covariates and endpoints in multi-arm multi-stage
clinical trials. Stat Med 32(7):11501163. https://doi.org/10.1002/sim.5669
Jaki T, Pallmann P, Magirr D (2019) The R package MAMS for designing multi-arm multi-stage
clinical trials. J Stat Softw 88:4. https://doi.org/10.18637/jss.v088.i04
James ND, Sydes MR, Clarke NW, Mason MD, Dearnaley DP, Anderson J, Popert RJ, Sanders K,
Morgan RC, Stansfeld J, Dwyer J, Masters J, Parmar MK (2008) STAMPEDE: systemic therapy
for advancing or metastatic prostate cancer- a multi-arm multi-stage randomised controlled trial.
Clin Oncol (R Coll Radiol) 20(8):577–581
James ND, Sydes MR, Mason MD, Clarke NW, Anderson J, Dearnaley DP, Dwyer J, Jovic G,
Ritchie AW, Russell JM, Sanders K, Thalmann GN, Bertelli G, Birtle AJ, O’Sullivan JM,
Protheroe A, Sheehan D, Srihari N, Parmar MK (2012) Celecoxib plus hormone therapy versus
hormone therapy alone for hormone-sensitive prostate cancer: first results from the stampede
multiarm, multistage, randomised controlled trial. Lancet Oncol 13(5):549–558
Jung SH, Lee T, Kim K, George SL (2004) Admissible two-stage designs for phase II cancer
clinical trials. Stat Med 23(4):561–569
Koenig F, Brannath W, Bretz F, Posch M (2008) Adaptive Dunnett tests for treatment selection. Stat
Med 27:1612–1625. https://doi.org/10.1002/sim.3048
Lan KK, Zucker DM (1993) Sequential monitoring of clinical trials: the role of information and
Brownian motion. Stat Med 12:753–765
Lee KM, Brown LC, Jaki T, Stallard N, Wason J (2021) Statistical consideration when adding new
arms to ongoing clinical trials: the potentials and the caveats. Trials 22:203
Magirr D, Jaki T, Whitehead J (2012) A generalized Dunnett test for multi-arm multi-stage clinical
studies with treatment selection. Biometrika 99(2):494–501
78 Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials 1541

Magirr D, Stallard N, Jaki T (2014) Flexible sequential designs for multi-arm clinical trials. Stat
Med 33:3269–3279
Mehta CR, Patel NR (2006) Adaptive, group sequential and decision theoretic approaches to sample
size determination. Stat Med 25:3250–3269. https://doi.org/10.1002/sim.2638
Mehta C, Tsiatis A (2001) Flexible sample size considerations using information-based interim
monitoring. Drug Inf J 35(4):1095–1112. https://doi.org/10.1177/009286150103500407
Meyer EL, Mesenbrink P, Mielke T, Parke T, Evans D, Konig F, EU-PEARL (EU Patient-cEntric
clinicAl tRial pLatforms) Consortium (2021) Systematic review of available software for multi-
arm multi-stage and platform clinical trial design. Trials 22:183. https://doi.org/10.1186/s13063-
021-05130-x
MRC Clinical Trials Unit at UCL. RAMPARE Trial. https://www.rampart-trial.org/
O’Brien PC (1983) The appropriateness of analysis of variance and multiple-comparison pro-
cedures. Biometrics 39(3):787–788
Parker RA, Weir CJ (2020) Non-adjustment for multiple testing in multi-arm trials of distinct
treatments: rationale and justification. Clin Trials 17(5):562–566. https://doi.org/10.1177/
1740774520941419
Parmar MK, Barthel FM, Sydes M, Langley R, Kaplan R, Eisenhauer E, Brady M, James N,
Bookman MA, Swart AM, Qian W, Royston P (2008) Speeding up the evaluation of new agents
in cancer. J Natl Cancer Inst 100(17):1204–1214
Parmar MKB, Carpenter J, Sydes MR (2014) More multiarm randomised trials of superiority are
needed. Lancet 384(9940):283–284. https://doi.org/10.1016/S0140-6736(14)61122-3
Piantadosi S (2005) Clinical trials: a methodologic perspective, 2nd edn. Wiley, New York
Posch M, Koenig F, Branson M, Brannath W, Dunger-Baldauf C, Bauer P (2005) Testing and
estimation in flexible group sequential designs with adaptive treatment selection. Stat Med 24:
3697–3714. https://doi.org/10.1002/sim.2389
Prentice RL (1989) Surrogate endpoints in clinical trials: definition and operational criteria. Stat
Med 8(4):431–440. https://doi.org/10.1002/sim.4780080407
Proschan MA (2005) Two-stage sample size re-estimation based on a nuisance pa-rameter: a review.
J Biopharm Stat 15(4):559–574. https://doi.org/10.1081/BIP-200062852
Proschan MA, Waclawiw MA (2000) Practical guidelines for multiplicity adjustment in clinical
trials. Control Clin Trials 21:527–539
Robertson DS, Choodari-Oskooei B, Dimairo M, Flight L, Pallmann P, Jaki T (2021) Point
estimation for adaptive trial designs. Stat Med, under review. https://arxiv.org/abs/2105.08836
Robertson DS, Choodari-Oskooei B, Dimairo M, Flight L, Pallmann P, Jaki T (2022) Point
estimation for adaptive trial designs II: practical considerations and guidance. Stat Med, under
review. https://arxiv.org/abs/2105.08836
ROSSINI 2: Reduction of surgical site infection using several novel interventions trial protocol,
Tech. rep (2018). https://www.birmingham.ac.uk/Documents/college-mds/trials/bctu/rossini-ii/
R0SSINI-2-Protocol-V1.0-02.12.2018.pdf
Royston P, Parmar MK (2020) A simulation study comparing the power of nine tests of the
treatment effect in randomized controlled trials with a time-to-event outcome. Trials 21:315.
https://doi.org/10.1186/s13063-020-4153-2
Royston P, Parmar MK, Qian W (2003) Novel designs for multi-arm clinical trials with survival
outcomes with an application in ovarian cancer. Stat Med 22(14):2239–2256
Royston P, Barthel FM, Parmar MK, Choodari-Oskooei B, Isham V (2011) Designs for clinical
trials with time-to-event outcomes based on stopping guidelines for lack of benefit. Trials 12:81
Schiavone F, Bathia R, Letchemanan K et al (2019) This is a platform alteration: a trial management
perspective on the operational aspects of adaptive and platform and umbrella protocols. Trials
20:264. https://doi.org/10.1186/s13063-019-3216-8
Sidak Z (1967) Rectangular confidence regions for the means of multivariate normal distributions.
J Am Stat Assoc 62(318):626–633
Sill MW, Sampson AR (2007) Extension of a two-stage conditionally unbiased estimator of the
selected population to the bivariate normal case. Commun Stat Theory Methods 36:801–813
1542 B. Choodari-Oskooei et al.

Stallard N, Kimani PK (2018) Uniformly minimum variance conditionally unbiased estimation in


multi-arm multi-stage clinical trials. Biometrika 105(2):495501. https://doi.org/10.1002/
sim.3958
Stallard N, Todd S (2003) Sequential designs for phase III clinical trials incorporating treatment
selection. Stat Med 22:689–703. https://doi.org/10.1002/sim.1362
Stallard N, Kunz CU, Todd S, Parsons N, Friede T (2015) Flexible selection of a single treatment
incorporating shortterm endpoint information in a phase II/III clinical trial. Stat Med 34(23):
3104–3115. https://doi.org/10.1002/sim.6567
Sydes MR, Parmar MK, James ND, Clarke NW, Dearnaley DP, Mason MD, Morgan RC,
Sanders K, Royston P (2009) Issues in applying multi-arm multi-stage methodology to a clinical
trial in prostate cancer: the MRC STAMPEDE trial. Trials 10:39
Sydes MR, Parmar MK, Mason MD, Clarke NW, Amos C, Anderson J, de Bono JS, Dearnaley DP,
Dwyer J, Green C, Jovic G, Ritchie AW, Russell JM, Sanders K, Thalmann G, James ND (2012)
Flexible trial design in practice – stopping arms for lack-of-benefit and adding research arms
mid-trial in STAMPEDE: a multi-arm multi-stage randomized controlled trial. Trials 13(1):168
Ventz S, Alexander BM, Parmigiani G, Gelber RD, Trippa L (2017) Designing clinical trials that
accept new arms: an example in metastatic breast cancer. J Clin Oncol 35(27):3160–3168
Wason JMS, Stecher L, Mander AP (2014) Correcting for multiple-testing in multi-arm trials: is it
necessary and is it done? Trials 15:364
Wason J, Stallard N, Bowden J, Jennison C (2017) A multi-stage drop-the-losers design for multi-
arm clinical trials. Stat Methods Med Res 26(1):508–524. https://doi.org/10.1177/
0962280214550759
Sequential, Multiple Assignment,
Randomized Trials (SMART) 79
Nicholas J. Seewald, Olivia Hackworth, and Daniel Almirall

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1544
Dynamic Treatment Regimens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1544
Scientific Questions about DTRS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1547
Sequential, Multiple Assignment, Randomized Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1547
Returning to the Scientific Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1549
Other Smart Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1551
Power Considerations and Analytic Methods for Primary Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1553
Additional Considerations for Designing and Implementing a Smart . . . . . . . . . . . . . . . . . . . . . . . . 1555
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1557
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1557
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1558
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1558

Abstract
A dynamic treatment regimen (DTR) is a prespecified set of decision rules that
can be used to guide important clinical decisions about treatment planning. This
includes decisions concerning how to begin treatment based on a patient’s
characteristics at entry, as well as how to tailor treatment over time based on
the patient’s changing needs. Sequential, multiple assignment, randomized trials
(SMARTs) are a type of experimental design that can be used to build effective
dynamic treatment regimens (DTRs). This chapter provides an introduction to
DTRs, common types of scientific questions researchers may have concerning the
development of a highly effective DTR, and how SMARTs can be used to address
such questions. To illustrate ideas, we discuss the design of a SMART used to
answer critical questions in the development of a DTR for individuals diagnosed
with alcohol use disorder.

N. J. Seewald (*) · O. Hackworth · D. Almirall


University of Michigan, Ann Arbor, MI, USA
e-mail: [email protected]; [email protected]; [email protected]

© Springer Nature Switzerland AG 2022 1543


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_280
1544 N. J. Seewald et al.

Keywords
Dynamic treatment regimen · Adaptive intervention · Tailoring variable ·
Sequential randomization · Multistage randomized trial

Introduction

In clinical settings, it is often necessary to treat patients using a sequential and


individually tailored approach, whereby treatment is adapted and readapted over
time based on both static and changing needs of the patient (Thall et al. 2000; Lavori
and Dawson 2014). A dynamic treatment regimen (DTR) is a prespecified set of
decision rules that can be used to guide clinicians on how to make such sequences of
treatment decisions (Murphy and Almirall 2009).
Investigators often have multiple scientific questions concerning the development
of effective DTRs. These questions often involve the effectiveness of the compo-
nents that make up a DTR: What is the best way to begin treatment? What is the best
treatment to provide patients who respond suboptimally or who fail to adhere to an
initial course of treatment? What is the best approach to monitoring patients for
response to treatment? How do treatments work in sequence, with or against each
other, to impact outcomes in the long term?
One type of clinical trial design that is useful for answering such questions is the
sequential, multiple assignment, randomized trial, or SMART (Lavori and Dawson
2004; Murphy 2005; Collins et al. 2014). Relative to standard multiarm randomized
trials, the SMART is unique in that it involves multiple stages of randomization:
Participants in a SMART may be randomized more than once across multiple stages
of the trial.
This chapter provides a brief introduction to DTRs (including the components
that make up a DTR) and SMART designs. Throughout, we illustrate ideas using a
SMART designed to answer critical questions in the development of a DTR for
individuals diagnosed with alcohol use disorder.

Dynamic Treatment Regimens

A dynamic treatment regimen (DTR) is a sequence of decision rules that can be used
to guide how treatment can be adapted and readapted to the individual in clinical
practice settings. These treatment adaptations can be in terms of the type of treat-
ment, mode of treatment delivery, treatment intensity or dose, or other intervention
components. As with other types of manualized interventions, the decision rules that
make up a DTR are prespecified and well operationalized; this helps to ensure that
they can be replicated by future clinicians or evaluated by future researchers. DTRs
are also referred to as adaptive interventions (Lei et al. 2012; Nahum-Shani et al.
2012b), Adaptive treatment strategies (Murphy 2005; August et al. 2016; Nahum-
Shani et al. 2017), treatment policies (Lunceford et al. 2002; Wahed and Tsiatis
79 Sequential, Multiple Assignment, Randomized Trials (SMART) 1545

Responders
naltrexone
naltrexone +
medical management
behavioral intervention +
medical management +
Non-responders naltrexone

Fig. 1 Schematic of an example DTR for an adult receiving treatment for alcohol use disorder.
Nonresponse to treatment is defined as two or more heavy drinking days during the 8-week initial
study period

2004, 2006), multistage treatments (Thall and Kyle Wathen 2005), and multicourse
treatment strategies (Thall et al. 2002).
To make the idea of a DTR more concrete, consider as an example the treatment
of patients with alcohol use disorder. Naltrexone is a medication that diminishes the
pleasurable effects of alcohol (Oslin et al. 2006). Response to naltrexone is heter-
ogenous due to factors such as poor patient adherence, biological response to the
medication, low social support, and poor coping skills (Nahum-Shani et al. 2017).
As a result of this heterogeneity, it is important to offer a supportive intervention
along with the naltrexone medication. One such intervention is medical manage-
ment, a face-to-face clinical support intervention that includes monitoring for adher-
ence to treatment. A more intensive clinical support intervention is the combined
behavioral intervention, which includes components which target adherence to
medication and enhance the patient’s motivation for change. The intervention also
involves the patient’s family, when possible, and reinforces abstinence by empha-
sizing social support (Longabaugh et al. 2005; Lei et al. 2012). Hereafter, we refer to
the combined behavioral intervention as simply “behavioral intervention.”
Figure 1 illustrates an example DTR that involves the use of naltrexone, medical
management, and behavioral intervention. In this example DTR, the patient is
offered naltrexone alongside medical management for up to 8 weeks, with weekly
check-ins with the clinician as a part of medical management. If, at any of the weekly
check-ins during this 8-week period, the patient reports experiencing two or more
heavy drinking days, the patient is identified as a nonresponder and is offered
behavioral intervention in addition to naltrexone and medical management. If
instead the patient does not experience two or more heavy drinking days during
the 8-week period, then, at week 8, the patient is identified as a “durable responder”
and continues treatment with naltrexone but without medical management (Lei et al.
2012).
There are four main components of a DTR, all of them prespecified: (1) decision
points, (2) treatment options, (3) tailoring variables, and (4) decision rules. Decision
points are times in a patient’s care where a treatment decision is made. They can
occur at scheduled intervals, after a specific number of clinic visits, or be event-
based, such as the point at which a patient fails to respond or adhere to a treatment.
The timing of decision points should be based on scientific or practical consider-
ations which inform when treatment may need to be modified. For instance, in
adolescent weight loss, clinicians typically evaluate response to treatment after
1546 N. J. Seewald et al.

about 3 months: This suggests a decision point should be placed at about this time
(Naar-King et al. 2016).
The second component of a DTR is the collection of treatment options available
at each decision point. This set may include aspects of treatment such as type of
treatment, intensity of treatment, and/or delivery method; see Lei et al. (2012) for
detailed examples. It may also include strategies for modifying treatment, such as
augmenting or intensifying an intervention, or staying the course (Pelham Jr. et al.
2016). The set of possible treatment options can be different at each decision point.
The third component is the Tailoring variables which are used to individualize
(“tailor”) treatment at each decision point. These could be static characteristics, such
as age or other demographic factors, known co-occurring conditions, or other
characteristics collected at intake. Tailoring variables could also be time-varying
characteristics that may change based on previous treatments, disease severity,
treatment preferences, or adherence.
The fourth component in a DTR is the decision rules. At each decision point, a
decision rule takes in the values of the tailoring variables and recommends a
treatment option (or set of options). The collection of decision rules over all decision
points is what makes up a DTR (Murphy and Almirall 2009).
In the alcohol use disorder DTR depicted in Fig. 1, there are two decision points
from the perspective of the clinician. The first is when treatment begins. The second
decision point is the first time the patient is identified as a nonresponder during the
first 8 weeks of treatment, or when the patient is identified as a responder at week
8. In the example DTR, there is only a single treatment option at the first decision
point: naltrexone with medical management. At the second decision point, there are
two treatment options: naltrexone with medical management and behavioral inter-
vention or naltrexone alone. In this example, there is a single tailoring variable,
which is the number of heavy drinking days reported by the patient following the
start of the initial intervention. This information is used to inform whether a patient
remains a responder for 8 weeks or triggers the nonresponse criterion (two or more
heavy drinking days) within the 8 weeks. The decision rule at the first decision point
is to offer all patients naltrexone with medical management. The decision rule at the
second decision point recommends withdrawing responders from medical manage-
ment at week 8 and offering behavioral intervention in addition to existing treatment
to any patient that triggers a nonresponse within the 8 weeks.
DTRs also have applications to other clinical settings, for example, in prevention
medicine, implementation, or in special education. In prevention applications, DTRs
could help operationalize the transition between “universal” preventive interven-
tions, which target a large section of the population, and “selected” then “indicated”
preventive interventions, which target populations at progressively higher risk of
developing a disorder (August et al. 2016; Hall et al. 2019). Implementation focuses
on the uptake or adoption of evidence-based practices by systems of providers (e.g.,
clinics); here, a DTR can be used to guide how best to adapt (potentially costly)
organizational-level interventions that seek to improve the health of individuals at
the organization (Kilbourne et al. 2014, 2018; Quanbeck et al. 2020). In special
education, DTRs can be used to guide how best to adapt interventions designed to
79 Sequential, Multiple Assignment, Randomized Trials (SMART) 1547

improve behavioral and academic outcomes in educational settings (Kasari et al.


2014; Almirall et al. 2018).

Scientific Questions about DTRS

Researchers interested in developing high-quality DTRs often have unanswered


questions that cannot necessarily be answered based on the extant literature, or
expert clinical opinion. These questions typically concern the relative effectiveness
of different DTRs, the relative effectiveness of different DTR components at specific
stages, how the intervention components at different stages work with (or against)
each other, and questions related to how best to tailor treatment at different stages of
intervention.
Common questions are about which treatment option the DTR should begin with,
how to modify the initial treatment for nonresponders, how to best define or monitor
individuals for response/nonresponse, and the timing of decision points and thus
interventions.
Within the context of the alcohol use disorder example, one important question
concerns the definition of nonresponse to naltrexone. In the DTR shown in Fig. 1,
nonresponse is defined as a patient reporting 2 or more heavy drinking days.
However, it is unclear what amount of drinking behavior corresponds to non-
response to naltrexone, so researchers may have questions concerning how best to
monitor a patient for nonresponse. Other scientific questions might ask which
treatment options to offer as follow-up to naltrexone with medical management.
For instance, should clinicians provide more intense support for nonresponders, or
which treatment best maintains longer-term response for patients identified as
responders at week 8.

Sequential, Multiple Assignment, Randomized Trials

A sequential, multiple assignment, randomized trial (SMART) is a type of multistage


randomized trial design that aims to answer critical questions in the development of
DTRs, such as those described above. In a SMART, all participants move through
multiple stages of treatment. At each stage, participants may be randomized to a set
of feasible treatment options. The randomizations in a SMART correspond to
scientific questions about the development of an effective DTR. The treatment
options to which a participant is randomized at each stage may depend on participant
characteristics or prior treatment. As with other randomized trials, the randomiza-
tions at each stage allow investigators to make causal inferences about the relative
effectiveness of different treatment options at each stage, without having to make
unverifiable assumptions (Rubin 1974). These randomizations also allow investiga-
tors to make causal inferences about the relative effectiveness of different DTRs. As
with other randomized trials, each of the randomizations in a SMART can be
stratified based on factors believed to be associated with subsequent outcomes.
1548 N. J. Seewald et al.

Stage 1 Stage 2 Subgroup

naltrexone A
Responders
R
naltrexone +
B
telehealth
naltrexone +
medical management +
behavioral intervention +
stringent non-response
medical management + C
placebo
R
Non-Responders behavioral intervention +
medical management + D
naltrexone
R
naltrexone E
Responders
R
naltrexone +
F
telehealth
naltrexone +
medical management +
behavioral intervention +
lenient non-response
medical management + G
placebo
R
Non-Responders behavioral intervention +
medical management + H
naltrexone

Fig. 2 Schematic of the ExTENd SMART. Circled R indicates randomization; treatments are
boxed. The stringent definition of nonresponse is triggered when the participant reports 2 or more
heavy drinking days in 1 week; the lenient definition, 5 or more heavy drinking days

Randomization probabilities may depend on any previously-observed covariate, as


long as the probabilities are known.
As an example, consider the two-stage ExTENd SMART study, led by David
Oslin, designed to inform the development of a DTR for the treatment of patients
with alcohol use disorder using naltrexone and patient support (Murphy and Almirall
2009). A diagram of this SMART is shown in Fig. 2. As we describe in more detail
below, the example DTR shown in Fig. 1 was drawn from this SMART. In ExTENd,
all participants began by receiving naltrexone with medical management. The first
randomization, which occurred at the outset of treatment, was to one of two definitions
of nonresponse: a stringent definition and a more lenient one. Thus, the first random-
ization was designed to compare two approaches to monitoring for nonresponse. The
stringent definition identifies a patient as a nonresponder if the patient reports
experiencing 2 or more heavy drinking days in 1 week. The lenient definition identifies
a patient as a nonresponder if the patient reports experiencing 5 or more heavy
drinking days in 1 week. The participants were then assessed weekly during the
8-week first stage based on their randomly assigned definition of nonresponse. If
and when a participant triggers a nonresponse, the participant is immediately
rerandomized to either receive behavioral intervention in addition to naltrexone with
medical management or behavioral intervention with medical management and pla-
cebo. This randomization was designed to investigate the effect of naltrexone in the
context of the behavioral intervention and medical management among nonresponders
to the first-stage treatment. If participants did not meet the definition of nonresponse at
the end of the 8 weeks, then they were classified as responders and immediately
79 Sequential, Multiple Assignment, Randomized Trials (SMART) 1549

rerandomized to either naltrexone alone or naltrexone alongside telephone disease


management (medical management delivered via telephone, hereafter referred to as
“telehealth”). This randomization was designed to investigate the effect of continued,
lower-intensity medical management for responders to the first-stage treatment. All
patients continued their second treatment for the duration of the 24-week study.
Many SMARTs are designed with an embedded tailoring variable, meaning that
subsequent randomizations are restricted based on the participant’s value of a
tailoring variable. The ExTENd study is an example of a SMART with this feature:
The second-stage treatment options for nonresponders are different from those for
responders. SMARTs that include an embedded tailoring variable will, by design,
include a number of DTRs embedded within it. For example, there are eight DTRs
embedded in ExTENd by design (Table 1). All participants in this SMART were
randomly assigned to a sequence of treatments that is consistent with recommenda-
tions made by one or more of these eight embedded DTRs. Participants who follow
treatment pathways A and D in Fig. 2 are assigned treatments according to the
example DTR discussed above and depicted in Fig. 1. These patients begin with
naltrexone, medical management, and the stringent definition of nonresponse to
treatment and are subsequently provided naltrexone alone if they respond or have
behavioral intervention added to their therapy if they do not respond.
Note that within ExTENd, all participants are consistent with two of the eight
embedded DTRs. This example SMART is conceptually similar to a (2  2  2)
(fractional) factorial trial design (Murphy and Bingham 2009; Collins et al. 2014;
Vock and Almirall 2018). The first factor is naltrexone with medical management
and the stringent definition of nonresponse versus naltrexone with medical manage-
ment and the lenient definition of nonresponse. The second factor is restricted to
responders and is naltrexone alone versus naltrexone with telehealth among
responders. The third factor, restricted to nonresponders, is naltrexone with medical
management and behavioral intervention versus placebo with medical management
and behavioral intervention.
Two key differences from factorial designs are the sequential nature of treatment
delivery in a SMART, as well as the possible restriction of certain treatment options
to participants based on their response status. Scientific questions which motivate a
SMART are asked in the context of a sequence of treatments which are delivered at
multiple points in time: This is not typically captured by a standard factorial design.
Additionally, SMARTs which contain an embedded tailoring variable usually offer
different sets of treatment options to responders and nonresponders. Similarly, first-
stage treatment assignment may determine whether individuals are rerandomized, as
in the SMART depicted in Fig. 4. These SMARTs are therefore not fully crossed
designs (Nahum-Shani et al. 2012b).

Returning to the Scientific Questions

As stated before, the goal of SMART designs is to aid the development of DTRs.
Data collected in a SMART can be used to answer questions concerning which
intervention option to provide at critical decision points during care. For example, in
1550 N. J. Seewald et al.

Table 1 Embedded dynamic treatment regimens (DTRs) in the ExTENd SMART (Fig. 2). The
stringent definition of nonresponse is triggered when the participant reports 2 or more heavy
drinking days in 1 week; the lenient definition, 5 or more heavy drinking days
Stage
Embedded Stage 2 treatment for Stage 2 treatment Subgroups in Figure 2
DTR 1 treatment responders for non responders consistent with DTR
1 naltrexone + naltrexone behavioral A, C
medical intervention +
management medical
+ stringent management +
non-response placebo
2 naltrexone + naltrexone behavioral A, D
medical intervention +
management medical
+ stringent management +
non-response naltrexone
3 naltrexone + naltrexone + behavioral B, C
medical telehealth intervention +
management medical
+ stringent management +
non-response placebo
4 naltrexone + naltrexone + behavioral B, D
medical telehealth intervention +
management medical
+ stringent management +
non-response naltrexone
5 naltrexone + naltrexone behavioral E, G
medical intervention +
management medical
+ lenient management +
non-response placebo
6 naltrexone + naltrexone behavioral E, H
medical intervention +
management medical
+ lenient management +
non-response naltrexone
7 naltrexone + naltrexone + behavioral F, G
medical telehealth intervention +
management medical
+ lenient management +
non-response placebo
8 naltrexone + naltrexone + behavioral F, H
medical telehealth intervention +
management medical
+ lenient management +
non-response naltrexone

ExTENd, researchers were interested in the comparison of the different definitions of


nonresponse to naltrexone with medical management (i.e., a comparison of first-
stage treatment options), averaged over subsequent treatment. This would help to
79 Sequential, Multiple Assignment, Randomized Trials (SMART) 1551

answer a question about what amount of drinking behavior corresponds with


nonresponse to naltrexone in the context of a DTR designed to increase the propor-
tion of days abstinent from alcohol. Other examples of scientific questions might be
a comparison of second-stage intervention options among responders, averaged over
the first-stage definition of nonresponse, similarly for nonresponders.
Questions can also focus on comparisons of the DTRs embedded in a SMART
(Table 1). An example would be to compare the DTR shown in Fig. 1 (embedded
DTR 2) to embedded DTR 5 based on proportion of days abstinent from alcohol at the
end of the study. This type of comparison may be used to investigate the difference
between, say, the most and least intensive DTRs, or the most and least expensive.
Data from a SMART can also be used to answer questions about more highly
tailored DTRs beyond those included in the SMART (Laber et al. 2014; Nahum-
Shani et al. 2017). Researchers can collect information about “candidate” tailoring
variables and assess whether and how they could help match participants to subse-
quent intervention options. This could lead to more individualized DTRs. For
example, ExTEND, it could be useful to investigate whether the nonresponse
definition used should be further tailored to an individual’s baseline years of alcohol
consumption. Support for this idea is based on evidence which suggests that
individuals with more severe histories of alcohol use problems are more prone to
relapse which could require a more stringent definition of nonresponse to naltrexone
(Heilig and Egli 2006) for some individuals. It would also be useful to explore
whether the maintenance treatment for responders should be tailored based on their
proportion of nonabstinence days during the initial treatment with naltrexone. This is
based in the idea that although participants were categorized as responders, their
failure to achieve complete abstinence suggests that they may need additional
support to maintain their improvement (McKay 2005; Cable and Sacker 2007).
This deeper tailoring can be investigated using Q-learning; see Nahum-Shani et al.
(2012a, 2017) for details.

Other Smart Designs

The ExTENd SMART, in which all participants were randomized initially and both
responders and nonresponders were rerandomized, is just one type of SMART
design. The defining feature of a SMART is that at least some participants are
randomized more than once; below, we introduce three additional common
SMART designs. SMARTs may include more than two stages of randomizations
and provide more than two interventions at each randomization. However, for
simplicity, the three SMART designs described below have only two stages and
two intervention options at each randomization.
Many SMARTs use a so-called “prototypical” design in which all participants are
randomized in the first stage, but subsequent randomizations are restricted only to
nonresponders (Sherwood et al. 2016; August et al. 2016; Gunlicks-Stoessel et al.
2016; Naar-King et al. 2016; Pelham Jr. et al. 2016; Schmitz et al. 2018). A
schematic is given in Fig. 3. Note that the tailoring variable could be reversed so
1552 N. J. Seewald et al.

Stage 1 Stage 2
Responders
C

A
D
R
Non-Responders
E
R
Responders
C

B
D
R
Non-Responders
E

Fig. 3 “Prototypical” SMART design. All participants are randomized in the first stage; only
nonresponders are rerandomized. There are four DTRs embedded in this design

that responders are the group that is rerandomized. This type of SMART design may
be helpful in a scenario in which there is an open scientific question about either
responders or nonresponders, but not both. For example, in the SMART described
by Pelham Jr. et al. (2016), participants who responded to first-stage treatment
continued on that treatment: The trial was not motivated by a question about
second-stage treatment for responders. Nonresponders, however, were rerandomized
between an intensified version of their first-stage intervention, or augmentation of
the intervention with another component. It should be noted that it is not necessary
that responders and nonresponders to different first-stage treatments be given the
same second-stage intervention options: Nonresponders to B, for instance, might be
rerandomized between treatments F and G.
In some contexts, scientific, practical, or ethical considerations limit the
treatment options available as follow-up to a particular first-stage intervention.
This type of consideration is accommodated by the SMART design described in
Fig. 4, in which participant rerandomization depends on both their response
status and previous treatment (Almirall et al. 2016; Kasari et al. 2014; Kilbourne
et al. 2014). In Fig. 4, participants who respond to treatment A are not
rerandomized, but the nonresponders are rerandomized. In the branch where
participants receive treatment B as their first stage treatment, no one is
rerandomized. This may be used if there are no practical or ethical treatment
options available to offer nonresponders to B, for example. In the SMART
described in Kasari et al. (2014), it was not feasible to rerandomize participants
who did not respond to one of the initial interventions. For these participants, the
only feasible option was to intensify their initial treatment. There are three DTRs
embedded in this type of SMART.
79 Sequential, Multiple Assignment, Randomized Trials (SMART) 1553

Stage 1 Stage 2
Responders
C

A
D
R
Non-Responders
R E
Responders
F
B
G
Non-Responders

Fig. 4 A SMART design in which only nonresponders to a particular first-stage treatment are
rerandomized. There are three DTRs embedded in this design

Not all SMARTs involve restricted randomization: In some designs, all par-
ticipants are rerandomized regardless of their response to previous treatment
(Fig. 5) (Chronis-Tuscano et al. 2016). In this scenario, investigators might
collect information on one or more candidate-tailoring variables but do not use
them when rerandomizing: The tailoring variable is not embedded in the design.
In the SMART shown in Fig. 5, all participants are randomized to either treatment
A or B, and then rerandomized to either treatment C or D regardless of their
response to first-stage treatment. There are four treatment paths embedded in this
design, but because there is no embedded tailoring variable, these are not DTRs
per se. These so-called “unrestricted” SMARTs are sequential, fully-crossed,
2  2 factorial designs. In this case, the factors are A versus B at stage 1 for all
participants, crossed with C versus D at stage 2 for all individuals. However, as
above, second-stage treatment options may depend on the first randomization
(i.e., individuals who receive B might be rerandomized between E and F rather
than C and D).

Power Considerations and Analytic Methods for Primary Aims

Like any other randomized trial, a SMART should be powered based on the primary
aim of the study. Here, we revisit three common primary aims for a SMART and
discuss power considerations and analysis for each. For simplicity, we restrict our
focus to two-stage studies with an outcome observed at the end of the trial and in
which all randomizations occur between two treatment options with equal probabil-
ity. More general situations are described by, e.g., Ogbagaber et al. (2016).
A common primary aim is the comparison of initial treatment options, averaged
over subsequent treatment. This is a two-group comparison: In the context of Fig. 2,
1554 N. J. Seewald et al.

Fig. 5 An unrestricted Stage 1 Stage 2


SMART. All participants are
randomized twice without
regard to a tailoring variable. C
There are four nonadaptive
treatment paths embedded in A R
this design D
R
C
B R
D

for example, this compares the mean outcome across subgroups A, B, C, and D to
the mean outcome across subgroups E, F, G, and H. As such, standard two-group
comparison methods can be used for both analysis and power considerations. In the
continuous-outcome case, linear regression with an indicator for first-stage treatment
along with any prognostic baseline (prior to first-stage randomization) covariates can
be used; the minimum sample size can be calculated using the standard formula
2
4 z1α=2 þ z1β
N
δ2
where δ is the smallest clinically relevant standardized effect size the investigator
wishes to detect using a test with type-I error rate α/2 with power 1  β. We use zp to
denote the p-th quantile of the standard normal distribution.
A second common primary aim is the comparison of second-stage treatment
options among responders or nonresponders, averaged over initial treatment assign-
ment. In a prototypical SMART (Fig. 3), this involves comparing nonresponders
who received treatment D to those who received treatment E in stage 2. Again, this is
simply a two-group comparison among nonresponders, so we can use standard
methods for analysis restricted to the nonresponders. The formula for the total
sample size for the SMART is the same as above, upweighted by nonresponse
probabilities:
2  
2 z1α=2 þ z1β 1 1
N  þ
δ2 1  PðRA ¼ 1Þ 1  PðRB ¼ 1Þ

where RX is an indicator for whether a participant responded to first-stage treatment


X (RX ¼ 1) or not (RX ¼ 0).
Investigators may also be interested in powering a SMART for a comparison of
two embedded DTRs which recommend different first-stage treatments. For exam-
ple, in a prototypical SMART such as the one shown in Fig. 3, this might be a
79 Sequential, Multiple Assignment, Randomized Trials (SMART) 1555

comparison of individuals who are consistent with the DTR which recommends A
then C for responders and D for nonresponders against those consistent with the
DTR which recommends B initially, then F for responders and G for nonresponders.
This comparison is often done using a regression model which allows for the
simultaneous estimation of mean outcomes under each of the embedded DTRs and
accounts for the facts that (1) some participants may be consistent with more than
one DTR, and (2) not all participants are randomized more than once. This can be
achieved using a so-called “weighted and recycled” approach (Nahum-Shani et al.
2012b).
In a prototypical SMART, responders are randomized only once whereas non-
responders are randomized twice. Therefore, there is imbalance by design in the
numbers of responders and nonresponders consistent with each embedded DTR. We
can correct for this imbalance with inverse-probability-of-treatment weights:
Assuming equal randomization, responders receive a weight of (1/2)1 ¼ 2 and
nonresponders receive a weight of (1/2  1/2)1 ¼ 4. Furthermore, responders to
treatment A are consistent with two embedded DTRs: The first recommends A, C for
responders, and D for nonresponders; the second recommends A, C for responders,
and E for nonresponders. The same holds for responders to treatment B. Regression
approaches which simultaneously estimate mean outcomes for all embedded DTRs
must account for this; see Appendix A of Nahum-Shani et al. (2012b) for additional
details.
Sample size formulae for a comparison of two embedded DTRs are surprisingly
straightforward and build on the standard formulae given above. The total sample
size for the SMART is
2
4 z1α=2 þ z1β
N DE
δ2
where DE is a “design effect” that accounts for differential randomization of
responders and nonresponders in the second stage. In a prototypical SMART,
DE ¼ 2  P(R ¼ 1) assuming a common response rate across first-stage treatments.
In the SMART shown in Fig. 4, DE ¼ (3  P(R ¼ 1))/2; in an ExTENd-style
SMART in which all participants are rerandomized (Fig. 1), DE ¼ 2 (Oetting et al.
2011).

Additional Considerations for Designing and Implementing


a Smart

The SMART designs discussed above are representative of most of the SMARTs that
have been implemented to date. To our knowledge, most SMARTs in the field have
two stages with randomizations limited to two treatment options. However, as
mentioned above, SMART designs may include more than two stages of randomi-
zation, or more than two treatment options after a randomization. As with any
1556 N. J. Seewald et al.

randomized trial, the design of a SMART is ultimately dictated by the scientific


questions the investigator seeks to answer.
Each of the SMART designs discussed in this chapter is motivated by a different
set of scientific questions at multiple stages of a DTR. Because each randomization
in a SMART corresponds to an open question about subsequent treatment recom-
mendations, and the defining characteristic of a SMART is that some or all partic-
ipants are randomized more than once, questions that do not involve multiple stages
of treatment do not, by themselves, motivate a SMART. Almirall et al. (2018)
describe several “singly-randomized” alternatives to SMARTs in the context of
research on DTRs.
SMARTs often include standard-of-care control groups. Most commonly, this is
done by embedding a standard-of-care intervention as one of the DTRs. For instance,
in Fig. 3, one of the embedded DTRs may be a DTR that is commonly used in
practice or could recommend standard-of-care throughout. This type of SMART
would allow for comparisons of the other embedded DTRs against this standard-of-
care DTR.
An important consideration in the design of a SMART is the choice of embedded
tailoring variable, if included. Embedding a tailoring variable into the trial also
embeds it into any DTRs the trial is able to study, so its inclusion should be well
justified based on scientific, ethical, or practical considerations. The tailoring vari-
able is a component of the DTR. As such, its operating characteristics are part of the
intervention as well as the trial. Therefore, tailoring variables should be relatively
easily measured in a clinical setting and reliably identify responders and nonre-
sponders. A variable which may “misclassify” individuals is not a good choice of
tailoring variable, as it may make assignment to subsequent treatment unsystematic.
This is an issue that should be anticipated and designed around, rather than corrected
post hoc.
In a SMART, the same cohort of individuals participates in all stages of treatment,
and a single study consent process is used for all these individuals (prior to the first
stage randomization). SMARTs should not employ multiple consents (e.g., one at
each randomization point); doing so could severely limit the ability to make infer-
ences about the relative effects of the DTRs embedded in a SMART. Rather, the
single consent process should inform participants of all possible treatment sequences
to which they may be assigned during the study. Because the goal of a SMART is to
develop a high-quality DTR, participants in the trial should experience the DTR as
close to a real-world implementation as possible; a reconsent process would detract
from this goal. Should they wish, investigators could randomize participants to
DTRs at the start of the trial, though this should be carefully blinded to avoid
expectancy effects: Participants should not have knowledge of their future treatment
assignments.
Importantly, SMARTs are typically not adaptive trial designs despite having
similar terminology (Meurer et al. 2012). An adaptive trial design is a multistage
study in which data is used to modify characteristics of the trial as it is collected
(Dragalin 2006). In contrast, in a SMART, typically all participants move through
every stage of the trial and the trial design remains fixed; the goal is to learn how best
79 Sequential, Multiple Assignment, Randomized Trials (SMART) 1557

to adapt treatment to the changing needs of the individual. In adaptive trials, the trial
is adaptive; in SMARTs, the focus is on developing an adaptive treatment strategy
(a DTR). More recently, statisticians have begun to develop randomized trial designs
that are both sequentially randomized and adaptive (Cheung et al. 2015).
Readers interested in more in-depth information about SMARTs and DTRs might
see the books by Chakraborty and Moodie (2013), Kosorok and Moodie (2015), or
Tsiatis et al. (2019). In addition, Nahum-Shani et al. (2012b) and Ogbagaber et al.
(2016) provide tutorials on analytic strategies for comparing embedded DTRs in a
SMART with a continuous, end-of-study outcome. Nahum-Shani et al. (2020) also
provide a tutorial for analyzing SMARTs with longitudinal outcomes. For analytic
and sample size considerations for SMARTs with binary outcomes, see Kidwell et al.
(2018); survival outcomes, Feng and Wahed (2009), Li and Murphy (2011); and
continuous longitudinal outcomes, Lu et al. (2016), Li (2017), Dziak et al. (2019),
and Seewald et al. (2020). Recently, methods have been developed for clustered
SMARTs for developing clustered DTRs (NeCamp et al. 2017). Finally, for infor-
mation on estimating optimal DTRs from a SMART see Moodie et al. (2007),
Murphy (2003), Nahum-Shani et al. (2012a), or Zhao and Laber (2014).

Summary and Conclusion

Dynamic treatment regimens provide a guide for the type of sequential intervention
decision-making that arises naturally in clinical settings (Lavori and Dawson 2014).
Sequential, multiple assignment, randomized trials (SMARTs) are one type of exper-
imental design that can be used by researchers for developing DTRs. This chapter
discussed the components that make up DTRs, and scientific questions that researchers
may have about them. It then described how a SMART can be used to address these
scientific questions. The ExTENd SMART study – designed to develop a DTR for
adults with alcohol use disorder – was used to illustrate these ideas.
For clinical trial researchers interested in developing efficient and effective DTRs,
the SMART may be a useful design to consider. As discussed in the chapter, there are
different types of SMART designs. Ultimately, for researchers who choose to use a
SMART, the type of SMART design they choose should be grounded in the scientific
questions they are seeking to answer.

Key Facts

Sequential, multiple-assignment randomized trials (SMARTs) are experimental


designs which aid in the development of sequences of treatments which are able to
adapt to an individual’s changing needs, called dynamic treatment regimens. The
key feature of a SMART is that some or all participants are randomized more than
once. Like any clinical trial, the design of a SMART is motivated by specific
scientific questions; for SMARTs, those questions are about dynamic treatment
regimens.
1558 N. J. Seewald et al.

Cross-References

▶ Essential Statistical Tests


▶ Estimation and Hypothesis Testing
▶ Factorial Trials
▶ Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials
▶ Power and Sample Size
▶ Principles of Clinical Trials: Bias and Precision Control

Acknowledgments Funding was provided by the National Institutes of Health (P50DA039838,


R01DA039901) and the Institute for Education Sciences (R324B180003). Funding for the ExTENd
study, which was used to illustrate ideas, was provided by the National Institutes of Health
(R01AA014851; PI: David Oslin).

References
Almirall D, DiStefano C, Chang Y-C, Shire S, Kaiser A, Lu X, Nahum-Shani I, Landa R, Mathy P,
Kasari C (2016) “Longitudinal Effects of Adaptive Interventions With a Speech-Generating
Device in Minimally Verbal Children With ASD.” J Clin Child Adolesc 45 (4): 442–56. https://
doi.org/10.1080/15374416.2016.1138407
Almirall D, Nahum-Shani I, Lu W, Kasari C (2018) Experimental designs for research on adaptive
interventions: singly and sequentially randomized trials. In: Collins LM, Kugler KC (eds)
Optimization of behavioral, biobehavioral, and biomedical interventions: advanced topics,
Statistics for social and behavioral sciences. Springer International Publishing, Cham, pp
89–120. https://doi.org/10.1007/978-3-319-91776-4_4
August GJ, Piehler TF, Bloomquist ML (2016) Being ‘SMART’ about adolescent conduct problems
prevention: executing a SMART pilot study in a juvenile diversion agency. J Clin Child Adolesc
Psychol 45(4):495–509. https://doi.org/10/ghpbrn
Cable N, Sacker A (2007) Typologies of alcohol consumption in adolescence: predictors and adult
outcomes. Alcohol Alcoholism 43(1):81–90. https://doi.org/10/fpmm33
Chakraborty B, Moodie EEM (2013) Statistical Methods for Dynamic Treatment Regimes. Statis-
tics for biology and health. Springer New York, New York, NY. https://doi.org/10.1007/978-1-
4614-7428-9
Cheung YK, Chakraborty B, Davidson KW (2015) Sequential multiple assignment randomized
trial (SMART) with adaptive randomization for quality improvement in depression treatment
program: SMART with adaptive randomization. Biometrics 71(2):450–459. https://doi.org/10.
1111/biom.12258
Chronis-Tuscano A, Wang CH, Strickland J, Almirall D, Stein MA (2016) Personalized treatment
of mothers with ADHD and their young at-risk children: a SMART pilot. J Clin Child Adolesc
Psychol 45(4):510–521. https://doi.org/10/gg2h36
Collins LM, Nahum-Shani I, Almirall D (2014) Optimization of behavioral dynamic treatment
regimens based on the sequential, multiple assignment, randomized trial (SMART). Clin Trials
11(4):426–434. https://doi.org/10/f6cjxm
Dragalin V (2006) Adaptive designs: terminology and classification. Drug Inf J 40(4):425–435.
https://doi.org/10/ghpbrt
Dziak JJ, Yap JRT, Almirall D, McKay JR, Lynch KG, Nahum-Shani I (2019) A data analysis
method for using longitudinal binary outcome data from a SMART to compare adaptive
interventions. Multivar Behav Res 0(0):1–24. https://doi.org/10/gftzjg
79 Sequential, Multiple Assignment, Randomized Trials (SMART) 1559

Feng W, Wahed AS (2009) Sample size for two-stage studies with maintenance therapy. Stat Med
28(15):2028–2041. https://doi.org/10.1002/sim.3593
Gunlicks-Stoessel M, Mufson L, Westervelt A, Almirall D, Murphy SA (2016) A pilot SMART for
developing an adaptive treatment strategy for adolescent depression. J Clin Child Adolesc
Psychol 45(4):480–494. https://doi.org/10/ghpbrv
Hall KL, Nahum-Shani I, August GJ, Patrick ME, Murphy SA, Almirall D (2019) Adaptive
intervention designs in substance use prevention. In: Sloboda Z, Petras H, Robertson E, Hingson
R (eds) Prevention of substance use, Advances in prevention science. Springer International
Publishing, Cham, pp 263–280. https://doi.org/10.1007/978-3-030-00627-3_17
Heilig M, Egli M (2006) Pharmacological treatment of alcohol dependence: target symptoms and
target mechanisms. Pharmacol Ther 111(3):855–876. https://doi.org/10/cfs7df
Kasari C, Kaiser A, Goods K, Nietfeld J, Mathy P, Landa R, Murphy SA, Almirall D (2014)
Communication interventions for minimally verbal children with autism: a sequential multiple
assignment randomized trial. J Am Acad Child Adolesc Psychiatry 53(6):635–646. https://doi.
org/10.1016/j.jaac.2014.01.019
Kidwell KM, Seewald NJ, Tran Q, Kasari C, Almirall D (2018) Design and analysis considerations
for comparing dynamic treatment regimens with binary outcomes from sequential multiple
assignment randomized trials. J Appl Stat 45(9):1628–1651. https://doi.org/10.1080/02664763.
2017.1386773
Kilbourne AM, Almirall D, Eisenberg D, Waxmonsky J, Goodrich DE, Fortney JC, JoAnn
E. Kirchner, et al. (2014) Protocol: adaptive implementation of effective programs trial
(ADEPT): cluster randomized SMART trial comparing a standard versus enhanced implemen-
tation strategy to improve outcomes of a mood disorders program. Implement Sci 9(1):132.
https://doi.org/10/f6q9fc
Kilbourne AM, Smith SN, Choi SY, Koschmann E, Liebrecht C, Rusch A, Abelson JL et al (2018)
Adaptive school-based implementation of CBT (ASIC): clustered-SMART for building an
optimized adaptive implementation intervention to improve uptake of mental health interven-
tions in schools. Implement Sci 13(1):119. https://doi.org/10/gd7jt2
Kosorok MR, Moodie EEM (eds) (2015) Adaptive treatment strategies in practice: planning trials
and analyzing data for personalized medicine. Society for Industrial and Applied Mathematics,
Philadelphia, PA. https://doi.org/10.1137/1.9781611974188
Laber EB, Lizotte DJ, Qian M, Pelham WE, Murphy SA (2014) Dynamic treatment regimes:
technical challenges and applications. Electron J Stat 8(1):1225–1272. https://doi.org/10/
gg29c8
Lavori PW, Dawson R (2004) Dynamic treatment regimes: practical design considerations. Clin
Trials 1(1):9–20. https://doi.org/10/cqtvnn
Lavori PW, Dawson R (2014) Introduction to dynamic treatment strategies and sequential multiple
assignment randomization. Clin Trials 11(4):393–399. https://doi.org/10.1177/
1740774514527651
Lei H, Nahum-Shani I, Lynch K, Oslin D, Murphy SA (2012) A ‘SMART’ design for building
individualized treatment sequences. Annu Rev Clin Psychol 8(1):21–48. https://doi.org/10.
1146/annurev-clinpsy-032511-143152
Li Z (2017) Comparison of adaptive treatment strategies based on longitudinal outcomes in
sequential multiple assignment randomized trials. Stat Med 36(3):403–415. https://doi.org/10.
1002/sim.7136
Li Z, Murphy SA (2011) Sample size formulae for two-stage randomized trials with survival
outcomes. Biometrika 98(3):503–518. https://doi.org/10.1093/biomet/asr019
Longabaugh R, Zweben A, Locastro JS, Miller WR (2005) Origins, issues and options in the
development of the combined behavioral intervention. J Stud Alcohol Suppl (15):179–187.
https://doi.org/10/ghpb9f
Lu X, Nahum-Shani I, Kasari C, Lynch KG, Oslin DW, Pelham WE, Fabiano G, Almirall D (2016)
Comparing dynamic treatment regimes using repeated-measures outcomes: modeling consider-
ations in SMART studies. Stat Med 35(10):1595–1615. https://doi.org/10/gg2gxc
1560 N. J. Seewald et al.

Lunceford JK, Davidian M, Tsiatis AA (2002) Estimation of survival distributions of treatment


policies in two-stage randomization designs in clinical trials. Biometrics 58(1):48–57. https://
doi.org/10/bk2dj9
McKay JR (2005) Is there a case for extended interventions for alcohol and drug use disorders?
Addiction 100(11):1594–1610. https://doi.org/10/btpvtr
Meurer WJ, Lewis RJ, Berry DA (2012) Adaptive clinical trials: a partial remedy for the therapeutic
misconception? JAMA-J Am Med Assoc 307(22):2377–2378. https://doi.org/10/gf3pmm
Moodie EEM, Richardson TS, Stephens DA (2007) Demystifying optimal dynamic treatment
regimes. Biometrics 63(2):447–455. https://doi.org/10/ffcq8r
Murphy SA (2003) Optimal dynamic treatment regimes. J R Stat Soc B 65(2):331–355. https://doi.
org/10/dmmr89
Murphy SA (2005) An experimental Design for the Development of adaptive treatment strategies.
Stat Med 24(10):1455–1481. https://doi.org/10.1002/sim.2022
Murphy SA, Almirall D (2009) Dynamic treatment regimens. In: Encyclopedia of Medical Decision
Making, 1:419–22. SAGE Publications, Thousand Oaks
Murphy SA, Bingham D (2009) Screening experiments for developing dynamic treatment regimes.
J Am Stat Assoc 104(485):391–408. https://doi.org/10/dk2gpv
Naar-King S, Ellis DA, Carcone AI, Templin T, Jacques-Tiura AJ, Hartlieb KB, Cunningham P, Jen
K-LC (2016) Sequential multiple assignment randomized trial (SMART) to construct weight
loss interventions for African American adolescents. J Clin Child Adolesc Psychol 45(4):428–
441. https://doi.org/10/gf4ks4
Nahum-Shani I, Qian M, Almirall D, Pelham WE, Gnagy B, Fabiano GA, Waxmonsky JG, Yu J,
Murphy SA (2012a) Q-learning: a data analysis method for constructing adaptive interventions.
Psychol Methods 17(4):478–494. https://doi.org/10.1037/a0029373
Nahum-Shani I, Qian M, Almirall D, Pelham WE, Gnagy B, Fabiano GA, Waxmonsky JG, Yu J,
Murphy SA (2012b) Experimental design and primary data analysis methods for comparing
adaptive interventions. Psychol Methods 17(4):457–477. https://doi.org/10.1037/a0029372
Nahum-Shani I, Ertefaie A, Xi (Lucy) Lu, Lynch KG, McKay JR, Oslin DW, Almirall D (2017) A
SMART data analysis method for constructing adaptive treatment strategies for substance use
disorders. Addiction 112(5):901–909. https://doi.org/10/ghpb9n
Nahum-Shani I, Almirall D, Yap JRT, McKay JR, Lynch KG, Freiheit EA, Dziak JJ (2020) SMART
longitudinal analysis: a tutorial for using repeated outcome Measures from SMART studies to
compare adaptive interventions. Psychol Methods 25(1):1–29. https://doi.org/10/ggttht
NeCamp T, Kilbourne A, Almirall D (2017) Comparing cluster-level dynamic treatment regimens
using sequential, multiple assignment, randomized trials: regression estimation and sample size
considerations. Stat Methods Med Res 26(4):1572–1589. https://doi.org/10.1177/
0962280217708654
Oetting AI, Levy JA, Weiss RD, Murphy SA (2011) Statistical methodology for a SMART Design
in the Development of adaptive treatment strategies. In: Shrout PE, Keyes KM, Ornstein K (eds)
Causality and psychopathology: finding the determinants of disorders and their cures. Oxford
University Press, New York, pp 179–205
Ogbagaber SB, Karp J, Wahed AS (2016) Design of Sequentially Randomized Trials for testing
adaptive treatment strategies. Stat Med 35(6):840–858. https://doi.org/10.1002/sim.6747
Oslin DW, Berrettini WH, O’Brien CP (2006) Targeting treatments for alcohol dependence: the
pharmacogenetics of naltrexone. Addict Biol 11(3–4):397–403. https://doi.org/10/fgcfbk
Pelham WE Jr, Fabiano GA, Waxmonsky JG, Greiner AR, Gnagy EM, Pelham WE III, Coxe S et al
(2016) Treatment sequencing for childhood ADHD: a multiple-randomization study of adaptive
medication and behavioral interventions. J Clin Child Adolesc Psychol 45(4):396–415. https://
doi.org/10/gfn9xr
Quanbeck A, Almirall D, Jacobson N, Brown RT, Landeck JK, Madden L, Cohen A et al (2020)
The balanced opioid initiative: protocol for a clustered, sequential, multiple-assignment ran-
domized trial to construct an adaptive implementation strategy to improve guideline-concordant
opioid prescribing in primary care. Implement Sci 15(1):26. https://doi.org/10/gjh5tx
79 Sequential, Multiple Assignment, Randomized Trials (SMART) 1561

Rubin DB (1974) Estimating causal effects of treatments in randomized and nonrandomized


studies. J Educ Psychol 66(5):688–701. https://doi.org/10.1037/H0037350
Schmitz JM, Stotts AL, Vujanovic AA, Weaver MF, Yoon JH, Vincent J, Green CE (2018) A
sequential multiple assignment randomized trial for cocaine cessation and relapse prevention:
tailoring treatment to the individual. Contemp Clin Trials 65(February):109–115. https://doi.
org/10/gc3tqr
Seewald NJ, Kidwell KM, Nahum-Shani I, Wu T, McKay JR, Almirall D (2020) Sample size
considerations for comparing dynamic treatment regimens in a sequential multiple-assignment
randomized trial with a continuous longitudinal outcome. Stat Methods Med Res 29(7):1891–
1912. https://doi.org/10/gf85ss
Sherwood NE, Butryn ML, Forman EM, Almirall D, Seburg EM, Lauren Crain A, Kunin-Batson
AS, Hayes MG, Levy RL, Jeffery RW (2016) The BestFIT trial: a SMART approach to
developing individualized weight loss treatments. Contemp Clin Trials 47(March):209–216.
https://doi.org/10.1016/j.cct.2016.01.011
Thall PF, Kyle Wathen J (2005) Covariate-adjusted adaptive randomization in a sarcoma trial with
multi-stage treatments. Stat Med 24(13):1947–1964. https://doi.org/10/d5ztnt
Thall PF, Millikan RE, Sung H-G (2000) Evaluating multiple treatment courses in clinical trials.
Stat Med 19(8):1011–1028. https://doi.org/10/bmv5jc
Thall PF, Sung H-G, Estey EH (2002) Selecting therapeutic strategies based on efficacy and death in
multicourse clinical trials. J Am Stat Assoc 97(457):29–39. https://doi.org/10/dx3fkb
Tsiatis AA, Davidian M, Holloway ST, Laber EB (2019) Dynamic Treatment Regimes: Statistical
Methods for Precision Medicine. Monographs on statistics and applied probability 164. CRC
Press LLC, Milton
Vock DM, Almirall D (2018) Sequential multiple assignment randomized trial (SMART). In:
Balakrishnan N, Colton T, Everitt W, Piegorsch F, Teugels JL (eds) Wiley StatsRef: statistics
reference online. https://doi.org/10.1002/9781118445112.stat08073
Wahed AS, Tsiatis AA (2004) Optimal estimator for the survival distribution and related quantities
for treatment policies in two-stage randomization designs in clinical trials. Biometrics 60(1):
124–133. https://doi.org/10/dc4kfb
Wahed AS, Tsiatis AA (2006) Semiparametric efficient estimation of survival distributions in
two-stage randomisation designs in clinical trials with censored data. Biometrika 93(1):163–
177. https://doi.org/10/cgchp6
Zhao Y-Q, Laber EB (2014) Estimation of optimal dynamic treatment regimes. Clin Trials 11(4):400–
407. https://doi.org/10/f6cjrn
Monte Carlo Simulation for Trial Design
Tool 80
Suresh Ankolekar, Cyrus Mehta, Rajat Mukherjee, Sam Hsiao,
Jennifer Smith, and Tarek Haddad

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1564
Monte Carlo Simulations and Trial Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1565
Case Study 1: The VALOR Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1567
Motivation of Adaptive Sample Size Re-Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1567
Statistical Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1568
VALOR Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1570
Practicalities of Running an Adaptive Trial (With Reference to VALOR) . . . . . . . . . . . . . . . . 1574
Case Study 2: SPYRAL HTN OFF-MED Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1575
Motivation of Bayesian Design with Discount Prior Methodology . . . . . . . . . . . . . . . . . . . . . . . 1575
Statistical Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1576
Discount Prior Methodology in the Context of the SPYRAL Trial Design . . . . . . . . . . . . . . . 1577
Compare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1577
Discount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1577
Combine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1578
Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1579

S. Ankolekar (*)
Cytel Inc, Cambridge, MA, USA
Maastricht School of Management, Maastricht, Netherlands
e-mail: [email protected]
C. Mehta
Cytel Inc, Cambridge, MA, USA
Harvard T.H. Chan School of Public Health, Boston, MA, USA
e-mail: [email protected]
R. Mukherjee · S. Hsiao
Cytel Inc, Cambridge, MA, USA
J. Smith
Sunesis Pharmaceuticals Inc, San Francisco, CA, USA
T. Haddad
Medtronic Inc, Minneapolis, MN, USA

© Springer Nature Switzerland AG 2022 1563


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_251
1564 S. Ankolekar et al.

Role of Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1579


Validation of Simulation Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1580
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1583
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1583
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1584
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1584

Abstract
Clinical trials often involve design issues with mathematically intractable com-
plexity. Being part of multi-phase drug development programs, the trial designs
need to incorporate prior information in terms of historical data from earlier
phases and available knowledge about related trials. Some trials with inherent
limits on data collection may need augmentation with simulated pseudo-data. For
planning of interim looks, group sequential and adaptive trials require accurate
timeline predictions of reaching clinical milestones involving complex set of
operational and clinical models. In general, clinical trial design involves an
interactive process involving interplay of models, data, assumptions, insights,
and experiences to address specific design issues before and during the trial. This
offers a rich context for simulation-centric modeling, the theme of this chapter.
We will focus on practical considerations of applying simulation modeling tools
and techniques to design and implementation of clinical trials. This will be
achieved through two real-life case studies and relevant illustrative examples
drawn from literature and our practical experience.

Keywords
Adaptive design · Bayesian discount · Sample-size re-estimation · Design
simulation · Power prior · Discount function · O’Brien-Fleming efficacy

Introduction

This chapter focuses on application of Monte Carlo simulations for clinical trial
design. In view of the emphasis of the book on principles and practice, we will focus
on practical considerations of applying simulation modeling tools and techniques to
design and implementation of clinical trials. This will primarily be achieved through
two real-life case studies and relevant illustrative examples drawn from literature and
our practical experience.
The chapter is organized in five sections. The next section introduces the basic
simulation concepts and relates them to clinical trials. We deliberately take a
simulation-centric view of clinical trials in the section and make a case for enhanced
role of simulation techniques in their design and implementation. This will be
followed by two detailed sections covering two real-life case studies, one completed
and other currently ongoing. Finally, we conclude the chapter with a few remarks.
80 Monte Carlo Simulation for Trial Design Tool 1565

Monte Carlo Simulations and Trial Design

Thompson (1999) gives a notion of simulation in terms of generation of pseudo-data


on the basis of a model, a database, or the use of a model in the light of a database.
The generation of the pseudo-data involves pseudorandom numbers, probabilistic
models, and possibly real data from historical clinical trial or currently ongoing one.
For some standard probabilistic models, it could directly involve a straight forward
“inverse transformation” or indirectly “Accept-Reject” methods, typically described
in most simulation textbooks, such as, Robert and Casella (2010). For example, for
an exponential distribution f(X:λ) = λe-λX with corresponding cumulative distri-
bution F(X:λ) = 1 - e-λX, the pseudo-data can simply be generated by substituting
uniform random variate U ~ U[0,1] for F(X: λ) and solving for X, as X = ln(1-
U)/λ. A generalized inverse transformation could be applied to generate pseudo-

data for related distributions, say, Y ~ Gamma(α,β) as Y ¼ β Xj where Xj is
j¼1
generated from related exponential distribution with λ = α/β. If a cumulative
distribution F(X:θ) is somehow not amenable to direct inverse transformation,
then it is possible to generate the pseudo-data indirectly using “Accept-Reject”
method, where a simpler distribution g(X:γ) is used to generate Y ~ G(X:γ) in
conjunction with U ~ U[0,1] and accept it as X only if U  (1/M).f(Y)/g(Y)
where M is a constant satisfying f(x)/g(x)  M for all x. The pseudo-data can also
be generated in conjunction with accumulated data of an ongoing clinical trial. For
example, in a widely used Poisson-Gamma model for enrolment, λ for Poisson
process is generated from a posterior Gamma(α,β) where the posterior parameters
are computed as Bayesian updates of a prior Gamma(α0,β0) with α = α0 + n and
β = β0 + τ where n is the realized enrolment at time τ. More sophisticated
computations could be efficiently carried out using Markov Chain Monte Carlo
(MCMC) techniques with Metropolis-Hastings and Gibbs Sampler algorithms, as
described in Suess and Trumbo (2010).
The purpose of this brief introduction to generation of pseudo-data was to
highlight the simplicity and demystify a rather harsh traditional view that the
simulation is a complex technique and to be avoided whenever an analytic solution
is possible. In practice, computing environments such as R and other industry
standard design tools have built-in functionality to implement the techniques. For
deeper theoretical and practical insights, the reader is referred to standard textbooks
on the subject, such as, Robert and Casella (2010), Suess and Trumbo (2010), and
others.
Thompson (1999) further asserts that as a computational aid in dealing with and
creating models of reality, the simulation could potentially be an integral part of the
modeling process itself. It is a device for working with models, testing models, and
building new models, a kind of paradigm for realistic evolutionary modeling,
beyond simply being a mechanism for dealing with old modeling techniques, say,
the numerical approximation to pointwise evaluation. This assertion is central to our
simulation-centric view of clinical trials.
1566 S. Ankolekar et al.

Thompson’s assertion also resonates with an interesting “retrospective” view of


clinical trial design from Evans (2010): “Although clinical trials are conducted
prospectively, one can think of them as being designed retrospectively. That is,
there is a vision of the scientific claim (i.e., answer to the research question) that a
project team would like to make at the end of the trial. In order to make that claim,
appropriate analyses must be conducted in order to justify the claim. In order to
conduct the appropriate analyses, specific data must be collected in a manner
suitable to conduct the analyses. In order to collect these necessary data, a thorough
plan for data collection must be developed. This sequential retrospective strategy
continues until a trial design has been constructed. . ..” The “retrospective” view
would straddle a relevant past and an imaginary future, involving relevant models/
data/assumptions carried forward from previous phases and available knowledge in
terms of insights/experiences and published literature, and various future design
options for the protocol and statistical analysis plan (SAP). The “retrospective”
strategy implies an interactive process involving interplay of models/data/assump-
tions/insights/experiences to address specific design issues, and offers a rich context
for simulation-centric modeling.
The context for a simulation-centric clinical trial design is further reinforced by a
Clinical Scenario Evaluation (CSE) approach introduced by Benda et al. (2010), and
subsequently refined by Friede et al. (2010). The approach consisting of three
components, data, analysis, and evaluation models, involves thorough assessment
of multiple design and analysis strategies and their sensitivity to potential changes in
the underlying assumptions. Dmitrienko and Pukstenis (2017) describe an imple-
mentation of the approach in open-source R package, Mediana developed by Paux
and Dmitrienko (2016) and being currently maintained by representatives from
multiple biopharmaceutical companies. The implementation takes a simulation-
centric view of the clinical trial design. Accordingly, “decomposition into three
independent components provides a structured framework for clinical trial simula-
tions which enables clinical trial researchers to carry out a systematic quantitative
assessment of the operating characteristics of candidate designs and statistical
methods to characterize their performance in multiple settings.”
There is a clear trend in enhanced support for simulation-centric view among the
industry standard tools used for clinical trial design. For example, East 6 (2018)
supports extensive simulation of clinical trial designs, including conditional simu-
lations for enrolment and clinical events predictions involving accumulated blinded
and unblinded data. Some of the features have been used in the real-life case studies
covered in the later sections.
The pseudo-data generated by the simulation model could be analyzed to explore
specific design issues. One of the dominant themes has been power computations in
the context of sample size and dose-finding studies. Chang (2011) offers wide range
of simulation algorithms to generate pseudo-data for classical and adaptive designs
and analyze it to compute power. Arnold et al. (2011) use a simulation study to
estimate power for individual and cluster-randomized designs. Antonijevic et al.
(2010) use simulation study to assess impact of Ph 2 dose selection strategy on Ph 3
probability of success.
80 Monte Carlo Simulation for Trial Design Tool 1567

Adaptive designs are necessarily simulation-centric by very nature of the design


and range of issues involved in the continuum of design and implementation. Gao
et al. (2008) use an extensive simulation experiment to establish equivalence of their
proposed method of preserving type I error after an unblinded sample size
reestimation with two other methods. The scope of simulation in design and imple-
mentation of clinical trial based on sample size reestimation will be elaborated in the
context of a real-life case study in the next section. Muller et al. (2007) consider
simulation-based methods for exploration and maximization of expected utility in
sequential decision problems, such as optimal stopping problem in a clinical trial.
The problem involves analytically intractable expected utility integrals at each stage.
Jiang et al. (2014) use extensive simulations to evaluate their proposed Bayesian
prediction model between biomarker and the clinical endpoint for dichotomous
variables. Haddad et al. (2017) use a novel method for augmenting a Bayesian
medical device trial by using virtual patient pseudo-data, where the extent of
augmentation is controlled by a parameterized “discount function” based on simi-
larity between pseudo-data and observed data, as described in section “Case Study 2:
SPYRAL HTN OFF-MED Trial.” A related simulation approach that uses historical
data for augmentation instead of virtual patient pseudo-data is covered in the real-life
case study presented later in the section.

Case Study 1: The VALOR Trial

Acute myelogenous leukemia (AML) is a disease of the bone marrow with poor
prognosis and few available therapies, a continued area of unmet need. The VALOR
study was a phase 3, double-blind, placebo-controlled trial conducted at 101 inter-
national sites in 711 patients with AML. Patients were randomized 1:1 to vosaroxin
plus cytarabine (vos/cyt) or placebo plus cytarabine (pla/cyt) stratified by disease
status, age, and geographic location. The primary and secondary efficacy endpoints
were overall survival (OS) and complete response rate. This study is registered at
clinicaltrials.gov (NCT01191801).

Motivation of Adaptive Sample Size Re-Estimation

Prior to designing the VALOR trial, Sunesis Pharmaceuticals completed a single arm
Ph 2 trial in relapsed refractory AML. Observed mean OS in that trial was approx-
imately 7 months. Assuming an expected OS of 5 months in the control arm, the
VALOR trial was powered at 90% for hazard ratio (HR = 5/7 = 0.71) requiring 375
events and 450 patients. However, there can be uncertainty around Ph 2 estimates.
While HR greater than 0.71 (say 0.75) could still be clinically meaningful, powering
for this smaller effect size would require a large, initially unfeasible number of
patients. An adaptive approach allows the sample size to be conditional on observed
data accumulating in the first part of the Ph 3 trial, avoiding unnecessarily enrolling
patients at the start if the true HR is close to 0.71 and allowing additional patients to
1568 S. Ankolekar et al.

be enrolled later if the effect size is smaller but still of meaningful magnitude. This
flexible approach allowed the exposure of patients and the expenditure of resources
to be conditional on observed results at the interim.

Statistical Methodology

VALOR was a two-arm trial with time to death as the primary endpoint. It was required
to have 100  (1β) = 90% power to detect an improvement in median survival from
5 months on pla/cyt (the control arm) to 7 months on vos/cyt (the experimental arm)
(HR = 0.71) using a one-sided log-rank test at the significance level α = 2.5%. The
trial was designed with one interim analysis when 50% of the death events were
observed, at which point one of the following four decisions could have been taken:

(a) Terminate for overwhelming efficacy


(b) Terminate for futility
(c) Increase the number of death events and sample size
(d) Continue the trial as planned

In trials with a time to event endpoint, the power is driven by number of events,
say D, and not by sample size. Sample size plays an indirect role, however, since the
larger the sample size, the earlier the required D events and the shorter will be the
expected study duration. To meet the 90% power requirement, the target number of
death events D is given by the formula
" #2
zα þ zβ
D¼  ½IF
lnðHRÞ

where zγ is the upper (1γ)  100% percentile of the standard normal distribution
and IF is an inflation factor to recover power loss due to spending some of the
available α for possible early stopping at the interim analysis. Values of IF for
different α-spending functions are available in Jennison and Turnull (2000). An
analytical relationship between required number of events, sample size, patient
enrollment rates, and study duration is available in Kim and Tsiatis (1990). Based
on these considerations, the planned initial enrolment was for 450 patients and 375
events with the possibility of increasing both the planned events and sample size by
50% if the results of the interim analysis fell in the promising zone (see below). Note
the total sample size was increased to allow for 5% dropouts and an effective sample
size of either 450 or 676 patients. Enrolment assumptions were tested periodically by
simulating the pooled study data (blinded to the treatment assignment) so that
accurate assessments of dates for the interim and final analyses could be obtained.
These simulations are described later in VALOR simulation.
The decision to terminate for efficacy at the interim analysis time point would be
based on the O’Brien-Fleming efficacy boundary derived from the Lan and DeMets
80 Monte Carlo Simulation for Trial Design Tool 1569

(1983) α-spending function, invoked when 50% of the death events were observed,
and the appropriate amount of α was spent to ensure that the overall one-sided type-1
error remained 0.025. This approach results in a one-sided significance level of
0.001525 for the interim analysis (with 187 of the planned 375 events) and 0.0245
for the final analysis (with 375 events). The overall significance level of this test
procedure was guaranteed to be 0.025 (one-sided).
If the efficacy criterion was not met, one of the remaining three decisions would
be taken based on the conditional power or probability of achieving statistical
significance at the end of the trial conditional on the results observed at the interim
analysis. Precise formulae for conditional power are available in Mehta and Pocock
(2011) and Gao et al. (2008). The trial would now be modified as follows:

• Fix a maximum upper limit of 562 for the number of events and 676 for the
sample size.
• Compute the conditional power with the number of events being increased to 562,
based on a hazard ratio of 0.71 as specified at the design stage. If the conditional
power (CP-plan-562) so computed is less than 50%, the DSMB would recom-
mend stopping for futility. If the futility criterion was not met, continue as
discussed below.
• Compute the conditional power at 187 events, with the number of events equal to
375 (CP-obs-375) at the final analysis as initially specified, based on the hazard
ratio estimated at the interim analysis.
– If CP-obs-375  30%, the results are considered unfavorable. Continue the
trial with no further change until 375 events are reached, and perform the final
analysis.
– If 30% < CP-obs-375  90%, the results are considered promising. Increase
the number of events to 562 and sample size to 676.
– If CP-obs-375 > 90%, the results are considered favorable. Continue the trial
with no further change until 375 events are reached, and perform the final
analysis.

The DSMB was allowed to exercise clinical judgement, based on its access to
unblinded safety and efficacy data to make minor adjustments to the sample size
obtained by the above rules.
Because the number of events could potentially be increased in a data-dependent
manner at the interim look, the final analysis would not use the conventional log-
rank statistic to determine if statistical significance is reached. Instead it used the
weighted statistic proposed by Cui et al. (1999) in which the independent log-rank
statistics of the two stages are combined by prespecified weights that are equal to the
planned proportion of total events at which the interim analysis would be taken if
there were no change in the design. In the present case, the trial was designed for 375
events with an interim analysis at 187 events. The planned proportion was 0.5 for
each stage and the log-rank statistics for the two stages were combined with weights
that equal the square root of 0.5. Thus, if Z1 and Z2 are the standardized log-rank
1570 S. Ankolekar et al.

statistics from the data before and after the interim analysis, the combined statistic
for the final analysis was
pffiffiffiffiffiffiffi pffiffiffiffiffiffiffi
Zf ¼ 0:5Z1 þ 0:5Z 2
pffiffiffiffiffiffiffi pffiffiffiffiffiffiffi
In order to ensure preservation of type-1 error, the two weights 0:5 and 0:5 for
the two stages must be used even if the total number of events was increased at the
interim analysis. This could result in a slight loss of efficiency, which is offset by the
increase in events.

VALOR Simulations

(a) Design-Stage Simulations: The VALOR trial was extensively simulated to


ascertain its operating characteristics at the design stage as per the following
pseudocode:

Initialize HR for the simulation scenario (e.g. 0.71, 0.74, 0.77,


etc.)
For each of the 10000 simulations for the scenario
Generate clinical events for each of the 450 patients
Identify interim look dataset with earliest 187 events
Analyze the interim dataset to compute the p-value
If p-value < 0.001525 then
Stop for overwhelming efficacy
Else If p-value  0.001525
Compute conditional power (cp-plan-562) for 562 events as
planned at design stage
If cp-plan-562 < 50% then
Stop for futility
Else If cp-plan-562  50%
Compute conditional power with observed HR and 375
events (cpobs-375)
If cp-obs-375  30% or If cp-obs-375 > 90% then
Prepare and analyze dataset of 375 events for
450 patients
Else If 30% < cp-obs-375  90% then
Increase the sample size to 676 patients and
generate their events
Prepare and analyze dataset of 562 events for
676 patients
End If [ cp-obs-375 ]
End If [ cp-plan-562 ]
End If [ p-value ]
End For [ 10000 simulations for the scenario ]
Summarize the results of 10000 simulations for the specified HR
scenario (e.g. 0.71, 0.74, 0.77, etc.)

Operating characteristics of the adaptive group sequential design under various


scenarios, based on 10,000 simulations per scenario, are tabulated below in
80 Monte Carlo Simulation for Trial Design Tool 1571

Table 1 for hazard ratios of 0.71, 0.74, and 0.77. The operating characteristics
include probabilities, conditional powers, trial durations, and sample sizes asso-
ciated with unfavorable, promising, and favorable zones at the interim look. For
comparison purposes the operating characteristics of the two-look nonadaptive
group sequential design, with 375 maximum events, 450 patients, and no
reassessment of events or sample size are also displayed. Both designs have an
O’Brien-Fleming efficacy boundary and a futility boundary for terminating at the
interim look if the conditional power based on HR = 0.71 is below 50%.
Average power gains of 3% to 6% are obtained with the adaptive design at an
average cost of 50–70 additional subjects and an average increase in study
duration of 2–4 months. The real benefit of the adaptive design, however, lies
in its ability to learn from the interim results and avoid an underpowered trial.
This is evident from an examination of the zone-wise powers. For example,
under the pessimistic scenario HR = 0.77, the study is underpowered at 71%.
But if the interim results land in the promising zone, the conditional power of the
adaptive designs is boosted up to 90% but remains 71% for the nonadaptive
design. This gain in power does come with a cost in terms of increased sample
size to 675 instead of 450 and the study duration is 38 months instead of 29.
However, these additional resource commitments would have to be made only
after observing the interim data, if promising. The simulation model adequately
supports the sample size reestimation decision, if any, as it consistently shows

Table 1 Zone-wise operating characteristics of adaptive and non-adaptive designs


Conditional power Duration (months) Sample size
Non- Non- Non-
Zone P(Zone) adaptive Adaptive adaptive Adaptive adaptive Adaptive

(a) Under the design-stage scenario: HR = 0.71


Unfavorable 0.12 56% 56% 30 30 447 447
Promising 0.27 87% 98% 30 39 375 562
Favorable 0.61 98% 99% 25 25 292 292
Average 1 91% 94% 27 29 324 373

(b) Under the moderately pessimistic scenario: HR = 0.74


Unfavorable 0.18 44% 44% 29 29 446 445
Promising 0.32 80% 94% 30 39 450 675
Favorable 0.50 97% 97% 25 25 404 406
Average 1 82% 87% 27 30 426 499

(c) Under the pessimistic scenario: HR = 0.77


Unfavorable 0.26 34% 33% 29 29 442 444
Promising 0.34 71% 90% 29 38 450 675
Favorable 0.40 94% 94% 26 26 412 412
Average 1 71% 77% 23 27 443 509
1572 S. Ankolekar et al.

increased power associated with promising zone over the range of hazard ratio
scenarios.

(b) Monitoring-Stage Simulations: In addition to the design stage, the periodic


simulations were also carried out during the monitoring stage of the ongoing
trial to perform blinded data reviews for enrolment and clinical events pre-
dictions. The simulations used Poisson-Gamma model with Bayesian updates
based on enrolments already realized as per the following pseudocode:

Initialize prior site enrollment rates, hazard rates, sample size


scenarios (e.g. 450, 675)
Read dataset containing observed enrollments and clinical events
Compute posterior Gamma parameters for Poisson-Gamma model
Compute posterior parameters for hazard rates using exponential
model
Compute posterior parameters for dropouts using exponential model
For each of the 1000 simulations for the scenario
Activate remaining sites, if any, by sampling uniform
distribution corresponding to Site Activation Plan
Generate posterior enrollment rates by sampling Gamma
distribution with posterior Gamma parameters
Generate posterior hazard rates by sampling Gamma distribution
with posterior hazard rate parameters
Generate posterior dropout rates by sampling Gamma
distribution with posterior dropout parameters
Generate remaining enrollments using Poisson-Gamma model with
above rates
Generate dropouts for patients-at-risk by sampling
exponential distribution with posterior dropout rates
Generate clinical events for patients-at-risk by sampling
exponential distribution with posterior hazard rates
Endfor [simulations]
Analyze the simulation database to generate prediction tables and
plots

Figure 1 shows stochastic enrolment predictions made on the basis of 1,000


simulations of future enrolment given realized enrolment of 303 patients.
Accordingly, a target enrolment of 500 patients for a sample size of 450 after
dropouts is predicted to be achieved around month 24. The increased target of
750 enrolments for reestimated sample size of 676 is predicted to be achieved
around month 32.
Death and dropout events were simulated in a blinded manner using an exponen-
tial distribution with prior pooled hazard rates based on design stage scenario
(HR = 0.71) and initial assumed dropout with Bayesian updates based on observed
death and dropout events until the review period. Figure 2 shows stochastic clinical
events predictions made on the basis of 1,000 simulations of future enrolment given
observed 115 events for the realized enrolment of 303 patients. Accordingly, a target
80 Monte Carlo Simulation for Trial Design Tool 1573

Fig. 1 Enrolment predictions after realized enrolment of 303 patients

Fig. 2 Clinical events predictions after observed 115 events for realized enrolment of 303 patients

of 187 events to trigger interim analysis is predicted to be achieved around month 21


and final target of 375 events for an initial sample size of 450 is predicted to be
reached around month 29. The clinical events are also predicted in the context of
potential increase of sample size and target number of events at the interim analysis.
Accordingly, the revised target of 562 events for reestimated sample size of 676 is
predicted to be achieved around month 42.
1574 S. Ankolekar et al.

Practicalities of Running an Adaptive Trial (With Reference to


VALOR)

There are regulatory guidances on adaptive designs by both the FDA (Adaptive
designs for clinical trials of drugs and biologics, draft guidance, September 2018)
and EMA (Reflection paper on methodological issues in confirmatory clinical trials
planned with an adaptive design, adoption by CHMP October 2007) that emphasize
the need to prespecify analysis methods, minimize operational bias, and control the
type I error, as well as the unbiased point estimation of treatment effect.
When there is an interim look at efficacy, as in the promising zone methodology,
there are several practical considerations to minimize operational bias. First, con-
sider strict control around the availability and communication of interim results.
Decide as a sponsor what the message will be (if any) following the interim analysis.
Will a change in the total sample size be announced to sites, operational entities, or
investors? Various stakeholders (legal, regulatory, operational, medical, etc.) can be
consulted up front to ensure that there is agreement about the planned communica-
tion strategy and an understanding of any implications. The VALOR trial made use
of a special Access Control Execution System to both control and document the flow
of data and reports created at the interim and shared with DSMB members. The
promising zone as originally intended would increase the sample size by an amount
proportionate to the observed results at the interim. The VALOR trial employed an
all or nothing 50% sample size increase instead of a proportionate increase thus
limiting the ability to back-calculate interim results based on the planned increase in
sample size.
The promising zone methodology allows for strict control of the type I error as
described in Jennison and Turnull (2000). However, there are some practical con-
siderations that should be understood in the conduct and analysis of the trial. First, in
this design, the test statistic Z1for the data at the interim is combined with the test
statistic Z2 for the data post-interim with prespecified weights as shown in the
equation for Zf at the end of the Statistical Methodology part of this case study. In
simulation and in theory, the data supporting the interim test statistic Z1 do not
change between the time of the analysis at the interim and final analysis while in
practice they may. Additional follow up (censor dates change) or data cleaning may
alter the value of the test statistic computed at the time of the interim and upon which
the sample size adjustment was made and the time of the final analysis. In the
VALOR trial, the value of the test statistic Z1 for the interim time point was
recomputed on final data before being combined with Z2 to produce the final
adjusted statistic. Thereby the test statistic Z1was most representative of the final
cleaned data and did not require creating and submitting data packages of the interim
data to support an interim value incorporated into the final analysis. Second, a
secondary analysis of OS in the VALOR trial was a stratified log rank. It was
determined that the weighted Cui et al. (1999) method could also be applied to
interim and post-interim test statistics after the stratification.
80 Monte Carlo Simulation for Trial Design Tool 1575

Case Study 2: SPYRAL HTN OFF-MED Trial

SPYRAL HTN OFF-MED is an international randomized single-blind (patient


masked) study evaluating safety and efficacy of treatment with the Symplicity
Spyral Multi-Node Electrode Renal Denervation System in patients with
uncontrolled hypertension in the absence of antihypertensive medications (Funded
by Medtronic; ClinicalTrials.gov Identifier: NCT02439749). The primary efficacy
endpoint is change in systolic blood pressure (SBP) as measured by 24-h ambu-
latory blood pressure monitoring (ABPM) from baseline to 3 months post-
procedure.
Prior studies did not find a consistent effect of renal denervation in reducing
blood pressure. In particular, a large randomized study reported by Bhatt et al.
(2014) in patients with resistant hypertension did not see a statistically significant
benefit of renal denervation compared to a sham procedure in reducing blood
pressure. Drug adherence was seen as a potentially important confounding factor
in that study.
SPYRAL HTN OFF-MED was initially conceived of as a proof of concept (PoC)
study to isolate the effect of renal denervation treatment (test group) versus a sham
procedure (control group) in a population without resistant hypertension, where
confounding by antihypertensive medications can be minimized by disallowing
the use of such medications during the study. The PoC study was to be expanded
to a pivotal study if a promising clinically benefits were observed.
Results of the PoC study, consisting of the first 80 consecutively randomized
patients, have now been published (Townsend et al. 2017) and the study has moved
into the planned pivotal phase under the same clinical investigation protocol.
Because the enrollment criteria and study procedures did not change when the
study moved into the pivotal phase, the sponsor plans to incorporate results from
the PoC phase in the analysis of the pivotal study.

Motivation of Bayesian Design with Discount Prior Methodology

Recognizing the potential for temporal bias and other unknown factors that may
impact the similarity of effect sizes in the two phases of the study, a Bayesian
discount prior method is used for the primary efficacy analysis of the pivotal study
as described in Haddad et al. (2017), whereby data from the PoC phase form the
basis of an informative prior distribution for the pivotal study. The prior information
is dynamically discounted with a factor between 0 and 1, based on the extent to
which the prior data is dissimilar to the data from the pivotal study.
The pivotal study is currently ongoing, and not all details of the statistical design
are publicly available at the time of writing. Our description of the design, simula-
tions, and operating characteristics should be considered illustrative of the general
approach and may not fully agree with the statistical plan of the trial.
1576 S. Ankolekar et al.

Statistical Design

The study plans to randomize up to 433 patients total including both the PoC and
pivotal phases.
The primary efficacy analysis of the pivotal trial uses a baseline adjusted com-
parison of change in SBP from baseline to 3 months post-procedure. Let xi denote
the baseline SBP and yi the SBP change from baseline for the i-th patient. The linear
model of interest is
 
yi ¼ βc I i fcontrolg þ βt I i ftestg þ βx xi þ ϵ i , ϵ i  Normal 0, σ 2 , ð1Þ

where, Ii{test} is the indicator for the test group (1 for test and 0 for control),
Ii{control} = 1  Ii{test}. The main parameter of interest is β = βt  βc, representing
the baseline adjusted treatment effect. The primary efficacy hypothesis is H0 : β  0
versus HA : β < 0.
The analysis to evaluate the efficacy hypothesis in the pivotal trial assumes
separate power-prior (Ibrahim et al. 2015) normal distributions on βc and βt and
uniform prior on log(σ), a standard choice for non-informative prior distribution on
the variance term in a normal model. The prior distribution assumes zero correlation
among model coefficients. The power-prior approach allows the amount of borrow-
ing from historical data to be specified in terms of one parameter for the test group
(αt) and one parameter for the control group (αc). The parameter values range
between 0 and 1, with 0 indicating no borrowing and 1 indicating full borrowing
from historical data. These power prior parameters are calculated as part of the
discount prior method as described in the next section. The posterior distribution of β
obtained via this approach will then be used to estimate the posterior probability that
β < 0. The success criteria for this trial is that this posterior probability is greater than
0.975. This criterion aligns with the classical frequentist rule of using a one-sided
test at 2.5% level of significance.
Multiple interim analyses are planned for this study. At each interim analysis, the
decision to continue enrollment or stop enrollment for expected success or futility
will be based on the predictive probability of success, which is derived by imputing
the incomplete data from the posterior distributions of model parameters given
interim data, and then recalculating the posterior probability of success. This com-
pletion process is repeated several times. The proportion of runs where the posterior
probability for β < 0 achieves the success criteria (> 0.975) is the predictive
probability of success. For efficacy, imputations are carried out for patients who
have enrolled prior to a particular interim analysis, hence their baseline SBP values
are available, but have not yet completed their 3-month follow-up. For futility,
imputations are carried out for patients who have been enrolled prior to the interim
analysis but completed their 3-month follow-up as well as for patients who have not
yet been enrolled, up to the maximum sample size. Enrollment is stopped for
expected success if the predictive probability of success with the currently enrolled
patients is greater than 90%, and enrollment is stopped for futility if the predictive
80 Monte Carlo Simulation for Trial Design Tool 1577

probability of success at the maximum sample size is less than 5%. A similar
approach to interim decision-making is described in Berry (2011).

Discount Prior Methodology in the Context of the SPYRAL Trial


Design

The discount prior method (Haddad et al. 2017) used in the SPYRAL trial was
developed collaboratively by statisticians from the sponsor and the United States
Food and Drug Administration (FDA) as part of the Medical Device Innovation
Consortium (MDIC). An R package (bayesDP) developed by Musgrove and Haddad
(2017) implementing this method is available. The method as it applies in the context
of the trial is described here. The reader is referred to the referenced papers for details
on the general methodology, and further to the R package documentation for details
on implementation.
The analysis to evaluate the primary efficacy hypothesis in the pivotal trial
assumes separate power-prior (Ibrahim et al. 2015) normal distributions on βc and
βt and uniform prior on log(σ). The power parameter of the power-prior for the test
group (αt) and for the control group (αc) are calculated as part of the discount prior
method, which comprises four steps: compare, discount, combine, and estimate.

Compare

The test and the control group data are separately used to fit the following model,
using combined data from both phases of the study in the given arm:
 
yi ¼ β~0 þ β~1 I i fcurrentg þ β~x xi þ ϵ i , ϵ i  Normal 0, σ 2 ,

where Ii{current} equals 0 if the data is from the pivotal phase and equals 1 if the
data is from the PoC phase. Here a joint uniform prior is assumed for log(σ) and the
model coefficients. The degree
 of agreement between the two phases can be mea-
sured by p ¼ P β~1 > 0jy . A value of p close to 0.5 indicates agreement while
deviation from 0.5 on either side indicates lack of agreement in terms of the
distribution of the response variable after adjusting for covariates. Thus, we trans-
form p to 2p if p < 0.5 and to 2(1  p) if p  0.5, so that higher transformed values of
p indicate higher levels of agreement. These calculations are carried out separately
for each arm, resulting in pc and pt.

Discount

The similarity measures pc and pt are each mapped to a discount value in the interval
[0, 1] by a discount function F( p). Examples include the identity function F( p) = p
1578 S. Ankolekar et al.

Fig. 3 Weibull discount function with shape k = 3 and scale λ = 0.5

p k
and Weibull function FðpÞ ¼ 1  eðλÞ . The power prior parameters for each arm are
defined as αt = αmaxF( pt) and αc = αmaxF( pc), where αmax is a parameter between
0 and 1 defined at the beginning of the study to control the maximum level of
borrowing from PoC data. The discount function and any accompanying parameters
are also predefined to achieve certain operating characteristics. The same discount
function is used for both arms, which may yield different levels of discount based on
the values of pt and pc. Use of the Weibull function facilitates exploration of a wide
range of discount profiles using just two parameters, the shape k and scale λ. The
discount function used for one of our designs with k = 3 and λ = 0.5 is shown in Fig. 3.

Combine

The power prior method is used to combine the PoC and pivotal data, whereby
informative normal priors based on PoC data are used for the linear model coeffi-
cients while applying a suitable level of discount according to the degree of
similarity with
 the pivotal  data. Thus,
 in the linear model (1), we use independent
priors βc N β0c , bτ0c =αc and βt N b
~ b 2 ~ β0c (b
β0t , bτ20t =αt , where b β0t ) and bτ20c (bτ20t ) are
maximum likelihood estimates of the model parameters and their variances for the
control group (test group)
 using  PoC data. For the baseline variable, we do not apply
a discount, and use βx N~ b
β0x , bτ20x where b
β0x and bτ20x are estimated baseline parameter
and its variance using the linear model (1) fitted to the PoC data. For the variance
term in (1), a flat prior on log(σ) is used. With these prior specifications, joint
posterior samples for βc and βt are drawn conditional on pivotal trial outcomes ( y),
80 Monte Carlo Simulation for Trial Design Tool 1579

from which we generate a posterior sample for β = βt  βc concerning the mean SBP
change difference between the test and sham groups.

Estimate

Using the posterior distribution from the combined pivotal and PoC data, the
probability of a treatment effect favoring the test group is estimated as

Pðβ < 0jy, y0 , αt , αc Þ

Here y0 denotes the PoC data, which is needed for prior specification and determi-
nation of prior discount levels αt and αc.

Role of Simulation

Simulations are critical in both the planning and implementation of the SPYRAL
study. In the planning stage, operating characteristics are evaluated in order to
optimize the design parameters and to facilitate discussion with regulatory author-
ities when seeking alignment on the design. The optimization process in this case
was not a formal procedure involving objective functions (such an approach, while
more rigorous, would have been computationally infeasible), but rather was iterative
and informal, whereby simulations were performed under several combinations of
realistic design parameters – such as sample size, timing and number of interim
looks, discount function parameters, early stopping thresholds – under a range of
plausible effect sizes including the null scenario (β = 0) and results were compared
across scenarios to determine the parameter combination(s) that provided the best
balance of type 1 error rate, power, and interim stopping probabilities in the
judgment of the study team. One advantage of the discount prior approach is the
flexibility to adjust the discount function to keep type 1 error rate at an acceptable
level without needing to change the success criterion.
While the trial is ongoing, simulations are used for repeatedly imputing the
incomplete data to derive estimates of the predictive probability of success for
interim decision-making. Furthermore, ad hoc simulations may be requested by
the Data Monitoring Committee should there be questions on how the efficacy
analysis would look if the trial were to progress under particular scenarios of interest.

Pseudocode for Simulation Model:


For every simulation
Simulate enrollment of all patients
Generate baseline BP and change in BP (ΔBP) for all patients
in treatment and control groups
For every interim analysis (IA) with sample size Nint
Complete data for both baseline BP and change ΔBP
Compute Posterior of β
1580 S. Ankolekar et al.

Perform Nrep imputations to generate following datasets


yimpES1 : Complete imputed dataset for Nint
yimpES2 : Complete imputed dataset for all patients
Compute Expected Successes, ES1 (for efficacy) and ES2
(for futility) as follows
ES_ =  I[P(β< 0 | yimpES_) > 0.975 / Nrep
If (ES2 < 0.05) then
Stop for futility
Else If (ES1 > 0.9) then
Stop for efficacy
Else If (ES2 _ 0.05) AND ES1 _ 0.9)
Continue to next IA
End If
Endfor (IA)
Perform final Analysis
Endfor (Simulations)
Analyze simulation results to generate plots and tables

Programs to carry out the simulations were written in R programming language


and the R package bayesDP was used to implement the discount prior analysis.
Plots showing power and type 1 error rates under two different designs are shown
in Figs. 4–5. Both designs assume a maximum sample size of 400 and have two
interim analyses, occurring when 280 and 320 subjects have been evaluated for 3-
month SBP change. One design uses the identity discount function and the other uses
a Weibull function with shape k = 3 and scale λ = 0.5 as shown earlier in Fig. 3. The
number of simulated trials was 8,000 for estimating power, and 15,000 for estimat-
ing type 1 error. Thus, Monte Carlo standard error for power estimation is about
0.03–0.05, and for type 1 error it is about 0.01. The plots show cumulative power at
each look, defined as the probability of the trial stopping at or before that look and
achieving the success criterion. Using the identity discount function leads to power
that is similar or marginally higher in most cases, but it also gives higher type 1 error
rate (0.034) compared with the Weibull discount function (0.028). The higher type 1
error rate may be due in part to more aggressive borrowing from historical data with
the identity discount function compared to the Weibull discount function when the
current data are dissimilar to historical data (i.e., pc or pt close to 0). Additional
operating characteristics of interest (not shown) may include probability of futility
stopping at each look, or average percent prior information leveraged in each arm (αc
and αt).

Validation of Simulation Tools

Validation of the programs for the Bayesian computations and trial simulations were
performed so as to provide assurances to the sponsor and regulatory reviewers that
the tools for design and implementation are working as intended in a manner
consistent with documentation. An independent team tested specific functions
from the bayesDP package that are intended to be used in the Bayesian analysis,
80 Monte Carlo Simulation for Trial Design Tool 1581

Fig. 4 Power versus treatment effect (difference in baseline adjusted SBP change at 3 months,
measured in mmHg), using the identity discount function

and tested the simulation program used to establish the trial performance character-
istics that are described in the statistical analysis plan.
For testing of bayesDP functions, specific test cases were derived such that the
output being tested can be determined exactly using theoretical knowledge when
possible. In cases where this cannot be done, outputs were compared with simulation
results obtained by independent means, or evaluated for consistency with known
theoretical properties.
For example, if the discount parameters are defined such that no borrowing from
the prior is allowed (e.g., by setting αmax = 0), then the posterior distribution of the
treatment effect β follows a scaled t-distribution, hence posterior samples generated
by bayesDP were compared with their expected theoretical values. The difference
(stochastic error) between the posterior sample and the theoretical distribution were
quantified using the Kolmogorov-Smirnov distance, and the average and maximum
distance over several runs was summarized in the validation report. On the other
1582 S. Ankolekar et al.

Fig. 5 Power versus treatment effect (difference in baseline adjusted SBP change at 3 months,
measured in mmHg), using the Weibull discount function with shape 3 and scale 0.5

hand, if no restrictions are placed on the amount of borrowing from the prior except
as dictated by the discount function (αmax = 1), then the posterior distribution of β no
longer has an analytically convenient form. In this case, the posterior samples from
bayesDP were compared with posterior samples obtained using Stan, a Bayesian
computation tool in common use.
The strategy for validating the simulation program consisted of code review to
map the logic of a single simulated trial, code review to map the logic of the
execution and summary of multiple simulations, and using the program to run
repeated simulations under different scenarios and ensuring operating characteristics
behave as expected as the input parameters are allowed to vary. In particular, it was
verified that power is a monotone function of sample size and effect size.
The simulation program, bayesDP source code (available for download from the
Comprehensive R Archive Network), and validation report were made available to
regulatory authorities for review.
80 Monte Carlo Simulation for Trial Design Tool 1583

Summary and Conclusion

In this chapter, we have taken a simulation-centric view of clinical trials with


simulation as an integral part of the design and implementation of clinical trials.
The simulation-centricity was guided by three related perspectives. Firstly, from a
simulation perspective as a generator of pseudo-data on the basis of a model or a
database and a device for working with models, testing models, and building new
models, the simulation could potentially be an integral part of the modeling process
itself as a kind of paradigm for realistic evolutionary modeling. Secondly, from a
design perspective as an interactive process involving interplay of models, data, and
assumptions, the simulation supports the process in terms of insights and experi-
ences to address specific design issues before and during the trial. Some designs, like
complex adaptive designs, are necessarily simulation-centric by very nature of the
design and range of issues involved in the continuum of design and implementation.
Finally, some trial design evaluation frameworks adopt simulation as a central part of
their assessment of multiple design and analysis strategies and their sensitivity to
potential changes in the underlying assumptions. Our two real-life case studies
illustrated the critical role of simulation in adaptive trials. The scope of simulation
in design and implementation of clinical trial based on sample size re-estimation was
elaborated in the context of VALOR case study. A related simulation approach that
uses historical data for augmentation instead of virtual patient pseudo-data was
covered in the SPYRAL case study.

Key Facts

• Simulation is primarily concerned about generation of pseudo-data on the basis of


a model, a database, or the use of a model in the light of a database. The
generation of the pseudo-data involves pseudorandom numbers, probabilistic
models, and possibly real data from historical clinical trial or currently ongoing
one.
• Simulation is a mechanism for working with models, testing models, and building
new models, a kind of paradigm for realistic evolutionary modeling, beyond
simply being a mechanism for dealing with old modeling techniques, say, the
numerical approximation to pointwise evaluation.
• Clinical trial design involves an interactive process involving interplay of models,
data, assumptions, insights, and experiences to address specific design issues
before and during the trial, offering a rich context for simulation-centric
modeling.
• Clinical trials often involve design issues with mathematically intractable com-
plexity. Being part of multi-phase drug development programs, the trial designs
need to incorporate prior information in terms of historical data from earlier
phases and available knowledge about related trials. Some trials with inherent
1584 S. Ankolekar et al.

limits on data collection may need augmentation with simulated pseudo-data.


Adaptive designs are necessarily simulation-centric by very nature of the design
and range of issues involved in the continuum of design and implementation.
• For planning of interim looks, group sequential and adaptive trials require
accurate timeline predictions of reaching clinical milestones involving complex
set of operational and clinical models.
• The simulation-centric view is increasingly being supported by industry standard
software tools used for clinical trial design and reinforced by the evaluation
frameworks of the design and analysis strategies.

Cross-References

▶ Bias Control in Randomized Controlled Clinical Trials


▶ Cluster Randomized Trials
▶ Controlling for Multiplicity, Eligibility, and Exclusions
▶ Documentation: Essential Documents and Standard Operating Procedures
▶ Missing Data
▶ Power and Sample Size
▶ Statistical Analysis of Patient-Reported Outcomes in Clinical Trials
▶ Use of Resampling Procedures to Investigate Issues of Model Building and Its
Stability

References
Antonijevic Z, Pinheiro J, Fardipour P, Lewis RJ (2010) Impact of dose selection strategies used in
phase II on the probability of success in phase III. Stat Biopharm Res 2(4):469–486
Arnold B, Hogan D, Colford J, Hubbard A (2011) Simulation methods to estimate design power: an
overview for applied research. BMC Med Res Methodol 11:94
Benda N, Branson M, Maurer W, Friede T (2010) Aspects of modernizing drug development using
clinical scenario planning and evaluation. Drug Inf J 44:299–315
Berry SM (ed) (2011) Bayesian adaptive methods for clinical trials. Chapman & Hall/CRC
biostatistics series. CRC Press, Boca Raton. 305 p
Bhatt DL, Kandzari DE, O’Neill WW, D’Agostino R, Flack JM, Katzen BT (2014) A controlled
trial of renal denervation for resistant hypertension. N Engl J Med 370:1393–1401
Chang M (2011) Monte Carlo simulation for the pharmaceutical industry: concepts, algorithms, and
case studies. Chapman & Hall/CRC biostatistics series. CRC Press, Boca Raton
Cui L, Hung HMJ, Wang S (1999) Modification of sample size in group sequential clinical trials.
Biometrics 55:853–857
Dmitrienko A, Pukstenis E (2017) Clinical trial optimization using R. Chapman & Hall/CRC
biostatistics series. CRC Press, Boca Raton
East 6 (2018) Statistical software for the design, simulation and monitoring clinical trials. Cytel Inc.,
Cambridge, MA
Evans SR (2010) Fundamentals of clinical trial design. J Exp Stroke Transl Med 3(1):19–27
Friede T, Nicholas R, Stallard N, Todd S, Parsons NR, Valdes-Marquez E, Chataway J (2010)
Refinement of the clinical scenario evaluation framework for assessment of competing devel-
opment strategies with an application to multiple sclerosis. Drug Inf J 44:713–718
80 Monte Carlo Simulation for Trial Design Tool 1585

Gao P, Ware J, Mehta C (2008) Sample size re-estimation for adaptive sequential design in clinical
trials. J Biopharm Stat 18:1184–1196
Haddad T, Himes A, Thompson L, Irony T, Nair R (2017) Incorporation of stochastic engineering
models as prior information in Bayesian medical device trials. J Biopharm Stat 27:1089–1103
Ibrahim JG, Chen M-H, Gwon Y, Chen F (2015) The power prior: theory and applications. Stat Med
34(28):3724–3749
Jennison C, Turnull BW (2000) Group sequential methods with applications to clinical trials.
Chapman and Hall/CRC, London
Jiang Z, Song Y, Shou Q, Xia J, Wang W (2014) A Bayesian prediction model between a biomarker
and clinical endpoint for dichotomous variables. Trials 15:500
Kim K, Tsiatis AA (1990) Study duration for clinical trials with survival response and early
stopping rule. Biometrics 46:81–92
Lan KKG, DeMets DL (1983) Discrete sequential boundaries for clinical trials. Biometrika
70:659–663
Mehta CR, Pocock SJ (2011) Adaptive increase in sample size when interim results are promising: a
practical guide with examples. Stat Med 30:3267–3284
Muller P, Berry D, Grieve A, Smith M, Krams M (2007) Simulation-based sequential Bayesian
design. J Stat Plann Inference 137:3140–3150
Musgrove D, Haddad T (2017) BayesDP: tools for the Bayesian discount prior function. https://
Cran.R-project.org/package=bayesDP
Paux G, Dmitrienko A (2016) Mediana: clinical trial simulations. R package version 1.0.4. http://
gpaux.github.io/Mediana/
Robert C, Casella G (2010) Introducing Monte Carlo methods with R. Springer, New York
Suess E, Trumbo B (2010) Introduction to probability simulation and Gibbs sampling with R.
Springer, New York
Thompson JR (1999) Simulation: a modeler’s approach. Wiley, New York
Townsend RR, Mahfoud F, Kandzari DE, Kario K, Pocock S, Weber MA (2017) Catheter-based
renal denervation in patients with uncontrolled hypertension in the absence of antihypertensive
medications (SPYRAL HTN-OFF MED): a randomised, sham-controlled, proof-of-concept
trial. Lancet 390(10108):2160–2170
Part VII
Analysis
Preview of Counting and Analysis Principles
81
Nancy L. Geller

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1590
Who Counts? Everyone Randomized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1590
What Happens when Things Are Not Perfect? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1591
Missing Outcome Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1591
Analyses Other than Intention to Treat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1592
Analysis Principles in Complex Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1593
Other Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1594
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1595
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1595
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1595
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1596

Abstract
This chapter provides an introduction to the Section on Analysis. The chapters in
this section range from elementary design and analysis considerations to many
more advanced topics. In this preview, each chapter is briefly mentioned in turn
and the reader is invited to delve more deeply into the individual chapter for
details.

Keywords
Analysis of clinical trials · Intention to treat · Missing outcome data · Non-
compliance · Statistical analysis plan · Analyses other than intent-to-treat

N. L. Geller (*)
National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, MD, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 1589


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_112
1590 N. L. Geller

Introduction

This chapter covers a broad range of topics some elementary, and with some
advanced. The first section gives a broad overview of the chapter brief reference to
the contents of each of the sections

Who Counts? Everyone Randomized

A fundamental tenet of clinical trial methodology is that the analysis of a clinical trial
should account for everyone randomized. The basic principle of including all
randomized subjects in the primary analysis according to their randomized treatment
assignment (and not according to treatment received) is known as the intention-to-
treat (ITT) principle. Good discussions on this topic are found in Lachin (2000) and
Friedman et al. (2015). A summary of ITT and alternatives is given in ▶ Chap. 82,
“Intention to Treat and Alternative Approaches,” by Goldberg. There are also three
other chapters in this book that briefly discuss ITT: ▶ Chaps. 100, “Causal Inference:
Efficacy and Mechanism Evaluation,” ▶ 93, “Adherence Adjusted Estimates in
Randomized Clinical Trials,” and ▶ 84, “Estimands and Sensitivity Analyses.”
Because randomization assures balance on average in baseline factors, both
known and unknown, an ITT analysis is an unbiased comparison between the
treatments among all randomized subjects, presumably defined by the eligibility
criteria of the trial. An ITT analysis evaluates a treatment policy. It ignores
non-adherence, withdrawal from the trial, and treatment stoppages (even if the
protocol allows them). Even dropping randomized subjects who receive no treatment
can lead to biased results; for example, in a trial of treatment compared to placebo,
those assigned to active treatment who do not take the treatment makes treatment
look more like placebo. Further, It is likely that adherers differ from non-adherers in
ways that are difficult to assess.
A proper ITT analysis requires outcomes on all randomized subjects and one
should do as best as is possible to get these outcomes, no matter whether the subject
remains on trial or takes assigned treatment as planned.
In some trials, subjects have been excluded from the trial after they are random-
ized (whether or not they received their assigned treatments). This may well lead to
spurious results. A classic example of excluding randomized subjects from analysis
is the Anturane reinfarction trial (The Anturane Reinfarction Trial Research Group
1978, 1980) which was a double-blind placebo-controlled trial of sulfinpyrazone in
recent post-myocardial infarction (post-MI patients. The primary endpoint was
sudden cardiac death within six months. Only those who received therapy for at
least seven days were considered “analyzable.” Of the 1620 patients randomized,
this eliminated 145 patients, more-or-less equally distributed in the two treatment
groups. Results reported were overwhelmingly positive for sulfinpyrazone. Since
sulfinpyrazone was an approved drug only for gout, the sponsor went before the
FDA to have the drug approved for a new indication, sudden cardiac death in
patients who were within 6 months post-MI. The FDA review of the data revealed
81 Preview of Counting and Analysis Principles 1591

that the results depended on after-the-fact exclusions of events. Although the


exclusions were equally distributed in the two treatment groups, the exclusions
eliminated many more events in the sulfinpyrazone group. Including all randomized
patients in the analysis completely changed the results and the new indication was
not granted (Temple and Pledger 1980).

What Happens when Things Are Not Perfect?

Much can go wrong, making the ITT principle not so easy to implement, even in the
two-armed randomized comparison with a well-defined single endpoint. Two simple
examples are missing outcome data and non-compliance. Often when there are only
a few subjects with missing outcome data, they are censored at their last follow
up. This inherently assumes that the reason for the missing is unrelated to treatment
assignment, which usually has no basis. Non-compliance can make the outcomes of
the treatments look more alike. Of course a pharmaceutical company in interested in
the effect of their product in those who take it.
These two issues have led to a great deal of statistical methodology to evaluate
clinical trial results accounting for missing outcome data and non-compliance.

Missing Outcome Data

Many trials ignore missing outcome data, often citing that only a small percent of those
randomized have been lost to follow up and have missing outcomes. Such an analysis
assumes that data are missing completely at random (MCAR), that is, that the trial
results would not change if we did have those missing data. That is a strong assump-
tion, as subjects are “lost” for reasons often related to treatment assignment. Examples
vary from experiencing toxicity (whether or not related to the trial) to just being “sick
and tired” of all of the necessary visits, to development of other conditions that require
the subject’s attention. One way in which the MCAR assumption is justified is to
compare baseline data of those missing outcome data to those with outcome data. Of
course, not finding a difference does not guarantee there is no difference.
In a case where a few subjects provide baseline data and no follow up
(i.e., subjects drop out after baseline) and, further, there is balance in drop outs
between treatment arms, the Guidance for Clinical Trials (1998) suggested justifying
the dropping of such patients from the trial analysis. This is acceptable in some cases
(c.f. Choi et al. 2020), but usually there is some data beyond baseline even when
subjects are lost to follow up. There is a vast literature dealing with the problem of
missing outcome data, starting with the classical book by Little and Rubin (2002,
second edition). An introduction to methods for missing data is given in ▶ Chap. 86,
“Missing Data,” by Tong, Li, and Allen.
A common way to deal with missing outcome data is to use the data in the trial to
substitute or impute values for the missing data. A simple way to do this is to
consider the best and worst case scenarios. That is, the best case scenario would give
1592 N. L. Geller

the most favorable values (among trial outcomes) for one treatment group and the
least favorable values for the other treatment group. The worst case scenario would
do the reverse. If there are few missing outcomes, these two methods could lead to
the same trial result, which would be highly satisfying. However, these simple
methods can give vastly different estimates of treatment effect, more so with more
missing data. They also underestimate the standard errors because the uncertainty in
the missing values is not considered. Methods that substitute one value for missing
data are called single imputation methods.
Rather than replacing missing data with one substituted value, the distribution of
observed values may be used by devising models to predict the outcome variable
based on the complete data and then using these models to estimate the missing
outcome data. Doing this multiple times yields an estimate of each outcome as well
as a better estimate of its standard error than single imputation methods. There are
several different multiple imputation methods that may be used (Sterne et al. 2009).
Others advocate supervised learning as far superior for imputation (Chakrabortty and
Cai 2018). Thus, there is no definitive way to deal with missing outcome data. Six
different methods are described by Badr (2019). All methods require some statistical
assumptions and thus will often be controversial. The best that investigators can do is
to set forth the primary method that will be used for the primary analysis, as well as a
number of sensitivity analyses to give confidence to the primary results. How to
interpret results if some sensitivity analyses lead to a different trial outcome also
should be considered. Most important is to plan for how missing data will be dealt
with in the statistical analysis plan, because missing data are almost inevitable.
Some still call an analysis which ignores missing data (e.g., patient was lost to
follow up and dropped from the primary analysis) an ITT analysis, which is both
misleading and a misuse of the term.

Analyses Other than Intention to Treat

Many who undertake clinical trials, notably in the pharmaceutical industry, are
interested in other analyses than ITT analyses because the ITT estimates of treatment
effect “may not provide an intuitive or clinically meaningful estimate of treatment
effects” (Ruberg and Akacha 2017).
A treatment effect of primary interest, which may not be the ITT estimate, is
called an estimand. A description of estimands, including the need for careful
definition and for planning sensitivity analysis is provided in ▶ Chap. 84,
“Estimands and Sensitivity Analyses,” by Russek-Cohen and Petullo. Adherence-
adjusted analyses are also discussed in ▶ Chap. 92, “Statistical Analysis of Patient-
Reported Outcomes in Clinical Trials,” by Mazza and Dueck, and causal estimands
are discussed in ▶ Chap. 100, “Causal Inference: Efficacy and Mechanism Evalua-
tion,” by Landau and Emsley.
Ruberg and Akacha suggest that ITT analyses do not adjust for confounding
factors post-randomization (such as drug discontinuation or addition of rescue
medication). They require an explicit definition of what treatment effect is of primary
81 Preview of Counting and Analysis Principles 1593

interest (“relevant and meaningful”), which they call the estimand. They define four
estimands other than the ITT estimand. One is based on a composite variable which
combines a change in a symptom score with discontinuation of study drug due to an
adverse event. “Success” is improvement in symptom score and completion of
taking the study drug. A second estimand is the treatment effect if all subjects
adhered to study medication for the period of the trial. A third is what is the effect
on those who can take the study drug(s) without adverse events. The fourth is the
treatment effect for each subject before an adverse event or discontinuation.
Ruberg and Akacha claim that the probability of discontinuation of a study drug
due to either adverse events or lack of efficacy (or both) may be quantified and
treatment comparisons made using usual statistical methods if reasons for discon-
tinuation are carefully defined. They do not consider “physician choice” or “loss to
follow up” to be sufficiently detailed. They consider administrative discontinuation
(e.g., the patient moves away from the center) to be missing completely at random,
so that they don’t include such patients. To estimate efficacy and safety for those able
to adhere to study treatment, adherence must be first defined and would be trial
specific, perhaps taking 70% or 80% of the drug. The authors claim that this
combination of probability of discontinuing study drug due to AE, probability of
discontinuation for lack of efficacy and efficacy in adherers provides a more
complete and meaningful description of drug effect that the ITT estimate.
While the FDA demands an ITT analysis, perhaps allowing other analyses, the
European Medicines Agency (EMA), has produced ICH E9 (R1), Addendum on
estimands and sensitivity analysis in clinical trials to the guideline on statistical
principles for clinical trials (ema.europa.eu/documents/scientific-guideline/ich-e9-
r1-addendum-estimands-sensitivity-analysis-clinical-trials-guideline-statistical_
en.pdf). EMA allows treatment effect to be specified through an estimand (rather
than through the ITT estimate):

The definition of a treatment effect, specified through an estimand, should consider whether
values of the variable after an intercurrent event are relevant, as well as how to account for
the (possibly treatment-related) occurrence or non-occurrence of the event itself. More
formally, an estimand defines in detail what needs to be estimated to address a specific
scientific question of interest.

Analysis Principles in Complex Trials

Several chapters deal with elementary statistical methods to perform hypothesis tests
and parameter estimation. (See the ▶ Chaps. 83, “Estimation and Hypothesis Testing,”
▶ 87, “Essential Statistical Tests,” ▶ 88, “Nonparametric Survival Analysis.” So why
is there need for so much more statistical methodology?
Over time, clinical trials have become more complex and often attempt to answer
more complicated questions than two-armed trials with one primary endpoint. In
many cases, incorporating covariates into the primary analysis increases the power to
detect differences. An introduction to regression methods for dichotomous or
1594 N. L. Geller

polychotomous data is given in ▶ Chap. 91, “Logistic Regression and Related


Methods,” by Diniz and Magalhães and for censored data is given in ▶ Chap. 88,
“Nonparametric Survival Analysis,” by Lokhnygina. Extensions beyond the Cox
Proportional Hazards Model are summarized in ▶ Chap. 89, “Survival Analysis II,”
by Dignam.
Multiple endpoints and/or multiple treatments (multi-armed trials) are also fre-
quently of interest. Combined with these may be multiple subgroup analyses. Such
complex trials have several null hypotheses, and so give rise to the multiplicity
problem: what type of error control is needed for each null hypothesis and for all or
several of them simultaneously? ▶ Chapter 85, “Confident Statistical Inference with
Multiple Outcomes, Subgroups, and Other Issues of Multiplicity,” by Kil, Kazar,
Tang, and Hsu address several types of error control for multiplicity problems in the
context of personalized medicine and carefully outline how to obtain Strong control
of Type I error. This means that the probability of rejecting at least one true null
hypothesis is controlled, even if some null hypotheses are false.
A simplifying approach to multiple outcomes in the two sample case is possible
when the outcomes can be prioritized (e.g., death is worse than recurrence without
death) so that patients in each treatment arm may be compared to one another. In the
two sample case, a Mann-Whitney Wilcoxon test may be performed to decide if two
treatments differ. This methodology is described in ▶ Chap. 95, “Generalized
Pairwise Comparisons for Prioritized Outcomes,” by Buyse and Peron.
Recent work has considered multiplicity problems in the context of longitudinal
outcomes when covariates are also considered and linearity of the outcome over time
is not assumed (e.g., Jeffries et al. 2018). The joint analysis of survival and longitu-
dinal data is summarized in ▶ Chap. 90, “Prognostic Factor Analyses,” by Li.
Patient reported outcomes often include many of these multiplicity problems.
These are described in ▶ Chap. 92, “Statistical Analysis of Patient-Reported Out-
comes in Clinical Trials,” by Mazza and Dueck.
The main analysis principle is that the primary analysis should allow investigators
to properly answer the primary question of the trial, no matter how complex, and
maintain a prefixed type 1 error, preferably with strong control.

Other Analyses

Potential prognostic factors are among the data collected in clinical trials for use in the
primary analysis or for modeling response. In Sect. 7.10, Li presents insight into the
process of finding prognostic factors as well as suggesting use of prognostic factors to
increase the power of the primary statistical analysis. The stability of statistical models
may be assessed by resampling procedures and this is described by Sauerbrei and
Boulesteix (▶ Chap. 96, “Use of Resampling Procedures to Investigate Issues of
Model Building and Its Stability”). ▶ Chapter 101, “Development and Validation of
Risk Prediction Models,” describe risk prediction models and how to develop and
validate them.
Several other statistical problems require specialized analyses. The nonlinear
nature of pharmacokinetic and pharmacodynamic processes require special analyses
81 Preview of Counting and Analysis Principles 1595

to relate drug exposure to response and several methods are described in ▶ Chap. 98,
“Pharmacokinetic and Pharmacodynamic Modeling,” by Kalaria, Wang, and
Gobburu. The potential for adverse events after a drug or device receives marketing
approval has led to pharmacovigilance, the study of adverse effects of drugs post-
marketing. ▶ Chapter 99, “Safety and Risk Benefit Analyses,” by Guo describes
many methods of benefit-risk analysis.
The main analysis principle here is that the analysis should reflect the specific
goals of the question being answered. There are often multiple methods that might be
used and the onus is on the investigators, in particular the statistical investigators, to
choose the one she or he considers most suitable, or even to derive new methods.
Although controversial, ▶ Chap. 94, “Randomization and Permutation Tests,” advo-
cate randomization tests for many situations, in particular, for when the assumptions
of the usual approaches are unlikely to be met.

Summary and Conclusion

Clinical trials should be well designed to test a carefully posed primary hypothesis
and/or estimate a well-defined primary parameter. A statistical analysis plan (SAP)
should describe the methodology that will be used to answer the questions posed, both
primary and secondary. The plan should account for all randomized patients, even if
some are missing outcome data. The SAP should be completed before the data are
unblinded. There are many choices for statistical analysis for a given situation and they
are mentioned here and further described in the following chapters.

Key Facts

From the simple two-armed randomized trial with a single primary endpoint, clinical
trials have become more complex, with investigators working in broad research
areas, such as longitudinal data, multiple endpoints, multiple treatment arms, and
data with special characteristics, such as safety data and quality of life data. This
chapter covers many aspects of basic clinical trial analysis as well as many recent
developments.

Cross-References

▶ Adherence Adjusted Estimates in Randomized Clinical Trials


▶ Causal Inference: Efficacy and Mechanism Evaluation
▶ Confident Statistical Inference with Multiple Outcomes, Subgroups, and Other
Issues of Multiplicity
▶ Development and Validation of Risk Prediction Models
▶ Essential Statistical Tests
▶ Estimands and Sensitivity Analyses
▶ Estimation and Hypothesis Testing
1596 N. L. Geller

▶ Generalized Pairwise Comparisons for Prioritized Outcomes


▶ Intention to Treat and Alternative Approaches
▶ Joint Analysis of Longitudinal and Time-to-Event Data
▶ Logistic Regression and Related Methods
▶ Missing Data
▶ Pharmacokinetic and Pharmacodynamic Modeling
▶ Prognostic Factor Analyses
▶ Randomization and Permutation Tests
▶ Safety and Risk Benefit Analyses
▶ Statistical Analysis of Patient-Reported Outcomes in Clinical Trials
▶ Survival Analysis II
▶ Use of Resampling Procedures to Investigate Issues of Model Building and its
Stability

References
Badr W (2019) 6 different ways to compensate for missing values in a dataset (Data Imputation with
examples). https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-
values-data-imputation-with-examples-6022d9ca0779
Chakrabortty A, Cai T (2018) Efficient and adaptive linear regression in semi-supervised settings.
Ann Stat 46:1541–1572. https://doi.org/10.1214/17-AOS1594
Choi IJ, Kim CG, Lee JY, Young-Il Kim Y-I, Myeong-Cherl Kook MC, Park B, Joo J (2020) Family
history of gastric cancer and Helicobacter pylori treatment. N Engl J Med 382:427–436. https://
doi.org/10.1056/NEJMoa1909666
Friedman LM, Furberg CD, DeMets D, Reboussin DM, Granger CB (2015) Fundamentals of
clinical trials, 5th edn. Springer. Chapter 18. ISBN 978-3-319-18539-2
ICH E9 (R1), Addendum on estimands and sensitivity analysis in clinical trials to the guideline on
statistical principles for clinical trials. http://ema.europa.eu/documents/scientific-guideline/ich-
e9-r1-addendum-estimands-sensitivity-analysis-clinical-trials-guideline-statistical_en.pdf
ICH Harmonized Tripartite Guideline Statistical Principles for Clinical Trials E9 (1998). https://
database.ich.org/sites/default/files/E9_Guideline.pdf
Jeffries NO, Troendle JF, Geller NL (2018) Detecting treatment differences in group sequential
longitudinal studies with covariate adjustment. Biometrics 74:1072–1081. https://doi.org/10.
1111/biom.12837
Lachin JM (2000) Statistical considerations in the intent-to-treat principle. Control Clin Trials 21:
167–189
Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, Hoboken
Ruberg SJ, Akacha M (2017) Considerations for evaluating treatment effects from RCTs. Clin
Pharmacol Ther. https://doi.org/10.1002/cpt.869
Sterne JAC, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR
(2009) Multiple imputation for missing data in epidemiological and clinical research: potential
and pitfalls. Br Med J 338:b2393. https://doi.org/10.1136/bmj.b2393
Temple R, Pledger G (1980) The FDA’s critique of the Anturane Reinfarction trial. N Engl J Med
303:1488–1492
The Anturane Reinfarction Trial Research Group (1978) Sulfinpyrazone in the prevention of cardiac
death after myocardial infarction – the Anturane Reinfarction trial. N Engl J Med 298:289–295
The Anturane Reinfarction Trial Research Group (1980) Sulfinpyrazone in the prevention of sudden
death after myocardial infarction. N Engl J Med 302:250–256.
Intention to Treat and Alternative
Approaches 82
Judith D. Goldberg

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1598
Randomized Controlled Clinical Trials (RCTs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1600
Examples of RCTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1600
Example: Salk Vaccine Trial – Vaccine Efficacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1600
Example: The HIP Breast Cancer Screening Study – Screening for Early
Detection of Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1601
Example: The Polycythemia Vera Study Group PVSG-01 – A Randomized
Multicenter Trial (Open Label) for Chronic Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1602
Example: Randomized Phase III Trial in Chronic Disease – MPD-RC 112 Phase III
Trial of Frontline Pegylated Interferon Alpha-2a (PEG) Versus Hydroxyurea (HU)
in High-Risk Polycythemia Vera (PV) and Essential Thrombocythemia (ET):
NCT01258856 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1604
ITT Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1604
Alternatives to ITT Population for Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1606
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1607
Alternative Approaches to Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1608
Noninferiority and Equivalence Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1609
Cluster Randomized Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1610
Example: Cluster Randomized Trials and IIT – Online Wound Electronic Medical
Record to Reduce Lower Extremity Amputations in Diabetics – A Cluster
Randomized Trial [AHRQ: R01 HS019218-01] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1610
Some Additional Design Considerations for ITT Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1610
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1611
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1612
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1612
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1612

J. D. Goldberg (*)
Department of Population Health and Environmental Medicine, New York University School of
Medicine, New York, NY, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 1597


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_113
1598 J. D. Goldberg

Abstract
“Intention to treat” or “intent to treat” (ITT) is the principal approach for the
evaluation of the treatment or intervention effect in a randomized clinical trial
(RCT). In an RCT, patients or subjects are randomized to one or more study
interventions according to a formal protocol that describes the entry criteria,
study treatments, follow-up plans, and statistical analysis approaches. In an
ideal trial, all randomized patients or subjects have the correct diagnosis, are
randomized correctly, comply with the treatment, and are evaluated according
to the study plan. These patients would have complete data and follow-up. In
this case, the ITT analysis that respects the randomization principle provides
unbiased tests of the null hypothesis that there is no treatment or intervention
effect. The goal in many cases is to establish the efficacy of a treatment or
intervention: does the planned treatment work? In practice, however, because
of the many ways in which the ideal is not the reality, an ITT analysis provides
a comparative evaluation of the effectiveness of the randomized intervention
strategy (does the strategy work), rather than of the efficacy of the planned
intervention itself. Examples of blinded, unblinded, screening, and drug clin-
ical trials are provided. Approaches to handling deviations from ideal are
described.

Keywords
Intent to treat (intention to treat) · Randomized controlled trial (RCT) ·
Compliance · Protocol · Efficacy · Effectiveness · Causality · Missing data

Introduction

The randomized clinical trial (RCT) is the gold standard that is used to establish the
efficacy or effectiveness of a new treatment or intervention. In an RCT, participants
(subjects or patients) are assigned to one or more treatment or intervention groups
using a random assignment mechanism, that is, patients or subjects are allocated to
the one or more treatment or intervention groups using a prespecified random
allocation scheme where allocation, ideally, is double blind; both participants and
treating (and evaluating) staff are blinded (masked) to treatment assignment. Further,
under these ideal circumstances, only subjects who have met all of the eligibility
criteria for entry into the trial would be randomized as close to the initiation of the
treatment or intervention as possible. This set of all randomized patients or subjects
comprises what is generally defined as the “intent-to-treat” or “intention-to-treat”
(ITT) population. The ITT population is included in the analysis in the group to
which they were assigned (see, e.g., Ellenberg 1996; Piantadosi 1997; DeMets 2004;
Goldberg and Belitskaya-Levy 2008a; Friedman et al. 1998). Adherence to intention
to treat in the analysis requires that all randomized patients or subjects be included in
the analyses regardless of whether or not they received the assigned treatment,
82 Intention to Treat and Alternative Approaches 1599

complied with the trial requirements, completed the trial, or even met the entry
criteria for the trial. This approach is preferred for the analysis of RCTs since it
respects the principle of randomization and provides unbiased tests of the null
hypothesis that there is no treatment or intervention effect, although the estimate
of the treatment effect may still be biased (Harrington 2000). There are, however,
multiple ways in which the actual deviates from the ideal. RCTs come in many
flavors, have different objectives, and are conducted with varying levels of quality.
Different trial objectives, issues in trial implementation and conduct that include
missing data, patient/subject noncompliance, and differing degrees of follow-up lead
to deviations from the ideal that have to be recognized and handled in the analysis of
such trials.
Under the ITT principle, in a randomized trial, patients remain in the trial under
the following circumstances which have implications for analysis and interpretation
(Goldberg and Belitskaya-Levy 2008a, b):

• If the patient is found to not have the disease under study. This can occur when
final verification of disease status is based on special tests that are completed after
randomization or on a central review of patient eligibility.
• If the patient never receives a single dose of the study drug.
• If the patient does not comply with the assigned treatment regimen or does not
compete the course of treatment.
• If the patient withdraws from the study for any reason.

This chapter reviews concepts for randomized trials, the issues regarding imple-
mentation of the operational definition of ITT in specific trials, and the implications
of the operational definition on the statistical analysis as the trial proceeds. These
issues range from evaluation of the impact of the deviation from the ITT model to the
consideration of other potential paradigms based on differing definitions of the
population included in the analysis. These alternatives range from various modifi-
cations of ITT (mITT), such as all treated patients, to all treated patients with the
correct diagnosis and to all treated patients who complied with assigned treatment,
among others. Further, errors in treatment allocation and diagnosis at the time of
randomization as well as missing outcomes and errors or misclassification of out-
comes and errors of measurement or misclassification of covariates including strat-
ification factors, multicenter deviations and heterogeneity, and missing data of all
kinds need to be considered.
Traditional approaches to handle these issues as well as recent alternative
approaches to analysis are described. Note that the deviations from the planned
randomization and the ITT paradigm move the randomized clinical trial to an
observational trial setting that leads to additional considerations in analysis.
Examples that illustrate the evolution of the concept of “intention to treat” and its
implications are provided to frame these issues. In addition, the interpretation of ITT
in non-inferiority and equivalence RCTs are discussed as are analogues to ITT
analyses in non-randomized trials and observational studies.
1600 J. D. Goldberg

The discussion in this chapter also includes considerations for the statistical
analysis under the ITT paradigm for different types of clinical trials with different
types of objectives.

Randomized Controlled Clinical Trials (RCTs)

In the context of medical research and the search to improve treatments or other
interventions to improve outcomes for patients or participants, the RCT provides the
controlled experimental setting to evaluate the efficacy or effectiveness of the “new”
treatment or intervention compared to control (either placebo or another active
treatment) in an unbiased, ideally blinded manner. An RCT is conducted under a
clinical protocol that explicitly defines the trial objectives; the primary outcome(s);
how, when, and on whom the outcome(s) will be measured; and the measures of the
effects of the intervention (National Research Council 2010) with a focus on
prevention of missing data of all types. The benefits of this controlled experimental
approach are that any observed differences between the two (or more) groups with
respect to the outcome are attributable to the intervention. Both confounding and
selection bias are removed since neither the subject nor the investigator chooses the
treatment assignment (see, e.g., Harrington 2000). In what follows, several examples
of RCTS are provided. These trials illustrate many of the issues that arise in the
analysis and interpretation in the ITT framework and its alternatives.

Examples of RCTs

Example: Salk Vaccine Trial – Vaccine Efficacy

The Salk polio vaccine trial, a classic example of a prevention trial, established the
efficacy of the new killed virus vaccine to provide protection against paralysis or
death from poliomyelitis (Brownlee 1955; Francis et al. 1955; Meier 1957, 1989).
While there were safety issues associated with the use of a killed virus vaccine
(Meier 1957, 1989), the National Foundation for Infantile Paralysis (NFIP) advisory
committee agreed that the Salk vaccine was safe and could produce desired antibody
levels in children who had been tested. Thus, “it remained to prove that the vaccine
actually would prevent polio in exposed individuals. It would be unjustified to
release such a vaccine for general use without convincing proof of its effectiveness,
so it was determined that a large-scale ‘field-trial’ should be undertaken” (Meier
1989).
Various approaches to the design of such a trial were considered that included the
vital statistics approach, an observed control approach, and lastly, an RCT with
randomization to a placebo control group. While the ideal design was the RCT, the
general reluctance to randomize children to placebo injections led to the choice of an
observed control study in which children in grade 2 would receive the vaccine and
children in grades 1 and 3 would be observed for the occurrence of polio. The final
82 Intention to Treat and Alternative Approaches 1601

study, however, also included a double blind RCT in which 750,000 children were
randomized to injections with placebo or with the vaccine. The trial was conducted
in a relatively short timeframe with endpoints observed within the time period.
While the results of the observed control portion of the trial favored the vaccine,
the results of the RCT portion were unequivocal and provided compelling evidence
of the effectiveness of the vaccine. The primary results of the RCT were based on the
ITT analysis of all randomized children who were included in their assigned
treatment group regardless of whether or not they received the injections as planned;
that is, the primary comparison included those subjects who were randomized to be
vaccinated (including those who were not vaccinated) with the polio vaccine or the
placebo in each of the randomized groups. In the observed control study, it was
known who received and who did not receive the vaccination among the intervention
subjects, but not among the control subjects. The control group then inherently could
consist of subjects who would have received the vaccine and those who would not
have received it, so any fair comparison must consider all subjects in each group.

Example: The HIP Breast Cancer Screening Study – Screening for


Early Detection of Disease

A classic example of a randomized trial of screening for the early detection of breast
cancer is the HIP Breast Cancer Screening Study. This RCT was designed to evaluate
the effectiveness of mammography, at the time an untested tool for early detection, in
combination with a clinical examination, to be compared with “usual care.” The
primary question was whether a screening program that incorporated mammography
could reduce mortality from breast cancer. The study was conducted in the Health
Insurance Plan of Greater New York, one of the first health maintenance organiza-
tions (HMO) in the USA. Sixty-two thousand women were randomly chosen across
all of clinical sites in New York City. Of these women, approximately 30,000 were
randomized to be invited for an initial screening examination and 3 subsequent
annual examinations. The remaining 30,000 women were followed for diagnosis of
and mortality from breast cancer. This trial, initiated in 1963, predates the require-
ments for Institutional Review Board approvals and informed consent requirements
that have since become the norm. The trial design and results are described by
Shapiro et al. (1974, 1988).
Of the 30,000 women who were randomized to be invited for screening, 20,000
accepted the initial invitation; 59% of these women completed all 4 examinations. In
the 5 years of the screening study, 299 cases of breast cancer were diagnosed among
the women randomized to the screening group: 225 of these cases were detected
among women who had a screening examination; 74 cases were detected among
those who refused the invitations. There were 285 cases detected in the control group
(Shapiro et al. 1974). Table 1 shows the cumulative numbers of deaths in the first
5 years: those women who refused screening have a higher observed death rate from
breast cancer than those women who were screened; death rates for all causes reflect
the same phenomenon. The primary results of the trial rest on the comparison of the
1602 J. D. Goldberg

Table 1 HIP Breast Cancer Screening Study: cumulative deaths in the first 5 years from entry
Number of # BC BC deaths/ # All other All deaths/
Group women deaths 1000 deaths 1000
Total randomized to 31,000 39 1.3 837 27
screening
Screened 20,200 23 1.1 428 21
Refused 10,800 16 1.5 409 38
Total randomized to 31,000 63 2.0 879 28
control

5-year death rates in the total screened group (1.3/1000) compared with total control
group rate (2.0/1000).
The control group consists of a group of women who would have accepted the
invitation to be screened and those who would have refused. Within the control
group, who falls into each of these groups is unknown, and, therefore, the only fair
comparison is of the rates in the total randomized groups, the ITT comparison. Fink
et al. (1968) studied the “reluctant” participants who refused screening in the
randomized to screening group and identified differences between those who did
and not participate that were related to socioeconomic status and education as well as
presence of other health issues that took priority for them. While this trial was
randomized, it was not blinded in the sense that it is known which women were in
each group; the control group received usual care. Outcomes could more readily be
evaluated because of the nature of the HMO that did have medical records for all
participants. There was central review of biopsy and surgery records. Follow-up was
carried out for all women in the trial from these records and from death records.
Screening trials have additional considerations that will not be discussed here.
Note, however, that self-selection of participants, handled primarily through the ITT
primary analysis, can still be an issue in interpretation. Other issues that impact the
analysis and interpretation of the results of these trials include the distinctions between
diagnosis at an initial screen, reflecting disease prevalence, and diagnosis at subse-
quent screens, identifying incident cases; the definition of a positive screen in a multi-
modality setting; the lack of true follow-up of negatives on screening (false negatives);
lead time bias; length-biased sampling; and misclassification of disease status.

Example: The Polycythemia Vera Study Group PVSG-01 – A


Randomized Multicenter Trial (Open Label) for Chronic Disease

In 1967, the National Cancer Institute supported an international multicenter ran-


domized clinical trial to compare treatment with a radiotherapeutic agent, 32P,
administered as a monthly injection, a chemotherapeutic agent (chlorambucil, an
alkylating agent), and phlebotomy (at the time, standard of care) for patients with
polycythemia vera (PV). PV is a relatively rare chronic disease characterized by an
elevated hematocrit. This phase III clinical trial (Goldberg 2006; Goldberg and Shao
82 Intention to Treat and Alternative Approaches 1603

2008) was designed to evaluate the available treatments to develop definitive


recommendations for care. Stroke, hemorrhage, leukemia, and death were the
expected negative outcomes. Patients were randomized between 1967 and 1974
from more than 40 institutions in 4 countries. While treatment was randomly
assigned, there was no blinding, and the timing and route of delivery differed for
each treatment. Diagnostic criteria were ill-defined, and capabilities for diagnosis
varied across centers and regions. The trial was planned to evaluate multiple
endpoints over a lengthy period of follow-up. Over this follow-up period, treatment
changed, compliance (or lack of compliance) was not well captured, and supportive
care changed. Frequent interim analyses (every 6 months) were conducted for the
semi-annual investigator meetings with reviews of patient accrual, patient eligibility,
and outcomes. The concept of ITT had not been developed; patients who were
randomized without any follow-up or who were found to be ineligible were excluded
from the analyses of “evaluable” patients. Randomized treatment assignments were
provided in sealed envelopes to each center; web-based study management was not
yet available.
Four hundred seventy eight patients were randomized; 431 patients comprised the
randomized, eligible patient population. The results, reviewed regularly, were fraught
with issues of timeliness in reporting, multiple analyses of “dirty data,” and center and
regional variability in all aspects of implementation and follow-up of patients.
The primary study endpoint was time to first occurrence of a major endpoint
(major thrombotic event, development of acute leukemia or lymphoma, develop-
ment of nonhematologic cancer, or death). Follow-up was intensive through 1981,
was updated in 1987 with all events reported as of January 1, 1987, and again in
1993 (Berk et al. 1981, 1995). At the time of the 1987 update, 16.3% of the initial
431 randomized, eligible patients were still alive and actively in follow-up, 50.8%
had died while on study, 29.0% of the initial population had been removed from the
study for a variety of reasons including major protocol violations, and 3.9% were
irretrievably lost to follow-up.
Ineligibility rates after central review varied by treatment group (14% on phle-
botomy, 9% on chlorambucil, and 6% on 32P) and by region (4% in the US region
and 18% in the major non-US region). In particular, the early loss to follow-up rates
differed by region with the largest losses in the phlebotomy arm in the same non-US
region (23%) with the highest ineligibility rate (Goldberg and Koury 1989).
In summary, this trial illustrates all of the complexities associated with a long
enrollment period, ongoing treatment over time, lack of blinding at randomization,
difficulties with respect to long-term follow-up during which changes in treatment
and supportive care occur, relatively high ineligibility rates, differential follow-up
across treatment groups, and nonblinded data monitoring. The major results of the
trial included the identification of excess risks of leukemia and nonhematologic
malignancies associated with treatment with chlorambucil. To the extent possible,
sensitivity analyses to evaluate the effects of these many issues did not change the
interpretation of the results. Nevertheless, the lessons learned serve to inform the
design and conduct of trials and the use of the ITT paradigm for the analysis of the
primary objectives in randomized controlled trials.
1604 J. D. Goldberg

Example: Randomized Phase III Trial in Chronic Disease – MPD-RC


112 Phase III Trial of Frontline Pegylated Interferon Alpha-2a (PEG)
Versus Hydroxyurea (HU) in High-Risk Polycythemia Vera (PV) and
Essential Thrombocythemia (ET): NCT01258856

In this randomized open label (unblinded) phase III trial, the primary objective was
to compare complete hematologic response rates determined by blinded central
review in patients randomized to treatment by PEG (new treatment) and HU
(standard of care, readily available by prescription) by the end of 12 months of
therapy with planned analyses in each of the two disease strata (PV and ET). Patients
were to be within 1 year of initial diagnosis and to be treatment naïve with less than
3 months of HU therapy. The trial was originally designed to randomize 612 patients
with 2 planned interim analyses, the first when 25% of accrual with adequate time on
study to be evaluable for response. Randomization began in September 2011 and
continued through June 2016 at 24 centers in 6 countries. The study was amended
multiple times because of slow enrollment to a final sample size of 170 patients
across the 2 disease strata with 1 interim analysis planned to be conducted when 75
patients were evaluable for response. Entry criteria were relaxed so that allowable
prior duration of disease was lengthened from less than 1 year in the original
protocol to less than 5 years in a situation where the diagnosis of the disease could
be made only at the time of an identified complication. Thus, this amendment would
enroll more patients with indolent disease who did not have an early complication
that would have rendered them ineligible for the study. This trial illustrates many
additional difficulties in the conduct of an RCT when one of treatments is the current
standard of care available outside of the trial. In fact, the sponsor of the experimental
arm stopped drug supply for administrative reasons. While treatment assignments
were implemented using a blinded randomization scheme, the actual assigned
treatment was known to both the investigators and patients because of the different
methods of delivery. In this trial, 7% of the 86 HU patients never received any
treatment, while all of the 82 PEG patients received treatment. When the study was
closed with the reduced sample size, the final ITT CR rates were 37.2% in the HU
group and 35.4% in the PEG group (Mascarenhas et al. 2018).

ITT Principle

The International Conference on Harmonization ICH E9 (1998) guideline for statis-


tical principles in randomized trials defines ITT as “the principle that the effect of a
treatment policy can be best assessed by evaluating on the basis of the intention to
treat a subject (i.e., the planned treatment regimen) rather than the actual treatment
given. It has the consequence that participants allocated to a treatment group should
be followed up, assessed, and analyzed as members of that group irrespective of their
compliance to the planned course of treatment” (See Little and Kang 2015; ICH E9
1998). ITT is the gold standard for the analysis of randomized trials. Many authors
have reviewed the issues that surround the use of ITT over recent years (see, for some
82 Intention to Treat and Alternative Approaches 1605

discussions, Goldberg and Belitskaya-Levy 2008a, b, c; Ellenberg 1996, 2005;


DeMets 2004).
The analysis of a randomized trial based on the ITT principle can be considered a
comparison of the planned treatments, or of the treatment strategies as planned. The
general assumption is that the ITT analysis in a superiority or difference trial is
conservative. All patients who do not satisfy the conditions of the study at random-
ization or during the trial are effectively treatment “failures” when they are included
as randomized. When subjects or patients are distributed equally among the different
treatment groups, then any differences between the groups are, in general, reduced,
and the effective sample size is reduced. If the results of the trial still favor
superiority, while the estimate of the treatment effect (or effectiveness) may be
attenuated, the conclusion would remain unchanged. On the other hand, if there
are differential effects associated with treatment group, the ITT analysis can inflate
the estimated effectiveness (Goldberg 1975).
The ITT paradigm yields an estimate of the difference or the relative difference in
the effectiveness of the planned treatment strategies in the simple two-arm random-
ized trial rather than of the actual efficacy in patients who met the entry criteria,
complied with the treatment, and had complete outcome data. For example, if 90%
of those randomized to a new treatment refused to comply with the treatment, even if
the success rate in the remaining 10% of patients was 100%, the estimated effec-
tiveness for the ITT analysis would be 10% which may be lower than under the rate
for the standard treatment strategy with a higher compliance rate.
Historically, trials were often planned with an increased sample size to allow
some flexibility for missing data of various kinds. However, even with an
inflation in planned sample size, the issues of the impact of missing data of all
types still remain to be addressed. The National Research Council report on “The
Prevention and Treatment of Missing Data in Clinical Trials” (2010) outlines
the key issues and provides some recommendations to deal with these issues.
Recent reviews of publications of randomized controlled clinical trial results in
various therapeutic areas suggest that the ITT principle is used less often than
would be expected and that attention to the implications of missing data in the
analyses is limited. For example, Royes et al. (2015) reported that of 91 trials
published in 5 major musculoskeletal journals over 2 years, only 38% used a
complete ITT analysis for the primary outcome. Bell et al. (2014) in a recent
review of handling of missing data in 77 RCTs in top medical journals published
within a 6-month period following the NRC report indicated that 95% of these
trials reported some missing outcome data with a median of 9% and maximum of
70%. Further, complete case analysis was the most common approach to handing
missing data in the primary analysis (45%) followed by simple imputation and
then model-based methods and multiple imputation. Most of the trials used an
ITT or modified ITT analysis with only 35% of trials reporting sensitivity
analysis.
The NRC report recommends a primary analysis under the assumption that data
are missing at random, followed by sensitivity analyses that weaken assumptions to
include data not missing at random.
1606 J. D. Goldberg

Alternatives to ITT Population for Analysis

The alternatives frequently used in place of the ITT principle to define the study
population for comparison in an RCT include modifications to the ITT (mITT), per
protocol (PP), as treated (AT), and variations among these and other options to define
the groups that are being compared in a trial. Each of these alternatives requires
careful definition within a trial protocol to ensure that there is clarity with respect to
the details.
mITT is often only vaguely defined and requires specificity to even evaluate how
it would operationally impact the analysis of an RCT. For example, patients ran-
domized as eligible, but subsequently found to be ineligible, could be excluded. Or,
the modification could be just to include all patients randomized to a trial who
actually received at least one dose of study treatment as randomized. The closest to
the ITT paradigm would be to include these patients as randomized. In randomized
trials of infectious diseases, a modified ITT approach is often used. In this case,
outcomes in the two treatment groups are compared for patients who actually have
diseases caused by organisms sensitive to the treatments. But, in practice, treatment
is given presumptively since the results of sensitivity testing are often not available
at the initiation of treatment. In this case, the ITT approach provides an approxima-
tion to the treatment strategy that would be implemented in practice.
Per protocol (PP) populations are defined to include patients who met the
protocol criteria for entry, complied with the treatment regimen based on the trial
definition of compliance and completed follow-up for the outcome. PP approach
includes analysis of patients evaluable for response in oncology trials, for example,
in which only those patients who received sufficient treatment to be evaluated at the
primary response outcome assessment are included in the analysis, eliminating those
patients who might have deteriorated on treatment and went on to other treatments or
died prior to the evaluation time. Clearly, this can provide misleading results
depending on the distributions of these patients in the treatment groups being
compared.
As treated (AT) populations would include patients with assignment to the
treatment actually received rather than assignment to treatment as randomized. An
as-treated (AT) analysis assigns subjects or patients to the treatment-taken group
regardless of the randomized assignment. As Ellenberg (1996) points out, these AT
analyses assign subjects to groups based on their compliance in the randomized trial.
And the definition of compliance to the assigned treatment can be subjective. In an
AT analysis, for example, a subject assigned to the new treatment may actually take
the standard treatment. In a blinded RCT, subjects can take only the standard if
available outside the trial, a common issue in trials that use available treatments as
the standard. In the absence of blinding, when one or both treatments are available
outside the trial setting (such as MPD-RC 112 with both hydroxyurea and PEG
interferon available outside the trial), the problem is exacerbated. In fact, patients
and/or their physicians will comply with the assigned treatment only if it is not
available to them in other ways. In these settings, compliance rates on the random-
ized regimen often differ for the two (or more) treatment arms. The ITT analysis in
82 Intention to Treat and Alternative Approaches 1607

this setting provides an unrealistic assessment of the treatment effects, but any other
analyses are biased by the selection process associated with compliance.
Note that safety analyses, however, are conducted appropriately on an AT basis
with subjects assigned to the regimen actually taken, as distinct from the ITT
approach for effectiveness.
While there are multiple variants of these broad approaches to identifying the
populations for analysis in an RCT, each of which has limitations, if these multiple
analyses do not differ in any substantive way, the risk of incorrectly attributing
efficacy (or lack of efficacy) to a new treatment are reduced.
In the setting of a noninferiority or equivalence trial, the use of the ITT population
for analysis reduces any differences between the two groups favoring the conclusion
of noninferiority or equivalence (see, e.g., Kim and Goldberg 2001; Sanchez and
Chen 2006).

Missing Data

The consensus among clinical trialists is that the gold standard remains the ITT
analysis for the randomized trial. The primary source of deviation from the ITT
paradigm arises from missing data of some type. That said, the NRC report (2010)
and many other authors focus on the need to develop a careful protocol that considers
primary outcomes and includes plans to minimize missing data of all kinds. Missing
data can occur at every stage of a trial with differing implications for analysis.
That missing data are unrelated to treatment or outcome and are missing
completely at random (MCAR) is generally not the case in RCTs. Rather, mis-
singness can be related to treatment but not outcome, missing at random (MAR), a
possible scenario. Such an assumption can lead to overly optimistic estimates of
treatment effects (see ▶ Chap. 84, “Estimands and Sensitivity Analyses”) but can be
useful. Lastly, missingness can be related to both treatment and outcome, that is, not
missing at random (NMAR), a scenario as noted in ▶ Chap. 84, “Estimands and
Sensitivity Analyses,” that is a more likely occurrence than one would like. The
▶ chapter 86, “Missing Data” provides details of approaches to incorporate missing
data into the analysis of an RCT.
Patients can be randomized to the incorrect treatment, in incorrect strata, or with
incorrect diagnoses at the outset. In some trials, randomization and treatment occur
based on a presumptive diagnosis while additional testing and review of the entry
criteria continue. At the study design stage, the goal is to randomize as close to the
initiation of treatment as possible with as much confirmed information as possible.
The handling of these types of errors impacts the analysis. In the ITT paradigm,
subjects or patients remain in the trial as randomized. Similarly, if the subject or
patient never receives the randomized treatment, the subject remains in the analysis
as randomized. If the patient does not comply with the assigned treatment, the patient
remains as randomized. And, if the patient withdraws from the study for any reason
at any point, the patient remains in the trial as randomized.
1608 J. D. Goldberg

The emphasis in the NRC report (2010) is on minimizing missing data of any
kind. Some baseline data can be missing in any trial. Trials with a single treatment or
intervention encounter and an immediate assessment of the outcome have the
smallest potential for missing data. As the duration of the intervention increases,
the potential for missing data increases and includes subject withdrawal for many
potential reasons that may be related to the intervention (e.g., side effects). As the
length of the follow-up period after the completion of treatment increases, the
potential for loss to follow-up as well as for the use of alternative treatments
increases. Short-term treatment and follow-up minimize missing data; long-term
treatment with end of treatment follow-up or long-term post treatment follow-up.
Post-randomization missing data can become a major problem of analysis and
interpretation (NRC 2010; Little and Kang 2015) after long-term post treatment
follow-up provide more opportunities for increased missing data.
In short-term trials with, for example, a one-time intervention (e.g., vaccine,
screening test) and a short-term single outcome assessment, missing outcome data
should not pose a major problem. Most clinical trials, however, involve multiple
dosing or treatments over time with follow-up at planned intervals during the active
intervention phase and then long-term follow-up after the intervention is complete.
Because the likelihood that complete data are obtained as planned is reduced, there is
recent interest in extending the concept of the ITT approach through a focus on how
to evaluate the multiple objectives, including the primary objective, of the trial and
choice of the appropriate outcomes (measurements) to be used for these evaluations.
▶ Chapter 84, “Estimands and Sensitivity Analyses,” provides an overview and
summarizes the strategies for the choice of “estimands” for different scenarios
beyond the ITT analysis. In this context, the ITT estimand estimates the effect of
the randomization to treatment on outcome. Among strategies to address intercurrent
events (post randomization) including compliance/noncompliance (Little et al.
2009) are treatment policy estimands similar to the ITT approach, composite end-
points that include intercurrent events on treatment, hypothetical estimands, princi-
pal stratification causal estimates (Fragakis and Rubin 2002; Little and Rubin 2000),
and on-treatment estimates. Compliance can be incorporated into analyses in various
ways including as a covariate. However, the definition of compliance has to be clear
and consistent and defined in the protocol. These concepts are elucidated from a
regulatory standpoint in ICH E9 R1 (2017) and elsewhere.

Alternative Approaches to Analysis

The primary ITT analysis in an RCT provides as we note above an estimate of the
treatment strategy defined by a protocol that includes all randomized patients as
randomized. The reality in a trial is often quite different. In addition, different kinds
of trials have different requirements. The AT and PP populations can be analyzed, but
each has different interpretations with respect to trial results. Composite endpoints can
provide a single summary in an ongoing trial that includes events on treatment. For
example, such an endpoint in a survival-type trial could be based on progression-free
82 Intention to Treat and Alternative Approaches 1609

survival, the time to disease progression or death, whichever occurs first. The analysis
of long-term outcomes can also be confounded with use of rescue medications, side
effects, and response or lack of efficacy itself if there any of these events contribute to
missing data, particularly with differential rates in the groups being compared. Of
course, it is still possible to have incomplete information with respect to the earlier
endpoint so that there is bias introduced when the first event is death that may or may
not have been preceded by disease progression. Analyses of multiple endpoints and
competing risk analyses can shed some light on these potential biases.
In longitudinal trials with repeated measurements of outcomes such as blood
pressure over time, mixed effects regression models and general estimating equation
models can be used. However, again, the assumptions of such methods (e.g., missing
at random) can easily be violated by differences in the distributions and types of
intercurrent events between the treatment groups (Little and Kang (2015). Hogan
et al. (2004) summarize approaches to handling dropouts in the longitudinal setting.
Various approaches to analysis of the different populations (AT, PP, compliers)
have been traditionally employed with known limitations. For example, single
imputation in an analysis based on PP patients can be viewed as a variant of what
is known as a “completers” or “complete case” analysis. The other extreme of this
approach is to use the first observation carried forward, best observation carried
forward, or “last observation carried forward” (LOCF) to replace missing outcome
data. In an LOCF analysis, the last available observation is used for each subject in
the analysis; this observation could even be the baseline pre-treatment observation.
These kinds of analyses are flawed and yield results that are biased in different ways.
While these methods have mostly been replaced by mixed effects regression models
and various other approaches, the differences in the results from all of these methods
can provide useful sensitivity analyses (see Thabane et al. 2013).
Other approaches that have been proposed for the analysis of RCTs that address
many of these issues are beyond the scope of this chapter. Bayesian methods can be
used to incorporate additional treatment information such as rescue medication for
treatment failure using data augmentation algorithms (Shaffer and Chinchilli 2004).
Selection models allow formal incorporation of potential outcomes and pattern
mixture models to model associations between observed exposures and outcomes
(Goetghebuer and Loeys 2002). Causal effect models can be used for realistic
treatment assignment rules when the expected treatment assignment (ETA) is vio-
lated (Van der Laan and Petersen 2007).

Noninferiority and Equivalence Trials

In the case of noninferiority or equivalence trials, an ITT analysis can bias the results
in favor of noninferiority or equivalence. This occurs because the effective sample
size is reduced by the inclusion of ineligible and noncompliant patients and the
difference between the groups is decreased favoring “no difference.” Hybrid ITT/PP
analyses that exclude noncompliant patients and incorporate the impact of missing
data in this setting have been proposed by Sanchez and Chen (2006).
1610 J. D. Goldberg

Cluster Randomized Trials

Example: Cluster Randomized Trials and IIT – Online Wound


Electronic Medical Record to Reduce Lower Extremity Amputations
in Diabetics – A Cluster Randomized Trial [AHRQ: R01 HS019218-01]

The Online Wound Electronic Medical Record (OWEMR) was an informatics tool
that synthesizes diabetic foot ulcer (DFU) data to inform treatment decisions. The
primary objective of this two-arm cluster RCT was to assess the impact by 6 months
of the OWEMR and standard of care (OWEMR+SOC) compared to SOC alone on
lower limb amputation or death. In a cluster randomized trial, intervention or
treatment assignment is randomized to a group of individuals defined, for example,
by a classroom, a school, or a center in a multicenter trial. In such a trial, the cluster
or group is randomized, and all members of the group or unit receive the same
treatment assignment. Thus, the unit of randomization is the cluster. Sample size is
based on the number of clusters and, only in part, on the number of observations
within a cluster. Analyses are based on the cluster summary and can also be based on
the individual observations nested within cluster. In a cluster randomized trial, the
ITT analysis can be thought of as the analysis of all randomized clusters. However,
there can be instances in which within a cluster, individuals do not uniformly adhere
to the assigned treatment. Within a cluster, individual outcomes are often correlated.
This RCT was originally designed to include 3504 patients in 12 centers (clusters;
292 patients/cluster) each of which would be randomized to the tool+SOC or SOC
alone. Enrollment began in August 2011 and was expected to complete in September
2013. Six of the original 12 study sites were closed for poor accrual. Additional sites
were identified; 16 study sites (ranging in size from 0 to 295 patients) were included
in this study. Of the 1608 subjects who signed informed consent in these sites, 47
were screen failures and 1561 were enrolled in the trial. OWEMR+SOC centers
enrolled 1 to 295 subjects (total, 977; outcome rate, 14.7%), and SOC alone centers
enrolled 0 to 169 subjects (total, 584; primary outcome rate, 11.8%). When early
terminations were included as failures, the composite failure rates were, respectively,
36.9% and 42.1%. For the primary endpoint, the results favored SOC, but the
dropout and early termination rates were, in fact, higher in the SOC arm. This
illustrates the difficulty in attracting centers (and patients) to remain on a trial
when they are not randomized to the new intervention.

Some Additional Design Considerations for ITT Analyses

Careful consideration to all of the design details in the RCT protocol can facilitate
analysis and interpretation of the results. Some possible design modifications could
retain the features of the planned ITT analyses.
Often in RCTs, substudies using new and potentially expensive technologies are
incorporated into the trial. Frequently, these substudies are carried out in conve-
nience samples or “wherever data are available.” While a relatively large sample size
is often required for the primary and secondary endpoints, the inability to measure all
82 Intention to Treat and Alternative Approaches 1611

variables on all subjects/patients introduces missing data that may or may not be
missing at random. Nested random subsamples of the overall study population can
be used to measure these more expensive classes of variable with smaller subsamples
as the cost of collection increases. These SMAR-type designs (Belitskaya-Levy et al.
2008; Goldberg 2006) enable integrated analyses in a setting of planned monotone
missingness with data missing at random.
In RCTs with response-dependent changes in treatment, patients can be ran-
domized to complete regimens at the initial randomization with a treatment
strategy randomization that incorporates treatment arms that randomize patients
to continue treatment or to receive additional treatment if required. For example, in
combination therapy trials for hypertension, patients could be randomized to the
new treatment, standard treatment, or placebo at the initial stage. Subsequent
randomizations would allow various combinations within each of the original
treatment groups. Such a design would allow an unbiased comparison within
each original randomization group of patients with and without additional therapy
after the first stage.
Sequential Multiple Assignment Randomized Trials (SMARTs) allow the com-
parison of dynamic treatment strategies (DTS) or sets of decision rules for patient
management (see Almirall et al. 2014). In the SMART framework, patients are
randomized to different treatment branches that separate DTSs. In the classic ITT
framework, randomization occurs at the start of the trial with subsequent treatment
changes after the initial randomization that are not governed by randomization. The
DTS ITT converts to SMART by using randomization when treatment decisions
would change. Patient information would contribute to one or more DTSs until the
patient leaves the DTS. These designs are gaining traction particularly in areas such
as behavioral intervention trials for smoking cessation.

Summary and Conclusions

Intention to treat (ITT) is the preferred approach for the statistical analysis of
randomized clinical trials. A careful, unambiguous protocol should be developed
prior to the trial initiation, and a plan for statistical analysis should be in place prior
to the conclusion of the trial and its unblinding. Investigators must provide a careful
accounting of all patients randomized, all patients treated, and all post-randomiza-
tion exclusions (if any) for lack of efficacy, lack of compliance, or lack of safety (see,
e.g., DeMets 2004; Begg 2000; Lachin 2000). The ITT principle provides a para-
digm for the conduct of RCTs that focuses on reducing any biases in patient/subject
assignment or evaluation of outcomes. That said, the realities of clinical trial conduct
often make it necessary to carefully consider deviations from the ideal model in the
analysis. The NRC report (2010) and the IHC E9 guidance (2017) review the
alternatives for analysis based on the goals of the specific trial and advantages and
limitations of these alternatives. Regardless of the primary method of analysis,
sensitivity analyses should be conducted to evaluate the effects of missing and/or
erroneous data at all stages of the trial process from randomization errors to missing
outcome data as well as the impact of lack of compliance.
1612 J. D. Goldberg

Key Facts

• Randomized clinical trial (RCT). An RCT is the gold standard experimental


paradigm to test a null hypothesis that a treatment or intervention effect is 0 versus
the alternative that this effect is not equal to 0 in the 2 group trial.
• Blinding. An RCT ideally should be blinded for treatment assignment and
outcome assessments and to patients or subjects. In practice, variations of this
occur.
• Intention to treat (ITT). ITT is the principle for the statistical evaluation of an
RCT that includes all subjects or patients in the analysis with assignment to their
randomized treatment or intervention group regardless of whether or not they
received the planned regimen.
• Efficacy. An RCT that includes all subjects randomized and all data from these
subjects should ideally compare the actual treatment effect if delivered as
planned.
• Effectiveness. An ITT analysis of an RCT evaluates the effectiveness of the
planned treatment as assigned regardless of the deviations from the plan as the
trial progresses.
• Missing data in RCTs. Data can be missing in any RCT for multiple reasons. The
extent and nature of missing data in an RCT can impact results and interpretation
under the ITT paradigm and alternative analyses.

Cross-References

▶ Adherence Adjusted Estimates in Randomized Clinical Trials


▶ Causal Inference: Efficacy and Mechanism Evaluation
▶ Estimands and Sensitivity Analyses
▶ Missing Data
▶ Prevention Trials: Challenges in Design, Analysis, and Interpretation of Preven-
tion Trials
▶ Screening Trials
▶ Sequential, Multiple Assignment, Randomized Trials (SMART)

References
Almirall D, Nahum-Shani I, Sherwood NE, Murphy SA (2014) Introduction to SMART designs for
the development of adaptive interventions: with application to weight loss research. Transl
Behav Med 4:26–274. PMCID: PMC4167891
Begg CB (2000) Commentary: ruminations on the intent-to-treat principle. Control Clin Trials
21:241–243
Belitskaya-Levy I, Shao Y, Goldberg JD (2008) Systematic missing-at-random (SMAR) design and
analysis for translational research studies. Int J Biostat 4(1):Article 15. https://doi.org/10.2202/
1557-4679.1046. PubMed PMID: 20231908; PubMed Central PMCID: PMC2835456
82 Intention to Treat and Alternative Approaches 1613

Bell ML, Fiero M, Horton NJ, Hsu C-H (2014) Handling missing data in RCTs; a review of the top
medical journals. BMC Med Res Methodol 14:118
Berk PD, Goldberg JD, Silverstein MN et al (1981) Increased incidence of acute leukemia in
polycythemia vera associated with chlorambucil therapy. NEJM 304:441–447
Berk PD, Wasserman LR, Fruchtman SM, Goldberg JD (1995) Treatment of polycythemia vera; a
summary of clinical trials conducted by the polycythemia vera study group. In: Wasserman LR,
Berk PD (eds) Polycythemia vera and the myeloproliferative disorders. Chapter 15. N.
Saunders, Berlin, pp 166–194
Brownlee KA (1955) Statistics of the 1954 polio vaccine trials. J Am Stat Assoc 50(272):
1005–1013
DeMets DL (2004) Statistical issues in interpreting clinical trials. J Intern Med 255:529–537
Ellenberg J (1996) Intent-to-treat analysis vs as-treated analysis. Drug Inf J 30:535–544
Ellenberg J (2005) Intention to treat analysis: basic. Encyclopedia of biostatistics. John Wiley and
Sons, Ltd
Fink R, Shapiro S, Lewison J (1968) The reluctant participant in a breast cancer screening program.
Public Health Rep 83(6):479–490
Fragakis CE, Rubin DB (2002) Principal stratification in causal inference. Biometrics 58:21–29
Francis T Jr et al (1955) An evaluation of the 1954 poliomyelitis vaccine trials – summary report.
Am J Public Health 45(5):1–63
Friedman LM, Furberg CD, DeMets DL (1998) Fundamentals of clinical trials, 3rd edn. Springer,
New Year
Goetghebuer E, Loeys T (2002) Beyond intention to treat. Epidemiol Rev 24:85–90
Goldberg JD (1975) The effects of misclassification on the bias in the difference between two
proportions and the relative odds in the fourfold table. J Am Stat Assoc 70:561–567
Goldberg JD (2006) The changing role of statistics in medical research: experiences from the past
and directions for the future. Invited paper, Proc Amer Stat Assoc. 1963–1969
Goldberg JD, Belitskaya-Levy I (2008a) In: Melnick E, Everitt BS (eds) Intent-to-treat principle.
Encyclopedia of quantitative risk assessment. Wiley, Chichester
Goldberg JD, Belitskaya-Levy I (2008b) In: Melnick E, Everitt BS (eds) Randomized controlled
trials. Encyclopedia of quantitative risk assessment. Wiley, Chichester
Goldberg JD, Belitskaya-Levy I (2008c) In: Melnick E, Everitt BS (eds) Efficacy. Encyclopedia of
quantitative risk assessment. Wiley, Chichester
Goldberg JD, Koury KJ (1989) In: Berry DA (ed) Design and analysis of multicenter trials. Chapter 7
in statistical methodology in the pharmaceutical sciences. Marcel Dekker, New York, pp 201–237
Goldberg JD, Shao YS (2008) In: Melnick E, Everitt BS (eds) Comparative efficacy trials (phase III
studies). Encyclopedia of quantitative risk assessment. Wiley, Chichester
Harrington DB (2000) The randomized clinical trial. J Am Stat Assoc 95:312–315
Hogan JW, Roy J, Korkontzelou C (2004) Tutorial in biostatistics: handling drop-out in longitudinal
studies. Stat Med 23:1455–1497
ICH (1998) E9: guideline on statistical principles for clinical trials. www.ich.org
ICH (2017) E9 R1: Addendum on estimands and sensitivity analyses in clinical trials. Step 2. www.
ich.org
Kim MY, Goldberg JD (2001) The effects of outcome misclassification and measurement error on
the design and analysis of therapeutic equivalence trials. Stat Med 20(14):2065–2078. PubMed
PMID: 1143942
Lachin JM (2000) Statistical considerations in the intent-to-treat principle. Control Clin Trials
21:167–189
Little R, Kang S (2015) Intention-to-treat analysis with treatment discontinuation and missing data
in clinical trials. Stat Med 34:2381–2390
Little RJA, Rubin DB (2000) Casual effects in clinical and epidemiological studies via potential
outcomes: concepts and analytical approaches. Annu Rev Public Health 21:121–145
Little RJA, Long Q, Lin X (2009) A comparison of methods for estimating the causal effects of a
treatment in randomized clinical trials subject to noncompliance. Biometrics 65:640–649
Mascarenhas J et al (2018) Results of the myeloproliferative neoplasms – research consortium
(MPN-RC) 112 randomized trial of pegylated interferon alfa-2a (PEG) versus hydroxyurea
1614 J. D. Goldberg

(HU) therapy for the treatment of high risk polycythemia vera (PV) and high risk essential
thrombocythemia (ET). Blood 132:577. https://doi.org/10.1182/blood-2018-99-111946
Meier P (1957) Safety testing of poliomyelitis vaccine. Science 125:1067–1071. https://doi.org/
10.1126/science.125.3257
Meier P (1989) The biggest public health experiment ever: the 1954 field trial of the Salk
poliomyelitis vaccine. In: Tanur JM, Mosteller F, Kruskal WH, Lehmann EL, Link RF, Pieters
RS, Rising GR (eds) Statistics: a guide to the unknown, 3rd edn. Duxbury
National Research Council (2010) The prevention and treatment of missing data in clinical trials.
The National Academies Press, Washington, DC
Piantadosi S (1997) Clinical trials: a methodologic perspective. Wiley-Interscience, New York
Royes J, Sims J, Ogollah R, Lewis M (2015) A systematic review finds variable use of the intention-
to-treat principle in musculoskeletal randomized controlled trials with missing data. J Clin
Epidemiol 68:15–24
Sanchez MM, Chen X (2006) Choosing the analysis population in non-inferiority studies. Stat Med
25:1169–1181
Shaffer M, Chinchilli V (2004) Bayesian inferences for randomized clinical trials with treatment
failures. Stat Med 23:1215–1228
Shapiro S, Goldberg JD, Hutchison GB (1974) Lead time in breast cancer detection and implica-
tions for periodicity of screening. Am J Epidemiol 100(5):357–366. PubMed PMID: 4417355
Shapiro S, Venet W, Strax P, Venet L (1988) Periodic screening for breast cancer: the health
insurance plan project and its sequelae, 1963–1986. Johns Hopkins, Baltimore
Stuart EA, Perry DF, Le H-N, Ialongo NS (2008) Estimating intervention effects of prevention
programs: accounting for noncompliance. Prev Sci 9:288–298
Thabane L, Mbuagbaw L, Zhang S, Samaan Z, Marcucci M, Ye C, Thabane M, Giangregorio L,
Dennis B, Kosa D, Debana VB, Dillenburg R, Fruci V, Bawor M, Lee J, Wells G, Goldsmith CH
(2013) A tutorial on sensitivity analyses in clinical trials: the what, why, when and how. BMC
Med Res Methodol 13:92. http://www.biomedcentral.com/1471-2288/13/92. Accessed 28 Apr
2018
US FDA (2016) Non-inferiority clinical trials to establish effectiveness: guidance for industry.
https://www.fda.gov/downloads/Drugs/Guidances/UCM202140
Van der Laan MJ, Petersen ML (2007) Causal effect models for realistic individualized treatment
and intention to treat rules. Int J Biostat 3(1):Article 3. http://www.bepress.com/ijb/vol3/iss1/3
Estimation and Hypothesis Testing
83
Pamela A. Shaw and Michael A. Proschan

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1616
Estimation and Uncertainty for Continuous Endpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1617
Estimation and Uncertainty for Noncontinuous Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1618
Estimation of the Difference Between Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1619
Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1620
Special Topics in Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1623
Exact Tests and Other Considerations for Choosing a Hypothesis Test . . . . . . . . . . . . . . . . . . . 1623
Multiple Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1625
Noninferiority Versus Superiority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1626
Controversies in Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1627
Two-Sided Versus One-Sided Controversy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1627
The P-Value Controversy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1628
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1629
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1629
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1630
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1630

Abstract
This chapter presents basic elements of parameter estimation and hypothesis
testing. The reader will learn how to form confidence intervals for the mean,
and more generally, how to calculate confidence intervals for the one parameter
setting and for the difference between two groups. Principles of hypothesis testing
are detailed, including the choice of the null and alternative hypotheses, the
significance level, and implications for choosing a one-sided versus two-sided

P. A. Shaw (*)
University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
e-mail: [email protected]
M. A. Proschan
National Institute of Allergy and Infectious Diseases, Bethesda, MD, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 1615


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_114
1616 P. A. Shaw and M. A. Proschan

test. The p-value is defined and a discussion of controversies that have arisen over
its use are included. After reading this chapter, the reader will have a better
understanding of the necessary steps to set up a hypothesis test and make valid
inference about the quantity of interest. Other topics in this chapter include exact
hypothesis tests, which may be preferable for small sample settings, and the
choice of a parametric versus nonparametric test. The chapter also includes a brief
discussion of the implications of multiple comparisons on hypothesis testing and
considerations of hypothesis testing in the setting of noninferiority trials.

Keywords
Confidence interval · Hypothesis testing · Noninferiority trial · Parameter ·
p-value · Power · Significance · Standard error · Test statistic · Type I error ·
Type II error

Introduction

Suppose a trial of two interventions aimed at helping individuals lose weight is


conducted. Individuals on Intervention A lose more weight on average than those on
intervention B. Can one conclude that Intervention A is better? What information is
needed to be fairly certain that if an independent clinical trial of these interventions
were conducted, its data would lead to the same conclusion? Suppose the data in the
original trial showed that individuals on arm A lost 6 pounds and individuals on arm
B lost 2 pounds, on average. Could one now conclude Arm A was definitively better?
The answer is, it depends. It depends on whether 4 lb. is an important clinical
difference in weight change. It also depends on whether the found 4 lb. difference
between groups is larger than the natural variability in average weight loss on each
arm.
This chapter considers two principal goals of statistical analysis: (1) estimat-
ing a parameter that summarizes the outcome of interest, e.g., effect of treatment
on weight change and (2) testing whether the value of the parameter of interest is
different in different groups, for example, is the effect of treatment on weight
change different for treatment A versus B. Both of these activities rely on
quantifying the amount of uncertainty in the estimate of the parameter of interest.
This allows calculation of the likelihood of a difference between two groups at
least as large as what was observed if treatment truly has no effect. We describe a
number of common summary statistics (mean, proportion, etc.) and their mea-
sures of uncertainty, i.e., the statistic’s variance. We define and present methods
to calculate confidence intervals that summarize the range of plausible parameter
values. Finally, we describe a formal framework for hypothesis testing that helps
determine whether signals observed in data can be distinguished from chance.
Statistical inference is the process of making statements from data about a
parameter of interest. Confidence intervals and hypothesis tests are the primary
tools of statistical inference.
83 Estimation and Hypothesis Testing 1617

Estimation and Uncertainty for Continuous Endpoints

The principal goal in a randomized clinical trial is to summarize the effect of an


intervention using data gathered from trial participants. The primary outcome chosen
to summarize the drug’s effect focuses on safety in early phase trials and efficacy in
later phase trials. To reliably quantify the degree of certainty about statements of
safety or efficacy of an intervention, one must specify in advance in the protocol and
statistical analysis plan the specific parameter of interest, and methods to estimate its
value. The parameter can be thought of as the true underlying value for a population
under study, such as the true average change in weight after 3 months of intervention
A. In this section, we focus on what is probably the most common parameter
estimate used in clinical trials for a continuous outcome, the mean, to illustrate the
general approach to summarizing an intervention effect.
Return to the example of the weight loss trial. The primary outcome is weight
change from baseline to end of trial. The distribution of weight change may look like
the bell curve of a normal distribution, or it may be skewed. The Central Limit
Theorem says that as the sample size increases, the distribution of the arithmetic
mean tends towards the normal distribution. This holds regardless of how weight
change is distributed, provided only that its variance is finite. This means we can use
attributes of the normal distribution to make statements about the mean. In particular,
with 95% probability, a normally distributed statistic will take on a value that is
within approximately two standard deviations of its mean. A normal distribution is
completely characterized by its mean and standard deviation. This underlying
principle is used to construct what is known as the confidence interval for the
unknown parameter of interest, say μA, the underlying mean weight change for
intervention A. Specifically, if the trial were repeated thousands of times, approxi-
mately 95% of such intervals would contain the true parameter value μA. The 95%
confidence interval can be approximated by x  1.960xSE, where SE is the estimated
standard error and 1.960 (1.960) is the upper (lower) 2.5% quantile of a standard
normal distribution. As shown in Fig. 1, for the standard normal distribution (mean
0, standard deviation 1), 2.5% of the values are above the value 1.960. Thus,
z.025 ¼ 1.960 for this distribution and, by symmetry, 95% of the values are between

Fig. 1 The standard normal


distribution, which has mean
0 and standard deviation 1, is
shown along with its upper
2.5% quantile, denoted by
z0.025 where z0.025 ¼ 1.960.
The probability that values
from this distribution exceed
1.960 is 2.5%
1618 P. A. Shaw and M. A. Proschan

1.960 and 1.960. For a sample mean of n independent


pffiffiffi observations with the same
underlying variance, the SE is estimated by s= n, where s is the sample standard
deviation. Suppose in the weight loss trial there are 100 individuals on intervention
A, the sample mean is 6 and s ¼ 20. Then the SE for the sample mean is SE ¼ 2.0,
and the 95% confidence interval for μA is approximately (2.08, 9.92). One can
construct a confidence interval of arbitrary confidence level L (e.g., 0.95) by adding
and subtracting zα/2 x SE to x when the sample size is large, where zα/2 is the upper
α/2 quantile of the normal distribution and SE is the standard error for the mean.
In the weight loss trial, as is typical, the sample standard deviation had to be
estimated. This extra imprecision needs to be taken into account for the 95%
confidence interval to preserve its property of covering the true value 95% of the
time (particularly with modest sample sizes).
Quantiles from the t-distribution with n-1 degrees of freedom (tn-1,α), instead of
the normal distribution, will account for this extra uncertainty for a sample of size n.
Thus, confidence intervals are generally formed with x  tn-1,0.025 xSE. For large n,
tn-1,0.025 quickly becomes indistinguishable from z.025. For example, for n ¼ 100,
t99,0.025 ¼ 1.984, and for n ¼ 500, t499,0.025 ¼ 1.965. The t-statistic applies here
specifically because the standardized quantity (x  μÞ=SE follows a t-distribution
with n-1 degrees of freedom. This t-distribution is symmetric about 0, like the
normal, but with wider tail probabilities. The t-distribution is sometimes referred
to as Student’s t distribution, named after the pseudonym under which William Sealy
Gosset first published the distribution for the standardized sample mean (Wendl
2016).

Estimation and Uncertainty for Noncontinuous Outcomes

Two other common parameters of interest for clinical trials are proportions from
binary outcomes and hazard ratios from time-to-event outcomes. For these param-
eters, the principles of the CI are the same. We can again rely on the Central Limit
Theorem and form the confidence interval by adding and subtracting the desired
quantile multiple of the appropriate SE. For proportions, the sample mean of the
binary outcome is the estimated proportion pb and the estimated SE is ½pbð1  pb)/n]1/2.
The 95% confidence interval for a population proportion is thus pb  1:960½pbð1  pb)/
n]1/2. The log hazard ratio (bβ) and its SE can be directly estimated from the Cox
model, and it is on the log-scale that the hazard ratio estimator is approximately
normal (Cox 1972; Hosmer et al. 2011). One can form the 95% confidence interval
by bβ 1.960 SE. This is also referred to as the Wald confidence interval for the log-
hazard ratio. For this and other statistics estimated with likelihood techniques, there
are other methods besides the Wald approach to estimate confidence intervals, such
as those that rely on the score or the likelihood ratio statistics (Casella and Berger
2002). While for large sample sizes, these three methods will produce similar results,
at smaller sample sizes there will be noticeable differences. The score statistic can be
a more efficient method than the more commonly used Wald technique; that is, with
83 Estimation and Hypothesis Testing 1619

the same data set, the score interval can be narrower, which is a desirable feature for a
confidence interval. See Lin et al. (2016) for a comparison of the different methods
of hazard ratio confidence interval estimation.
For discrete data that take on many values that have a natural order, such as count
data or an outcome taken from an ordinal scale, such as the Likert scale, it is common
to treat these outcomes like a continuous outcome to summarize a treatment effect.
That is, the mean value x and sample standard deviation s are calculated for each arm
and a 95% confidence interval is formed using methods for a continuous outcome,
seen in the previous section. For an ordinal scale, one must first assign numeric
values; most commonly the positive integers are used. For example, for the 5-level
Likert scale, one can assign 1 ¼ strongly disagree, 2 ¼ disagree, 3 ¼ neither agree
nor disagree, 4 ¼ agree, and 5 ¼ strongly agree. For data such as these, that start off
far from normally distributed, it may take larger sample sizes for the estimate of
interest, say x, to be well-approximated with a normal distribution. An alternative
approach, called nonparametric statistics, is discussed in a later section of this
chapter.

Estimation of the Difference Between Groups

A formal comparison of the difference between two arms of a clinical trial is often
desired. For quantities like the sample mean or proportion, which are approximately
normally distributed, this will be relatively straightforward. The difference is
approximately normally distributed, as any linear combination of two (jointly)
normal statistics will also be normally distributed. Thus, we can apply our general
confidence interval technique to form the CI for the difference. We can then examine
this confidence interval and see whether it contains the value 0, which would indicate
the data are consistent with no difference in the parameter of interest between
groups.
For example, suppose in the weight loss trial the mean weight change from
baseline is x ¼ 6 for arm A and y ¼ 2 for arm B. Suppose further arm A has a
sample standard deviation sx ¼ 20 and m ¼ 100 people and arm B has standard
deviation sy ¼ 15 and n ¼ 90 people. The estimated mean weight change difference
between arms is x  y ¼ 4. If we assume the true SD in each arm may be different,
then the SE for this difference is qcalculated by the square
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi root of the sum of the
variances for each mean, sxy ¼ 202 =100 þ 152 =90 ¼2.550. An approximate
95% CI could be formed again using quantiles from a t distribution, but in this case,
the degrees of freedom (df) for the difference of means must be estimated. The
common approach is to use Satterthwaite’s formula (Rosner 2015), which yields
0 2 2 s2 2 11
2 2 2
sx y
s
df ¼ mx þ ny @ m1 þ n1 A
s m n
. The 95% confidence interval is then

4  t182.2371,0.0.025 (2.550), or (1.03, 9.03). The confidence interval includes 0,


which does not support the conclusion that there is a difference in weight loss
1620 P. A. Shaw and M. A. Proschan

between the two arms. If one assumed that the two arms had a common true
variance, one could use a more efficient estimate of the common variance by pooling
data from the two arms and estimating a single SD. Since we generally do not know
whether arms have a common SD, the Satterthwaite method is preferable. A
common, yet faulty, approach is to use the same data to first conduct a hypothesis
test for equal variances between the two arms and then, based on results of that test,
decide which estimate of the SE to use. This procedure can yield slightly anti-
conservative confidence intervals.
In the case of paired differences for a continuous outcome, one can first form the
within person difference d ¼ x-y for each individual and then follow the usual
procedure for confidence intervals for a single continuous outcome.
For a clinical trial with binary outcomes, we can take a similar approach to
forming the confidence interval for the difference in proportions between two
independent groups. Denote the difference in sample proportions by pbx  pby : Since
each pb is approximately normally distributed, so is the difference inffi proportions. The
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
   
bpx 1bpx bpy 1bpy
SE for pbx  pby can be estimated as SEbp bp ¼ m þ n , and the 95% CI
x y

becomes pbx  pby  1:960 SEbp bp . To determine whether the data are consistent with
x y

no between-arm difference, one can again consider whether the CI contains the value
zero.
When estimating the difference between intervention groups for other outcomes,
one simply needs to formulate a parameter in a statistical model which represents this
difference and estimate it. The 1-α confidence interval could be formed with the
upper and lower α/2 quantile of the probability distribution appropriate for that
statistical model. A common approach parameterizes this difference in a regression
model. Estimates for both the parameter and its SE are straightforward. For censored
survival data, the hazard ratio is inherently a parameter for the difference between
arms, in this case a ratio. For a ratio, the value representing no difference between
arms is 1. Consequently, a confidence interval for the HR containing 1, or equiva-
lently a confidence interval of the log-hazard containing 0, is consistent with no
difference between arms.

Hypothesis Testing

In the previous section, the confidence interval was used to answer questions about
the parameter such as whether data were consistent with no difference between two
intervention arms. We can also directly answer questions about the value of a
parameter with a process called hypothesis testing. Confidence intervals and hypoth-
esis testing are intimately linked. In fact, as will be explained below, in many
situations, there is a 1–1 correspondence between the conclusion made from a
confidence interval for the value of a parameter (such as whether a value of zero is
consistent with the data) and the results of a hypothesis test. The hypothesis testing
83 Estimation and Hypothesis Testing 1621

framework provides a way to formalize the language and process for drawing
conclusions about parameter values from the data.
The hypothesis test can be described as consisting of five steps:

1) Formulate the null hypothesis


2) Formulate the alternative hypothesis
3) Set a level of significance
4) Evaluate a test statistic for the hypothesis
5) Estimate the p-value for the test statistic.

The null hypothesis is a statement about a value for the parameter, for which data
will be collected to assess. For the parameter of interest μ, the null value is
represented μ0. For example, if μ ¼ μA-μB is the parameter for the difference in
the average weight change between arms A and B, one may set the null to be one of
no difference, i.e., H0: μ0 ¼ 0. For this one-dimensional parameter, the alternative
hypothesis (HA) can take on three possible forms: (i) μ <0 (individuals on Arm A
have smaller weight change), (ii) μ >0 (individuals on Arm A have bigger weight
change), and (iii) μ 6¼ 0 (the average weight change for arm A and B is different).
The first two are examples of “one-sided” (also called “one-tailed”) alternative
hypotheses and the third is a “two-sided” alternative hypothesis. This distinction will
matter in terms of evaluating the strength of evidence against the null hypothesis. In
a randomized clinical trial, even though showing a difference in one direction is
more of interest (such as that the novel intervention A is superior to intervention B),
it is most typical to have a two-sided hypothesis. The issue of one-sided versus two-
sided has been the source of continued debate and some controversy, as will be
explained later in this chapter.
The basic idea of conducting a hypothesis test is to calculate a test statistic that
estimates how far the sample estimate of the parameter of interest is from its target
value under the null hypothesis (μ0). The assumed probability distribution for the
data is used to calculate the likelihood of a test statistic value at least as extreme as
the observed value, assuming the null hypothesis is true. This probability is called
the p-value. R.A. Fisher is attributed to having given the p-value its dominance in the
scientific literature with his seminal book in 1925 (Fisher 1925; Kyriacou 2016). For
approximately normal data, we can measure the departure from the null hypothesis
in terms of numbers of standard errors. That is, we form the statistic t ¼ (b μ – μ0)/SE,
where b μ is the sample value for the parameter of interest and SE is the standard error
for bμ. When b μ is a mean or difference of means, the test statistic t has an approximate
t distribution, with degrees of freedom that can again be approximated by the
Satterthwaite formula. In other settings, it is common to rely on the approximate
normality of t to calculate the p-value. Test statistics can also take on different
functional forms, for which their likelihood must be derived in order to calculate
the p-value.
For the weight loss trial, one can set up the null and alternative hypotheses for the
between-arm difference in average change in weight (μ). The null and alternative
hypotheses are H0: μ ¼ 0 and HA: μ 6¼ 0, respectively. Recall b μ ¼ x  y ¼ 4 and
1622 P. A. Shaw and M. A. Proschan

SE ¼ 2.550. Relying on approximate normality of the sample means in the two


groups, we form the test statistic t ¼ (4–0)/2.550 ¼ 1.569. The reference distribution
is the student t distribution, using the Satterthwaite formula again, with 182.2371
degrees of freedom. The percentile for 1.569 for the t182.2371 distribution is 0.9408.
The p-value is 2(1–0.9408) ¼ 0.1184, which is the probability that t is at least as
extreme as 1.569 (t < 1.569 or t > 1.569). Note that here, p < 0.05 would have
happened only if the magnitude of test statistic were larger than t182.2371,0.025, or
equivalently, if the difference in sample means were more than t182.2371,0.025 standard
errors away from the null value of 0. Thus, two-sided hypothesis testing of the null
hypothesis is the same as checking whether the confidence interval includes the null
value μ0.
A significance level is chosen prior to conducting a hypothesis test, and if the p-
value is smaller than this extreme value, the null hypothesis is rejected in favor of the
alternative. If the p-value is larger than the significance level, one fails to reject. This
does not mean that the null hypothesis is proven to be true. Though this language is
common, it is not correct to accept the null hypothesis as true. If the p-value is greater
than the significance level, as in the weight loss trial example, one can conclude only
that there was not enough evidence in the data to reject the null. The null hypothesis
might be true or the true difference might be too small to be reliably detected.
Alternatively, the null hypothesis may be true, but a rare event was observed, and the
null hypothesis was falsely rejected.
In a hypothesis test, two possible errors are: type I error (with probability denoted
by α), namely rejecting the null hypothesis when it is really true and type II error
(with probability β), namely failing to reject the null hypothesis when it is really
false. The significance level, known as the alpha level, is the maximum type I error
probability. The typical value set for alpha is 0.05. R.A. Fisher stated in the
theoretical development of experimental design “...If one in twenty does not seem
high enough odds, we may, if we prefer it, draw the line at one in fifty or one in a
hundred. Personally, the writer prefers to set a low standard of significance at the 5
per cent point, and ignore entirely all results which fail to reach this level. A
scientific fact should be regarded as experimentally established only if a properly
designed experiment rarely fails to give this level of significance...” (Fisher 1925;
Hackshaw and Kirkwood 2011). Though popular, the significance level of 0.05 may
not be appropriate for every setting. For example, in a definitive phase III study, one
may choose to set alpha ¼ 0.01. In an early phase drug development, where the goal
is only to gather preliminary evidence and avoid type II error, alpha ¼ 0.10 is one
common value.
An important concept in hypothesis testing is power, which is the probability of
rejecting the null hypothesis given the alternative is true. Power is the same as 1
minus the type II error rate, i.e., 1β, and in order to calculate power, one must
specify a specific alternative hypothesis. For example, in the weight loss trial,
suppose during the design of the trial investigators expected to enroll 190 subjects,
in a 1:1 ratio, onto two arms and expected the SD for weight loss to be 10 lb. in both
arms. Assuming weight change is normally distributed, there would be 92.9%
chance of having a significant difference between the arms at the 0.05 level if the
83 Estimation and Hypothesis Testing 1623

true between-arm difference in average weight change was 5 lb. Having good power
helps interpret a null result. For an adequately powered study, one with a high chance
of rejecting the null in favor of the alternative of interest, a null result indicates the
data are not consistent with that alternative. If power was 92.9% in the weight loss
trial, and the null was not rejected, this is reasonably strong evidence that the
between-arm difference in weight change is smaller than 5 lb. In this example, the
observed sample standard deviations were 15 and 20 lb. on the two arms. If the true
underlying SD on each arm were 18 lb. and the true treatment difference 5 lb., power
with 95 per arm would be only 48%. Thus, with such low power, it is equally likely
to reject or not reject the null. Therefore, failure to reject the null in this trial would
not provide reliable evidence that the alternative was false. In many settings, the
sample size is chosen so that power for an alternative of interest is at least 80% (20%
type II error rate). In definitive settings, such as a large phase III trial, 90% power is
often desirable.

Special Topics in Hypothesis Testing

Exact Tests and Other Considerations for Choosing a Hypothesis Test

The validity of a hypothesis test relies on choosing the correct method or probability
distribution for the test statistic. This will depend on the distribution of the variables
being measured and the study design. For instance, if data are highly skewed and the
sample size is small, then using the t-test will likely result in an incorrect p-value.
Even if data are not severely skewed, small samples may mean that one cannot rely
on approximate normality of the test statistic to calculate the p-value. In this case, it
would be better to consider an exact method – one that does not rely on approximate
normality but rather uses the correct probability distribution of the test statistic.
One exact test is based on permuting the labels of treatment and control obser-
vations. Consider the strong null hypothesis that treatment has no effect on anyone.
The idea is to fix the data at their observed values, permute the treatment labels, and
compute the value of the test statistic assuming the permuted treatment labels were
the actual ones. After all, under the null hypothesis of no effect of treatment, the
same data would have been observed regardless of treatment received. Repeat this
process for all possible, or at least a large number of, permutations to generate a
reference distribution for the test statistic under the null hypothesis. The p-value is
the proportion of test statistic values in the reference distribution at least as extreme
as the observed test statistic value. For a one-sided test to determine whether
treatment produces larger outcome values than control, reference values “at least
as extreme” are those that are at least as large. For example, if the observed value of
the test statistic is 2.5, and only 1% of the reference distribution is 2.5 or larger, the p-
value is 0.01.
The permutation test can be used in many settings. When the outcome is
binary and the test statistic is the difference in sample proportions, the permuta-
tion reference distribution can be computed theoretically using probability, and
1624 P. A. Shaw and M. A. Proschan

the permutation test is equivalent to Fisher’s exact test. If the sample size is large,
the permutation test is nearly identical to the z-test of proportions, which is
equivalent to the chi-squared test. When the outcome is continuous and the test
statistic is the difference in sample means, the permutation test is nearly identical
to the t-test if the sample size is large. The advantage of the permutation test is
that, without any further assumptions, it provides a valid test of the strong null
hypothesis that treatment has no effect on anyone.
A disadvantage of permutation tests is that, although they give nearly the same
answer as t-tests or z-tests of proportions when sample sizes are large, they can be
quite conservative for smaller sample sizes. For instance, with only 3 patients per
arm, the smallest possible two-sided p-value is 0.10 (i.e., a statistically significant
result at the conventional alpha level of 0.05 is impossible). Many would say that
such conservatism is appropriate if the sample sizes are that small.
A common error when determining the distribution of a test statistic is failure to
account for correlation between observations. For example, suppose one were
interested in comparing the efficacy of two weight loss interventions and married
couples were recruited and assigned to the same intervention. Since married indi-
viduals tend to share meals, their weight loss may be positively correlated. Failure to
account for the correlation leads to a higher than intended probability of falsely
declaring benefit of a diet. Interestingly, a permutation test that adheres to the
original randomization (i.e., both members of the couple receive the same treatment)
automatically accounts for such correlation and provides a valid p-value. Further
discussion of permutation tests is provided in the section on ▶ Chap. 94, “Random-
ization and Permutation Tests” in the Analysis chapter.

Nonparametric Versus Parametric Analysis


Two classes of statistical analysis are parametric and nonparametric methods. Para-
metric methods make more assumptions, e.g., that the data are normally distributed.
Nonparametric analyses make fewer assumptions. For example, the Wilcoxon rank
sum test comparing two arms is valid regardless of the shape of the distribution of data;
the only assumption is that the distribution is shifted in one arm relative to the other.
Other popular nonparametric methods include the Kruskal–Wallis test, instead of the
parametric one-way ANOVA, and rank regression instead of linear regression. The
reward for using parametric analysis is that, if the underlying assumptions are true,
power is better and conclusions may be stronger. On the other hand, if those assump-
tions are false, the parametric analysis may lead to incorrect conclusions.
Many people feel that clinical trials should provide valid results with as few
assumptions as possible. This would argue for nonparametric analysis of data.
Nonparametric analyses are often nearly as powerful as parametric analyses if
sample sizes are large. For example, the sample size required for a desired level of
power is only about 5% smaller for the t-test relative to the Wilcoxon rank sum test
when data actually are normally distributed. If data are not normally distributed, the
t-test may give invalid results. Even if data have a symmetric distribution, power for
a t-test can be substantially lower than that of a Wilcoxon test. The advantage of the
t-test seems outweighed by its disadvantages.
83 Estimation and Hypothesis Testing 1625

Transformations of data are common. A log transformation can greatly reduce


skew. Sometimes square roots or other monotone transformations (meaning that if
x < y, the transformed x is also less than the transformed y) are used. It is reasonable
to assume that some transformation of the data will result in approximate normality.
The ranks of the monotone transformed data are identical to those of the original
data, so a nonparametric test reaches the same conclusion for any monotone trans-
formation. Therefore, a nonparametric test may be viewed as first finding a trans-
formation that “normalizes” the data, then applying a test to compare means of
transformed data. This is equivalent to comparing medians of untransformed data.
A disadvantage of nonparametric methods is that they do not naturally facilitate
analyses that adjust for baseline imbalances in covariates. Parametric methods do
facilitate such an analysis. For instance, a linear regression model can incorporate
covariates, and it simplifies to a t-test when there are no covariates other than
treatment. A nonparametric analog, known as rank regression, is not as appealing
because ranks are inherently discrete. Regression parameters for rank regression,
which summarize covariate effects on the rank of the outcome, are generally more
difficult to interpret than those for parametric regression methods where the parameter
of interest relates to the outcome on a more natural scale (such as the mean value).

Multiple Comparisons

The problem that multiple comparisons create for hypothesis testing is best illus-
trated with the following analogy. In the popular game of darts, a circular target
board is placed at a certain distance from a player, which makes throwing a dart and
hitting a target difficult. The bullseye, a small ring in the center of the board, is worth
the most points. Compare two players. One hits the bullseye on the first attempt and
one takes 100 attempts to hit the bullseye. Though the first player could have been
lucky, it seems clear that the second player is not particularly good at hitting the
target. If the second player reported that he hit the target without specifying how
many attempts it had taken him, it would be difficult to conclude how good a player
he was. One might even incorrectly assume he had only thrown the dart once.
Similarly, suppose a large study examining whether a certain compound was effica-
cious at preventing cancer reported that treatment had a significantly lower incidence
of stomach cancer than control. It would be important for investigators of this study
to disclose for how many different cancers was a hypothesis test done to compare
that treatment with control. Looking at 10 cancers increases the probability that one
hypothesis test would be significant by chance alone, even if the risk for none of the
cancers was influenced by the treatment. The alpha level, below which a p-value is
declared significant, must be adjusted for multiple comparisons in order to preserve
the type I error rate. Many methods exist for adjusting the testing procedure to
accommodate multiple comparisons so that it maintains the desired type I error rate
(Hochberg and Tamhane 1987). Issues of multiple comparisons are considered
further in the section on ▶ Chap. 85, “Confident Statistical Inference with Multiple
Outcomes, Subgroups, and Other Issues of Multiplicity” in the Analysis chapter.
1626 P. A. Shaw and M. A. Proschan

Noninferiority Versus Superiority

Sometimes the goal of a clinical trial is to show not that the new treatment is superior
to, but rather that it is almost as good as, the standard treatment. Such a design is
called a noninferiority trial. Noninferiority trials are appealing if the standard
treatment is onerous or has serious side effects. Even if the new treatment is almost
as good as the standard, it may be preferred by patients. The ACTG 076 trial in the
United States and France had already demonstrated the preventive benefit of a longer
course of AZT, but the longer course was prohibitively expensive for developing
countries. A superiority trial randomizing HIV-infected mothers to a shorter course
of AZT or placebo drew criticism on ethical grounds (Lurie and Wolfe 1997). Some
critics argued that a trial demonstrating noninferiority of the short course to the
longer course was more ethical and could have indirectly shown that the short course
was superior to placebo.
In a noninferiority trial, a new treatment N is compared to a standard treatment S.
In a noninferiority setting, S has already been shown superior to placebo in a
previous trial by some amount M1. That is, if pS and p0 denote the proportions
with events, say a heart attack, in the standard and placebo arms in the previous trial,
p0-pS ¼ M1 > 0. Suppose one can show that N is not worse than S by more than a
prespecified noninferiority margin M, i.e., pN-pS  M. Then pN-p0 ¼ pN-pS +
pS-p0  M-M1. As long as M is smaller than M1, one can conclude that N would
have beaten the placebo (that is, pN-p0 < 0), had the current trial used a placebo. The
noninferiority design begins by prespecifying a noninferiority margin M. A common
choice for the noninferiority margin is half of the known effect of S relative to
placebo, M ¼ M1/2. That way, demonstration of noninferiority shows that the new
treatment preserves at least half of the benefit of the standard treatment seen in the
previous trial.
The null and alternative hypotheses are essentially reversed in a noninferiority
trial. The null hypothesis is that the new treatment is worse than the standard by more
than M: H0: pN-pS > M, and the alternative is that treatment is worse than the
standard by no more than M: H1: pN-pS  M. Rejection of the null in favor of the
alternative hypothesis at the given alpha level demonstrates noninferiority. The
procedure is equivalent to constructing a 1-2α confidence interval for pN-pS and
declaring noninferiority if the upper limit of the interval is M or less. For example, if
the alpha level of the test of noninferiority is 0.05, the procedure is equivalent to
constructing a 90% confidence interval and declaring noninferiority if the upper limit
of the interval is M or less.
One of the biggest drawbacks of noninferiority designs is that things that ought to
be bad in a clinical trial actually help demonstrate noninferiority. For instance,
suppose that the new drug is so ineffective that 100% of patients in arm N abandon
the new treatment and start taking the standard treatment. Then the observed
difference between N and S will be close to 0, making it easier to establish
noninferiority. That can’t be good! For this reason, even though intent-to-treat is
the primary analysis method for superiority trials, an as-treated analysis is often the
primary analysis in noninferiority trials. Another downside of noninferiority trials is
83 Estimation and Hypothesis Testing 1627

that it assumes that the effect of the standard treatment relative to placebo is
unchanged from the previous to the current trial, the so-called the constancy
assumption. That is why the noninferiority margin is often taken to be half of the
“known” effect of S relative to placebo.
Because noninferiority trials are inherently problematic, they should be avoided
whenever the question can be answered in another way. For example, when a
placebo is considered unethical, one could provide everyone the standard treatment
and test whether the new treatment has additional benefit. Another alternative to a
noninferiority design is a superiority design in patients who do not benefit from, or
cannot tolerate, the standard treatment. Noninferiority designs are also discussed
further in the “Equivalence and Noninferiority Designs” section in the Advanced
Topics in Trial Design chapter.

Controversies in Hypothesis Testing

Two-Sided Versus One-Sided Controversy

In 1988, many cardiologists believed that patients with a prior heart attack and
cardiac arrhythmias could reduce their risk of cardiac arrest and sudden death by
suppressing those arrhythmias. After all, studies showed clearly that heart attack
patients with arrhythmias were at increased risk of sudden death. Therefore, when
the Cardiac Arrhythmia Suppression Trial (CAST) tested the “suppression hypoth-
esis,” their original alternative hypothesis was that the antiarrhythmic arm would
have a lower risk of cardiac arrest/sudden death than the placebo arm. More
specifically, with λ denoting the log-hazard ratio for sudden death/cardiac arrest in
the antiarrhythmic arm relative to placebo, the null and alternative hypotheses were
H0: λ ¼ 0 (no effect) and H1: λ < 0 (the antiarrhythmic arm is superior). As described
in Friedman et al. (1993), at its first review meeting, the Data and Safety Monitoring
Board (DSMB) recommended switching to the two-sided alternative hypothesis H1:
λ 6¼ 0, which allows a decrease or increase in the risk of cardiac arrest/sudden death
in the antiarrhythmic arm. This was a prescient move; the trial stopped early because
the event rate was much higher in the antiarrhythmic arm (CAST Investigators
1989). CAST reminds us that interventions can cause harm. The prevailing view is
that one should always use a two-sided alternative hypothesis. Some medical
journals have gone so far as not allowing one-sided testing.
A counter-argument to two-sided testing is that there is no interest in proving
harm. If results in the treatment harm were going the wrong way, the trial would be
stopped before the evidence was sufficient to show actual harm. But this was not true
in CAST. The widespread misconception about the benefits of suppression of cardiac
arrhythmias needed to be dispelled before medical practice could change.
A better argument against two-sided tests is that the two errors, (1) falsely
declaring treatment beneficial and (2) falsely declaring treatment harmful, are very
different with vastly different consequences. Declaring a drug harmful when it
actually has no effect may not have serious consequences because that drug should
1628 P. A. Shaw and M. A. Proschan

not be used anyway. On the other hand, declaring a drug beneficial when it is
ineffective is problematic because patients may eschew truly effective treatments
for the ineffective treatment. Therefore, it is important to consider each of the one-
sided error rates.

The P-Value Controversy

P-values have been viewed in the medical literature as the definitive measure of
evidence for many years. A counter-movement is underway to eliminate them. Both
viewpoints can be viewed as overreactions.
One must first understand what a p-value is and what role it plays. Imagine 10
people exposed to a level of radiation that is known to be 95% fatal. They are given
a new treatment, and half survive. How compelling is the evidence that the new
treatment saves lives? Are observed results consistent with chance? It can be
shown that the probability of 5 or more people out of 10 surviving a condition
that is 95% fatal is only 0.00006. The two possible conclusions are (1) the new
treatment saves lives or (2) the new treatment does not save lives, but an incredibly
rare event occurred. Chance is not a plausible explanation for the observed results.
A small p-value does not necessarily imply that the treatment effect was large. For
instance, suppose that 1,000 people had been exposed to a radiation level that is
known to be 95% fatal, and 80 people survived. The p-value in that case would be
0.00003, yet 92% still died. The observed treatment effect was small, but it was
large enough to effectively rule out chance. The sole purpose of a p-value is to see
whether results are consistent with chance; it is imperative to supplement p-values
with estimates and confidence intervals for the size of the treatment effect to
appreciate whether the effect is both statistically significant and clinically
important.
A criticism commonly levied against the p-value is that it is not reproducible. If
we repeat the same trial, the p-value may be completely different. This is especially
true if the true treatment effect is small. For example, if treatment has no true effect,
the p-value for many tests is uniformly distributed between 0 and 1, meaning that it is
equally likely to be large, medium, or small. Therefore, if treatment has no true
effect, we might see a relatively small p-value in the first trial and a much larger one
in the next trial. The p-value is less variable if treatment is truly effective. Repro-
ducibility worries are somewhat ameliorated by the common practice of lumping p-
values above 0.05, declaring them “not statistically significant.” One must bear in
mind that the purpose of a p-value is to determine whether chance provides a
plausible explanation for observed results.
Another criticism of the p-value is that it depends not only on the observed results
but also on what action we would have taken if other results had been observed. In
other words, to compute a p-value, one must define what results are at least as
extreme as the observed results. Critics question the logic of computing the proba-
bility of the actual result or a more extreme result, when no more extreme result
occurred.
83 Estimation and Hypothesis Testing 1629

The p-value has limitations. Nonetheless, the p-value is useful for its intended
purpose. Its primary competitor is Bayesian methodology, which has its own criti-
cisms. Although not covered in this chapter, Bayesian methodology has received
considerable attention in clinical trials (Berry et al. 2010).

Summary and Conclusion

Two principal aims of statistics are to use data to (1) provide an estimate of a
population parameter and (2) to test whether two populations may be different
with respect to this parameter. Sample estimates have uncertainty that can be
expressed with their associated confidence interval. Hypothesis testing is used to
make statements about the value of a parameter, such as whether treatment A is
superior to treatment B. To conduct a reliable hypothesis test, one must specify in
advance the null and alternative hypothesis for the parameter of interest and choose a
study design that has good power to reject the null in favor of the specified
alternative. The p-value summarizes the evidence against the null hypothesis. The
validity of the hypothesis test relies on calculating the p-value with a correctly
specified probability distribution. The study design and distribution of the study
outcome will determine which distribution is appropriate. In some cases, an exact or
nonparametric test may be desired to avoid unnecessary assumptions. When
interpreting results, one must remember that no reasonable hypothesis test has zero
type I or type II error. In many settings, other evidence, such as results from other
clinical trials or mechanistic laboratory studies, are useful to evaluate the totality of
evidence for the question under study.

Key Facts

• The confidence interval contains values of the parameter that are consistent with
study data. For a study repeated many times, the 95% confidence interval is
expected to contain the true value 95% of the time.
• A hypothesis test requires specifying a null and alternative hypothesis for the
parameter.
• The p-value is the proportion of test statistic values in the reference distribution at
least as extreme as the observed test statistic value. The chosen alternative
hypothesis determines whether this is a one-sided or two-sided p-value.
• Type I error rate, denoted by α, is the probability of rejecting the null hypothesis
when it is true. Type II error rate, denoted by β, is the probability of failing to
reject the null hypothesis when it is false.
• Power helps us interpret a null result; if power was high and the null was not
rejected, we can be reasonably confident that the effect was not as strong as
originally hypothesized. If the trial was not well-powered, a null result is difficult
to interpret.
1630 P. A. Shaw and M. A. Proschan

Cross-References

▶ Confident Statistical Inference with Multiple Outcomes, Subgroups, and Other


Issues of Multiplicity
▶ Randomization and Permutation Tests

References
Berry SM, Carlin BP, Lee JJ, Muller P (2010) Bayesian adaptive methods for clinical trials. CRC
Press, Boca Raton
Cardiac Arrhythmia Suppression Trial (CAST) Investigators (1989) Preliminary report: effect of
encainide and flecainide on mortality in a randomized trial of arrhythmia suppression after
myocardial infarction. NEJM 321(6):406–412
Casella G, Berger RL (2002) Statistical inference. Duxbury, Pacific Grove
Cox DR (1972) Regression models and life-tables. Journal of the Royal Statistical Society: Series B
(Methodological) 34(2):187–202
Fisher RA (1925) Statistical methods for research workers. Oliver and Boyd, Edinburgh
Friedman LM, Bristow JD, Hallstrom A et al (1993) Data monitoring in the cardiac arrhythmia
suppression trial. Online J Curr Clin Trials, Doc. No. 79 [5870 words; 53 paragraphs]
Hackshaw A, Kirkwood A (2011) Interpreting and reporting clinical trials with results of borderline
significance. BMJ 343:d3340
Hochberg Y, Tamhane AC (1987) Multiple comparison procedures. Wiley, Hoboken
Hosmer DW Jr, Lemeshow S, May S (2011) Applied survival analysis: regression modeling of
time-to-event data. Wiley, Hoboken
Kyriacou DN (2016) The enduring evolution of the p value. JAMA 315(11):1113–1115
Lin DY, Dai L, Cheng G et al (2016) On confidence intervals for the hazard ratio in randomized
clinical trials. Biometrics 72(4):1098–1102
Lurie P, Wolfe SM (1997) Unethical trials of interventions to reduce perinatal transmission of the
human immunodeficiency virus in developing countries. NEJM 337(12):853–856
Rosner B (2015) Fundamentals of biostatistics. Brooks/Cole, Boston
Wendl MC (2016) Pseudonymous fame. Science 351(6280):1406–1406
Estimands and Sensitivity Analyses
84
Estelle Russek-Cohen and David Petullo

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1632
Randomization and Randomized Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1634
Causal Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1635
Estimand Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1635
Intent to Treat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1636
Types of Trials and Measurements in Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1636
Strategies for Addressing Intercurrent Events when Formulating Estimands . . . . . . . . . . . . . . . . . 1637
Treatment Policy Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1638
Hypothetical Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1639
Principal Stratification Strategy: Estimands and Causal Inference . . . . . . . . . . . . . . . . . . . . . . . . 1639
Some Cautions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1641
Other Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1643
Importance of Selecting an Estimand at the Planning Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1643
Role of Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1644
Types of Missingness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1644
Estimands and Safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1645
Estimands in Studies with Time-to-Event Endpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1645
Estimands in Complex Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1646

This chapter reflects the views of the authors and should not be construed to FDA’s views or
policies.

E. Russek-Cohen (*)
Office of Biostatistics, Center for Drug Evaluation and Research,
U.S. Food and Drug Administration, Silver Spring, MD, USA
e-mail: [email protected]
D. Petullo
Division of Biometrics II, Office of Biostatistics Office of Translational Sciences,
Center for Drug Evaluation and Research, U.S. Food and Drug Administration,
Silver Spring, MD, USA
e-mail: [email protected]

© This is a U.S. Government work and not under copyright protection in the U.S.; 1631
foreign copyright protection may apply 2022
S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_115
1632 E. Russek-Cohen and D. Petullo

Estimands and Meta-Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1646


Network Meta-Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1646
Estimands in Non-inferiority (NI) Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1647
Going from Estimand to Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1647
Benefit Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1648
Sensitivity Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1648
An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1649
Challenges for Clinical Trials to Evaluate Pain Medications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1650
Estimands, Estimation, and Sensitivity Analysis Illustrated Using an FDA Example . . . . 1651
An Estimand, Estimate, and a Tipping Point Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1653
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1654
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1655
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1655
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1655

Abstract
An estimand is a quantity used to define a treatment effect in a clinical trial. In many
cases, clinical trial planners skipped the step of defining the estimand in their rush to
pick a test statistic and calculate planned sample size(s). This would sometimes lead to
ambiguity on how results of a trial were to be interpreted. In this chapter we describe
estimands in detail and explain the importance of defining estimands when planning
randomized trials and doing this before picking a test statistic to use in evaluating trial
outcomes. The estimand is key to defining the scientific question the trial needs to
address. When patients drop out or fail to follow a planned regime within a random-
ized clinical trial and stakeholders disagree on how this impacts the analysis of the
trial, interpretability of this trial can be called into question. A clear definition of
treatment effect ought to capture how dropouts and protocol violators will be handled.
In this chapter sensitivity analyses are tied to the definition of the estimand in a
trial. In practice, sensitivity analyses are often ad hoc and only addressed after a
study is completed. Considering both estimands and sensitivity analyses in
planning will improve the interpretation of results from completed randomized
trials. While regulators (e.g., the US Food and Drug Administration) have been
particularly interested in advancing these ideas, utilization of these ideas ought to
improve the interpretability of randomized trials more generally.

Keywords
Intent to treat · Intercurrent events · Protocol violations · Tipping point analyses ·
Treatment effect

Introduction

In other parts of this text, considerable attention is made to planning of clinical trials,
estimation of key summary measures, and finally reporting of clinical trial results.
The topic of an estimand may never enter into those discussions, and in clinical trial
textbooks written over a decade ago, it seems unlikely that the topic of an estimand
84 Estimands and Sensitivity Analyses 1633

would be covered. Yet estimands are defined as quantities used to capture treatment
effects within a clinical trial, and they are not always carefully considered during the
planning stage. Sensitivity analyses may be something you have seen before but, as
with estimands, are not often covered in a systematic way in textbooks. Discussions
on estimands force a clinical trial planning team to define the scientific question to be
answered in the trial. Sensitivity analyses are often used at the end of a trial to
confirm the results and the assumptions of any statistical methods used, but ought
to follow from first defining the estimand of interest. A systematic approach to
sensitivity analyses set up prior to starting the study is preferable to generating a
laundry list of data analyses after the study is over. This chapter stresses the
importance of selecting an appropriate estimand(s) and sensitivity analyses at the
planning stage to allow for a cleaner interpretation of results once the study is
completed.
If all studies went exactly “as planned” and everyone completed the trial without
exception to protocol guidelines, defining an estimand could be a trivial task and
possibly left till the end. However, as noted in a survey by Fletcher et al. (2017), the
overwhelming majority of clinical trials have missing data or protocol deviations.
Waiting till the study is over to decide how these issues will be addressed
when determining treatment effects is not good science and frankly, pretty naive.
Furthermore, to the extent that clinical trials mimic real-world use of a product,
dropouts and failure to take doses as prescribed are a common occurrence and one
should not be surprised by this at the end of the study. Regulators such as the US
Food and Drug Administration (FDA) and companies wanting to market a medical
product often negotiate success criteria for a clinical trial. The choice of an estimand
and showing how an estimate of treatment effect follows from it will be important.
However, if both groups are not on the same page, it would be painful to discover
this after a rather expensive clinical trial has been completed. So the desire
to prespecify estimands is of importance to regulators. It should be noted in
some cases, there could be more than one acceptable estimand, so regulators and
companies need to communicate early in the development process.
The FDA commissioned a report by the National Academy of Sciences (NAS)
and its research arm, the National Research Council (NRC 2010), dealing with the
prevention and treatment of missing data in clinical trials. One motivation for the
NAS report was the arbitrary use of “last observation carried forward” (LOCF) as a
way of filling in missing values when subjects drop out of a clinical trial submitted to
the FDA. It was easy to calculate and there may be settings in which LOCF
made sense but the option was often used without justification. One general recom-
mendation that came out of the NAS document was that one should design trials
that minimize the amount of missing data and estimands ought to be defined when
planning the trial. However, in spite of the NAS report, regulators realized that
current practice had not moved forward (LaVange and Permutt 2016). In 2014
multiple regulatory agencies and their industry counterparts under the umbrella of
the International Council for Harmonisation (ICH) agreed to develop an addendum
to an important international guideline on statistical principles in clinical trials (ICH
1998, 2014). The focus of the addendum is estimands and sensitivity analyses.
1634 E. Russek-Cohen and D. Petullo

While regulators and their industry counterparts are now considering estimands
earlier in their deliberations, it would be unfortunate to think these discussions are
solely related to medical product approval. There are many clinical trials sponsored
by others (e.g., National Institutes of Health) that have public health impacts and
thinking about how results will be interpreted as one plans a study should improve
the science. Practices such as increasing the sample size to account for an expected
dropout rate without thinking about why dropouts occur are a missed opportunity to
plan a better study.

Randomization and Randomized Clinical Trials

The majority of clinical trials reported in a drug or medical device label are
randomized, and such trials are considered the gold standard in establishing treat-
ment differences. In this chapter, for simplicity, we focus on clinical trials with two
treatment groups, most commonly a treatment group and a control group. However,
the principles here ought to have relevance to other kinds of trials, e.g., traditional
trials with more than two arms, pragmatic clinical trials (Ford and Norrie 2016) that
may harness electronic health records, and/or relax eligibility requirements to assess
something closer to real-world effectiveness of an intervention and to trials that rely
more heavily on data from other sources (e.g., using external control data).
Randomization in most trials should result in comparability of subjects in the two
treatment groups with respect to baseline characteristics, but comparability can be
lost depending on events that occur post-randomization. What is often missing from
the characterization of a treatment effect is how post-randomization events were
handled. The new ICH E9 R1 document defines events such as leaving a study early,
use of rescue medications, and so on as “intercurrent” events. When these events are
not balanced across treatment arms or the reasons for why these occur are not the
same, the interpretability of the study may be problematic. Therefore, when choos-
ing an estimand, one should consider all relevant intercurrent events. In many
therapeutic areas, these can impact a substantial portion of the study subjects. The
NAS report (NRC 2010) encourages FDA to explore which post-randomization
events (i.e., intercurrent events) are common and in what settings so future clinical
trials can be better planned. This kind of activity is still going on at FDA.
Randomized clinical trials have appeal because one can attribute causation,
namely, if the randomization was done appropriately and the study went according
to plan, observed significant treatment differences can be attributed to the difference
in treatments under investigation. However, in long-term studies or any trial where
there are more than a few dropouts and/or protocol violators, treatment effects are
harder to interpret. The issue is worse if the number of dropouts or the reasons for
dropouts and protocol violations differs among treatment arms. For example, a
dropout on a placebo arm could be due to ineffectiveness of the intervention, but
dropouts on the arm with a new drug could be due to serious side effects.
84 Estimands and Sensitivity Analyses 1635

Causal Inference

The term estimand appears in the literature associated with causal inference (Little
and Rubin 2000) where one recognized the role of confounding and interpreting
treatment effects in something other than a randomized clinical trial had to be done
with caution. At the heart of many causal inference discussions, the reader is asked to
imagine how the same subject would respond if they were assigned to one treatment
group and then if they were assigned to the other treatment and think of the treatment
effect for that subject as the difference in the two values (Y(trt)-Y(control)). The
value for that subject in the unobserved arm is the unobserved potential outcome. In
most instances one only gets to observe the outcome on one treatment, but under
randomization, one could imagine the two groups being comparable at baseline and
treatment effect could logically be interpreted to be the mean of the subject level
treatment effects.

Estimand Framework

An estimand should be thoroughly vetted among the interdisciplinary team and


could be stated in words so that both clinicians and statisticians can comprehend
what is needed. After deciding on a clinically relevant estimand, one can decide on
how to estimate it, referred to as an estimator. This estimator can be defined using
words or using formulas as appropriate. In statistical terms, this would be the
parameter of interest. An estimate is derived from the clinical trial data, and if
possible, an unbiased estimate of treatment effect is desired.
There can be more than one estimand in a trial. For example, trials may have
multiple key endpoints, or multiple stakeholders may view the results of a trial
differently. However, the process of picking the estimand, the estimator, and estimate
would need to be repeated for each estimand. Some estimands can be regarded
primary, while others may be considered as supportive (e.g., involving a different
handling of dropouts or protocol violators).
In Table 1, we see the components of an estimand defined. Previous clinical trial
textbooks focused on the need to define the population and the variable of interest
along with a summary statistic without specifically indicating how intercurrent
events are reflected in the estimand and estimator. This is new. In this definition, a
different consideration of intercurrent events would result in a different estimand
even for the same trial.
When different parties have different objectives, there may be different
estimands. For example, regulators primarily want to establish efficacy and safety
in a specific context, namely, the drug or device working. Insurers may interpret the
results of a study differently since cost of a therapeutic intervention or the cost of
follow-up in the event of treatment failure may not be considered by regulators like
the FDA.
1636 E. Russek-Cohen and D. Petullo

Table 1 Elements of an estimand (ICH 2017)


Population for which we want to address as a scientific question;
Variable, which consists of the measurements taken in a time period or at a certain time point
(e.g., blood pressure 24 weeks after randomization) or functions thereof (e.g., change from
baseline to 24 weeks in blood pressure) and could be a composite measure that incorporates
several individual components;
Intervention effect describing how intercurrent events are reflected in the scientific question;
Summary measure for the variable on which the treatment effect will be based (e.g., variable
mean)

It is common to see a statistical analysis plan (SAP) state that because the trial
sponsor anticipated a 20% drop out rate, they would increase the sample size by
some corresponding value (e.g., using 125% of the calculated sample size), ignoring
why missing data occurs or that it may be imbalanced among the treatment arms.
This is incorrect even if one does need an increase in sample size, since a failure to
account for intercurrent events may still result in an uninterpretable trial. Others
involved in planning trials may ignore missing data issues altogether. Sponsors of
clinical trials would regularly use terms like intent to treat (ITT) that were interpreted
differently by other stakeholders. Some interpreted ITT as only recording data while
on treatment, yet others would follow subjects till the end irrespective of whether
they complied with the assigned treatment regime. In reality, many analysis plans
did not consider the basis of missing data and instead picked a method that used
simpler and possibly unrealistic assumptions such as missing completely at random
(see section “Types of Missingness”; Little and Rubin 2014).

Intent to Treat

The term “intent to treat” or ITT became much more commonly used in the clinical
trial literature after the publication of the ICH E9 document Statistical Principles in
Clinical Trials in 1998. This ICH guideline is used globally to assist companies in
designing clinical trials to generate evidence in support of drug approval. However,
people used the term ITT inconsistently after the 1998 document was published
and confused two concepts, namely, the need to account for all subjects enrolled in a
trial and the need to randomize subjects to avoid confounding due to imbalances
in baseline covariates. The new ICH document (ICH 2017) distinguishes these
concepts to add clarity to what treatment effect is being measured. See chapter on
“Intent to Treat.”

Types of Trials and Measurements in Trials

While there are many kinds of clinical trials, the topic of estimands has gained more
attention in the context of longitudinal studies where patients are randomized to one
of two (or more) treatments at baseline and patients are repeatedly assessed using the
84 Estimands and Sensitivity Analyses 1637

same measurement at fixed time points prespecified in the protocol (e.g., every
month for 6 months). In O’Neill and Temple (2012), these have been called
symptom trials though the outcomes could be laboratory measurements. For exam-
ple, glycosylated hemoglobin or hemoglobin A1C (HgA1C) is measured at sched-
uled visits after subjects are randomized to treatment arms in trials that evaluate
drugs to treat diabetes. Subjects may drop out at various times, but most are likely to
stay until the end. These types of longitudinal studies are common in the assessment
of treatments for diabetes, depression, pain, allergies, and other possibly chronic
disorders. For such studies, the estimand is often defined in terms of a treatment
effect at the end of the observed time period (e.g., the difference in average HgA1C
after 6 months on assigned treatment). Information collected at earlier times may
improve the precision of the estimate of treatment effect, particularly when a subject
discontinues before 6 months on treatment. For other settings, perhaps the interest
may be the average treatment difference over the observed time period (e.g.,
evaluating a treatment for symptom relief for seasonal allergies).
One alternative class of trials would be outcome trials (O’Neill and Temple 2012),
and these focus on a single event for each subject but may fall into two categories
based on the endpoint utilized. For example, in infectious disease trials, the primary
focus may be on whether the treatment cures a subject of a disease and outcomes
correspond to subject status at a given time point (disease present or not at 6 months
after start of therapy). Dropouts are often regarded as treatment failures, and
dropouts are an important consideration when evaluating a trial. For several thera-
peutic areas, including oncology, time to a prespecified major clinical event is the
most common form of endpoint used, but, for example, it could be time until disease
progression (with an agreed to basis of how this is defined) or time until death due to
any cause (overall survival) (FDA-NIH 2018). In some instances (e.g., in drug trials
in cardiology), time to event is a composite outcome (e.g., time until stroke, heart
attack, or death, whichever comes first). In these settings, when subjects have not
had the event of interest, an observation is considered censored. How protocol
violators are handled or whether some of these events can be treated as censored
should be considered when planning a trial much as intercurrent events are addressed
in a longitudinal symptom trial.

Strategies for Addressing Intercurrent Events when Formulating


Estimands

The ICH E9 R1 addendum (2017) defines a set of five strategies for selecting
estimands. These are referred to as treatment policy, composite, hypothetical, prin-
cipal stratum, and while on treatment. Statisticians, clinicians, and others with an
understanding of the disease including epidemiologists may need to weigh in on the
choice of an estimand. This would include selection of meaningful endpoints,
identifying clinically relevant intercurrent events likely to occur and then defining
the estimands in the presence of these intercurrent events. These strategies are
discussed below.
1638 E. Russek-Cohen and D. Petullo

The strategies presented are not exhaustive nor are they mutually exclusive. For
example, Mallinckrodt et al. (2012) define de Jure and de Facto estimands with de
Jure focusing on what might have been if subjects completed the planned course of
treatment and de Facto estimands focusing on what is actually observed. However,
these do not directly correspond to the five categories we provide below. The paper
by Phillips et al. (2017) states that even though Mallinckrodt implies de
Jure estimands are equated to efficacy and de Facto estimands are equated to
effectiveness, given the restricted nature of who enrolls in trials relative to
who may use a particular intervention once in practice, effectiveness may not be
characterized in the most common clinical trials. Others have provided approaches
that may not be defined exactly as we have below (Permutt 2016).

Treatment Policy Strategy

The occurrence of intercurrent events is irrelevant, and so in the context of Table 1,


one would not need to state how each intercurrent event would be handled. The value
for the variable of interest will be the endpoint of interest (e.g., HgBA1c at 6 months)
regardless of whether an intercurrent event occurs. So all subjects are accounted for
whether or not there is an intercurrent event. A key consideration for choosing
this estimand is subjects need to be followed even if they start using rescue
medication or fail to follow the treatment regime as described in the protocol. In
studies where subjects are exposed to treatments that are relatively short in duration
and assessments come with few missing data values, this may be the most sensible.
In areas where use of rescue medication is quite common, an estimand that is the
result of the treatment policy strategy may be hard to interpret if your goal
is assessing the impact of a new drug. This is because you are comparing treat-
ment+ rescue medications versus control+ rescue medications, and the effect will be
influenced not only by the treatment under investigation but will be impacted by
rescue medications taken by study participants.
A treatment policy strategy could be acceptable if the only intercurrent events
were subjects crossing over to another treatment in the same trial. If rescue medica-
tions are designed to keep side effects down or are considered as appropriate with the
treatments under study, perhaps these are not intercurrent events that require special
attention in the definition of an estimand. But when subjects use various rescue
medications decided on by the subjects or their providers rather than the trial
sponsor, the estimand may no longer reflect the impact of the investigational drug.
This is particularly true if the “rescue medication” is another treatment for the same
indication.
When trial sponsors agree to a treatment policy strategy, all efforts should be
made to keep subjects in the study (NRC 2010). This could mean providing
incentives for subjects to stay in the trial and not miss visits. For those that are
genuinely lost to follow-up, one may need to consider a hybrid of a treatment policy
strategy and one of the strategies below.
84 Estimands and Sensitivity Analyses 1639

In trials where overall survival is the endpoint of interest, a treatment policy


estimand would have some advantages. There is little subjectivity involved in
whether a person is alive or not though sponsors of a trial would need to determine
the status of subjects that may be lost to follow-up. Sometimes to reduce missing
data information external to the trial is used to determine the date of death for these
study participants.

Hypothetical Strategy

A scenario is envisaged in which the intercurrent event would not occur. One would
choose a value to reflect the scientific question of interest assuming a particular
hypothetical scenario, i.e., what a subject’s pain score would have been at the end
of the study had they completed 12 weeks of treatment. Assuming a subject is
comparable to a placebo subject once treatment is discontinued (Mehrotra et al.
2017) or was never on a treatment (as might be implied when using baseline
observation carried forward (BOCF)) (Phillips et al. 2017) may be sensible.
However, each of these approaches to dealing with an intercurrent event is distinct,
and treatment estimates that follow from each of these estimands may result in
different estimates of treatment effect.

Principal Stratification Strategy: Estimands and Causal Inference

Principal stratification is a means of adjusting for variables that are observed after
randomization. An overview of these methods is provided by Fragakis and Rubin
(2002), but several basic tutorials are available (e.g., Baker et al. 2016; Stuart et al.
2008; Dunn et al. 2005). Methods vary depending on the context of the study, the
kind of post-randomization event, and the properties of the variable that describes
the post-randomization event (e.g., dichotomous or continuous) and the primary
outcome variable for the trial. However, the theory behind principal stratification
relies heavily on the concept of causal estimands which we described briefly earlier.
Namely, one needs to imagine there is a potential outcome for every subject on each
treatment, but one only gets to observe one of them. Additional assumptions specific
to principal stratification imply the potential outcomes for each person do not depend
on the treatment status of others in the study. If subjects were not blinded to their
treatment assignment, their behavior could be influenced by the outcomes of others.
This section briefly describes some simpler examples.

Compliance as a Post-randomization/Intercurrent Event


In the estimand definition presented, failure to comply with the assigned treatment
would be an intercurrent event. Baker et al. (2016) and Stuart et al. (2008) focus on
randomized trials with two treatments (an active treatment and a control), and
compliance is modeled as an all or none event, namely, subjects will either follow
1640 E. Russek-Cohen and D. Petullo

one treatment or the other. Dunn et al. (2005) start off with this same setting but then
adds discussions on missing data. The context for these is where patients can either
agree to be part of the treatment group they are assigned to via randomization
(“compliers”) or some may show some distinct preferences, namely, some subjects
would always stay with an active treatment (“always takers”) even if assigned to the
control and those that would always stay with a control treatment (the latter are
called “never takers” in the literature implying they would never take the active
treatment). Their models also imply that there are no subjects that are “defiers.”
Defiers would always go to the other treatment irrespective of which treatment they
are assigned to. The model assumes the distribution of always takers, never takers,
and compliers is the same in the two treatment arms, a logical result of randomiza-
tion. An additional assumption is that “always takers” and “never takers” will not
contribute to an overall treatment effect. Namely, the subject specific causal
estimand is zero for an “always taker” and zero for a “never taker.” The estimate
of treatment effect in compliers is then an estimate of the treatment policy estimand
(namely, an estimate of treatment effect in everyone) divided by the estimate of the
fraction that would comply with their randomization assignment. Generally, the
treatment effect in compliers is expected to be larger than one derived using a
treatment policy estimand. This very basic model does not account for dropouts,
loss to follow-up, and so on. However, it does not require that we label a subject as a
complier prior to randomization.
A hypothetical example in Dunn et al. (2005) illustrates the concepts. Patients are
randomized to be cared for as a day patient or an inpatient, whereas the treatment
received is either to be cared for as a day patient or an inpatient. Patients could in
theory elect to choose a treatment option other than that assigned via randomization,
and this could be the result of previous experience, costs, severity of the condition
being treated, ability to get to and from day care, etc. This in turn could impact their
outcome (e.g., going back to work within a week or not, a dichotomous outcome).
Compliance is all or none (they do comply with one of the two treatments and they
do not switch between day care and inpatient once they start a particular treatment).
This may not always be realistic so one should think through the assumptions before
selecting a model. Dunn et al. (2005) only consider compliance (and not the factors
that may drive compliance) and a binary response variable. The latent classes for this
problem are as in the paragraph above but translated to this particular problem.
Compliers will stay with the randomized treatment. “Always takers” will be those
that are day patients irrespective of how they were assigned, and “never takers” are
those that are inpatient no matter how they were assigned via randomization. As with
Stuart et al. (2008) and Baker et al. (2016), closed form equations for an estimate of
the causal effect are presented with the same assumption that only compliers have
a subject-specific treatment effect that is non-zero.
Stuart et al. (2008) point to a two-stage regression model that can be used to
generalize beyond this simple example including incorporation of covariates that
can predict participation with assigned treatments and covariates that impact the
outcome or variable of interest. These have been implemented in statistics packages
but are beyond the scope of this chapter.
84 Estimands and Sensitivity Analyses 1641

Noncompliance and Attrition as Post-randomization (Intercurrent)


Events
In reality, as noted earlier in this chapter, many studies have dropouts. Dunn et al.
(2005) provide two approaches to handling missing data. One builds on the simple
day care example with six categories of subjects. The six categories are determined
by the latent compliance variable (always takers, compliers, and never takers) by
treatment assigned via randomization. They assume that the mean outcome in each
of six categories is the same as that determined by those that are observed. This may
be an improvement over ignoring missing data but may not take advantage of what
else is known about each subject in the study. This is called latent ignorability since
one does not know the true designation of compliance status for each subject. Dunn
et al. (2005) present closed form equations in this setting for a causal estimate of
treatment effect in the simple case of a dichotomous outcome.
A more general approach when developing causal estimands in this setting is to
develop a set of regression models:

1. One needs regression equation(s) to predict probability of being a complier for


each subject even though this is a latent variable. This could be a logistic
regression if one assumes compliance is described as compliers and never takers.
Note: When the trial involves a novel treatment, subjects that are “always takers”
will not have access to the novel treatment when assigned to the control. So they
may not participate, and compliance may be regarded as a dichotomous variable.
2. A second regression equation is used to predict who is likely to drop out or not in
studies in which dropouts are a concern and includes terms associated with
treatment assignment and the latent compliance variable in addition to baseline
variables.
3. A third equation predicts the response or outcome variable and includes both
baseline covariates and variables that relate to compliance class and treatment
assignment.

The covariates in each model need not be the same since there may be covariates
that drive one to stay in the trial and other covariates that help predict the outcome
variable. The model is fit iteratively, and Dunn et al. have proposed an approach
based on a variant of maximum likelihood called “ML EM.” The nice part is that one
can fit such models in a number of statistical packages though the fitting options are
apt to vary.

Some Cautions

Since latent variables are not “observed,” these approaches could be challenged
in that different assumptions regarding latent variables could yield different causal
estimates of treatment effect. However, ignoring imbalances in dropouts or arbi-
trarily using a treatment policy estimand without thinking about compliance is not
sensible. So, these models provide estimates of treatment effect allowing one to
1642 E. Russek-Cohen and D. Petullo

challenge the robustness of treatment effect in the presence of certain intercurrent


events. When compliance is partial or dropouts are not adequately described by the
models used, the models are more approximate (Stuart et al. 2008; Baker et al. 2016).

Death Before a Fixed Time as a Post-randomization (Intercurrent) Event


If the treatment is designed to improve the quality of life (QOL) in a serious chronic
disease, mortality during the trial can make assessing the QOL endpoint a challenge.
If the QOL measure is taken at 12 months on the treatment, and the person dies
before then, one can treat survival as a dichotomous variable depending on whether
the person is alive and can be evaluated at 12 months. If the assumption that
mortality is not influenced by the treatment is plausible, one could elect to use a
“while on treatment” strategy which is described later on. However, if this is
unknown, one option is to use a principal stratification method that includes a
sensitivity analysis that looks at the impact of survival on estimates of treatment
effect (Chiba and Vanderweele 2011), i.e., by comparing those that are likely to
survive under either treatment. In practical terms, if mortality were higher in the
active treatment arm, survivors in the active group may be healthier to begin with,
and better QOL values may not be attributable to the treatment under study. Note that
this form of sensitivity analysis is quite different from the sensitivity analyses we
present later. However, it is an analysis designed to help the evaluator gain a better
understanding of what is attributable to the intervention of interest. Thus, it is in the
same spirit as the sensitivity analyses we discuss later. Large differences in survival
can make interpreting a QOL endpoint challenging in any case.
With death as an intercurrent event, a treatment policy estimand could be an issue.
However, a treatment policy estimand could be fine if overall survival is the endpoint
of interest. Context matters.

Per Protocol Analyses


In the original ICH E9 (1998), there is a discussion of a per protocol analysis.
Although not following the protocol is a post-randomization event, the
simple exclusion of protocol violators from an analysis would not be an example
of a principal stratification since there is no effort to break the data down by principal
strata in the analysis and try and develop a causal estimate of treatment effect.
In real terms, the nature of dropouts could vary by treatment arm, and the proportion
dropping out could be considerably different. Causal estimates are designed to
compare outcomes among similar individuals. See chapter on “Intent to Treat.”

Composite strategy
The occurrence of intercurrent event is taken to be a component of the variable, i.e.,
the intercurrent event is integrated with one or more other clinical measures of
interest. For example, in a rare disease setting, the focus is on a treatment to treat
multiple symptoms of the disease (e.g., headaches, diarrhea, etc.). The estimand may
be defined in terms of the average number of days per week without a symptom. The
use of rescue medications to alleviate any one symptom could be considered a day
with a symptom. Assuming the rare disease is chronic (so subjects may be on the
84 Estimands and Sensitivity Analyses 1643

medication for a very long time in practice), one may wish to focus on the change
from baseline versus the last week on a treatment for a weekly average number of
days without symptoms.
Another approach that falls within a composite strategy would be a “responder”
analysis that bins subjects into two classes, namely, success and failure. Multiple
features can be considered in defining success or failure but dropouts are considered
failures. While this results in a loss of information and reduced power, how treatment
effect will be captured is quite clear. This should not be the sole consideration in
picking an estimand. Several estimands including composite estimands are illus-
trated in an example later in this chapter.

While on Treatment Strategy


This would be a case where one is only interested in information prior to an
intercurrent event. For example, if the treatment is a palliative treatment (e.g., for
pain) in late-stage cancer, one would not want to consider data after a patient dies.
Analyses that treat values after death as missing rather than nonexistent can be
somewhat nonsensical. As in the section on a “Principal Stratification Strategy:
Estimands and Causal Inference,” if the treatment impacted survival in addition to
pain, an estimand based on this strategy could be hard to interpret.
A discussion of estimands when longitudinal data and survival are in the same
study can be found in Kurland et al. (2009).
Other strategies. One can combine strategies. For example, in a chronic pain
trial where pain is assessed daily, use of rescue medication can be part of a
composite strategy by assuming the “worst pain over the past 24-hours” is not
impacted by short-term use of rescue medications. But treatment dropouts could be
handled by (1) either collecting data on all subjects whether or not they take the
assigned regime or more likely (2) treating subjects that leave the study as
treatment failures. However, these need to be carefully considered early on as
they may have ramifications for what data needs to be collected in a protocol and
which statistical analyses make sense. Simulations may be necessary to see what
impact decisions on intercurrent events have on operating characteristics and
sample size requirements.

Other Considerations

Importance of Selecting an Estimand at the Planning Stage

By acknowledging which estimands are to be assessed, one can properly state what
data needs to be collected in the protocol. Choice of an estimand may mean one
needs to document why subjects leave, and trial sponsors may need to consider
incentives to keep subjects in the trial. Some advice on protocol writing is on an NIH
website, but it is geared for investigators needing to come to FDA to have their study
plan approved (NIH-FDA 2017). It is common practice to have a protocol complete
prior to finalizing an SAP. But it is logical to say an initial draft of an SAP ought to be
1644 E. Russek-Cohen and D. Petullo

evaluated with the protocol to be sure the right information will be collected. In the
early stages of most clinical trials, both the protocol and SAP are refined, but they
should be in sync with respect to the primary analyses.

Role of Covariates

The elements of an estimand (see Table 1) do not explicitly call for covariates. One
can consider covariates when the estimator or the estimate of treatment effect is
defined. Use of appropriate covariates can greatly improve the precision of an
estimate of treatment effect and the power of a test of hypotheses. However, consider
covariates that are reliably collected at screening or baseline so that use of covariates
does not generate a bigger missing data issue. Covariates can also be very useful
when deciding among approaches for imputing missing data (Little and Rubin
2014).

Types of Missingness

When selecting an estimand, one needs to think through the analysis options and
various anticipated patterns of missing data that may occur. This could be influenced
by the type of intercurrent events that are anticipated and the factors that influence
whether or not these occur. Of course, estimands that do not rely on data after a
patient stops taking their assigned medication means data at that stage is not
considered missing and may not always be collected (nor should it be imputed).
Similarly, there is no need to impute values for life after death in a sensible estimand.
See “▶ Chap. 86, Missing Data,” and Little and Rubin (2014).
Missing completely at random (MCAR) usually results in an analysis that ignores
the reason for the missing data. As we have discussed in the section on “Principal
Stratification Strategy: Estimands and Causal Inference,” if missing data is thought
to be unrelated to treatment assignment, intercurrent events may still impact the
precision of treatment effects, but the resulting estimator and estimate may still be
sensible. However, in most settings, this may be the least realistic. Missing at
random (MAR) usually means the chance of being missing is a function of terms
measured during the trial and incorporated into the analysis. Mehrotra et al. (2017)
suggest that MAR in a longitudinal study with repeated measures involves the
assumption that subjects that drop out are like the subjects that had the same values
until the time drop out occurs. In drug trials subjects often drop out because they are
not doing especially well on their assigned treatment group, so this assumption can
be misleading and can result in an overly optimistic estimate of treatment effect. One
of the most common approaches in analyzing the longitudinal studies we describe
here uses a mixed model with repeated measures (noted as MMRM), which does not
explicitly impute the gaps in the dataset. However, because the analysis is consistent
with assuming the data is MAR, it typically generates an overly optimistic estimate
of treatment effect.
84 Estimands and Sensitivity Analyses 1645

Missing not at random (MNAR) is missing data that is not MAR or MCAR and is
probably more common than one would like to admit. In studies with large amounts
of missing data, an analysis that is consistent with an MAR assumption would be
especially problematic even though one sees this in the literature on a regular basis.
The reality is values that would have occurred after a subject leaves a study are at
best a guess since these values do not really exist, so analysis methods that minimize
the assumptions made for unobserved values are the most appropriate.
For longitudinal studies, sometimes the term “monotone missingness” is used. In
the section on “Sensitivity Analysis,” an approach that assumes monotone mis-
singness (as the primary source of missing data in a trial) is discussed. Once a
subject drops out, they will no longer contribute data, and they do not return. If
clinicians are likely to use certain laboratory measurements to take a subject off a
treatment, how that information is reflected in the data analysis should be thought
out, but it would be inappropriate to consider that data as MCAR (Holzhauer et al.
2015).
Although estimands ought to be spelled out prior to selecting an estimator, there
likely will be an iterative process involved. Missing data and/or protocol violations
will have to be part of the conversation. This is normal when planning a clinical trial.

Estimands and Safety

There can be one estimand (or more) for efficacy and other estimands for safety. For
example, in vaccine trials submitted to FDA in support of vaccines for healthy
subjects, only subjects completing a prescribed regimen are typically included in
the efficacy calculations. Those included in the safety assessments are those who
receive at least one dose of the treatment assigned. So, efficacy and safety estimands
differ and the resulting estimates may not involve using data from the same subjects.
For trials where safety is a primary outcome, such as a safety study to rule out an
elevated cardiovascular risk for a drug to treat diabetes (FDA 2008), the estimand
may be defined in terms of an intent to treat policy strategy where all subjects are
included, whether or not they adhere to the assigned treatment regimen. This could
differ from a study in which a more traditional efficacy endpoint is being used, and
there are no prespecified hypotheses associated with safety.

Estimands in Studies with Time-to-Event Endpoints

In oncology and many cardiovascular trials, time to event data, such as time until
death from any cause (overall survival), are common endpoints of interest. The most
common measure of treatment efficacy in that setting is a hazard ratio (e.g., using a
Cox proportional hazards regression). It would be hard to represent the hazard ratio
as a causal estimand (see section on “Causal Inference”). But it would be inappro-
priate to think that some of the principles we have identified here do not apply.
In many time-to-event studies, not all the patients will die (or have the event of
1646 E. Russek-Cohen and D. Petullo

interest) by the time the study is ended and that must be considered in the analysis.
Administrative censoring refers to subjects not having the event of interest at the
time the study is over and that would likely be considered non-informative (namely,
the time to event and the time at which censoring occurs are regarded as indepen-
dent). However, one would need to specify how other types of censoring are
considered, e.g., how would one treat patients that move onto another treatment
because of disease progression in a trial with overall survival as an endpoint. One
may want to treat results differently if (1) patients moved onto another active
treatment for the indication in question may be considered differently than
(2) patients moving from the active treatment to the control arm.
Clarity in endpoints and reasons for censoring can be relevant in other time to
event settings (FDA-NIH 2018). Planning in advance for how these are handled is
also better science.

Estimands in Complex Designs

Adaptive designs are more common today, and some designs can alter the estimand
after the start of the trial. One obvious case of this would be adaptive enrichment
(Rosenblum et al. 2016) where an interim analysis is planned. Based on the interim
analysis, restrict future recruitment to a prespecified subgroup of patients, e.g., those
with more severe disease. This may change the estimand in that this changes the
intended population; it may also alter the frequency of certain outcomes (e.g., deaths
may occur at a higher frequency in a study of severely ill patients). The decision to
study a subgroup should not jeopardize the overall integrity of the study (FDA 2019).

Estimands and Meta-Analysis

Meta-analyses are common when there are multiple trials designed to answer similar
questions. One common objective in a meta-analysis is to provide a global estimate
of treatment effect, combining information from several trials. It would be challeng-
ing to include trials with different estimands (and perhaps a different consideration of
intercurrent events) into a meta-analysis. Trials of different duration are a challenge
particularly if effect size is apt to change with duration. In addition, intercurrent
events could be more likely if patients are observed over a longer time period. It may
be necessary to obtain line data and reanalyze the data with a common estimand and
a consistent approach to handling missing data in mind. This could be a challenge
since many meta-analyses use summary measures from journal articles. Note this
concern is distinct from the issue of relying solely on published studies.

Network Meta-Analysis

Network meta-analysis (Efthimiou et al. 2016) is often used in comparative effec-


tiveness studies because it allows comparisons of therapies not in the same trial, but
84 Estimands and Sensitivity Analyses 1647

when a different population or different handling of intercurrent events occurs, the


resulting analysis may make no sense. These share the same concerns as traditional
meta-analyses.

Estimands in Non-inferiority (NI) Studies

NI studies are common in a regulatory environment where one wants to allow drugs
and other medical products to compete with other products on the market by
showing comparability rather than superiority (see chapter on “Non-inferiority”).
NI studies use an already established therapy (e.g., a medical product already
approved at FDA) to serve as an “active control.” Previous studies are used to
formulate a margin, namely, (1) considering how the active control compares to a
placebo and (2) defining how to compare a new treatment to an active control. “The
margin needs to be small enough to demonstrate that the novel treatment is still
effective relative to placebo and part of planning can involve trying to capture what
proportion of that effect needs to be conserved with the new product.” The US
guidance on NI studies for drugs and biologics (FDA 2016) provides advice on
determining an NI margin though in other settings, the margin can be determined in
other ways. In the guidance, the margin is often determined using a meta-analysis
that compares a proposed active control arm to a placebo. It would be useful to have
all the trials used in the meta-analysis and the planned trial to use the same estimand
or at least factor the use of different estimands into which trials are or are not
included when the margin is determined.
Because having a treatment effect that is close to zero is usually declared a
success in this setting, non-inferiority studies with large numbers of intercurrent
events could be suspected, and minimizing intercurrent events is critical (Rothmann
et al. 2011).

Going from Estimand to Estimator

Different estimands may result in different estimators. However, a choice of a single


estimand could result in more than one estimator that is in alignment with the
estimand. As part of the estimand decision, someone may decide that data after a
patient goes off treatment is not missing and therefore does not need to be collected
or imputed. A choice of estimand may also limit the choice of strategies for
estimating missing values via an imputation method (Phillips et al. 2017; see
“▶ Chap. 86, Missing Data”). Different choices of estimands and estimators are
possible and may vary even within the same broad therapeutic area. An extensive
discussion of picking estimators and subsequent choices of estimation in the context
of type 2 diabetes is in Holzhauer et al. (2015). They note that some estimators that
are the result of a particular choice of estimand may impact the distribution of test
statistics and ought to be evaluated using simulations in advance of a study. Mehrotra
et al. (2017) show more than one analysis approach to addressing a specific estimand
and illustrate using multiple datasets.
1648 E. Russek-Cohen and D. Petullo

Estimands need to be selected such that once a trial is completed and one uses
estimators consistent with an estimand, the results are interpretable. Regulators are
obligated to approve products that are both safe and effective, and estimates of
treatment effect will drive decision-making. However, in medical practice, there may
be an interest in comparing different treatments for a given patient. Different
estimands and different ways of estimating treatment effects could hamper using
summary data from various trials to decide which treatments are best.

Benefit Risk

While most trials may handle effectiveness and safety separately, there is use in
considering benefit and risk together. So, if one were treating patients for certain
types of heart disease, drugs that reduce the risk of clots can come with an
elevated risk of serious bleeding, and one may wish to define an endpoint and
an estimand that formally weighs both kinds of events. Literature in this space is
limited.

Sensitivity Analyses

As noted earlier, sensitivity analyses are often performed at the completion of a


clinical trials. Prior to the ICH addendum (ICH 2017), sensitivity analyses were
often used but not always with an explicit tie to an estimand. Common sensitivity
analyses could include different treatment of outliers or challenging the distribu-
tional assumptions of an analysis (e.g., using a nonparametric method in place of a
t-test). Sensitivity analyses were focused on the data analysis aspects but not always
with a formal consideration of intercurrent events. There is a wealth of literature on
methods for imputing missing data and many robust methods to minimize the impact
of outliers. However, sponsors of a trial should be proactive in minimizing missing
data when designing the trial and anticipating what data needs to be collected to
provide an estimate consistent with a proposed estimand. Anticipated protocol
violations and how they will be addressed need to be described in the SAP and
considered when developing a sensitivity analysis.
Sensitivity analyses can challenge various aspects of the primary analysis. Our
common use of covariates in randomized trials should improve the analysis while
not impacting the estimand. However, in those cases where qualitative treatment by
covariate interactions arise, the new treatment may be better for some while harmful
for others; some common sense needs to be there. Subgroup analyses are inevitable
in clinical trials (Alosh et al. 2015), but these are normally not considered part of a
sensitivity analysis associated with a specific estimand.
By tying sensitivity analyses to the prespecified estimand, the sensitivity analysis
is tied to the underlying research question to be answered. Scharfstein et al. (2014)
have divided sensitivity analyses into ad hoc, local, and global forms of sensitivity
84 Estimands and Sensitivity Analyses 1649

analyses. Ad hoc sensitivity analyses correspond to the more basic assessments of


assumptions of the statistical analyses just described (though such analyses do not
have to be considered ad hoc), while global assessments look at the impact of
dropouts on the conclusions regarding the treatment effect. This method relies on
monotone missingness, namely, once a patient skips a visit, they do not come back
later. This a major source of missing data in many longitudinal studies.
Leuchs et al. (2015) discuss sensitivity analyses but refer to use of alternative
estimands as a type of sensitivity analyses. In the ICH E9 R1 guidance addendum
(ICH 2017), these analyses are regarded as supportive analyses. So, terminology in
this space is evolving.
If sensitivity analyses are prespecified, there are likely to be fewer analyses. A
clearer interpretation of the trial is more likely. Studies with a large degree of
noncompliance with a protocol are likely to fail certain sensitivity analyses, and all
sides involved in the planning ought to discuss what that might mean before the
study is over.
When one analyzes data assuming dropouts equate to a failure (e.g., assigning
dropouts as the worst rank using a rank-based test), it may be reasonable to assume
the analysis is regarded as primary and a sensitivity analysis using the same estimand
may not be warranted. A primary analysis that still addresses intercurrent events but
is accompanied with a tipping point analysis is another approach. Ouyang et al.
(2017) describe “tipping point analyses used to explore how extreme and detrimental
outcomes among subjects with missing data need to be to overturn the positive
treatment effect attained in subjects with complete data.” There have been multiple
approaches to tipping point analyses, but they often involve challenging a primary
analysis by deciding what changes in values associated with intercurrent events lead
to a different conclusion regarding study success (e.g., Mehrotra et al. 2017;
Scharfstein et al. 2014; Campbell et al. 2011). Approaches differ by how missing
data and protocol violators are handled. Tipping point analyses minimize reliance on
methods that ignore the reasons for the missing data, or possibly make conservative
assumptions regarding likely values that substitute for a missing value. Most tipping
point results are captured in a table or graph and would require subject matter experts
to interpret.
There are sensitivity analyses specific to time to event analyses, but these are
beyond the scope of this chapter. Distinguishing a primary analysis from a sensitivity
analysis when reporting results would be important.

An Example

An example from the US FDA is presented (authors are not permitted to share the
line data). It illustrates the discussions presented on estimands and sensitivity
analyses (namely, tipping point analyses) in the context of formal decision-making.
First some issues that arise when designing clinical trials to evaluate pain medica-
tions are discussed first.
1650 E. Russek-Cohen and D. Petullo

Challenges for Clinical Trials to Evaluate Pain Medications

Pain can be categorized according to duration, acute or chronic, as well as other


characteristics such as breakthrough pain (i.e., acute episodes of pain that occur on a
background of well-controlled chronic pain). Acute pain is defined as pain that is
self-limiting and generally requires treatment for no more than a few weeks (e.g.,
postoperative pain after various types of surgeries). In contrast, chronic pain is
defined as pain persisting longer than 3 months (e.g., chronic lower back pain and
pain associated with spinal cord injuries (SCI)). See the Initiative on Methods,
Measurements, and Pain Assessments in Clinical Trials (IMMPACT 2011) and
Analgesic, Anesthetic, and Addiction Clinical Trial Translation, Innovations,
Opportunities, and Network (ACTTION 2002) websites for more information
regarding different trial designs.
Pain, the primary efficacy endpoint of interest, is subjective in nature and is
measured by a subject self-reporting their pain. It is often measured daily on an
11-point numerical rating scale (NRS) where a score of 0 indicates no pain and 10 is
the worst pain possible. Other scales may be acceptable. In chronic pain, these
conditions could last a lifetime, but a randomized double-blind, adequate, and
well-controlled trial where each subject is on study medication for about 12 weeks
has been accepted as a reasonable assessment of long-term use. To ensure subjects
have adequate pain and to demonstrate an effect, subjects are generally required to
have a minimum pain score of at least 4 over a given timeframe to be eligible for
randomization into the trial. To determine if a drug is working, mean change from
baseline pain at the end of an agreed to time is used as a primary summary statistic,
though a difference in trimmed means (Permutt and Li 2017) could also be consid-
ered. Measurements recorded between baseline and the agreed to time can improve
the efficiency of the analysis or possibly provide a basis for imputing missing values.
Often an MMRM analysis is conducted where a difference in adjusted means at
week 12 is the comparison of interest.
Chronic pain trials often have 30–50% of subjects discontinuing due to lack of
efficacy (placebo group) and/or intolerability (active drug). Two methods are com-
monly utilized to minimize these discontinuations:

• Allow subjects to use rescue medication to minimize study discontinuations due


to lack of efficacy. The protocol should specify the type and amount of rescue
medication that will be allowed during the study. Rescue medication use needs to
be done in a manner that does not interfere with scheduled pain assessments.
In chronic pain trials, short-term use of rescue medication is often not considered
when evaluating pain since the worst pain experienced in the previous 24 h
includes pain assessed prior to taking rescue medication. All use of rescue
medications is recorded. It would be a concern if treatment arm subjects used
more rescue medications than those on the placebo arm (Dworkin et al. 2005).
• To minimize discontinuations due to intolerability to study drug, during the first
2–3 weeks of the trial, often in an open-label fashion, subjects are titrated to an
effective dose of study drug. Only subjects who achieve an effective dose of
84 Estimands and Sensitivity Analyses 1651

the experimental drug are randomized into the double-blind treatment phase.
Randomized subjects either stay on active treatment or switch to the control
group. If the control group is a placebo, subjects are tapered off the active drug
to minimize discontinuations due to withdrawal symptoms and to assist in
maintaining the blinding of the study. This is referred to as an enriched enrollment
withdrawal design. Even with this design, subjects still discontinue active treat-
ment for lack of efficacy and adverse events (Katz 2009).

In chronic pain trials, subjects that discontinue treatment will most likely switch
to other therapies. This makes an estimand based on a treatment policy strategy
difficult to interpret since it measures the impact of “treatment plus other therapies”
versus “placebo plus other therapies.” An estimand using a composite strategy that
considers treatment discontinuations as failures is more commonly used. The focus
is on the difference in the two treatments at week 12 as this is a chronic condition
although subjects may stay on a drug considerably longer if approved.
Sometimes one sees a responder analysis in this setting. A responder is a subject
that shows a prespecified improvement in baseline pain, such as a 30% or 50%
improvement which has been deemed to be clinically relevant. This responder
definition may also include use of rescue medication such as no use or less than a
specific amount. For example, if a subject uses rescue medication for 7 consecutive
days or longer, they are considered a non-responder in the primary analysis
(Dworkin et al. 2005). So, subjects are either a responder or a non-responder.
This approach has often been criticized as having less power, but it is easy to
implement.
A method that retains more information than a dichotomy of response utilizes
a continuous responder curve with an appropriate corresponding analysis. We
illustrate this in our FDA example below.

Estimands, Estimation, and Sensitivity Analysis Illustrated Using


an FDA Example

This example was submitted to FDA in a New Drug Application (NDA). Statistical
and clinical reviews are provided by FDA at the Drugs@FDA weblink (FDA 2020).
The study will be briefly described along with a discussion of possible estimands
and estimators along with corresponding estimates of treatment effect. These were
previously presented (Petullo 2016).
This was an 18-week (4-week dose-adjustment phase, 12-week fixed-dose
maintenance phase, 1-week taper phase, 1-week follow-up phase) randomized,
double-blind, placebo-controlled, multicenter trial. Subjects were started on
150 mg per day or placebo and titrated to a target dose range of 150–600 mg per
day. Subjects were required to have a diagnosis of traumatic SCI of at least 1-year
duration with central neuropathic pain that had persisted continuously for at least
3 months or with remissions and relapses for at least 6 months. Concomitant
analgesics were allowed if subjects were on a stable dose regimen prior to
1652 E. Russek-Cohen and D. Petullo

randomization. Subjects must have had an average daily pain score of at least 4 (0–
10 NRS) during the 7 days prior to randomization.
The study randomized 219 subjects, 108 subjects in the placebo arm and 111 in
the active arm. There were approximately 15% of subjects with missing data at
Week 16 (15% in the placebo arm and 17% in the active arm). There were similar
numbers dropping out for adverse events and lack of efficacy in both
treatment arms.
When this study was planned, conducted, and reviewed by FDA, a primary
estimand of interest was not explicitly stated, but the estimand that corresponds to
the analysis to support approval is presented along with how it was estimated.
A composite strategy was used to define an estimand, i.e., to define the effect
of the active drug for treating neuropathic pain associated with SCI. Use of this
estimand did not require follow-up of subjects after treatment discontinuation as they
are considered treatment failures or non-responders. The four components of this
estimand are described below:

(A) Population: Subjects with traumatic SCI of at least 1-year duration with central
neuropathic pain that had persisted continuously for at least 3 months or with
remissions and relapses for at least 6 months.
(B) Variable: Change from baseline pain score Week 16. (Baseline pain was defined
as the pain prior to the OL titration phase.)
(C) Intercurrent Event: Subject did not complete 16 weeks of treatment. Subjects
that experienced the intercurrent event were considered as having no improve-
ment in baseline pain.
(D) Population-Level Summary: Difference in mean change in baseline pain at
Week 16 comparing subjects with neuropathic pain associated with spinal
cord injuries assigned to active drug versus those assigned to placebo.

The analysis used to estimate the population-level summary was an analysis of


covariance (ANCOVA) with treatment as a fixed effect and baseline pain as a
covariate. Subjects that discontinued were assigned the baseline value as the Week
16 value (i.e., a BOCF strategy). Results from the ANCOVA analysis gave an
estimated mean change from baseline at Week 16 of 1.1 and 1.7 for placebo and
active drug, respectively. The estimated difference of 0.6 (95% CI: 0.1, 1.1) was
significant with a p-value<0.01 (FDA 2020). This was considered the primary
analysis at the time of review.
An alternative estimand based on a composite strategy is also feasible. Parts
A. and B. of the estimand above are the same as above, but parts C. and D. are
changed:

(C) Intercurrent Event: Patient failed to complete 16 weeks of treatment. Subjects


that fail to complete are assigned a poor outcome.
(D) Population-Level Summary: Difference in mean change from baseline pain at
Week 16 in the best 80% of subjects from each treatment arm comparing
subjects with neuropathic pain associated with spinal cord injuries (a logical
84 Estimands and Sensitivity Analyses 1653

Fig. 1 Continuous responder curves – composite estimand

graphical display of the data in support of a trimmed mean analysis could be


continuous responder curves; see Fig. 1).

As noted above, 15% of placebo subjects discontinued versus 17% of active


subjects. Therefore, trim at least 17% of the data from both treatment arms. To
simplify, a 20% trimmed mean was selected. It is important to note that this should
be prespecified and should at a minimum trim all dropouts. Compare the differences
in trimmed means using a permutation test though if one only wanted a p-value
rather than an estimate of treatment effect, one could use a rank test with failures
being assigned the worst rank.
The 20% trimmed mean for placebo was 0.9 and 1.9 for the active arm. The
difference in trimmed means in this example was 1.0 (95% CI: 1.4, 0.5). Using
a permutation test, the difference was considered significant, with a p-value less than
0.001. The continuous responder curve in Fig. 1 is a useful way of graphically
displaying results when a trimmed mean is used (see Farrar et al. 2006).

An Estimand, Estimate, and a Tipping Point Analysis

For the estimands in the example above, subjects that discontinued treatment were
considered as treatment failures, and imputation was simple. If the primary analysis
uses a more optimistic assumption, namely, that those that dropped out were like
those that stayed in up until they left an analysis consistent with an MAR assumption
can be the primary analysis. A tipping point analysis evaluating how sensitive
the results was to this assumption is worth considering. Starting from a MAR
1654 E. Russek-Cohen and D. Petullo

Table 2 Tipping point analysis (Petullo et al. 2016)

Shift in mean change from baseline at Week 16: difference from placebo: mean (95% CI)
Placebo
0 0.5 1.0 1.5 2.0 2.5
0 0.8 (0.3, 1.4) 0.9 (−0.4, 1.5) 1.0 (0.4, 1.5) 1.1 (0.5, 1.6) 1.1 (0.6, 1.7) 1.2 (0.7, 1.8)
−0.5 0.8 (0.2, 1.3) 0.8 (−1.4, −0.3) 0.9 (0.4, 1.5) 1.0 (0.4, 1.5) 1.1 (0.5, 1.6) 1.1 (0.6, 1.7)
Lyrica

−1.0 0.7 (0.1, 1.2) 0.8 (−1.3, −0.2) 0.8 (0.3, 1.4) 0.9 (0.3, 1.5) 1.0 (0.4, 1.5) 1.1 (0.5, 1.6)
−1.5 0.6 (0.0, 1.2) 0.7 (0.1, 1.2) 0.7 (0.2, 1.3) 0.8 (0.3, 1.4) 0.9 (0.3, 1.5) 1.0 (0.4, 1.5)
−2.0 0.5 (0.1, 1.1) 0.6 (0.0, 1.1) 0.7 (0.1, 1.2) 0.7 (0.2, 1.3) 0.8 (0.2, 1.4) 0.9 (0.3, 1.5)
−2.5 0.4 (−0.1, 1.0) 0.5 (−0.1, 1.0) 0.6 (0, 1.21.2) 0.7 (0.1, 1.2) 0.7 (0.1, 1.3) 0.8 (0.2, 1.4)

assumption, the tipping point analysis varies the pains scores for subjects with
missing outcomes. The missing outcomes on each treatment arm are allowed to
vary independently and includes scenarios where dropouts on the active arm have
worse outcomes than dropouts on control. The goal is to explore the plausibility of
missing data assumptions under which there is no longer evidence of a treatment
effect. As seen in Table 2, the analysis only tips, i.e., p-value >0.05, if missing data
for active subjects are assumed to have a much worse outcome than missing data
for placebo subjects. In this example, the various analyses did not appear to impact
the final conclusions. When it does, those involved in the analysis and interpretation
of the trial need to determine what this means. The choice of a particular primary
analysis and sensitivity analysis could depend on the degree of missing data antic-
ipated, and MAR as a starting assumption for data analysis could be quite plausible
in settings where dropouts are not all that frequent. It may not be ideal in pain studies
where dropouts are common.

Conclusion

The importance of considering both estimands and sensitivity analyses and their role
in planning of clinical trials was described here. It is possible to have more than one
plausible estimand and more than one analysis. But it is anticipated that all
those involved in the trial planning and assessment will understand what is going
to be presented when a trial is completed.
It is hoped that greater considerations at the planning stage of things that can go
awry will lead to better writing of protocols and analysis plans. Thinking through the
options before a study starts could impact what data needs to be collected, what is
regarded as missing, and will limit the number and kinds of analyses that are
reasonable given the question of interest. It is recognized that one cannot anticipate
all things that can happen in a clinical trial, but that does not mean planning to
prevent or minimize problems that have occurred in related trials is unimportant.
Some of the attention to this topic has been motivated by regulators, but the
principles described are not solely applicable to trials for medical product approvals.
Finally, we recognize that more complex analytical approaches to handling
missing data exist. See “▶ Chap. 86, Missing Data.”
84 Estimands and Sensitivity Analyses 1655

Key Facts

• When the patterns of intercurrent events occurring after treatment assignment


differ by treatment arm, interpretation of treatment effects can be challenging or
even misleading.
• If one formulates an estimand without considering how intercurrent events will
factor into the estimand, the study may not answer the question the trial is
intended to answer.
• The statistical analysis plan ought to address how intercurrent events will impact
the estimated treatment effects to be calculated and the estimate ought to follow
logically from the proposed estimand.
• Sensitivity analyses designed to challenge the assumptions associated with a
primary set of analyses are best understood if planned in conjunction with the
formulation of an estimand.
• Methods to minimize intercurrent events are important in clinical trials.

Cross-References

▶ Missing Data

References
ACTTION (2002) Analgesic, Anesthetic, and Addiction Clinical Trial Translations, Innovations,
Opportunities, and Networks (ACTTION). www.acttion.org. Accessed Jan 2019
Alosh M, Fritsch K, Huque M, Mahjoob K, Pennello G, Rothmann M, Russek-Cohen E, Smith F,
Wilson S, Yue L (2015) Statistical considerations on subgroup analysis in clinical trials.
Stat Biopharm Res 7:286–303
Baker SG, Kramer BS, Lindemen KS (2016) Latent class instrumental variables: a clinical and
biostatistical perspective. Stat Med 35:147–160
Campbell G, Pennello G, Yue L (2011) Missing data in the regulation of medical devices.
J Biopharm Stat 21:180–195
Chiba, Vanderweele (2011) A simple method for principal strata effects when the outcome is
truncated due to death. Am J Epidemiol 173:745–751
Dunn G, Maracy M, Tomenton B (2005) Estimating treatment effects from randomized clinical
trials with noncompliance and loss to follow-up: the role of instrumental variable methods.
Stat Methods Med Res 14:369–395
Dworkin RH, Turk DC, Farrar JT, Haythornthwaite JA, Jensen MP, Katz NP, Kerns RD, Stucki G,
Allen RR, Bellamy N, Carr DB, Chandler J, Cowan P, Dionne R, Galer BS, Hertz S, Jadad AR,
Kramer LD, Manning DC, Martin S, McCormick CG, McDermott MP, McGrath P, Quessy S,
Rappaport BA, Robbins W, Robinson JP, Rothman M, Royal MA, Simon L, Stauffer JW,
Stein W, Tollett J, Wernicke J, Witter J (2005) Core outcome measures for chronic pain trials:
IMMPACT recommendations. Pain 113(1–2):9–19
Efthimiou O, Debray TPA, van Valkenhoef G, Trelle S, Panaidou K, Moons KGM, Reitsma JB,
Shang A, Salanti G et al (2016) GetReal in network meta-analysis: a review of the methodology.
Res Synth Methods 7:236–263
1656 E. Russek-Cohen and D. Petullo

Farrar JT, Dworkin RH, Max MB (2006) Use of the cumulative proportion analysis graph to
present data over a range of cut-off points: making clinical trial data more understandable.
J Pain Symptom Manag 31(4):369–377
FDA (2020) Drugs@FDA https:// HYPERLINK. http://www.accessdata.Fda.gov/scripts/cder/daf
FDA-NIH Biomarker Working Group BEST (Biomarkers, Endpoints, and other tools) Resource
(Internet). Silver Spring, Glossary Created 2016 Jan 28 (Updated 2018 May 2) Co-published by
National Institutes of Health (US), Bethesda
Fletcher C, Tsuchiya S, Mehrotra D (2017) Current practices in choosing Estimands and sensitivity
analyses in clinical trials: results of the ICH E9 survey. Therapeutic Innovation and Regulatory.
Science 51:69–76
Ford I, Norrie J (2016) Pragmatic trials. NEJM 375:454–463
Fragakis CE, Rubin DB (2002) Principal stratification in causal inference. Biometrics 58:21–29
Holzhauer B, Akacha M, Bermann G (2015) Choice of estimand and analysis methods in diabetes
trials with rescue medication. Pharm Stat 14:433–447
ICH (1998) E9: Guideline on statistical principles for clinical trials. http://www.ich.org
ICH (2014) E9 concept paper on estimands and sensitivity analyses. http://www.ich.org
ICH (2017) E9 R1: Addendum on estimands and sensitivity analyses in clinical trials. Step 2. www.
ich.org
IMMPACT (2011) Initiative on methods, measurement, and pain assessment in clinical trials
(IMMPACT). www.immpact.com. Accessed Jan 2019
Katz N (2009) Enriched enrollment randomized withdrawal trial designs of analgesics: focus on
methodology. Clin J Pain 25(9):797–807
Kurland BF, Johnson LL, Egleston BL, Diehr PH (2009) Longitudinal data with follow-up
truncated by death: match the analysis method to research aims. Stat Sci 24:211–222
Lavange LM, Permutt T (2016) A regulatory perspective on missing data in the aftermath of the
NRC report. Stat Med 35:2853–2864
Leuchs A, Zinserling J, Brandt A, Wirtz D, Benda N (2015) Choosing appropriate estimands in
clinical trials. Therapeutic innovation and Regulatory. Science 49:584–592
Little RJA, Rubin DB (2000) Casual effects in clinical and epidemiological studies via potential
outcomes: concepts and analytical approaches. Annu Rev Public Health 21:121–145
Little RJA, Rubin DB (2014) Statistical analysis with missing data, 2nd edn. Wiley, New York.
408pp
Mallinckrodt CH, Lin Q, Lipkovich I, Molenberghs (2012) A structured approach to choosing
estimands in longitudinal clinical trials. Pharm Stat 11:456–461
Mehrotra D, Liu F, Permutt T (2017) Missing data in clinical trials: control based mean imputation
and sensitivity analysis. Pharm Stat 16:378–392
National Research Council (2010) The prevention and treatment of missing data in clinical trials.
National Academies of Science Press, Washington, DC
O’Neill RT, Temple R (2012) The prevention and treatment of missing data in clinical trials: an
FDA perspective on the importance of dealing with it. Clin Pharmacol Ther 91:550–554
Ouyang J, Carroll KJ, Koch G, Li J (2017) Coping with missing data in phase III pivotal registration
trials: Tolvaptan in subjects with kidney disease, a case study. Pharm Stat 16:250–266
Permutt T (2016) A taxonomy of Estimands for regulatory clinical trials with discontinuations.
Stat Med 35:2865–2864
Permutt T, Li F (2017) Trimmed means for symptom trials with dropouts. Pharm Stat 16:20–283
Petullo D (2016) Statistical review and evaluation. https://www.accessdata.fda.gov/drugsatfda_
docs/nda/2012/021446Orig1s028StatR.pdf
Petullo D, Permutt T, Li F (2016) An alternative to data imputation in analgesic clinical trials.
American Pain Society Conference on Analgesic Trials, Austin Texas
Phillips A, Abellan-Andres J, Soren A, Bretz F, Fletcher C, FranceI GA, Harris R, Kjaer M,
Keene O, Morgan D, O’Kelley M, Roger J (2017) Estimands: discussion points from the PSI
estimands and sensitivity expert group. Pharm Stat 16:6–11
Rosenblum M, Qian T, Du Y, Qiu H, Fisher A (2016) Multiple testing procedures for adaptive
enrichment designs: combining group sequential and reallocation approaches. Biostatistics
17(4):650–662
84 Estimands and Sensitivity Analyses 1657

Rothmann MD, Wiens BL, Chan ISF (2011) Design and analysis of non-inferiority trials. Chapman
& Hall/CRC, CRC Press lists Boca Raton, FL, 454p
Scharfstein D, McDermott A, Olson W, Wiegand F (2014) Global sensitivity analyses with
informative dropouts: a fully parametric approach. Stat Biopharm Res 6:338–348
Stuart EA, Perry DF, Le H-N, Ialongo NS (2008) Estimating intervention effects of prevention
programs: accounting for noncompliance. Prev Sci 9:288–298
US FDA (2008) Guideline for industry: diabetes mellitus-evaluating cardiovascular risk in new
antidiabetic therapies to treat Type 2 diabetes. Dec 2008. https://www.fda.gov/downloads/
Drugs/Guidances/UCM071627
US FDA (2016) Non-inferiority clinical trials to establish effectiveness: guidance for industry.
Nov 2016. https://www.fda.gov/downloads/Drugs/Guidances/UCM202140
US FDA (2018) Product label for Cymbalta. https://www.accessdata.fda.gov/drugsatfda_docs/
label/2008/022148lbl.pdf. Accessed Jan 2019
US FDA (2019) Adaptive designs for clinical trials for drugs and biologics: guidance for industry.
https://www.fda.gov/media/78495/download. Accessed 24 April 2020
US National Institutes of Health-Food and Drug Administration (2017) NIH-FDA Protocol tem-
plate. https://osp.od.nih.gov/clinical-research/clinical-trials. Accessed 9 Nov 2018
Confident Statistical Inference with
Multiple Outcomes, Subgroups, and Other 85
Issues of Multiplicity

Siyoen Kil, Eloise Kaizar, Szu-Yu Tang, and Jason C. Hsu

Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1660
Patient Targeting for a Targeted Therapy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1660
Strength of Error Rate Controls in Patient Targeting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1661
Respecting Logical Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1664
Three Kinds of Outcome Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1664
A Common Misconception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1665
Binary Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1666
Time-to-Event Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1669
The Subgroup Mixable Estimation (SME) Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1671
Effect of the Prognostic Factor on Permutation Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1672
Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1672
Predictive Null Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1673
Test Statistic and Reference Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1674
Numerical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1675
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1678
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1678
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1678
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1679

S. Kil
LSK Global Pharmaceutical Services, Seoul, Republic of Korea
E. Kaizar
The Ohio State University, Columbus, OH, USA
e-mail: [email protected]
S.-Y. Tang
Roche Tissue Diagnostics, Oro Valley, AZ, USA
e-mail: [email protected]
J. C. Hsu (*)
Department of Statistics, The Ohio State University, Columbus, OH, USA
e-mail: [email protected]

© Springer Nature Switzerland AG 2022 1659


S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_116
1660 S. Kil et al.

Abstract
This chapter starts with a thorough discussion of different multiple comparison
error rates, including weak and strong control for multiple tests and noncoverage
probability for confidence sets. With multiple endpoints as an example, it
describes which error rate controls would translate to incorrect decision rate
controls. Then, using targeted therapy as the context, this chapter discusses a
potential issue with some efficacy measures in terms of respecting logical rela-
tionships among the subgroups. A statistical principle that helps avoid this issue is
described. As another example of multiplicity-induced issues to be aware of, it is
shown that permutation test for patient targeting may not control Type I error rate
in some situations. Finally, a list of the key points and a summary of the
conclusions are given.

Keywords
Subgroups · Multiple comparisons · Prognostic effect · Permutation tests

Introduction

Multiplicity issues arise in clinical trials due to having multiple treatments, end-
points, and subgroups. Using precision medicine as the context, this chapter
describes multiple comparison principles that help ensure proper error rate control.
To start, the extent of each multiple comparison Type I error rate control ensures
control of the incorrect decision rate is discussed. Then, which efficacy measures that
respect natural logical relationships among patient subgroups are enumerated. The
Subgroup Mixable Estimation (SME) principle for achieving statistical inference
that respects such logic is described. It is also shown that permutation testing is a
technique that should be avoided, as it does not produce a valid null distribution with
a discrete outcome even with only one subgroup classifier. Finally, a list of key
points is given, and a summary of the conclusions is provided.

Patient Targeting for a Targeted Therapy

Targeted therapies, which as Woodcock (2015) states are sometimes called “person-
alized medicine” or “precision medicine,” target specific pathways.
Suppose a companion diagnostic test (CDx) divides patients into a marker-
positive (g+) subgroup and its complementary marker-negative (g) subgroup.
Call the entire patient population {g+, g} “all-comers.” If all-comers or a patient
subgroup can confidently be inferred to receive clinically meaningful efficacy, then a
decision is made to target all-comers or that subgroup. If no patient group can be
85 Confident Statistical Inference with Multiple Outcomes, Subgroups. . . 1661

identified as receiving clinically meaningful efficacy, then development of that


therapy has failed.
With a biomarker that may be predictive of treatment response, for every potential
cut-point value c of the biomarker, efficacy is assessed in the marker-positive
patients (gþ 
c patients, those with values >c), marker-negative patients (gc patients,

those with values <c), and in the all-comer {g , g } population.
+

Statistical methods for patient targeting should control the probability of incorrect
targeting, the probability that a targeted patient group does not derive clinically
meaningful efficacy from the treatment it is given.

Strength of Error Rate Controls in Patient Targeting

A statistical error rate control has meaning only if it translates to an incorrect


decision rate control.
Consider a two-arm randomized clinical trial (RCT). Denote “treatment” and
“control” by Rx and C, and assume there is no differential propensity in treatment
assignment, so that under the Rx and the C arms, the prevalence of the g+ subgroup is
the same in the population by γ +.
In this setting with multiple subgroups and potentially multiple endpoints, mul-
tiple comparisons are made. Tukey (1953), which has been reprinted as Tukey
(1994), defined three kinds of multiple comparison error rates: per comparison,
per family, and familywise. For confirmatory studies, the familywise error rate
(FWER) is most relevant. Inference in Tukey (1953) is in the form of confidence
intervals, and FWER is defined as the (maximum) probability that at least one of the
simultaneous confidence intervals fails to cover its true value. When FWER is
applied to tests of null hypotheses, there has been some confusion as to what it is.
We explain, in a test of hypotheses setting, what FWER control needs to be, in order
to control the rate of incorrect decision-making.

Null Control
Some methods for patient targeting offer error rate control under what can be called
the null null hypothesis. The null null hypothesis is that Rx has exactly the same
effect as C for all biomarker subgroups. In other words, there is no treatment effect or
biomarker effect whatsoever.
If the outcome is survival time, say, then under the null null, all patients come
from a single group with the same survival curve, regardless of whether they are
given Rx and C, or what biomarker value they have.
In a lucid paper written before the adoption of modern concepts of multiple
comparison error rate control, Miller and Siegmund (1982) suggest forming 2  2
tables of Rx vs. C and responder vs. nonresponder at every cut-point and then
selecting the cut-point with the maximum chi-square statistic value. Their critical
value calculation for testing that the observed differential efficacy between the g+
1662 S. Kil et al.

patients and g patients at the sample maximum chi-square statistic is not just due to
random sample fluctuation is under the null null hypothesis.
As John W. Tukey would say, “controlling the Type I error rate testing a null null
is a null guarantee,” because there will surely be some difference between Rx and C
effects, if measured to enough decimal places. A null null is even more restrictive
than a complete null used in weak control.

Weak Control
The complete null is where all the null hypotheses are true. Controlling the Type I
error rate under the complete null is termed weak control.
When there are subgroups, the null hypothesis for each subgroup is Rx and C have
the same effect in that subgroup. So, under the complete null, the biomarker can have
an effect (a prognostic effect), but effects of Rx and C do not differ in each of the
subgroups.
Thus, weak control controls the probability of inferring Rx and C have different
effects in at least one subgroup, when in fact Rx and C have the same effect in all the
subgroups.
Suppose there is originally a single primary endpoint E1. If only weak Type I
error rate control is required, then one can game the system by artificially introducing
a second primary endpoint E2, a co-primary endpoint so that Rx is approved only if
its efficacy relative to C is shown in both endpoints E1 and E2. It would seem that
adding E2 would not make getting Rx approved easier.
Suppose the Statistical Analysis Plan (SAP) is a two-step process, as follows:

Step 1 Test, at the 5% level, the complete null hypothesis that there is no difference
between Rx and C for either endpoint; if the complete null hypothesis is rejected,
then go to Step 2; otherwise Stop.
Step 2 Infer Rx is better than C.

This procedure controls the Type I error rate weakly at 5%, but what is its
incorrect decision rate?
Suppose E2, a clinically meaningless endpoint, is chosen on the basis that it is a
sure bet that efficacy in this endpoint can be proven. Then, at the end of the study,
the complete null will surely be rejected, and Rx is guaranteed to be approved by
this procedure (regardless of what the data indicates). If Rx is in fact slightly worse
than C on endpoint E1, then the probability of an incorrect decision can in fact be
close to one-half. Clearly, weak control of the Type I error rate may not translate to
control of the incorrect decision rate. Requiring strong control ameliorates this
concern.
Similarly, weak control is insufficient to control the incorrect targeting rate,
because it does not control the probability of inferring Rx is better than C for g
patients when they have the same effect in g, if it is true that Rx is better than C for
g+ patients, for example.
85 Confident Statistical Inference with Multiple Outcomes, Subgroups. . . 1663

Even with these known limitations, some methods for subgroup identification
rely on weak control. The machine learning approach of Lipkovich et al. (2011) and
the likelihood ratio testing approach of Jiang et al. (2007) compute the null distri-
butions by permutation which, as explained in Xu and Hsu (2007) and Kaizar et al.
(2011), requires the subtle MDJ (marginals-determine-the-joint) assumption to con-
trol Type I error rate weakly (The crux of the matter is that permutation generates a
null distribution assuming the joint distribution of observations are identical under
Rx and C across all biomarker values, while the complete null only specifies that the
marginal distributions are the same.).

MDJ If a subset of the null hypotheses that state under Rx and C the marginal
distributions of the observations are identical are true, then the joint distributions
of the observations are identical under Rx and C for that subset as well.

In applying any permutation-based technique, whether MDJ holds should be


checked.

Strong Control
Strong control of Type I error rate means that even if some of the null hypotheses are
false, the probability of rejecting at least one true null hypothesis is controlled.
Strong control would control the probability of incorrect decision if the null
hypotheses are appropriately formulated.

Confident Directional Control


Actionable inferences are directional in nature: Rx is better than C, or Rx is worse
than C. As explained in Section 5 of Lin et al. (2019), basing directional decisions on
confidence intervals automatically controls the directional error rate: if one infers
μ > 0 when its 95% confidence interval is entirely larger than zero, then the
probability of inferring μ > 0 when in fact μ  0 is at most 5%. Such methods are
what Tukey (1953) calls confident direction methods.
It is an incorrect perception that stepwise testing methods do not have
confidence sets. Actually, so long as the union of the null hypotheses being
tested by a multiple test covers the entire parameter space, the Partitioning
Principle (as described, e.g., in Huang and Hsu 2007) can be used to pivot the
test to a corresponding confidence set (The name The Partitioning Principle was
officially coined by Finner and Strassburger (2002). Historically, this principle
was independently developed by Helmut Finner and associates as well as
Takeuchi (1973, 2010) and Stefansson et al. (1988).). See Stefansson et al.
(1988), Hayter and Hsu (1994), Hsu (1996), and Hsu and Berger (1999) for
examples of using the pivoting technique, and Finner and Strassburger (2007)
and Strassburger and Bretz (2008) for additional examples. We thus urge more
attention be paid to checking whether proposed multiple tests have associated
confidence sets.
1664 S. Kil et al.

Respecting Logical Relationships

Subgroups can be defined by biomarkers or by other characteristics such as regions.


In the former case, decision-making involves assessing efficacy in the subgroups and
their mixtures. In the latter case, typical practice is to adjust for baseline differences
in the subgroups in assessing a presumed common efficacy across the subgroups,
though therapies are approved separately for each region. This chapter focuses on the
former situation.
Let μRx(x) and μC(x) denote the true effect of Rx and C at each biomarker value x.
Recall p(x) is the density of patient biomarker values in the population which, in our
RCT setting, is the same for Rx and C. Suppose a biomarker cut-point value c divides
the entire population into two subgroups, gþ 
c fx  cg and gc fx < cg:
þ 
Denote the true (unknown) efficacy in gc , gc , and all-comers {all} by
ηgþc , ηgc , ηfallg , respectively. Since all-comers are a mixture of gþ 
c and gc , it seems
desirable for efficacy measures to meet the criterion that efficacy for all-comers lies
between the efficacies of the complementary subgroups:

Def inition : An ef f icacy measure is logicrespecting if

You might also like